8

Vorbrodt's C++ Blog

 3 years ago
source link: https://vorbrodt.blog/2019/04/04/bloom-filters/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Bloom Filters

From Wikipedia:

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.

In other words, given a set of elements, bloom filter can tell you that: A) given element is definitely not in the set, or B) given element is maybe in the set. It can give you a false positive: it can say that an element is in the set when it in fact is not. But it will never give you a false negative: it will never say that an element is not in the set when in fact it is.

In my past life I used bloom filters to check whether or not I should perform an expensive database index search 🙂

In the following example I construct a bloom filter given a set of strings set1, I then verify that each element of set1 is in the set according to the bloom filter. Finally I try a different set of elements set2, and test what the bloom filter says about those elements. Given big enough bloom filter I get 100% correct answers (that non of the elements in set2 are present). Here’s the code (bloom.cpp):

#include <iostream>
#include <string>
#include "bloom.hpp"
using namespace std;
int main()
string set1[] = {"Martin", "Vorbrodt", "C++", "Blog"};
string set2[] = {"Not", "In", "The", "Set"};
bloom_filter<string> bloom(128);
for(auto& s : set1)
bloom.add(s);
for(auto& s : set1)
cout << "Contains \"" << s << "\" : " << bloom.contains(s) << endl;
for(auto& s : set2)
cout << "Contains \"" << s << "\" : " << bloom.contains(s) << endl;

Contains “Martin” : 1
Contains “Vorbrodt” : 1
Contains “C++” : 1
Contains “Blog” : 1
Contains “Not” : 0
Contains “In” : 0
Contains “The” : 0
Contains “Set” : 0

Program output.

My implementation of the bloom filter is primitive: it uses only one hashing function which increases false positives, but it is very short and clean and can be built upon; maybe at some point I’ll write a full blown implementation; though there’s plenty of examples online; I want this post to be an introduction to the topic.

In any case, here’s my implementation of the bloom filter (bloom.hpp):

#pragma once
#include <vector>
#include <functional>
#include <cstddef>
template<typename key, typename hash = std::hash<key>>
class bloom_filter
public:
bloom_filter(std::size_t size) : m_bits(size) {}
void add(const key& data)
m_bits[hash{}(data) % m_bits.size()] = true;
bool contains(const key& data)
return m_bits[hash{}(data) % m_bits.size()];
private:
std::vector<bool> m_bits;

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK