From Wikipedia:
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
In other words, given a set of elements, bloom filter can tell you that:
- given element is definitely not in the set, or
- given element is maybe in the set.
It can give you a false positive: it can say that an element is in the set when it in fact is not. But it will never give you a false negative: it will never say that an element is not in the set when in fact it is.
In my past life, I used bloom filters to check whether or not I should perform an expensive database index search.
In the following example, I construct a bloom filter given a set of string
s:
set1
I then verify that each element of:
set1
is in the set according to the bloom filter. Finally, I try a different set of elements:
set2
and test what the bloom filter says about those elements. Given big enough bloom filter, I get 100% correct answers (that none of the elements in...
set2
...are present). Here’s the code (bloom.cpp):
#include <iostream>
#include <string>
#include "bloom.hpp"
using namespace std;
int main()
{
string set1[] = {"Martin", "Vorbrodt", "C++", "Blog"};
string set2[] = {"Not", "In", "The", "Set"};
bloom_filter<string> bloom(128);
for(auto& s : set1)
bloom.add(s);
for(auto& s : set1)
cout << "Contains \"" << s << "\" : " << bloom.contains(s) << endl;
for(auto& s : set2)
cout << "Contains \"" << s << "\" : " << bloom.contains(s) << endl;
}
Program output:
Contains "Martin" : 1
Contains "Vorbrodt" : 1
Contains "C++" : 1
Contains "Blog" : 1
Contains "Not" : 0
Contains "In" : 0
Contains "The" : 0
Contains "Set" : 0
My implementation of the bloom filter is primitive: it uses only one hashing function which increases false positives, but it is very short and clean and can be built upon; maybe at some point, I’ll write a full blown implementation; though there’s plenty of examples online; I want this post to be an introduction to the topic.
In any case, here’s my implementation of the bloom filter (bloom.hpp):
#pragma once
#include <vector>
#include <functional>
#include <cstddef>
template<typename key, typename hash = std::hash<key>>
class bloom_filter
{
public:
bloom_filter(std::size_t size) : m_bits(size) {}
void add(const key& data)
{
m_bits[hash{}(data) % m_bits.size()] = true;
}
bool contains(const key& data)
{
return m_bits[hash{}(data) % m_bits.size()];
}
private:
std::vector<bool> m_bits;
};