Bloom Filter

Tags: maths

Bloom Filter

Hash tables are such data structures which have uses almost everywhere. For each problem that we get, there’s a slight chance that it can be solved by using a hash table. Arrays are simpler hash tables, so are structures like symbols tables, caches etc.

The problem with hash tables is that these store both the key as well as its associated value. While the keys can be hashed, the values are not. And if these records run into millions, then there might be a situation of how to store it into memory (although this might not be concern in the age of big data, but still).

A Bloom filter is a probabilistic data structure. Here we try to find if the key is available in the Bloom filter or not. The Bloom filter, by design, is either 100% sure that the key is not present, or it’s fairly certain (not 100%) that it’s present. Not that the actual value needs to be stored separately and is not present in the Bloom filter.

Mathematically, a Bloom filter can return the following values for a given value v, in set S:

  • 100% sure that v doesn’t exist in S, or
  • Very high probability (less than 100%) that v does exists in S.

So the non-existence is guaranteed but existence is just a high probability.

In some sense, the Bloom filter is a smaller cache in front of cache. It allows us to know if the key is present in the cache or not.

Technical Details

The underlying data structure is a very long bit string, with its size roughly running into millions. Initially, each of the bits is off.

Each key which needs to be added to the Bloom filter is hashed using multiple hash functions, with each function giving a number less than the bit-string size. Each number becomes an index in the bit-string and the associated bit is turned on. So a key hashed using 3 hash functions giving 12, 55, and 87 hash values is going to turn on the bits on index 12, 55, and 87. We keep on adding new keys into the filter.

During lookup, the same procedure is followed and the bits are calculated. If any of the bits is off, that means that the key is not present in the Bloom filter. If all of the bits are on, that either means that the key is actually present in the filter, or maybe a unique combination of some other keys have turned on all the bits. Therefore, the absence is guaranteed, but the presence is not. Any such entry which is not actually present, but the associated bits are turned on is called a false positive.

Designing a Bloom filter means that we need to have a balance between the number of elements which we want to enter, and the proportion of false positives that we are willing to tolerate.

Next we see some combination of properties we can use.

Ballpark calculations

When we start with a Bloom filter, we need to fix how many elements is the filter going to hold, and what is the percentage of false positives we are willing to bear.

Then the number of hash functions which takes the false positive percentage to a minimum is:

\[k = {m \over n} ln(2).\]

where \[k\] is the number of hash functions, \[m\] is the number of bits that we have to store per element \[n\] is the total number of elements.

A ballpark figure is to have the following bits for false positive tolerance:

4.8 bits/element for 10% false positives 9.6 bits/element for 1% false positives 14.4 bits/element for 0.1% false positives

How to get independent hash functions?

Given that having a false positive rate of 0.1% requires 10 different hash functions, we need to understand how we can get so many hash functions.

  • Trick 1

    Divide the bit of a large bit value to smaller bits, if the bits of large bits can be independent. e.g. a 32bit wide value can be split to 4x8 or 2x16 hashed bits.

    Number of bits required to represent n numbers is log2(n)

  • Trick 2

    Combine two hash functions together:

    h(…) = h1(…) + i * h2(…)

    e.g.

    h3 = h1(…) + 3 * h2(…) h4 = h1(…) + 4 * h2(…)

Example Usecase

Images which are served only once are cached. This is a waste of resources. Solution: Bloom and Cuckoo filters saves the cache to be filled with such one-off values.

Bloom filter allows us to find the requests which are one-off and then remove these from the cache. So this is like a cache for the cache.

Imagine that there’s a cache which has hundreds of thousands of items. The purpose of the cache is to send back the items which are requested by the users without making a trip to the database. But how do we know which item is frequently requested by the end users and which are seldomly requested? This is one situation where Bloom filter can be used.

The simple scenario without using a Bloom filter is when the user requests an item which is not in the cache, and we return it and save it in the cache. If the item is not requested even again, then it is wasting the cache space.

The solution would be to have a Bloom filter alongside the cache. In case of a non-cached request by the user, we do the following:

  • Return the item from the database,
  • Check the Bloom filter if the request has been served before.

    • The Bloom filter is going to either going to be 100% sure if the request has

    not been served before, or give a good enough probable answer that it has been served before.

    • In either case, we enter the value in the Bloom filter.
    • If the same item is requested again, then we check the Bloom filter and going to tell us that the item has been probably served before.
    • This item then goes into the cache.

Implementation

As per StackOverflow discussion here, one of the good hash functions for Bloom filters is Murmur3 function.

We try to target the functionality of having just 0.1% false positives.

We start with a relatively large number of elements so that the algorithm can actually become functional. As per the calculations in the previous section, we need around 14.4 bits per element to have low number of false positives. So we start by creating a BitArray having 14.4 times the size of elements.

Since we already know that we need Murmur3 hash function, we use the library. But we also have to use 10 of these so that there’s enough randomisation of the bits.

The murmur3 library returns 32bit int (there’s an option to get 128bit values as well), so we can split that 32bit int into 2 16bit values. This is because for our use case of 1million elements, unsigned 16bit will hold 65K values – well below the size of the array. This takes care of the first 2 hash functions.

For the next 8 functions, we use the same key but have the following arrangement of the functions:

\[ h_i(k) = h_1(k) + i * h_2(k) \]

By the time we reach the 8th hash function, we get a maximum of 520K index.

Entering and reading values is just a matter of setting and reading the bits from the BitArray

public class BloomFilter
    {
        private readonly BitArray _elements;
        private static readonly Murmur32 _murmur = MurmurHash.Create32();

        public BloomFilter(int numElements)
        {
            _elements = new BitArray((int)(14.4 * numElements), false);
        }

        // Since this is a 32bit integer, we can split the value
        // in the first half and second half with hash1 getting the first half
        // and hash2 getting the second half. This is doable since BitConverter
        // takes in a starting index as well. This may help with the indexes as
        // well, since we'll not have to take the modulo.
        static readonly Func<string, int> hash1 = (key) =>
        {
            byte[] hash = MurmurHash.Create32().ComputeHash(Encoding.ASCII.GetBytes(key));
            return BitConverter.ToUInt16(hash, 0);
        };

        static readonly Func<string, int> hash2 = (key) =>
        {
            byte[] hash = MurmurHash.Create32().ComputeHash(Encoding.ASCII.GetBytes(key));
            return BitConverter.ToUInt16(hash, 2);
        };

        // We need to have an array of 10 hash functions, and iterate over these
        // to get the 10 hash values and then set these 10 bits in the bitarray.
        // In order to not overflow the array, (and I'm not sure about the implementation),
        // we can take the modulo of the number.
        public void AddKey(string key)
        {
            int bitIndex = hash1(key) % _elements.Length;
            _elements.Set(bitIndex, true);
            bitIndex = hash2(key) % _elements.Length;
            _elements.Set(bitIndex, true);

            for (int i = 3; i <= 10; i++)
            {
                bitIndex = hash1(key) + i * hash2(key);
                _elements.Set(bitIndex, true);
            }
        }

        public bool ContainsKey(string key)
        {
            var contains = true;

            int bitIndex = hash1(key) % _elements.Length;
            contains = contains && _elements.Get(bitIndex);
            bitIndex = hash2(key) % _elements.Length;
            contains = contains && _elements.Get(bitIndex);

            for (int i = 3; i <= 10; i++)
            {
                bitIndex = hash1(key) + i * hash2(key);
                contains = contains && _elements.Get(bitIndex);
            }

            return contains;
        }
    }

Challenges

So in the default Bloom filter, deletions are not supported. Because of this, over time, there’s a chance that all the bits are going to be set to 1. When this happens, then every request is going to be returned as a positive.

Applications

Check the relevent section of the wikipedia article to get more ideas of where to use the Bloom filters.