How to Store and Query 100 Million Items Using Just 77MB with Python Bloom Filters

Perform lightning-fast, memory efficient membership checks in Python with this need-to-know data structure

How to Store and Query 100 Million Items Using Just 77MB with Python Bloom Filters
Programming with a view (image by ChatGPT)

A Bloom filter is a super-fast, memory-efficient data structure with many use-cases. The Bloom filter answers a simple question: does a set contain a given value? A good Bloom filter can contain 100 million items, use only 77MB of memory and still be lightning fast. It achieves this incredible efficiency by being probabilistic: when you ask if it contains an item, it can respond in two ways: definitely not or maybe yes.

A Bloom filter can either tell you with certainty that an item is not a member of a set, or that it probably is

In this article we’ll find out how a Bloom filter works, how to implement one, and we’ll go through some practical use cases. In the end you’ll have a new tool in your belt to optimize your scripts significantly! Let’s code!


Before we begin..

This article explores the mechanics of a Bloom Filter and provides a basic Python implementation to illustrate its inner workings in 6 steps:

  1. When to use a Bloom filter? Characteristics and use cases
  2. How does a Bloom filter work? a non-code explanation
  3. How do you add values and check for membership?
  4. How can I configure a Bloom filter?
  5. What role do hash functions play?
  6. Implementing a Bloom filter in Python.

The code resulting from this article is more educational than efficient. If you are looking for an optimized, memory-efficient and high-speed Bloom Filter check out bloomlib; a super-fast, easy-to-use Python package that offers a Bloom Filters, implemented in Rust. More info here.

pip install bloomlib

1. When to use a Bloom filter?

Bloom filter are very useful in situations where speed and space are at a premium. This is very much the case in data science but also in other situations when dealing with big data. Imagine you have a dictionary application. Each time a user submits a word, we execute an expensive database query. Notice that this also happens when the user makes a typo e.g.

Adding 100 million strings only used 77MB of memory

In this case we can use a Bloom filter to severely limit the number of queries we have to execute by first checking if the word actually exists. Since our Bloom filter is very memory efficient we can simply add all words from the English dictionary. This way we can quickly check if a submitted word exist, avoiding the expensive database query if it doesn’t!

I’ve tested bloomlib by adding 100 million strings and found it only used 77MB of memory (5% false positive rate). By comparison, the Oxford Dictionary only contains 273,000 words (366x fewer).

Applying Python multiprocessing in 2 lines of code
When and how to use multiple cores to execute many times faster

Characteristics

Bloom filters have a unique combination of characteristics:

  1. Space & Memory Efficient: Bloom filters require much less space than other data structures like hash tables or sets.
  2. Super-fast: Adding new items or lookups are extremely fast, which is beneficial for performance-critical applications.
  3. Scalable: The filter handles large numbers of elements, making it ideal for big data applications.
  4. Portable: the filter can be serialized and saved. This way the state of the Bloom filter can be preserved over system re-boots or distributed.

Practical Use Cases

The Bloom filter’s unique combination of characteristics make them very convenient in many cases:

  1. Data Deduplication: They are used to identify unique items in massive datasets, helping in data cleaning processes by quickly filtering out duplicates, thus saving on processing time and storage.
  2. Big Data Processing: Hadoop, Apache Spark, and other big data frameworks use Bloom filters to optimize data processing.
  3. Database Queries: Helps in quickly querying whether a record exists in a database without actually querying the database.
  4. Cache Filtering: Before querying a slow cache, a Bloom filter can be used to check if the item is probably not in the cache, saving time.
  5. Spell Checkers: Used to quickly check if a word is in a dictionary.
  6. Network Applications: Used in network routers to quickly decide if a URL is in a list of malicious URLs.
Python: __init__ is NOT a constructor: a deep dive in Python object creation
Tinkering with Python’s constructor to create fast, memory-efficient classes

2. How Does it Work?

A Bloom filter doesn’t store the whole object but a simpler representation. Think of it as not storing 3-dimensional objects but a photograph. We can store far more 2d photographs than 3d objects. The downside is that some information gets lost: “photographed” from the top, the silhouettes of a sky scraper and pyramid might look the same; a square.

In order to achieve the space efficiency, The Bloom filter simplifies objects by applying different hash functions to it and then store the results of these hash functions. It first takes a large array of bits that are all set to 0 initially. When adding an item to the filter, it is processed by multiple hash functions, each resulting in an integer that will serve as an index for our bit array. Bits at these indices are set to 1.

To check if an item is in the set, you run it through the same hash functions and see if all the bits at these positions are 1. If they are, there is a probability that the item is contained if not, it’s definitely not in the set. Later in this article we’ll take a closer look at how this works in a practical example.

What is the difference between UNION and JOIN in SQL?
5 minute guide to UNION, EXCEPT and INTERSECT in SQL

False Positives and configuration

A false positive is when the Bloom Filter tells you that it may contain the item while it does not. In a probabilistic data structure like the bloom filter this can happen because different items can result in the same hashed values.

As we’ll see later on in this article, we can “set” the rate at which false positives occur by adjusting the length of the bit array and the number of times we hash a value. Increasing either will lower the chance of false positives but also have consequences for memory usage and speed.


3. Adding and getting values — Demo example

We’ll create a new Bloom filter by initializing a bit array. This simple example has 5 bits, all set to 0 initially:

[0 0 0 0 0]

Adding a value

In order to add the string "some_value" to the filter, we need to apply a number of hash functions to it; in this example we’ll use two. Applying the first hash function to "some_value” results in the integer 3. We’ll use this as an index and set the corresponding bit in our bit array to 1.

Next we’ll apply the 2nd hash function to "some_value", resulting into 12. Since 12 is larger than the length of our bit array we’ll have to modulo it: 12 % 5 = 2, resulting in our index 2. We’ll use this to set the bit at index 2:

value gets hashed to 3 and 4, which we’ll use as an index to flip a bit in our bit array (image by author)

Adding a second value

Let’s add a second item to our bit array: "other_value”. Applying the two hash functions results in indices 0 and 3. We only have to set the bit with index 0 since the bit with index 3 is already set:

value hashes result in 0 and 3 so we’ll set the bits at those indices to 1 (image by author)
Cython for absolute beginners: 30x faster code in two simple steps
Easy Python code compilation for blazingly fast applications

Performing a lookup

We want to check if "unknown_value” is contained in the Bloom filter. When we run it through the hash functions we’ll end up with indices 2 end 4. Not all bits at these indices are set to 1 so we know for sure that our Bloom filter doesn’t include "unknown_value”.

At least one of the indices from the resulting hash functions correspond with a non-set bit: value is not contained (image by author)

Collisions — False Positives

Next we’ll look up "a_new_value” in the Bloom filter. Running it through the hash functions gives us 22 and 23. After applying the modulo we end up with indices 2 and 3. Coincidentally, these are the exact same indices that "some_value” gave us.

Value hashes coincidentally correspond with bit indices that are set. There is a chance this value is contained (image by author)

We must conclude that there is a a probability that "a_new_value” is contained in the Bloom filter. This is what makes the Bloom filter probabilistic.

Why Python is so slow and how to speed it up
Take a look under the hood to see where Python’s bottlenecks lie

4. Configuring my Bloom filter

When configuring your Bloom filter you want to aim to minimize the rate of false positives to a level that’s acceptable for our particular application. There are two ways you can calibrate your bloom filter:

  • the length of the bit array (number of bits)
  • the number of hash functions to use

Size of the bit array

More bits means more space to distribute the elements; when elements are hashed to a larger space, the chances that two different elements will have the same set of hashed positions decreases. The formula for determining the number of required bits looks like this:

Calculate the required number of bits

The simple explanation of this formulat: given the desired fp-rate (ε) and the expected number of items (n), the required number of bits (m) are returned. In Python this formula is implemented as follows:

def optimal_memory_size(n:int, p:float) -> int: 
    """ Return the required size of bit array(m) 
        m = -(n * lg(p)) / (lg(2)^2) 
    """ 
    m = -(n * math.log(p))/(math.log(2)**2) 
    return int(m)

Number of hash functions to use

Increasing the number of hash functions results in more precision since it increases the number of bits to set for each element. On the other hand, using more hash functions also increases the chance of settings the same bits for different elements, especially if the bit fills up.

Calculating the optimal number of hash functions

The formula above shows us how to find the optimal number of hash functions: given the size of the bit array (m) and the expected number of items (n) we find the required number of hash functions. In Python this is implemented as follows:

def optimal_n_hashes(self, m:int, n:int) -> int: 
    """ 
    Return the number of required hash functions 
        k = (m/n) * lg(2) 
    """ 
    k = (m/n) * math.log(2) 
    return int(k)
Should You Use Slots? How Slots Affect Your Class, and When and How to Use Them
One line of code for a 20% performance increase?

5. How to design the hash functions

Hashing is an important part of the Bloom filter. Applying the hash functions will translate our “complex” object to a “simple” one in the same way that “some_value” was translated to indices 3 and 4 in the previous chapter. Implementing hash functions can be very simple:

def hash_item(item: str, num_hashes:int) -> Iterable[int]: 
    """ Hashes item, yields bit array indices """ 
    for i in range(num_hashes): 
        yield hash(f"{item}{i}") % self.size

The code above works as follows: it loops through the range of the required number of hashes to apply. It then applies Python’s built-in hash function to the item item after suffixing the index of the num_hashes (i). You use the function like this:

for index in _hashes(item="my_value", num_hashes=6, bitarray_size=10_000): 
    print("index: ", index)  # prints each index 
 
# or 
 
print([i for i in _hashes(item="my_value", num_hashes=6, bitarray_size=10_000)]) 
# "my_value" is hashed to 6 indices: [2473, 5047, 71, 9362, 8558, 9368]

Instead of suffixing the item with an index, you can also use 6 different kinds of hashing algorithms. Just take into account that the hash functions should be independent, uniformly distributed and as fast as possible. A great example is murmur.


6. Python implementation

Now that we know how to calculate the optimal number of bits and hashes, and we know how to hash, it’s time to implement our Bloom Filter.

Remember that this implementation is purely for understaning how a simplified Bloom Filter works and is far from optimal: it can only handle strings for example. If you are looking for the faster Python bloom filter package; check out BloomLib.
class BloomFilter: 
    expected_number_of_items: int 
    desired_fp_rate: float 
    size: int 
    num_hashes: int 
    bit_array: list[int] 
 
    def __init__(self, expected_number_of_items: int, desired_false_positive_rate: float) -> None: 
        self.expected_number_of_items: int = expected_number_of_items 
        self.desired_fp_rate: float = desired_false_positive_rate 
        self.size: int = self._calculate_optimal_size(expected_number_of_items, desired_false_positive_rate) 
        self.num_hashes: int = self._calculate_optimal_num_hashes(self.size, expected_number_of_items) 
        self.bit_array: list[int] = [0] * self.size 
 
    def _calculate_optimal_size(self, n: int, p: float) -> int: 
        m = -(n * math.log(p)) / (math.log(2) ** 2) 
        return int(m) 
 
    def _calculate_optimal_num_hashes(self, m: int, n: int) -> int: 
        k = (m / n) * math.log(2) 
        return int(k) 
 
    def hash_item(self, item: str) -> typing.Iterable[int]: 
        """ Hashes item, yields bit array indices """ 
        for i in range(self.num_hashes): 
            yield hash(f"{item}{i}") % self.size 
 
    def add(self, item: str) -> None: 
        """ Takes the outputs of the hash functions and sets the corresponding bits in the bit array """ 
        for hash_value in self.hash_item(item): 
            self.bit_array[hash_value] = 1 
 
    def contains(self, item: str) -> bool: 
        """ Checks if all bit-indices from the hash functions are set to 1 in the bit array; only then the item may be contained """ 
        for hash_value in self.hash_item(item=item): 
            if not self.bit_array[hash_value]: 
                return False 
        return True

The add and contains functions are new; they use the hash_item method and either set or check the resulting indices. We can see our bloom filter in action like this:

expected_number_of_items = 1000 
desired_fp_rate = 0.05  
 
bloom = BloomFilter(expected_number_of_items, desired_fp_rate) 
 
# Add some values 
bloom.add(item="apple") 
bloom.add(item="banana") 
 
# Lookup values 
print("apple:", bloom.contains("apple"))    # Should return True 
print("banana:", bloom.contains("banana"))  # Should return True 
print("cherry:", bloom.contains("cherry"))  # Might return False, but can return True (false positive)
Python args, kwargs, and All Other Ways to Pass Arguments to Your Function
Expertly design your function parameters in 6 examples

Conclusion

In this article we took a deep dive into Python object creation, learnt a lot about how and why it works. Then we looked at some practical examples that demonstrate that we can do a lot of interesting things with our newly acquired knowledge. Controlling object creation can enable you to create efficient classes and professionalize your code significantly.

To improve your code even further, I think the most important part is to truly understand your code, how Python works and apply the right data structures. For this, check out my other articles here or this this presentation.

I hope this article was as clear as I hope it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics like these:

Happy coding!

— Mike

P.S: like what I’m doing? Follow me!

Mike Huls - Medium
Read writing from Mike Huls on Medium. I'm a full-stack developer with a passion for programming, technology and…