# How to Store and Query 100 Million Items Using Just 77MB with Python Bloom Filters

Perform lightning-fast, memory efficient membership checks in Python with this need-to-know data structure

A Bloom filter is a super-fast, memory-efficient data structure with many use-cases. The Bloom filter answers a simple question: does a set contain a given value? A good Bloom filter can contain 100 million items, use only 77MB of memory and still be lightning fast. It achieves this incredible efficiency by being probabilistic: when you ask if it contains an item, it can respond in two ways: **definitely not** or **maybe yes**.

A Bloom filter can either tell youwith certaintythat an item isnota member of a set, or that itprobably is

In this article we’ll find out how a Bloom filter works, how to implement one, and we’ll go through some practical use cases. In the end you’ll have a new tool in your belt to optimize your scripts significantly! Let’s code!

### Before we begin..

This article explores the mechanics of a Bloom Filter and provides a basic Python implementation to illustrate its inner workings in 6 steps:

- When to use a Bloom filter? Characteristics and use cases
- How does a Bloom filter work? a non-code explanation
- How do you add values and check for membership?
- How can I configure a Bloom filter?
- What role do hash functions play?
- Implementing a Bloom filter in Python.

The code resulting from this article is more educational than efficient. If you are looking for an optimized, memory-efficient and high-speed Bloom Filter check out **bloomlib**; a super-fast, easy-to-use Python package that offers a Bloom Filters, implemented in Rust. More info **here**.

`pip install bloomlib`

### 1. When to use a Bloom filter?

Bloom filter are very useful in situations where speed and space are at a premium. This is very much the case in data science but also in other situations when dealing with big data. Imagine you have a dictionary application. Each time a user submits a word, we execute an **expensive database query**. Notice that this also happens when the user makes a typo e.g.

Adding 100 million strings only used 77MB of memory

In this case we can use a Bloom filter to severely limit the number of queries we have to execute byfirst checking if the word actually exists. Since our Bloom filter is very memory efficient we can simply add all words from the English dictionary. This way we canquickly checkif a submitted wordexist,avoiding the expensive database queryif it doesn’t!

*I’ve tested **bloomlib **by adding 100 million strings and found it only used 77MB of memory (5% false positive rate). By comparison, the Oxford Dictionary only contains 273,000 words (366x fewer).*

#### Characteristics

Bloom filters have a unique combination of characteristics:

**Space & Memory Efficient**: Bloom filters require much less space than other data structures like hash tables or sets.**Super-fast**: Adding new items or lookups are extremely fast, which is beneficial for performance-critical applications.**Scalable**: The filter handles large numbers of elements, making it ideal for big data applications.**Portable**: the filter can be serialized and saved. This way the state of the Bloom filter can be preserved over system re-boots or distributed.

#### Practical Use Cases

The Bloom filter’s unique combination of characteristics make them very convenient in many cases:

**Data Deduplication**: They are used to identify unique items in massive datasets, helping in data cleaning processes by quickly filtering out duplicates, thus saving on processing time and storage.**Big Data Processing**: Hadoop, Apache Spark, and other big data frameworks use Bloom filters to optimize data processing.**Database Queries**: Helps in quickly querying whether a record exists in a database without actually querying the database.**Cache Filtering**: Before querying a slow cache, a Bloom filter can be used to check if the item is probably not in the cache, saving time.**Spell Checkers**: Used to quickly check if a word is in a dictionary.**Network Applications**: Used in network routers to quickly decide if a URL is in a list of malicious URLs.

### 2. How Does it Work?

A Bloom filter doesn’t store the whole object but a simpler representation. Think of it as not storing 3-dimensional objects but a photograph. We can store far more 2d photographs than 3d objects. The downside is that some information gets lost:** “photographed” from the top, the silhouettes of a sky scraper and pyramid might look the same; a square**.

In order to achieve the space efficiency, The Bloom filter **simplifies** objects by applying different hash functions to it and then store the results of these hash functions. It first takes a large array of bits that are all set to 0 initially. When adding an item to the filter, it is processed by multiple hash functions, each resulting in an integer that will serve as an index for our bit array. Bits at these indices are set to 1.

To check if an item is in the set, you run it through the same hash functions and see if all the bits at these positions are 1. If they are, there is a ** probability **that the item is contained if not, it’s

*definitely**not*in the set. Later in this article we’ll take a closer look at how this works in a practical example.

#### False Positives and configuration

A false positive is when **the Bloom Filter tells you that it may contain the item while it does not**. In a probabilistic data structure like the bloom filter this can happen because different items can result in the same hashed values.

As we’ll see later on in this article, we can “set” the rate at which false positives occur by adjusting the *length of the bit array* and the *number of times we hash *a value. Increasing either will lower the chance of false positives but also have consequences for memory usage and speed.

### 3. Adding and getting values — Demo example

We’ll create a new Bloom filter by initializing a bit array. This simple example has 5 bits, all set to 0 initially:

`[0 0 0 0 0]`

#### Adding a value

In order to add the string `"some_value"`

to the filter, we need to apply a number of hash functions to it; in this example we’ll use two. Applying the **first hash** **function **to `"some_value”`

results in the integer `3`

. We’ll use this as an index and set the corresponding bit in our bit array to 1.

Next we’ll apply the **2nd** **hash function **to `"some_value"`

, resulting into `12`

. Since 12 is larger than the length of our bit array we’ll have to modulo it: `12 % 5 = 2`

, resulting in our index `2`

. We’ll use this to set the bit at index `2:`

#### Adding a second value

Let’s add a second item to our bit array: `"other_value”`

. Applying the two hash functions results in indices `0`

and `3`

. We only have to set the bit with index `0`

since the bit with index `3`

is already set:

#### Performing a lookup

We want to check if `"unknown_value”`

is contained in the Bloom filter. When we run it through the hash functions we’ll end up with indices `2`

end `4`

. Not all bits at these indices are set to 1 so we know **for sure** that our Bloom filter doesn’t include `"unknown_value”`

.

#### Collisions — False Positives

Next we’ll look up `"a_new_value”`

in the Bloom filter. Running it through the hash functions gives us `22`

and `23`

. After applying the modulo we end up with indices `2`

and `3`

. Coincidentally, these are the exact same indices that `"some_value”`

gave us.

We must conclude that there is a **a probability** that `"a_new_value”`

is contained in the Bloom filter. This is what makes the Bloom filter probabilistic.

### 4. Configuring my Bloom filter

When configuring your Bloom filter you want to aim to **minimize the rate of false positives to a level that’s acceptable for our particular application**. There are two ways you can calibrate your bloom filter:

- the length of the bit array (number of bits)
- the number of hash functions to use

#### Size of the bit array

More bits means more space to distribute the elements; when elements are hashed to a larger space, the chances that two different elements will have the same set of hashed positions decreases. The formula for determining the number of required bits looks like this:

The simple explanation of this formulat: given the **desired fp-rate ***(ε)* and the **expected number of items*** (n)*, the **required number of bits*** (m) *are returned. In Python this formula is implemented as follows:

```
def optimal_memory_size(n:int, p:float) -> int:
""" Return the required size of bit array(m)
m = -(n * lg(p)) / (lg(2)^2)
"""
m = -(n * math.log(p))/(math.log(2)**2)
return int(m)
```

#### Number of hash functions to use

Increasing the number of hash functions results in more precision since it increases the number of bits to set for each element. On the other hand, using more hash functions also increases the chance of settings the same bits for different elements, especially if the bit fills up.

The formula above shows us how to find the optimal number of hash functions: given the **size of the bit array** *(m)* and the **expected number of items*** (n)* we find the* ***required number of hash functions**. In Python this is implemented as follows:

```
def optimal_n_hashes(self, m:int, n:int) -> int:
"""
Return the number of required hash functions
k = (m/n) * lg(2)
"""
k = (m/n) * math.log(2)
return int(k)
```

### 5. How to design the hash functions

Hashing is an important part of the Bloom filter. Applying the hash functions will translate our “complex” object to a “simple” one in the same way that `“some_value”`

was translated to indices `3`

and `4`

in the previous chapter. Implementing hash functions can be very simple:

```
def hash_item(item: str, num_hashes:int) -> Iterable[int]:
""" Hashes item, yields bit array indices """
for i in range(num_hashes):
yield hash(f"{item}{i}") % self.size
```

The code above works as follows: it loops through the range of the required number of hashes to apply. It then applies Python’s built-in `hash`

function to the item `item`

after suffixing the index of the num_hashes (`i)`

. You use the function like this:

```
for index in _hashes(item="my_value", num_hashes=6, bitarray_size=10_000):
print("index: ", index) # prints each index
# or
print([i for i in _hashes(item="my_value", num_hashes=6, bitarray_size=10_000)])
# "my_value" is hashed to 6 indices: [2473, 5047, 71, 9362, 8558, 9368]
```

Instead of suffixing the item with an index, you can also use 6 different kinds of hashing algorithms. Just take into account that the hash functions should be independent, uniformly distributed and as fast as possible. A great example is murmur.

### 6. Python implementation

Now that we know how to calculate the optimal number of bits and hashes, and we know how to hash, it’s time to implement our Bloom Filter.

Remember that this implementation is purely for understaning how a simplified Bloom Filter works and is far from optimal: it can only handle strings for example. If you are looking for the faster Python bloom filter package; check outBloomLib.

```
class BloomFilter:
expected_number_of_items: int
desired_fp_rate: float
size: int
num_hashes: int
bit_array: list[int]
def __init__(self, expected_number_of_items: int, desired_false_positive_rate: float) -> None:
self.expected_number_of_items: int = expected_number_of_items
self.desired_fp_rate: float = desired_false_positive_rate
self.size: int = self._calculate_optimal_size(expected_number_of_items, desired_false_positive_rate)
self.num_hashes: int = self._calculate_optimal_num_hashes(self.size, expected_number_of_items)
self.bit_array: list[int] = [0] * self.size
def _calculate_optimal_size(self, n: int, p: float) -> int:
m = -(n * math.log(p)) / (math.log(2) ** 2)
return int(m)
def _calculate_optimal_num_hashes(self, m: int, n: int) -> int:
k = (m / n) * math.log(2)
return int(k)
def hash_item(self, item: str) -> typing.Iterable[int]:
""" Hashes item, yields bit array indices """
for i in range(self.num_hashes):
yield hash(f"{item}{i}") % self.size
def add(self, item: str) -> None:
""" Takes the outputs of the hash functions and sets the corresponding bits in the bit array """
for hash_value in self.hash_item(item):
self.bit_array[hash_value] = 1
def contains(self, item: str) -> bool:
""" Checks if all bit-indices from the hash functions are set to 1 in the bit array; only then the item may be contained """
for hash_value in self.hash_item(item=item):
if not self.bit_array[hash_value]:
return False
return True
```

The `add`

and `contains`

functions are new; they use the `hash_item`

method and either set or check the resulting indices. We can see our bloom filter in action like this:

```
expected_number_of_items = 1000
desired_fp_rate = 0.05
bloom = BloomFilter(expected_number_of_items, desired_fp_rate)
# Add some values
bloom.add(item="apple")
bloom.add(item="banana")
# Lookup values
print("apple:", bloom.contains("apple")) # Should return True
print("banana:", bloom.contains("banana")) # Should return True
print("cherry:", bloom.contains("cherry")) # Might return False, but can return True (false positive)
```

### Conclusion

In this article we took a deep dive into Python object creation, learnt a lot about how and why it works. Then we looked at some practical examples that demonstrate that we can do a lot of interesting things with our newly acquired knowledge. Controlling object creation can enable you to create efficient classes and professionalize your code significantly.

To improve your code even further, I think the most important part is to truly understand your code, how Python works and apply the right data structures. For this, check out my **other articles here**** **or this **this presentation**.

I hope this article was as clear as I hope it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics like these:

- Git for absolute beginners: understanding Git with the help of a video game
- Create and publish your own Python package
- Create a fast auto-documented, maintainable, and easy-to-use Python API in 5 lines of code with FastAPI

Happy coding!

— Mike

*P.S: like what I’m doing? **Follow me**!*