If that does not ring a bell, here’s the link to wiki article on Bloom Filter
In this post, I discuss about Bloom filter variants, Applications of Bloom filters, and my project on improving the performance of Standard Bloom Filter.
40 years have passed since this probabilistic data structure- “Bloom filter ” was conceived, and even now papers on variations of Bloom filters are accepted in top-notch conferences like SIGCOMM and SIGMETRICS .An off-the cuff comment by one of my professors stuck on in my mind -” There still are new data structures to be discovered“- which is really, really intriguing
A Google search on “Bloom filter survey papers”, and similar searches brings up heavily-cited survey papers 2003 like “Network Applications of Bloom Filters: A Survey” – or nice presentations like this but a LOT of time has passed since then!
Bloom Filter -Variants
The Standard Bloom filter has these characteristics:
- No deletion operation
- Use the same group of ‘k’ hash functions for all elements insertion/lookup
There are numerous variations of the standard Bloom filter, and there are plenty of well-cited papers on the same.
Let me try to understand the points of variation that can affect the Design and Implementation of a Bloom filter by listing out the factors: Please do comment if you have a better idea to organize points of variation of Bloom filters- and the use-case of those variants.
Parameters of Bloom Filter which could vary:
a) Static /dynamic size of Bit vector,number of hash functions
b) Single vs Multiple, Parallel Bloom Filters
c) Single vs Multiple group of hash functions (Partitioned Hashing)
d) Standard/Counting bloom filters (Bit vector entry is a counter)
d) Single Threaded vs Multi-threaded Bloom filter (should be relevant for counting bloom filter- with the deletion operation)
e) Cache aware/Cache oblivious( memory hierarchy optimized)
f) Variations in Number and type of Hash functions used
g) Set membership vs Multiset membership (Spectral Bloom filter)
h)Bloom filter on Single core vs Multi-core (parallelize hash functions? )
i) Bloom filter-memory access – distributed/shared/not shared
j) Bloom filter with TTL (Time to live) for data inserted, storing ,retrieving based on time cluster
k) Bloom filter variants with dynamic bounds on number of elements stored
l) Storage of data inserted associated with the Bloom filter – Single copy/ Multiple copy
m)Bloom filter variants with bounds on false-postive rate.
From performance standpoint, these are the main factors:
a)False positive rate
b)Memory access time
c)Hash function time /space complexity
In my semester long project, the main reference work were these two papers-:
a) Fast hash table lookup using extended bloom filter: an aid to network processing - (2005 SIGCOMM paper)
b) Building high accuracy bloom filters using partitioned hashing (2007 SIGMETRICS paper)
I worked on improving the False positive rate, and Memory Access time of the Standard Bloom filter.
For false positive rate, it was by using partitioned hashing, and reducing entropy. For memory access time, it involved implementation of schemes, where single/multiple copies of the data were stored in secondary storage.For further details, the report is over here
On Application of Bloom filters-Where do you use them?
I find it both intellectually stimulating and highly irritating when I end up spending time reading literature on clever variations of Bloom filters- like a cache efficient bloom filter , that would perform better than a standard bloom filter )- when I want to get a solid understanding, and really want to KNOW, at a design level, when Bloom filter would be a good choice, what are it’s variants, when to use them, when not to use them, how to implement them, and what can you do with them! Many online sources lack depth, and many research papers have too much emphasis on the math-so it’s actual usage is less clear
As the crux of the Bloom filter is that it does NOT have false-negative, it is used to perform a quick look up if the key is present or not, and if so, retrieval is done from slower/costlier secondary data storage. This could be anything-a Hashtable, database, etc
At low level- there are two prominent use cases-which I have elaborated below.
Check out this paper: Fast packet classification using bloom filter
Packet classification at line speed is an Engineering design problem. A really great book to learn more about this is Network Algorithmics by George Varghese, Professor at UCSD . To those new to packet classification- If you have 200 million packets arriving per second at a router, and say, you have a Rule table fwith IP prefix of the form (Source IP, Destination IP,source TCP PORT, Destionation TCP Port) , which specifies what to do with each packet(drop,forward,etc), how do you do achieve this with least latency,best memory requirement,least cost? Hardware parallelism? Bit Tries? Hash Tables?Turns out Bloom filter is used to improve naive cross-product algorithm.
Check out this paper: Avoiding the disk bottleneck in the data domain deduplication file system, published in 2008.
The basic idea is this: If you can break files into segments, and keep track of segments which are identical at the binary level ,then you can compress the file storage. This is referred to as ” Identical Segment Deduplication”. We know that linux file system maintains an ‘inode cache’, in memory, to offset slow reading of inodes from the disk.Going deep into Data domain’s implementation is difficult to summarize- They use a Bloom filter to store segments already read. So, with this,they could potentially avoid costly disk look-ups.
Google’s BigTable used Bloom filters, and so does HBase- it’s open source version. Here’s a link to Bloom filter documentation HBase docs
I spent sometime researching about the use of Bloom filter: Here’s a bunch of interesting links:
- Stackoverflow: This thread, and there would be a bunch of others too-as the community is very active!
- This post from Rapleaf was an interesting read.
To summarize what Rapleaf did- They had three components – a hash table with 40 million keys, a database with those keys, and 500 million keys which needed to be verified if it was in the 40 million keys. So, they first inserted the Hash table entries into a bloom filter. Then,for each of those 500 million entries, they verified if it was present in the bloom filter. Since Bloom filters give false -positive,a check had to be made for the Database.
- This is about how Chrome uses Bloom filter)
- This is an Assembly level Bloom filter!
I haven’t touched upon the Maths behind performance analysis- average/ best case/ ,else the scope of this blog would spiral out of proportion!
From my experience, reading the performance analysis section of papers on Bloom filter variants, is definitely the way to go- even if it might make you tear your hair
I am very curious to know more about the application of Bloom filters, and it’s variants- Please do comment, and let me know if you have something interesting to share!