This post originated from an RSS feed registered with Ruby Buzz
by Andrew Johnson.
Original Post: Bloom Filter
Feed Title: Simple things ...
Feed URL: http://www.siaris.net/index.cgi/index.rss
Feed Description: On programming, problem solving, and communication.
I was recently looking up something in Knuth’s TAOCP vol. III, 2nd ed. and I came
across mention of Burton H. Bloom and his data structure for set membership
queries that now carries his name: Bloom filters. This structure has seen
something of resurgence in the past few years, so I thought I’d
contribute my own explanation and implementation.
Superimposed encoding techniques on secondary keys (attributes) had been
around for quite some (manual card filing systems for example) when Bloom,
in 1970, published an article
that basically turned the idea on its side by considering an entire dataset
as one record and the primary keys as the attributes. This provides the
basis of a relatively compact structure (an m-bit field) that allows for
fast membership queries on the primary with a probabilistically controlled
rate of false positives.
What does that mean? Let’s start with something a little more
concrete. We can simulate the stucture with a hole-punch, a strip of paper,
and a Ruby irb shell open (to calculate a few numbers for us).
Let’s assume I want to make a set consisting of the names of myself
and my three brothers: Andrew, Bradford, Gregory, and John. I can take a
strip of paper with 12 small circles drawn on it numbered from 0 to 11
(this is a 12-bit field).
For each name, we will calculate two identifying numbers between 0 and 11
— for the purposes of this little simulation it will suffice to use
Ruby’s String#sum and String#hash methods and use
the modulo operator to reduce these to numbers between 0 and 11. For
example, the two identifying numbers for my name in this situation are:
We generally refer to such operations as hashing operations or
functions — so our example is using two hashing functions. To add my
name to the set, I just use my hole punch to punch out those two circles in
the strip of paper:
And continuing by hashing Gregory, Bradford, and John we get
And after a bit of hole-punching (some holes will have already been punched
and that’s expected) we arrive at this strip of paper which
represents our loaded Bloom filter:
Now, if I want to see if any old Tom, Dick, or Harry is in the Johnson
Gang, I just calculate the 2 hashes for that name, punch out those two
holes in a new strip of paper, then line it up with my master strip (the
Bloom filter) and see if both holes align with existing holes in the master
strip.
As we can see, neither Tom nor Harry are in the gang because at least one
hole in each person’s strip doesn’t match up with a
corresponding hole in the Bloom filter strip. However, Dick’s name
just happens to hash to two holes in my Bloom filter. This is the
false-positive problem mentioned earlier. In fact, using a 12-bit field to
store 4 items by setting two bits per item results in a false positive rate
of approximately 25%. Reducing this false positive rate is accomplished by
increasing the size of the bit-field and utilizing more bits per item. For
example, to get the false positive rate down to 10% we could use a 20-bit
field with 3 hashes, and to reach 1% we’d use a 39-bit field and 7
hashes.
The way the math falls out for optimum performance with a given set size
n and an acceptable rate err of false
positives is to calculate the size of the bit-field and number of hashing
functions as follows:
m = (n * log(err) / log(1.0 / 2**log(2))).ceil
k = (log(2) * m / n).round
Where m is the size of the bit-field and k
is the number of hashes to use. So, for example, given n =
1000 here are a few data points for various error rates just to get a
general idea of the relationships:
err m k
0.25 2886 2
0.10 4793 3
0.01 9586 7
0.001 14378 10
0.0001 19171 13
One of the assumptions is a good set of hashing functions — and MD5
or SHA1 (with various salts) are common choices — for a good random
distribtion of the k-bits per key. However, a recent paper demonstrates that
distributing k-bits using only 2 hashes and a simple cycling algorithm can
achieve results comparable to using k high-quality hashing
functions. This can lead to definite speedups as k grows
and numerous keys are checked against the filter.
My Ruby
implementation utilizes the above mentioned algorithm — the core
of the implementation resides in the private bits_on method which
takes a key and computes two hashes (aka MD5 digests) and then cycles
through k bit positions yielding each position in turn. The add
and has? methods call bits_on and supply blocks to set or
test the bit positions respectively:
def add(*keys)
keys.flatten.each do |key|
raise RangeError, "Too many keys" if (self.nkeys += 1) > capacity
bits_on(key) {|pos| self.bv[pos] = 1}
end
end
alias :insert :add
def has?(key)
bits_on(key){|pos| self.bv[pos] == 1 or return false}
true
end
# yields each bit-position for a given key
def bits_on(key)
pos1, pos2 = salts.map{|s|Digest[s + key.to_s].hex % bits}
hashes.times do |i|
yield pos1
pos1 = (pos1 + pos2) % bits
pos2 = (pos2 + i) % bits
end
end
private :bits_on
On a simple test using a smallish dictionary of n = 38619
words and a false-positive rate of err = 0.0001, this
implementation was better than an order of magnitude faster than the pre-existing Ruby
version or the Perl
version from which it was derived. I’ve run a number of ad-hoc
tests and confirmed that the false positive rate does not appear to be
affected by the dependency introduced in the k-bit positions.
Traditionally, Bloom filters were used in memory constrained applications,
such as a spell-checker when available RAM couldn’t hold a dictionary
of appreciable size. More recently, Bloom filters have found uses in
network routing, cache-summary, and peer-peer applications (see this survey article, or read about loaf for a more
specific example).
Further Resources
Here are a few further resources — my implementation drew heavily
from the C# version given in the literate programming example at the end of
the following links: