Bloom filters
Bloom filters is a probabilistic data structure, which is space efficient. Bloom filters can be used to quickly check if a value don’t exists or might exists in a set, false positives are possible(with a low likelihood) but false negatives are not possible. The time to check if an element exsist or add an element is also constant O(k), where k is the number of hash functions(we will covert this later).
How does it works?
Bloom filters build upon k
hash functions and an bit array of length m
. Each value is hashed k
times, where each hash is converted to a position in the bit array, and these bits are set to positive(if they are already positive they are not swapped). In order to check if a value exists it is hashed using the k
hashing functions, if all the bits are set to 1 the value exists(or we have a false positive).
Example:
$$ x = “Thomas” \ f^1(x) = 79 \ f^2(x) = 1 \ f^k(x) = 1000 \ $$
The bits are set to 1 for each of the hash values.
Implementation
The implementation is done in GO in this case, in this case I implemented it as a struct with three methods:
- hash(Not exported) - for hashing the input values
- Add(exported) - adding a value to the bit map
- Check(exported) - check if a value exists in the filter
The struct is defined as:
type bloom struct {
hashes int // nbr of hash functions
maxVal uint64 // the length of the bit array(bool array in this case)
hashDifferentiations []string // trick to create n hash functions
h hash.Hash // the base hash functions
bloomFilter []bool // the "bit array" which stores the hash values
}
to initialize a bloom filter a New
function is created also to avoid having to export the struct it self.
func New(hashes int, cardinality uint64) *bloom {
bloomFilter := []bool{}
for i := 0; i < int(cardinality); i++ {
bloomFilter = append(bloomFilter, false)
}
return &bloom{
hashes: hashes,
maxVal: cardinality,
hashDifferentiations: []string{"a", "b", "c ", "d", "e", "f", "g", "h", "i", "j"},
h: sha1.New(),
bloomFilter: bloomFilter,
}
}
The key for a bloom filter is the hash function in this case I emulate k
hash function by making changes to the input in a deterministic way using the hashDifferentiations
, in this case the upper limit of k
is len(hashDifferentiations)
func (b *bloom) hash(input []byte) ([]uint64, error) {
defer b.h.Reset()
out := []uint64{}
if b.hashes > len(b.hashDifferentiations) {
return nil, fmt.Errorf("max hashes is: %v", len(b.hashDifferentiations))
}
for i := 0; i < b.hashes; i++ {
_, err := b.h.Write(append(input, []byte(b.hashDifferentiations[i])...))
if err != nil {
return nil, err
}
out = append(out, binary.BigEndian.Uint64(b.h.Sum(nil))%b.maxVal)
}
return out, nil
}
the hash
function is never exported byt used internally inside Add
and check
. The function to add values to the bloom filters is:
func (b *bloom) Add(input []byte) error {
hashes, err := b.hash(input)
if err != nil {
return fmt.Errorf("hash: %w", err)
}
for _, idx := range hashes {
b.bloomFilter[idx] = true
}
return nil
}
The bit map is emulated as a bool slice(since we don’t know the size are compile time). The final function of interest is the function to check if a value exists(false positives are positive).
func (b *bloom) Check(input []byte) (bool, error) {
hashes, err := b.hash(input)
if err != nil {
return false, fmt.Errorf("hash: %w", err)
}
for _, idx := range hashes {
if !b.bloomFilter[idx] {
return false, nil
}
}
return true, nil
}
The more values we add to the bloom filter the higher is the chance that we get a false positive. Since more and more of the bits in the bit array are positive.
Some tests:
func TestNew(t *testing.T) {
assert.NotNil(t, New(10, 10))
}
func TestHash(t *testing.T) {
t.Run("Single hash", func(t *testing.T) {
bloom := New(10, 10000)
assert.NoError(t, bloom.Add([]byte("Test1")))
})
t.Run("Multiple hashes hash", func(t *testing.T) {
bloom := New(10, 10000)
for i := 0; i < 10; i++ {
assert.NoError(t, bloom.Add([]byte(fmt.Sprintf("Test%v", i))))
}
})
}
func TestCheck(t *testing.T) {
t.Run("Single hash", func(t *testing.T) {
bloom := New(10, 10000)
val := []byte("Test1")
assert.NoError(t, bloom.Add(val))
exsist, err := bloom.Check(val)
assert.NoError(t, err)
assert.True(t, exsist)
})
t.Run("Multiple hashes hash", func(t *testing.T) {
bloom := New(10, 10000)
for i := 0; i < 10; i++ {
val := []byte(fmt.Sprintf("Test%v", i))
assert.NoError(t, bloom.Add(val))
exsist, err := bloom.Check(val)
assert.NoError(t, err)
assert.True(t, exsist)
exsist, err = bloom.Check(append(val, []byte("n")...))
assert.NoError(t, err)
assert.False(t, exsist)
}
})
}