Some Fun with Hash Table performance analysis

I  was trying to solve a puzzle some  days back- and it  essentially could be mapped to the following question :

Given  a hashtable of bucket size ‘M’ ,  continue inserting  elements into the hash table till all the buckets have atleast  one  value  associated with it. What is the  estimated number of insertions that should be done to achieve this? 

The answer I derived corresponds  with a quick implementation I did  using a random number generator-  While this is not the best way to  infer correctness, a large magnitude of  difference over many trial runs should give of a warning sign.

So,here goes:

Let the size of the hash table =M, and the hash function have normal distribution.

Refreshing the definition of  Estimated value for a random variable:

E(X) : X1*Probability(X1) + X2*Probability(X2)  …. + Xn*Probability(Xn) 

Let Xn  refer to the event of nth insertion in the same bucket.

E(X=2)  :  Expected  Number of insertions  for inserting 2 elements  are in two different buckets

E(X=3) :   Expected Number of insertions for inserting 3 elements in three different buckets, and so on.

The total required number of insertions= E(1) +E(2)+ E(3) ….. E(M)

So, Let the first element be inserted into the hash table.

The second time, Probability of another element going to the same bucket = 1/M

Third time, Probability of another element going to the same bucket in the second insertion=1/M * 1/M.

Continuing,

E(1) =1+ 1/M +1*(1/M^2)  + 1*(1/M^3) + 1*(1/M^4)   till infinity  etc  :

This is an infinte geometric progression,where the sum converges to    1 / ( 1-(1/M) ) =  M/( M -1)

This is for the first element.

Next,we proceed to a state where we have 2 elements in two buckets.

The expected number of insertions ,for  the keys to map on to the same 2 buckets will be :

E(2) =1+ 2/M +  (2/M)*(2/M)  + (2/M)*(2/M)*(2/M) + …. infinity.

The sum of this series= M/M-2

Now, if we continue the same reasoning, and do summation of all the infinite series,we will have

M*(1/M-1 +1/M-2 + 1/M-3 ….  1 )

The term within the summation is = (1+1/2+1/3+1/4 ….. 1/K)  which is greater than  ln(K+1)   

So,the expected number of insertions is approximately MLogM    ( Log to the base ‘e’  where M is the size of the hash table)

One response to “Some Fun with Hash Table performance analysis

  1. Him Please may I get an equation to know the probability of collision attack?
    If I know the plaintext (input) is 100-bits and hash values (output) is 160…

    Please, may I get some help, thanks in advance!

Leave a comment