I was trying to solve a puzzle some days back- and it essentially could be mapped to the following question :
Given a hashtable of bucket size ‘M’ , continue inserting elements into the hash table till all the buckets have atleast one value associated with it. What is the estimated number of insertions that should be done to achieve this?
The answer I derived corresponds with a quick implementation I did using a random number generator- While this is not the best way to infer correctness, a large magnitude of difference over many trial runs should give of a warning sign.
Let the size of the hash table =M, and the hash function have normal distribution.
Refreshing the definition of Estimated value for a random variable:
E(X) : X1*Probability(X1) + X2*Probability(X2) …. + Xn*Probability(Xn)
Let Xn refer to the event of nth insertion in the same bucket.
E(X=2) : Expected Number of insertions for inserting 2 elements are in two different buckets
E(X=3) : Expected Number of insertions for inserting 3 elements in three different buckets, and so on.
The total required number of insertions= E(1) +E(2)+ E(3) ….. E(M)
So, Let the first element be inserted into the hash table.
The second time, Probability of another element going to the same bucket = 1/M
Third time, Probability of another element going to the same bucket in the second insertion=1/M * 1/M.
E(1) =1+ 1/M +1*(1/M^2) + 1*(1/M^3) + 1*(1/M^4) till infinity etc :
This is an infinte geometric progression,where the sum converges to 1 / ( 1-(1/M) ) = M/( M -1)
This is for the first element.
Next,we proceed to a state where we have 2 elements in two buckets.
The expected number of insertions ,for the keys to map on to the same 2 buckets will be :
E(2) =1+ 2/M + (2/M)*(2/M) + (2/M)*(2/M)*(2/M) + …. infinity.
The sum of this series= M/M-2
Now, if we continue the same reasoning, and do summation of all the infinite series,we will have
M*(1/M-1 +1/M-2 + 1/M-3 …. 1 )
The term within the summation is = (1+1/2+1/3+1/4 ….. 1/K) which is greater than ln(K+1)
So,the expected number of insertions is approximately MLogM ( Log to the base ‘e’ where M is the size of the hash table)