An interesting aspect of grocery shopping is that it offers tons of opportunity to hone your statistics skills. Here is one such instance I recently encountered. Typically while picking eggs, I randomly select few eggs to make sure that the box has no bad eggs (i.e cracked or broken). This led me to think of how many eggs do I need to randomly sample in order to be 50% or more confident in a box of one dozen eggs or how confident I am that there are no bad eggs in a box containing 12 eggs if I randomly sampled 3 eggs ?
Intuitively you can quantify your confidence by the number of samples. For instance, if you picked an egg randomly and its turns out to be good then you are 8.3% (or 1/12) confident. Similarly if you sampled 2 eggs and both turn out to be good, then your are 16.6% confident and so on. Thus to have more than 50% confidence you need to pickup 6 samples.
However, I was looking for a more rigorous solution and it wasn’t obvious to me at first. Below I walk through the thought process and the solution I finally managed to find.
Since each random sampling of an egg has two possible outcomes (good egg or bad egg), the first thing that came to my mind was to model the problem as some kind of binomial distribution. Binomial distribution gives us the likelihood of getting k successes in n number of trials. In this case, the success represents finding a bad egg. Since each egg is likely to be good or bad, the probability of success (and failure) is 0.5. However binomial distribution gives the probability of finding k number of successes in n random trials, whereas we will reject the box as soon as we find the first bad egg (success). This kind of problem is better modeled using geometric distribution.
So I ditched Binomial distribution for Geometric distribution. Geometric distribution gives the probability of observing n-1 failures before getting our first success in trial. The table below gives the probability of observing first success for different number of trials.
|Number of trials (n)||Geometric Distribution:|
I guess you can see a problem here. As we increase the number of trials, the probability of finding the bad egg decreases. This is counter intuitive because if there is a bad egg then the probability of finding it should increase with each random sampling. So what’s the problem ?
Going to basics we realize that geometric distribution makes an important assumption and the one that we violate over here. Geometric distribution assumes that all events are identically and independently distributed (iid), i.e., the probability of an egg being bad remains same for all eggs and doesn’t change with each sampling. Let’s work through a scenario and see if that’s right. Let’s assume there is one bad egg in a box of 12 eggs. Now let say you randomly picked an egg. The probability of it being the bad egg is 1/12. Let’s assume that it turns out to be one of the good eggs and so you place it on the side. Now there are 11 eggs in the box. Again you randomly pick an egg from the box. Now what’s the probability that it is a bad egg. Its 1/11. So the probability of randomly sampling the bad egg just got increased. Technically this sampling approach where the randomly drawn sample is not placed back in the pool of samples is known as sampling without replacement. As seen above, with sampling without replacements our events become interdependent and thus violates the assumption of identically and independently distributed. ( In the case of sampling with replacement, the probability of a randomly sampled egg turning out to be a bad one will always remains the same)
So now we know that we can’t use geometric distribution as it requires sampling with replacement and we are interested in sampling without replacement. Searching for probability distribution mass function that assumes sampling without replacement led me to the wiki page of hypergeometric distribution. Hypergeometric distribution is defined as a discrete probability distribution that describes the probability of k success in n draws from a finite population of size N containing m successes without replacement. This is exactly what we are trying to do.
In our original problem, the population size is 12 i.e (N=12). Lets assume in the worst case scenario there is only one bad egg (i.e m = 1). Since we are interested in the probability of finding atleast one broken egg in a random sample of n events, we can assume k = 1 and vary n from 1 to n to see what’s the probability of getting atleast one broken egg, i.e. . I used the following calculator to calulate probability distribution for different number of trials.
|Number of trials (n)||HyperGeometric Distribution: (P = 12, m = 1)|
So if there is only one bad egg then we will certainly find it in 12 random samples without replacement. And to be 50% certain, we will need to randomly pick 6 eggs without replacement.
It feels so good that atleast now I can quantify my confidence while selecting eggs. I usually sample 2-3 eggs, which gives me about 25% confidence :-).