The hypergeometric distribution is a discrete random variable with three parameters: the population size \(n\), the event count \(r\), and the sample size \(m\). It describes the probability of getting \(k\) items of interest when sampling \(m\) items, without replacement, from a population of \(n\) that includes \(r\) items of interest.

Hypergeometric versus Binomial Distributions

The key difference between the hypergeometric and the binomial distributions is the sampling method. In the hypergeometric case, we are sampling without replacement so that the trials are dependent, i.e., the outcome of a trial affects the outcome of the following trials. In the binomial case, they are independent because we are sampling with replacement. To illustrate this, consider a bowl with 7 black and 3 white marbles. For the first trial, the probability of picking a black marble is \(\tfrac{7}{10}\). If we picked a black marble and did not replace it, the probability of picking another black marble in the second trial will be \(\tfrac{6}{9}\). However, if we replaced it, then the probability of getting a black marble the second time will still be \(\tfrac{7}{10}\).

Probability Mass Function

The probability mass function for a hypergeometric distribution is

\[p_X(k)=\dfrac{\displaystyle\binom{r}{k}\binom{n-r}{m-k}}{\displaystyle\binom{n}{m}}.\]

Expectation

The mean of a hypergeometric distribution is

\[\mathbb{E}[X]=\frac{mr}{n}.\]

Variance

The variance of a hypergeometric random variable is

\[\mathrm{var}(X)=\frac{mr(n-r)(n-m)}{n^2(n-1)}.\]

Monte Carlo Simulation

library(tidyverse)

# Set the parameters
n <- 500  # Number of items in the bowl
r <- 100  # Number of red items in the bowl
m <- 50   # Number of items to pick

# Create a bowl with n items, r are red
# Identify 1 as red and 0 otherwise
bowl <- sample(rep(c(0,1), c(n-r,r)))

B <- 10000  # Number of replications

# Pick m items from bowl and replicate B times
trials <- replicate(B, {
  items_picked <- sample(bowl, m, replace = FALSE)
  sum(items_picked)
})

# Plot the distribution
data.frame(trials) %>%
  ggplot(aes(trials, y=..prop..)) +
  geom_bar(width = 0.5, color = "dodgerblue", fill = "dodgerblue") +
  labs(x = "Number of Red Items", y = "Probability")