Snippet: Central Limit Theorem

A small demonstration of the normal distribution arises from the aggregation of many, many sample averages.

I originally created this simulation to create some GIFs demonstrating how the normal distribution arises from many, many sample means. Very Normal is the name of the Substack I write for, and it tries to illustrate statistical concepts without diving too much into the underlying theory. I thought it would be good to make my animating code transparent here as well.

The Lindeberg-Levy version of the Central Limit Theorem (aka the classical one) states that for a sequence of \(n\) independent, identically distributed random variables \(\{ X_1, ..., X_n \}\) with finite mean \(\mu\) and variance \(\sigma^2\), the following random variable has a normal distribution:

\[ \sqrt{n}( \bar{X}_n - \mu) \leadsto N(0, \sigma^2) \]

In plain terms, the sample average \(\bar{X}_n\) after being mean-centered and scaled by \(\sqrt{n}\) has a normal distribution with zero-mean and variance \(\sigma^2\).

Thanks to properties of normal distributions, we also know the sample average has the following distribution:

\[ \bar{X}_n \leadsto N(\mu, \frac{\sigma^2}{n}) \]

With the notation out of the way, here’s the code I used to develop the GIFs.


# Population of Very Normal Land
population = rep(c(1, 2, 3, 4, 5), each = 20)
frames = 300 # frame numbers for GIF

# Function for creating underlying data to animate
createDataForGIF = function(n, frames) {
  # Initialize structures to hold entire dataset
  data = tibble()
  averages = c()
  # For each frame, calculate a new sample average to add to histogram
  for (i in 1:frames) {
    s = sample(population, size = n, replace = FALSE) %>% mean
    averages = c(averages, s)
    # This dataset represents one specific frame in the dataset
    d = tibble(
      avgs = averages,
      n_samp = i
    # Aggregate into the greater data
    data = bind_rows(data, d)

# Create GIF dataset based on a sample size of 20
sampsize20 = createDataForGIF(n = 20, frames = frames)

# Create an overall ggplot of the data which will be split up in the GIF
sampsize20viz = sampsize20 %>% 
  ggplot() +
  aes(x = avgs) + 
  geom_histogram(fill = "#00A08A", color = "black", bins = 30) + 
  theme_minimal() + 
  xlim(0, 6) +
    x = "Calculated Sample Average",
    y = "Number of data collectors who got a particular average"

# Store the animations to a GIF
gif20 = sampsize20viz + 
                    transition_length = 0.01,
                    state_length = 0.5) +
  transition_time(n_samp) +
  ggtitle('Number sample averages accumulated (n = 20): {frame}')

# By default, gganimate stops at 100 frames, so we need to animate it and
# explicitly define how many frames should be in GIF
animate(a1, nframes = frames)

# Save the GIF to local (automatically saves last animation created above)

Running the above code produces the following GIF (colors may differ):

Here’s another run for a sample size of 3. Notice the resulting sampling distribution is wider and has a smaller height than the GIF above.