# Snippet: Central Limit Theorem

A small demonstration of the normal distribution arises from the aggregation of many, many sample averages.

I originally created this simulation to create some GIFs demonstrating how the normal distribution arises from many, many sample means. Very Normal is the name of the Substack I write for, and it tries to illustrate statistical concepts without diving too much into the underlying theory. I thought it would be good to make my animating code transparent here as well.

The Lindeberg-Levy version of the Central Limit Theorem (aka the classical one) states that for a sequence of $$n$$ independent, identically distributed random variables $$\{ X_1, ..., X_n \}$$ with finite mean $$\mu$$ and variance $$\sigma^2$$, the following random variable has a normal distribution:

$\sqrt{n}( \bar{X}_n - \mu) \leadsto N(0, \sigma^2)$

In plain terms, the sample average $$\bar{X}_n$$ after being mean-centered and scaled by $$\sqrt{n}$$ has a normal distribution with zero-mean and variance $$\sigma^2$$.

Thanks to properties of normal distributions, we also know the sample average has the following distribution:

$\bar{X}_n \leadsto N(\mu, \frac{\sigma^2}{n})$

With the notation out of the way, here’s the code I used to develop the GIFs.

library(tidyverse)
library(gganimate)

# Population of Very Normal Land
population = rep(c(1, 2, 3, 4, 5), each = 20)
frames = 300 # frame numbers for GIF

# Function for creating underlying data to animate
createDataForGIF = function(n, frames) {

# Initialize structures to hold entire dataset
data = tibble()
averages = c()

# For each frame, calculate a new sample average to add to histogram
for (i in 1:frames) {

s = sample(population, size = n, replace = FALSE) %>% mean
averages = c(averages, s)

# This dataset represents one specific frame in the dataset
d = tibble(
avgs = averages,
n_samp = i
)

# Aggregate into the greater data
data = bind_rows(data, d)

}

data

}

# Create GIF dataset based on a sample size of 20
sampsize20 = createDataForGIF(n = 20, frames = frames)

# Create an overall ggplot of the data which will be split up in the GIF
sampsize20viz = sampsize20 %>%
ggplot() +
aes(x = avgs) +
geom_histogram(fill = "#00A08A", color = "black", bins = 30) +
theme_minimal() +
xlim(0, 6) +
labs(
x = "Calculated Sample Average",
y = "Number of data collectors who got a particular average"
)

# Store the animations to a GIF
gif20 = sampsize20viz +
transition_states(n_samp,
transition_length = 0.01,
state_length = 0.5) +
transition_time(n_samp) +
ggtitle('Number sample averages accumulated (n = 20): {frame}')

# By default, gganimate stops at 100 frames, so we need to animate it and
# explicitly define how many frames should be in GIF
animate(a1, nframes = frames)

# Save the GIF to local (automatically saves last animation created above)
anim_save("samp_20.gif")


Running the above code produces the following GIF (colors may differ): Here’s another run for a sample size of 3. Notice the resulting sampling distribution is wider and has a smaller height than the GIF above. 