A small demonstration of the normal distribution arises from the aggregation of many, many sample averages.

I originally created this simulation to create some GIFs demonstrating how the normal distribution arises from many, many sample means. Very Normal is the name of the Substack I write for, and it tries to illustrate statistical concepts without diving too much into the underlying theory. I thought it would be good to make my animating code transparent here as well.

The Lindeberg-Levy version of the Central Limit Theorem (aka the classical one) states that for a sequence of \(n\) *independent, identically distributed* random variables \(\{ X_1, ..., X_n \}\) with **finite mean \(\mu\) and variance \(\sigma^2\)**, the following random variable has a normal distribution:

\[ \sqrt{n}( \bar{X}_n - \mu) \leadsto N(0, \sigma^2) \]

In plain terms, the sample average \(\bar{X}_n\) after being mean-centered and scaled by \(\sqrt{n}\) has a normal distribution with zero-mean and variance \(\sigma^2\).

Thanks to properties of normal distributions, we also know the sample average has the following distribution:

\[ \bar{X}_n \leadsto N(\mu, \frac{\sigma^2}{n}) \]

With the notation out of the way, here’s the code I used to develop the GIFs.

```
library(tidyverse)
library(gganimate)
# Population of Very Normal Land
population = rep(c(1, 2, 3, 4, 5), each = 20)
frames = 300 # frame numbers for GIF
# Function for creating underlying data to animate
createDataForGIF = function(n, frames) {
# Initialize structures to hold entire dataset
data = tibble()
averages = c()
# For each frame, calculate a new sample average to add to histogram
for (i in 1:frames) {
s = sample(population, size = n, replace = FALSE) %>% mean
averages = c(averages, s)
# This dataset represents one specific frame in the dataset
d = tibble(
avgs = averages,
n_samp = i
)
# Aggregate into the greater data
data = bind_rows(data, d)
}
data
}
# Create GIF dataset based on a sample size of 20
sampsize20 = createDataForGIF(n = 20, frames = frames)
# Create an overall ggplot of the data which will be split up in the GIF
sampsize20viz = sampsize20 %>%
ggplot() +
aes(x = avgs) +
geom_histogram(fill = "#00A08A", color = "black", bins = 30) +
theme_minimal() +
xlim(0, 6) +
labs(
x = "Calculated Sample Average",
y = "Number of data collectors who got a particular average"
)
# Store the animations to a GIF
gif20 = sampsize20viz +
transition_states(n_samp,
transition_length = 0.01,
state_length = 0.5) +
transition_time(n_samp) +
ggtitle('Number sample averages accumulated (n = 20): {frame}')
# By default, gganimate stops at 100 frames, so we need to animate it and
# explicitly define how many frames should be in GIF
animate(a1, nframes = frames)
# Save the GIF to local (automatically saves last animation created above)
anim_save("samp_20.gif")
```

Running the above code produces the following GIF (colors may differ):

Here’s another run for a sample size of 3. Notice the resulting sampling distribution is wider and has a smaller height than the GIF above.