Fun with Random Numbers: The Bootstrap
Let’s play a guessing game. I’m going to choose a big set of numbers. Like a billion jillion numbers.
# This is R code.
library(tidyverse)
generator_mean = floor(runif(1) * 10000 + 10) #SECRETS
generator_standard_deviation = floor(runif(1) * 30 + 10) #MORE SECRETS
my_numbers = rnorm(10000000, generator_mean, generator_standard_deviation) #OK fine it'll be 10 million instead of a billion jillion.
My set of numbers has an average that you’re trying to guess.
my_mean = mean(my_numbers)
You get some numbers, let’s say 100, from my set.
sample_size = 100
your_numbers = sample(my_numbers, sample_size)
And now you guess. If you’re right you win! If not I win. Loser goes to the 😈 BAD PLACE 😈.
Easy game, right?
NO FAIR
This game isn’t very fair right now. Getting only a single guess makes this game pretty impossible for you. (OK fine, maybe I won’t send you to the 😈 BAD PLACE 😈 if you lose). The chance that your 100 numbers average out to the average of my jillion billion numbers exactly is pretty low.
mean(my_numbers) == mean(your_numbers) #NO CHANCE
## [1] FALSE
See?
I’m a good sport kinda guy though, so I’ll make you a deal. Instead of guessing a single number, you get to guess a range. If my average is inside your range, then you win!
Well…now the game is unfair in your favor. You can pick the whole number line as your range right? Then you’d always win (really glad I’m dropping the whole 😈 BAD PLACE 😈 thing 😅).
You’re a good sport kinda person though, so you’re gonna make me a deal. We’ll play the game over and over again and keep score. If you win 95% of the time, you win! If you win 96% of the time you lose. If you win 94% of the time, or any other percent of the time besides 95% you lose.
What the f?
Quite the game ain’t it?
The Bootstrap?
“The” Bootstrap is a technique you can use to win this game. “The” is in scare quotes there because there are a number of related techniques, this is the simplest one. The first step is taking a “bootstrap sample”. To do this you sample 100 (or whatever the original sample_size
was) elements from your sample with replacement. Sampling from your sample is why The Bootstrap is called a resampling method.
take_a_bootstrap_sample = function() {
sample(your_numbers, sample_size, replace = TRUE)
}
The second step is to calculate the statistic of interest (in this case, the mean) on your bootstrap sample.
calculate_bootstrap_statistic = function() {
mean(take_a_bootstrap_sample())
}
You do steps 1 and 2 a lot. Like 10000 times, and put all the boostrap statistics together…
bootstrap_statistics = sapply(1:10000, function(x) calculate_bootstrap_statistic()) ## BASE R JASON?! ARE YOU WITHOUT COUTH?!
Step three is to find the range that contains the middle 95% of your bootstrap statistics. This range is your final guess.
conf_low = quantile(bootstrap_statistics, .025, names = FALSE)
conf_high = quantile(bootstrap_statistics, .975, names = FALSE)
So in this run your guess is that my_mean
is between 3925.2 and 3931.4
In this round of the game my_mean
= 3926. You WIN, YAY!
But…
But your deal was that we play over and over again. Let’s play 1000 times. I’ll choose a different set of numbers for each game. You should win about 95% of these.
First let’s make a helper function that plays a round of the game and returns the results…
play_game = function() {
# I choose some numbers.
my_numbers = rnorm(1000, runif(1) * 10000 + 10, runif(1) * 10 + 10)
# You get a sample.
your_numbers = sample(my_numbers, sample_size)
# You do Bootstrap step one and two a bunch of times.
bootstrap_statistics = sapply(1:10000, function(x) mean(sample(your_numbers, sample_size, replace = TRUE)))
# Then Bootstrap step three, extract the interval. Then check whether you win...
tibble(ConfLow = quantile(bootstrap_statistics, .025),
ConfHigh = quantile(bootstrap_statistics, .975),
PopulationMean = mean(my_numbers)) %>%
mutate(YouWin = ConfLow < PopulationMean & ConfHigh > PopulationMean)
}
We’ll use that function to play the game 1000 times and calculate your win percentage.
tibble(GameId = 1:1000) %>%
mutate(Game = map(GameId, ~ play_game())) %>%
unnest() %>%
count(YouWin) %>% # Count the wins
mutate(Percent = n / sum(n)) %>% # Calculate the percentage
filter(YouWin)
## # A tibble: 1 × 3
## YouWin n Percent
## <lgl> <int> <dbl>
## 1 TRUE 948 0.948
Close enough :)