By the end of today you will…
infer
to obtain a bootstrap distributionDownload this application exercise by pasting the code below into your console
download.file("https://sta101.github.io/static/appex/ae9.Rmd",
destfile = "ae9.rmd")
library(tidyverse)
library(tidymodels)
manhattan = read_csv("https://sta101.github.io/static/appex/data/manhattan.csv")
## Rows: 20 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): rent
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Bootstrapping is a re-sampling technique. The key idea is you have already collected a sample of size \(N\) from the population. To create a bootstrap sample, you sample with replacement from your original sample \(N\) times.
Let’s say you measure the height of five Duke students in meters:
heights = c(1.51, 1.62, 1.89, 2.01, 1.78)
students = data.frame(heights)
There are many ways to perform a bootstrap sample in R.
set.seed(1)
sample(heights, size = 5, replace = TRUE)
## [1] 1.51 2.01 1.51 1.62 1.78
set.seed(2)
students %>%
specify(response = heights) %>%
generate(reps = 1, type = "bootstrap")
## Response: heights (numeric)
## # A tibble: 5 × 2
## # Groups: replicate [1]
## replicate heights
## <int> <dbl>
## 1 1 1.78
## 2 1 1.51
## 3 1 1.78
## 4 1 1.51
## 5 1 2.01
From here, we can compute a bootstrap statistic. E.g.
set.seed(1)
sample(heights, size = 5, replace = TRUE) %>%
median()
## [1] 1.62
set.seed(2)
students %>%
specify(response = heights) %>%
generate(reps = 1, type = "bootstrap") %>%
calculate(stat = "median")
## Response: heights (numeric)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 1.78
take-away: sample()
takes individual columns which
can be handy for quick sampling while tidy syntax,
specify() %>% generate() %>% calculate()
plays nice
with whole data frames and sets us up to easily implement future
use-cases.
the tidy way uses the infer
package (included as a
part of tidymodels
)
On a given day in 2018, twenty one-bedroom apartments were randomly
selected on Craigslist Manhattan from apartments listed as “by owner”.
The data are in the manhattan
data frame. We will use this
sample to conduct inference on the typical rent of 1 bedroom apartments
in Manhattan.
Let’s start by using bootstrapping to estimate the mean rent of one-bedroom apartments in Manhattan.
What is a point estimate (i.e. single number summary) of the typical rent?
Let’s bootstrap!
Fill in the values from the bootstrap sample conducted in class. Once the values are filled in, un-comment the code.
# class_bootstrap = c()
# add code
We will use the infer
package, included as part
of tidymodels
to calculate a 95% confidence interval for
the mean rent of one-bedroom apartments in Manhattan.
We start by setting a seed to ensure our analysis is reproducible.
We can use R to take many bootstrap samples, compute a statistic and then view the bootstrap distribution of that statistic.
Un-comment the lines and fill in the blanks to create the bootstrap
distribution of sample means and save the results in the data frame
boot_dist
.
Use 1000 reps for the in-class activity. (You will use about 10,000 reps for assignments outside of class.)
set.seed(7182022)
boot_dist = manhattan #%>%
#specify(______) %>%
#generate(______) %>%
#calculate(______)
boot_dist
?boot_dist
? What do they
mean?A sample statistic is a random variable, we can look at its distribution.
Visualize the bootstrap distribution using a histogram. Describe the shape, center, and spread of this distribution.
# add code
Uncomment the lines and fill in the blanks to construct the 95% bootstrap confidence interval for the mean rent of one-bedroom apartments in Manhattan.
#___ %>%
# summarize(lower = quantile(______),
# upper = quantile(______))
Write the interpretation for the interval calculated above.
#calculate a 90% confidence interval
#calculate a 99% confidence interval
Question: Does a confidence interval have to be symmetric?
What is one advantage to using a 90% confidence interval instead of a 95% confidence interval to estimate a parameter? - What is one advantage to using a 99% confidence interval instead of a 95% confidence interval to estimate a parameter?
Next, use bootstrapping to estimate the median rent for one-bedroom
apartments in Manhattan. - Generate the bootstrap distribution of the
sample medians. Use 100 reps. Save the results in
boot_dist_median
.
set.seed(100)
## add code
# code here
## add code