By the end of today you will…
Download this application exercise by pasting the code below into your console
download.file("https://sta101.github.io/static/appex/ae8.Rmd",
destfile = "ae8.rmd")
library(tidyverse)
library(knitr) # contains kable()
The sample space is the set of all possible outcomes of an experiment.
We separate each outcome by a comma and use brackets \(\{ \}\) to denote a “set”.
Experiment 2: You flip a coin twice. The sample space is \(\{ HH, HT, TH, TT\}\)
Experiment 3: You roll a die once. The sample space is \(\{ 1, 2, 3, 4, 5, 6\}\)
Experiment 4: You send out a survey asking participants whether they prefer cats or dogs. The sample space is \(\{ \text{Cats}, \text{Dogs} \}\)
Experiment 5: A car manufacturer makes 100 vehicles. You count the number of recalls. The sample space is \(\{0, 1, 2, 3, \ldots, 99, 100 \}\)
Here we write the lower bound and upper bound of the sample space and assume we can observe all values in-between. Brackets, \([\) \(]\), are inclusive of the end values while parentheses, \((\) \()\), are not.
Experiment 7: You measure the tail length of American alligators The sample space is \((0, c ]\) feet where \(c\) is the maximum tail length of an alligator, e.g. \(c\) might be approximately 10.
Experiment 8: You measure the geographic coordinates (longitude and latitude) of a COVID case. The sample space is \([-90, 90]\) for latitude and \([-180, 180 ]\) for longitude.
An event is a collection of 1 or more outcomes. Two events are said to be disjoint if they cannot occur at the same time.
You roll a die once. Let A be the event that you roll an even number, i.e. A \(= \{2, 4, 6 \}\). Let B be the event you roll a 1 or a 2, i.e. B \(= \{1, 2 \}\). A and B are not disjoint.
A car manufacturer makes 100 vehicles. You count the number of recalls. Let C be the event you see fewer than 10 recalls. C \(= \{0, 1, 2, 3, \ldots, 8, 9 \}\)
You observe the numeric grade you earn in a course. Let D be the event you receive a letter grade of “A”. D \(= [93, 100]\). Let E be the event that you earn a “B” or worse. E \(= [0, 87)\). D and E are disjoint events because they cannot occur simultaneously.
A probability is the long-run frequency of an event. In other words, the proportion of times we would see an event occur if we could repeat an experiment an infinite number of times. Probabilities take values between 0 and 1 inclusive.
If A and B are two disjoint events, then the probability of A or B occuring is equal to the probability of A plus the probability of B. More concisely, Pr(A or B) = Pr(A) + Pr(B).
Let A and B be two events.
Let’s try out some of these ideas with data from the in-class survey.
class = read_csv("https://sta101.github.io/static/appex/data/sta101-su22.csv")
## Rows: 20 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): pet, element
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What is the probability a randomly selected STA101 student is a Earthbender?
# code here
What is the probability a randomly selected STA101 student is a Waterbender given that they prefer cats?
# code here
Let A be the event that a person prefers dogs and B be the event a person is an Airbender. Are events A and B independent?
# code here
Let C be the event a person likes cats and D be the event a person is a Firebender. Are C and D disjoint?
# code here
You toss a fair coin 10 times. Let A be the event there is at least one head.
What is the probability of A?
rbinom()
arguments:
N = 100
coin_flips = data.frame(num_heads = rbinom(n = N, size = 10, prob = 0.5))
coin_flips %>%
filter(num_heads >=1) %>%
nrow() / N
## [1] 1
coin_flips %>%
ggplot(aes(x = num_heads)) +
geom_histogram() +
theme_bw() +
labs(x = "Number of heads", y = "Count", title = "Distribution of the total number of heads seen in 10 coin flips")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
How many people prefer dogs to cats? Make your tables pretty with
kable()
. Read ?kable
for more details
class %>%
count(pet) %>%
kable(col.names = c("Pet", "Number"))
Pet | Number |
---|---|
Cat | 3 |
Dog | 17 |
sample()
is a powerful function. 17/20 students in
STA101 prefer dogs to cats. Let’s assume ask 10 more people if they
prefer dogs to cats and assume that 17/20 is the true proportion of
people that prefer dogs.
set.seed(1) # ensure you and I will both get the same "random" result
outcomes = c("Dogs", "Cats")
sample(outcomes, size = 10, replace = TRUE, prob = c(0.85,0.15))
## [1] "Dogs" "Dogs" "Dogs" "Cats" "Dogs" "Cats" "Cats" "Dogs" "Dogs" "Dogs"
Alternatively, we could have used rbinom()
. When using a
rbinom()
we must define our outcomes “success” (1) and a
“failure” (0). This is arbitrary and does not reflect the outcome itself
being positive or negative. Let’s call “Dogs” a success and “Cats” a
failure (1 and 0 respectively).
set.seed(1)
# we survey 10 people 1 time
numberOfPeople = 10
numberOfSurveys = 1
probOfSuccess = 17 / 20
rbinom(n = numberOfSurveys, size = numberOfPeople, prob = probOfSuccess)
## [1] 9
You roll a six-sided die five times. Let B be the event that you roll exactly one “2”. What is the probability of B?
outcomes = 1:6
one.roll = function() {
roll = sample(outcomes, size = 1, replace = TRUE)
return(roll)
}
five.rolls = function() {
rolls = replicate(5, one.roll())
return(sum(rolls == 2) == 1)
}
samples = replicate(10000, five.rolls())
mean(samples)
## [1] 0.4022
Modify the code above to instead generate 1000 random samples of
bender element, e.g. “Airbender”, “Waterbender”, etc. where each one is
equally likely. Visualize the distribution using
geom_bar()
.
# code here
You may not have realized it, but we’ve been talking about random variables. A random variable is a function that maps an observed outcome to a number.
For example, when you ask someone what pet they prefer, and map “Dog” to 1 and “Cat” to 0, you are defining a random variable!
Random variables have distributions…
The binomial distribution models the number of success in a series of independent and identical binary trials and is defined by two parameters:
The sample space of a binomial random variable is \(\{0, 1, \ldots, k \}\). In words, there could be up to \(k\) success in an binomial experiment.
Assume that the true proportion of people who prefer dogs is in fact 17 / 20.
Let’s perform a simulation. Go out and ask 100 people if they prefer
cats or dogs using rbinom()
(where 0 is cat and 1 is dog).
This is called a single sample of 100 individuals. What
proportion of people prefer dogs to cats in your new sample?
Now repeat this 10 times. What is the average number of people in each sample? Plot the distribution of the mean across all 10 samples.
Now ask 100 samples of 100 people the same question and plot the distribution of the mean.
Now ask 1000 samples of 100 people the same question and plot the distribution of the mean. What do you notice?
Note: you might need to play with binwidth
argument of
geom_histogram()
for the best results.
set.seed(714)
# code here
The normal distribution, also known as “Gaussian distribution” is a distribution of a continuous random variable. The sample space of a normal random variable is \(\{- \infty, + \infty \}\) and is defined by two parameters: a mean \(\mu\) and a standard deviation \(\sigma\).
We can sample N
times from a normal with mean
mu
and standard deviation s
using
rnorm(n = N, mean = mu, sd = s)
.
Let’s visualize the normal function curve using the code below.
mu1 = 100
s1 = 6
mu2 = 105
s2 = 3
ggplot(data = data.frame(x = c(mu1 - s1*3, mu1 + s1*3)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = mu1, sd = s1),
color = "steelblue") +
stat_function(fun = dnorm, args = list(mean = mu2, sd = s2),
color = "orange") +
theme_bw() +
labs(title = "Two normal curves")
Try setting mu2
to 99 and s2
to 4. What do
you notice? Play with a few more settings and describe what the mean and
standard deviation do to the shape of the curve.
Paste the code of your histogram from exercise 3 below but change
your geometry to geom_histogram(aes(y = ..density..))
. This
will rescale your histogram so that the area under the curve is 1. Next,
use the stat_function
code above as a template to
superimpose a normal distribution on top of your histogram. Adjust the
mean and standard deviation until you obtain a good looking fit. What do
you notice?
set.seed(714)
# code here