By the end of today you will…
dplyr
ggplot
Download this application exercise by pasting the code below into your console (bottom left of screen)
download.file("https://sta101.github.io/static/appex/ae2.Rmd",
destfile = "ae2.rmd")
.rmd
library(tidyverse)
library(palmerpenguins)
Type ?palmerpenguins
to learn more about this package.
Or better yet, check it out here.
data(penguins)
Look at the data, how many observations are there? How many variables?
When we load the tidyverse
library, dplyr
is packaged with it.
dplyr, a grammar of data manipulation offers intuitive ‘verb’ functions that describe actions we commonly want to perform with data. The big 7 we’ll cover today are:
mutate()
adds new variables that are functions of
existing variablesselect()
picks variables based on their names.filter()
picks cases based on their values.group_by()
sets us up to summarize across groupssummarize()
reduces multiple values down to a single
summary.arrange()
changes the ordering of the rows.slice()
select, remove and duplicate rows based on
their index(as described at https://dplyr.tidyverse.org/)
?
Approximate bill area (in \(mm^2\)) as bill length * bill depth:
penguins %>%
mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm)
## # A tibble: 344 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## # bill_area_mm2 <dbl>
It’s hard to see bill length, depth and area in the same output, let’s select a smaller subset of the variables to look at.
# Example 1
penguins %>%
mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
select(-year)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, bill_area_mm2 <dbl>
# Example 2
penguins %>%
mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
select(species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, bill_area_mm2)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, bill_area_mm2 <dbl>
%>%
and a note on style.Let’s just examine penguins on Dream island
penguins %>%
mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
select(-year)
# code here
Find mean bill area across sex. Fill in the blanks
penguins %>%
mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
select(-year) %>%
# filter for Dream
group_by(___) %>%
summarize(mean_bill_area_mm2 = ___)
Let’s use arrange()
and slice()
to report
the five penguins with the greatest bill area.
penguins %>%
mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
select(bill_area_mm2, bill_length_mm) %>%
arrange(desc(bill_area_mm2))
## # A tibble: 344 × 2
## bill_area_mm2 bill_length_mm
## <dbl> <dbl>
## 1 1127. 54.2
## 2 1105. 55.8
## 3 1076. 52
## 4 1065. 53.5
## 5 1056 52.8
## 6 1050. 51.7
## 7 1043. 52.7
## 8 1032. 58
## 9 1021. 51.3
## 10 1013. 59.6
## # … with 334 more rows
Are these the same five penguins with the longest bills?
Optional hint: if you want to be exactly precise about which penguins are which, you could add an ID column, e.g.
penguins %>%
mutate(id = seq(1:nrow(penguins)))
## # A tibble: 344 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>, id <int>
This takes advantage of the nrow()
function. Can you
guess what it returns?
Compute the average bill length, bill depth, flipper length and body mass across all islands.
Is every species on every island?
What is a statistic? It’s any mathematical function of the data. Sometimes, a statistic is referred to as “sample statistic” since you compute it from a finite sample (the data) and not the entire population.
For example, penguins
is a sample of penguins in
Antarctica, not an exhaustive list of the entire population.
Examples of statistics:
… and any other arbitrary function of the data you can come up with!
Come up with your own statistic and write it in the narrative below.
To access a column of the data, we’ll use
data$column
.
Let’s write down the R function for each of the above.
Try to compute the above statistics for the penguin bill length column.
Do you receive an error? Why?
# code here
Let’s take a look at the distribution of bill length among all penguins.
penguins %>% # data
ggplot(aes(x = bill_length_mm)) + # columns we want to look at
geom_histogram() # geometry of the visualization
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Let’s visualize some of our summary statistics on the plot.
Uncomment the code below and fill in the blank with the mean.
penguins %>%
ggplot(aes(x = bill_length_mm)) +
geom_histogram() #+
#geom_vline(xintercept = __, color = 'red')
Add another geom_vline
with the median and mode. Use
separate colors for each.
Finally, let’s try out another geometry geom_density
and
clean up our graph with some axes labels.
# live coding here