Bulletin

Today

By the end of today you will…

Getting started

Download this application exercise by pasting the code below into your console (bottom left of screen)

download.file("https://sta101.github.io/static/appex/ae2.Rmd",
destfile = "ae2.rmd")

Load packages and data

library(tidyverse)
library(palmerpenguins)

Type ?palmerpenguins to learn more about this package. Or better yet, check it out here.

data(penguins)

Exercise 1:

Look at the data, how many observations are there? How many variables?

A package within a package…

When we load the tidyverse library, dplyr is packaged with it.

dplyr, a grammar of data manipulation offers intuitive ‘verb’ functions that describe actions we commonly want to perform with data. The big 7 we’ll cover today are:

(as described at https://dplyr.tidyverse.org/)

Mutate

Approximate bill area (in \(mm^2\)) as bill length * bill depth:

penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm)
## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   bill_area_mm2 <dbl>

Select

It’s hard to see bill length, depth and area in the same output, let’s select a smaller subset of the variables to look at.

# Example 1
penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(-year)
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, bill_area_mm2 <dbl>
# Example 2
penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, bill_area_mm2)
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, bill_area_mm2 <dbl>
  • A note on pipes %>% and a note on style.

Filter

Let’s just examine penguins on Dream island

penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(-year)
# code here

Group by + Summarize

Exercise 2:

Find mean bill area across sex. Fill in the blanks

penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(-year) %>%
  # filter for Dream
  group_by(___) %>%
  summarize(mean_bill_area_mm2 = ___)

Arrange + Slice

Let’s use arrange() and slice() to report the five penguins with the greatest bill area.

penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(bill_area_mm2, bill_length_mm) %>%
  arrange(desc(bill_area_mm2))
## # A tibble: 344 × 2
##    bill_area_mm2 bill_length_mm
##            <dbl>          <dbl>
##  1         1127.           54.2
##  2         1105.           55.8
##  3         1076.           52  
##  4         1065.           53.5
##  5         1056            52.8
##  6         1050.           51.7
##  7         1043.           52.7
##  8         1032.           58  
##  9         1021.           51.3
## 10         1013.           59.6
## # … with 334 more rows

Exercise 3:

Are these the same five penguins with the longest bills?

Optional hint: if you want to be exactly precise about which penguins are which, you could add an ID column, e.g.

 penguins %>%
  mutate(id = seq(1:nrow(penguins)))
## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>, id <int>

This takes advantage of the nrow() function. Can you guess what it returns?

Exercise 4:

Compute the average bill length, bill depth, flipper length and body mass across all islands.

Exercise 5:

Is every species on every island?


Summary statistics

What is a statistic? It’s any mathematical function of the data. Sometimes, a statistic is referred to as “sample statistic” since you compute it from a finite sample (the data) and not the entire population.

For example, penguins is a sample of penguins in Antarctica, not an exhaustive list of the entire population.

Examples of statistics:

measure of central tendency:

  • mean
  • median
  • mode

measures of spread:

  • standard deviation
  • variance
  • range
  • quartiles
  • inter-quartile range (IQR)

order statistics:

  • quantiles
  • minimum (0 percentile)
  • median (50th percentile)
  • maximum (100 percentile)

… and any other arbitrary function of the data you can come up with!

Exercise 6:

Come up with your own statistic and write it in the narrative below.

To access a column of the data, we’ll use data$column.

Let’s write down the R function for each of the above.

Exercise 7:

Try to compute the above statistics for the penguin bill length column.

Do you receive an error? Why?

# code here

Let’s take a look at the distribution of bill length among all penguins.

penguins %>% # data
  ggplot(aes(x = bill_length_mm)) + # columns we want to look at
  geom_histogram() # geometry of the visualization
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

Let’s visualize some of our summary statistics on the plot.

Exercise 8:

Uncomment the code below and fill in the blank with the mean.

penguins %>% 
  ggplot(aes(x = bill_length_mm)) + 
  geom_histogram() #+ 
  #geom_vline(xintercept = __, color = 'red')

Exercise 9

Add another geom_vline with the median and mode. Use separate colors for each.

Exercise 10

Finally, let’s try out another geometry geom_density and clean up our graph with some axes labels.

# live coding here