Exploratory data I: penguins

Bulletin

first lab today, due Friday at 11:59pm on Gradescope
be sure to complete prepare material (on the schedule) before each class

Today

By the end of today you will…

gain familiarity with the ‘big seven’ functions of dplyr
define and compute various statistics
begin to gain familiarity with ggplot

Getting started

Download this application exercise by pasting the code below into your console (bottom left of screen)

download.file("https://sta101.github.io/static/appex/ae2.Rmd",
destfile = "ae2.rmd")

A note on YAML. Create a new .rmd

Load packages and data

library(tidyverse)
library(palmerpenguins)

Type ?palmerpenguins to learn more about this package. Or better yet, check it out here.

data(penguins)

Exercise 1:

Look at the data, how many observations are there? How many variables?

A package within a package…

When we load the tidyverse library, dplyr is packaged with it.

dplyr, a grammar of data manipulation offers intuitive ‘verb’ functions that describe actions we commonly want to perform with data. The big 7 we’ll cover today are:

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
group_by() sets us up to summarize across groups
summarize() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.
slice() select, remove and duplicate rows based on their index

(as described at https://dplyr.tidyverse.org/)

Check out documentation with ?

Mutate

Approximate bill area (in $mm^2$) as bill length * bill depth:

penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm)

## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   bill_area_mm2 <dbl>

Select

It’s hard to see bill length, depth and area in the same output, let’s select a smaller subset of the variables to look at.

# Example 1
penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(-year)

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, bill_area_mm2 <dbl>

# Example 2
penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, bill_area_mm2)

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, bill_area_mm2 <dbl>

A note on pipes %>% and a note on style.

Filter

Let’s just examine penguins on Dream island

penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(-year)
# code here

Group by + Summarize

Exercise 2:

Find mean bill area across sex. Fill in the blanks

penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(-year) %>%
  # filter for Dream
  group_by(___) %>%
  summarize(mean_bill_area_mm2 = ___)

Arrange + Slice

Let’s use arrange() and slice() to report the five penguins with the greatest bill area.

penguins %>%
  mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm) %>%
  select(bill_area_mm2, bill_length_mm) %>%
  arrange(desc(bill_area_mm2))

## # A tibble: 344 × 2
##    bill_area_mm2 bill_length_mm
##            <dbl>          <dbl>
##  1         1127.           54.2
##  2         1105.           55.8
##  3         1076.           52  
##  4         1065.           53.5
##  5         1056            52.8
##  6         1050.           51.7
##  7         1043.           52.7
##  8         1032.           58  
##  9         1021.           51.3
## 10         1013.           59.6
## # … with 334 more rows

Exercise 3:

Are these the same five penguins with the longest bills?

Optional hint: if you want to be exactly precise about which penguins are which, you could add an ID column, e.g.

 penguins %>%
  mutate(id = seq(1:nrow(penguins)))

## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>, id <int>

This takes advantage of the nrow() function. Can you guess what it returns?

Exercise 4:

Compute the average bill length, bill depth, flipper length and body mass across all islands.

Exercise 5:

Is every species on every island?

Summary statistics

What is a statistic? It’s any mathematical function of the data. Sometimes, a statistic is referred to as “sample statistic” since you compute it from a finite sample (the data) and not the entire population.

For example, penguins is a sample of penguins in Antarctica, not an exhaustive list of the entire population.

Examples of statistics:

measure of central tendency:

mean
median
mode

measures of spread:

standard deviation
variance
range
quartiles
inter-quartile range (IQR)

order statistics:

quantiles
minimum (0 percentile)
median (50th percentile)
maximum (100 percentile)

… and any other arbitrary function of the data you can come up with!

Exercise 6:

Come up with your own statistic and write it in the narrative below.

To access a column of the data, we’ll use data$column.

Let’s write down the R function for each of the above.

Exercise 7:

Try to compute the above statistics for the penguin bill length column.

Do you receive an error? Why?

# code here

Let’s take a look at the distribution of bill length among all penguins.

penguins %>% # data
  ggplot(aes(x = bill_length_mm)) + # columns we want to look at
  geom_histogram() # geometry of the visualization

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bin).

Let’s visualize some of our summary statistics on the plot.

Exercise 8:

Uncomment the code below and fill in the blank with the mean.

penguins %>% 
  ggplot(aes(x = bill_length_mm)) + 
  geom_histogram() #+ 
  #geom_vline(xintercept = __, color = 'red')

Exercise 9

Add another geom_vline with the median and mode. Use separate colors for each.

Exercise 10

Finally, let’s try out another geometry geom_density and clean up our graph with some axes labels.

# live coding here