Bulletin

Today

Getting started

Download this application exercise by pasting the code below into your console

download.file("https://sta101.github.io/static/appex/ae16.Rmd",
destfile = "ae16.rmd")

Load packages

library(tidyverse)
library(tidymodels)

Guidelines for Discussion

  • Listen respectfully. Listen actively and with an ear to understanding others’ views.

  • Criticize ideas, not individuals.

  • Commit to learning, not debating. Comment in order to share information, not to persuade.

  • Avoid blame, speculation, and inflammatory language.

  • Avoid assumptions about any member of the class or generalizations about social groups.

Data Representation

Misleading Data Visualizations1

Brexit

Brexit

  • What is the graph trying to show?

  • Why is this graph misleading?

  • How can you improve this graph?

Spurious Correlations2

A Spurious Correlation

  • What is the graph trying to show?

  • Why is this graph misleading?

Ethics of collecting + handling data

Web scraping3

A data analyst received permission to post a data set that was scraped from a social media site. The full data set included name, screen name, email address, geographic location, IP (Internet protocol) address, demographic profiles, and preferences for relationships. The analyst removes name and email address from the data set in effort to deidentify it.

  • Why might it be problematic to post this data set publicly?

  • How can you store the full dataset in a safe and ethical way?

  • You want to make the data available so your analysis is transparent and reproducible. How can you modify the full data set to make the data available in an ethical way?

Algorithmic bias: deep dive

- Video

Discussion questions

  • “Simpson’s paradox”, where conclusions drawn from analyzing subgroups differ from conclusions drawn when the groups are combined. Can you demonstrate Simpson’s Paradox with the data below? 4
berk = read_csv("https://sta101.github.io/static/appex/data/BerkeleyAdmissionsData.csv")
## Rows: 7 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Dept
## dbl (4): MaleYes, MaleNo, FemaleYes, FemaleNo
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
berk
## # A tibble: 7 × 5
##   Dept  MaleYes MaleNo FemaleYes FemaleNo
##   <chr>   <dbl>  <dbl>     <dbl>    <dbl>
## 1 A         512    313        89       19
## 2 B         313    207        17        8
## 3 C         120    205       202      391
## 4 D         138    279       131      244
## 5 E          53    138        94      299
## 6 F          22    351        24      317
## 7 All      1158   1493       557     1278
  • A company uses a machine learning algorithm to determine which job advertisement to display for users searching for technology jobs. Based on past results, the algorithm tends to display lower paying jobs for women than for men (after controlling for other characteristics than gender). What ethical considerations might be considered when reviewing this algorithm?5

  • As you start working on data analyses for the STA 101 project, internships, research, etc., what are 1 - 2 things you can do to ensure you’re doing the analysis in an ethical way?


  1. Source: https://humansofdata.atlan.com/2019/02/dos-donts-data-visualization↩︎

  2. Source: https://www.tylervigen.com/spurious-correlations Content warning: some examples include death or suicide.↩︎

  3. Modified from Modern Data Science with R, 2nd Edition↩︎

  4. Source: https://www.randomservices.org/random/data/Berkeley.html↩︎

  5. Source: Modern Data Science with R, 2nd Edition↩︎