Lab 04: Parkinson’s disease

By the end of this lab you will

compute probabilities
build a logistic regression model and assess its performance

Getting started

1. Download the lab template by pasting the code below in your console:

download.file("https://sta101.github.io/static/labs/lab04_template.Rmd",
              destfile = "lab04.rmd")

2. Under the “Files” tab on the right hand side, click on lab04.rmd to open the lab template.

3. Complete the exercises below using the space provided.

Warm up

Be sure to update the YAML at the top of the document to include your name and the date.

Packages

library(tidyverse)
library(tidymodels)
library(knitr)

Data

Load the data:

parkinsons = read_csv("https://sta101.github.io/static/labs/data/parkinsons_cleaned.csv")

This dataset comes from Little et al. (2008). The data includes various measurements of dysphonia from 32 people, 24 with Parkinson’s disease (PD). Multiple measurements were taken per individual. The measurements we examine in this subset of the data include:

name: patient ID
jitter: a measure of relative variation in fundamental frequency
shimmer: a measure of variation in amplitude (dB)
PPE: pitch period entropy
HNR: a ratio of total components vs. noise in the voice recording
status: health status (1 for PD, 0 for healthy)

Exercises

What are the identification codes (names) of healthy individuals in the data set? Print your output as a nice kable table.
What is the probability a randomly selected observation from this data set is an observation of an individual with Parkinson’s disease? What is the probability a randomly selected participant (different than an observation) in this data set has PD?
Create a new column that classifies vocal amplitude variability as high (>=.30 dB) or low (<30). Is high vocal amplitude independent of whether or not someone has PD? Why or why not?
What is the probability HNR is greater than 25 given that the participant is in the control group?
Next you will build a predictive model of PD status, but first split the data into two disjoint sets: a training set and a test set. From the original data frame, remove 4 PD and 4 healthy individuals to be in your test set. For consistency, choose the 4 healthy individuals with the lowest ID number of their respective category, e.g. phon_R01_S07 ends with 07 (the lowest number of the healthy group) so they should be placed in the test data frame. Similarly, the lowest ID number for an individual with PD is S01 so phon_R01_S01 should also be placed in the test data frame.

Your train data frame should contain 147 rows and your test data frame should contain 48. In your code chunk, print the number of rows of each data frame.

Big hint: create a group of individuals to be in your test data frame, e.g. keep_ids = c("id1", "id2", "id3") and pair this with filter logic described in ae3.

Fit a main effects logistic regression model that predicts prob(PD) status from HNR, PPE, jitter and shimmer. Print your model tidy.
Edit the code chunk below, specifically renaming model_fit and test_data where appropriate. Un-comment and run to find the predicted probabilities of Parkinson’s disease in the test data frame.

# prediction = predict(model_fit, test_data, type = "prob")

# test_result = test_data %>%
#   mutate(predicted_prob_pd = prediction$.pred_1)

Next, create a new column that classifies an individual as having PD if the predicted probability is above 50%. Repeat with a decision boundary of 75% and 90%.

How many false positives do you have in each case? False negatives? If you were to use your model as a diagnostic tool for PD to decide if someone should undergo subsequent testing, which decision boundary would you prefer and why?

Note: your narrative should read, e.g.: “With a decision boundary of 50%, my model yields X false positives and Y false negatives. With a decision boundary of 75%…” etc.

Formatting

Reminder: For all assignments in this course there is a “formatting” component to the grade. To receive full points for “formatting”, you must:

1. Have your name at the top of the knitted document

2. Pipes %>% and ggplot layers + should be followed by a newline (see formatting above)

3. Your code should be under the 80 character code limit. (You shouldn’t have to scroll to see all your code on the knitted document).

4. All exercises and corresponding pages should be linked on gradescope.

These necessary “tidyverse” style choices are good general practice and will help make your code more legible. For a more extensive list of recommended guidelines, click here.

Submitting to gradescope

Once you are fully satisfied with your lab, Knit to .pdf to create a .pdf document. You may notice that the formatting/theme of the report has changed – this is expected. Remember – you must turn in a .pdf file to the Gradescope page before the submission deadline for credit. To submit your assignment:

Go to http://www.gradescope.com and click Log in in the top right corner. - Click School Credentials, Duke NetID and log in using your NetID credentials.
Click on your STA 101 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the papers of your lab should be associated with at least one question (i.e., should be “checked”). Select the first page of your .pdf submission to be associated with the “Formatting” section.

Grading

Total: 50 pts.

Exercise 1: 4pts

Exercise 2: 6pts

Exercise 3: 7pts

Exercise 4: 3pts

Exercise 5: 8pts

Exercise 6: 9pts

Exercise 7: 9pts

Workflow and formatting:  4pts