By the end of this lab you will
1. Download the lab template by pasting the code below in your console:
download.file("https://sta101.github.io/static/labs/lab04_template.Rmd",
destfile = "lab04.rmd")
2. Under the “Files” tab on the right hand side, click on
lab04.rmd
to open the lab template.
3. Complete the exercises below using the space provided.
Be sure to update the YAML at the top of the document to include your name and the date.
library(tidyverse)
library(tidymodels)
library(knitr)
Load the data:
parkinsons = read_csv("https://sta101.github.io/static/labs/data/parkinsons_cleaned.csv")
This dataset comes from Little et al. (2008). The data includes various measurements of dysphonia from 32 people, 24 with Parkinson’s disease (PD). Multiple measurements were taken per individual. The measurements we examine in this subset of the data include:
name
: patient IDjitter
: a measure of relative variation in fundamental
frequencyshimmer
: a measure of variation in amplitude (dB)PPE
: pitch period entropyHNR
: a ratio of total components vs. noise in the voice
recordingstatus
: health status (1 for PD, 0 for healthy)What are the identification codes (names
) of healthy
individuals in the data set? Print your output as a nice
kable
table.
What is the probability a randomly selected observation from this data set is an observation of an individual with Parkinson’s disease? What is the probability a randomly selected participant (different than an observation) in this data set has PD?
Create a new column that classifies vocal amplitude variability as high (>=.30 dB) or low (<30). Is high vocal amplitude independent of whether or not someone has PD? Why or why not?
What is the probability HNR
is greater than 25 given
that the participant is in the control group?
Next you will build a predictive model of PD status, but first
split the data into two disjoint sets: a training set and a test set.
From the original data frame, remove 4 PD and 4 healthy individuals to
be in your test set. For consistency, choose the 4 healthy individuals
with the lowest ID number of their respective category,
e.g. phon_R01_S07
ends with 07
(the lowest
number of the healthy group) so they should be placed in the test data
frame. Similarly, the lowest ID number for an individual with PD is
S01
so phon_R01_S01
should also be placed in
the test data frame.
Your train data frame should contain 147 rows and your test data frame should contain 48. In your code chunk, print the number of rows of each data frame.
Big hint: create a group of individuals to be in your test data
frame, e.g. keep_ids = c("id1", "id2", "id3")
and pair this
with filter logic described in ae3
.
Fit a main effects logistic regression model that predicts
prob(PD) status from HNR
, PPE
,
jitter
and shimmer
. Print your model
tidy
.
Edit the code chunk below, specifically renaming
model_fit
and test_data
where appropriate.
Un-comment and run to find the predicted probabilities of Parkinson’s
disease in the test data frame.
# prediction = predict(model_fit, test_data, type = "prob")
# test_result = test_data %>%
# mutate(predicted_prob_pd = prediction$.pred_1)
Next, create a new column that classifies an individual as having PD if the predicted probability is above 50%. Repeat with a decision boundary of 75% and 90%.
How many false positives do you have in each case? False negatives? If you were to use your model as a diagnostic tool for PD to decide if someone should undergo subsequent testing, which decision boundary would you prefer and why?
Note: your narrative should read, e.g.: “With a decision boundary of 50%, my model yields X false positives and Y false negatives. With a decision boundary of 75%…” etc.
Reminder: For all assignments in this course there is a “formatting” component to the grade. To receive full points for “formatting”, you must:
1. Have your name at the top of the knitted document
2. Pipes %>%
and ggplot layers +
should
be followed by a newline (see formatting above)
3. Your code should be under the 80 character code limit. (You shouldn’t have to scroll to see all your code on the knitted document).
4. All exercises and corresponding pages should be linked on gradescope.
These necessary “tidyverse” style choices are good general practice and will help make your code more legible. For a more extensive list of recommended guidelines, click here.
Once you are fully satisfied with your lab, Knit to .pdf to create a .pdf document. You may notice that the formatting/theme of the report has changed – this is expected. Remember – you must turn in a .pdf file to the Gradescope page before the submission deadline for credit. To submit your assignment:
Go to http://www.gradescope.com and click
Log in in the top right corner. - Click School Credentials
,
Duke NetID
and log in using your NetID
credentials.
Click on your STA 101 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the papers of your lab should be associated with at least one question (i.e., should be “checked”). Select the first page of your .pdf submission to be associated with the “Formatting” section.
Total: 50 pts.
Exercise 1: 4pts
Exercise 2: 6pts
Exercise 3: 7pts
Exercise 4: 3pts
Exercise 5: 8pts
Exercise 6: 9pts
Exercise 7: 9pts
Workflow and formatting: 4pts