This regression project is due Tuesday July 19 at 11:59pm.
The deliverable for this project (what you will turn in) is a written report. See below for more details.
About the project
For this project you will create a linear model (of your choosing) to predict newborn birth weights (or some transformation of the variable) using a sample of 3000 observations from 2020 US birth data originally sourced from the CDC.
Download the data using the code below.
births = read_csv("https://sta101.github.io/static/appex/data/trainUSbirth.csv")
Here’s a code book for the data set:
newborn_birth_weight
: newborn birth weight in grams (response)month
: birth month (1 = January, …, 12 = December)mother_age
: age of the mother in yearsprenatal_care_starting_month
: month in which prenatal care began; if 0, there was no prenatal caredaily_cigarette_prepregnancy
: daily number of cigarettes smoked before the pregnancydaily_cigarette_trimester_1
: daily number of cigarettes smoked during the 1st trimester of the pregnancydaily_cigarette_trimester_2
: daily number of cigarettes smoked during the 2nd trimester of the pregnancydaily_cigarette_trimester_3
: daily number of cigarettes smoked during the 3rd trimester of the pregnancymother_height
: height of the mother in inchesmother_bmi
: body mass index of the mothermother_weight_prepregnancy
: weight of the mother before the pregnancy in poundsmother_weight_delivery
: weight of the mother at delivery in poundsmother_diabetes_gestational
: whether the mother had diabetes during the pregnancynewborn_sex
: sex of the newborngestation_week
: number of gestational weeksmother_risk_factors
: whether the mother had any risk factor (diabetes, hypertension, previous preterm birth, previous cesarean, infertility treatment used, etc)
If you wish, you can optionally evaluate the predictive ability of your model, using 1000 observations contained in the test data set below.
test_data = read_csv("https://sta101.github.io/static/appex/data/testUSbirth.csv")
Here is an example of exactly how to evaluate your predictive ability in a typical linear regression model:
# load data
births = read_csv("https://sta101.github.io/static/appex/data/trainUSbirth.csv")
test_data = read_csv("https://sta101.github.io/static/appex/data/testUSbirth.csv")
# fit a model with mother_age as a predictor of newborn_birth_weight
myPredictiveModel = linear_reg() %>%
set_engine("lm") %>%
fit(newborn_birth_weight ~ mother_age, data = births)
# predict based on new data
predict_test = test_data %>%
mutate(myPrediction = predict(myPredictiveModel, test_data)$.pred)
# compute sum squared error
predict_test %>%
mutate(squared_error = (newborn_birth_weight - myPrediction)^2) %>%
summarize(sse = sum(squared_error))
See ae7 for an example of how to do this in a logistic regression setting.
Written report
Your report must be written using R Markdown. Your written report should not exceed five pages inclusive of all tables and figures. Use the code below to download a template file for the project.
download.file("https://sta101.github.io/static/projects/regression_project_template.Rmd",
destfile = "regressionProject.rmd")
To begin, add YAML to the top and specify a project name, a team name (optional) and the names of each group member. You can use the YAML posted below as a template.
---
title: "Regression project"
author: "The Last Rbenders: Aang, Katara, Sokka, Momo"
---
All team members must contribute to the report. Before you finalize your report, make sure the printing of code chunks is turned off by including the following code chunk at the top of your RMD:
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
Next, load any relevant libraries and the data.
The written report is worth 50 points, broken down as follows:
Introduction 7pts
The introduction provides motivation and context for your research.
To begin, introduce the data set in a few short sentences. Next, create a code book (aka a “data dictionary”) of the variables in the data set. Although a code book is provided above, you should include one in your report as well so that your report is self-contained.
Complete the introduction by providing a concise, clear statement of your research question and hypotheses. Be sure to motivate why the research question is interesting / useful.
Example research question and hypotheses:
Can we predict body mass with bill depth? We hypothesize that penguins with deeper bills will also have more mass.
Methodology 15pts
Here you should introduce your regression model. Specifically, do you use regular linear regression or logistic regression? Why? What’s the full model equation?
Did you choose to leave out some predictors but include others? Why or why not? You can also include summary statistics or create plots to explore the data and help inform your model choices.
Do you fit multiple models? If so, why?
Do you compare them using R^2 or AIC? If so, why?
Results 15pts
Report and interpret your final fitted model equation(s).
Report any relevant R^2 or AIC you use to select between models.
Place figure(s) here to illustrate the main results from your analysis. 1 beautiful figure is worth more than several poorly formatted figures. You must have at least 1 figure.
Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statistic and create every possible model for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in writing about and interpreting results, and that you can accomplish these tasks using R. More is not better.
How well does your model predict new data?
Discussion 8pts
This section is a conclusion and discussion. You should
-
Summarize your main finding in a sentence or two.
-
Discuss your finding and why it is useful (put in the context of your motivation from the introduction).
-
Critique your own analyses and include a brief paragraph on what you would do differently if you were able to start the project over.
-
List a brief (1 or 2 sentence) summary of the relative contributions of each team member. E.g. “Aang built the models, Katara implemented them in R, and Sokka wrote the introduction and discussion.”
- Note: all team members should be comfortable describing all aspects of the project and understanding all code.
Formatting 5pts
Your project should be professionally formatted. For example, this means labeling graphs and figures, turning off code chunks, and using typical style guidelines.
Submitting to gradescope
-
Select one team member to upload the team’s PDF submission to Gradescope.
-
Be sure to select every team member’s name in Gradescope.
-
Associate all pages with “Full report”.