resources
> exams
By the end of today you will…
Download this application exercise by pasting the code below into your console
download.file("https://sta101.github.io/static/appex/ae5.Rmd",
destfile = "ae5.rmd")
library(tidyverse)
library(tidymodels)
library(scatterplot3d)
Today’s data is a collection of tech stock prices from January 1st
2020 to December 31st 2021. I pulled this data off Yahoo finance using
their API via the tidyquant
package July 2022.
stocks = read_csv("https://sta101.github.io/static/appex/data/stocks2.csv")
From last time…
What is \(\hat{y}\)? How is it different than \(y\)?
What is \(\hat{\beta}\)? How is it different than \(\beta\)?
What is \(R^2\) in your own words?
What is a residual? How is it different than error?
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon \]
\(y\): the outcome variable. Also called the “response” or “dependent variable”. In prediction problems, this is what we are interested in predicting.
\(x_i\): the \(i^{th}\) predictor. Also commonly referred to as “regressor”, “independent variable”, “covariate”, “feature”, “the data”.
\(\beta_i\): “constants” or coefficients i.e. fixed numbers. These are population parameters. \(\beta_0\) has another special name, “the intercept”.
\(\epsilon\): the error. This quantity represents observational error, i.e. the difference between our observation and the true population-level expected value: \(\beta_0 + \beta_1 x\).
Effectively this model says our data \(y\) is linearly related to the \(x_1\) and \(x_2\) but is not perfectly observed due to some error.
Let’s examine the first quarter of 2020 high prices of Microsoft, IBM and Apple stocks to illustrate some ideas.
If we have three measurements (variables) then each observation is a point in three-dimensional space. In this example, we can choose one of our measurements to be the outcome variable (e.g. Apple stock price) and use our other two measurements (MSFT and IBM price) as predictors.
In general, the total number of measurements, i.e. variables (columns) in our linear model represents the spatial dimension of our model.
Our fitted linear model no longer looks like a line, but instead looks like a plane.
This plane shows our prediction of AAPL price (\(y\)) given both MSFT price (\(x_1\)) and IBM price (\(x_2\))
Demo: building intuition for higher dimensional linear models
In \(n\)-dimensional space, a linear equation creates a \(\text{insert number here}\)-dimensional object.
Find the equation of the plane above with this one simple trick!
myModelFit = linear_reg() %>%
set_engine("lm") %>%
fit(outcome ~ predictor1 + predictor2 + predictor3 + ..., data = data-set-here)
we can simply ‘add’ in new predictors! This code template will fit the model according to the ordinary least squares (OLS) objective function, i.e. we are finding the equation of the hyperplane that minimizes the sum of squared residuals.
You can subsequently print the coefficients (\(\beta\)s) to the screen by simply typing
the model name, e.g. myModelFit
or calling the
tidy()
function on your fitted model,
e.g. tidy(myModelFit)
.
In the code chunk below, fit the multiple regression model described above where
\(y\): AAPL high price, \(x_1\): MSFT high price, \(x_2\): IBM high price.
Then write the equation of your fitted model below.
apple_high_fit
# code here
The equation of the plane above:
\[ \text{your equation here} \]
Interpret the coefficients in your equation above.
[your interpretation here]
Applying a model to values outside of the original data is called extrapolation. Extrapolation can be very unreliable.
That being noted, it would be nice if our model was only able to predict realistic outcomes. If we consider extrapolating our forecast, we will see that our linear model can easily predict unrealistic values. For example, with a negative slope, we can imagine that a very high Microsoft price drives our Apple prediction down to a negative value.
However, stock prices cannot be negative. A more useful modeling framework used by investors is to predict the “log return” of a stock. Over the course of day, the log return is defined:
\[ \log(\text{close price}) - \log(\text{open price}) = \log \left( \frac{\text{close price}}{\text{open price}} \right) \]
Starting with your stocks
data frame, create new columns
AAPL.LogReturn
, MSFT.LogReturn
,
IBM.LogReturn
that shows the daily log return of each
stock. Continue this for the remaining stocks in the data frame. Save
your new data frame as stock_returns
.
# code here
Fit the following model:
\[ y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + \epsilon \] where
and report \(R^2\).
# code here
So far we’ve only used the present to predict the present. i.e. we’ve used January 1st IBM prices to predict January 1st AAPL prices. While the resulting models are quite good, they are not particularly useful.
It would be much more useful if we could predict the return of AAPL tomorrow so that we could make an informed decision about buying or selling it.
To begin such an endeavor, let’s build a model that uses yesterday’s log-return of IBM and MSFT to predict today’s log return of AAPL.
What should our data frame look like?
[ your answer here ]
Let’s make that data frame! Adapt the example below to create new columns for yesterday’s IBM and MSFT returns.
stock_returns2 = stock_returns %>%
mutate(AAPL.LogReturnYesterday = lag(AAPL.LogReturn, 1)) %>%
filter(!is.na(AAPL.LogReturnYesterday))
stock_returns2
Fit the following model:
\[ y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + \epsilon \]
where
\(y\): AAPL daily log return \(x_1\): MSFT log return yesterday \(x_2\): IBM log return yesterday
and report \(R^2\). What do you notice?
# code here
To get a good picture of how we go about model comparison, we have to introduce some new ideas.
We’ll start by examining a common assumption about our error, \(\epsilon\).
The assumption is
\[ \epsilon \sim N(0, \sigma^2). \] In words: errors are normally distributed with mean zero and some unknown standard deviation \(\sigma\).
In picture:
The function that defines this Normal distribution curve above is:
\[ f(\epsilon) = \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{1}{2 \sigma^2}\epsilon^2} \]
More explicitly, this assumption states: if we observe more and more data points, we expect a histogram of the error to eventually look like the bell-shaped curve above. Since we can’t observe the true error, we might use residuals as an estimate.
Accessing the residuals can be tricky, we’ll use the template code below to plot a histogram of residuals
ggplot(data = model-fit-here$fit, aes(x = .resid)) +
+ geom_histogram()
Let’s take a look offline at visualizing this assumption in context of regression.
Recall that in simple linear regression
\[ y = \beta_0 + \beta_1 x + \epsilon \]
For a given data point, e.g.
\(y = 78.4\) and \(x = 164.35\), assuming the error is 0, there is a slope and intercept that fits this data perfectly. In fact, there will be infinitely many lines we can draw (combinations of slope and intercept) that go through this point.
But if we fix the intercept, to say \(\beta_0 = 0\), then there is only 1 slope that gives us a perfect estimate. You could solve this algebraically…
\[ 78.4 = 0 + \beta_1 \cdot164.35\\ \beta_1 = 0.477 \] but interestingly, we could also plot a likelihood function. This is, intuitively, the likelihood of the data given various values of the parameter \(\beta_1\) under our statistical model.
To generate this likelihood, we use the function that defines the normal curve but substitute
\[ \epsilon = y - \beta_0 + \beta_1 x. \]
If we compute the likelihood, for all the data points, we could multiply them together and obtain an overall likelihood score.
AIC is defined as
\[ 2k - 2 \log(\text{likelihood}) \] where \(k\) is the number of estimated parameters (\(\beta\)s) in the model. Notice this will be 1 + the number of predictors.
Re-fit the model from exercise 7 but add 1 more predictor: yesterday’s AAPL return. Before you run any code, do you think \(R^2\) will increase or decrease? After fitting the model and glancing at the results, which model do you prefer and why?
# code here
[your answer here]