Lecture 14 (4/27/2022)

Announcements

  • Lab 5 may take a little longer than previous labs

  • Pset 3 due today

  • Purva’s OH in ERCA after lab today

Last time we covered:

  • Linear regression basics (ordinary least squares)

Today’s agenda:

  • Understanding and evaluating linear regression

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Deep Dive: regression in python

Last time, we talked about a number of different solutions for doing regression in python.

The most popular option, and the one we’ll focus on today, is the scikit LinearRegression class.

*Context: scikit is a robust set of tools for data science and analysis. Take a look at their other model-fitting resources here.

mpg = sns.load_dataset('mpg')

mpg
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
... ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790 15.6 82 usa ford mustang gl
394 44.0 4 97.0 52.0 2130 24.6 82 europe vw pickup
395 32.0 4 135.0 84.0 2295 11.6 82 usa dodge rampage
396 28.0 4 120.0 79.0 2625 18.6 82 usa ford ranger
397 31.0 4 119.0 82.0 2720 19.4 82 usa chevy s-10

398 rows × 9 columns

# Let's look at the data
# Is there a linear relationship between "weight" and "horsepower"?

sns.scatterplot(data = mpg,
                x = "weight",
                y = "horsepower"
               )
<AxesSubplot:xlabel='weight', ylabel='horsepower'>
../../_images/Lecture_14-pre_7_1.png
# Let's make sure this data is cleaned up by checking for null values!
# Handy resource: https://datatofish.com/rows-with-nan-pandas-dataframe/


mpg[mpg.isna().any(axis=1)]

# mpg.shape
# mpg.index
mpg cylinders displacement horsepower weight acceleration model_year origin name
32 25.0 4 98.0 NaN 2046 19.0 71 usa ford pinto
126 21.0 6 200.0 NaN 2875 17.0 74 usa ford maverick
330 40.9 4 85.0 NaN 1835 17.3 80 europe renault lecar deluxe
336 23.6 4 140.0 NaN 2905 14.3 80 usa ford mustang cobra
354 34.5 4 100.0 NaN 2320 15.8 81 europe renault 18i
374 23.0 4 151.0 NaN 3035 20.5 82 usa amc concord dl
# Looks like we need to handle the null values above
mpg_clean = mpg.dropna().reset_index(drop = True)

mpg_clean[mpg_clean.isna().any(axis=1)]
# mpg.shape
# mpg.index
mpg cylinders displacement horsepower weight acceleration model_year origin name
from sklearn.linear_model import LinearRegression # first, import the LinearRegression model class

# Prepare our data
x_vals = np.array(mpg_clean['weight'])
x_vals = x_vals.reshape(len(x_vals), 1) # what's going on here?
x_vals.shape
x_vals[0, 0]


y_vals = np.array(mpg_clean['horsepower'])
y_vals = y_vals.reshape(len(y_vals), 1)


# Initialize the model
mod = LinearRegression()
# Fit the model
mod = mod.fit(X = x_vals, y = y_vals)

mod
LinearRegression()

Our (fitted) model comes with a number of attributes, including, most critically, the intercept and slope estimates

# Look at the values
mod.intercept_
mod.coef_ # why is this different from the above?

# Other attributes we won't get into here
array([[0.03917702]])

Reminder: what do these values actually mean?

sns.scatterplot(data = mpg,
                x = "weight",
                y = "horsepower"
               )

plt.axline(xy1 = (0, mod.intercept_[0]), slope = mod.coef_[0, 0], color = "r", linestyle = "--")
<matplotlib.lines._AxLine at 0x7fda2c841dc0>
../../_images/Lecture_14-pre_14_1.png

Your turn!

Take a few minutes now to choose another pair of variables from this dataset.

Plot them to see if they have a linear(ish) relationship, then fit a regression to them and examine the parameter estimates.

NOTE: when doing this, try not to just copy-paste what I have above; instead, write it out yourself while looking back at the code when you need to.

Why? Copy-pasting is a big part of coding, but it works better when you’re already familiar with the syntax and just need the right details. When you’re doing something for the first time, it’s good to internalize what you’re doing!

### YOUR CODE HERE

What variables did you choose? How did your regression do?


Above we looked at some of the attributes of our model class.

This class also has several useful functions. We’ve seen one already above (fit()).

What else can a LinearRegression object do?

# Other functions we can use

mod.predict(X = x_vals) # what is this doing?

mod.score(X = x_vals, y = y_vals) # what is this? 

# These are the primary ones
0.74742549968982

Evaluating linear regression

In the section above, we generated a regression for the relationship between weight and horsepower.

How do we know if our regression is a good fit to the data? What gives you confidence that it’s describing the relationship between the variables well?

\(R^2\) (coefficient of determination): how much variance are we explaining?

\( R^2 = 1 - \dfrac{RSS}{TSS} \)

\( RSS = \sum_{i=1}^{n}{(y_i - \hat{y_i})}^2 \)

Where have we seen this before?

\( TSS = \sum_{i=1}^{n}{(y_i - \bar{y})}^2 \)

What does this look like?

\(R^2\) ranges between 0 and 1 and can be thought of as the percentage of variance in \(y\) that our model explains.

Intuition for this metric: divide the SS for our guess by the SS if our guess was just the mean of \(y\).

  • Raw value of RSS will vary depending on how many data points we have

  • What’s the lowest value that RSS could have? …

  • What’s the highest value that RSS could have? …

So our RSS should be somewhere between 0 (good) and whatever we would get from guessing \(\bar{y}\) for every point (bad). Divide RSS by that worst case value to indicate how close we are to worst case, then subtract from 1 for how close we are to best case.

Here’s a graphic illustrating it (source: wikipedia)

rsq

Let’s calculate it by hand so we understand it better:

rsq_mod = mod.score(X = x_vals, y = y_vals) # R^2 value
rsq_mod

# Add predicted y values to dataframe
preds = mod.predict(X = x_vals)
mpg_clean = mpg_clean.assign(model_pred = pd.Series([x[0] for x in preds]))

# Calculate RSS based on predicted values
mpg_clean = mpg_clean.assign(resid_sum_sq = (mpg_clean['horsepower'] - mpg_clean['model_pred'])**2)

# Calculate TSS based on mean of y
ybar = mpg_clean['horsepower'].mean()
mpg_clean = mpg_clean.assign(total_sum_sq = (mpg_clean['horsepower'] - ybar)**2)

rsq_hand = 1 - mpg_clean['resid_sum_sq'].sum() / mpg_clean['total_sum_sq'].sum()
rsq_hand

# What is the R^2 value?
print("Model value: {}".format(rsq_mod))
print("Hand calculation: {}".format(rsq_hand))
Model value: 0.74742549968982
Hand calculation: 0.74742549968982

Out of sample prediction (more on this later!)

Since the goal with our model is prediction, another way we can test whether it’s doing a good job is to see how it predicts new data!

There are a number of different approaches for this that we’ll talk more about later.

For now, let’s demonstrate a simple one: train the model, then test on hold-out data.

Step 1 is to randomly sample, e.g., 25% of our data and set it aside as the test data.

# Randomly sample 25% of our data points to be test data

from sklearn.model_selection import train_test_split

x_vals = np.array(mpg_clean['weight'])
x_vals = x_vals.reshape(len(x_vals), 1)
y_vals = mpg_clean['horsepower']

xtrain, xtest, ytrain, ytest = train_test_split(x_vals, y_vals, test_size = 0.25)

# mpg_clean.shape
# len(Xtest)

Step 2 is to fit our regression to the training data only.

# Initialize the model
mod_tr = LinearRegression()
# Fit the model
mod_tr = mod_tr.fit(X = xtrain, y = ytrain)

mod_tr.intercept_
mod_tr.coef_
array([0.03977281])

Step 3 is to see how well our regression model performs on the held out test data

mod_preds = mod_tr.predict(X = xtest)

mod_preds
array([181.49694764, 126.76956519, 114.08203975,  92.60472397,
        78.24674063, 170.83783536,  92.56495116, 157.87190028,
       157.19576256,  79.83765291, 122.59342045, 151.07075028,
       119.05364063,  76.45696432,  68.50240291, 100.32064853,
        86.04221081, 113.24681081, 114.47976782,  86.24107484,
        66.67285379, 132.41730379,  65.32057835, 123.94569589,
       122.8718301 ,  83.1785687 ,  72.20127397,  77.01378362,
       174.69579764, 129.91161695,  95.34904765, 126.37183712,
        77.84901256, 114.83772309,  81.62742923, 101.75246958,
        73.87173186, 141.36618537, 132.09912133,  97.73541607,
       101.871788  ,  71.28649941, 185.11627308, 120.80364414,
        91.17290291,  64.92285028, 154.013938  , 123.15023975,
       154.88893975,  76.05923625, 174.57647922,  88.78653449,
        75.2240073 ,  79.04219677,  84.37175291,  59.31488449,
       171.5935187 , 109.07066607,  76.01946344,  69.89445116,
        68.90013098,  59.95124941,  56.37169677, 162.36622747,
       190.80378449,  91.6899494 , 100.43996695, 162.24690905,
        56.17283274, 165.42873361, 149.28097396,  83.85470642,
        81.22970116,  74.62741519,  74.667188  , 115.43431519,
        95.15018361, 113.36612923,  68.86035818, 141.76391344,
       124.18433274, 130.82639151,  94.55359151,  80.71265467,
        89.26380818, 160.57645115,  65.91717046, 130.74684589,
       160.41735993,  71.88309151,  99.0479187 ,  59.39443011,
       135.71844677,  71.08763537, 107.24111695,  80.63310905,
       149.36051958, 123.03092133])
sns.scatterplot(x = xtest.reshape(len(xtest)),
                y = ytest
               )

plt.axline(xy1 = (0, mod_tr.intercept_), slope = mod_tr.coef_[0], color = "r", linestyle = "--")

plt.title("Regression performance on held out test data")
Text(0.5, 1.0, 'Regression performance on held out test data')
../../_images/Lecture_14-pre_31_1.png

Step 4 is …

now what?

Let’s compare how our model did on the testing data compared to the training data it was fit to.

There are lots of ways to do this (incl. the \(R^2\) values we discussed above) and we’ll spend more time looking at some of them next week.

But one commoon approach worth highlighting is the mean squared error.

Mean squared error (MSE) is just the average of the squared residuals or squared error, the same stuff we’ve been dealing with from the beginning.

\( MSE = \dfrac{1}{n - 2} \sum_{i=1}^{n}{(y_i - \hat{y_i})}^2 = \dfrac{1}{n - 2} \sum_{i=1}^{n}{\epsilon_i}^2 \)

This tells you, on average, how close your model was to the true value across all the data points.

Why the \(n - 2\) here?

Let’s calculate it for our train and test datasets and compare them.

Do you think they’ll be similar?

# Training MSE

# Generate model predictions for the training data
train_pred = mod_tr.predict(X = xtrain)
# Calculate the squared error of these predictions
train_error = (ytrain - train_pred)**2
# Get the mean squared error
train_mse = train_error.sum() / (len(train_error) - 2)
train_mse

# Test MSE

# Generate model predictions for the test data (we did this above already)
test_pred = mod_tr.predict(X = xtest)
# Calculate the squared error of the test predictions
test_error = (ytest - test_pred)**2
# Get the mean squared error for test data
test_mse = test_error.sum() / (len(test_error) - 2)
test_mse

print("Training MSE: {} \n Test MSE: {}".format(train_mse, test_mse))
Training MSE: 353.1564266980096 
 Test MSE: 451.413560729764

What do these results mean?

Is this a pretty good regression fit?

Parameter estimates

The criteria above are mostly concerned with whether we’re doing a good job predicting our \(y\) values with this model.

Notice this completely ignores the actual parameter estimates of the model!

In many cases, part of what we’re concerned with isn’t just how well we predict our data, but what kind of relationship our model estimates between \(x\) and \(y\).

  • How large or small is the slope?

  • How confident are we in the estimate?

This isn’t data we get directly from the LinearRegression class.

Instead, we can calculate by hand or use another model class like the statsmodels class we saw on Monday.

import statsmodels.formula.api as smf

results = smf.ols('horsepower ~ weight', data = mpg_clean).fit()
results.summary()
OLS Regression Results
Dep. Variable: horsepower R-squared: 0.747
Model: OLS Adj. R-squared: 0.747
Method: Least Squares F-statistic: 1154.
Date: Wed, 25 May 2022 Prob (F-statistic): 1.36e-118
Time: 09:22:42 Log-Likelihood: -1717.0
No. Observations: 392 AIC: 3438.
Df Residuals: 390 BIC: 3446.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -12.1835 3.570 -3.412 0.001 -19.203 -5.164
weight 0.0392 0.001 33.972 0.000 0.037 0.041
Omnibus: 83.255 Durbin-Watson: 1.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 312.937
Skew: 0.892 Prob(JB): 1.11e-68
Kurtosis: 6.997 Cond. No. 1.13e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

What do we learn from the above?

Understanding the full output requires a more extensive statistics discussion than we’re going to have, but there are a few important take-aways:

  1. The coefficient for weight (0.0392) has a small standard error (0.001) and a very narrow 95% confidence interval (0.037, 0.041) based on the observations here. This means we’re pretty certain about this slope estimate.

  2. The \(t\) value for the slope (33.97) is extremely large, meaning we’re very confident that this slope value is significantly different from 0. In other words, there is a significant relationship between weight and horsepower in this data.

This is another way that we can evaluate our regression: the parameter estimates themselves tell us something useful and reasonably precise about the unerlying data.

Regression failures

Above we went through how to interpret and evaluate our regression.

In this case, we were estimating a linear slope in data that had a pretty strong linear relationship.

But what are the ways that this can break down?

Noisy data

# mean = [0, 0]
# cov = [[1, 0], [0, 1]]
# x, y = np.random.multivariate_normal(mean, cov, 100).T


g = sns.scatterplot(data = mpg_clean, x = "acceleration", y = "mpg")
../../_images/Lecture_14-pre_42_0.png
# Can we fit a regression to the data above?

m = LinearRegression().fit(
    X = np.array(mpg_clean['acceleration']).reshape(len(mpg_clean['acceleration']), 1), 
    y = np.array(mpg_clean['mpg'])
)

m.intercept_
m.coef_

m.score(
    X = np.array(mpg_clean['acceleration']).reshape(len(mpg_clean['acceleration']), 1), 
    y = np.array(mpg_clean['mpg'])
) 
# not so great...
0.1792070501562546
sns.lmplot(data = mpg_clean, x = "acceleration", y = "mpg")
<seaborn.axisgrid.FacetGrid at 0x7fda2c841e80>
../../_images/Lecture_14-pre_44_1.png

Non-linear data

# This doesn't seem like a linear relationship...

sns.scatterplot(data = mpg_clean, x = "horsepower", y = "mpg") # hp
<AxesSubplot:xlabel='horsepower', ylabel='mpg'>
../../_images/Lecture_14-pre_46_1.png
# Can we fit a regression to the data above?

m = LinearRegression().fit(
    X = np.array(mpg_clean['horsepower']).reshape(len(mpg_clean['horsepower']), 1), 
    y = np.array(mpg_clean['mpg'])
)

m.intercept_
m.coef_

m.score(
    X = np.array(mpg_clean['horsepower']).reshape(len(mpg_clean['horsepower']), 1), 
    y = np.array(mpg_clean['mpg'])
)
0.6059482578894348
sns.lmplot(data = mpg_clean, x = "horsepower", y = "mpg")
<seaborn.axisgrid.FacetGrid at 0x7fda2d5cd550>
../../_images/Lecture_14-pre_48_1.png

This graphic shows the correlation coefficient between two variables for a range of different data patterns.

The data that doesn’t have a strong correlation is also going to be difficult to fit a linear regression to.

corr_ex

Multiple predictors

mpg_clean

mpg_long = mpg_clean.melt(
    id_vars = ["name", "origin", "model_year", "cylinders", "mpg"]
)
mpg_long

sns.relplot(data = mpg_long,
            x = "value",
            y = "mpg",
            col = "variable",
            kind = "scatter",
            facet_kws={'sharey': False, 'sharex': False}
           )
<seaborn.axisgrid.FacetGrid at 0x7fda2d65e8b0>
../../_images/Lecture_14-pre_51_1.png

[next time] Failures of the linear regression assumptions

  • Ex. non-normal residuals