Lecture 16 (5/2/2022)¶
Announcements
Last time we covered:
Linear regression
Simple linear regression
Evaluating regression
Polynomial and multiple regression
Today’s agenda:
Multiple regression continued
Overfitting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Multiple Regression¶
Big idea: multiple predictors for \(y\)¶
In real-world scenarios, it’s common to explore the relationship between multiple predictor variables that all jointly impact the dependent variable.
For this, we need multiple regression.
Though the guiding principles are similar to simple regression, things can get much more complicated in multiple dimensions. Here, we’ll provide enough of an overview for you to be able to use multiple regression as part of your modeling toolbox.
Example: predicting life expectancy¶
Let’s start with a familiar example.
In Problem Set 3 and elsewhere, we’ve looked at the gapminder dataset, which explores the relationship between GDP, population, and life expectancy at the national level and over time.
Here, we’ll just look at these variables for the most recent year (2007).
gap = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/gapminder.csv")
# Let's keep just some of the variables (note for pset!)
# NOTE: we're also just restricting ourselves to 2007
gap_subset = gap.loc[gap['year'] == 2007, ('country', 'year', 'lifeExp', 'pop', 'gdpPercap')].reset_index(drop = True)
# Add log transformed population and income
gap_subset['logPop'] = np.log10(gap_subset['pop'])
gap_subset['logGdpPercap'] = np.log10(gap_subset['gdpPercap'])
gap_subset
country | year | lifeExp | pop | gdpPercap | logPop | logGdpPercap | |
---|---|---|---|---|---|---|---|
0 | Afghanistan | 2007 | 43.828 | 31889923 | 974.580338 | 7.503653 | 2.988818 |
1 | Albania | 2007 | 76.423 | 3600523 | 5937.029526 | 6.556366 | 3.773569 |
2 | Algeria | 2007 | 72.301 | 33333216 | 6223.367465 | 7.522877 | 3.794025 |
3 | Angola | 2007 | 42.731 | 12420476 | 4797.231267 | 7.094138 | 3.680991 |
4 | Argentina | 2007 | 75.320 | 40301927 | 12779.379640 | 7.605326 | 4.106510 |
... | ... | ... | ... | ... | ... | ... | ... |
137 | Vietnam | 2007 | 74.249 | 85262356 | 2441.576404 | 7.930757 | 3.387670 |
138 | West Bank and Gaza | 2007 | 73.422 | 4018332 | 3025.349798 | 6.604046 | 3.480776 |
139 | Yemen, Rep. | 2007 | 62.698 | 22211743 | 2280.769906 | 7.346583 | 3.358081 |
140 | Zambia | 2007 | 42.384 | 11746035 | 1271.211593 | 7.069891 | 3.104218 |
141 | Zimbabwe | 2007 | 43.487 | 12311143 | 469.709298 | 7.090298 | 2.671829 |
142 rows × 7 columns
In Problem Set 3, you generated a graph that explored the relationship between income (GDP per-capita on \(x\)) and life expectancy (in years on \(y\)), alongside two additional predictors: region (color) and population (size).
Here, we want to know formally what the relationship is between the three continuous variables (income, life expectancy, and population). In other words, can we predict life expectancy using both income and population better than we could only using one of those variables?
Let’s start by just examining each of the predictors in isolation to see if this is a plausible hypothesis.
g = sns.lmplot(data = gap_subset,
x = "logGdpPercap", # x1
y = "lifeExp"
)
plt.title("Life expectancy as a function of income")
plt.xlabel("Income (log GDP / capita)")
plt.ylabel("Life expectancy")
plt.show()
h = sns.lmplot(data = gap_subset,
x = "logPop", # x2
y = "lifeExp"
)
plt.title("Life expectancy as a function of population")
plt.xlabel("Population (log)")
plt.ylabel("Life expectancy")
plt.show()
One of these variables has a strong positive relationship. The other one seems a bit less clear.
How can we think about exploring the role that both of them play together in predicting life expectancy?
Multiple regression: overview¶
Multiple regression is like linear regression except that we assume our dependent variable \(y_i\) is jointly predicted by multiple independent variables:
\(x_1\), \(x_2\), …, \(x_n\).
Simple linear regression generates a predicted \(\hat{y}_i\) from \(x_i\):
\(\hat{y}_i = \beta_0 + \beta_1 x_i + \epsilon_i\)
With multiple regression, we now extend this model to include multiple predictors for our data (\(x_{1i}\), \(x_{2i}\), …, \(x_{ni}\)):
\(\hat{y}_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ ... \ + \beta_n x_{ni} + \epsilon_i \)
In most cases, multiple regression once again uses the same Ordinary Least Squares (OLS) parameter estimation as simple regression. However, interpreting the parameter estimates is a little less straightforward.
How would we interpret \(\beta_0 = 1\), \(\beta_1 = 2\), \(\beta_2 = 3\)?
(Think of a simple example. What variables would you want to use to predict somebody’s height as accurately as possible? Now imagine each of those variables is one of your \(x_{ji}\) variables.)
Multiple regression in python¶
The scikit-learn LinearRegression
class is very easy to extend to multiple regression!
We’ll also demonstrate the statsmodels
approach below for more robust statistical analysis.
# scikit learn approach
from sklearn.linear_model import LinearRegression
x_vals = np.array(gap_subset[['logGdpPercap', 'logPop']]) # Note: this double bracket syntax is important!
x_vals = x_vals.reshape(len(gap_subset), 2)
x_vals
y_vals = np.array(gap_subset['lifeExp'])
y_vals
mod = LinearRegression().fit(X = x_vals, y = y_vals)
mod.intercept_
mod.coef_
array([16.6811828 , 1.86827925])
What do the coefficient estimates above suggest?
How should we interpret them?
# R^2 for our regression
mod.score(X = x_vals, y = y_vals)
0.6649393884784984
# What if we don't include the population predictor?
x_single = np.array(gap_subset['logGdpPercap']).reshape(len(gap_subset), 1)
x_single
mod_simple = LinearRegression().fit(X = x_single, y = y_vals)
mod_simple.score(X = x_single, y = y_vals)
# Doesn't seem like population helps much...
0.6544490472926064
The statsmodels approach gives us a bit more insight here.
# statsmodels approach
import statsmodels.formula.api as smf
multiple_reg = smf.ols('lifeExp ~ logGdpPercap + logPop', data = gap_subset).fit()
# View the results
multiple_reg.summary()
# Our population variable does have a significant positive slope,
# but it's pretty small and the effect may be driven by outliers.
Dep. Variable: | lifeExp | R-squared: | 0.665 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.660 |
Method: | Least Squares | F-statistic: | 137.9 |
Date: | Wed, 25 May 2022 | Prob (F-statistic): | 9.91e-34 |
Time: | 09:23:15 | Log-Likelihood: | -477.07 |
No. Observations: | 142 | AIC: | 960.1 |
Df Residuals: | 139 | BIC: | 969.0 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -8.6161 | 7.538 | -1.143 | 0.255 | -23.520 | 6.288 |
logGdpPercap | 16.6812 | 1.008 | 16.555 | 0.000 | 14.689 | 18.673 |
logPop | 1.8683 | 0.896 | 2.086 | 0.039 | 0.098 | 3.639 |
Omnibus: | 34.155 | Durbin-Watson: | 2.170 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 54.987 |
Skew: | -1.183 | Prob(JB): | 1.15e-12 |
Kurtosis: | 4.923 | Cond. No. | 104. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Regression: wrap-up¶
Regression is one of the most powerful tools in our data science / analysis toolkit.
However, it’s important to be familiar with the limitations in regression as well.
One class of these limitations is when the data violate the assumptions of linear regression. We’re not going to get into these issues in this class, but it’s good to be aware that not all data (even data that looks linear!) can be accurately described by linear regression. There are a number of tricks for diagnosing this, such as plotting your residuals (we may explore this on the problem set as a way to cover some of this material).
However, an important limitation that we will discuss now is when regression overfits the data. Avoiding overfitting is something that arises in just about any modeling context. It’s easiest to illustrate with regression, but the things we discuss here will be relevant throughout the remainder of the quarter.
Overfitting¶
What is it?¶
The basic principle behind overfitting is that when your modeling approach is sufficiently unconstrained, you can fit patterns in your data that do not generalize.
What does it mean for a model to be unconstrained? What does it mean for the model not to generalize?
…
How does this happen?
There are lots of ways this can arise (and what it looks like differs across modeling techniques).
Let’s look at a canonical example just to get an intuition. For models that we learn about throughout the rest of the quarter, it’s a useful exercise to think about what overfitting might look like.
Example¶
Let’s return to our gapminder
dataset and the task of predicting life expectancy (\(y\)) based on income (\(x\)).
What kind of relationship do we think these variables have in the data below?
gap_subset
g = sns.scatterplot(data = gap_subset,
x = "logGdpPercap", # x1
y = "lifeExp",
color = "r",
alpha = 0.5
)
plt.title("Life expectancy as a function of income")
plt.xlabel("Income (log GDP / capita)")
plt.ylabel("Life expectancy")
plt.show()
Let’s start with a simple linear regression:
# Estimate the model
x_vals = np.array(gap_subset['logGdpPercap']).reshape(len(gap_subset), 1)
y_vals = np.array(gap_subset['lifeExp'])
deg1_fits = LinearRegression().fit(X = x_vals, y = y_vals)
# Add predictions for this model to our dataframe
preds = gap_subset.loc[:, ("logGdpPercap", "lifeExp")].reset_index(drop = True)
simple_preds = deg1_fits.predict(X = x_vals)
preds['deg1_pred'] = simple_preds
preds
logGdpPercap | lifeExp | deg1_pred | |
---|---|---|---|
0 | 2.988818 | 43.828 | 54.519342 |
1 | 3.773569 | 76.423 | 67.534497 |
2 | 3.794025 | 72.301 | 67.873765 |
3 | 3.680991 | 42.731 | 65.999076 |
4 | 4.106510 | 75.320 | 73.056337 |
... | ... | ... | ... |
137 | 3.387670 | 74.249 | 61.134339 |
138 | 3.480776 | 73.422 | 62.678496 |
139 | 3.358081 | 62.698 | 60.643606 |
140 | 3.104218 | 42.384 | 56.433262 |
141 | 2.671829 | 43.487 | 49.262068 |
142 rows × 3 columns
# Plot the fit
sns.scatterplot(data = preds,
x = "logGdpPercap",
y = "lifeExp",
color = "m",
alpha = 0.5
)
sns.lineplot(data = preds,
x = "logGdpPercap",
y = "deg1_pred"
)
plt.title("Life expectancy as a function of income")
plt.xlabel("Income (log GDP / capita)")
plt.ylabel("Life expectancy")
plt.show()
How did we do? Maybe this data is quadratic or cubic?
from sklearn.preprocessing import PolynomialFeatures
# Fit degree 2 polynomial regression
poly2 = PolynomialFeatures(degree = 2, include_bias = False)
x2_features = poly2.fit_transform(x_vals)
mod2 = LinearRegression().fit(x2_features, y_vals)
# Fit degree 3 polynomial regression
poly3 = PolynomialFeatures(degree = 3, include_bias = False)
x3_features = poly3.fit_transform(x_vals)
mod3 = LinearRegression().fit(x3_features, y_vals)
# Add these to our predictions
preds['deg2_pred'] = mod2.predict(X = x2_features)
preds['deg3_pred'] = mod3.predict(X = x3_features)
preds
# How do these predictions look at first glance? More accurate?
logGdpPercap | lifeExp | deg1_pred | deg2_pred | deg3_pred | |
---|---|---|---|---|---|
0 | 2.988818 | 43.828 | 54.519342 | 54.128139 | 54.071604 |
1 | 3.773569 | 76.423 | 67.534497 | 68.392417 | 68.422855 |
2 | 3.794025 | 72.301 | 67.873765 | 68.722745 | 68.758756 |
3 | 3.680991 | 42.731 | 65.999076 | 66.871091 | 66.875117 |
4 | 4.106510 | 75.320 | 73.056337 | 73.506613 | 73.595809 |
... | ... | ... | ... | ... | ... |
137 | 3.387670 | 74.249 | 61.134339 | 61.765814 | 61.694179 |
138 | 3.480776 | 73.422 | 62.678496 | 63.433290 | 63.381871 |
139 | 3.358081 | 62.698 | 60.643606 | 61.226743 | 61.149956 |
140 | 3.104218 | 42.384 | 56.433262 | 56.420367 | 56.336913 |
141 | 2.671829 | 43.487 | 49.262068 | 47.486363 | 47.642830 |
142 rows × 5 columns
Let’s plot these lines and see how they look:
# First, let's make our data tidy for easier plotting
preds_long = preds.melt(
id_vars = ["logGdpPercap", "lifeExp"],
var_name = "degree",
value_name = "prediction"
)
preds_long
# Now, let's take a look
sns.scatterplot(data = preds_long,
x = "logGdpPercap",
y = "lifeExp",
color = "m",
alpha = 0.1
)
sns.lineplot(
data = preds_long,
x = "logGdpPercap",
y = "prediction",
hue = "degree"
)
<AxesSubplot:xlabel='logGdpPercap', ylabel='lifeExp'>
These lines look pretty similar. Do we see any change in \(R^2\) values for these models?
# Get R^2 for each model
mod1_rsq = deg1_fits.score(X = x_vals, y = y_vals)
mod2_rsq = mod2.score(X = x2_features, y = y_vals)
mod3_rsq = mod3.score(X = x3_features, y = y_vals)
print("Degree 1 R^2: {} \nDegree 2 R^2: {} \nDegree 3 R^2: {}".format(
mod1_rsq, mod2_rsq, mod3_rsq
))
Degree 1 R^2: 0.6544490472926064
Degree 2 R^2: 0.6588616234359694
Degree 3 R^2: 0.6589091468875565
What does this tell us?
Probably don’t need 2nd or 3rd degree polynomials for this relationship.
What happens if we try to fit a much higher-order polynomial function to this data? Similar line or something else?
???
# Fit degree 10 polynomial regression
poly10 = PolynomialFeatures(degree = 10, include_bias = False)
x10_features = poly10.fit_transform(x_vals)
mod10 = LinearRegression().fit(x10_features, y_vals)
Let’s compare this one to our \(R^2\) values above:
# Calculate R^2 for our degree 10 polynomial
mod10_rsq = mod10.score(X = x10_features, y = y_vals)
mod10_rsq
0.6708222009621823
Improvement! Let’s keep going!
# Fit degree 25 polynomial regression
poly25 = PolynomialFeatures(degree = 25, include_bias = False)
x25_features = poly25.fit_transform(x_vals)
mod25 = LinearRegression().fit(x25_features, y_vals)
# Check R^2
mod25_rsq = mod25.score(X = x25_features, y = y_vals)
mod25_rsq
0.6716693321510845
Okay maybe things are slowing down…
Let’s plot our functions and see how they look.
# Add to predictions dataframe and re-format
preds['deg10_pred'] = mod10.predict(X = x10_features)
preds['deg25_pred'] = mod25.predict(X = x25_features)
preds_long = preds.melt(
id_vars = ["logGdpPercap", "lifeExp"],
var_name = "degree",
value_name = "prediction"
)
preds_long
# Now, let's take a look
sns.scatterplot(data = preds_long,
x = "logGdpPercap",
y = "lifeExp",
color = "m",
alpha = 0.1
)
sns.lineplot(
data = preds_long,
x = "logGdpPercap",
y = "prediction",
hue = "degree"
)
<AxesSubplot:xlabel='logGdpPercap', ylabel='lifeExp'>
Hmmm….
sns.lineplot(
data = preds,
x = "logGdpPercap",
y = "deg10_pred",
color = "r"
)
plt.xlabel("Income")
plt.ylabel("Predicted life expectancy")
plt.show()
sns.lineplot(
data = preds,
x = "logGdpPercap",
y = "deg25_pred",
color = "b"
)
plt.xlabel("Income")
plt.ylabel("Predicted life expectancy")
plt.show()
This doesn’t seem like it’s capturing the underlying process that generated our data…
Aside: overfitting is easiest to demonstrate with polynomial regression because it makes Least Squares Spaghetti. However, this problem can arise in almost any modeling context. Think about what this might look like with multiple regression.
Overfitting summary¶
The high-degree polynomials above are unconstrained, leading them to fit our data too well.
This creates worries that they won’t generalize.
But how do we test this?
Cross-validation: avoiding overfitting¶
There are many different techniques for cross-validation, one of which we discussed on Friday.
However, at their heart, all of them involve the same underlying principle: estimate your model parameters with only a subset (or repeated subsets) of your data, then evaluate your model on the remaining data to ensure generalizability.
This technique allows you to ensure that your model is a good fit, or do model comparison to choose among different candidate models (such as our polynomial models above).
Rather than go through all of the different ways of doing this, I’ll walk through a few of the more well known ones. This will give you a feel for how cross-validation works.
Holdout cross-validation: this is the technique we reviewed on Friday
Split data into training and test data (e.g. 75-25 split)
Train model on training data only
Evaluate model on test data
k-fold cross-validation: This is a more robust version of holdout cross-validation
Split data into k equal sized buckets (randomly!)
Choose one of the buckets to be test data and assign the rest to be training data
Perform holdout cross-validation with this data
Then, repeat the above for each of the other buckets and average your test results (e.g., MSE) over the multiple runs
Monte-carlo cross-validation (repeated random subsampling): Similar to k-fold, but random
Split the data into training and test data randomly
Evaluate model on test data
Repeat this process many times
Exhaustive cross-validation (various methods): The most common of these are leave-one-out or leave-p-out cross-validation
For every individual value in the data or set of \(p\) values, train the model on all remaining data then evalute the model using this individual or set of \(p\) values
Note: this is exhaustive because every data point or set of \(p\) data points is used in the test set, but is often computationally intractable!
Other methods: specific to particular types of data (e.g., time series data, classification data)
For more info on this, you can check out this blog, or the wikipedia page devoted to this topic, or many others like it online!
Cross-Validation Summary¶
The techniques above are used with many if not all of the modeling techniques we’ll discuss this quarter.
From here on out, you should assume that some form of cross-validation is a required component of any model fitting process, much like looking at your data, cleaning it, and other core aspects of analysis.