Intuiting the Frisch-Waugh-Lovell (FWL) Theorem

[ regression ]

So after doing some regression self-study, inspired by Chief Ridge Officer Quantymacro, I decided to write a post to consolidate some of my learnings. To give context, these guys are part of QuantTwit, a collection of quants on $\mathbb{X}$. Now if there’s anything I learned from reading $\mathbb{X}$, it’s that quants have tremendous respect for regression.

The nature of the work involves quantifying relationships between noisy, high-dim, collinear, non-stationary data, and regression fits the bill. The FWL theorem lets us spruce up our interpretation of regression in a few ways that are useful.

Disclaimer: I’m just a scrub. Please feel free to correct mistakes.

Motivating the Theorem

It’s worth noting the theorem is less related to the computational aspects of regression (no matrix decomps like SVD involved in the derivation). I’ll have a look at ridge regression and the SVD as per it’s implementation in statsmodels in a future post.

The theorem can be motivated by this question:

Intuitively, you would say “if the new predictor is correlated to other predictors, the coefficients might change”. Or conversely, “if the new predictor is uncorrelated/orthogonal to the current ones, the coefficients remain the same”.

The FWL theorem formalizes this notion. It states the individual beta of a feature in a multiple regression is a partial effect: the effect of that variable after residualising/orthogonalising out the influence of the other predictors. If we have:

\[Y=\beta_1X_1+ \beta_2X_2+\epsilon\]

and want to isolate $\beta_1$, we can first regress $y \sim X_2$ and $X_1 \sim X_2$ and take their residuals, then regress first set of residuals on the second. FWL tells us the slope of this regression is our isolated beta $\beta_1$:

\[r(y \sim X_2)=\beta_1 r(X_1 \sim X_2)\]

This ‘leave-one-out’, ‘residual-on-residual’ style set of regressions can be done for each beta. But why residual?

The residual of $\hat{y}{| X_2}$ is the remaining value of our $y$ that cannot be ‘explained’ by $X_2$. The residual $\hat{X_1}{| X_2}$ is the remaining part of the $X_1$ feature that is orthogonal to $X_2$. Regress the two sets of residuals, and you have found the ‘pure’ relationship between $X_1$ and $Y$ after partialling out correlations - orthogonalizing $y$ and $X_1$ w.r.t $X_2$.

So in theory, assessing a single beta in a multiple regression, one need not worry about it’s correlation to other features because it’s correlation has been partialled out! In practice, this is not the case, but that’s why ridge and SVD are used.

The Proof

I’m going to cite the proof from Gregory Gundersen’s blog in his awesome OLS article.

First, rewrite the OLS equation into partitioned form:

Relate it back to the normal equation for betas $(\textbf{X}^T\textbf{X})\hat{\beta}=\textbf{X}^T\textbf{y}$:

Expand it out. This uses the outer product to multiply the first two matrices together, then transform the beta vector to give two equations.

Solve for $\hat{\beta_1}$:

Substitute this expression for $\hat{\beta_1}$ into the second equation:

Define the hat and residual maker matrices. The hat matrix orthogonally projects the endogenous vector onto the column space of the design matrix. Hastie & Tibshirani illustrate this in §3.2.3 of ESL which is the first image in this post. The hat and residual maker matrices are orthogonal.

Put them into the equation and solve for $\hat{\beta_2}$ in terms of residual-on-residual regression.

Rearrange the terms:

This expression is the normal equation for regressing $\textbf{y}$ on $\textbf{X}_2$ but with $\textbf{M}_1$ in front of $\textbf{y}$ and certain $\textbf{X}_2$ terms. Why not in front of the first $\textbf{X}_2$? Because $\textbf{M}_1$ is an orthogonal projection aka $\textbf{M}=\textbf{M}^2$ and also orthogonal so $\textbf{M}=\textbf{M}^T$.

So the residual-on-residual idea of FWL can be expressed in terms of the residual maker matrix $\textbf{M}_1\textbf{y} = \textbf{M}_1\textbf{X}_2\hat{\beta_2} + \textbf{e}$.

Why Is This Useful

But who cares? Why is this useful? Here are some reasons:

Tree-based ML models (rf, xgb, etc) don’t have this neat interpretation of coefficients! It would be interesting to compare feature importances of a tree-based model like xgboost to the betas from linear regression both fit on the same data.

Again, this derivation is from Gundersen. In simple regression, the intercept is the difference of two means, effectively de-meaning the target and predictor:

\[\hat{\alpha}=\bar{y}-\hat{\beta}\bar{x}\]

Why would you want an intercept? This image from Gundersen’s blog shows why:

So you do a parallel shift of your regression slope upwards by a constant coefficient to account for differences in means. Anyway, in the multiple regression case, we set $\textbf{M}_1$ where $X_1=\textbf{1}$ and compute $\beta_1$

\[\textbf{y}= \textbf{1}\beta_1 + \textbf{X}_2\beta_2 +\textbf{e}\]

The hat matrix $\textbf{H}_1$ expands out to become a $1/N$ matrix. Note an error in the picture, it should be $(\textbf{1}^T\textbf{1})^{-1}$.

This $\textbf{1}$ hat matrix when applied to the design matrix effectively produces column-wise means in every entry. To see this, use the inner product view of matrix multiplication.

And it acts on the design matrix and the vector. So our $\hat{\beta}_2$ is effectively a demean and a regression:

Now use the equation for $\beta{1}$ that was derived earlier with $\textbf{X}_1=\textbf{1}$. And of course, the difference-in-means is captured in the $\hat{\beta{1}}$ as the hat matrix demeans the terms, and we see the intercept as a generalized case as the simple OLS intercept.

I think it’s worth understanding what the intercept does, especially later in the case of multilevel regression/mixed effects models where you can have random intercepts, random slopes or both - see 0xfdf’s post on using multilevel regression to create a signal.

And in commodities, there might be lots of chances to use mixed effects models (e.g predict some quantity across countries or different geographical areas).

The Case Study

Let’s look at a made-up example to see the FWL theorem in action. This example is purely for illustration purposes. Suppose we want to predict attractiveness from height and face, and these quantities are all z-scores, and the true relationship is linear:

\[Attr \sim 1 Face + 0.4 Height + \epsilon\]

That is, face contributes to attractiveness ~ twice that of height. However, face is highly collinear to height (0.7). We can simulate this dataset.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
np.random.seed(69)

n = 50
rho = 0.7
beta1, beta2 = 1, 0.4

cov = np.array([1,rho,rho,1]).reshape(2,2)
mu = np.array([0,0])

exog = np.random.multivariate_normal(mu, cov, size=n)
endog = beta1 * exog.T[0] + beta2 * exog.T[1] + np.random.normal(0, 0.1, size=n)

df = pd.DataFrame(np.column_stack((exog, endog)), columns=['face', 'height', 'attr'])

hf = smf.ols('height ~ face', data=df).fit()
fh = smf.ols('face ~ height', data=df).fit()
ah = smf.ols('attr ~ height', data=df).fit()
af = smf.ols('attr ~ face', data=df).fit()

Plotting individual simple regressions of face and height, we see the individual betas are over-estimated because of the correlation between the two features.

Now let’s isolate our height beta. We orthogonalize height on face and attractiveness on face.

We are left with the residuals, the variation of height and attractiveness that are orthogonal to face.

Regressing the orthogonalized (w.r.t face) variables (residuals), demeaned, against each other, we recover the original height beta in the true relationship ($\pm$ the standard error):

Further Stuff

That concludes this short article! For further investigation, I’d very much like to know how this works …

But I’ll work on that slowly.

For now, the FWL theorem provides us with essential intuition about key facets of multiple regression without going into any implementation details yet.

I’ll look at ridge regression (from QM articles of course) and how statsmodels implements it under the hood with SVD. In particular, to see how it deals with collinearity, and how to determine regularization strength.