Ordinary Least Squares (OLS) — a step-by-step derivation
Where do the OLS formulas come from? A full derivation of the least-squares estimator in plain words: from intuition through diagrams to a step-by-step proof
The problem and the intuition
You have a cloud of points — say years of experience (horizontal axis) and wage (vertical axis). You want to draw a single straight line that “fits best”.
But what does “best” mean? Every line is wrong somewhere — it passes near the points, not through them. The error for one point is the vertical distance: how far the line missed the actual value.
Step 0: Naming things
We have $n$ pairs of observations $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$. We fit a line:
$$ \hat{y}_i = b_0 + b_1 x_i $$where $b_0$ is the intercept (where the line crosses the $y$-axis) and $b_1$ is the slope (how much $y$ rises when $x$ increases by 1).
The residual (error) for point $i$ is the difference between what we observe and what the line predicts:
$$ e_i = y_i - \hat{y}_i = y_i - b_0 - b_1 x_i $$Step 1: The objective function
The sum of squared residuals — denote it $S$:
$$ S(b_0, b_1) = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} \left(y_i - b_0 - b_1 x_i\right)^2 $$This is a function of two unknowns, $b_0$ and $b_1$. We look for the values that make $S$ as small as possible.
The function $S$ is a paraboloid — a bowl opening upwards. It has exactly one minimum, at the bottom of the bowl. And at the bottom, the tangent is flat, i.e. the derivatives equal zero.
Step 2: The proof — first-order conditions
We find the minimum by taking the partial derivatives of $S$ with respect to $b_0$ and $b_1$ and setting them to zero.
- Derivative with respect to $b_0$. Differentiate $S$ in $b_0$ (chain rule — derivative of the square times the derivative of the inside, which is $-1$): $$ \frac{\partial S}{\partial b_0} = \sum_{i=1}^n 2\left(y_i - b_0 - b_1 x_i\right)(-1) = -2\sum_{i=1}^n \left(y_i - b_0 - b_1 x_i\right) $$
- Derivative with respect to $b_1$. The same, but the derivative of the inside in $b_1$ is $-x_i$: $$ \frac{\partial S}{\partial b_1} = -2\sum_{i=1}^n x_i\left(y_i - b_0 - b_1 x_i\right) $$
- Set both to zero. Divide by $-2$ and obtain the normal equations: $$ \sum_{i=1}^n \left(y_i - b_0 - b_1 x_i\right) = 0 \qquad\text{and}\qquad \sum_{i=1}^n x_i\left(y_i - b_0 - b_1 x_i\right) = 0 $$
- Solve the first for $b_0$. Split the sum and divide by $n$ (recalling $\frac{1}{n}\sum y_i = \bar{y}$ and $\frac{1}{n}\sum x_i = \bar{x}$): $$ \sum y_i = n b_0 + b_1 \sum x_i \;\;\Longrightarrow\;\; \boxed{\,b_0 = \bar{y} - b_1 \bar{x}\,} $$ This says something important: the OLS line always passes through the point of means $(\bar{x}, \bar{y})$.
- Substitute $b_0$ into the second equation. Plugging in $b_0 = \bar{y} - b_1\bar{x}$ and rearranging: $$ \sum x_i(y_i - \bar{y}) = b_1 \sum x_i(x_i - \bar{x}) $$
- Solve for $b_1$. Using the identity $\sum x_i(y_i-\bar y)=\sum (x_i-\bar x)(y_i-\bar y)$ (because $\sum \bar x (y_i - \bar y)=0$), we obtain the final formula: $$ \boxed{\,b_1 = \dfrac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\,} $$
The slope is the covariance divided by the variance of $x$, and the intercept pins the line to the point of means:
$$ b_1 = \frac{\mathrm{Cov}(x,y)}{\mathrm{Var}(x)} = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}, \qquad b_0 = \bar{y} - b_1\bar{x} $$Step 3: Confirm it really is a minimum
A zero derivative marks a critical point — but is it a valley or a peak? Take the second derivative:
$$ \frac{\partial^2 S}{\partial b_1^2} = 2\sum_{i=1}^n x_i^2 > 0 $$The second derivative is positive (a sum of squares), so the function is convex — this really is a minimum. A bowl opening upwards, exactly as in the figure above.
Step 4: A worked numerical example
Five observations: experience $x$ (years) and wage $y$ (thousands).
| $x_i$ | $y_i$ | $x_i-\bar{x}$ | $y_i-\bar{y}$ | $(x_i-\bar{x})(y_i-\bar{y})$ | $(x_i-\bar{x})^2$ |
|---|---|---|---|---|---|
| 1 | 3 | −2 | −2 | 4 | 4 |
| 2 | 4 | −1 | −1 | 1 | 1 |
| 3 | 5 | 0 | 0 | 0 | 0 |
| 4 | 6 | 1 | 1 | 1 | 1 |
| 5 | 7 | 2 | 2 | 4 | 4 |
| Σ | 10 | 10 |
Means: $\bar{x} = 3$, $\bar{y} = 5$. Plugging into the formulas:
$$ b_1 = \frac{10}{10} = 1, \qquad b_0 = 5 - 1\cdot 3 = 2 $$So $\hat{y} = 2 + 1\cdot x$ — each year of experience adds about 1 thousand on average. A check in R / Python:
x <- c(1,2,3,4,5); y <- c(3,4,5,6,7)
coef(lm(y ~ x)) # (Intercept) 2, x 1
import numpy as np
x = np.array([1,2,3,4,5]); y = np.array([3,4,5,6,7])
b1 = np.cov(x, y, bias=True)[0,1] / np.var(x) # 1.0
b0 = y.mean() - b1 * x.mean() # 2.0
Step 5: The geometric view (bonus)
There is a more elegant way to see OLS. Stack all the $y_i$ into a single vector $\mathbf{y}$ in $n$-dimensional space. All possible fits $\hat{\mathbf{y}}$ lie on a plane (the space spanned by the columns of $\mathbf{X}$). OLS picks the point on that plane closest to $\mathbf{y}$ — that is, the orthogonal projection.
The same normal equation $\sum x_i e_i = 0$ that we derived with calculus says, geometrically: the residual is perpendicular to $x$. Two languages, one truth.
What to remember
- The residual is how much the line missed vertically
- We minimise the sum of their squares (smooth, penalises large errors)
- Derivatives = 0 → normal equations → formulas
- Second derivative > 0 → it is certainly a minimum
- Geometrically → an orthogonal projection, residual ⟂ regressors
More in: Econometrics · Basics