Ordinary Least Squares (OLS) — a step-by-step derivation

Abstract

Where do the OLS formulas come from? A full derivation of the least-squares estimator in plain words: from intuition through diagrams to a step-by-step proof

The problem and the intuition

You have a cloud of points — say years of experience (horizontal axis) and wage (vertical axis). You want to draw a single straight line that “fits best”.

But what does “best” mean? Every line is wrong somewhere — it passes near the points, not through them. The error for one point is the vertical distance: how far the line missed the actual value.

wage yexperience xŷ = b₀ + b₁xeᵢ
Each red segment is an error (residual) — the vertical distance from a point to the line. OLS looks for the line whose sum of squared segments is the smallest.
Intuicja
In short
A line is “best” when it is least wrong overall. We take each error, square it (so that pluses and minuses do not cancel, and large errors hurt more), add them up — and look for the line for which that sum is smallest. Hence the name: the method of least squares.

Step 0: Naming things

We have $n$ pairs of observations $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$. We fit a line:

$$ \hat{y}_i = b_0 + b_1 x_i $$

where $b_0$ is the intercept (where the line crosses the $y$-axis) and $b_1$ is the slope (how much $y$ rises when $x$ increases by 1).

The residual (error) for point $i$ is the difference between what we observe and what the line predicts:

$$ e_i = y_i - \hat{y}_i = y_i - b_0 - b_1 x_i $$
yᵢ (actual)ŷᵢ (predicted)eᵢxᵢyᵢŷᵢ
For one point: $y_i$ is the actual value, $\hat{y}_i$ is the value on the line, and the residual $e_i = y_i - \hat{y}_i$ is their difference.

Step 1: The objective function

The sum of squared residuals — denote it $S$:

$$ S(b_0, b_1) = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} \left(y_i - b_0 - b_1 x_i\right)^2 $$

This is a function of two unknowns, $b_0$ and $b_1$. We look for the values that make $S$ as small as possible.

Intuicja
The reason for squaring
If we summed the residuals $e_i$ themselves, the positive and negative ones would cancel — a poor measure. We could take absolute values $|e_i|$, but they are “kinked” (not differentiable at zero) and awkward in calculus. The square is smooth, always positive, and penalises large errors more heavily. It also yields clean, closed-form formulas — which we derive next.

The function $S$ is a paraboloid — a bowl opening upwards. It has exactly one minimum, at the bottom of the bowl. And at the bottom, the tangent is flat, i.e. the derivatives equal zero.

S(b₁)b₁minimum: dS/db₁ = 0b̂₁
The sum of squared residuals $S$ as a function of the slope $b_1$ is a parabola (a bowl). The minimum is where the tangent is flat — where the derivative = 0.

Step 2: The proof — first-order conditions

We find the minimum by taking the partial derivatives of $S$ with respect to $b_0$ and $b_1$ and setting them to zero.

Proof
Deriving the OLS formulas
  1. Derivative with respect to $b_0$. Differentiate $S$ in $b_0$ (chain rule — derivative of the square times the derivative of the inside, which is $-1$): $$ \frac{\partial S}{\partial b_0} = \sum_{i=1}^n 2\left(y_i - b_0 - b_1 x_i\right)(-1) = -2\sum_{i=1}^n \left(y_i - b_0 - b_1 x_i\right) $$
  2. Derivative with respect to $b_1$. The same, but the derivative of the inside in $b_1$ is $-x_i$: $$ \frac{\partial S}{\partial b_1} = -2\sum_{i=1}^n x_i\left(y_i - b_0 - b_1 x_i\right) $$
  3. Set both to zero. Divide by $-2$ and obtain the normal equations: $$ \sum_{i=1}^n \left(y_i - b_0 - b_1 x_i\right) = 0 \qquad\text{and}\qquad \sum_{i=1}^n x_i\left(y_i - b_0 - b_1 x_i\right) = 0 $$
  4. Solve the first for $b_0$. Split the sum and divide by $n$ (recalling $\frac{1}{n}\sum y_i = \bar{y}$ and $\frac{1}{n}\sum x_i = \bar{x}$): $$ \sum y_i = n b_0 + b_1 \sum x_i \;\;\Longrightarrow\;\; \boxed{\,b_0 = \bar{y} - b_1 \bar{x}\,} $$ This says something important: the OLS line always passes through the point of means $(\bar{x}, \bar{y})$.
  5. Substitute $b_0$ into the second equation. Plugging in $b_0 = \bar{y} - b_1\bar{x}$ and rearranging: $$ \sum x_i(y_i - \bar{y}) = b_1 \sum x_i(x_i - \bar{x}) $$
  6. Solve for $b_1$. Using the identity $\sum x_i(y_i-\bar y)=\sum (x_i-\bar x)(y_i-\bar y)$ (because $\sum \bar x (y_i - \bar y)=0$), we obtain the final formula: $$ \boxed{\,b_1 = \dfrac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\,} $$
Twierdzenie
The OLS estimator

The slope is the covariance divided by the variance of $x$, and the intercept pins the line to the point of means:

$$ b_1 = \frac{\mathrm{Cov}(x,y)}{\mathrm{Var}(x)} = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}, \qquad b_0 = \bar{y} - b_1\bar{x} $$

Step 3: Confirm it really is a minimum

A zero derivative marks a critical point — but is it a valley or a peak? Take the second derivative:

$$ \frac{\partial^2 S}{\partial b_1^2} = 2\sum_{i=1}^n x_i^2 > 0 $$

The second derivative is positive (a sum of squares), so the function is convex — this really is a minimum. A bowl opening upwards, exactly as in the figure above.

Step 4: A worked numerical example

Five observations: experience $x$ (years) and wage $y$ (thousands).

$x_i$$y_i$$x_i-\bar{x}$$y_i-\bar{y}$$(x_i-\bar{x})(y_i-\bar{y})$$(x_i-\bar{x})^2$
13−2−244
24−1−111
350000
461111
572244
Σ1010

Means: $\bar{x} = 3$, $\bar{y} = 5$. Plugging into the formulas:

$$ b_1 = \frac{10}{10} = 1, \qquad b_0 = 5 - 1\cdot 3 = 2 $$

So $\hat{y} = 2 + 1\cdot x$ — each year of experience adds about 1 thousand on average. A check in R / Python:

x <- c(1,2,3,4,5); y <- c(3,4,5,6,7)
coef(lm(y ~ x))     # (Intercept) 2,  x 1
import numpy as np
x = np.array([1,2,3,4,5]); y = np.array([3,4,5,6,7])
b1 = np.cov(x, y, bias=True)[0,1] / np.var(x)   # 1.0
b0 = y.mean() - b1 * x.mean()                    # 2.0

Step 5: The geometric view (bonus)

There is a more elegant way to see OLS. Stack all the $y_i$ into a single vector $\mathbf{y}$ in $n$-dimensional space. All possible fits $\hat{\mathbf{y}}$ lie on a plane (the space spanned by the columns of $\mathbf{X}$). OLS picks the point on that plane closest to $\mathbf{y}$ — that is, the orthogonal projection.

column space of Xyŷe
The geometry of OLS: $\hat{\mathbf{y}}$ is the orthogonal projection of $\mathbf{y}$ onto the space of fits. The residual $\mathbf{e}$ is perpendicular to that plane — which is exactly why $\mathbf{X}^\top\mathbf{e}=0$.

The same normal equation $\sum x_i e_i = 0$ that we derived with calculus says, geometrically: the residual is perpendicular to $x$. Two languages, one truth.

What to remember

Definicja
Summary in one sentence
OLS picks the line by minimising the sum of squared vertical distances. The solution: $b_1 = \dfrac{\mathrm{Cov}(x,y)}{\mathrm{Var}(x)}$, $b_0 = \bar{y} - b_1\bar{x}$ — and the line always passes through $(\bar{x}, \bar{y})$.
  1. The residual is how much the line missed vertically
  2. We minimise the sum of their squares (smooth, penalises large errors)
  3. Derivatives = 0 → normal equations → formulas
  4. Second derivative > 0 → it is certainly a minimum
  5. Geometrically → an orthogonal projection, residual ⟂ regressors

More in: Econometrics · Basics