The Method of Least Squares — A Complete Treatise: History, Theory and Proofs

Abstract

A comprehensive account of the method of least squares: from the eighteenth-century problem of the figure of the Earth and its precursors (Cotes, Mayer, Bošković), through Legendre's first publication (1805), Gauss's original derivation (1809) and Laplace's probabilistic foundation, to the Gauss-Markov theorem (1823) and the birth of regression — with dates, names, original-language texts, full proofs and portraits.

The method of least squares was not born as an exercise in algebra. It grew out of the most demanding computational problems of the eighteenth and nineteenth centuries — determining the figure of the Earth and the orbits of celestial bodies from observations corrupted by error. This article traces that entire path: from the precursors, through the priority dispute between Legendre and Gauss, to Gauss’s two distinct justifications and the birth of modern regression — with dates, names, original-language texts and full proofs.

Part I. The problem: inconsistent observations (eighteenth century)

In the eighteenth century, astronomy and geodesy posed a difficulty of a new kind. To determine a few unknown quantities — say, the parameters of an orbit, or the flattening of the Earth — practitioners made many measurements, more than there were unknowns. Each measurement was slightly in error, so the resulting system of equations was overdetermined and inconsistent: no single solution satisfied all observations simultaneously.

Intuicja
The heart of the problem
When there are as many equations as unknowns, one solves the system. When there are more observations than unknowns, and each is slightly in error, no solution fits them all at once. One must decide which “near-solution” is best. That question gave rise to the method of least squares.

The figure of the Earth

Newton predicted that the Earth is oblate — flattened at the poles. To settle the matter, the French Academy of Sciences dispatched two expeditions to measure the length of a meridian arc: to Lapland (Maupertuis, 1736–37) and to Peru (Bouguer and La Condamine, 1735–44). From many geodetic measurements, two parameters of the ellipsoid had to be recovered — a classic overdetermined system.

Precursors: the first ways of combining observations

Portrait of Ruđer Bošković
Ruđer Bošković (1711–1787) minimised the sum of the absolute values of the deviations. Wikimedia Commons, public domain
  • Roger Cotes (1682–1716) proposed a weighted mean of observations (published posthumously in Harmonia mensurarum, 1722) — the idea of giving greater weight to more precise measurements.
  • Tobias Mayer (1723–1762), studying the libration of the Moon (c. 1750), used a “method of averages”: he grouped the equations and summed them, reducing the overdetermined system to a solvable one.
  • Ruđer Bošković (1711–1787), measuring a meridian arc near Rome with Christopher Maire (De litteraria expeditione…, 1755), proposed the criterion closest in spirit to least squares: minimise the sum of absolute deviations subject to the constraint that the signed deviations sum to zero — an $L_1$ regression, to which statistics returned only in the twentieth century.
  • Leonhard Euler and Pierre-Simon Laplace tried other “methods of combination,” but none was both general and simple.

What was missing was a single, universal principle. It appeared at the start of the nineteenth century — and almost simultaneously in two men.

Part II. Legendre: the first publication (1805)

Boilly's caricature of Legendre and Fourier
Adrien-Marie Legendre (1752–1833), at left, beside Fourier. The only authentic likeness of Legendre is a watercolour caricature by Boilly (1820). J.-L. Boilly, 1820 — Wikimedia Commons, public domain

The first publication of the method was an appendix to Adrien-Marie Legendre’s Nouvelles méthodes pour la détermination des orbites des comètes (1805), entitled Sur la méthode des moindres carrés. It was Legendre who named the method — méthode des moindres carrés — and stated it with disarming clarity:

Przykład
Legendre, 1805 — original and translation

“De tous les principes qu’on peut proposer pour cet objet, je pense qu’il n’en est pas de plus général, de plus exact, ni d’une application plus facile, que celui dont nous avons fait usage dans les recherches précédentes, et qui consiste à rendre minimum la somme des carrés des erreurs.”

“Of all the principles that can be proposed for this purpose, I think there is none more general, more exact, or more easy to apply, than that which we have used in the preceding research, and which consists in rendering the sum of the squares of the errors a minimum.”

Legendre gave the method as a computational recipe — effective and elegant, but without justification of why the squares of the errors, rather than some other power. He left that question open. Gauss answered it.

Part III. Gauss: genesis and the first justification (1809)

Portrait of Carl Friedrich Gauss
Carl Friedrich Gauss (1777–1855), portrait by Christian Albrecht Jensen (1840) — the Prince of Mathematicians. Wikimedia Commons, public domain

The missing planet, Ceres (1801)

On 1 January 1801, Giuseppe Piazzi of Palermo observed a faint moving object — the dwarf planet Ceres. He tracked it for 41 nights, after which Ceres approached the Sun and vanished in its glare. Astronomers held a handful of discordant measurements and a single question: where should one look for the planet when it re-emerged from behind the Sun?

Ceres photographed by the Dawn spacecraft
Ceres — the largest object in the asteroid belt, imaged by the Dawn spacecraft (2015). Its disappearance in 1801 spurred Gauss to refine the computation of orbits. NASA / JPL-Caltech / UCLA / MPS / DLR / IDA — public domain

Gauss, then twenty-four, undertook the task. Using his own methods — including the minimisation of the sum of squared errors for the redundant observations — he computed the orbit and indicated where to look. In December 1801, Franz von Zach and Heinrich Olbers recovered Ceres close to Gauss’s prediction. The success made him famous across Europe.

Theoria Motus (1809) and the priority dispute

In Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium (1809), Gauss presented the method together with a deep justification, and stated that he had used it since 1795 — long before Legendre’s publication. This ignited one of the most famous priority disputes in the history of science. In Davis’s standard 1857 English translation, Gauss writes:

Przykład
Gauss, 1809 (trans. Davis, 1857)
“Our principle, which we have made use of since the year 1795, has lately been published by Legendre in the work Nouvelles méthodes pour la détermination des orbites des comètes, Paris, 1806, where several other properties of this principle have been explained, which, for the sake of brevity, we here omit.”

The modern assessment is usually balanced:

  • Legendre was first to publish the method and to name it;
  • Gauss was first to use it (though without publishing) and — more importantly — first to explain why the squares are the right choice.
Uwaga
Independent discovery
At about the same time (1808), the American mathematician Robert Adrain (1775–1843) independently derived the law of error and the method of least squares. This is a frequent pattern in science — a multiple, simultaneous discovery of an idea whose time had come.

Gauss’s original reasoning

Gauss asked the question Legendre had not: why the sum of squares? He took as his starting point a principle that astronomers already applied:

Definicja
The postulate of the arithmetic mean
If the same quantity is measured repeatedly with independent errors, then the most probable value of that quantity is the arithmetic mean of the measurements.

His genius lay in reversing the question: instead of “what is the best value for a given law of error?”, Gauss asked — which law of error $\varphi(\varepsilon)$ makes the arithmetic mean the best value? The answer forces the shape of the error distribution, and that shape turns out to be the normal distribution.

Postulatemean = best valueLaw of errornormal distributionMethodleast squares
Gauss’s reverse reasoning. He does not assume the normal distribution — he derives it from the universally accepted postulate of the mean, and from it least squares follows.
Proof
The postulate of the mean implies the normal law (Gauss, 1809)
  1. Setup. Let $M_1,\dots,M_n$ be $n$ independent measurements of a quantity $\mu$; the error $\varepsilon_i = M_i - \mu$ has an unknown, symmetric density $\varphi(\varepsilon)$. The likelihood is $$ \Omega(\mu) = \varphi(M_1-\mu)\cdots\varphi(M_n-\mu). $$
  2. First-order condition. From $\frac{d}{d\mu}\log\Omega = 0$, writing $f := (\log\varphi)' = \varphi'/\varphi$: $$ \sum_{i=1}^n f(M_i-\hat\mu) = 0. $$
  3. Impose the postulate. Require $\hat\mu = \bar M$ for every sample. Then $\sum_i(M_i-\bar M)=0$, so the condition reads $\sum_i f(\varepsilon_i)=0$ whenever $\sum_i \varepsilon_i = 0$.
  4. This forces linearity. For the configuration $\varepsilon_1=(n-1)t,\ \varepsilon_2=\dots=\varepsilon_n=-t$ (which sums to zero) and odd $f$, one obtains $f((n-1)t)=(n-1)f(t)$ for all $t,n$ — hence $f(\varepsilon)=k\varepsilon$.
  5. Integrate. $(\log\varphi)'=k\varepsilon \Rightarrow \varphi(\varepsilon)=C\,e^{k\varepsilon^2/2}$.
  6. Density condition. For $\varphi$ to be an integrable density, $k<0$; writing $k=-2h^2$: $$ \boxed{\;\varphi(\varepsilon) = \frac{h}{\sqrt{\pi}}\,e^{-h^2\varepsilon^2}\;} $$ This is the normal distribution; $h$ is Gauss's *measure of precision* ($\textit{Genauigkeit}$). Hence its name — the Gaussian distribution.

The final step is immediate: since $\Omega \propto \exp(-h^2\sum\varepsilon_i^2)$, maximising the likelihood is equivalent to minimising the sum of squares $\sum\varepsilon_i^2$. This is least squares as maximum-likelihood estimation.

Part IV. Laplace: the probabilistic foundation

Portrait of Pierre-Simon Laplace
Pierre-Simon de Laplace (1749–1827). His central limit theorem explained why errors are often normal. P. Guérin — Wikimedia Commons, public domain

Gauss’s 1809 argument rested on the postulate of the mean — elegant, but arbitrary. An independent foundation came from Pierre-Simon Laplace. In 1810 he proved the central limit theorem: the sum of many small, independent influences is approximately normal.

This explained why a measurement error — being the sum of many tiny, independent disturbances — is in practice Gaussian. Laplace connected this with the method of least squares (Théorie analytique des probabilités, 1812), giving it a large-sample justification independent of Gauss’s postulate. This convergence of two routes is sometimes called the Gauss-Laplace synthesis; it secured the place of the normal distribution and of least squares throughout science. See the article on the central limit theorem.

Part V. Gauss again: the Gauss-Markov theorem (1823)

Portrait of Andrey Markov
Andrey Markov (1856–1922). The theorem bears his name, although Gauss already had the result in 1823. Wikimedia Commons, public domain

Gauss himself judged the 1809 argument unsatisfactory (it depended on the assumption of normality). In Theoria combinationis observationum erroribus minimis obnoxiae (1821–1823) he gave a wholly new justification requiring no particular law of error.

Twierdzenie
The Gauss-Markov theorem
If the errors have zero expectation, constant variance and are uncorrelated (with no assumption of normality), then the least-squares estimator has the smallest variance among all linear unbiased estimators. In short, least squares is the Best Linear Unbiased Estimator (BLUE).

This was a more mature and more powerful argument. Gauss moved from “the mean is best, hence errors are normal, hence squares” to a purely variance-based argument: among reasonable (linear, unbiased) rules, least squares yields the least-dispersed estimates — irrespective of the shape of the error distribution. It is this result that makes least squares the foundation of classical econometrics, in which normality is seldom assumed yet least squares is trusted nonetheless.

Part VI. From measurement error to regression (nineteenth–twentieth centuries)

For most of the nineteenth century, least squares served astronomy and geodesy — estimating physical constants. The decisive turn towards a statistics of relationships came near the century’s end:

  • Francis Galton (1822–1911), studying the heights of parents and children, observed “regression towards the mean” (1885) and coined the term regression. He moved the method from “errors” to relationships between variables.
  • Karl Pearson (1857–1936) formalised the correlation coefficient (1896) and developed the regression apparatus.
  • George Udny Yule (1871–1951) showed (1897) that multiple regression is simply least squares applied to several explanatory variables.
  • Ronald A. Fisher (1890–1962) unified the theory: maximum likelihood, the analysis of variance, the properties of estimators. In his framework least squares is the maximum-likelihood estimator under normal errors — closing the circle Gauss had opened.

Thus a tool of astronomers became the foundation of econometrics: with the very calculus by which Gauss caught a planet, we now estimate the returns to education or the elasticity of demand.

Part VII. Matrix formulation of the model

Modern econometrics writes the linear model in matrix form, which renders the entire calculus compact and general. With $n$ observations and $k$ parameters,

$$ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, $$

where $\mathbf{y}$ is the $n\times 1$ vector of the dependent variable, $\mathbf{X}$ the $n\times k$ matrix of regressors (the first column usually ones, for the intercept), $\boldsymbol{\beta}$ the $k\times 1$ vector of unknown parameters, and $\boldsymbol{\varepsilon}$ the $n\times 1$ vector of disturbances.

Definicja
Assumptions of the classical model
  • A1. Linearity: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$.
  • A2. Full rank: $\operatorname{rank}(\mathbf{X}) = k$ (no perfect collinearity; $\mathbf{X}^\top\mathbf{X}$ invertible).
  • A3. Exogeneity: $\mathbb{E}[\boldsymbol{\varepsilon}\mid\mathbf{X}] = \mathbf{0}$.
  • A4. Sphericity: $\mathrm{Var}(\boldsymbol{\varepsilon}\mid\mathbf{X}) = \sigma^2\mathbf{I}_n$ (homoskedasticity and no autocorrelation).
  • A5. Normality (optional, for exact finite-sample inference): $\boldsymbol{\varepsilon}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}_n)$.

Assumptions A1–A4 are exactly the conditions of the Gauss-Markov theorem. Normality (A5) is not required for the optimality of least squares — it serves only for exact $t$ and $F$ tests in small samples.

Part VIII. Derivation of the estimator (matrix form)

We minimise the residual sum of squares written as an inner product:

$$ S(\boldsymbol{\beta}) = (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y}-\mathbf{X}\boldsymbol{\beta}) = \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}. $$
Proof
The normal equations and the estimator
  1. Gradient. Differentiating with respect to $\boldsymbol{\beta}$ (matrix calculus: $\partial(\mathbf{a}^\top\boldsymbol\beta)/\partial\boldsymbol\beta=\mathbf{a}$, $\partial(\boldsymbol\beta^\top\mathbf{A}\boldsymbol\beta)/\partial\boldsymbol\beta=2\mathbf{A}\boldsymbol\beta$): $$ \frac{\partial S}{\partial\boldsymbol\beta} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol\beta. $$
  2. First-order condition. Setting this to zero gives the normal equations: $$ \mathbf{X}^\top\mathbf{X}\,\hat{\boldsymbol\beta} = \mathbf{X}^\top\mathbf{y}. $$
  3. Solution. By A2, $\mathbf{X}^\top\mathbf{X}$ is invertible, so $$ \boxed{\;\hat{\boldsymbol\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\;} $$
  4. It is a minimum. The Hessian $2\mathbf{X}^\top\mathbf{X}$ is positive definite (full rank), so the critical point is the global minimum.

Part IX. Geometry: the hat matrix and orthogonal projection

The fitted values are $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol\beta} = \mathbf{H}\mathbf{y}$, with $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ the hat matrix — the orthogonal projection onto the column space of $\mathbf{X}$. The residuals are $\mathbf{e} = \mathbf{M}\mathbf{y}$ with $\mathbf{M} = \mathbf{I}-\mathbf{H}$, the annihilator matrix.

Twierdzenie
Properties of H and M

Both are symmetric and idempotent ($\mathbf{H}^2=\mathbf{H}$, $\mathbf{M}^2=\mathbf{M}$), and

$$ \mathbf{H}\mathbf{X}=\mathbf{X},\quad \mathbf{M}\mathbf{X}=\mathbf{0},\quad \operatorname{tr}(\mathbf{H})=k,\quad \operatorname{tr}(\mathbf{M})=n-k. $$

Hence the orthogonality of residuals to the regressors: $\mathbf{X}^\top\mathbf{e}=\mathbf{0}$ — the same normal equation, now read geometrically. The residual vector is perpendicular to the column space of $\mathbf{X}$, and $\hat{\mathbf{y}}$ is the point of that space closest to $\mathbf{y}$.

col(X)Oŷ = Hyye = y − ŷ
The geometry of least squares: the orthogonal projection of the observation vector y onto the column space col(X). The fit ŷ=Hy lies in that space, while the residual e=y−ŷ is perpendicular to it — the right angle at the foot is the geometric meaning of the normal equations: the residuals are orthogonal to the regressors.

Part X. Properties of the estimator

Substituting $\mathbf{y}=\mathbf{X}\boldsymbol\beta+\boldsymbol\varepsilon$ yields the key identity $\hat{\boldsymbol\beta} = \boldsymbol\beta + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol\varepsilon$, from which:

  • Unbiasedness: $\mathbb{E}[\hat{\boldsymbol\beta}\mid\mathbf{X}] = \boldsymbol\beta$ (by A3).
  • Covariance: $\mathrm{Var}(\hat{\boldsymbol\beta}\mid\mathbf{X}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ (by A4).
  • Estimator of $\sigma^2$: $s^2 = \mathbf{e}^\top\mathbf{e}/(n-k)$ is unbiased ($k$ degrees of freedom are absorbed by the estimated parameters).
Proof
Unbiasedness of s²: why we divide by n−k
  1. Residuals as a function of the errors. Since $\mathbf{M}\mathbf{X}=\mathbf{0}$, substituting $\mathbf{y}=\mathbf{X}\boldsymbol\beta+\boldsymbol\varepsilon$ gives $\mathbf{e}=\mathbf{M}\mathbf{y}=\mathbf{M}\mathbf{X}\boldsymbol\beta+\mathbf{M}\boldsymbol\varepsilon=\mathbf{M}\boldsymbol\varepsilon$. The residuals do not depend on $\boldsymbol\beta$ — they are the filtered errors.
  2. Residual sum of squares. By the symmetry and idempotence of $\mathbf{M}$ ($\mathbf{M}^\top\mathbf{M}=\mathbf{M}^2=\mathbf{M}$): $$ \mathbf{e}^\top\mathbf{e}=\boldsymbol\varepsilon^\top\mathbf{M}^\top\mathbf{M}\boldsymbol\varepsilon=\boldsymbol\varepsilon^\top\mathbf{M}\boldsymbol\varepsilon. $$
  3. The trace trick. $\boldsymbol\varepsilon^\top\mathbf{M}\boldsymbol\varepsilon$ is a scalar, hence equal to its own trace; by the cyclic invariance of the trace, $\operatorname{tr}(\mathbf{M}\boldsymbol\varepsilon\boldsymbol\varepsilon^\top)$: $$ \mathbb{E}[\mathbf{e}^\top\mathbf{e}\mid\mathbf{X}]=\mathbb{E}\big[\operatorname{tr}(\mathbf{M}\boldsymbol\varepsilon\boldsymbol\varepsilon^\top)\big]=\operatorname{tr}\!\big(\mathbf{M}\,\mathbb{E}[\boldsymbol\varepsilon\boldsymbol\varepsilon^\top]\big)=\operatorname{tr}(\mathbf{M}\,\sigma^2\mathbf{I})=\sigma^2\operatorname{tr}(\mathbf{M}). $$
  4. Result. Since $\operatorname{tr}(\mathbf{M})=n-k$, we have $\mathbb{E}[\mathbf{e}^\top\mathbf{e}\mid\mathbf{X}]=\sigma^2(n-k)$, whence $$ \mathbb{E}[s^2\mid\mathbf{X}]=\frac{\mathbb{E}[\mathbf{e}^\top\mathbf{e}\mid\mathbf{X}]}{n-k}=\sigma^2. $$ The divisor $n-k$ — not $n$ — thus follows directly from the trace of the projection onto the residual space. Dividing by $n$ would bias the estimator downward by the factor $(n-k)/n$.

Part XI. The Gauss-Markov theorem — matrix proof

Proof
Least squares has the smallest covariance among linear unbiased estimators
  1. Any linear estimator. Let $\tilde{\boldsymbol\beta} = \mathbf{C}\mathbf{y}$. Write $\mathbf{C} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top + \mathbf{D}$.
  2. Unbiasedness. $\mathbb{E}[\tilde{\boldsymbol\beta}\mid\mathbf{X}]=\mathbf{C}\mathbf{X}\boldsymbol\beta$; unbiasedness for all $\boldsymbol\beta$ requires $\mathbf{C}\mathbf{X}=\mathbf{I}$, hence $\mathbf{D}\mathbf{X}=\mathbf{0}$.
  3. Covariance. Using $\mathbf{D}\mathbf{X}=\mathbf{0}$, the cross terms vanish and $\mathbf{C}\mathbf{C}^\top = (\mathbf{X}^\top\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^\top$.
  4. Conclusion. Therefore $$ \mathrm{Var}(\tilde{\boldsymbol\beta}\mid\mathbf{X}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}^\top, $$ and $\mathbf{D}\mathbf{D}^\top$ is positive semidefinite. Thus every other linear unbiased estimator has covariance no smaller than that of least squares, with equality only for $\mathbf{D}=\mathbf{0}$. This is BLUE.

Part XII. Distributions and inference (under normality)

Under A5, $\hat{\boldsymbol\beta}\sim\mathcal{N}\big(\boldsymbol\beta,\ \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\big)$ and $(n-k)s^2/\sigma^2\sim\chi^2_{n-k}$. Replacing the unknown $\sigma^2$ by $s^2$ gives, for a single coefficient,

$$ \frac{\hat\beta_j-\beta_j}{\operatorname{SE}(\hat\beta_j)}\sim t_{n-k},\qquad \operatorname{SE}(\hat\beta_j)=s\sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}}, $$

and, for $q$ joint linear restrictions, an $F_{q,\,n-k}$ statistic. This is the formal basis of hypothesis testing, the p-value and confidence intervals. Below we derive these distributions from first principles — from the single assumption that the disturbances are normal.

Proof
Independence of the estimator and the residual variance, and the t and F distributions
  1. Everything is linear in $\boldsymbol\varepsilon$. From the identities $\hat{\boldsymbol\beta}-\boldsymbol\beta=(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol\varepsilon$ and $\mathbf{e}=\mathbf{M}\boldsymbol\varepsilon$, both the estimator and the residuals are linear transformations of $\boldsymbol\varepsilon\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})$. They are therefore jointly normal.
  2. (a) Independence of $\hat{\boldsymbol\beta}$ and $s^2$. Since $\mathbf{X}^\top\mathbf{H}=\mathbf{X}^\top$ (because $\mathbf{H}\mathbf{X}=\mathbf{X}$ and $\mathbf{H}$ is symmetric), $\hat{\boldsymbol\beta}-\boldsymbol\beta=(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{H}\boldsymbol\varepsilon$ is a function of the component $\mathbf{H}\boldsymbol\varepsilon$. The residuals depend only on $\mathbf{M}\boldsymbol\varepsilon$. Their covariance $$ \mathrm{Cov}(\mathbf{H}\boldsymbol\varepsilon,\ \mathbf{M}\boldsymbol\varepsilon)=\mathbf{H}\,(\sigma^2\mathbf{I})\,\mathbf{M}^\top=\sigma^2\mathbf{H}\mathbf{M}=\mathbf{0}, $$ since $\mathbf{H}\mathbf{M}=\mathbf{0}$. For jointly normal vectors, zero covariance means independence, so $\hat{\boldsymbol\beta}$ is independent of $\mathbf{e}$, and hence of $s^2=\mathbf{e}^\top\mathbf{e}/(n-k)$.
  3. (b) Distribution of $s^2$. By the symmetry and idempotence of $\mathbf{M}$: $(n-k)s^2=\mathbf{e}^\top\mathbf{e}=\boldsymbol\varepsilon^\top\mathbf{M}^\top\mathbf{M}\boldsymbol\varepsilon=\boldsymbol\varepsilon^\top\mathbf{M}\boldsymbol\varepsilon$. Setting $\mathbf{z}=\boldsymbol\varepsilon/\sigma\sim\mathcal{N}(\mathbf{0},\mathbf{I})$, $$ \frac{(n-k)s^2}{\sigma^2}=\mathbf{z}^\top\mathbf{M}\mathbf{z}. $$ The matrix $\mathbf{M}$ — symmetric and idempotent — has eigenvalues $1$ (with multiplicity $\operatorname{tr}\mathbf{M}=n-k$) and $0$. In its eigenbasis $\mathbf{w}=\mathbf{Q}^\top\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$, so the quadratic form reduces to a sum of $n-k$ independent squared standard normals: $$ \mathbf{z}^\top\mathbf{M}\mathbf{z}=\sum_{i=1}^{n-k}w_i^2\sim\chi^2_{n-k}. $$
  4. (c) The $t$ statistic. For a single coefficient $\hat\beta_j\sim\mathcal{N}(\beta_j,\ \sigma^2 v_{jj})$, where $v_{jj}=[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}$, so $Z=(\hat\beta_j-\beta_j)/(\sigma\sqrt{v_{jj}})\sim\mathcal{N}(0,1)$. Replacing the unknown $\sigma$ by $s$ and dividing by the square root of an independent $\chi^2$ over its degrees of freedom: $$ \frac{\hat\beta_j-\beta_j}{s\sqrt{v_{jj}}}=\frac{Z}{\sqrt{\dfrac{(n-k)s^2/\sigma^2}{n-k}}}=\frac{\mathcal{N}(0,1)}{\sqrt{\chi^2_{n-k}/(n-k)}}\sim t_{n-k}, $$ since numerator and denominator are independent by part (a). This is the definition of Student's $t$.
  5. (c) The $F$ statistic. Under $q$ restrictions $\mathbf{R}\boldsymbol\beta=\mathbf{r}$ and a true null, the vector $\mathbf{R}\hat{\boldsymbol\beta}-\mathbf{r}\sim\mathcal{N}\big(\mathbf{0},\ \sigma^2\mathbf{R}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{R}^\top\big)$. The standardised quadratic form of this vector is a $\chi^2$ on $q$ degrees of freedom: $$ \frac{(\mathbf{R}\hat{\boldsymbol\beta}-\mathbf{r})^\top[\mathbf{R}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{R}^\top]^{-1}(\mathbf{R}\hat{\boldsymbol\beta}-\mathbf{r})}{\sigma^2}\sim\chi^2_q. $$ The ratio of two independent $\chi^2$ variables, each divided by its own degrees of freedom, follows the $F$ distribution: $$ F=\frac{\chi^2_q/q}{\chi^2_{n-k}/(n-k)}\sim F_{q,\,n-k}, $$ with $\sigma^2$ cancelling between numerator and denominator — hence the final formula with $s^2$ in the denominator.

Part XIII. The Frisch-Waugh-Lovell theorem

One of the most important results on the interpretation of multiple regression (Frisch and Waugh 1933; Lovell 1963).

Twierdzenie
The FWL theorem (partialling out)

Partition the regressors as $\mathbf{X}=[\mathbf{X}_1\ \mathbf{X}_2]$. The OLS coefficient on $\mathbf{X}_2$ in the full regression is identical to the coefficient from regressing

the residuals of $\mathbf{y}$ on $\mathbf{X}_1$ against the residuals of $\mathbf{X}_2$ on $\mathbf{X}_1$.

In other words: the coefficient on a variable in a multiple regression is its effect after the influence of the remaining variables has been “partialled out” — the formal meaning of the phrase ceteris paribus.

Proof
The Frisch-Waugh-Lovell theorem
  1. Normal equations of the full model. For $\mathbf{X}=[\mathbf{X}_1\ \mathbf{X}_2]$ and $\hat{\boldsymbol\beta}=[\hat{\boldsymbol\beta}_1;\ \hat{\boldsymbol\beta}_2]$, the system $\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol\beta}=\mathbf{X}^\top\mathbf{y}$ splits into two blocks: $$ \mathbf{X}_1^\top\mathbf{X}_1\hat{\boldsymbol\beta}_1 + \mathbf{X}_1^\top\mathbf{X}_2\hat{\boldsymbol\beta}_2 = \mathbf{X}_1^\top\mathbf{y},\qquad \mathbf{X}_2^\top\mathbf{X}_1\hat{\boldsymbol\beta}_1 + \mathbf{X}_2^\top\mathbf{X}_2\hat{\boldsymbol\beta}_2 = \mathbf{X}_2^\top\mathbf{y}. $$
  2. Eliminate $\hat{\boldsymbol\beta}_1$. From the first block $\hat{\boldsymbol\beta}_1=(\mathbf{X}_1^\top\mathbf{X}_1)^{-1}\mathbf{X}_1^\top(\mathbf{y}-\mathbf{X}_2\hat{\boldsymbol\beta}_2)$. Substituting into the second, with the projection $\mathbf{P}_1=\mathbf{X}_1(\mathbf{X}_1^\top\mathbf{X}_1)^{-1}\mathbf{X}_1^\top$: $$ \mathbf{X}_2^\top\mathbf{P}_1(\mathbf{y}-\mathbf{X}_2\hat{\boldsymbol\beta}_2) + \mathbf{X}_2^\top\mathbf{X}_2\hat{\boldsymbol\beta}_2 = \mathbf{X}_2^\top\mathbf{y}. $$
  3. Collect terms. Rearranging and grouping in $\hat{\boldsymbol\beta}_2$, with $\mathbf{M}_1=\mathbf{I}-\mathbf{P}_1$ (the annihilator of $\mathbf{X}_1$): $$ \mathbf{X}_2^\top\mathbf{M}_1\mathbf{X}_2\,\hat{\boldsymbol\beta}_2 = \mathbf{X}_2^\top\mathbf{M}_1\mathbf{y}. $$
  4. Symmetry and idempotence. Since $\mathbf{M}_1=\mathbf{M}_1^\top=\mathbf{M}_1^2$, we have $\mathbf{X}_2^\top\mathbf{M}_1\mathbf{X}_2=(\mathbf{M}_1\mathbf{X}_2)^\top(\mathbf{M}_1\mathbf{X}_2)$ and $\mathbf{X}_2^\top\mathbf{M}_1\mathbf{y}=(\mathbf{M}_1\mathbf{X}_2)^\top(\mathbf{M}_1\mathbf{y})$. Hence $$ \boxed{\;\hat{\boldsymbol\beta}_2 = \big[(\mathbf{M}_1\mathbf{X}_2)^\top(\mathbf{M}_1\mathbf{X}_2)\big]^{-1}(\mathbf{M}_1\mathbf{X}_2)^\top(\mathbf{M}_1\mathbf{y})\;} $$ which is exactly the OLS estimator from regressing the **residuals** $\mathbf{M}_1\mathbf{y}$ (of $\mathbf{y}$ on $\mathbf{X}_1$) on the **residuals** $\mathbf{M}_1\mathbf{X}_2$ (of $\mathbf{X}_2$ on $\mathbf{X}_1$).

The theorem explains why coefficients of multiple regression are read as ceteris paribus effects, and it is a proof device in the theory of panel data and fixed-effects estimators.

Przykład
FWL in numbers

Take $n=5$, two regressors and an intercept:

$x_1$12345
$x_2$21534
$y$32657

The full regression of $y$ on $\{1, x_1, x_2\}$ gives the coefficient on $x_2$ equal to $\hat\beta_2 = \tfrac{27}{32} = 0.84375$.

Via FWL — first “clean” $y$ and $x_2$ of the influence of $\{1, x_1\}$ (take the residuals of each from a regression on a constant and $x_1$):

$$ \mathbf{M}_1\mathbf{y} = (0.6,\ -1.5,\ 1.4,\ -0.7,\ 0.2), \qquad \mathbf{M}_1\mathbf{x}_2 = (0.2,\ -1.4,\ 2.0,\ -0.6,\ -0.2). $$

Regressing residuals on residuals (through the origin):

$$ \hat\beta_2 = \frac{(\mathbf{M}_1\mathbf{x}_2)^\top(\mathbf{M}_1\mathbf{y})}{(\mathbf{M}_1\mathbf{x}_2)^\top(\mathbf{M}_1\mathbf{x}_2)} = \frac{5.4}{6.4} = \frac{27}{32} = 0.84375. $$

Identical — exactly as the theorem predicts.

Part XIV. Violations of the assumptions — a map

Each assumption A1–A5 can be broken; each breach has its own diagnosis and remedy. Details in the dedicated articles:

Broken assumptionEffectDiagnosisRemedy
A1 linearitybiasresidual patternquadratic terms, logarithms
A2 full ranknone/instabilityVIFdrop/combine variables, regularisation
A3 exogeneitybias and inconsistencysubject-matter argumentinstrumental variables, panel
A4 homoskedasticitywrong SEBreusch-Paganrobust SE, WLS
A4 no autocorrelationwrong SEBreusch-GodfreyHAC errors
A5 normalitysmall-sample testsQ-Q, Shapirolarger sample (CLT), bootstrap

Part XV. Computational methods and extensions

In practice one does not invert $\mathbf{X}^\top\mathbf{X}$ explicitly — it is numerically unstable, since the condition number is squared: $\kappa(\mathbf{X}^\top\mathbf{X})=\kappa(\mathbf{X})^2$. Preferred are:

  • QR decomposition: $\mathbf{X}=\mathbf{Q}\mathbf{R}$ (orthogonal-triangular), then $\mathbf{R}\hat{\boldsymbol\beta}=\mathbf{Q}^\top\mathbf{y}$ solved by back-substitution; the standard in R’s lm() and lstsq.
  • Singular value decomposition: $\mathbf{X}=\mathbf{U}\boldsymbol\Sigma\mathbf{V}^\top$ — the most stable, handling even a (near-)rank-deficient $\mathbf{X}$; yields the Moore-Penrose pseudoinverse.
  • Cholesky factorisation: $\mathbf{X}^\top\mathbf{X}=\mathbf{L}\mathbf{L}^\top$ — fast when conditioning is good.
  • Gaussian elimination — the algorithm Gauss developed precisely to solve the normal equations by hand; still the basis of solving linear systems.

The principal statistical extensions:

  • Weighted least squares (WLS): under known heteroskedasticity, minimise $\sum w_i\varepsilon_i^2$; $\hat{\boldsymbol\beta}=(\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}\mathbf{y}$.
  • Generalised least squares (GLS): for any $\mathrm{Var}(\boldsymbol\varepsilon)=\sigma^2\boldsymbol\Omega$: $\hat{\boldsymbol\beta}=(\mathbf{X}^\top\boldsymbol\Omega^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol\Omega^{-1}\mathbf{y}$ (Aitken).
  • Instrumental variables / 2SLS: under endogeneity, $\hat{\boldsymbol\beta}_{IV}=(\mathbf{Z}^\top\mathbf{X})^{-1}\mathbf{Z}^\top\mathbf{y}$ — see IV.
  • Ridge and lasso: add a penalty on coefficient size; ridge: $\hat{\boldsymbol\beta}=(\mathbf{X}^\top\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$ — stabilises under collinearity and in high dimension.
  • Nonlinear least squares (NLS): minimise $\sum(y_i-f(\mathbf{x}_i,\boldsymbol\beta))^2$ for nonlinear $f$, by iterative methods (Gauss-Newton).

Part XVI. Connections to other methods of estimation

  • Maximum likelihood: under normal errors, ML coincides exactly with least squares (Part III).
  • Bayes: with a flat prior, the posterior for $\boldsymbol\beta$ is centred at $\hat{\boldsymbol\beta}_{\text{OLS}}$; a conjugate normal prior yields ridge regression.
  • Method of moments: the condition $\mathbb{E}[\mathbf{X}^\top\boldsymbol\varepsilon]=\mathbf{0}$, in sample form $\mathbf{X}^\top\mathbf{e}=\mathbf{0}$, is again the normal equations.

Part XVII. A worked example: regression step by step

The entire theory reduces to a handful of arithmetic operations. Let us carry them out by hand on a small dataset ($n=5$, one regressor and an intercept), so as to see exactly where every number reported by a statistical package comes from.

Definicja
The data
$i$$x_i$$y_i$
112
224
335
444
555

Model: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$. Means: $\bar x = 3$, $\bar y = 4$.

Step 1. Sums and matrices

The design matrix $\mathbf{X}$ has ones in its first column (the intercept) and the values $x_i$ in the second. Hence

$$ \mathbf{X}^\top\mathbf{X} = \begin{bmatrix} n & \sum x_i \\ \sum x_i & \sum x_i^2 \end{bmatrix} = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}, \qquad \mathbf{X}^\top\mathbf{y} = \begin{bmatrix} \sum y_i \\ \sum x_i y_i \end{bmatrix} = \begin{bmatrix} 20 \\ 66 \end{bmatrix}. $$

The determinant is $\det(\mathbf{X}^\top\mathbf{X}) = 5\cdot 55 - 15^2 = 275 - 225 = 50$, so

$$ (\mathbf{X}^\top\mathbf{X})^{-1} = \frac{1}{50}\begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix} = \begin{bmatrix} 1.1 & -0.3 \\ -0.3 & 0.1 \end{bmatrix}. $$

Step 2. The estimator

$$ \hat{\boldsymbol\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \begin{bmatrix} 1.1 & -0.3 \\ -0.3 & 0.1 \end{bmatrix}\begin{bmatrix} 20 \\ 66 \end{bmatrix} = \begin{bmatrix} 22 - 19.8 \\ -6 + 6.6 \end{bmatrix} = \begin{bmatrix} 2.2 \\ 0.6 \end{bmatrix}. $$

The scalar formulae for simple regression give the same result: with $S_{xx} = \sum(x_i-\bar x)^2 = 10$ and $S_{xy} = \sum(x_i-\bar x)(y_i-\bar y) = 6$,

$$ \hat\beta_1 = \frac{S_{xy}}{S_{xx}} = \frac{6}{10} = 0.6, \qquad \hat\beta_0 = \bar y - \hat\beta_1\bar x = 4 - 0.6\cdot 3 = 2.2. $$

The fitted line: $\hat y = 2.2 + 0.6\,x$.

Step 3. Fitted values and residuals

$i$$x_i$$y_i$$\hat y_i = 2.2+0.6x_i$$e_i = y_i-\hat y_i$$e_i^2$
1122.8$-0.8$0.64
2243.4$0.6$0.36
3354.0$1.0$1.00
4444.6$-0.6$0.36
5555.2$-0.2$0.04
sum$0.0$2.40

The residuals sum to zero ($\sum e_i = 0$) — a direct consequence of the normal equations in the presence of an intercept. The residual sum of squares is $\text{RSS} = \mathbf{e}^\top\mathbf{e} = 2.40$.

Step 4. Residual variance and standard errors

With $n=5$, $k=2$, hence $n-k=3$ degrees of freedom:

$$ s^2 = \frac{\text{RSS}}{n-k} = \frac{2.40}{3} = 0.80. $$

The estimated covariance matrix is $\widehat{\mathrm{Var}}(\hat{\boldsymbol\beta}) = s^2(\mathbf{X}^\top\mathbf{X})^{-1}$, whose diagonal yields the standard errors:

$$ \operatorname{SE}(\hat\beta_0) = \sqrt{0.80\cdot 1.1} = \sqrt{0.88} \approx 0.938, \qquad \operatorname{SE}(\hat\beta_1) = \sqrt{0.80\cdot 0.1} = \sqrt{0.08} \approx 0.283. $$

Step 5. The $t$ statistics and significance

Under the null hypothesis $\beta_j = 0$:

$$ t_0 = \frac{2.2}{0.938} \approx 2.345, \qquad t_1 = \frac{0.6}{0.283} \approx 2.121. $$

The critical value of $t$ for $3$ degrees of freedom at the $0.05$ level (two-sided) is $3.182$. Both statistics fall below it, so with such a small sample we fail to reject the hypothesis of zero coefficients at the 5% level — despite the apparent fit. This neatly illustrates how little information $n=5$ carries. See hypothesis testing and the p-value.

Step 6. Goodness of fit

The total sum of squares is $\text{TSS} = \sum(y_i-\bar y)^2 = 4+0+1+0+1 = 6$. Therefore

$$ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{2.4}{6} = 0.60, $$

the adjusted $\bar R^2 = 1 - (1-R^2)\dfrac{n-1}{n-k} = 1 - 0.4\cdot\dfrac{4}{3} \approx 0.467$, and the statistic

$$ F = \frac{(\text{TSS}-\text{RSS})/(k-1)}{\text{RSS}/(n-k)} = \frac{3.6/1}{2.4/3} = 4.5 = t_1^2, $$

confirming the identity $F = t^2$ for a single regressor. The model explains 60% of the variation in $y$.

Intuicja
What we have just done
From five pairs of numbers we have passed — using nothing but addition, multiplication and the inversion of a $2\times 2$ matrix — to a complete regression report: the coefficients, their standard errors, the $t$ statistics, $R^2$ and the $F$ statistic. This is precisely what every statistical package does under the hood, only for larger matrices and by a more numerically stable method (Part XV).

Part XVIII. A confidence interval for E[y₀] versus a prediction interval

Having fitted the model, at a point $\mathbf{x}_0$ we may ask two distinct questions. First: where does the conditional mean $\mathbb{E}[y\mid\mathbf{x}_0]=\mathbf{x}_0^\top\boldsymbol\beta$ lie? Second: what value will a new, single observation $y_0=\mathbf{x}_0^\top\boldsymbol\beta+\varepsilon_0$ take? Both share the same point estimate $\hat y_0=\mathbf{x}_0^\top\hat{\boldsymbol\beta}$, but carry different uncertainty — and that is the origin of the distinction between a confidence interval and a prediction interval.

Proof
Two variances: for the conditional mean and for a new observation
  1. A common estimator. $\hat y_0=\mathbf{x}_0^\top\hat{\boldsymbol\beta}$ is an unbiased estimator of the mean $\mathbf{x}_0^\top\boldsymbol\beta$, since $\mathbb{E}[\hat{\boldsymbol\beta}\mid\mathbf{X}]=\boldsymbol\beta$.
  2. Variance of the conditional mean. The uncertainty comes only from estimating $\hat{\boldsymbol\beta}$: $$ \mathrm{Var}(\hat y_0\mid\mathbf{X})=\mathbf{x}_0^\top\mathrm{Var}(\hat{\boldsymbol\beta}\mid\mathbf{X})\mathbf{x}_0=\sigma^2\,\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0. $$
  3. Prediction error for a new observation. A new observation $y_0=\mathbf{x}_0^\top\boldsymbol\beta+\varepsilon_0$ carries an additional, independent term $\varepsilon_0$ of variance $\sigma^2$. The prediction error decomposes as $y_0-\hat y_0=(\mathbf{x}_0^\top\boldsymbol\beta-\mathbf{x}_0^\top\hat{\boldsymbol\beta})+\varepsilon_0$; the two terms are independent (since $\varepsilon_0$ belongs to a future observation, absent from the sample), so the variances add: $$ \mathrm{Var}(y_0-\hat y_0\mid\mathbf{X})=\sigma^2\,\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0+\sigma^2=\sigma^2\big(1+\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0\big). $$ The extra "1" is the irreducible noise of a single observation — it persists even in an infinite sample, where $\hat{\boldsymbol\beta}\to\boldsymbol\beta$.

Replacing the unknown $\sigma^2$ by $s^2$ and using the $t_{n-k}$ distribution gives two intervals (at confidence level $1-\alpha$):

$$ \underbrace{\hat y_0\pm t_{n-k,\,\alpha/2}\;s\sqrt{\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}}_{\text{confidence interval for }\mathbb{E}[y_0]},\qquad \underbrace{\hat y_0\pm t_{n-k,\,\alpha/2}\;s\sqrt{1+\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}}_{\text{prediction interval for a new }y_0}. $$
Przykład
Numerically, on the example of Part XVII

Take the model of Part XVII: $\hat{\boldsymbol\beta}=(2.2;\ 0.6)$, $s^2=0.8$, $(\mathbf{X}^\top\mathbf{X})^{-1}=\begin{bmatrix}1.1 & -0.3\\ -0.3 & 0.1\end{bmatrix}$, $n-k=3$, $t_{3;\,0.025}=3.182$. At the centre of the data $\mathbf{x}_0=(1,\ 3)$:

$$ \hat y_0=2.2+0.6\cdot 3=4.0,\qquad \mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0=\tfrac{1}{5}=0.2. $$

The leverage $\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0=0.2$ is the smallest possible here (the centre of the sample — the most reliable prediction).

  • Confidence interval for $\mathbb{E}[y_0]$: $\operatorname{SE}=\sqrt{0.8\cdot 0.2}=\sqrt{0.16}=0.4$, so $4.0\pm 3.182\cdot 0.4=4.0\pm 1.273$, that is $[2.727;\ 5.273]$.
  • Prediction interval for a new $y_0$: $\operatorname{SE}=\sqrt{0.8\cdot 1.2}=\sqrt{0.96}\approx 0.980$, so $4.0\pm 3.182\cdot 0.980=4.0\pm 3.118$, that is $[0.882;\ 7.118]$.

The prediction interval is markedly wider ($\pm 3.118$ against $\pm 1.273$) — it must cover not only the uncertainty about the location of the line, but also the random scatter of the observation itself. The gap does not vanish with the sample: as $n\to\infty$, the confidence interval shrinks to a point, while the prediction interval shrinks only to width $\pm t\cdot\sigma$.

Part XIX. The decomposition of variance: R², adjusted R² and the ANOVA table

How well does the model explain the variation in the data? The answer comes from splitting the total variation of $\mathbf{y}$ into a part explained by the regression and a residual part — the foundation of the analysis of variance (ANOVA) and of the coefficient of determination $R^2$. The three sums of squares (for a model with an intercept; $\bar y$ is the mean of $y$):

$$ \underbrace{\textstyle\sum_i (y_i-\bar y)^2}_{\text{TSS — total}}, \qquad \underbrace{\textstyle\sum_i (\hat y_i-\bar y)^2}_{\text{ESS — explained}}, \qquad \underbrace{\textstyle\sum_i (y_i-\hat y_i)^2}_{\text{RSS — residual}}. $$
Proof
The ANOVA identity: TSS = ESS + RSS
  1. Decompose the deviation. For each observation $y_i-\bar y=(\hat y_i-\bar y)+(y_i-\hat y_i)=(\hat y_i-\bar y)+e_i$. Squaring and summing over $i$: $$ \sum_i(y_i-\bar y)^2=\sum_i(\hat y_i-\bar y)^2+\sum_i e_i^2+2\sum_i(\hat y_i-\bar y)e_i. $$
  2. The cross term vanishes. $\sum_i(\hat y_i-\bar y)e_i=\sum_i\hat y_i e_i-\bar y\sum_i e_i$. By the normal equations the residuals are orthogonal to the regressors, $\mathbf{X}^\top\mathbf{e}=\mathbf{0}$; in particular the column of ones gives $\sum_i e_i=0$, and a combination of columns gives $\hat{\mathbf{y}}^\top\mathbf{e}=\hat{\boldsymbol\beta}^\top\mathbf{X}^\top\mathbf{e}=0$. Both terms are zero.
  3. Conclusion. $\text{TSS}=\text{ESS}+\text{RSS}$. Step 2 requires an intercept in the model — without it $\sum_i e_i\neq 0$ and the identity breaks down.

Hence the coefficient of determination:

$$ R^2=\frac{\text{ESS}}{\text{TSS}}=1-\frac{\text{RSS}}{\text{TSS}}\in[0,1], $$

read as the share of the variation in $y$ explained by the model. Equivalently, $R^2$ is the square of the correlation coefficient between $y_i$ and $\hat y_i$.

The difficulty is that $R^2$ never decreases when a regressor is added (each new column can only lower the RSS). To penalise an excess of parameters one uses the adjusted $R^2$:

$$ \bar R^2=1-\frac{\text{RSS}/(n-k)}{\text{TSS}/(n-1)}=1-(1-R^2)\frac{n-1}{n-k}, $$

which rises only when a new regressor improves the fit by more than chance alone would; $\bar R^2$ may be negative. The same sums of squares arrange themselves into an ANOVA table, from which the $F$ statistic for the joint significance of the regression follows directly:

SourceSum of squaresdfMean square$F$
RegressionESS$k-1$$\text{ESS}/(k-1)$$\dfrac{\text{ESS}/(k-1)}{\text{RSS}/(n-k)}$
ResidualRSS$n-k$$\text{RSS}/(n-k)=s^2$
TotalTSS$n-1$
Przykład
ANOVA on the example of Part XVII

For the data of Part XVII ($\hat y=2.2+0.6x$, $\bar y=4$):

$$ \text{TSS}=6,\qquad \text{ESS}=\sum_i(\hat y_i-4)^2=1.44+0.36+0+0.36+1.44=3.6,\qquad \text{RSS}=2.4, $$

so $\text{ESS}+\text{RSS}=3.6+2.4=6=\text{TSS}$ — the identity holds. Therefore

$$ R^2=\frac{3.6}{6}=0.60,\qquad \bar R^2=1-0.4\cdot\frac{4}{3}=\frac{7}{15}\approx0.467. $$

The ANOVA table for this fit:

SourceSSdfMS$F$
Regression3.613.64.5
Residual2.430.8
Total6.04

The statistic $F=3.6/0.8=4.5$ — exactly $t_1^2=2.121^2$ from Part XVII (for a single regressor $F=t^2$). The model explains 60% of the variation in $y$.

Uwaga
What R² does not tell you
A high $R^2$ does not mean the model is correct or causal: one can obtain $R^2\approx 1$ for a spurious regression (trending series), or a low $R^2$ for a perfectly correct model with large noise. $R^2$ does not detect omitted-variable bias, does not judge out-of-sample predictive accuracy, and — without an intercept — can be negative or incomparable across models. It is a measure of in-sample fit, not a measure of truth.

Part XX. Multiple regression and omitted-variable bias — a worked example

Part XIII showed how the FWL theorem recovers a coefficient cleaned of the influence of the other variables. Here we compute the other side of the coin: what happens when a relevant variable is omitted. We use the same data as in Part XIII:

$x_1$12345
$x_2$21534
$y$32657

The full regression of $y$ on $\{1,x_1,x_2\}$ — the solution of the normal equations $\hat{\boldsymbol\beta}=(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ (exactly, in fractions):

$$ \hat\beta_0=\tfrac{23}{80}=0.2875,\qquad \hat\beta_1=\tfrac{19}{32}=0.59375,\qquad \hat\beta_2=\tfrac{27}{32}=0.84375. $$

The short regression omitting $x_2$ (only $y$ on $\{1,x_1\}$):

$$ \tilde\beta_0=\tfrac{13}{10}=1.3,\qquad \tilde\beta_1=\tfrac{11}{10}=1.1. $$

The coefficient on $x_1$ has jumped from $0.594$ to $1.1$ — nearly doubling. This is omitted-variable bias: the short coefficient attributes to $x_1$ part of the effect of $x_2$ as well, because the two variables are correlated.

Proof
The omitted-variable bias formula
  1. Auxiliary regression. Let $\delta$ be the slope from regressing the omitted variable on the retained regressors — here $x_2$ on $\{1,x_1\}$ — that is $\delta=S_{x_1x_2}/S_{x_1x_1}$.
  2. Sample identity. From the partitioned solution of the normal equations (the same algebra as in the FWL proof, Part XIII) follows the exact identity: $$ \tilde\beta_1=\hat\beta_1+\hat\beta_2\,\delta. $$ The short coefficient collects the own effect of $x_1$ (namely $\hat\beta_1$) augmented by the effect of $x_2$ (namely $\hat\beta_2$) "passed through" the correlation of $x_2$ with $x_1$ (namely $\delta$).
  3. Conclusion. The bias vanishes exactly when $\hat\beta_2=0$ (the omitted variable is irrelevant) or $\delta=0$ (the omitted variable is uncorrelated with the retained ones). This is the sample counterpart of [the exogeneity assumption](/en/ekonometria/zalozenia-kmnk/) A3.
Przykład
Checking the OVB formula in numbers

Regressing $x_2$ on $\{1,x_1\}$ gives $\delta=S_{x_1x_2}/S_{x_1x_1}=6/10=0.6$ (since $\bar x_1=\bar x_2=3$, $S_{x_1x_2}=6$, $S_{x_1x_1}=10$). Then

$$ \hat\beta_1+\hat\beta_2\,\delta=\frac{19}{32}+\frac{27}{32}\cdot\frac{3}{5}=\frac{19}{32}+\frac{81}{160}=\frac{95+81}{160}=\frac{176}{160}=\frac{11}{10}=1.1=\tilde\beta_1. $$

Exactly the coefficient from the short regression. For comparison of residuals: the full model has $\text{RSS}=87/160\approx0.544$, the short model $\text{RSS}=5.1$ — omitting $x_2$ worsens the fit several-fold and distorts the estimate of $x_1$ at the same time.

Appendix A. Summary of key formulae

QuantityFormula
Estimator$\hat{\boldsymbol\beta}=(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$
Fitted values$\hat{\mathbf{y}}=\mathbf{H}\mathbf{y}$, $\mathbf{H}=\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$
Residuals$\mathbf{e}=\mathbf{M}\mathbf{y}$, $\mathbf{M}=\mathbf{I}-\mathbf{H}$
Covariance$\mathrm{Var}(\hat{\boldsymbol\beta})=\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$
Estimator of $\sigma^2$$s^2=\mathbf{e}^\top\mathbf{e}/(n-k)$
$t$ statistic$(\hat\beta_j-\beta_j)/\operatorname{SE}(\hat\beta_j)\sim t_{n-k}$
$R^2$$1-\mathbf{e}^\top\mathbf{e}/\sum(y_i-\bar y)^2$
Adjusted $R^2$$1-(1-R^2)\dfrac{n-1}{n-k}$
$F$ statistic (ANOVA)$\dfrac{\text{ESS}/(k-1)}{\text{RSS}/(n-k)}\sim F_{k-1,\,n-k}$

Appendix B. Notation

SymbolMeaning
$n,\ k$number of observations, number of parameters
$\mathbf{y},\ \mathbf{X},\ \boldsymbol\beta,\ \boldsymbol\varepsilon$vector of observations, regressor matrix, parameters, errors
$\hat{\boldsymbol\beta},\ \mathbf{e},\ \hat{\mathbf{y}}$estimator, residuals, fitted values
$\mathbf{H},\ \mathbf{M}$hat matrix, annihilator matrix
$\sigma^2,\ s^2$error variance and its estimator

Appendix C. Biographical sketches

Brief notes on the figures whose work converged into the method of least squares and regression.

Adrien-Marie Legendre (1752–1833) — French mathematician, author of foundational works in number theory (Essai sur la théorie des nombres), elliptic functions and geometry (his Éléments de géométrie displaced Euclid in schools for a century). He was the first to publish the method of least squares (1805) and to name it. Late in life he lost his government pension after refusing to back the authorities’ candidate in an Institute vote, and died in poverty. The only surviving likeness of him is Boilly’s caricature.

Carl Friedrich Gauss (1777–1855) — German mathematician, astronomer and physicist, called princeps mathematicorum, the “Prince of Mathematicians.” Born to a poor family in Brunswick, his gifts were recognised in childhood. In the Disquisitiones Arithmeticae (1801) he built modern number theory; he computed the orbit of Ceres; he proved the fundamental theorem of algebra (doctoral dissertation, 1799); and he advanced geodesy, magnetism and electrostatics (the unit of magnetic induction bears his name). For least squares the key works are Theoria Motus (1809) and Theoria combinationis (1821–1823).

Pierre-Simon de Laplace (1749–1827) — French mathematician and astronomer, creator of celestial mechanics (Traité de mécanique céleste) and of modern probability theory (Théorie analytique des probabilités, 1812). His central limit theorem explained why measurement errors tend to be normal, and tied least squares to the calculus of probability. He navigated the Revolution, the Empire and the Restoration while retaining influential scientific posts.

Ruđer Josip Bošković (1711–1787) — a scholar from Dubrovnik, a Jesuit, astronomer and physicist active in Italy. He proposed (1755–1760) the criterion closest in spirit to least squares: minimising the sum of absolute deviations subject to the signed deviations summing to zero — today’s $L_1$ regression, to which statistics returned only in the twentieth century.

Andrey Markov (1856–1922) — Russian mathematician, a student of Chebyshev, and the founder of the theory of Markov chains. His name entered the Gauss-Markov theorem through later textbooks, although Gauss already had the result in 1823.

Francis Galton (1822–1911) — British polymath, a cousin of Darwin, and a student of heredity. He introduced the notion of regression (“reversion to the mean”) and of correlation, transferring the method from measurement errors to relationships between variables. (He was also a proponent of eugenics — an idea now rejected.)

Karl Pearson (1857–1936) — British mathematician and statistician, who formalised the correlation coefficient, the $\chi^2$ test and much of the apparatus of mathematical statistics; founder of the first chair of statistics (University College London) and of the journal Biometrika.

Ronald Aylmer Fisher (1890–1962) — British statistician and geneticist, one of the founders of modern statistics. He introduced maximum likelihood, the analysis of variance, the design of experiments and the notion of statistical significance, casting least squares as the maximum-likelihood estimator under normal errors.

Appendix D. The original texts and sources

A register of the primary works in which the method took shape, with the location of the key passages and the standard translations. The quotations in Parts II and III are drawn directly from these editions.

Legendre (1805). A. M. Legendre, Nouvelles méthodes pour la détermination des orbites des comètes, Paris: Firmin Didot, 1805. The first publication of the method appears in the appendix Sur la méthode des moindres carrés (pp. 72–80). It is the source of the famous sentence "…rendre minimum la somme des carrés des erreurs" (Part II).

Gauss (1809). C. F. Gauss, Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium, Hamburg: Perthes & Besser, 1809. The Latin original; the treatment of least squares occupies Book II, Section 3, articles 175–186. The standard English translation is C. H. Davis, Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections, Boston: Little, Brown, 1857 — the source of the priority statement (Part III).

Gauss (1821–1823). C. F. Gauss, Theoria combinationis observationum erroribus minimis obnoxiae, Göttingen (Pars prior 1821, Pars posterior 1823) — the mature variance-based justification (the Gauss-Markov theorem). Annotated English translation: G. W. Stewart, Theory of the Combination of Observations Least Subject to Errors, SIAM, 1995.

Laplace (1812). P. S. Laplace, Théorie analytique des probabilités, Paris: Courcier, 1812 — the probabilistic treatment of error and the central limit theorem.

Translations and scholarship. English translations of many historical works (Legendre, Laplace and others) have been made available by Richard J. Pulskamp (probabilityandfinance.com). The classic historical account is S. M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900, Harvard University Press, 1986.

Appendix E. Bibliography and further reading

Modern treatments and classic source papers, complementing the primary texts of Appendix D. For the full history of the method, Stigler (1986), already cited above, remains indispensable.

Modern textbooks of econometrics and statistics.

  • W. H. Greene, Econometric Analysis, Pearson, 8th ed., 2018 — the encyclopaedic academic standard; a complete treatment of the matrix linear model, the Gauss-Markov theorem, FWL and the extensions.
  • J. M. Wooldridge, Introductory Econometrics: A Modern Approach, Cengage, 7th ed., 2019 — the most popular introduction; and his Econometric Analysis of Cross Section and Panel Data, MIT Press, 2nd ed., 2010 (advanced level).
  • F. Hayashi, Econometrics, Princeton University Press, 2000 — a rigorous, method-of-moments-based treatment of least squares and GMM.
  • R. Davidson, J. G. MacKinnon, Econometric Theory and Methods, Oxford University Press, 2004 — a strong emphasis on the geometry of projections and the FWL theorem.
  • C. R. Rao, Linear Statistical Inference and Its Applications, Wiley, 2nd ed., 1973 — a classic of linear estimation theory.
  • H. Theil, Principles of Econometrics, Wiley, 1971 — a historical reference point of modern econometrics.

Classic papers on the method and its history.

  • R. Frisch, F. V. Waugh, “Partial Time Regressions as Compared with Individual Trends”, Econometrica 1(4), 1933, pp. 387–401 — the source of the FWL theorem.
  • M. C. Lovell, “Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis”, Journal of the American Statistical Association 58(304), 1963, pp. 993–1010 — the generalisation and the name of the theorem.
  • A. C. Aitken, “On Least Squares and Linear Combination of Observations”, Proceedings of the Royal Society of Edinburgh 55, 1935, pp. 42–48 — generalised least squares (GLS).
  • R. L. Plackett, “The Discovery of the Method of Least Squares”, Biometrika 59(2), 1972, pp. 239–251 — a detailed reconstruction of the priority dispute.
  • S. M. Stigler, “Gauss and the Invention of Least Squares”, The Annals of Statistics 9(3), 1981, pp. 465–474 — an analysis of Gauss’s 1809 priority claim.

Chronology

YearEvent
1722Cotes — weighted mean of observations (posthumous)
1736–44Lapland and Peru expeditions — the figure of the Earth
~1750Tobias Mayer — method of averages (lunar libration)
1755–60Bošković — minimising the sum of absolute deviations ($L_1$)
1795Gauss begins (by his own account) using least squares
1801Piazzi discovers Ceres; Gauss computes the orbit; the planet is recovered
1805Legendre — first publication of the method of least squares
1808Robert Adrain — independent derivation of the law of error
1809Gauss, Theoria Motus — derivation from the postulate of the mean
1810Laplace — the central limit theorem
1823Gauss, Theoria combinationis — the Gauss-Markov theorem
1885Galton — “regression”; the method applied to relationships
1896Pearson — the correlation coefficient
1922Fisher — maximum likelihood

Legacy

From a single astronomical problem grew a tool underlying linear regression and the whole of econometrics, estimation theory, machine learning (the minimisation of squared error), and numerical linear algebra (Gaussian elimination, the QR and SVD decompositions). And the distribution Gauss derived “in passing,” while justifying his method, bears his name and is the most important distribution in all of statistics — in part through the central limit theorem.

Intuicja
One sentence to remember
Gauss did not assume the normal distribution — he derived it from the conviction that the arithmetic mean is best, and from it obtained least squares (1809). Laplace explained why errors tend to be normal (1810). Fourteen years later, Gauss showed the method is optimal even without normality (1823). These three arguments make least squares the foundation of econometrics.

Related: Ordinary Least Squares — a step-by-step derivation · The classical (OLS) assumptions

📚 Learning resources
  • C. F. Gauss, Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium (1809); trans. C. H. Davis, Theory of the Motion of the Heavenly Bodies (1857), arts. 175–186
  • C. F. Gauss, Theoria combinationis observationum erroribus minimis obnoxiae (1821–1823)
  • A. M. Legendre, Nouvelles méthodes pour la détermination des orbites des comètes (1805), appendix Sur la méthode des moindres carrés
  • P. S. Laplace, Théorie analytique des probabilités (1812)
  • S. M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900 (1986)