A Linear Algebra Approach to Least Squares

In science and engineering, we rarely get perfect data. When we collect measurements \((t_i, y_i)\), they almost never fall perfectly on a straight line due to noise or experimental error. We are usually looking for the “best fit” line \(y = ct + d\) to describe the trend. But what does “best fit” or “least squares” actually mean mathematically?
While this is often taught as a calculus problem (minimizing the sum of squared errors), it is actually a beautiful problem in Linear Algebra. It is a story about geometry, projections, and subspaces.
1. From Statistics to Linear Algebra
Let’s say we have \(m\) data points. We are trying to solve a system of linear equations that looks like this:
$$
\begin{cases} ct_1 + d = y_1 \\ ct_2 + d = y_2 \\ \vdots \\ ct_m + d = y_m \end{cases}
$$
In matrix notation, we define our Design Matrix \(A\), our Unknowns \(x\), and our Observations \(y\):
$$
A = \begin{pmatrix} t_1 & 1 \\ \vdots & \vdots \\ t_m & 1 \end{pmatrix}, \quad x = \begin{pmatrix} c \\ d \end{pmatrix}, \quad y = \begin{pmatrix} y_1 \\ \vdots \\ y_m \end{pmatrix}
$$
We want to solve \(Ax = y\). However, because of the noise, \(y\) is not in the column space of \(A\) (denoted as \(\mathsf{W}\)).
Instead of a perfect solution, we seek a vector \(x\) that minimizes the “error” \(E\). In Linear Algebra, this error is the square of the norm of the residual vector:
$$
E = || y – Ax ||^2
$$
Why \(E = || y – Ax ||^2\)?
To understand this notation, we have to look at the error for each individual data point.
1. The Individual Error
For a single data point \((t_i, y_i)\), our line predicts the value \(ct_i + d\). The difference between the actual observed value (\(y_i\)) and the predicted value is the error for that specific point:
$$
e_i = y_i – (ct_i + d)
$$
2. The Error Vector
In matrix notation, \(Ax\) represents the vector of all our predicted values:
$$
Ax = \begin{pmatrix} ct_1 + d \\ \vdots \\ ct_m + d \end{pmatrix} = \text{Predicted Values}
$$
If we subtract this prediction vector from our observation vector \(y\), we get a single vector \(e\) containing all the individual errors:
$$
e = y – Ax = \begin{pmatrix} y_1 – (ct_1 + d) \\ \vdots \\ y_m – (ct_m + d) \end{pmatrix} = \begin{pmatrix} e_1 \\ \vdots \\ e_m \end{pmatrix}
$$
3. The Squared Norm
In statistics, we want to minimize the Sum of Squared Errors (SSE) to avoid negative errors canceling out positive ones. Mathematically, this is:
$$
\text{SSE} = \sum_{i=1}^{m} e_i^2 = e_1^2 + e_2^2 + \dots + e_m^2
$$
In Linear Algebra, the length (or magnitude) of a vector \(v\) is called its Norm, denoted as \(||v||\). If we use the standard inner product here, the formula for the norm is the square root of the sum of squared components:
$$
||e|| = ||y-Ax|| = \sqrt{e_1^2 + e_2^2 + \dots + e_m^2}
$$
If we square both sides to get rid of the square root, we get exactly the Sum of Squared Errors:
$$
||e||^2 = ||y-Ax||^2 = e_1^2 + e_2^2 + \dots + e_m^2 = \text{SSE}
$$
Therefore, minimizing the sum of squared errors is mathematically identical to minimizing the squared length of the error vector:
$$
E = || \underbrace{y – Ax}_{\text{error vector}} ||^2
$$
Because the norm is always non-negative (\(||v|| \ge 0\)), finding an \(x\) that minimizes \(||y-Ax||^2\) is equivalent to finding an \(x\) that minimizes \(||y-Ax||\). Thus, the problem translates simply to: Find \(x\) such that the vector \(Ax\) is as close as possible to \(y\).
2. The Solution: Geometry of Projections
To solve this, we rely on subspaces.
Theorem 6.6: Orthogonal Decomposition
This theorem tells us that any vector \(y\) can be split uniquely into two perpendicular components relative to a subspace \(\mathsf{W}\).
Let \(\mathsf{W}\) be a finite-dimensional subspace of an inner product space \(\mathsf{V}\), and let \(y \in \mathsf{V}\). Then there exist unique vectors \(u \in \mathsf{W}\) and \(z \in \mathsf{W}^{\perp}\) such that \(y = u + z\).
Connecting to Our Least Squares Problem:
To see why this theorem is the key, we map its notation to our problem:
- The Subspace (\(\mathsf{W}\)): This matches the Column Space (Range) of our matrix \(A\), defined as \(\{Ax : x \in \mathsf{F}^n\}\).
- The Vector (\(u\)): This matches our ideal approximation \(Ax_0\). It is the “shadow” or projection of \(y\) onto the subspace \(\mathsf{W}\).
- The Vector (\(z\)): This matches our error vector \(y – Ax_0\). This represents the “noise” that is orthogonal to our model.
Corollary: The Closest Vector
The corollary to Theorem 6.6 gives us our optimization criterion.
The vector \(u\) is the unique vector in \(\mathsf{W}\) that is “closest” to \(y\); that is, \(||y – x|| \geq ||y – u||\) for all \(x \in \mathsf{W}\).
In our context, this confirms that if we can find the \(x_0\) that makes the error vector (\(y – Ax_0\)) orthogonal to the column space, we have mathematically guaranteed the minimum error.
3. The Algebraic Tools
To find the actual values for \(x\), we need two algebraic lemmas to manipulate our equations.
Lemma 1: The Adjoint Property
This allows us to move the matrix \(A\) from one side of an inner product to the other.
\(\langle Ax, y \rangle_m = \langle x, A^*y \rangle_n\)
Lemma 2: Rank Preservation
This ensures that our solution will be unique.
If \(\text{rank}(A) = n\), then \(\text{rank}(A^A) = n\). Consequently, \(A^A\) is invertible.
If \(\text{rank}(A) = n\), then \(\text{rank}(A^*A) = n\). Consequently, \(A^*A\) is invertible.
4. The Solution: Deriving the Normal Equations
We now combine the geometry (Theorem 6.6) with the algebra (Lemma 1) to solve for \(x_0\).
Step 1: The Orthogonality Condition
From Theorem 6.6, we established that the optimal error vector (\(Ax_0 – y\)) must be in \(\mathsf{W}^\perp\). This means it must be orthogonal to every vector in the column space \(\mathsf{W}\).
Since every vector in \(\mathsf{W}\) can be written as \(Ax\) for some \(x\), we can write this condition as an inner product:
$$
\langle Ax, Ax_0 – y \rangle_m = 0 \quad \text{for all } x \in \mathsf{F}^n
$$
Step 2: Moving the Matrix (Lemma 1)
We want to isolate \(x\), so we use Lemma 1 to move \(A\) from the left side of the inner product to the right side (where it becomes the adjoint \(A^*\)):
$$
\langle x, A^*(Ax_0 – y) \rangle_n = 0 \quad \text{for all } x \in \mathsf{F}^n
$$
Step 3: Solving for Zero
If the inner product of a vector \(v\) with any vector \(x\) is always 0, then \(v\) itself must be the zero vector. Therefore:
$$
A^*(Ax_0 – y) = 0
$$
Expanding this gives us:
$$
A^Ax_0 – A^y = 0 \Rightarrow A^Ax_0 = A^y
$$
Step 4: Invertibility (Lemma 2)
Finally, if we assume our data columns are linearly independent (rank(\(A\)) = \(n\)), Lemma 2 tells us that \(A^*A\) is invertible. We can multiply by the inverse to isolate \(x_0\).
This leads to Theorem 6.12:
Let \(A\) be an \(m \times n\) matrix with rank \(n\). The unique vector \(x_0\) that minimizes \(||Ax – y||\) is given by:
$$
x_0 = (A^A)^{-1}A^y
$$
These are the famous Normal Equations.
5. A Numerical Deep Dive
Let’s apply this theorem to a concrete data set to verify the geometry explicitly.
The Data:
$$
\begin{array}{c|c} t \text{ (Independent)} & y \text{ (Dependent)} \\ \hline 1 & 1.04 \\ 2 & 3.54 \\ 3 & 4.31 \\ 4 & 7.81 \\ 5 & 9.29 \\ 6 & 11.94 \\ \end{array}
$$
Step 1: Construct the Matrices
$$
A = \begin{pmatrix} 1 & 1 \\ 2 & 1 \\ 3 & 1 \\ 4 & 1 \\ 5 & 1 \\ 6 & 1 \end{pmatrix}, \quad y = \begin{pmatrix} 1.04 \\ 3.54 \\ 4.31 \\ 7.81 \\ 9.29 \\ 11.94 \end{pmatrix}
$$
Step 2: Solve for \(x\) (The Coefficients)
Using the formula \(x = (A^T A)^{-1} A^T y\), we calculate:
$$
\beta_1 \approx 2.164, \quad \beta_0 \approx -1.250
$$
So our best fit line is \(y = 2.164t – 1.250\).
6. Verifying the Geometry (Theorem 6.6)
Here is where it gets interesting. We claimed that the data \(y\) splits into a signal \(u\) and noise \(z\). Let’s calculate them.
The Signal Vector \(u\) (in \(\mathsf{W}\))
$$
u = Ax \approx \begin{pmatrix} 0.91 \\ 3.08 \\ 5.24 \\ 7.41 \\ 9.57 \\ 11.74 \end{pmatrix}
$$
- Dimension: \(\mathsf{W}\) is 2-dimensional.
- Basis of \(u\): The vector \(u\) is formed by a linear combination of the two columns of A.
The Noise Vector \(z\) (in \(\mathsf{W}^{\perp}\))
$$
z = y – u \approx \begin{pmatrix} 0.13 \\ 0.46 \\ -0.93 \\ 0.40 \\ -0.28 \\ 0.20 \end{pmatrix}
$$
Does \(z\) really live in \(\mathsf{W}^{\perp}\)?
For \(z\) to be in \(\mathsf{W}^{\perp}\), it must be orthogonal to the columns of \(A\).
$$
z \cdot \text{Column}_1(A) \approx 0
$$
$$
z \cdot \text{Column}_2(A) \approx 0
$$
Numerical checks confirm these are effectively zero (\(< 10^{-14}\)). Theorem 6.6 holds!
The Dimensionality of Error
It is worth noting the dimensions here.
- Our total space \(\mathsf{V}\) is 6-dimensional (6 data points).
- Our model space \(\mathsf{W}\) is 2-dimensional (slope and intercept).
- Therefore, our error space \(\mathsf{W}^{\perp}\) is 4-dimensional (\(6-2=4\)).
The vector \(z\) is not random; it lives in a specific 4-dimensional subspace defined by the null space of \(A^T\). We can even find a basis for this error space (4 vectors orthogonal to \(A\)) and express \(z\) as a unique combination of those basis vectors.
Conclusion
Least Squares is not just a formula to memorize. It is a projection. We project our messy, high-dimensional data \(y\) onto the clean, lower-dimensional subspace \(\mathsf{W}\) to find the shadow \(u\) that best represents reality.
References
The theorem numbering in this post follows Linear Algebra (4th Edition) by Friedberg, Insel, and Spence. Some explanations and details here differ from the book.