The geometry of linear regression
What does $\mathbf{Ax} = \mathbf{b}$ really looks like ? What does it look like to solve an overdetermined problem ? Today we'll tackle those questions with geometric insights.
Linear algebra fundamental subspaces
Here I restate the four fundamental subspaces as described by master G. Strang for an $m \times n$ matrix $\mathbf{A}$:
- $\mathcal{C}(\mathbf{A})$: the column space, in $\mathbb{R}^m$
- $\mathcal{N}(\mathbf{A})$: the nullspace, in $\mathbb{R}^n$
- $\mathcal{C}(\mathbf{A^T})$: the row space, in $\mathbb{R}^n$
- $\mathcal{N}(\mathbf{A^T})$: the left nullspace, in $\mathbb{R}^m$
I will refer to those in my following explanations so a bit a familiarity of linear algebra basics is expected in order to follow this post.
Solving linear systems when $m = n$
In the case of square matrices ($m=n$), the equation $\mathbf{Ax} = \mathbf{b}$ has one solution only if $\mathbf{A}$ is invertible.
$$ \underbrace{\begin{bmatrix} 1 & 2 & 1 \cr 3 & 1 & 1 \cr 1 & 0 & 1 \cr \end{bmatrix}}_{\mathbf{A}} \underbrace{\begin{bmatrix} x_0 \cr x_1 \cr x_2 \cr \end{bmatrix}}_{\mathbf{x}}= \underbrace{\begin{bmatrix} 3 \cr 2 \cr 1 \cr \end{bmatrix}}_{\mathbf{b}} $$
Geometrically, this equation asks a simple question: how should we add $\mathbf{A}$'s columns in order to get vector $\mathbf{b}$. I made a small 3D plot in order to illustrate the three columns of $\mathbf{A}$ as vectors $a_1, a_2$ and $a_3$ in $\mathbb{R}^{m=n=3}$.
Again, intuitively solving this equation reduces down to finding the linear combination of columns of $\mathbf{A}$ with coefficients $\mathbf{x}$ that lands on $\mathbf{b}$.
Of course, this can only happen if the columns of $\mathbf{A}$ span the whole space $\mathbb{R}^m$, $\mathbf{b}$ being potentially anywhere in $\mathbb{R}^m$. Here, if $a_i$ were not independent, no linear combinations would land on $b$ (unless $b$ belongs to $\mathcal{C}(A)$, which almost never happen in high dimensions).
$$ x_0 \begin{bmatrix} 1 \cr 2 \cr 1 \cr \end{bmatrix}+ x_1 \begin{bmatrix} 3 \cr 1 \cr 1 \cr \end{bmatrix}+ x_1 \begin{bmatrix} 1 \cr 0 \cr 1 \cr \end{bmatrix}=b $$
Having the same numbers of independent equations as unknowns is still a relatively ideal case, solved using the inverse of $\mathbf{A}, \mathbf{A}^{-1}$:
$$ \mathbf{x} = \mathbf{A}^{-1}\mathbf{b} $$
$$ \begin{bmatrix} x_0 \cr x_1 \cr x_2 \cr \end{bmatrix}= \underbrace{\begin{bmatrix} -0.25 & 0.5 & -0.25 \cr 0.5 & 0 & -0.5\cr 0.25 & -0.5 & 1.25 \cr \end{bmatrix}}_{\mathbf{A}^{-1}} \begin{bmatrix} 3 \cr 2 \cr 1 \cr \end{bmatrix}= \begin{bmatrix} 0 \cr 1 \cr 1 \cr \end{bmatrix} $$
For this example, $\mathbf{x} = [0, 1, 1]^T$.
(Actually, you do not want to do that numerically, check on the side of $\mathbf{LU}$ factorization and substitution for faster and stable solutions.)
Solving linear systems when $m > n$
This is where things get a bit more interesting. Let's consider the same equation.
$$ \mathbf{Ax} = \mathbf{b} $$
Where $\mathbf{A}$ is now an $m \times n$ matrix, representing $m$ data points (rows) and $n$ unknowns. If we think about rank, a fundamental relation is that the rank of the row space and the column space are equal.
$$\text{rank}({\mathcal{C}(A)})=\text{rank}({\mathcal{C}(A^T)})$$
With $m > n$, it does not matter how many rows we have, the rank is upper bound by $n$. Furthermore, we are trying to find a solution in the column space, or $\mathbb{R}^m$. Those two observations makes it clear that finding a solution in $\mathbb{R}^m$ with at most $n$ independent columns will be troublesome. This might be possible, in the rare case where $\mathbf{b}$ belong to the column space $\mathcal{C}(\mathbf{A})$, but this is not generalizable.
The statement that $\mathbf{b}$ can't be reached by any linear combination of the columns of $\mathbf{A}$ invite us to wonder which vector would be the closer to $\mathbf{b}$, while still belonging to $\mathcal{C}(\mathbf{A})$ space. In that case, we would obtain a solution in some sense. The closest vector to $\mathbf{b}$ in $\mathcal{C}(\mathbf{A})$ would be the projection $\mathbf{p}$ of $\mathbf{b}$ on the $\mathcal{C}(\mathbf{A})$ subspace.
Here, by closest I mean the vector that minimizes the euclidean distance with $\mathbf{b}$. This is also why this method is said to find a solution in the least square sense because by finding an $\mathbf{\hat{x}}$ that solve the system for $\mathbf{p}$, $\mathbf{\hat{x}}$ is indeed the solution that minimizes the squared error ($\mathcal{L}_2$ norm here) between the two solutions $\mathbf{p}$ and $\mathbf{b}$.
$$ \mathbf{A\hat{x}} = \mathbf{p} $$
Finding a relation with the projection $\mathbf{p}$ can be relatively straightforward when thinking with the four fundamental spaces. We know that the least distance projection of $\mathbf{b}$ on $\mathcal{C}(\mathbf{A})$ is orthogonal to $\mathcal{C}(\mathbf{A})$. What else is orthogonal to $\mathcal{C}(\mathbf{A})$ ? Well, by definition, the left null space $\mathcal{N}(\mathbf{A^T})$. Then the vector difference between $\mathbf{b}$ and $\mathbf{p}$ (the dotted white line) must belong to $\mathcal{N}(\mathbf{A^T})$. If $\mathbf{b - p}$ belongs to the left null space, then it is the solution of the following system of equation:
$$ \mathbf{A^T(\mathbf{b - p})} = 0 $$
By a bit of rearranging, we can get to the final solution for $\mathbf{\hat{x}}$.
$$ \begin{aligned} \mathbf{A}^T(\mathbf{b - A\hat{x}}) &= 0 \cr \mathbf{A}^T\mathbf{A\hat{x}} &= \mathbf{A}^T\mathbf{b} \cr \mathbf{\hat{x}} &= (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b} \cr \end{aligned} $$
The least square solution to an overdetermined solution is then: $$\boxed{\mathbf{\hat{x}} = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b}}$$
Example with the estimations of elliptic parameters
Time for a real life example. Suppose we are given $m=1000$ noisy measurements of points supposed to lie on an unknown ellipse which can have any rotation or eccentricity. Each point $i$ consist in an $x_i$ and a $y_i$ position.
The quadratic form for a generic ellipsis can be written like so:
$$\underbrace{\left(\frac{\cos^2{\alpha}}{a^2}+\frac{\sin^2{\alpha}}{b^2}\right)}_{C_0}x^2 +\underbrace{\sin{2\alpha}\left(\frac{1}{a^2}-\frac{1}{b^2}\right)}_{C_1}xy +\underbrace{\left(\frac{\sin^2{\alpha}}{a^2}+\frac{\cos^2{\alpha}}{b^2}\right)}_{C_2} y^2=1$$
The actual data I plotted were generated by computing the perfect ellipse with the following parameters and adding gaussian noise of unit variance on each component.
$$ \begin{bmatrix} \alpha \cr a \cr b \cr \end{bmatrix}= \begin{bmatrix} \pi/6 \cr 2 \cr 8\cr \end{bmatrix} $$
I will use what we now know to estimate the $3$ parameters $a, b$ and $\alpha$ from the measurements using the least square method.
First thing to observe: this quadratic form, indeed, obey the linear combination form as well.
$$C_0x^2+C_1xy+C_2y^2=1$$
This tells us that we should be able to write our problem in the matrix form $\mathbf{Ax} = \mathbf{b}$, like so:
$$ \begin{bmatrix} x^2_0 & x_0y_0 & y^2_0 \cr x^2_1 & x_1y_1 & y^2_1 \cr \vdots & \vdots & \vdots \cr x^2_{m-1} & x_{m-1}y_{m-1} & y^2_{m-1} \cr \end{bmatrix} \begin{bmatrix} \hat{C_0} \cr \hat{C_1} \cr \hat{C_2} \cr \end{bmatrix}= \begin{bmatrix} 1 \cr 1 \cr \vdots \cr 1 \cr \end{bmatrix} $$
Because every point has measurement error, the vector $\mathbf{b}$ is almost guaranteed to not belong in the column space $\mathcal{C}(\mathbf{A})$ of $\mathbf{A}$. This means that there is no exact $\mathbf{x}$ solution for this equation. This is where least square method comes in. By projecting $\mathbf{b}$ on $\mathcal{C}(\mathbf{A})$, we solve $\mathbf{A\hat{x}} = \mathbf{p}$.
As we already explained geometrically this concept, I will simply apply the formula we found.
$$ \mathbf{\hat{x}} = \mathbf{(A^TA)^{-1}A^Tb} $$
Actually doing the computation give us those values:
$$ \begin{bmatrix} \hat{C_0} \cr \hat{C_1} \cr \hat{C_2} \cr \end{bmatrix}= \begin{bmatrix} 0.191 \cr 0.202 \cr 0.074\cr \end{bmatrix} $$
This is not done quite yet, as we still need to recover the most interpretable parameters $a, b$ and $\alpha$. Because this is not the core subject of this post, the details of the parameters derivation are available in the addendum below.
$$ \begin{bmatrix} \hat{\alpha} \cr \hat{a} \cr \hat{b} \cr \end{bmatrix}= \begin{bmatrix} 0.523 \cr 2.143 \cr 8.013\cr \end{bmatrix} $$
The estimated values are looking pretty good, let's draw the estimated (red) and true (green) ellipses for fun!
And that closes this post on the geometry of linear regression, I hope you enjoyed the reading and see you next time.
Addendum
Getting back $\alpha$ from $C_{0, 1, 2}$.
$$ \begin{aligned} C_0 - C_2 &= \frac{1}{a^2}(\cos^2{\alpha}-\sin^2{\alpha})-\frac{1}{b^2}(\cos^2{\alpha}-\sin^2{\alpha})\cr &= \cos{2\alpha}\left(\frac{1}{a^2}-\frac{1}{b^2}\right) \cr \frac{C_1}{C_0 - C_2} &= \frac{\sin{2\alpha}}{\cos{2\alpha}}= \tan{2\alpha} \cr \end{aligned} $$
$$\boxed{\alpha = \frac{1}{2}\arctan{\left(\frac{C_1}{C_0-C_2}\right)}}$$
Then $a$ and $b$.
$$ \begin{aligned} C_0 + C_2 &= \frac{1}{a^2}(\cos^2{\alpha}+\sin^2{\alpha})+\frac{1}{b^2}(\cos^2{\alpha}+\sin^2{\alpha})\cr &= \frac{1}{a^2}+\frac{1}{b^2}\cr \end{aligned} $$
Let $k = \sin{2\alpha}$, then $C_1=k\left(\frac{1}{a^2}-\frac{1}{b^2}\right)$.
$$ \begin{aligned} C_1 - k (C_0 + C_2) &= -\frac{2k}{b^2} \cr C_1 + k (C_0 + C_2) &= +\frac{2k}{a^2} \cr \end{aligned} $$
$$ \boxed{b = \sqrt{\frac{-2k}{C_1-k(C_0+C_2)}}} \text{ and } \boxed{a = \sqrt{\frac{2k}{C_1+k(C_0+C_2)}}} $$