One of the most common application problems encountered for linear algebra is
least squares problem
of finding a solution to an overdetermined linear system.
Overdetermined
systems are ones with more equations than unknowns, so there is
too much
data for the problem often leading to inconsistent systems. In these cases, we have the matrix equation
$$ A\mathbf{x} = \mathbf{b} $$
with $A\in\mathbb{R}^{m\times n}$ and $m \gt n$. Since this system is typically inconsistent, we will not in general be able to find $\mathbf{x}\in\mathbb{R}^n$ that solves $A\mathbf{x}=\mathbf{b}$, but we can find the
least squares solution
$\hat{\mathbf{x}}$ that is the
closest
vector to being the solution to the system. The way we will measure the "closeness" of $A\hat{\mathbf{x}}$ to being equal to $\mathbf{b}$ is by defining a
residual
$$ r(\mathbf{x}) = \mathbf{b} - A\mathbf{x} $$
and determine the distance between vectors $\mathbf{b}$ and $A\mathbf{x}$ by computing the norm of the residual
$$ \|r(\mathbf{x})\| = \left\| \mathbf{b} - A\mathbf{x} \right\| $$
The
least squares solution
$\hat{\mathbf{x}}$ of $A\mathbf{x} = \mathbf{b}$ will be the vector $\mathbf{x}\in\mathbb{R}^n$ that minimizes the norm of the residual $\|r(\mathbf{x})\|$. For convenience, we note that minimizing $\|r(\mathbf{x})\|$ is equivalent to minimizing $\|r(\mathbf{x})\|^2$, so our theorems will focus on minimizing the latter.
Theorem 3.8.1 ¶
The Least Squares Solution of a Linear System Exists
Let $S$ be a subspace of $\mathbb{R}^m$. For each $\mathbf{b}\in\mathbb{R}^m$, there is an unique element $\mathbf{p}\in S$ which is closest to $\mathbf{b}$; that is
$$ \left\| \mathbf{b} - \mathbf{y} \right\| \gt \left\| \mathbf{b} - \mathbf{p} \right\| $$
for any $\mathbf{y}\neq\mathbf{p}$ in $S$. Additionally, $\mathbf{p}\in S$ will be closest to a given vector $\mathbf{b}\in\mathbb{R}^m$ if an only if $\mathbf{b}-\mathbf{p}\in S^\perp$.
$(\Leftarrow)$ Since $S$ is a subspace of $\mathbb{R}^m$, we can write $\mathbb{R}^m$ as a direct sum of $S$ and its orthogonal complement $S^\perp$; $\mathbb{R}^m = S\oplus S^\perp$. Therefore each element $\mathbf{b}\in\mathbb{R}^m$ may be expressed uniquely as a sum
$$ \mathbf{b} = \mathbf{p} + \mathbf{z} $$
where $\mathbf{p}\in S$ and $\mathbf{z}\in S^\perp$. If $\mathbf{y} \neq \mathbf{p}$ is in $S$, then
$$ \left\| \mathbf{b} - \mathbf{y} \right\|^2 = \left\| (\mathbf{b} - \mathbf{p}) + (\mathbf{p} - \mathbf{y}) \right\|^2 $$
Vector $\mathbf{p} - \mathbf{y} \in S$ and vector $\mathbf{b} - \mathbf{p}\in S^\perp$ are orthogonal, so by the Pythagorean law the above equation becomes
$$ \left\| \mathbf{b} - \mathbf{y} \right\|^2 = \left\| \mathbf{b} - \mathbf{p}\right\|^2 + \left\| \mathbf{p} - \mathbf{y} \right\|^2 $$
Norms are always positive, so we remove one of the right hand terms to produce the following inequality:
$$ \left\| \mathbf{b} - \mathbf{y} \right\| \gt \left\| \mathbf{b} - \mathbf{p}\right\| $$
This gives that, for $\mathbf{p}\in S$ and $\mathbf{b}-\mathbf{p}\in S^\perp$, $\mathbf{p}$ is the element of the subspace $S$ closest to $\mathbf{b}$.
$(\Rightarrow)$ Let $\mathbf{q}\in S$ and $\mathbf{b}-\mathbf{q}\notin S^\perp$, then $\mathbf{q}\neq \mathbf{p}$ and repeating the previous argument with $\mathbf{y} = \mathbf{q}$ yields
$$ \left\| \mathbf{b} - \mathbf{q} \right\| \gt \left\| \mathbf{b} - \mathbf{p}\right\| $$
∎
Rarely, $\mathbf{b}\in S$ which will allow us to find an exact solution $\mathbf{x}$ to $A\mathbf{x}=\mathbf{b}$. In this special case,
$$ \mathbf{b} = \mathbf{p} + \mathbf{z} $$
for $\mathbf{p}\in S$ and $\mathbf{z}\in S^\perp$, so
$$ \mathbf{b} = \mathbf{p} + \mathbf{0} $$
as direct sum representations are unique.
The
least squares solution
$\hat{\mathbf{x}}$ to the
least squares problem
$A\mathbf{x} = \mathbf{b}$ is valid if and only if $\mathbf{p} = A\hat{\mathbf{x}}$ is the vector in the column space $C(A)$ closest to $\mathbf{b}$. What this means geometrically is that we
project $\mathbf{b}$ onto $C(A)$
to get $\mathbf{p}$ and then find $\hat{\mathbf{x}}$. By Theorem 3.8.1
$$ \mathbf{b} - \mathbf{p} = \mathbf{b} - A\hat{\mathbf{x}} = r(\hat{\mathbf{x}}) $$
which must be in $C(A)^\perp$. Hence, $\hat{\mathbf{x}}$ is the least squares solution if and only if $r(\hat{\mathbf{x}})\in C(A)^\perp$.
For a vector in $\mathbf{b}\in\mathbb{R}^2$, we can represent this visually as
We will derive the normal equations by proving two short theorems. First that $N\left(A^T A\right) = N(A)$, and second that the equation $A^T A\mathbf{x} = A\mathbf{b}$ has a unique solution $\hat{\mathbf{x}}$.
Theorem 3.8.2 ¶
The Null Space of $A^T A$ is the same as the Null Space of $A$.
$$ N\left(A^T A\right) = N(A) $$
$(\Rightarrow)$ Suppose $\mathbf{x}\in N\left(A^T A\right)$, then
$$ A^T A\mathbf{x} = \mathbf{0} $$
This means that $\mathbf{x}$ is orthogonal to each row vector of $A^T A$. Since $A^T A$ is a symmetric matrix, this means that $\mathbf{x}$ is also orthogonal to each of $A^T A$'s columns. Recall that $\mathbf{0}$ is orthogonal to every vector, in particular $\mathbf{x}$, so $\mathbf{x}^T\mathbf{0} = 0 $. We use this fact in conjunction with the previous equation and write
$$ \mathbf{x}^T\mathbf{0} = \mathbf{x}^T \left(A^T A\mathbf{x}\right) = \left(\mathbf{x}^T A^T\right) A\mathbf{x} = \left(A\mathbf{x}\right)^T A\mathbf{x} = \left\langle A\mathbf{x}, A\mathbf{x} \right\rangle = 0 $$
So $A\mathbf{x} = \mathbf{0}$, since only $\mathbf{0}$ is orthogonal to itself. Hence, $\mathbf{x}\in N(A)$.
$(\Leftarrow)$ Suppose $\mathbf{x}\in N(A)$, then $A\mathbf{x} = \mathbf{0}$ and
$$ A^T A\mathbf{x} = A^T\left(A\mathbf{x}\right) = A^T\mathbf{0} = \mathbf{0} $$
so $\mathbf{x}\in N\left(A^T A\right)$.
∎
Theorem 3.8.3 ¶
The Normal Equations Yield a Unique Solution to the Least Squares Problem
If $A\in\mathbb{R}^{m\times n}$ and $\,\text{rank}(A)=n$, the normal equations
$$ A^T A\mathbf{x} = A^T\mathbf{b} $$
have a unique solution
$$ \hat{\mathbf{x}} = \left(A^T A\right)^{-1}A^T\mathbf{b} $$
and $\, \hat{\mathbf{x}}$ uniquely solves the least squares problem $A\mathbf{x} = \mathbf{b}$.
Since $A\in\mathbb{R}^{m\times n}$ is rank $n$, $N(A) = \left\{\mathbf{0}\right\}$. From our previous theorem, $N(A) = N\left(A^T A\right) = \left\{\mathbf{0}\right\}$. With a trivial null space, $A^T A\in\mathbb{R}^{n\times n}$ is invertible and its nonhomogenous systems have
unique solutions
. In particular, the normal equations
$$ A^T A\mathbf{x} = A^T\mathbf{b} $$
have the unique solution given by
$$ \hat{\mathbf{x}} = \left(A^T A\right)^{-1}A^T\mathbf{b} $$
the unique least squares solution.
∎
The context where you have likely encountered least squares problems before is finding a
line of best fit
for a data set.
$$
\begin{array}{c|r|r|r|r|r|r}
x & x_1 & x_2 & \dots & x_j & \dots & x_m \\
\hline
y & y_1 & y_2 & \dots & y_j & \dots & y_m
\end{array}
$$
Given a set of values $\left\{(x_i,y_i)\,:\,1\le j\le m\right\}$, we seek to find a line in the form $y = a + bx$ that best represents the data set. To do this, we make a system equations out of the set
$$
\begin{align*}
a + bx_1 &= y_1 \\
a + bx_2 &= y_2 \\
\vdots \\
a + bx_j &= y_j \\
\vdots \\
a + bx_m &= y_m \\
\end{align*}
$$
for all $i=1,2,\ldots,m$. This yields an $m\times 2$
inconsistent
linear system
$$
A\mathbf{c} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \ddots & \ddots \\ 1 & x_j \\ \ddots & \ddots \\ 1 & x_m \end{bmatrix}\begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_j \\ \vdots \\ y_m \end{bmatrix} = \mathbf{y},
$$
where $A$ is the linear operator, $\mathbf{c}$ is the vector of coefficients $\begin{bmatrix} a \\ b \end{bmatrix}$, and $\mathbf{y}$ is the vector of outputs $y_j$ for each $x_j$ in our table. Notice that vector $\mathbf{c}$ is
NOT
a vector in $\mathbb{R}^2$; it is a vector in $P_2$, the vector space of polynomials of degree less than 2. The first coordinate in vector $\mathbf{c}$ is the $y$-intercept and the second coordinate is the slope.
One solves the least squares problem by obtaining the
normal equation
. That is multiply both sides of the equation by $A^T$
$$
A^TA\mathbf{c} = \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_m \end{bmatrix}\begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_m \end{bmatrix}\begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_m \end{bmatrix}\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} = A^T\mathbf{y}
$$
Notice that
If you know that your least squares line must pass through the origin, that is the $y$-intercept must be zero, one obtains the linear system
$$
\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix}\begin{bmatrix} a \end{bmatrix} =
\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}
$$
$$
\left(\displaystyle\sum_{i=1}^m x_i^2\right)a = \begin{bmatrix} x_1 & x_2 & \ldots & x_m \end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix}\begin{bmatrix} a \end{bmatrix} =
\begin{bmatrix} x_1 & x_2 & \ldots & x_m \end{bmatrix}\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} = \displaystyle\sum_{i=1}^m x_iy_i
$$
A common laboratory exercise for physics students is using
Hooke's Law
to determine the spring constant of a spring experimentally. The law states that the force exerted by a spring is proportional to and opposite in direction of its displacement from its equilibrium (resting, not stretched or compressed) length and is given by
$$ F = -kx $$
where $F$ is the force exerted by the spring, $x$ is the displacement, and $k$ is a constant of proportionality called the
spring constant
.
Suppose some freshmen physics students have compiled the following data
Displacement $x$ (m) | Force $F(x)$ (N) |
---|---|
0.020 m | 4.1 N |
0.030 m | 6.0 N |
0.040 m | 7.9 N |
0.050 m | 9.8 N |
To determine $k$, we can use positive values for simplicity and construct the normal equations
$$ \begin{bmatrix} 0.02 & 0.03 & 0.04 & 0.05 \end{bmatrix}\begin{bmatrix} 0.02 \\ 0.03 \\ 0.04 \\ 0.05 \end{bmatrix}\begin{bmatrix} k \end{bmatrix} =
\begin{bmatrix} 0.02 & 0.03 & 0.04 & 0.05 \end{bmatrix}\begin{bmatrix} 4.1 \\ 6.0 \\ 7.9 \\ 9.8 \end{bmatrix} $$
This simplifies to
$$ 0.0054k = 1.068 $$
so $k\approx 197.\bar{7}\ \text{N}/\text{m}$ and those students should report $k = 2.0\times 10^2\ \text{N}/\text{m}$ for the appropriate number of
significant figures
.
Lines of best fit may also be used to fit market data for economics purposes. Suppose data was collected showing the demand for gasoline at different costs per gallon. The data for gas prices
Price per gallon $p$ (\$) | Demand $x(p)$ (in millions of gallons) |
---|---|
\$2.75 | 500 |
\$2.50 | 620 |
\$2.25 | 710 |
\$2.00 | 810 |
\$1.75 | 930 |
\$1.50 | 1050 |
Here it is not reasonable to assume that $b=0$ since there is likely a minimum gas price based on the cost of producing and delivering the gasoline. First one looks at the data
Next we decide how to model our data. In this case we decide that like Hooke's Law there is a first order (line) that best fits the data points. If we can find such a line
$$
x = a + bp,
$$
then we would have
$$
\begin{align*}
a + 2.75b &= 500 \\
a + 2.50b &= 620 \\
a + 2.25b &= 710 \\
a + 2.00b &= 810 \\
a + 1.75b &= 930 \\
a + 1.50b &= 1050
\end{align*}
$$
So we will construct our inconsistent system of equations using the relationship $A\mathbf{c}=\mathbf{d}$, where
$$
\begin{align*}
\mathbf{c} &= \begin{bmatrix} a \\ b \end{bmatrix}\text{ is the vector of coefficients for our line,} \\
\mathbf{p} &=\text{ the vector of gas prices, and } \\
\mathbf{d} &=\text{ the vector of resulting sales in millions of gallons (the demand).}
\end{align*}
$$
Notice that vector $\mathbf{c}$ is not a vector in $\mathbb{R}^2$; it is a vector in $P_2$, $\mathbf{c}\in P_2$. The polynomial in $P_2$, the line we seek is the line $x(p) = a + bp$ with slope $b$ and $y$-intercept $a$ that best fits the data points in our table.
$$
A\mathbf{c} = \begin{bmatrix} 1 & 2.75 \\ 1 & 2.50 \\ 1 & 2.25 \\ 1 & 2.00 \\ 1 & 1.75 \\ 1 & 1.50 \end{bmatrix}\begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} 500 \\ 620 \\ 710 \\ 810 \\ 930 \\ 1050 \end{bmatrix} = \mathbf{d}
$$
The solution is
$$
\hat{\mathbf{x}} = \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} 1688 \\ -432 \end{bmatrix} $$
Another application of the least squares problem is to find a polynomial that represents a data set. Suppose that we have a data set $\left\{(x_i,y_i)\right\}$ composed of $n+1$ values. We may, by solving a particular least squares problem, find polynomial $\mathbf{p}$ of degree $n$ or less, $\mathbf{p}\in P_{n+1}$, that passes through each of these points. This polynomial would allow us to interpolate the data set, that is talk about values between the specific data points in $\left\{(x_i,y_i)\right\}$. The resulting polynomial $\mathbf{p}\in P_{n+1}$ is called the interpolating polynomial .
It is important to be careful when doing these kinds of interpolations. The best representation of $n+1$ data points will not necessarily be polynomial of degree $n$. In fact, this is rarely the case. We tend to use a low order polynomial. Look at Example 1 and Example 2. These are both actually polynomial interpolation problems, we just choose to use a line (or technically affine function ) which is a polynomial of degree $1$.
To construct the least squares problem for polynomial interpolation, we use the
Vandermonde matrix
$$ V = \begin{bmatrix} x_1^n & x_1^{n-1} & \ldots & x_1 & 1 \\ x_2^n & x_2^{n-1} & \ldots & x_2 & 1 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\ x_m^n & x_m^{n-1} & \ldots & x_m & 1 \end{bmatrix} $$
Each $x_j$ comes from the data set, and the matrix $V$ takes the place of $A$ in the normal equations for polynomial interpolation.
We will do two least squares problems on the same data set, finding a degree $2$ and a degree $5$ interpolating polynomial. Consider the data set
$$
\begin{array}{c|r|r|r|r|r|r|r|r|r} x & -1.95 & 0.22 & 0.88 & 1.16 & 1.25 & 4.48 & 4.91 & 5.96 & 7.16 \\\hline
y & -1.96 & -1.76 & -1.62 & -1.25 & -0.51 & 2.08 & 3.24 & 3.53 & 3.71 \end{array}
$$
For the first problem, we interpolate the data set using the polynomial $ c_0 + c_1 x + c_2 x^2 $, finding the least squares solution of the system
$$ \begin{bmatrix} x_1^2 & x_1 & 1 \\ x_2^2 & x_2 & 1 \\
\vdots & \vdots & \vdots \\ x_m^2 & x_m & 1 \end{bmatrix}\begin{bmatrix} c_0 \\ c_1 \\ c_2 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} $$
For the second version, we use the polynomial $ c_0 + c_1 x + c_2 x^2 + c_3 x^3 + c_4 x^4 + c_5 x^5 $ whose system is very close in form
$$ \begin{bmatrix} x_1^5 & x_1^4 & \ldots & x_1 & 1 \\ x_2^5 & x_2^4 & \ldots & x_2 & 1 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\ x_m^5 & x_m^4 & \ldots & x_m & 1 \end{bmatrix}\begin{bmatrix} c_0 \\ c_1 \\ \ddots \\ c_4 \\ c_5 \end{bmatrix} =
\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} $$
We solve these and plot the solutions on the same axis for comparison purposes.
Least squared problems come in many forms, and we can use other equations we are familiar with to construct them. For example, suppose we have a data set $\left\{(x_i,y_i)\,:\,1\le i\le m\right\}$ where the points roughly represent a circle. We can use the equation of a circle
$$
(x-c_1)^2 + (y-c_2)^2 = r^2
$$
with our data points $\left\{(x_i,y_i)\right\}$ plugged in allow for the construction of a least squares problem if we do a little bit of algebra
$$
\begin{align*}
(x-c_1)^2 + (y-c_2)^2 &= r^2 \\
\\
x^2 -2xc_1 + c_1^2 + y^2 - 2yc_2 + c_2^2 &= r^2 \\
\\
x^2 + y^2 &= 2xc_1 + 2yc_2 + \left(r^2 - c_1^2 - c_2^2\right)
\end{align*}
$$
By setting $ c_3 = r^2 - c_1^2 - c_2^2 $, we can form a linear system based on
$$
2xc_1 + 2yc_2 + c_3 = x^2 + y^2
$$
which looks like
$$
A\mathbf{c} = \begin{bmatrix} 2x_1 & 2y_1 & 1 \\ 2x_2 & 2y_2 & 1 \\ \vdots & \vdots & \vdots \\ 2x_m & 2y_m & 1 \end{bmatrix}\begin{bmatrix} c_1 \\ c_2 \\ c_3 \end{bmatrix} = \begin{bmatrix} x_1^2 + y_1^2 \\ x_2^2 + y_2^2 \\ \vdots \\ x_m^2 + y_m^2 \end{bmatrix} = \mathbf{b}
$$
The least squares solution will give us $\mathbf{c} = \left[ c_1\ c_2\ c_3 \right]^T$, we interpret $(c_1,c_2)$ as the center of the circle, and
$$
r = \sqrt{c_1^2 + c_2^2 + c_3}
$$
yields the radius.
If we use the data set
$$
\begin{array}{c|r|r|r|r|r|r|r|r|r|r} x & 8.03 & 6.86 & 5.99 & 5.83 & 4.73 & 4.02 & 4.81 & 5.41 & 5.71 & 7.77 \\\hline
y & 3.06 & 4.48 & 4.98 & 5.04 & 4.54 & 2.74 & 1.37 & 1.13 & 1.02 & 2.01 \end{array}
$$
and solve the least squared problem, we have the information required to determine the center and radius of the approximating circle. The plot of the circle and the data is
Creative Commons Attribution-NonCommercial-ShareAlike 4.0
Attribution
You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Noncommercial
You may not use the material for commercial purposes.
Share Alike
You are free to share, copy and redistribute the material in any medium or format. If you adapt, remix, transform, or build upon the material, you must distribute your contributions under the
same license
as the original.