Exercise 1: In polynomial regression, we are given $n$ points $(x_i,y_i)$ and fit a size-$m$ polynome $f(x) = c_0 + c_1 x + c_2 x^2 + .. + c_m x^m$ so that $f(x_i)\approx y_i$ for all $i=1,..n$.
In this exercise, we want to find the $m+1$ coefficients $(c_j)$ that minimize the least square error $$ \left\| ~~~ \underbrace{ \left[ \begin{array}{ccccc} 1 & x_1 & x_1^2 & .. & x_1^m \\ 1 & x_2 & x_2^2 & .. & x_2^m \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & .. & x_n^m \end{array} \right] }_{X} ~ \underbrace{ \left[ \begin{array}{c} c_0 \\ c_1 \\ c_2 \\ \vdots \\ c_m \end{array} \right] }_{c} - \underbrace{ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] }_{y} \right\|^2 $$
- Construct the matrix $X$ from the given vector $x$ below with $m=3$.
- Find the optimal $c$ by solving the least square problem $\min_c \|Xc-y\|^2$ using Numpy (
lstsq
andsolve
). Is the result exact (no error)?- Plot the obtained polynome on the scatter plot.
- Redo questions 2-3 by varying the relative sizes of $n$ and $m$. What do you observe?
import numpy as np
n = 30 # Number of sample points
x = (np.random.rand(n)-0.5)*6
y = np.sign(x)*np.maximum(np.abs(x)-1.0,0.0)
import matplotlib.pyplot as plt
plt.scatter(x,y)
m = 3
X = np.array( [[e**i for i in range(m+1)] for e in x] ) # Construction of the matrix X
# e**i for i=0,1,2,..,m and e going through the coordinates of x
X
res = np.linalg.lstsq(X,y) # we solve the least square problem \min_c \|Xc-y\|^2
res # res is a tuple containing the solution and other stuff (see the doc)
c = res[0] # the solution c is the first element of res
y_regression = X.dot(c) # Xc is the obtained regression value for the points
plt.scatter(x,y,color='r',label='true')
plt.scatter(x,y_regression,color='b',label='regression')
plt.legend()
# If I want to see better what happens, I can generate 100 points in [-3,3] and compute the regression output with the optimal c
x_lin = np.linspace(-3,3,100)
X_lin = np.array( [[e**i for i in range(m+1)] for e in x_lin] )
y_pred = X_lin.dot(c)
plt.scatter(x,y,color='r',label='true')
plt.plot(x_lin,y_pred,color='b',label='regression') # I use a plot to show a solid line
plt.legend()
x_lin = np.linspace(-3,3,100)
for m in [1,3,5,11]: # I do the same above for multiple values of m
X = np.array( [[e**i for i in range(m+1)] for e in x] )
c = np.linalg.lstsq(X,y,rcond=None)[0];
X_lin = np.array( [[e**i for i in range(m+1)] for e in x_lin] )
y_pred = X_lin.dot(c)
plt.plot(x_lin,y_pred,label='regression m={}'.format(m))
plt.scatter(x,y,color='r',label='true')
plt.legend()