Guest course Dec. 7, 2022 |
Structure identification:from nonsmooth optimization to data science |
Franck Iutzeler |
The goal of this course/lab is to explore the notion of structure identification in nonsmooth optimization and its applications in data science.
The proximity operator is defined for a function $g:\mathbb{R}^n\to\mathbb{R}\cup +\infty$ and a point $x\in\mathbb{R}^n$ as the mapping $$ \mathbf{prox}_{\gamma g}(x) = \arg\min_{u\in\mathbb{R}^n} \left\{ g(u) + \frac{1}{2\gamma}\| x - u \|^2 \right\} $$
Let us first consider the absolute function $g = |\cdot|$ in $\mathbb{R}$ and plot the function $$ u \mapsto s_g(u; \gamma,x) = |u| + \frac{1}{2\gamma}\| x - u \|^2 $$ for different values of $\gamma,x$.
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
output_notebook()
from bokeh.layouts import column,row
from bokeh.models import CustomJS, ColumnDataSource, Slider
# Set up data
import numpy as np
N = 200
u = np.linspace(-4, 4, N)
p = np.abs(u)+0.5*np.power(u,2)
source = ColumnDataSource(data=dict(u=u, p=p))
opt = ColumnDataSource(data=dict(u=[0.0], p=[0.0]))
# Set up plot
plot = figure(height=400, width=400, title="stability of minimizers",
x_range=[-4, 4], y_range=[-0.5, 12.5])
plot.line('u', 'p', source=source, line_width=3, line_alpha=0.6,legend_label="s_g")
plot.circle('u', 'p', source=opt, size = 5, color="black",legend_label=r"minimizer = prox_{\gamma g}(x)")
plot.legend.location = "top_left"
slider_gamma = Slider(start=0.01, end=4, value=1, step=.01, title="stepsize $$\gamma$$")
slider_x = Slider(start=-4.0, end=4.0, value=0.0, step=.1, title="input point $$x$$")
update_curve = CustomJS(args=dict(source=source, gamma=slider_gamma, xslider=slider_x ), code="""
const g = gamma.value
const x = xslider.value
const u = source.data.u
const p = Array.from(u, (u) => Math.abs(u) + 0.5*Math.pow(u-x, 2)/g)
source.data = { u, p }
""")
slider_gamma.js_on_change('value', update_curve)
slider_x.js_on_change('value', update_curve)
update_opt = CustomJS(args=dict(opt=opt, gamma=slider_gamma, xslider=slider_x ), code="""
const g = gamma.value
const x = xslider.value
const p = opt.data.p
const u = Array.from(p, (p) => Math.sign(x)*Math.max(0,Math.abs(x)-g))
opt.data = { u , p }
""")
slider_gamma.js_on_change('value', update_opt)
slider_x.js_on_change('value', update_opt)
show(row(column(slider_gamma,slider_x), plot))
1. Show that $$ \mathbf{prox}_{\gamma g}(x) = \left\{ \begin{array}{ll} x-\gamma & \text{if } x\geq \gamma \\ x+\gamma & \text{if } x\leq -\gamma \\ 0 & \text{if } -\gamma \leq x \leq \gamma \end{array} \right. $$ Notice how the points of non-differentiabiity of $g$ trap the minimizers of the problem.
This is a typical example of how nonsmoothness can be used to promote certain patterns in the solutions of minimization problems.
Let us first consider the $\ell_1$ norm, ie. $g: x \mapsto \|x\|_1 = \sum_{i=1}^n |x_i|$ in $\mathbb{R}^n$ (where $x_i$ is the $i$-th component of vector $x$).
2. How can one compute the proximity operator of the $\ell_1$ norm from the one of the absolute value?
In the first part of the lab, we will consider the classical lasso problem: \begin{align} \tag{lasso} \min_{x\in\mathbb{R}^n} \frac{1}{2}\|Ax-b\|^2 + \lambda \|x\|_1 \end{align}
3. Observe the level sets of the lasso problem implemented below. Run the code several times to generate different sensing matrices $A$ and vector $b = A x_0 + \text{noise}$. Notice how the solution of the problem often seem to have a null first coordinate, as $x_0$.
import numpy as np
x0 = np.array([0,1,0,0.5,-0.9])
n = x0.size
m = 10
A = np.random.randn(m,n)
b = np.dot(A,x0) + 0.1*np.random.randn(m)
lam = 0.5
def lasso(x,A=A,b=b,lam=lam):
return 0.5*np.linalg.norm(np.dot(A,x)-b) + lam*np.linalg.norm(x,1)
from scipy.optimize import minimize
x0 = np.zeros(n)
res = minimize(lasso, x0)
import matplotlib.cm as cm
import matplotlib.pyplot as plt
delta = 0.025
x = np.arange(-2.0, 2.0, delta)
y = np.arange(-2.0, 2.0, delta)
X, Y = np.meshgrid(x, y)
Z = np.copy(X)
p = res.x
for index,x in np.ndenumerate(X):
p[0] = x
p[1] = Y[index]
Z[index] = lasso(p)
fig, ax = plt.subplots()
CS = ax.contour(X, Y, Z)
ax.clabel(CS, inline=True, fontsize=10)
ax.set_title('Level sets of the lasso problem')
Text(0.5, 1.0, 'Level sets of the lasso problem')
In this section, we will again consider the lasso problem $$ \min_{x\in\mathbb{R}^n} \frac{1}{2} \|Ax-b\|^2 + \lambda \|x\|_1 $$ using the data generated below and we will compare the solutions obtained by several optimization methods.
import numpy as np
n, m = 20, 50
prop_null = 0.5
null_coord = np.random.choice(n, int(prop_null*n), replace=False)
x0 = np.random.randn(n)
x0[null_coord] = 0.0
np.random.seed(42)
A = np.random.randn(m,n)
b = np.dot(A,x0) + 0.1*np.random.randn(m)
lam = 0.1*n
x0
array([ 0. , -0.83997292, 0. , 0.00300909, 0. , 0. , 0. , -0.68270716, 0. , 0.60573409, 0. , 0.95825461, 1.56716482, 0. , 0. , -2.15207088, -0.11261452, -0.03273244, 0.66114644, 0. ])
def lasso(x,A=A,b=b,lam=lam):
return 0.5*np.linalg.norm(np.dot(A,x)-b)**2 + lam*np.linalg.norm(x,1)
In Section 2.1 to 2.5, you will be asked to solve this problem by different manners. Make sure to fill at least one method from 2.1,2.2,2.3 and one from 2.4,2.5. You will be asked to compare the solution in Section 2.6.
4. Solve the problem with a general purpose solver such as
scipy.optimize.minimize
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize. Store the result in a variablexopt_GP
.
from scipy.optimize import minimize
5. Solve the problem with a modelling software (eg.
cvxpy
https://www.cvxpy.org/) or a QP solver. Store the result in a variablexopt_CVXPY
.
import cvxpy as cp
6. Solve the problem with a dedicated library as
sklearn
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html. Store the result in a variablexopt_sklearn
.
from sklearn import linear_model
Recall, the iterations of the proximal gradient are of the form $$ x_{k+1} = \mathbf{prox}_{\gamma g}\left( x_k - \gamma \nabla f(x_k) \right) $$ where $\gamma$ is a stepsize to choose.
7. Solve the problem with a proximal gradient method. Store the result in a variable
xopt_PG
.
Use any proximal splitting method you wish, eg. the Douglas-Rachford method which reads \begin{align*} x_k &= \mathbf{prox}_{\gamma g}(u_k)\\ z_k &= \mathbf{prox}_{\gamma f}(2x_k-u_k)\\ u_{k+1} &= u_k + z_k - x_k \end{align*}
8. Solve the problem with a splitting method such as Douglas-Rachford, ADMM, etc. Store the result in a variable
xopt_Split
. Several variables may converge to the same value, you can pick one arbitrarily and get back to it later.
9. Compare the objective values for the points obtained above. Compare the sparsity pattern of these points and compare them to the one of $x_0$.
10. Compare the results obtain when the dimensions and the sparsity increases, eg.
n, m = 100, 200; prop_null = 0.6
orn, m = 400, 200; prop_null = 0.8
We have some data $\mathcal{D}$ consisting of $m$ examples; each example consisting of a feature vector $a_i\in\mathbb{R}^n$ and an observation $b_i\in \mathcal{O}$: $\mathcal{D} = \{[a_i,b_i]\}_{i=1..m}$. In this lab, we will consider the student performance dataset.
The goal of supervised learning is to construct a predictor for the observations when given feature vectors.
A popular approach is based on linear models which are based on finding a parameter $x$ such that the real number $\langle a_i , x \rangle$ is used to predict the value of the observation through a predictor function $g:\mathbb{R}\to \mathcal{O}$: $g(\langle a_i , x \rangle)$ is the predicted value from $a_i$.
In order to find such a parameter, we use the available data and a loss $\ell$ that penalizes the error made between the predicted $g(\langle a_i , x \rangle)$ and observed $b_i$ values. For each example $i$, the corresponding error function for a parameter $x$ is $f_i(x) = \ell( g(\langle a_i , x \rangle) ; b_i )$. Using the whole data, the parameter that minimizes the total error is the solution of the minimization problem $$ \min_{x\in\mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m f_i(x) = \frac{1}{m} \sum_{i=1}^m \ell( g(\langle a_i , x \rangle) ; b_i ). $$
In our setup, the observations are binary: $\mathcal{O} = \{-1 , +1 \}$, and the Logistic loss is used to form the following optimization problem \begin{align*} \min_{x\in\mathbb{R}^n } f(x) := \frac{1}{m} \sum_{i=1}^m \log( 1+\exp(-b_i \langle a_i,x \rangle) ) + \frac{\lambda}{2} \|x\|_2^2. \end{align*} where the last term is added as a regularization (of type $\ell_2$, aka Tikhonov) to prevent overfitting.
Under some statistical hypotheses, $x^\star = \arg\min f(x)$ maximizes the likelihood of the labels knowing the features vector. Then, for a new point $d$ with features vector $a$, $$ p_1(a) = \mathbb{P}[a\in \text{ class } +1] = \frac{1}{1+\exp(-\langle a;x^\star \rangle)} $$ Thus, from $a$, if $p_1(a)$ is close to $1$, one can decide that $d$ belongs to class $1$; and the opposite decision if $p(a)$ is close to $0$.
We will consider an $\ell_1$ regularization of this problem to promote sparsity of the iterates. A sparse final solution would select the most important features. The new function (below) is non-smooth but it has a smooth part, $f$, and a non-smooth part, $g$.
\begin{align*} \min_{x\in\mathbb{R}^n } F(x) := \underbrace{ \frac{1}{m} \sum_{i=1}^m \log( 1+\exp(-b_i \langle a_i,x \rangle) ) + \frac{\lambda_2}{2} \|x\|_2^2}_{f(x)} + \underbrace{\lambda_1 \|x\|_1 }_{g(x)}. \end{align*}The dataset is comprised of $27$ features described below and the goal is to predict if the student may pass its year or not. It is thus of importance to investigate which features are the most significant for the student success. We will see how the $\ell_1$ regularization can help to this goal.
1 sex - student's sex (binary: "F" - female or "M" - male) 2 age - student's age (numeric: from 15 to 22) 3 address - student's home address type (binary: "U" - urban or "R" - rural) 4 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3) 5 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart) 6 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 7 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 8 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 9 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 10 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 11 schoolsup - extra educational support (binary: yes or no) 12 famsup - family educational support (binary: yes or no) 13 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 14 activities - extra-curricular activities (binary: yes or no) 15 nursery - attended nursery school (binary: yes or no) 16 higher - wants to take higher education (binary: yes or no) 17 internet - Internet access at home (binary: yes or no) 18 romantic - with a romantic relationship (binary: yes or no) 19 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 20 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 21 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 22 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 23 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 24 health - current health status (numeric: from 1 - very bad to 5 - very good) 25 absences - number of school absences (numeric: from 0 to 93) 26 G1 - first period grade (numeric: from 0 to 20) 27 G2 - second period grade (numeric: from 0 to 20)
import numpy as np
import csv
#### File reading
dat_file = np.load('student.npz')
A = dat_file['A_learn']
final_grades = dat_file['b_learn']
m = final_grades.size
b = np.zeros(m)
for i in range(m):
if final_grades[i]>11:
b[i] = 1.0
else:
b[i] = -1.0
A_test = dat_file['A_test']
final_grades_test = dat_file['b_test']
m_test = final_grades_test.size
b_test = np.zeros(m_test)
for i in range(m_test):
if final_grades_test[i]>11:
b_test[i] = 1.0
else:
b_test[i] = -1.0
d = 27 # features
n = d+1 # with the intercept
lam2 = 0.1 # for the 2-norm regularization best:0.1
lam1 = 0.03 # for the 1-norm regularization best:0.03
L = 0.25*max(np.linalg.norm(A,2,axis=1))**2 + lam2
The size of $x$ is given as $n$, and $L$ represents the Lipschitz constant of the gradient of $f$.
Below are functions returning the gradient of $f$ and proximity operator of $g$.
def f(x,A=A,b=b):
l = 0.0
for i in range(A.shape[0]):
if b[i] > 0 :
l += np.log( 1 + np.exp(-np.dot( A[i] , x ) ) )
else:
l += np.log( 1 + np.exp(np.dot( A[i] , x ) ) )
return l/m + lam2/2.0*np.dot(x,x)
def grad_f(x,lam2=lam2):
g = np.zeros(n)
for i in range(A.shape[0]):
if b[i] > 0:
g += -A[i]/( 1 + np.exp(np.dot( A[i] , x ) ) )
else:
g += A[i]/( 1 + np.exp(-np.dot( A[i] , x ) ) )
return g/m + lam2*x
def g(x,lam1=lam1):
return lam1*np.linalg.norm(x,1)
def prox_g(x,gamma,lam1=lam1):
p = np.zeros(n)
for i in range(n):
if x[i] < - lam1*gamma:
p[i] = x[i] + lam1*gamma
if x[i] > lam1*gamma:
p[i] = x[i] - lam1*gamma
return p
def F(x):
return f(x) + g(x)
Below are two functions that compote the training accuracy (accuracy on the training set), and testing accuracy, given a parameter $x$.
def prediction_train(w,PRINT=True):
pred = np.zeros(A.shape[0])
perf = 0
for i in range(A.shape[0]):
p = 1.0/( 1 + np.exp(-np.dot( A[i] , w ) ) )
if p>0.5:
pred[i] = 1.0
if b[i]>0:
correct = "True"
perf += 1
else:
correct = "False"
if PRINT:
print("True class: {:d} \t-- Predicted: {} \t(confidence: {:.1f}%)\t{}".format(int(b[i]),1,(p-0.5)*200,correct))
else:
pred[i] = -1.0
if b[i]<0:
correct = "True"
perf += 1
else:
correct = "False"
if PRINT:
print("True class: {:d} \t-- Predicted: {} \t(confidence: {:.1f}%)\t{}".format(int(b[i]),-1,100-(0.5-p)*200,correct))
return pred,float(perf)/A.shape[0]
def prediction_test(w,PRINT=True):
pred = np.zeros(A_test.shape[0])
perf = 0
for i in range(A_test.shape[0]):
p = 1.0/( 1 + np.exp(-np.dot( A_test[i] , w ) ) )
if p>0.5:
pred[i] = 1.0
if b_test[i]>0:
correct = "True"
perf += 1
else:
correct = "False"
if PRINT:
print("True class: {:d} \t-- Predicted: {} \t(confidence: {:.1f}%)\t{}".format(int(b[i]),1,(p-0.5)*200,correct))
else:
pred[i] = -1.0
if b_test[i]<0:
correct = "True"
perf += 1
else:
correct = "False"
if PRINT:
print("True class: {:d} \t-- Predicted: {} \t(confidence: {:.1f}%)\t{}".format(int(b[i]),-1,100-(0.5-p)*200,correct))
return pred,float(perf)/A_test.shape[0]
For minimizing a function $F:\mathbb{R}^n \to \mathbb{R}$ equal to $f+g$ where $f$ is differentiable and the $\mathbf{prox}$ of $g$ is known, given:
F
f_grad
g_prox
x0
PREC
ITE_MAX
PRINT
this algorithm perform iterations of the form $$ x_{k+1} = \mathbf{prox}_{\gamma g}\left( x_k - \gamma \nabla f(x_k) \right) $$ where $\gamma$ is a stepsize to choose.
11. Fill the function below with the proximal gradient algorithm.
import numpy as np
import timeit
def proximal_gradient_algorithm(F , grad_f , prox_g , x0 , step , PREC , ITE_MAX , PRINT ):
x = np.copy(x0)
x_tab = np.copy(x)
if PRINT:
print("------------------------------------\n Proximal gradient algorithm\n------------------------------------\nSTART -- stepsize = {:0}".format(step))
t_s = timeit.default_timer()
for k in range(ITE_MAX):
x = x ####### ITERATION TO FILL
x_tab = np.vstack((x_tab,x))
t_e = timeit.default_timer()
if PRINT:
print("FINISHED -- {:d} iterations / {:.6f}s -- final value: {:f}\n\n".format(k,t_e-t_s,F(x)))
return x,x_tab
12. Run the algorithm and investigate the decrease of the algorithm.
#### Parameter we give at our algorithm
PREC = 1e-5 # Sought precision
ITE_MAX = 1000 # Max number of iterations
x0 = np.zeros(n) # Initial point
step = 1.0/L
##### gradient algorithm
x,x_tab = proximal_gradient_algorithm(F , grad_f , prox_g , x0 , step , PREC, ITE_MAX , True)
------------------------------------ Proximal gradient algorithm ------------------------------------ START -- stepsize = 0.030334772813507393 FINISHED -- 999 iterations / 0.009546s -- final value: 0.693147
import matplotlib.pyplot as plt
Fval = []
for i in range(x_tab.shape[0]):
Fval.append( F(x_tab[i]))
plt.figure()
plt.plot( Fval, color="black", linewidth=1.0, linestyle="-")
plt.grid(True)
plt.show()
13. Plot the support of the vector $x_k$ (i.e. one point for every non-null coordinate of $x_k$) versus the iteration number. What do you notice? Was it expected?
plt.figure()
for i in np.arange(0,x_tab.shape[0],int(x_tab.shape[0]/40)):
for j in range(n):
if np.abs(x_tab[i,j])>1e-14:
plt.plot( i , j , 'ko')
plt.grid(True)
plt.ylabel('Non-null Coordinates')
plt.xlabel('Nb. Iterations')
plt.ylim(-1,d+1)
plt.yticks(np.arange(0,d+1))
plt.show()
We saw above that the algorithm selected some coordinates as the other get to zero. Considering our machine learning task, this translates into the algorithm selecting a subset of the features that will be used for the prediction step.
14. Change the parameter $\lambda_1$ of the problem (
lam1
) in the code above and investigate how it influences the number of selected features.
In order to quantify the influence of this feature selection, let us consider the regularization path that is the support of the final points obtained by our minimization method versus the value of $\lambda_1$.
15. For $\lambda_1 = 2^{-10},2^{-9}, .. , 2^{1}$, run the proximal gradient algorithm on the obtained problem and store the support of the final point, the prediction performance on the training set (
prediction_train
) and on the testing set (prediction_test
). What is the best regularization choice and which are the most important features of the dataset?
16. What happens if you replace the $\ell_1$ norm by the $\ell_\infty$ norm $\|x\|_\infty = \max_i |x_i|$?
17. What happens if you replace the proximal gradient by SAGA (see https://arxiv.org/pdf/1407.0202.pdf )