Was ist das kleinste

Die in der Frage beschriebene Lasso-Schätzung ist das Lagrange-Multiplikator-Äquivalent des folgenden Optimierungsproblems:

minimize f (β) subject to g (β) \leq t

${\text{minimize } f(\beta) \text{ subject to } g(\beta) \leq t}$

\begin{aligned} f (β) & = \frac{1}{2 n} | | y - X β | |_{2}^{2} \\ g (β) & = | | β | |_{1} \end{aligned}

$\begin{align} f(\beta) &= \frac{1}{2n} \vert\vert y-X\beta \vert\vert_2^2 \\ g(\beta) &= \vert\vert \beta \vert\vert_1 \end{align}$

Diese Optimierung hat eine geometrische Darstellung des Auffindens des Kontaktpunkts zwischen einer mehrdimensionalen Kugel und einem Polytop (aufgespannt durch die Vektoren von X). Die Oberfläche des Polytops repräsentiert $g(\beta)$ . Das Quadrat des Kugelradius repräsentiert die Funktion $f(\beta)$ und wird bei Kontakt der Oberflächen minimiert.

Die folgenden Bilder bieten eine grafische Erklärung. Die Bilder verwendeten das folgende einfache Problem mit Vektoren der Länge 3 (der Einfachheit halber, um eine Zeichnung machen zu können):

[\begin{matrix} y_{1} \\ y_{2} \\ y_{3} \end{matrix}] = [\begin{matrix} 1.4 \\ 1.84 \\ 0.32 \end{matrix}] = β_{1} [\begin{matrix} 0.8 \\ 0.6 \\ 0 \end{matrix}] + β_{2} [\begin{matrix} 0 \\ 0.6 \\ 0.8 \end{matrix}] + β_{3} [\begin{matrix} 0.6 \\ 0.64 \\ - 0.48 \end{matrix}] + [\begin{matrix} ϵ_{1} \\ ϵ_{2} \\ ϵ_{3} \end{matrix}]

$\begin{bmatrix} y_1 \\ y_2 \\ y_3\\ \end{bmatrix} = \begin{bmatrix} 1.4 \\ 1.84 \\ 0.32\\ \end{bmatrix} = \beta_1 \begin{bmatrix} 0.8 \\ 0.6 \\ 0\\ \end{bmatrix} +\beta_2 \begin{bmatrix} 0 \\ 0.6 \\ 0.8\\ \end{bmatrix} +\beta_3 \begin{bmatrix} 0.6 \\ 0.64 \\ -0.48\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3\\ \end{bmatrix}$ und wir minimiere

ϵ_{1}^{2} + ϵ_{2}^{2} + ϵ_{3}^{2}

$\epsilon_1^2+\epsilon_2^2+\epsilon_3^2$ mit der Bedingung

a b s (β_{1}) + a b s (β_{2}) + a b s (β_{3}) \leq t

$abs(\beta_1)+abs(\beta_2)+abs(\beta_3) \leq t$

Die Bilder zeigen:

Die rote Fläche zeigt die Abhängigkeit, ein von X überspanntes Polytop.
Und die grüne Fläche zeigt die minimierte Fläche, eine Kugel.
Die blaue Linie zeigt den Lassopfad, die Lösungen, die wir finden, wenn wir $t$ oder $\lambda$ ändern .
Die Grün - Vektor zeigt die OLS Lösung (die als gewählt oder . $\hat{y}$ $\beta_1=\beta_2=\beta_3=1$ $\hat{y} = x_1 + x_2 + x_3$
Die drei schwarzen Vektoren sind $x_1 = (0.8,0.6,0)$ , $x_2 = (0,0.6,0.8)$ und $x_3 = (0.6,0.64,-0.48)$ .

Wir zeigen drei Bilder:

Im ersten Bild berührt nur ein Punkt des Polytops die Kugel . Dieses Bild zeigt sehr gut, warum die Lasso-Lösung nicht nur ein Vielfaches der OLS-Lösung ist. Die Richtung der OLS-Lösung addiert sich stärker zur Summe $\vert \beta \vert_1$ . In diesem Fall ist nur ein einzelnes $\beta_i$ ungleich Null.
Im zweiten Bild berührt ein Kamm des Polytops die Kugel (in höheren Dimensionen erhalten wir höherdimensionale Analoga). In diesem Fall sind mehrere $\beta_i$ ungleich Null.
Im dritten Bild berührt eine Facette des Polytops die Kugel . In diesem Fall sind alle $\beta_i$ ungleich Null .

Der Bereich von $t$ oder $\lambda$ für den wir den ersten und dritten Fall haben, kann aufgrund ihrer einfachen geometrischen Darstellung leicht berechnet werden.

Fall 1: Nur ein einzelnes $\beta_i$ ungleich Null

Die Nicht-Null - $\beta_i$ ist diejenige , für die die zugehörigen Vektor $x_i$ mit dem höchsten Absolutwert der Kovarianz hat $\hat{y}$ ist der Punkt des parrallelotope der am nächsten zu der OLS - Lösung). Wir können den Lagrange-Multiplikator $\lambda_{max}$ unter dem wir mindestens ein $\beta$ ungleich Null haben, berechnen, indem wir die Ableitung mit $\pm\beta_i$ (das Vorzeichen hängt davon ab, ob wir das $\beta_i$ in negativer oder positiver Richtung erhöhen ):

\frac{\partial (\frac{1}{2 n} | | y - X β | |_{2}^{2} - λ | | β | |_{1})}{\pm \partial β_{i}} = 0

$\frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2 - \lambda \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i} = 0$

was dazu führt

λ_{m a x} = \frac{(\frac{1}{2 n} \frac{\partial (| | y - X β | |_{2}^{2}}{\pm \partial β_{i}})}{(\frac{| | β | |_{1})}{\pm \partial β_{i}})} = \pm \frac{\partial (\frac{1}{2 n} | | y - X β | |_{2}^{2}}{\partial β_{i}} = \pm \frac{1}{n} x_{i} \cdot y

$\lambda_{max} = \frac{ \left( \frac{1}{2n}\frac{\partial ( \vert \vert y - X\beta \vert \vert_2^2}{\pm \partial \beta_i} \right) }{ \left( \frac{ \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i}\right)} = \pm \frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2}{\partial \beta_i} = \pm \frac{1}{n} x_i \cdot y$

das entspricht der $\vert \vert X^Ty \vert \vert_\infty$ in den Kommentaren erwähnt.

wo wir bemerken sollten, dass dies nur für den speziellen Fall gilt, in dem die Spitze des Polytops die Kugel berührt ( dies ist also keine allgemeine Lösung , obwohl die Verallgemeinerung einfach ist).

Fall 3: Alle $\beta_i$ sind nicht Null.

In diesem Fall berührt eine Facette des Polytops die Kugel. Dann ist die Änderungsrichtung des Lasso-Pfades normal zur Oberfläche der jeweiligen Facette.

The polytope has many facets, with positive and negative contributions of the $x_i$ . In the case of the last lasso step, when the lasso solution is close to the ols solution, then the contributions of the $x_i$ must be defined by the sign of the OLS solution. The normal of the facet can be defined by taking the gradient of the function $\vert \vert \beta(r) \vert \vert_1$ , the value of the sum of beta at the point $r$ , which is:

n = - \nabla_{r} (| | β (r) | |_{1}) = - \nabla_{r} (sign (\hat{β}) \cdot (X^{T} X)^{- 1} X^{T} r) = - sign (\hat{β}) \cdot (X^{T} X)^{- 1} X^{T}

$n = - \nabla_r ( \vert \vert \beta(r) \vert \vert_1) = -\nabla_r ( \text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^Tr ) = -\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T$

and the equivalent change of beta for this direction is:

{\vec{β}}_{l a s t} = (X^{T} X)^{- 1} X n = - (X^{T} X)^{- 1} X^{T} [sign (\hat{β}) \cdot (X^{T} X)^{- 1} X^{T}]

$\vec{\beta}_{last} = (X^TX)^{-1}X n = -(X^TX)^{-1}X^T [\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T]$

which after some algebraic tricks with shifting the transposes ( $A^TB^T = [BA]^T$ ) and distribution of brackets becomes

{\vec{β}}_{l a s t} = - (X^{T} X)^{- 1} sign (\hat{β})

$\vec{\beta}_{last} = - (X^TX)^{-1} \text{sign} (\hat{\beta})$

we normalize this direction:

{\vec{β}}_{l a s t, n o r m a l i z e d} = \frac{{\vec{β}}_{l a s t}}{\sum {\vec{β}}_{l a s t} \cdot s i g n (\hat{β})}

$\vec{\beta}_{last,normalized} = \frac{\vec{\beta}_{last}}{\sum \vec{\beta}_{last} \cdot sign(\hat{\beta})}$

To find the $\lambda_{min}$ below which all coefficients are non-zero. We only have to calculate back from the OLS solution back to the point where one of the coefficients is zero,

d = m i n (\frac{\hat{β}}{{\vec{β}}_{l a s t, n o r m a l i z e d}}) with the condition that \frac{\hat{β}}{{\vec{β}}_{l a s t, n o r m a l i z e d}} > 0

$d = min \left( \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} \right)\qquad \text{with the condition that } \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} >0$

,and at this point we evaluate the derivative (as before when we calculate $\lambda_{max}$ ). We use that for a quadratic function we have $q'(x) = 2 q(1) x$ :

λ_{m i n} = \frac{d}{n} | | X {\vec{β}}_{l a s t, n o r m a l i z e d} | |_{2}^{2}

$\lambda_{min} = \frac{d}{n} \vert \vert X \vec{\beta}_{last,normalized} \vert \vert_2^2$

Images

a point of the polytope is touching the sphere, a single $\beta_i$ is non-zero:

a ridge (or differen in multiple dimensions) of the polytope is touching the sphere, many $\beta_i$ are non-zero:

a facet of the polytope is touching the sphere, all $\beta_i$ are non-zero:

Code example:

library(lars)    
data(diabetes)
y <- diabetes$y - mean(diabetes$y)
x <- diabetes$x

# models
lmc <- coef(lm(y~0+x))
modl <- lars(diabetes$x, diabetes$y, type="lasso")

# matrix equation
d_x <- matrix(rep(x[,1],9),length(x[,1])) %*% diag(sign(lmc[-c(1)]/lmc[1]))
x_c = x[,-1]-d_x
y_c = -x[,1]

# solving equation
cof <- coefficients(lm(y_c~0+x_c))
cof <- c(1-sum(cof*sign(lmc[-c(1)]/lmc[1])),cof)

# alternatively the last direction of change in coefficients is found by:
solve(t(x) %*% x) %*% sign(lmc)

# solution by lars package
cof_m <-(coefficients(modl)[13,]-coefficients(modl)[12,])

# last step
dist <- x %*% (cof/sum(cof*sign(lmc[])))
#dist_m <- x %*% (cof_m/sum(cof_m*sign(lmc[]))) #for comparison

# calculate back to zero
shrinking_set <- which(-lmc[]/cof>0)  #only the positive values
step_last <- min((-lmc/cof)[shrinking_set])

d_err_d_beta <- step_last*sum(dist^2)

# compare
modl[4] #all computed lambda
d_err_d_beta  # lambda last change
max(t(x) %*% y) # lambda first change
enter code here

note: those last three lines are the most important

> modl[4]            # all computed lambda by algorithm
$lambda
 [1] 949.435260 889.315991 452.900969 316.074053 130.130851  88.782430  68.965221  19.981255   5.477473   5.089179
[11]   2.182250   1.310435

> d_err_d_beta       # lambda last change by calculating only last step
    xhdl 
1.310435 
> max(t(x) %*% y)    # lambda first change by max(x^T y)
[1] 949.4353

Written by StackExchangeStrike

Sextus Empiricus
quelle

Thanks for including the edits! So far in my reading, I'm stuck just past the "case 1" subsection. The result for

λ_{max}

$\lambda_\max$ derived there is wrong since it doesn't include an absolute value or a maximum. We know further that there's a mistake since in the derivation, there's a sign mistake, a place where differentiability is wrongly assumed, an "arbitrary choice" of

i

$i$ to differentiate with respect to, and an incorrectly evaluated derivative. To be frank, there isn't one "

=

$=$ " sign that's valid.

user795305

I have corrected it with a plus minus sign. The change of the beta can be possitive or negative. Regarding the maximum and "arbitrary choice"... "for which the associated vector $x_i$ has the highest covariance with $\hat{y}$ "

Sextus Empiricus

Thanks for the update! However, there's still problems. For instance,

\frac{\partial}{\partial β_{i}} ‖ y - X β ‖_{2}^{2}

$\frac{\partial}{\partial \beta_i} \|y - X \beta\|_2^2$ is evaluated incorrectly.

user795305

β = 0

$\beta=0$ then

\frac{\partial}{\partial β_{i}} | | y - X β | |_{2}^{2}

$\frac{\partial}{\partial\beta_i} \vert \vert y - X\beta \vert \vert_2^2$

= \frac{\partial | | y - X β | |_{2}}{\partial β_{i}} 2 | | y - X β | |_{2}

$= \frac{\partial \vert \vert y - X\beta \vert \vert_2}{\partial\beta_i} 2 \vert \vert y - X\beta \vert \vert_2$

= \frac{\partial | | y - s x_{i} | |_{2}}{\partial s} 2 | | y - X β | |_{2}

$= \frac{\partial \vert \vert y - s x_i \vert \vert_2}{\partial s} 2 \vert \vert y - X\beta \vert \vert_2$

= 2 c o r (x_{i}, y) | | x_{i} | |_{2} | | y | |_{2}

$= 2 cor(x_i,y) \vert \vert x_i \vert \vert_2 \vert \vert y \vert \vert_2$

= 2 x_{i} \cdot y

$= 2 x_i \cdot y$ this correlation enters the equation because,if s=0 then only the change of

s x_{i}

$s x_i$ tangent to

y

$y$ is changing the length of the vector

y - s x_{i}

$y - s x_i$

Sextus Empiricus

Ah, okay, so there's a limit involved in your argument! (You're using both

β = 0

$\beta = 0$ and that a coefficient is nonzero.) Further, the second equality in the line with

λ_{max}

$\lambda_\max$ is misleading since the sign could change due to the differentiation of the absolute value.

user795305

Was ist das kleinste

Antworten:

Fall 1: Nur ein einzelnes βiβi\beta_i ungleich Null

Fall 3: Alle βiβi\beta_i sind nicht Null.

Images

Code example:

Fall 1: Nur ein einzelnes $\beta_i$ ungleich Null

Fall 3: Alle $\beta_i$ sind nicht Null.