Hessisch logistische Funktion

Ich habe Schwierigkeiten die Hessian der Zielfunktion, abzuleiten $l(\theta)$ , in logistischer Regression , wo $l(\theta)$ ist:

l (θ) = \sum i = 1 m [y i log (h θ (x i)) + (1 - y i) log (1 - h θ (x i))]

$l(\theta)=\sum_{i=1}^{m} \left[y_{i} \log(h_\theta(x_{i})) + (1- y_{i}) \log (1 - h_\theta(x_{i}))\right]$

$h_\theta(x)$ ist eine logistische Funktion. Die Hessian ist $X^T D X$ . Ich habe versucht, es durch Berechnung von abzuleiten $\frac{\partial^2 l(\theta)}{\partial \theta_i \partial \theta_j}$ , aber dann war mir nicht klar, wie ich vonzur Matrixnotation komme $\frac{\partial^2 l(\theta)}{\partial \theta_i \partial \theta_j}$ .

Kennt jemand eine saubere und einfache Möglichkeit, abzuleiten $X^T D X$ ?

logistic DSKim
quelle

was hast du dafür bekommen

? ∂2l∂θi∂θj $\frac{\partial^2 l}{\partial \theta_i \partial \theta_j}$

Glen_b

Hier ist ein guter Satz Folien, die die genaue Berechnung zeigen, die Sie suchen: sites.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf

Ich habe ein wunderbares Video gefunden, das den Hessischen Schritt für Schritt berechnet. Logistische Regression (binär) - Berechnung des Hessischen

Naomi

Hier leite ich alle notwendigen Eigenschaften und Identitäten ab, damit die Lösung in sich geschlossen ist, aber ansonsten ist diese Herleitung sauber und einfach. Lassen Sie uns unsere Notation formalisieren und die Verlustfunktion etwas kompakter schreiben. Betrachten $m$ Proben $\{x_i,y_i\}$ , so dass $x_i\in\mathbb{R}^d$ und $y_i\in\mathbb{R}$ . Denken Sie daran, dass in der binären logistischen Regression typischerweise die Hypothesenfunktion $h_\theta$ die logistische Funktion ist. Formal

h θ (x i) = σ (ω T x i) = σ (z i) = 1 1 + e - z i,

$h_\theta(x_i)=\sigma(\omega^Tx_i)=\sigma(z_i)=\frac{1}{1+e^{-z_i}},$

where $\omega\in\mathbb{R}^d$ and $z_i=\omega^Tx_i$ . The loss function (which I believe OP's is missing a negative sign) is then defined as:

l (ω) = \sum i = 1 m - (y i log σ (z i) + (1 - y i) log (1 - σ (z i)))

$l(\omega)=\sum_{i=1}^m -\Big( y_i\log\sigma(z_i)+(1-y_i)\log(1-\sigma(z_i))\Big)$

There are two important properties of the logistic function which I derive here for future reference. First, note that $1-\sigma(z)=1-1/(1+e^{-z})=e^{-z}/(1+e^{-z})=1/(1+e^z)=\sigma(-z)$ .

Also note that

\partial \partial z σ (z) = \partial \partial z (1 + e - z) - 1 = e - z (1 + e - z) - 2 = 1 1 + e - z e - z 1 + e - z = σ (z) (1 - σ (z))

$\begin{equation} \begin{aligned} \frac{\partial}{\partial z}\sigma(z)=\frac{\partial}{\partial z}(1+e^{-z})^{-1}=e^{-z}(1+e^{-z})^{-2}&=\frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}} =\sigma(z)(1-\sigma(z)) \end{aligned} \end{equation}$

Instead of taking derivatives with respect to components, here we will work directly with vectors (you can review derivatives with vectors here). The Hessian of the loss function $l(\omega)$ is given by $\vec{\nabla}^2l(\omega)$ , but first recall that $\frac{\partial z}{\partial \omega} = \frac{x^T\omega}{\partial \omega}=x^T$ and $\frac{\partial z}{\partial \omega^T}=\frac{\partial \omega^Tx}{\partial \omega ^T} = x$ .

Let $l_i(\omega)=-y_i\log\sigma(z_i)-(1-y_i)\log(1-\sigma(z_i))$ . Using the properties we derived above and the chain rule

\partial log σ ( z i ) \partial ω T \partial log ( 1 - σ ( z i ) ) \partial ω T = 1 σ ( z i ) \partial σ ( z i ) \partial ω T = 1 σ ( z i ) \partial σ ( z i ) \partial z i \partial z i \partial ω T = (1 - σ (z i)) x i = 1 1 - σ ( z i ) \partial ( 1 - σ ( z i ) ) \partial ω T = - σ (z i) x i

$\begin{equation} \begin{aligned} \frac{\partial \log\sigma(z_i)}{\partial \omega^T} &= \frac{1}{\sigma(z_i)}\frac{\partial\sigma(z_i)}{\partial \omega^T} = \frac{1}{\sigma(z_i)}\frac{\partial\sigma(z_i)}{\partial z_i}\frac{\partial z_i}{\partial \omega^T}=(1-\sigma(z_i))x_i\\ \frac{\partial \log(1-\sigma(z_i))}{\partial \omega^T}&= \frac{1}{1-\sigma(z_i)}\frac{\partial(1-\sigma(z_i))}{\partial \omega^T} =-\sigma(z_i)x_i \end{aligned} \end{equation}$

It's now trivial to show that

\nabla ⃗ l i (ω) = \partial l i ( ω ) \partial ω T = - y i x i (1 - σ (z i)) + (1 - y i) x i σ (z i) = x i (σ (z i) - y i)

$\vec{\nabla}l_i(\omega)=\frac{\partial l_i(\omega)}{\partial \omega^T} =-y_ix_i(1-\sigma(z_i))+(1-y_i)x_i\sigma(z_i)=x_i(\sigma(z_i)-y_i)$

whew!

Our last step is to compute the Hessian

\nabla ⃗ 2 l i (ω) = \partial l i ( ω ) \partial ω \partial ω T = x i x T i σ (z i) (1 - σ (z i))

$\vec{\nabla}^2l_i(\omega)=\frac{\partial l_i(\omega)}{\partial \omega\partial \omega^T}=x_ix_i^T\sigma(z_i)(1-\sigma(z_i))$

For $m$ samples we have $\vec{\nabla}^2l(\omega)=\sum_{i=1}^m x_ix_i^T\sigma(z_i)(1-\sigma(z_i))$ . This is equivalent to concatenating column vectors $x_i\in\mathbb{R}^d$ into a matrix $X$ of size $d\times m$ such that $\sum_{i=1}^m x_ix_i^T=XX^T$ . The scalar terms are combined in a diagonal matrix $D$ such that $D_{ii}=\sigma(z_i)(1-\sigma(z_i))$ . Finally, we conclude that

H ⃗ (ω) = \nabla ⃗ 2 l (ω) = X D X T

$\vec{H}(\omega)=\vec{\nabla}^2l(\omega)=XDX^T$

A faster approach can be derived by considering all samples at once from the beginning and instead work with matrix derivatives. As an extra note, with this formulation it's trivial to show that $l(\omega)$ is convex. Let $\delta$ be any vector such that $\delta\in\mathbb{R}^d$ . Then

δ T H ⃗ (ω) δ = δ T \nabla ⃗ 2 l (ω) δ = δ T X D X T δ = δ T X D (δ T X) T = ∥ δ T D X ∥ 2 \geq 0

$\delta^T\vec{H}(\omega)\delta = \delta^T\vec{\nabla}^2l(\omega)\delta = \delta^TXDX^T\delta = \delta^TXD(\delta^TX)^T = \|\delta^TDX\|^2\geq 0$

since $D>0$ and $\|\delta^TX\|\geq 0$ . This implies $H$ is positive-semidefinite and therefore $l$ is convex (but not strongly convex).

Manuel Morales
quelle

In the last equation, shouldn't it be

||δD1/2X|| $||\delta D^{1/2}X||$ since

XDX⊤ $XDX^\top$ =

XD1/2(XD1/2)⊤ $XD^{1/2}(XD^{1/2})^\top$ ?

appletree

Shouldn't it be

$X^T D X$ ?

Chintan Shah

Hessisch logistische Funktion

Antworten: