Ich habe Schwierigkeiten die Hessian der Zielfunktion, abzuleiten l ( θ )
h θ ( x )
Kennt jemand eine saubere und einfache Möglichkeit, X T D X abzuleiten
Ich habe Schwierigkeiten die Hessian der Zielfunktion, abzuleiten l ( θ )
h θ ( x )
Kennt jemand eine saubere und einfache Möglichkeit, X T D X abzuleiten
Antworten:
Hier leite ich alle notwendigen Eigenschaften und Identitäten ab, damit die Lösung in sich geschlossen ist, aber ansonsten ist diese Herleitung sauber und einfach. Lassen Sie uns unsere Notation formalisieren und die Verlustfunktion etwas kompakter schreiben. Betrachten mm Proben { x i , y i }{xi,yi} , so dass x i ∈ R dxi∈Rd und y i ∈ Ryi∈R . Denken Sie daran, dass in der binären logistischen Regression typischerweise die Hypothesenfunktion h θhθ die logistische Funktion ist. Formal
hθ(xi)=σ(ωTxi)=σ(zi)=11+e−zi,hθ(xi)=σ(ωTxi)=σ(zi)=11+e−zi,
where ω∈Rdω∈Rd and zi=ωTxizi=ωTxi . The loss function (which I believe OP's is missing a negative sign) is then defined as:
l(ω)=m∑i=1−(yilogσ(zi)+(1−yi)log(1−σ(zi)))l(ω)=∑i=1m−(yilogσ(zi)+(1−yi)log(1−σ(zi)))
There are two important properties of the logistic function which I derive here for future reference. First, note that 1−σ(z)=1−1/(1+e−z)=e−z/(1+e−z)=1/(1+ez)=σ(−z)1−σ(z)=1−1/(1+e−z)=e−z/(1+e−z)=1/(1+ez)=σ(−z) .
Also note that
∂∂zσ(z)=∂∂z(1+e−z)−1=e−z(1+e−z)−2=11+e−ze−z1+e−z=σ(z)(1−σ(z))∂∂zσ(z)=∂∂z(1+e−z)−1=e−z(1+e−z)−2=11+e−ze−z1+e−z=σ(z)(1−σ(z))
Instead of taking derivatives with respect to components, here we will work directly with vectors (you can review derivatives with vectors here). The Hessian of the loss function l(ω)l(ω) is given by →∇2l(ω)∇⃗ 2l(ω) , but first recall that ∂z∂ω=xTω∂ω=xT∂z∂ω=xTω∂ω=xT and ∂z∂ωT=∂ωTx∂ωT=x∂z∂ωT=∂ωTx∂ωT=x .
Let li(ω)=−yilogσ(zi)−(1−yi)log(1−σ(zi))li(ω)=−yilogσ(zi)−(1−yi)log(1−σ(zi)) . Using the properties we derived above and the chain rule
∂logσ(zi)∂ωT=1σ(zi)∂σ(zi)∂ωT=1σ(zi)∂σ(zi)∂zi∂zi∂ωT=(1−σ(zi))xi∂log(1−σ(zi))∂ωT=11−σ(zi)∂(1−σ(zi))∂ωT=−σ(zi)xi∂logσ(zi)∂ωT∂log(1−σ(zi))∂ωT=1σ(zi)∂σ(zi)∂ωT=1σ(zi)∂σ(zi)∂zi∂zi∂ωT=(1−σ(zi))xi=11−σ(zi)∂(1−σ(zi))∂ωT=−σ(zi)xi
It's now trivial to show that
→∇li(ω)=∂li(ω)∂ωT=−yixi(1−σ(zi))+(1−yi)xiσ(zi)=xi(σ(zi)−yi)∇⃗ li(ω)=∂li(ω)∂ωT=−yixi(1−σ(zi))+(1−yi)xiσ(zi)=xi(σ(zi)−yi)
whew!
Our last step is to compute the Hessian
→∇2li(ω)=∂li(ω)∂ω∂ωT=xixTiσ(zi)(1−σ(zi))∇⃗ 2li(ω)=∂li(ω)∂ω∂ωT=xixTiσ(zi)(1−σ(zi))
For mm samples we have →∇2l(ω)=∑mi=1xixTiσ(zi)(1−σ(zi))∇⃗ 2l(ω)=∑mi=1xixTiσ(zi)(1−σ(zi)) . This is equivalent to concatenating column vectors xi∈Rdxi∈Rd into a matrix XX of size d×md×m such that ∑mi=1xixTi=XXT∑mi=1xixTi=XXT . The scalar terms are combined in a diagonal matrix DD such that Dii=σ(zi)(1−σ(zi))Dii=σ(zi)(1−σ(zi)) . Finally, we conclude that
→H(ω)=→∇2l(ω)=XDXTH⃗ (ω)=∇⃗ 2l(ω)=XDXT
A faster approach can be derived by considering all samples at once from the beginning and instead work with matrix derivatives. As an extra note, with this formulation it's trivial to show that l(ω)l(ω) is convex. Let δδ be any vector such that δ∈Rdδ∈Rd . Then
δT→H(ω)δ=δT→∇2l(ω)δ=δTXDXTδ=δTXD(δTX)T=‖δTDX‖2≥0δTH⃗ (ω)δ=δT∇⃗ 2l(ω)δ=δTXDXTδ=δTXD(δTX)T=∥δTDX∥2≥0
since D>0D>0 and ‖δTX‖≥0∥δTX∥≥0 . This implies HH is positive-semidefinite and therefore ll is convex (but not strongly convex).
quelle