OP glaubt fälschlicherweise, dass die Beziehung zwischen diesen beiden Funktionen auf die Anzahl der Stichproben zurückzuführen ist (dh Single vs All). Der eigentliche Unterschied besteht jedoch einfach darin, wie wir unsere Trainingsetiketten auswählen.
Bei der binären Klassifikation können wir die Bezeichnungen y=±1 oder y=0,1 .
Wie bereits ausgeführt, ist die logistische Funktion σ(z) eine gute Wahl, da sie die Form einer Wahrscheinlichkeit hat, dh σ(−z)=1−σ(z) und σ(z)∈(0,1) als z→±∞ . Wenn wir die Bezeichnungen y=0,1 auswählen , können wir sie zuweisen
P(y=1|z)P(y=0|z)=σ(z)=11+e−z=1−σ(z)=11+ez
which can be written more compactly as P(y|z)=σ(z)y(1−σ(z))1−y.
It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For m samples {xi,yi}, after taking the natural logarithm and some simplification, we will find out:
l(z)=−log(∏imP(yi|zi))=−∑imlog(P(yi|zi))=∑im−yizi+log(1+ezi)
Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels y=±1. It is pretty obvious then that we can assign
P(y|z)=σ(yz).
It is also obvious that P(y=0|z)=P(y=−1|z)=σ(−z). Following the same steps as before we minimize in this case the loss function
L(z)=−log(∏jmP(yj|zj))=−∑jmlog(P(yj|zj))=∑jmlog(1+e−yzj)
Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form y takes different values, nevertheless these two are equivalent:
−yizi+log(1+ezi)≡log(1+e−yzj)
The case yi=1 is trivial to show. If yi≠1, then yi=0 on the left hand side and yi=−1 on the right hand side.
∂σ(z)/∂z=σ(z)(1−σ(z)) to trivially calculate ∇l(z) and ∇2l(z), both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).
I learned the loss function for logistic regression as follows.
Logistic regression performs binary classification, and so the label outputs are binary, 0 or 1. LetP(y=1|x) be the probability that the binary output y is 1 given the input feature vector x . The coefficients w are the weights that the algorithm is trying to learn.
Because logistic regression is binary, the probabilityP(y=0|x) is simply 1 minus the term above.
The loss functionJ(w) is the sum of (A) the output y=1 multiplied by P(y=1) and (B) the output y=0 multiplied by P(y=0) for one training example, summed over m training examples.
wherey(i) indicates the ith label in your training data. If a training instance has a label of 1 , then y(i)=1 , leaving the left summand in place but making the right summand with 1−y(i) become 0 . On the other hand, if a training instance has y=0 , then the right summand with the term 1−y(i) remains in place, but the left summand becomes 0 . Log probability is used for ease of calculation.
If we then replaceP(y=1) and P(y=0) with the earlier expressions, then we get:
You can read more about this form in these Stanford lecture notes.
quelle
Anstelle von Mean Squared Error verwenden wir eine Kostenfunktion namens Cross-Entropy, die auch als Log Loss bezeichnet wird. Der Cross-Entropy-Verlust kann in zwei separate Kostenfunktionen unterteilt werden: eine für y = 1 und eine für y = 0.
When we put them together we have:
Multiplying byy and (1−y) in the above equation is a sneaky trick that let’s us use the same equation to solve for both y=1 and y=0 cases. If y=0 , the first side cancels out. If y=1 , the second side cancels out. In both cases we only perform the operation we need to perform.
If you don't want to use a
for
loop, you can try a vectorized form of the equation aboveThe entire explanation can be view on Machine Learning Cheatsheet.
quelle