Ich habe heute über die Naive Bayes-Klassifikation gelesen. Ich las unter der Überschrift Parameterschätzung mit add 1 Glättung :
Verweisen Sie mit auf eine Klasse (z. B. Positiv oder Negativ) und mit auf ein Token oder Wort.
Der Maximum - Likelihood - Schätzer für ist
Diese Schätzung von könnte problematisch sein, da sie uns die Wahrscheinlichkeit 0 geben würde für Dokumente mit unbekannten Wörtern eine . Eine gängige Methode zur Lösung dieses Problems ist die Verwendung der Laplace-Glättung.
Sei V die Wortmenge in der Trainingsmenge, füge der Wortmenge ein neues Element (für Unbekannt) hinzu.
Definiere
wo auf das Vokabular bezieht (die Wörter im Trainingssatz).
Insbesondere hat jedes unbekannte Wort die Wahrscheinlichkeit
Meine Frage lautet: Warum beschäftigen wir uns überhaupt mit dieser Laplace-Glättung? Wenn diese unbekannten Wörter, auf die wir in der Testmenge stoßen, eine Wahrscheinlichkeit haben, die offensichtlich fast Null ist, dh , was bringt es, sie in das Modell aufzunehmen? Warum nicht einfach ignorieren und löschen?
Antworten:
Sie benötigen immer diese "ausfallsichere" Wahrscheinlichkeit.
Um zu sehen, warum der schlimmste Fall in Betracht gezogen wird, in dem keines der Wörter im Trainingsbeispiel im Testsatz vorkommt. In diesem Fall würden wir unter Ihrem Modell den Schluss ziehen, dass der Satz unmöglich ist, aber eindeutig existiert und einen Widerspruch erzeugt.
Ein weiteres extremes Beispiel ist der Testsatz "Alex hat Steve getroffen". wo "met" mehrmals im Trainingsbeispiel vorkommt, "Alex" und "Steve" jedoch nicht. Ihr Modell würde zu dem Schluss kommen, dass diese Aussage sehr wahrscheinlich nicht wahr ist.
quelle
Nehmen wir an, Sie haben Ihren Naive Bayes-Klassifikator auf zwei Klassen, "Ham" und "Spam", trainiert (dh er klassifiziert E-Mails). Der Einfachheit halber nehmen wir frühere Wahrscheinlichkeiten mit 50/50 an.
So weit, ist es gut.
Plötzlich,P( Ham|w1,w2,...wn,wn+1)=P(Ham|w1,w2,...wn)∗P(Ham|wn+1)=0 and P(Spam|w1,w2,..wn,wn+1)=P(Spam|w1,w2,...wn)∗P(Spam|wn+1)=0
Despite the 1st email being strongly classified in one class, this 2nd email may be classified differently because of that last word having a probability of zero.
Laplace smoothing solves this by giving the last word a small non-zero probability for both classes, so that the posterior probabilities don't suddenly drop to zero.
quelle
This question is rather simple if you are familiar with Bayes estimators, since it is the directly conclusion of Bayes estimator.
In the Bayesian approach, parameters are considered to be a quantity whose variation can be described by a probability distribution(or prior distribution).
So, if we view the procedure of picking up as multinomial distribution, then we can solve the question in few steps.
First, define
If we assume the prior distribution ofpi is uniform distribution, we can calculate it's conditional probability distribution as
we can find it's in fact Dirichlet distribution, and expectation ofpi is
A natural estimate forpi is the mean of the posterior distribution. So we can give the Bayes estimator of pi :
You can see we just draw the same conclusion as Laplace Smoothing.
quelle
Disregarding those words is another way to handle it. It corresponds to averaging (integrate out) over all missing variables. So the result is different. How?
Assuming the notation used here:
Let say tokentk does not appear. Instead of using a Laplace smoothing (which comes from imposing a Dirichlet prior on the multinomial Bayes), you sum out tk which corresponds to saying: I take a weighted voting over all possibilities for the unknown tokens (having them or not).
But in practice one prefers the smoothing approach. Instead of ignoring those tokens, you assign them a low probability which is like thinking: if I have unknown tokens, it is more unlikely that is the kind of document I'd otherwise think it is.
quelle
You want to know why we bother with smoothing at all in a Naive Bayes classifier (when we can throw away the unknown features instead).
The answer to your question is: not all words have to be unknown in all classes.
Say there are two classes M and N with features A, B and C, as follows:
M: A=3, B=1, C=0
(In the class M, A appears 3 times and B only once)
N: A=0, B=1, C=3
(In the class N, C appears 3 times and B only once)
Let's see what happens when you throw away features that appear zero times.
A) Throw Away Features That Appear Zero Times In Any Class
If you throw away features A and C because they appear zero times in any of the classes, then you are only left with feature B to classify documents with.
And losing that information is a bad thing as you will see below!
If you're presented with a test document as follows:
B=1, C=3
(It contains B once and C three times)
Now, since you've discarded the features A and B, you won't be able to tell whether the above document belongs to class M or class N.
So, losing any feature information is a bad thing!
B) Throw Away Features That Appear Zero Times In All Classes
Is it possible to get around this problem by discarding only those features that appear zero times in all of the classes?
No, because that would create its own problems!
The following test document illustrates what would happen if we did that:
A=3, B=1, C=1
The probability of M and N would both become zero (because we did not throw away the zero probability of A in class N and the zero probability of C in class M).
C) Don't Throw Anything Away - Use Smoothing Instead
Smoothing allows you to classify both the above documents correctly because:
Naive Bayes Classifiers In Practice
The Naive Bayes classifier in NLTK used to throw away features that had zero counts in any of the classes.
This used to make it perform poorly when trained using a hard EM procedure (where the classifier is bootstrapped up from very little training data).
quelle
I also came across the same problem while studying Naive Bayes.
According to me, whenever we encounter a test example which we hadn't come across during training, then out Posterior probability will become 0.
So adding the 1 , even if we never train on a particular feature/class, the Posterior probability will never be 0.
quelle
Matt you are correct you raise a very good point - yes Laplace Smoothing is quite frankly nonsense! Just simply throwing away those features can be a valid approach, particularly when the denominator is also a small number - there is simply not enough evidence to support the probability estimation.
I have a strong aversion to solving any problem via use of some arbitrary adjustment. The problem here is zeros, the "solution" is to just "add some small value to zero so it's not zero anymore - MAGIC the problem is no more". Of course that's totally arbitrary.
Your suggestion of better feature selection to begin with is a less arbitrary approach and IME increases performance. Furthermore Laplace Smoothing in conjunction with naive Bayes as the model has in my experience worsens the granularity problem - i.e. the problem where scores output tend to be close to 1.0 or 0.0 (if the number of features is infinite then every score will be 1.0 or 0.0 - this is a consequence of the independence assumption).
Now alternative techniques for probability estimation exist (other than max likelihood + Laplace smoothing), but are massively under documented. In fact there is a whole field called Inductive Logic and Inference Processes that use a lot of tools from Information Theory.
What we use in practice is of Minimum Cross Entropy Updating which is an extension of Jeffrey's Updating where we define the convex region of probability space consistent with the evidence to be the region such that a point in it would mean the Maximum Likelihood estimation is within the Expected Absolute Deviation from the point.
This has a nice property that as the number of data points decreases the estimations peace-wise smoothly approach the prior - and therefore their effect in the Bayesian calculation is null. Laplace smoothing on the other hand makes each estimation approach the point of Maximum Entropy that may not be the prior and therefore the effect in the calculation is not null and will just add noise.
quelle