Wie ist Naive Bayes ein linearer Klassifikator?

Ich habe den anderen Thread hier gesehen, aber ich glaube nicht, dass die Antwort die eigentliche Frage befriedigt hat. Was ich immer wieder gelesen habe, ist, dass Naive Bayes ein linearer Klassifikator (z. B. hier ) ist (so dass er eine lineare Entscheidungsgrenze zeichnet), der die Log Odds-Demonstration verwendet.

Ich simulierte jedoch zwei Gaußsche Wolken und passte eine Entscheidungsgrenze an und erhielt die Ergebnisse als solche (Bibliothek e1071 in r, mit naiveBayes ()) 1- Grün, 0 - Rot

Wie wir sehen können, ist die Entscheidungsgrenze nicht linear. Versucht es zu sagen, dass die Parameter (bedingte Wahrscheinlichkeiten) eine lineare Kombination im Protokollraum sind, anstatt zu sagen, dass der Klassifizierer selbst Daten linear trennt?

classification naive-bayes Kevin Pei
quelle

Wie haben Sie die Entscheidungsgrenze geschaffen? Ich vermute, es liegt eher an Ihrer Anpassungsroutine als an der wahren Entscheidungsgrenze des Klassifikators. Normalerweise würde man eine Entscheidungsgrenze erzeugen, indem man die Entscheidung an jedem einzelnen Punkt in Ihrem Quadranten berechnet.

Seanv507

Das habe ich gemacht, ich habe die beiden Bereiche X = [Min (x), Max (x)] und Y = [Min (Y), Max (Y)] mit einem Abstand von 0,1 genommen. Ich habe dann alle diese Datenpunkte mit dem trainierten Klassifikator versehen und Punkte gefunden, bei denen die Log-Quoten zwischen -0,05 und 0,05 lagen

Kevin Pei

Antworten:

Im Allgemeinen ist der naive Bayes-Klassifikator nicht linear, aber wenn die Wahrscheinlichkeitsfaktoren aus Exponentialfamilien stammen , entspricht der naive Bayes-Klassifikator einem linearen Klassifikator in einem bestimmten Merkmalsraum. Hier ist, wie man das sieht. $p(x_i \mid c)$

Sie können jeden naiven Bayes-Klassifikator schreiben als *

p (c = 1 ∣ x) = σ (\sum_{i} \log \frac{p (x_{i} ∣ c = 1)}{p (x_{i} ∣ c = 0)} + \log \frac{p (c = 1)}{p (c = 0)}),

$p(c = 1 \mid \mathbf{x}) = \sigma\left( \sum_i \log \frac{p(x_i \mid c = 1)}{p(x_i \mid c = 0)} + \log \frac{p(c = 1)}{p(c = 0)} \right),$

Dabei ist die logistische Funktion . Wenn aus einer Exponentialfamilie stammt, können wir es schreiben als $\sigma$ $p(x_i \mid c)$

p (x_{i} ∣ c) = h_{i} (x_{i}) \exp (u_{i c}^{⊤} ϕ_{i} (x_{i}) - A_{i} (u_{i c})),

$p(x_i \mid c) = h_i(x_i)\exp\left(\mathbf{u}_{ic}^\top \phi_i(x_i) - A_i(\mathbf{u}_{ic})\right),$

und daher

p (c = 1 ∣ x) = σ (\sum_{i} w_{i}^{⊤} ϕ_{i} (x_{i}) + b),

$p(c = 1 \mid \mathbf{x}) = \sigma\left( \sum_i \mathbf{w}_i^\top \phi_i(x_i) + b \right),$

woher

\begin{aligned} w_{i} & = u_{i 1} - u_{i 0}, \\ b & = \log \frac{p (c = 1)}{p (c = 0)} - \sum_{i} (A_{i} (u_{i 1}) - A_{i} (u_{i 0})) . \end{aligned}

$\begin{align} \mathbf{w}_i &= \mathbf{u}_{i1} - \mathbf{u}_{i0}, \\ b &= \log \frac{p(c = 1)}{p(c = 0)} - \sum_i \left( A_i(\mathbf{u}_{i1}) - A_i(\mathbf{u}_{i0}) \right). \end{align}$

Beachten Sie, dass dies ähnlich ist logistische Regression - ein linearer Klassifikator - in dem Merkmalsraum definiert durch die . Für mehr als zwei Klassen erhalten wir analog eine multinomiale logistische (oder Softmax-) Regression . $\phi_i$

$p(x_i \mid c)$ $\phi_i(x_i) = (x_i, x_i^2)$

\begin{aligned} w_{i 1} & = σ_{1}^{- 2} μ_{1} - σ_{0}^{- 2} μ_{0}, \\ w_{i 2} & = 2 σ_{0}^{- 2} - 2 σ_{1}^{- 2}, \\ b_{i} & = \log σ_{0} - \log σ_{1}, \end{aligned}

$\begin{align} w_{i1} &= \sigma_1^{-2}\mu_1 - \sigma_0^{-2}\mu_0, \\ w_{i2} &= 2\sigma_0^{-2} - 2\sigma_1^{-2}, \\ b_i &= \log \sigma_0 - \log \sigma_1, \end{align}$

Angenommen, . $p(c = 1) = p(c = 0) = \frac{1}{2}$

* So leiten Sie dieses Ergebnis ab:

\begin{aligned} p (c = 1 ∣ x) & = \frac{p (x ∣ c = 1) p (c = 1)}{p (x ∣ c = 1) p (c = 1) + p (x ∣ c = 0) p (c = 0)} \\ = \frac{1}{1 + \frac{p (x ∣ c = 0) p (c = 0)}{p (x ∣ c = 1) p (c = 1)}} \\ = \frac{1}{1 + \exp (- \log \frac{p (x ∣ c = 1) p (c = 1)}{p (x ∣ c = 0) p (c = 0)})} \\ = σ (\sum_{i} \log \frac{p (x_{i} ∣ c = 1)}{p (x_{i} ∣ c = 0)} + \log \frac{p (c = 1)}{p (c = 0)}) \end{aligned}

$\begin{align} p(c = 1 \mid \mathbf{x}) &= \frac{p(\mathbf{x} \mid c = 1) p(c = 1)}{p(\mathbf{x} \mid c = 1) p(c = 1) + p(\mathbf{x} \mid c = 0) p(c = 0)} \\ &= \frac{1}{1 + \frac{p(\mathbf{x} \mid c = 0) p(c = 0)}{p(\mathbf{x} \mid c = 1) p(c = 1)}} \\ &= \frac{1}{1 + \exp\left( -\log\frac{p(\mathbf{x} \mid c = 1) p(c = 1)}{p(\mathbf{x} \mid c = 0) p(c = 0)} \right)} \\ &= \sigma\left( \sum_i \log \frac{p(x_i \mid c = 1)}{p(x_i \mid c = 0)} + \log \frac{p(c = 1)}{p(c = 0)} \right) \end{align}$

Lucas
quelle

Thank you for the derivation, which I now understand, can you explain the notations in equation 2 and below? (u, h(x_i), phi(x_i), etc) Is P(x_i | c) under an exponential family just simply taking the value from the pdf?

Kevin Pei

There are different ways you can express one and the same distribution. The second equation is an exponential family distribution in canonical form. Many distributions are exponential families (Gaussian, Laplace, Dirichlet, Bernoulli, binomial, just to name a few), but their density/mass function is typically not given in canonical form. So you first have to reparametrize the distribution. This table tells you how to compute

u

$\mathbf{u}$ (natural parameters) and

ϕ

$\phi$ (sufficient statistics) for various distributions: en.wikipedia.org/wiki/Exponential_family#Table_of_distributions

Lucas

Notice the important point that

ϕ (x) = (x, x^{2})

$\phi(x) = (x, x^2)$ . What this means is that linear classifiers are a linear combination of weights

w

$\mathbf{w}$ and potentially non-linear functions of the features! So, to the original poster's point, a plot of the datapoints may not show that they are separable by a line.

RMurphy

I find this answer misleading: as pointed out in the comment just about, and the answer just below, the Gaussian naive Bayes is not linear in the original feature space, but in a non-linear transform of these. Hence it is not a conventional linear classifier.

Gael Varoquaux

why

p (x_{i} | c)

$p(x_i|c)$ is Gaussian,then

ϕ_{i} (x_{i}) = (x_{i}, x_{i}^{2})

$\phi_i(x_i)=(x_i,x_i^2)$ ? I think the sufficient statistic

T (x)

$T(x)$ for Gaussian distribution should be

x / σ

$x/\sigma$ .

Naomi

It is linear only if the class conditional variance matrices are the same for both classes. To see this write down the ration of the log posteriors and you'll only get a linear function out of it if the corresponding variances are the same. Otherwise it is quadratic.

axk
quelle

I'd like add one additional point: the reason for some of the confusion rests on what it means to be performing "Naive Bayes classification".

Under the broad topic of "Gaussian Discriminant Analysis (GDA)" there are several techniques: QDA, LDA, GNB, and DLDA (quadratic DA, linear DA, gaussian naive bayes, diagonal LDA). [UPDATED] LDA and DLDA should be linear in the space of the given predictors. (See, e.g., Murphy, 4.2, pg. 101 for DA and pg. 82 for NB. Note: GNB is not necessarily linear. Discrete NB (which uses a multinomial distribution under the hood) is linear. You can also check out Duda, Hart & Stork section 2.6). QDA is quadratic as other answers have pointed out (and which I think is what is happening in your graphic - see below).

These techniques form a lattice with a nice set of constraints on the "class-wise covariance matrices" $\Sigma_c$ :

QDA: $\Sigma_c$ arbitrary: arbitrary ftr. cov. matrix per class
LDA: $\Sigma_c = \Sigma$ : shared cov. matrix (over classes)
GNB: $\Sigma_c = {diag}_c$ : class wise diagonal cov. matrices (the assumption of ind. in the model $\rightarrow$ diagonal cov. matrix)
DLDA: $\Sigma_c = diag$ : shared & diagonal cov. matrix

While the docs for e1071 claim that it is assuming class-conditional independence (i.e., GNB), I'm suspicious that it is actually doing QDA. Some people conflate "naive Bayes" (making independence assumptions) with "simple Bayesian classification rule". All of the GDA methods are derived from the later; but only GNB and DLDA use the former.

A big warning, I haven't read the e1071 source code to confirm what it is doing.

MrDrFenner
quelle