14

Ich versuche die Intuition hinter den SVMs des Kernels zu verstehen. Jetzt verstehe ich, wie linear SVM funktioniert, wobei eine Entscheidungslinie erstellt wird, die die Daten so gut wie möglich aufteilt. Ich verstehe auch das Prinzip der Portierung von Daten in einen höherdimensionalen Raum und wie dies das Finden einer linearen Entscheidungslinie in diesem neuen Raum erleichtern kann. Was ich nicht verstehe, ist, wie ein Kernel verwendet wird, um Datenpunkte in diesen neuen Bereich zu projizieren.

Was ich über einen Kernel weiß, ist, dass er effektiv die "Ähnlichkeit" zwischen zwei Datenpunkten darstellt. Aber wie hängt das mit der Projektion zusammen?

machine-learning svm kernel-trick Karnivaurus
quelle

3

Wenn Sie sich in einem ausreichend großen Raum befinden, können alle Trainingsdatenpunkte durch eine Ebene perfekt voneinander getrennt werden. Das bedeutet nicht, dass es irgendeine Vorhersagekraft haben wird. Ich denke, in einen sehr hochdimensionalen Raum zu gehen, ist das moralische Äquivalent (eine Form von Überanpassung).

Mark L. Stone

@ Mark L. Stone: Das ist richtig (+1), aber es kann immer noch eine gute Frage sein, wie ein Kernel im unendlichen dimensionalen Raum abgebildet werden kann. Wie funktioniert das? Ich habe es versucht, siehe meine Antwort

Ich wäre vorsichtig, wenn ich das Feature-Mapping als "Projektion" bezeichnen würde. Die Merkmalszuordnung ist im Allgemeinen eine nichtlineare Transformation.

Paul

Ein sehr hilfreicher Beitrag über den Kernel-Trick visualisiert den inneren Produktraum des Kernels und beschreibt, wie die hochdimensionalen Merkmalsvektoren verwendet werden, um dies zu erreichen. Hoffentlich beantwortet dies die Frage kurz und bündig: eric-kim.net/eric-kim-net/ posts / 1 / kernel_trick.html

JStrahl

5

Sei $h(x)$ die Projektion zum Raum hoher Dimension $\mathcal{F}$ . Grundsätzlich ist die Kernel - Funktion $K(x_1,x_2)=\langle h(x_1),h(x_2)\rangle$ , die das Skalarprodukt ist. Es wird also nicht zum Projizieren von Datenpunkten verwendet, sondern als Ergebnis der Projektion. Es kann als ein Maß für die Ähnlichkeit angesehen werden, aber in einer SVM ist es mehr als das.

Die Optimierung zum Finden der besten Trennungs-Hyperebene in $\mathcal{F}$ beinhaltet $h(x)$ nur durch die innere Produktform. Das heißt, wenn Sie $K(\cdot,\cdot)$ , müssen Sie nicht die genaue Form von $h(x)$ , was die Optimierung erleichtert.

Jeder Kern $K(\cdot,\cdot)$ hat auch ein entsprechendes $h(x)$ . Wenn Sie also eine SVM mit diesem Kernel verwenden, finden Sie implizit die lineare Entscheidungslinie in dem Raum, in den $h(x)$ abgebildet wird.

Kapitel 12 der Elemente des statistischen Lernens enthält eine kurze Einführung in SVM. Hier erfahren Sie mehr über die Verbindung zwischen Kernel und Feature-Mapping: http://statweb.stanford.edu/~tibs/ElemStatLearn/

Lii
quelle

Meinen Sie, dass es für einen Kernel

K (x, y)

$K(x,y)$ ein eindeutiges zugrunde liegendes

h (x)

$h(x)$ ?

2

@fcoppens Nein; als triviales Beispiel betrachten

h

$h$ und

- h

$-h$ . Es gibt jedoch einen eindeutigen Wiedergabekernel-Hilbert-Raum, der diesem Kernel entspricht.

Dougal

@Dougal: Dann kann ich dir zustimmen, aber in der obigen Antwort stand "ein entsprechendes

h

$h$ ", also wollte ich sicher sein. Für die RKHS ich sehen, aber glauben Sie , dass es möglich ist , in einer intuitiven Weise " , was diese Transformation zu erklären

h

$h$ sieht aus wie für einen Kernel

K (x, y)

$K(x,y)$ ?

@ fcoppens Im Allgemeinen nein; Es ist schwierig, explizite Darstellungen dieser Karten zu finden. Für bestimmte Kernel ist es jedoch entweder nicht zu schwer oder es wurde schon früher gemacht.

Dougal,

1

@fcoppens du hast recht, das h (x) ist nicht eindeutig. Sie können leicht Änderungen an h (x) vornehmen, während das innere Produkt <h (x), h (x ')> gleich bleibt. Sie können sie jedoch als Basisfunktionen betrachten, und der Raum, den sie umfassen (dh das RKHS), ist eindeutig.

Lii

4

Die nützlichen Eigenschaften von Kernel-SVM sind nicht universell - sie hängen von der Wahl des Kernels ab. Um sich einen Überblick zu verschaffen, ist es hilfreich, sich einen der am häufigsten verwendeten Kernel anzusehen, den Gaußschen Kernel. Bemerkenswerterweise verwandelt dieser Kernel SVM in etwas, das einem k-nächsten Nachbarn-Klassifikator sehr ähnlich ist.

Diese Antwort erklärt Folgendes:

Warum eine perfekte Trennung von positiven und negativen Trainingsdaten mit einem Gaußschen Kernel mit ausreichend kleiner Bandbreite immer möglich ist (auf Kosten einer Überanpassung)
Wie diese Trennung in einem Merkmalsraum als linear interpretiert werden kann.
Wie der Kernel verwendet wird, um das Mapping vom Datenraum zum Feature-Raum zu erstellen. Spoiler: Der Feature-Space ist ein sehr mathematisch abstraktes Objekt mit einem ungewöhnlichen abstrakten inneren Produkt, das auf dem Kernel basiert.

1. Perfekte Trennung erreichen

Eine perfekte Trennung ist bei einem Gaußschen Kernel aufgrund der Lokalitätseigenschaften des Kernels immer möglich, was zu einer beliebig flexiblen Entscheidungsgrenze führt. Bei einer ausreichend kleinen Kernelbandbreite sieht die Entscheidungsgrenze so aus, als hätten Sie nur kleine Kreise um die Punkte gezogen, wenn diese zur Trennung der positiven und negativen Beispiele benötigt werden:

(Gutschrift: Andrew Ngs Online-Kurs für maschinelles Lernen ).

Warum geschieht dies also aus mathematischer Sicht?

Betrachten Sie den Standard - Setup: Sie haben einen Gauß - Kern und Trainingsdaten $K(\mathbf{x},\mathbf{z}) = \exp(- ||\mathbf{x}-\mathbf{z}||^2 / \sigma^2)$ wobei die -Werte . Wir wollen eine Klassifikatorfunktion lernen $(\mathbf{x}^{(1)},y^{(1)}), (\mathbf{x}^{(2)},y^{(2)}), \ldots, (\mathbf{x}^{(n)},y^{(n)})$ $y^{(i)}$ $\pm 1$

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} K (x^{(i)}, x)

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x})$

Wie werden wir nun jemals die Gewichte zuweisen ? Benötigen wir unendlich dimensionale Räume und einen quadratischen Programmieralgorithmus? Nein, weil ich nur zeigen will, dass ich die Punkte perfekt trennen kann. Also mache ich eine Milliarde mal kleiner als der kleinste Abstand zwischen zwei beliebigen Trainingsbeispielen, und ich setze gerade . Dies bedeutet , dass alle Trainingspunkte sind eine Milliarde Sigmas auseinander , so weit das Kernel betrifft, und jeder Punkt steuert vollständig das Vorzeichen von $w_i$ $\sigma$ $||\mathbf{x}^{(i)} - \mathbf{x}^{(j)}||$ $w_i = 1$ $\hat{y}$ in seiner Nachbarschaft. Formal haben wir

\hat{y} (x^{(k)}) = \sum_{i = 1}^{n} y^{(k)} K (x^{(i)}, x^{(k)}) = y^{(k)} K (x^{(k)}, x^{(k)}) + \sum_{i \neq k} y^{(i)} K (x^{(i)}, x^{(k)}) = y^{(k)} + ϵ

$\hat{y}(\mathbf{x}^{(k)}) = \sum_{i=1}^n y^{(k)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} K(\mathbf{x}^{(k)},\mathbf{x}^{(k)}) + \sum_{i \neq k} y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} + \epsilon$

Wobei ein beliebig kleiner Wert ist. Wir wissen , klein ist , weil eine Milliarde Sigmas weg von jedem anderen Punkt ist, so dass für alle haben wir $\epsilon$ $\epsilon$ $\mathbf{x}^{(k)}$ $i \neq k$

K (x^{(i)}, x^{(k)}) = \exp (- | | x^{(i)} - x^{(k)} | |^{2} / σ^{2}) \approx 0.

$K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = \exp(- ||\mathbf{x}^{(i)} - \mathbf{x}^{(k)}||^2 / \sigma^2) \approx 0.$

Da so klein hat auf jeden Fall das gleiche Vorzeichen wie und der Klassifikator erreicht perfekte Genauigkeit auf den Trainingsdaten. In der Praxis wäre dies furchtbar überpassend, aber es zeigt die enorme Flexibilität der SVM des Gaußschen Kernels und wie sie sich sehr ähnlich wie ein Klassifikator eines nächsten Nachbarn verhalten kann. $\epsilon$ $\hat{y}(\mathbf{x}^{(k)})$ $y^{(k)}$

2. Kernel-SVM-Lernen als lineare Trennung

Die Tatsache, dass dies als "perfekte lineare Trennung in einem unendlich dimensionalen Merkmalsraum" interpretiert werden kann, ergibt sich aus dem Kernel-Trick, der es Ihnen ermöglicht, den Kernel als ein abstraktes inneres Produkt zu interpretieren, das einen neuen Merkmalsraum bietet:

K (x^{(i)}, x^{(j)}) = ⟨ Φ (x^{(i)}), Φ (x^{(j)}) ⟩

$K(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) = \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x}^{(j)})\rangle$

where $\Phi(\mathbf{x})$ is the mapping from the data space into the feature space. It follows immediately that the $\hat{y}(\mathbf{x})$ function as a linear function in the feature space:

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), Φ (x) ⟩ = L (Φ (x))

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x})\rangle = L(\Phi(\mathbf{x}))$

where the linear function $L(\mathbf{v})$ is defined on feature space vectors $\mathbf{v}$ as

L (v) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), v ⟩

$L(\mathbf{v}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\mathbf{v}\rangle$

This function is linear in $\mathbf{v}$ because it's just a linear combination of inner products with fixed vectors. In the feature space, the decision boundary $\hat{y}(\mathbf{x}) = 0$ is just $L(\mathbf{v}) = 0$ , the level set of a linear function. This is the very definition of a hyperplane in the feature space.

3. How the kernel is used to construct the feature space

Kernel methods never actually "find" or "compute" the feature space or the mapping $\Phi$ explicitly. Kernel learning methods such as SVM do not need them to work; they only need the kernel function $K$ . It is possible to write down a formula for $\Phi$ but the feature space it maps to is quite abstract and is only really used for proving theoretical results about SVM. If you're still interested, here's how it works.

Basically we define an abstract vector space $V$ where each vector is a function from $\mathcal{X}$ to $\mathbb{R}$ . A vector $f$ in $V$ is a function formed from a finite linear combination of kernel slices:

f (x) = \sum_{i = 1}^{n} α_{i} K (x^{(i)}, x)

$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i K(\mathbf{x}^{(i)},\mathbf{x})$ (Here the

x^{(i)}

$\mathbf{x}^{(i)}$ are just an arbitrary set of points and need not be the same as the training set.) It is convenient to write

f

$f$ more compactly as

f = \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}

$f = \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}$ where

K_{x} (y) = K (x, y)

$K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y})$ is a function giving a "slice" of the kernel at

x

$\mathbf{x}$ .

The inner product on the space is not the ordinary dot product, but an abstract inner product based on the kernel:

⟨ \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}, \sum_{j = 1}^{n} β_{j} K_{x^{(j)}} ⟩ = \sum_{i, j} α_{i} β_{j} K (x^{(i)}, x^{(j)})

$\langle \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}, \sum_{j=1}^n \beta_j K_{\mathbf{x}^{(j)}} \rangle = \sum_{i,j} \alpha_i \beta_j K(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$

This definition is very deliberate: its construction ensures the identity we need for linear separation, $\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \rangle = K(\mathbf{x},\mathbf{y})$ .

With the feature space defined in this way, $\Phi$ is a mapping $\mathcal{X} \rightarrow V$ , taking each point $\mathbf{x}$ to the "kernel slice" at that point:

Φ (x) = K_{x}, where K_{x} (y) = K (x, y) .

$\Phi(\mathbf{x}) = K_\mathbf{x}, \quad \text{where} \quad K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y}).$

You can prove that $V$ is an inner product space when $K$ is a positive definite kernel. See this paper for details.

Paul
quelle

Great explanation, but I think you have missed a minus for the definition of the gaussian kernel. K(x,z)=exp(-||x−z||2/σ2) . As it's written, it does not make sense with the ϵ found in the part (1)

hqxortn

1

For the background and the notations I refer to How to calculate decision boundary from support vectors?.

So the features in the 'original' space are the vectors $x_i$ , the binary outcome $y_i \in \{-1, +1\}$ and the Lagrange multipliers are $\alpha_i$ .

As said by @Lii (+1) the Kernel can be written as $K(x,y)=h(x) \cdot h(y)$ (' $\cdot$ ' represents the inner product.

I will try to give some 'intuitive' explanation of what this $h$ looks like, so this answer is no formal proof, it just wants to give some feeling of how I think that this works. Do not hesitate to correct me if I am wrong.

I have to 'transform' my feature space (so my $x_i$ ) into some 'new' feature space in which the linear separation will be solved.

For each observation $x_i$ , I define functions $\phi_i(x)=K(x_i,x)$ , so I have a function $\phi_i$ for each element of my training sample. These functions $\phi_i$ span a vector space. The vector space spanned by the $\phi_i$ , note it $V=span(\phi_{i, i=1,2,\dots N})$ .

I will try to argue that is the vector space in which linear separation will be possible. By definition of the span, each vector in the vector space $V$ can be written as as a linear combination of the $\phi_i$ , i.e.: $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ are real numbers.

$N$ is the size of the training sample and therefore the dimension of the vector space $V$ can go up to $N$ , depending on whether the $\phi_i$ are linear independent. As $\phi_i(x)=K(x_i,x)$ (see supra, we defined $\phi$ in this way), this means that the dimension of $V$ depends on the kernel used and can go up to the size of the training sample.

The transformation, that maps my original feature space to $V$ is defined as

$\Phi: x_i \to \phi(x)=K(x_i, x)$ .

This map $\Phi$ maps my original feature space onto a vector space that can have a dimension that goed up to the size of my training sample.

Obviously, this transformation (a) depends on the kernel, (b) depends on the values $x_i$ in the training sample and (c) can, depending on my kernel, have a dimension that goes up to the size of my training sample and (d) the vectors of $V$ look like $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ , $\gamma_i$ are real numbers.

Looking at the function $f(x)$ in How to calculate decision boundary from support vectors? it can be seen that $f(x)=\sum_i y_i \alpha_i \phi_i(x)+b$ .

In other words, $f(x)$ is a linear combination of the $\phi_i$ and this is a linear separator in the V-space : it is a particular choice of the $\gamma_i$ namely $\gamma_i=\alpha_i y_i$ !

The $y_i$ are known from our observations, the $\alpha_i$ are the Lagrange multipliers that the SVM has found. In other words SVM find, through the use of a kernel and by solving a quadratic programming problem, a linear separation in the $V$ -spave.

This is my intuitive understanding of how the 'kernel trick' allows one to 'implicitly' transform the original feature space into a new feature space $V$ , with a different dimension. This dimension depends on the kernel you use and for the RBF kernel this dimension can go up to the size of the training sample.

So kernels are a technique that allows SVM to transform your feature space , see also What makes the Gaussian kernel so magical for PCA, and also in general?

Community
quelle

"for each element of my training sample" -- is element here referring to a row or column (i.e. feature )

user1761806

what is x and x_i? If my X is an input of 5 columns, and 100 rows, what would x and x_i be?

user1761806

@user1761806 an element is a row. The notation is explained in the link at the beginning of the answer

1

Transform predictors (input data) to a high-dimensional feature space. It is sufficient to just specify the kernel for this step and the data is never explicitly transformed to the feature space. This process is commonly known as the kernel trick.

Let me explain it. The kernel trick is the key here. Consider the case of a Radial Basis Function (RBF) Kernel here. It transforms the input to infinite dimensional space. The transformation of input $x$ to $\phi(x)$ can be represented as shown below (taken from http://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf)

The input space is finite dimensional but the transformed space is infinite dimensional. Transforming the input to an infinite dimensional space is something that happens as a result of the kernel trick. Here $x$ which is the input and $\phi$ is the transformed input. But $\phi$ is not computed as it is, instead the product $\phi(x_i)^T\phi(x)$ is computed which is just the exponential of the norm between $x_i$ and $x$ .

There is a related question Feature map for the Gaussian kernel to which there is a nice answer /stats//a/69767/86202.

The output or decision function is a function of the kernel matrix $K(x_i,x)=\phi(x_i)^T\phi(x)$ and not of the input $x$ or transformed input $\phi$ directly.

prashanth
quelle

0

Mapping to a higher dimension is merely a trick to solve a problem that is defined in the original dimension; so concerns such as overfitting your data by going into a dimension with too many degrees of freedom are not a byproduct of the mapping process, but are inherent in your problem definition.

Basically, all that mapping does is converting conditional classification in the original dimension to a plane definition in the higher dimension, and because there is a 1 to 1 relationship between the plane in the higher dimension and your conditions in the lower dimension, you can always move between the two.

Taking the problem of overfitting, clearly, you can overfit any set of observations by defining enough conditions to isolate each observation into its own class, which is equivalent of mapping your data to (n-1)D where n is the number of your observations.

Taking the simplest problem, where your observations are [[1,-1], [0,0], [1,1]] [[feature, value]], by moving into the 2D dimension and separating your data with a line, your are simply turning the conditional classification of feature < 1 && feature > -1 : 0 to defining a line that passes through (-1 + epsilon, 1 - epsilon). If you had more data points and needed more condition, you just needed to add one more degree of freedom to your higher dimension by each new condition that your define.

You can replace the process of mapping to a higher dimension with any process that provides you with a 1 to 1 relationship between the conditions and the degrees of freedom of your new problem. Kernel tricks simply do that.

Hou
quelle

1

As a different example, take the problem where the phenomenon results in observations of the form of [x, floor(sin(x))]. Mapping your problem into a 2D dimension is not helpful here at all; in fact, mapping to any plane will not be helpful here, which is because defining the problem as a set of x < a && x > b : z is not helpful in this case. The simplest mapping in this case is mapping into a polar coordinate, or into the imaginary plane.

Hou

Kernel SVM: Ich möchte ein intuitives Verständnis der Abbildung auf einen höherdimensionalen Merkmalsraum und wie dies eine lineare Trennung ermöglicht

Antworten:

1. Perfekte Trennung erreichen

2. Kernel-SVM-Lernen als lineare Trennung

3. How the kernel is used to construct the feature space