Rekursiver (online) regularisierter Algorithmus der kleinsten Quadrate

12

Kann mich jemand auf einen (rekursiven) Online-Algorithmus für die Tikhonov-Regularisierung (regularisierte kleinste Quadrate) hinweisen?

In einer Offline-Einstellung würde ich Verwendung meines ursprünglichen Datensatzes berechnen, wobei unter Verwendung der n-fachen Kreuzvalidierung gefunden wird. Ein neuer Wert kann für ein gegebenes mit vorhergesagt werden . $\hat\beta=(X^TX+λI)^{−1}X^TY$ $λ$ $y$ $x$ $y=x^T\hat\beta$

In einer Online-Umgebung zeichne ich ständig neue Datenpunkte. Wie kann ich updaten? $\hat\beta$ wenn ich neue zusätzliche Datenmuster zeichne, ohne eine vollständige Neuberechnung des gesamten Datensatzes durchzuführen (Original + Neu)?

regression machine-learning least-squares regularization online rnudel
quelle

1

Ihre Tikhonov-regularisierten kleinsten Quadrate werden in statistischen Kreisen vielleicht häufiger als Levenberg-Marquardt bezeichnet , selbst wenn sie auf rein lineare Probleme angewendet werden (wie hier). Es gibt ein Papier über Online - Levenberg Marquardt hier . Ich weiß nicht, ob das hilft.

Glen_b

11

$\hat\beta_n=(XX^T+λI)^{−1} \sum\limits_{i=0}^{n-1} x_iy_i$

Sei $M_n^{-1} = (XX^T+λI)^{−1}$ , dann

$\hat\beta_{n+1}=M_{n+1}^{−1} (\sum\limits_{i=0}^{n-1} x_iy_i + x_ny_n)$ und

$M_{n+1} - M_n = x_nx_n^T$ , wir können bekommen

$\hat\beta_{n+1}=\hat\beta_{n}+M_{n+1}^{−1} x_n(y_n - x_n^T\hat\beta_{n})$

Nach der Woodbury-Formel haben wir

$M_{n+1}^{-1} = M_{n}^{-1} - \frac{M_{n}^{-1}x_nx_n^TM_{n}^{-1}}{(1+x_n^TM_n^{-1}x_n)}$

As a result,

$\hat\beta_{n+1}=\hat\beta_{n}+\frac{M_{n}^{−1}}{1 + x_n^TM_n^{-1}x_n} x_n(y_n - x_n^T\hat\beta_{n})$

Polyak averaging indicates you can use $\eta_n = n^{-\alpha}$ to approximate $\frac{M_{n}^{−1}}{1 + x_n^TM_n^{-1}x_n}$ with $\alpha$ ranges from $0.5$ to $1$ . You may try in your case to select the best $\alpha$ for your recursion.

I think it also works if you apply a batch gradient algorithm:

$\hat\beta_{n+1}=\hat\beta_{n}+\frac{\eta_n}{n} \sum\limits_{i=0}^{n-1}x_i(y_i - x_i^T\hat\beta_{n})$

lennon310
quelle

Was passiert, wenn ich meinen Regressor jedes Mal mit Stapelproben neuer Daten aktualisiere, wobei jeder aufeinanderfolgende Stapel aus einer geringfügig anderen Verteilung stammt? dh nicht IID. In diesem Fall möchte ich, dass der Regressor die neuen Daten berücksichtigt, aber seine Vorhersagen an der Stelle der alten Daten (vorherige Stapel) nicht beeinflusst. Können Sie mich auf Literatur verweisen, die Sie für nützlich halten könnten?

rnoodle

Good question, but sorry currently I cannot tell how much would it affect your model if you are still using the batch gradient formula in the answer, or approximating by applying the matrix form directly: eta^(-alpha)*X(Y-X'beta_n) where X, Y are your new batch samples

lennon310

hi, it seems that the regularization coefficient does not be involved in the recursive update formula? or does it only matter in the initialization of the inverse of M matrix?

Peng Zhao

4

A point that no one has addressed so far is that it generally doesn't make sense to keep the regularization parameter $\lambda$ constant as data points are added. The reason for this is that $\| X \beta -y \|^{2}$ will typically grow linearly with the number of data points, while the regularization term $\| \lambda\beta \|^{2}$ won't.

Brian Borchers
quelle

That's an interesting point. But exactly why does it "not make sense"? Keeping

λ

$\lambda$ constant surely is mathematically valid, so "not make sense" has to be understood in some kind of statistical context. But what context? What goes wrong? Would there be some kind of easy fix, such as replacing the sums of squares with mean squares?

whuber

Replacing the sum of squares with a scaled version (e.g. the mean squared error) would make sense, but simply using recursive least squares won't accomplish that.

Brian Borchers

As for what would go wrong, depending on your choice of

λ

$\lambda$ , you'd get a very underregularized solution with a large number of data points or a very overregularized solution with a small number of data points.

Brian Borchers

One would suspect that, but if

λ

$\lambda$ is tuned initially after receiving

n

$n$ data points and then more data points are added, whether the resulting solutions with more data points and the same

λ

$\lambda$ are over- or under-regularized would depend on those new datapoints. This can be analyzed by assuming the datapoints act like an iid sample from a multivariate distribution, in which case it appears

λ

$\lambda$ should be set to

N / n

$N/n$ at stage

N

$N$ . This would change the updating formulas, but in such a regular and simple way that efficient computation might still be possible. (+1)

whuber

3

Perhaps something like Stochastic gradient descent could work here. Compute $\hat{\beta}$ using your equation above on the initial dataset, that will be your starting estimate. For each new data point you can perform one step of gradient descent to update your parameter estimate.

Max S.
quelle

I have since realise that SGD (perhaps minibatch) is the way to go for online problems like this i.e. updating function approximations.

rnoodle

1

In linear regression, one possibility is updating the QR decomposition of $X$ directly, as explained here. I guess that, unless you want to re-estimate $\lambda$ after each new datapoint has been added, something very similar can be done with ridge regression.

Matteo Fasiolo
quelle

0

Here is an alternative (and less complex) approach compared to using the Woodbury formula. Note that $X^TX$ and $X^Ty$ can be written as sums. Since we are calculating things online and don't want the sum to blow up, we can alternatively use means ( $X^TX/n$ and $X^Ty/n$ ).

If you write $X$ and $y$ as :

X = (\begin{matrix} x_{1}^{T} \\ ⋮ \\ x_{n}^{T} \end{matrix}), y = (\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}),

$X = \begin{pmatrix} x_1^T \\ \vdots \\ x_n^T \end{pmatrix}, \quad y = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix},$

we can write the online updates to $X^TX/n$ and $X^Ty/n$ (calculated up to the $t$ -th row) as:

A_{t} = (1 - \frac{1}{t}) A_{t - 1} + \frac{1}{t} x_{t} x_{t}^{T},

$A_t = \left(1 - \frac{1}{t}\right) A_{t-1} + \frac{1}{t}x_t x_t^T,$

b_{t} = (1 - \frac{1}{t}) b_{t - 1} + \frac{1}{t} x_{t} y_{t} .

$b_t = \left(1 - \frac{1}{t}\right) b_{t-1} + \frac{1}{t}x_t y_t.$

Your online estimate of $\beta$ then becomes

{\hat{β}}_{t} = (A_{t} + λ I)^{- 1} b_{t} .

$\hat\beta_t = (A_t + \lambda I)^{-1}b_t.$

Note that this also helps with the interpretation of $\lambda$ remaining constant as you add observations!

This procedure is how https://github.com/joshday/OnlineStats.jl computes online estimates of linear/ridge regression.

joshday
quelle

Rekursiver (online) regularisierter Algorithmus der kleinsten Quadrate

Antworten: