Pearson-Korrelation von Datensätzen mit möglicherweise null Standardabweichung?

12

Ich habe ein Problem bei der Berechnung des Pearson-Korrelationskoeffizienten von Datensätzen mit möglicherweise null Standardabweichung (dh alle Daten haben den gleichen Wert).

Angenommen, ich habe die folgenden zwei Datensätze:

float x[] = {2, 2, 2, 3, 2};
float y[] = {2, 2, 2, 2, 2};

Der Korrelationskoeffizient "r" würde unter Verwendung der folgenden Gleichung berechnet:

float r = covariance(x, y) / (std_dev(x) * std_dev(y));

Da jedoch alle Daten im Datensatz "y" den gleichen Wert haben, wäre die Standardabweichung std_dev (y) Null und "r" undefiniert.

Gibt es eine Lösung für dieses Problem? Oder sollte ich in diesem Fall andere Methoden zum Messen der Datenbeziehung verwenden?

correlation Andree
quelle

In diesem Beispiel gibt es keine "Datenbeziehung", da y nicht variiert. Es wäre ein Fehler , r einen beliebigen numerischen Wert zuzuweisen.

Whuber

1

@whuber - es ist wahr, dass das

undefiniert ist, aber nicht unbedingt, dass die "wahre" unbekannte Korrelation

nicht geschätzt werden kann. Müssen nur etwas anderes verwenden, um es zu schätzen.

r

$r$

ρ

$\rho$

Wahrscheinlichkeitsrechnung

@ Wahrscheinlichkeit Sie setzen voraus, dass dies ein Problem der Schätzung und nicht nur der Charakterisierung ist. Aber wenn Sie das akzeptieren, welchen Schätzer würden Sie im Beispiel vorschlagen? Keine Antwort kann allgemeingültig sein, da dies davon abhängt, wie der Schätzer verwendet wird (im Grunde genommen eine Verlustfunktion). In vielen Anwendungen, wie PCA, scheint es wahrscheinlich , dass die Verwendung jedes Verfahren , das einen Wert zuschreibt

als andere Verfahren schlechter sein kann , die erkennen

nicht identifiziert werden können.

ρ

$\rho$

ρ

$\rho$

whuber

1

@whuber - Schätzung ist eine schlechte Wortwahl für mich (Sie haben vielleicht bemerkt, dass ich nicht der beste Wortschmied bin). Was ich damit meinte, war, dass

zwar nicht eindeutig identifiziert werden kann, dies aber nicht bedeutet, dass die Daten nicht aussagekräftig sind uns über

. Meine Antwort zeigt dies (hässlich) aus algebraischer Sicht.

ρ

$\rho$

ρ

$\rho$

Wahrscheinlichkeitsrechnung

@ Wahrscheinlichkeit Es scheint, dass Ihre Analyse widersprüchlich ist: Wenn y tatsächlich mit einer Normalverteilung modelliert ist, zeigt eine Stichprobe von fünf 2er-Werten, dass dieses Modell ungeeignet ist. Letztendlich bekommen Sie nichts für nichts: Ihre Ergebnisse hängen stark von den Annahmen über die Vorgesetzten ab. Die ursprünglichen Probleme bei der Identifizierung von

sind immer noch vorhanden, wurden jedoch durch all diese zusätzlichen Annahmen verborgen. Das scheint meiner Meinung nach nur die Probleme zu verschleiern, anstatt sie zu klären.

ρ

$\rho$

whuber

9

Die Leute der "Sampling-Theorie" werden Ihnen sagen, dass keine solche Schätzung existiert. Aber Sie können eine bekommen, Sie müssen nur vernünftig mit Ihren vorherigen Informationen umgehen und viel härter rechnen.

Wenn Sie eine Bayes'sche Schätzmethode angegeben haben und der hintere Teil mit dem vorherigen identisch ist, können Sie sagen, dass die Daten nichts über den Parameter aussagen. Da die Dinge für uns "singulär" werden können, können wir keine unendlichen Parameterräume verwenden. Ich gehe davon aus, dass Sie aufgrund der Pearson-Korrelation eine bivariate normale Wahrscheinlichkeit haben:

wobei

p (D | μ_{x}, μ_{y}, σ_{x}, σ_{y}, ρ) = {(σ_{x} σ_{y} \sqrt{2 π (1 - ρ^{2})})}^{- N} e x p (- \frac{\sum_{i} Q_{i}}{2 (1 - ρ^{2})})

$p(D|\mu_x,\mu_y,\sigma_x,\sigma_y,\rho)=\left(\sigma_x\sigma_y\sqrt{2\pi(1-\rho^2)}\right)^{-N}exp\left(-\frac{\sum_{i}Q_i}{2(1-\rho^2)}\right)$

Q_{i} = \frac{(x_{i} - μ_{x})^{2}}{σ_{x}^{2}} + \frac{(y_{i} - μ_{y})^{2}}{σ_{y}^{2}} - 2 ρ \frac{(x_{i} - μ_{x}) (y_{i} - μ_{y})}{σ_{x} σ_{y}}

$Q_i=\frac{(x_i-\mu_x)^2}{\sigma_x^2}+\frac{(y_i-\mu_y)^2}{\sigma_y^2}-2\rho\frac{(x_i-\mu_x)(y_i-\mu_y)}{\sigma_x\sigma_y}$

Um anzuzeigen, dass ein Datensatz den gleichen Wert haben kann, schreiben Sie , und dann erhalten wir: $y_i=y$

wobei

\sum_{i} Q_{i} = N [\frac{(y - μ_{y})^{2}}{σ_{y}^{2}} + \frac{s_{x}^{2} + (\bar{x} - μ_{x})^{2}}{σ_{x}^{2}} - 2 ρ \frac{(\bar{x} - μ_{x}) (y - μ_{y})}{σ_{x} σ_{y}}]

$\sum_{i}Q_i=N\left[\frac{(y-\mu_y)^2}{\sigma_y^2}+\frac{s_x^2 + (\overline{x}-\mu_x)^2}{\sigma_x^2}-2\rho\frac{(\overline{x}-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y}\right]$

s_{x}^{2} = \frac{1}{N} \sum_{i} (x_{i} - \bar{x})^{2}

$s_x^2=\frac{1}{N}\sum_{i}(x_i-\overline{x})^2$

Und so Ihre Wahrscheinlichkeit auf vier Zahlen abhängt, . Sie möchten also eine Schätzung von , müssen also mit einem vorherigen multiplizieren und die Störparameter . Nun bereiten wir uns auf die Integration vor und "vervollständigen das Quadrat" $s_x^2,y,\overline{x},N$ $\rho$ $\mu_x,\mu_y,\sigma_x,\sigma_y$

\frac{\sum_{i} Q_{i}}{1 - ρ^{2}} = N [\frac{{(μ_{y} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}])}^{2}}{σ_{y}^{2} (1 - ρ^{2})} + \frac{s_{x}^{2}}{σ_{x}^{2} (1 - ρ^{2})} + \frac{(\bar{x} - μ_{x})^{2}}{σ_{x}^{2}}]

$\frac{\sum_{i}Q_i}{1-\rho^2}=N\left[\frac{\left(\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]\right)^2}{\sigma_y^2(1-\rho^{2})}+\frac{s_x^2}{\sigma_{x}^{2}(1-\rho^{2})} + \frac{(\overline{x}-\mu_x)^2}{\sigma_x^2}\right]$

Jetzt sollten wir auf Nummer sicher gehen und eine richtig normalisierte Wahrscheinlichkeit sicherstellen. Auf diese Weise können wir keinen Ärger bekommen. Eine solche Option besteht darin, einen schwach informativen Prior zu verwenden, der lediglich die Reichweite jedes einzelnen einschränkt. Wir haben also für das Mittel mit dem flachen Prior und für die Standardabweichungen mit jeffreys prior. Diese Grenzen können leicht mit ein wenig "gesundem Menschenverstand" festgelegt werden, der über das Problem nachdenkt. Ich werde einen nicht näher bezeichneten Prior für $L_{\mu}<\mu_x,\mu_y<U_{\mu}$ $L_{\sigma}<\sigma_x,\sigma_y<U_{\sigma}$ $\rho$ , und so bekommen wir (Uniform sollte funktionieren, wenn nicht die Singularität bei abgeschnitten ): $\pm 1$

p (ρ, μ_{x}, μ_{y}, σ_{x}, σ_{y}) = \frac{p (ρ)}{A σ_{x} σ_{y}}

$p(\rho,\mu_x,\mu_y,\sigma_x,\sigma_y)=\frac{p(\rho)}{A\sigma_x\sigma_y}$

$A=2(U_{\mu}-L_{\mu})^{2}[log(U_{\sigma})-log(L_{\sigma})]^{2}$

p (ρ | D) = \int p (ρ, μ_{x}, μ_{y}, σ_{x}, σ_{y}) p (D | μ_{x}, μ_{y}, σ_{x}, σ_{y}, ρ) d μ_{y} d μ_{x} d σ_{x} d σ_{y}

$p(\rho|D)=\int p(\rho,\mu_x,\mu_y,\sigma_x,\sigma_y)p(D|\mu_x,\mu_y,\sigma_x,\sigma_y,\rho)d\mu_y d\mu_x d\sigma_x d\sigma_y$

= \frac{p (ρ)}{A [2 π (1 - ρ^{2})]^{\frac{N}{2}}} \int_{L_{σ}}^{U_{σ}} \int_{L_{σ}}^{U_{σ}} {(σ_{x} σ_{y})}^{- N - 1} e x p (- \frac{N s_{x}^{2}}{2 σ_{x}^{2} (1 - ρ^{2})}) \times

$=\frac{p(\rho)}{A[2\pi(1-\rho^2)]^{\frac{N}{2}}}\int_{L_{\sigma}}^{U_{\sigma}}\int_{L_{\sigma}}^{U_{\sigma}}\left(\sigma_x\sigma_y\right)^{-N-1}exp\left(-\frac{N s_x^2}{2\sigma_{x}^{2}(1-\rho^{2})}\right) \times$

\int_{L_{μ}}^{U_{μ}} e x p (- \frac{N (\bar{x} - μ_{x})^{2}}{2 σ_{x}^{2}}) \int_{L_{μ}}^{U_{μ}} e x p (- \frac{N {(μ_{y} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}])}^{2}}{2 σ_{y}^{2} (1 - ρ^{2})}) d μ_{y} d μ_{x} d σ_{x} d σ_{y}

$\int_{L_{\mu}}^{U_{\mu}}exp\left(-\frac{N(\overline{x}-\mu_x)^2}{2\sigma_x^2}\right)\int_{L_{\mu}}^{U_{\mu}}exp\left(-\frac{N\left(\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]\right)^2}{2\sigma_y^2(1-\rho^{2})}\right)d\mu_y d\mu_x d\sigma_x d\sigma_y$

Now the first integration over $\mu_y$ can be done by making a change of variables $z=\sqrt{N}\frac{\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\sigma_y\sqrt{1-\rho^{2}}}\implies dz=\frac{\sqrt{N}}{\sigma_y\sqrt{1-\rho^{2}}}d\mu_y$ and the first integral over $\mu_y$ becomes:

\frac{σ_{y} \sqrt{2 π (1 - ρ^{2})}}{\sqrt{N}} [Φ (\frac{U_{μ} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}]}{\frac{σ_{y}}{\sqrt{N}} \sqrt{1 - ρ^{2}}}) - Φ (\frac{L_{μ} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}]}{\frac{σ_{y}}{\sqrt{N}} \sqrt{1 - ρ^{2}}})]

$\frac{\sigma_y\sqrt{2\pi(1-\rho^{2})}}{\sqrt{N}}\left[\Phi\left( \frac{U_{\mu}-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\frac{\sigma_y}{\sqrt{N}}\sqrt{1-\rho^{2}}} \right)-\Phi\left( \frac{L_{\mu}-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\frac{\sigma_y}{\sqrt{N}}\sqrt{1-\rho^{2}}} \right)\right]$

And you can see from here, no analytic solutions are possible. However, it is also worthwhile to note that the value $\rho$ has not dropped out of the equations. This means that the data and prior information still have something to say about the true correlation. If the data said nothing about the correlation, then we would be simply left with $p(\rho)$ as the only function of $\rho$ in these equations.

It also shows how that passing to the limit of infinite bounds for $\mu_y$ "throws away" some of the information about $\rho$ , which is contained in the complicated looking normal CDF function $\Phi(.)$ . Now if you have a lot of data, then passing to the limit is fine, you don't loose much, but if you have very scarce information, such as in your case - it is important keep every scrap you have. It means ugly maths, but this example is not too hard to do numerically. So we can evaluate the integrated likelihood for $\rho$ at values of say $-0.99,-0.98,\dots,0.98,0.99$ fairly easily. Just replace the integrals by summations over a small enough intervals - so you have a triple summation

probabilityislogic
quelle

@probabilityislogic: Wow. Simply wow. After seen some of your answers I really wonder: what should a doofus like me do to reach such a flexible bayesian state of mind ?

steffen

1

@steffen - lol. Its not that difficult, you just need to practice. And always always always remember that the product and sum rules of probability are the only rules you will ever need. They will extract whatever information is there - whether you see it or not. So you apply product and sum rules, then just do the maths. That is all I have done here.

probabilityislogic

@steffen - and the other rule - more a mathematical one than stats one - don't pass to an infinite limit too early in your calculations, your results may become arbitrary, or little details may get thrown out. Measurement error models are a perfect example of this (as is this question).

probabilityislogic

@probabilityislogic: Thank you, I'll keep this in mind... as soon as I am done working through my "Bayesian Analysis"-copy ;).

steffen

@probabilityislogic: If you could humor a nonmathematical statistician/researcher...would it be possible to summarize or translate your answer to a group of dentists or high school principals or introductory statistics students?

rolando2

6

I agree with sesqu that the correlation is undefined in this case. Depending on your type of application you could e.g. calculate the Gower Similarity between both vectors, which is: $gower(v1,v2)=\frac{\sum_{i=1}^{n}\delta(v1_i,v2_i)}{n}$ where $\delta$ represents the kronecker-delta, applied as function on $v1,v2$ .

So for instance if all values are equal, gower(.,.)=1. If on the other hand they differ only in one dimension, gower(.,.)=0.9. If they differ in every dimension, gower(.,.)=0 and so on.

Of course this is no measure for correlation, but it allows you to calculate how close the vector with s>0 is to the one with s=0. Of course you can apply other metrics,too, if they serve your purpose better.

steffen
quelle

+1 That's a creative idea. It sounds like the "Gower Similarity" is a scaled Hamming distance.

whuber

@whuber: Indeed it is !

steffen

0

The correlation is undefined in that case. If you must define it, I would define it as 0, but consider a simple mean absolute difference instead.

sesqu
quelle

0

This question is coming from programmers, so I'd suggest plugging in zero. There's no evidence of a correlation, and the null hypothesis would be zero (no correlation). There might be other context knowledge that would provide a "typical" correlation in one context, but the code might be re-used in another context.

zbicyclist
quelle

2

There's no evidence of lack of correlation either, so why not plug in 1? Or -1? Or anything in between? They all lead to re-usable code!

whuber

@whuber - you plug in zero because the data is "less constrained" when it is independent - this is why maxent distributions are independent unless you explicitly specify correlations in the constraints. Independence can be viewed as a conservative assumption when you know of no such correlations - effectively you are averaging over all possible correlations.

probabilityislogic

1

@prob I question why it makes sense as a generic procedure to average over all correlations. In effect this procedure substitutes the definite and possibly quite wrong answer "zero!" for the correct answer "the data don't tell us." That difference can be important for decision making.

whuber

Just because the question might be from a programmer, does not mean you should convert an undefined value to zero. Zero means something specific in a correlation calculation. Throw an exception. Let the caller decide what should happen. Your function should calculate a correlation, not decide what to do if one cannot be computed.

Jared Becksfort

Pearson-Korrelation von Datensätzen mit möglicherweise null Standardabweichung?

Antworten: