Kann ich eine Normalverteilung aus dem Stichprobenumfang und den Min- und Max-Werten rekonstruieren? Ich kann den Mittelpunkt verwenden, um den Mittelwert darzustellen

Ich weiß, dass dies statistisch gesehen vielleicht ein bisschen blöd ist, aber das ist mein Problem.

Ich habe viele Bereichsdaten, das heißt das Minimum, das Maximum und die Stichprobengröße einer Variablen. Für einige dieser Daten habe ich auch einen Mittelwert, aber nicht viele. Ich möchte diese Bereiche miteinander vergleichen, um die Variabilität jedes Bereichs zu quantifizieren und auch die Mittelwerte zu vergleichen. Ich habe einen guten Grund anzunehmen, dass die Verteilung um den Mittelwert symmetrisch ist und dass die Daten eine Gaußsche Verteilung haben werden. Aus diesem Grund denke ich, dass ich es rechtfertigen kann, den Mittelpunkt der Verteilung als Proxy für den Mittelwert zu verwenden, wenn er fehlt.

Was ich tun möchte, ist, eine Verteilung für jeden Bereich zu rekonstruieren und diese dann zu verwenden, um eine Standardabweichung oder einen Standardfehler für diese Verteilung bereitzustellen. Die einzige Information, die ich habe, ist das von einer Stichprobe beobachtete Maximum und Minimum und der Mittelpunkt als Proxy für den Mittelwert.

Auf diese Weise hoffe ich in der Lage zu sein, gewichtete Mittelwerte für jede Gruppe zu berechnen und auch den Variationskoeffizienten für jede Gruppe zu berechnen, basierend auf den Bereichsdaten, die ich habe, und meinen Annahmen (einer symmetrischen und normalen Verteilung).

Ich plane, R zu verwenden, um dies zu tun, so würde jede mögliche Code-Hilfe ebenso geschätzt.

r normal-distribution estimation missing-data order-statistics green_thinlake
quelle

Ich habe mich gefragt, warum Sie sagen, dass Sie Daten für minimale und maximale und maximale Werte haben. Später haben Sie nur Informationen zum erwarteten Minimum und Maximum. Was ist es - beobachtet oder erwartet?

Scortchi - Wiedereinsetzung von Monica

Entschuldigung, das ist mein Fehler. Die maximalen und minimalen Daten werden eingehalten (gemessen an Objekten aus dem wirklichen Leben). Ich habe den Beitrag geändert.

green_thinlake

Antworten:

Die gemeinsame kumulative Verteilungsfunktion für das Minimum $x_{(1)}$ & Maximum $x_{(n)}$ für eine Stichprobe von $n$ aus einer Gaußschen Verteilung mit mittlerem $\mu$ & Standardabweichung $\sigma$ ist

F (x_{(1)}, x_{(n)}; μ, σ) = Pr (X_{(1)} < x_{(1)}, X_{(n)} < x_{(n)}) = Pr (X_{(n)} < x_{(n)}) - Pr (X_{(1)} > x_{(1)}, X_{(n)} < x_{(n)} = Φ {(\frac{x_{(n)} - μ}{σ})}^{n} - {[Φ (\frac{x_{(n)} - μ}{σ}) - Φ (\frac{x_{(1)} - μ}{σ})]}^{n}

$F(x_{(1)},x_{(n)};\mu,\sigma) = \Pr(X_{(1)}<x_{(1)}, X_{(n)}<x_{(n)})\\ =\Pr( X_{(n)}<x_{(n)}) - \Pr(X_{(1)}>x_{(1)}, X_{(n)}<x_{(n)}\\ =\Phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right)^n - \left[\Phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right) -\Phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right)\right]^n$

wobei die Standard-Gaußsche CDF ist. Die Differenzierung in Bezug auf & ergibt die gemeinsame Wahrscheinlichkeitsdichtefunktion $\Phi(\cdot)$ $x_{(1)}$ $x_{(n)}$

f (x_{(1)}, x_{(n)}; μ, σ) = n (n - 1) {[Φ (\frac{x_{(n)} - μ}{σ}) - Φ (\frac{x_{(1)} - μ}{σ})]}^{n - 2} \cdot ϕ (\frac{x_{(n)} - μ}{σ}) \cdot ϕ (\frac{x_{(1)} - μ}{σ}) \cdot \frac{1}{σ^{2}}

$f(x_{(1)},x_{(n)};\mu,\sigma) =\\ n(n-1)\left[\Phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right) - \Phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right)\right]^{n-2}\cdot\phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right)\cdot\phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right)\cdot\tfrac{1}{\sigma^2}$

where $\phi(\cdot)$ is the standard Gaussian PDF. Taking the log & dropping terms that don't contain parameters gives the log-likelihood function

ℓ (μ, σ; x_{(1)}, x_{(n)}) = (n - 2) \log [Φ (\frac{x_{(n)} - μ}{σ}) - Φ (\frac{x_{(1)} - μ}{σ})] + \log ϕ (\frac{x_{(n)} - μ}{σ}) + \log ϕ (\frac{x_{(1)} - μ}{σ}) - 2 \log σ

$\ell(\mu,\sigma;x_{(1)},x_{(n)}) =\\ (n-2)\log\left[\Phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right) - \Phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right)\right] + \log\phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right) + \log\phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right) - 2\log\sigma$

This doesn't look very tractable but it's easy to see that it's maximized whatever the value of $\sigma$ by setting $\mu=\hat\mu=\frac{x_{(n)}+x_{(1)}}{2}$ , i.e. the midpoint—the first term is maximized when the argument of one CDF is the negative of the argument of the other; the second & third terms represent the joint likelihood of two independent normal variates.

$\hat\mu$ $r=x_{(n)}-x_{(1)}$

ℓ (σ; x_{(1)}, x_{(n)}, \hat{μ}) = (n - 2) \log [1 - 2 Φ (\frac{- r}{2 σ})] - \frac{r^{2}}{4 σ^{2}} - 2 \log σ

$\ell(\sigma;x_{(1)},x_{(n)},\hat\mu)=(n-2)\log\left[1 - 2\Phi\left(\tfrac{-r}{2\sigma}\right)\right] - \frac{r^2}{4\sigma^2} -2\log{\sigma}$

This expression has to be maximized numerically (e.g. with optimize from R's stat package) to find $\hat\sigma$ . (It turns out that $\hat\sigma=k(n)\cdot r$ , where $k$ is a constant depending only on $n$ —perhaps someone more mathematically adroit than I could show why.)

Estimates are no use without an accompanying measure of precision. The observed Fisher information can be evaluated numerically (e.g. with hessian from R's numDeriv package) & used to calculate approximate standard errors:

I (μ) = - {\frac{\partial^{2} ℓ (μ; \hat{σ})}{(\partial μ)^{2}} |}_{μ = \hat{μ}}

$I(\mu)=-\left.\frac{\partial^2{\ell(\mu;\hat\sigma)}}{(\partial\mu)^2}\right|_{\mu=\hat\mu}$

I (σ) = - {\frac{\partial^{2} ℓ (σ; \hat{μ})}{(\partial σ)^{2}} |}_{σ = \hat{σ}}

$I(\sigma)=-\left.\frac{\partial^2{\ell(\sigma;\hat\mu)}}{(\partial\sigma)^2}\right|_{\sigma=\hat\sigma}$

It would be interesting to compare the likelihood & the method-of-moments estimates for $\sigma$ in terms of bias (is the MLE consistent?), variance, & mean-square error. There's also the issue of estimation for those groups where the sample mean is known in addition to the minimum & maximum.

Scortchi - Reinstate Monica
quelle

+1. Die Konstante hinzufügen

2 \log (r)

$2\log(r)$ to the log-likelihood ändert den Ort seines Maximums nicht, sondern konvertiert ihn in eine Funktion von

σ / r

$\sigma/r$ und

n

$n$ , woher der Wert von

σ / r

$\sigma/r$ Das maximiert es ist eine Funktion

n \to k (n)

$n\to k(n)$ . Äquivalent dazu

\hat{σ} = k (n) r

$\hat\sigma=k(n)r$ wie du behauptest. Mit anderen Worten, die relevante Größe, mit der gearbeitet werden soll, ist das Verhältnis der Standardabweichung zum (beobachteten) Bereich oder ebenso sein Kehrwert - der eng mit dem studentisierten Bereich zusammenhängt .

whuber

@whuber: Thanks! Seems obvious with hindsight. I'll incorporate that into the answer.

Scortchi - Reinstate Monica

You need to relate the range to the standard deviation/variance.Let $\mu$ be the mean, $\sigma$ the standard deviation and $R=x_{(n)} - x_{(1)}$ be the range. Then for the normal distribution we have that $99.7$ % of probability mass lies within 3 standard deviations from the mean. This, as a practical rule means that with very high probability,

μ + 3 σ \approx x_{(n)}

$\mu + 3\sigma \approx x_{(n)}$ and

μ - 3 σ \approx x_{(1)}

$\mu - 3\sigma \approx x_{(1)}$

Subtracting the second from the first we obtain

6 σ \approx x_{(n)} - x_{(1)} = R

$6\sigma \approx x_{(n)} - x_{(1)}= R$ (this, by the way is whence the "six-sigma" quality assurance methodology in industry comes). Then you can obtain an estimate for the standard deviation by

\hat{σ} = \frac{1}{6} ({\bar{x}}_{(n)} - {\bar{x}}_{(1)})

$\hat \sigma = \frac 16 \Big(\bar x_{(n)} - \bar x_{(1)}\Big)$ where the bar denotes averages. This is when you assume that all sub-samples come from the same distribution (you wrote about having expected ranges). If each sample is a different normal, with different mean and variance, then you can use the formula for each sample, but the uncertainty / possible inaccuracy in the estimated value of the standard deviation will be much larger.

Having a value for the mean and for the standard deviation completely characterizes the normal distribution.

Alecos Papadopoulos
quelle

That's neither a close approximation for small

n

$n$ nor an asymptotic result for large

n

$n$ .

Scortchi - Reinstate Monica

@Stortchi Well, I didn't say that it is a good estimate -but I believe that it is always good to have easily implemented solutions, even very rough, in order to get a quantitative sense of the issue at hand, alongside the more sophisticated and efficient approaches like for example the one outlined in the other answer to this question.

Alecos Papadopoulos

I wouldn't carp at "the expectation of the sample range turns out to be about 6 times the standard deviation for values of

n

$n$ from 200 to 1000". But am I missing something subtle in your derivation, or wouldn't it work just as well to justify dividing the range by any number?

Scortchi - Reinstate Monica

@Scortchi Well, the spirit of the approach is "if we expect almost all realizations to fall within 6 sigmas, then it is reasonable to expect that the extreme realizations will be near the border" -that's all there is to it, really. Perhaps I am too used to operate under extremely incomplete information, and obliged to say something quantitative about it... :)

Alecos Papadopoulos

I could reply that even more observations would fall within

10 σ

$10 \sigma$ of the mean, giving a better estimate

\hat{σ} = \frac{R}{10}

$\hat\sigma=\frac{R}{10}$ . I shan't because it's nonsense. Any number over

1.13

$1.13$ will be a rough estimate for some value of

n

$n$ .

Scortchi - Reinstate Monica

It is straightforward to get the distribution function of the maximum of the normal distribution (see "P.max.norm" in code). From it (with some calculus) you can get the quantile function (see "Q.max.norm").

Using "Q.max.norm" and "Q.min.norm" you can get the median of the range that is related with N. Using the idea presented by Alecos Papadopoulos (in previous answer) you can calculate sd.

Try this:

N = 100000    # the size of the sample

# Probability function given q and N
P.max.norm <- function(q, N=1, mean=0, sd=1){
    pnorm(q,mean,sd)^N
} 
# Quantile functions given p and N
Q.max.norm <- function(p, N=1, mean=0, sd=1){
    qnorm(p^(1/N),mean,sd)
} 
Q.min.norm <- function(p, N=1, mean=0, sd=1){
    mean-(Q.max.norm(p, N=N, mean=mean, sd=sd)-mean)
} 

### lets test it (takes some time)
Q.max.norm(0.5, N=N)  # The median on the maximum
Q.min.norm(0.5, N=N)  # The median on the minimum

iter = 100
median(replicate(iter, max(rnorm(N))))
median(replicate(iter, min(rnorm(N))))
# it is quite OK

### Lets try to get estimations
true_mean = -3
true_sd = 2
N = 100000

x = rnorm(N, true_mean, true_sd)  # simulation
x.vec = range(x)                  # observations

# estimation
est_mean = mean(x.vec)
est_sd = diff(x.vec)/(Q.max.norm(0.5, N=N)-Q.min.norm(0.5, N=N))

c(true_mean, true_sd)
c(est_mean, est_sd)

# Quite good, but only for large N
# -3  2
# -3.252606  1.981593

Vyga
quelle

Continuing this approach,

E (R) = σ \int_{- \infty}^{\infty} 1 - (1 - Φ (x))^{n} - Φ (x)^{n} d x = σ d_{2} (n)

$\operatorname{E} (R) = \sigma \int_{-\infty}^{\infty} 1-(1-\Phi(x))^n -\Phi(x)^n\, \mathrm{d} x = \sigma d_2(n)$ , where

R

$R$ is the range &

Φ (\cdot)

$\Phi(\cdot)$ the standard normal cumulative distribution function. You can find tabulated values of

d_{2}

$d_2$ for small

n

$n$ in the statistical process control literature, numerically evaluate the integral, or simulate for your

n

$n$ .

Scortchi - Reinstate Monica