Welcher Teil der Wiederholungsexperimente hat eine Effektgröße innerhalb des 95% -Konfidenzintervalls des ersten Experiments?

12

Bleiben wir bei einer idealen Situation mit Zufallsstichproben, Gaußschen Populationen, gleichen Varianzen, keinem P-Hacking usw.

Schritt 1. Sie führen ein Experiment durch, indem Sie beispielsweise zwei Stichprobenmittelwerte vergleichen und ein 95% -Konfidenzintervall für die Differenz zwischen den beiden Populationsmitteln berechnen.

Schritt 2. Sie führen viel mehr Experimente durch (Tausende). Der Unterschied zwischen den Mitteln variiert von Experiment zu Experiment aufgrund von Zufallsstichproben.

Frage: Welcher Bruchteil der Differenz zwischen Mitteln aus der Sammlung von Experimenten in Schritt 2 liegt im Konfidenzintervall von Schritt 1?

Das kann nicht beantwortet werden. Es hängt alles davon ab, was in Schritt 1 passiert ist. Wenn dieses Experiment in Schritt 1 sehr untypisch war, ist die Antwort auf die Frage möglicherweise sehr niedrig.

Stellen Sie sich also vor, dass beide Schritte viele Male wiederholt werden (wobei Schritt 2 viele Male wiederholt wird). Nun sollte es möglich sein, sich eine Erwartung zu machen, für welchen Bruchteil von Wiederholungsexperimenten im Durchschnitt eine Effektgröße innerhalb des 95% -Konfidenzintervalls des ersten Experiments vorliegt.

Es scheint, dass die Antwort auf diese Fragen verstanden werden muss, um die Reproduzierbarkeit von Studien zu bewerten, ein sehr heißer Bereich.

Harvey Motulsky
quelle
Definieren Sie für jedes Originalexperiment (Schritt 1) x i als den Bruchteil der nachfolgenden Ergebnisse (Schritt 2), die Ergebnisse liefern, die innerhalb des Konfidenzintervalls des Originalergebnisses liegen. Sie wollen die empirische Verteilung von x berechnen ? ixix
Matthew Gunn
Ja, Sie verstehen, was ich frage
Harvey Motulsky
@MatthewGunn fragte, ob Sie die empirische Verteilung der "Einfangfraktion" für zukünftige Beobachtungen wünschen. Ihre Nachricht wird gefragt „... es soll möglich sein, würde ich denken, denn mit der Erwartung kommen , welchem Anteil von Wiederholungsversuchen im Durchschnitt eine Wirkung Größe innerhalb des 95% Konfidenzintervall des ersten Experiments“ . Dies ist keine Verteilung, sondern ein erwarteter Wert (Durchschnitt).
Whubers Analyse ist großartig, aber wenn Sie ein Zitat benötigen, dann ist hier ein Artikel, der genau diese Frage ausführlich behandelt: Cumming & Maillardet, 2006, Konfidenzintervalle und Replikation: Wo wird der nächste Mittelwert fallen? . Sie nennen es Erfassungsprozentsatz eines Konfidenzintervalls.
Amöbe sagt Reinstate Monica

Antworten:

12

Analyse

Da dies eine konzeptionelle Frage ist, wollen wir der Einfachheit halber die Situation betrachten, in der ein Konfidenzintervall [ ˉ x ( 1 ) + Z α / 2 s ( 1 ) / 1-αwird für einen Mittelwertμ unterVerwendung einer Zufallsstichprobex(1)der Größenkonstruiertund eine zweite Zufallsstichprobex(2)der Größemwird aus derselben Normalverteilung(μ,σ2)entnommen. (Wenn Sie möchten, können Sie dieZs durch Werte aus der Studentt-Verteilung vonn-1Freiheitsgradenersetzen. Die folgende Analyse wird sich nicht ändern.)

[x¯(1)+Zα/2s(1)/n,x¯(1)+Z1-α/2s(1)/n]
μx(1)nx(2)m(μ,σ2)Ztn-1

Die Chance, dass der Mittelwert der zweiten Stichprobe innerhalb des vom ersten bestimmten CI liegt, ist

Pr(x¯(1)+Zα/2ns(1)x¯(2)x¯(1)+Z1α/2ns(1))=Pr(Zα/2ns(1)x¯(2)x¯(1)Z1α/2ns(1)).

Da der Mittelwert der ersten Stichprobe unabhängig von der Standardabweichung der ersten Stichprobe s ( 1 ) ist (dies erfordert Normalität) und die zweite Stichprobe unabhängig von der ersten Stichprobe ist, bedeutet die Differenz der Stichproben U = ˉ x ( 2 ) - ˉ x ( 1 ) ist unabhängig von s ( 1 ) . Außerdem ist für dieses symmetrische Intervall Z α / 2 = - Z 1 - α / 2x¯(1)s(1)U=x¯(2)x¯(1)s(1)Zα/2=Z1α/2. Wenn man also für die Zufallsvariable s ( 1 ) schreibt und beide Ungleichungen quadriert, ist die fragliche Wahrscheinlichkeit die gleiche wieSs(1)

Pr(U2(Z1α/2n)2S2)=Pr(U2S2(Z1α/2n)2).

Die Erwartungsgesetze implizieren, dass einen Mittelwert von 0 und eine Varianz von hatU0

Var(U)=Var(x¯(2)x¯(1))=σ2(1m+1n).

Da eine lineare Kombination von Normalvariablen ist, hat es auch eine Normalverteilung. Daher U 2 ist , σ 2 ( 1UU2σ2(1n+1m) times a χ2(1) variable. We already knew that S2 is σ2/n times a χ2(n1) variable. Consequently, U2/S2 is 1/n+1/m times a variable with an F(1,n1) distribution. The required probability is given by the F distribution as

(1)F1,n1(Z1α/221+n/m).

Discussion

An interesting case is when the second sample is the same size as the first, so that n/m=1 and only n and α determine the probability. Here are the values of (1) plotted against α for n=2,5,20,50.

Figure

The graphs rise to a limiting value at each α as n increases. The traditional test size α=0.05 is marked by a vertical gray line. For largish values of n=m, the limiting chance for α=0.05 is around 85%.

By understanding this limit, we will peer past the details of small sample sizes and better understand the crux of the matter. As n=m grows large, the F distribution approaches a χ2(1) distribution. In terms of the standard Normal distribution Φ, the probability (1) then approximates

Φ(Z1α/22)Φ(Zα/22)=12Φ(Zα/22).

For instance, with α=0.05, Zα/2/21.96/1.411.386 and Φ(1.386)0.083. Consequently the limiting value attained by the curves at α=0.05 as n increases will be 12(0.083)=10.166=0.834. You can see it has almost been reached for n=50 (where the chance is 0.8383.)

For small α, the relationship between α and the complementary probability--the risk that the CI does not cover the second mean--is almost perfectly a power law. Another way to express this is that the log complementary probability is almost a linear function of logα. The limiting relationship is approximately

log(2Φ(Zα/22))1.79712+0.557203log(20α)+0.00657704(log(20α))2+

In other words, for large n=m and α anywhere near the traditional value of 0.05, (1) will be close to

10.166(20α)0.557.

(This reminds me very much of the analysis of overlapping confidence intervals I posted at /stats//a/18259/919. Indeed, the magic power there, 1.91, is very nearly the reciprocal of the magic power here, 0.557. At this point you should be able to re-interpret that analysis in terms of reproducibility of experiments.)


Experimental results

These results are confirmed with a straightforwward simulation. The following R code returns the frequency of coverage, the chance as computed with (1), and a Z-score to assess how much they differ. The Z-scores are typically less than 2 in size, regardless of n,m,μ,σ,α (or even whether a Z or t CI is computed), indicating the correctness of formula (1).

n <- 3      # First sample size
m <- 2      # Second sample size
sigma <- 2 
mu <- -4
alpha <- 0.05
n.sim <- 1e4
#
# Compute the multiplier.
#
Z <- qnorm(alpha/2)
#Z <- qt(alpha/2, df=n-1) # Use this for a Student t C.I. instead.
#
# Draw the first sample and compute the CI as [l.1, u.1].
#
x.1 <- matrix(rnorm(n*n.sim, mu, sigma), nrow=n)
x.1.bar <- colMeans(x.1)
s.1 <- apply(x.1, 2, sd)
l.1 <- x.1.bar + Z * s.1 / sqrt(n)
u.1 <- x.1.bar - Z * s.1 / sqrt(n)
#
# Draw the second sample and compute the mean as x.2.
#
x.2 <- colMeans(matrix(rnorm(m*n.sim, mu, sigma), nrow=m))
#
# Compare the second sample means to the CIs.
#
covers <- l.1 <= x.2 & x.2 <= u.1
#
# Compute the theoretical chance and compare it to the simulated frequency.
#
f <- pf(Z^2 / ((n * (1/n + 1/m))), 1, n-1)
m.covers <- mean(covers)
(c(Simulated=m.covers, Theoretical=f, Z=(m.covers - f)/sd(covers) * sqrt(length(covers))))
whuber
quelle
You say that using t instead of z won't make much difference. I believe you but haven't checked yet. With small sample size, the two critical values can differ a lot and the t distribution is the correct way to compute the CI. Why do you prefer to use z??
Harvey Motulsky
It's purely illustrative and Z is simpler. When you use t it is interesting that the curves in the figure start high and descend to their limit. In particular, the chance of reproducing a significant result is then much higher for small samples than for large! Note that there's nothing to check, because you are free to interpret Zα as a percentage point of the appropriate Student t distribution (or of any other distribution you might care to name). Nothing changes in the analysis. If you do want to see the particular effects, uncomment the qt line in the code.
whuber
1
+1. This is a great analysis (and your answer has way too few upvotes for what it is). I just came across a paper that discusses this very question in great detail and I thought you might be interested: Cumming & Maillardet, 2006, Confidence Intervals and Replication: Where Will the Next Mean Fall?. They call it capture percentage of a confidence interval.
amoeba says Reinstate Monica
@Amoeba Thank you for the reference. I especially appreciate one general conclusion therein: "Replication is central to the scientific method, and researchers should not turn a blind eye to it just because it makes salient the inherent uncertainty of a single study."
whuber
1
Update: Thanks to the ongoing discussion in the sister thread, I now believe my reasoning in the above comment was not correct. 95% CIs have 83% "replication-capture", but this is a statement about repeated sampling and cannot be interpreted as giving a probability conditioned on one particular confidence interval, at least not without further assumptions. (Perhaps both this and previous comments should better be deleted in order not to confuse further readers.)
amoeba says Reinstate Monica
4

[Edited to fix the bug WHuber pointed out.]

I altered @Whuber's R code to use the t distribution, and plot coverage as a function of sample size. The results are below. At high sample size, the results match WHuber's of course.

enter image description here

And here is the adapted R code, run twice with alpha set to either 0.01 or 0.05.

sigma <- 2 
mu <- -4
alpha <- 0.01
n.sim <- 1e5
#
# Compute the multiplier.

for (n in c(3,5,7,10,15,20,30,50,100,250,500,1000))
{
   T <- qt(alpha/2, df=n-1)     
# Draw the first sample and compute the CI as [l.1, u.1].
#
x.1 <- matrix(rnorm(n*n.sim, mu, sigma), nrow=n)
x.1.bar <- colMeans(x.1)
s.1 <- apply(x.1, 2, sd)
l.1 <- x.1.bar + T * s.1 / sqrt(n)
u.1 <- x.1.bar - T * s.1 / sqrt(n)
#
# Draw the second sample and compute the mean as x.2.
#
x.2 <- colMeans(matrix(rnorm(n*n.sim, mu, sigma), nrow=n))
#
# Compare the second sample means to the CIs.
#
covers <- l.1 <= x.2 & x.2 <= u.1
#
Coverage=mean(covers)

print (Coverage)

}

And here is the GraphPad Prism file that made the graph.

Harvey Motulsky
quelle
I believe your plots do not use the t distribution, due to a bug: you set the value of T outside the loop! If you would like to see the correct curves, just plot them directly using the theoretical result in my answer, as given at the end of my R code (rather than relying on the simulated results): curve(pf(qt(.975, x-1)^2 / ((x * (1/x + 1/x))), 1, x-1), 2, 1000, log="x", ylim=c(.8,1), col="Blue"); curve(pf(qt(.995, x-1)^2 / ((x * (1/x + 1/x))), 1, x-1), add=TRUE, col="Red")
whuber
1
@whuber. Yikes! Of course you are right. Embarrassing. I've fixed it. As you pointed out the coverage is higher with tiny sample sizes. (I fixed the simulations, and didn't try your theoretical function.)
Harvey Motulsky
I am glad you fixed it, because it is very interesting how high the coverage is for small sample sizes. We could also invert your question and use the formula to determine what value of Zα/2 to use if we wished to assure (before doing any experiments), with probability p=0.95 (say), that the mean of the second experiment would lie within the two-sided 1α confidence interval determined from the second. Doing so, as a routine practice, could be one intriguing way of addressing some criticism of NHST.
whuber
@whuber I think the next step is to look at the distribution of coverage. So far, we have the average coverage (average of many first experiments, with average of many second experiments each). But depending on what the first experiment is, in some cases the average coverage will be poor. It would be interesting to see the distribution. I'm trying to learn R well enough to find out.
Harvey Motulsky
Regarding the distributions, see the paper I linked to in the comments above.
amoeba says Reinstate Monica