Wahrscheinlichkeit vs. bedingte Verteilung für die Bayes'sche Analyse

Wir können den Satz von Bayes als schreiben

p (θ | x) = \frac{f (X | θ) p (θ)}{\int_{θ} f (X | θ) p (θ) d θ}

$p(\theta|x) = \frac{f(X|\theta)p(\theta)}{\int_{\theta} f(X|\theta)p(\theta)d\theta}$

wobei $p(\theta|x)$ der hintere ist, $f(X|\theta)$ die bedingte Verteilung ist und $p(\theta)$ der Prior ist.

oder

p (θ | x) = \frac{L (θ | x) p (θ)}{\int_{θ} L (θ | x) p (θ) d θ}

$p(\theta|x) = \frac{L(\theta|x)p(\theta)}{\int_{\theta} L(\theta|x)p(\theta)d\theta}$

wobei $p(\theta|x)$ der hintere ist, $L(\theta|x)$ die Wahrscheinlichkeitsfunktion ist und $p(\theta)$ der Prior ist.

Meine Frage ist

Warum wird die Bayes'sche Analyse unter Verwendung der Wahrscheinlichkeitsfunktion und nicht der bedingten Verteilung durchgeführt?
Können Sie in Worten sagen, was der Unterschied zwischen der Wahrscheinlichkeit und der bedingten Verteilung ist? Ich weiß, dass die Wahrscheinlichkeit keine Wahrscheinlichkeitsverteilung ist und $L(\theta|x) \propto f(X|\theta)$ .

bayesian likelihood kzoo
quelle

Es gibt keinen Unterschied! Die Wahrscheinlichkeit ist, dass die bedingte Verteilung

f (X | θ)

$f(X | \theta)$ proportional zu ist, was alles ist, was zählt.

kjetil b halvorsen

Der vorherige Parameter

hat die Dichte

. wenn die Realisierung von

Wert

während

der beobachtete Wert einer Zufallsvariablen ist

, dann wird der Wert der Wahrscheinlichkeitsfunktion

ist genau

, der Wert der Bedingungsdichte

von

Θ

$\Theta$

p_{Θ} (θ)

$p_\Theta(\theta)$

Θ

$\Theta$

θ

$\theta$

x

$x$

X

$X$

L (θ ∣ x)

$L(\theta\mid x)$

f (x ∣ θ)

$f(x\mid \theta)$

f_{X ∣ Θ} (x ∣ Θ = θ)

$f_{X\mid\Theta}(x\mid\Theta=\theta)$

X

$X$ . Der Unterschied ist , dass

für alle Realisierungen

. In Abhängigkeit von

(und festem

) ist

jedoch keine Dichte:

\int_{- \infty}^{\infty} f_{X ∣ Θ} (x ∣ Θ = θ) d x = 1

$\int_{-\infty}^{\infty}f_{X\mid\Theta}(x\mid\Theta=\theta)dx=1$

Θ

$\Theta$

θ

$\theta$

x

$x$

L (θ ∣ x)

$L(\theta\mid x)$

\int L (θ ∣ x) d θ \neq 1

$\int L(\theta\mid x)d\theta\neq 1$

Dilip Sarwate

Antworten:

Angenommen, Sie haben Zufallsvariablen (deren Werte in Ihrem Experiment beobachtet werden), die bedingt unabhängig sind, vorausgesetzt, , mit bedingten Dichten $X_1,\dots,X_n$ $\Theta=\theta$ , für . Dies ist Ihr (postuliertes) statistisches (bedingtes) Modell, und die bedingten Dichten drücken für jeden möglichen Wert des (zufälligen) Parameters Ihre Unsicherheit über die Werte der ,bevorSie Zugriff auf einen Realwert haben Daten. Mit Hilfe der bedingten Dichten können Sie beispielsweise bedingte Wahrscheinlichkeiten wie berechnen $f_{X_i\mid\Theta}(\,\cdot\mid\theta)$ $i=1,\dots,n$ $\theta$ $\Theta$ $X_i$ Für jedes .

P {X_{1} \in B_{1}, \dots, X_{n} \in B_{n} ∣ Θ = θ} = \int_{B_{1} \times \dots \times B_{n}} \prod_{i = 1}^{n} f_{X_{i} ∣ Θ} (x_{i} ∣ θ) d x_{1} \dots d x_{n},

$P\{X_1\in B_1,\dots,X_n\in B_n\mid \Theta=\theta\} = \int_{B_1\times\dots\times B_n} \prod_{i=1}^n f_{X_i\mid\Theta}(x_i\mid\theta)\,dx_1\dots dx_n \, ,$

θ

$\theta$

Nachdem Sie Zugriff auf eine tatsächliche Stichprobe von Werten (Realisierungen) der haben, die in einem Durchlauf Ihres Experiments beobachtet wurden, ändert sich die Situation: Es besteht keine Unsicherheit mehr über die Observablen . Angenommen, der Zufall nimmt Werte in einem Parameterraum . Nun definieren Sie für diese bekannten (festen) Werte eine Funktion $(x_1,\dots,x_n)$ $X_i$ $X_1,\dots,X_n$ $\Theta$ $\Pi$ $(x_1,\dots,x_n)$

L_{x_{1}, \dots, x_{n}} : Π \to R

$L_{x_1,\dots,x_n} : \Pi \to \mathbb{R} \,$ by

L_{x_{1}, \dots, x_{n}} (θ) = \prod_{i = 1}^{n} f_{X_{i} ∣ Θ} (x_{i} ∣ θ) .

$L_{x_1,\dots,x_n}(\theta)=\prod_{i=1}^n f_{X_i\mid\Theta}(x_i\mid\theta) \, .$ Note that

L_{x_{1}, \dots, x_{n}}

$L_{x_1,\dots,x_n}$ , known as the "likelihood function" is a function of

θ

$\theta$ . In this "after you have data" situation, the likelihood

L_{x_{1}, \dots, x_{n}}

$L_{x_1,\dots,x_n}$ contains, for the particular conditional model that we are considering, all the information about the parameter

Θ

$\Theta$ contained in this particular sample

(x_{1}, \dots, x_{n})

$(x_1,\dots,x_n)$ . In fact, it happens that

L_{x_{1}, \dots, x_{n}}

$L_{x_1,\dots,x_n}$ is a sufficient statistic for

Θ

$\Theta$ .

Answering your question, to understand the differences between the concepts of conditional density and likelihood, keep in mind their mathematical definitions (which are clearly different: they are different mathematical objects, with different properties), and also remember that conditional density is a "pre-sample" object/concept, while the likelihood is an "after-sample" one. I hope that all this also help you to answer why Bayesian inference (using your way of putting it, which I don't think is ideal) is done "using the likelihood function and not the conditional distribution": the goal of Bayesian inference is to compute the posterior distribution, and to do so we condition on the observed (known) data.

Zen
quelle

I think Zen is correct when he says that the likelihood and conditional probability are different. In the likelihood function θ is not a random variable, thus it is different from conditional probability.

Martine

Proportionality is used to simplify analysis

Bayesian analysis is generally done via an even simpler statement of Bayes' theorem, where we work only in terms of proportionality with respect to the parameter of interest. For a standard IID model with sampling density $f(X|\theta)$ we can express this as:

p (θ | x) \propto L_{x} (θ) \cdot p (θ) L_{x} (θ) \propto \prod_{i = 1}^{n} f (x_{i} | θ) .

$p(\theta|\mathbf{x}) \propto L_\mathbf{x}(\theta) \cdot p(\theta) \quad \quad \quad \quad L_\mathbf{x}(\theta) \propto \prod_{i=1}^n f(x_i|\theta).$

This statement of Bayesian updating works in terms of proportionality with respect to the parameter $\theta$ . It uses two proportionality simplifications: one in the use of the likelihood function (proportional to the sampling density) and one in the posterior (proportional to the product of likelihood and prior). Since the posterior is a density function (in the continuous case), the norming rule then sets the multiplicative constant that is required to yield a valid density (i.e., to make it integrate to one).

This method use of proportionality has the advantage of allowing us to ignore any multiplicative elements of the functions that do not depend on the parameter $\theta$ . This tends to simplify the problem by allowing us to sweep away unnecessary parts of the mathematics, and get simpler statements of the updating mechanism. This is not a mathematical requirement (since Bayes' rule works in its non-proportional form too), but it makes things simpler for our tiny animal brains.

An applied example: Consider an IID model with observed data $X_1, ..., X_n \sim \text{IID N}(\theta, 1)$ . To facilitate our analysis we define the statistics $\bar{x} = \tfrac{1}{n} \sum_{i=1}^n x_i$ and $\bar{\bar{x}} = \tfrac{1}{n} \sum_{i=1}^n x_i^2$ , which are the first two sample moments. For this model we have sampling density:

\begin{aligned} f (x | θ) = \prod_{i = 1}^{n} f (x_{i} | θ) & = \prod_{i = 1}^{n} N (x_{i} | θ, 1) \\ = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π}} \exp (- \frac{1}{2} (x_{i} - θ)^{2}) \\ = (2 π)^{n / 2} \exp (- \frac{1}{2} \sum_{i = 1}^{n} (x_{i} - θ)^{2}) . \\ = (2 π)^{n / 2} \exp (- \frac{n}{2} (θ^{2} - 2 \bar{x} θ + \bar{\bar{x}})) \\ = (2 π)^{n / 2} \exp (- \frac{n \bar{\bar{x}}}{2}) \cdot \exp (- \frac{n}{2} (θ^{2} - 2 \bar{x} θ)) \end{aligned}

$\begin{equation} \begin{aligned} f(\mathbf{x}|\theta) = \prod_{i=1}^n f(x_i|\theta) &= \prod_{i=1}^n \text{N}(x_i|\theta,1) \\[6pt] &= \prod_{i=1}^n \frac{1}{\sqrt{2 \pi}} \exp \Big( -\frac{1}{2} (x_i-\theta)^2 \Big) \\[6pt] &= (2 \pi)^{n/2} \exp \Big( -\frac{1}{2} \sum_{i=1}^n (x_i-\theta)^2 \Big). \\[6pt] &= (2 \pi)^{n/2} \exp \Big( -\frac{n}{2} ( \theta^2 - 2\bar{x} \theta + \bar{\bar{x}} ) \Big) \\[6pt] &= (2 \pi)^{n/2} \exp \Big( -\frac{n \bar{\bar{x}}}{2} \Big) \cdot \exp \Big( -\frac{n}{2} ( \theta^2 - 2\bar{x} \theta ) \Big) \\[6pt] \end{aligned} \end{equation}$

Now, we can work directly with this sampling density if we want to. But notice that the first two terms in this density are multiplicative constants that do not depend on $\theta$ . It is annoying to have to keep track of these terms, so let's just get rid of them, so we have the likelihood function:

L_{x} (θ) = \exp (- \frac{n}{2} (θ^{2} - 2 \bar{x} θ)) .

$L_\mathbf{x}(\theta) = \exp \Big( -\frac{n}{2} ( \theta^2 - 2\bar{x} \theta ) \Big).$

That simplifies things a little bit, since we don't have to keep track of an additional term. Now, we could apply Bayes' rule using its full equation-version, including the integral denominator. But again, this requires us to keep track of another annoying multiplicative constant that does not depend on $\theta$ (more annoying because we have to solve an integral to get it). So let's just apply Bayes' rule in its proportional form. Using the conjugate prior $\theta \sim \text{N}(0,\lambda_0)$ , with some known precision parameter $\lambda_0>0$ , we get the following result (by completing the square):

\begin{aligned} p (θ | x) & \propto L_{x} (θ) \cdot p (θ) \\ = \exp (- \frac{n}{2} (θ^{2} - 2 \bar{x} θ)) \cdot N (θ | 0, λ_{0}) \\ \propto \exp (- \frac{n}{2} (θ^{2} - 2 \bar{x} θ)) \cdot \exp (- \frac{λ_{0}}{2} θ^{2}) \\ = \exp (- \frac{1}{2} (n θ^{2} - 2 n \bar{x} θ + λ_{0} θ^{2})) \\ = \exp (- \frac{1}{2} ((n + λ_{0}) θ^{2} - 2 n \bar{x} θ)) \\ = \exp (- \frac{n + λ_{0}}{2} (θ^{2} - 2 \frac{n \bar{x}}{n + λ_{0}} θ)) \\ \propto \exp (- \frac{n + λ_{0}}{2} (θ - \frac{n}{n + λ_{0}} \cdot \bar{x})^{2}) \\ \propto N (θ | \frac{n}{n + λ_{0}} \cdot \bar{x}, n + λ_{0}) . \end{aligned}

$\begin{equation} \begin{aligned} p(\theta|\mathbf{x}) &\propto L_\mathbf{x}(\theta) \cdot p(\theta) \\[10pt] &= \exp \Big( -\frac{n}{2} ( \theta^2 - 2\bar{x} \theta ) \Big) \cdot \text{N}(\theta|0,\lambda_0) \\[6pt] &\propto \exp \Big( -\frac{n}{2} ( \theta^2 - 2\bar{x} \theta ) \Big) \cdot \exp \Big( -\frac{\lambda_0}{2} \theta^2 \Big) \\[6pt] &= \exp \Big( -\frac{1}{2} ( n\theta^2 - 2n\bar{x} \theta + \lambda_0 \theta^2 ) \Big) \\[6pt] &= \exp \Big( -\frac{1}{2} ( (n+\lambda_0) \theta^2 - 2n\bar{x} \theta ) \Big) \\[6pt] &= \exp \Big( -\frac{n+\lambda_0}{2} \Big( \theta^2 - 2 \frac{n\bar{x}}{n+\lambda_0} \theta \Big) \Big) \\[6pt] &\propto \exp \Big( -\frac{n+\lambda_0}{2} \Big( \theta - \frac{n}{n+\lambda_0} \cdot \bar{x} \Big)^2 \Big) \\[6pt] &\propto \text{N}\Big( \theta \Big| \frac{n}{n+\lambda_0} \cdot \bar{x}, n+\lambda_0 \Big). \\[6pt] \end{aligned} \end{equation}$

So, from this working we can see that the posterior distribution is proportional to a normal density. Since the posterior must be a density, this implies that the posterior is that normal density:

p (θ | x) = N (θ | \frac{n}{n + λ_{0}} \cdot \bar{x}, n + λ_{0}) .

$p(\theta|\mathbf{x}) = \text{N}\Big( \theta \Big| \frac{n}{n+\lambda_0} \cdot \bar{x}, n+\lambda_0 \Big).$

Hence, we see that a posteriori the parameter $\theta$ is normally distributed with posterior mean and variance given by:

E (θ | x) = \frac{n}{n + λ_{0}} \cdot \bar{x} V (θ | x) = \frac{1}{n + λ_{0}} .

$\mathbb{E}(\theta|\mathbf{x}) = \frac{n}{n+\lambda_0} \cdot \bar{x} \quad \quad \quad \quad \mathbb{V}(\theta|\mathbf{x}) = \frac{1}{n+\lambda_0}.$

Now, the posterior distribution we have derived has a constant of integration out the front of it (which we can find easily by looking up the form of the normal distribution). But notice that we did not have to worry about this multiplicative constant - all our working removed (or brought in) multiplicative constants whenever this simplified the mathematics. The same result can be derived while keeping track of the multiplicative constants, but this is a lot messier.

Reinstate Monica
quelle

I think Zen's answer really tells you how conceptually the likelihood function and the joint density of values of random variables differ. Still mathematically as a function of both the x $_i$ s and θ they are the same and in that sense the likelihood can be looked at as a probability density. The difference you point to in the formula for the Bayes posterior distribution is just a notational difference. But the subtlety of the difference is nicely explained in Zen's answer.

This issue has come up in other questions discussed on this site regarding the likelihood function. Also other comments by kjetil and Dilip seem to support what I am saying.

Michael R. Chernick
quelle