Bayesian view and Variational Inference

4 minute read

Published: May 01, 2020

This blog post talks about Bayesian view of statistics and the need for variatinal inference and a simple Mean Field approximation method

Bayesian view of statistics

Traditional ML methods treats model parameters as constants, and tries to find them using maximum likelihood principle. Here the model parameters $\theta$ are just unknown, not random. This view of $\theta$ as being constant, but unknown value is taken in frequentist statistics. An alternate approach to parameter estimation is Bayesian view which treats model parameters as random variables with unknown values. The following can be considered as main ideas of Bayesian view.

Use prior knowledge
Choose answers that explains the observations mostly
Avoid making extra assumptions (Occam’s razor)

Baye’s Theorem

Let $\theta$ –> parameters, $X$ –> observations

\[\underbrace{p(\theta/X)}_{Posterior} = \frac{\overbrace{p(X/\theta)}^{Likelihood}\overbrace{p(\theta)}^{Prior}}{\underbrace{p(X)}_{Evidence}}\]

In the above formulae, $p(\theta)$ –> prior information that we have on parameters $p(X/\theta)$ –> Likelihood (How well our params explains observations) $p(\theta/X)$ –> Posterior (How well the data explains the params) $p(X)$ —> Evidence (How likely to observe the given data, can only be calculated if we have a model that can generate the data)

Bayesian methods tries to find the distribution over the model parameters, after observing the data (posterior). In case of

Training

\[p(\theta/ X_{tr}, y_{tr}) = \frac{p(y_{tr}/X_{tr}, \theta) p(\theta)}{p(y_{tr}/X_{tr})}\]

Inference

\[p(y_{ts}/X_{ts}, X_{tr}, y_{tr}) = \int_{\theta} p(y_{ts}/X_{ts}, \theta) p(\theta/ X_{tr}, y_{tr}) d\theta\]

By choosing a proper prior, we can embed our prior knowledge into the model and hence prior can be used as a regularizer. Bayes theorem can also be used for online training.

\[\underbrace{p_k(\theta)}_{New prior} = \underbrace{p(\theta/x_k)}_{Posterior} = \frac{\overbrace{p(x_k/\theta)}^{Likelihood}\overbrace{p_{k-1}(\theta)}^{Prior}}{\underbrace{p(x_k)}_{Evidence}}\]

On every iteration, we get new data and we use the posterior from the previous iteration as prior for the current iteration. Our posterior becomes more and more accurate with the incoming evidence over iterations.

Analytical Inference

The denominator in the Baye’s theorem is the evidence term $p(X)$ is difficult to model, so calculating posterior distribution in closed form is not possible. Therefore, we approximate the posterior in general. One common approach is to replace the posterior with a single point estimate. The MAP (Maximu A Posteriori) estimate of $\theta$ is given by

\[ \begin{aligned} \theta_{MAP} &= argmax_{\theta} p(\theta/X) \newline &= argmax_{\theta} p(X/\theta) p(\theta) \end{aligned} \]

Conjugate distributions

Prior $p(\theta)$ is conjugate to the likelihood $P(X/\theta)$ if the posterior and prior lies in the same family of distributions. For example, let both prior and likelihood are normal distributions with $p(\theta) = \mathcal{N}(\theta/m,s^2)$ and $p(X/\theta) = \mathcal{N}(X/\theta, \sigma^2)$

\[\underbrace{p(\theta/X)}_{\mathcal{N}(a, b^2)} = \frac{\overbrace{p(X/\theta)}^{\mathcal{N}(X/\theta, \sigma^2)}\overbrace{p(\theta)}^{\mathcal{N}(\theta/m, s^2)}}{p(X)}\]

i.e. if we choose prior that’s conjugate to the likelihood, we can avoid computing the evidence, since the posterior belongs to the prior family of distributions.

Let the likelihood be Bernouli, and prior be beta, then

\[ \begin{aligned} p(X/\theta) &= \theta^{N_1}(1-\theta)^{N_0} \newline p(\theta) &= B(\theta/a,b) \propto \theta^{a-1} (1-\theta)^{b-1} \newline p(\theta/X) &\propto p(X/\theta) p(\theta) \newline &\propto \theta^{N_1+a-1} (1-\theta)^{N_0+b-1} \newline &= B(N_1+a, N_0+b) \end{aligned} \]

i.e we calculated the exact posterior without calculating the evidence.

Variational Inference

\[p^*(z) = p(z/X) = \frac{p(X/z) p(z)}{p(X)} = \frac{p(X/z) p(z)}{\int p(X/z) p(z) dz} = \frac{\hat{p}(z)}{Z}\]

Here $\hat{p}(z)$ is the un-normalized posterior and $Z$ is the normalization constant.

Computing the posterior using Bayes formula in closed form (analytical expression) is not possible in many cases because of intractable integrals involved in calculating the evidence. Only when likelihood and prior are conjugate to each other, this is possible. So, there is a need to approximate the posterior distribution. Variational inference is an idea to approximate the posterior using other simple known distributions. The main idea behind variational inference is as follows

Pick a family of distributions $Q$ over latent variables with variational parameters. Let’s call this as variational family
Find the variational params such that $q(z)$ is the best approximation to the posterior $p^*(z)$.

\[KL(q(z) || p^{*}(z)) --> \min_{q \in Q}\]

Use the $q(z)$ with fitted parameters as an approximation to the posterior,e.g. to from predictions over future data etc. Typically, true posterior does not lie in the variational family.

\[ \begin{aligned} KL(q(z) || p^{*}(z)) &= KL(q(z) || \frac{\hat{p}(z)}{Z}) \newline &= \int q(z) \log\frac{q(z)}{\hat{p}(z)/Z} \newline &= \int q(z) \log\frac{q(z)}{\hat{p}(z)} + \int q(z) \log Z dz \newline &= KL(q(z) || \hat{p}(z)) + \log Z \end{aligned} \]

So, we only need to approximate unnormalized posterior.

Mean Field Approximation

In mean field variational inference, we assume that the variational family factorizes over the dimensions of latent variable.
\[Q = \{ q; q(z) = q(z_1, z_2,...z_d) = q_1(z_1) q_2(z_2)...q_d(z_d)\}\]
for example
\[p^*(z_1, z_2) \approx q_1(z_1) q_2(z_2)\]
Find best approximation $q(z)$ of $p^*(z)$. We will use Coordinate ascent algorithm, iteratively optimizing each variational distribution, keeping the others fixed.

Minimizing the KL divergence w.r.to $q_k$ gives

\[\log q_k = h(z_k) = \mathbf{E}_{q_{-k}} \log p^*\]

References

Bayesian Methods for Machine Learning by National Research University Higher School of Economics

Share on

Twitter Facebook LinkedIn

Vamshi Kumar Kurva

Bayesian view and Variational Inference

Bayesian view of statistics

Baye’s Theorem

Training

Inference

Analytical Inference

Conjugate distributions

Variational Inference

Mean Field Approximation

References

Share on

You May Also Enjoy

Probability Theory and Random variables

Geometric view of matrices, Diagonalization and SVD

Variational Auto Encoder

Latent Variables and EM