### Theory of variational inference

#### Premise

Given a set of $N$ observed variables $X$ , the bayesian framework of model fitting aims to find out the most likely parameters $\theta$ that maximizes the likelihood of observing $X$ . To put it in probablistic terms we try to maximize the function $P(\theta|X)$.

This is known as MAP.

Another formulation of the problem is to find a $\theta$ that maximizes the probability of observing $X$,

The actual process behind an actual observation can be very complex. Therefore we often resort to a bayesian network that leads to the observation. This kind of model has more expressability since we can make objective claims about the generative process itself.

This more complex formalism involves a set intermediate variables, that we don't directly estimate, but, helps us to put more constraint on the model. These variables are commonly denoted by $Z$.

Here we don't show $\theta$ in the model, but they are there.

#### Formulation

The goal is still the same, to maximize the log-likelihood of the observation.

By using Jensen's inequality which says when we have concave function then, any point on the straight-line connecting two points on the concave curve (i.e. $\mathbb{E}[f(x)]$) is always lower than the actual mapped point on the curve (i.e. $f(\mathbb{E}[x])$), therefore $\mathbb{E}[f(x)] \leq f(\mathbb{E}[x])$. If $f$ is $\log$ here, then we have

There are many terms that are used to denote equation (9), such as variational lower bound, Evidence Lower Boound (ELBO) etc. Intuitively this is average (with respect to a distribution $q$) of log-fold-change of joint likelihood of $(X,Z)$ and a fictitious variational distribution $q(Z)$. There is another way of writing (9) if we take denominator out,

Negative expectation of a log (i.e. $\int \log (x) p(x)dx$) is also known as shanon's entropy.

#### Basic EM algorithm

So we see that there is a gap between LHS and RHS in equation (9). We can examine the gap further,

As $q(z|\theta)$ depends only on $z$ ($\theta$ is given when we evaluate $q$) therefore $P(X|\theta)$ can come inside the $\mathbb{E}_{q(z|\theta)}$.

RHS (14) denotes the Kullback-leibler divergence between the fictitious distribution of hidden variable $Z$ and the true distribution of $Z$, generally denoted by $KL \big[ q(Z|\theta) || P(Z|X,\theta) \big]$,

ELBO, $\mathbb{E}_{q(Z|\theta)} \Big[ \log{\frac{P(X,Z|\theta)}{q(Z|\theta)}} \Big]$ , can also be rewritten as, $\mathcal{L}(q,\theta)$, and, subsequently,

$KL$ divergence is a positive quantity.

Equation 17 gives us another definition of $\mathcal{L}$, or the variational lower bound, Evidence Lower Boound (ELBO),

Equation 19 will be useful later.

An iterative algorithm known as EM, is applied. We stepwise optimize $q$ and $\theta$. We can use EM, these two steps can be executed properly

##### Actual Algorithm

Let's assume we already have a $\theta \leftarrow \theta^0$

##### Step 1: E step

In this step we will minimize the KL divergence. $KL(q(Z|\theta^0)||P(Z|X,\theta^0))$ divergence become 0 if,

It would be awesome if we can evaluate $P(X,Z|\theta^0)$. Remember it amounts to solving the following expression.

The integration in the denominator of equation (22) is the hardest part here, for many reasons, such as $\int_z P(X|Z,\theta^0) P(Z|\theta^0) dz$ might not give us an analytical closed form, and might not be integrable.

Say by some superpower (or if numerator of equation 20 is analytically so simple that you can just integrate ) you can solve the equation

Finding this value is not where E-step ends, although that's the crux of it. We also evaluate ELBO by plugging in $q^0(Z|\theta) = P(Z|X,\theta^0)$ in 18

We move to the next step for obtaining a better estimate of $\theta$.

##### Step 2: M step

Givem $q^0$ the next step is to maximize $\mathcal{L}(q^0, \theta)$. There can be many ways to solve it. You might want to put the function $q^0$ in ELBO, to evaluate $\mathcal{L}(q^0,\theta)$. But that would just yield a function of $\theta$,

A very naive way to solve equation (20) is to just differentiate it (if you can) and figure out $\theta^1$.

That's how EM algorithm proceeds,

$\theta^0$ can be any value you deem reasonable (taking any value might elongate the convergence.)

FAQ

##### Why E step is called E step (or expectation)?

The name comes from evaluating the expectation in equation 27. We in essence evaluate the expectation of the observed data under the fictitious distribution $q$.

##### Why it's ELBO is called variational lower bound?

Because $q$ is called variational ditribution. In the case EM we got super lucky and evaluated $q$, that's not the case in most problems, in which case we resort to a variational distribution $q$ and given that variational distribution ELBO is a lower bound. You can evaluate $\mathcal{L}(q^{i},\theta^i{})$ and after each $i$-th $M$ step, and plot this value. If the implementation is successful we would see an increasing curve that saturates with $i$ denoting that lower bound is getting improved.

#### What if $\int_z P(X|Z,\theta^0) P(Z|\theta^0) dz$ is a hard nut to crack (intractable)?

If we can't solve the integration $\int_z P(X|Z,\theta^0) P(Z|\theta^0) dz$ then we have to figure out other ways to lower the KL divergence. We of course can not make it 0 under $\theta^0$, but may be, we can get closer, and KL divergence would be a small value.

We still know that our best bet is to evaluate $P(Z|X,\theta^0)$ and therefore evaluating $\int_z P(X|Z,\theta^0) P(Z|\theta^0) dz$ or $\mathbb{E}_{P(Z|\theta^0)} [P(X|Z,\theta^0)]$ , if we can't do that exactly, then to the least we can use numerical algorithms in order to estimate this integration approximately. There are many approximation algorithms that can give us numeric estimates of an actual integration.

##### 1. Try half-hearted EM nevertheless

A very naive technique of would be to sample a bunch of points from the ditribution $P(Z|\theta^0)$ (which can be an intimidating task on it's own). and evaluate $P(X|Z,\theta^0)$.

To be more concrete you sample (there are many many good algorithms for well-structured sampling such as MCMC, gibbs) a vector (hidden variables) $\mathbf{z}^i \sim P(Z|\theta^0)$ (if you can) and evaluate $P(X|\mathbf{z}^i,\theta^0)$. Say, do it $D$ (a bunch of) times, then our best hope is to calculate

This approximation looks simple, but there could be caveats of such a solution, such as,

• The expression $\frac{1}{D} \sum_{i} P(X|\mathbf{z}^i,\theta)$ can be far from true $\int_z P(X|Z,\theta^0) P(Z|\theta^0) dz$ when $\mathbf{z}^i$ is not well distributed, or are autocorrelated (therefore does not capture the space of true hidden variable distribution).
• We can't simply sample from $P(Z|\theta^0)$, because there is no analytical form to it. (This happens a lot in a real world problem)

(To be written)

##### 3. Variational autoencoder

Equation 19 states $\mathcal{L}(q,\theta) = \log P(X|\theta) - KL\big[ q(Z|\theta) || P(Z|X,\theta) \big]$ , where we had taken help of a distirbution $q$. We realized that the $E$ step finds out a $q$ that minimizes the KL divergence part given an initial estimate of $\theta$, $\theta^0$. When we are lucky we can set $q(Z|\theta^0)$ to true estimate to $P(Z|X,\theta^0)$. In that case we are certain that we can fully evaluate $P(Z|X,\theta^0)$, by getting a solution to equation 22 (doing integration etc.).

When we are not so lucky (which is most of the cases), we parameterize $q$ and try to find out $q$ accordingly. Before getting into that, let's first realize a fact when we used in equation 23, we actually did the following,

Therefore $\mathcal{L}$ can be treated as a key component in this entire exercise.

Now, when $P(Z|X,\theta^0)$ is intractable, we cannot set $q(Z|\theta^0)$ to $P(Z|X,\theta^0)$, however we define another parameterized function $q_\phi(Z|X)$ (Notice we removed $\theta^0$ here, we assume this function $q_\phi$ is not dependent on the model parameters, but rather another set of parameters which are all hidden in function $\phi$.). Our hope $q_\phi(Z|X)$ will be close to $P(Z|X,\theta^0)$.

Now coming back to our discussion about not having a good $q$, one can think about parameterizing $q$ itself as a function of some variables, and try to improve those parameters. In other words if we can get a $q_{\phi}$ that is so flexible that it can be moulded in any way we want, that would be great. One such a function is neural networks. We can assume the parameters in the network (all the weights and nonlinear activation parameters) are encoded by $\phi$ then instead of using $q$ , we can call such a neural network $q_\phi$.

In fact you can think $q_{\phi}(Z|X)$ as an output from encoder in the auto-encoder.

In the light of this new notation we can re-write equation 19, as

We would simplify the above equation a but by assuming that we first like to find out the best $\phi$ for one data point $\mathbf{x}^i$. We would also assume $\mathbf{x}^i$ is dependent on a subset of hidden variables $\mathbf{z}^i$, with this assumptions, we can wright likelihood for one data point $\mathbf{x}^i$,

$\log P(\mathbf{x}^i|\theta)$ does not contain anything to do with $q_\phi$. We can rewrite $\log P(\mathbf{x}^i|\theta) = \int_z \log P(\mathbf{x}^i|\theta) q_\phi(\mathbf{z}^i|\mathbf{x}^i) d\mathbf{z}^i$, since $q_\phi(\mathbf{z}^i|\mathbf{x}^i)$ is a probability distribution and therefore, $\int_z q_\phi(\mathbf{z}^i|\mathbf{x}^i,\theta) d\mathbf{z}^i = 1$. We used the similar trick while absorbing $\log P(X|\theta)$ inside expectation in equation 12.

We can massage the expression $P(\mathbf{x}^i|\theta) P(\mathbf{z}^i|\mathbf{x}^i,\theta)$,

The point of this mathematical juggling would be clear in a moment,

Our new variational lower bound or Evidence Lower Boound (ELBO) is,

We are again in trouble, since the first part is also an expectation, but since $q_\phi(\mathbf{z}^i|\mathbf{x}^i)$ is something that we can choose, we can take enough samples and approximate the expression.

We need two different outcomes, firstly, we want to optimize ELBO, $\mathcal{L}$ , and get the best values for $\theta$ and $\phi$. On the other hand, we need to evalueate $\mathcal{L}$.

##### - How do we optimize the ELBO?
• E-like step (finding optimal $\phi$ given $\theta^0$)

We follow something similar to EM, but a bit more complex, in the E-like step we find out a $\phi$ that maximizes $\mathcal{L}$. (given an initial value $\theta^0$)

We differentiate $\mathcal{L}(\phi,\theta^0)$ w.r.t $\phi$,

To understand how we optimize ELBO, first we have to understand how in reality such an expression $\mathcal{L}$, is formed. Because we are not going to differentiate the exact form of equation 39 directly. This optimization scheme is very much tied to the actual generation process. To briefly understand that, let's review the flow of inference.

Given the data $\mathbf{x}^i$, a trainable parameterized function $\phi$ (such as encoder network), is used to generate a set of hidden variable (often called latent variables) $\mathbf{z}^i$, we use another easy function $q$ (such as a normal distribution). Given this super artificial distribution for $q_\phi(\mathbf{z}^i|\mathbf{x}^i)$, we evaluate $\mathbb{E}_{q_\phi(\mathbf{z}^i|\mathbf{x}^i)}\log P(\mathbf{x}^i|\mathbf{z}^i,\theta)$ , Now again this expectation could be complex. So we resort to sampling

So the approximation process could be drawing a sample $\mathbf{z}_d^i \sim q_\phi (\mathbf{z}^i|\mathbf{x}^i)$ and then evaluate $\log P(\mathbf{x}^i|\mathbf{z}_d^i,\theta)$, subsequently,

Summing them up and averaging would give us an approximation of

The last part of equation 40 is another neural network, that can be termed as a decoder with a set of parameters $\theta$ (this part is very similar to a normal heirarchical bayesian model) . With this sampling process we redefine the definition of ELBO,

Now we can think about differentiating the above expression, Now if we want to differentiate the first part of 48, then we will face a problem, since $\nabla_\phi \Big[\frac{1}{D} \sum_d \log P(\mathbf{x}^i|\mathbf{z}_d^i,\theta)\Big]$ is hard to evaluate. Since the sampling is inside the summation notation depends on $\phi$ , specifically, $\mathbf{z}_d^i \sim q_\phi (\mathbf{z}^i|\mathbf{x}^i)$ itself depends on a $\phi$.

This is a classical problem and this does does not have anything to do with sampling in particular,

In short this problem is formulated as following,

Here probability of the distribution for wihich the expectation is taken depends on a variable $h$, and we are differentiating w.r.t that variable. One way to solve this problem is something called log derivative tick , which by now might be familiar: we multiply and divide the expresseion inside the expression by $P(y(h))$.

Effectively this is a good solution if we go back to 42

In actual we do sampling for evaluating

Although expression might look simple, it leads to a very bad sampling. Why?

Let's think about a real world scenario, where we start with some random $\theta^0$ and $\log P(\mathbf{x}^i|\mathbf{z}^i,\theta^0)$ denotes the probability of generating an image from some random initalization, this probability can be very low number (log likelihood of images $\log10^{-10^6} = -10^6$). So while sampling we choose $\mathbf{z}_d^i \sim q_\phi (\mathbf{z}^i|\mathbf{x}^i)$ and then take derivative of log of that with respect to $\phi$, which could be negative or positive, multiply with a highly negative number, becoming a very large or small number. So the sampling mechanisms (such as MCMC) will take a long time to converge.

Another way of solving this problem is change of variable or reparameterization. As opposed to log derivative trick we use change of variables in the calculation of expectation. This method is widely popular in calculus,

If $P(g(x))g'(x)$ can be represented as a probability distribution $P'(x)$ then
is same as first sampling $\epsilon \sim \mathcal{N}(0,1)$ and then multiplying by $\sigma$ and scaling by $\mu$,