## MLE versus Bayesian Approach

Let’s say I toss a coin 10 times and observe 8 heads and 2 tails. Based on this, if I say that the probability of getting head (let’s represent it by $\theta$) in the 11th toss is 0.8 or 80%, most will agree with me. However, often unknown, this is only one of the many possible estimates for $\theta$. The above estimate is based on an approach known as Maximum Likelihood Estimate (MLE).

The intuition behind the MLE approach is to find one such $\theta$ that maximizes the probability of getting the above observed dataset ($D$) i.e. we want to maximize $P(D|\theta)$. Since the observed dataset is independent and identically distributed (iid), we can write

$P(D|\theta) = \prod_{i=1}^NP(x_i|\theta)$

For practical purpose (to avoid numerical overflow) rather than maximizing the above equation we take the log of the above equation and maximize it (Since log is monotonically increasing function, it doesn’t change our max estimate).

$L = log\left ( (P(D|\theta) \right ) = \sum_{i=1}^Nlog(P(x_i|\theta)) = n_Hlog(\theta) + (N-n_H)log(1-\theta)$

Above $n_H$ represents number of heads. Now to maximize above equation we take derivative with represent to $\theta$ and equate it to zero. This will give us maximum likelihood estimate for $\theta$.

$\frac{\partial L}{\partial x} = \frac{n_H}{\theta} - \frac{N-n_H}{1-\theta} = 0$
$\therefore \theta = n_H/N$

Note that the above solution is obtained by maximizing the probability of getting the observed dataset i.e. $P(D|\theta)$. However few scholars argue that in reality what we want to maximize is the probability of getting $\theta$ given the observed dataset i.e $P(\theta|D)$. Using Bayesian approach this can be expanded as

$P(\theta|D) = \frac{P(D|\theta) \times P(\theta)}{P(D)} ~~~~~~ (1)$

One advantage of the above approach is that it allows us to include prior knowledge. For instance from past experiences we all know that the probability of getting head or tail of a fair coin is 0.5. But if a coin is tossed 10 times, rarely we get exactly 5 heads and 5 tails. That means we have some kind of distribution for $\theta$ that has a mean around 0.5 and a small variance. Bayesian approach allows us to mathematically incorporate this belief. However the downside of the Bayesian approach is that it quickly becomes lot more difficult to find the closed form solution. Further the challenges are different when trying to solve the above equation analytically and practically.

Theoretically the challenge is computing probability of evidence i.e. $P(D)$. $P(D)$ can be represented as $\int_{\theta}P(D|\theta)P(\theta)d\theta$. However solving this integration can be impossible unless we assume conjugate prior. This will become clearer after we solve the equation 1. For now, as shown below, let’s assume that we represent our prior belief about $\theta$ using beta distribution. The two parameters ($\alpha, \beta$) of the Beta distribution allows us to model our belief that a fair coin has equal chances of getting head or tail.

$Beta(\alpha, \beta) = \frac{1}{B(\alpha,\beta)}x^{\alpha-1}(1-x)^{\beta-1}$

By representing our prior using Beta distribution we can rewrite evidence as

$P(D) = \int_{0}^{1}\left ( \prod_{i=1}^NP(x_i|\theta) \right ) \times \left( \frac{1}{B(\alpha,\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1} \right ) d\theta$
$P(D) = \int_{0}^{1}\left ( \theta^{n_H}\times (1-\theta)^{N-n_H} \right ) \times \left( \frac{1}{B(\alpha,\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1} \right ) d\theta$
$P(D) = \frac{1}{Beta(\alpha, \beta)}\int_{0}^{1} \left ( \theta^{n_H+\alpha-1}\times (1-\theta)^{N-n_H+\beta-1} \right) d\theta$

Solving the above integration turns out to be simple and results into a constant value that can be represented as $Z$. Thus we can rewrite eqn 1. as

$P(\theta|D) = \frac{1}{Z}\times\left(\theta^{n_H}(1-\theta)^{N-n_H}\right)\times\left(\frac{1}{Beta(\alpha,\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\right)$
$P(\theta|D) = \frac{1}{ZBeta(\alpha, \beta)}\times\left(\theta^{n_H+\alpha-1}(1-\theta)^{N-n_H+\beta-1}\right) ~~~~~~ (2)$

Note that the above posterior $P(\theta|D)$ has the same form as that of prior distribution (i.e. beta distribution) but with different parameters. For posterior we got $\alpha = n_H + \alpha$ and $\beta = N - n_H + \beta$. For the given likelihood distribution (here bernoulli) if the choice of the prior distribution (here beta) results in the same form of distribution for the posterior then the choice of prior distribution is known as conjugate prior. For bernoulli distribution, the conjugate prior is given as beta distribution. For gaussian distribution, the conjugate prior is gaussian distribution itself. For the beta distribution, the expected value of parameter is given as

$E\left[Beta(x|\alpha, \beta)\right] = \frac{\alpha}{\alpha + \beta} ~~~~~~ (3)$

Based on equation 2 and 3, we can calculate expected value of $\theta$ as
$\theta = \frac{n_H + \alpha}{\alpha + \beta + N} ~~~~ (4)$

Let’s assume $\alpha = \beta = 2$, then $\theta = 0.71$. Thus, while as per MLE $\theta = 0.80$, as per Bayesian estimator $\theta = 0.71$. In the case of Bayesian estimation, there are two different forces that are trying to pull $\theta$ in different directions. From eqn 4 we notice that the bigger our hyper parameters ($\alpha, \beta$), i.e. the stronger we belief in the prior, the stronger is the pull towards our prior belief. On the other hand the bigger is the sample size N, the stronger is the pull towards the estimate based on the dataset. If N is sufficiently large then bayesian estimate and MLE will converge. Another difference to note is that MLE gives a point estimate (i.e it returns one value), where as Bayesian estimate returns a distribution. For instance above we got a beta distribution with certain alpha and beta parameter.

From the practical perspective, computing probability of evidence $P(D)$ is not an issue. It is merely a normalizing constant. From a practical perspective expected value for $P(\theta|D)$ can be represented as

$E\left[P(\theta|D)\right] = \int P(D|\theta)P(\theta)d\theta$

As per Monte Carlo method, above integral can be approximated by taking large number of samples from probability distribution of $P(\theta)$ and taking the average value for the above integral at these sample points i.e.

$E\left[P(\theta|D)\right] = \frac{1}{n} \sum_{i=1}^nP(D|\theta_i)P(\theta_i)$

However the challenge is how to sample $\theta$. Random $\theta$ as drawn from the prior distribution $P(\theta)$ might not be good representative of the sample space represented by $P(\theta_i|D)$. To overcome this challenge there are many different sampling techniques. Broadly these sampling techniques are referred as Importance Sampling. Below is a python example demonstrating one of the simplest importance sampling technique known as Metropolis Sampling technique. I leave the discussion for importance sampling for some other time but for now below is the python code to compute $\theta$ for the given problem.

Below I am assume that the prior follows a beta distribution with $\alpha = \beta = 2$. Thus the integral that we need to solve is

$P(D|\theta_i) \times P(\theta_i) = \left(\theta_i^{8}(1-\theta)^{2}\right) \times \left( \frac{1}{Beta(2,2})\theta_i(1-\theta_i)\right)$

Any constant can be dropped when using Monte Carlo approach. Hence dropping Beta(2,2) from the above equation, the integral that we need to solve from the Monte Carlo approach perspective is

$P(D|\theta_i) \approx \theta_i^9(1-\theta_i)^3$

import random
import scipy.stats
import scipy.special
import matplotlib.pyplot as plt

def integral(theta):
""" This is likelihood X prior after removing any constant terms. """
return theta**9*(1-theta)**3

n = 100  # Number of points to sample
samples = [random.random()] # Random starting point

for i in range(n):
#Grab last sample point
theta = samples[-1]

#Create new sample by randomly selecting a point from a normal distribution
#of mean = 0 and sd = 0.1. if the new sample is outside of the
#domain then ignore it and use existing sample
newTheta = theta + random.normalvariate(0, 0.1)
if newTheta < 0 or newTheta > 1:
newTheta = theta

#If the probability of new sample as compared to last sample is less than uniform distribution
#then ignore it.
acceptanceRatio = integral(newTheta)/integral(theta)
if acceptanceRatio > random.random(): # accept only if going uphill
samples.append(newTheta)
else:
samples.append(theta)

print "Estimate: ", sum(samples)/n

# Plot how sample theta varies with each iteration
ylab = [i for i in xrange(len(samples))]
pylab.plot(samples, ylab)
pylab.title('Random Walk Visualization')
pylab.xlabel('Theta Value')
pylab.ylabel('Time')
pylab.show()


Running the above code gave mean $\theta$ as 0.64. The plot shows sampled $\theta$ at different iteration number.

Note that the above code only sampled 100 points. Typically you sample much larger number of points and also ignore few initial points, known as burn-in period, so as to remove influence of starting point from the estimate. If we simply set $n = 1000$ then the estimate for $\theta$ is 0.71. This is same as we got analytically by applying the bayesian form.

References and Notes
One of the best resources I found on this topic is a series of tutorials by Dr. Avinash KaK. The python code is based on the Hilbert’s blog on Monte Carlo Markov Chains.

Apart from MLE and Bayesian approach, another frequently used approach is Maximum A Posterior (MAP). Similar to MLE, MAP gives a point estimate but also allows to incorporate prior beliefs.

1. Hi Ritesh, the main reason to take the logarithm of the likelihood function isn’t to avoid numerical overflow, but rather to simplify the expression. Since many likelihood functions contain exponentials (e.g. everything in the exponential family) the log simply undoes this exponential and makes the math easier. Also as per your section about Monte Carlo sampling, I’m nearly positive that, for your estimate of the expected value of the posterior, you wouldn’t want to sum $P(\theta_i)$ here. I didn’t read past this. Best, Chris