# Parameter Estimation – MLE, MAP and Conjugate Prior

Machine learning papers often mention the term “conjugate prior”. Here is an attempt to provide a practical understanding of the term. What I have realized however is that you can’t just explain conjugate prior without talking about parameter estimation.

Parameter estimation is concerned with estimating the parameters of the underlying distribution of a dataset, hence referred as parameter estimation. For instance if we tossed a coin (we don’t know whether it is biased or not) 10 times and get the following output: $X = {H, H, H, T, H, T, H, T, H, H}$. Now let’s say I ask you what’s the probability of getting head in the 11th coin toss. One possible way is to determine the probability of getting head (i.e. $\theta$) using the above sample data. This process of estimating a distribution parameter is called parameter estimation. Once we know the value $\theta$, we can use it predict the probability of getting 11th head (or tail).

There are however two different approaches  for parameter estimation: MLE and MAP.  Conjugate prior is concerned with the second approach (MAP). However I would suggest to still go through MLE as it provides some of the foundation for MAP.

Maximum Likelihood Estimation (MLE)
To understand the idea behind MLE assume that $\theta$ can take any value between 0 and 1 (such as 0.1, 0.2, 0.3, … ). For each value of $\theta$, we can calculate probability of getting the above dataset X, i.e $P(X|\theta)$ as follows.

$P(X|\theta) = \prod_{x \epsilon X}{P(x|\theta)} = \prod_{x \epsilon X}{\theta^c~(1-\theta)^{1-c}}$
$where~c = {1, 0}$

Let’s assume we calculated $P(X|\theta)$ for each possible value of $\theta$ (which is actually impossible), the question still remains which value of $\theta$ should we select ? A natural choice is to select $\theta$ that gives the highest probability for the dataset, i.e.

$\theta_{MLE} = \underset{\theta}{argmax} P(X|\theta) = \underset{\theta}{argmax}{\prod_{x \epsilon X}{\theta^c~(1-\theta)^{1-c}}}$

In order to avoid the number overflow problem, we take the log sum instead of product of probabilities as shown below

$\theta_{MLE} = \underset{\theta}{argmax} \sum_{x \epsilon X}{log(\theta^c~(1-\theta)^{1-c})} = \underset{\theta}{argmax}{(n^1 log~\theta + n^0 log~(1-\theta))}$

Where $n^1 (=7)$ represents number of heads and $n^0 (=3)$ number of tails. Now to maximize the above objective function, we take the partial derivation with respect to $\theta$ and equate it to zero. That is

$\frac{\partial{(n^1 log~\theta + n^0 log~(1-\theta))}}{\partial\theta } = 0$
Solving the above equation, we get $\theta = \frac{n^1}{n^1 + n^0}$. This is our maximum likelihood estimate ($\theta_{MLE}$) of probability of getting head. Notice that is same as the mean formula.  Thus, our MLE for probability of getting head ($\theta$) is 0.7.

Maximum A Posteriori Probability (MAP)

In the case of MLE, we maximized $P(X|\theta)$ to estimate $\theta$. In the case of MAP, we maximize $P(\theta|X)$ to get the estimate of $\theta$. An advantage of MAP is that by modeling $P(\theta|X)$ we can use Bayesian tranformation and use our priori belief to influence estimate of $\theta$.

Based on Bayes theorem, we can rewrite $P(\theta|X)$ as

$P(\theta|X) = \frac{P(X|\theta)P(\theta)}{P(X)}$

$P(\theta)$ on RHS represents our belief about $\theta$. For instance, based on our prior experience with other coins we estimate $\theta$ to be around 0.5. However this is just our belief and we are not 100% confident. To model our belief about $\theta$ we need another distribution. Theoretically, we are free to select any distribution, but using some random distribution makes the whole calculation very difficult. Scholars found that if select a distribution for $P(\theta)$ that has the same form as of $P(\theta|X)$ then the whole computation becomes much easier.  In this case, the distribution associated with $P(\theta)$ is known as conjugate prior. For bernoulli distribution, the conjugate prior is given by Beta distribution which takes two parameters ($\alpha~and~\beta$) and the probability density function of the Beta distribution is given as:

$P(\theta) = \frac{1}{B(\alpha, \beta)}{\theta^{\alpha-1}}{(1-\theta)^{\beta-1}}$

where $B(\alpha, \beta)$ represents beta function.

For different values of $\alpha~and~\beta$, we get different shapes of beta distribution each of which peaks at $B(\alpha,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$.

Thus to model our belief about $\theta (= 0.5)$, we can experiment with different values of $\alpha~and~\beta$ and find where $P(\theta = 0.5)$ maximizes. For now, I just tell this will happen when $\alpha=\beta=5$. Thus, we can rewrite our Bayesian tranformation as

$P(\theta|X) = \frac{P(X|\theta)\frac{1}{B(\alpha, \beta)}{\theta^{\alpha-1}}{(1-\theta)^{\beta-1}}}{P(X)}$

We can simplify the above equation in two ways. First we can get rid of the denominator as it is only for normalization purpose. Second we take the log sum instead of product to avoid the number overflow problem.

$log P(\theta|X) = log P(X|\theta) + log \frac{1}{B(\alpha, \beta)} + log \theta^{\alpha-1} + log (1-\theta)^{\beta-1}$

$log P(\theta|X) = \sum_{x \epsilon X}{log P(x|\theta)} + log \frac{1}{B(\alpha, \beta)} + log \theta^{\alpha-1} + log (1-\theta)^{\beta-1}$

$log P(\theta|X) = n^1 log \theta + n^0 log (1-\theta) + log \frac{1}{B(\alpha, \beta)} + log \theta^{\alpha-1} + log (1-\theta)^{\beta-1}$

As in MLE, now we take the partial derivative of the above equation with respect to $\theta$ and equate it to zero. This will give us the value of $\theta$ that maximizes $P(\theta|X)$. Thus,

$\frac{\partial(n^1 log~\theta + n^0 log~(1-\theta) + log \frac{1}{B(\alpha, \beta)} + log \theta^{\alpha-1} + log (1-\theta)^{\beta-1})}{\partial\theta}=0$

$\frac{n^1}{\theta}-\frac{n^0}{1-\theta}+\frac{\alpha-1}{\theta}-\frac{\beta-1}{1-\theta} = 0$

$\therefore \theta_{MAP} \Rightarrow \theta = \frac{n^1+\alpha-1}{n^1+n^0+\alpha+\beta-2}$

Since, $n^1 = 7~and~n^0=3$ and $\alpha=\beta=5$, we get $\theta_{MAP} = \frac{11}{18} = 0.61$.

Summary

1. MLE maximizes $P(X|\theta)$ whereas MAP maximizes $P(\theta|X)$
2. MAP allows to model prior belief about the parameter. As a result, MAP estimates are pulled towards our prior beliefs.

Reference

1. Avi Kak’s Tutorial on Paramter Estimation: I found this to be one of the best and easy to follow. However before finding this one I already had some good understanding of MLE and MAP and therefore other’s might not find it so intuitive.
2. Gregor Heinrich Post on parameter estimation in text analysis.

1. Ben says: