Rejection Sampling: Intuitive Understanding

Rejection sampling is one of the many techniques to generate samples from a distribution. Below is an attempt to intuitively (and visually) explain the approach.

Basic Intuition

For simplicity, as indicated by the blue line in Fig A., let’s assume the target distribution from which you want random samples is a truncated normal distribution with -3 to 3 domain i.e X \sim N(0, 1) ~~ \forall ~~ X \in (-3, 3). Also, as indicated by dashed black line, assume a rectangular enveloping region around the target distribution that is bounded by (-3, 0) and (3, 0.4167).

TruncatedNormalMean

Fig A. Truncated Normal Distribution with Random Samples

If randomly select x and y from this enveloping region and plot these points, as shown by red and green dots, some of the them will fall inside our target distribution and some outside of the target distribution. For any random point (x, y), if it falls within the target distribution i.e. y \leq P(X=x) then we accept it as a valid sample point. For instance assume the random point is (-1, 0.15). At X=-1, the probability density for the target distribution is given as P_{target}(X=-1) = 0.24 . Since 0.15 < 0.24 , we accept (-1, 0.15) as a valid point.

Let's convert the above idea into a working python code. Next we will look how to create this enveloping region around our target distribution.

import random
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from scipy.stats import truncnorm

#Domain of X
xdomain = [-3, 3]

def pdf(x):
    """
    Probability distribution function for Random Variable X
    from which we want to sample points. Here we assume
    we have truncated standard normal distribution in the domain of -3 to 3
    """
    return truncnorm.pdf(x, xdomain[0], xdomain[1])

def random_point_within_enveloping_region():
    """
    """
    x = random.uniform(xdomain[0], xdomain[1])
    y = random.uniform(0, 0.4167)
    return (x,y)

#Number of sample points to sample
n = 100

#Creating two arrays to capture accepted and rejected points
accepted = []
rejected = []

#Run this loop until we got required number of valid points
while len(accepted) < n:

    #Get random point
    x, y = random_point_within_enveloping_region()

    #If y is below blue curve then accept it
    if y < pdf(x):
       accepted.append((x, y))
    #otherwise reject it.
    else:
       rejected.append((x, y))

#Plot the graph
x = np.linspace(a, b, 100)
plt.plot(x, [pdf(i) for i in x], color='blue') # Plot Random Variable X
plt.plot(x, [0.4167 for i in x], color='black', ls='dashed', lw=2) # Plot Enveloping Region
plt.plot([x[0] for x in accepted], [x[1] for x in accepted] , 'ro', color='g') # Plot Accepted Points
plt.plot([x[0] for x in rejected], [x[1] for x in rejected] , 'ro', color='r') # Plot Rejected Points
plt.show()

#Calculate expected value for the truncated standard normal distribution
approxMean = sum([x[0] for x in accepted])/len(accepted)
print "Expected Mean = ", 0, pdf(0)
print "Approximated Mean = ", approxMean, pdf(approxMean)
print "Approximated Variance = ", sum([(x[0] - approxMean)**2 for x in accepted])/(len(accepted)-1)

uniformenvelop

Expected Mean = 0 0.400022258921
Approximated Mean = -0.0896272908375 0.398418781625
Approximated Variance = 1.17167227825

Creating Enveloping Region

Above the term “enveloping region” was broadly used to indicate some distribution function. Any distribution (such as gaussian, uniform, etc) can be used as an enveloping distribution as long as following condition is meet:

P_{Envelop}(X=x) \geq P_{target}(X=x) ~~~ \forall ~ X \in x ~~~~~~~ (A) .

That is at all possible value of x in the domain of the target distribution, probability of X=x based on the envelop distribution is more than that obtained by the target distribution. For instance for the truncated normal distribution at X=0, P_{target}(X=0) = 0.4 . Thus any enveloping distribution P_{Envelop}(X=0) has to be more than 0.4 at point X = 0.

In the above example I assumed a uniform distribution ranging from -3 to 3 as the envelope distribution. For this envelope distribution, P_{Envelop}(X=x) = 1/3-(-3) = 0.167. However if we simply use this uniform distribution as it is, then we will violating the above condition A. For instance, since the envelop distribution is a uniform distribution, at X=0 P_{Envelop}(X=0) = 0.167.

To overcome this challenge, rejection sampling introduces a multiplier constant “M”. In the above example I used M = 2.5. The reason I used M = 2.5 because the max height for standard normal distribution is 0.4 at X = 0. For the given envelop distribution, P(X=0) = 0.167. Thus M = 0.167/0.4 = 2.3. Based on this multiplier condition, we can reformulate the condition for enveloping distribution as follows:

M \times P_{Envelop}(X=x) \geq P_{target}(X=x) ~~~ \forall ~ X \in x ~~~~~~~ (B) .

In practice, finding the max height of the target distribution can be challenging. Also we just want to make sure that at point condition B holds true. Hence in practice we start with random M. After each sample we make sure that the above condition holds (See line number 57-62 in the modified code below). If this is false then we increment M and restart the sampling from beginning.

import random
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from scipy.stats import truncnorm

#Domain of X
xdomain = [-3, 3]

#Multiplier Constant
M = 2.0

def pdf(x):
    """
    Probability distribution function for Random Variable X
    from which we want to sample points. Here we assume
    we have truncated standard normal distribution in the domain of -3 to 3
    """
    return truncnorm.pdf(x, xdomain[0], xdomain[1])

def random_point_within_enveloping_region():
    """
    Return random point within the enveloping region.
    For x we will randomly sample point between -3 and 3
    Since we are assuming uniform distribution, the height of the enveloping region
    at any x is 1/6. So for Y we randomly sample point between 0 and 1/6
    """
    #Randomly sample x from -3 to 3
    x = random.uniform(xdomain[0], xdomain[1])

    # probability of obtain any x is equal to 1/6. i.e. height of enveloping region
    # for any X is 1/6.
    y = random.uniform(0, M * 1.0/6.0 )

    return (x,y)

def height_of_enveloping_region(x):
    """Return height of enveloping region at x."""
    return M * 1.0/6.0

#Number of sample points to sample
n = 100

#Creating two arrays to capture accepted and rejected points
accepted = []
rejected = []
M = 2.0

#Run this loop until we got required number of valid points
while len(accepted) < n:

    #Get random point
    x, y = random_point_within_enveloping_region()

    #If for any x if envelping region is below the distribution from which we want to sample points
    #increment the multipler constant and resample all the points.
    if height_of_enveloping_region(x) < pdf(x):
        print "Increasing M from {0} to {1}".format(M, M+1)
        accepted = []
        rejected = []
        M += 1.0
        continue

    #If y is below blue curve then accept it
    if y < pdf(x):
        accepted.append((x, y))
    #otherwise reject it.
    else:
        rejected.append((x, y))

x = np.linspace(a, b, 100)
plt.plot(x, [pdf(i) for i in x], color='blue')
plt.plot(x, [1.0/6 for i in x], color='black', ls='dashed', lw=1)
plt.plot(x, [M * 1.0/6 for i in x], color='black', ls='dashed', lw=2)
plt.plot([x[0] for x in accepted], [x[1] for x in accepted] , 'ro', color='g')
plt.plot([x[0] for x in rejected], [x[1] for x in rejected] , 'ro', color='r')
plt.show()

#Calculate expected value for the truncated standard normal distribution
approxMean = sum([x[0] for x in accepted])/len(accepted)
print "Expected Mean = ", 0, pdf(0)
print "Approximated Mean = ", approxMean, pdf(approxMean)
print "Approximated Variance = ", sum([(x[0] - approxMean)**2 for x in accepted])/(len(accepted)-1)

FAQ

Example: Using Gaussian distribution as an enveloping distribution

See example over here

Why not to sample y ranging from 0 to P_{target}(X=x)

Since we are throwing all the points that are outside of target distribution, one might question why not directly sample y in “random_point_within_enveloping_region” function between 0 and P_{target}(X=x) (instead of P_{Envelop}(X=x)). The problem with this approach is that x is not uniformly distributed in our target distribution. If we sample points from truncated normal distribution we are likely to get lot more x = 0 then x = -3. However in “random_point_within_enveloping_region” we are using uniform distribution to sample x.

Posted in Machine Learning, Python | Tagged , , | 2 Comments