1 Introduction

In these exercises, we will explore an powerful method of sampling from a distribution \(\pi\). The algorithm we will discuss is usually referred to as the Metropolis-Hastings algorithm; Metropolis was one of the authors on the ground breaking paper and Hastings considered extensions to this work and applications to statistics.

The key feature is that again we construct a Markov chain that has \(\pi\) as its stationary distribution, and we need only know \(\pi\) up to a constant factor. The following simple exercise in Bayesian statistics may help to illustrate why this is important.

1.1 Exercise

Let \(X=(X_1, \ldots, X_n)\) be a random sample from the conditional distribution of \(X_1\) given \(\Theta = \theta\), where \(X_1|\theta \sim Bern(\theta)\) and \(\Theta \sim Unif(0,1)\). Find the posterior distribution.

1.1.1 Solution

Let \(x \in \{0,1\}^n\) and \(t= x_1 + \cdots +x_n\). Let \(\theta \in (0,1)\). Let \(r(\theta) = \mathbf{1}[\theta \in (0,1)]\) be the prior pdf for \(\Theta\) and \(f_X\) be the pdf for \(X\). We have that \[\begin{eqnarray*} s(\theta|x) &=& \frac{L(x|\theta)r(\theta)}{ f_X(x)} \\ &=& \frac{\theta^t (1-\theta)^{n-t}}{f_X(x)}. \end{eqnarray*}\] We have that \[f_X(x) = \int_0 ^1 L(x|\theta) d\theta = \int_0 ^1 \theta^t(1-\theta)^{n-t}d \theta,\] where we recognize that we don’t really need to compute this integral, since we recognize that this is the beta distribution, so that \(s(\theta|x)\) is given by the pdf of beta distribution in \(\theta\) with parameters \(\alpha = t+1\) and \(\beta = n-t+1\), where \(t= x_1 + \cdots + x_n\).

A problem arises when we do not recognize the distribution, and the integral of \(f_X\) is difficult to compute, but we still want to sample, simulate, or understand \(s(\theta|x)\).

2 The Metropolis-Hastings algorithm

Suppose that we want to sample from \(\pi(i) = \tilde{\pi}(i)/C\); in this context it is often called the target distribution, and we only know \(\tilde{\pi}\) In the Metropolis algorithm, we choose a symmetric transition matrix \(Q\), called the proposal distribution and we typically write \(Q(j|i) = Q_{ij}\); it turns out than we will many choices for \(Q\), and it is an active area of research on how to choose the best \(Q\) with respect to convergence and robustness. Hastings generalized the Metropolis algorithm to allow from non-symmetric choices for \(Q\). We consider the Markov chain which advances one step in the following way.

If we are at a state \(i\), so that \(X_n=i\), then we generate a random variable \(Y=j\) with distribution \(Q(\cdot|i)\).
Given that \(Y=j\), we move to state \(j\), so that \(X_{n+1} = j\), with probability \[ \alpha(j|i) = \min\big( \tfrac{\tilde{\pi}(j)}{\tilde{\pi}(i)} \cdot \tfrac{Q(i|j)}{Q(j|i)},1\big);\]
Otherwise, we stay in state \(i\), so that \(X_{n+1} = i\).

If we do move to state \(j\) the proposal is sometimes refered to as being accepted; we can actually have some freedom to choose \(\alpha\), and the choice above is sometimes referred to as the Metropolis choice.

In order to show that this method works, we show that \(\pi\) is indeed a stationary distribution for this Markov chain, and to do this, it is enough to verify that it is a reversible distribution with respect to the transition matrix, which is given by: \[ P(j|i) = Q(j|i)\alpha(j|i),\]

when \(i \not = j\). We need to check that \[ \pi_{i}P(j|i) = \pi_{j} P(i|j);\] notice that there is nothing to check if \(i = j\).

2.1 Exercise

Check it! Note that for \(a >0,\) we have \(\min(a,1) =a\), then \(\min(1/a, 1) =1\).

2.2 Exercise

Technically, we have only verified Metropolis-Hastings for a finite state space, but we also adapt the algorithm for the case of continuous distributions. Try the Metropolis algorithm, on the first exercise, so we are given \(\tilde{\pi}(\theta) = \theta^t (1-\theta)^{n-t}\), using by letting \(Y\) be normal with mean \(i\), where we let \(Q(j|i)\) be the pdf of a normal distribution centered at \(i\) with variance \(1\). Fix \(n=5\), and \(t=2\).

2.3 Exercise

How would you adapt the Metropolis-Hastings algorithm for the case of a bivariate continuous distribution? Try it in the case \(\tilde{\pi}(x,y)= e^{-x^2/2} e^{-y^2/2}\)—of course you know \(\pi\), and a two dimensional uniform exploration.

2.3.1 Solution

To check reversibility, for \(i \not = j\), we have

\[ \begin{eqnarray*} \pi_{i}P(j|i) &=& \pi_{i}Q(j|i)\alpha(j|i) \\ &=& \pi_{i}Q(j|i)\min\big( \tfrac{\tilde{\pi}(j)}{\tilde{\pi}(i)} \cdot \tfrac{Q(i|j)}{Q(j|i)},1\big). \end{eqnarray*} \]

If \(\min\big( \tfrac{\tilde{\pi}(j)}{\tilde{\pi}(i)} \cdot \tfrac{Q(i|j)}{Q(j|i)},1\big)=\tfrac{\tilde{\pi}(j)}{\tilde{\pi}(i)} \cdot \tfrac{Q(i|j)}{Q(j|i)}\), then

\[ \begin{eqnarray*} \pi_{i}P(j|i) &=& \pi_{i}Q(j|i) \tfrac{\tilde{\pi}(j)}{\tilde{\pi}(i)} \\ &=& \pi_j Q(j|i), \end{eqnarray*} \]

and \(\min\big( \tfrac{\tilde{\pi}(i)}{\tilde{\pi}(j)} \cdot \tfrac{Q(j|i)}{Q(i|j)},1\big)=1\), so that

\[ \begin{eqnarray*} \pi_{i}P(j|i) &=& \pi_j Q(j|i) \\ &=& \pi_j Q(j|i)\min\big( \tfrac{\tilde{\pi}(i)}{\tilde{\pi}(j)} \cdot \tfrac{Q(j|i)}{Q(i|j)},1\big) \\ &=& \pi_j P(j|i), \end{eqnarray*} \] as desired. A similar argument holds, if \(\min\big( \tfrac{\tilde{\pi}(j)}{\tilde{\pi}(i)} \cdot \tfrac{Q(i|j)}{Q(j|i)},1\big)=1\).

2.3.2 Solution R code

h <- function(x){
 z=0
  if(x >0 && x <1 ) {z=(x^2) * ( (1-x)^3)}
 z
  }


Yq <- function(x){
   y= rnorm(1,x,1)
y
}

a <- function(j,i){
m=min( c( h(j)/h(i), 1 ))  
m
}

metro <- function(n){
  x=0.5
  for(k in 1:n){
    i = x[length(x)]
    j = Yq(i)
    p = a(j,i)
    x = c(x,i)
      if ( rbinom(1,1,p)==1 ) {
      x[length(x)] <- j
            }
  }
  x
}

z = replicate(2000, metro(2000))


hist(z, prob=TRUE, breaks=50)
x= seq(0, 1, by =0.1)
curve(dbeta(x,3,4), add=TRUE)

2.3.3 Remarks

Actually, the above histogram, although does give the desired beta, is not what I intended to check. What I intended to do was to take the last entry of metro(2000) a large number of times, which would directly verify that distribution of the last entry of metro(2000) is almost the beta. However, I did get the right answer, why? We do the intended computation below.

z = replicate(2500, metro(2000)[length(metro(2000))])

hist(z, prob=TRUE, breaks=50)
x= seq(0, 1, by =0.1)
curve(dbeta(x,3,4), add=TRUE)

2.3.4 Solution (2d)

First, we look at how to do 3d histograms. (I haven’t quite figured out how to easily get probability histograms yet.)

x = rnorm(5000)
y = rnorm(5000)
library(plot3D)

## Warning: package 'plot3D' was built under R version 4.0.5

x_c <- cut(x, 20)
y_c <- cut(y, 20)

z <- table(x_c, y_c)

hist3D(z=z, border="black")

h <- function(w){
 x = w[1]
 y = w[2]
  z= exp(-x^2/2)*exp(-y^2/2)
 z
  }


Yq <- function(w){
  x = w[1]
  y = w[2]
  yn =  runif(1,-0.3, 0.3) + y
   xn = runif(1, -0.2, 0.2) +x
c(xn,yn)
}

a <- function(j,i){
m=min( c( h(j)/h(i), 1 ))  
m
}

metro <- function(n){
  x=0.5
  y=0.5
  for(k in 1:n){
    i = c(x,y)
    j = Yq(i)
    p = a(j,i)
      if (rbinom(1,1,p)==1) {
      x = j[1]
      y = j[2]
            }
  }
  c(x,y)
}

w = replicate(5000, metro(500))

x = w[1,]
y = w[2,]

x_c <- cut(x, 20)
y_c <- cut(y, 20)

z <- table(x_c, y_c)

hist3D(z=z, border="black")

3 Endnotes

Version: 06 November 2023
Rmd Source

In class exercise: Metropolis algorithm