Introduction

Made available 2 November 2020
Due 18 November 2020, 13:00 London Time.
This ICA consists of five questions.
This ICA is worth 30 percent of your grade for this module.
Most questions will require you to write and run R code.
The ICA must be completed in R and typeset using R Markdown with Latex code, just like the way our module content is generated.
Please hand in the html file and the Rmd source file.
As usual, part of the grading will depend on the clarity and presentation of your solutions.
You are to do this assignment by yourselves, without any help from others.
You are allowed to use any materials and code that was presented so far.
Do not search the internet for answers to the ICA.
Do not use any fancy code or packages imported from elsewhere.

Plagiarism and collusion

Please familarize yourself with the following excerpt on plagiarism and collusion from the student handbook

By ticking the submission declaration box in Moodle you are agreeing to the following declaration:

Declaration: I am aware of the UCL Statistical Science Department’s regulations on plagiarism for assessed coursework. I have read the guidelines in the student handbook and understand what constitutes plagiarism. I hereby affirm that the work I am submitting for this in-course assessment is entirely my own.

Anonymous Marking

Please do not write your name anywhere on the submission. Please include only your student number as the proxy identifier.

1.) Central limit theorem [10 points]

Recall that if \(X_1, X_2, \ldots\) are i.i.d. random variables with finite nonzero variance then if \(S_n = X_1 + \cdots + X_n\), we have that the normalized sum

\[Z_n = \frac{S_n - \mathbb{E} S_n} {\sqrt{ var(S_n) } } = \frac{S_n - n\mathbb{E}X_1} {\sqrt{ n \cdot var(X_1) } }\] converges in distribution to a standard normal.

Illustrate this fact in R by running simulations. Specifically, consider the case where \(X_i\) are i.i.d. exponential random variables with mean \(1\), and for large values of \(n\), simulate \(k\) instances of \(Z_n\), where \(k\) is also large.

Plot the histogram of these values. [5 points]
Discuss how these simulations relate to the theory. [5 points]

2.) Transition matrices [20 points]

Consider the transition matrix \(P\) on three states \(\{1,2,3\}\) given by \[ \begin{equation*} P = \begin{pmatrix} 1/4 & 1/4 & 1/2 \\ 1/4 & 1/4 & 1/2 \\ 1/8 & 1/4 & 5/8 \end{pmatrix} \end{equation*} \]

Why does \(P\) have a unique stationary distribution? [2 points]
Compute via linear algebra, using R, the stationary distribution for \(P\). [2 points]
Compute \(Q=P^n\) for large \(n\). Why do you see the stationary distribution in \(Q\)? [4 points]
Let \(X = (X_0, X_1, \ldots, X_n, \ldots)\) be a Markov chain with transition matrix \(P\). Using R, simulate \(X\) for large values of \(n\) starting at \(1\) and \(2\) [4 points]
For both simulations, find the average number of times the chain is in state \(3\). Explain why the averages are similar. [4 points]
Using R, estimate the expected time it takes a Markov chain started at \(3\) to return to \(3\). [4 points]

3.) Experimenting with R [10 points]

Let \(X\) be a Markov chain with transition matrix \(P\) as in the previous question. Let \(\pi\) be the stationary distribution and assume that \(X\) is started at stationarity, so that \(X_i\) has law \(\pi\) for all \(i\). Consider the normalized sum given by

\[Z_n = \frac{S_n - \mathbb{E} S_n} {\sqrt{ n}} = \frac{S_n - n\mathbb{E} X_1} {\sqrt{ n}}\] where \(S_n = X_1 + \cdots + X_n.\)

Is the \(var(S_n) = n \cdot var(X_1)\)? Explain. [2 points]
What do you think \(Z_n\) converges to? [4 points]
Use R to run simulations to support your guess. [4 points]

4.) Estimating transition matrices [10 points]

Suppose that \(X\) is an aperiodic and irreducible Markov chain on a finite number of states that is started at the stationary distribution. Suppose that all you see is the realization \((x_0,x_1, \ldots, x_n)\), where \(n\) is large. How can you estimate the values transition matrix \(P\)? Carefully explain your answer and why your estimates are reasonable.

5.) Poisson processes [10 points]

Suppose someone hands you the data for arrival times (in seconds) of visits to some website. It is given by \[ \begin{eqnarray*} && arr = (0.16, 0.28, 0.42, 0.66, 0.75, \\ && 1.27, 2.25, 2.33, 2.83, 4.09, \\ && 4.30, 4.80, 5.20, 5.68, 7.05) \end{eqnarray*} \] Suppose that a Poisson process is a good model for these arrival times. Use this data and the assumption that it is a Poisson process to find estimates for the following. Justify your answers.

Estimate the expected time between arrivals. [3 points]
Estimate the expected number of arrrivals in one minute. [2 points]
Estimate the variance of the number of arrivals in one minute. [2 points]
Estimate the probability that there are five arrivals in two seconds. [3 points]

Solutions

Central limit theorem

Since \(X_1\) is exponential with mean \(1\), it also has variance \(1\) and thus

\[Z_n =\frac{S_n - n }{\sqrt{n} }\] converges in distribution to the normal distribution. Hence for say \(n=100\), the law of \(Z_n\) is close to that of a standard normal. If we plot a histogram \(k=1000\) independent realizations of \(Z_n\) we expect to see the density for the standard normal.

Zapprox <- function(){
z =   (sum(rexp(100,1)) -100)/10
z
}
y= replicate(1000, Zapprox())
hist(y, prob=TRUE, breaks=50)
curve(dnorm(x), add=TRUE)

Transition matrices

All the entries of \(P\) are positive, our theory on Markov chains tells us that \(P\) is irreducible and aperiodic and has a stationary distribution.
We enter \(P\) into R

P <- matrix(c(1/4,1/4, 1/2,1/4,1/4,1/2,1/8,1/4,5/8), nrow =3)
P <-t(P)
P

##       [,1] [,2]  [,3]
## [1,] 0.250 0.25 0.500
## [2,] 0.250 0.25 0.500
## [3,] 0.125 0.25 0.625

The left-eigenvector corresponding to the stationary distribution is easily computed and normalized

eigen(t(P))

## eigen() decomposition
## $values
## [1]  1.000000e+00  1.250000e-01 -6.362224e-17
## 
## $vectors
##           [,1]          [,2]          [,3]
## [1,] 0.2752409  7.071068e-01  7.071068e-01
## [2,] 0.3853373  6.019825e-16 -7.071068e-01
## [3,] 0.8807710 -7.071068e-01  7.101010e-16

z=eigen(t(P))$vector[,1]
stat <- z/sum(z)
stat

## [1] 0.1785714 0.2500000 0.5714286

When we compute \(P^n\) for large \(n\), we expect to see the stationary measure repeated as a row vector in the matrix, since we know from our convergence theory on Markov chains that \(p_{ij}(n) \to \pi(j).\)

Q = P
for (i in 1:1000){
  Q <- Q %*% P
}
Q

##           [,1] [,2]      [,3]
## [1,] 0.1785714 0.25 0.5714286
## [2,] 0.1785714 0.25 0.5714286
## [3,] 0.1785714 0.25 0.5714286

In order to simulate the Markov chain, we can adapt the code we had used before:

step <- function(i){
  q = P[i,]
  x=-1
  u = runif(1)
  j=0
  cumq = cumsum(q)
  while(x==-1){
    j<-j+1
    if(u <= cumq[j]){x <-j}
  }
  x
}

steps <- function(y){
  x = y[1]
  n = y[2]
  for (i in 1:n){
    x <- c(x, step(x[i]))
  }
  x
}


one = steps(c(1,1000))
two = steps(c(2,1000))

sum(one==3)/1000

## [1] 0.595

sum(two==3)/1000

## [1] 0.569

As expected both averages to be close to \(\pi(3)\) by a version of the law of large numbers.

Note that the wording of the question in the third part, suggests that we should actually simulate the chain, and count the number of times it is in the state \(3\), and repeat this procedure, a large number of times, say \(100\), since it says average number of times rather than average time. We would end with answers, like \(56\) and \(57\); again, the law of large numbers explains why these two quantities are the same.

We start the chain at \(3\) and simulates until it returns, with a while loop.

time <- function(i){
  n =0
  x=i
  n <- n+1
  x <-step(x)
  while(x !=3){
    n <-n+1
    x <- step(x)
  }
  n
}

mean(replicate(1000, time(3)))

## [1] 1.701

Experimenting with R

No, since the \(X_i\) are not independent.
We suspect some version of the central limit theorem, except we don’t know the normalizing constant corresponding to the variance of \(S_n\).
We do simulations of \(Z_n\) in R

initial <- function(){
q = stat
  x=-1
  u = runif(1)
  j=0
  cumq = cumsum(q)
  while(x==-1){
    j<-j+1
    if(u <= cumq[j]){x <-j}
  }
  x
}

mstat = 1*stat[1] + 2*stat[2] + 3*stat[3]


mcclt <- function(){
  path = steps( c(initial(), 100 )  )
z=  (sum(path) - 100*mstat)/10
z  
}

y = replicate(1000, mcclt())
hist(y, prob=TRUE, breaks=50)

We can hope to obtain the standardized normal, by replacing the variance with an estimate for the variance: we can do this separately.

sumsz=replicate(1000, sum( steps( c(initial(),400 ) ) ) ) 
sdest= sd(sumsz)    
                  



samplev <- function(){
  path = steps( c(initial(),400 )  )
z=  (sum(path) - 400*mstat)/(sdest)
z  
}
y = replicate(1000, samplev())
hist(y, prob=TRUE, breaks=30)
curve(dnorm(x), add=TRUE)

Estimating transition matrices

A version of the law of large numbers as explored in Homework 3, Question 3 gives that

\[ \frac{1}{n} \sum_{k=0} ^{n-1} \mathbf{1}[X_k =i ] \mathbf{1}[X_{k+1} =j] \to p_{ij} \pi(i).\]

Thus it suffices to be able to estimate \(\pi\); but this is also no problem, as we know from a version of the law of large numbers that:

\[\frac{1}{n} \sum_{k=0} ^{n-1} \mathbf{1}[X_k =i ] \to \pi(i).\]

Poisson processes

First, we enter this data into R and compute inter-arrival times

arr = c(0.16, 0.28, 0.42, 0.66, 0.75, 1.27, 2.25, 
      2.33, 2.83, 4.09,4.30, 4.80, 5.20,5.68, 7.05)

inter = arr[1]
for (i in 2:length(arr)){
  inter <- c(inter, arr[i] - arr[i-1] )
}
arr

##  [1] 0.16 0.28 0.42 0.66 0.75 1.27 2.25 2.33 2.83 4.09 4.30 4.80 5.20 5.68 7.05

inter

##  [1] 0.16 0.12 0.14 0.24 0.09 0.52 0.98 0.08 0.50 1.26 0.21 0.50 0.40 0.48 1.37

By assumption the arrival times are independent exponentials with some rate \(\lambda = \mu^{-1}\) (arrivals/second) Thus we simply take the average to estimate \(\mu\)

mu = mean(inter)
mu

## [1] 0.47

In one minute, the number of arrivals is a Poisson random variable with mean \(60 \cdot \lambda\).

1/mu * 60

## [1] 127.6596

We report the same numerical value for the variance, since it is a Poisson random variable.
Since this is a Poisson process, the probability that there are 5 arrivals in two seconds is given by the Poisson distribution with mean \(2 \cdot \lambda\)

dpois(5, 2 * 1/mu)

## [1] 0.1649749

Version: 31 December 2020

Rmd Source

Stat0009 (2020) ICA 1