Please familarize yourself with the following excerpt on plagiarism and collusion from the student handbook
By ticking the submission declaration box in Moodle you are agreeing to the following declaration:
Declaration: I am aware of the UCL Statistical Science Department’s regulations on plagiarism for assessed coursework. I have read the guidelines in the student handbook and understand what constitutes plagiarism. I hereby affirm that the work I am submitting for this in-course assessment is entirely my own.
Please do not write your name anywhere on the submission. Please include only your student number as the proxy identifier.
Recall that if \(X_1, X_2, \ldots\) are i.i.d. random variables with finite nonzero variance then if \(S_n = X_1 + \cdots + X_n\), we have that the normalized sum
\[Z_n = \frac{S_n - \mathbb{E} S_n} {\sqrt{ var(S_n) } } = \frac{S_n - n\mathbb{E}X_1} {\sqrt{ n \cdot var(X_1) } }\] converges in distribution to a standard normal.
Illustrate this fact in R by running simulations. Specifically, consider the case where \(X_i\) are i.i.d. exponential random variables with mean \(1\), and for large values of \(n\), simulate \(k\) instances of \(Z_n\), where \(k\) is also large.
Consider the transition matrix \(P\) on three states \(\{1,2,3\}\) given by \[ \begin{equation*} P = \begin{pmatrix} 1/4 & 1/4 & 1/2 \\ 1/4 & 1/4 & 1/2 \\ 1/8 & 1/4 & 5/8 \end{pmatrix} \end{equation*} \]
Let \(X\) be a Markov chain with transition matrix \(P\) as in the previous question. Let \(\pi\) be the stationary distribution and assume that \(X\) is started at stationarity, so that \(X_i\) has law \(\pi\) for all \(i\). Consider the normalized sum given by
\[Z_n = \frac{S_n - \mathbb{E} S_n} {\sqrt{ n}} = \frac{S_n - n\mathbb{E} X_1} {\sqrt{ n}}\] where \(S_n = X_1 + \cdots + X_n.\)
Suppose that \(X\) is an aperiodic and irreducible Markov chain on a finite number of states that is started at the stationary distribution. Suppose that all you see is the realization \((x_0,x_1, \ldots, x_n)\), where \(n\) is large. How can you estimate the values transition matrix \(P\)? Carefully explain your answer and why your estimates are reasonable.
Suppose someone hands you the data for arrival times (in seconds) of visits to some website. It is given by \[ \begin{eqnarray*} && arr = (0.16, 0.28, 0.42, 0.66, 0.75, \\ && 1.27, 2.25, 2.33, 2.83, 4.09, \\ && 4.30, 4.80, 5.20, 5.68, 7.05) \end{eqnarray*} \] Suppose that a Poisson process is a good model for these arrival times. Use this data and the assumption that it is a Poisson process to find estimates for the following. Justify your answers.
Since \(X_1\) is exponential with mean \(1\), it also has variance \(1\) and thus
\[Z_n =\frac{S_n - n }{\sqrt{n} }\] converges in distribution to the normal distribution. Hence for say \(n=100\), the law of \(Z_n\) is close to that of a standard normal. If we plot a histogram \(k=1000\) independent realizations of \(Z_n\) we expect to see the density for the standard normal.
Zapprox <- function(){
z = (sum(rexp(100,1)) -100)/10
z
}
y= replicate(1000, Zapprox())
hist(y, prob=TRUE, breaks=50)
curve(dnorm(x), add=TRUE)
All the entries of \(P\) are positive, our theory on Markov chains tells us that \(P\) is irreducible and aperiodic and has a stationary distribution.
We enter \(P\) into R
P <- matrix(c(1/4,1/4, 1/2,1/4,1/4,1/2,1/8,1/4,5/8), nrow =3)
P <-t(P)
P
## [,1] [,2] [,3]
## [1,] 0.250 0.25 0.500
## [2,] 0.250 0.25 0.500
## [3,] 0.125 0.25 0.625
The left-eigenvector corresponding to the stationary distribution is easily computed and normalized
eigen(t(P))
## eigen() decomposition
## $values
## [1] 1.000000e+00 1.250000e-01 -6.362224e-17
##
## $vectors
## [,1] [,2] [,3]
## [1,] 0.2752409 7.071068e-01 7.071068e-01
## [2,] 0.3853373 6.019825e-16 -7.071068e-01
## [3,] 0.8807710 -7.071068e-01 7.101010e-16
z=eigen(t(P))$vector[,1]
stat <- z/sum(z)
stat
## [1] 0.1785714 0.2500000 0.5714286
Q = P
for (i in 1:1000){
Q <- Q %*% P
}
Q
## [,1] [,2] [,3]
## [1,] 0.1785714 0.25 0.5714286
## [2,] 0.1785714 0.25 0.5714286
## [3,] 0.1785714 0.25 0.5714286
step <- function(i){
q = P[i,]
x=-1
u = runif(1)
j=0
cumq = cumsum(q)
while(x==-1){
j<-j+1
if(u <= cumq[j]){x <-j}
}
x
}
steps <- function(y){
x = y[1]
n = y[2]
for (i in 1:n){
x <- c(x, step(x[i]))
}
x
}
one = steps(c(1,1000))
two = steps(c(2,1000))
sum(one==3)/1000
## [1] 0.595
sum(two==3)/1000
## [1] 0.569
As expected both averages to be close to \(\pi(3)\) by a version of the law of large numbers.
Note that the wording of the question in the third part, suggests that we should actually simulate the chain, and count the number of times it is in the state \(3\), and repeat this procedure, a large number of times, say \(100\), since it says average number of times rather than average time. We would end with answers, like \(56\) and \(57\); again, the law of large numbers explains why these two quantities are the same.
We start the chain at \(3\) and simulates until it returns, with a while loop.
time <- function(i){
n =0
x=i
n <- n+1
x <-step(x)
while(x !=3){
n <-n+1
x <- step(x)
}
n
}
mean(replicate(1000, time(3)))
## [1] 1.701
initial <- function(){
q = stat
x=-1
u = runif(1)
j=0
cumq = cumsum(q)
while(x==-1){
j<-j+1
if(u <= cumq[j]){x <-j}
}
x
}
mstat = 1*stat[1] + 2*stat[2] + 3*stat[3]
mcclt <- function(){
path = steps( c(initial(), 100 ) )
z= (sum(path) - 100*mstat)/10
z
}
y = replicate(1000, mcclt())
hist(y, prob=TRUE, breaks=50)
We can hope to obtain the standardized normal, by replacing the variance with an estimate for the variance: we can do this separately.
sumsz=replicate(1000, sum( steps( c(initial(),400 ) ) ) )
sdest= sd(sumsz)
samplev <- function(){
path = steps( c(initial(),400 ) )
z= (sum(path) - 400*mstat)/(sdest)
z
}
y = replicate(1000, samplev())
hist(y, prob=TRUE, breaks=30)
curve(dnorm(x), add=TRUE)
A version of the law of large numbers as explored in Homework 3, Question 3 gives that
\[ \frac{1}{n} \sum_{k=0} ^{n-1} \mathbf{1}[X_k =i ] \mathbf{1}[X_{k+1} =j] \to p_{ij} \pi(i).\]
Thus it suffices to be able to estimate \(\pi\); but this is also no problem, as we know from a version of the law of large numbers that:
\[\frac{1}{n} \sum_{k=0} ^{n-1} \mathbf{1}[X_k =i ] \to \pi(i).\]
First, we enter this data into R and compute inter-arrival times
arr = c(0.16, 0.28, 0.42, 0.66, 0.75, 1.27, 2.25,
2.33, 2.83, 4.09,4.30, 4.80, 5.20,5.68, 7.05)
inter = arr[1]
for (i in 2:length(arr)){
inter <- c(inter, arr[i] - arr[i-1] )
}
arr
## [1] 0.16 0.28 0.42 0.66 0.75 1.27 2.25 2.33 2.83 4.09 4.30 4.80 5.20 5.68 7.05
inter
## [1] 0.16 0.12 0.14 0.24 0.09 0.52 0.98 0.08 0.50 1.26 0.21 0.50 0.40 0.48 1.37
mu = mean(inter)
mu
## [1] 0.47
1/mu * 60
## [1] 127.6596
We report the same numerical value for the variance, since it is a Poisson random variable.
Since this is a Poisson process, the probability that there are 5 arrivals in two seconds is given by the Poisson distribution with mean \(2 \cdot \lambda\)
dpois(5, 2 * 1/mu)
## [1] 0.1649749