```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(reticulate) # py_install("matplotlib") ``` # Introduction * Made available UCL Week 10: 2 November 2021 * Due UCL Week 12: 18 November 2021, 13:00 London Time. * This ICA consists of five questions, worth 50 points; * another 10 points will be given based on presentation, so that the total available points is 60. * This ICA is worth 30 percent of your grade for this module. * Many questions will require you to write and run R/Python code. * The ICA must be completed in R Markdown/Juptyer Notebook and typeset using Markdown with Latex code, just like the way our module content is generated. You can choose to use either R or Python. * Please hand in the html file and the Rmd/ipynb source file. * As usual, part of the grading will depend on the clarity and presentation of your solutions. * You are to do this assignment by yourselves, without any help from others. * You are allowed to use any materials and code that was presented so far. * Do not search the internet for answers to the ICA. * Do not use any fancy code or packages imported from elsewhere. ## Plagiarism and collusion Please familarize yourself with [the following excerpt on plagiarism and collusion from the student handbook ](https://tsoo-math.github.io/ucl/Plagiarism-Collusion.pdf) By ticking the submission declaration box in Moodle you are agreeing to the following declaration:
**Declaration:** I am aware of the UCL Statistical Science Department's regulations on plagiarism for assessed coursework. I have read the guidelines in the student handbook and understand what constitutes plagiarism. I hereby affirm that the work I am submitting for this in-course assessment is entirely my own. ## Anonymous Marking Please do **not** write your name anywhere on the submission. Please include only your **student number** as the proxy identifier. # Questions ## 1.) Empirical distribution [10 points] * Let $X_1, \ldots, X_n$ be i.i.d. random variables all with cdf $F$. Show that for every $x \in \mathbb{R}$, we have $$ \frac{1}{n} \sum_{i=1}^n \mathbf{1}[ X_i \leq x] \to F(x). \ \text{[5 points]}$$ * Demonstrate this fact by running simulations of a large number of random variables that are uniformly distributed in the unit interval $[0,1]$. [5 points] ## 2.) Generating random variables [10 points] * Show that if $X$ is a continuous random variable taking values on $D$ with a cdf that is strictly increasing on $D$, then the random variable $F(X)$ is uniformly distributed on the unit interval $[0,1]$. [3 points] * Show that if $U$ is uniformly distributed in $[0, \tfrac{\pi}{2}]$, then $\sin^2(U)$ has the beta distribution with parameters $(\tfrac{1}{2}, \tfrac{1}{2})$. [3 points] * Suppose that you have access to a true source of randomization given by say radioactive decay; that is, you have access to independent random variables that are exponentially distributed with rate $1$. Show that you can generate random variables with the beta distribution with parameters $(\tfrac{1}{2}, \tfrac{1}{2})$. [2 points] * Demonstrate your procedure in the last question, by computer simulations, and plot a histogram of the results against pdf of the beta $(\tfrac{1}{2}, \tfrac{1}{2})$. [2 points] ## 3.) Random walk [5 points] Let $S_n = X_1 + \cdots + X_n$, where $X_i$ are i.i.d. random variables, with $$\mathbb{P}(X_1 = 1) = \tfrac{1}{2} = \mathbb{P}(X_1 = -1).$$ Let $$L_n = \# \{ 1 \leq k \leq n : S_k >0\}.$$ Demonstrate, by simulations, that $L_n/n$ converges in distribution to the beta $(\tfrac{1}{2}, \tfrac{1}{2})$ distribution. ## 4.) Exporting and importing data [5 points] * Simulate $250$ random variables that are uniformly distributed in $[0,1]$. * Export them to a tab delimited text file named *export.txt*. * Now import them back and save them under the variable *imported*. * Plot a probability histogram of the imported data. You may need to do some processing as the values may be in a table, rather than a vector. ## 5.) Estimating the stationary distribution [10 points] * Suppose you are given the output of a $100000$ steps of a irreducible and aperiodic finite state Markov chain. Carefully explain how you could estimate the stationary distribution for this Markov chain, and why you estimator is reasonable. [5 points] * Import the data from the file *markovchain.txt* and use this data and your method above to estimate the stationary distribution. [5 points] ## 6.) Poisson processes [10 points] Suppose a shop that operates daily in the time interval $[a,b]$. It has customers arriving according to a Poisson process of intensity $3$ in the time interval $[a, c)$, and a Poisson process of intensity $5$ in the time interval $[c,b)$; here $a$ and $b$ are known, but $c$ is *unknown*. You can imagine the shop keeper notices that at some point in the day, the shop seems to get busier. The shop keeper has a log of all the arrival times, for each of $n$ days of operation, where $n$ is large. * Given an open interval $(r,s) \subset [a,b]$, explain how you can use the shop keeper's log to make a good guess at whether or not $(r,s)$ contains the unknown time $c$; show that as $n \to \infty$ you will know with certainty whether $c \in (r,s)$. Carefully explain your answer. [5 points] * Demonstrate your answer by running simulations; for example, choose $a=0$, $b=8$, and $c=4$, and simulate the arrivals to generate the shop keeper's log. Now apply your method with the intervals $(2.7, 4.3)$ and $(5,6)$. [5 points] # Endnotes * Version: `r format(Sys.time(), '%d %B %Y')` * [Rmd Source](https://tsoo-math.github.io/ucl2/2021-ica1-stat9.Rmd) # Solutions ## Empirical Distributions * Fix $x \in \mathbb{R}$. We observe that the Bernoulli random variables $\mathbf{1}[X_i \leq x]$ are i.i.d. with mean $\mathbb{P}(X_i \leq x) = F(x)$. Thus the desired convergence is immediate from the law of large numbers. * We will run our simulations in Python. ```{python} import numpy as np u = [np.random.uniform() for _ in range(1000)] t = t=np.linspace(0,1,num=1000) def emperical(x): return (1/len(u)) * sum( k <= x for k in u) import matplotlib.pyplot as plt plt.plot(t,t, label='uniform cdf') plt.plot(t,emperical(t), label='empirical distribution') plt.legend(loc='upper left') plt.show() ``` ## Generating random variables * Note that $F(X)$ takes values in $[0,1]$. Let $x \in [0,1]$, then $\{ F(X) \leq x \} = \{ X \leq F^{-1}(x)\}$. Thus $$\mathbb{P}(F(X) \leq x ) = \mathbb{P}(X \leq F^{-}(x)) = F( F^{-1}(x)) = x,$$\ as desired. * Note that $\sin^2(U)$ takes values in $[0,1]$. For $x \in [0,1]$, we have $$\mathbb{P}(\sin^2(U) \leq x) =\mathbb{P}( U \leq \sin^{-1}(\sqrt{x}) = \frac{2}{\pi} \cdot \sin^{-1}(\sqrt{x});$$ hence taking a derviative with respect to $x$, we obtain the pdf $$ x \mapsto \frac{1}{\pi \sqrt{x(1-x)}},$$ as desired. * If $X$ is exponential, then we know that cdf is given by $$ F(x) = 1 - e^{-x}.$$ Thus the first part gives that $F(X)$ is uniformly distributed on $[0, 1]$, so that $\tfrac{\pi}{2} F(X)$ is uniformly distributed on $[0, \tfrac{\pi}{2}]$. Now the previous part gives that $\sin^2(\tfrac{\pi}{2} F(X))$ has the required beta distribution. * We do our simulations in Python ```{python} def F(x): return 1-np.exp(-x) print(1+1) x = np.random.exponential(size=10000) trans = (np.sin((np.pi/2)*F(x)))**2 suppress=plt.hist(trans, bins = 25, density=True, label='Proability Histogram') t=np.linspace(0.001,0.999,num=100) plt.plot(t, 1/ (np.pi * np.sqrt(t*(1-t))), label='Exact Density') plt.legend(loc='upper left') plt.show() ``` ## Random walk We do our simulations in Python ```{python} def L(n): s=np.array([0]) for i in range(n): x= 2*np.random.binomial(1,0.5,1)-1 s = np.append(s, s[-1]+x) return sum(0