-
Notifications
You must be signed in to change notification settings - Fork 0
/
Report.Rmd
183 lines (135 loc) · 7.05 KB
/
Report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: "Simulation of Exponential Distribution using R"
author: "Carlos Hernández"
date: "25/9/2020"
output:
pdf_document:
latex_engine: xelatex
highlight: espresso
toc: true
toc_depth: 4
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
set.seed(1)
library(ggplot2)
theme_set(theme_bw())
```
## Overview
In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda.
To achieve this we will investigate the distribution of averages of 40 exponentials with lambda 0.2, replicated a thousand times.
```{r}
lambda <- 0.2
n <- 40
```
## Definitions
#### Exponential Distribution
The exponential distribution describes the arrival time of a randomly recurring independent event sequence. If μ is the mean waiting time for the next event recurrence, its probability density function is:
\begin{equation}
\nonumber f_X(x) = \left\{
\begin{array}{l l}
\lambda e^{-\lambda x} & \quad x > 0\\
0 & \quad \textrm{otherwise}
\end{array} \right.
\end{equation}
Here is a graph of the exponential distribution with mean = 1.
```{r}
data <- data.frame(value = c(t(rexp(1000, rate = 1))))
ggplot(data, aes(x=value)) +
geom_histogram(aes(y=..density..),binwidth=.25, col="black", fill="lightblue")+
labs(title= "Exponential distribution with mean = 1",
caption="Produced by Carlos Hernández") +
xlab("x") +
ylab("y")
```
#### Central Limit Theorem
The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population, then the distribution of the sample means will be approximately normally distributed.
This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30). If the population is normal, then the theorem holds true even for samples smaller than 30. In fact, this also holds true even if the population is binomial, provided that min(np, n(1-p))> 5, where n is the sample size and p is the probability of success in the population. This means that we can use the normal probability model to quantify uncertainty when making inferences about a population mean based on the sample mean.
## Simulation
First we are going to simulate our exponential distribution, and replicated it one thousand times.
```{r}
expData <- replicate(1000, rexp(n, lambda)) # replicate 1000 times
expData <- data.frame(value = c(t(expData))) # convert to data frame
# plot
ggplot(expData, aes(x=value)) +
geom_histogram(aes(y=..density..), binwidth=.8,colour="black", fill="lightblue") +
labs(title= "Exponential distribution with lambda = 0.2 and 40 observations",
subtitle = "Replicated 1000 times",
caption="Produced by Carlos Hernández") +
xlab("x") +
ylab("exp(x)")
```
According to the Central Limit Theorem, if we take simple random samples from the population and compute the mean for each of the samples, the distribution of sample means should be approximately normal.
Thus, we can calculate the mean of the distribution one thousand times and save its value in a data frame and the resulted distribution should be approximately normal.
```{r}
data <- replicate(1000, mean(rexp(n,lambda)))
data <- data.frame(value = c(t(data)), size = 40)
ggplot(data, aes(x=value)) +
geom_histogram(aes(y=..density..),binwidth=.25, col="black", fill="lightblue") +
labs(title= "Average of 40 random exponential distribution",
subtitle = "Replicated 1000 times",
caption="Produced by Carlos Hernández") +
xlab("x") +
ylab("mean")
```
## Sample Mean versus Theoretical Mean
We know that the theoretical mean is given by 1/lambda, so we can calculate it and compare it with the sample mean.
```{r warning=FALSE}
theoretical_mu <- 1/lambda # calculate theoretical mean
sample_mu <-mean(data$value) # calculate experimental mean
ggplot(data, aes(x=value)) +
stat_function(fun=dnorm,
color="black",
args=list(mean=mean(data$value),
sd=sd(data$value)))+
geom_vline(xintercept = theoretical_mu, colour="red") +
geom_text(aes(x=theoretical_mu-.25,
label="\nTheoretical mean", y=.2),
colour="red", angle=90, text=element_text(size=11)) +
geom_vline(xintercept = sample_mu, colour="green")+
geom_text(aes(x=sample_mu+.05, label="\nSample mean", y=.2),
colour="green", angle=90) +
labs(title= "Theoretical mean vs sample mean",
caption="Produced by Carlos Hernández") +
xlab("x") +
ylab("y")
```
The theoretical mean is: `r theoretical_mu` and the sample mean is: `r round(sample_mu,2)`.
## Sample Variance versus Theoretical Variance
Also, we know that the theoretical variance is given by 1/lambda^2, so we can calculate it and compare it with the sample variance.
```{r}
theoretical_variance <- 1/(n * lambda^2)
sample_variance <- round(var(data$value),3)
ggplot(data, aes(x=value)) +
stat_function(fun=dnorm, color="black", args=list(mean=mean(data$value), sd=sd(data$value)))+
geom_vline(xintercept = sample_mu, colour="gray", linetype="dashed")+
geom_vline(xintercept = theoretical_mu, colour="gray", linetype="dashed")+
geom_segment(aes(x = sample_mu, y = 0.36, xend =sample_mu +
sample_variance, yend = 0.36), colour="green") +
geom_segment(aes(x = theoretical_mu - theoretical_variance, y = 0.35,
xend =theoretical_mu, yend = 0.35), colour="red") +
labs(title= "Theoretical variance vs sample variance",
caption="Produced by Carlos Hernández") +
geom_text(aes(x=sample_mu+.55, label="\nSample variance", y=.42), colour="green") +
geom_text(aes(x=theoretical_mu-.65, label="\nTheoretical variance", y=.33), colour="red") +
xlab("x") +
ylab("y")
```
The theoretical variance is: `r theoretical_variance` and the sample variance is: `r sample_variance`.
## Distribution: the distribution is approximately normal
In order to show graphically that this distribution is normally distributed, we can normalize it by its mean and standard deviation. This transforms it into a Standard Normal Distribution, N(0,1). So let’s do that normalization, plot the histogram of the transformed distribution, and show how it compares to an exact Standard Normal Distribution, by overlaying it:
```{r}
ggplot(data, aes(x=value)) +
geom_histogram(aes(y=..density..),binwidth=.25, col="black", fill="lightblue") +
stat_function(fun=dnorm,
color="blue",
args=list(mean=mean(data$value),
sd=sd(data$value)))+
labs(title= "Average of 40 random exponential distribution",
subtitle = "Replicated 1000 times",
caption="Produced by Carlos Hernández") +
xlab("x") +
ylab("y")
```