17-Tools-DistributionsAndModels.Rmd


# Distributions and models {#SamplingDistributions}


```{r, child = if (knitr::is_html_output()) {'./introductions/17-Tools-DistributionsAndModels-HTML.Rmd'} else {'./introductions/17-Tools-DistributionsAndModels-LaTeX.Rmd'}}
```


## Introduction {#Chap17-Intro}

In the decision-making process used in statistics (Sect. \@ref(DecisionMaking)), an *assumption* is made about a parameter that describes the population.
Then, we observe just one of the many different samples that could be drawn from this population.
The sample statistic can vary, depending on which sample we observe.


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
Remember: Studying a sample leads to the following observations:
\vspace{-2ex}

* Every sample is likely to be different.
* Our sample is one of countless possible samples from the population.
* Every sample is likely to produce a different value for the sample statistic.
* Hence we only observe one of the many possible values for the sample statistic.
\vspace{-2ex}

Since many values for the sample statistic are possible, the possible values of the sample statistic vary (called *sampling variation*) and have a *distribution* (called a *sampling distribution*).
:::


Based on the assumption about the parameter, the values of the statistic that we could reasonably  *expect* from these all possible samples can be described.
The challenge is that we only study *one* of these many possible samples.

The values of the statistic that we can reasonably *expect* from all possible samples can be described using a *distribution*: that is, by describing what values the statistic can take from all these samples, and how often.
For example, if I deal 15 cards, the *statistic* could be $\hat{p}$, 'the proportion of red cards in a hand of 15'.
Specifically, this is a *sampling distribution*, since the distribution is describing the distribution of a sample statistic (in this case, $\hat{p}$).
The *distribution* would describe what values of $\hat{p}$ are possible, and how likely each one is to be observed.

Under certain circumstances, many sample statistics have a similar-shaped distribution: a *bell-shaped (or normal) distribution*.
We study this distribution, as it often used to describe what values of the statistic are reasonable to observe.


## Distributions: an example


<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/measure-289399_640.jpg" width="200px"/>
</div>


Consider the heights and weights of *all* American adult females aged 18 or over.
(the populations).
Clearly, the height and weights of *all* American adult females is unknown: no-one has ever, or could ever realistically, measure the heights and weights of every American adult female.
However, the *National Health and Nutrition Examination Survey* (NHANES) data uses a 'nationally *representative* sample' of Americans so is a large, representative *sample* of Americans [@data:NHANES3:Data; @data:NHANES3].
The NHANES data includes the heights and weights of those in the sample; the *distributions* are shown in Fig.\ \@ref(fig:F18HeightsWeights).


```{r, F18HeightsWeights, out.width = '90%', fig.width=7.5, fig.height=3, fig.align='center', fig.cap="The heights (left) and weights (right) of females in the USA aged 18 and over, from the 2009--2012 NHANES survey ($n = 3742$)"}
F18 <- subset( NHANES, 
               (Age > 18) & (Gender == "female") )

par( mfrow = c(1, 2)) 

hist(F18$Height, 
     main = "Heights of adult females\nin the USA",
     xlab = "Heights (in cm)",
     ylab = "Number",
     col = plot.colour,
     las = 1,
     breaks = seq(120, 200, by = 5))
hist(F18$Weight, 
     main = "Weights of adult females\nin the USA",
     xlab = "Weights (in kg)",
     ylab = "Number",
     col = plot.colour,
     las = 1,
     breaks = seq(20, 250, by = 15)) 
```


Since the sample is representative, and the sample size is large ($n = 3742$), the distribution of the heights and weights in the sample probably will approximately resemble the population. 
However, *every sample of the same size is likely to be different*, so a different sample of Americans would produce a different (though probably similar) histogram.

However, we could *assume* a model for the heights and weights of adult American females in the *population* that gave rise to the sample histograms.
The *sample* heights look like they may have come from a *population* of heights that are roughly symmetrical, centred around approximatey $162$cm.
The *sample* weights, though, do *not* look like they may have come from a roughly symmetric  *population* of weights; they are right skewed.

Based on this, a distribution of the population heights can be proposed that may have produced the histogram of the sample.
This *model* for the distribution of heights of adult American women is roughly bell-shaped.
These types of distributions are called *normal distributions*.

A model is a *theoretical* idea that might be a useful description of the heights of American adult females in the *population*.
Suppose a *model* for the heights of American adult females is adopted that describes the heights as:

* having a bell-shaped (normal) distribution,
* with a *mean height* of $162$cm, and 
* a *standard deviation* of $7$cm. 

Then, the *distribution* of the heights of American adult females would look like Fig. \@ref(fig:HeightsModel) (left panel).
That is, most American adult females are between about $155$ and $169$cm, and very few are taller than $183$cm or shorter than $141$cm.


```{r HeightsModel, out.width='95%', results='hide', fig.cap="A model for the heights of American adult females, showing 1, 2 and 3 standard deviations either side of the mean (left); the model for heights of American adult females, plus the histogram from one specific sample of size $n = 200$ (right)", fig.align="center", fig.width=7.75, fig.height=2.5}
HT.mn <- round( mean( F18$Height, na.rm = TRUE) )
HT.sd <- round(sd( F18$Height, na.rm = TRUE))

par( mfrow = c(1, 2),
     mar = c(5, 0.25, 4, 0.25))

hide <- plotNormal(mu = HT.mn,  
                   sd = HT.sd, 
                   showZ = TRUE,
                   xlab = "Height (in cm)",
                   main = "Heights of American\nadult females")

set.seed(140430)

num.heights <- 200

x <- seq( HT.mn - 4 * HT.sd, 
          HT.mn + 4 * HT.sd, 
          length = 100)
y <- dnorm( x,  
            mean = HT.mn, 
            sd = HT.sd)

# HISTOGRAM
HT.data <- rnorm( num.heights, 
                  mean = HT.mn, 
                  sd = HT.sd)

out <- hist(HT.data, 
            breaks = seq(130, 190, by = 5),
            col = plot.colour,
            border = plot.colour,
            xlab = "Heights (in cm)",
            axes = FALSE,
            plot = TRUE,
            xlim = range(x),
            main = "Heights of American adult\nfemales: model, and one sample",
            sub = "for one sample of size 200",
            ylab = "")
axis(side = 1 )
y <- y / max(y) * max (out$counts)

lines( y ~ x,
       lwd = 2,
       col = "black")

# Plot x-axis
abline(v = 0,
       col = "grey",
       lwd = 2)
# Plot the mean
abline(h = 0,
       col = "grey",
       lwd = 2)
```

Since we do not know the heights of all American females, this model represents an idealised, or *assumed*, picture of the histogram of the heights of all American adult females in the *population*.
The sample of females in Fig.\ \@ref(fig:F18HeightsWeights) (left panel) could reasonably have come from this population.

If this model is accurate, the distribution of heights in any *sample* may be shaped a bit like this... but *sampling variation* exists, so every sample will be a bit different.

While any one sample will look a bit different than this model, the model captures the general feel of the histogram from many of these samples.
The model is like the 'average' of many sample histograms.


```{r HeightsModelMovie, animation.hook="gifski", dev=if (is_latex_output()){"pdf"}else{"png"}}
if (knitr::is_html_output()) {
  set.seed(14040)
  
HT.mn <- round( mean( F18$Height, na.rm = TRUE) )
HT.sd <- round(sd( F18$Height, na.rm = TRUE))

  num.heights <- 200
  
  x <- seq( HT.mn - 4 * HT.sd, 
            HT.mn + 4 * HT.sd, 
            length = 100)
  y <- dnorm( x,  
              mean = HT.mn, 
              sd = HT.sd)
  
  NumSampleHists <- 20
  
  for (i in (1:(NumSampleHists + 1))){
    # HISTOGRAMS
    if ( i > 1 ){
      HT.data <- rnorm( num.heights, 
                        mean = HT.mn, 
                        sd = HT.sd)
      
      out <- hist(HT.data, 
                  breaks = seq(130, 190, by = 5),
                  col = plot.colour,
                  border = plot.colour,
                  xlab = "Heights (in cm)",
                  axes = FALSE,
                  plot = TRUE,
                  xlim = range(x),
                  main = "Model for the heights of\nAustralian adult males",
                  sub = paste( "Sample number:", (i - 1)),
                  ylab = "")
      axis(side = 1)
    }
    
    
    # Plot the normal distribution
    if ( i == 1 ) {
      plot( range(x), c(0, 1),
            axes = FALSE,
            type = "n",
            main = "Model for the heights of\nAustralian adult males",
            xlab = "Height (in cm)",
            ylab = "")
      axis(side = 1)
      
      lines( (y/max(y)) ~ x,
             lwd = 2,
             col = "black")
    } else {
      y <- y/max(y) * max (out$counts)
      
      lines( y ~ x,
             lwd = 2,
             col = "black")
    }
    
    # Plot x-xaxis
    abline(v = 0,
           col = "grey",
           lwd = 2)
    # Plot the mean
    abline(h = 0,
           col = "grey",
           lwd = 2)
  }
}
```


This *bell-shaped distribution* is called a *normal distribution* or a *normal model*.
A normal distribution is a way of *modelling* the population.
A *model* is a theoretical or ideal concept.
In the same way that a model skeleton isn't $100$% accurate and certainly not exactly like *your* skeleton, it suitably approximates reality.
None of us probably have a skeleton *exactly* like the model, but the model is still useful and helpful.

Likewise, no variable has *exactly* a normal distribution, but the model is still useful and helpful.
The model is a *theoretical* way of describing the distribution in the population; it does not represent any particular sample of data.

If this model turns out to be poor at describing what appears in samples, the *parameters* of the model (the values of $\mu$ and $\sigma$) can be adjusted so the model *does* describe the sample data well.
In fact, evidence suggests that the average height of Americans has been increasing (for example, [see this webpage](https://ncdrisc.org/data-downloads-height.html)) and so the mean of the model may need to be changed to  remain a good model.


The heights of adult American females is assumed to have a *normal distribution*.
Normal distribution play a large role in the chapters ahead, so we study normal distributions in this chapter.
All *normal distributions* have these properties:

* Normal distributions are symmetric about the mean.\tightlist
* No upper limit or lower limit exists, in theory, for the variable.
  Of course, the chance that some values occur is essentially zero (e.g., a female taller than $350$cm).

These properties are true for *all* normal distributions, whatever the mean $\mu$ and whatever the standard deviation $\sigma$.


## The 68--95--99.7 (empirical) rule {#EmpiricalRule}

One of the most important properties of a normal distribution is given by the *68--95--99.7 rule*, also called the *empirical rule*.


:::{.definition #EmpiricalRule name="The 68--95-99.7 rule"}
For *any* bell-shaped distribution, *approximately*

* $68$% of observations lie within one standard deviation of the mean.
* $95$% of observations lie within two standard deviations of the mean.
* $99.7$\% of observations lie within three standard deviations of the mean.
:::


:::{.example #EmpiricalAmericanFemales name="Using the 68--95--99.7 rule"}
The model proposed for the height of adult American females is a *normal* distribution, with a mean of of $162$cm, and a *standard deviation* of $7$cm. 
Using this model:

* about $68$% of women have heights between $162 - 7 = 155$ and $162 + 7 = 169$cm.
* about $95$% of women have heights between $162 - (2\times 7) = 148$ and $162 + (2\times 7) = 176$cm.
* about $99.7$% of women have heights between $162 - (3\times 7) = 141$ and $162 + (3\times 7) = 183$cm.
:::


## Standardising ($z$-scores) {#z-scores}

Since the 68--95--99.7 rule (Sect. \@ref(EmpiricalRule)) applies for all normal distributions, the percentages in the rule only depend on how many standard deviations ($\sigma$) a value ($x$) is from the mean ($\mu$).
This information can be used to learn more about how values are distributed.


::: {.example #HeightsExer1 name="The 68--95--99.7 rule"}
Suppose heights of American adult females have a mean of $\mu = 162$cm, and a standard deviation of $\sigma = 7$cm, and (approximately) follow a normal distribution.
Using this model, what proportion of American adult females are *taller* than $169$cm?
:::

From a picture of the situation (Fig. \@ref(fig:HtsExer1), left panel), $162 + 7 = 169$cm is one standard deviation *above* the mean.
Since $68$% of values are within one standard deviation of the mean, $32$% are outside that range, smaller or larger.
Hence, $16$% are taller than one standard deviation above the mean, so the answer is about $16$%.
(Another $16$% are less than one standard deviation *below* the mean, or less than $175 - 7 = 168$cm in height.)

Again, the percentages only depend on how many standard deviations ($\sigma$) the value ($x$) is from the mean ($\mu$), and not the actual values of $\mu$ and $\sigma$.


```{r HtsExer1, fig.cap="Left: What proportion of American adult females are taller than $169$cm? Right: What proportion of American adult females are shorter than $148$cm?", fig.align="center", fig.width=7.5, fig.height=3, out.width='90%'}

par( mfrow = c(1, 2),
     mar = c(5, 0.5, 5, 0.5))

out <- plotNormal(mu = HT.mn, 
                  sd = HT.sd, 
                  ylim = c(0, 0.075),
                  xlab = "Height (in cm)")
shadeNormal(out$x,
            out$y,
            lo = 0,
            hi = HT.mn - HT.sd,
            col = plot.colour)

shadeNormal(out$x,
            out$y,
            hi = 200,
            lo = HT.mn + HT.sd,
            col = plot.colour)

abline( v = c(HT.mn - HT.sd, 
              HT.mn + HT.sd),
        col = "grey")

arrows(x0 = HT.mn - HT.sd + 0.5,
       y0 = 0.06,
       x1 = HT.mn + HT.sd - 0.5,
       y1 = 0.06,
       code = 3, # Arrow both ends
       length = 0.10,
       angle = 15)
text(x = HT.mn, 
     y = 0.06, 
     "68%", 
     cex = 0.9,
     pos = 3)

arrows(x0 = HT.mn + HT.sd + 0.5,
       y0 = 0.06,
       x1 = 179,
       y1 = 0.06,
       code = 1,
       length = 0.10,
       angle = 15)
lines( x = c(179, 183),
       y = c(0.06, 0.06),
       lty = 2)
text(x = 148, 
     y = 0.06, 
     "16%", 
     cex = 0.9,
     pos = 3)

arrows(x0 = HT.mn - HT.sd - 0.5,
       y0 = 0.06,
       x1 = 145,
       y1 = 0.06,
       code = 1,
       length = 0.10,
       angle = 15)
lines( x = c(145, 141),
       y = c(0.06, 0.06),
       lty = 2)
text(x = 176, 
     y = 0.06, 
     "16%", 
     cex = 0.9,
     pos = 3)


out <- plotNormal(mu = HT.mn, 
                  sd = HT.sd, 
                  ylim = c(0, 0.075),
                  xlab = "Height (in cm)")
shadeNormal(out$x,
            out$y,
            lo = 0,
            hi = HT.mn - 2 * HT.sd,
            col = plot.colour)
shadeNormal(out$x,
            out$y,
            lo = HT.mn + 2 * HT.sd,
            hi = 400,
            col = plot.colour)


abline( v = c(HT.mn - 2 * HT.sd, 
              HT.mn + 2 * HT.sd),
        col = "grey")

arrows(x0 = HT.mn - 2 * HT.sd + 0.5,
       y0 = 0.06,
       x1 = HT.mn + 2 * HT.sd - 0.5,
       y1 = 0.06,
       code = 3, # Arrow both ends
       length = 0.10,
       angle = 15)
text(x = HT.mn, 
     y = 0.06, 
     "95%", 
     cex = 0.9,
     pos = 3)

arrows(x0 = HT.mn + 2 * HT.sd + 0.5,
       y0 = 0.06,
       x1 = 179,
       y1 = 0.06,
       code = 1,
       length = 0.10,
       angle = 15)
lines( x = c(179, 183),
       y = c(0.06, 0.06),
       lty = 2)
text(x = 179, 
     y = 0.06, 
     "2.5%", 
     cex = 0.9,
     pos = 3)

arrows(x0 = HT.mn - 2 * HT.sd - 0.5,
       y0 = 0.06,
       x1 = 145,
       y1 = 0.06,
       code = 1,
       length = 0.10,
       angle = 15)
lines( x = c(145, 141),
       y = c(0.06, 0.06),
       lty = 2)
text(x = 144, 
     y = 0.06, 
     "2.5%", 
     cex = 0.9,
     pos = 3)

```


::: {.example #HeightsExer2 name="The 68--95--99.7 rule"}
Consider again the heights of American adult females.
Using this model, what proportion are *shorter* than $148$cm?

Again, drawing the situation is helpful (Fig. \@ref(fig:HtsExer1), right panel).
Since $162 - (2\times 7) = 148$, then $148$cm is two standard deviation *below* the mean.
Since $95$% of values are within two standard deviation of the mean, $5$% are outside that range (half smaller, half larger; see Fig. \@ref(fig:HtsExer1), right panel), so that $2.5$% are *shorter* than $148$cm.
(Another $2.5$% are *taller* than $162 + 14 = 176$cm.)
:::


Again, the percentages only depend on how many standard deviations ($\sigma$) the value ($x$) is from the mean ($\mu$).
The number of standard deviations that an observation is from the mean is called a *$z$-score*.
A $z$-score is computed using  
\[
   z = \frac{ x - \mu}{\sigma},
\]
where $\sigma$ is the standard deviation measuring the variation in the $x$-values.
Converting values to $z$-scores is called *standardising*.


::: {.definition #zScore name="z-score"}
A *$z$-score* measures how many standard deviations a value is from the mean.
In symbols:  
\begin{equation}
   z = \frac{x - \mu}{\sigma},
   (\#eq:zscores)
\end{equation}
where $x$ is the value, $\mu$ is the mean of the distribution, and $\sigma$ is the standard deviation of the distribution (measuring the variation in the $x$-values).
:::
 
The $z$-score is the *number of standard deviations the observation is away from the mean*, and is also called the *standardised value* or *standard score*, and is calculated using Equation \@ref(eq:zscores).
Note that:

* $z$-scores are negative for observations *below* the mean.
* $z$-scores are positive for observations *above* the mean.
* $z$-scores have no units (that is, not measured in kg, or cm, etc.).


::: {.example #HeightsExer3 name="$z$-scores"}
In Example \@ref(exm:HeightsExer1), the $z$-score for a height of $169$cm is  
\[
   z = \frac{x-\mu}{\sigma} = \frac{169 - 162}{7} = 1,
\]
one standard deviation *above* the mean.
In Example \@ref(exm:HeightsExer2), the $z$-score for a height of $148$cm is  
\[
   z = \frac{x-\mu}{\sigma} = \frac{148 - 162}{7} = -2,
\]
two standard deviations *below* the mean (the $z$-score is *negative*).
:::


::: {.example #EmpiricalRuleZ  name="The 68--95--99.7 rule"}
Consider the model for the heights of American adult females: a normal distribution, mean $\mu = 162$, standard deviation $\sigma = 7$ (Fig.\ \@ref(fig:HeightsModel)).
Using this model:

* A height of $175$cm is zero standard deviations from the mean: $z = 0$.
* $155$cm is one standard deviation *below* the mean: $z = -1$.
* $169$cm is one standard deviation *above* the mean: $z = 1$.
* $148$cm and $176$cm are two standard deviations from the mean: $z = -2$ and $z = 2$ respectively.
* $141$cm and $183$cm are three standard deviations from the mean: $z = -3$ and $z = 3$ respectively.
:::


## Approximating percentages using the 68--95--99.7 rule {#ApproxProbs}


As we have seen above, percentages under normal distributions can be *approximated* using the [68--95--99.7 rule](#def:EmpiricalRule).


::: {.example #Height160 name="Normal distribution areas"}
Suppose again that heights of American adult females have a mean of $\mu = 162$cm, and a standard deviation of $\sigma = 7$cm, and (approximately) follow a normal distribution (Fig.\ \@ref(fig:HtsEmpirical)).

To find the proportion of females *shorter* than $150$cm, first draw the situation (Fig.\ \@ref(fig:HtsExer3)).
:::


```{r HtsEmpirical, fig.cap="The empirical rule and heights of American adult females", fig.align="center", fig.width=6.0, fig.height=2.75, out.width='62.5%'}

out <- plotNormal(HT.mn,
                  HT.sd,
                  xlab = "Heights (in cm)")


mtext(expression( "("*italic(z)==0*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn)

mtext(expression( "("*italic(z)==1*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn + HT.sd)
mtext(expression( "("*italic(z)==2*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn + HT.sd*2)
mtext(expression( "("*italic(z)==3*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn + HT.sd*3)

mtext(expression( "("*italic(z)==-1*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn - HT.sd)
mtext(expression( "("*italic(z)==-2*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn - HT.sd*2)
mtext(expression( "("*italic(z)==-3*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn - HT.sd*3)
```


```{r HtsExer3, fig.cap="What proportion of Australian adult males are shorter than 160cm?", fig.align="center", fig.width=7.0, fig.height=3.00, out.width='70%'}


out <- plotNormal(HT.mn, 
                  HT.sd, 
                  ylim = c(0, 0.085),
                  xlab = "Heights (in cm)")
shadeNormal(out$x,
            out$y,
            lo = 140,
            hi = 160,
            col = plot.colour)

mtext(expression( "("*italic(z)==0*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn)

mtext(expression( "("*italic(z)==1*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn + HT.sd)
mtext(expression( "("*italic(z)==2*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn + HT.sd*2)
mtext(expression( "("*italic(z)==3*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn + HT.sd*3)

mtext(expression( "("*italic(z)==-1*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn - HT.sd)
mtext(expression( "("*italic(z)==-2*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn - HT.sd*2)
mtext(expression( "("*italic(z)==-3*")"),
      side = 1,
      line = 2,
      cex = 0.8,
      at = HT.mn - HT.sd*3)

z <- -2.14
x <- 160
lines( x = c(x, x),
       y = c(0, max(out$y) * 0.7),
       col = "grey")
text(x = x, 
     y = max(out$y) * 0.7,
     pos = 2,
     cex = 0.9,
     labels = "160 cm")

text(x = 175, 
     y = 0.070,
     pos = 3,
     cex = 0.9,
     labels = "95%")
text(x = 159, 
     y = 0.070,
     pos = 3,
     cex = 0.9,
     labels = "2.5%")
text(x = 194, 
     y = 0.070,
     pos = 3,
     cex = 0.9,
     labels = "2.5%")

abline(v = c(161, 189),
        col = "grey")

arrows( x0 = 155,
        x1 = 160.5,
        y0 = 0.070,
        y1 = 0.070,
        length = 0.10,
        angle = 15,
        lwd = 1)
lines( x = c(150, 155),
       y = c(0.070, 0.070),
       lwd = 1,
       lty = 2)
arrows( x0 = 195,
        x1 = 189.5,
        y0 = 0.070,
        y1 = 0.070,
        length = 0.10,
        angle = 15,
        lwd = 1)
lines( x = c(195, 200),
       y = c(0.070, 0.070),
       lwd = 1,
       lty = 2)

arrows( x0 = 161.5,
        x1 = 188.5,
        y0 = 0.070,
        y1 = 0.070,
        code = 3,
        length = 0.10,
        angle = 15,
        lwd = 1)

```


Proceeding as before, we ask 'How many standard deviations below the mean is $150$cm?'
Using Equation \@ref(eq:zscores) to compute the $z$-score, $150$cm corresponds to a $z$-score of  
\begin{equation}
   z = \frac{150 - 162}{7} = -1.71;
   (\#eq:zscore214)
\end{equation}
that is, $1.71$ standard deviations *below* the mean.

What percentage of observations are less than this $z$-score?
This case is not covered by the [68--95--99.7 rule](#def:EmpiricalRule), though we can use the [68--95--99.7 rule](#def:EmpiricalRule) to make some *rough estimates*.

About $2.5$% of observations are less than $2$ standard deviations below the mean (Example \@ref(exm:HeightsExer1)); that is, about $2.5$% of men are shorter than $161$cm.
So the percentages males even shorter than $161$cm (that is, further into the tail of the distribution), will be *less* than $2.5$%.
While we don't know the probability exactly, it will be smaller than $2.5$%.

Estimates in this way are crude, but often serviceable.
However, better estimates of 'areas under the normal curve' are found using tables compiled for this very purpose.
These tables are in 
`r if ( knitr::is_html_output()) { 
   'Appendix\\ \\@ref(ZTablesOnline).'
} else {
   'Appendices\\ \\@ref(ZTablesNEG) and\\ \\@ref(ZTablesPOS).'
   }`
'Percentages' under a normal curve are also called 'areas' under the normal curve.
The *total area* under a normal curve is one (or $100$%), since it represent all possible values that could be observed: evety height appear somewhere in Fig. \@ref(fig:HtsEmpirical).

We now learn how to use these tables for Example \@ref(exm:Height160).


## Exact areas from normal distributions {#ExactAreasUsingTables}

Areas under normal distributions can be found using *online* tables, or *hard copy* tables.
The online tables are easier to use,
`r if (knitr::is_latex_output()) {
   'but only the *hard-copy* tables are explained in this book (see the [online](https://bookdown.org/pkaldunn/SRM-Textbook/tables.html#ZTablesOnline) version of this book for how to use the online tables).'
} else {
   'but only the *online* tables are explained in this online book (see the hard-copy version for how to use the hard-copy tables).'
}`

```{r, child = if (knitr::is_latex_output()) './Tables/Ztables-Using-Hardcopy.Rmd'}
```

```{r, child = if (knitr::is_html_output()) './Tables/Ztables-Using-Online.Rmd'}
```

The hard-copy or online tables gives an answer of $1.62$%.
This agrees with the approximate answer using the [68--95--99.7 rule](#def:EmpiricalRule): less than $2.5$%.


## Computing areas (probabilities)

The general approach to computing probabilities from normal distributions is:

* *Draw a diagram*, and mark on 160cm (Fig. \@ref(fig:HtsExer3)).
* *Shade* the required region of interest: 'less than 160cm tall' (Fig. \@ref(fig:HtsExer3)).
* *Compute* the $z$-score using Equation \@ref(eq:zscores).
* *Use* the $z$ tables in `r if ( knitr::is_html_output()) { 'Appendix \\@ref(ZTablesOnline).'} else {'Appendices \\@ref(ZTablesNEG) and \\@ref(ZTablesPOS).'}`
* *Compute* the answer.

The number of standard deviations that 160cm is from the mean was computed above (Eq. \@ref(eq:zscore214)): $z = -2.14$.
That is, 160cm is $2.14$ standard deviations *below* the mean, so use $z = -2.14$ in the tables (remembering that the tables give probability (area) *less* than $z = -2.14$; Fig. \@ref(fig:HtsExer3)).

The probability of finding an Australian man less than 160cm tall is about $1.6$%.
The 68--95--99.7 rule can be used to give *approximate* probabilities, as a check that the answer found using tables seems reasonable.

More complicated questions can be asked too, as shown in the next section.

<iframe src="https://learningapps.org/watch?v=ppievv9gc22" style="border:0px;width:100%;height:800px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>


## Examples using $z$-scores {#Z-Score-Forestry}


::: {.example #NormalTrees name="Normal distributions"}
@data:Aedo1997:softwood simulated mechanized forest harvesting systems [@DevoreBerk2007].
In their study, they modelled the diameter of specific trees using

* a normal distribution; with
* a mean of $\mu = 8.8$ inches; and
* a standard deviation of $\sigma = 2.7$ inches.

Using this model, what is the probability that a randomly-chosen tree has a diameter *greater than* than 6 inches?
:::


Follow the steps identified earlier:

* *Draw* a normal curve, and mark on 6 inches (Fig. \@ref(fig:ZDBH1), left panel).
* *Shade* the region corresponding to 'greater than 6 inches' (Fig. \@ref(fig:ZDBH1), right panel).
* *Compute* the $z$-score using Eq. \@ref(eq:zscores):
  $\displaystyle z = (6 - 8.8)/2.7 = -2.8/2.7 = -1.04$ to two decimal places.
* *Use* tables:
  The probability of a tree diameter *shorter* than 6 inches is $0.1492$. 
  (The tables always give area *less* than the value of $z$ that is looked up.)
* *Compute* the answer:
  Since the *total* area under the normal distribution is one, the probability of a tree diameter  *greater* than 6 inches is $1 - 0.1492 = 0.8508$, or about $85$%.


```{r ZDBH1, fig.cap="What proportion of tree diameters are greater than 6 inches?", fig.align="center", fig.width=9.5, fig.height=2.75, out.width='100%'}
DBH.mn <- 8.8
DBH.sd <- 2.7

par(mfrow = c(1, 2))

z <- seq( -3.5, 3.5, 
          length = 250)
zy <- dnorm( z, 
             mean = 0, 
	           sd = 1)

mu <- DBH.mn
sigma <- DBH.sd
x <- z * sigma + mu

out <- plotNormal(mu,
                  sigma,
                  xlab = "Tree diameters (in inches)",
                  main = "Draw",
                  round.dec = 1)
abline(v = 6,
       lwd = 2)


out <- plotNormal(mu,
                  sigma,
                  xlab = "Tree diameters (in inches)",
                  main = "Shade",
                  round.dec = 1)

shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 6,
            hi = 20)
abline(v = 6,
       lwd = 2)
```


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The normal-distribution tables **always** provide area to the **left** of the $z$-score looked up.
Drawing a picture of the situation is important: it helps visualise getting the answer from what the table give us.
Remember: The *total* area under the normal distribution is one (or 100%).
:::


::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
Match the diagram in Fig. \@ref(fig:MatchDiagrams) with the meaning for the tree-diameter model (recall: $\mu = 8.8$ inches):\label{thinkBox:MatchForward}

1. Tree diameters greater than 11 inches.
2. Tree diameters *between* 6 and 11 inches.
3. Tree diameters less than 11 inches.
4. Tree diameters between 3 and 6 inches.

`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
**1:** matches B; **2:** matches C; **3:** matches D; **4:** matches A.
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::


```{r MatchDiagrams, fig.cap="Match the diagram with the description",  fig.align="center", out.width="85%", fig.height=4.00, fig.width=8}
par( mfrow = c(2, 2))

par( mar = c(5, 1, 1.5, 2) + 0.1)

out <- plotNormal(mu,
                  sigma,
                  main = "Diagram A",
                  xlab = "Tree diameters (inches)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 3,
            hi = 6)


out <- plotNormal(mu,
                  sigma,
                  main = "Diagram B",
                  xlab = "Tree diameters (inches)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 11,
            hi = 20)


out <- plotNormal(mu,
                  sigma,
                  main = "Diagram C",
                  xlab = "Tree diameters (inches)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 6,
            hi = 11)


out <- plotNormal(mu,
                  sigma,
                  main = "Diagram D",
                  xlab = "Tree diameters (inches)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 0,
            hi = 11)
```


::: {.example #NormalTrees2 name="Normal distributions"}
Using the model for tree diameters in Example \@ref(exm:NormalTrees), what is the probability that a tree has a diameter *between* 6 and 11 inches?
:::

First, **draw** the situation, and **shade** 'between 6 and 10 inches' (Fig. \@ref(fig:MatchDiagrams), Diagram C).
Then, **compute** the $z$-scores for *both* tree diameters:

* For 6 inches: $\quad  z = (6 - 8.8)/2.7 = -1.04$.
* For 11 inches: $\quad z = (11 - 8.8)/2.7 = 0.81$.

Table B can then be used to find the area to the *left* of $z = -1.04$, and also the area to the *left* of $z = 0.81$.
However, neither of these provide the area *between* $z = -1.04$ and $z = 0.81$ (Fig.&nbsp;\@ref(fig:ZDBH3)).


```{r ZDBH3, fig.cap="What proportion of tree diameters are between 6 and 11 inches? The two shaded areas are what we find by using the tables with $z = -1.04$ and $z = 0.81$, but neither give us the area we are seeking.", fig.align="center", fig.width=9.5, fig.height = 2.75,out.width='100%'}

par( mfrow = c(1, 2))

out <- plotNormal(mu,
                  sigma,
                  cex.axis = 0.85,
                  main = expression(What~tables~give~"for"~italic(z)==-1.04),
                  xlab = "Tree diameters (inches)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 0,
            hi = 6)

out <- plotNormal(mu,
                  sigma,
                  cex.axis = 0.85,
                  main = expression(What~tables~give~"for"~italic(z)==0.81),
                  xlab = "Tree diameters (inches)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 0,
            hi = 11)


```


Looking carefully at the areas from the tables and the area sought, that area between the two $z$-scores is
`r if (knitr::is_latex_output()) {
   '$0.7910 - 0.1492 = 0.6418$ (the online version has an animation).'
} else {
   '$0.7910 - 0.1492 = 0.6418$; see the animation below.'
}`
The probability that a tree has a diameter between 6 and 11 inches is about $0.6418$, or about $64$%.


```{r animation.hook="gifski", dev=if (is_latex_output()){"pdf"}else{"png"}}
  RT.mn <- 8.8
  RT.sd <- 2.7
  
  lower <- 6
  upper <- 11
  
if (knitr::is_html_output()){
  for (i in (1:4)){
    if ( i == 1 ){
      out <- plotNormal(RT.mn, 
                sd = RT.sd, 
                xlab = "Tree diameter (inches)",
                round.dec = 1,
                main = "Between 6 and 11 inches")	
      shadeNormal(out$x,
                  out$y,
                  col = "azure2",
                  lo = lower,
                  hi = upper)
      
    }  
    if ( i == 3 ){
      out <- plotNormal(RT.mn, 
                sd = RT.sd, 
                xlab = "Tree diameter (inches)",
                round.dec = 1,
                main = "Table: Less than 11 inches: 0.7910")	
      shadeNormal(out$x,
                  out$y,
                  col = "azure2",
                  lo = 0,
                  hi = upper)
      shadeNormal(out$x,
                  out$y,
                  col = "blue",
                  lo = 0,
                  hi = lower)
    }  
    if ( i == 2 ){

      out <- plotNormal(RT.mn, 
                sd = RT.sd, 
                xlab = "Tree diameter (inches)",
                round.dec = 1,
                main = "Table: Less than 11 inches: 0.7910")	
      shadeNormal(out$x,
                  out$y,
                  col = "blue",
                  lo = 0,
                  hi = upper)
    }  
    if ( i == 4 ){
      out <- plotNormal(RT.mn, 
                sd = RT.sd, 
                xlab = "Tree diameter (inches)",
                round.dec = 1,
                main = "Between 6 and 11 inches: 0.6418")	
      shadeNormal(out$x,
                  out$y,
                  col = "azure2",
                  lo = lower,
                  hi = upper)
    }  
  }
}
```


<iframe src="https://learningapps.org/watch?v=p4jq6ujuj22" style="border:0px;width:100%;height:900px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>


## Unstandardising: Working backwards {#Unstandardising}

Using the model for tree diameters in Example \@ref(exm:NormalTrees) again, different types of questions can be asked too.

::: {.example #MNormalBackwards name="Normal distributions backwards"}
Consider again the trees study (Example \@ref(exm:NormalTrees)). 
Identify the diameters of the *smallest* 3% of trees.
:::


This is a different problem than before; previously, the *tree diameter* was known, so a $z$-score could be computed, and hence a probability (Fig.&nbsp;\@ref(fig:WorkingWithZ), top panel).

However, in Example \@ref(exm:MNormalBackwards), the *probability* is known, and a tree diameter is sought.
That is, working 'backwards' is needed (Fig. \@ref(fig:WorkingWithZ), bottom panel), so the $z$-tables need to be used 'backwards' too.


```{r WorkingWithZ, fig.cap="Working with $z$-scores. In the tables, the areas (probabilities) are in the body of the table, and the $z$-scores are in the margins of the table.", fig.align="center", out.width='80%', fig.height=2.25, fig.width=8.25}
par( mar = c(0.5, 0.5, 0.5, 0.5))

openplotmat()

boxY <- 0.075
boxX <- 0.120

pos <- diagram::coordinates(3)
pos[1, 1] <- pos[1, 1] - 0.0
pos[3, 1] <- pos[3, 1] + 0.0

pos[, 2] <- 0.725

text(0.5, 0.90, 
     "The usual way to work with z-scores", 
     font = 2)

straightarrow(from = pos[1,], 
            to = pos[2,])
straightarrow(from = pos[2,], 
            to = pos[3,])


textrect( pos[1, ], 
          lab = expression(Value~of~italic(x)~bold(known)), 
          box.col = plotSolid,
          lcol = plotSolid,
          shadow.size = 0,
          radx = boxX,
          rady = boxY)
textrect( pos[2, ], 
          lab = expression(Value~of~italic(z)~bold(computed)), 
          box.col = plotSolid,
          lcol = plotSolid,
          shadow.size = 0,
          radx = boxX,
          rady = boxY)
textrect( pos[3, ], 
          lab = expression(Area~from~bold(tables)), 
          box.col = plotSolid,
          lcol = plotSolid,
          shadow.size = 0,
          radx = boxX,
          rady = boxY)


###

pos <- diagram::coordinates(3)
pos[1, 1] <- pos[1, 1] - 0.0
pos[3, 1] <- pos[3, 1] + 0.0

pos[, 2] <- 0.1

text(0.5, 0.28, 
     "Working backwards with z-scores", 
     font = 2)

straightarrow(from = pos[2, ], 
            to = pos[1, ])
straightarrow(from = pos[3, ], 
            to = pos[2, ])

textrect( pos[1, ], 
          lab = expression(Value~of~italic(x)~bold(computed)), 
          box.col = plotSolid,
          lcol = plotSolid,
          shadow.size = 0,
          radx = boxX,
          rady = boxY)
textrect( pos[2, ], 
          lab = expression(Value~of~italic(z)~from~bold(tables)), 
          box.col = plotSolid,
          lcol = plotSolid,
          shadow.size = 0,
          radx = boxX,
          rady = boxY)
textrect( pos[3, ], 
          lab = expression(Area~bold(known)),
          box.col = plotSolid,
          lcol = plotSolid,
          shadow.size = 0,
          radx = boxX,
          rady = boxY)

```

Drawing a rough diagram of the situation again is very helpful (Fig. \@ref(fig:DBHBackwards)).
We can only mark the approximate location of the required score, but this is sufficient.
Then, tables must be used to determine the necessary $z$-score.


```{r DBHBackwards, fig.cap="Tree diameters: The smallest 3\\%. The approximate location of the required $z$-score is drawn.", fig.align="center", out.width='60%', fig.width=6.5, fig.height=3.0}

z <- -1.88
zguess <- -1.8
xguess <- zguess * sigma + mu

out <- plotNormal(mu,
                  sigma,
                  main = "Smallest 3% of trees",
                  xlab = "Tree diameters (in inches)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            hi = xguess,
            lo = 0)

text(xguess, 
     max(out$y) * 0.8,
     expression(italic(z)~near~here), 
     pos = 3, 
     cex = 1)
lines(x = c(xguess, xguess),
      y = c(0, max(out$y) * 0.775),
      col = "grey",
      lwd = 2)


arrows(x0 = 1.0, 
       y0 = max(out$y) * 0.5, 
       x1 = 2.0, 
       y1 = max(out$y) * 0.1, 
       angle = 15, 
       length = 0.15, 
       lwd = 2) 
text(1.2, 
     max(out$y) * 0.425, 
     "Approx.\n3%", 
     pos = 2)
```


As before (Sect. \@ref(ExactAreasUsingTables)), *online* tables, or *hard copy* tables can be used (and again the online tables are easier to use).
`r if (knitr::is_latex_output()) {
   'Only the *hard-copy* tables are explained in this book (see the online version for how to use the online tables).'
} else {
   'Only the *online* tables are explained in this online book (see the hard-copy version for how to use the hard-copy tables).'
}`

```{r, child = if (knitr::is_latex_output()) './Tables/Ztables-Using-Hardcopy-Tables-Backwards.Rmd'} 
```
```{r, child = if (knitr::is_html_output()) './Tables/Ztables-Using-Online-Tables-Backwards.Rmd'}
```


Using hard copy tables, the closest value in the *body* of the tables to $3$% (or $0.030$) is $0.0301$.
(Sometimes, the exact area can be found, but usually we take the value as close as possible.)
This corresponds to a $z$-score of $z = -1.88$.
Using the online tables (and entering an `Area.to.the.left` of $0.0300$), the $z$-score is $-1.881$ (a slightly more precise answer).


::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
To identify the diameters of the *smallest* $3$% of trees, the $z$-score that has an area to the *left* of $3$% (or $0.030$) needs to be found (or, at least, as close as possible to $0.03$).
:::


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The tables always give the area to the **left** of the $z$-score that is looked up.
:::


Using either the hard-copy or online tables, the appropriate $z$-value is $-1.88$ standard deviations *below* the mean; that is, $z = -1.88$ (Fig.&nbsp;\@ref(fig:DBHBackwards)).
The $z$-score can be converted to an observation value $x$ using the *unstandardising* formula^[This is found by re-arranging Equation&nbsp;\@ref(eq:zscores).]:  
\[
	x = \mu + z\sigma.
\]
Using this unstandardising formula:  
\begin{align*}
	x &= \mu + (z\times\sigma) \\
		&= 8.8 + (-1.88 \times 2.7) = 3.724;
\end{align*}
that is, about $3$% of trees have diameters less than about $3.72$ inches.


::: {.definition #UnstandardizingFormula name="Unstandardizing formula"}
When the $z$-score is known, the corresponding value of the observation $x$ is
\begin{equation}
	x = \mu + z\sigma.
(\#eq:UnstandardisingFormula)
\end{equation}
This is called the *unstandardising formula*.
:::


::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
Ball bearings labelled as "50mm bearings" actually have diameters that follow a normal distribution with mean 50mm and standard deviation $0.1$mm.
The *smallest* $15$% of bearings are too small for sale.
What size bearings cannot be sold?\label{thinkBox:NormaBearings}

`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
The closest area from the tables is $0.1492$, corresponding to $z = -1.04$.
Using the unstandardising formula, $x = 50 + (-1.04\times 0.1) = 49.896$.

Bearings less than about $49.90$&nbsp;mm in diameter cannot be sold.
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::


::: {.example #LargestPC name="Normal distributions backwards"}
Using the model for tree diameters in Example \@ref(exm:NormalTrees) again, suppose now the diameters of the *largest* $25$% of trees needs to be identified.
What are these diameters?
:::


The tree diameters can be modelled with a normal distribution, with a mean of $\mu = 8.8$ inches and a standard deviation of $\sigma = 2.7$ inches.
Since an area is given, we need to work 'backwards' (Fig. \@ref(fig:DBHBackwards2), bottom panel), so the $z$-tables need to be used 'backwards' too.
The *largest* $25$% implies large trees, so diameter is larger than the mean. 

Using a diagram is important (Fig. \@ref(fig:DBHBackwards2)): the tables work with the area to the *left* of the value of interest, which is $75$%.
Using either the hard-copy or online tables, the appropriate $z$-value is $z = 0.674$.
Then, the $z$-score can be converted to an observation value $x$ using the [*unstandardising* formula](#def:UnstandardizingFormula):
\begin{align*}
	x &= \mu + (z\times\sigma) \\
		&= 8.8 + (0.674 \times 2.7) = 10.621.
\end{align*}
That is, about $25$% of trees have diameters larger than about $10.6$&nbsp;inches.


```{r DBHBackwards2, fig.cap="Tree diameters: The largest 25\\% is the same as the smallest 75\\%", fig.align="center", fig.width=6.5, out.width='60%', fig.height=2.75}

out <- plotNormal(mu,
                  sigma,
                  cex.axis = 0.85,
                  main = "Largest 25% of trees",
                  xlab = "Tree diameters (inches)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 0,
            hi = 10.62)
# shadeNormal(out$x,
#             out$y,
#             col = plot.colour2,
#             lo = 10.62,
#             hi = 20)

arrows(x0 = 16, 
       y0 = max(out$y) * 0.45, 
       x1 = 12.5, 
       y1 = max(out$y) * 0.1, 
       angle = 15, 
       length = 0.15, 
       lwd = 2) # Note: Locations in terms of z-scores
text(16, max(out$y) * 0.45, 
     "Largest 25%", 
     cex = 0.9,
     pos = 3)

text(10.5,
     max(out$y) * 0.97,
     expression(italic(z)~near~here), 
     pos = 4, 
     cex = 0.9)
abline(v = 10.6,
       col = "grey")


arrows(5, max(out$y) * 0.7,
       7, max(out$y) * 0.1, 
       angle = 15, 
       length = 0.15, 
       lwd = 2) # Note: Locations in terms of z-scores
text(5, max(out$y) * 0.7, 
     "Smallest 75%", 
     cex = 0.9,
     pos = 2)
```


<iframe src="https://learningapps.org/watch?v=poo3x05hn22" style="border:0px;width:100%;height:600px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>


## Example: methane production

A study of methane produced by animals [@huhtanen2016effects] modelled the retention time of food in sheep using a normal distribution, with the mean retention time as $\mu = 42.5$ hours, and the standard deviation of the retention time as $\sigma = 3.68$ hours.
We can draw this normal distribution (Fig. \@ref(fig:RetentionTime)), and then apply the 68--95--99.7 rule:

* About 68% of retention times are between 38.83 and 46.18 hrs;
* About 95% of retention times are between 35.14 and 49.86 hrs;
* About 99.7% of retention times are between 31.46 and 53.54 hrs.


```{r RetentionTime, fig.cap="Retention times of food in sheep", fig.align="center", fig.width=6.5, fig.height=2.75, out.width='60%'}

out <- plotNormal(42.5,
                  3.68,
                  xlab = "Retention times (in hours)",
                  main = "Retention times of food in sheep",
                  round.dec = 2)
```


::: {.example #Methane1 name="Working with the normal distribution"}
Using this model, what proportion of sheep have a retention time *less than* 40 hours?
:::

A retention time of 40 hours corresponds to a $z$-score of:  
\[
   z = \frac{40 - 42.5}{3.68} = -0.68.
\]
This is a *negative* number, since 40 hours is *below* the mean.
Using the normal distribution tables (that give the *area to the left* of the $z$-score), the area to the left of $z = -0.68$ is $0.2483$, or about $24.8$% (Fig \@ref(fig:RetentionPlots), top left panel).
About $24.8$% of sheep have a retention times *less* than 40 hours.


::: {.example #Methane2 name="Working with the normal distribution"}
What proportion of sheep have a retention time *greater than* 48 hours (two days)?
:::

A retention time of 48 hours corresponds to a $z$-score of $1.49$.
Using the normal distribution tables, the area to the *left* of this $z$-score is $0.9319$, so the area to the *right* of this $z$-score is $0.0681$ (Fig \@ref(fig:RetentionPlots), top left panel).


::: {.example #Methane3 name="Working with the normal distribution"}
What proportion of sheep have a retention time *between* 40 and 48 hours?
:::


```{r RetentionPlots, fig.cap="Plots for retention times",  fig.align="center", out.width="80%", fig.height=4.00}
par( mfrow = c(2, 2))

par( mar = c(5, 1, 1.5, 2) + 0.1)

mu <- 42.5
sigma <- 3.68

out <- plotNormal(mu,
                  sigma,
                  main = "Less than 40 hours",
                  xlab = "Retention times (hours)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 25,
            hi = 40)

###

out <- plotNormal(mu,
                  sigma,
                  main = "Greater than 48 hours",
                  xlab = "Retention times (hours)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 48,
            hi = 60)

###


out <- plotNormal(mu,
                  sigma,
                  ylim = c(0, 0.140),
                  main = "Between 40 and 48 hours",
                  xlab = "Retention times (hours)")
shadeNormal(out$x,
            out$y,
            col = NA,
            density = 15,
            angle = 45,
            lo = 25,
            hi = 40)
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 25,
            hi = 48)
arrows(x0 = 34,
       x1 = 40,
       y0 = 0.11,
       y1 = 0.11,
       length = 0.10,
       angle = 15,
       lwd = 1)
arrows(x0 = 34,
       x1 = 48,
       y0 = 0.130,
       y1 = 0.130,
       length = 0.10,
       angle = 15,
       lwd = 1)
arrows(x0 = 40,
       x1 = 48,
       y0 = 0.12,
       y1 = 0.12,
       code = 3,
       length = 0.10,
       angle = 15,
       lwd = 1)
lines( x = c(25, 40),
       y = c(0.11, 0.11),
       lty = 2)
lines( x = c(25, 40),
       y = c(0.130, 0.130),
       lty = 2)
abline( v = c(40, 48),
        col = "grey")

 
out <- plotNormal(mu,
                  sigma,
                  main = "Smallest 35%",
                  xlab = "Retention times (hours)")
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 25,
            hi = 41)
```


A retention time of 40 hours corresponds to $z = -0.68$, and, using the normal distribution tables, the area to the *left* of $z = -0.68$ is $0.2483$ (Fig. \@ref(fig:RetentionPlots), bottom left panel; hatched area).
But this is not the area that we are seeking...
From earlier, the area to the *left* of $z = 1.49$ is $0.9319$ (Fig. \@ref(fig:RetentionPlots), bottom left panel; coloured region).
But this is not the area to that we are seeking either...

From the two areas that we know, we *can* find the area that we are seeking:

* 48 hours corresponds to $z = 1.49$. The area to the *left* of this $z$-score is $0.9319$.
* 40 hours corresponds to $z = -0.68$. The area to the *left* of this $z$-score is $0.2483$.
* The *difference* between these two *areas* is what we are seeking: $0.9319 - 0.2483 = 0.6836$.

So the proportion is about $0.684$ (or $68.4$%).


::: {.example #Methane4 name="Working with the normal distribution"}
Consider the 35% of sheep with the *shortest* retention times.
What are these retention times?
:::

The time we seek must be *smaller* than the mean if it defines the *shortest* 35% of retention times.
We don't know *exactly* where to draw the retention time that this corresponds to on the diagram; it's just somewhere to the left of the mean (Fig \@ref(fig:RetentionPlots), bottom right panel).

This time, *we know the area to the left*, but we do not know the value (or $z$-score).
Previously, we knew the retention value (and hence the $z$-score), but not the area.
This is like a 'backwards problem', and we need to find the $z$-score 'backwards' (Sect. \@ref(Unstandardising)).
From the hard copy tables, a $z$-score of $z = -0.39$ has an area to the left of $0.3483$... which is as close as we can get.
From the online tables, $z = -0.385$.

We know the $z$-score, so we can find the retention value, using the unstandardising formula: $x = \mu + (z \times \sigma)$.
The retention time is $41.07$ hours.


## Summary {#Chap17-Summary}

A **model** is a way of theoretically describing the distribution of some quantitative variable in a population.
One common model is a **normal model** or **normal distribution**, which is a bell-shaped distribution with a theoretical mean $\mu$ and a theoretical standard deviation $\sigma$.
Probabilities can be computed from normal distributions using **$z$-scores**.


## Quick revision questions {#Chap17-QuickReview}

::: {.webex-check .webex-box}
Consider again the model for tree diameters in Example \@ref(exm:NormalTrees) [@data:Aedo1997:softwood]: a normal distribution with $\mu = 8.8$ inches, and $\sigma = 2.7$ inches.


```{r}
mean.tree <- 8.8
sd.tree <- 2.7

if( knitr::is_latex_output() ) {
  set.seed(111111110) # So printed books have *same* question
} 
diam1 <- round(runif(1, 6, 11), 1)
z1 <- round2( (diam1 - mean.tree)/sd.tree, 2)
prob1 <- pnorm(z1)

diam2 <- round(runif(1, 5, 14), 1)
z2 <- round2( (diam2 - mean.tree)/sd.tree, 2)
prob2 <- pnorm(z2)
```

1. A tree diameter of `r diam1` inches corresponds to a $z$-score (to two decimal places) of:\tightlist  
  `r if( knitr::is_html_output() ) {
	fitb(z1,
 		num = TRUE,
		tol = 0.01)}` 
2. The probability that a tree has a diameter *less* than `r diam1` inches is (as a *decimal value*):  
  `r if( knitr::is_html_output() ) {
	fitb( prob1,
		num = TRUE,
		tol = 0.01)}` 
3. The probability that a tree has a diameter *greater* than `r diam1` inches is (as a *decimal value*):  
  `r if( knitr::is_html_output() ) {
	fitb( 1 - prob1,
		num = TRUE,
		tol = 0.01)}` 
4. A tree diameter of `r diam2` inches corresponds to a $z$-score (to two decimal places) of:  
  `r if( knitr::is_html_output() ) {
	fitb(z2,
 		num = TRUE,
		tol = 0.01)}` 
5. The probability that a tree has a diameter *less* than `r diam2` inches is (as a *decimal value*):  
  `r if( knitr::is_html_output() ) {
	fitb( prob2,
		num = TRUE,
		tol = 0.01)}`
6. The probability that a tree has a diameter *greater* than `r diam2` inches is (as a *decimal value*):  
  `r if( knitr::is_html_output() ) {
	fitb( 1 - prob2,
		num = TRUE,
		tol = 0.01)}`
:::


## Exercises {#SamplingDistributionsExercises}

Selected answers are available in Sect. \@ref(SamplingDistributionsAnswer).


::: {.exercise #Statements}
Are the following statements **true** or **false**?

1. The unstandardising formula can be used to compute probabilities.\tightlist
   `r if( knitr::is_html_output() ) {
	 mcq( c("True", answer = "False"))}`
2. About 68% of observations are within two standard deviations of the mean.  
   `r if( knitr::is_html_output() ) {
	 mcq( c("True", answer = "False"))}`
3. Positive $z$-scores correspond to values larger than the mean.  
   `r if( knitr::is_html_output() ) {
	 mcq( c(answer = "True", "False"))}`
4. A $z$-score tells us how many standard deviations a value is away from the mean.  
   `r if( knitr::is_html_output() ) {
	 mcq( c(answer = "True", "False"))}`
5. A $z$-score larger than 4 is impossible.  
   `r if( knitr::is_html_output() ) {
	 mcq( c("True", answer = "False"))}`
6. A $z$-score of zero is located at the mean value.  
   `r if( knitr::is_html_output() ) {
	 mcq( c(answer = "True", "False"))}`
7. About 5% of observations are less than two standard deviations below the mean.  
   `r if( knitr::is_html_output() ) {
	 mcq( c("True", answer = "False"))}`
:::


::: {.exercise #CornSeeds}
In a simulation of methods to coat corn seeds (with fertilizer and crop protection chemicals, etc.), @pasha2016effect modelled the seed diameter as having a normal distribution, with mean 7.5mm and standard deviation of 0.225mm.

1. What is the probability that a seed has a diameter of more than 8mm?\tightlist  
  `r if( knitr::is_html_output() ) {
	longmcq( c(
	   "About 2.22",
	   answer = "About 1.3%",
	   "About 98.7%"))}`
 2. What is the probability that a seed has a diameter less than 7.1mm?  
  `r if( knitr::is_html_output() ) {
	longmcq( c(
	   "About 96.3%",
	   answer = "About 3.8%",
	   "About -1.78"))}`
3. What is the probability that a seed has a diameter between 7.5 and 8mm?  
  `r if( knitr::is_html_output() ) {
	longmcq( c(
	   "About 0.89",
	   answer = "About 48.7%",
	   "About 2.22",
	   "About 50%",
	   "About 98.7%"))}`
4. What is the diameter of the smallest 30% of seeds?  
  `r if( knitr::is_html_output() ) {
	longmcq( c(
	   "Smaller than about 7.62mm",
	   "Larger than about 7.38mm",
	   "About -0.524",
	   answer = "Smaller than about 7.38mm"))}`
5. What is the diameter of the largest 90% of the seeds?  
  `r if( knitr::is_html_output() ) {
	longmcq( c(
	   "Less than about 7.79mm",
	   "Larger than about 7.79mm",
	   "Less than about 7.21mm",
	   answer = "**Larger** than about 7.21mm",
	   "About -1.28"))}`
:::


::: {.exercise #SamplingDistributionsTrees}
Consider again the study by @data:Aedo1997:softwood, who studied the diameter of trees in certain forests.
The tree diameters can be modelled as having a normal distribution, with a mean of $\mu = 8.8$ inches, and a standard deviation of $\sigma = 2.7$ inches.
For these trees:

1. What is the probability that a tree will have a diameter *less than* 8 inches?
1. What is the probability that a tree will have a diameter *greater than* 9 inches?
1. What is the probability that a tree will have a diameter *between* 7 and 10 inches?
1. The largest 15% of trees have what diameters?
1. The smallest 25% of trees have what diameters?
:::


::: {.exercise #SamplingDistributionsGestationLength}
In a study [@snowden2018causal] to understand factors influencing preterm births, the gestation length of healthy babies was modelled with a normal distribution, having a mean of 40 weeks, and a standard deviation of 1.64 weeks.
Using this model:

1. What proportion of births are *longer* than 39 weeks (that is, nine months)?
1. In Australia, 
`r if (knitr::is_latex_output()) {
   'a premature birth is defined as a birth occuring before 37 weeks.'
} else {
   '[a premature birth is defined as a birth occuring before 37 weeks](https://www.pregnancybirthbaby.org.au/premature-baby).'
}`
   What proportion of births are expected to be premature?
1. According to
`r if (knitr::is_latex_output()) {
   'Health Direct,'
} else {
   '[Health Direct](https://www.pregnancybirthbaby.org.au/premature-baby),'
}`
   'Babies born between 32 and 37 weeks may need care in a special care nursery'.
   What proportion of healthy births would be expected to be born between 32 and 37 weeks gestation? 
1. How long is the gestation length for the *longest* 5% of pregnancies?
1. How long is the gestation length for the *shortest* 5% of pregnancies?
:::


::: {.exercise #SamplingDistributionsIQs}
IQ scores are
`r if (knitr::is_latex_output()) {
   'designed to have'
} else {
   '[designed to have](https://en.wikipedia.org/wiki/IQ_classification)'
}`
a mean of 100 and a standard deviation of 15.
`r if (knitr::is_latex_output()) {
   'Mensa'
} else {
   '[Mensa](https://www.mensa.org/)'
}`
is a society for people with a high IQ; specifically, to people who have 'attained a score within the upper two percent of the general population' (Mensa webpage (https://www.mensa.org/)).

What IQ score is needed to join Mensa?
:::


::: {.exercise #SamplingDistributionsIQsMilitary}
IQ scores are
`r if (knitr::is_latex_output()) {
   'designed to have'
} else {
   '[designed to have](https://en.wikipedia.org/wiki/IQ_classification)'
}`
a mean of 100 and a standard deviation of 15.
@data:Zagorsky2016:Blondes reports that the US Military must "reject all military recruits whose IQ is in the bottom 10% of the population" [@data:Zagorsky2016:Blondes, p. 403].

What IQs scores lead to a rejection from the US military?
:::


::: {.exercise #SamplingDistributionsIQForwards}
IQ scores are
`r if (knitr::is_latex_output()) {
   'designed to have'
} else {
   '[designed to have](https://en.wikipedia.org/wiki/IQ_classification)'
}`
a mean of 100 and a standard deviation of 15.
Match the diagram in Fig. \@ref(fig:IQMatchDiagramsForwards) with the meaning.

:::::: {.cols data-latex=""}

:::: {.col data-latex="{0.4\textwidth}"}
1. IQs greater than 110.
2. IQs between 90 and 115.

::::

:::: {.col data-latex="{0.05\textwidth}"}
\ 
<!-- an empty Div (with a white space), serving as
a column separator -->
::::

:::: {.col data-latex="{0.5\textwidth}"}

3. IQs less than 110.
4. IQs greater than 85.
::::
::::::
:::


```{r IQMatchDiagramsForwards, fig.cap="Match the diagram with the description", fig.align="center", out.width="80%", fig.height=3.75, fig.width=6.5}
par( mfrow = c(2, 2),
     mar = c(4, 1, 2, 2) + 0.1)

mu <- 100 
sigma <- 15

out <- plotNormal(mu,
                  sigma,
                  main = "Diagram A",
                  xlab = "IQ scores",
                  round.dec = 0)
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 90,
            hi = 115)

out <- plotNormal(mu,
                  sigma,
                  main = "Diagram B",
                  xlab = "IQ scores",
                  round.dec = 0)
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 0,
            hi = 110)

out <- plotNormal(mu,
                  sigma,
                  main = "Diagram C",
                  xlab = "IQ scores", 
                  round.dec = 0)
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 110,
            hi = 200)

out <- plotNormal(mu,
                  sigma,
                  main = "Diagram D",
                  xlab = "IQ scores",
                  round.dec = 0)
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = 110,
            hi = 200)
```


::: {.exercise #SamplingDistributionsIQBackwards}
IQ scores are
`r if (knitr::is_latex_output()) {
   'designed to have'
} else {
   '[designed to have](https://en.wikipedia.org/wiki/IQ_classification)'
}`
a mean of 100 and a standard deviation of 15.
Match the diagram in Fig. \@ref(fig:IQMatchDiagramsBackwards) with the meaning.

:::::: {.cols data-latex=""}

:::: {.col data-latex="{0.4\textwidth}"}
1. The *largest* 25% of IQ scores.
2. The *smallest* 10% of IQ scores.

::::

:::: {.col data-latex="{0.05\textwidth}"}
\ 
<!-- an empty Div (with a white space), serving as
a column separator -->
::::

:::: {.col data-latex="{0.5\textwidth}"}

3. The *largest* 70% of IQ scores.
4. The *smallest* 60% of IQ scores.
::::
::::::

:::


```{r IQMatchDiagramsBackwards, fig.cap="Match the diagram with the description", fig.align="center", out.width="80%", fig.height=3.75, fig.width=6.5}
par( mfrow = c(2, 2),
     mar = c(4, 1, 2, 2) + 0.1)

mu <- 100
sigma <- 15

out <- plotNormal(mu,
                  sigma,
                  main = "Diagram A",
                  xlab = "IQ scores",
                  round.dec = 0)
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = qnorm(0.75, mean = 100, sd = sigma),
            hi = 200)

out <- plotNormal(mu,
                  sigma,
                  main = "Diagram B",
                  xlab = "IQ scores",
                  round.dec = 0)
shadeNormal(out$x,
            out$y,
            col = plot.colour,
            lo = qnorm(0.30, mean = 100, sd = sigma),
            hi = 200)

out <- plotNormal(mu,
                  sigma,
                  main = "Diagram C",
                  xlab = "IQ scores",
                  round.dec = 0)
shadeNormal(out$x,
            out$y, 
            col = plot.colour,
            hi = qnorm(0.10, mean = 100, sd = sigma),
            lo = 0)  

out <- plotNormal(mu,
                  sigma,
                  main = "Diagram D",
                  xlab = "IQ scores",
                  round.dec = 0)
shadeNormal(out$x, 
            out$y,
            col = plot.colour,
            hi = qnorm(0.60, mean = 100, sd = sigma),
            lo = 0)
```


::: {.exercise #SamplingDistributionsChargingEVs}
A study of the impact of charging electric vehicles (EVs) on electricity demands [@affonso2018probabilistic] modelled the *time* at which people began charging their EVs at home.
Based on a survey [@us20112009], they modelled the time at which EVs began charging as having a mean of 5:30pm, with a standard deviation of 2.28 hrs.
For this model:

1. What is the probability that an EVs will begin charging after 9pm?
1. What is the probability that an EVs will begin charging before 5pm?
1. What is the probability that an EVs will begin charging between 5pm and 6pm?
1. 30% of the EVs begin charging after what time?
1. The earliest 15% of charging begins when?
  
**Hint:** This question is *much* easier if you convert times into 'minutes after midnight'.
:::


<!-- QUICK REVIEW ANSWERS -->
`r if (knitr::is_html_output()) '<!--'`
::: {.EOCanswerBox .EOCanswer data-latex="{iconmonstr-check-mark-14-240.png}"}
**Answers to in-chapter questions:**

- Sect. \ref{thinkBox:MatchForward}: **1:** matches B; **2:** matches C; **3:** matches D; **4:** matches A.
- Sect. \ref{thinkBox:NormaBearings}: The closest area from the tables is $0.1492$, corresponding to $z = -1.04$.
Using the unstandardising formula, $x = 50 + (-1.04\times 0.1) = 49.896$.
Bearings less than about $49.90$mm in diameter cannot be sold.
- \textbf{\textit{Quick Revision} questions:}
**1a.** `r z1`. 
**1b.** `r prob1`.
**1c.** `r 1 - prob1`.
**1d.** `r z2`
**1e.** `r prob2`.
**1f.** `r 1 - prob2`.
**2a.** About 1.3%
**2b.** About 3.8%
**2c.** About 48.7%.
**2d.** *Smaller* than about 7.38mm.
**2e.** *Larger* than about 7.21mm
**3.** False; False; True; True; False; True; False.
:::
`r if (knitr::is_html_output()) '-->'`