forked from PeterKDunn/SRM-Textbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
28-Testing-OneProportion.Rmd
999 lines (753 loc) · 43.8 KB
/
28-Testing-OneProportion.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
# Tests for one proportion {#TestOneProportion}
<!-- Introductions; easier to separate by format -->
```{r, child = if (knitr::is_html_output()) {'./introductions/28-Testing-OneProportion-HTML.Rmd'} else {'./introductions/28-Testing-OneProportion-LaTeX.Rmd'}}
```
## Introduction: rolling dice {#ProportionTestIntro}
<div style="float:right; width: 222x; border: 1px; padding:10px"><img src="Illustrations/LoadedDice.png" width="200px"/></div>
`r if (knitr::is_html_output()) '<!--'`
\begin{wrapfigure}{R}{.25\textwidth}
\begin{center}
\includegraphics[width=.20\textwidth]{Illustrations/LoadedDice.png}
\end{center}
\end{wrapfigure}
`r if (knitr::is_html_output()) '-->'`
In a toy store on day (for my children, of course...), I saw 'loaded dice' for sale.
The packaging claimed 'One loaded \& one normal'.
I bought two packs!
But how could I determine *which* die was the 'loaded' die, and which was the 'normal' die?
I guess had to roll the dice...
Using classical probability (Sect. \@ref(ProbClassical)), the *population* proportion of times a `r include_graphics("Dice/die1.png", dpi=1500)` is rolled on a fair die is $1/6$.
If I rolled the *fair* die, I'd expect that each face would appear *approximately* (but not exactly) one-sixth of the time.
So, I could roll one of the dice, and see how often a `r include_graphics("Dice/die1.png", dpi=1500)` (for example) actually appeared.
Using the [decision-making process](#DecisionMaking) discussed earlier, I could decide if that die was the fair die.
## Statistical hypotheses and notation
If the die was fair, I would expect about one-sixth of rolls to produce a `r include_graphics("Dice/die1.png", dpi=1500)`, but not necessarily *exactly* one-sixth of the rolls, due to *sampling variation*.
However, by initially assuming the population proportion of ones would be $1/6$, the possible values of the *sample* proportion from all possible rolls of the fair die could be determined.
This is the beginning of the [decision-making process](#DecisionMaking).
More formally, the initial assumption about the population is that the die is fair (I have no evidence against this), and hence that the *population* proportion of rolling a `r include_graphics("Dice/die1.png", dpi=1500)` is $p = 1/6$, or approximately $p = 0.16667$.
Then, the values of the sample proportion that are reasonable to expect from all possible sample is described, and compared to the observed value of $\hat{p}$ from just one of those possible samples.
If the sample proportion of rolls that are `r include_graphics("Dice/die1.png", dpi=1500)` is not *exactly* $1/6$, two possibilities exist:
* The *population* proportion *is* $1/6$, and the *sample* proportion is not exactly $p = 1/6$ due to sampling variation; or
* The *population* proportion *is not* $1/6$; that is, the *sample* proportion is not exactly $p = 1/6$ because the die is not fair.
These two possible explanations are called *statistical hypotheses*.
Formally, the two statistical hypotheses above are written:
* $H_0$: $p = 1/6$, the *null hypothesis*; and
* $H_1$: $p \ne 1/6$, the *alternative hypothesis*.
The null hypothesis is always the 'sampling variation' explanation.
The alternative hypothesis can take different forms, depending on the research question.
Here, the alternative hypothesis here is open to the value of $p$ being smaller *or* larger than $1/6$; that is, two possibilities are considered (since we are interested in finding if the die is loaded in any way).
For this reason, this alternative hypothesis is called a *two-tailed* alternative hypothesis.
An alternative hypothesis like $p > 1/6$ or $p < 1/6$ is a *one-tailed* hypothesis.
## Describing the sampling distribution {#OnePropTestSamplingDist}
When the proportion of rolls that show a `r include_graphics("Dice/die1.png", dpi=1500)` really is $p = 1/6$, what values of the *sample* proportion are reasonable to expect from all possible samples, given sampling variation?
The answer depends on the sample size.
In *one* roll of a die, rolling a `r include_graphics("Dice/die1.png", dpi=1500)`, and hence finding a sample proportion of $\hat{p} = 1$, is not unreasonable.
However, in 20,000 rolls, a sample proportion of $\hat{p} = 1$ would be *incredibly* unlikely for a fair die.
Earlier (Sect. \@ref(SamplingDistributionKnownp)), the sampling distribution of a sample proportion (Sed. \@ref(def:SamplingDistProp)) was given.
For an assumed value of $p$, the sample proportion $\hat{p}$ across all possible samples is expected to vary, described by
* an approximate normal distribution;
* centred around a sampling mean whose value is the population proportion $p$;
* with a standard deviation (called the *standard error* of $\hat{p}$) of
\begin{equation}
\text{s.e.}(\hat{p})
= \sqrt{\frac{p \times (1 - p)}{n}},
(\#eq:StdErrorPknownTest)
\end{equation}
when certain conditions are met (Sect. \@ref(ValidityProportionsTest)), where $n$ is the size of the sample.
This is the *sampling distribution of the sample proportion*.
The *mean* of this distribution is the *mean* of all possible values of $\hat{p}$; the value of that means just happens to be the value of $p$.
Similarly, the standard deviation of this distribution is denoted $\text{s.e.}(\hat{p})$, to remind us that it is the standard deviation of the mean of all possible values of the statistic $\hat{p}$.
So we write that the sample proportions have a normal distribution, with mean $\mu_{\hat{p}} = p$ and standard deviation $\text{s.e.}(\hat{p})$ as given in Eq. \@ref(eq:StdErrorPknownTest).
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The notation $\text{s.e.}{\hat{p}}$ is the *standard error of the sample proportion*, and denotes 'the standard deviation of the proportions computed from all the possible samples'.
:::
I decided to use 100 rolls.
So, if $p$ really was $1/6$, and if certain conditions are met (Sect. \@ref(ValidityProportionsTest)), the possible values of the sample proportion that could be expected across all possible samples of size $100$ would be described using:
* An approximate normal distribution;
* With mean $\mu_{\hat{p}} = 1/6$;
* With a standard deviation of
$\displaystyle
\text{s.e.}(\hat{p})
= \sqrt{\frac{\frac{1}{6} \times(1 - \frac{1}{6})}{100}} = 0.037267$.
This is the standard deviation of all possible sample proportions when $n = 100$.
```{r NotationOnePropHT}
OneProportionNotation <- array( dim = c(4, 2))
OneProportionNotation[1, ] <- c("Individual values in the population",
"Proportion of successes $p$")
OneProportionNotation[2, ] <- c("Individual values in a sample",
"Proportion of successes $\\hat{p}$")
OneProportionNotation[3, ] <- c("Sample proportions ($\\hat{p}$) across",
"Vary with approx. normal distribution (under certain conditions)")
OneProportionNotation[4, ] <- c("all possible samples",
"with mean $\\mu_{\\hat{p}}$ and standard deviation $\\text{s.e.}(\\hat{p})$")
if( knitr::is_latex_output() ) {
kable( OneProportionNotation,
format = "latex",
booktabs = TRUE,
longtable = FALSE,
escape = FALSE,
caption = "The notation used for describing means, and the sampling distribution of the sample means",
align = c("r", "l"),
linesep = c("\\addlinespace",
"\\addlinespace",
""),
col.names = c("Quantity",
"Description") ) %>%
row_spec(0, bold = TRUE) %>%
kable_styling(font_size = 10)
} else {
OneProportionNotation[3, 1] <- paste(OneProportionNotation[3, 1],
OneProportionNotation[4, 1])
OneProportionNotation[3, 2] <- paste(OneProportionNotation[3, 2],
OneProportionNotation[4, 2])
OneProportionNotation[4, ] <- NA
kable( OneProportionNotation,
format = "html",
booktabs = TRUE,
longtable = FALSE,
escape = FALSE,
caption = "The notation used for describing means, and the sampling distribution of the sample means",
align = c("r", "l"),
linesep = c("\\addlinespace",
"\\addlinespace",
""),
col.names = c("Quantity",
"Description") ) %>%
row_spec(0, bold = TRUE)
}
```
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
When computing the standard error for a proportion, take care!
* The formula for a confidence interval uses the **sample proportion** $\hat{p}$ (see Eq. \@ref(eq:StdErrorPknownTest)), since we only have sample information to work with when forming a confidence interval.
* The formula for a hypothesis test uses the **population proportion** $p$ from the null hypothesis (see Eq. \@ref(eq:StdErrorCI)), since hypothesis testing *assumes that the null hypothesis is true*, and hence the value of $p$ is known.
* In both cases, make sure you are using a *proportion* in the formula, not a *percentage* (i.e., using 0.16666 rather than 16.666%).
Also: Don't forget to take the square root!
:::
A picture of this sampling distribution (Fig. \@ref(fig:RollsSixesSD)) shows how the *sample* proportion varies when $n = 100$ across all possible samples, simply due to sampling variation, when $p = 0.1666...$.
A value of $\hat{p}$ larger than 0.25 looks unlikely when $n = 100$; a value less than 0.10 also looks quite unlikely, but not impossible.
A value above 0.3, or lower than 0.05, looks almost impossible.
```{r, RollsSixesSD, fig.height=3.00, fig.width=9, out.width='90%', fig.align="center", fig.cap="The sampling distribution, showing the distribution of the sample proportion of 1s when the population proportion is 0.1666..., in 50 die rolls"}
p <- 1/6
n <- 100
mu <- p
sep <- sqrt( p * (1 - p) / n )
x <- 0.41
z <- (x - mu)/sep
out <- plotNormal(mu = mu,
sd = sep,
round.dec = 3,
xlim.hi = 0.45,
showX = seq(-3, 7, by = 1) * sep + p, # The tick marks
xlab = "Values of the sample proportion",
main = "Sampling distribution of the sample proportion of ones\nin 100 rolls ")
arrows( y0 = 0.75 * max(out$y),
x0 = x,
y1 = 0,
x1 = x,
angle = 15,
length = 0.1)
#text(y = 0.75 * max(out$y),
# x = x,
# cex = 0.8,
# pos = 3,
# labels = expression(italic(z)==6.53) )
text(y = 0.75 * max(out$y),
x = x,
cex = 0.8,
pos = 3,
labels = expression(41~ones~"in"~100~rolls) )
text(y = 0.6 * max(out$y),
x = mu,
cex = 0.8,
pos = 3,
labels = expression( mu[hat(italic(p))]==italic(p) ) )
mtext( expression( group( "(", italic(z)==-2, ")" ) ) ,
side = 1,
cex = 0.8,
at = mu - 2*sep,
line = 2)
mtext( expression( group( "(", italic(z)==0, ")" ) ) ,
side = 1,
cex = 0.8,
at = mu + 0*sep,
line = 2)
mtext( expression( group( "(", italic(z)==2, ")" ) ) ,
side = 1,
cex = 0.8,
at = mu + 2*sep,
line = 2)
mtext( expression( group( "(", italic(z)==4, ")" ) ) ,
side = 1,
cex = 0.8,
at = mu + 4*sep,
line = 2)
mtext( expression( group( "(", italic(z)==6, ")" ) ) ,
side = 1,
cex = 0.8,
at = mu + 6*sep,
line = 2)
```
In my 100 rolls of one die, I observed 41 that showed a `r include_graphics("Dice/die1.png", dpi=1500)`, a sample proportion of $\hat{p} = 41/100 = 0.41$.
From Fig. \@ref(fig:RollsSixesSD)---which displays the values of $\hat{p}$ from all possible samples---this is practically impossible *if the die was fair*.
What I observed was almost impossible... but I really did observe it.
A reasonable conclusion is that the assumption I was making---that the die is fair---is not tenable.
## Computing the test statistic and $z$-scores {#OnePropTestStatistic}
One way to measure how far the sample proportion $\hat{p} = 0.41$ is from the population proportion $p = 1/6$ in 100 rolls is to use a $z$-score, since the sampling distribution (Fig. \@ref(fig:RollsSixesSD)) has an approximate normal distribution, with mean $p$ and standard deviation of $\text{s.e.}(\hat{p})$.
The $z$-score is
\begin{align*}
z
&= \frac{\text{sample statistic} - \text{mean of the distribution}}{\text{standard deviation of the distribution}}\\
&= \frac{\hat{p} - p }{\text{s.e.}(\hat{p})} \\
&= \frac{0.41 - 0.1666...}{037267} = 6.53.
\end{align*}
(Remember that the standard deviation of the distribution in Fig. \@ref(fig:RollsSixesSD) is the the standard error: the amount of variation in the sample proportions.)
The observed sample proportion is more than six standard deviations from the mean, which is *highly unusual* according to the [68--95--99.7 rule](#def:EmpiricalRule).
## Determining $P$-values {#OnePropTestP}
The value of the $z$-score shows that the value of $\hat{p}$ is highly very unusual... but how unusual?
Quantifying *how* unusual is assessed more precisely using a $P$-value, which is used widely in scientific research. The $P$-value is a way of measuring how unusual an observation is, when $H_0$ is assumed to be true.
### Approximating $P$-values: the 68--95--99.7 rule {#OnePropTestP6895997}
$P$-values can be approximated using the 68--95--99.7 rule with a diagram (Sect. \@ref(ApproxProbs)), or more precisely using the $z$-tables
`r if (knitr::is_latex_output()) {
'(Appendices \\@ref(ZTablesNEG) and \\@ref(ZTablesPOS);'
} else {
'(App. \\@ref(ZTablesOnline);'
}`
see Sect. \@ref(Z-Score-Forestry)).
For many hypothesis tests, $P$-values are found using software.
$P$-values refer to the area **more extreme** than the calculated $z$-score in the normal distribution; that is, in the *tails* of the distribution.
For *two-tailed* $P$-values, the $P$-value is the combined area in the two tails; for *one-tailed* $P$-values, the $P$-value is the area in one tail only.
For example:
* *If* the calculated $z$-score was $z = 1$, the two-tailed $P$-value would be the shaded area in Fig. \@ref(fig:OnePropTestP) (left panel):
About 32%, based on the 68--95--99.7 rule.
The $P$-value would be the same if $z = -1$.
The *one-tailed* $P$-value would the the area in one-tail:
About 16%, based on the 68--95--99.7 rule.
* *If* the calculated $z$-score was $z = 2$, the two-tailed $P$-value would be the shaded area shown in Fig. \@ref(fig:OnePropTestP) (middle panel):
About 5%, based on the 68--95--99.7 rule.
The $P$-value would be the same if $z = -2$.
The *one-tailed* $P$-value would the the area in one-tail:
About 2.5%, based on the 68--95--99.7 rule.
If the $z$-score is a little *larger* than $z = 1$, say $z = 1.2$, then the tail area will be a little *smaller* than the tail area when $z = 1$ (Fig \@ref(fig:OnePropTestP2), left panel).
The two-tailed $P$-value is a little *smaller* than $0.32$.
Similarly, when the $t$-score is a bit *less* than $z = 2$, say $z = 1.9$, the tail area will be a little *larger* than the tail area when $z = 2$ (Fig \@ref(fig:OnePropTestP2), left panel).
The two-tailed $P$-value is a little *larger* than $0.05$.
```{r, OnePropTestP, fig.cap="The two-tailed P-value is the combined area in the two tails of the distribution; left panel: if $z = 1$ (or $z = -1$); right panel: if $z = 2$ (or $z = -2$)", fig.width=10, fig.height=3, out.width='90%', fig.align="center"}
par(mfrow = c(1, 2),
mar = c(4, 1, 4, 1) + 0.1)
out <- plotNormal(mu = 0,
sd = 1,
main = expression(The~italic(P)*"-value"~"if"~italic(z)==1),
xlab = expression(italic(z)*"-score")
)
shadeNormal(out$x, out$y,
lo = -5,
hi = -1,
col = plot.colour)
shadeNormal(out$x, out$y,
lo = 1,
hi = 5,
col = plot.colour)
polygon(x = c(-0.9, -0.9, 0.9, 0.9), # White-ish background for above text
y = c(0.05, 0.14, 0.14, 0.05),
border = NA,
col = "white")
arrows(x0 = -1,
x1 = 1,
y0 = 0.04,
y1 = 0.04,
angle = 15,
length = 0.15,
code = 3) # BOTH ENDS
text(0,
y = 0.07,
label = "Area: 68%")
out <- plotNormal(mu = 0,
sd = 1,
main = expression(The~italic(P)*"-value"~"if"~italic(z)==2),
xlab = expression(italic(z)*"-score")
)
shadeNormal(out$x, out$y,
lo = -5,
hi = -2,
col = plot.colour)
shadeNormal(out$x, out$y,
lo = 2,
hi = 5,
col = plot.colour)
polygon(x = c(-1.4, -1.4, 1.4, 1.4), # White-ish background for above text
y = c(0.05, 0.14, 0.14, 0.05),
border = NA,
col = "white")
arrows(x0 = -2,
x1 = 2,
y0 = 0.04,
y1 = 0.04,
angle = 15,
length = 0.15,
code = 3) # BOTH ENDS
text(0,
y = 0.07,
label = "Area: 95%")
```
```{r OnePropTestP2, fig.cap="The two-tailed P-value is the combined area in the two tails of the distribution; left panel: when $z = 1.2$ (or $z = -1.2$); right panel: when $z = 1.9$ (or $z = -1.9$)", fig.align="center", fig.width=10, fig.height=3, out.width='90%'}
par( mfrow = c(1, 2))
out <- plotNormal(mu = 0,
sd = 1,
main = expression(The~italic(P)*"-value"~when~italic(z)==1.2),
xlab = expression(italic(z)*"-score")
)
shadeNormal(out$x, out$y,
lo = -5,
hi = -1.2,
col = plot.colour)
shadeNormal(out$x, out$y,
lo = 1.2,
hi = 5,
col = plot.colour)
lines( x = c(-1, -1),
y = c(0, 1.37 * dnorm(-1)),
lwd = 2)
lines( x = c(1, 1),
y = c(0, 1.37 * dnorm(1)),
lwd = 2)
text(x = -1,
y = 1.37 * dnorm(-1),
pos = 3,
label = expression(italic(z) == -1~" "))
text(x = 1,
y = 1.37 * dnorm(1),
pos = 3,
label = expression(" "~italic(z) == 1))
out <- plotNormal(mu = 0,
sd = 1,
main = expression(The~italic(P)*"-value"~when~italic(z)==1.9),
xlab = expression(italic(z)*"-score")
)
shadeNormal(out$x, out$y,
lo = -5,
hi = -1.9,
col = plot.colour)
shadeNormal(out$x, out$y,
lo = 1.9,
hi = 5,
col = plot.colour)
lines( x = c(-2, -2),
y = c(0, 2.5 * dnorm(-2)),
lwd = 2)
lines( x = c(2, 2),
y = c(0, 2.5 * dnorm(2)),
lwd = 2)
text(x = -2,
y = 2.5 * dnorm(-2),
pos = 3,
label = expression(italic(z) == -2))
text(x = 2,
y = 2.5 * dnorm(2),
pos = 3,
label = expression(italic(z) == 2))
```
### Exact $P$-values: using tables {#OnePropTestPTables}
Using the tables of areas under normal distributions (`r if ( knitr::is_html_output()) { 'Appendix \\@ref(ZTablesOnline).'} else {'Appendices \\@ref(ZTablesNEG) and \\@ref(ZTablesPOS)'}`), we can be more precises when computing the $P$-values, using the ideas from Sect. \@ref(ExactAreasUsingTables).
For instance (see Fig. \@ref(fig:OnePropTestP2)):
* For $z = 1.2$: the area to the *left* of $z = -1.2$ is $0.1151$, and the area to the *right* of $z = 1.2$ is $0.1151$, so the *two-tailed* $P$-value is $0.1151 + 0.1151 = 0.2302$.
This is a little smaller than $0.32$, as estimated above.
* For $z = 1.9$: the area to the *left* of $z = -1.9$ is $0.0287$, and the area to the *right* of $z = 1.9$ is $0.0287$, so the *two-tailed* $P$-value is $0.0287 + 0.0287 = 0.0574$.
This is a little larger than $0.05$, as estimated above.
In this die-rolling example, where the $z$-score is 6.53, the tail area is *very* small (using `r if ( knitr::is_html_output()) { 'Appendix \\@ref(ZTablesOnline)'} else {'Appendices \\@ref(ZTablesNEG) and \\@ref(ZTablesPOS)'}`),
and zero to four decimal places (Fig. \@ref(fig:RollsSixesSD)).
Clearly, from what the $P$-value means, a $P$-value is always between 0 and 1.
## Making decisions with $P$-values {#OnePropTestDecisions}
$P$-values tells us the probability of observing the sample statistic (or something even more extreme), assuming the null hypothesis is true.
In this context, the $P$-value tells us the probability of observing the value of $\hat{p}$ (or something more extreme), just through sampling variation (chance) if $p = 0.1666\dots$
So, the $P$-value is a probability, albeit a probability of something quite specific, so it is a value between 0 and 1.
Then `r if( knitr::is_html_output() ) {
"(see Fig. \\@ref(fig:PvaluesAnimation)):"
}`
`r if( knitr::is_latex_output() ) {
"(see Fig. \\@ref(fig:PvaluesBigSmall)):"
}`
* 'Big' $P$-values mean that the sample statistic (i.e., $\bar{p}$) could reasonably have occurred through sampling variation in one of the many possible samples, if the assumption made about the parameter (stated in $H_0$) was true:
The data *do not* contradict the assumption in $H_0$.
* 'Small' $P$-values mean that the sample statistic (i.e., $\hat{p}$) is unlikely to have occurred through sampling variation in one of the many possible samples, if the assumption made about the parameter (stated in $H_0$) was true:
The data *do* contradict the assumption.
What is meant by 'small' and 'big'?
This is *arbitrary*: no definitive rules exist.
A $P$-value smaller than 1% (that is, smaller than 0.01) is usually considered 'small', and a $P$-value larger than 10% (that is, larger than 0.10) is usually considered 'big'.
Between the values of 1% and 10% is often a 'grey area', though a $P$-value less than 0.05 is often considered 'small'.
In this die-rolling example, where the $P$-value is *very* small, the data contradict the null hypothesis (that $p = 1/6$), suggesting that the die may not be fair.
```{r PvaluesAnimation, animation.hook="gifski", interval=0.20, fig.cap="The strength of evidence: P-values. As the $z$-score becomes larger, the $P$-value becomes smaller, and the evidence is greater to support the alternative hypothesis.", fig.height = 2.75, fig.align="center", dev=if (is_latex_output()){"pdf"}else{"png"}}
if (knitr::is_html_output()) {
par( mar = c(0.1, 0.1, 0.1, 0.1) ) # Number of margin lines on each side
zList <- c( seq(0.5,
1,
by = 0.1),
seq(1, 3.5,
by = 0.05) )
pMeaning <- function(pValue){
if (pValue > 0.10) Meaning <- "Insufficient"
if ( (pValue >= 0.05) & (pValue < 0.10)) Meaning <- "Slight"
if ( (pValue >= 0.01) & (pValue < 0.05)) Meaning <- "Moderate"
if ( (pValue >= 0.001) & (pValue < 0.01)) Meaning <- "Strong"
if (pValue < 0.001) Meaning <- "Very strong"
Meaning
}
pColours <- viridis( length(zList),
begin = 0.5 ,
end = 1,
option = "H")
for (i in (1:length(zList))){
zScore <- zList[i]
pValue <- pnorm( -zScore )
pValue2 <- ifelse( pValue < 0.001,
"< 0.001",
round(pValue, 4) )
out <- plotNormal(mu = 0,
sd = 1,
xlab = expression(italic(z)~"-score"),
main = paste("Evidence to support alternative hypothesis:\n",
pMeaning(pValue)),
round.dec = 0)
shadeNormal(out$x,
out$y,
col = pColours[i],
lo = zScore,
hi = 6)
shadeNormal(out$x,
out$y,
col = pColours[i],
lo = -zScore,
hi = -6)
abline(v = zScore,
col = "grey")
abline(v = -zScore,
col = "grey")
polygon(x = c(-1.4, -1.4, 1.4, 1.4), # White-ish background for above text
y = c(0.02, 0.10, 0.10, 0.02),
border = NA,
col = "white")
text(0,
y = 0.06,
label = paste("Two-tailed P-value:", pValue2 ) )
}
}
```
```{r PvaluesBigSmall, fig.cap="The strength of evidence: P-values. As the $z$-score becomes larger, the $P$-value becomes smaller, and the evidence is greater to support the alternative hypothesis.", fig.height = 2.75, fig.width=10, out.width='100%', fig.align="center", dev=if (is_latex_output()){"pdf"}else{"png"}}
if (knitr::is_latex_output()) {
par(mfrow = c(1, 2) )
# par( mar = c(0.1, 0.1, 0.1, 0.1) ) # Number of margin lines on each side
zList <- c( 1.5, # Two-tailed P-value: 10% -1.645
2.4 ) # Two-tailed P-value: 1% -2.576
pMeaning <- function(pValue){
if (pValue > 0.10) Meaning <- "Insufficient"
if ( (pValue >= 0.05) & (pValue < 0.10)) Meaning <- "Slight"
if ( (pValue >= 0.01) & (pValue < 0.05)) Meaning <- "Moderate"
if ( (pValue >= 0.001) & (pValue < 0.01)) Meaning <- "Strong"
if (pValue < 0.001) Meaning <- "Very strong"
Meaning
}
pColours <- viridis( length(zList),
begin = 0.5 ,
end = 1,
option = "H")
for (i in (1:length(zList))){
zScore <- zList[i]
pValue <- pnorm( -zScore )
pValue2 <- ifelse( pValue < 0.001,
"< 0.001",
round(pValue, 4) )
out <- plotNormal(mu = 0,
sd = 1,
xlab = expression(italic(z)-score),
round.dec = 0,
main = paste("Evidence to support alternative\nhypothesis:",
pMeaning(pValue))
)
shadeNormal(out$x,
out$y,
col = pColours[i],
lo = zScore,
hi = 10)
shadeNormal(out$x,
out$y,
col = pColours[i],
lo = -zScore,
hi = -10)
abline(v = zScore,
col = "grey")
abline(v = -zScore,
col = "grey")
polygon(x = c(-2.3, -2.3, 2.3, 2.3), # White-ish background for the above text
y = c(0.11, 0.21, 0.21, 0.11),
border = NA,
col = rgb(255, 255, 255, max = 255, alpha = 200) ) # Translucent white
text(0,
y = 0.16,
label = paste("Two-tailed P-value:", pValue2 ) )
}
}
```
## Writing conclusions {#OnePropTestCommunicate}
In general, to communicate the results of any hypothesis test, report:
* An answer to the RQ.
Since the null hypothesis is assumed to be true, the onus is on the evidence to support the alternative hypothesis.
Hence, conclusions are worded in terms of how much evidence exists to support the *alternative* hypothesis.
* A summary of the evidence used to reach that conclusion (such as the $z$-score and $P$-value, including if the $P$-value is one- or two-tailed).
* Sample summary information, including a CI, summarising the data used to make the decision.
So for the die-rolling example, write:
> The sample provides very strong evidence ($z = 6.53$; two-tailed $P < 0.001$) that the proportion of sixes is not $1/6$ ($n = 100$ rolls; 41 sixes) in the population.
The components are:
* An answer to the RQ: 'The sample provides very strong evidence... that the population proportion is not $1/6$'; notice the wording states how much evidence exists in the sample to support the *alternative* hypothesis.
* The evidence used to reach the conclusion: '$z = 6.53$; two-tailed $P < 0.001$)'.
* Some sample summary information (including a CI).
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
Since the *null* hypothesis is initially assumed to be true, *the onus is on the evidence to refute the null hypothesis*.
Hence, conclusions are worded in terms of how strongly the evidence (i.e., sample data) support the alternative hypothesis.
In fact, the alternative hypothesis *may* or *may not* be true... but the evidence (data) available supports the alternative hypothesis.
:::
## Summary {#OnePropTestSummary}
Let's recap the decision-making process, in this context about rolling a `r include_graphics("Dice/die1.png", dpi=1500)`:
1. **Assumption**:
Write the *null hypothesis* snd *alternative hypothesis* about the *parameter* (based on the RQ):
* $H_0$: $p = 01.666...$, and
* $H_1$: $p \ne 0.1666...$ (this is a two-tailed alternative hypothesis).
2. **Expectation**:
The sampling distribution describes what values to expect reasonably expect from the sample statistic across all possible samples, *if* the null hypothesis is true.
Under certain circumstances, the sample proportions will vary with an approximate normal distribution around a mean of $p = 0.1666...$ with a standard deviation of $\text{s.e.}(\hat{p}) = 0.0372678$.
3. **Observation**:
Compute the $z$-score: $z = 6.53$ to measure the distance between the assumed population value, and the observed sample value.
4. **Consistency?**:
Determine if the data are consistent with the assumption, by computing the $P$-value.
Here, the $P$-value is (much) less than $0.001$.
The $P$-value can be computed by software, or approximated using the 68--95--99.7 rule.
The **conclusion** is that very strong evidence exists that $p$ is *not* $0.16667$, based on the evidence.
::: {.example #POnePropTestMeasles name="One sample proportion test"}
A study of the measles-rubella vaccination in Korea [@kim2004sero] compared the proportion of children with measles antibodies to the World Health Organization (WHO) target proportion (for children aged 5 to 9 years old: 10%).
In the study, 55 children out of 972 had the antibody present; that is, $\hat{p} = 55/972 = 0.056584...$.
Of course, every sample of 972 children would produce a different sample proportion (depending on which children were selected to be in the sample), so the difference between this sample proportion and the target proportion (of 10%, or $p = 0.10$) could be due to sampling variation.
The aim of the study was to test if the proportion of Korean children with the measles antibody in the *population* was 10% or better (lower); the hypotheses are:
* $H_0$: $p = 0.10$ (assume the target is met, and the difference between $p$ and $\hat{p}$ is due to sampling variation); and
* $H_1$: $p < 0.10$ (one-tailed, since the RQ is whether the target is 10% or *lower*).
The *standard error* for the sample proportion is
\[
\text{s.e.}(\hat{p})
= \sqrt{\frac{p (1 - p)}{n}}
= 0.0096225...
\]
The *test statistic* is:
\[
z
= \frac{\hat{p} - p}{\text{s.e.}(p)}
= \frac{0.056584 - 0.10}{0.0096225} = -4.51.
\]
This is a *very* large (and *negative*) $z$-score, so expect a *very* small $P$-value from using the 68--95--99.7 rule or using tables: there is very strong evidence to support the alternative hypothesis.
We write:
> Very strong evidence exists in the sample ($z = -4.51$; two-tailed $P < 0.001$) that the population proportion is less than the target of $p = 0.10$ (Korean sample proportion: $\hat{p} = 0.0566$; $n = 972$; approximate 95% CI from $0.042$ to $0.071$).
:::
## Statistical validity conditions {#ValidityProportionsTest}
All inference procedures have underlying [conditions to be met](#exm:StatisticalValidityAnalogy) so that the results are statistically valid; that is, the $P$-values can be found accurately because the sampling distribution is an approximate normal distribution.
For a hypothesis test for one proportion, these conditions are similar to those for the [CI for one proportion](#ValidityProportions).
The *statistical validity conditions* for a test for a single proportion is that the *expected* number of individuals in the group of interest (i.e, $n\times p$) and in the group *not* of interest (i.e., $n\times (1 - p)$ both exceed five; that is:
* $n\times p > 5$, *and* $n\times (1 - p) > 5$.
The value of 5 here is a rough figure here, and some books give other values (such as 10 or 15).
This condition ensures that the *distribution of the sample proportions has an approximate normal distribution* (so that, for example, the [68--95--99.7 rule](#def:EmpiricalRule) can be used).
::: {.example #StatisticalValidityDice name="Statistical validity"}
The hypothesis test regarding the dice is statistically valid, since $n\times p = 100 \times (1/6) = 16.666\dots$ and $n\times (1 - p) = 83.333\dots$, so *both* comfortably exceed five.
:::
::: {.example #StatisticalValidityMeasles name="Statistical validity"}
The hypothesis test regarding measles in Korea (Example \@ref(exm:POnePropTestMeasles)) is statistically valid, since $n\times p = 972 \times 0.10 = 97.2$ and $n\times (1 - p) = 874.8$, so *both* easily exceed five.
:::
## Example: dominance of birds
A study [@barve2017elevational] compared two types of birds (male green-backed tits; male cinereous tits) to see which was more behaviourally dominant over winter.
If the species were equally-dominant, then about 50% of the interactions would be won by each species (i.e., $p = 0.50$).
However, in the 45 interactions observed between the two species, green-backed tits won 37 of these interactions (i.e., $\hat{p} = 0.82222$).
Of course, every sample of 45 interactions would produce a different sample proportion, so the difference between this sample proportion and $p = 0.5$ could be due to sampling variation.
To test if the proportion of interactions were equally shared, the hypotheses are:
\[
\text{$H_0$: } p = 0.5\quad\text{and}\quad\text{$H_1$: } p \ne 0.5 \text{ (two-tailed)}.
\]
The test will be statistically valid, since $n\times p = 45\times 0.5 = 22.5$ and $n\times (1 - p) = 22.5$ both exceed five.
The *standard error* for the sample proportion is
\[
\text{s.e.}(\hat{p})
= \sqrt{\frac{p (1 - p)}{n}}
= \sqrt{\frac{0.50 \times (1 - 0.50)}{45}}
= 0.0745356...
\]
Then, the *test statistic* is:
\[
z
= \frac{\hat{p} - p}{\text{s.e.}(p)}
= \frac{0.82222 - 0.50}{0.0745356}
= 4.322.
\]
This is a *very* large $z$-score, so expect a very small $P$-value from using the 68--95--99.7 rule or tables.
The 95% CI for the proportion requires the standard error computed using the *sample* proportion:
\[
\text{s.e.}(\hat{p})
= \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}
= \sqrt{\frac{0.82222 \times (1 - 0.82222)}{45}}
= 0.056999...
\]
So the approximate 95% CI is $0.82222 \pm(2 \times 0.056999...)$, or from 0.708 to 0.936.
We write:
> There is *very* strong evidence in the sample ($P < 0.001$; $z = 4.325$) that the interactions were not won equally between each species ($\hat{p} = 0.8222$ won by green-backed tits; $n = 45$; approximate 95% CI: 0.708 to 0.936) in the population.
## Example: obesity
@kolanska2010high compared the rate of obesity in $n = 143$ Polish patients with adrenal tumours to that of the general population of Poland ($p = 0.125$), to test if those with adrenal tumours were *more likely* to be obese that the general population.
The hypotheses are:
\[
\text{$H_0$: } p = 0.125\quad\text{and}\quad\text{$H_1$: } p > 0.125\text{ (one-tailed)}.
\]
Assuming the null hypothesis is true, the standard error is (remembering to use $p$):
\[
\text{s.e.}(\hat{p})
= \sqrt{\frac{p (1 - p)}{n}}
= \sqrt{\frac{.125 \times (1 - 0.125)}{143}}
= 0.027656...
\]
In their sample, 57 were obese, so $\hat{p} = 57/143 = 0.3986...$.
Then, the *test statistic* is:
\[
z
= \frac{\hat{p} - p}{\text{s.e.}(p)}
= \frac{0.3986 - 0.125}{0.027656}
= 9.89.
\]
This is an *extremely* large $z$-score, so expect a very small $P$-value using the 68--95--99.7 rule.
The 95% CI for the proportion requires the standard error computed from the *sample* proportion:
\[
\text{s.e.}(\hat{p})
= \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}
= \sqrt{\frac{0.3986 \times (1 - 0.3986)}{143}}
= 0.040943...
\]
The approximate 95% CI is $0.3986 \pm(2 \times 0.040943...)$.
We write:
> *Very* strong evidence exists in the sample (one-tailed $P < 0.001$; $z = 9.89$) that the rate of obesity in patients with adrenal tumours ($\hat{p} = 0.3986$; $n = 143$; approximate 95% CI: 0.317 to 0.480) is higher than the general Polish population.
## Summary {#Chapxx-Summary}
To test a hypothesis about a population proportion $p$:
* Initially *assume* the value of $p$ in the null hypothesis to be true.
* Then, describe the *sampling distribution*, which describes what to *expect* from the sample statistic across all possible samples, based on this assumption: under certain statistical validity conditions, the sample mean varies with:
* an approximate normal distribution,
* centered around the hypothesised value of $p$,
* with a standard deviation of $\displaystyle \text{s.e.}(\hat{p}) = \sqrt{\frac{p (1 - p)}{n}}$.
* The *observations* are then summarised, and *test statistic* computed:
\[
z = \frac{ \hat{p} - p}{\text{s.e.}(p)},
\]
where $p$ is the hypothesised value given in the null hypothesis.
An approximate *$P$-value* can be estimated using the [68--95--99.7 rule](#def:EmpiricalRule), or using tables.
## Quick review questions {#Chapxx-QuickReview}
::: {.webex-check .webex-box}
A study of diseases in native Americans [@kizer2006digestive] found 381 obese or overweight patients in 449 patients.
In the USA general population, the rate of Americans obese or overweight is 65%.
The researchers wanted to determine of the rate of obesity/overweight native Americans was *greater* than that of the general population.
1. True or false: The *population* proportion of overweight/obese native Americans is 0.65.\tightlist
`r if( knitr::is_html_output() ) {torf(answer=TRUE)}`
1. True or false: The sample size is $n = 381$.
`r if( knitr::is_html_output() ) {torf(answer=FALSE)}`
1. The *sample* proportion $\hat{p}$ is (to *four* decimal places):
`r if( knitr::is_html_output() ) {
fitb(num=TRUE, tol=0.0001, answer=0.84855)
} else {
"________________"
}`
1. True or false: The *null* hypothesis is $H_0$: $p = 0.65$.
`r if( knitr::is_html_output() ) {torf(answer=TRUE)}`
1. True or false: The *alternative* hypothesis is *one*-tailed.
`r if( knitr::is_html_output() ) {torf(answer=TRUE)}`
1. True or false: To compute the standard error for the sample proportion, $\text{s.e.}(\hat{p})$, we use $\hat{p}$ in the formula.
`r if( knitr::is_html_output() ) {torf(answer=FALSE)}`
1. True or false: In a one-sample test of proportion, the $z$-score is always large.
`r if( knitr::is_html_output() ) {torf(answer=FALSE)}`
1. For this test, the computed $z$-score is (to *two* decimal places):
`r if( knitr::is_html_output() ) {
fitb(num=TRUE, tol=0.005, answer=8.82079)
} else {
"________________"
}`
1. True or false? We always accept the *null* hypothesis.
`r if( knitr::is_html_output() ) {torf(answer=FALSE)}`
:::
## Exercises {#OneProportionTestExercises}
Selected answers are available in Sect. \@ref(TestOneProportionAnswer).
::: {.exercise #OneProportionTestExercisesPlacebos}
The study of herbal medicines is complicated as *blinding* subjects is difficult: placebos are often easily-identifiable by eye, by taste, or by smell.
One study [@loyeung2018experimental] examined if subjects could identify potential placebos, performing *better* than just guessing.
The 81 subjects were each presented with a choice of five different supplements, and asked to select which one was the legitimate herbal supplement based on the *taste*.
Of these, 50 correctly selected the true herbal supplement.
1. If the subjects were selecting the true herbal supplement randomly, what proportion of subjects would be expected to select the correct supplement as the true herbal medicine?
2. Write the hypotheses for addressing the aims of the study.
3. Is this a one- or two-tailed test?
Explain.
4. Sketch the *sampling distribution* of the sample proportion, assuming the null hypothesis is correct.
5. Is there evidence to support the idea that people can identify the true supplement by taste?
:::
::: {.exercise #OneProportionTestExercisesEPL}
In the 2019/2020 English Premier League (EPL), at full-time the home team had won 91 out of 208 games, while the away team won 67.
(50 games were draws.)
(Data from: https://sports-statistics.com/sports-data/soccer-datasets/)
*Ignoring draws*, is there evidence of a home-side advantage; that is, that the home-side winning percentage is greater than 50%?
:::
::: {.exercise #OneProportionTestExercisesPedalMachines}
In a study to increase activity in library users [@maeda2013introducing], pedal machine were introduced on the first floor of Joyner Library at East Carolina University, where 60.2% of all students were females.
Students were observed using the machine on 589 occasions, of which 295 times were by females
Is there evidence that the proportion of females users of the machines was lower than the overall female proportion at the university?
What would you conclude?
:::
::: {.exercise #OneProportionTestExercisesCasinos}
In a 1995 study, 357 visitors to Las Vegas casinos 88 were smokers.
At the time, 25.5% of the general U.S. population were smokers (based on data from the U.S. National Center for Health Statistics).
Are casino-goers just as likely to be a smokers as the general U.S. population?
:::
:::{.exercise #OneProportionBreadfruitPasta}
Researchers developed a gluten-free pasta made from breadfruit [@nochera2019development].
In the study sample, 57 of the 71 participants stated that they liked the pasta.
Do the researchers have sufficient evidence to claim that the 'majority of people like breadfruit pasta'?
:::
::: {.exercise #OneProportionTestExercisesIguanas}
A study of black spiny-tailed iguanas in Florida (an invasive species) compared the snout-vent length (SVL) for iguanas of various sizes [@avery2014invasive].
275 iguanas with a SVL between 100 and 149mm were found in the study, of which 146 were female.
Assuming female and male iguanas were equally present in the population, is there evidence that female and male iguanas were equally-likely to found with SVL in this range?
:::
::: {.exercise #OneProportionTestExercisesCTS}
Carpal Tunnel Syndrome (CTS) is a painful condition in the wrists.
A study [@boltuch2020palmaris] was interested in whether 'a relationship exists between the palmaris tendon [and] carpal tunnel syndrome (CTS)' (@boltuch2020palmaris, p. 493).
The palmaris longus (PL) tendon is visually absent in about 15% of the population.
The researchers found PL was visually absent in 33 of 516 CTS wrists in their sample.
Is there evidence to suggest that rate of PL absence is different in CTS cases?
:::
::: {.exercise #OneProportionTestExercisesBorers}
In a study of resistance of some commercial corn varieties to the European corn borer [@siegfried2014estimating], borers were collected from corn in Iowa and Nebraska.
Researchers aimed to estimate the frequency of resistance to the toxin in the corn.
By mating borers collected from the field with various resistant laboratory individuals, they could determine what proportion of resistant individuals to expect in the second generation offspring.
In one study of $n = 172$ second-generation individuals, 24 were found to be resistant.
The expectation was that 1-in-16 would be resistant if the field borers were resistant.
Perform a hypothesis test to determine if the data suggest that the borers were resistant (that is, if the population proportion is $1/16$) as expected.
:::
::: {.exercise #OneProportionTestExercisesLEDlights}
In a study of streetlight preferences of drivers [@davidovic2019drivers], drivers were asked to conduct a series of manoeuvres under 3000K LED light and then under 4000K LED lights.
They were then asked to decide which streetlight they preferred.
Out of the 52 subjects, 29 preferred the 3000K LED lights.
Is there evidence that the choice between the two streetlights is random, or is there evidence of a preference for one over the other?
:::
::: {.exercise #OneProportionTestExercisesPenguins}
A study of Magellanic penguins [@vanstreels2013female] found dead or stranded on the southern Brazilian coast found 73 adult penguins.
Of these, 47 were female,
Assuming female and male penguins were equally present in the population, we would expect about half the dead or stranded penguins to be female and male.
Is this what the data suggest?
:::
<!-- QUICK REVIEW ANSWERS -->
`r if (knitr::is_html_output()) '<!--'`
::: {.EOCanswerBox .EOCanswer data-latex="{iconmonstr-check-mark-14-240.png}"}
\textbf{Answers to \textit{Quick Revision} questions:}
**1.** True.
**2.** False.
**3.** 0.84855.
**4.** True.
**5.** True.
**6.** False.
**7.** False.
**8.** $z = -8.82079$.
**9.** False.
:::
`r if (knitr::is_html_output()) '-->'`