forked from PeterKDunn/SRM-Textbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
29-Testing-OneMean.Rmd
1087 lines (796 loc) · 49.8 KB
/
29-Testing-OneMean.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Tests for one mean {#TestOneMean}
<!-- Introductions; easier to separate by format -->
```{r, child = if (knitr::is_html_output()) {'./introductions/29-Testing-OneMean-HTML.Rmd'} else {'./introductions/29-Testing-OneMean-LaTeX.Rmd'}}
```
## Introduction: body temperatures {#BodyTemperature}
```{r}
data(BodyTemp)
```
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-anna-shvets-3987141.jpg" width="200px"/>
</div>
The average internal body temperature is commonly believed to be $37.0^\circ\text{C}$, based on data over 150 years old [@data:Wunderlich:BodyTemp].
More recently, researchers wanted to re-examine this claim [@data:mackowiak:bodytemp] to see if this benchmark is still appropriate.
That is, a decision is sought about the value of the *population* mean body temperature $\mu$.
The value of $\mu$ will never be known: the internal body temperature of every person alive would need to be measured... and even those not yet born.
The parameter is $\mu$, the population mean internal body temperature (in degrees C).
However, a *sample* of people can be taken to determine whether or not there is evidence that the *population* mean internal body temperature is still $37.0^\circ\text{C}$.
To make this decision, the [decision-making process](#DecisionMaking) (Sect. \@ref(DecisionMaking)) is used.
Begin by **assuming** that $\mu = 37.0$ (as there is no evidence that this accepted standard is wrong), and then determine if the evidence supports this claim or not.
The RQ could be stated as:
> Is the *population* mean internal body temperature $37.0^\circ\text{C}$?
## Statistical hypotheses and notation
The [decision making](#DecisionMaking) process begins by **assuming** that the population mean internal body temperature is $37.0^\circ\text{C}$.
Since the sample mean $\bar{x}$ is likely to be different for every sample even if $\mu = 37$ (*sampling variation*), the *sampling distribution* of $\bar{x}$ across all possible samples needs to be described.
Because every sample is different, $\bar{x}$ will vary, and the *sample* mean $\bar{x}$ probably won't be exactly $37.0^\circ\text{C}$, *even if* the population mean $\mu$ is $37.0^\circ\text{C}$.
Two broad reasons could explain why:
1. The *population* mean body temperature *is* $37.0^\circ\text{C}$, but $\bar{x}$ isn't exactly $37.0$ due to sampling variation;
or
3. The *population* mean body temperature *is not* $37.0$, and the *sample* mean body temperature reflects this.
Formally, the two statistical hypotheses above are:
1. The *null hypothesis* ($H_0$): $\mu = 37.0^\circ\text{C}$; and
2. The *alternative hypothesis* ($H_1$): $\mu \ne 37.0^\circ\text{C}$.
The alternative hypothesis asks if $\mu$ is $37.0$ or not: the value of $\mu$ may be smaller or larger than $37.0$.
Two possibilities are considered: for this reason, this alternative hypothesis is called a *two-tailed* alternative hypothesis.
## Describing the sampling distribution
Data to answer this RQ are available from an American study [@data:Shoemaker1996:Temperature]
`r if (knitr::is_latex_output()) {
'(Table \\@ref(tab:DataBodyTemp)).'
} else {
'(data below).'
}`
Summarising the data is important, because the data are used to answer the RQ (Fig. \@ref(fig:BodyTempHist)); the internal body temperature of individuals varies from person to person.
A numerical summary, from using software, shows that:
* The *sample* mean is $\bar{x} = 36.8051^\circ$C;
* The *sample* standard deviation is $s = 0.40732^\circ$C;
* The sample size is $n = 130$.
The sample mean $\bar{x}$ is *less* than the assumed value of $\mu = 37^\circ\text{C}$...
The question is *why*: can the difference reasonably be explained by sampling variation because every sample produces a different value of $\bar{x}$, or not?
A 95% CI can be computed (using software or manually): the 95% CI for $\mu$ is from $36.73^\circ$ to $36.88^\circ$C.
This CI is narrow, implying that $\mu$ has been estimated with precision, so detecting even small deviations of $\mu$ from $37^\circ$ should be possible.
```{r NotationOneMeanHT}
OneMeanNotation <- array( dim = c(4, 2))
OneMeanNotation[1, ] <- c("Describe individual in the population",
"Vary with mean $\\mu$ and standard deviation $\\sigma$")
OneMeanNotation[2, ] <- c("Describe individual in a sample",
"Vary with mean $\\bar{x}$ and standard deviation $s$")
OneMeanNotation[3, ] <- c("Describe sample means ($\\bar{x}$)",
"Vary with approx. normal distribution (under certain conditions):")
OneMeanNotation[4, ] <- c("across all possible samples",
"sampling mean $\\mu$; standard deviation $\\text{s.e.}(\\bar{x})$")
if( knitr::is_latex_output() ) {
kable( OneMeanNotation,
format = "latex",
booktabs = TRUE,
longtable = FALSE,
escape = FALSE,
caption = "The notation used for describing means, and the sampling distribution of the sample means",
align = c("r", "l"),
linesep = c("\\addlinespace",
"\\addlinespace",
""),
col.names = c("Quantity",
"Description") ) %>%
row_spec(0, bold = TRUE) %>%
kable_styling(font_size = 10)
} else {
OneMeanNotation[3, 1] <- paste(OneMeanNotation[3, 1],
OneMeanNotation[4, 1])
OneMeanNotation[3, 2] <- paste(OneMeanNotation[3, 2],
OneMeanNotation[4, 2])
OneMeanNotation[4, ] <- NA
kable( OneMeanNotation,
format = "html",
booktabs = TRUE,
longtable = FALSE,
escape = FALSE,
caption = "The notation used for describing means, and the sampling distribution of the sample means",
align = c("r", "l"),
linesep = c("\\addlinespace",
"\\addlinespace",
""),
col.names = c("Quantity",
"Description") ) %>%
row_spec(0, bold = TRUE)
}
```
```{r DataBodyTemp}
subBodyT <- round(cbind(BodyTemp$BodyTempC[1:10],
BodyTemp$BodyTempC[121:130] ), 2)
if( knitr::is_latex_output() ) {
T1 <- kable( subBodyT[1:10, 1],
format = "latex",
booktabs = TRUE,
longtable = FALSE,
linesep = "",
align = "c",
col.names = "Body temp (deg. C)") %>%
row_spec(0, bold = TRUE)
T2 <- kable( subBodyT[1:10, 2],
format = "latex",
booktabs = TRUE,
escape = FALSE,
longtable = FALSE,
linesep = "",
align = "c",
col.names = "Body temp (deg. C)") %>%
row_spec(0, bold = TRUE)
out <- knitr::kables(list(T1, T2),
format = "latex",
label = "DataBodyTemp",
caption = "The body temperature data: The lowest ten and the highest ten of the 130 observations") %>%
kable_styling(font_size = 10)
prepareSideBySideTable(out)
}
if( knitr::is_html_output() ) {
DT::datatable( BodyTemp,
#fillContainer=FALSE, # Make more room, so we don't just have ten values
#filter="top",
#selection="multiple",
#escape=FALSE,
options = list(searching = FALSE), # Remove searching: See: https://stackoverflow.com/questions/35624413/remove-search-option-but-leave-search-columns-option
caption = "The body temperature data")
}
```
```{r BodyTempHist, fig.cap="The histogram of the body temperature data", fig.align="center", fig.height=3.0, fig.width=4.5}
hist( BodyTemp$BodyTempC,
xlab = "Body temperature, in degrees C",
ylab = "Frequency",
las = 1,
main = "",
breaks = seq(35, 39, by = 0.5),
col = plot.colour)
box()
```
The sampling distribution of $\bar{x}$ was discussed in Sect. \@ref(SamplingDistSampleMean) (and Def. \@ref(def:DEFSamplingDistributionXbar) specifically).
The notation for describing the sampling distribution is shown in Table \@ref(tab:NotationOneMeanHT).
From this, if $\mu$ really was $37.0^\circ$C and if [certain conditions are true](#ValiditySampleMeanTest), the possible values of the sample means across all possible samples can be described using:
* An approximate normal distribution;
* With a sampling mean whose value is $\mu = 37.0^\circ\text{C}$ (from $H_0$);
* With standard deviation of $\displaystyle \text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{0.40732}{\sqrt{130}} = 0.035724$.
This is the *standard error* of the sample means.
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The notation $\text{s.e.}{\bar{x}}$ is the *standard error of the sample mean*, and denotes 'the standard deviation of the means computed from all the possible samples'.
:::
A picture of this sampling distribution (Fig. \@ref(fig:BodyTempSamplingDist)) shows *how the sample mean varies when $n = 130$ across all possible samples, simply due to sampling variation, when $\mu = 37^\circ\text{C}$*.
This enables questions to be asked about the likely values of $\bar{x}$ that would be found in the sample, when the population mean is $\mu = 37^\circ\text{C}$.
```{r BodyTempSamplingDist, fig.cap="The distribution of sample mean body temperatures, if the population mean is $37^\\circ$C and $n = 130$. The grey vertical lines are 1, 2 and 3 standard deviations from the mean.", fig.align="center", fig.height=2.75, fig.width=9, out.width='90%'}
mn <- 37.0
# These taken from Shoemaker's JSE data file
s <- 0.40732
n <- 130
se <- s/sqrt(n)
out <- plotNormal(mn,
s / sqrt(n),
xlab = "Sample means from sample of size 130 (deg C)",
round.dec = 3,
cex.tickmarks = 0.9)
```
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
Given the sampling distribution (Fig. \@ref(fig:BodyTempSamplingDist)), use the [68--95--99.7 rule](#def:EmpiricalRule) to determine how often will $\bar{x}$ be **larger** than 37.036 degrees C just because of sampling variation, if $\mu$ really is $37^\circ$C.
`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
About 16% of the time.
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::
<iframe src="https://learningapps.org/watch?v=p28qr801322" style="border:0px;width:100%;height:600px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>
## Computing the test statistic and $t$-scores {#Tscores}
The sampling distribution describes how the sample means varies; that is, what to *expect* from the sample means, after *assuming* $\mu = 37.0^\circ\text{C}$.
The value of $\bar{x}$ that is *observed*, however, is $\bar{x} = 36.8051^\circ$.
How likely is it that such a value could occur in our sample by chance (by sampling variation)?
The value of the observed sample mean can be located the picture of the [sampling distribution](#fig:BodyTempSamplingDist) (Fig. \@ref(fig:BodyTempSamplingDistT)).
The value $\bar{x} = 36.8051^\circ\text{C}$ is unusually small: observing a sample mean this low is very unlikely from a sample of $n = 130$ if $\mu$ really was $37$.
About how many standard deviations is $\bar{x}$ away from $\mu = 37$?
A lot...
```{r BodyTempSamplingDistT, fig.cap="The sample mean of $\\bar{x} = 36.8041^\\circ$C is very unlikely to have been observed if the poulation mean really was $37^\\circ$C, and $n=130$. The standard deviation of this normal distribution is $\\text{s.e.}(\\bar{x}) = 0.035724$.", fig.align="center", fig.width=6, fig.height=2.75}
mn <- 37.0
#These taken from Shoemaker's JSE data file
s <- 0.40732
n <- 130
xbar <- 36.8
se <- s/sqrt(n)
z36 <- (36.8 - mn)/(s / sqrt(n))
out <- plotNormal(mn,
s / sqrt(n),
xlab = "Sample mean temperatures (in deg C)",
round.dec = 3,
showX = seq(-6, 3, by = 1) * s/sqrt(n) + mn,
xlim.hi = mn + 3.5 * se,
xlim.lo = mn - 6.5 * se,
cex.tickmarks = 0.9)
arrows(xbar,
0.8 * max(out$y),
xbar,
0,
length = 0.15,
angle = 15)
text(xbar,
0.8 * max(out$y),
"36.8 deg.",
pos = 4)
```
Relatively speaking, the *distance* that the observed sample mean (of $\bar{x} = 36.8051$) is from the mean of the sampling distribution (Fig. \@ref(fig:BodyTempSamplingDistT)) is found by computing *how many* standard deviations the value of $\bar{x}$ is from the mean of the distribution; that is, computing something like a $z$-score.
Remember that Fig. \@ref(fig:BodyTempSamplingDistT) displays the values of $\bar{x}$ that could be expected if $\mu = 37$.
The mean of this distribution is $\mu_{\bar{x}} = 37$ and the standard deviation of this normal distribution is the *standard error*: the amount of variation in the sample means across all possible samples.
The number of standard deviations that $\bar{x} = 36.8051$ is from the mean is
\[
\frac{36.8051 - 37.0}{0.035724} = -5.453.
\]
This value is *like* a $z$-score, but is actually called a $t$-score.
(The reason is that *the population standard deviation is unknown*, so the best estimate (the *sample* standard deviation) is used when $\text{s.e.}(\bar{x})$ was computed.)
Both $t$ and $z$ scores measure *the number of standard deviations that an observation is from the mean*.
::: {.tipBox .tip data-latex="{iconmonstr-info-6-240.png}"}
Like $z$-scores, $t$-scores measure the number of standard deviations that a value is from the mean.
When the value of interest is sample statistic, then, both measure the number of *standard errors* that a *sample statistic* is from the mean.
The difference is:
* $z$-scores are calculated if computing the standard error uses known values (as with a test for proportions (Chap. \@ref(TestOneProportion)).
* $t$-scores are calculated if computing the standard error uses sample estimates (as in this chapter).
$t$-scores are more commonly used than $z$-scores, because almost always the population standard deviation is unknown, and so the sample standard deviation is used to compute the standard error.
In this book, it is sufficient to think of $z$-scores and $t$-scores as approximately the same.
Unless sample sizes are small, this is a reasonable approximation.
:::
The calculation is:
\[
t = \frac{36.8051 - 37.0}{0.035724} = -5.453;
\]
the observed sample mean is *more than five standard deviation below the population mean*, which is *highly* unusual based on the [68--95--99.7 rule](#def:EmpiricalRule) (Fig. \@ref(fig:BodyTempSamplingDistT)).
In general, a $t$-score in hypothesis testing is
\begin{equation}
t
=
\frac{\text{sample statistic} - \text{mean of the sample statistic}}
{\text{standard error of the sample statistic}}
=
\frac{\bar{x} - \mu}{\text{s.e.}(\bar{x})}.
(\#eq:tscore)
\end{equation}
<iframe src="https://learningapps.org/watch?v=pi8jnzhu322" style="border:0px;width:100%;height:500px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>
## Determining $P$-values {#Pvalues}
As seen in Sect. \@ref(OnePropTestP), a $P$-value is used to quantify how unusual the computed $t$-score (or $z$-score) is, after assuming the null hypothesis is true.
The $P$-value can be *approximated* using the [68--95--99.7 rule](#def:EmpiricalRule) and a diagram (Sect. \@ref(ApproxP)), or using tables (`r if ( knitr::is_html_output()) { 'Appendix \\@ref(ZTablesOnline).'} else {'Appendices \\@ref(ZTablesNEG) and \\@ref(ZTablesPOS)'}`).
Commonly, software is used for test involving one mean to compute the $P$-value (Sect. \@ref(SoftwareP)).
### Approximate $P$-values {#ApproxP}
Since $t$-scores are similar to $z$-scores, the ideas in Sect. \@ref(OnePropTestP) can be used to *approximate* a $P$-value for $t$-scores.
The [68--95--99.7 rule](#def:EmpiricalRule) and a diagram (Sect. \@ref(OnePropTestP6895997)) can be used in a similar way to *approximate* a $P$-value for a $t$-score.
In addition, Tables of $z$-scores (`r if ( knitr::is_html_output()) { 'Appendix \\@ref(ZTablesOnline).'} else {'Appendices \\@ref(ZTablesNEG) and \\@ref(ZTablesPOS)'}`) can be used to approximate the $P$-values for $t$-scores also (Sect. \@ref(OnePropTestPTables)).
These two method both produce approximate $P$-values (since they are based on using $z$-scores, not $t$-scores); usually, software is used to determine $P$-values for $t$-scores.
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
What do you think the two-tailed $P$-value will be when $t = -5.45$ (Fig. \@ref(fig:BodyTempSamplingDistT))?
`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
Based on the [68--95--99.7 rule](#def:EmpiricalRule), the two-tailed $P$-value will be *extremely* small.
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::
### Exact $P$-values using sofware {#SoftwareP}
Software computes the $t$-score and a precise $P$-value (jamovi: Fig. \@ref(fig:BodyTempTestjamovi); SPSS: Fig. \@ref(fig:BodyTempTestSPSS)).
The output (in jamovi, under the heading `p`; in SPSS, under the heading `Sig. (2-tailed)`) shows that the $P$-value is indeed very small.
Although SPSS reports the $P$-value as 0.000, $P$-values can never be *exactly* zero, so we interpret this as 'zero to three decimal places', or that $P$ is less than 0.001 (written as $P < 0.001$, as jamovi reports).
::: {.tipBox .tip data-latex="{iconmonstr-info-6-240.png}"}
When software reports a $P$-value of `0.000`, it really means (and we should write) $P < 0.001$: That is, the $P$-value is *smaller* than 0.001.
:::
This $P$-value means that, assuming $\mu = 37.0^\circ$C, observing a sample mean as low as $36.8051^\circ$C just through sampling variation (from a sample size of $n = 130$) is almost *impossible*.
And yet... we did.
Using the [decision-making process](#DecisionMaking), this implies that the initial assumption (the null hypothesis) is contradicted by the data: The evidence suggests that the *population* mean body temperature is *not* $37.0^\circ\text{C}$.
```{r BodyTempTestjamovi, fig.cap="jamovi output for conducting the $t$-test for the body temperature data", fig.align="center", out.width="60%"}
knitr::include_graphics("jamovi/BodyTemp/BodyTempTtest.png")
```
```{r BodyTempTestSPSS, fig.cap="SPSS output for conducting the $t$-test for the body temperature data", fig.align="center", out.width="70%"}
knitr::include_graphics("SPSS/BodyTemp/BodyTempTtest.png")
```
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
For *one-tailed tests*, the $P$-value is *half* the value of the two-tailed $P$-value.
:::
::: {.softwareBox .software data-latex="{iconmonstr-laptop-4-240.png}"}
SPSS always produces **two-tailed** $P$-values, calls then *Significance values*, and labels them as `Sig.`
jamovi can produce one- or two-tailed $P$-values.
:::
### Making decisions with $P$-values
As seen in Sect. \@ref(OnePropTestDecisions), $P$-values measure the likelihood of observing the sample statistic (or something more extreme), based on the **assumption** about the population parameter being true.
In this context, the $P$-value tells us the likelihood of observing the value of $\bar{x}$ (or something more extreme), just through sampling variation if $\mu = 37$.
For the body-temperature data then, where $P < 0.001$, the $P$-value is *very* small, so there is *very strong evidence* that the population mean body temperature is not $37.0^\circ\text{C}$.
## Writing conclusions
In general, to communicate the results of any hypothesis test, report an *answer to the RQ*, a summary of the *evidence* used to reach that conclusion (such as the $t$-score and $P$-value---including if it is a one- or two-tailed $P$-value), and some *sample summary information* (including a CI, summarising the data used to make the decision).
So for the body-temperature example, write:
> The sample provides very strong evidence ($t = -5.45$; two-tailed $P<0.001$) that the population mean body temperature is *not* $37.0^\circ\text{C}$ ($\bar{x} = 36.81$; $n = 130$; 95% CI\ from 36.73$^\circ$C to 36.88$^\circ$C).
The components are:
* The *answer to the RQ*.
The sample provides very strong evidence... that the population mean body temperature is not $37.0^\circ\text{C}$'.
Since the alternative hypothesis was two-tailed, the conclusion is worded in terms of the population mean body temperature *not* being $37.0^\circ\text{C}$.
* The *evidence* used to reach the conclusion: '$t = -5.45$; two-tailed $P<0.001$'.
* Some *sample summary information* (including a CI): '$\bar{x} = 36.81$; $n=130$; 95% CI from 36.73$^\circ$C to 36.88$^\circ$C'.
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
Since the *null* hypothesis is initially assumed to be true, the onus is on the evidence to refute the null hypothesis.
Hence, conclusions are worded in terms of how strongly the evidence (i.e., sample data) support the alternative hypothesis.
In fact, the alternative hypothesis *may* or *may not* be true... but the evidence (data) available here strongly supports the alternative hypothesis.
:::
## Summary {#TestSummary}
Let's recap the [decision-making process](#DecisionMaking), in this context about body temperatures:
* **Step 1: Assumption**:
Write the *null hypothesis* about the parameter (based on the RQ) $H_0$: $\mu = 37.0^\circ\text{C}$.
In addition, write the *alternative hypothesis* $H_1$: $\mu \ne 37.0^\circ\text{C}$.
(This alternative hypothesis is two-tailed.)
* **Step 2: Expectation**:
The *sampling distribution* describes what to expect from the sample statistic *if* the null hypothesis is true: [under certain circumstances](#ValiditySampleMeanTest), the sample means will vary with an approximate normal distribution around a mean of $\mu = 37.0^\circ\text{C}$ with a standard deviation of $\text{s.e.}(\bar{x}) = 0.03572$ (Fig. \@ref(fig:BodyTempSamplingDistT)).
* **Step 3: Observation**:
Compute the $t$-score: $t = -5.45$.
The $t$-score can be computed by software, or using the general equation \@ref(eq:tscore).
* **Step 4: Consistency?**:
Determine if the data are *consistent* with the assumption, by computing the $P$-value.
Here, the $P$-value is much smaller than 0.001.
The $P$-value can be computed by software, or approximated using the [68--95--99.7 rule](#def:EmpiricalRule).
The **conclusion** is that there is very strong evidence that $\mu$ is not $37.0^\circ\text{C}$:
::: {.example #OneTSpeeds name="Mean driving speeds"}
A study of driving speeds in Malaysia [@azwari2021evaluating] recorded the speeds of vehicles on various roads.
One RQ of interest was whether the mean speed of cars on one road was the posted speed limit of 90 km.h^-1^, or whether it was *higher*.
The *parameter* of interest is $\mu$, the mean speed in the *population*.
The statistical hypotheses are:
\[
\text{$H_0$: } \mu = 90\quad\text{and}\quad\text{$H_1$: }\mu > 90.
\]
The alternative hypothesis is *one-tailed*, since the researchers were interested in whether the mean speed was *higher* than the posted speed limit.
(In the case of a one-tailed alternative hypothesis, writing $\text{$H_0$: } \mu \le 90$ is also correct (and equivalent), though the test still proceeds just as if $\mu = 90$.)
The researchers recorded the speed of $n = 400$ vehicles on this road, and found the mean and standard deviation of the speeds of individual vehicles were $\bar{x} = 96.56$ and $s = 13.874$.
Of course, the sample mean is likely to vary from sample to sample (with a normal distribution).
For samples of size $n = 400$, the mean of this normal distribution that describes the sample means is $\mu_{\bar{x}} = 90$, and the standard deviation is
\[
\text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{13.874}{\sqrt{400}} = 0.6937.
\]
Hence, the test statistic is
\[
t = \frac{\bar{x} - \mu}{\text{s.e.}(\bar{x})} = \frac{96.56 - 90}{0.6937} = 9.46,
\]
where (as usual) the value of $\mu_{\bar{x}}$ is taken from the null hypothesis (which we always assume to be true initially).
This $t$-score is a *huge* value, suggesting that the (one-tailed) $P$-value is very small.
We write (remembering the alternative hypothesis is one-tailed):
> There is very strong evidence ($t = 9.46$; one-tailed $P < 0.001$) that the mean speed of vehicles on this road (sample mean: 96.56; standard deviation: 13.874) is greater than 90 km.h^-1^.
Of course, since this statement refers to the *mean* speed, some individual vehicles may be travelling below the speed limit.
:::
## Statistical validity conditions {#ValiditySampleMeanTest}
As with any inference procedure, the underlying mathematics requires [certain conditions to be met](#exm:StatisticalValidityAnalogy) so that the results are statistically valid.
For a hypothesis test for one mean, these conditions are the same as for the CI for one mean (Sect. \@ref(ValiditySampleMean)).
The test will be statistically valid if *one* of these is true:
1. The sample size is at least 25, *or*
2. The sample size is smaller than 25 *and* the *population* data has an approximate normal distribution.
The sample size of 25 is a rough figure here, and some books give other values (such as 30).
This condition ensures that the *distribution of the sample means has an approximate normal distribution* (so that, for example, the [68--95--99.7 rule](#def:EmpiricalRule) can be used).
Provided the sample size is larger than about 25, this will be approximately true *even if* the distribution of the individuals in the population does not have a normal distribution.
That is, when $n > 25$ the sample means generally have an approximate normal distribution, even if the data themselves do not have a normal distribution.
::: {.example #StatisticalValidityTemps name="Statistical validity"}
The hypothesis test regarding body temperature is statistically valid since the sample size is large ($n = 130$).
Since the sample size is large, we *do not* require the data to come from a population with a normal distribution.
:::
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
Suppose we are performing a one-sample $t$-test about a mean.
The random sample is of size is $n = 45$, and the histogram of the data is skewed right.
Is the test likely to be **statistically** valid?
`r if( knitr::is_html_output() ) {
mcq( c(
"Yes, because the sample is a random sample",
answer = "Yes, because the sample size is large",
"No, because the sample size is too small",
"We don't know; we need to know if the data has a normal distribution",
"No, because the histogram is skewed right"
))}`
:::
::: {.example #OneTSpeedsValid name="Driving speeds"}
In Example \@ref(exm:OneTSpeeds) about mean driving speeds,
the sample size was 400, much larger than 25.
The test is statistically valid.
:::
## Example: recovery times {#RecoveryTimes}
```{r}
RTime <- structure(list(RecTime = c(14, 9, 18, 26, 12, 0, 10, 4, 8, 21,
28, 24, 24, 2, 3, 14, 9)), .Names = "RecTime", row.names = c(NA,
-17L), class = "data.frame", variable.labels = structure(character(0), .Names = character(0)), codepage = 28591L)
```
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-anna-shvets-3846157.jpg" width="200px"/>
</div>
Seventeen patients were treated for medial collateral ligament (MCL) and anterior cruciate ligament (ACL) tears using a new treatment method [@data:Nakamua2000:ACL; @data:Altman1991:PracticalStats].
The current existing treatment has an average recovery time of 15 days.
The RQ is:
> For patients with this type of injury, does the new treatment method lead to *shorter* mean recovery times?
The parameter is $\mu$, the population mean recovery time.
The statistical hypotheses (**Step 1: Assumption**) about the parameter are, from the RQ:
- $H_0$: $\mu = 15$ (the population mean is $15$, but $\bar{x}$ is not $15$ due to sampling variation):
This is the initial **assumption**.
- $H_1$: $\mu < 15$ ($\mu$ is not $15$; it really does produce *shorter* recovery times, on average).
This test is *one-tailed*: the RQ only asks if the new method produces *shorter* recovery times.
The evidence (Table \@ref(tab:RecoveryData)) can be summarised numerically, using software or (since the dataset is small) a calculator.
Either way, $\bar{x} = 13.29$ and $s = 8.887$.
```{r RecoveryData}
RTimeArray <- array( c(RTime$RecTime, NA),
dim = c(2, 9))
options(knitr.kable.NA = '')
if( knitr::is_latex_output() ) {
kable(RTimeArray,
format = "latex",
#col.names = rep(" ", 9),
booktabs = TRUE,
longtable = FALSE,
caption = "The recovery times for a new treatment") %>%
kable_styling(font_size = 10) %>%
add_header_above(header = c("Recovery times (in days)" = 9),
bold = TRUE,
align = "c")
}
if( knitr::is_html_output() ) {
kable(RTimeArray,
format = "html",
col.names = rep(" ", 9),
booktabs = TRUE,
longtable = FALSE,
caption = "The recovery times (in days) for a new treatment")
}
```
If the null hypothesis is true (and $\mu = 15$), the values of the sample mean that are likely to occur through sampling variation can be described (**Step 2: Expectation**).
The sample means are likely to vary with an approximate normal distribution ([under certain assumptions](#ValiditySampleMeanTest)), with mean $\mu_{\bar{x}} = 15$ and a standard deviation of
\[
\text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{8.887}{\sqrt{17}} = 2.155.
\]
This describes what values of $\bar{x}$ we should *expect* in the sample if the mean recovery time $\mu$ really was 15 days (Fig. \@ref(fig:RecoveryTimesDSamplingDistribution)).
The sample mean is $\bar{x} = 13.29$, so the $t$-score to determine where the sample mean is located (**Step 3: Observation**), relative to what is expected, is
\[
t = \frac{\bar{x} - \mu_{\bar{x}}}{\text{s.e.}(\bar{x})}
= \frac{13.29 - 15}{2.155}
= -0.79.
\]
Software could also be used (jamovi: Fig. \@ref(fig:RecoveryTimestestjamovi); SPSS: Fig. \@ref(fig:RecoveryTimestestSPSS)); in either case, $t = -0.79$.
A $z$-score of $-0.79$ is not unusual, and (since $t$-scores are like $z$-scores) this $t$-score is not unusual either (**Step 4: Consistency**).
The $P$-value in the [jamovi output](#fig:RecoveryTimestestjamovi) or [SPSS output](#fig:RecoveryTimestestSPSS) confirms this: the *two*-tailed $P$-value is $0.440$, so the *one*-tailed $P$-value is $0.440\div 2 = 0.220$.
```{r RecoveryTimestestjamovi, fig.cap="jamovi output for the $t$-test for the recovery-times data", fig.align="center", out.width="60%"}
knitr::include_graphics("jamovi/RecoveryTimes/RecoveryTimeTTest.png")
```
```{r RecoveryTimestestSPSS, fig.cap="SPSS output for the $t$-test for the recovery-times data", fig.align="center", out.width="80%"}
knitr::include_graphics("SPSS/RecoveryTimes/RecoveryTimeOneSampleTOutput.png")
```
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
Recall: For *one-tailed tests*, the $P$-value is *half* the value of the two-tailed $P$-value.
:::
This 'large' $P$-value suggests that a sample mean of 13.29 could reasonably have been observed just through sampling variation: there is no evidence to support the alternative hypothesis $H_1$.
If $\mu$ really was 15, then $\bar{x}$ would be less than 13.29 in about about 22% of samples, just because of sampling variation.
```{r RecoveryTimesDSamplingDistribution, fig.cap="The sampling distribution for the recovery-times data", fig.align="center", fig.width=10, fig.height=3, out.width='90%'}
par( mfrow = c(1,2) )
mn.RT <- 15
sd.RT <- sd(RTime$RecTime)
n.RT <- length(RTime$RecTime)
se.RT <- sd.RT/sqrt(n.RT)
xbar <- 13.29
z <- (xbar - mn.RT)/se.RT
out <- plotNormal(mn.RT,
se.RT,
xlab = "Sample mean values (in days)",
las = 2,
round.dec = 2)
shadeNormal(out$x,
out$y,
lo = 8,
hi = xbar,
col = plot.colour)
lines(x = c(xbar, xbar),
y = c(0, 0.9 * max(out$y)),
lwd = 3,
col = "black" )
text( xbar,
0.9 * max(out$y),
expression( paste(bar(italic(x)) == "13.29")),
pos = 2)
out <- plotNormal(0,
1,
xlab = expression(italic(t)-scores),
round.dec = 0)
shadeNormal(out$x,
out$y,
lo = -5,
hi = z,
col = plot.colour)
lines( x = c(z, z),
y = c(0, 0.9 * max(out$y) ),
lwd = 3,
col = "black")
text( z,
0.9 * max(out$y),
expression(Observed~italic(t)~score),
pos = 2)
```
To summarise:
* **Step 1 (Assumption)**:
- $H_0$: $\mu = 15$, the initial assumption (or, $\mu \ge 15$);
- $H_1$: $\mu < 15$ (note: *one*-tailed).
* **Step 2 (Expectation)**:
The sample means will vary, and this sampling variation is described by an approximate normal distribution with mean $\mu_{\bar{x}} = 15$ and standard deviation $\text{s.e.}(\bar{x}) = 2.155$.
* **Step 3 (Observation)**:
$t = -0.791$.
* **Step 4 (Consistency?)**:
The *one-tailed* $P$-value is $0.220$:
The data are consistent with $H_0$, so there is no evidence to support the alternative hypothesis.
To write a conclusion, include an *answer* to the question, *evidence* leading to the conclusion, and some *sample summary information*:
> No evidence exists in the sample (one sample $t = -0.79$; one-tailed $P = 0.220$) that the population mean recovery time is less than 15 days (mean 13.29 days; $n=17$; 95% CI from 8.73 to 17.86 days) using the new treatment method.
(The CI is found using the ideas in Sect. \@ref(OneMeanCI).)
Notice the wording: The new method *may* be better, but no evidence exists of this in the sample.
The onus is on the new method to demonstrate that it is better than the current method.
The sample size is small ($n = 17$), so the test may not be statistically valid (but the $P$-value is so large that it probably won't affect the conclusions).
## Example: student IQs {#IQstudents}
Standard IQ scores are designed to have a mean in the general population of 100.
A study of $n = 224$ students at Griffith University [@reilly2022gender] found that the sample IQ scores were approximately normally distributed, with a mean of 111.19 and a standard deviation of 14.21.
Is this evidence that students at Griffith University (GU) have a higher mean IQ than the general population?
The RQ is:
> For students at Griffith University, is the mean IQ higher than 100?
The parameter is $\mu$, the population mean IQ for students at GU.
The statistical hypotheses (**Step 1: Assumption**) about the parameters are, from the RQ:
- $H_0$: $\mu = 100$ (the population mean is $100$, but $\bar{x}$ is not $100$ due to sampling variation): This is the initial **assumption**.
- $H_1$: $\mu > 100$ ($\mu$ is *greater than* $100$; GU students really do have a higher mean IQ than the general population).
This test is *one-tailed*: the RQ only asks if the IQ of GU students is *greater* than 100.
We do not have the original data, but the summary data are sufficient: $\bar{x} = 111.19$ with a standard deviation of $s = 14.21$ from a sample of size $n = 224$.
The *sample* mean is higher than 100, but the sample mean varies for each sample.
The sample means vary with a normal distribution having a mean of $\mu_{\bar{x}} = 100$ and a standard deviation of
\[
\text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{14.21}{\sqrt{224}} = 0.9494456.
\]
The $t$-score is
\[
t = \frac{\bar{x} - \mu_{\bar{x}}}{\text{s.e.}(\bar{x})} = \frac{111.19 - 100}{0.9494456} = 11.78.
\]
This is a *huge* $t$-score, which means that a sample mean as large as 111.19 would be highly unlikely to occur simply by sampling variation in a sample of size $n = 224$ if the population mean really was 100.
The $P$-value will be extremely small.
To conclude (where the CI is found using the ideas in Sect. \@ref(OneMeanCI), or manually):
> Very strong evidence exists in the sample (one sample $t = 11.78$; one-tailed $P < 0.001$) that the population mean IQ in students at Griffith University
> is greater than 100 (mean 111.19; $n = 224$; 95% CI from 109.29 to 113.09).
(**Note**: IQ scores do not have units of measurement.)
Since the sample size is much large than 25, this conclusion is *statistically valid* (i.e., the mathematics behind the computations will be sound).
The sample is not a true random sample from the population of all GU students (the students are mostly first-year students, and most were enrolled in an undergraduate psychological science degree).
However, these students may be somewhat representative of all GU student; that is, the sample *may* be externally valid.
The difference between the general population IQ of 100 and the sample mean IQ of GU students is only small: about 11 IQ units.
Possibly, this difference has very little practical significance, even though the statistical evidence suggests that the difference cannot be explained by chance.
## Summary {#Chap27-Summary}
To test a hypothesis about a population mean $\mu$:
* Initially **assume** the value of $\mu$ in the null hypothesis to be true.
* Then, describe the **sampling distribution**, which describes what to **expect** from the sample mean based on this assumption: under certain statistical validity conditions, the sample mean varies with:
* an approximate normal distribution,
* with sampling mean whose value is the value $\mu$ (from $H_0$), and
* having a standard deviation of $\displaystyle \text{s.e.}(\bar{x}) =\frac{s}{\sqrt{n}}$.
* The **observations** are then summarised, and *test statistic* computed:
\[
t = \frac{ \bar{x} - \mu}{\text{s.e.}(\bar{x})},
\]
where $\mu$ is the hypothesised value given in the null hypothesis.
* The $t$-value is like a $z$-score, and so an approximate **$P$-value** can be estimated using the [68--95--99.7 rule](#def:EmpiricalRule), or found using software.
`r if (knitr::is_html_output()){
'The following short video may help explain some of these concepts:'
}`
<iframe width="560" height="315" src="https://www.youtube.com/embed/ZbJ58Ag22Mw" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture"></iframe>
## Quick review questions {#Chap27-QuickReview}
::: {.webex-check .webex-box}
In traffic, the usual engineering recommendation is that the safe gap between travelling vehicles (called a 'headway') is *at least* 1.9s (often conveniently rounded to 2 seconds in practice).
One study of $28$ streams of traffic in Birmingham, Alabama [@majeed2014field] found the mean headway was 1.1915s, with a standard deviation of 0.231s.
The researchers wanted to test if the mean headway in Birmingham was at least 1.9s.
1. True or false? The test is *one-tailed*.\tightlist
`r if( knitr::is_html_output() ) {torf(answer=TRUE)}`
1. The standard error of the mean (to five decimal places) is
`r if( knitr::is_html_output() ) {fitb(num=TRUE, tol=0.00001, answer=0.04365)} else {"________________"}`
1. The null hypothesis is:
`r if( knitr::is_html_output() ) {longmcq( c(
"The sample mean headway is 1.9s",
"The sample mean is 1.1915s",
answer = "The population mean is 1.9s",
"The population mean is 1.1915s") )} else {"________________"}`
1. The test statistic (to two decimal places) is
`r if( knitr::is_html_output() ) {fitb(num=TRUE, tol=0.01, answer=-16.23)} else {"________________"}`
1. The one-tailed $P$-value is
`r if( knitr::is_html_output() ) {mcq( c(
answer = "Small",
"Big",
"0.05",
"0.34") )} else {"________________"}`
1. True or false? There is no evidence to accept the *alternative* hypothesis (that the mean headway is at least 1.9s).
`r if( knitr::is_html_output() ) {torf(answer=TRUE)}`
:::
<!-- ::: {.webex-check .webex-box} -->
<!-- A study [@imtiaz2017assessment] compared the nutritional intake of $n = 50$ anaemic infants in Lahore (Pakistan) with the recommended daily intake (of 13g). -->
<!-- The mean daily protein intake in the sample was 14g, with a standard deviation of 3g. -->
<!-- The researchers wanted to see if the mean intake met (or exceeded) the recommendation, or not. -->
<!-- ::: -->
## Exercises {#TestOneMeanAnswerExercises}
Selected answers are available in Sect. \@ref(TestOneMeanAnswer).
::: {.exercise #TestOneMeanExercisesSleep}
A study of Taiwanese pre-school children [@lin2021sleep] compared their average sleep -times to the recommendation (of *at least* 10 hours per night).
The summary of the data for weekend sleep-times is shown in Table \@ref(tab:SleepingSummary), for both females and males.
On average, do female pre-school children get *at least* 10 hours of sleep per night?
Do male preschool children?
:::
```{r SleepingSummary}
SleepArray <- array( dim = c(2, 3))
SleepArray[1, ] <- c(47,
8.50,
0.48)
SleepArray[2, ] <- c(39,
8.64,
0.37)
if( knitr::is_latex_output() ) {
kable(SleepArray,
format = "latex",
col.names = c("Sample size",
"Sample mean",
"Sample std. dev."),
booktabs = TRUE,
caption = "Summary information for the Taiwanese pre-schoolers sleep times (in hours)") %>%
row_spec(0, bold = TRUE) %>%
kable_styling(font_size = 10)
}
if( knitr::is_html_output() ) {
kable(SleepArray,
format = "html",
col.names = c("Sample size",
"Sample mean",
"Sample std. dev."),
booktabs = TRUE,
caption = "Summary information for the Taiwanese pre-schoolers sleep times (in hours)") %>%
row_spec(0, bold = TRUE)
}
```
::: {.exercise #TestOneMeanExercisesToothbrushing}
Most dental associations^[Such as the *American Dental Association* and the *Australian Dental Association*.] recommend brushing teeth for *at least* two minutes.
A study [@data:Macgregor1979:BrushingDurationKids] of the brushing time for 85 uninstructed school children from England (11 to 13 years old) found the mean brushing time was 60.3 seconds, with a standard deviation of 23.8 seconds.
1. Is there evidence that the mean brushing time for schoolchildren from England is at least two minutes (as recommended)?
1. Sketch the sampling distribution of the sample mean.
:::
::: {.exercise #TestOneMeanExercisesAutomatedVehicles}
A study [@data:greenlee2018:vehicles] of human-automation interaction with automated vehicles aimed to
> ... determine whether monitoring the roadway for hazards during automated driving results in a vigilance decrement [i.e., a reduction].
>
> --- @data:greenlee2018:vehicles, p. 465
That is, they were interested in whether the average mental demand of 'drivers' of automated vehicles was *higher* than the average mental demand for ordinary tasks.
In the study, the $n = 22$ participants 'drove' (in a simulator) an automated vehicle for 40 minutes.
While driving, the drivers monitored the road for hazards.
The researchers assessed the 'mental demand' placed on these drivers, where scores of 50 over 'typically indicate substantial levels of workload' (p. 471).
For the sample, the mean score was 84.00 with a standard deviation of 22.05.
Is there evidence of a 'substantial workload' associated with monitoring roadways while 'driving' automated vehicles?
:::
::: {.exercise #TestOneMeanExercisesQualityOfLife}
A study explored the quality of life of patients receiving cavopulmonary shunts [@data:Steele2016:Shunt].
'Quality of life' was assessed using a 36-question health survey, where the scale is standardised so that the mean of the general population is 50.
For the 14 patients in the study, the sample mean for the 'Physical component' of the survey was 47.2 (with a standard deviation of 8.2).
The sample mean for the 'Mental component' of the survey was 52.7 (with a standard deviation of 5.6).
Is there evidence that the patients are different, on average, to the general population on the basis of the results?
:::
::: {.exercise #TestOneMeanExercisesCherryRipes}
A *Cherry Ripe* is a popular chocolate bar in Australia.
In 2017 and 2018, I sampled some *Cherry Ripe* Fun Size bars.
The packaging claimed that the Fun Size bars weigh 12g (on average).
Use the SPSS summary of the data (Fig. \@ref(fig:CherryRipes20172018)) to perform a hypothesis test to determine if the mean weight really is 12g or not.
:::
```{r CherryRipes20172018, fig.cap="jamovi output for the Cherry Ripes data", fig.align="center", out.width="75%"}
knitr::include_graphics( "SPSS/CherryRipes/CherryRipes20172018.png")
```
::: {.exercise #TestOneMeanBloodLoss}
(This study was also seen in Exercise \@ref(exr:CIOneMeanBloodLoss).)
A study of paramedics [@data:Williams2007:BloodLoss] asked participants ($n = 199$) to estimate the amount of blood on four different surfaces.
When the actual amount of blood spilt on concrete was 1000 ml, the mean guess was 846.4 ml (with a standard deviation of 651.1 ml).
Is there evidence that the mean guess really is 1000 ml (the true amount)?
Is this test likely to be valid?
:::
```{r}
QC <- structure(list(High1 = c(61.63, 63.11, 66.88, 62.56, 66.12, 65.34,
64.83, 64.22, 65.54, 65.33, 67.83, 65.97, 65.01, 64.51, 64.3,
64.38, 64.75, 63.12, 62.93, 63.59, 67.04, 64.38, 63.41, 64.19,
64.86, 63.7, 63.14, 66, 63.43, 64.25, 63.67, 64.78, 58.02, 63.46,
65.2, 63.74), Mid1 = c(18.36, 18.77, 18.98, 17.97, 19.69, 19.63,
19.5, 19.39, 19.87, 19.66, 20.41, 18.37, 19.09, 18.98, 19.22,
18.99, 19.58, 19.7, 19, 19.11, 19.97, 19.52, 19.39, 17.2, 19.75,
19.35, 19.95, 19.38, 19.33, 19, 19.05, 19.24, 19.29, 19.29, 19.18,
19.35), High2 = c(62.64, 64.36, 66.06, 65.39, 66.85, 65.56, 66.6,
66.9, 65.5, 65.92, 65.51, 65.03, 64.22, 64.19, 64.95, 64.95,
64.45, 64.77, 65.58, 64.87, 63.52, 65.02, 66.05, 63.82, 64.58,
65.09, 64.86, 64.85, 66.75, 65.37, 64.29, 65.5, 62.85, 64, 63.72,
64.3), Mid2 = c(19.12, 19.07, 19.58, 19.35, 19.65, 19.13, 19.55,
19.85, 19.45, 19.87, 19.73, 19.6, 20.76, 18.98, 19.93, 19.15,
19.23, 19.2, 19.52, 19.38, 19.08, 18.89, 19.57, 19.35, 19.13,
19.44, 19.27, 19.58, 19.71, 19.64, 19.14, 19.28, 19.68, 18.27,
18.75, 19.57)), .Names = c("High1", "Mid1", "High2", "Mid2"), class = "data.frame", row.names = c(NA,
-36L))
DataAndTargets <- array(dim = c(3, 4))
DataAndTargets[1, ]<- c(64.31, 19.24, 64.97, 19.40)
DataAndTargets[2, ]<- c(1.700, 0.588, 1.029, 0.413)
DataAndTargets[3, ]<- c(64.22, 19.01, 65.05, 19.45)
rownames(DataAndTargets) <- c("Mean of data",
"Std. dev. of data",
"Pre-determined target")
colnames(DataAndTargets) <- c("High level",
"Mid level",
"High level",
"Mid level")
```
::: {.exercise #TestOneMeanQualityControl}
A quality-control study [@feng2017application] assessed the accuracy of two instruments from a clinical laboratory, by comparing the reported luteotropichormone (LH) concentrations to known pre-determined values (Table \@ref(tab:QualityControlDataSummary)).
Perform a series of tests to determine how well the two instruments perform, for both high- and mid-level LH concentrations (from the data in
`r if (knitr::is_latex_output()) {
'Table \\@ref(tab:QualityControlData)).'
} else {
'below.'
}`
:::
```{r QualityControlData}
if( knitr::is_latex_output() ) {
T1 <- kable( rbind( head(QC[1:2], 5),
c("$\\vdots$", "$\\vdots$") ),
format = "latex",
align = "c",
linesep = "",
col.names = c("High level",