-
Notifications
You must be signed in to change notification settings - Fork 7
/
06-ResearchDesign-Sampling.Rmd
1521 lines (1090 loc) · 70.7 KB
/
06-ResearchDesign-Sampling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# External validity: sampling {#Sampling}
<!-- Introductions; easier to separate by format -->
```{r, child = if (knitr::is_html_output()) {'./introductions/06-ResearchDesign-Sampling-HTML.Rmd'} else {'./introductions/06-ResearchDesign-Sampling-LaTeX.Rmd'}}
```
<!-- Define colours as appropriate -->
```{r, child = if (knitr::is_html_output()) {'./children/coloursHTML.Rmd'} else {'./children/coloursLaTeX.Rmd'}}
```
## Introduction {#IntroExternalValidity}
\index{External validity}\index{Research design!external validity|(}
In a well-designed study, the researchers learn about the *population* by studying just one of the countless possible *samples*.
Ideally the sample that is studied is representative of the population, so the results from the sample generalise to the population.
This is called *external validity*.\index{External validity}
*External validity* does *not* mean that the results apply more widely than the intended population.
<!-- ::: {.definition #ExternalValidity name="External validity"} -->
<!-- *External validity* refers to the ability to generalise the results to the rest of the population, beyond just those in the sample studied. -->
<!-- ::: -->
<div style="float:right; width: 75px; padding:10px">
<img src="Pics/iconmonstr-share-11-240.png" width="50px"/>
</div>
::: {.example #ExternalValidPop name="External validity"}
Suppose the *population* in a study is *Californian home-owners*.
The sample comprises the Californian home-owners studied by the researchers.
If the study is externally valid, the sample is representative of all Californian home-owners.
The results will not necessarily apply to home-owners outside of Californian (though they may), or all Californian residents.
However, this *is irrelevant for external validity*.
External validity concerns how the *sample* represents the intended population in the RQ, which is *Californian home-owners*.
:::
## The idea of sampling {#IdeaOfSampling}
\index{Sampling}
Studying every member of a population is very rare due to cost, time, ethics, logistics and/or practicality.
Instead, a subset of the population (a *sample*) is studied, and *many* different samples are possible.\index{Sample}
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The challenge of research is learning about a population from studying just one of the countless possible samples.
:::
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-gerd-altmann-23180.jpg" width="200px"/>
</div>
::: {.example #SamplesAspirin name="Samples"}
A study of the effectiveness of aspirin in treating headaches cannot possibly study every single human who may one day take aspirin.
Not only would this be prohibitively expensive, time-consuming, and impractical, but the study would not even study those not yet born who might use aspirin.
Using the whole target population is *impossible*, and a *sample* must be used.
:::
Only studying one sample out of countless possible samples raises questions:
* *Which* individuals should be included in the sample to be studied?
* *How many* individuals should be included in the sample to be studied?
The first issue is studied in this chapter.
The second issue is studied later (Chap.\ \@ref(EstimatingSampleSize)), after learning about the implications of studying samples rather than populations.
Many samples are possible, and *every sample is likely to be different*.
Hence, the results of studying a sample likely to vary, depending on which individuals are in the studied sample.
The differences between the samples, and differences in the results from each sample, are called *sampling variation*.\index{Sampling variation}
That is, each sample has different individuals, produces different data, and may even suggest different answers to the RQ.
:::{.example name="Number of samples"}
In a 'population' of just\ $100$, the number of possible samples of size\ $25$ is more than twice the number of people currently living on earth.
:::
This is the challenge of research: *making decisions about populations, using just one of the many possible samples*.
A lot can be learnt about the population if selecting a sample is approached correctly.\index{Decision making}
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
Almost always, researchers study *samples*, not *populations*.
Many samples are possible, and *every sample is likely to be different*, and hence the *results from every sample are likely to be different*.
This is called *sampling variation*.
As a result, *conclusions from a sample are never certainties*, though special techniques allow us to make decisions about the *population* from a *sample*.
:::
:::{.example #SamplingVarInCards name="Sample vary"}
`r if (knitr::is_latex_output()) {
'Consider a fair pack of cards (a *population*), where $50$% of cards are red. The percentage of red cards is not the same in every hand (every *sample*) of ten cards.'
} else {
'The animation below shows how samples vary. We know that\ $50$% of cards in a fair, shuffled pack (a *population*) are red, but each hand (every *sample*) of ten cards can produce a different percentage of red cards (and not always\ $50$%).'
}`
This is a simple example of *sampling variation*.\index{Sampling variation}
:::
```{r DealCards, animation.hook="gifski", fig.height = 4, fig.width=7, out.width='60%', interval=0.20, dev=if (is_latex_output()){"pdf"}else{"png"}}
source("R/showCardSampling.R")
numSamples <- 5
numCardsPerSample <- 10
### CARDS ARE OF SIZE 500 x 726
if (knitr::is_html_output()) {
totalCards <- numSamples * numCardsPerSample
for(iCard in 1:totalCards){
showCardSampling(iCard,
numSamples = numSamples,
numCardsPerSample = numCardsPerSample)
# Now pad, so the images appears to "stop"
for (iteration in 1:20){
points(-2.1, -2.25 + iteration/1000,
col = "white",
cex = 0.5) # To keep getting a new plot, so the final plot holds to be looked at
}
gc() # Try garbage collection to prevent "Cannot allocate vector of size" error
}
# }
}
```
## Precision and accuracy {#PrecisionAccuracy}
Two questions concerning sampling in Sect.\ \@ref(IdeaOfSampling) were: *which* individuals should be in the sample, and *how many* individuals should be in the sample.
The first question addresses the *accuracy*\index{Accuracy} of using a sample value to estimate a population value.
The second addresses the *precision*\index{Precision} with which a population value is estimated using a sample.
An estimate that is not accurate is called *biased* (Def.\ \@ref(def:Bias)).
::: {.definition name="Accuracy"}
\index{Accuracy}
*Accuracy* refers to how close a *sample* estimate is likely to be to the *population* value, on average.
:::
::: {.definition name="Precision"}
\index{Precision}
*Precision* refers to how similar the sample estimates from different samples are likely to be to each other (that is, how much variation is likely in the sample estimates).
:::
Using this language:
* The sampling *method* (i.e., *how* the sample is selected) impacts the *accuracy* of the sample estimate (i.e., the *external validity* of the study).
* The *size* of the sample impacts the *precision* of the sample estimate (i.e., the *internal* validity).
Large samples are more likely to produce *precise* estimates, but they may or may not be accurate estimates.
Similarly, random samples are likely to produce *accurate* estimates, but they may or may not be *precise*.
As an analogy, consider an archer aiming at a target.
The shots can be accurate, or precise, or (ideally) both (Fig.\ \@ref(fig:PrecisionAccuracy)).
```{r PrecisionAccuracy, fig.align="center", fig.cap="Precision and accuracy: each dot indicates where a shot lands, and is like a sample estimate of the population value (shown by the central $\\times$).", fig.width=5.0, out.width='55%'}
par( mfrow = c(2, 2),
mar = c(1, 1, 1, 1)/4,
oma = c(0.5, 2.5, 2.5, 0.5) )
source("R/drawTargets.R")
### PRECISE + ACCURATE
drawTarget()
#title(main = "Precise and accurate")
shots <- addShots(x = 0,
y = 0,
radius = 0.15,
n = 20,
seed = 121314)
drawBullsEye()
### PRECISE + INACCURATE
drawTarget()
#title(main = "Precise and inaccurate")
shots <- addShots(x = 0.25,
y = 0.3,
radius = 0.15,
n = 20,
seed = 12131415)
drawBullsEye()
### IMPRECISE + ACCURATE
drawTarget()
#title(main = "Imprecise but accurate")
shots <- addShots(x = 0,
y = 0,
radius = 0.5,
n = 20,
seed = 1213141516)
drawBullsEye()
### IMPRECISE + INACCURATE
drawTarget()
#title(main = "Imprecise and inaccurate")
shots <- addShots(x = 0.4,
y = 0.3,
radius = 0.5,
n = 20,
seed = 985421)
drawBullsEye()
### LABELS
mtext("Imprecise",
side = 2,
line = 1,
font = 2,
cex = 1.25,
at = 0.25,
outer = TRUE)
mtext("Precise",
side = 2,
line = 1,
font = 2,
cex = 1.25,
at = 0.75,
outer = TRUE)
mtext("Accurate",
side = 3,
line = 1,
font = 2,
cex = 1.25,
at = 0.25,
outer = TRUE)
mtext("Inaccurate",
side = 3,
line = 1,
font = 2,
cex = 1.25,
at = 0.75,
outer = TRUE)
```
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-iqwan-alif-1206101.jpg" width="200px"/>
</div>
::: {.example #PrecisionAccuracyQld name="Precision and accuracy"}
To estimate the average age of *all Canadians*, $9\,000$ Canadian school children could be sampled.
The answer obtained from the sample will be *inaccurate* because the sample is not representative of *all* Canadians.
Since the sample is large, the answer will give a *precise* answer but to a *different* question: 'What is the average age of Canadian school children?'
:::
<iframe src="https://learningapps.org/watch?v=prpojnfzj22" style="border:0px;width:100%;height:500px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>
## Types of sampling
\index{Sampling}
One key to obtaining accurate estimates about the population from the sample is to ensure that the sample faithfully represents the population.
So, *how* is such a sample selected from the population?
The individuals selected for the sample can be chosen using either *random sampling* or *non-random sampling*.
The word *random* here has a specific meaning that is different from how it is often used in everyday use.
It does *not* mean 'haphazard' or 'picking individuals as "randomly" as I can'.
::: {.definition #Random name="Random"}
*Random* means determined completely by impersonal chance.
:::
### Random sampling {#RandomSamples}
\index{Sampling!random}
In a *random sample*:
1. each individual in the population can be selected; and
2. each individual is chosen on the basis of impersonal chance (such as using a random number generator, or a table of random numbers).
Some examples of random sampling methods appear in Table\ \@ref(tab:TypesOfRandomSampling), and are explained further in Sect.\ \@ref(RandomSamplingMethods).
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The results obtained from a random sample are likely to generalise to the population from which the sample is drawn; that is, *random samples* are likely to produce *externally valid* and *accurate* studies.
:::
```{r TypesOfRandomSampling}
SampleTypes <- array( dim = c(5, 5) )
colnames(SampleTypes) <- c("Type",
"Stage 1",
"Stage 2",
"Ref.",
"")
if( knitr::is_latex_output() ) {
SampleTypes[1, ] <- c("Simple random",
"Individuals chosen at \\emph{random}",
"",
"\\S \\ref{SRS}",
"")
SampleTypes[2, ] <- c("Systematic",
"Start at a \\emph{random} location",
"Take every $n$th element thereafter",
"\\S \\ref{SystematicSampling}",
"")
SampleTypes[3, ] <- c("Stratified",
"Split into a few large groups (`strata') of similar individuals",
"Select a \\emph{simple random sample} from \\emph{every} stratum",
"\\S \\ref{StratifiedSampling}",
"")
SampleTypes[4, ] <- c("Cluster",
"Split into many small groups (`clusters'); select a \\emph{simple random sample} of clusters",
"Select \\emph{all} individuals in the chosen clusters",
"\\S \\ref{ClusterSampling}",
"")
SampleTypes[5, ] <- c("Multi-stage",
"Select a \\emph{simple random sample} from the larger collection of units",
"Select a \\emph{simple random sample} from those chosen in Stage 1; etc.",
"\\S \\ref{MultistageSampling}",
"")
kable(SampleTypes[, 1:4],
format = "latex",
longtable = FALSE,
booktabs = TRUE,
escape = FALSE, # For latex to work in \rightarrow
linesep = c( "\\addlinespace"), # Add a bit of space between all rows.
caption = "Comparing five types of random sampling.",
align = c("r", "l", "l", "r")) %>%
kable_styling(full_width = FALSE, font_size = 8) %>%
row_spec(0, bold = TRUE) %>% # Columns headings in bold
column_spec(column = 1,
width = "20mm") %>%
column_spec(column = 2,
width = "45mm") %>%
column_spec(column = 3,
width = "45mm") %>%
column_spec(column = 4,
width = "10mm")
}
if( knitr::is_html_output() ) {
SampleTypes[1, ] <- c("Simple random",
"Individuals chosen at *random*",
"",
"Sect. \\@ref(SRS)",
"")
SampleTypes[2, ] <- c("Systematic",
"Start at a *random* location",
"Take every $n$th element thereafter",
"Sect. \\@ref(SystematicSampling)",
"")
SampleTypes[3, ] <- c("Stratified",
"Split into a few large groups ('strata') of similar individuals",
"Select a *simple random sample* from *every* stratum",
"Sect. \\@ref(StratifiedSampling)",
"")
SampleTypes[4, ] <- c("Cluster",
"Split into many small groups ('clusters'); select *simple random sample* of clusters",
"Select *all* individuals in the chosen clusters",
"Sect. \\@ref(ClusterSampling)",
"")
SampleTypes[5, ] <- c("Multi-stage",
"Select *simple random sample* from the larger collection of units",
"Select *simple random sample* from those chosen in Stage 1; etc.",
"Sect. \\@ref(MultistageSampling)",
"")
SampleTypes[1, 5] <- "![](./Pics/iconmonstr-chart-22-240.png){#id .class height=90px width=90px}"
SampleTypes[2, 5] <- "![](./Pics/iconmonstr-layer-17-240.png){#id .class height=90px width=90px}"
SampleTypes[3, 5] <- "![](./Pics/iconmonstr-view-5-240.png){#id .class height=90px width=90px}"
SampleTypes[4, 5] <- "![](./Pics/iconmonstr-view-4-240.png){#id .class height=90px width=90px}"
kable(SampleTypes[, c(5, 1:4)], # Move icons to the front
format = "html",
align = c("c", "r", "l", "l", "r"),
longtable = FALSE,
caption = "Comparing five types of random sampling.",
booktabs = TRUE)
}
```
A pot of soup can be tested randomly or non-randomly.
If the soup is stirred (randomised), the whole pot of soup need not be tasted to obtain an overall impression.
However, an *overall* impression is not obtained from a non-random sample (from a non-stirred pot of soup).
### Non-random sampling {#NonRandomSamples}
\index{Sampling!non-random}
A *non-random* sample is selected using personal input from the researchers.
Examples include:
* *Judgement samples*:\index{Sampling!non-random!judgement}
Individuals are selected based on the researchers' judgement (possibly unconsciously), perhaps because the individuals are (or may appear) agreeable, supportive, easily accessible, or helpful.
For example, researchers may select rats that are less aggressive, or plants that are accessible.
* *Convenience samples*:\index{Sampling!non-random!convenience}
Individuals are selected because they are convenient for the researcher.
For example, researchers may study beaches that are nearby, or use their friends for a study.
* *Voluntary response (self-selecting) samples*:\index{Sampling!non-random!voluntary}
Individuals participate if they wish to.
For example, researchers may ask people to volunteer to take a survey.
* *Cherry-picking*:\index{Sampling!non-random!cherry-picking}
Individuals are specifically chosen to reach the conclusion that the researchers want.
In non-random sampling, the individuals *in* the study are probably different from those *not in* the study.
That is, *non-random samples are not likely to be externally valid*.\index{External validity}
Researchers may use a non-random sample intentionally (e.g., to deceive) which is unethical, or unintentionally (e.g., accidentally, or due to practicality (such as meeting budgets)).
Ethically, a random (or near-random sample) should be used when possible.
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
Using a non-random sample means that the results probably do not generalise to the intended population: they probably do not produce externally valid or accurate studies.
:::
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-andrea-piacquadio-3807629.jpg" width="200px"/>
</div>
<!-- ::: {.example #COVIDsampling name="Different ways to sample"} -->
<!-- During the COVID-19 pandemic in 2020, [a Facebook poll](https://www.abc.net.au/news/2020-07-03/coronacheck-anti-vaxxers-flood-online-poll-pete-evans/12415860) asked the question: "Do you think a Coronavirus vaccine should be compulsory?" -->
<!-- The result was reported as '79 per cent of Australians oppose a compulsory vaccination', from a sample of over 53,000 responses. -->
<!-- This sample was a *voluntary response sample*, not a random sample, so the results may not be *accurate*. -->
<!-- For example, the poll could have been completed multiple times by individuals, and by non-Australians as well as Australians. -->
<!-- A different study [@smith2020majority] asked Australians: -->
<!-- > The Federal Government's 'No Jab, No Pay' policy withholds certain benefits and payments from families who don't fully vaccinate their children. -->
<!-- > Do you agree with this policy? -->
<!-- In the sample of 1809 respondents, 83.7% either agreed or strongly agreed with this statement. -->
<!-- While this sample was not a *truly* random sample of Australians, the sample intentionally included individuals representing a wide range of demographics (e.g., age, gender, location, income, and so on (@smith2020majority, p. 194). -->
<!-- Furthermore, 'respondents were paid small token sum for their participation in the study' to encourage *all* selected respondents to provide an answer (and avoid voluntary responses). -->
<!-- ::: -->
## Methods of random sampling {#RandomSamplingMethods}
\index{Sampling!random}
### Simple random sampling {#SRS}
\index{Sampling!random!simple random}
The most straightforward idea for obtaining a random sample is to use a *simple random sample*.
::: {.definition #SamplingSRS name="Simple random sample"}
In a *simple random sample*, *every* possible sample of a given size has the *same* chance of being selected.
:::
Selecting a simple random sample requires a list of all members of the population, called the *sampling frame*, from which to select a sample.
Obtaining the sampling frame is often difficult or impossible, and so finding a simple random sample is also difficult.
For example, finding a simple random sample of wombats would require having a list and location of all wombats.
This is absurd; other random sampling methods, like
`r if (knitr::is_html_output()){
'[special ecological sampling methods](http://www.countrysideinfo.co.uk/what_method.htm) (e.g., @manly2014introduction),'
} else {
'special ecological sampling methods (e.g., @manly2014introduction),'
}`
would be used instead.
::: {.definition #SamplingFrame name="Sampling frame"}
The *sampling frame* is a list of *all* the individuals in the population.\index{Sampling frame}
:::
Selecting a simple random sample from the *sampling frame* can be performed using *random numbers* (e.g., using random number tables, or
`r if (knitr::is_html_output()){
'websites like https://www.random.org). A smaller version of this webpage, which generates one number at a time, is below; just press *Generate*. The numbers generated by this widget come from the true random number generator at [RANDOM.ORG](https://www.random.org). (The webpage generates as many numbers as you want all at the same time.)'
} else {
'websites like https://www.random.org).'
}`
Other random sampling methods avoid the need for a sampling frame, but still use randomness rather than human choice.
<div style="text-align:center;">
<iframe src="https://www.random.org/widgets/integers/iframe.php?title=True+Random+Number+Generator&buttontxt=Generate&width=160&height=200&border=on&bgcolor=%23FFFFFF&txtcolor=%23777777&altbgcolor=%23CCCCFF&alttxtcolor=%23000000&defaultmin=&defaultmax=&fixed=off" frameborder="0" width="160" height="200" scrolling="no" longdesc="https://www.random.org/integers/">
</iframe>
</div>
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
This book assumes simple random samples, unless otherwise noted.
:::
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-startup-stock-photos-7357.jpg" width="200px"/>
</div>
::: {.example #TypingSRS name="Simple random sampling"}
Consider the letter-typing RQ again (Example\ \@ref(exm:Typing)`r ifelse(knitr::is_latex_output(), ", p.\\ \\pageref{exm:Typing}", "")`):
> For students in a large university course, is the average typing speed (in words per minute) the same for those aged under\ $25$ ('younger') and $25$\ or over ('older')?
Suppose budget and time constraints mean only\ $40$ students (out of\ $441$) can be selected for the study.
The *sampling frame* is the list of all students enrolled in the course.
Obtaining the sampling frame is feasible here; instructors have access to this information for grading.
A simple random sample could be found using the course enrolment list, by first placing all $441$\ student names into rows of a spreadsheet (ordered by name, student\ ID, or any way).
Then, using random numbers, $40$\ rows are selected at random (without repeating numbers) between\ $1$ and\ $441$ inclusive.
For instance, when I used
`r if (knitr::is_latex_output()) {
'\\texttt{https://random.org/integers},'
} else {
'[random.org](https://www.random.org/integers/?num=40&min=1&max=441&col=5&base=10&format=html&rnd=new),'
}`
the first few random numbers were: `410`, `215`, `384`, `158`, `296`.
Every student chosen using this method becomes part of the study.
If a student could not be contacted, more students could be chosen at random to ensure $40$ students participated
`r if (knitr::is_latex_output()) {
'(Fig.\\ \\@ref(fig:SamplesA), left panel).'
} else {
'(see animation below).'
}`
By chance, the sample comprises\ $15$ younger students and $25$\ older students.
:::
```{r animation.hook="gifski", interval=0.4, fig.height = 6.5, dev=if (is_latex_output()){"pdf"}else{"png"}}
# SRS
if (knitr::is_html_output()){
source("R/showSampleSRS.R")
showSampleSRS(static = FALSE)
}
```
```{r SamplesA, fig.align="center", fig.width=8, fig.height=5, out.width='100%', fig.cap="A simple random sample (left) and a systematic random sample (right) for obtaining a random sample of size\\ $40$ from a class of\\ $441$. Triangles $\\bigtriangledown$ represent younger students (there are\\ $294$), circles $\\bigcirc$ represent older students (there are\\ $147$), and filled shapes represent those individuals selected in the sample. In the right panel, the boxed individual in the bottom row is the initial, randomly-chosen person." }
if (knitr::is_latex_output()){
par( mfrow = c(1, 2),
mar = c(2, 0.5, 3, 0.5))
source("R/showSampleSRS.R")
source("R/showSampleSystematic.R")
showSampleSRS(static = TRUE)
showSampleSystematic(static = TRUE,
start = 9)
}
```
### Systematic sampling {#SystematicSampling}
\index{Sampling!random!systematic}
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Pics/iconmonstr-chart-22-240.png" width="50px"/>
</div>
In *systematic sampling*, the first case is *randomly* selected; then, more individuals are selected at regular intervals thereafter.
In general, we say that every\ $n$th individual is selected after the initial random selection.
::: {.example #SystematicCourse name="Systematic sampling"}
For the study in Example\ \@ref(exm:Typing), a sample of $40$\ students in a course of $441$ is needed.
To find a systematic random sample, select a random number between\ $1$ and\ $441/40$ (approximately\ $11$) as a starting point; suppose the random number selected is\ $9$ (as in
`r if (knitr::is_latex_output()) {
'Fig.\\ \\@ref(fig:SamplesA), right panel).'
} else {
'the animation below).'
}`
The first student selected is the $9$th\ person in the student list (which may be ordered alphabetically, by student\ ID, or other means).
Thereafter, every\ $441/40$th person, or\ $11$th person, in the list is selected: people in rows $9$,\ $20$,\ $31$, $42$,\ and so on
`r if (knitr::is_latex_output()) {
'(Fig.\\ \\@ref(fig:SamplesA), right panel).'
} else {
'(see animation below).'
}`
By chance, the sample comprises\ $17$ younger students and $23$\ older students.
:::
```{r animation.hook="gifski", fig.height = 6.5, interval=0.4, dev=if (is_latex_output()){"pdf"}else{"png"}}
# Systematic
if (knitr::is_html_output()){
source("R/showSampleSystematic.R")
showSampleSystematic(static = FALSE)
}
```
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
Care needs to be taken when using systematic samples to ensure a pattern is not hidden.
Consider taking a systematic sample of every $10$th residence on a long street.
In many countries, odd numbers are usually on one side of the street, and even numbers usually on the other side.
Selecting every\ $10$th house (for example) would include houses all on the same side of the street, and hence with similar exposure to the sun, traffic, etc.
:::
:::{.example #SystematicQuebec name="Systematic sampling"}
@alary1991risk studied households in Quebec to determine if their hot water systems kept their water sufficiently hot to avoid Legionellae bacteria.
They used a systematic random sample to select households to study (p.\ $2\,361$):
> The first house was selected by using a random-number table.
> Thereafter, each fifth house that satisfied the [...] criteria was eligible for the study.
:::
### Stratified sampling {#StratifiedSampling}
\index{Sampling!random!stratified}
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Pics/iconmonstr-layer-17-240.png" width="50px"/>
</div>
In *stratified sampling*, the population is split into a *small* number of *large* (usually similar) groups called *strata*, then cases are selected using a *simple random sample* from *each* stratum.
Every individual in the population must be in one, and only one, stratum.
::: {.example #StratifiedUni name="Stratified sampling"}
For the typing study in Example\ \@ref(exm:Typing), $20$\ younger and $20$\ older students could be selected to obtain a sample of size\ $40$.
The sample is stratified by *age group* of the person
`r if (knitr::is_latex_output()) {
'(Fig.\\ \\@ref(fig:SamplesStrat), left panel).'
} else {
'(see animation below).'
}`
Since $67$% of the students are younger in the population, the sample could be selected so that two-thirds of the sample of size\ $40$ (i.e., $27$\ students) were younger students
`r if (knitr::is_latex_output()) {
'(Fig.\\ \\@ref(fig:SamplesStrat), right panel).'
} else {
'(see animation below).'
}`
This is a *proportional* stratified sample.\index{Sampling!proportional}
:::
```{r SamplesStrat, fig.align="center", fig.width=8, fig.height=5, out.width='100%', fig.cap="Two stratified sampling methods for taking a random sample of size\\ $40$ from a class of\\ $441$. Left: equal numbers of younger and older students. Right: proportional numbers of younger and older students. Triangles $\\bigtriangledown$ represent younger students, circles $\\bigcirc$ represent older students, and filled shapes represent those individuals selected in the sample." }
if (knitr::is_latex_output()){
par( mfrow = c(1, 2),
mar = c(2, 0.5, 3, 0.5))
source("R/showSampleStratified.R")
showSampleStratified(static = TRUE,
proportionA = 2/3, # POPULATION proportion younger
sampleA = 1/2, # Equal age groups
main = "Stratified: equal size sample groups")
showSampleStratified(static = TRUE,
proportionA = 2/3, # POPULATION proportion younger
sampleA = 2/3, # Proportion OLDER...?
main = "Stratified: proportional sample groups")
}
```
```{r, animation.hook="gifski", interval=0.2, fig.height = 6.5, dev=if (is_latex_output()){"pdf"}else{"png"}}
# STRATIFIED, By M/F: 20:20
if (knitr::is_html_output()){
source("R/showSampleStratified.R")
showSampleStratified(static = FALSE,
proportionA = 2/3,
sampleA = 1/2, # Same as population
main = "Stratified: equal age groups in sample")
}
```
`r if (knitr::is_html_output()) {
'Similarly, the second animation below shows how a stratified random sample of size\\ $40$ might be selected, by randomly selecting $27$\\ younger and $13$\\ older students.'
}`
```{r, animation.hook="gifski", interval=0.2, fig.height = 6.5, dev=if (is_latex_output()){"pdf"}else{"png"}}
# STRATIFIED, By M/F: 20:20
if (knitr::is_html_output()){
source("R/showSampleStratified.R")
showSampleStratified(static = FALSE,
proportionA = 2/3,
sampleA = 2/3, # Same as population
main = "Stratified: proportional age groups in sample")
}
```
### Cluster sampling {#ClusterSampling}
\index{Sampling!random!cluster}
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Pics/iconmonstr-view-5-240.png" width="50px"/>
</div>
In *cluster sampling*, the population is split into a *large* number of *small* groups called *clusters*.
Then, a *simple random sample* of clusters is selected, and *every* member of the chosen clusters become part of the sample.
Every individual in the population must be in one, and only one, cluster.
::: {.example name="Cluster sampling"}
For the study in Example\ \@ref(exm:Typing), a simple random sample of (say) three of the many small-group classes for the course could be selected, and *every* student enrolled in those selected small groups constitute the sample
`r if (knitr::is_latex_output()) {
'(Fig.\\ \\@ref(fig:SamplesB), left panel).'
} else {
'(see animation below).'
}`
By chance, the chosen classes produce a sample size of $n = 47$ ($31$\ younger; $16$\ older).
:::
```{r animation.hook="gifski", fig.height = 6.5, interval=0.75, dev=if (is_latex_output()){"pdf"}else{"png"}}
if (knitr::is_html_output()) {
source("R/showSampleCluster.R")
showSampleCluster(static = FALSE)
}
```
```{r SamplesB, fig.align="center", fig.width=8, fig.height=5, out.width='100%', fig.cap="Cluster sampling (left) and multi-stage sampling (right) for taking a random sample of size approximately\\ $40$. Classes shown bold and shaded represent classes selected to be in the sample in the first stage. Triangles $\\bigtriangledown$ represent younger students, circles $\\bigcirc$ represent older students, and filled shapes represent those individuals selected in the sample." }
if (knitr::is_latex_output()){
par( mfrow = c(1, 2),
mar = c(2, 3.5, 2, 0.5))
source("R/showSampleCluster.R")
source("R/showSampleMultistage.R")
showSampleCluster()
showSampleMultistage()
}
```
### Multi-stage sampling {#MultistageSampling}
\index{Sampling!random!multi-stage}
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Pics/iconmonstr-view-4-240.png" width="50px"/>
</div>
In *multi-stage sampling*, larger collections of individuals are selected using a *simple random sample*, then smaller collections of individuals *within* those large collections are selected using a *simple random sample*.
The simple random sampling continues for as many levels as necessary, until individuals are being selected (at random).
::: {.example name="Multi-stage sampling"}
For the study in Example\ \@ref(exm:Typing), a *simple random sample* of ten of the many small-group classes could be selected (Stage\ 1), and then four students are *randomly* selected from each of these $10$ selected tutorials (Stage\ 2)
`r if (knitr::is_latex_output()) {
'(Fig.\\ \\@ref(fig:SamplesB), right panel).'
} else {
'(see animation below).'
}`
The sample size is $10\times 4 = 40$, comprising (by chance) $24$\ younger students and $16$\ older students.
:::
```{r animation.hook="gifski", interval=0.75, fig.height = 6.5, dev=if (is_latex_output()){"pdf"}else{"png"}}
if (knitr::is_html_output()){
source("R/showSampleMultistage.R")
showSampleMultistage(static = FALSE)
}
```
::: {.example name="Multi-stage sampling"}
Multi-stage sampling is often used by national statistical agencies.
For example, to obtain a multi-stage random sample from a country:
* *Stage\ 1*: Randomly select some cities in the nation.
* *Stage\ 2*: Randomly select some suburbs in these chosen cities.
* *Stage\ 3*: Randomly select some streets in these chosen suburbs.
* *Stage\ 4*: Randomly select some houses in these chosen streets.
This is cheaper than simple random sampling, as data collectors can be deployed in a small number of cities (only those chosen in Stage\ 1).
:::
### Comparing the samples
The different random sampling methods produce different samples, with different proportions of younger and older students (Table\ \@ref(tab:samplesSummaryTable)).
Of course, repeating the random sampling processes would produce different samples each time.
In all cases, we end up studying *one* of the countless possible samples.
```{r samplesSummaryTable}
samplingTable <- array( dim = c(6, 4) )
colnames(samplingTable) <- c("Younger",
"Older",
"Total",
"Percentage younger")
rownames(samplingTable) <- c("Simple random sample",
"Systematic sample",
"Stratified sample: equal",
"Stratified sample: proportional",
"Cluster sample",
"Multi-stage sample")
samplingTable[1, ] <- c(26, 14, 40, 65.0)
samplingTable[2, ] <- c(31, 9, 40, 77.5)
samplingTable[3, ] <- c(20, 20, 40, 50.0)
samplingTable[4, ] <- c(27, 13, 40, 67.5)
samplingTable[5, ] <- c(31, 16, 47, 66.0)
samplingTable[6, ] <- c(24, 16, 40, 60.0)
if (knitr::is_latex_output()){
knitr::kable(pad( samplingTable,
surroundMaths = TRUE,
targetLength = c(2, 2, 2, 3),
decDigits = c(0, 0, 0, 1) ),
format = "latex",
booktabs = TRUE,
escape = FALSE,
caption = "A summary of the various random samples selected using different random sampling methods. In the population, $66.7$\\% of students are younger students.",
linesep = c("", "\\addlinespace", "", "\\addlinespace", "", ""),
align = c("c", "c", "c", "c") ) %>%
kableExtra::kable_styling(font_size = 8) %>%
row_spec(0, bold = TRUE) %>%
column_spec(1, bold = TRUE) %>%
add_header_above( c(" " = 1,
"Number of students selected" = 3,
" " = 1),
bold = TRUE,
line = TRUE)
} else {
knitr::kable(pad( samplingTable,
surroundMaths = TRUE,
targetLength = c(2, 2, 2, 3),
decDigits = c(0, 0, 0, 1) ),
format = "html",
booktabs = TRUE,
escape = FALSE,
caption = "A summary of the various random samples selected using different random sampling methods. In the population, $66.7$% of students are younger students.",
linesep = c("", "\\addlinespace", "", "\\addlinespace", "", ""),
align = c("c", "c", "c", "c") ) %>%
kableExtra::kable_styling(font_size = 8) %>%
row_spec(0, bold = TRUE) %>%
column_spec(1, bold = TRUE) %>%
add_header_above( c(" " = 1,
"Number of students selected" = 3,
" " = 1),
bold = TRUE,
line = TRUE)
}
```
## Representative sampling {#Representative-samples}
\index{Sampling!representative}
Obtaining a truly random sample is usually hard or impossible in practice.
Sometimes the best compromise is to select a sample sufficiently diverse so that it is likely to be *somewhat representative* of the diversity in the population.
That is, those *in* the sample are not likely to be different from those *not in* the sample, at least for the variables of interest.
This is often the only practical way to sample.
As always, the results from any non-random sample *may not generalise* to the intended population (but will generalise to the population which the sample *does* represent).
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-thisisengineering-3912979.jpg" width="200px"/>
</div>
::: {.example name="Representative sample"}
Suppose we wish to evaluate the functionality of two types of hand prosthetics.
A randomly-chosen group of Alaska and Texas residents is asked for their feedback, probably (but not certainly) their views would be similar to those of all Americans.
No obvious reason exists for why residents of Alaska and Texas would be very different from residents in the rest of the United States, regarding their view of hand prosthetic functionality.
Even though the sample is not a random sample of all Americans, the results *may* generalise to all Americans (though we cannot be sure).
:::
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-juergen-striewski-301048.jpg" width="200px"/>
</div>
::: {.example #AirConUse name="Non-representative samples"}
Suppose we wish to determine the average time per day that Americans households use their air-conditioners for *cooling* in summer.
A sample of Texas residents would not be expected to represent all Americans: it would *over*-represent the average number of hours air-conditioners are used for *cooling* in summer.
In this case, those *in* the sample are very different to those *not in* the sample, regarding their air-conditioners usage for cooling in summer.
In contrast, suppose a sample of Alaskans was asked the same question.
This sample would not represent all Americans either (it would *under*-represent).
Again, those *in* the sample are likely to be very different to those *not in* the sample, regarding their air-conditioners usage for *cooling* in summer.
:::
Sometimes, a *combination* of sampling methods is used.\index{Sampling!combination of methods}
::: {.example name="A combination of sampling methods"}
In a study of pathogens present on magazines in doctors' surgeries in Dublin, some suburbs can be selected at *random*, and then (within each suburb) all surgeries are contacted, and some surgeries *volunteer* to be part of the study.
:::
:::{.exampleExtra data-latex=""}
In a study of diets of children at child-care centres, researchers used samples in\ 2010 and\ 2016, described as follows [@larson2019staff, p.\ 336]:
> In\ 2010, a stratified random sampling procedure was used to select representative cross-sections of providers working in licensed center-based programs and licensed providers of family home-based care from publicly available lists.
> [...] Additional participants were also recruited in\ 2016 using a combination of stratified random and open, convenience-based sampling.
:::
Sometimes, practicalities dictate how the sample is obtained, which may result in a non-random sample.
Even so, the impact of using a non-random sample on the conclusions should be discussed (Chap.\ \@ref(Interpretation)).
Sometimes, simple steps can be taken to obtain a sample that is *more likely* to be representative.
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
Random samples are often difficult to obtain, and sometimes *representative* samples are the best that can be achieved,
In a representative sample, those *in* the sample are not obviously different from those *not in* the sample.
Try to ensure that a broad cross-section of the target population appears in the sample.
:::
Even if a random or representative sample cannot be obtained, the study can still be useful.
The results still apply to the population represented by the sample.
If individuals in the sample are unlikely to be different from individuals *not* in the sample, for the variables important to the study, the results are likely to approximately apply to the population.
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-lina-kivaka-3395280.jpg" width="200px"/>
</div>
::: {.example #RepresentativeUni name="Representative sample"}
For the typing study in Example\ \@ref(exm:Typing), only selecting students who attend the gym, or only students who are at a certain `r readr::parse_character( c("Café"), locale = locale(encoding = "UTF-8"))`, is unlikely to be somewhat representative of the student population.
Instead, the researchers could approach:
* Students at the `r readr::parse_character( c("Café"), locale = locale(encoding = "UTF-8"))` on Monday at\ $8$am;
* Students at the gym on Tuesday at\ $11$:$30$am; and
* Students entering the Library on Thursdays at\ $2$pm.
*This is not a random sample*, but does contain a variety of students.
Ideally, *students would not be included more than once in our sample*, though this is often difficult to ensure.
The students *in* the sample are probably somewhat similar to those *not* in the sample in terms of average typing speeds (there is no obvious reason why they would not be), but we cannot be sure.
:::
`r if (knitr::is_latex_output()) '<!--'`
<iframe src='https://www.ferendum.com/en/embeded.php?pregunta_ID=1252845&sec_digit=1251027117&embeded_digit=219259060' style='width:100%; height:500px; overflow: auto; background: #badaff33;' frameBorder='0'></iframe><BR>
<A href='https://www.ferendum.com' target='_blank'>Free Online Poll Maker</A>
`r webexercises::hide()`
The researchers takes a random sample from *each* of the large groups (cases).
This is a **stratified sample**.
`r webexercises::unhide()`