12-SummaryQual.Rmd

# Summarising qualitative data {#SummariseQualData}
\index{Qualitative data!summarising|(}


<!-- Introductions; easier to separate by format -->
```{r, child = if (knitr::is_html_output()) {'./introductions/12-SummaryQual-HTML.Rmd'} else {'./introductions/12-SummaryQual-LaTeX.Rmd'}}
```


<!-- Define colours as appropriate -->
```{r, child = if (knitr::is_html_output()) {'./children/coloursHTML.Rmd'} else {'./children/coloursLaTeX.Rmd'}}
```


## Introduction {#SummaryQual-Intro}

Many quantitative research studies involve qualitative variables
Except for very small amounts of data, understanding the data is difficult without a summary.
As with quantitative data, qualitative data can be understood by knowing how often values of the variables appear.
This is called the *distribution* of the data (Def.\ \@ref(def:Distribution)).\index{Distribution!qualitative data}

The distribution can be displayed using a frequency table (Sect.\ \@ref(QualitativeTables)) or a graph (Sect.\ \@ref(QualitativeGraphs)).
Qualitative data can be summarised by finding modes or, for ordinal qualitative data, using medians (Sect.\ \@ref(SummariseDataQualitative)).
The distribution of qualitative data can be summarised numerically by computing proportions, percentages (Sect.\ \@ref(QualitativeProportionsPercentages)) or odds (Sect.\ \@ref(QualOdds)).


## Frequency tables for qualitative data {#QualitativeTables}
\index{Qualitative data!frequency table}

Qualitative data are typically collated in a *frequency table*.\index{Frequency table!qualitative data}
The rows (or the columns) should list the *levels* of the variable, and these should be *exhaustive* (cover all levels) and *mutually exclusive* (observations belong to only one level).\index{Levels}
The number of observations or the percentage of observations (or both) are then given for each level.

For *nominal* data, the levels of the variables can be displayed in alphabetical order, in order of size, by personal preference, or other way: use the order most likely to be useful to readers.
For *ordinal* data, the natural order of the levels should almost always be used.


::: {.example #AVstudy name="Opinions of AV vehicles"}
@pyrialakou2020perceptions surveyed $400$\ residents of Phoenix (Arizona) about their opinions of autonomous vehicles (AVs).
Demographic information (Table\ \@ref(tab:AVtable1)) and respondents' opinions of sharing roads with AVs (Table\ \@ref(tab:AVtable2)) were recorded.

The gender of the respondent is *nominal* (two levels), while the age group is *ordinal* (six levels).
The levels are shown in the rows.
The three questions about safety (Table\ \@ref(tab:AVtable2)) all yield *ordinal* responses (five levels, in columns). 
:::

```{r}
AVtable1 <- array( dim = c(8, 2) )

rownames(AVtable1) <- c("Female",
                        "Male",
                        "$18$ to $24$",
                        "$25$ to $34$",
                        "$35$ to $44$",
                        "$45$ to $54$",
                        "$55$ to $64$",
                        "$65+$")
colnames(AVtable1) <- c("Number",
                        "Percentage")
AVtable1[1, ] <- c(204, 51)
AVtable1[2, ] <- c(196, 49)
AVtable1[3, ] <- c(52, 13)
AVtable1[4, ] <- c(76, 19)
AVtable1[5, ] <- c(76, 19)
AVtable1[6, ] <- c(72, 18)
AVtable1[7, ] <- c(56, 14)
AVtable1[8, ] <- c(68, 17)


###########

AVtable2 <- array( dim = c(10, 4) )

colnames(AVtable2) <- c("",
                        "Driving near an AV",
                        "Cycling near an AV",
                        "Walking near an AV")
#rownames(AVtable2) <- c("Unsafe",
#                        "Somewhat unsafe",
#                        "Neutral",
#                        "Somewhat safe",
#                        "Safe")

rownames(AVtable2) <- c("Unsafe", "",
                   "Somewhat unsafe", "",
                   "Neutral", "",
                   "Somewhat safe", "",
                   "Safe", "")

AVtable2[ c(1, 3, 5, 7, 9), 2] <- c(58, 79, 96, 97, 70)
AVtable2[ c(1, 3, 5, 7, 9), 3] <- c(77, 104, 87, 76, 56)
AVtable2[ c(1, 3, 5, 7, 9), 4] <- c(63, 86, 103, 82, 66)

AVtable2[ c(2, 4, 6, 8, 10), 2] <- round( AVtable2[ c(1, 3, 5, 7, 9), 2] / 400 * 100, 0)
AVtable2[ c(2, 4, 6, 8, 10), 3] <- round( AVtable2[ c(1, 3, 5, 7, 9), 3] / 400 * 100, 0)
AVtable2[ c(2, 4, 6, 8, 10), 4] <- round( AVtable2[ c(1, 3, 5, 7, 9), 4] / 400 * 100, 0)

AVtable2[, 1] <- c("$n$", "Percent",
                        "$n$", "Percent",
                        "$n$", "Percent",
                        "$n$", "Percent",
                        "$n$", "Percent")

AVdemographics <- AVtable1[, 1]

AVquestions <- as.numeric( AVtable2[ c(1, 3, 5, 7, 9), 2:4] )
dim(AVquestions) <- c(5, 3)
rownames(AVquestions) <- rownames(AVtable2)[ c(1, 3, 5, 7, 9)]
colnames(AVquestions) <- colnames(AVtable2)[2:4]


```

```{r AVtable1}
if (knitr::is_html_output()) {
   knitr::kable(pad(AVtable1,
                   surroundMaths = TRUE,
                   targetLength = c(3, 2),
                   decDigits = 0),
               format = "html",
               longtable = FALSE,
               escape = FALSE,
               align = c("c", "c"),
               booktabs = TRUE,
               caption = "Demographic information for the AV data for $400$ respondents.") %>%
    row_spec(0, bold = TRUE) %>%
    pack_rows( "Gender",
               start_row = 1,
               end_row = 2) %>%
    pack_rows( "Age group",
               start_row = 3,
               end_row = 8)
} else {
  knitr::kable(pad(AVtable1,
                   surroundMaths = TRUE,
                   targetLength = c(3, 2),
                   decDigits = 0),
               format = "latex",
               longtable = FALSE,
               escape = FALSE,
               align = c("c", "c"),
               booktabs = TRUE,
               caption = "Demographic information for the AV data for $400$ respondents.") %>%
    kable_styling(font_size = 8) %>%
    row_spec(0, bold = TRUE) %>%
    pack_rows( "Gender ($n = 400$)",
               start_row = 1,
               escape = FALSE,
               end_row = 2) %>%
    pack_rows( "Age group ($n = 400$)",
               start_row = 3,
               escape = FALSE,
               end_row = 8) 
}
```


```{r AVtable2}
AVtable2T <- t( AVtable2)
colnames(AVtable2T) <- AVtable2T[1, ]
AVtable2T <- AVtable2T[-1, ]
colnames(AVtable2T) <- rep( c("$n$", "\\%"), 5)


if (knitr::is_html_output()) {
   knitr::kable(pad( AVtable2T,
                     surroundMaths = TRUE,
                     targetLength = c(2, 2, 3, 2, 3, 2, 2, 2, 2, 2),
                     decDigits = 0),
               format = "html",
               longtable = FALSE,
               escape = FALSE,
               align = "c",  # Otherwise adds a space after five lines... 
               booktabs = TRUE,
               caption = "Responses to three scenarios for the AV data for $400$ respondents (rows sum to $n = 400$).") %>%
    row_spec(0, bold = TRUE) %>%
    column_spec(1, bold = TRUE) %>%
    add_header_above( c(" " = 1,
                        "Unsafe" = 2,
                        "unsafe" = 2,
                        "Neutral" = 2,
                        "safe" = 2,
                        "Safe" = 2),
                      align = "c",
                      line = TRUE,
                      bold = TRUE)  %>%
    add_header_above( c(" " = 1,
                        " " = 2,
                        "Somewhat" = 2,
                        " " = 2,
                        "Somewhat" = 2,
                        " " = 2),
                      align = "c",
                      line = FALSE,
                      bold = TRUE)
} else {
  
  knitr::kable( pad( AVtable2T,
                     surroundMaths = TRUE,
                     targetLength = c(2, 2, 3, 2, 3, 2, 2, 2, 2, 2),
                     decDigits = 0),
                format = "latex",
                booktabs = TRUE,
                caption = "Responses to three scenarios for the AV data for $400$ respondents (rows sum to $n = 400$).",
                align = "c",
                longtable = FALSE,
                escape = FALSE) %>%
    kable_styling(font_size = 8) %>%
    row_spec(0, bold = TRUE) %>%
    column_spec(1, bold = TRUE) %>%
    add_header_above( c(" " = 1,
                        "Unsafe" = 2,
                        "unsafe" = 2,
                        "Neutral" = 2,
                        "safe" = 2,
                        "Safe" = 2),
                      align = "c",
                      line = TRUE,
                      bold = TRUE)  %>%
    add_header_above( c(" " = 1,
                        " " = 2,
                        "Somewhat" = 2,
                        " " = 2,
                        "Somewhat" = 2,
                        " " = 2),
                      align = "c",
                      line = FALSE,
                      bold = TRUE)
}
```          


## Graphs for qualitative data {#QualitativeGraphs}
\index{Qualitative data!graphs}\index{Graphs!qualitative data}\index{Software output!graphs}


Three options for graphing qualitative data include:

* *Dot charts* (Sect.\ \@ref(DotChartsOneQual)):
  usually a good choice.
* *Bar charts* (Sect,\ \@ref(BarCharts)):
  usually a good choice.
* *Pie charts* (Sect.\ \@ref(PieCharts)):
  only useful in special circumstances, and can be hard to interpret.

Sometimes these graphs are used for *discrete* quantitative data with a small number of possible options.


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The purpose of a graph is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data.
:::


### Dot charts (qualitative data) {#DotChartsOneQual}
\index{Graphs!dot chart!one qualitative variable}

Dot charts indicate the counts (or corresponding percentages) in each level using dots (or some other symbol).
The levels can be on the horizontal or vertical axis, and the counts or percentages on the other.
Placing the levels on the vertical axis often makes for easier reading, and space for long labels. 


::: {.example #DotPlotsQual name="Dot plots"}
For the AV study in Example\ \@ref(exm:AVstudy), a dot chart of the age group of respondents is shown in Fig.\ \@ref(fig:AVDotBarPie) (top left panel).
:::

```{r, AVDotBarPie, fig.align='center', fig.cap="The age group of respondents in the AV study. All graphs present the same data.", out.width='100%', fig.width=9, fig.height=5.25, fig.show='hold'}

par(mfrow = c(2, 2))
par( mar = c(6, 4, 4, 2) + 0.1 )

if( knitr::is_latex_output()) {
  cols <- grey.colors(n = 6,
                      start = 0.15,
                      end = 0.9)
} else {
  cols <- viridis(9)[3:8]
}

AVdemographics2 <- AVdemographics
AVdemographics2.names <- names(AVdemographics2)

AVdemographics2.names <- gsub(pattern = "\\$",
                              replacement = "",
                              x = AVdemographics2.names)
names(AVdemographics2) <- AVdemographics2.names


dotchart(AVdemographics2[3:8],
         pch = 21,
         labels = rep("                               ", 6), # Otherwise, coloured and hard to read; reinstate after
         xlim = c(0, 80),
         xlab = "Numbers of respondents",
         main = "Age group of respondents\nin the AV study",
         bg = cols)
text(labels = AVdemographics2.names[3],
     x = -5,
     y = 1,
     adj = 1)
text(labels = AVdemographics2.names[4],
     x = -5,
     y = 2,
     adj = 1)
text(labels = AVdemographics2.names[5],
     x = -5,
     y = 3,
     adj = 1)
text(labels = AVdemographics2.names[6],
     x = -5,
     y = 4,
     adj = 1)
text(labels = AVdemographics2.names[7],
     x = -5,
     y = 5,
     adj = 1)
text(labels = AVdemographics2.names[8],
     x = -5,
     y = 6,
     adj = 1)
###

barplot(AVdemographics2[3:8],
        las = 2,
        ylim = c(0, 80),
        ylab = "Numbers of respondents",
        main = "Age group of respondents\nin the AV study",
        col = cols)

###
par( mar = c(0.4, 0.4, 3.4, 0.4) + 0.1 )

pie(AVdemographics2[3:8],
    main = "Age group of respondents\nin the AV study",
    col = cols)

###

par( mar = c(0.1, 0.1, 3.4, 0.1) + 0.1 )

# Add some spaces so the labels do not overlap pie chart
names(AVdemographics2)[3] <- paste0("       ", names(AVdemographics2)[3])
names(AVdemographics2)[5] <- paste0(names(AVdemographics2)[5], "         ")
names(AVdemographics2)[6] <- paste0(names(AVdemographics2)[6], "         ")
names(AVdemographics2)[8] <- paste0("   ", names(AVdemographics2)[8] )
plotrix::pie3D( AVdemographics2[3:8],
                main = "Age group of respondents\nin the AV study",
                theta = pi/4,
                labels = names(AVdemographics2[3:8]),
                labelcex = 0.9,
                col = cols)
```

For dot charts:

* place the qualitative variable on the horizontal or vertical axis (and label with the levels of the variable).
* use counts or percentages on the other axis.
* for nominal data, *think about the most helpful order* for the levels.


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The axis displaying the counts (or percentages) should *start from zero*, since the distance of the dots from the axis visually implies the frequency of those observations (see Example\ \@ref(exm:VerticalTruncation)).
:::


### Bar charts {#BarCharts}
\index{Graphs!bar chart}

Bar charts use bars to represent the number (or percentage) of observations in each level.
As with dot charts, the levels can be on the horizontal or vertical axis, but placing the level names on the vertical axis often makes for easier reading, and room for long labels. 


::: {.example #BarchartQual name="Bar plots"}
For the AV study in Example\ \@ref(exm:AVstudy), a bar chart of the age group of respondents is shown in Fig.\ \@ref(fig:AVDotBarPie) (top right panel).
:::

For bar charts:

* place the qualitative variable on the horizontal or vertical axis (and label with the levels of the variable).
* use counts or percentages on the other axis.
* for nominal data, levels can be ordered any way: *think about the most helpful order*.
* bars have gaps between bars, as the bars represent distinct categories.

In contrast to bar charts, the bars in histograms are butted together (except when an interval has a count of zero), as the variable-axis usually represents a continuous numerical scale.


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The axis displaying the counts (or percentages) should *start from zero*, since the height of the bars visually implies the frequency of those observations (see Example\ \@ref(exm:VerticalTruncation)).
:::


### Pie charts {#PieCharts}
\index{Graphs!pie chart}

In pie charts, a circle is divided into segments proportional to the number in each level of the qualitative variable.


::: {.example #PieChartsQual name="Pie charts"}
For the AV study in Example\ \@ref(exm:AVstudy), a pie chart of the age group of respondents is shown in Fig.\ \@ref(fig:AVDotBarPie) (bottom left panel).
:::


Using pie charts may present challenges:

* Pie charts only work when graphing parts of a whole.
* Pie charts only work when *all* options are present ('exhaustive').
* Pie charts are difficult to use with levels having zero or small counts (see Example\ \@ref(fig:PieSmallCounts)).
* Pie charts are difficult to interpret when many categories are present.
* Pie charts are hard to read: humans compare *lengths* (bar and dot charts) better than *angles* (pie charts) [@data:Friel:Graphs].


::: {.example #PieUnsuitable name="Pie chart unsuitable"}
Consider studying the percentage of people who use Firefox, Chrome, and Safari as web browsers.
A pie chart is *not suitable* for displaying the data, as people can use more than one of these browsers (i.e., the options are not *mutually exclusive*) nor *exhaustive* (i.e., other options exist). 
:::


### Comparing dot, bar and pie charts {#CompareBarPie}
\index{Graphs!bar chart!compared to other graphs}\index{Graphs!pie chart!compared to other graphs}\index{Graphs!dot chart!compared to other graphs}

Consider the pie chart in Fig.\ \@ref(fig:AVDotBarPie) (bottom left panel).
Determining *which* age groups have the fewest and the most respondents is hard.
The equivalent bar chart or dot chart makes the comparison easy: the youngest age group has the fewest respondents, while the\ $25$ to\ $34$ and\ $35$ to\ $44$ age groups have the most.
The *tilted* pie chart makes this comparison even harder (Fig.\ \@ref(fig:AVDotBarPie), bottom right panel).

Recall that the *purpose of a graph is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data*.
A pie chart often makes the message hard to see [@siegrist1996use].\index{Graphs!pie chart!warnings}


<iframe src="https://learningapps.org/watch?v=pf4om4k5t22" style="border:0px;width:100%;height:500px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>


## Numerical summary: proportions and percentages {#QualitativeProportionsPercentages}
\index{Proportions}\index{Percentages}

Qualitative data can be summarised numerically by using the *proportion* or *percentage* of individuals in each level.
These can be given instead of, or with, the counts (Tables\ \@ref(tab:AVtable1) and\ \@ref(tab:AVtable2)).


::: {.definition #Proportion name="Proportion"}
A *proportion* is a fraction out of a total, and is a number between\ $0$ and\ $1$.
:::


::: {.definition #Percentage name="Percentages"}
A *percentage* is a proportion, multiplied by\ $100$.
In this context, percentages are numbers between\ $0$% and\ $100$%.
:::


*Population* proportions are almost always unknown. 
Instead, the *population* proportion (the parameter), denoted\ $p$, is estimated by a *sample* proportion (a statistic), denoted by\ $\hat{p}$.\index{Estimate}


::: {.pronounceBox .pronounce data-latex="{iconmonstr-microphone-7-240.png}"}
The symbol\ $\hat{p}$ is pronounced 'pee-hat', and refers to a *sample* proportion.
The caret above the\ $p$ is called a 'hat'.
:::


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
As always, only one possible sample is studied.
*Statistics* are estimates of *parameters*, and the value of the *statistic* is not the same for every possible *sample*.
:::


::: {.example #AVProportionsPercentages name="Proportions and percentages"}
Consider the AV data in Table\ \@ref(tab:AVtable1), summarising results from a sample of $n = 400$ respondents.
The *sample proportion* of respondents aged\ $25$ to\ $34$ is $76\div 400$, or\ $0.19$.
The *sample percentage* of respondents aged\ $25$ to\ $34$ is $0.19 \times 100$, or\ $19$%, as in the table.
:::


## Numerical summary: odds {#QualOdds}
\index{Odds}

For the AV data in Table\ \@ref(tab:AVtable1), the number of females is slightly larger than the number of males.
More specifically, the *ratio* of females to males is $204\div 196 = 1.04$; that is, there are\ $1.04$ *times* as many females as males.
This value of\ $1.04$ is the *odds* that a respondent is female.
An alternative interpretation is that there are $1.04\times 100 = 104$ females for every\ $100$ males.

While proportions and percentages are computed as the number of results of interest divided by the *total number*, the *odds* are computed as the number of results of interest divided by *the remaining number*.


::: {.definition #Odds name="Odds"}
The *odds* are the number (or proportion, or percentage) of results of interest, divided by the remaining number (or proportion, or percentage) of results:
$$
         \text{Odds} = \frac{\text{Number of results of interest}}{\text{Remaining number of results}}
$$
or (equivalently)
$$
         \text{Odds} 
          = 
            \frac{\text{Proportion of results of interest}}
                 {\text{Remaining proportion of results}}
          = 
            \frac{\text{Percentage of results of interest}}
                 {\text{Remaining percentage of results}}.
$$
The *odds* are how many *times* the result of interest *occurs* compared to the number of times the results of interest does *not occur*.
:::


::: {.example #AVOddsMale name="Interpreting odds"}
The AV data (Table\ \@ref(tab:AVtable1)) includes\ $204$ females and $196$\ males.
The *odds* that a respondent is female is\ $1.04$.
The odds are greater than one, as there are more females than males.
Alternatively, there are\ $104$ females for every\ $100$ males.

The *odds* that a respondent is male is $196/204 = 0.96$; there are $0.96$\ *times* the number of males as females.
The odds are less than one, as there are fewer males than females.
Alternatively, there are $96$\ males for every\ $100$ females.
:::


When interpreting odds:
  
* odds *greater* than\ $1$ mean the event is *more* likely to happen than not.
* odds *equal to*\ $1$ mean the event is *equally likely* to happen as not.
* odds *less* than\ $1$ mean the event is *less* likely to happen than not.


::: {.example #AVOdds name="Odds and percentages"}
Consider the AV data in Table\ \@ref(tab:AVtable1), summarising results from a sample of $n = 400$ respondents.

The percentage of respondents aged\ $18$ to\ $24$ is $52/400\times 400 = 13$%.
The *odds* that a respondent is aged\ $18$ to\ $24$ is $52/(400 - 52) = 0.15$; that is, the odds that a respondent is aged\ $18$ to\ $24$ is\ $0.15$.
This means that the number of respondents aged\ $18$ to\ $24$ is $0.15$\ times (i.e., less) the number of respondents aged over\ $24$.

The *odds* that a respondent is aged\ $18$ to\ $54$ is $(52 + 76 + 76 + 72)/(56 + 68) = 2.23$; that is, the odds that a respondent is aged\ $18$ to\ $54$ is\ $2.23$.
This means that the number of respondents aged\ $18$ to\ $54$ is\ $2.23$ times (i.e., greater) the number of respondents aged\ $55$ or over.
:::


*Population* odds are almost always unknown. 
Instead, the *population* odds (the parameter) is estimated by a *sample* odds (a statistic).
No symbol is commonly used to denote odds.

Take care: proportions and odds are similar, but are different ways of numerically summarising quantitative data (Fig.\ \@ref(fig:PropOdds)).


```{r PropOdds, fig.align="center", out.width='90%', fig.width=9, fig.height=2.25, fig.cap="Proportions (left) are the number of interest divided by the total number. Odds (right) are the number of interest divided by the rest."}
source("R/showPropOdds.R")    
showPropOdds()  
```


## Describing the distribution: modes and medians {#SummariseDataQualitative}
\index{Qualitative data!distribution}

Graphs are constructed to help readers understand the data, so any important features in the graph should be described.
One simple way is to identify the level (or levels) with the *most* observations.
This is called the *mode*.\index{Mode}


::: {.definition #Mode name="Mode"}
A *mode* is the level (or levels) of a qualitative variable with the most observations.
:::


::: {.example #OrdinalModes name="Modes" }
Consider the data in Tables\ \@ref(tab:AVtable1) and\ \@ref(tab:AVtable2):

* The *mode* for gender is 'Female' (with\ $204$ respondents, or\ $51$%).
* The *mode* age groups are $25$ to\ $34$ and $35$ to\ $44$ (each with $19$\ respondents, or\ $4.8$%).
* The *modal* response to the question about *driving* near AVs is 'Somewhat safe'.
* The *modal* response to the question about *cycling* near AVs is 'Somewhat unsafe'.
* The *modal* response to the question about *walking* near AVs is 'Neutral'.
:::


*Medians*\index{Median!qualitative ordinal data} can be found for *ordinal* data (but *not* nominal data), since ordinal data have levels with a natural order.
The *median* is the location of the middle response, when the levels from all individuals are placed in order.
The sample median estimates the unknown *population* median.


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
Medians can be used to summarise *quantitative data* and *ordinal* data, but *never* nominal data.
:::


::: {.example #OrdinalMedians name="Medians"}
Consider the data in Tables\ \@ref(tab:AVtable1) and\ \@ref(tab:AVtable2).
'Gender' is *nominal* qualitative, so medians are not appropriate.
However, the other variables are *ordinal*, so medians could be used to describe each variable.
Since $n = 400$, the median response will be halfway between the location of the\ $200$th and\ $201$st response when ordered:

* the *median* age group is $35$ to\ $44$.
* the *median* response to the driving-near-AVs question is 'Neutral'.
* the *median* response to the cycling-near-AVs question is 'Neutral'.
* the *median* response to the walking-near-AVs question is 'Neutral'.

For each variable, ordered observations\ $200$ and\ $201$ fall into the indicated level.
:::


Importantly, all these numerical quantities are computed from a sample (i.e., are statistics; Def.\ \@ref(def:Statistic)), even though the whole population is of interest (i.e., the parameter; Def.\ \@ref(def:Parameter)).

Means (Sect.\ \@ref(Mean)) are generally not suitable for numerically summarising qualitative data.
However, *ordinal* data *may be* numerically summarised like quantitative data in *rare and very special circumstances*: when

* the levels are considered equally spaced; *and*
* assigning a number to each level is appropriate (for example, using a mid-point for numerical age groups).

We will not consider means for ordinal data further.


## Numerical summary tables {#QualSummaryTable}
\index{Qualitative data!summary tables}

Qualitative variables should be summarised in a table.
The table should include, as a minimum, numbers and percentages for each level.
While useful in other contexts (see Chap.\ \@ref(CompareQualData)), odds are usually not given in  summary table.
Examples are shown in Tables\ \@ref(tab:AVtable1) and\ \@ref(tab:AVtable2), and in the next section.


## Example: water access {#WaterAccessQual}

@lopez2022farmers recorded data about access to water for three rural communities in Cameroon (see Sect.\ \@ref(WaterAccessQuant)).
Numerous qualitative variables are recorded; some are displayed in Fig.\ \@ref(fig:WaterAcessQual), and summarised in Table\ \@ref(tab:WaterAccessQual).
Notice that the levels of the two ordinal variables are displayed in their natural order.

The distance to the nearest water source is usually less than\ $1\kms$, and the wait is often over\ $15\mins$.
The most common water source is a bore ($68.6$%).


```{r WaterAccessQual}
data(WaterAccess)

WaterAccessTable2A <- array( dim = c(6, 3) )

 
findOdds <- function(x){
  tablex <- table(x)
  oddsTab <- array( dim = length(tablex))

  for (i in 1:length(tablex)){
    oddsTab[i] <- tablex[i] / sum(tablex[-i])
  }
  oddsTab
}

qualSummary <- function(x){
  out <- cbind( pad2( t(t(table(x))),
                     targetLength = 2,
                     surroundMaths = TRUE),
                pad2( t( t( table(x) ) / sum(table(x)) * 100),
                      decDigits = 1,
                     targetLength = 4,
                     surroundMaths = TRUE),
                pad2( t( t(findOdds(x)) ),
                      decDigits = 2,
                     targetLength = 4,
                     surroundMaths = TRUE ) 
  )

  out
} 

qualSummary <- function(x){
  tab <- table(x)
  
  out <- array( dim = length(tab)) 
  
#  for (i in 1:length(tab)){
    
    # out <- cbind( pad( t(t(tab[i])),
    #                    targetLength = 2,
    #                    surroundMaths = TRUE),
    #               pad( format( round(t( t( tab[i] ) / sum(tab) * 100), 1), nsmall = 1),
    #                    targetLength = 4,
    #                    surroundMaths = TRUE),
    #               pad( format( round( t( t(findOdds(x)[i]) ), 2), nsmall = 2),
    #                    targetLength = 4,
    #                    surroundMaths = TRUE ) 
    # )
    out <- cbind( pad( t(t(tab)),
                       targetLength = 2,
                       decDigits = 0,
                       surroundMaths = TRUE),
                  pad( t( t( tab ) / sum(tab) * 100), 
                       targetLength = 4,
                       decDigits = 1,
                       surroundMaths = TRUE),
                  pad( t( t(findOdds(x)) ),
                        decDigits = 2,
                       targetLength = 4,
                       surroundMaths = TRUE ) 
    )
#  }
  
  out
} 


WaterAccessTable2A <- rbind(qualSummary(WaterAccess$SourceDistance), 
                            qualSummary(WaterAccess$SourceQueueTime) )

rownames(WaterAccessTable2A) <- c("Under $100\\ms$",
                                  "$100\\ms$\\ to $1000\\ms$",
                                  "Over $1000\\ms$",
                                  "Under $5\\mins$",
                                  "$5$ to $15\\mins$",
                                  "Over $15\\mins$") 
 

colnames(WaterAccessTable2A) <- c("Number",
                                  "\\%",
                                  "Odds")


###

WaterAccessTable2B <- array( dim = c(4, 3) )

rownames(WaterAccessTable2B) <- c( levels(WaterAccess$WaterSource) )
 

WaterAccessTable2B <- qualSummary(WaterAccess$WaterSource)

colnames(WaterAccessTable2B) <- c("Number",
                                  "\\%",
                                  "Odds")


if( knitr::is_latex_output() ) {
  T1 <- kable( WaterAccessTable2A,
        format = "latex",
        longtable = FALSE,
        escape = FALSE,
        col.names = colnames(WaterAccessTable2A),
        valign = 't',
        booktabs = TRUE,
        align = c("c")) %>%
    row_spec(0, bold = TRUE) %>%
    pack_rows("Distance to water source ($n = 121$)",
              start_row = 1,
              end_row = 3,
              escape = FALSE,
              bold = TRUE) %>%
    pack_rows("Wait time at water source ($n = 120$)",
              start_row = 4,
              end_row = 6,
              escape = FALSE,
              bold = TRUE)
  
  T2 <- kable( WaterAccessTable2B,
        format = "latex",
        longtable = FALSE,
        escape = FALSE,
        col.names = colnames(WaterAccessTable2B),
        valign = 't',
        booktabs = TRUE,
        align = c("c")) %>%
    row_spec(0, bold = TRUE) %>%
    pack_rows("Water source ($n = 121$)",
              start_row = 1,
              end_row = 4,
              escape = FALSE,
              bold = TRUE)
  
    out <- knitr::kables(list(T1, T2),
                       format = "latex",
                       label = "WaterAccessQual",
                       caption = "Summarising some qualitative data in the water-access study. Left: the ordinal variables. Right: the nominal variable.") %>% 
    kable_styling(font_size = 8)
  out2 <- prepareSideBySideTable(out, 
                                 gap = "\\qquad\\qquad") 
  out2
}
if( knitr::is_html_output() ) {
 kable( rbind(WaterAccessTable2A, 
              WaterAccessTable2B),
               format = "html",
               longtable = FALSE,
               col.names = colnames(WaterAccessTable2A),
               booktabs = TRUE,
               align = c("r"),
               caption = "Summarising some qualitative data in the water-access study. Left: the ordinal variables. Right: the nominal variable.") %>%
   kable_styling(full_width = FALSE) %>%
    pack_rows("Distance to water source",
              start_row = 1,
              end_row = 3,
              bold = TRUE) %>%
    pack_rows("Wait time at water source",
              start_row = 4,
              end_row = 6,
              bold = TRUE) %>%
    pack_rows("Water source",
              start_row = 7,
              end_row = 10,
              bold = TRUE)
}
```

```{r WaterAcessQual, fig.align="center", fig.cap="The distance to the water source (left), the wait time at the water source (centre), and the water sources (right) for the water-access study. (Some data are missing.)", out.width='100%', fig.width=6.25, fig.height=1.7}

par(mfrow = c(1, 3),
    mar = c(3.75, 0.2, 4, 1.75) )

dotchart( as.numeric(table(WaterAccess$SourceDistance)),
          labels = names(table(WaterAccess$SourceDistance)),
     xlab = "Number",
     ylab = "",
     main = "Distance to\nwater source",
     xlim = c(0, 60),
     las = 2,
     pch = 19)

dotchart( as.numeric(table(WaterAccess$SourceQueueTime)),
          labels = names(table(WaterAccess$SourceQueueTime)),
     xlab = "Number",
     ylab = "",
     main = "Queue wait at\nwater source",
     xlim = c(0, 55),
     las = 2,
     pch = 19)

dotchart( as.numeric(table(WaterAccess$WaterSource)),
          labels = names(table(WaterAccess$WaterSource)),
     xlab = "Number",
     ylab = "",
     main = "Water sources",
     xlim = c(0, 90),
     las = 2,
     pch = 19)
```


\index{Qualitative data!summarising|)}

## Chapter summary {#SummaryQual-Summary}

Qualitative data can be graphed with a dot chart, bar chart or pie chart (in special circumstances).
Qualitative data can be described using the mode or (for *ordinal* data only) a median.
Qualitative data can be numerically summarised using *proportions*, *percentages* or *odds*.


## Quick review questions {#SummaryQual-QuickReview}

Are the following statements *true* or *false*?

::: {.webex-check .webex-box}
1. Nominal data can be summarised using a median. \tightlist
`r if( knitr::is_html_output() ) {torf(answer = FALSE)}`
1. Ordinal data can be summarised using a mode.
`r if( knitr::is_html_output() ) {torf(answer = TRUE)}`
1. Odds are the ratio of how often a result of interest occurs, to how often it does *not* occur.
`r if( knitr::is_html_output() ) {torf(answer = TRUE)}`
1. Proportions and percentages are the same.
`r if( knitr::is_html_output() ) {torf(answer = FALSE)}`
:::


## Exercises {#SummariseQualDataExercises}

[Answers to odd-numbered exercises] are given at the end of the book. 

`r if( knitr::is_latex_output() ) "\\captionsetup{font=small}"`

::: {.exercise #SpiderMonkeys}
A study of spider monkeys [@data:Chapman1990:SpiderMonkeys] examined the types of social groups present
`r if (knitr::is_latex_output()) {
   '(Table\\ \\@ref(tab:SpiderMonkeysLATEX)).'
} else {
   '(Table\\ \\@ref(tab:SpiderMonkeysHTML)).'
}`
Construct a suitable plot, and explain what the data reveal.
:::


```{r SpiderMonkeysData}
namesSM <- c("Solitary", 
             "All males", 
             "Female + no young", 
             "Mixed young", 
             "Mixed + no young", 
             "One female + offspring", 
             "Many females + offspring",
             NA)

SpiderMonkeyTable <- array( dim = c(8, 2) )
SpiderMonkeyTable[, 2] <- c(8, 3, 2, 15, 1, 23, 48, NA)
SpiderMonkeyTable[, 1] <- namesSM

rownames(SpiderMonkeyTable) <- namesSM
```

```{r}
if( knitr::is_latex_output() ) {
   T1 <- kable( pad( SpiderMonkeyTable[1:4,],
                     surroundMaths = TRUE,
                     targetLength = 2,
                     decDigits = 0), 
       format = "latex",
       longtable = FALSE,
       booktabs = TRUE,
       escape = FALSE,
       digits = 1,
       align = c("r", "c"),
       row.names = FALSE,
       col.names = c("Social group",
                     "Number")) %>%
  row_spec(row = 0, 
           bold = TRUE) 

   
  T2 <- kable( pad( SpiderMonkeyTable[5:8,],
                    surroundMaths = TRUE,
                     targetLength = 2,
                     decDigits = 0), 
       format = "latex",
       longtable = FALSE,
       booktabs = TRUE,
       escape = FALSE,
       digits = 1,
       align = c("r", "c"),
       row.names = FALSE,
       col.names = c("Social group",
                     "Number")) %>%
  row_spec(row = 0, 
           bold = TRUE)   
  
  
  out <- knitr::kables(list(T1, T2),
                       format = "latex",
                       label = "SpiderMonkeysLATEX",
                       caption = "Social groups for soldier monkeys.") %>% 
    kable_styling(font_size = 8)
  out2 <- prepareSideBySideTable(out, 
                                 gap = "\\qquad") 
  out2
}
```


```{r SpiderMonkeysHTML}
if( knitr::is_html_output() ) {

kable( pad( SpiderMonkeyTable,
                     surroundMaths = TRUE,
                     targetLength = 2,
                     decDigits = 0), 
       format = "html",
       longtable = FALSE,
       booktabs = TRUE,
       escape = FALSE,
       digits = 1,
       align = c("r", "c"),
       row.names = FALSE,
       col.names = c("Social group",
                     "Number"),
       caption = "Social groups in spider monkeys.") %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(row = 0, 
           bold = TRUE) 
}
```


::: {.exercise #QualSummary}
@czarniecka2021consumer studied how Poles prepared and consumed coffee using a sample of $1\,500$\ Poles.
Some data are shown in Table\ \@ref(tab:CoffeePoles).

1. Classify the variables as quantitative, nominal or ordinal.
2. Sketch appropriate graphs for the three variables.
3. Summarise the three variables.
:::

```{r}
Coffee1 <- array(dim = c(5, 2))
Coffee1[, 2] <- c(1432, 687, 922, 994, 1196)
Coffee1[, 1] <- c("Home",
                 "Canteen",
                 "Cafe",
                 "Others' homes",
                 "Work")
T1 <- kable( pad(Coffee1,
                 surroundMaths = TRUE,
                 targetLength = 4,
                 decDigits = 0),
             format = "latex",
             longtable = FALSE,
             booktabs = TRUE,
             escape = FALSE,
             valign = 't',
             col.names = c("Where consumed", "$n$"),
             align = c("r", "c") ) %>%
    row_spec(0, bold = TRUE)


Coffee2 <- array( dim = c(4, 2))
Coffee2[, 2] <- c(748, 269, 453, 30)
Coffee2[, 1] <- c("\\text{$100^\\circ$C}",
                  "\\text{$98^\\circ$C}",
                  "\\text{$93^\\circ$C}",
                  "Unknown")

T2 <- kable( pad(Coffee2,
                 surroundMaths = TRUE,
                 targetLength = 3,
                 decDigits = 0),
             format = "latex",
             longtable = FALSE,
             booktabs = TRUE,
             escape = FALSE,
             valign = 't',
             linesep = c("", "", "\\addlinespace"), 
             col.names = c("Brew Temp.", "$n$"),
             align = c("r", "c") ) %>%
    row_spec(0, bold = TRUE)


Coffee3 <- array( dim = c(6, 2) )
Coffee3[, 2] <- c(226, 267, 114, 82, 30, 781)
Coffee3[, 1] <- c("Under $3$ mins",
                 "$3$ mins",
                 "$4$ mins",
                 "$5$ mins",
                 "$6$ mins",
                 "Unknown"
                 )
T3 <- kable( pad(Coffee3,
                 surroundMaths = TRUE,
                 targetLength = 3,
                 decDigits = 0),
             format = "latex",
             longtable = FALSE,
             booktabs = TRUE,
             escape = FALSE,
             valign = 't',
             col.names = c("Brew time", "$n$"),
             align = c("r", "c") ) %>%
    row_spec(0, bold = TRUE)


if( knitr::is_latex_output() ) {
out <- knitr::kables( list(T1, T2, T3),
                        format = "latex",
                        label = "CoffeePoles",
                        caption = "Location of coffee consumption, brewing temperature and brewing time, from $1\\,500$ Poles.") %>% 
    kable_styling(font_size = 8)
  out2 <- prepareSideBySideTable(out) 
  out2
} else { # HTML
  knitr::kable( rbind(Coffee1,
                      Coffee2,
                      Coffee3),
                      format = "html",
                      label = "CoffeePoles",
                      caption = "Location of coffee consumption, brewing temperature and brewing time, from $1\\,500$ Poles.") %>%
    pack_rows("Where consumed",
              start_row = 1,
              end_row = 5,
              bold = TRUE) %>%
    pack_rows("Brewing temperature",
              start_row = 6,
              end_row = 9,
              bold = TRUE) %>%
    pack_rows("Brewing time",
              start_row = 10,
              end_row = 15,
              bold = TRUE)
}
```


::: {.exercise #GraphsCars}
@henderson1981building recorded the number of cylinders in many models of cars: eleven cars had four cylinders, seven cars had six cylinders, and fourteen cars had eight cylinders.
The *number* of cylinders is quantitative discrete, but with so few different values, the data could be plotted with a graph used for qualitative data.
For these data:

:::::: {.cols data-latex=""}
::: {.col data-latex="{0.45\textwidth}"}

1. Produce a dot chart.
2. Produce a histogram.

:::

::: {.col data-latex="{0.05\textwidth}"}
\ 
<!-- an empty Div (with a white space), serving as
a column separator -->
:::

::: {.col data-latex="{0.45\textwidth}"}

3. Produce a bar chart.
4. Produce a pie chart.

:::
::::::

\null\smallskip
What graph do you think is best?
Why?
:::


::: {.exercise #GraphSurveyData}
A 
`r if (knitr::is_latex_output()) {
   'survey of voice assistants'
} else {
   '[survey of voice assistants](https://www.nielsen.com/us/en/insights/news/2018/smart-speaking-my-language-despite-their-vast-capabilities-smart-speakers-all-about-the-music.print.html)'
}`
(such as Amazon Echo; Google Home; etc.) conducted by
`r if (knitr::is_latex_output()) {
   'Nielsen'
} else {
   '[Nielsen](https://www.nielsen.com/au/en.html)'
}`
asked respondents to indicate how they used their voice assistant.
The options were:

:::::: {.cols data-latex=""}
::: {.col data-latex="{0.33\textwidth}"}

* Listening to music;
* Listen to news;
* Use alarms, timer.

:::

::: {.col data-latex="{0.03\textwidth}"}
\ 
<!-- an empty Div (with a white space), serving as
a column separator -->
:::

::: {.col data-latex="{0.60\textwidth}"}

* Search for real-time information (e.g., traffic; weather);
* Search for factual information (e.g., trivia; history);
* Chat with voice assistant for fun;

:::
::::::

Respondents could select all options that applied.
What would be the best graph for displaying respondents answers?
Would a pie chart be suitable? Explain your answer.
:::


::: {.exercise #OrdinalMedians}
@gkebski2019impact studied the taste of bread with varying salt and fibre content, and recorded information from $300$ subjects, including the subjects' responses to the statement 'Rolls with lower salt content taste worse than regular ones', on a five-point ordinal scale from 'Strongly Agree' to 'Strongly Disagree'); see Table\ \@ref(tab:Bread).

1. Identify the variables, then classify them as nominal or ordinal.
1. For which variables is a mode an appropriate summary (if any)?
1. For which variables is a median an appropriate summary (if any)?
1. Compute the above statistics where appropriate.
1. Compute and interpret the odds of a respondent coming from a city background.
1. Compute and interpret the odds of a respondent agreeing *or* strongly agreeing with the statement.
1. Compute and interpret the odds of a respondent being male.
:::


```{r Bread}
BreadTable  <- array( dim = c(11, 3))

colnames(BreadTable) <- c("",
                          "Number",
                          "Percentage")

BreadTable[, 1] <- c("Female",
                     "Male",
                     "Rural",
                     "City up to $20\\, 000$ residents",
                     "City $20\\, 000$ to $100\\, 000$ residents",
                     "City $> 100\\, 000$ residents",
                     "Strongly agree",
                     "Agree",
                     "Neutral",
                     "Disagree","Strongly disagree"
                     )
BreadTable[, 2] <- c(150,
                     150,
                     49,
                     38,
                     83,
                     130,
                     30,
                     84,
                     78,
                     66,
                     42)
BreadTable[, 3] <- c(50,
                     50,
                     16,
                     13,
                     28,
                     43,
                     10,
                     28,
                     26,
                     22,
                     14)

if( knitr::is_latex_output() ) {
  knitr::kable(pad(BreadTable,
                   surroundMaths = TRUE,
                   targetLength = c(0, 0, 0),
                   decDigits = 0),
        format = "latex",
        longtable = FALSE,
        booktabs = TRUE,
        escape = FALSE,
        align = c("r", "c", "c"),
        linesep = c("", "\\addlinespace", "", "", "", "\\addlinespace", "", "", "", ""), 
        caption = "The bread-tasting data ($n = 300)$.") %>%
   kable_styling(font_size = 8) %>%
   pack_rows("Gender", 1, 2) %>%
   pack_rows("Place of residence", 3, 6) %>%
   pack_rows("Response to statement", 7, 11) %>%
   row_spec(0, bold = TRUE)
}

if( knitr::is_html_output() ) {
  BreadTable[6, 1] <- "City more than $100\\, 000$ residents" 
  
  knitr::kable(pad(BreadTable,
                   surroundMaths = TRUE,
                   targetLength = c(0, 0, 0),
                   decDigits = 0),
        format = "html",
        longtable = FALSE,
        booktabs = TRUE,
        align = c("r", "c", "c"),
        caption = "The bread-tasting data ($n = 300$).")%>%
   #kable_styling(font_size = 10) %>%
   pack_rows("Gender", 1, 2) %>%
   pack_rows("Place of residence", 3, 6) %>%
   pack_rows("Response to statement", 7, 11) %>%
   row_spec(0, bold = TRUE)
}
```


::: {.exercise #ReclaimedWater}
@lopez2022farmers asked $231$\ farmers what they considered to be the advantages and disadvantages of using reclaimed water on the farm.
The responses are shown in Table\ \@ref(tab:ReclaimedWater) (not all farmers responded).

1. Produce two bar charts to display the data.
1. Produce two dot charts to display the data.
1. Produce two pie charts to display the data
1. Determine the mode for both the advantages and disadvantages.
1. Compute the percentages for the advantages and disadvantages.
1. Compute the odds of a farmer stating 'high price' as a disadvantage, among *all* farmers.
1. Compute the odds of a farmer stating 'high price' as a disadvantage, among farmers who listed a disadvantage.
:::

```{r}
Advantages <- c(15, 27, 16)
namesAdvantages <- c("Water reutilization",
                     "Availability",
                     "Sustainability")
Disadvantages <- c(40, 12, 21)
namesDisadvantages <- c("High price",
                        "Growing conductivity",
                        "Lack of proper filtering")

AdTable <- DisadTable <- array( dim = c(3, 2))

AdTable[, 1] <- namesAdvantages
AdTable[, 2] <- Advantages
DisadTable[, 1] <- namesDisadvantages
DisadTable[, 2] <- Disadvantages

if( knitr::is_latex_output() ) {
  
  T1 <-  knitr::kable(pad(AdTable,
                          surroundMaths = TRUE,
                          targetLength = c(0, 2),
                          decDigits = 0),
                      format = "latex",
                      valign = 't',
                      align = c("r", "c"),
                      linesep = "",
                      col.names = c("Advantage", "No. farmers"),
                      row.names = FALSE,
                      escape = FALSE,
                      booktabs = TRUE) %>%
    row_spec(0, bold = TRUE)
  
  
  T2 <-  knitr::kable(pad(DisadTable,
                          surroundMaths = TRUE,
                          targetLength = c(0, 2),
                          decDigits = 0),
                      format = "latex",
                      valign = 't',
                      align = c("r", "c"),
                      linesep = "",
                      col.names = c("Disadvantage", "No. farmers"),
                      row.names = FALSE,
                      escape = FALSE,
                      booktabs = TRUE) %>%
    row_spec(0, bold = TRUE)
  
  
  out <- knitr::kables( list(T1, T2),
                        format = "latex",
                        label = "ReclaimedWater",
                        caption = "The advantages and disadvantages of using reclaimed water, reported by $231$ farmers. (Not all farmers responded.)") %>% 
    kable_styling(font_size = 8)
  out2 <- prepareSideBySideTable(out) 
  out2
}
if( knitr::is_html_output() ) {
  
  T1 <- knitr::kable(pad(AdTable,
                          surroundMaths = TRUE,
                          targetLength = c(0, 2),
                          decDigits = 0),
                      format = "html",
                      valign = 't',
                      align = "c",
                      linesep = "",
                      col.names = c("Advantage", "No. farmers"),
                      row.names = FALSE,
                      escape = FALSE,
                      booktabs = TRUE) %>%
    row_spec(0, bold = TRUE)
  
  
  T2 <-  knitr::kable(pad(DisadTable,
                          surroundMaths = TRUE,
                          targetLength = c(0, 2),
                          decDigits = 0),
                      format = "html",
                      valign = 't',
                      align = "c",
                      linesep = "",
                      col.names = c("Disadvantage", "No. farmers"),
                      row.names = FALSE,
                      escape = FALSE,
                      booktabs = TRUE) %>%
    row_spec(0, bold = TRUE)
  
  
  knitr::kables( list(T1, T2),
                        format = "html",
                        label = "ReclaimedWater",
                        caption = "The advantages and disadvantages of using reclaimed water, reported by $231$ farmers. (Not all farmers responded.)") 
}
```


::: {.exercise #StudentTransport}
@henning2020modelling studied $284$ university students in Joinville, Brazil, tabulating how students got to campus (Table\ \@ref(tab:TransportTable); each student could select one option only).

1. What is the mode type of active transport? 
   What about motorised transport?
1. What is the mode type of transport overall?
1. Are medians appropriate?
   If so, compute the median for active transport types, and motorised transport types.
1. Compute the percentages for each option, out of the total sample.
1. Compute the odds that a randomly-chosen student uses motorised transport to get to campus.
   Explain what this means.
1. Compute the odds that a student walks to campus.
   Explain what this means.
1. Construct appropriate plots to display the data.
:::

```{r TransportTable}
TransportTable <- array( dim = c(5, 2) )

rownames(TransportTable) <- c("Bicycle",
                              "Walking",
                              "Car",
                              "Bus",
                              "Other")
colnames(TransportTable) <- c("Number",
                              "Percentage")
TransportTable[, 1] <- c(29, 35, 70, 117, 33)
TransportTable[, 2] <- c(10.2, 12.3, 24.7, 41.2, 11.6)

if( knitr::is_latex_output() ) {
  T1 <- knitr::kable(pad( as.data.frame(TransportTable[1:2, 1]),
                   surroundMaths = TRUE,
                   targetLength = c(2),
                   decDigits = c(0)),
               format = "latex",
               escape = FALSE,
               valign = 't',
               align = "c",
               row.names = TRUE,
               col.names = "Number: active methods",
               booktabs = TRUE) %>%
    row_spec(0, bold = TRUE)

  T2 <- knitr::kable(pad( as.data.frame(TransportTable[3:5, 1]),
                   surroundMaths = TRUE,
                   targetLength = c(3),
                   decDigits = c(0)),
               format = "latex",
               escape = FALSE,
               valign = 't',
               align = "c",
               row.names = TRUE,
               col.names = "Number: motorised methods",
               booktabs = TRUE) %>%
    row_spec(0, bold = TRUE)
    
  out <- knitr::kables( list(T1, T2),
                        format = "latex",
                        label = "TransportTable",
                        caption = "Modes of transport for students getting to campus.") %>% 
    kable_styling(font_size = 8)
  out2 <- prepareSideBySideTable(out) 
  out2
 }
if( knitr::is_html_output() ) {
  
  knitr::kable(pad(as.matrix( TransportTable[, 1],
                              col = 1),
                   surroundMaths = TRUE,
                   targetLength = c(3),
                   decDigits = c(0)),
               format = "html",
               col.names = "Number",
               longtable = FALSE,
               escape = FALSE,
               caption = "Modes of transport for students getting to campus.",
               align = "c",
               booktabs = TRUE) %>%
    row_spec(0, bold = TRUE) %>%
    pack_rows("Active", 
              start_row = 1,
              end_row = 2) %>%
    pack_rows("Motorised", 
              start_row = 3,
              end_row = 5)
}
```


::: {.exercise #QualSumBabyBoom}
[*Dataset*: `BabyBoom`]
The data in
`r if( knitr::is_html_output() ) {
  'Fig.\\ \\@ref(fig:BabyBoomDataHTML)'
}  else {
  'Table\\ \\@ref(tab:BabyBoomDataLATEX)'
}`
give the gender of $44$\ babies born in a hospital on one day [@mypapers:Dunn:dataset:1999; @data:Steele:BabyBoom].
The data are given in the order in which the births occurred.

1. What is the mode sex?
1. If appropriate, compute the median sex.
1. Compute the percentages for each sex.
1. Compute the odds that a randomly-chosen baby from the sample is female.
   Explain what this means.
1. Construct appropriate plots to display sex of the baby.
:::


::: {.exercise #FEVplots}
[*Dataset*: `LungCap`]
@data:Tager:FEV studied the lung volume of $654$\ children in East Boston in the\ 1970s (Table\ \@ref(tab:LungCapTab)).

1. Construct suitable plots for all variables.
2. For each qualitative variable, determine the mode.
3. For each qualitative variable, compute the percentage and odds of one of the levels occurring in the data.
4. Compute appropriate statistics for each quantitative variable.
:::

```{r LungCapTab}
data(LungCap) ### Exercise

# Recode Smoking
LungCap$Smoke[LungCap$Smoke == 0] <- "No"
LungCap$Smoke[LungCap$Smoke == 1] <- "Yes"

if( knitr::is_latex_output() ) {
  kable( pad( head(LungCap),
              surroundMaths = TRUE,
              targetLength = c(1, 5, 2, 1, 1),
              decDigits = c(0, 3, 0, 0, 0) ),
         format = "latex",
         longtable = FALSE,
         booktabs = TRUE,
         escape = FALSE,
                 linesep = c("", "", "\\addlinespace"), 
         caption = "The lung volume (FEV) for youth in East Boston in the 1970s; the first six observations in the dataset ($n = 654$).",
         col.names = c("Age",
                       "FEV", 
                       "Height",
                       "Gender",
                       "Smoking"),
         align = c("c") ) %>%
    row_spec(0, bold = TRUE) %>% 
    kable_styling(font_size = 8)
} else { #HTML
  kable( pad( head(LungCap),
              surroundMaths = TRUE,
              targetLength = c(1, 5, 2, 1, 1),
              decDigits = c(0, 3, 0, 0, 0) ),
         format = "html",
         longtable = FALSE,
         booktabs = TRUE,
         escape = FALSE,
         caption = "The lung volume (FEV) for youth in East Boston in the 1970s; the first six observations in the dataset ($n = 654$).",
         col.names = c("Age",
                       "FEV", 
                       "Height",
                       "Gender",
                       "Smoking"),
         align = c("r", "c") ) %>%
    row_spec(0, bold = TRUE) 
}
```


::: {.exercise #SummariseUniOrthoses}
@swinnen2018influence studied the influence of using ankle-foot orthoses in children with cerebral palsy.
The data in Table\ \@ref(tab:DescribeAnkleFoot) give the data for the $15$\ subjects.
(GMFCS is the 
`r if (knitr::is_latex_output()) {
   'Gross Motor Function Classification System)'
} else {
   '[Gross Motor Function Classification System](https://en.wikipedia.org/wiki/Gross_Motor_Function_Classification_System))'
}`
used to describe the impact of cerebral palsy on their motor function; where *lower* levels mean *better* functionality.)

1. Construct suitable plots for all variables.
2. For each qualitative variable, determine the mode.
3. For each qualitative variable, compute the percentage and odds of one of the levels occurring in the data.
4. Compute appropriate statistics for each quantitative variable.
:::


::: {.exercise #PLHomeAway}
[*Dataset*: `PremierL`]
In the 2019/2020 Premier League season, Chelsea had $4$\ wins from $10$\ games at home, and $7$\ wins from $11$\ wins away from home.
What is the odds ratio of a win (comparing home games and away games)?
:::


`r if( knitr::is_latex_output() ) "\\captionsetup{font=normalsize}"`


<!-- QUICK REVIEW ANSWERS -->
`r if (knitr::is_html_output()) '<!--'`
::: {.EOCanswerBox data-latex="{iconmonstr-check-mark-14-240.png}"}
**Answers to *Quick Revision* questions:**
**1.** False.
**2.** True.
**3.** True.
**4.** False. Percentages are proportions multiplied by\ $100$, so similar.
:::
`r if (knitr::is_html_output()) '-->'`