-
Notifications
You must be signed in to change notification settings - Fork 1
/
penguins-summarize.Rmd
208 lines (168 loc) · 8.16 KB
/
penguins-summarize.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "Brief demo of tidy evaluation"
author: "Brittany Barker & John Smith"
date: "3/18/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(palmerpenguins)
library(gt)
```
Tidy evaluation is a framework for controlling how expressions and variables in your code are evaluated by tidyverse functions, as described in this [R-bloggers blog](https://www.r-bloggers.com/2019/12/practical-tidy-evaluation/#:~:text=Tidy%20evaluation%20is%20a%20framework,more%20efficient%20and%20elegant%20code).
The Tidy evaluation [chapter](https://tidyeval.tidyverse.org/index.html) of the `tidyverse` guide book recommends reading:
- The new [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html) vignette.
- The [Using ggplot2 in packages](https://ggplot2.tidyverse.org/articles/ggplot2-in-packages.html) vignette.
As described in the dplyr vignette, most dplyr verbs use **tidy evaluation** in some way. The two basic forms of tidy evaluation in dplyr are those that use:
- **data masking** so that you can use data variables as they were variables in the environment (i.e. you write `my_variable` not `df$myvariable`). These include `arrange()`, `count()`, `filter()`, `group_by()`, `mutate()`, and `summarise()`.
- **tidy selection** so you can easily choose variables based on their position, name, or type (e.g. `starts_with("x")` or `is.numeric`. These include `across()`, `relocate()`, `rename()`, `select()`, and `pull()`.
In this demo, we'll cover some examples of both forms of tidy evaluation using the `penguins` dataset.
## Data-variables vs. env-variables
**Data masking** blurs the line between two different meanings of the word "variable."
- **env-variables** are "programming" variables that live in the environment (e.g., the `penguins` data frame).
- **data-variables** are "statistical" variables that live in a data frame (e.g., the 8 variables in the `penguins` data frame).
```{r}
str(penguins)
head(penguins)
```
When you want to get the data-variable from an environmental variable without typing the variable's name, you need to **embrace** the argument surrounding it in double braces. The following function uses embracing to create a wrapper around `summarise` (and other functions). Note that `summarise()` and `summarize()` can be used interchangeably.
```{r}
var_min_max <- function(df, var){
df %>%
summarize(min = min({{ var }}, na.rm = TRUE), max = max({{ var }}, na.rm = TRUE))
}
penguins %>%
group_by(species) %>%
var_min_max(bill_length_mm)
```
When you have an env-variable that is a character vector, you need to index into the `.data` pronoun with `[[`, like `summarise(penguins, mean(.data[[var]]))`.
```{r}
var_list <- c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")
for (var in var_list){
penguins %>%
summarise(mean = mean(.data[[var]], na.rm = TRUE)) %>% print()
}
```
## Tidy selection
Tidy selection is a complementary tool to data masking that makes it easy to work with the columns of a data frame. Underneath all functions that use tidy selection is the `tidyselect` package. It provides a miniature domain specific language that makes it easy to select columns by name, position, or type. Here are some examples.
Selects the first column (`last_col()` selects the last column).
```{r}
head(select(penguins, 1))
```
Select columns by name.
```{r}
head(select(penguins, c(species, island, body_mass_g)))
```
Selects all columns whose name starts with “bill."
```{r}
head(select(penguins, starts_with("bill")))
```
Select all columns whose name ends with "mm."
```{r}
head(select(penguins, ends_with("mm")))
```
Select all numeric columns.
```{r}
head(select(penguins, where(is.numeric)))
```
Use the negation of select to remove columns.
```{r}
head(select(penguins, -c(year, sex)))
```
## Some more summarize examples
Usually with summarize, you want to group data first. First let's do some analyses in which we'll group the `penguins` data by species. There are three species (Adelie, Chinstrap, and Gentoo).
```{r}
penguins %>%
group_by(species) %>%
distinct(species)
```
Group data by island and calculate the number of species on each island.
```{r}
penguins %>%
group_by(island) %>%
distinct(species, .keep_all = TRUE) %>%
summarize(n())
```
Summarize bill and flipper length measurements (columns contain "mm" string) using the `mean` function. This involves grouping data by species, rounding results, renaming columns, and creating a table of results using the `gt` package. Notice that the `across` function is used to round all numeric columns.
```{r}
mean_bill_flip <- penguins %>%
group_by(species) %>%
summarize(across(ends_with("mm"), mean, na.rm = TRUE)) %>%
mutate(across(where(is.numeric), round, 2)) %>%
setNames(c("Species", "Bill length (mm)", "Bill depth (mm)", "Flipper length (mm)"))
gt(mean_bill_flip) %>%
cols_align(align = "left") %>%
tab_header(title = md("**Trait measurements for three penguin species**"),
subtitle = "Averages for bill and flipper lengths")
```
The `tidyselect` grammar can also be used in some places in the construction of a `gt` table. For example, we can just show 1 decimal place for the numeric variables. Other handy format functions that are `tidyselect`-aware are `fmt_currency`, `fmt_date`, `fmt_time`, `fmt_datetime`, `fmt_percent`, and `fmt_markdown`.
```{r}
gt(mean_bill_flip) %>%
cols_align(align = "left") %>%
fmt_number(where(is.numeric), decimals = 1) %>%
tab_header(title = md("**Trait measurements for three penguin species**"),
subtitle = "Averages for bill and flipper lengths")
```
Are there differences in flipper length between species? To answer this question, we summarize flipper length using the `quantile` function, which by default returns data divided into 0% (min), 25% (lower), 50% (median), 75% (upper), and 100% (max) subsets. This involves grouping flipper length data by species, removing missing data, and calculating quantiles.
```{r}
quants <- penguins %>%
group_by(species) %>%
na.omit() %>%
summarize(flipper = list(quantile(flipper_length_mm)))
```
Show the characteristics of the output:
```{r}
str(quants)
quants
```
```{r}
quants_flipper <- quants %>%
unnest_wider(flipper) # have a look at unnest_longer(flipper), too.
quants_flipper
```
Next, the results are plotted using `geom_boxplot` in `ggplot2`. The `ggplot2` pacakge has support for tidy evaluation.
```{r}
ggplot(quants_flipper, aes(x = species)) +
geom_boxplot(
aes(ymin = `0%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `100%`),
stat = "identity"
) +
xlab("Species") +
ylab("Flipper length (mm)")
```
Use `where(is.numeric)` and its negation to format the table:
```{r}
quants_flipper %>%
gt() %>%
fmt_number(where(is.numeric), decimals = 1) %>%
cols_align(!where(is.numeric), align = "right")
```
## Conditional summarize functions
Here are some examples of how to conditionally summarize data using `summarize_if`, `summarize_at`, and `summarize_all`.
Here `summarize_if` is used to calculate mean of numeric columns. We exclude year using `select`.
```{r}
penguins %>%
select(-year) %>%
group_by(species) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
```
Here `summarize_at` is used to calculate mean of columns only containing the string "mm" (corresponding to bill and flipper lengths).
```{r}
penguins %>%
group_by(species) %>%
summarise_at(vars(contains("mm")), mean, na.rm = TRUE)
```
Here `summarize_all` is used to summarize data in all columns. However, notice how we get a warning message because we're asking the function to summarize `sex` and `island`, which are factors (the other columns are integer or numeric variables). Also, it doesn't really make sense to calculate the mean year.
```{r}
penguins %>%
group_by(species) %>%
summarise_all(~mean(., na.rm = TRUE))
```
Thus, we'd either not want to use `summarize_all` in this context, or we could remove the problematic columns before we apply the function. This can be accomplished using the negation operator in the `select` function.
```{r}
penguins %>%
group_by(species) %>%
select(-year, -island, -sex) %>%
summarise_all(~mean(., na.rm = TRUE))
```