penguins-summarize.Rmd

---
title: "Brief demo of tidy evaluation"
author: "Brittany Barker & John Smith"
date: "3/18/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(palmerpenguins)
library(gt)
```

Tidy evaluation is a framework for controlling how expressions and variables in your code are evaluated by tidyverse functions, as described in this [R-bloggers blog](https://www.r-bloggers.com/2019/12/practical-tidy-evaluation/#:~:text=Tidy%20evaluation%20is%20a%20framework,more%20efficient%20and%20elegant%20code).

The Tidy evaluation [chapter](https://tidyeval.tidyverse.org/index.html) of the `tidyverse` guide book recommends reading:  
- The new [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html) vignette.  
- The [Using ggplot2 in packages](https://ggplot2.tidyverse.org/articles/ggplot2-in-packages.html) vignette.  

As described in the dplyr vignette, most dplyr verbs use **tidy evaluation** in some way. The two basic forms of tidy evaluation in dplyr are those that use:  
- **data masking** so that you can use data variables as they were variables in the environment (i.e. you write `my_variable` not `df$myvariable`). These include `arrange()`, `count()`, `filter()`, `group_by()`, `mutate()`, and `summarise()`.  
- **tidy selection** so you can easily choose variables based on their position, name, or type (e.g. `starts_with("x")` or `is.numeric`. These include `across()`, `relocate()`, `rename()`, `select()`, and `pull()`.    

In this demo, we'll cover some examples of both forms of tidy evaluation using the `penguins` dataset.  

## Data-variables vs. env-variables 

**Data masking** blurs the line between two different meanings of the word "variable."    
- **env-variables** are "programming" variables that live in the environment (e.g., the `penguins` data frame).   
- **data-variables** are "statistical" variables that live in a data frame (e.g., the 8 variables in the `penguins` data frame).  

```{r}
str(penguins)
head(penguins)
```

When you want to get the data-variable from an environmental variable without typing the variable's name, you need to **embrace** the argument surrounding it in double braces. The following function uses embracing to create a wrapper around `summarise` (and other functions). Note that `summarise()` and `summarize()` can be used interchangeably.
```{r}
var_min_max <- function(df, var){
  df %>% 
    summarize(min = min({{ var }}, na.rm = TRUE), max = max({{ var }}, na.rm = TRUE))
}

penguins %>%
  group_by(species) %>%
  var_min_max(bill_length_mm)
```
When you have an env-variable that is a character vector, you need to index into the `.data` pronoun with `[[`, like `summarise(penguins, mean(.data[[var]]))`.
```{r}
var_list <- c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")
for (var in var_list){
  penguins %>% 
    summarise(mean = mean(.data[[var]], na.rm = TRUE)) %>% print()
}
```
## Tidy selection

Tidy selection is a complementary tool to data masking that makes it easy to work with the columns of a data frame. Underneath all functions that use tidy selection is the `tidyselect` package. It provides a miniature domain specific language that makes it easy to select columns by name, position, or type. Here are some examples.

Selects the first column (`last_col()` selects the last column).
```{r}
head(select(penguins, 1)) 
```

Select columns by name.
```{r}
head(select(penguins, c(species, island, body_mass_g)))
```

Selects all columns whose name starts with “bill."
```{r}
head(select(penguins, starts_with("bill")))
```

Select all columns whose name ends with "mm."
```{r}
head(select(penguins, ends_with("mm")))
```

Select all numeric columns.
```{r}
head(select(penguins, where(is.numeric)))
```

Use the negation of select to remove columns.
```{r}
head(select(penguins, -c(year, sex)))
```

## Some more summarize examples

Usually with summarize, you want to group data first. First let's do some analyses in which we'll group the `penguins` data by species. There are three species (Adelie, Chinstrap, and Gentoo).
```{r}
penguins %>%
  group_by(species) %>%
  distinct(species)
```

Group data by island and calculate the number of species on each island.
```{r}
penguins %>%
  group_by(island) %>%
  distinct(species, .keep_all = TRUE) %>%
  summarize(n())
```

Summarize bill and flipper length measurements (columns contain "mm" string) using the `mean` function. This involves grouping data by species, rounding results, renaming columns, and creating a table of results using the `gt` package. Notice that the `across` function is used to round all numeric columns.
```{r}
mean_bill_flip <- penguins %>%
  group_by(species) %>%
  summarize(across(ends_with("mm"), mean, na.rm = TRUE)) %>%
  mutate(across(where(is.numeric), round, 2)) %>%
  setNames(c("Species", "Bill length (mm)", "Bill depth (mm)", "Flipper length (mm)"))

gt(mean_bill_flip) %>%
  cols_align(align = "left") %>%
  tab_header(title = md("**Trait measurements for three penguin species**"),
             subtitle = "Averages for bill and flipper lengths")
```

The `tidyselect` grammar can also be used in some places in the construction of a `gt` table.  For example, we can just show 1 decimal place for the numeric variables.  Other handy format functions that are `tidyselect`-aware are `fmt_currency`, `fmt_date`, `fmt_time`, `fmt_datetime`, `fmt_percent`, and `fmt_markdown`.

```{r}
gt(mean_bill_flip) %>%
  cols_align(align = "left") %>%
  fmt_number(where(is.numeric), decimals = 1) %>% 
  tab_header(title = md("**Trait measurements for three penguin species**"),
             subtitle = "Averages for bill and flipper lengths") 
```


Are there differences in flipper length between species? To answer this question, we summarize flipper length using the `quantile` function, which by default returns data divided into 0% (min), 25% (lower), 50% (median), 75% (upper), and 100% (max) subsets. This involves grouping flipper length data by species, removing missing data, and calculating quantiles. 

```{r}
quants <- penguins %>%
  group_by(species) %>%
  na.omit() %>%
  summarize(flipper = list(quantile(flipper_length_mm)))
```

Show the characteristics of the output:

```{r}
str(quants)
quants
```

```{r}
quants_flipper <- quants %>%
  unnest_wider(flipper) # have a look at unnest_longer(flipper), too.

quants_flipper
```

Next, the results are plotted using `geom_boxplot` in `ggplot2`. The `ggplot2` pacakge has support for tidy evaluation.
```{r}
ggplot(quants_flipper, aes(x = species)) +
  geom_boxplot(
    aes(ymin = `0%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `100%`),
    stat = "identity"
    ) +
  xlab("Species") +
  ylab("Flipper length (mm)")
```

Use `where(is.numeric)` and its negation to format the table:

```{r}
quants_flipper %>% 
  gt() %>% 
  fmt_number(where(is.numeric), decimals = 1) %>% 
  cols_align(!where(is.numeric), align = "right")
```


## Conditional summarize functions
Here are some examples of how to conditionally summarize data using `summarize_if`, `summarize_at`, and `summarize_all`.    

Here `summarize_if` is used to calculate mean of numeric columns. We exclude year using `select`.
```{r}
penguins %>%
  select(-year) %>%
  group_by(species) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)
```

Here `summarize_at` is used to calculate mean of columns only containing the string "mm" (corresponding to bill and flipper lengths).
```{r}
penguins %>%
  group_by(species) %>%
  summarise_at(vars(contains("mm")), mean, na.rm = TRUE)
```

Here `summarize_all` is used to summarize data in all columns. However, notice how we get a warning message because we're asking the function to summarize `sex` and `island`, which are factors (the other columns are integer or numeric variables). Also, it doesn't really make sense to calculate the mean year.
```{r}
penguins %>%
  group_by(species) %>%
  summarise_all(~mean(., na.rm = TRUE))
```

Thus, we'd either not want to use `summarize_all` in this context, or we could remove the problematic columns before we apply the function. This can be accomplished using the negation operator in the `select` function.
```{r}
penguins %>%
  group_by(species) %>%
  select(-year, -island, -sex) %>%
  summarise_all(~mean(., na.rm = TRUE))
```