group_by()
and summarize()
Often, data contains groupings (e.g. by geography, time period, experimental group). The group_by()
function enables R to first group observations and then perform additional calculations. Function summarize()
often proves useful with group_by()
, allowing the calculation of measures across groups (mean, median, variance, etc).
Continuing with the penguins data, one goal may be to calculate a variable that shows mean body mass of penguins for each island. First feed the grouping variable into group_by()
:
penguins %>%
group_by(island)
## # A tibble: 344 × 8
## # Groups: island [3]
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
The above code did not change anything about the dataset; the group_by()
function changes the metadata of the dataset. (You can think of this as R getting the dataset ready for a further transformation to be applied to it, with the rule of “anything that happens, happens within these groupings”.)
With the grouping defined, apply the summarize()
function to calculate mean body mass by island.
penguins %>%
group_by(island) %>%
summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 3 × 2
## island mean_body_mass_g
## <fct> <dbl>
## 1 Biscoe 4716.
## 2 Dream 3713.
## 3 Torgersen 3706.
Missing values (NA
) are an important consideration in data analysis. In the above code, the na.rm = TRUE
removes any missing values in body_mass_g
so R is free to calculate the mean of the body mass data that is in the dataset. (To experiment with the effect of na.rm = TRUE
, run the above code without that argument.)
You can extend the above approach to group by both island and year:
penguins %>%
group_by(island, year) %>%
summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 9 × 3
## # Groups: island [3]
## island year mean_body_mass_g
## <fct> <int> <dbl>
## 1 Biscoe 2007 4741.
## 2 Biscoe 2008 4628.
## 3 Biscoe 2009 4793.
## 4 Dream 2007 3684.
## 5 Dream 2008 3779.
## 6 Dream 2009 3691.
## 7 Torgersen 2007 3763.
## 8 Torgersen 2008 3856.
## 9 Torgersen 2009 3489.