Summarizing data with group_by() and summarize()

Often, data contains groupings (e.g. by geography, time period, experimental group). The group_by() function enables R to first group observations and then perform additional calculations. Function summarize() often proves useful with group_by(), allowing the calculation of measures across groups (mean, median, variance, etc).

Continuing with the penguins data, one goal may be to calculate a variable that shows mean body mass of penguins for each island. First feed the grouping variable into group_by():

penguins %>%
  group_by(island)
## # A tibble: 344 × 8
## # Groups:   island [3]
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

The above code did not change anything about the dataset; the group_by() function changes the metadata of the dataset. (You can think of this as R getting the dataset ready for a further transformation to be applied to it, with the rule of “anything that happens, happens within these groupings”.)

With the grouping defined, apply the summarize() function to calculate mean body mass by island.

penguins %>%
  group_by(island) %>%
  summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 3 × 2
##   island    mean_body_mass_g
##   <fct>                <dbl>
## 1 Biscoe               4716.
## 2 Dream                3713.
## 3 Torgersen            3706.

Missing values (NA) are an important consideration in data analysis. In the above code, the na.rm = TRUE removes any missing values in body_mass_g so R is free to calculate the mean of the body mass data that is in the dataset. (To experiment with the effect of na.rm = TRUE, run the above code without that argument.)

You can extend the above approach to group by both island and year:

penguins %>%
  group_by(island, year) %>%
  summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 9 × 3
## # Groups:   island [3]
##   island     year mean_body_mass_g
##   <fct>     <int>            <dbl>
## 1 Biscoe     2007            4741.
## 2 Biscoe     2008            4628.
## 3 Biscoe     2009            4793.
## 4 Dream      2007            3684.
## 5 Dream      2008            3779.
## 6 Dream      2009            3691.
## 7 Torgersen  2007            3763.
## 8 Torgersen  2008            3856.
## 9 Torgersen  2009            3489.