Creating new columns with mutate()

Another common data wrangling task is to create a new variable, using function mutate(). When creating a new variable, you provide a name for the new column and a method for calculating the new value.

Continuing with the penguins data from palmerpenguins, the code below creates a new column for the mean body mass in kilograms:

penguins %>%
  mutate(body_mass_kg = body_mass_g / 1000)
## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   body_mass_kg <dbl>

The syntax for mutating a column follows the pattern of mutate(new_column_name = expression), where expression is some sort of instruction for combining values in existing columns. In the above example, new_column_name is body_mass_kg, and expression is body_mass_g / 1000.

Perhaps you realized that all flipper measurements were 4 mm short of the true length; you could use mutate() to adjust the data:

penguins %>%
  mutate(flipper_length_mm = flipper_length_mm + 4)
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <dbl>       <int>
##  1 Adelie  Torgersen           39.1          18.7               185        3750
##  2 Adelie  Torgersen           39.5          17.4               190        3800
##  3 Adelie  Torgersen           40.3          18                 199        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               197        3450
##  6 Adelie  Torgersen           39.3          20.6               194        3650
##  7 Adelie  Torgersen           38.9          17.8               185        3625
##  8 Adelie  Torgersen           39.2          19.6               199        4675
##  9 Adelie  Torgersen           34.1          18.1               197        3475
## 10 Adelie  Torgersen           42            20.2               194        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

You can also combine mutate() with other functions. The below code calculates total body mass of all penguins on each island.

penguins %>%
  group_by(island) %>%
  mutate(island_penguin_mass = sum(body_mass_g, na.rm = T))
## # A tibble: 344 × 9
## # Groups:   island [3]
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   island_penguin_mass <int>

It may be useful to give R rules for creating new variables. For example, the below code divides all penguins into flipper length categories, based on the mean flipper length of the dataset (201mm), using the case_when() function. You can think of case_when as being a multilevel if statement. Essentially, the case_when() function in the code below is saying "for each observation (row), when the variable flipper_length_mm meets a certain condition (greater than, equal to, or less than 201mm), the new column should contain the respective category: "long", "average", or "short".

penguins %>%
  mutate(flipper_category =
           case_when( flipper_length_mm > 201 ~ "long",
                      flipper_length_mm == 201 ~ "average",
                      flipper_length_mm < 201 ~ "short"))
## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   flipper_category <chr>