Tidying example

The “tidy” format of a dataset can depend on how you plan to use the data. For examples of tidy data in action, return to the penguins dataset. (For instructions on loading penguins, see Getting started.)

This summarized version of the penguins data, called penguins_sum, gives the mean body mass for all penguins on a given island in a given year.

## # A tibble: 9 × 3
## # Groups:   island [3]
##   island     year mean_body_mass_g
##   <fct>     <int>            <dbl>
## 1 Biscoe     2007            4741.
## 2 Biscoe     2008            4628.
## 3 Biscoe     2009            4793.
## 4 Dream      2007            3684.
## 5 Dream      2008            3779.
## 6 Dream      2009            3691.
## 7 Torgersen  2007            3763.
## 8 Torgersen  2008            3856.
## 9 Torgersen  2009            3489.

The resulting data is “tidy”: each variable has its own column, each observation (penguin) is a row, and each measurement has its own cell.

Compare that with this representation of the same data:

## # A tibble: 3 × 4
## # Groups:   island [3]
##   island    `2007` `2008` `2009`
##   <fct>      <dbl>  <dbl>  <dbl>
## 1 Biscoe     4741.  4628.  4793.
## 2 Dream      3684.  3779.  3691.
## 3 Torgersen  3763.  3856.  3489.

While the above data contains the same information, in this version, the mean body mass is found in three different columns, one for each year. This representation is usually called “wide” data, while the first option is referred to as “long.” Per tidy data rules, year and body mass are both variables and should therefore have their own columns. Similarly, year should have its own column. tidyr will help us “pivot” from one to the other.