The “tidy” format of a dataset can depend on how you plan to use the data. For examples of tidy data in action, return to the penguins
dataset. (For instructions on loading penguins
, see Getting started.)
This summarized version of the penguins data, called penguins_sum
, gives the mean body mass for all penguins on a given island in a given year.
## # A tibble: 9 × 3
## # Groups: island [3]
## island year mean_body_mass_g
## <fct> <int> <dbl>
## 1 Biscoe 2007 4741.
## 2 Biscoe 2008 4628.
## 3 Biscoe 2009 4793.
## 4 Dream 2007 3684.
## 5 Dream 2008 3779.
## 6 Dream 2009 3691.
## 7 Torgersen 2007 3763.
## 8 Torgersen 2008 3856.
## 9 Torgersen 2009 3489.
The resulting data is “tidy”: each variable has its own column, each observation (penguin) is a row, and each measurement has its own cell.
Compare that with this representation of the same data:
## # A tibble: 3 × 4
## # Groups: island [3]
## island `2007` `2008` `2009`
## <fct> <dbl> <dbl> <dbl>
## 1 Biscoe 4741. 4628. 4793.
## 2 Dream 3684. 3779. 3691.
## 3 Torgersen 3763. 3856. 3489.
While the above data contains the same information, in this version, the mean body mass is found in three different columns, one for each year. This representation is usually called “wide” data, while the first option is referred to as “long.” Per tidy data rules, year
and body mass
are both variables and should therefore have their own columns. Similarly, year
should have its own column. tidyr
will help us “pivot” from one to the other.