Combining data

During data analysis, you might have two separate datasets that you need to combine, or join.

For example, suppose you also have some data about the islands that are home to the Palmer penguins, and you want to include information about the islands in your analyses of penguin characteristics.

The below code builds a dataset (palmer_islands) with information about the mean temperature and mean elevation of each island.

palmer_islands
## # A tibble: 3 × 3
##   island    mean_temperature_c mean_elevation_m
##   <chr>                  <dbl>            <dbl>
## 1 Torgersen                 -3               17
## 2 Biscoe                    -6                8
## 3 Dream                     -1               10

penguins and islands share a column, island, that can be used to combine the two datasets. To join these two datasets, use the full_join() function from dplyr.

full_join(x = penguins, y = palmer_islands, by = "island") %>%
  select(species, island)
## # A tibble: 344 × 2
##    species island   
##    <fct>   <chr>    
##  1 Adelie  Torgersen
##  2 Adelie  Torgersen
##  3 Adelie  Torgersen
##  4 Adelie  Torgersen
##  5 Adelie  Torgersen
##  6 Adelie  Torgersen
##  7 Adelie  Torgersen
##  8 Adelie  Torgersen
##  9 Adelie  Torgersen
## 10 Adelie  Torgersen
## # … with 334 more rows

“Full” joins are one of many types of joins. In a full join, any rows that are included in both x and y (the first and second datasets, respectively) remain in the final dataset. Since the palmer_islands dataset only has one row per island, its rows are repeated as many times as needed so that each row in penguins has the corresponding data (in this case, information on island temperature, elevation).

Other join types include inner, left, and right joins. These join types differ in how they match rows between the two different datasets. If x and y are the first and second datasets, respectively,

To read more about combining data, we recommend the Relational Data chapter of R for Data Science.