During data analysis, you might have two separate datasets that you need to combine, or join.
For example, suppose you also have some data about the islands that are home to the Palmer penguins, and you want to include information about the islands in your analyses of penguin characteristics.
The below code builds a dataset (palmer_islands
) with information about the mean temperature and mean elevation of each island.
palmer_islands
## # A tibble: 3 × 3
## island mean_temperature_c mean_elevation_m
## <chr> <dbl> <dbl>
## 1 Torgersen -3 17
## 2 Biscoe -6 8
## 3 Dream -1 10
penguins
and islands
share a column, island
, that can be used to combine the two datasets. To join these two datasets, use the full_join()
function from dplyr
.
full_join(x = penguins, y = palmer_islands, by = "island") %>%
select(species, island)
## # A tibble: 344 × 2
## species island
## <fct> <chr>
## 1 Adelie Torgersen
## 2 Adelie Torgersen
## 3 Adelie Torgersen
## 4 Adelie Torgersen
## 5 Adelie Torgersen
## 6 Adelie Torgersen
## 7 Adelie Torgersen
## 8 Adelie Torgersen
## 9 Adelie Torgersen
## 10 Adelie Torgersen
## # … with 334 more rows
“Full” joins are one of many types of joins. In a full join, any rows that are included in both x
and y
(the first and second datasets, respectively) remain in the final dataset. Since the palmer_islands
dataset only has one row per island, its rows are repeated as many times as needed so that each row in penguins
has the corresponding data (in this case, information on island temperature, elevation).
Other join types include inner, left, and right joins. These join types differ in how they match rows between the two different datasets. If x
and y
are the first and second datasets, respectively,
inner_join()
includes all rows in x
and y
that matchleft_join()
includes all rows in x
(and any from y
that match)right_join()
includes all rows in y
(and any from x
that match)full_join()
includes all rows in x
or y
, even if they do not match something in the other datasetTo read more about combining data, we recommend the Relational Data chapter of R for Data Science.