filter()
and select()
Perhaps you wish to subset your data, retaining only observations (rows) that meet given criteria. To subset Rows, use function filter()
(both words contain an “r”, which may help you remember which function to use). The below code subsets only female penguins:
penguins %>%
filter(sex == "female")
## # A tibble: 165 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.5 17.4 186 3800
## 2 Adelie Torgersen 40.3 18 195 3250
## 3 Adelie Torgersen 36.7 19.3 193 3450
## 4 Adelie Torgersen 38.9 17.8 181 3625
## 5 Adelie Torgersen 41.1 17.6 182 3200
## 6 Adelie Torgersen 36.6 17.8 185 3700
## 7 Adelie Torgersen 38.7 19 195 3450
## 8 Adelie Torgersen 34.4 18.4 184 3325
## 9 Adelie Biscoe 37.8 18.3 174 3400
## 10 Adelie Biscoe 35.9 19.2 189 3800
## # … with 155 more rows, and 2 more variables: sex <fct>, year <int>
While this code subsets penguins with a bill length of less than 40mm:
penguins %>%
filter(bill_length_mm < 40)
## # A tibble: 100 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 36.7 19.3 193 3450
## 4 Adelie Torgersen 39.3 20.6 190 3650
## 5 Adelie Torgersen 38.9 17.8 181 3625
## 6 Adelie Torgersen 39.2 19.6 195 4675
## 7 Adelie Torgersen 34.1 18.1 193 3475
## 8 Adelie Torgersen 37.8 17.1 186 3300
## 9 Adelie Torgersen 37.8 17.3 180 3700
## 10 Adelie Torgersen 38.6 21.2 191 3800
## # … with 90 more rows, and 2 more variables: sex <fct>, year <int>
You can also combine requirements for your data (“arguements” in the function), for example subsetting penguins from Torgersen Island that were observed in 2008 with a body mass greater than 3500 grams:
penguins %>%
filter(island == "Torgersen" & year == 2008 & body_mass_g > 3500 )
## # A tibble: 14 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 41.8 19.4 198 4450
## 2 Adelie Torgersen 33.5 19 190 3600
## 3 Adelie Torgersen 39.7 18.4 190 3900
## 4 Adelie Torgersen 39.6 17.2 196 3550
## 5 Adelie Torgersen 45.8 18.9 197 4150
## 6 Adelie Torgersen 35.5 17.5 190 3700
## 7 Adelie Torgersen 42.8 18.5 195 4250
## 8 Adelie Torgersen 40.9 16.8 191 3700
## 9 Adelie Torgersen 37.2 19.4 184 3900
## 10 Adelie Torgersen 36.2 16.1 187 3550
## 11 Adelie Torgersen 42.1 19.1 195 4000
## 12 Adelie Torgersen 42.9 17.6 196 4700
## 13 Adelie Torgersen 36.7 18.8 187 3800
## 14 Adelie Torgersen 35.1 19.4 193 4200
## # … with 2 more variables: sex <fct>, year <int>
In working with the penguins data, you may have noticed missing values (NA
) throughout the dataset. Missing data can complicate data analysis, and you should be intentional in how you approach your NA
s.
You can use filter()
combined with !is.na()
(“is not missing”) to remove missing observations for a given variable. For example, creating a dataset that is only penguins of known sex:
penguins %>%
filter(!is.na(sex))
## # A tibble: 333 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## # … with 323 more rows, and 2 more variables: sex <fct>, year <int>
Conversely, you may wish to subset observations with missing values, using filter()
in combination with is.na()
(“is missing”).
penguins %>%
filter(is.na(sex))
## # A tibble: 11 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen NA NA NA NA
## 2 Adelie Torgersen 34.1 18.1 193 3475
## 3 Adelie Torgersen 42 20.2 190 4250
## 4 Adelie Torgersen 37.8 17.1 186 3300
## 5 Adelie Torgersen 37.8 17.3 180 3700
## 6 Adelie Dream 37.5 18.9 179 2975
## 7 Gentoo Biscoe 44.5 14.3 216 4100
## 8 Gentoo Biscoe 46.2 14.4 214 4650
## 9 Gentoo Biscoe 47.3 13.8 216 4725
## 10 Gentoo Biscoe 44.5 15.7 217 4875
## 11 Gentoo Biscoe NA NA NA NA
## # … with 2 more variables: sex <fct>, year <int>
Both logical and Boolean operators can be used with filter()
. The two pieces of code below (?base::Logic
and ?Comparison
) give more information on the different operators and their uses.
?base::Logic
?Comparison
Another way to subset data is by subsetting columns, or variables. A dataset may have more variables than you need, so you can use the select()
function to subset data to only variables of interest. (Both seleCt and columns contain “c”s, which may help you remember which function to use.)
To subset only species information from your dataset, you can use the following code:
penguins %>%
select(species)
## # A tibble: 344 × 1
## species
## <fct>
## 1 Adelie
## 2 Adelie
## 3 Adelie
## 4 Adelie
## 5 Adelie
## 6 Adelie
## 7 Adelie
## 8 Adelie
## 9 Adelie
## 10 Adelie
## # … with 334 more rows
You can use a negative selection to subset everything except species information from your data:
penguins %>%
select(-species)
## # A tibble: 344 × 7
## island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex year
## <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Torger… 39.1 18.7 181 3750 male 2007
## 2 Torger… 39.5 17.4 186 3800 fema… 2007
## 3 Torger… 40.3 18 195 3250 fema… 2007
## 4 Torger… NA NA NA NA <NA> 2007
## 5 Torger… 36.7 19.3 193 3450 fema… 2007
## 6 Torger… 39.3 20.6 190 3650 male 2007
## 7 Torger… 38.9 17.8 181 3625 fema… 2007
## 8 Torger… 39.2 19.6 195 4675 male 2007
## 9 Torger… 34.1 18.1 193 3475 <NA> 2007
## 10 Torger… 42 20.2 190 4250 <NA> 2007
## # … with 334 more rows
You can also list which variables you would like to subset, for example species
, island
, and sex
:
penguins %>%
select(species, island, sex)
## # A tibble: 344 × 3
## species island sex
## <fct> <fct> <fct>
## 1 Adelie Torgersen male
## 2 Adelie Torgersen female
## 3 Adelie Torgersen female
## 4 Adelie Torgersen <NA>
## 5 Adelie Torgersen female
## 6 Adelie Torgersen male
## 7 Adelie Torgersen female
## 8 Adelie Torgersen male
## 9 Adelie Torgersen <NA>
## 10 Adelie Torgersen <NA>
## # … with 334 more rows
(To recap, subsetting columns is done with select
and subsetting rows is done with filter
; both will be useful as you work with data.)