Subsetting data with filter() and select()

Perhaps you wish to subset your data, retaining only observations (rows) that meet given criteria. To subset Rows, use function filter() (both words contain an “r”, which may help you remember which function to use). The below code subsets only female penguins:

penguins %>%
  filter(sex == "female")
## # A tibble: 165 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.5          17.4               186        3800
##  2 Adelie  Torgersen           40.3          18                 195        3250
##  3 Adelie  Torgersen           36.7          19.3               193        3450
##  4 Adelie  Torgersen           38.9          17.8               181        3625
##  5 Adelie  Torgersen           41.1          17.6               182        3200
##  6 Adelie  Torgersen           36.6          17.8               185        3700
##  7 Adelie  Torgersen           38.7          19                 195        3450
##  8 Adelie  Torgersen           34.4          18.4               184        3325
##  9 Adelie  Biscoe              37.8          18.3               174        3400
## 10 Adelie  Biscoe              35.9          19.2               189        3800
## # … with 155 more rows, and 2 more variables: sex <fct>, year <int>

While this code subsets penguins with a bill length of less than 40mm:

penguins %>%
  filter(bill_length_mm < 40)
## # A tibble: 100 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           36.7          19.3               193        3450
##  4 Adelie  Torgersen           39.3          20.6               190        3650
##  5 Adelie  Torgersen           38.9          17.8               181        3625
##  6 Adelie  Torgersen           39.2          19.6               195        4675
##  7 Adelie  Torgersen           34.1          18.1               193        3475
##  8 Adelie  Torgersen           37.8          17.1               186        3300
##  9 Adelie  Torgersen           37.8          17.3               180        3700
## 10 Adelie  Torgersen           38.6          21.2               191        3800
## # … with 90 more rows, and 2 more variables: sex <fct>, year <int>

You can also combine requirements for your data (“arguements” in the function), for example subsetting penguins from Torgersen Island that were observed in 2008 with a body mass greater than 3500 grams:

penguins %>%
  filter(island == "Torgersen" & year == 2008 & body_mass_g > 3500 )
## # A tibble: 14 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           41.8          19.4               198        4450
##  2 Adelie  Torgersen           33.5          19                 190        3600
##  3 Adelie  Torgersen           39.7          18.4               190        3900
##  4 Adelie  Torgersen           39.6          17.2               196        3550
##  5 Adelie  Torgersen           45.8          18.9               197        4150
##  6 Adelie  Torgersen           35.5          17.5               190        3700
##  7 Adelie  Torgersen           42.8          18.5               195        4250
##  8 Adelie  Torgersen           40.9          16.8               191        3700
##  9 Adelie  Torgersen           37.2          19.4               184        3900
## 10 Adelie  Torgersen           36.2          16.1               187        3550
## 11 Adelie  Torgersen           42.1          19.1               195        4000
## 12 Adelie  Torgersen           42.9          17.6               196        4700
## 13 Adelie  Torgersen           36.7          18.8               187        3800
## 14 Adelie  Torgersen           35.1          19.4               193        4200
## # … with 2 more variables: sex <fct>, year <int>

In working with the penguins data, you may have noticed missing values (NA) throughout the dataset. Missing data can complicate data analysis, and you should be intentional in how you approach your NAs.

You can use filter() combined with !is.na() (“is not missing”) to remove missing observations for a given variable. For example, creating a dataset that is only penguins of known sex:

penguins %>%
  filter(!is.na(sex))
## # A tibble: 333 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # … with 323 more rows, and 2 more variables: sex <fct>, year <int>

Conversely, you may wish to subset observations with missing values, using filter() in combination with is.na() (“is missing”).

penguins %>%
  filter(is.na(sex))
## # A tibble: 11 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           NA            NA                  NA          NA
##  2 Adelie  Torgersen           34.1          18.1               193        3475
##  3 Adelie  Torgersen           42            20.2               190        4250
##  4 Adelie  Torgersen           37.8          17.1               186        3300
##  5 Adelie  Torgersen           37.8          17.3               180        3700
##  6 Adelie  Dream               37.5          18.9               179        2975
##  7 Gentoo  Biscoe              44.5          14.3               216        4100
##  8 Gentoo  Biscoe              46.2          14.4               214        4650
##  9 Gentoo  Biscoe              47.3          13.8               216        4725
## 10 Gentoo  Biscoe              44.5          15.7               217        4875
## 11 Gentoo  Biscoe              NA            NA                  NA          NA
## # … with 2 more variables: sex <fct>, year <int>

Both logical and Boolean operators can be used with filter(). The two pieces of code below (?base::Logic and ?Comparison) give more information on the different operators and their uses.

?base::Logic

?Comparison

Another way to subset data is by subsetting columns, or variables. A dataset may have more variables than you need, so you can use the select() function to subset data to only variables of interest. (Both seleCt and columns contain “c”s, which may help you remember which function to use.)

To subset only species information from your dataset, you can use the following code:

penguins %>%
  select(species)
## # A tibble: 344 × 1
##    species
##    <fct>  
##  1 Adelie 
##  2 Adelie 
##  3 Adelie 
##  4 Adelie 
##  5 Adelie 
##  6 Adelie 
##  7 Adelie 
##  8 Adelie 
##  9 Adelie 
## 10 Adelie 
## # … with 334 more rows

You can use a negative selection to subset everything except species information from your data:

penguins %>%
  select(-species)
## # A tibble: 344 × 7
##    island  bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex    year
##    <fct>            <dbl>         <dbl>            <int>       <int> <fct> <int>
##  1 Torger…           39.1          18.7              181        3750 male   2007
##  2 Torger…           39.5          17.4              186        3800 fema…  2007
##  3 Torger…           40.3          18                195        3250 fema…  2007
##  4 Torger…           NA            NA                 NA          NA <NA>   2007
##  5 Torger…           36.7          19.3              193        3450 fema…  2007
##  6 Torger…           39.3          20.6              190        3650 male   2007
##  7 Torger…           38.9          17.8              181        3625 fema…  2007
##  8 Torger…           39.2          19.6              195        4675 male   2007
##  9 Torger…           34.1          18.1              193        3475 <NA>   2007
## 10 Torger…           42            20.2              190        4250 <NA>   2007
## # … with 334 more rows

You can also list which variables you would like to subset, for example species, island, and sex:

penguins %>%
  select(species, island, sex)
## # A tibble: 344 × 3
##    species island    sex   
##    <fct>   <fct>     <fct> 
##  1 Adelie  Torgersen male  
##  2 Adelie  Torgersen female
##  3 Adelie  Torgersen female
##  4 Adelie  Torgersen <NA>  
##  5 Adelie  Torgersen female
##  6 Adelie  Torgersen male  
##  7 Adelie  Torgersen female
##  8 Adelie  Torgersen male  
##  9 Adelie  Torgersen <NA>  
## 10 Adelie  Torgersen <NA>  
## # … with 334 more rows

(To recap, subsetting columns is done with select and subsetting rows is done with filter; both will be useful as you work with data.)