Working with text data with stringr

Many of the previous examples have worked with numeric data; you may also work with text, or “string” data. String data can include numbers, letters, spaces, and special characters. Strings are most often represented as "text surrounded in quotes" in R.

The stringr package is a useful tool for working with string data. Instead of going through a step-by-step tutorial for each stringr function (there are many!), the below code presents concrete examples using the penguins dataset to demonstrate some of the capabilities of stringr.

Most functions you will want to use from the stringr function begin with str_. If you forget the name of the function you need to use, try typing str_ and scrolling through the auto-completed list of functions exported from stringr.

String manipulations are carried out with the use of patterns, specified with a language called regular expressions (or “regex”, for short). Regular expressions are shorthand ways of specifying general patterns that you wish to match in data.

Some regex that may be useful for getting started:

  1. ^x : means ‘begins with x
  2. x$ : means ‘ends with x
  3. x{n} : means exactly n occurrences of x
  4. x{n,} : means n or more occurrences of x
  5. x{n,m} : between n and m occurrences of x
  6. [xyz] : means ‘includes x or y or z

To see each of these in context, see below.

The below code subsets data (using filter()) to penguins of a species which start with the letter ‘A’, and who are found on an island that ends with the letter ‘m’.

penguins %>%
  filter(str_detect(species, "^A") & str_detect(island, "m$"))
## # A tibble: 56 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Dream            39.5          16.7               178        3250
##  2 Adelie  Dream            37.2          18.1               178        3900
##  3 Adelie  Dream            39.5          17.8               188        3300
##  4 Adelie  Dream            40.9          18.9               184        3900
##  5 Adelie  Dream            36.4          17                 195        3325
##  6 Adelie  Dream            39.2          21.1               196        4150
##  7 Adelie  Dream            38.8          20                 190        3950
##  8 Adelie  Dream            42.2          18.5               180        3550
##  9 Adelie  Dream            37.6          19.3               181        3300
## 10 Adelie  Dream            39.8          19.1               184        4650
## # … with 46 more rows, and 2 more variables: sex <fct>, year <int>

As is often true in coding, there are other ways to arrive at this same result (example: filter(species == "Adelie" & island == "Dream")); the above code is meant to provide an extensible example of str_detect().

This same approach can be extended to look for observations that start or end with any given combination of characters. The below will filter to any penguins with a value of species beginning with “Adeli”.

penguins %>%
  filter(str_detect(species, "^Adeli"))
## # A tibble: 152 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 142 more rows, and 2 more variables: sex <fct>, year <int>

The below code filters to species that end in one (1) or or two (2) “o’s”; after that subsetting, each species name is capitalized (Adelie becomes ADELIE) using the str_to_upper() function.

penguins_edited <- penguins %>%
  filter(str_detect(species, "o{1,2}$"))%>%
  mutate(species = str_to_upper(species))

penguins_edited
## # A tibble: 124 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 GENTOO  Biscoe           46.1          13.2               211        4500
##  2 GENTOO  Biscoe           50            16.3               230        5700
##  3 GENTOO  Biscoe           48.7          14.1               210        4450
##  4 GENTOO  Biscoe           50            15.2               218        5700
##  5 GENTOO  Biscoe           47.6          14.5               215        5400
##  6 GENTOO  Biscoe           46.5          13.5               210        4550
##  7 GENTOO  Biscoe           45.4          14.6               211        4800
##  8 GENTOO  Biscoe           46.7          15.3               219        5200
##  9 GENTOO  Biscoe           43.3          13.4               209        4400
## 10 GENTOO  Biscoe           46.8          15.4               215        5150
## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>

(If you have reason to have your data in the style of bell hooks, there is also a corresponding str_to_lower() function.)

In addition to looking for a pattern (str_detect()) or adjusting the case of characters (str_to_upper()/str_to_lower()), you can also remove characters based on a pattern-matching rule.

The below code uses str_remove() to drop two “O’s” from the end of each value of species. (This can be especially useful when you have observations that have a prefix or a suffix that you do not want.)

penguins_edited %>%
  mutate(species = str_remove(species,"O{2}$"))
## # A tibble: 124 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 GENT    Biscoe           46.1          13.2               211        4500
##  2 GENT    Biscoe           50            16.3               230        5700
##  3 GENT    Biscoe           48.7          14.1               210        4450
##  4 GENT    Biscoe           50            15.2               218        5700
##  5 GENT    Biscoe           47.6          14.5               215        5400
##  6 GENT    Biscoe           46.5          13.5               210        4550
##  7 GENT    Biscoe           45.4          14.6               211        4800
##  8 GENT    Biscoe           46.7          15.3               219        5200
##  9 GENT    Biscoe           43.3          13.4               209        4400
## 10 GENT    Biscoe           46.8          15.4               215        5150
## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>

(Note that no values of species end in O’s, so the above code changes nothing. If a value of species ended in, for example, three “O’s”, the result would only retain one.)

Another common task when working with string data is to replace a pattern, achieved with str_replace().

The below code replaces the “e” at the end of “Biscoe” with “otti”.

penguin %>%
  filter(island == "Biscoe") %>%
  mutate(island = str_replace(island, "e", "tti"))
## # A tibble: 168 × 7
##    species island   bill_length_mm bill_depth_mm flipper_length_mm sex     year
##    <fct>   <chr>             <dbl>         <dbl>             <int> <fct>  <int>
##  1 Adelie  Biscotti           37.8          18.3               174 female  2007
##  2 Adelie  Biscotti           37.7          18.7               180 male    2007
##  3 Adelie  Biscotti           35.9          19.2               189 female  2007
##  4 Adelie  Biscotti           38.2          18.1               185 male    2007
##  5 Adelie  Biscotti           38.8          17.2               180 male    2007
##  6 Adelie  Biscotti           35.3          18.9               187 female  2007
##  7 Adelie  Biscotti           40.6          18.6               183 male    2007
##  8 Adelie  Biscotti           40.5          17.9               187 female  2007
##  9 Adelie  Biscotti           37.9          18.6               172 female  2007
## 10 Adelie  Biscotti           40.5          18.9               180 male    2007
## # … with 158 more rows

Transforming islands into biscuits notwithstanding, knowing how to make replacements such as these will likely be important for your data work.

To replace patterns throughout a dataset, use str_replace_all(). The below code replaces all occurrences of vowels with “x”.

penguins %>%
  mutate(species = str_replace_all(species, "[aeiou]", "x"))
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adxlxx  Torgersen           39.1          18.7               181        3750
##  2 Adxlxx  Torgersen           39.5          17.4               186        3800
##  3 Adxlxx  Torgersen           40.3          18                 195        3250
##  4 Adxlxx  Torgersen           NA            NA                  NA          NA
##  5 Adxlxx  Torgersen           36.7          19.3               193        3450
##  6 Adxlxx  Torgersen           39.3          20.6               190        3650
##  7 Adxlxx  Torgersen           38.9          17.8               181        3625
##  8 Adxlxx  Torgersen           39.2          19.6               195        4675
##  9 Adxlxx  Torgersen           34.1          18.1               193        3475
## 10 Adxlxx  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

The stringr package contains many tools beyond those profiled above, and regex can be challenging to navigate. Some additional resources for this work can be found at stringr cheatsheet (credit: Lise Vaudor). To learn more about stringr, the package’s website contains a more extensive overview of the package and its functions. For further background, visit the Strings chapter of the R for Data Science book.