stringr
Many of the previous examples have worked with numeric data; you may also work with text, or “string” data. String data can include numbers, letters, spaces, and special characters. Strings are most often represented as "text surrounded in quotes"
in R.
The stringr
package is a useful tool for working with string data. Instead of going through a step-by-step tutorial for each stringr
function (there are many!), the below code presents concrete examples using the penguins
dataset to demonstrate some of the capabilities of stringr
.
Most functions you will want to use from the stringr
function begin with str_
. If you forget the name of the function you need to use, try typing str_
and scrolling through the auto-completed list of functions exported from stringr
.
String manipulations are carried out with the use of patterns, specified with a language called regular expressions (or “regex”, for short). Regular expressions are shorthand ways of specifying general patterns that you wish to match in data.
Some regex that may be useful for getting started:
^x
: means ‘begins with x
’x$
: means ‘ends with x
’x{n}
: means exactly n occurrences of x
x{n,}
: means n or more occurrences of x
x{n,m}
: between n and m occurrences of x
[xyz]
: means ‘includes x
or y
or z
’To see each of these in context, see below.
The below code subsets data (using filter()
) to penguins of a species which start with the letter ‘A’, and who are found on an island that ends with the letter ‘m’.
penguins %>%
filter(str_detect(species, "^A") & str_detect(island, "m$"))
## # A tibble: 56 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 39.5 16.7 178 3250
## 2 Adelie Dream 37.2 18.1 178 3900
## 3 Adelie Dream 39.5 17.8 188 3300
## 4 Adelie Dream 40.9 18.9 184 3900
## 5 Adelie Dream 36.4 17 195 3325
## 6 Adelie Dream 39.2 21.1 196 4150
## 7 Adelie Dream 38.8 20 190 3950
## 8 Adelie Dream 42.2 18.5 180 3550
## 9 Adelie Dream 37.6 19.3 181 3300
## 10 Adelie Dream 39.8 19.1 184 4650
## # … with 46 more rows, and 2 more variables: sex <fct>, year <int>
As is often true in coding, there are other ways to arrive at this same result (example: filter(species == "Adelie" & island == "Dream")
); the above code is meant to provide an extensible example of str_detect()
.
This same approach can be extended to look for observations that start or end with any given combination of characters. The below will filter to any penguins with a value of species
beginning with “Adeli”.
penguins %>%
filter(str_detect(species, "^Adeli"))
## # A tibble: 152 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 142 more rows, and 2 more variables: sex <fct>, year <int>
The below code filters to species that end in one (1) or or two (2) “o’s”; after that subsetting, each species name is capitalized (Adelie
becomes ADELIE
) using the str_to_upper()
function.
penguins_edited <- penguins %>%
filter(str_detect(species, "o{1,2}$"))%>%
mutate(species = str_to_upper(species))
penguins_edited
## # A tibble: 124 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <fct> <dbl> <dbl> <int> <int>
## 1 GENTOO Biscoe 46.1 13.2 211 4500
## 2 GENTOO Biscoe 50 16.3 230 5700
## 3 GENTOO Biscoe 48.7 14.1 210 4450
## 4 GENTOO Biscoe 50 15.2 218 5700
## 5 GENTOO Biscoe 47.6 14.5 215 5400
## 6 GENTOO Biscoe 46.5 13.5 210 4550
## 7 GENTOO Biscoe 45.4 14.6 211 4800
## 8 GENTOO Biscoe 46.7 15.3 219 5200
## 9 GENTOO Biscoe 43.3 13.4 209 4400
## 10 GENTOO Biscoe 46.8 15.4 215 5150
## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>
(If you have reason to have your data in the style of bell hooks, there is also a corresponding str_to_lower()
function.)
In addition to looking for a pattern (str_detect()
) or adjusting the case of characters (str_to_upper()
/str_to_lower()
), you can also remove characters based on a pattern-matching rule.
The below code uses str_remove()
to drop two “O’s” from the end of each value of species
. (This can be especially useful when you have observations that have a prefix or a suffix that you do not want.)
penguins_edited %>%
mutate(species = str_remove(species,"O{2}$"))
## # A tibble: 124 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <fct> <dbl> <dbl> <int> <int>
## 1 GENT Biscoe 46.1 13.2 211 4500
## 2 GENT Biscoe 50 16.3 230 5700
## 3 GENT Biscoe 48.7 14.1 210 4450
## 4 GENT Biscoe 50 15.2 218 5700
## 5 GENT Biscoe 47.6 14.5 215 5400
## 6 GENT Biscoe 46.5 13.5 210 4550
## 7 GENT Biscoe 45.4 14.6 211 4800
## 8 GENT Biscoe 46.7 15.3 219 5200
## 9 GENT Biscoe 43.3 13.4 209 4400
## 10 GENT Biscoe 46.8 15.4 215 5150
## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>
(Note that no values of species
end in O’s, so the above code changes nothing. If a value of species
ended in, for example, three “O’s”, the result would only retain one.)
Another common task when working with string data is to replace a pattern, achieved with str_replace()
.
The below code replaces the “e” at the end of “Biscoe” with “otti”.
penguin %>%
filter(island == "Biscoe") %>%
mutate(island = str_replace(island, "e", "tti"))
## # A tibble: 168 × 7
## species island bill_length_mm bill_depth_mm flipper_length_mm sex year
## <fct> <chr> <dbl> <dbl> <int> <fct> <int>
## 1 Adelie Biscotti 37.8 18.3 174 female 2007
## 2 Adelie Biscotti 37.7 18.7 180 male 2007
## 3 Adelie Biscotti 35.9 19.2 189 female 2007
## 4 Adelie Biscotti 38.2 18.1 185 male 2007
## 5 Adelie Biscotti 38.8 17.2 180 male 2007
## 6 Adelie Biscotti 35.3 18.9 187 female 2007
## 7 Adelie Biscotti 40.6 18.6 183 male 2007
## 8 Adelie Biscotti 40.5 17.9 187 female 2007
## 9 Adelie Biscotti 37.9 18.6 172 female 2007
## 10 Adelie Biscotti 40.5 18.9 180 male 2007
## # … with 158 more rows
Transforming islands into biscuits notwithstanding, knowing how to make replacements such as these will likely be important for your data work.
To replace patterns throughout a dataset, use str_replace_all()
. The below code replaces all occurrences of vowels with “x”.
penguins %>%
mutate(species = str_replace_all(species, "[aeiou]", "x"))
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <fct> <dbl> <dbl> <int> <int>
## 1 Adxlxx Torgersen 39.1 18.7 181 3750
## 2 Adxlxx Torgersen 39.5 17.4 186 3800
## 3 Adxlxx Torgersen 40.3 18 195 3250
## 4 Adxlxx Torgersen NA NA NA NA
## 5 Adxlxx Torgersen 36.7 19.3 193 3450
## 6 Adxlxx Torgersen 39.3 20.6 190 3650
## 7 Adxlxx Torgersen 38.9 17.8 181 3625
## 8 Adxlxx Torgersen 39.2 19.6 195 4675
## 9 Adxlxx Torgersen 34.1 18.1 193 3475
## 10 Adxlxx Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
The stringr
package contains many tools beyond those profiled above, and regex can be challenging to navigate. Some additional resources for this work can be found at stringr cheatsheet (credit: Lise Vaudor). To learn more about stringr
, the package’s website contains a more extensive overview of the package and its functions. For further background, visit the Strings chapter of the R for Data Science book.