Strings
Regular expressions
Using regular expressions to extract data from text: an example

(NB: This has its roots in a workshop co-written and delivered by Matt Denny and myself.)

For text manipulation in R, I recommend the stringr package, which is part of the “tidyverse.” There is a fantastic “cheat sheet” available here. All stringr functions start with str_.

(A little bit in the weeds … First, a lot of this can be done in base R, but it can be less straightforward or intuitive. Second, almost all of stringr is a “wrapper” for commands in the package stringi, providing a more intuitive and consistent syntax, especially if you are also working with other elements of the tidyverse. So, not only can you do most of the following with stringi, you actually are doing most of this with stringi “under the hood.”)

Install stringr if you need to, and load it:

#install.packages("stringr", dependencies = TRUE)
# the following is necessary for the str_view command to work 
#install.packages*"htmlwidgets", dependencies = TRUE)
library(stringr)

Strings

The basic thing we want to manipulate are “strings.” These are “character” objects in R, and can be specified using double quotes (") or single quotes (’):

a_string <- "Example STRING, with numbers (12, 15 and also 10.2)?!"
a_string

## [1] "Example STRING, with numbers (12, 15 and also 10.2)?!"

It’s really a matter of style or convenience, but you might use single quotes if your string actually contains double quotes:

my_single_quoted_string <- 'He asked, "Why would you use single quotes?"'
my_single_quoted_string

## [1] "He asked, \"Why would you use single quotes?\""

R always displays strings in double-quotes. That \ tells R to “escape” the next character. In this case, the \" is saying, " is part of the string, not the end of the string.

You can specify the string that way if you want.

my_string_with_double_quotes <- "She answered, \"Convenience, but you never really have to.\""
my_string_with_double_quotes

## [1] "She answered, \"Convenience, but you never really have to.\""

If you ever want to see how your string with escape characters displays when printed or (typically) in an editor, use writeLines.

writeLines(my_single_quoted_string)

## He asked, "Why would you use single quotes?"

writeLines(my_string_with_double_quotes)

## She answered, "Convenience, but you never really have to."

This can get a little bit confusing. For example, since the backslash character tells R to escape, to indicate an actual backslash character you have to backslash your backslashes:

a_string_with_backslashes = "To indicate a backslash, \\, you have to type two: \\\\. Just there, to indicate two backslashes, I had to type four: \\\\\\\\."
a_string_with_backslashes

## [1] "To indicate a backslash, \\, you have to type two: \\\\. Just there, to indicate two backslashes, I had to type four: \\\\\\\\."

writeLines(a_string_with_backslashes)

## To indicate a backslash, \, you have to type two: \\. Just there, to indicate two backslashes, I had to type four: \\\\.

There are a number of special escape characters that are used to represent things like “control characters.” The most common are two that you’re already used to tapping a keyboard key for without expecting a character to appear on your screen: \t (tab) and \n (newline).

test_string <- "abc ABC 123\t.!?\\(){}\n"
test_string

## [1] "abc ABC 123\t.!?\\(){}\n"

writeLines(test_string)

## abc ABC 123  .!?\(){}

As with pretty much everything in R, you can have a vector of strings (a “character vector”).

a_vector_of_strings <- c("abcde", "123", "chicken of the sea")
a_vector_of_strings

## [1] "abcde"              "123"                "chicken of the sea"

Base R comes with a few built in string vectors – letters, LETTERS, month.abb, and month.name. Loading stringr also loads a few more: fruit, words, and sentences. We’ll use these for a few examples, so let’s look at them. The last two are long, so we’ll just look at the first few entries of each of those.

letters

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

LETTERS

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"

month.abb

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

month.name

##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"

fruit

##  [1] "apple"             "apricot"           "avocado"          
##  [4] "banana"            "bell pepper"       "bilberry"         
##  [7] "blackberry"        "blackcurrant"      "blood orange"     
## [10] "blueberry"         "boysenberry"       "breadfruit"       
## [13] "canary melon"      "cantaloupe"        "cherimoya"        
## [16] "cherry"            "chili pepper"      "clementine"       
## [19] "cloudberry"        "coconut"           "cranberry"        
## [22] "cucumber"          "currant"           "damson"           
## [25] "date"              "dragonfruit"       "durian"           
## [28] "eggplant"          "elderberry"        "feijoa"           
## [31] "fig"               "goji berry"        "gooseberry"       
## [34] "grape"             "grapefruit"        "guava"            
## [37] "honeydew"          "huckleberry"       "jackfruit"        
## [40] "jambul"            "jujube"            "kiwi fruit"       
## [43] "kumquat"           "lemon"             "lime"             
## [46] "loquat"            "lychee"            "mandarine"        
## [49] "mango"             "mulberry"          "nectarine"        
## [52] "nut"               "olive"             "orange"           
## [55] "pamelo"            "papaya"            "passionfruit"     
## [58] "peach"             "pear"              "persimmon"        
## [61] "physalis"          "pineapple"         "plum"             
## [64] "pomegranate"       "pomelo"            "purple mangosteen"
## [67] "quince"            "raisin"            "rambutan"         
## [70] "raspberry"         "redcurrant"        "rock melon"       
## [73] "salal berry"       "satsuma"           "star fruit"       
## [76] "strawberry"        "tamarillo"         "tangerine"        
## [79] "ugli fruit"        "watermelon"

length(words)

## [1] 980

words[1:5]

## [1] "a"        "able"     "about"    "absolute" "accept"

length(sentences)

## [1] 720

sentences[1:5]

## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."

Basic string operations

You can combine, or “concatenate”, two strings with the stringr command str_c or the syntactically identical base R command paste.

second_string <- "Wow, two sentences."
combined_string <- str_c(a_string,second_string,sep = " ")
combined_string

## [1] "Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences."

paste(a_string,second_string,sep = " ")

## [1] "Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences."

You can zip together vectors of strings.

str_c(month.abb, month.name, sep=" stands for ")

##  [1] "Jan stands for January"   "Feb stands for February" 
##  [3] "Mar stands for March"     "Apr stands for April"    
##  [5] "May stands for May"       "Jun stands for June"     
##  [7] "Jul stands for July"      "Aug stands for August"   
##  [9] "Sep stands for September" "Oct stands for October"  
## [11] "Nov stands for November"  "Dec stands for December"

You can concatenate all the strings in a vector together with the collapse parameter.

str_c(month.name, collapse=" then ")

## [1] "January then February then March then April then May then June then July then August then September then October then November then December"

Or both.

str_c(letters,LETTERS, sep="", collapse=",")

## [1] "aA,bB,cC,dD,eE,fF,gG,hH,iI,jJ,kK,lL,mM,nN,oO,pP,qQ,rR,sS,tT,uU,vV,wW,xX,yY,zZ"

Or more than two vectors or constants.

str_c(month.name," (", month.abb, ")", sep="", collapse=" then ")

## [1] "January (Jan) then February (Feb) then March (Mar) then April (Apr) then May (May) then June (Jun) then July (Jul) then August (Aug) then September (Sep) then October (Oct) then November (Nov) then December (Dec)"

You can split a string up using str_split. The similar base R command is strsplit.

str_split(combined_string,"!") # split on !

## [[1]]
## [1] "Example STRING, with numbers (12, 15 and also 10.2)?"
## [2] " Wow, two sentences."

str_split(combined_string,",") # split on ,

## [[1]]
## [1] "Example STRING"           " with numbers (12"       
## [3] " 15 and also 10.2)?! Wow" " two sentences."

strsplit(combined_string,"!") # base R

## [[1]]
## [1] "Example STRING, with numbers (12, 15 and also 10.2)?"
## [2] " Wow, two sentences."

str_split returns a list of character vectors. With simplify = TRUE, it returns a character matrix (with one row and two columns).

str_split(combined_string,"!",simplify=TRUE)

##      [,1]                                                  
## [1,] "Example STRING, with numbers (12, 15 and also 10.2)?"
##      [,2]                  
## [1,] " Wow, two sentences."

So, in this case you could get that as one vector a number of ways.

str_split(combined_string,"!")[[1]] #give first element of list

## [1] "Example STRING, with numbers (12, 15 and also 10.2)?"
## [2] " Wow, two sentences."

str_split(combined_string,"!",simplify=TRUE)[1,] # give first row of matrix

## [1] "Example STRING, with numbers (12, 15 and also 10.2)?"
## [2] " Wow, two sentences."

unlist(str_split(combined_string,"!")) #turn list into vector

## [1] "Example STRING, with numbers (12, 15 and also 10.2)?"
## [2] " Wow, two sentences."

Substrings

You can use the str_sub command to identify a substring at a known position in a string.

str_sub(fruit,2,4) # 2nd through 5th character of each fruit

##  [1] "ppl" "pri" "voc" "ana" "ell" "ilb" "lac" "lac" "loo" "lue" "oys" "rea"
## [13] "ana" "ant" "her" "her" "hil" "lem" "lou" "oco" "ran" "ucu" "urr" "ams"
## [25] "ate" "rag" "uri" "ggp" "lde" "eij" "ig"  "oji" "oos" "rap" "rap" "uav"
## [37] "one" "uck" "ack" "amb" "uju" "iwi" "umq" "emo" "ime" "oqu" "ych" "and"
## [49] "ang" "ulb" "ect" "ut"  "liv" "ran" "ame" "apa" "ass" "eac" "ear" "ers"
## [61] "hys" "ine" "lum" "ome" "ome" "urp" "uin" "ais" "amb" "asp" "edc" "ock"
## [73] "ala" "ats" "tar" "tra" "ama" "ang" "gli" "ate"

str_sub(fruit,-2,-1) # Last two characters of each fruit

##  [1] "le" "ot" "do" "na" "er" "ry" "ry" "nt" "ge" "ry" "ry" "it" "on" "pe" "ya"
## [16] "ry" "er" "ne" "ry" "ut" "ry" "er" "nt" "on" "te" "it" "an" "nt" "ry" "oa"
## [31] "ig" "ry" "ry" "pe" "it" "va" "ew" "ry" "it" "ul" "be" "it" "at" "on" "me"
## [46] "at" "ee" "ne" "go" "ry" "ne" "ut" "ve" "ge" "lo" "ya" "it" "ch" "ar" "on"
## [61] "is" "le" "um" "te" "lo" "en" "ce" "in" "an" "ry" "nt" "on" "ry" "ma" "it"
## [76] "ry" "lo" "ne" "it" "on"

some_dates <- c("1999/01/01","1998/12/15","2001/09/03")
str_sub(some_dates,1,4)

## [1] "1999" "1998" "2001"

str_sub(some_dates,6,7)

## [1] "01" "12" "09"

You can use the str_sub command to change a substring at a known position in a string.

zebra_fruit <- fruit
str_sub(zebra_fruit,2,3) <- "--ZEBRA!!--"
zebra_fruit[1:10]

##  [1] "a--ZEBRA!!--le"        "a--ZEBRA!!--icot"      "a--ZEBRA!!--cado"     
##  [4] "b--ZEBRA!!--ana"       "b--ZEBRA!!--l pepper"  "b--ZEBRA!!--berry"    
##  [7] "b--ZEBRA!!--ckberry"   "b--ZEBRA!!--ckcurrant" "b--ZEBRA!!--od orange"
## [10] "b--ZEBRA!!--eberry"

Cleaning and normalizing strings

You can do case-folding with str_to_lower (orstr_to_upper). Base R commaned is tolower.

str_to_lower(combined_string)

## [1] "example string, with numbers (12, 15 and also 10.2)?! wow, two sentences."

You can trim excess whitespace off the ends of strings with str_trim

a_string_vector <- str_split(combined_string,"!")[[1]]
a_string_vector

## [1] "Example STRING, with numbers (12, 15 and also 10.2)?"
## [2] " Wow, two sentences."

str_trim(a_string_vector)

## [1] "Example STRING, with numbers (12, 15 and also 10.2)?"
## [2] "Wow, two sentences."

Searching for a pattern

What are the locations of the fruits with “berry” in their name?

str_which(fruit,"berry")

##  [1]  6  7 10 11 19 21 29 32 33 38 50 70 73 76

We could get the same answer in this instance from base R’s grep, but the syntax is different.

grep("berry", fruit)

##  [1]  6  7 10 11 19 21 29 32 33 38 50 70 73 76

For each fruit, does it contain “berry”?

str_detect(fruit,"berry")

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [37] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [73]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

We could get the same answer in this instance from base R’s grepl, but the syntax is like that of grep.

grepl("berry", fruit)

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [37] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [73]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

How many matches of “berry” does each fruit have?

str_count(fruit,"berry")

##  [1] 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1
## [39] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1
## [77] 0 0 0 0

str_count(fruit,"a")

##  [1] 1 1 2 3 0 0 1 2 1 0 0 1 2 2 1 0 0 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 1 1 2 0 0
## [39] 1 1 0 0 1 0 0 1 0 2 1 0 1 0 0 1 1 3 1 1 1 0 1 1 0 2 0 1 0 1 2 1 1 0 2 2 1 1
## [77] 2 1 0 1

Where is the substring “berry” located in each fruit string?

str_locate(fruit[1:10],"berry")

##       start end
##  [1,]    NA  NA
##  [2,]    NA  NA
##  [3,]    NA  NA
##  [4,]    NA  NA
##  [5,]    NA  NA
##  [6,]     4   8
##  [7,]     6  10
##  [8,]    NA  NA
##  [9,]    NA  NA
## [10,]     5   9

(We can get equivalent information, in very different format, from the base r command regexpr. It’s nowhere near as intuitive though, returning a vector for the start positions, and the length of the matches in an attribute.)

regexpr_obj <-  regexpr("berry",fruit[1:10]) 
regexpr_obj  # The full object

##  [1] -1 -1 -1 -1 -1  4  6 -1 -1  5
## attr(,"match.length")
##  [1] -1 -1 -1 -1 -1  5  5 -1 -1  5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

regexpr_obj[1:10] # The values of the object itself, the starting positions

##  [1] -1 -1 -1 -1 -1  4  6 -1 -1  5

attr(regexpr_obj,"match.length") # The match.length attribute

##  [1] -1 -1 -1 -1 -1  5  5 -1 -1  5

List fruits that have “berry” in their name.

str_subset(fruit,"berry")

##  [1] "bilberry"    "blackberry"  "blueberry"   "boysenberry" "cloudberry" 
##  [6] "cranberry"   "elderberry"  "goji berry"  "gooseberry"  "huckleberry"
## [11] "mulberry"    "raspberry"   "salal berry" "strawberry"

The base R equivalent is grep with value=TRUE:

grep("berry",fruit, value=TRUE)

##  [1] "bilberry"    "blackberry"  "blueberry"   "boysenberry" "cloudberry" 
##  [6] "cranberry"   "elderberry"  "goji berry"  "gooseberry"  "huckleberry"
## [11] "mulberry"    "raspberry"   "salal berry" "strawberry"

For each fruit, give me the first substring that matches “berry.”

str_extract(fruit,"berry")

##  [1] NA      NA      NA      NA      NA      "berry" "berry" NA      NA     
## [10] "berry" "berry" NA      NA      NA      NA      NA      NA      NA     
## [19] "berry" NA      "berry" NA      NA      NA      NA      NA      NA     
## [28] NA      "berry" NA      NA      "berry" "berry" NA      NA      NA     
## [37] NA      "berry" NA      NA      NA      NA      NA      NA      NA     
## [46] NA      NA      NA      NA      "berry" NA      NA      NA      NA     
## [55] NA      NA      NA      NA      NA      NA      NA      NA      NA     
## [64] NA      NA      NA      NA      NA      NA      "berry" NA      NA     
## [73] "berry" NA      NA      "berry" NA      NA      NA      NA

In this instance, we get the same answer in matrix form from str_match:

str_match(fruit[1:10],"berry")

##       [,1]   
##  [1,] NA     
##  [2,] NA     
##  [3,] NA     
##  [4,] NA     
##  [5,] NA     
##  [6,] "berry"
##  [7,] "berry"
##  [8,] NA     
##  [9,] NA     
## [10,] "berry"

str_match is mainly helpful when we want to match multiple things or use a larger pattern to isolate smaller pieces. We’ll see examples below.

To get a visual for where your matches are occurring, you can use str_view_all. (You will see something in the RStudio Viewer.)

str_view_all(fruit,"berry")

For every fruit with “berry” in the name, change “berry” to “fish”.

str_replace(fruit[1:10],"berry", "fish")

##  [1] "apple"        "apricot"      "avocado"      "banana"       "bell pepper" 
##  [6] "bilfish"      "blackfish"    "blackcurrant" "blood orange" "bluefish"

str_replace replaces the first pattern match in each string; str_replace_all replaces all pattern matches in each string.

str_replace(fruit[1:10],"a", "ZZ")

##  [1] "ZZpple"        "ZZpricot"      "ZZvocado"      "bZZnana"      
##  [5] "bell pepper"   "bilberry"      "blZZckberry"   "blZZckcurrant"
##  [9] "blood orZZnge" "blueberry"

str_replace_all(fruit[1:10],"a", "ZZ")

##  [1] "ZZpple"         "ZZpricot"       "ZZvocZZdo"      "bZZnZZnZZ"     
##  [5] "bell pepper"    "bilberry"       "blZZckberry"    "blZZckcurrZZnt"
##  [9] "blood orZZnge"  "blueberry"

Regular expressions

So far, I’ve only searched for patterns that are only alphabetic characters like "berry". But we can use make much more elaborate and flexible patterns using regular expressions.

Regular expressions come in a variety of flavors and R has a somewhat unusual one. I recommend you reference the cheat sheet and the online regex tool https://regex101.com in parallel.

This or that, not this or that, this or that or anything in between

Square brackets for “or” (disjunction) of characters

Match “any one of” the characters in the square brackets.

str_subset(sentences, ' [bhp]eat ')

## [1] "The heart beat strongly and with firm strokes."
## [2] "Burn peat after the logs give out."            
## [3] "Feel the heat of the weak dying flame."        
## [4] "A speedy man can beat this track mark."        
## [5] "Even the worst will beat his low score."       
## [6] "It takes heat to bring out the odor."

Square brackets with `^` for negation.

Match “anything but one of” the characters in the square brackets.

str_subset(sentences, ' [^bhp]eat ')

## [1] "Pack the records in a neat thin case."
## [2] "A clean neck means a neat collar."

Square brackets with `-` for “or” over a range of characters

str_subset(sentences, ' [b-p]eat ')

## [1] "The heart beat strongly and with firm strokes."
## [2] "Burn peat after the logs give out."            
## [3] "Feel the heat of the weak dying flame."        
## [4] "A speedy man can beat this track mark."        
## [5] "Even the worst will beat his low score."       
## [6] "Pack the records in a neat thin case."         
## [7] "It takes heat to bring out the odor."          
## [8] "A clean neck means a neat collar."

Parentheses and pipe operator for multi-character patterns

When we need an “or” over multi-character patterns, we can use the “pipe” operator, using parentheses as necessary to identify what’s with what.

str_subset(fruit, '(black|blue|red)(currant|berry)')

## [1] "blackberry"   "blackcurrant" "blueberry"    "redcurrant"

The parentheses also define a “capture group”, a concept we’ll explain below.

Special characters and escaping

In addition to the backslash, there are at least 16 characters that have special meaning in R regexes, and (may) have to be escaped in order to match the literal character. They are ^ $ . * + | ! ? ( ) [ ] { } < >.

For example, the period – “.” – means “any character but a newline.” It’s a wildcard. We get different results when we escape or don’t escape.

str_extract_all(combined_string,".")    # any single character

## [[1]]
##  [1] "E" "x" "a" "m" "p" "l" "e" " " "S" "T" "R" "I" "N" "G" "," " " "w" "i" "t"
## [20] "h" " " "n" "u" "m" "b" "e" "r" "s" " " "(" "1" "2" "," " " "1" "5" " " "a"
## [39] "n" "d" " " "a" "l" "s" "o" " " "1" "0" "." "2" ")" "?" "!" " " "W" "o" "w"
## [58] "," " " "t" "w" "o" " " "s" "e" "n" "t" "e" "n" "c" "e" "s" "."

str_extract_all(combined_string,"\\.")  # a period

## [[1]]
## [1] "." "."

str_extract_all(combined_string,"a.")   # "a" followed by any single character

## [[1]]
## [1] "am" "an" "al"

str_extract_all(combined_string,"a\\.") # "a" followed by a period (no match)

## [[1]]
## character(0)

Some of these are only special characters in certain contexts and don’t have to be escaped to be recognized when not in those contexts. But they can be escaped in all circumstances and I recommend that rather than trying to figure out the exact rules.

The exclamation point is such a character.

str_extract_all(combined_string,"!")

## [[1]]
## [1] "!"

str_extract_all(combined_string,"\\!")

## [[1]]
## [1] "!"

Class shorthands: \w \W \s \S \d \D and POSIX classes

Conversely, there are a number of characters that have special meaning only when escaped. The main ones for now are “\w” (any alphanumeric character), “\s” (any space character), and “\d” (any numeric digit), The capitalized versions of these are used to mean “anything but” that class.

str_extract_all(combined_string,"\\w") # any "word" character - letter or number

## [[1]]
##  [1] "E" "x" "a" "m" "p" "l" "e" "S" "T" "R" "I" "N" "G" "w" "i" "t" "h" "n" "u"
## [20] "m" "b" "e" "r" "s" "1" "2" "1" "5" "a" "n" "d" "a" "l" "s" "o" "1" "0" "2"
## [39] "W" "o" "w" "t" "w" "o" "s" "e" "n" "t" "e" "n" "c" "e" "s"

str_extract_all(combined_string,"\\W") # any nonword character

## [[1]]
##  [1] " " "," " " " " " " "(" "," " " " " " " " " "." ")" "?" "!" " " "," " " " "
## [20] "."

str_extract_all(combined_string,"\\s") # any whitespace character

## [[1]]
##  [1] " " " " " " " " " " " " " " " " " " " " " "

str_extract_all(combined_string,"\\S") # any nonspace character

## [[1]]
##  [1] "E" "x" "a" "m" "p" "l" "e" "S" "T" "R" "I" "N" "G" "," "w" "i" "t" "h" "n"
## [20] "u" "m" "b" "e" "r" "s" "(" "1" "2" "," "1" "5" "a" "n" "d" "a" "l" "s" "o"
## [39] "1" "0" "." "2" ")" "?" "!" "W" "o" "w" "," "t" "w" "o" "s" "e" "n" "t" "e"
## [58] "n" "c" "e" "s" "."

str_extract_all(combined_string,"\\d") # any digit

## [[1]]
## [1] "1" "2" "1" "5" "1" "0" "2"

str_extract_all(combined_string,"\\D") # any nondigit character

## [[1]]
##  [1] "E" "x" "a" "m" "p" "l" "e" " " "S" "T" "R" "I" "N" "G" "," " " "w" "i" "t"
## [20] "h" " " "n" "u" "m" "b" "e" "r" "s" " " "(" "," " " " " "a" "n" "d" " " "a"
## [39] "l" "s" "o" " " "." ")" "?" "!" " " "W" "o" "w" "," " " "t" "w" "o" " " "s"
## [58] "e" "n" "t" "e" "n" "c" "e" "s" "."

“POSIX” classes

There are other predefined classes in a computing standard called “POSIX” that some regex engines recognize. The ones for R are listed on the cheat sheet. These can mostly be mimicked with the shorthand listed above. The main one I find handy is “[:punct:]” for “any punctuation character.”

str_extract_all(combined_string,"[:punct:]") # any "punctuation" character

## [[1]]
## [1] "," "(" "," "." ")" "?" "!" "," "."

Note that when characters stray beyond the limited ASCII character set – other languages, specialized characters like emojis – there’s not complete consistency in what may be considered an alphanumeric character or punctuation.

Quantifiers: * . ?

Quantifiers: * (zero or more of the previous)

This is also known as the “Kleene star” (pronounced clean-ee), after its original user (Kleene) who introduced the notation in formal logic.

str_extract_all(combined_string,"\\d*") #

## [[1]]
##  [1] ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""  
## [16] ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""  
## [31] "12" ""   ""   "15" ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   "10"
## [46] ""   "2"  ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""  
## [61] ""   ""   ""   ""   ""   ""   ""   ""   ""   ""   ""

Quantifiers: + (one or more of the previous)

This is also known as the “Kleene plus.”

str_extract_all(combined_string,"\\d+") #

## [[1]]
## [1] "12" "15" "10" "2"

Quantifiers:

{n} = “exactly n” of the previous {n,m} = “between n and m” of the previous {n,} = “n or more” of the previous

str_extract_all("x xx xxx xxxx xxxxx","x{3}") #

## [[1]]
## [1] "xxx" "xxx" "xxx"

str_extract_all("x xx xxx xxxx xxxxx","x{3,4}") #

## [[1]]
## [1] "xxx"  "xxxx" "xxxx"

str_extract_all("x xx xxx xxxx xxxxx","x{3,}") #

## [[1]]
## [1] "xxx"   "xxxx"  "xxxxx"

Were all of those what you expected? Use str_view_all to see what’s happening.

str_view_all("x xx xxx xxxx xxxxx","x{3}") #

Question Mark as Quantifier (zero or one of the previous)

str_extract_all(combined_string,"\\d?") # 0 or 1 digit

## [[1]]
##  [1] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "" 
## [20] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "1" "2" ""  ""  "1" "5" ""  "" 
## [39] ""  ""  ""  ""  ""  ""  ""  ""  "1" "0" ""  "2" ""  ""  ""  ""  ""  ""  "" 
## [58] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""

str_subset(sentences," [bp]?eat")

## [1] "The heart beat strongly and with firm strokes."
## [2] "Burn peat after the logs give out."            
## [3] "A speedy man can beat this track mark."        
## [4] "Even the worst will beat his low score."       
## [5] "Quench your thirst, then eat the crackers."    
## [6] "Oats are a food eaten by horse and man."

Greedy vs. non-greedy matching

Question Mark as Nongreedy Modifier to Quantifier - smallest match of previous possible

str_extract_all("(First bracketed statement) Other text (Second bracketed statement)","\\(.+\\)") # greedy - captures from first ( to last )

## [[1]]
## [1] "(First bracketed statement) Other text (Second bracketed statement)"

str_extract_all("(First bracketed statement) Other text (Second bracketed statement)","\\(.+?\\)") # nongreedy - finds two smaller matches

## [[1]]
## [1] "(First bracketed statement)"  "(Second bracketed statement)"

str_extract_all("x xx xxx xxxx xxxxx","x.+x") # defaults to greedy - largest match

## [[1]]
## [1] "x xx xxx xxxx xxxxx"

str_extract_all("x xx xxx xxxx xxxxx","x.+?x") # nongreedy - not what you expect - why?

## [[1]]
## [1] "x x"  "x x"  "xx x" "xxx"  "xxx"

Anchors and word boundaries: ^ $ \b

Anchors: ^ (beginning of string), $ (end of string)

str_extract_all(combined_string,"\\w+") # sequences of alphanumeric characters

## [[1]]
##  [1] "Example"   "STRING"    "with"      "numbers"   "12"        "15"       
##  [7] "and"       "also"      "10"        "2"         "Wow"       "two"      
## [13] "sentences"

str_extract_all(combined_string,"^\\w+") # sequences at beginning of string

## [[1]]
## [1] "Example"

str_extract_all(combined_string,"\\w+$") # sequences at end of string # none - it ends in punctuation

## [[1]]
## character(0)

Word boundaries: `\b`

Similarly, we can identify “word boundaries.’’ This solves the greedy/nongreedy problem we had with the”x" sequences above. It still thinks the decimal point in 10.2 is a word boundary, though.

str_extract_all("x xx xxx xxxx xxxxx","\\bx.*?\\b") #

## [[1]]
## [1] "x"     "xx"    "xxx"   "xxxx"  "xxxxx"

str_extract_all(combined_string,"\\b\\w+?\\b") #

## [[1]]
##  [1] "Example"   "STRING"    "with"      "numbers"   "12"        "15"       
##  [7] "and"       "also"      "10"        "2"         "Wow"       "two"      
## [13] "sentences"

Capture groups

We’ve seen parentheses used with the pipe operator. They are also used to indicate smaller parts of the pattern that we want to “capture.” str_match will give us a matrix in which the first column is the match to the entire pattern, what we’ve seen before. Each subsequent column holds the part of the match in each pair of parentheses.

str_match(fruit[1:15],"^(.+?)(berry|fruit)$")

##       [,1]          [,2]     [,3]   
##  [1,] NA            NA       NA     
##  [2,] NA            NA       NA     
##  [3,] NA            NA       NA     
##  [4,] NA            NA       NA     
##  [5,] NA            NA       NA     
##  [6,] "bilberry"    "bil"    "berry"
##  [7,] "blackberry"  "black"  "berry"
##  [8,] NA            NA       NA     
##  [9,] NA            NA       NA     
## [10,] "blueberry"   "blue"   "berry"
## [11,] "boysenberry" "boysen" "berry"
## [12,] "breadfruit"  "bread"  "fruit"
## [13,] NA            NA       NA     
## [14,] NA            NA       NA     
## [15,] NA            NA       NA

We also can use \\1, \\2, etc. to refer to these capture groups later in the same command.

Here’s an actual regular expression I use in cleaning the Mood of the Nation poll answers. I later will let punctuation like ' indicate a word boundary, so first, I want to collapse contractions across the ' to keep them together. This, for example collapses any n't contractions.

motn <- "i can't stand don'trump supporters shouting 'build that wall'!"
newmotn <- str_replace_all(motn,"(n't)($|[[:punct:]]|\\s)","nt\\2") #dont, cant, wont, wasnt, werent, didnt, couldnt, wouldnt, shouldnt, havent
newmotn

## [1] "i cant stand don'trump supporters shouting 'build that wall'!"

That looks for the (1) n't pattern followed by the (2) end of the string, another punctuation mark, or a whitespace. It then replaces that with nt followed by whatever the following character was. This avoids replacing other accidental instances of the n't pattern that aren’t clearly contractions.

Using regular expressions to extract data from text: an example

Let’s start with some example text:

text <- "SEC. 101. FISCAL YEAR 2017.
(a) In General.--There are authorized to be appropriated to NASA
for fiscal year 2017 $19,508,000,000, as follows:
(1) For Exploration, $4,330,000,000.
(2) For Space Operations, $5,023,000,000.
(3) For Science, $5,500,000,000.
(4) For Aeronautics, $640,000,000.
(5) For Space Technology, $686,000,000.
(6) For Education, $115,000,000.
(7) For Safety, Security, and Mission Services,
$2,788,600,000.
(8) For Construction and Environmental Compliance and
Restoration, $388,000,000.
(9) For Inspector General, $37,400,000.
(b) Exception.--In addition to the amounts authorized to be
appropriated for each account under subsection (a), there are
authorized to be appropriated additional funds for each such account,
but only if the authorized amounts for all such accounts are fully
provided for in annual appropriation Acts, consistent with the
discretionary spending limits in section 251(c) of the Balanced Budget
and Emergency Deficit Control Act of 1985."

Wait … that’s just one variable holding one string? Yep.

text

## [1] "SEC. 101. FISCAL YEAR 2017.\n(a) In General.--There are authorized to be appropriated to NASA\nfor fiscal year 2017 $19,508,000,000, as follows:\n(1) For Exploration, $4,330,000,000.\n(2) For Space Operations, $5,023,000,000.\n(3) For Science, $5,500,000,000.\n(4) For Aeronautics, $640,000,000.\n(5) For Space Technology, $686,000,000.\n(6) For Education, $115,000,000.\n(7) For Safety, Security, and Mission Services,\n$2,788,600,000.\n(8) For Construction and Environmental Compliance and\nRestoration, $388,000,000.\n(9) For Inspector General, $37,400,000.\n(b) Exception.--In addition to the amounts authorized to be\nappropriated for each account under subsection (a), there are\nauthorized to be appropriated additional funds for each such account,\nbut only if the authorized amounts for all such accounts are fully\nprovided for in annual appropriation Acts, consistent with the\ndiscretionary spending limits in section 251(c) of the Balanced Budget\nand Emergency Deficit Control Act of 1985."

All those \ns there indicate new lines.

We’re going to try to use regular expressions to make data out of the appropriations dollars and purposes in bullets 1-9.

Lets play around with a few things. Extract all contiguous sequences of one or more numbers.

stringr::str_extract_all(text,"[0-9]+")[[1]]

##  [1] "101"  "2017" "2017" "19"   "508"  "000"  "000"  "1"    "4"    "330" 
## [11] "000"  "000"  "2"    "5"    "023"  "000"  "000"  "3"    "5"    "500" 
## [21] "000"  "000"  "4"    "640"  "000"  "000"  "5"    "686"  "000"  "000" 
## [31] "6"    "115"  "000"  "000"  "7"    "2"    "788"  "600"  "000"  "8"   
## [41] "388"  "000"  "000"  "9"    "37"   "400"  "000"  "251"  "1985"

That does two things we don’t like … separates numbers at the 1000s separating comma and gets numbers (“101”, “2017”, etc.) that aren’t dollar amounts. So, let’s try getting everything that Starts with a “$” (which needs to be escaped) Followed by one or more strings of commas or digits.

stringr::str_extract_all(text,"\\$[,0-9]+")[[1]] # must start with $

##  [1] "$19,508,000,000," "$4,330,000,000"   "$5,023,000,000"   "$5,500,000,000"  
##  [5] "$640,000,000"     "$686,000,000"     "$115,000,000"     "$2,788,600,000"  
##  [9] "$388,000,000"     "$37,400,000"

Almost … don’t like that extra comma on the first number. Add and ends with a number.

stringr::str_extract_all(text,"\\$[,0-9]+[0-9]")[[1]]

##  [1] "$19,508,000,000" "$4,330,000,000"  "$5,023,000,000"  "$5,500,000,000" 
##  [5] "$640,000,000"    "$686,000,000"    "$115,000,000"    "$2,788,600,000" 
##  [9] "$388,000,000"    "$37,400,000"

We could use quantifiers to get numbers of $1 billion or more

stringr::str_extract_all(text,"\\$[,0-9]{12,}[0-9]")[[1]]

## [1] "$19,508,000,000" "$4,330,000,000"  "$5,023,000,000"  "$5,500,000,000" 
## [5] "$2,788,600,000"

That asks for Starts with a “$” Followed by 12 OR MORE commas and numbers And ends with a number

Now let’s try to get the bullet numbers enclosed in parentheses:

stringr::str_extract_all(text,"\\([0-9]\\)")[[1]]

## [1] "(1)" "(2)" "(3)" "(4)" "(5)" "(6)" "(7)" "(8)" "(9)"

Say we only want to match lines that start with a particular set of characters … First let’s split it into lines:

text_split <- stringr::str_split(text,"\\n")[[1]]
text_split

##  [1] "SEC. 101. FISCAL YEAR 2017."                                           
##  [2] "(a) In General.--There are authorized to be appropriated to NASA"      
##  [3] "for fiscal year 2017 $19,508,000,000, as follows:"                     
##  [4] "(1) For Exploration, $4,330,000,000."                                  
##  [5] "(2) For Space Operations, $5,023,000,000."                             
##  [6] "(3) For Science, $5,500,000,000."                                      
##  [7] "(4) For Aeronautics, $640,000,000."                                    
##  [8] "(5) For Space Technology, $686,000,000."                               
##  [9] "(6) For Education, $115,000,000."                                      
## [10] "(7) For Safety, Security, and Mission Services,"                       
## [11] "$2,788,600,000."                                                       
## [12] "(8) For Construction and Environmental Compliance and"                 
## [13] "Restoration, $388,000,000."                                            
## [14] "(9) For Inspector General, $37,400,000."                               
## [15] "(b) Exception.--In addition to the amounts authorized to be"           
## [16] "appropriated for each account under subsection (a), there are"         
## [17] "authorized to be appropriated additional funds for each such account," 
## [18] "but only if the authorized amounts for all such accounts are fully"    
## [19] "provided for in annual appropriation Acts, consistent with the"        
## [20] "discretionary spending limits in section 251(c) of the Balanced Budget"
## [21] "and Emergency Deficit Control Act of 1985."

Now match on beggining string anchor and open paren.

stringr::str_extract_all(text_split,"^\\(.*")

## [[1]]
## character(0)
## 
## [[2]]
## [1] "(a) In General.--There are authorized to be appropriated to NASA"
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "(1) For Exploration, $4,330,000,000."
## 
## [[5]]
## [1] "(2) For Space Operations, $5,023,000,000."
## 
## [[6]]
## [1] "(3) For Science, $5,500,000,000."
## 
## [[7]]
## [1] "(4) For Aeronautics, $640,000,000."
## 
## [[8]]
## [1] "(5) For Space Technology, $686,000,000."
## 
## [[9]]
## [1] "(6) For Education, $115,000,000."
## 
## [[10]]
## [1] "(7) For Safety, Security, and Mission Services,"
## 
## [[11]]
## character(0)
## 
## [[12]]
## [1] "(8) For Construction and Environmental Compliance and"
## 
## [[13]]
## character(0)
## 
## [[14]]
## [1] "(9) For Inspector General, $37,400,000."
## 
## [[15]]
## [1] "(b) Exception.--In addition to the amounts authorized to be"
## 
## [[16]]
## character(0)
## 
## [[17]]
## character(0)
## 
## [[18]]
## character(0)
## 
## [[19]]
## character(0)
## 
## [[20]]
## character(0)
## 
## [[21]]
## character(0)

That returned a list, and we’d probably rather have a vector. In this case, we can just wrap this in an unlist() statement:

unlist(stringr::str_extract_all(text_split,"^\\(.*"))

##  [1] "(a) In General.--There are authorized to be appropriated to NASA"
##  [2] "(1) For Exploration, $4,330,000,000."                            
##  [3] "(2) For Space Operations, $5,023,000,000."                       
##  [4] "(3) For Science, $5,500,000,000."                                
##  [5] "(4) For Aeronautics, $640,000,000."                              
##  [6] "(5) For Space Technology, $686,000,000."                         
##  [7] "(6) For Education, $115,000,000."                                
##  [8] "(7) For Safety, Security, and Mission Services,"                 
##  [9] "(8) For Construction and Environmental Compliance and"           
## [10] "(9) For Inspector General, $37,400,000."                         
## [11] "(b) Exception.--In addition to the amounts authorized to be"

So, now let’s try to put everything we’ve learned together and make a little dataset out of the items (1) to (9) with dollar amounts and what they’re for

Let’s see what we have in that last command .. We have some extra lines … “(a)” and “(b)” And we’re missing the $ numbers from items (7) and (8) which are on the next lines.

Let’s go back to the original and get rid of the newlines:

one_line <- stringr::str_replace_all(text,"\\n"," ")[[1]]
one_line

## [1] "SEC. 101. FISCAL YEAR 2017. (a) In General.--There are authorized to be appropriated to NASA for fiscal year 2017 $19,508,000,000, as follows: (1) For Exploration, $4,330,000,000. (2) For Space Operations, $5,023,000,000. (3) For Science, $5,500,000,000. (4) For Aeronautics, $640,000,000. (5) For Space Technology, $686,000,000. (6) For Education, $115,000,000. (7) For Safety, Security, and Mission Services, $2,788,600,000. (8) For Construction and Environmental Compliance and Restoration, $388,000,000. (9) For Inspector General, $37,400,000. (b) Exception.--In addition to the amounts authorized to be appropriated for each account under subsection (a), there are authorized to be appropriated additional funds for each such account, but only if the authorized amounts for all such accounts are fully provided for in annual appropriation Acts, consistent with the discretionary spending limits in section 251(c) of the Balanced Budget and Emergency Deficit Control Act of 1985."

and find all the matches from “(number)” to a period, lazily rather than greedily

item_strings <- stringr::str_extract_all(one_line,"\\(\\d\\).+?\\.")[[1]]
item_strings

## [1] "(1) For Exploration, $4,330,000,000."                                            
## [2] "(2) For Space Operations, $5,023,000,000."                                       
## [3] "(3) For Science, $5,500,000,000."                                                
## [4] "(4) For Aeronautics, $640,000,000."                                              
## [5] "(5) For Space Technology, $686,000,000."                                         
## [6] "(6) For Education, $115,000,000."                                                
## [7] "(7) For Safety, Security, and Mission Services, $2,788,600,000."                 
## [8] "(8) For Construction and Environmental Compliance and Restoration, $388,000,000."
## [9] "(9) For Inspector General, $37,400,000."

Can use str_match and parentheses to identify the stuff you want

for_strings <- stringr::str_match(item_strings,"For (.+), \\$")
for_strings

##       [,1]                                                              
##  [1,] "For Exploration, $"                                              
##  [2,] "For Space Operations, $"                                         
##  [3,] "For Science, $"                                                  
##  [4,] "For Aeronautics, $"                                              
##  [5,] "For Space Technology, $"                                         
##  [6,] "For Education, $"                                                
##  [7,] "For Safety, Security, and Mission Services, $"                   
##  [8,] "For Construction and Environmental Compliance and Restoration, $"
##  [9,] "For Inspector General, $"                                        
##       [,2]                                                       
##  [1,] "Exploration"                                              
##  [2,] "Space Operations"                                         
##  [3,] "Science"                                                  
##  [4,] "Aeronautics"                                              
##  [5,] "Space Technology"                                         
##  [6,] "Education"                                                
##  [7,] "Safety, Security, and Mission Services"                   
##  [8,] "Construction and Environmental Compliance and Restoration"
##  [9,] "Inspector General"

The second column contains our list of the “for what”s.

for_strings <- for_strings[,2]
for_strings

## [1] "Exploration"                                              
## [2] "Space Operations"                                         
## [3] "Science"                                                  
## [4] "Aeronautics"                                              
## [5] "Space Technology"                                         
## [6] "Education"                                                
## [7] "Safety, Security, and Mission Services"                   
## [8] "Construction and Environmental Compliance and Restoration"
## [9] "Inspector General"

Do something similar for money

money_strings <- stringr::str_match(item_strings,"\\$([,\\d]+)")[,2]
money_strings

## [1] "4,330,000,000" "5,023,000,000" "5,500,000,000" "640,000,000"  
## [5] "686,000,000"   "115,000,000"   "2,788,600,000" "388,000,000"  
## [9] "37,400,000"

Get rid of the punctuation

money_strings <- stringr::str_replace_all(money_strings,"[\\$,]","")
money_strings

## [1] "4330000000" "5023000000" "5500000000" "640000000"  "686000000" 
## [6] "115000000"  "2788600000" "388000000"  "37400000"

Turn them into numeric data rather than strings.

money <- as.numeric(money_strings)
money

## [1] 4330000000 5023000000 5500000000  640000000  686000000  115000000 2788600000
## [8]  388000000   37400000

Now let’s make it data:

appropriations_data <- data.frame(purpose = for_strings,amount = money)
appropriations_data

Other languages

Remember … other programming languages handle regular expressions slightly differently. In particlar, Python does not use the “double escape” idiom. ```

An Introduction to String Manipulation and Regular Expressions in R

Penn State and Essex courses in “Text as Data”

Burt L. Monroe

Strings

Basic string operations

Substrings

Cleaning and normalizing strings

Searching for a pattern

Regular expressions

This or that, not this or that, this or that or anything in between

Square brackets for “or” (disjunction) of characters

Square brackets with `^` for negation.

Square brackets with `-` for “or” over a range of characters

Parentheses and pipe operator for multi-character patterns

Special characters and escaping

Class shorthands: \w \W \s \S \d \D and POSIX classes

“POSIX” classes

Quantifiers: * . ?

Quantifiers: * (zero or more of the previous)

Quantifiers: + (one or more of the previous)

Quantifiers:

Question Mark as Quantifier (zero or one of the previous)

Greedy vs. non-greedy matching

Question Mark as Nongreedy Modifier to Quantifier - smallest match of previous possible

Anchors and word boundaries: ^ $ \b

Anchors: ^ (beginning of string), $ (end of string)

Word boundaries: `\b`

Capture groups

Using regular expressions to extract data from text: an example

Other languages

An Introduction to String Manipulation and Regular Expressions in R

Penn State and Essex courses in “Text as Data”

Burt L. Monroe

Strings

Basic string operations

Substrings

Cleaning and normalizing strings

Searching for a pattern

Regular expressions

This or that, not this or that, this or that or anything in between

Square brackets for “or” (disjunction) of characters

Square brackets with ^ for negation.

Square brackets with - for “or” over a range of characters

Parentheses and pipe operator for multi-character patterns

Special characters and escaping

Class shorthands: \w \W \s \S \d \D and POSIX classes

“POSIX” classes

Quantifiers: * . ?

Quantifiers: * (zero or more of the previous)

Quantifiers: + (one or more of the previous)

Quantifiers:

Question Mark as Quantifier (zero or one of the previous)

Greedy vs. non-greedy matching

Question Mark as Nongreedy Modifier to Quantifier - smallest match of previous possible

Anchors and word boundaries: ^ $ \b

Anchors: ^ (beginning of string), $ (end of string)

Word boundaries: \b

Capture groups

Using regular expressions to extract data from text: an example

Other languages

Square brackets with `^` for negation.

Square brackets with `-` for “or” over a range of characters

Word boundaries: `\b`