Text as Data Tutorial -- Intro to String Manipulation and Regular Expressions in Python

Penn State, Text as Data (PLSC 597), Fall 2019

Burt L. Monroe

This tutorial parallels a similar tutorial in R. In most cases, the task achieved is identical or very similar in both.

Many string manipulation methods below are provided in base Python. When we get to regular expressions, you'll need the "re" library.

The basic thing we want to manipulate are strings. These can be specified using double quotes (“) or single quotes (’):

In [1]:
a_string = 'Example STRING, with numbers (12, 15 and also 10.2)?!'
a_string
Out[1]:
'Example STRING, with numbers (12, 15 and also 10.2)?!'

It’s really a matter of style or convenience, but you might use one if your string actually contains the other:

In [2]:
my_double_quoted_string = "He asked, 'Why would you use double quotes?'"
my_double_quoted_string
Out[2]:
"He asked, 'Why would you use double quotes?'"

You can still use either one if you like, using \ (backslash) to tell Python to “escape” the next character. In the example below, the \" is saying, " is part of the string, not the end of the string.

In [3]:
my_string_with_double_quotes = "She answered, \"Convenience, but you never really have to.\""
my_string_with_double_quotes
Out[3]:
'She answered, "Convenience, but you never really have to."'

If you ever want to see how your string with escape characters displays when printed or (typically) in an editor, use print.

In [4]:
print(my_double_quoted_string)
He asked, 'Why would you use double quotes?'
In [5]:
print(my_string_with_double_quotes)
She answered, "Convenience, but you never really have to."

This can get a little bit confusing. For example, since the backslash character tells Python to escape, to indicate an actual backslash character you have to backslash your backslashes:

In [6]:
a_string_with_backslashes = "To indicate a backslash, \\, you have to type two: \\\\. Just there, to indicate two backslashes, I had to type four: \\\\\\\\."
a_string_with_backslashes
Out[6]:
'To indicate a backslash, \\, you have to type two: \\\\. Just there, to indicate two backslashes, I had to type four: \\\\\\\\.'
In [7]:
print(a_string_with_backslashes)
To indicate a backslash, \, you have to type two: \\. Just there, to indicate two backslashes, I had to type four: \\\\.

There are a number of special escape characters that are used to represent things like “control characters.” The most common are two that you’re already used to tapping a keyboard key for without expecting a character to appear on your screen: \t (tab) and \n (newline).

In [8]:
test_string = "abc ABC 123\t.!?\\(){}\n  \nthird line"
test_string
Out[8]:
'abc ABC 123\t.!?\\(){}\n  \nthird line'
In [9]:
print(test_string)
abc ABC 123	.!?\(){}
  
third line

If you want to define a multiline string without using escaped newline characters, use triple quotation marks:

In [10]:
test_string2 = """abc ABC 123\t.!?\\(){}
  
third line"""
test_string2
Out[10]:
'abc ABC 123\t.!?\\(){}\n  \nthird line'

As with pretty much everything in Python, you can have a list of strings.

In [11]:
a_list_of_strings = ["abcde", "123", "chicken of the sea"]
a_list_of_strings
Out[11]:
['abcde', '123', 'chicken of the sea']

In the R tutorial, we made use of a few collections of strings provided in base R or stringr. To make this comparable, we'll create or load these here.

The letters of the alphabet are available in the string module.

In [12]:
import string
letters_string = string.ascii_lowercase
letters_string
Out[12]:
'abcdefghijklmnopqrstuvwxyz'
In [13]:
letters_list = list(letters_string)
letters_list
Out[13]:
['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']
In [14]:
LETTERS_string = string.ascii_uppercase
LETTERS_string
Out[14]:
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
In [15]:
LETTERS_list = list(LETTERS_string)
LETTERS_list
Out[15]:
['A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z']

We'll just make the month lists.

In [16]:
month_abb = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_abb
Out[16]:
['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec']
In [17]:
month_name = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
month_name
Out[17]:
['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']
In [18]:
# REQUIRES file "fruit.txt" be in the same directory
fruitfile = open("fruit.txt","r")
fruit = fruitfile.read().splitlines()
print(fruit)
['apple', 'apricot', 'avocado', 'banana', 'bell pepper', 'bilberry', 'blackberry', 'blackcurrant', 'blood orange', 'blueberry', 'boysenberry', 'breadfruit', 'canary melon', 'cantaloupe', 'cherimoya', 'cherry', 'chili pepper', 'clementine', 'cloudberry', 'coconut', 'cranberry', 'cucumber', 'currant', 'damson', 'date', 'dragonfruit', 'durian', 'eggplant', 'elderberry', 'feijoa', 'fig', 'goji berry', 'gooseberry', 'grape', 'grapefruit', 'guava', 'honeydew', 'huckleberry', 'jackfruit', 'jambul', 'jujube', 'kiwi fruit', 'kumquat', 'lemon', 'lime', 'loquat', 'lychee', 'mandarine', 'mango', 'mulberry', 'nectarine', 'nut', 'olive', 'orange', 'pamelo', 'papaya', 'passionfruit', 'peach', 'pear', 'persimmon', 'physalis', 'pineapple', 'plum', 'pomegranate', 'pomelo', 'purple mangosteen', 'quince', 'raisin', 'rambutan', 'raspberry', 'redcurrant', 'rock melon', 'salal berry', 'satsuma', 'star fruit', 'strawberry', 'tamarillo', 'tangerine', 'ugli fruit', 'watermelon']
In [19]:
# REQUIRES file "words.txt" be in the same directory
wordsfile = open("words.txt","r")
words = wordsfile.read().splitlines()
len(words)
Out[19]:
980

The word list is long, so let's just look at the top 5.

(A couple things for R users here. First, Python starts counting at 0 not 1. So, in R words[0] is an error, words[1] is "a", and words[5] is "accept"; in Python, words[0] is "a", words[1] is "able", and words[5] is an "account". Second, the "slicing" notation in Python is also weird if you're used to R. In R, to get the first five members, you ask for item 1 through item 5:words[1:5]. In Python, you might imagine the list members sitting on a number line that puts the first list member between 0 and 1, the second between 1 and 2m and so on. Then to get the first 5 members of the list, we need to ask for the slice between "0" just to the "left" of the slice and "5" at the "right" of the stuff we want: words[0:5].

In [20]:
words[0:5]
Out[20]:
['a', 'able', 'about', 'absolute', 'accept']
In [21]:
words[5]
Out[21]:
'account'

The sentences list is also long.

In [22]:
# REQUIRES file "sentences.txt" be in the same directory
sentencesfile = open("sentences.txt","r")
sentences = sentencesfile.read().splitlines()
len(sentences)
Out[22]:
720
In [23]:
sentences[0:5]
Out[23]:
['The birch canoe slid on the smooth planks.',
 'Glue the sheet to the dark blue background.',
 "It's easy to tell the depth of a well.",
 'These days a chicken leg is a rare dish.',
 'Rice is often served in round bowls.']

Manipulating strings

You can combine, or “concatenate”, strings very naturally using the "+" sign.

In [24]:
second_string = "Wow, two sentences."
combined_string = a_string + " " + second_string
combined_string
Out[24]:
'Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.'

You can also combine lists of strings by a separator using the "join" method. To again join the two strings above separated by a space, place the strings to be joined in a list by using square brackets, and the separator in a string and use the syntax sep.join(list):

In [25]:
" ".join([a_string,second_string]) 
Out[25]:
'Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.'

Note that "join" takes a list of strings of any length and concatenates all the strings together with the separator.

In [26]:
" then ".join(month_name)
Out[26]:
'January then February then March then April then May then June then July then August then September then October then November then December'

In the R notebook, we next created a new vector of strings concatenating the month vectors into strings like "Jan stands for January". Python doesn't have the element-by-element "vectorized" syntax R does, so we have to more explicitly iterate over the elements to do this here. There are a several ways to do this.

The most straightforward to understand, but not very "Pythonic", way is to do this in a for loop:

In [27]:
month_explanations = []
for i in range(12):
    new_string = " stands for ".join([month_abb[i], month_name[i]])
    month_explanations.append(new_string)
month_explanations
Out[27]:
['Jan stands for January',
 'Feb stands for February',
 'Mar stands for March',
 'Apr stands for April',
 'May stands for May',
 'Jun stands for June',
 'Jul stands for July',
 'Aug stands for August',
 'Sep stands for September',
 'Oct stands for October',
 'Nov stands for November',
 'Dec stands for December']

There are more compact, Pythonic ways to do this. One is to use the "zip" function which creates an "iterator" of "tuples" element by element and then iterate over that. Let's look inside the zip function first by making a list of the zipped elements:

In [28]:
list(zip(month_abb, month_name))
Out[28]:
[('Jan', 'January'),
 ('Feb', 'February'),
 ('Mar', 'March'),
 ('Apr', 'April'),
 ('May', 'May'),
 ('Jun', 'June'),
 ('Jul', 'July'),
 ('Aug', 'August'),
 ('Sep', 'September'),
 ('Oct', 'October'),
 ('Nov', 'November'),
 ('Dec', 'December')]

Not quite what we want, but we can join those tuples as we iterate over the zip object using a "list comprehension":

In [29]:
[" stands for ".join([abbrev,name]) for abbrev,name in zip(month_abb, month_name)]
Out[29]:
['Jan stands for January',
 'Feb stands for February',
 'Mar stands for March',
 'Apr stands for April',
 'May stands for May',
 'Jun stands for June',
 'Jul stands for July',
 'Aug stands for August',
 'Sep stands for September',
 'Oct stands for October',
 'Nov stands for November',
 'Dec stands for December']

The "list comprehension" is defined by those square brackets on the outside (making it a list) and the "for loop"-like instruction inside.

There are many ways to do the same thing. For example, we can change the string manipulation operation that gets repeated from the join method to the format method:

In [30]:
["{} stands for {}".format(abbrev,name) for abbrev,name in zip(month_abb, month_name)]
Out[30]:
['Jan stands for January',
 'Feb stands for February',
 'Mar stands for March',
 'Apr stands for April',
 'May stands for May',
 'Jun stands for June',
 'Jul stands for July',
 'Aug stands for August',
 'Sep stands for September',
 'Oct stands for October',
 'Nov stands for November',
 'Dec stands for December']

The join/zip idiom works for the letters example in the other notebook as well:

In [31]:
letterpairs = ["".join([lower,upper]) for lower, upper in zip(letters_list, LETTERS_list)]
print(letterpairs)
['aA', 'bB', 'cC', 'dD', 'eE', 'fF', 'gG', 'hH', 'iI', 'jJ', 'kK', 'lL', 'mM', 'nN', 'oO', 'pP', 'qQ', 'rR', 'sS', 'tT', 'uU', 'vV', 'wW', 'xX', 'yY', 'zZ']

You can zip two lists together, concatenate those element by element, and then join them by a separator.

In [32]:
" then ".join(["{} ({})".format(name,abbrev) for name,abbrev in zip(month_name,month_abb)])
Out[32]:
'January (Jan) then February (Feb) then March (Mar) then April (Apr) then May (May) then June (Jun) then July (Jul) then August (Aug) then September (Sep) then October (Oct) then November (Nov) then December (Dec)'

You can split up a string into pieces, based on a pattern, with the "split" method.

In [33]:
combined_string.split("!")
Out[33]:
['Example STRING, with numbers (12, 15 and also 10.2)?',
 ' Wow, two sentences.']

Substrings (Slices)

Substrings are just slices in Python. To get a list of the second through fourth character in each fruit name:

In [34]:
substringfromfruit = [eachfruit[1:4] for eachfruit in fruit]
print(substringfromfruit)
['ppl', 'pri', 'voc', 'ana', 'ell', 'ilb', 'lac', 'lac', 'loo', 'lue', 'oys', 'rea', 'ana', 'ant', 'her', 'her', 'hil', 'lem', 'lou', 'oco', 'ran', 'ucu', 'urr', 'ams', 'ate', 'rag', 'uri', 'ggp', 'lde', 'eij', 'ig', 'oji', 'oos', 'rap', 'rap', 'uav', 'one', 'uck', 'ack', 'amb', 'uju', 'iwi', 'umq', 'emo', 'ime', 'oqu', 'ych', 'and', 'ang', 'ulb', 'ect', 'ut', 'liv', 'ran', 'ame', 'apa', 'ass', 'eac', 'ear', 'ers', 'hys', 'ine', 'lum', 'ome', 'ome', 'urp', 'uin', 'ais', 'amb', 'asp', 'edc', 'ock', 'ala', 'ats', 'tar', 'tra', 'ama', 'ang', 'gli', 'ate']

Substrings from the end of the string can be accessed by slices using negative numbers.

In [35]:
subfromend = [eachfruit[-3:-1] for eachfruit in fruit]
print(subfromend)
['pl', 'co', 'ad', 'an', 'pe', 'rr', 'rr', 'an', 'ng', 'rr', 'rr', 'ui', 'lo', 'up', 'oy', 'rr', 'pe', 'in', 'rr', 'nu', 'rr', 'be', 'an', 'so', 'at', 'ui', 'ia', 'an', 'rr', 'jo', 'fi', 'rr', 'rr', 'ap', 'ui', 'av', 'de', 'rr', 'ui', 'bu', 'ub', 'ui', 'ua', 'mo', 'im', 'ua', 'he', 'in', 'ng', 'rr', 'in', 'nu', 'iv', 'ng', 'el', 'ay', 'ui', 'ac', 'ea', 'mo', 'li', 'pl', 'lu', 'at', 'el', 'ee', 'nc', 'si', 'ta', 'rr', 'an', 'lo', 'rr', 'um', 'ui', 'rr', 'll', 'in', 'ui', 'lo']

You can use slicing to extract data from strings:

In [36]:
some_dates = ["1999/01/01","1998/12/15","2001/09/03"]
years = [date[0:4] for date in some_dates]
print(years)
['1999', '1998', '2001']
In [37]:
months = [date[5:7] for date in some_dates]
print(months)
['01', '12', '09']

Getting a copy of a string with specific positions replaced is also a matter of slicing:

In [38]:
apple = "apple"
zebra = "--!ZEBRA!--"
zebraapple = apple[0:1] + zebra + apple[3:]
zebraapple
Out[38]:
'a--!ZEBRA!--le'

Replicating the R result over the whole list can be done by putting within a list comprehension.

In [39]:
zebrafruit = [fr[0:1] + zebra + fr[3:] for fr in fruit]
print(zebrafruit)
['a--!ZEBRA!--le', 'a--!ZEBRA!--icot', 'a--!ZEBRA!--cado', 'b--!ZEBRA!--ana', 'b--!ZEBRA!--l pepper', 'b--!ZEBRA!--berry', 'b--!ZEBRA!--ckberry', 'b--!ZEBRA!--ckcurrant', 'b--!ZEBRA!--od orange', 'b--!ZEBRA!--eberry', 'b--!ZEBRA!--senberry', 'b--!ZEBRA!--adfruit', 'c--!ZEBRA!--ary melon', 'c--!ZEBRA!--taloupe', 'c--!ZEBRA!--rimoya', 'c--!ZEBRA!--rry', 'c--!ZEBRA!--li pepper', 'c--!ZEBRA!--mentine', 'c--!ZEBRA!--udberry', 'c--!ZEBRA!--onut', 'c--!ZEBRA!--nberry', 'c--!ZEBRA!--umber', 'c--!ZEBRA!--rant', 'd--!ZEBRA!--son', 'd--!ZEBRA!--e', 'd--!ZEBRA!--gonfruit', 'd--!ZEBRA!--ian', 'e--!ZEBRA!--plant', 'e--!ZEBRA!--erberry', 'f--!ZEBRA!--joa', 'f--!ZEBRA!--', 'g--!ZEBRA!--i berry', 'g--!ZEBRA!--seberry', 'g--!ZEBRA!--pe', 'g--!ZEBRA!--pefruit', 'g--!ZEBRA!--va', 'h--!ZEBRA!--eydew', 'h--!ZEBRA!--kleberry', 'j--!ZEBRA!--kfruit', 'j--!ZEBRA!--bul', 'j--!ZEBRA!--ube', 'k--!ZEBRA!--i fruit', 'k--!ZEBRA!--quat', 'l--!ZEBRA!--on', 'l--!ZEBRA!--e', 'l--!ZEBRA!--uat', 'l--!ZEBRA!--hee', 'm--!ZEBRA!--darine', 'm--!ZEBRA!--go', 'm--!ZEBRA!--berry', 'n--!ZEBRA!--tarine', 'n--!ZEBRA!--', 'o--!ZEBRA!--ve', 'o--!ZEBRA!--nge', 'p--!ZEBRA!--elo', 'p--!ZEBRA!--aya', 'p--!ZEBRA!--sionfruit', 'p--!ZEBRA!--ch', 'p--!ZEBRA!--r', 'p--!ZEBRA!--simmon', 'p--!ZEBRA!--salis', 'p--!ZEBRA!--eapple', 'p--!ZEBRA!--m', 'p--!ZEBRA!--egranate', 'p--!ZEBRA!--elo', 'p--!ZEBRA!--ple mangosteen', 'q--!ZEBRA!--nce', 'r--!ZEBRA!--sin', 'r--!ZEBRA!--butan', 'r--!ZEBRA!--pberry', 'r--!ZEBRA!--currant', 'r--!ZEBRA!--k melon', 's--!ZEBRA!--al berry', 's--!ZEBRA!--suma', 's--!ZEBRA!--r fruit', 's--!ZEBRA!--awberry', 't--!ZEBRA!--arillo', 't--!ZEBRA!--gerine', 'u--!ZEBRA!--i fruit', 'w--!ZEBRA!--ermelon']

Strings have a simple casefolding method that can be applied:

In [40]:
combined_string.lower()
Out[40]:
'example string, with numbers (12, 15 and also 10.2)?! wow, two sentences.'
In [41]:
combined_string.upper()
Out[41]:
'EXAMPLE STRING, WITH NUMBERS (12, 15 AND ALSO 10.2)?! WOW, TWO SENTENCES.'

Also several to trim excess white space off the ends of strings:

In [42]:
lotsofspace = '   Why   so much  space?   '
lotsofspace.strip()
Out[42]:
'Why   so much  space?'
In [43]:
lotsofspace.lstrip()
Out[43]:
'Why   so much  space?   '
In [44]:
lotsofspace.rstrip()
Out[44]:
'   Why   so much  space?'

Matching substrings

If we're looking for specific substrings, there are string methods to do that.

In [45]:
"strawberry".find("berry")
Out[45]:
5

That returns the position of the first match. If there is no match, find returns a value of -1.

In [46]:
"apple".find("berry")
Out[46]:
-1

If there are multiple matches, find returns the position of the first match.

In [47]:
"berryberryboberrybananafanafoferrymemymomerry berry".find("berry")
Out[47]:
0

We can use this in a list comprehension, with the addition of an "if" condition, to extract a list of all matching fruits.

In [48]:
[fr for fr in fruit if fr.find("berry")> -1]
Out[48]:
['bilberry',
 'blackberry',
 'blueberry',
 'boysenberry',
 'cloudberry',
 'cranberry',
 'elderberry',
 'goji berry',
 'gooseberry',
 'huckleberry',
 'mulberry',
 'raspberry',
 'salal berry',
 'strawberry']

We can get a copy of the string with the substring replaced with something else:

In [49]:
"strawberry".replace("berry","fish")
Out[49]:
'strawfish'
In [50]:
fishfruit = [fr.replace("berry","fish") for fr in fruit]
print(fishfruit)
['apple', 'apricot', 'avocado', 'banana', 'bell pepper', 'bilfish', 'blackfish', 'blackcurrant', 'blood orange', 'bluefish', 'boysenfish', 'breadfruit', 'canary melon', 'cantaloupe', 'cherimoya', 'cherry', 'chili pepper', 'clementine', 'cloudfish', 'coconut', 'cranfish', 'cucumber', 'currant', 'damson', 'date', 'dragonfruit', 'durian', 'eggplant', 'elderfish', 'feijoa', 'fig', 'goji fish', 'goosefish', 'grape', 'grapefruit', 'guava', 'honeydew', 'hucklefish', 'jackfruit', 'jambul', 'jujube', 'kiwi fruit', 'kumquat', 'lemon', 'lime', 'loquat', 'lychee', 'mandarine', 'mango', 'mulfish', 'nectarine', 'nut', 'olive', 'orange', 'pamelo', 'papaya', 'passionfruit', 'peach', 'pear', 'persimmon', 'physalis', 'pineapple', 'plum', 'pomegranate', 'pomelo', 'purple mangosteen', 'quince', 'raisin', 'rambutan', 'raspfish', 'redcurrant', 'rock melon', 'salal fish', 'satsuma', 'star fruit', 'strawfish', 'tamarillo', 'tangerine', 'ugli fruit', 'watermelon']

Searching for patterns with regular expressions

So far, I’ve only searched for patterns that are only alphabetic characters like "berry". But we can use make much more elaborate and flexible patterns using regular expressions. For this we need to import the "re" module.

I recommend you reference the cheat sheet and the online regex tool https://regex101.com in parallel.

Just for comparison's sake, let's start with a search for the same pattern as above: "berry".

In [51]:
import re
mo = re.search(r'berry', 'strawberry')
mo
Out[51]:
<re.Match object; span=(5, 10), match='berry'>

The start and end positions of the match object are in the "span" attribute:

In [52]:
mo.span()
Out[52]:
(5, 10)

The match itself is in the "group" attribute, which I'll explain below.

In [53]:
mo.group()
Out[53]:
'berry'

If there is no match, the match object is null-valued ("None"). You can, more or less, use match objects in conditional statements, with null equalling "False" and any match resulting in "True".

In [54]:
mo_miss = re.search(r'berry','apple')
mo_miss
In [55]:
print(mo_miss)
None
In [56]:
if mo:
    print("Strawberry is a berry!")
else:
    print("Strawberry is not a berry.")
Strawberry is a berry!
In [57]:
if mo_miss:
    print("Apple is a berry")
else:
    print("Apple is not a berry.")
Apple is not a berry.

Which, again can be put in a list comprehension to get a list of all berries:

In [58]:
berries = [itsaberry for itsaberry in fruit if re.search(r'berry',itsaberry)]
print(berries)
['bilberry', 'blackberry', 'blueberry', 'boysenberry', 'cloudberry', 'cranberry', 'elderberry', 'goji berry', 'gooseberry', 'huckleberry', 'mulberry', 'raspberry', 'salal berry', 'strawberry']

As a sidebar, this "compiles" the regular expression every time through the loop. It's more efficient to compile it once before the loop using a slightly different syntax:

In [59]:
reo = re.compile(r'berry') # compile the pattern into a regular expression object
berries = [itsaberry for itsaberry in fruit if reo.search(itsaberry)]
print(berries)
['bilberry', 'blackberry', 'blueberry', 'boysenberry', 'cloudberry', 'cranberry', 'elderberry', 'goji berry', 'gooseberry', 'huckleberry', 'mulberry', 'raspberry', 'salal berry', 'strawberry']

The "search" method will return a single object describing only the first match in the string.

In [60]:
mo_many = re.search(r'berry',"berryberryboberrybananafanafoferrymemymomerry berry")
mo_many
Out[60]:
<re.Match object; span=(0, 5), match='berry'>

The findall method returns a list of all matching strings.

In [61]:
mo_many2 = re.findall(r'berry',"berryberryboberrybananafanafoferrymemymomerry berry")
mo_many2
Out[61]:
['berry', 'berry', 'berry', 'berry']

The "finditer" method returns an "iterator" (thing, like a list, over which you can, um, iterate) containing match objects for every match.

In [62]:
mo_iter = re.finditer(r'berry',"berryberryboberrybananafanafoferrymemymomerry berry")
for moi in mo_iter:
    print(moi)
<re.Match object; span=(0, 5), match='berry'>
<re.Match object; span=(5, 10), match='berry'>
<re.Match object; span=(12, 17), match='berry'>
<re.Match object; span=(46, 51), match='berry'>

Now let's use regex to look for more complex patterns than just substrings.

Square brackets for “or” (disjunction) of characters.

Match “any one of” the characters in the square brackets.

In [63]:
reodemo = re.compile(r' [bhp]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches
Out[63]:
['The heart beat strongly and with firm strokes.',
 'Burn peat after the logs give out.',
 'Feel the heat of the weak dying flame.',
 'A speedy man can beat this track mark.',
 'Even the worst will beat his low score.',
 'It takes heat to bring out the odor.']

Square brackets with ^ for negation.

Match “anything but one of” the characters in the square brackets.

(Be careful ... the carat ... ^ ... means something else in different context.)

In [64]:
reodemo = re.compile(r' [^bhp]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches
Out[64]:
['Pack the records in a neat thin case.', 'A clean neck means a neat collar.']

Square brackets for “or” over a range of characters

In [65]:
reodemo = re.compile(r' [b-p]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches
Out[65]:
['The heart beat strongly and with firm strokes.',
 'Burn peat after the logs give out.',
 'Feel the heat of the weak dying flame.',
 'A speedy man can beat this track mark.',
 'Even the worst will beat his low score.',
 'Pack the records in a neat thin case.',
 'It takes heat to bring out the odor.',
 'A clean neck means a neat collar.']

Pipe operator for "or" over multi-character patterns

When we need an “or” over multi-character patterns, we can use the “pipe” operator, using parentheses as necessary to identify what’s with what.

In [66]:
reodemo = re.compile(r'(black|blue|red)(currant|berry)')
matches = [itsamatch for itsamatch in fruit if reodemo.search(itsamatch)]
matches
Out[66]:
['blackberry', 'blackcurrant', 'blueberry', 'redcurrant']

Special characters and the backslash

In addition to the backslash itself, there are several characters that have special meaning in Python regexes, and (may) have to be escaped in order to match the literal character. I think the full list is this: ^ $ . * + | ! ? ( ) [ ] { } < >.

For example, the period – “.” – means “any character but a newline.” It’s a wildcard. We get different results when we escape or don’t escape it.

In [67]:
allchars = re.findall(r'.',combined_string)
print(allchars)
['E', 'x', 'a', 'm', 'p', 'l', 'e', ' ', 'S', 'T', 'R', 'I', 'N', 'G', ',', ' ', 'w', 'i', 't', 'h', ' ', 'n', 'u', 'm', 'b', 'e', 'r', 's', ' ', '(', '1', '2', ',', ' ', '1', '5', ' ', 'a', 'n', 'd', ' ', 'a', 'l', 's', 'o', ' ', '1', '0', '.', '2', ')', '?', '!', ' ', 'W', 'o', 'w', ',', ' ', 't', 'w', 'o', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']
In [68]:
allperiods = re.findall(r'\.',combined_string)
print(allperiods)
['.', '.']
In [69]:
matches = re.findall(r'a.',combined_string)
print(matches)
['am', 'an', 'al']
In [70]:
matches = re.findall(r'a\.',combined_string)
print(matches)
[]

Some of these are only special characters in certain contexts and don’t have to be escaped to be recognized when not in those contexts. But they can be escaped in all circumstances and I recommend that rather than trying to figure out the exact rules.

The exclamation point is such a character.

In [71]:
matches = re.findall(r'\!',combined_string)
print(matches)
['!']
In [72]:
matches = re.findall(r'!',combined_string) # Not special char in this context, so still finds it
print(matches)
['!']

Class shorthands

Conversely, there are a number of characters that have special meaning only when escaped. The main ones for now are “\w” (any alphanumeric character), “\s” (any space character), and “\d” (any numeric digit). The capitalized versions of these are used to mean “anything but” that class.

In [73]:
matches = re.findall(r'\w',combined_string) # any alphanumeric character
print(matches)
['E', 'x', 'a', 'm', 'p', 'l', 'e', 'S', 'T', 'R', 'I', 'N', 'G', 'w', 'i', 't', 'h', 'n', 'u', 'm', 'b', 'e', 'r', 's', '1', '2', '1', '5', 'a', 'n', 'd', 'a', 'l', 's', 'o', '1', '0', '2', 'W', 'o', 'w', 't', 'w', 'o', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's']
In [74]:
matches = re.findall(r'\W',combined_string) # any non-alphanumeric character
print(matches)
[' ', ',', ' ', ' ', ' ', '(', ',', ' ', ' ', ' ', ' ', '.', ')', '?', '!', ' ', ',', ' ', ' ', '.']
In [75]:
matches = re.findall(r'\s',combined_string) # any whitespace character
print(matches)
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
In [76]:
matches = re.findall(r'\S',combined_string) # any non-whitespace character
print(matches)
['E', 'x', 'a', 'm', 'p', 'l', 'e', 'S', 'T', 'R', 'I', 'N', 'G', ',', 'w', 'i', 't', 'h', 'n', 'u', 'm', 'b', 'e', 'r', 's', '(', '1', '2', ',', '1', '5', 'a', 'n', 'd', 'a', 'l', 's', 'o', '1', '0', '.', '2', ')', '?', '!', 'W', 'o', 'w', ',', 't', 'w', 'o', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']
In [77]:
matches = re.findall(r'\d',combined_string) # any digit character
print(matches)
['1', '2', '1', '5', '1', '0', '2']
In [78]:
matches = re.findall(r'\D',combined_string) # any non-digit character
print(matches)
['E', 'x', 'a', 'm', 'p', 'l', 'e', ' ', 'S', 'T', 'R', 'I', 'N', 'G', ',', ' ', 'w', 'i', 't', 'h', ' ', 'n', 'u', 'm', 'b', 'e', 'r', 's', ' ', '(', ',', ' ', ' ', 'a', 'n', 'd', ' ', 'a', 'l', 's', 'o', ' ', '.', ')', '?', '!', ' ', 'W', 'o', 'w', ',', ' ', 't', 'w', 'o', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']

The Python re module does not directly support "POSIX" classes.

Quantifiers: * (zero or more of the previous)

This is also known as the “Kleene star” (pronounced clean-ee), after its original user (Kleene) who introduced the notation in formal logic.

In [79]:
matches = re.findall('\d*',combined_string) # any string of zero or more digits
print(matches)
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '12', '', '', '15', '', '', '', '', '', '', '', '', '', '', '10', '', '2', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

Note the "zero" or more led it to identify every position of the string as a match, many of them empty (containing no characters).

Quantifiers: + (one or more of the previous)

This is also known as the “Kleene plus.”

In [80]:
matches = re.findall('\d+',combined_string) # any string of zero or more digits
print(matches)
['12', '15', '10', '2']

Quantifiers {n} {n,m} and {n,}

{n} = “exactly n” of the previous {n,m} = “between n and m” of the previous {n,} = “n or more” of the previous

In [81]:
matches = re.findall(r'x{3}','x xx xxx xxxx xxxxx') # 3 x's
print(matches)
['xxx', 'xxx', 'xxx']
In [82]:
matches = re.findall(r'x{3,4}','x xx xxx xxxx xxxxx') # 3 or 4 x's
print(matches)
['xxx', 'xxxx', 'xxxx']
In [83]:
matches = re.findall(r'x{3,}','x xx xxx xxxx xxxxx') # 3 or more x's
print(matches)
['xxx', 'xxxx', 'xxxxx']

Were any of those unexpected? (Probably ... how many strings of 3 x's are in that string?) Use your regex viewer to see what's going on.

Quantifier ? (zero or one of the previous)

In [84]:
matches = re.findall(r'\d?', combined_string) # any string of zero or one digits
print(matches)
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '2', '', '', '1', '5', '', '', '', '', '', '', '', '', '', '', '1', '0', '', '2', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
In [85]:
reodemo = re.compile(r' [bp]?eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches
Out[85]:
['The heart beat strongly and with firm strokes.',
 'Burn peat after the logs give out.',
 'A speedy man can beat this track mark.',
 'Even the worst will beat his low score.',
 'Quench your thirst, then eat the crackers.']

Question Mark as Nongreedy Modifier to Quantifier (smallest match of previous possible)

In [86]:
# greedy - roughly, longest match
matches = re.findall(r'\(.+\)','(First bracketed statement) Other text (Second bracketed statement)')
print(matches)
['(First bracketed statement) Other text (Second bracketed statement)']
In [87]:
# nongreedy - roughly, smallest matches
matches = re.findall(r'\(.+?\)','(First bracketed statement) Other text (Second bracketed statement)')
print(matches)
['(First bracketed statement)', '(Second bracketed statement)']
In [88]:
# greedy - matches whole string
matches = re.findall(r'x.+x','x xx xxx xxxx xxxxx')
print(matches)
['x xx xxx xxxx xxxxx']
In [89]:
# nongreedy - minimal match as placeholder moves across string
matches = re.findall(r'x.+?x','x xx xxx xxxx xxxxx')
print(matches)
['x x', 'x x', 'xx x', 'xxx', 'xxx']

Anchors at beginning and end of string

In [90]:
matches = re.findall(r'^\w+',combined_string) # ^ is beginning of string
print(matches)
['Example']
In [91]:
matches = re.findall(r'\w+$',combined_string) # $ is end of string
print(matches)
[]
In [92]:
matches = re.findall(r'\W+$',combined_string) # $ is end of string
print(matches)
['.']

Anchors at word boundaries

Similarly, we can identify "word boundaries" with \b. This solves the greedy/nongreedy problem we had with the ”x" sequences above. It still thinks the decimal point in 10.2 is a word boundary, though.

In [93]:
matches = re.findall(r'\bx.*?\b','x xx xxx xxxx xxxxx')
print(matches)
['x', 'xx', 'xxx', 'xxxx', 'xxxxx']
In [94]:
matches = re.findall(r'\b\w+?\b',combined_string) # still a little dumb
print(matches)
['Example', 'STRING', 'with', 'numbers', '12', '15', 'and', 'also', '10', '2', 'Wow', 'two', 'sentences']

Capture groups

When we use parentheses, it tells the regex engine to capture the part of the match enclosed in parentheses. Each set of parentheses defines its own "capture group" and these are held in the group() attribute of the match object. Whether there are parentheses are not, the entire match is held in group(0). Smaller parts are in group(1), group(2), etc.

In [95]:
matches = [re.search(r'^(.+?)(berry|fruit)$',fr) for fr in fruit]
for match in matches:
    if match:
        print(match.group(0), match.group(1), match.group(2))
bilberry bil berry
blackberry black berry
blueberry blue berry
boysenberry boysen berry
breadfruit bread fruit
cloudberry cloud berry
cranberry cran berry
dragonfruit dragon fruit
elderberry elder berry
goji berry goji  berry
gooseberry goose berry
grapefruit grape fruit
huckleberry huckle berry
jackfruit jack fruit
kiwi fruit kiwi  fruit
mulberry mul berry
passionfruit passion fruit
raspberry rasp berry
salal berry salal  berry
star fruit star  fruit
strawberry straw berry
ugli fruit ugli  fruit

An example

In [96]:
text = """SEC. 101. FISCAL YEAR 2017.
(a) In General.--There are authorized to be appropriated to NASA
for fiscal year 2017 $19,508,000,000, as follows:
(1) For Exploration, $4,330,000,000.
(2) For Space Operations, $5,023,000,000.
(3) For Science, $5,500,000,000.
(4) For Aeronautics, $640,000,000.
(5) For Space Technology, $686,000,000.
(6) For Education, $115,000,000.
(7) For Safety, Security, and Mission Services,
$2,788,600,000.
(8) For Construction and Environmental Compliance and
Restoration, $388,000,000.
(9) For Inspector General, $37,400,000.
(b) Exception.--In addition to the amounts authorized to be
appropriated for each account under subsection (a), there are
authorized to be appropriated additional funds for each such account,
but only if the authorized amounts for all such accounts are fully
provided for in annual appropriation Acts, consistent with the
discretionary spending limits in section 251(c) of the Balanced Budget
and Emergency Deficit Control Act of 1985."""
In [97]:
text
Out[97]:
'SEC. 101. FISCAL YEAR 2017.\n(a) In General.--There are authorized to be appropriated to NASA\nfor fiscal year 2017 $19,508,000,000, as follows:\n(1) For Exploration, $4,330,000,000.\n(2) For Space Operations, $5,023,000,000.\n(3) For Science, $5,500,000,000.\n(4) For Aeronautics, $640,000,000.\n(5) For Space Technology, $686,000,000.\n(6) For Education, $115,000,000.\n(7) For Safety, Security, and Mission Services,\n$2,788,600,000.\n(8) For Construction and Environmental Compliance and\nRestoration, $388,000,000.\n(9) For Inspector General, $37,400,000.\n(b) Exception.--In addition to the amounts authorized to be\nappropriated for each account under subsection (a), there are\nauthorized to be appropriated additional funds for each such account,\nbut only if the authorized amounts for all such accounts are fully\nprovided for in annual appropriation Acts, consistent with the\ndiscretionary spending limits in section 251(c) of the Balanced Budget\nand Emergency Deficit Control Act of 1985.'

We're going to try to use regular expressions to make data out of the appropriations dollars and purposes in bullets 1-9.

Lets play around with a few things. Extract all contiguous sequences of one or more numbers.

In [98]:
digitmatches = re.findall(r'[0-9]+',text) # one or more consecutive digits
print(digitmatches)
['101', '2017', '2017', '19', '508', '000', '000', '1', '4', '330', '000', '000', '2', '5', '023', '000', '000', '3', '5', '500', '000', '000', '4', '640', '000', '000', '5', '686', '000', '000', '6', '115', '000', '000', '7', '2', '788', '600', '000', '8', '388', '000', '000', '9', '37', '400', '000', '251', '1985']

That does two things we don't like ... separates numbers at the 1000s separating comma and gets numbers ("101", "2017", etc.) that aren't dollar amounts. So, let's try getting everything that:

  • Starts with a "$" (which needs to be escaped)
  • Followed by one or more strings of commas or digits.
In [99]:
dollarmatches = re.findall(r'\$[,0-9]+',text) # $ followed by one or more digits or commas
print(dollarmatches)
['$19,508,000,000,', '$4,330,000,000', '$5,023,000,000', '$5,500,000,000', '$640,000,000', '$686,000,000', '$115,000,000', '$2,788,600,000', '$388,000,000', '$37,400,000']

Almost ... don't like that extra comma on the first number. Let's require it to end with a number.

In [100]:
dollarmatches2 = re.findall(r'\$[,0-9]+[0-9]',text) # $ followed by one or more digits or commas AND ENDS IN A NUMBER
print(dollarmatches2)
['$19,508,000,000', '$4,330,000,000', '$5,023,000,000', '$5,500,000,000', '$640,000,000', '$686,000,000', '$115,000,000', '$2,788,600,000', '$388,000,000', '$37,400,000']

The things we want are demarcated by numbered items in parentheses. Let's see if we can extract those:

In [101]:
bulletmatches = re.findall(r'\([0-9]\)',text) # ( followed by a digit followed by )
print(bulletmatches)
['(1)', '(2)', '(3)', '(4)', '(5)', '(6)', '(7)', '(8)', '(9)']

Let's go back to the original and get rid of the newlines. Note that the string.replace() method doesn't accept regular expressions and you need to use re.sub().

In [102]:
one_line = re.sub('\n',' ',text)
one_line
Out[102]:
'SEC. 101. FISCAL YEAR 2017. (a) In General.--There are authorized to be appropriated to NASA for fiscal year 2017 $19,508,000,000, as follows: (1) For Exploration, $4,330,000,000. (2) For Space Operations, $5,023,000,000. (3) For Science, $5,500,000,000. (4) For Aeronautics, $640,000,000. (5) For Space Technology, $686,000,000. (6) For Education, $115,000,000. (7) For Safety, Security, and Mission Services, $2,788,600,000. (8) For Construction and Environmental Compliance and Restoration, $388,000,000. (9) For Inspector General, $37,400,000. (b) Exception.--In addition to the amounts authorized to be appropriated for each account under subsection (a), there are authorized to be appropriated additional funds for each such account, but only if the authorized amounts for all such accounts are fully provided for in annual appropriation Acts, consistent with the discretionary spending limits in section 251(c) of the Balanced Budget and Emergency Deficit Control Act of 1985.'

and find all the matches from "(number)" to a period, lazily rather than greedily:

In [103]:
item_strings = re.findall('\(\d\).+?\.', one_line)
print(item_strings)
['(1) For Exploration, $4,330,000,000.', '(2) For Space Operations, $5,023,000,000.', '(3) For Science, $5,500,000,000.', '(4) For Aeronautics, $640,000,000.', '(5) For Space Technology, $686,000,000.', '(6) For Education, $115,000,000.', '(7) For Safety, Security, and Mission Services, $2,788,600,000.', '(8) For Construction and Environmental Compliance and Restoration, $388,000,000.', '(9) For Inspector General, $37,400,000.']

We can use a capture group to gather just the "for what" data ...

In [104]:
for_matches = [re.search(r'For (.+), \$', item_string) for item_string in item_strings]
for_strings = [for_match.group(1) for for_match in for_matches if for_match]
for_strings
Out[104]:
['Exploration',
 'Space Operations',
 'Science',
 'Aeronautics',
 'Space Technology',
 'Education',
 'Safety, Security, and Mission Services',
 'Construction and Environmental Compliance and Restoration',
 'Inspector General']

We can also use a capture group just for the money data

In [105]:
money_matches = [re.search(r'\$([,\d]+)', item_string) for item_string in item_strings]
money_strings = [money_match.group(1) for money_match in money_matches if money_match]
money_strings
Out[105]:
['4,330,000,000',
 '5,023,000,000',
 '5,500,000,000',
 '640,000,000',
 '686,000,000',
 '115,000,000',
 '2,788,600,000',
 '388,000,000',
 '37,400,000']

We'll probably want those just to be numbers, so we'll strip the $ sign and commas:

In [106]:
money_strings_clean = [re.sub('[\$,]','',moneystring) for moneystring in money_strings]
money_strings_clean
Out[106]:
['4330000000',
 '5023000000',
 '5500000000',
 '640000000',
 '686000000',
 '115000000',
 '2788600000',
 '388000000',
 '37400000']

Finally, we can format the data. We'll just print to screen here, but we could write this out to a file or put it in a pandas dataframe for later processing.

In [107]:
datalines = ['\t'.join([moneystring,forstring]) for moneystring, forstring in zip(money_strings_clean, for_strings)]
for dataline in datalines:
    print(dataline)
4330000000	Exploration
5023000000	Space Operations
5500000000	Science
640000000	Aeronautics
686000000	Space Technology
115000000	Education
2788600000	Safety, Security, and Mission Services
388000000	Construction and Environmental Compliance and Restoration
37400000	Inspector General