This tutorial parallels a similar tutorial in R. In most cases, the task achieved is identical or very similar in both.
Many string manipulation methods below are provided in base Python. When we get to regular expressions, you'll need the "re" library.
The basic thing we want to manipulate are strings. These can be specified using double quotes (“) or single quotes (’):
a_string = 'Example STRING, with numbers (12, 15 and also 10.2)?!'
a_string
It’s really a matter of style or convenience, but you might use one if your string actually contains the other:
my_double_quoted_string = "He asked, 'Why would you use double quotes?'"
my_double_quoted_string
You can still use either one if you like, using \ (backslash) to tell Python to “escape” the next character. In the example below, the \" is saying, " is part of the string, not the end of the string.
my_string_with_double_quotes = "She answered, \"Convenience, but you never really have to.\""
my_string_with_double_quotes
If you ever want to see how your string with escape characters displays when printed or (typically) in an editor, use print.
print(my_double_quoted_string)
print(my_string_with_double_quotes)
This can get a little bit confusing. For example, since the backslash character tells Python to escape, to indicate an actual backslash character you have to backslash your backslashes:
a_string_with_backslashes = "To indicate a backslash, \\, you have to type two: \\\\. Just there, to indicate two backslashes, I had to type four: \\\\\\\\."
a_string_with_backslashes
print(a_string_with_backslashes)
There are a number of special escape characters that are used to represent things like “control characters.” The most common are two that you’re already used to tapping a keyboard key for without expecting a character to appear on your screen: \t (tab) and \n (newline).
test_string = "abc ABC 123\t.!?\\(){}\n \nthird line"
test_string
print(test_string)
If you want to define a multiline string without using escaped newline characters, use triple quotation marks:
test_string2 = """abc ABC 123\t.!?\\(){}
third line"""
test_string2
As with pretty much everything in Python, you can have a list of strings.
a_list_of_strings = ["abcde", "123", "chicken of the sea"]
a_list_of_strings
In the R tutorial, we made use of a few collections of strings provided in base R or stringr. To make this comparable, we'll create or load these here.
The letters of the alphabet are available in the string module.
import string
letters_string = string.ascii_lowercase
letters_string
letters_list = list(letters_string)
letters_list
LETTERS_string = string.ascii_uppercase
LETTERS_string
LETTERS_list = list(LETTERS_string)
LETTERS_list
We'll just make the month lists.
month_abb = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_abb
month_name = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
month_name
# REQUIRES file "fruit.txt" be in the same directory
fruitfile = open("fruit.txt","r")
fruit = fruitfile.read().splitlines()
print(fruit)
# REQUIRES file "words.txt" be in the same directory
wordsfile = open("words.txt","r")
words = wordsfile.read().splitlines()
len(words)
The word list is long, so let's just look at the top 5.
(A couple things for R users here. First, Python starts counting at 0 not 1. So, in R words[0]
is an error, words[1]
is "a", and words[5]
is "accept"; in Python, words[0]
is "a", words[1]
is "able", and words[5]
is an "account". Second, the "slicing" notation in Python is also weird if you're used to R. In R, to get the first five members, you ask for item 1 through item 5:words[1:5]
. In Python, you might imagine the list members sitting on a number line that puts the first list member between 0 and 1, the second between 1 and 2m and so on. Then to get the first 5 members of the list, we need to ask for the slice between "0" just to the "left" of the slice and "5" at the "right" of the stuff we want: words[0:5]
.
words[0:5]
words[5]
The sentences list is also long.
# REQUIRES file "sentences.txt" be in the same directory
sentencesfile = open("sentences.txt","r")
sentences = sentencesfile.read().splitlines()
len(sentences)
sentences[0:5]
You can combine, or “concatenate”, strings very naturally using the "+" sign.
second_string = "Wow, two sentences."
combined_string = a_string + " " + second_string
combined_string
You can also combine lists of strings by a separator using the "join" method. To again join the two strings above separated by a space, place the strings to be joined in a list by using square brackets, and the separator in a string and use the syntax sep.join(
list)
:
" ".join([a_string,second_string])
Note that "join" takes a list of strings of any length and concatenates all the strings together with the separator.
" then ".join(month_name)
In the R notebook, we next created a new vector of strings concatenating the month vectors into strings like "Jan stands for January". Python doesn't have the element-by-element "vectorized" syntax R does, so we have to more explicitly iterate over the elements to do this here. There are a several ways to do this.
The most straightforward to understand, but not very "Pythonic", way is to do this in a for loop:
month_explanations = []
for i in range(12):
new_string = " stands for ".join([month_abb[i], month_name[i]])
month_explanations.append(new_string)
month_explanations
There are more compact, Pythonic ways to do this. One is to use the "zip" function which creates an "iterator" of "tuples" element by element and then iterate over that. Let's look inside the zip function first by making a list of the zipped elements:
list(zip(month_abb, month_name))
Not quite what we want, but we can join those tuples as we iterate over the zip object using a "list comprehension":
[" stands for ".join([abbrev,name]) for abbrev,name in zip(month_abb, month_name)]
The "list comprehension" is defined by those square brackets on the outside (making it a list) and the "for loop"-like instruction inside.
There are many ways to do the same thing. For example, we can change the string manipulation operation that gets repeated from the join method to the format method:
["{} stands for {}".format(abbrev,name) for abbrev,name in zip(month_abb, month_name)]
The join/zip idiom works for the letters example in the other notebook as well:
letterpairs = ["".join([lower,upper]) for lower, upper in zip(letters_list, LETTERS_list)]
print(letterpairs)
You can zip two lists together, concatenate those element by element, and then join them by a separator.
" then ".join(["{} ({})".format(name,abbrev) for name,abbrev in zip(month_name,month_abb)])
You can split up a string into pieces, based on a pattern, with the "split" method.
combined_string.split("!")
Substrings are just slices in Python. To get a list of the second through fourth character in each fruit name:
substringfromfruit = [eachfruit[1:4] for eachfruit in fruit]
print(substringfromfruit)
Substrings from the end of the string can be accessed by slices using negative numbers.
subfromend = [eachfruit[-3:-1] for eachfruit in fruit]
print(subfromend)
You can use slicing to extract data from strings:
some_dates = ["1999/01/01","1998/12/15","2001/09/03"]
years = [date[0:4] for date in some_dates]
print(years)
months = [date[5:7] for date in some_dates]
print(months)
Getting a copy of a string with specific positions replaced is also a matter of slicing:
apple = "apple"
zebra = "--!ZEBRA!--"
zebraapple = apple[0:1] + zebra + apple[3:]
zebraapple
Replicating the R result over the whole list can be done by putting within a list comprehension.
zebrafruit = [fr[0:1] + zebra + fr[3:] for fr in fruit]
print(zebrafruit)
Strings have a simple casefolding method that can be applied:
combined_string.lower()
combined_string.upper()
Also several to trim excess white space off the ends of strings:
lotsofspace = ' Why so much space? '
lotsofspace.strip()
lotsofspace.lstrip()
lotsofspace.rstrip()
If we're looking for specific substrings, there are string methods to do that.
"strawberry".find("berry")
That returns the position of the first match. If there is no match, find returns a value of -1.
"apple".find("berry")
If there are multiple matches, find returns the position of the first match.
"berryberryboberrybananafanafoferrymemymomerry berry".find("berry")
We can use this in a list comprehension, with the addition of an "if" condition, to extract a list of all matching fruits.
[fr for fr in fruit if fr.find("berry")> -1]
We can get a copy of the string with the substring replaced with something else:
"strawberry".replace("berry","fish")
fishfruit = [fr.replace("berry","fish") for fr in fruit]
print(fishfruit)
So far, I’ve only searched for patterns that are only alphabetic characters like "berry". But we can use make much more elaborate and flexible patterns using regular expressions. For this we need to import the "re" module.
I recommend you reference the cheat sheet and the online regex tool https://regex101.com in parallel.
Just for comparison's sake, let's start with a search for the same pattern as above: "berry".
import re
mo = re.search(r'berry', 'strawberry')
mo
The start and end positions of the match object are in the "span" attribute:
mo.span()
The match itself is in the "group" attribute, which I'll explain below.
mo.group()
If there is no match, the match object is null-valued ("None"). You can, more or less, use match objects in conditional statements, with null equalling "False" and any match resulting in "True".
mo_miss = re.search(r'berry','apple')
mo_miss
print(mo_miss)
if mo:
print("Strawberry is a berry!")
else:
print("Strawberry is not a berry.")
if mo_miss:
print("Apple is a berry")
else:
print("Apple is not a berry.")
Which, again can be put in a list comprehension to get a list of all berries:
berries = [itsaberry for itsaberry in fruit if re.search(r'berry',itsaberry)]
print(berries)
As a sidebar, this "compiles" the regular expression every time through the loop. It's more efficient to compile it once before the loop using a slightly different syntax:
reo = re.compile(r'berry') # compile the pattern into a regular expression object
berries = [itsaberry for itsaberry in fruit if reo.search(itsaberry)]
print(berries)
The "search" method will return a single object describing only the first match in the string.
mo_many = re.search(r'berry',"berryberryboberrybananafanafoferrymemymomerry berry")
mo_many
The findall method returns a list of all matching strings.
mo_many2 = re.findall(r'berry',"berryberryboberrybananafanafoferrymemymomerry berry")
mo_many2
The "finditer" method returns an "iterator" (thing, like a list, over which you can, um, iterate) containing match objects for every match.
mo_iter = re.finditer(r'berry',"berryberryboberrybananafanafoferrymemymomerry berry")
for moi in mo_iter:
print(moi)
Now let's use regex to look for more complex patterns than just substrings.
Match “any one of” the characters in the square brackets.
reodemo = re.compile(r' [bhp]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches
Match “anything but one of” the characters in the square brackets.
(Be careful ... the carat ... ^ ... means something else in different context.)
reodemo = re.compile(r' [^bhp]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches
reodemo = re.compile(r' [b-p]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches
When we need an “or” over multi-character patterns, we can use the “pipe” operator, using parentheses as necessary to identify what’s with what.
reodemo = re.compile(r'(black|blue|red)(currant|berry)')
matches = [itsamatch for itsamatch in fruit if reodemo.search(itsamatch)]
matches
In addition to the backslash itself, there are several characters that have special meaning in Python regexes, and (may) have to be escaped in order to match the literal character. I think the full list is this: ^ $ . * + | ! ? ( ) [ ] { } < >.
For example, the period – “.” – means “any character but a newline.” It’s a wildcard. We get different results when we escape or don’t escape it.
allchars = re.findall(r'.',combined_string)
print(allchars)
allperiods = re.findall(r'\.',combined_string)
print(allperiods)
matches = re.findall(r'a.',combined_string)
print(matches)
matches = re.findall(r'a\.',combined_string)
print(matches)
Some of these are only special characters in certain contexts and don’t have to be escaped to be recognized when not in those contexts. But they can be escaped in all circumstances and I recommend that rather than trying to figure out the exact rules.
The exclamation point is such a character.
matches = re.findall(r'\!',combined_string)
print(matches)
matches = re.findall(r'!',combined_string) # Not special char in this context, so still finds it
print(matches)
Conversely, there are a number of characters that have special meaning only when escaped. The main ones for now are “\w” (any alphanumeric character), “\s” (any space character), and “\d” (any numeric digit). The capitalized versions of these are used to mean “anything but” that class.
matches = re.findall(r'\w',combined_string) # any alphanumeric character
print(matches)
matches = re.findall(r'\W',combined_string) # any non-alphanumeric character
print(matches)
matches = re.findall(r'\s',combined_string) # any whitespace character
print(matches)
matches = re.findall(r'\S',combined_string) # any non-whitespace character
print(matches)
matches = re.findall(r'\d',combined_string) # any digit character
print(matches)
matches = re.findall(r'\D',combined_string) # any non-digit character
print(matches)
The Python re module does not directly support "POSIX" classes.
This is also known as the “Kleene star” (pronounced clean-ee), after its original user (Kleene) who introduced the notation in formal logic.
matches = re.findall('\d*',combined_string) # any string of zero or more digits
print(matches)
Note the "zero" or more led it to identify every position of the string as a match, many of them empty (containing no characters).
This is also known as the “Kleene plus.”
matches = re.findall('\d+',combined_string) # any string of zero or more digits
print(matches)
{n} = “exactly n” of the previous {n,m} = “between n and m” of the previous {n,} = “n or more” of the previous
matches = re.findall(r'x{3}','x xx xxx xxxx xxxxx') # 3 x's
print(matches)
matches = re.findall(r'x{3,4}','x xx xxx xxxx xxxxx') # 3 or 4 x's
print(matches)
matches = re.findall(r'x{3,}','x xx xxx xxxx xxxxx') # 3 or more x's
print(matches)
Were any of those unexpected? (Probably ... how many strings of 3 x's are in that string?) Use your regex viewer to see what's going on.
matches = re.findall(r'\d?', combined_string) # any string of zero or one digits
print(matches)
reodemo = re.compile(r' [bp]?eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches
# greedy - roughly, longest match
matches = re.findall(r'\(.+\)','(First bracketed statement) Other text (Second bracketed statement)')
print(matches)
# nongreedy - roughly, smallest matches
matches = re.findall(r'\(.+?\)','(First bracketed statement) Other text (Second bracketed statement)')
print(matches)
# greedy - matches whole string
matches = re.findall(r'x.+x','x xx xxx xxxx xxxxx')
print(matches)
# nongreedy - minimal match as placeholder moves across string
matches = re.findall(r'x.+?x','x xx xxx xxxx xxxxx')
print(matches)
matches = re.findall(r'^\w+',combined_string) # ^ is beginning of string
print(matches)
matches = re.findall(r'\w+$',combined_string) # $ is end of string
print(matches)
matches = re.findall(r'\W+$',combined_string) # $ is end of string
print(matches)
Similarly, we can identify "word boundaries" with \b. This solves the greedy/nongreedy problem we had with the ”x" sequences above. It still thinks the decimal point in 10.2 is a word boundary, though.
matches = re.findall(r'\bx.*?\b','x xx xxx xxxx xxxxx')
print(matches)
matches = re.findall(r'\b\w+?\b',combined_string) # still a little dumb
print(matches)
When we use parentheses, it tells the regex engine to capture the part of the match enclosed in parentheses. Each set of parentheses defines its own "capture group" and these are held in the group() attribute of the match object. Whether there are parentheses are not, the entire match is held in group(0). Smaller parts are in group(1), group(2), etc.
matches = [re.search(r'^(.+?)(berry|fruit)$',fr) for fr in fruit]
for match in matches:
if match:
print(match.group(0), match.group(1), match.group(2))
text = """SEC. 101. FISCAL YEAR 2017.
(a) In General.--There are authorized to be appropriated to NASA
for fiscal year 2017 $19,508,000,000, as follows:
(1) For Exploration, $4,330,000,000.
(2) For Space Operations, $5,023,000,000.
(3) For Science, $5,500,000,000.
(4) For Aeronautics, $640,000,000.
(5) For Space Technology, $686,000,000.
(6) For Education, $115,000,000.
(7) For Safety, Security, and Mission Services,
$2,788,600,000.
(8) For Construction and Environmental Compliance and
Restoration, $388,000,000.
(9) For Inspector General, $37,400,000.
(b) Exception.--In addition to the amounts authorized to be
appropriated for each account under subsection (a), there are
authorized to be appropriated additional funds for each such account,
but only if the authorized amounts for all such accounts are fully
provided for in annual appropriation Acts, consistent with the
discretionary spending limits in section 251(c) of the Balanced Budget
and Emergency Deficit Control Act of 1985."""
text
We're going to try to use regular expressions to make data out of the appropriations dollars and purposes in bullets 1-9.
Lets play around with a few things. Extract all contiguous sequences of one or more numbers.
digitmatches = re.findall(r'[0-9]+',text) # one or more consecutive digits
print(digitmatches)
That does two things we don't like ... separates numbers at the 1000s separating comma and gets numbers ("101", "2017", etc.) that aren't dollar amounts. So, let's try getting everything that:
dollarmatches = re.findall(r'\$[,0-9]+',text) # $ followed by one or more digits or commas
print(dollarmatches)
Almost ... don't like that extra comma on the first number. Let's require it to end with a number.
dollarmatches2 = re.findall(r'\$[,0-9]+[0-9]',text) # $ followed by one or more digits or commas AND ENDS IN A NUMBER
print(dollarmatches2)
The things we want are demarcated by numbered items in parentheses. Let's see if we can extract those:
bulletmatches = re.findall(r'\([0-9]\)',text) # ( followed by a digit followed by )
print(bulletmatches)
Let's go back to the original and get rid of the newlines. Note that the string.replace() method doesn't accept regular expressions and you need to use re.sub().
one_line = re.sub('\n',' ',text)
one_line
and find all the matches from "(number)" to a period, lazily rather than greedily:
item_strings = re.findall('\(\d\).+?\.', one_line)
print(item_strings)
We can use a capture group to gather just the "for what" data ...
for_matches = [re.search(r'For (.+), \$', item_string) for item_string in item_strings]
for_strings = [for_match.group(1) for for_match in for_matches if for_match]
for_strings
We can also use a capture group just for the money data
money_matches = [re.search(r'\$([,\d]+)', item_string) for item_string in item_strings]
money_strings = [money_match.group(1) for money_match in money_matches if money_match]
money_strings
We'll probably want those just to be numbers, so we'll strip the $ sign and commas:
money_strings_clean = [re.sub('[\$,]','',moneystring) for moneystring in money_strings]
money_strings_clean
Finally, we can format the data. We'll just print to screen here, but we could write this out to a file or put it in a pandas dataframe for later processing.
datalines = ['\t'.join([moneystring,forstring]) for moneystring, forstring in zip(money_strings_clean, for_strings)]
for dataline in datalines:
print(dataline)