Regular Expressions

What is Regular Expression?

Regular Expression are mini-language for specifying text patterns.

Example: Find Phone Number in a Message( without using regular expression)

Writing code to do pattern matching without regular expression is a huge pain. The following example checks if a message consists of a phone number.

In [1]:
def isPhoneNumber(text):
  if len(text) != 12:
    return False # not phone number-sized
  for i in range(0, 3):
    if not text[i].isdecimal():
      return False # no area code
    if text[3] != '-':
      return False # missing dash
    for i in range(4, 7):
      if not text[i].isdecimal():
        return False # not first 3 digits
    if text[7] != '-':
      return False # missing second dash
    for i in range(8, 12):
      if not text[i].isdecimal():
        return False # missing last 4 digits
    return True

message = 'Call me 415-555-1101 tomorrow or 415-555-9999'
foundNumber = False

for i in range(len(message)):
  chunk = message[i:i+12]
  if isPhoneNumber(chunk):
    print('Phone number found: ' + chunk)
    foundNumber = True

if not foundNumber:
  print('Could not find any phone numbers')
Phone number found: 415-555-1101
Phone number found: 415-555-9999

The Regular Expression (re) Module

Regex strings often use \backslashes (like \d), so they are often raw strings: r'\d'. \d is the regex for a numeric digital character

To use regular expression, you have to import the re module first. Then, you will usually pass raw strings to re.compile() function which will return a regex object.

Call the regex object's search() method to return a match object (mo). Call the matched object's group() method to get the matched string

In [2]:
import re
message = 'Call me 415-555-1101 tomorrow or 415-555-9999'

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search(message)
print(mo.group())
415-555-1101

The findall() Method

Call the regex object's findall() method will return a list of string matches the pattern.

In [3]:
import re
message = 'Call me 415-555-1101 tomorrow or 415-555-9999'

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
print(phoneNumRegex.findall(message))
['415-555-1101', '415-555-9999']

Regex Groups

Groups are created in regex strings with parentheses ( ). The first set of parentheses is group 1, the second is 2, and so on. Calling group() or group(0) returns the full matching string, group(1) returns group 1's matching string, and so on.

In [4]:
import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is 415-555-4242')

mo.group()
Out[4]:
'415-555-4242'
In [5]:
mo.group(0)
Out[5]:
'415-555-4242'
In [6]:
mo.group(1)
Out[6]:
'415'
In [7]:
mo.group(2)
Out[7]:
'555-4242'

Use \( and \) to search the actual literal parentheses ( ) in a string

In [8]:
import re
phoneNumRegex = re.compile(r'\(\d\d\d\) \d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My phone number is (415) 555-4242')
mo.group()
Out[8]:
'(415) 555-4242'

Pipe Character

The pipe | can match one of many possible groups.

Searching for 'Batman', 'Batmobile', 'Batcopter', 'Batbat'

In [9]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
In [10]:
mo.group()
Out[10]:
'Batmobile'
In [11]:
mo.group(0)
Out[11]:
'Batmobile'
In [12]:
mo.group(1)
Out[12]:
'mobile'

Matched Object mo will return a None value if pattern not found in search string

In [13]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmotorcycle lost a wheel')

mo == None
Out[13]:
True
In [14]:
# mo.group() # AttributeError

Repetition in Regex

? (zero or one)

The ? says the group matches zero or one times

wo group appears zero time

In [15]:
import re
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batman')
mo.group()
Out[15]:
'Batman'

wo group appears one time

In [16]:
import re
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batwoman')
mo.group()
Out[16]:
'Batwoman'
In [17]:
import re
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batwowowowoman')
mo == None
Out[17]:
True
In [18]:
# mo.group() # AttributeError

Match a phone number with or without area code

In [19]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
phoneRegex.search('My phone number is 415-555-1234')
Out[19]:
<_sre.SRE_Match object; span=(19, 31), match='415-555-1234'>
In [20]:
phoneRegex.search('My phone number is 555-1234')
Out[20]:
<_sre.SRE_Match object; span=(19, 27), match='555-1234'>

For example, if you wanted a regex object for the text "dinner?" (with the question mark), you would call:

In [21]:
re.compile(r'dinner\?') # note the slash in \?
Out[21]:
re.compile(r'dinner\?', re.UNICODE)

In the above case, dinner is not optional, we are literally looking for a question mark: "dinner?"

$*$ (zero or more)

The * says the group matches zero or more times

In [22]:
import re
batRegex = re.compile(r'Bat(wo)*man')
batRegex.search('The Adventures of Batman')
Out[22]:
<_sre.SRE_Match object; span=(18, 24), match='Batman'>
In [23]:
batRegex.search('The Adventures of Batwoman')
Out[23]:
<_sre.SRE_Match object; span=(18, 26), match='Batwoman'>
In [24]:
batRegex.search('The Adventures of Batwowowowoman')
Out[24]:
<_sre.SRE_Match object; span=(18, 32), match='Batwowowowoman'>

+ (one or more)

The + says the group matches one or more times

In [25]:
import re
batRegex = re.compile(r'Bat(wo)+man')
batRegex.search('The Adventures of Batman') == None
Out[25]:
True
In [26]:
batRegex.search('The Adventures of Batwoman')
Out[26]:
<_sre.SRE_Match object; span=(18, 26), match='Batwoman'>
In [27]:
batRegex.search('The Adventures of Batwowowowoman')
Out[27]:
<_sre.SRE_Match object; span=(18, 32), match='Batwowowowoman'>

Escaping Symbols $?*+$

In [28]:
regex = re.compile(r'\+\*\?')
regex.search('I learnt about +*? regex syntax')
Out[28]:
<_sre.SRE_Match object; span=(15, 18), match='+*?'>
In [29]:
regex = re.compile(r'(\+\*\?)+')
regex.search('I learnt about +*? regex syntax')
Out[29]:
<_sre.SRE_Match object; span=(15, 18), match='+*?'>

{x} (exactly x times)

The curly braces {} can match a specific number of times

In [30]:
haRegex = re.compile(r'(Ha){3}')
haRegex.search('HaHaHa') # found
Out[30]:
<_sre.SRE_Match object; span=(0, 6), match='HaHaHa'>
In [31]:
haRegex = re.compile(r'Ha{3}')
haRegex.search('HaHaHa') # not found
In [32]:
haRegex.search('Haaa') # found
Out[32]:
<_sre.SRE_Match object; span=(0, 4), match='Haaa'>

Match 3 phone number in a row, may not have an area code and may have a comma or space following the phone number

In [33]:
phoneRegex = re.compile(r'((\d\d\d-)?\d\d\d-\d\d\d\d(,| )?){3}')
phoneRegex.search('555-1234,555-4242,212-555-0000') # found
Out[33]:
<_sre.SRE_Match object; span=(0, 30), match='555-1234,555-4242,212-555-0000'>
In [34]:
phoneRegex.search('415-555-1234 555-4242,212-555-0000') # found
Out[34]:
<_sre.SRE_Match object; span=(0, 34), match='415-555-1234 555-4242,212-555-0000'>

{x, y} (at least x, at most y)

The curly braces {} with two numbers matches a minimum and maximum number of times. Leaving out the first or second number in the curly braces {} says there is no minimum or maximum e.g. {,5} same as {0, 5}, {3,}: 3 or more.

In [35]:
haRegex = re.compile(r'(Ha){3,5}')
haRegex.search('He said "HaHaHa"') # found
Out[35]:
<_sre.SRE_Match object; span=(9, 15), match='HaHaHa'>
In [36]:
haRegex.search('He said "HaHaHaHaHa"') # found
Out[36]:
<_sre.SRE_Match object; span=(9, 19), match='HaHaHaHaHa'>
In [37]:
haRegex.search('He said "HaHaHaHaHaHaHa"') # found
Out[37]:
<_sre.SRE_Match object; span=(9, 19), match='HaHaHaHaHa'>

Greedy/ Non-Greedy Match

Greedy matching matches the longest string possible, non-greedy matching math the shortest string possible. By default, regular expression does a greedy match. Putting a question mark ? after the curly braces {} makes it do a non-greedy match.

Greedy match a string with 3 or 5 digits

In [38]:
digitRegex = re.compile(r'(\d){3,5}')
digitRegex.search('1234567890')
Out[38]:
<_sre.SRE_Match object; span=(0, 5), match='12345'>

Non-greedy match ?

In [39]:
digitRegex = re.compile(r'(\d){3,5}?')
digitRegex.search('1234567890')
Out[39]:
<_sre.SRE_Match object; span=(0, 3), match='123'>

Regex Character Class & the findall() Method

The findall() Method

The regex method findall() is passing a string, and returns all matches in it, not just the first match

serach() returns Match Objects findall() returns a list of strings

In [40]:
import re
phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')
Out[40]:
['415-789-3564', '416-782-4654']

If the regular expression string have zero or one group i.e. ? in them, the findall() method will return a list of strings. Each text in that list is the text that it found matching the pattern.

In [41]:
import re
phoneRegex = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')
Out[41]:
['415', '416']

If the regular expression string have two or more groups, for example, one for the area code and one for the main number, the findall() method return a list of tuples of strings

In [42]:
import re
phoneRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')
Out[42]:
[('415', '789-3564'), ('416', '782-4654')]

With 3 groups

In [43]:
import re
phoneRegex = re.compile(r'((\d\d\d)-(\d\d\d-\d\d\d\d))')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')
Out[43]:
[('415-789-3564', '415', '789-3564'), ('416-782-4654', '416', '782-4654')]

Character Classes

\d is a shorthand character class that matches digits. \w matches word characters, \s matches whitespace characters. The uppercase shorthand character classes \D, \W, \S match chaaracters that are NOT digits, word characters, and spaces.

In [44]:
digitRegex = re.compile(r'(0|1|2|3|4|5|6|7|8|9)')
digitRegex = re.compile(r'\d') # Same as above
Shorthand Codes for Common Character Classes Represents
\d Any numeric digit from 0 to 9
\D Any character that is not a numeric digit from 0 to 9
\w Any letter, numeric digit, or the underscore character (Think of this as matching "word" characters)
\W Any character that is not a letter, numeric digit, or the underscore character
\s Any space, tab or newline character (Think of this as matching "space" character)
\S Any character that is not a space, tab, or newline
12 Days of Christmas Example

Regular expression string consists of one or more digit \d+, then a "space" character \s, then one or more word \w+

In [45]:
import re
lyrics = '12 Drummers Drumming, 11 Pipers Piping, 10 Lords a Leaping, 9 Ladies Dancing, 8 Maids a Milking, 7 Swans a Swimming, 6 Geese a Laying, 5 Golden Rings, 4 Calling Birds, 3 French Hens, 2 Turtle Doves, and a Partridge in a Pear Tree'
xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall(lyrics)
Out[45]:
['12 Drummers',
 '11 Pipers',
 '10 Lords',
 '9 Ladies',
 '8 Maids',
 '7 Swans',
 '6 Geese',
 '5 Golden',
 '4 Calling',
 '3 French',
 '2 Turtle']
Making Your Own Character Classes

You can make your own character class with square brackets []. You don't need to use escape character in own character class. See also re.IGNORECASE below.

In [46]:
vowelRegex = re.compile(r'[aeiouAEIOU]') # r'(a|e|i|o|u|A|E|I|O|U)'
vowelRegex.findall('Robocop eats baby food.')
Out[46]:
['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o']

Match two vowels in a row

In [47]:
vowelRegex = re.compile(r'[aeiouAEIOU]{2}')
vowelRegex.findall('Robocop eats baby food.')
Out[47]:
['ea', 'oo']
In [48]:
atofRegex = re.compile(r'[a-fA-f]') # all cases a to f
Negative Character Classes

A caret symbol ^ makes it a negative character class, matching anything NOT in the brackets

In [49]:
consonantsRegex = re.compile(r'[^aeiouAEIOU]')
consonantsRegex.findall('Robocop eats baby food.')
Out[49]:
['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.']

Regex Dot-Star and the Caret/Dollar Characters

Matching the ^Start and End $

^ means the string must start with the pattern

In [50]:
import re
beginWithHelloRegex = re.compile(r'^Hello')
beginWithHelloRegex.search('Hello there')
Out[50]:
<_sre.SRE_Match object; span=(0, 5), match='Hello'>
In [51]:
beginWithHelloRegex.search('He said "Hello"')

$ means the string must end with the pattern.

In [52]:
import re
endWithWorldRegex = re.compile(r'world$')
endWithWorldRegex.search('Hello world')
Out[52]:
<_sre.SRE_Match object; span=(6, 11), match='world'>
In [53]:
endWithWorldRegex.search('Hello world again')

Both ^ $ means the entire string must match the pattern.

In [54]:
import re
allDigitsRegex = re.compile(r'^\d+$')
allDigitsRegex.search('5789798754')
Out[54]:
<_sre.SRE_Match object; span=(0, 10), match='5789798754'>
In [55]:
allDigitsRegex.search('5789x98754')

. (anything except newline)

The . is a wildcard; it matches anything in single character except newlines \n

In [56]:
import re
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat')
Out[56]:
['cat', 'hat', 'sat', 'lat', 'mat']

Anything with 1 or 2 character followed by 'at'. The result includes 'flat' but includes white space character

In [57]:
import re
atRegex = re.compile(r'.{1,2}at')
atRegex.findall('The cat in the hat sat on the flat mat')
Out[57]:
[' cat', ' hat', ' sat', 'flat', ' mat']

Dot-Star to match anything

Pull out the first name and last name

In [58]:
import re
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
nameRegex.findall('First Name: Al Last Name: Sweigart')
Out[58]:
[('Al', 'Sweigart')]
Greedy and Non-Greedy

Anything enclosed by angle brackets - Non-greedy version (.*?)

In [59]:
import re
serve = '<To serve human> for dinner>'
nongreedy = re.compile(r'<(.*?)>')
nongreedy.findall(serve)
Out[59]:
['To serve human']

Anything enclosed by angle brackets - Greedy version (.*)

In [60]:
import re
serve = '<To serve human> for dinner>'
greedy = re.compile(r'<(.*)>')
greedy.findall(serve)
Out[60]:
['To serve human> for dinner']

Making Dot Match Newlines Too (with re.DOTALL)

Pass re.DOTALL as the second argument to re.compile() to make the . matches newlines \n as well

Without re.DOTALL

In [61]:
import re
prime = 'Serve the public trust.\nProtect the innocent\nUphold the law'
dotStarRegex = re.compile(r'.*')
dotStarRegex.search(prime)
Out[61]:
<_sre.SRE_Match object; span=(0, 23), match='Serve the public trust.'>

With re.DOTALL

In [62]:
import re
prime = 'Serve the public trust.\nProtect the innocent\nUphold the law'
dotStarRegex = re.compile(r'.*', re.DOTALL)
dotStarRegex.search(prime)
Out[62]:
<_sre.SRE_Match object; span=(0, 59), match='Serve the public trust.\nProtect the innocent\nUp>

re.IGNORECASE or re.I

Pass re.IGNORECASE or re.I as the second argument to re.compile() to make the matching case-insensitive

Without re.IGNORECASE or re.I

In [63]:
import re
vowelRegex = re.compile(r'[aeiou]')
vowelRegex.findall('Al, why robocop again?')
Out[63]:
['o', 'o', 'o', 'a', 'a', 'i']

With re.IGNORECASE or re.I

In [64]:
import re
vowelRegex = re.compile(r'[aeiou]', re.IGNORECASE)
vowelRegex.findall('Al, why robocop again?')
Out[64]:
['A', 'o', 'o', 'o', 'a', 'a', 'i']

Regex sub() Method and Verbose Mode

The sub() Method

The sub() regex method will substitute matches with some other text

Using findall() Method

In [65]:
import re
nameRegex = re.compile(r'Agent \w+')
nameRegex.findall('Agent Alice gave the secret doc to Agent Bob')
Out[65]:
['Agent Alice', 'Agent Bob']

Using sub() Method

In [66]:
import re
nameRegex = re.compile(r'Agent \w+')
nameRegex.sub('REDACTED','Agent Alice gave the secret doc to Agent Bob')
Out[66]:
'REDACTED gave the secret doc to REDACTED'
Using \1, \2, etc in sub()

Using \1, \2 and so will substitute group 1, 2 etc in the regex pattern

In [67]:
import re
nameRegex = re.compile(r'Agent (\w)\w*')
nameRegex.findall('Agent Alice gave the secret doc to Agent Bob')
Out[67]:
['A', 'B']
In [68]:
nameRegex.sub(r'Agent \1****','Agent Alice gave the secret doc to Agent Bob')
Out[68]:
'Agent A**** gave the secret doc to Agent B****'

Verbose Mode with re.VERBOSE

Passing re.VERBOSE lets you add whitespace and comments to the regex string passed to re.compile

In [69]:
re.compile(r'''
(\d\d\d-)|    # area code (without parens, with dash)
(\(\d\d\d\) ) # -or- area code with parens and no dash
\d\d\d        # first 3 digits
-             # seond dash
\d\d\d\d      # last 4 digits
\sx\d{2,4}    # extension, like x1234''', re.VERBOSE);

Using Multiple Options (re.I, re.DOTALL, re.VERBOSE)

If you want to pass multiple arguments (re.I, re.DOTALL, re.VERBOSE), combine them with the bitwise or operator |

In [70]:
re.compile(r'''
(\d\d\d-)|    # area code (without parens, with dash)
(\(\d\d\d\) ) # -or- area code with parens and no dash
\d\d\d        # first 3 digits
-             # seond dash
\d\d\d\d      # last 4 digits
\sx\d{2,4}    # extension, like x1234''',
re.IGNORECASE | re.DOTALL |re.VERBOSE);

Regex Example Program: A Phone and Email Scraper