Regular Expression are mini-language for specifying text patterns.
Writing code to do pattern matching without regular expression is a huge pain. The following example checks if a message consists of a phone number.
def isPhoneNumber(text):
if len(text) != 12:
return False # not phone number-sized
for i in range(0, 3):
if not text[i].isdecimal():
return False # no area code
if text[3] != '-':
return False # missing dash
for i in range(4, 7):
if not text[i].isdecimal():
return False # not first 3 digits
if text[7] != '-':
return False # missing second dash
for i in range(8, 12):
if not text[i].isdecimal():
return False # missing last 4 digits
return True
message = 'Call me 415-555-1101 tomorrow or 415-555-9999'
foundNumber = False
for i in range(len(message)):
chunk = message[i:i+12]
if isPhoneNumber(chunk):
print('Phone number found: ' + chunk)
foundNumber = True
if not foundNumber:
print('Could not find any phone numbers')
Regex strings often use \backslashes (like \d
), so they are often raw strings: r'\d'
. \d
is the regex for a numeric digital character
To use regular expression, you have to import the re module
first. Then, you will usually pass raw strings to re.compile()
function which will return a regex object.
Call the regex object's search()
method to return a match object (mo). Call the matched object's group()
method to get the matched string
import re
message = 'Call me 415-555-1101 tomorrow or 415-555-9999'
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search(message)
print(mo.group())
Call the regex object's findall()
method will return a list of string matches the pattern.
import re
message = 'Call me 415-555-1101 tomorrow or 415-555-9999'
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
print(phoneNumRegex.findall(message))
Groups are created in regex strings with parentheses ( )
. The first set of parentheses is group 1, the second is 2, and so on. Calling group()
or group(0)
returns the full matching string, group(1)
returns group 1's matching string, and so on.
import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is 415-555-4242')
mo.group()
mo.group(0)
mo.group(1)
mo.group(2)
Use \(
and \)
to search the actual literal parentheses ( ) in a string
import re
phoneNumRegex = re.compile(r'\(\d\d\d\) \d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My phone number is (415) 555-4242')
mo.group()
The pipe |
can match one of many possible groups.
Searching for 'Batman', 'Batmobile', 'Batcopter', 'Batbat'
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
mo.group()
mo.group(0)
mo.group(1)
Matched Object mo
will return a None
value if pattern not found in search string
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmotorcycle lost a wheel')
mo == None
# mo.group() # AttributeError
import re
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batman')
mo.group()
wo group appears one time
import re
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batwoman')
mo.group()
import re
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batwowowowoman')
mo == None
# mo.group() # AttributeError
Match a phone number with or without area code
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
phoneRegex.search('My phone number is 415-555-1234')
phoneRegex.search('My phone number is 555-1234')
For example, if you wanted a regex object for the text "dinner?" (with the question mark), you would call:
re.compile(r'dinner\?') # note the slash in \?
In the above case, dinner is not optional, we are literally looking for a question mark: "dinner?"
The *
says the group matches zero or more times
import re
batRegex = re.compile(r'Bat(wo)*man')
batRegex.search('The Adventures of Batman')
batRegex.search('The Adventures of Batwoman')
batRegex.search('The Adventures of Batwowowowoman')
The +
says the group matches one or more times
import re
batRegex = re.compile(r'Bat(wo)+man')
batRegex.search('The Adventures of Batman') == None
batRegex.search('The Adventures of Batwoman')
batRegex.search('The Adventures of Batwowowowoman')
regex = re.compile(r'\+\*\?')
regex.search('I learnt about +*? regex syntax')
regex = re.compile(r'(\+\*\?)+')
regex.search('I learnt about +*? regex syntax')
The curly braces {}
can match a specific number of times
haRegex = re.compile(r'(Ha){3}')
haRegex.search('HaHaHa') # found
haRegex = re.compile(r'Ha{3}')
haRegex.search('HaHaHa') # not found
haRegex.search('Haaa') # found
Match 3 phone number in a row, may not have an area code and may have a comma or space following the phone number
phoneRegex = re.compile(r'((\d\d\d-)?\d\d\d-\d\d\d\d(,| )?){3}')
phoneRegex.search('555-1234,555-4242,212-555-0000') # found
phoneRegex.search('415-555-1234 555-4242,212-555-0000') # found
The curly braces {}
with two numbers matches a minimum and maximum number of times.
Leaving out the first or second number in the curly braces {}
says there is no minimum or maximum e.g. {,5}
same as {0, 5}, {3,}
: 3 or more.
haRegex = re.compile(r'(Ha){3,5}')
haRegex.search('He said "HaHaHa"') # found
haRegex.search('He said "HaHaHaHaHa"') # found
haRegex.search('He said "HaHaHaHaHaHaHa"') # found
Greedy matching matches the longest string possible, non-greedy matching math the shortest string possible. By default, regular expression does a greedy match. Putting a question mark ?
after the curly braces {}
makes it do a non-greedy match.
Greedy match a string with 3 or 5 digits
digitRegex = re.compile(r'(\d){3,5}')
digitRegex.search('1234567890')
Non-greedy match ?
digitRegex = re.compile(r'(\d){3,5}?')
digitRegex.search('1234567890')
import re
phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')
If the regular expression string have zero or one group i.e. ?
in them, the findall()
method will return a list of strings. Each text in that list is the text that it found matching the pattern.
import re
phoneRegex = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')
If the regular expression string have two or more groups, for example, one for the area code and one for the main number, the findall() method return a list of tuples of strings
import re
phoneRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')
With 3 groups
import re
phoneRegex = re.compile(r'((\d\d\d)-(\d\d\d-\d\d\d\d))')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')
\d
is a shorthand character class that matches digits. \w
matches word characters, \s
matches whitespace characters. The uppercase shorthand character classes \D
, \W
, \S
match chaaracters that are NOT digits, word characters, and spaces.
digitRegex = re.compile(r'(0|1|2|3|4|5|6|7|8|9)')
digitRegex = re.compile(r'\d') # Same as above
Shorthand Codes for Common Character Classes | Represents |
---|---|
\d | Any numeric digit from 0 to 9 |
\D | Any character that is not a numeric digit from 0 to 9 |
\w | Any letter, numeric digit, or the underscore character (Think of this as matching "word" characters) |
\W | Any character that is not a letter, numeric digit, or the underscore character |
\s | Any space, tab or newline character (Think of this as matching "space" character) |
\S | Any character that is not a space, tab, or newline |
Regular expression string consists of one or more digit \d+
, then a "space" character \s
, then one or more word \w+
import re
lyrics = '12 Drummers Drumming, 11 Pipers Piping, 10 Lords a Leaping, 9 Ladies Dancing, 8 Maids a Milking, 7 Swans a Swimming, 6 Geese a Laying, 5 Golden Rings, 4 Calling Birds, 3 French Hens, 2 Turtle Doves, and a Partridge in a Pear Tree'
xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall(lyrics)
You can make your own character class with square brackets []
. You don't need to use escape character in own character class.
See also re.IGNORECASE below.
vowelRegex = re.compile(r'[aeiouAEIOU]') # r'(a|e|i|o|u|A|E|I|O|U)'
vowelRegex.findall('Robocop eats baby food.')
Match two vowels in a row
vowelRegex = re.compile(r'[aeiouAEIOU]{2}')
vowelRegex.findall('Robocop eats baby food.')
atofRegex = re.compile(r'[a-fA-f]') # all cases a to f
A caret symbol ^
makes it a negative character class, matching anything NOT in the brackets
consonantsRegex = re.compile(r'[^aeiouAEIOU]')
consonantsRegex.findall('Robocop eats baby food.')
^
means the string must start with the pattern
import re
beginWithHelloRegex = re.compile(r'^Hello')
beginWithHelloRegex.search('Hello there')
beginWithHelloRegex.search('He said "Hello"')
$
means the string must end with the pattern.
import re
endWithWorldRegex = re.compile(r'world$')
endWithWorldRegex.search('Hello world')
endWithWorldRegex.search('Hello world again')
Both ^ $
means the entire string must match the pattern.
import re
allDigitsRegex = re.compile(r'^\d+$')
allDigitsRegex.search('5789798754')
allDigitsRegex.search('5789x98754')
The .
is a wildcard; it matches anything in single character except newlines \n
import re
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat')
Anything with 1 or 2 character followed by 'at'. The result includes 'flat' but includes white space character
import re
atRegex = re.compile(r'.{1,2}at')
atRegex.findall('The cat in the hat sat on the flat mat')
import re
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
nameRegex.findall('First Name: Al Last Name: Sweigart')
Anything enclosed by angle brackets - Non-greedy version (.*?)
import re
serve = '<To serve human> for dinner>'
nongreedy = re.compile(r'<(.*?)>')
nongreedy.findall(serve)
Anything enclosed by angle brackets - Greedy version (.*)
import re
serve = '<To serve human> for dinner>'
greedy = re.compile(r'<(.*)>')
greedy.findall(serve)
Pass re.DOTALL
as the second argument to re.compile()
to make the .
matches newlines \n
as well
Without re.DOTALL
import re
prime = 'Serve the public trust.\nProtect the innocent\nUphold the law'
dotStarRegex = re.compile(r'.*')
dotStarRegex.search(prime)
With re.DOTALL
import re
prime = 'Serve the public trust.\nProtect the innocent\nUphold the law'
dotStarRegex = re.compile(r'.*', re.DOTALL)
dotStarRegex.search(prime)
Pass re.IGNORECASE
or re.I
as the second argument to re.compile()
to make the matching case-insensitive
Without re.IGNORECASE
or re.I
import re
vowelRegex = re.compile(r'[aeiou]')
vowelRegex.findall('Al, why robocop again?')
With re.IGNORECASE
or re.I
import re
vowelRegex = re.compile(r'[aeiou]', re.IGNORECASE)
vowelRegex.findall('Al, why robocop again?')
import re
nameRegex = re.compile(r'Agent \w+')
nameRegex.findall('Agent Alice gave the secret doc to Agent Bob')
Using sub()
Method
import re
nameRegex = re.compile(r'Agent \w+')
nameRegex.sub('REDACTED','Agent Alice gave the secret doc to Agent Bob')
Using \1
, \2
and so will substitute group 1, 2 etc in the regex pattern
import re
nameRegex = re.compile(r'Agent (\w)\w*')
nameRegex.findall('Agent Alice gave the secret doc to Agent Bob')
nameRegex.sub(r'Agent \1****','Agent Alice gave the secret doc to Agent Bob')
Passing re.VERBOSE
lets you add whitespace and comments to the regex string passed to re.compile
re.compile(r'''
(\d\d\d-)| # area code (without parens, with dash)
(\(\d\d\d\) ) # -or- area code with parens and no dash
\d\d\d # first 3 digits
- # seond dash
\d\d\d\d # last 4 digits
\sx\d{2,4} # extension, like x1234''', re.VERBOSE);
If you want to pass multiple arguments (re.I, re.DOTALL, re.VERBOSE), combine them with the bitwise or operator |
re.compile(r'''
(\d\d\d-)| # area code (without parens, with dash)
(\(\d\d\d\) ) # -or- area code with parens and no dash
\d\d\d # first 3 digits
- # seond dash
\d\d\d\d # last 4 digits
\sx\d{2,4} # extension, like x1234''',
re.IGNORECASE | re.DOTALL |re.VERBOSE);