def isPhoneNumber(text):
  if len(text) != 12:
    return False # not phone number-sized
  for i in range(0, 3):
    if not text[i].isdecimal():
      return False # no area code
    if text[3] != '-':
      return False # missing dash
    for i in range(4, 7):
      if not text[i].isdecimal():
        return False # not first 3 digits
    if text[7] != '-':
      return False # missing second dash
    for i in range(8, 12):
      if not text[i].isdecimal():
        return False # missing last 4 digits
    return True

message = 'Call me 415-555-1101 tomorrow or 415-555-9999'
foundNumber = False

for i in range(len(message)):
  chunk = message[i:i+12]
  if isPhoneNumber(chunk):
    print('Phone number found: ' + chunk)
    foundNumber = True

if not foundNumber:
  print('Could not find any phone numbers')

Phone number found: 415-555-1101
Phone number found: 415-555-9999

The Regular Expression (re) Module¶

Regex strings often use \backslashes (like \d), so they are often raw strings: r'\d'. \d is the regex for a numeric digital character

To use regular expression, you have to import the re module first. Then, you will usually pass raw strings to re.compile() function which will return a regex object.

Call the regex object's search() method to return a match object (mo). Call the matched object's group() method to get the matched string

import re
message = 'Call me 415-555-1101 tomorrow or 415-555-9999'

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search(message)
print(mo.group())

415-555-1101

The findall() Method¶

Call the regex object's findall() method will return a list of string matches the pattern.

import re
message = 'Call me 415-555-1101 tomorrow or 415-555-9999'

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
print(phoneNumRegex.findall(message))

['415-555-1101', '415-555-9999']

Regex Groups¶

Groups are created in regex strings with parentheses ( ). The first set of parentheses is group 1, the second is 2, and so on. Calling group() or group(0) returns the full matching string, group(1) returns group 1's matching string, and so on.

import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is 415-555-4242')

mo.group()

'415-555-4242'

mo.group(0)

'415-555-4242'

mo.group(1)

'415'

mo.group(2)

'555-4242'

Use $ and $ to search the actual literal parentheses ( ) in a string

import re
phoneNumRegex = re.compile(r'\(\d\d\d\) \d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My phone number is (415) 555-4242')
mo.group()

'(415) 555-4242'

Pipe Character¶

The pipe | can match one of many possible groups.

Searching for 'Batman', 'Batmobile', 'Batcopter', 'Batbat'

batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')

mo.group()

'Batmobile'

mo.group(0)

'Batmobile'

mo.group(1)

'mobile'

Matched Object mo will return a None value if pattern not found in search string

batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmotorcycle lost a wheel')

mo == None

True

# mo.group() # AttributeError

Repetition in Regex¶

? (zero or one)¶

The ? says the group matches zero or one times

wo group appears zero time

import re
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batman')
mo.group()

'Batman'

wo group appears one time

import re
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batwoman')
mo.group()

'Batwoman'

import re
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batwowowowoman')
mo == None

True

# mo.group() # AttributeError

Match a phone number with or without area code

phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
phoneRegex.search('My phone number is 415-555-1234')

<_sre.SRE_Match object; span=(19, 31), match='415-555-1234'>

phoneRegex.search('My phone number is 555-1234')

<_sre.SRE_Match object; span=(19, 27), match='555-1234'>

For example, if you wanted a regex object for the text "dinner?" (with the question mark), you would call:

re.compile(r'dinner\?') # note the slash in \?

re.compile(r'dinner\?', re.UNICODE)

In the above case, dinner is not optional, we are literally looking for a question mark: "dinner?"

$*$ (zero or more)¶

The * says the group matches zero or more times

import re
batRegex = re.compile(r'Bat(wo)*man')
batRegex.search('The Adventures of Batman')

<_sre.SRE_Match object; span=(18, 24), match='Batman'>

batRegex.search('The Adventures of Batwoman')

<_sre.SRE_Match object; span=(18, 26), match='Batwoman'>

batRegex.search('The Adventures of Batwowowowoman')

<_sre.SRE_Match object; span=(18, 32), match='Batwowowowoman'>

+ (one or more)¶

The + says the group matches one or more times

import re
batRegex = re.compile(r'Bat(wo)+man')
batRegex.search('The Adventures of Batman') == None

True

batRegex.search('The Adventures of Batwoman')

<_sre.SRE_Match object; span=(18, 26), match='Batwoman'>

batRegex.search('The Adventures of Batwowowowoman')

<_sre.SRE_Match object; span=(18, 32), match='Batwowowowoman'>

Escaping Symbols $?*+$

regex = re.compile(r'\+\*\?')
regex.search('I learnt about +*? regex syntax')

<_sre.SRE_Match object; span=(15, 18), match='+*?'>

regex = re.compile(r'(\+\*\?)+')
regex.search('I learnt about +*? regex syntax')

<_sre.SRE_Match object; span=(15, 18), match='+*?'>

{x} (exactly x times)¶

The curly braces {} can match a specific number of times

haRegex = re.compile(r'(Ha){3}')
haRegex.search('HaHaHa') # found

<_sre.SRE_Match object; span=(0, 6), match='HaHaHa'>

haRegex = re.compile(r'Ha{3}')
haRegex.search('HaHaHa') # not found

haRegex.search('Haaa') # found

<_sre.SRE_Match object; span=(0, 4), match='Haaa'>

Match 3 phone number in a row, may not have an area code and may have a comma or space following the phone number

phoneRegex = re.compile(r'((\d\d\d-)?\d\d\d-\d\d\d\d(,| )?){3}')
phoneRegex.search('555-1234,555-4242,212-555-0000') # found

<_sre.SRE_Match object; span=(0, 30), match='555-1234,555-4242,212-555-0000'>

phoneRegex.search('415-555-1234 555-4242,212-555-0000') # found

<_sre.SRE_Match object; span=(0, 34), match='415-555-1234 555-4242,212-555-0000'>

{x, y} (at least x, at most y)¶

The curly braces {} with two numbers matches a minimum and maximum number of times. Leaving out the first or second number in the curly braces {} says there is no minimum or maximum e.g. {,5} same as {0, 5}, {3,}: 3 or more.

haRegex = re.compile(r'(Ha){3,5}')
haRegex.search('He said "HaHaHa"') # found

<_sre.SRE_Match object; span=(9, 15), match='HaHaHa'>

haRegex.search('He said "HaHaHaHaHa"') # found

<_sre.SRE_Match object; span=(9, 19), match='HaHaHaHaHa'>

haRegex.search('He said "HaHaHaHaHaHaHa"') # found

<_sre.SRE_Match object; span=(9, 19), match='HaHaHaHaHa'>

Greedy/ Non-Greedy Match¶

Greedy matching matches the longest string possible, non-greedy matching math the shortest string possible. By default, regular expression does a greedy match. Putting a question mark ? after the curly braces {} makes it do a non-greedy match.

Greedy match a string with 3 or 5 digits

digitRegex = re.compile(r'(\d){3,5}')
digitRegex.search('1234567890')

<_sre.SRE_Match object; span=(0, 5), match='12345'>

Non-greedy match ?

digitRegex = re.compile(r'(\d){3,5}?')
digitRegex.search('1234567890')

<_sre.SRE_Match object; span=(0, 3), match='123'>

Regex Character Class & the findall() Method¶

The findall() Method¶

The regex method findall() is passing a string, and returns all matches in it, not just the first match

serach() returns Match Objects findall() returns a list of strings

import re
phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')

['415-789-3564', '416-782-4654']

If the regular expression string have zero or one group i.e. ? in them, the findall() method will return a list of strings. Each text in that list is the text that it found matching the pattern.

import re
phoneRegex = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')

['415', '416']

If the regular expression string have two or more groups, for example, one for the area code and one for the main number, the findall() method return a list of tuples of strings

import re
phoneRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')

[('415', '789-3564'), ('416', '782-4654')]

With 3 groups

import re
phoneRegex = re.compile(r'((\d\d\d)-(\d\d\d-\d\d\d\d))')
phoneRegex.findall('The number are 415-789-3564 and 416-782-4654')

[('415-789-3564', '415', '789-3564'), ('416-782-4654', '416', '782-4654')]

Character Classes¶

\d is a shorthand character class that matches digits. \w matches word characters, \s matches whitespace characters. The uppercase shorthand character classes \D, \W, \S match chaaracters that are NOT digits, word characters, and spaces.

digitRegex = re.compile(r'(0|1|2|3|4|5|6|7|8|9)')
digitRegex = re.compile(r'\d') # Same as above

Shorthand Codes for Common Character Classes	Represents
\d	Any numeric digit from 0 to 9
\D	Any character that is not a numeric digit from 0 to 9
\w	Any letter, numeric digit, or the underscore character (Think of this as matching "word" characters)
\W	Any character that is not a letter, numeric digit, or the underscore character
\s	Any space, tab or newline character (Think of this as matching "space" character)
\S	Any character that is not a space, tab, or newline

12 Days of Christmas Example¶

Regular expression string consists of one or more digit \d+, then a "space" character \s, then one or more word \w+

import re
lyrics = '12 Drummers Drumming, 11 Pipers Piping, 10 Lords a Leaping, 9 Ladies Dancing, 8 Maids a Milking, 7 Swans a Swimming, 6 Geese a Laying, 5 Golden Rings, 4 Calling Birds, 3 French Hens, 2 Turtle Doves, and a Partridge in a Pear Tree'
xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall(lyrics)

['12 Drummers',
 '11 Pipers',
 '10 Lords',
 '9 Ladies',
 '8 Maids',
 '7 Swans',
 '6 Geese',
 '5 Golden',
 '4 Calling',
 '3 French',
 '2 Turtle']

Making Your Own Character Classes¶

You can make your own character class with square brackets []. You don't need to use escape character in own character class. See also re.IGNORECASE below.

vowelRegex = re.compile(r'[aeiouAEIOU]') # r'(a|e|i|o|u|A|E|I|O|U)'
vowelRegex.findall('Robocop eats baby food.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o']

Match two vowels in a row

vowelRegex = re.compile(r'[aeiouAEIOU]{2}')
vowelRegex.findall('Robocop eats baby food.')

['ea', 'oo']

atofRegex = re.compile(r'[a-fA-f]') # all cases a to f

Negative Character Classes¶

A caret symbol ^ makes it a negative character class, matching anything NOT in the brackets

consonantsRegex = re.compile(r'[^aeiouAEIOU]')
consonantsRegex.findall('Robocop eats baby food.')

['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.']

Regex Dot-Star and the Caret/Dollar Characters¶

Matching the ^Start and End $

^ means the string must start with the pattern

import re
beginWithHelloRegex = re.compile(r'^Hello')
beginWithHelloRegex.search('Hello there')

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

beginWithHelloRegex.search('He said "Hello"')

$ means the string must end with the pattern.

import re
endWithWorldRegex = re.compile(r'world$')
endWithWorldRegex.search('Hello world')

<_sre.SRE_Match object; span=(6, 11), match='world'>

endWithWorldRegex.search('Hello world again')

Both ^ $ means the entire string must match the pattern.

import re
allDigitsRegex = re.compile(r'^\d+$')
allDigitsRegex.search('5789798754')

<_sre.SRE_Match object; span=(0, 10), match='5789798754'>

allDigitsRegex.search('5789x98754')

. (anything except newline)¶

The . is a wildcard; it matches anything in single character except newlines \n

import re
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat')

['cat', 'hat', 'sat', 'lat', 'mat']

Anything with 1 or 2 character followed by 'at'. The result includes 'flat' but includes white space character

import re
atRegex = re.compile(r'.{1,2}at')
atRegex.findall('The cat in the hat sat on the flat mat')

[' cat', ' hat', ' sat', 'flat', ' mat']

Dot-Star to match anything

Pull out the first name and last name

import re
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
nameRegex.findall('First Name: Al Last Name: Sweigart')

[('Al', 'Sweigart')]

Greedy and Non-Greedy¶

Anything enclosed by angle brackets - Non-greedy version (.*?)

import re
serve = '<To serve human> for dinner>'
nongreedy = re.compile(r'<(.*?)>')
nongreedy.findall(serve)

['To serve human']

Anything enclosed by angle brackets - Greedy version (.*)

import re
serve = '<To serve human> for dinner>'
greedy = re.compile(r'<(.*)>')
greedy.findall(serve)

['To serve human> for dinner']

Making Dot Match Newlines Too (with re.DOTALL)¶

Pass re.DOTALL as the second argument to re.compile() to make the . matches newlines \n as well

Without re.DOTALL

import re
prime = 'Serve the public trust.\nProtect the innocent\nUphold the law'
dotStarRegex = re.compile(r'.*')
dotStarRegex.search(prime)

<_sre.SRE_Match object; span=(0, 23), match='Serve the public trust.'>

With re.DOTALL

import re
prime = 'Serve the public trust.\nProtect the innocent\nUphold the law'
dotStarRegex = re.compile(r'.*', re.DOTALL)
dotStarRegex.search(prime)

<_sre.SRE_Match object; span=(0, 59), match='Serve the public trust.\nProtect the innocent\nUp>

re.IGNORECASE or re.I¶

Pass re.IGNORECASE or re.I as the second argument to re.compile() to make the matching case-insensitive

Without re.IGNORECASE or re.I

import re
vowelRegex = re.compile(r'[aeiou]')
vowelRegex.findall('Al, why robocop again?')

['o', 'o', 'o', 'a', 'a', 'i']

With re.IGNORECASE or re.I

import re
vowelRegex = re.compile(r'[aeiou]', re.IGNORECASE)
vowelRegex.findall('Al, why robocop again?')

['A', 'o', 'o', 'o', 'a', 'a', 'i']

Regex sub() Method and Verbose Mode¶

The sub() Method¶

The sub() regex method will substitute matches with some other text

Using findall() Method

import re
nameRegex = re.compile(r'Agent \w+')
nameRegex.findall('Agent Alice gave the secret doc to Agent Bob')

['Agent Alice', 'Agent Bob']

Using sub() Method

import re
nameRegex = re.compile(r'Agent \w+')
nameRegex.sub('REDACTED','Agent Alice gave the secret doc to Agent Bob')

'REDACTED gave the secret doc to REDACTED'

Using \1, \2, etc in sub()¶

Using \1, \2 and so will substitute group 1, 2 etc in the regex pattern

import re
nameRegex = re.compile(r'Agent (\w)\w*')
nameRegex.findall('Agent Alice gave the secret doc to Agent Bob')

['A', 'B']

nameRegex.sub(r'Agent \1****','Agent Alice gave the secret doc to Agent Bob')

'Agent A**** gave the secret doc to Agent B****'

Verbose Mode with re.VERBOSE¶

Passing re.VERBOSE lets you add whitespace and comments to the regex string passed to re.compile

re.compile(r'''
(\d\d\d-)|    # area code (without parens, with dash)
(\(\d\d\d\) ) # -or- area code with parens and no dash
\d\d\d        # first 3 digits
-             # seond dash
\d\d\d\d      # last 4 digits
\sx\d{2,4}    # extension, like x1234''', re.VERBOSE);

Using Multiple Options (re.I, re.DOTALL, re.VERBOSE)¶

If you want to pass multiple arguments (re.I, re.DOTALL, re.VERBOSE), combine them with the bitwise or operator |

re.compile(r'''
(\d\d\d-)|    # area code (without parens, with dash)
(\(\d\d\d\) ) # -or- area code with parens and no dash
\d\d\d        # first 3 digits
-             # seond dash
\d\d\d\d      # last 4 digits
\sx\d{2,4}    # extension, like x1234''',
re.IGNORECASE | re.DOTALL |re.VERBOSE);

Table of Contents

Regular Expressions ¶

What is Regular Expression?¶

Example: Find Phone Number in a Message( without using regular expression)¶

The Regular Expression (re) Module¶

The findall() Method¶

Regex Groups¶

Pipe Character¶

Repetition in Regex¶

? (zero or one)¶

$*$ (zero or more)¶

+ (one or more)¶

Escaping Symbols $?*+$

{x} (exactly x times)¶

{x, y} (at least x, at most y)¶

Greedy/ Non-Greedy Match¶

Regex Character Class & the findall() Method¶

The findall() Method¶

Character Classes¶

12 Days of Christmas Example¶

Making Your Own Character Classes¶

Negative Character Classes¶

Regex Dot-Star and the Caret/Dollar Characters¶

Matching the ^Start and End $

. (anything except newline)¶

Dot-Star to match anything

Greedy and Non-Greedy¶

Making Dot Match Newlines Too (with re.DOTALL)¶

re.IGNORECASE or re.I¶

Regex sub() Method and Verbose Mode¶

The sub() Method¶

Using \1, \2, etc in sub()¶

Verbose Mode with re.VERBOSE¶

Using Multiple Options (re.I, re.DOTALL, re.VERBOSE)¶

Regex Example Program: A Phone and Email Scraper ¶

Table of Contents

Regular Expressions¶

What is Regular Expression?¶

Example: Find Phone Number in a Message( without using regular expression)¶

The Regular Expression (re) Module¶

The findall() Method¶

Regex Groups¶

Pipe Character¶

Repetition in Regex¶

? (zero or one)¶

$*$ (zero or more)¶

+ (one or more)¶

Escaping Symbols $?*+$

{x} (exactly x times)¶

{x, y} (at least x, at most y)¶

Greedy/ Non-Greedy Match¶

Regex Character Class & the findall() Method¶

The findall() Method¶

Character Classes¶

12 Days of Christmas Example¶

Making Your Own Character Classes¶

Negative Character Classes¶

Regex Dot-Star and the Caret/Dollar Characters¶

Matching the ^Start and End $

. (anything except newline)¶

Dot-Star to match anything

Greedy and Non-Greedy¶

Making Dot Match Newlines Too (with re.DOTALL)¶

re.IGNORECASE or re.I¶

Regex sub() Method and Verbose Mode¶

The sub() Method¶

Using \1, \2, etc in sub()¶

Verbose Mode with re.VERBOSE¶

Using Multiple Options (re.I, re.DOTALL, re.VERBOSE)¶

Regex Example Program: A Phone and Email Scraper¶

Regular Expressions ¶

Regex Example Program: A Phone and Email Scraper ¶