Web Scraping

The webbrowser Module

In [1]:
import webbrowser
webbrowser.open('https://automatetheboringstuff.com')
Out[1]:
True

Exercise: Open address with Google Map

Use google Map to show location of address typed or copied to clipboard (mapit.py)

In [2]:
import webbrowser, sys, pyperclip

# sys covers in lesson 20
sys.argv # ['mapit.py', '870', 'Valencia', 'St.']

# Check if command line arguments were passed
if len(sys.argv) > 1:
  # ['mapit.py', '870', 'Valencia', 'St.'] -> [870 Valencia St.]
  address = ' '.join(sys.argv[1:])
else:
  address = pyperclip.paste()

# https://www.google.com/maps/place/<ADDRESS>
webbrowser.open('https://www.google.com/maps/place/' + address)
Out[2]:
True

Save the above code in file as "mapit.py". Create a batch file "mapit.bat" and include the following code

@python Z:\IT\Python\MyPythonScripts\mapit.py %*

To run the program, either:

  • Copy address Ctrl+C, select Win+R, type mapit, hit Enter or
  • Select Win+R, type mapit followed by the address, hit Enter

Downloading from the Web within the Requests Module

The requests module is a third party module for downloading web pages and files. Run pip install requests to install.

requsts.get() returns a Response object. The raise_for_status() Response method will raise an exception if the download failed.

You can learn more about the other features in the request modules from the website https://requests.readthedocs.org

In [3]:
import requests
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
res.status_code
Out[3]:
200
In [4]:
len(res.text)
Out[4]:
174130
In [5]:
print(res.text[:500])
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org/license


Title: Romeo and Juliet

Author: William Shakespeare

Posting Date: May 25, 2012 [EBook #1112]
Release Date: November, 1997  [Etext #1112]

Language: English


*** S
In [6]:
res.raise_for_status()
In [8]:
import requests
badRes = requests.get('https://automatetheboringstuff.com/files/fail')
# badRes.raise_for_status() # 404
In [9]:
badRes.status_code
Out[9]:
404

Write-binary mode

You can save downloaded file to your hard drive with call to the iter_content() method. Pass wb as the second argument to open() method in order to maintain the unicode encoding in the file.

In [13]:
playFile = open('RomeoAndJuliet.txt', 'wb')

for chunk in res.iter_content(100000):
  playFile.write(chunk)

playFile.close()

Unicode and bytes data type are explained on the page All about Python & Unicode. You can also watch the video there.

Parsing HTML with the Beautiful Soup Module

Web pages are plaintext files formatted as HTML. HTML can be parsed with the BeautifulSoup module. To install BeautifulSoup:

In [14]:
# pip install beautifulsoup4

Exercise: Get the price of a book from Amazon

Import modules BeautifulSoup & Requests. BeautifulSoup is imported with the name bs4

In [15]:
import bs4, requests

Pass URL of the web page to request.get() to get a Request Object

In [16]:
res = requests.get('https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994')
res.raise_for_status()

Pass the string with the HTML to bs4.BeautifulSoup function to get a Soup object

In [17]:
soup = bs4.BeautifulSoup(res.text, "html.parser")

The Soup object has a select() method that can be passed a string of the CSS Selector from HTML tag. You can get a CSS selector string from the browser's developer tools by right-clicking the element and selecting Copy CSS Path

To open browser's developer tools:

  • Chrome, IE & Firefox: F12
  • Safari: Cmd+Opt+I

To locate the element:

  • Chrome: right click the element on web page & select Inspect
  • Firefox: right click the element on web page & select Inspect Element

To get the CSS selector string:

  • Chrome: right click the element in developer's tool & select copy > copy Selector
  • Firefox: right click the element in developer's tool & select copy > CSS Selector
In [18]:
elems = soup.select('#mediaNoAccordion > div.a-row > div.a-column.a-span4.a-text-right.a-span-last > span.a-size-medium.a-color-price.header-price')

The select() method will return a list of matching Element objects. Each of these elements objects has a text property with a string of that element HTML. You can trim the output by using strip() String method

In [19]:
elems[0].text.strip()
Out[19]:
'$23.96'

The Full Script: amazonPrice.py

In [20]:
import bs4, requests

def getAmazonPrice(productUrl):
  res = requests.get(productUrl)
  res.raise_for_status()

  soup = bs4.BeautifulSoup(res.text, 'html.parser')
  elems = soup.select('#mediaNoAccordion > div.a-row > div.a-column.a-span4.a-text-right.a-span-last > span.a-size-medium.a-color-price.header-price')
  return elems[0].text.strip()

price = getAmazonPrice('https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994')
print('The price is ' + price)
The price is $23.96

Controlling the Browser with the Selenium Module

Selenium is a suite of tools (Selenium + WebDriver) to automate web browsers across many platforms. To install Selenium module: pip install selenium. Also you need to download the latest executable Webdriver geckodriver from here to run latest firefox using Selenium. The Webdriver must be found in environment PATH variable.

To import Selenium

In [22]:
from selenium import webdriver

To open the browser, run

In [23]:
browser = webdriver.Firefox()

To send the browser to a URL

In [24]:
browser.get('https://automatetheboringstuff.com')

The browser.find_element_by_css_selector() method will return a list of WebElement objects

In [25]:
elem = browser.find_element_by_css_selector('.main > div:nth-child(1) > ul:nth-child(18) > li:nth-child(1) > a:nth-child(1)')

elems = browser.find_elements_by_css_selector('p')
len(elems)
Out[25]:
8

To get the CSS selector string:

  • Chrome: right click the element in developer's tool & select copy > copy Selector
  • Firefox: right click the element in developer's tool & select copy > CSS Selector

The click() method will click on an element in a browser

In [26]:
elem.click()

Selenium’s WebDriver Methods for Finding Element[s]

Method name WebElement object/list returned
browser.find_element[s]_by_class_name(name) Elements that use the CSS class name
browser.find_element[s]_by_css_selector(selector) Elements that match the CSS selector
browser.find_element[s]_by_id(id) Elements with a matching id
browser.find_element[s]_by_link_text(text) <a> elements that completely match the text provided
browser.find_element_by_partial_link_text(text) <a> elements that contain the text provided
browser.find_element[s]_by_name(name) Elements with a matching name
browser.find_element[s]_by_tag_name(name) Elements with a matching tag name (case insensitive; an <a> element is matched by 'a' and 'A')

The sendkeys() method will type into a specific element in the browser, usually an input form for search, login etc

In [29]:
# searchElem = browser.find_element_by_css_selector('.search-field')
# searchElem.send_keys('zophie')

The submit() method will simulate clicking on the Submit button of a form

In [30]:
# searchElem.submit()

The browser can also be controlled with these commands:

In [31]:
browser.back()
browser.forward()
browser.refresh()
browser.quit()

Example: Grap a paragraph

In [32]:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://automatetheboringstuff.com')
elem = browser.find_element_by_css_selector('.main > div:nth-child(1) > p:nth-child(7)')
elem.text
Out[32]:
"In Automate the Boring Stuff with Python, you'll learn how to use Python to write programs that do in minutes what would take you hours to do by hand-no prior programming experience required. Once you've mastered the basics of programming, you'll create Python programs that effortlessly perform useful and impressive feats of automation to:"

Example: Grap the Entire web page

In [34]:
# elem = browser.find_element_by_css_selector('html')
# elem.text

To learn more about Selenium, read this doc or read the rest of Chapter 11 of the book