Excel, Words and PDF Documents

Reading Excel Spreadsheets

The OpenPyXL third-party module handles Excel spreadsheets (.xlsx files). The versiion of OpenPyXL was 2.1.4. You can install this by running pip install openpyxl==2.1.4

Free Alternatives to Excel

This Example spreadsheet (XLSX) will be used throughout the session

To install OpenPyXL

In [1]:
# pip install openpyxl

openpyxl.load_workbook(filename) returns a Workbook object

In [2]:
import openpyxl, os
os.chdir('z:\\it\\python\\General Python\\snippets')

workbook = openpyxl.load_workbook('example.xlsx')
type(workbook)
Out[2]:
openpyxl.workbook.workbook.Workbook

Use wb[sheetname] and workbook.sheetnames to get Worksheet objects

In [3]:
sheet = workbook['Sheet1']
type(sheet)
Out[3]:
openpyxl.worksheet.worksheet.Worksheet
In [4]:
workbook.sheetnames
Out[4]:
['Sheet1', 'Sheet2', 'Sheet3']

The square brakcets in sheet['a1'] get Cell objects

In [5]:
sheet['A1']
Out[5]:
<Cell 'Sheet1'.A1>
In [6]:
sheet['a1']
Out[6]:
<Cell 'Sheet1'.A1>

Cell objects have a .value member variable with the content of that cell

In [7]:
cell = sheet['a1']
cell.value
Out[7]:
datetime.datetime(2015, 4, 5, 13, 34, 2)
In [8]:
str(cell.value)
Out[8]:
'2015-04-05 13:34:02'
In [9]:
str(sheet['a1'].value)
Out[9]:
'2015-04-05 13:34:02'

The cell() method also returns a Cell object from a sheet

In [10]:
sheet.cell(row=1, column=2)
Out[10]:
<Cell 'Sheet1'.B1>
In [11]:
for i in range(1, 8):
      print(i, sheet.cell(row=i, column=2).value)
1 Apples
2 Cherries
3 Pears
4 Oranges
5 Apples
6 Bananas
7 Strawberries

Create and Edit Excel Spreadsheets

Changing a cell's value is done using the square brackets [], just like changing a value in a list or dictionary

In [12]:
import openpyxl
wb = openpyxl.Workbook()
wb
Out[12]:
<openpyxl.workbook.workbook.Workbook at 0x811ae70>
In [13]:
wb.sheetnames
Out[13]:
['Sheet']
In [14]:
sheet = wb['Sheet']
sheet
Out[14]:
<Worksheet "Sheet">
In [15]:
sheet['a1'].value == None
Out[15]:
True
In [16]:
sheet['a1'] = 42
sheet['a2'] = 'Hello'

Changes you make to the workbook object can be saved with the save() method

In [17]:
import os
os.chdir('z:\\it\\python\\General Python\\snippets')
wb.save('example1.xlsx')

You can create a new sheet with wb_create_sheet()

In [18]:
sheet2 = wb.create_sheet()
wb.sheetnames
Out[18]:
['Sheet', 'Sheet1']

You can view and rename a sheet's name with its title member veriable

In [19]:
sheet2.title
Out[19]:
'Sheet1'
In [20]:
sheet2.title = 'My New Sheet Name'
wb.sheetnames
Out[20]:
['Sheet', 'My New Sheet Name']
In [21]:
wb.save('example2.xlsx')

Create sheet and put it in front

In [22]:
wb.create_sheet(index=0, title='My Other Sheet')
Out[22]:
<Worksheet "My Other Sheet">
In [23]:
wb.save('example3.xlsx')

Reading and Editing PDF

PDF files are binary files and it's much more complex than plaintext files. The PyPDF2 module can read and write PDFs.

To install PyPDF2, be sure to get the capitalization correct pip install PyPDF2

PDF file 1 used in this section PDF file 2 used in this section

SumatraPDF

In [24]:
import PyPDF2, os
os.chdir('z:\\it\\python\\General Python\\snippets')

Opening a PDF is done by calling open() and passing the File object to PdfFileReader(). Be sure to use the read binary mode rb.

In [25]:
pdfFile = open('meetingminutes1.pdf', 'rb')
reader = PyPDF2.PdfFileReader(pdfFile)

You can find the number of pages in a PDF focument by looking at the numPages member variable

In [26]:
reader.numPages
Out[26]:
19

A Page object can be obtained from the PDF object with the getPage() method

In [27]:
page = reader.getPage(0)
page.extractText()
Out[27]:
'OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of \nMarch 7\n, 2014\n        \n     The Board of Elementary and Secondary Education shall provide leadership and \ncreate policies for education that expand opportunities for children, empower \nfamilies and communities, and advance Louisiana in an increasingly \ncompetitive glob\nal market.\n BOARD \n of ELEMENTARY\n and \n SECONDARY\n EDUCATION\n  '

The text from a Page object is obtained with the extractText() method, which can be imperfect To print all pages in a PDF

In [28]:
# for pageNum in range(reader.numPages):
  # print(reader.getPage(pageNum).extractText())

Combines PDF documents into a single document

New PDFs can be made from PdfFileWriter(). New pages can be appended to a Writer object with the addPage() method. Call the write() method to save its changes. Be sure to use the write binary mode wb.

In [29]:
import PyPDF2
pdfFile1 = open('meetingminutes1.pdf', 'rb')
pdfFile2 = open('meetingminutes2.pdf', 'rb')
reader1 = PyPDF2.PdfFileReader(pdfFile1)
reader2 = PyPDF2.PdfFileReader(pdfFile2)
writer = PyPDF2.PdfFileWriter()
for pageNum in range(reader1.numPages):
    page = reader1.getPage(pageNum)
    writer.addPage(page)

for pageNum in range(reader2.numPages):
    page = reader2.getPage(pageNum)
    writer.addPage(page)

outputFile = open('combinedMinutes.pdf', 'wb')
writer.write(outputFile)
outputFile.close()
pdfFile1.close()
pdfFile2.close()

Reading and Editing Word Documents

Python-Docx can read and write .docx Word files.

To install Python-Docx: pip install python-docx. The file used in this session: http://autbor.com/demo.docx

To import Python-Docx

In [30]:
import docx

Document object contains Paragraph objects. Paragraph objects contain Run objects.

Open a Word file with docx.Document()

In [31]:
d = docx.Document('z:\\it\\python\\General Python\\snippets\\demo.docx')

Access one of the Paragraph Objects from the paragraphs member variable, which is a list of Paragraph objects.

In [32]:
d.paragraphs
Out[32]:
[<docx.text.paragraph.Paragraph at 0x83dd8b0>,
 <docx.text.paragraph.Paragraph at 0x83dda50>,
 <docx.text.paragraph.Paragraph at 0x83dd8f0>,
 <docx.text.paragraph.Paragraph at 0x83dd910>,
 <docx.text.paragraph.Paragraph at 0x83dd950>,
 <docx.text.paragraph.Paragraph at 0x83dd9d0>,
 <docx.text.paragraph.Paragraph at 0x83dda10>]

Paragraph objects have a text member variable containing the text as a string value

In [33]:
d.paragraphs[0].text
Out[33]:
'Document Title'
In [34]:
d.paragraphs[1].text
Out[34]:
'A plain paragraph having some bold and some italic.'
In [35]:
p = d.paragraphs[1]

Paragraphs are composed of **"runs". The runs member variable of a Paragraph object contains a list of Run objects

In [36]:
p.runs # a new run when change of style
Out[36]:
[<docx.text.run.Run at 0x83ddc90>,
 <docx.text.run.Run at 0x83dd890>,
 <docx.text.run.Run at 0x83ddf30>,
 <docx.text.run.Run at 0x83ddab0>]

Run objects also have a text member variable

In [37]:
p.runs[0].text
Out[37]:
'A plain paragraph having some '
In [38]:
p.runs[1].text
Out[38]:
'bold'
In [39]:
p.runs[2].text
Out[39]:
' and some '
In [40]:
p.runs[3].text
Out[40]:
'italic.'

Run objects have a bold, italic, and underline member variables which can be set to True or False

In [41]:
p.runs[1].bold
Out[41]:
True
In [42]:
p.runs[0].bold == None
Out[42]:
True
In [43]:
p.runs[3].italic
Out[43]:
True
In [44]:
p.runs[3].underline = True
In [45]:
p.runs[3].text = 'italic and underline.'
In [46]:
d.save('z:\\it\\python\\General Python\\snippets\\demo2.docx')

Built-in Style

Paragraph and run objects have a style member variable that can be set to one of Word's built-in styles. To bring up the Styles Window in Word: Ctrl+Alt+Shift+S, click a section in the doc to see what style is used.

To change the style

In [47]:
p.style
Out[47]:
_ParagraphStyle('Normal') id: 138208016
In [48]:
p.style = 'Title'
d.save('z:\\it\\python\\General Python\\snippets\\demo3.docx')

Word files can be created by calling add_paragraph()

In [49]:
d = docx.Document()
d.add_paragraph('Hello this is a paragraph')
Out[49]:
<docx.text.paragraph.Paragraph at 0x83ddd50>
In [50]:
d.add_paragraph('Hello this is another paragraph')
Out[50]:
<docx.text.paragraph.Paragraph at 0x8403050>
In [51]:
d.save('z:\\it\\python\\General Python\\snippets\\demo4.docx')

Also, Word files can be created by calling add_run() to append text content

In [52]:
p = d.paragraphs[0]
p.add_run('This is a new run')
Out[52]:
<docx.text.run.Run at 0x84033b0>
In [53]:
p.runs[1].bold = True
d.save('z:\\it\\python\\General Python\\snippets\\demo5.docx')

Unfortunately, there is no way of inserting paragraph and run objects anywhere except at the end.

Script: Get all text inside a document

getTextfromWord.py

In [54]:
import docx

def getText(filename):
  doc = docx.Document(filename)
  fullText = []

  for para in doc.paragraphs:
    fullText.append(para.text)

  return '\n'.join(fullText)

print(getText('z:\\it\\python\\General Python\\snippets\\demo.docx'))
Document Title
A plain paragraph having some bold and some italic.
Heading, level 1
Intense quote
first item in unordered list
first item in ordered list