'OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of \nMarch 7\n, 2014\n        \n     The Board of Elementary and Secondary Education shall provide leadership and \ncreate policies for education that expand opportunities for children, empower \nfamilies and communities, and advance Louisiana in an increasingly \ncompetitive glob\nal market.\n BOARD \n of ELEMENTARY\n and \n SECONDARY\n EDUCATION\n  '

The text from a Page object is obtained with the extractText() method, which can be imperfect To print all pages in a PDF

# for pageNum in range(reader.numPages):
  # print(reader.getPage(pageNum).extractText())

Combines PDF documents into a single document¶

New PDFs can be made from PdfFileWriter(). New pages can be appended to a Writer object with the addPage() method. Call the write() method to save its changes. Be sure to use the write binary mode wb.

import PyPDF2
pdfFile1 = open('meetingminutes1.pdf', 'rb')
pdfFile2 = open('meetingminutes2.pdf', 'rb')
reader1 = PyPDF2.PdfFileReader(pdfFile1)
reader2 = PyPDF2.PdfFileReader(pdfFile2)
writer = PyPDF2.PdfFileWriter()
for pageNum in range(reader1.numPages):
    page = reader1.getPage(pageNum)
    writer.addPage(page)

for pageNum in range(reader2.numPages):
    page = reader2.getPage(pageNum)
    writer.addPage(page)

outputFile = open('combinedMinutes.pdf', 'wb')
writer.write(outputFile)
outputFile.close()
pdfFile1.close()
pdfFile2.close()

Reading and Editing Word Documents ¶

Python-Docx can read and write .docx Word files.

To install Python-Docx: pip install python-docx. The file used in this session: http://autbor.com/demo.docx

To import Python-Docx

import docx

Document object contains Paragraph objects. Paragraph objects contain Run objects.

Open a Word file with docx.Document()

d = docx.Document('z:\\it\\python\\General Python\\snippets\\demo.docx')

Access one of the Paragraph Objects from the paragraphs member variable, which is a list of Paragraph objects.

d.paragraphs

[<docx.text.paragraph.Paragraph at 0x83dd8b0>,
 <docx.text.paragraph.Paragraph at 0x83dda50>,
 <docx.text.paragraph.Paragraph at 0x83dd8f0>,
 <docx.text.paragraph.Paragraph at 0x83dd910>,
 <docx.text.paragraph.Paragraph at 0x83dd950>,
 <docx.text.paragraph.Paragraph at 0x83dd9d0>,
 <docx.text.paragraph.Paragraph at 0x83dda10>]

Paragraph objects have a text member variable containing the text as a string value

d.paragraphs[0].text

'Document Title'

d.paragraphs[1].text

'A plain paragraph having some bold and some italic.'

p = d.paragraphs[1]

Paragraphs are composed of **"runs". The runs member variable of a Paragraph object contains a list of Run objects

p.runs # a new run when change of style

[<docx.text.run.Run at 0x83ddc90>,
 <docx.text.run.Run at 0x83dd890>,
 <docx.text.run.Run at 0x83ddf30>,
 <docx.text.run.Run at 0x83ddab0>]

Run objects also have a text member variable

p.runs[0].text

'A plain paragraph having some '

p.runs[1].text

'bold'

p.runs[2].text

' and some '

p.runs[3].text

'italic.'

Run objects have a bold, italic, and underline member variables which can be set to True or False

p.runs[1].bold

True

p.runs[0].bold == None

True

p.runs[3].italic

True

p.runs[3].underline = True

p.runs[3].text = 'italic and underline.'

d.save('z:\\it\\python\\General Python\\snippets\\demo2.docx')

Built-in Style¶

Paragraph and run objects have a style member variable that can be set to one of Word's built-in styles. To bring up the Styles Window in Word: Ctrl+Alt+Shift+S, click a section in the doc to see what style is used.

To change the style

p.style

_ParagraphStyle('Normal') id: 138208016

p.style = 'Title'
d.save('z:\\it\\python\\General Python\\snippets\\demo3.docx')

Word files can be created by calling add_paragraph()

d = docx.Document()
d.add_paragraph('Hello this is a paragraph')

<docx.text.paragraph.Paragraph at 0x83ddd50>

d.add_paragraph('Hello this is another paragraph')

<docx.text.paragraph.Paragraph at 0x8403050>

d.save('z:\\it\\python\\General Python\\snippets\\demo4.docx')

Also, Word files can be created by calling add_run() to append text content

p = d.paragraphs[0]
p.add_run('This is a new run')

<docx.text.run.Run at 0x84033b0>

p.runs[1].bold = True
d.save('z:\\it\\python\\General Python\\snippets\\demo5.docx')

Unfortunately, there is no way of inserting paragraph and run objects anywhere except at the end.

Script: Get all text inside a document¶

getTextfromWord.py

import docx

def getText(filename):
  doc = docx.Document(filename)
  fullText = []

  for para in doc.paragraphs:
    fullText.append(para.text)

  return '\n'.join(fullText)

print(getText('z:\\it\\python\\General Python\\snippets\\demo.docx'))

Document Title
A plain paragraph having some bold and some italic.
Heading, level 1
Intense quote
first item in unordered list
first item in ordered list

Table of Contents

Excel, Words and PDF Documents¶

Reading Excel Spreadsheets ¶

Create and Edit Excel Spreadsheets¶

Reading and Editing PDF ¶

Combines PDF documents into a single document¶

Reading and Editing Word Documents ¶

Built-in Style¶

Script: Get all text inside a document¶

Table of Contents

Excel, Words and PDF Documents¶

Reading Excel Spreadsheets¶

Create and Edit Excel Spreadsheets¶

Reading and Editing PDF¶

Combines PDF documents into a single document¶

Reading and Editing Word Documents¶

Built-in Style¶

Script: Get all text inside a document¶

Reading Excel Spreadsheets ¶

Reading and Editing PDF ¶

Reading and Editing Word Documents ¶