The OpenPyXL third-party module handles Excel spreadsheets (.xlsx files). The versiion of OpenPyXL was 2.1.4. You can install this by running pip install openpyxl==2.1.4
Free Alternatives to Excel
This Example spreadsheet (XLSX) will be used throughout the session
To install OpenPyXL
# pip install openpyxl
openpyxl.load_workbook(filename) returns a Workbook object
import openpyxl, os
os.chdir('z:\\it\\python\\General Python\\snippets')
workbook = openpyxl.load_workbook('example.xlsx')
type(workbook)
Use wb[sheetname] and workbook.sheetnames to get Worksheet objects
sheet = workbook['Sheet1']
type(sheet)
workbook.sheetnames
The square brakcets in sheet['a1'] get Cell objects
sheet['A1']
sheet['a1']
Cell objects have a .value member variable with the content of that cell
cell = sheet['a1']
cell.value
str(cell.value)
str(sheet['a1'].value)
The cell() method also returns a Cell object from a sheet
sheet.cell(row=1, column=2)
for i in range(1, 8):
print(i, sheet.cell(row=i, column=2).value)
Changing a cell's value is done using the square brackets [], just like changing a value in a list or dictionary
import openpyxl
wb = openpyxl.Workbook()
wb
wb.sheetnames
sheet = wb['Sheet']
sheet
sheet['a1'].value == None
sheet['a1'] = 42
sheet['a2'] = 'Hello'
Changes you make to the workbook object can be saved with the save() method
import os
os.chdir('z:\\it\\python\\General Python\\snippets')
wb.save('example1.xlsx')
You can create a new sheet with wb_create_sheet()
sheet2 = wb.create_sheet()
wb.sheetnames
You can view and rename a sheet's name with its title member veriable
sheet2.title
sheet2.title = 'My New Sheet Name'
wb.sheetnames
wb.save('example2.xlsx')
Create sheet and put it in front
wb.create_sheet(index=0, title='My Other Sheet')
wb.save('example3.xlsx')
PDF files are binary files and it's much more complex than plaintext files. The PyPDF2 module can read and write PDFs.
To install PyPDF2, be sure to get the capitalization correct
pip install PyPDF2
PDF file 1 used in this section PDF file 2 used in this section
SumatraPDF
import PyPDF2, os
os.chdir('z:\\it\\python\\General Python\\snippets')
Opening a PDF is done by calling open() and passing the File object to PdfFileReader(). Be sure to use the read binary mode rb.
pdfFile = open('meetingminutes1.pdf', 'rb')
reader = PyPDF2.PdfFileReader(pdfFile)
You can find the number of pages in a PDF focument by looking at the numPages member variable
reader.numPages
A Page object can be obtained from the PDF object with the getPage() method
page = reader.getPage(0)
page.extractText()
The text from a Page object is obtained with the extractText() method, which can be imperfect
To print all pages in a PDF
# for pageNum in range(reader.numPages):
# print(reader.getPage(pageNum).extractText())
New PDFs can be made from PdfFileWriter(). New pages can be appended to a Writer object with the addPage() method. Call the write() method to save its changes. Be sure to use the write binary mode wb.
import PyPDF2
pdfFile1 = open('meetingminutes1.pdf', 'rb')
pdfFile2 = open('meetingminutes2.pdf', 'rb')
reader1 = PyPDF2.PdfFileReader(pdfFile1)
reader2 = PyPDF2.PdfFileReader(pdfFile2)
writer = PyPDF2.PdfFileWriter()
for pageNum in range(reader1.numPages):
page = reader1.getPage(pageNum)
writer.addPage(page)
for pageNum in range(reader2.numPages):
page = reader2.getPage(pageNum)
writer.addPage(page)
outputFile = open('combinedMinutes.pdf', 'wb')
writer.write(outputFile)
outputFile.close()
pdfFile1.close()
pdfFile2.close()
Python-Docx can read and write .docx Word files.
To install Python-Docx: pip install python-docx. The file used in this session: http://autbor.com/demo.docx
To import Python-Docx
import docx
Document object contains Paragraph objects. Paragraph objects contain Run objects.
Open a Word file with docx.Document()
d = docx.Document('z:\\it\\python\\General Python\\snippets\\demo.docx')
Access one of the Paragraph Objects from the paragraphs member variable, which is a list of Paragraph objects.
d.paragraphs
Paragraph objects have a text member variable containing the text as a string value
d.paragraphs[0].text
d.paragraphs[1].text
p = d.paragraphs[1]
Paragraphs are composed of **"runs". The runs member variable of a Paragraph object contains a list of Run objects
p.runs # a new run when change of style
Run objects also have a text member variable
p.runs[0].text
p.runs[1].text
p.runs[2].text
p.runs[3].text
Run objects have a bold, italic, and underline member variables which can be set to True or False
p.runs[1].bold
p.runs[0].bold == None
p.runs[3].italic
p.runs[3].underline = True
p.runs[3].text = 'italic and underline.'
d.save('z:\\it\\python\\General Python\\snippets\\demo2.docx')
Paragraph and run objects have a style member variable that can be set to one of Word's built-in styles. To bring up the Styles Window in Word: Ctrl+Alt+Shift+S, click a section in the doc to see what style is used.
To change the style
p.style
p.style = 'Title'
d.save('z:\\it\\python\\General Python\\snippets\\demo3.docx')
Word files can be created by calling add_paragraph()
d = docx.Document()
d.add_paragraph('Hello this is a paragraph')
d.add_paragraph('Hello this is another paragraph')
d.save('z:\\it\\python\\General Python\\snippets\\demo4.docx')
Also, Word files can be created by calling add_run() to append text content
p = d.paragraphs[0]
p.add_run('This is a new run')
p.runs[1].bold = True
d.save('z:\\it\\python\\General Python\\snippets\\demo5.docx')
Unfortunately, there is no way of inserting paragraph and run objects anywhere except at the end.
getTextfromWord.py
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
print(getText('z:\\it\\python\\General Python\\snippets\\demo.docx'))