The OpenPyXL
third-party module handles Excel spreadsheets (.xlsx files). The versiion of OpenPyXL was 2.1.4. You can install this by running pip install openpyxl==2.1.4
Free Alternatives to Excel
This Example spreadsheet (XLSX) will be used throughout the session
To install OpenPyXL
# pip install openpyxl
openpyxl.load_workbook(filename)
returns a Workbook object
import openpyxl, os
os.chdir('z:\\it\\python\\General Python\\snippets')
workbook = openpyxl.load_workbook('example.xlsx')
type(workbook)
Use wb[sheetname]
and workbook.sheetnames
to get Worksheet objects
sheet = workbook['Sheet1']
type(sheet)
workbook.sheetnames
The square brakcets in sheet['a1']
get Cell objects
sheet['A1']
sheet['a1']
Cell objects have a .value
member variable with the content of that cell
cell = sheet['a1']
cell.value
str(cell.value)
str(sheet['a1'].value)
The cell()
method also returns a Cell object from a sheet
sheet.cell(row=1, column=2)
for i in range(1, 8):
print(i, sheet.cell(row=i, column=2).value)
Changing a cell's value is done using the square brackets []
, just like changing a value in a list or dictionary
import openpyxl
wb = openpyxl.Workbook()
wb
wb.sheetnames
sheet = wb['Sheet']
sheet
sheet['a1'].value == None
sheet['a1'] = 42
sheet['a2'] = 'Hello'
Changes you make to the workbook object can be saved with the save()
method
import os
os.chdir('z:\\it\\python\\General Python\\snippets')
wb.save('example1.xlsx')
You can create a new sheet with wb_create_sheet()
sheet2 = wb.create_sheet()
wb.sheetnames
You can view and rename a sheet's name with its title
member veriable
sheet2.title
sheet2.title = 'My New Sheet Name'
wb.sheetnames
wb.save('example2.xlsx')
Create sheet and put it in front
wb.create_sheet(index=0, title='My Other Sheet')
wb.save('example3.xlsx')
PDF files are binary files and it's much more complex than plaintext files. The PyPDF2 module
can read and write PDFs.
To install PyPDF2
, be sure to get the capitalization correct
pip install PyPDF2
PDF file 1 used in this section PDF file 2 used in this section
SumatraPDF
import PyPDF2, os
os.chdir('z:\\it\\python\\General Python\\snippets')
Opening a PDF is done by calling open()
and passing the File object to PdfFileReader()
. Be sure to use the read binary mode rb
.
pdfFile = open('meetingminutes1.pdf', 'rb')
reader = PyPDF2.PdfFileReader(pdfFile)
You can find the number of pages in a PDF focument by looking at the numPages
member variable
reader.numPages
A Page object can be obtained from the PDF object with the getPage()
method
page = reader.getPage(0)
page.extractText()
The text from a Page object is obtained with the extractText()
method, which can be imperfect
To print all pages in a PDF
# for pageNum in range(reader.numPages):
# print(reader.getPage(pageNum).extractText())
New PDFs can be made from PdfFileWriter()
. New pages can be appended to a Writer object with the addPage()
method. Call the write()
method to save its changes. Be sure to use the write binary mode wb
.
import PyPDF2
pdfFile1 = open('meetingminutes1.pdf', 'rb')
pdfFile2 = open('meetingminutes2.pdf', 'rb')
reader1 = PyPDF2.PdfFileReader(pdfFile1)
reader2 = PyPDF2.PdfFileReader(pdfFile2)
writer = PyPDF2.PdfFileWriter()
for pageNum in range(reader1.numPages):
page = reader1.getPage(pageNum)
writer.addPage(page)
for pageNum in range(reader2.numPages):
page = reader2.getPage(pageNum)
writer.addPage(page)
outputFile = open('combinedMinutes.pdf', 'wb')
writer.write(outputFile)
outputFile.close()
pdfFile1.close()
pdfFile2.close()
Python-Docx can read and write .docx Word files.
To install Python-Docx: pip install python-docx
. The file used in this session: http://autbor.com/demo.docx
To import Python-Docx
import docx
Document object contains Paragraph objects. Paragraph objects contain Run objects.
Open a Word file with docx.Document()
d = docx.Document('z:\\it\\python\\General Python\\snippets\\demo.docx')
Access one of the Paragraph Objects from the paragraphs
member variable, which is a list of Paragraph objects.
d.paragraphs
Paragraph objects have a text
member variable containing the text as a string value
d.paragraphs[0].text
d.paragraphs[1].text
p = d.paragraphs[1]
Paragraphs are composed of **"runs". The runs
member variable of a Paragraph object contains a list of Run objects
p.runs # a new run when change of style
Run objects also have a text
member variable
p.runs[0].text
p.runs[1].text
p.runs[2].text
p.runs[3].text
Run objects have a bold
, italic
, and underline
member variables which can be set to True
or False
p.runs[1].bold
p.runs[0].bold == None
p.runs[3].italic
p.runs[3].underline = True
p.runs[3].text = 'italic and underline.'
d.save('z:\\it\\python\\General Python\\snippets\\demo2.docx')
Paragraph and run objects have a style
member variable that can be set to one of Word's built-in styles. To bring up the Styles Window in Word: Ctrl+Alt+Shift+S
, click a section in the doc to see what style is used.
To change the style
p.style
p.style = 'Title'
d.save('z:\\it\\python\\General Python\\snippets\\demo3.docx')
Word files can be created by calling add_paragraph()
d = docx.Document()
d.add_paragraph('Hello this is a paragraph')
d.add_paragraph('Hello this is another paragraph')
d.save('z:\\it\\python\\General Python\\snippets\\demo4.docx')
Also, Word files can be created by calling add_run()
to append text content
p = d.paragraphs[0]
p.add_run('This is a new run')
p.runs[1].bold = True
d.save('z:\\it\\python\\General Python\\snippets\\demo5.docx')
Unfortunately, there is no way of inserting paragraph and run objects anywhere except at the end.
getTextfromWord.py
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
print(getText('z:\\it\\python\\General Python\\snippets\\demo.docx'))