In this lecture we will learn about pandas built-in capabilities for data visualization! It's built-off of matplotlib, but it baked into pandas for easier usage!
Let's take a look!
import numpy as np
import pandas as pd
%matplotlib inline
Your graph will look a little bit nicer if you import seaborn
import seaborn as sns
plt.style.use("seaborn")
There are some fake data csv files you can read in as dataframes:
df1 is a time series dataset.
df1 = pd.read_csv('df1',index_col=0)
df2 is the non time series dataset
df2 = pd.read_csv('df2')
Matplotlib has style sheets you can use to make your plots look a little nicer. These style sheets include "classic", "bmh", "fivethirtyeight", "ggplot", "seaborn-white" (recommended) and more. They basically create a set of style rules that your plots follow. I recommend using them, they make all your plots have the same look and feel more professional. You can even create your own if you want your company's plots to all have the same look (it is a bit tedious to create on though).
Here is how to use them.
Before using style
df1['A'].hist()
After using style
Use plt.style.use()
to call style ggplot
import matplotlib.pyplot as plt
plt.style.use('ggplot')
Now your plots look like this:
df1['A'].hist()
Style: bmh
plt.style.use('bmh')
df1['A'].hist()
Style: dark_background
plt.style.use('dark_background')
df1['A'].hist()
Style: fivethirtyeight
plt.style.use('fivethirtyeight')
df1['A'].hist()
Style: ggplot
plt.style.use('ggplot')
Let's stick with the ggplot style and actually show you how to utilize pandas built-in plotting capabilities!
There are two ways to call plot in pandas
df.plot.hist()
which is equivalent the ancester method df.plot()
:
df.plot(kind='hist')
There are several plot types built-in to pandas, most of them statistical plots by nature:
You can also just call df.plot(kind='hist') or replace that kind argument with any of the key terms shown in the list above (e.g. 'box','barh', etc..)
Let's start going through them!
df2.plot.area(alpha=0.4)
df2.head()
df2.plot.bar()
df2.plot.bar(stacked=True)
You can pass argument bins
to specify number of bins you want
df1['A'].plot.hist(bins=50)
df1.plot.line(x=df1.index,y='B',figsize=(12,3),lw=1)
df1.plot.scatter(x='A',y='B')
You can use c to color based off another column value Use cmap to indicate colormap to use. For all the colormaps, check out: http://matplotlib.org/users/colormaps.html
df1.plot.scatter(x='A',y='B',c='C',cmap='coolwarm')
Or use s
to indicate size based off another column. s parameter needs to be an array, not just the name of a column:
df1.plot.scatter(x='A',y='B',s=df1['C']*200)
Can also pass a by=
argument for groupby
df2.plot.box() # Can also pass a by= argument for groupby
Useful for Bivariate Data, alternative to scatterplot:
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df.plot.hexbin(x='a',y='b',gridsize=25,cmap='Oranges')
df2['a'].plot.kde()
df2.plot.density()
That's it! Hopefully you can see why this method of plotting will be a lot easier to use than full-on matplotlib, it balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib plt. call.