Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

Grab the First/Last n Rows with head() or tail()

Use head(n=5) to find the first n rows in the DataFrame. Use tail(n=5) to get the last n rows in the DataFrame. The default is 5 rows (n=5)

In [2]:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
Out[2]:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz

Finding Info on Unique Values

The unique() Method

Use the unique() method to find unique values in a DataFrame

In [3]:
df['col2'].unique()
Out[3]:
array([444, 555, 666], dtype=int64)

The nunique() Method

Use the nunique() method to find the count of unique values in a DataFrame

In [4]:
df['col2'].nunique()
Out[4]:
3

The value_counts() Method

The value_counts() method gives you a table of unique values and how many times these values show up

In [5]:
df['col2'].value_counts()
Out[5]:
444    2
555    1
666    1
Name: col2, dtype: int64

Selecting Data

Pass the conditional selection statement to the DataFrame. The conditional selection statements composed of a list of boolean values [False, False, True, ..., True]

Select from DataFrame using criteria from multiple columns

In [6]:
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
Out[6]:
col1 col2 col3
3 4 444 xyz

The apply() Method

The apply() method enables you to apply your own custom functions or built-in functions to a DataFrame

Applying custom function

In [7]:
def times2(x):
    return x*2

This will broadcast the function to column 1

In [8]:
df['col1'].apply(times2)
Out[8]:
0    2
1    4
2    6
3    8
Name: col1, dtype: int64

Alternatively, you can apply a lambda function

In [14]:
df['col2'].apply(lambda x:x*2)
Out[14]:
0     888
1    1110
2    1332
3     888
Name: col2, dtype: int64

Applying built-in function

In [60]:
df['col3'].apply(len)
Out[60]:
0    3
1    3
2    3
3    3
Name: col3, dtype: int64
In [61]:
df['col1'].sum()
Out[61]:
10

Permanently Removing a Column

df.drop('col1', axis=1, inplace=True)
In [62]:
del df['col1']
In [63]:
df
Out[63]:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Get Column Names and Index Info

Use the member variable .columns to get the column names

In [64]:
df.columns
Out[64]:
Index(['col2', 'col3'], dtype='object')

Use the member variable .index to get the start, stop and step size of an index

In [65]:
df.index
Out[65]:
RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame

In [66]:
df
Out[66]:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Use sort_values() to sort by column or by row. Note that inplace=False by default.

In [67]:
df.sort_values(by='col2')
Out[67]:
col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi

Alternatively,

In [10]:
df.sort_values('col2')
Out[10]:
col1 col2 col3
0 1 444 abc
3 4 444 xyz
1 2 555 def
2 3 666 ghi

The sorting order ascending by default Ascending=True. Use Ascending=False to sort in descending order

In [11]:
df.sort_values('col2', ascending=False)
Out[11]:
col1 col2 col3
2 3 666 ghi
1 2 555 def
0 1 444 abc
3 4 444 xyz

Find Null Values or Check for Null Values

Find Null Values

To find null/missing values in a DataFrame, use isnull() which returns boolean values.

In [68]:
df.isnull()
Out[68]:
col2 col3
0 False False
1 False False
2 False False
3 False False

Drop rows with NaN Values

In [69]:
df.dropna()
Out[69]:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Filling in NaN values with something else:

In [71]:
import numpy as np
In [72]:
df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})
df.head()
Out[72]:
col1 col2 col3
0 1.0 NaN abc
1 2.0 555.0 def
2 3.0 666.0 ghi
3 NaN 444.0 xyz
In [75]:
df.fillna('FILL')
Out[75]:
col1 col2 col3
0 1 FILL abc
1 2 555 def
2 3 666 ghi
3 FILL 444 xyz

Create a Pivot Table

In [89]:
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)
In [90]:
df
Out[90]:
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1

Use pivot_table() to create a pivot table. A pivot table with multi-level index

In [91]:
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
Out[91]:
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN