DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!
import pandas as pd
import numpy as np
Standard normal distribution
Make sure we get the same random number with seed()
from numpy.random import randn
np.random.seed(101)
DataFrame(data, index, columns)
We can create a DataFrame object with DataFrame()
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
Each of these columns is actually a panda Series i.e. W, X, Y, Z and they all share a common index
Use df[]
to grap a single column
df['W']
Pass a list of column names
Pass a list of column names
df[['W','Z']]
SQL Syntax (NOT RECOMMENDED!)
df.W
DataFrame Columns are just Series
type(df['W'])
By using existing columns
df['new'] = df['W'] + df['Y']
df
Use df.drop()
to remove columns.
df.drop('new',axis=1)
Not inplace
unless specified! The inplace
argument is default to False
df
Set inplace
argument to True
to commit
df.drop('new',axis=1,inplace=True)
df
Can also drop rows this way:
df.drop('E',axis=0)
df.loc['A']
Or select based on position instead of label
df.iloc[2]
df.loc['B','Y']
df.loc[['A','B'],['W','Y']]
An important feature of pandas is conditional selection using bracket notation, very similar to numpy
:
df
Get the DataFrame back with boolean values
df>0
If you pass in the whole DataFrame of boolean values, you will get values that are True and NaN for values that are False
df[df>0]
If you pass in a Series of boolean values, such as a column with a comparison operator, you will get the rows of the DataFrame where Series happens to be True
df[df['W']>0]
df[df['W']>0]['Y']
df[df['W']>0][['Y','X']]
The built-in and
or
operators can't handle comparison on Series of Boolean values
df[(df['W']>0) & (df['Y'] > 1)] #ValueError
For two conditions you can use |
and &
with parenthesis ()
:
df[(df['W']>0) & (df['Y'] > 1)]
Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!
df
Use reset_index()
to reset to default 0,1...n index. The reset_index()
method is not inplace
by default and you need to pass inplace=True
to commit. If inplace
, the old index A,B,...E will become a new column with column name index
df.reset_index()
Set a new index based on a column
Create a new list
split()
as a nice quick way to create a new listnewind = 'CA NY WY OR CO'.split()
Add the list to a column
df['States'] = newind
df
Set a Column as a New Index
Use set_index()
to set a column as a new index. This will over-write your old index A,B,...E.
df.set_index('States')
df
The set_index()
method is not inplace
by default and you need to set argument inplace=True
to make changes permanent.
df.set_index('States',inplace=True)
df
Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:
First we have two lists as Index Levels:
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
Use zip()
along with list()
to make a list of tuple pair
list(zip(outside, inside))
Then pass the list of tuple pair to MultiIndex.from_tuples()
to create a multi-index
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
The index have two levels: ['G1', 'G2'] is one level and [1, 2, 3] is another level
hier_index
Pass the hier_index to DataFrame()
to create a DataFrame with Multi-Index aka Index Hierarchy
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
Grab the ouside Index
df.loc['G1']
Grab the inside index
df.loc['G1'].loc[1]
Grab a piece of data
df.loc['G2'].loc[2]['B']
Use member variable names
to view the names of the index. Here, the index have no name
df.index.names
Name the index by passing in a list of names
df.index.names = ['Group','Num']
df
The xs()
methods returns a cross-section (rows or columns) from the Series/DataFrame
Grab the section G1. This is eqivalent to loc['G1']
df.xs('G1')
df.xs(['G1',1])
The advantage of xs()
over loc()
is xs()
can grab a specifc part of BOTH section groups.
Grab data Series of G1 G2 with the same inside index "num"
df.xs(1,level='Num')