Create a DataFrame
- Syntax
Selection and Indexing
Selecting Column(s)
- Select Single Column
- Select Multiple Columns
Creating a new column
Removing Columns
Removing Rows
Selecting Rows
- .loc[]
- .iloc[]
Selecting subset of rows and columns
- Conditional Selection
  - Select a Single Column
  - Select Multiple Columns
- Multiple Conditional Selection
More Index Details
- Reset Index
- Set a New Index
Multi-Index and Index Hierarchy

DataFrames¶

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

import pandas as pd
import numpy as np

Standard normal distribution

Make sure we get the same random number with seed()

from numpy.random import randn
np.random.seed(101)

Create a DataFrame¶

Syntax¶

DataFrame(data, index, columns)

We can create a DataFrame object with DataFrame()

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

df

Each of these columns is actually a panda Series i.e. W, X, Y, Z and they all share a common index

Selection and Indexing¶

Selecting Column(s)¶

Let's learn the various methods to grab data from a DataFrame

Select Single Column¶

Use df[] to grap a single column

df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

Select Multiple Columns¶

Pass a list of column names

Pass a list of column names
df[['W','Z']]

SQL Syntax (NOT RECOMMENDED!)

Get confused ... method and column name?

df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

DataFrame Columns are just Series

type(df['W'])

pandas.core.series.Series

Creating a new column¶

By using existing columns

df['new'] = df['W'] + df['Y']

df

Removing Columns¶

Use df.drop() to remove columns.

df.drop('new',axis=1)

Not inplace unless specified! The inplace argument is default to False

df

Set inplace argument to True to commit

df.drop('new',axis=1,inplace=True)

df

Removing Rows¶

Can also drop rows this way:

df.drop('E',axis=0)

Selecting Rows¶

.loc[]¶

Use loc[] to select rows based on label

df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

.iloc[]¶

Or select based on position instead of label

df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Selecting subset of rows and columns¶

df.loc['B','Y']

-0.84807698340363147

df.loc[['A','B'],['W','Y']]

Conditional Selection¶

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

df

Get the DataFrame back with boolean values

df>0

If you pass in the whole DataFrame of boolean values, you will get values that are True and NaN for values that are False

df[df>0]

If you pass in a Series of boolean values, such as a column with a comparison operator, you will get the rows of the DataFrame where Series happens to be True

df[df['W']>0]

Select a Single Column¶

df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

Select Multiple Columns¶

df[df['W']>0][['Y','X']]

Multiple Conditional Selection¶

The built-in and or operators can't handle comparison on Series of Boolean values

df[(df['W']>0) & (df['Y'] > 1)] #ValueError

For two conditions you can use | and & with parenthesis ():

df[(df['W']>0) & (df['Y'] > 1)]

More Index Details¶

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

df

Reset Index¶

Use reset_index() to reset to default 0,1...n index. The reset_index() method is not inplace by default and you need to pass inplace=True to commit. If inplace, the old index A,B,...E will become a new column with column name index

df.reset_index()

Set a New Index¶

Set a new index based on a column

Create a new list

Use split() as a nice quick way to create a new list

newind = 'CA NY WY OR CO'.split()

Add the list to a column

df['States'] = newind

df

Set a Column as a New Index Use set_index() to set a column as a new index. This will over-write your old index A,B,...E.

df.set_index('States')

df

The set_index() method is not inplace by default and you need to set argument inplace=True to make changes permanent.

df.set_index('States',inplace=True)

df

Multi-Index and Index Hierarchy¶

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

First we have two lists as Index Levels:

outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]

Use zip() along with list() to make a list of tuple pair

list(zip(outside, inside))

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

Then pass the list of tuple pair to MultiIndex.from_tuples() to create a multi-index

hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

The index have two levels: ['G1', 'G2'] is one level and [1, 2, 3] is another level

hier_index

MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

Pass the hier_index to DataFrame() to create a DataFrame with Multi-Index aka Index Hierarchy

df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Selection and Indexing¶

Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

Grab the ouside Index

df.loc['G1']

Grab the inside index

df.loc['G1'].loc[1]

A    1.114024
B    0.597304
Name: 1, dtype: float64

Grab a piece of data

df.loc['G2'].loc[2]['B']

0.7364394404800374

Name the Index¶

Use member variable names to view the names of the index. Here, the index have no name

df.index.names

FrozenList([None, None])

Name the index by passing in a list of names

df.index.names = ['Group','Num']

df

Cross Section Method¶

The xs() methods returns a cross-section (rows or columns) from the Series/DataFrame

Grab the section G1. This is eqivalent to loc['G1']

df.xs('G1')

df.xs(['G1',1])

A    0.153661
B    0.167638
Name: (G1, 1), dtype: float64

The advantage of xs() over loc() is xs() can grab a specifc part of BOTH section groups.

Grab data Series of G1 G2 with the same inside index "num"

df.xs(1,level='Num')

		A	B
G1	1	1.114024	0.597304
	2	-1.125510	0.314178
	3	1.731880	-0.611851
G2	1	-0.325227	0.525790
	2	-0.052716	0.736439
	3	0.505291	1.659062

	A	B
1	1.114024	0.597304
2	-1.125510	0.314178
3	1.731880	-0.611851

		A	B
Group	Num
G1	1	0.153661	0.167638
	2	-0.765930	0.962299
	3	0.902826	-0.537909
G2	1	-1.549671	0.435253
	2	1.259904	-0.447898
	3	0.266207	0.412580

	A	B
Num
1	0.153661	0.167638
2	-0.765930	0.962299
3	0.902826	-0.537909

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

	W	X	Y	Z
A	True	True	True	True
B	True	False	False	True
C	False	True	True	False
D	True	False	False	True
E	True	True	True	True

	W	X	Y	Z
States
CA	2.706850	0.628133	0.907969	0.503826
NY	0.651118	-0.319318	-0.848077	0.605965
WY	-2.018168	0.740122	0.528813	-0.589001
OR	0.188695	-0.758872	-0.933237	0.955057
CO	0.190794	1.978757	2.605967	0.683509

Table of Contents

DataFrames¶

Create a DataFrame¶

Syntax¶

Selection and Indexing¶

Selecting Column(s)¶

Select Single Column¶

Select Multiple Columns¶

Creating a new column¶

Removing Columns¶

Removing Rows¶

Selecting Rows¶

.loc[]¶

.iloc[]¶

Selecting subset of rows and columns¶

Conditional Selection¶

Select a Single Column¶

Select Multiple Columns¶

Multiple Conditional Selection¶

More Index Details¶

Reset Index¶

Set a New Index¶

Multi-Index and Index Hierarchy¶

Selection and Indexing¶

Name the Index¶

Cross Section Method¶