Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [1]:
import numpy as np
import pandas as pd

Create a DataFrame from a Dictionary

The key in a dictionary are the columns. Use np.nan to signify missing / null value

In [10]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

df
Out[10]:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3

The dropna() Method

Use dropna() to remove ROW(S) with null/missing value(s)

In [12]:
df.dropna()
Out[12]:
A B C
0 1.0 5.0 1

Use dropna(axis=1) to remove COLUMN(S) with null/missing value(s)

In [13]:
df.dropna(axis=1)
Out[13]:
C
0 1
1 2
2 3

Use the threshold thresh argument to specify a minimum of non-na values. The row will be kept if the number of non-na values >= number specified in threshold

In [14]:
df.dropna(thresh=2)
Out[14]:
A B C
0 1.0 5.0 1
1 2.0 NaN 2

The fillna() Method

Missing value are indicated by NaN. We can replace the missing value with fillna()

In [15]:
df.fillna(value='FILL VALUE')
Out[15]:
A B C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALUE FILL VALUE 3

Set the fill value to be the mean of the column

In [17]:
df['A'].fillna(value=df['A'].mean())
Out[17]:
0    1.0
1    2.0
2    1.5
Name: A, dtype: float64