There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:
Use head(n=5)
to find the first n rows in the DataFrame. Use tail(n=5)
to get the last n rows in the DataFrame. The default is 5 rows (n=5)
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
df['col2'].unique()
Use the nunique()
method to find the count of unique values in a DataFrame
df['col2'].nunique()
The value_counts()
method gives you a table of unique values and how many times these values show up
df['col2'].value_counts()
Pass the conditional selection statement to the DataFrame. The conditional selection statements composed of a list of boolean values [False, False, True, ..., True]
Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
The apply()
method enables you to apply your own custom functions or built-in functions to a DataFrame
Applying custom function
def times2(x):
return x*2
This will broadcast the function to column 1
df['col1'].apply(times2)
Alternatively, you can apply a lambda function
df['col2'].apply(lambda x:x*2)
Applying built-in function
df['col3'].apply(len)
df['col1'].sum()
df.drop('col1', axis=1, inplace=True)
del df['col1']
df
Use the member variable .columns
to get the column names
df.columns
Use the member variable .index
to get the start, stop and step size of an index
df.index
df
Use sort_values()
to sort by column or by row. Note that inplace=False
by default.
df.sort_values(by='col2')
Alternatively,
df.sort_values('col2')
The sorting order ascending by default Ascending=True
. Use Ascending=False
to sort in descending order
df.sort_values('col2', ascending=False)
Find Null Values
To find null/missing values in a DataFrame, use isnull()
which returns boolean values.
df.isnull()
Drop rows with NaN Values
df.dropna()
Filling in NaN values with something else:
import numpy as np
df = pd.DataFrame({'col1':[1,2,3,np.nan],
'col2':[np.nan,555,666,444],
'col3':['abc','def','ghi','xyz']})
df.head()
df.fillna('FILL')
data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
Use pivot_table()
to create a pivot table. A pivot table with multi-level index
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])