We will usually get data that has a DateTime Index on a smaller time scale (every day, every hour, etc…). However, it is often a good idea to aggregate the data based off some frequency (monthly, quarterly, etc…)
While using groupby
could solve part of this issue, a simple groupby
isn’t “smart” enough to understand things like business quarters or business year starts. Luckily, pandas has frequency sampling tools to help with this!
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
Use read_csv()
to read CSV file
df = pd.read_csv('time_data/walmart_stock.csv')
If your have a standard date format %y%m%d, you can use this method. Simply pass arguments index_col='Date'
and parse_dates=True
to the .read_csv()
method
df = pd.read_csv('time_data/walmart_stock.csv',index_col='Date', parse_dates=True)
The index_col='Date'
argument will set the Date column as index but the column is still of data type String. The parse_dates=True
argument will convert Date column to DateTime object
df.index
If you use this method, you can jump to section "Resample" to continue.
df.head()
Notice that there is a "Date" column and this column is what we want as a Date Index
Check out DataFrame info
df.info()
Create a date index from the date column
Method 1
to_datetime()
method and pass in the entire date Series. The to_datetime()
method will re-format the Series to a datetime object. Set the DateTime object to 'Date' column in the DataFrame# df['Date'] = pd.to_datetime(df['Date'])
Method 2
.apply()
to apply pd.to.datetime
to all elements in the Date Series# df['Date'] = df['Date'].apply(pd.to_datetime)
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1258 entries, 0 to 1257
Data columns (total 7 columns):
Date 1258 non-null datetime64[ns]
Open 1258 non-null float64
High 1258 non-null float64
Low 1258 non-null float64
Close 1258 non-null float64
Volume 1258 non-null int64
Adj Close 1258 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 68.8 KB
Notice that the Date column changed to DateTime object
If the date is not in standard format, use the format
argument in .to_datetime() method to specify the format e.g. "%d/%m/%Y"
(%y: 2-digit year)
As we have a DateTime object now, we can pass it to the set_index()
method with argument inplace=True
to set it as an index
df.set_index('Date',inplace=True)
df.head()
A common operation with time series data is resamplling based on the time series index. The resample()
method is essentially acting as a groupby method specifc to time series data that has a DateTime object.
First of all, make sure that we have a DateTime object as index
Alias | Description |
---|---|
B | business day frequency |
C | custom business day frequency (experimental) |
D | calendar day frequency |
W | weekly frequency |
M | month end frequency |
SM | semi-month end frequency (15th and end of month) |
BM | business month end frequency |
CBM | custom business month end frequency |
MS | month start frequency |
SMS | semi-month start frequency (1st and 15th) |
BMS | business month start frequency |
CBMS | custom business month start frequency |
Q | quarter end frequency |
BQ | business quarter endfrequency |
QS | quarter start frequency |
BQS | business quarter start frequency |
A | year end frequency |
BA | business year end frequency |
AS | year start frequency |
BAS | business year start frequency |
BH | business hour frequency |
H | hourly frequency |
T, min | minutely frequency |
S | secondly frequency |
L, ms | milliseconds |
U, us | microseconds |
N | nanoseconds |
# Our index
df.index
You need to call resample with the rule parameter, then you need to call some sort of aggregation function. This is because due to resampling, we need some sort of mathematical rule to join the rows by (mean,sum,count,etc...)
Yearly Means
rule='A'
, year end frequency, we can expect the last day of the year as our resamplingdf.resample(rule='A').mean()
You could technically also create your own custom resampling function:
def first_day(entry):
"""
Returns the first instance of the period, regardless of samplling rate.
"""
return entry[0]
Note that we PASS firstday
in (without parentheses), NOT calling it
df.resample(rule='A').apply(first_day)
df['Close'].resample('A').mean().plot(kind='bar')
plt.title('Yearly Mean Close Price for Walmart')
Reample by Month
df['Open'].resample('M').max().plot(kind='bar',figsize=(16,6))
plt.title('Monthly Max Opening Price for Walmart')
That is it! Up next we'll learn about time shifts!