Time Resampling

We will usually get data that has a DateTime Index on a smaller time scale (every day, every hour, etc…). However, it is often a good idea to aggregate the data based off some frequency (monthly, quarterly, etc…)

While using groupby could solve part of this issue, a simple groupby isn’t “smart” enough to understand things like business quarters or business year starts. Luckily, pandas has frequency sampling tools to help with this!

Imports

In [3]:
import numpy as np
import pandas as pd
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt

Read CSV File

Use read_csv() to read CSV file

In [11]:
df = pd.read_csv('time_data/walmart_stock.csv')

Faster alternative

If your have a standard date format %y%m%d, you can use this method. Simply pass arguments index_col='Date' and parse_dates=True to the .read_csv() method

In [15]:
df = pd.read_csv('time_data/walmart_stock.csv',index_col='Date', parse_dates=True)

The index_col='Date' argument will set the Date column as index but the column is still of data type String. The parse_dates=True argument will convert Date column to DateTime object

In [10]:
df.index
Out[10]:
DatetimeIndex(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06',
               '2012-01-09', '2012-01-10', '2012-01-11', '2012-01-12',
               '2012-01-13', '2012-01-17',
               ...
               '2016-12-16', '2016-12-19', '2016-12-20', '2016-12-21',
               '2016-12-22', '2016-12-23', '2016-12-27', '2016-12-28',
               '2016-12-29', '2016-12-30'],
              dtype='datetime64[ns]', name='Date', length=1258, freq=None)

If you use this method, you can jump to section "Resample" to continue.


In [3]:
df.head()
Out[3]:
Date Open High Low Close Volume Adj Close
0 2012-01-03 59.970001 61.060001 59.869999 60.330002 12668800 52.619235
1 2012-01-04 60.209999 60.349998 59.470001 59.709999 9593300 52.078475
2 2012-01-05 59.349998 59.619999 58.369999 59.419998 12768200 51.825539
3 2012-01-06 59.419998 59.450001 58.869999 59.000000 8069400 51.459220
4 2012-01-09 59.029999 59.549999 58.919998 59.180000 6679300 51.616215

Notice that there is a "Date" column and this column is what we want as a Date Index

Check out DataFrame info

  • Notice that the "Date" column is of data type "non-null object". This indicates the "Date" column is a string.
In [12]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1258 entries, 0 to 1257
Data columns (total 7 columns):
Date         1258 non-null object
Open         1258 non-null float64
High         1258 non-null float64
Low          1258 non-null float64
Close        1258 non-null float64
Volume       1258 non-null int64
Adj Close    1258 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 63.9+ KB

Create a DateTime Object .to_datetime()

Create a date index from the date column

Method 1

  • Use to_datetime() method and pass in the entire date Series. The to_datetime() method will re-format the Series to a datetime object. Set the DateTime object to 'Date' column in the DataFrame
In [13]:
# df['Date'] = pd.to_datetime(df['Date'])

Method 2

  • Use .apply() to apply pd.to.datetime to all elements in the Date Series
In [5]:
# df['Date'] = df['Date'].apply(pd.to_datetime)
df.head()
Out[5]:
Date Open High Low Close Volume Adj Close
0 2012-01-03 59.970001 61.060001 59.869999 60.330002 12668800 52.619235
1 2012-01-04 60.209999 60.349998 59.470001 59.709999 9593300 52.078475
2 2012-01-05 59.349998 59.619999 58.369999 59.419998 12768200 51.825539
3 2012-01-06 59.419998 59.450001 58.869999 59.000000 8069400 51.459220
4 2012-01-09 59.029999 59.549999 58.919998 59.180000 6679300 51.616215
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1258 entries, 0 to 1257
Data columns (total 7 columns):
Date         1258 non-null datetime64[ns]
Open         1258 non-null float64
High         1258 non-null float64
Low          1258 non-null float64
Close        1258 non-null float64
Volume       1258 non-null int64
Adj Close    1258 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 68.8 KB

Notice that the Date column changed to DateTime object

Format Argument

If the date is not in standard format, use the format argument in .to_datetime() method to specify the format e.g. "%d/%m/%Y" (%y: 2-digit year)

Set DateTime Index .set_index()

As we have a DateTime object now, we can pass it to the set_index() method with argument inplace=True to set it as an index

In [8]:
df.set_index('Date',inplace=True)
df.head()
Out[8]:
Open High Low Close Volume Adj Close
Date
2012-01-03 59.970001 61.060001 59.869999 60.330002 12668800 52.619235
2012-01-04 60.209999 60.349998 59.470001 59.709999 9593300 52.078475
2012-01-05 59.349998 59.619999 58.369999 59.419998 12768200 51.825539
2012-01-06 59.419998 59.450001 58.869999 59.000000 8069400 51.459220
2012-01-09 59.029999 59.549999 58.919998 59.180000 6679300 51.616215

Resampling (Grouping) .resample()

A common operation with time series data is resamplling based on the time series index. The resample() method is essentially acting as a groupby method specifc to time series data that has a DateTime object.

First of all, make sure that we have a DateTime object as index

Table: Time Series Offest strings

Alias Description
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
In [14]:
# Our index
df.index
Out[14]:
DatetimeIndex(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06',
               '2012-01-09', '2012-01-10', '2012-01-11', '2012-01-12',
               '2012-01-13', '2012-01-17',
               ...
               '2016-12-16', '2016-12-19', '2016-12-20', '2016-12-21',
               '2016-12-22', '2016-12-23', '2016-12-27', '2016-12-28',
               '2016-12-29', '2016-12-30'],
              dtype='datetime64[ns]', name='Date', length=1258, freq=None)

You need to call resample with the rule parameter, then you need to call some sort of aggregation function. This is because due to resampling, we need some sort of mathematical rule to join the rows by (mean,sum,count,etc...)

Yearly Means

  • Use rule='A', year end frequency, we can expect the last day of the year as our resampling
In [17]:
df.resample(rule='A').mean()
Out[17]:
Open High Low Close Volume Adj Close
Date
2012-12-31 67.158680 67.602120 66.786520 67.215120 9239015 59.389349
2013-12-31 75.264048 75.729405 74.843055 75.320516 6951496 68.147179
2014-12-31 77.274524 77.740040 76.864405 77.327381 6515612 71.709712
2015-12-31 72.569405 73.064167 72.034802 72.491111 9040769 68.831426
2016-12-31 69.481349 70.019643 69.023492 69.547063 9371645 68.054229

Custom Resampling

You could technically also create your own custom resampling function:

In [23]:
def first_day(entry):
    """
    Returns the first instance of the period, regardless of samplling rate.
    """
    return entry[0]

Note that we PASS firstday in (without parentheses), NOT calling it

In [25]:
df.resample(rule='A').apply(first_day)
Out[25]:
Open High Low Close Volume Adj Close
Date
2012-12-31 59.970001 61.060001 59.869999 60.330002 12668800 52.619235
2013-12-31 68.930000 69.239998 68.449997 69.239998 10390800 61.879708
2014-12-31 78.720001 79.470001 78.500000 78.910004 6878000 72.254228
2015-12-31 86.269997 86.720001 85.550003 85.900002 4501800 80.624861
2016-12-31 60.500000 61.490002 60.360001 61.459999 11989200 59.289713

Visualization with Resampling

Reample by Year

Use bar plot as dataset is not continuous

In [38]:
df['Close'].resample('A').mean().plot(kind='bar')
plt.title('Yearly Mean Close Price for Walmart')
Out[38]:
<matplotlib.text.Text at 0x11d782080>

Reample by Month

In [42]:
df['Open'].resample('M').max().plot(kind='bar',figsize=(16,6))
plt.title('Monthly Max Opening Price for Walmart')
Out[42]:
<matplotlib.text.Text at 0x11e8ddda0>

That is it! Up next we'll learn about time shifts!