Aggregations on time-series data with Pandas
Introduction
Working with time-series data is often a challenge on its own. It is a special kind of data, where data points depend on each other across time. When analyzing it, your productivity at gaining insights to a large extent depends on your ability to juggle with the time dimension.
Very often, time-series data are collected over long periods, especially when they come from hardware devices or represent sequences of, for example, financial transactions. Furthermore, even when no field in the dataset is a “null”, the data may still be problematic if the timestamps are not regularly spaced, shifted, missing, or in any way inconsistent.
One of the key skills that help to learn useful information from time-dependent data is to efficiently perform aggregations. Not only does it allow to greatly reduce the total volume of the data, but also helps to spot interesting facts faster.
In this article, I would like to present a few ways how Pandas, the most popular Python library for helping you with analysis, can help you perform these aggregations, and what is so special when you work with time. In addition to that, I will also put an equivalent syntax in SQL for reference.
The data
For demonstration, I use the credit card transaction
dataset from Kaggle.
However, for simplicity, I focus on the "Amount"
column, and filter it by a single user,
although the aggregations can always be extended to include more criteria.
Information about time is spread across "Year", "Month", "Day"
, and "Time"
columns,
so it makes sense to represent it using a single column instead.
Since the whole dataset weighs around 2.35 GB, let’s transform the data on the fly using smaller batches.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import pandas as pd
import numpy as np
from tqdm import tqdm
from pathlib import Path
SRC = Path("data/credit_card_transactions-ibm_v2.csv")
DST = Path("data/transactions.csv")
USER = 0
def load(filepath=SRC):
data = pd.read_csv(
filepath,
iterator=True,
chunksize=10000,
usecols=["Year", "Month", "Day", "Time", "Amount"],
)
for df in tqdm(data):
yield df
def process(df):
_df = df.query("User == @USER")
ts = _df.apply(
lambda x: f"{x['Year']}{x['Month']:02d}{x['Day']:02d} {x['Time']}",
axis=1,
)
_df["timestmap"] = pd.to_datetime(ts)
_df["amount"] = df["Amount"].str.strip("$").astype(float)
return _df.get(["timestamp", "amount"])
def main():
for i, df in enumerate(load()):
df = process(df)
df.to_csv(
DST,
mode="a" if i else "w",
header=not(bool(i)),
index=False,
)
if __name__ == "__main__":
main()
timestamp | amount |
---|---|
2002-09-01 06:21:00 | 134.09 |
2002-09-01 06:42:00 | 38.48 |
2002-09-02 06:22:00 | 120.34 |
2002-09-02 17:45:00 | 128.95 |
2002-09-03 06:23:00 | 104.71 |
The “head” of this frame gives us the above table.
For a single user (here USER = 0
), we have almost 20k timestamps
that mark transactions between 2002 and 2020 with a one-minute resolution.
Thanks to pd.to_datetime
in line 31, we convert data concatenated from four columns
and store it as np.datetime64
variable that describes time in a unified data type.
What is np.datetime64
?
The np.datetime64
(doc)
type is a numpy’ed version of pythonic datetime.datetime
object.
It is vectorized, therefore making it possible to perform operations over entire
arrays quickly.
At the same time, the object recognizes typical datetime
methods
(doc)
that facilitate naturally manipulating the values.
On the pandas side, relevant objects are Timestamp
, Timedelta
, and Period
(with corresponding DatetimeIndex
, TimedeltaIndex
, and PeriodIndex
), which describe
moments in time, time shifts, and time spans, respectively.
Underneath, however, there are still np.datetime64
s (and similar np.timedelta64
s) with
their handy properties.
Converting time-related values to these objects is the best starting point for any time-series analysis. It is convenient, and it is fast.
Basic resampling
The simplest form of a time-series aggregation is to feed values into evenly spaced bins using an aggregating function. It helps to adjust the resolution and the volume of data.
The following snippet shows an example of resampling to days using two functions: sum and count:
1
2
3
4
5
6
SELECT
sum(amount),
count(amount),
DATE(timestamp) AS dt
FROM transactions
GROUP BY dt;
Pandas provides us at least two ways to achieve the same result:
1
2
3
4
5
# option 1
df["amount"].resample("D").agg(["sum", "count"])
# option 2
df["amount"].groupby(pd.Grouper(level=0, freq="D")).agg(["sum", "count"])
Both options are equivalent.
The first one is simpler and relies on the fact that the timestamp
column has been
set to be the dataframe’s index, although it is also possible to use
an optional argument on
to point to a particular column.
The second uses a more generic aggregation object pd.Grouper
in combination with
the .groupby
method.
It is highly customizable with many optional arguments.
Here, I am using level
as opposed to key
, because timestamp
is an index.
Also, freq="D"
stands for days.
There are other codes too,
although an analogous SQL statement may be more complicated.
Aggregations over several time spans
Say you want to aggregate data over multiple parts of the time stamp such as (year, week)
or (month, day-of-week, hour)
.
Due to timestamp
being of np.datetime64
type, it is possible to refer to its methods using
the so-called .dt
accessor and use them for aggregation instructions.
In SQL, you would do:
1
2
3
4
5
SELECT
AVG(amount),
STRFTIME('%Y %W', timestamp) AS yearweek
FROM transactions
GROUP BY yearweek
Here are two ways to do it in Pandas:
1
2
3
4
5
6
7
8
9
10
11
df = df.reset_index() # if we want `timestamp` to be a column
df["amount"].groupby(by=[
df["timestamp"].dt.year,
df["timestamp"].dt.isocalendar().week
]).mean()
df = df.set_index("timestamp") # if we want `timestamp` to be index
df["amount"].groupby(by=[
df.index.year,
df.index.isocalendar().week,
]).mean()
They do the same thing.
amount | |
---|---|
(2002, 1) | 40.7375 |
(2002, 35) | 86.285 |
(2002, 36) | 82.3733 |
(2002, 37) | 72.2048 |
(2002, 38) | 91.8647 |
It is also worth mentioning that the .groupby
method does not enforce using an aggregating
function.
All it does is to slice the frame into a series of frames.
You may just as well want to use the individual “sub-frames” and perform
some transformations directly on them.
If that is the case, just iterate:
1
2
for key, group in df.groupby(by=[df.index.year, df.index.isocalendar().week]):
pass
Here the key
will be a tuple of (year, week)
and the group
will be a sub-frame.
Remark
It is important to mention that the boundaries of the time windows may be defined differently in different flavors of SQL and Pandas. When using SQLite for comparison, each gave a slightly different result.
SQL:
1
2
3
4
5
SELECT
STRFTIME('%Y %W %w', timestamp),
timestamp
FROM TRANSACTIONS
LIMIT 5;
timestamp | year | week | day |
---|---|---|---|
2002-09-01 06:21:00 | 2002 | 34 | 0 |
2002-09-01 06:42:00 | 2002 | 34 | 0 |
2002-09-02 06:22:00 | 2002 | 35 | 1 |
2002-09-02 17:45:00 | 2002 | 35 | 1 |
2002-09-03 06:23:00 | 2002 | 35 | 2 |
Pandas:
1
df.index.isocalendar().head()
timestamp | year | week | day |
---|---|---|---|
2002-09-01 06:21:00 | 2002 | 35 | 7 |
2002-09-01 06:42:00 | 2002 | 35 | 7 |
2002-09-02 06:22:00 | 2002 | 36 | 1 |
2002-09-02 17:45:00 | 2002 | 36 | 1 |
2002-09-03 06:23:00 | 2002 | 36 | 2 |
The concept is the same, but the reference is different.
Window functions
The last type of aggregations that is commonly used for time data is to use a rolling window. As opposed to groupby rows by values of some specific columns, this method defines an interval of rows to pick a sub-table, shifts the window, and does it again.
Let’s see an example of calculating a moving average of five consecutive rows (the current plus four into the past). In SQL, the syntax is the following:
1
2
3
4
5
6
7
SELECT
timestamp,
AVG(amount) OVER (
ORDER BY timestamp
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
) rolling_avg
FROM transactions;
Pandas decalres a much simpler syntax:
1
2
3
4
5
6
# applying mean immediatiely
df["amount"].rolling(5).mean()
# accessing the chunks directly
for chunk in df["amount"].rolling(5):
pass
Again, in Pandas, different adjustments can be made using optional arguments.
The size of the window is dictated by the window
attribute, which in SQL
is realized by a sequence of statements (line 5).
In addition, we may want to center the window, use a different window
e.g. weighted averaging, or perform optional data cleaning.
However, the usage of the pd.Rolling
object returned by the .rolling
method
is, in a sense, similar to the pd.DataFrameGroupBy
objects.
Conclusions
Here, I presented three types of aggregations I frequently use when working with
time-series data.
Although not all data the contains information about time is a time-series,
for time-series it is almost always beneficial to convert the time information
into pd.Timestamp
or other similar objects that implement numpys np.datetime64
objects underneath.
As shown, it makes aggregating across differnent time properties very convenient,
intuitive, and fun.