A bee we saved from drowning in a lake, Poland 2014.

# Categorical sequences with Pandas for household expense control

## Introduction

Dealing with household expenses is never pleasant. Although banks try to make it fun these days, it is seldom that their user interface would actually help you to gain insight into how much you actually spend and on what “it all goes”. After all, the more you spend, the more likely you are to apply for a loan, correct?

Anyways, most banks offer their customers an option to download of a transaction history. While the options usually include CSV formatting, the lists are all but easy for the actual analytics.

In this post, we show how to translate such historical data into an easier to analyse format using pandas library - one of the “must-knowns” for any data scientist.

## Initial format

The initial format is typical for historical data. It contains a date or a time-stamp, a transferee and an amount that was transferred:

It is definitely not analytics-friendly, and we would like to present the values on a timeline with different categories. For that, we will formulate a table, where each column’s name matches the value taken from the CATEGORY column. Then we will accummulate the data within time-periods in order to make it easy to track the expenses within each category against time.

## Processing

### To categories

As the first step, we need to access the data, which can either be done by namually downloading the file or requesting the data through a bank open API (if available).

Once we have the initial table loaded as Pandas DateFrame, we can pick the new columns’ names from the values of CATEGORY.

The first line replaces all empty values (appearing as NaNs) with common category “Other”. The second line uses the CATEGORY values to create a list of new column names.

In the loop, we use df[df['CATEGORY'].str.match(dim)] construction to pick all the actual values that belong to the given category. At the same time, the .loc[:, df.columns != 'CATEGORY'] removes the whole column as it is now completely redundant. Finally, we limit the sub-table to contain only the DATE and VALUE columns, from where we replace the name "VALUE", with the one matching our category (dim). At last, we concatenate all the sub-tables and sort them with respect to the time stamp.

Our data is now of the following form:

### To sequences

So far so good. We have one table now that contains all values on “per category” basis. Unfortunately, it does contain a lot of “holes” (NaNs), which still hinder the nice overview. Furthermore, in this example we only have Dates, so our time-resolution is equal to one day, although we may have several entrances within one day. In many situations, we may have data that is time-stamped all the way to miliseconds, and there would actually be no point in keeping such high resolution.

Let’s say we are only interested in our weekly spending within each of the categories listed. Pandas offers a lot of tools for handling of the sequential data.

Here, we use pd.to_datefime() to batch-convert our timestamps from strings to date-time objects. Once done, we change it to be our actual index. Finally, given that our index is now a date-time object, we can resample it to form 7-days time bins, and sum all the expenses within these bins.

Our data is not in the following form:

This operation applies automatically to all the columns we have, at the same time treating all NaNs as zeros. As a result, we obtain a multi-dimensional time-continues table, which contains our weekly spendings on things grouped by the joint category.

Now, let’s use the derived table to plot the data.

As a result, we get the following plot.

Isn’t that nice?

#### Hey! Do you mind helping me out?

It's been 4 years since I launched this blog. Now, I would like to bring it to the next level. I want to record some screencast tutorial videos on the very topics that brought you here!

If you want more of the stuff, you will help me greatly by filling out a survey I have prepared for you. By clicking below, you will be redirected to Google Forms with a few questions. Please, answer them. They won't take more than 5 minutes and I do not collect any personal data.

Thank you! I appreciate it.