some background
Sunrise in Varanasi, India 2012.

Useful file formats in work with machine-learning and python.

A word of context

For a couple of years already, Big Data and machine learning have offered us new tools and opportunities, but they have also begun to shape a new way how we think about problems. From the R&D perspective, where approaching new problems is a daily routine, they add a new flavour to this work, making it even more exciting. Still, like almost anything commonly described as "R&D", there is a substantial void when it comes to standards. For machine learning in particular, one of them is file formats.

Now, a lot about R&D success is about choosing right tools for the job. Although there already exist a few very efficient machine learning libraries that help to construct diverse neural network architectures, the ultimate goal of the process is to rule out all the bad networks and arrive at one, which will hopefully give the most accurate predictions. This cannot be achieved with the libaries alone. For that one has to build a chain of procedures: from data source, through feature definitions, network training and up to results' analysis, which is commonly referred to as machine learning pipeline. Best if automated, best if flexible and best if... quick.

In this post, I would like to present four commonly used file formats, and discuss their stengths and weaknesses in machine learning/Big Data perspective:

  • CSV
  • MAT
  • H5
  • JSON
None of these formats was deliberatly created for machine learning purposes. Yet, with a bit of planning ahead, they can save you a lot of time when it comes to developing of your pipeline and its future maintanence.

Comma-Separated Values

Let's start with "good old" csv files. These are mainly plain text files that store tabular data. They are not standardized and use a different delimiter to separate values depending on the convention. These "dinosaurs" date back to '70s, but I guess they are still going to be around as long as people will be using Excel or other speadsheet programs.

Why would they be interesting for machine learning, then? It is exactly for this reason. People do use spreadsheets to collect data, survays, summaries, measurement outcomes, etc. Event though far from optimimal, it is often the easiest to simply save a spreadsheet in csv format.

Getting csv's into the pipeline

Getting csv input to your machine learning "ecosystem" is very easy, especially, when the delimiter is consistent between different files. In this case, reading these files can be done in the following manner:

import csv
with open('your_data.csv', 'rb') as f:
content csv.reader(f, delimiter=",")
for row in content:
    print row
If it is not, the csv module possesses a range of additional tools that can help with inconsistencies.

Writing is also easy, but... should not be done. In fact, you would probably like the pipeline to work and process your data automatically. When data gets big, your datasets become multidimensional, the simple csv files might slow your system from scaling up. Therefore, it may be wiser to organize the data in a better format that is more suitable for numerical calculations.

Mat stands for Matlab

For very long time, Matlab has been a research standard for numerical computation. With its native mat files, it was easy to pass multidimensional datasets and variables between sessions. With advent of python these can also be accessible through scipy.io module. Their advantage over csv's is that data can be easily organized within their structure and multiple dimensions is not a problem.

Writing and reading

And example of simple writing of the mat file goes as follows:

from scipy.io import loadmat, savemat
import numpy as np
x_values = np.array([1, 3, 9, -3, 12])
y_values = np.array([0, 2, 9, 11, -9])
data = {'x': x_values, 'y': y_values}
savemat('my_data.mat', data)
As you can see, we can organize our datasets by passing the numbers into a dictionary and storing it with a simple function, which can later be recovered using loadmat. Note that datasets need not to be of equal size nor dimensionality.

Problems...

Of course, life would not be real without problems. In fact, if you load the data now, you will notice that when looking at the returned variable (dictionary) it contains a bit more.

from scipy.io import loadmat
import numpy as np
data = loadmat('my_data.mat')
print (data.keys())
Among x and y keywords there are also others containing some contextual information. This information is usually just irrelevant, and since what we put into the file is not exactly what we get from the file, it is already pretty annoying, becuase we need to keep track of our keywords (dataset names). We cannot just blindly iterate over a dictionary that is recovered from the file.

Another downside is that, even when we do keep track of the keywords, again, we do not receive the exact same thing back.

pritn (data['x'])

is not going to return us array([...]), but rather array([[...]]).

Again, this is annoying, since it would be natural to iterate over elements once we get the dataset back as data['x'][index]. Instead, we have to remember to put a zero first and iterate over next "dimension" like this: data['x'][0][index]. And since when did we have it to begin with?

Now, you can imagine that the more dimensions our dataset have, not to mention nested dictionaries that can be saved too, our work becomes complicated and the smooth development is likely to get stranded in those petty things.

Hierarchical Data Format

Alright, once we remind oourselves that mat's we created by Matlab, where everything (including your cats) is a matrix, this zeros can be understandable. Still, there is a better way of handling data with h5 files. Python interfaces them through h5py module. The file format lets you organize data in structures (just like mat were supposed to do), but also possesses a designated place for metadata through attributes. This is pretty close to what you need in research... have it all in one place, nicely organized, etc.

With Big Data and python, there is however, one more advantage. The files can be accessed using subsets of datasets, which is "lifesaving" when running out of RAM. Let's face it. Big Data is indeed something that should be handled in the cloud, but how many times would you like to test your ideas on your local machine?

Saving a dataset

Saving of a dataset goes beyond one line of the code. For that we will use tables module, which handles the operations in a bit more friendly fashion.

import tables as tbl
try:
    f = tbl.open_file('my_data.h5', mode='w')
    atom = tbl.Float16Atom()
    content = f.create_earray(f.root, 'data', atom, (0, DIMENSION))
    for idx in range(DATA_SIZE):
        f.root.data.append(data[idxi, :])
finally:
    f.close()

What happened here is we essentially created/opened a new file, specified what value format we would use (atom), and created an extendible array to which we were saving chunks of data in the loop. The (0, DIMENSION) tuple is used to specify the size of the tables (again, needs not to be a 2D table!), with 0 indicating the dimension that is meant to be extendible.

When reading from the file, we do not need to dump its entire content to RAM. We can, as said before, use chunks of it, depending on what part we need to access.

import tables as tbl
try:
    f = tbl.open_file('my_data.h5', mode='r')
    chunk = f.root.data[LEFT:RIGHT, DOWN:UP]
finally:
    f.close()
This is especially useful when working with images or sequences.

Hyperparameters

The last type of structure I would like to mention is JSON. JSON stands for JavaScript Object Notation, but its syntaxt is also a valid dictionary sytnax in python. It is very fortunate, since JSONs are extremely popular in web development and databases, which makes them great candidates for storage of hyperparameters. In fact, even neural network model architectures can be formulated as JSONs. Having it all standardized, your pipeline can be constructed already at certain level of future compatibility right from the start.

The way to approach the problem is to define one walk though the pipeline as session, which can be uniquely described by a bunch of hyperparameters, nicely organized as JSON. The JSONs can then be stored an a database, signifying what hyperparamters' contraines have already been tested. Whenever you decide to add or remove a certain parameter in the future, it would not be a problem, since JSONs are very flexible.

Python interaces JSONs through, again, a dedicated module. Loading it can be done in the following way:

import json
try:
    p = {}
    with open('hyperparameters.json') as f:
        p = json.load(f)
except (ValueError, IOError):
    print ("Failed to load hyperparameters.")
else:
    for key in p:
        hyperparameter[key] = p[key]
Of course, you may build a dedicated class only dedicated to handling of your hyperparameters. The bottom line is that whenever various streams of data pass through your pipeline, it is always useful to shield the hyperparameters, so you do not get confused with variables, names, dimensions and datasets. JSON is a neat way to ensure this.

Summary

In this post, we discussed four files that are relevant from machine learning and Big Data point of view. Starting out with the simple csv, often used for inputing data, we moved on to mat and h5 files, which come in handy when dealing with datasets. Especially, the latter, although requiring more code, is a more generic way of organizing of datasets. Finally, we touched upon JSONs and outlined why they are useful for storing of hyperparameters.