Useful file formats in work with machine-learning and python.
A word of context
For a couple of years already, Big Data and machine learning have offered us new tools and opportunities, but they have also begun to shape a new way how we think about problems. From the R&D perspective, where approaching new problems is a daily routine, they add a new flavour to this work, making it even more exciting. Still, like almost anything commonly described as "R&D", there is a substantial void when it comes to standards. For machine learning in particular, one of them is file formats.
Now, a lot about R&D success is about choosing right tools for the job. Although there already exist a few very efficient machine learning libraries that help to construct diverse neural network architectures, the ultimate goal of the process is to rule out all the bad networks and arrive at one, which will hopefully give the most accurate predictions. This cannot be achieved with the libaries alone. For that one has to build a chain of procedures: from data source, through feature definitions, network training and up to results' analysis, which is commonly referred to as machine learning pipeline. Best if automated, best if flexible and best if... quick.
In this post, I would like to present four commonly used file formats, and discuss their stengths and weaknesses in machine learning/Big Data perspective:
Let's start with "good old" csv files. These are mainly plain text files that store tabular data. They are not standardized and use a different delimiter to separate values depending on the convention. These "dinosaurs" date back to '70s, but I guess they are still going to be around as long as people will be using Excel or other speadsheet programs.
Why would they be interesting for machine learning, then? It is exactly for this reason. People do use spreadsheets to collect data, survays, summaries, measurement outcomes, etc. Event though far from optimimal, it is often the easiest to simply save a spreadsheet in csv format.
Getting csv's into the pipeline
Getting csv input to your machine learning "ecosystem" is very easy, especially, when the delimiter is consistent between different files. In this case, reading these files can be done in the following manner: If it is not, the csv module possesses a range of additional tools that can help with inconsistencies.
Writing is also easy, but... should not be done. In fact, you would probably like the pipeline to work and process your data automatically. When data gets big, your datasets become multidimensional, the simple csv files might slow your system from scaling up. Therefore, it may be wiser to organize the data in a better format that is more suitable for numerical calculations.
Mat stands for Matlab
For very long time, Matlab has been a research standard for numerical computation. With its native mat files, it was easy to pass multidimensional datasets and variables between sessions. With advent of python these can also be accessible through scipy.io module. Their advantage over csv's is that data can be easily organized within their structure and multiple dimensions is not a problem.
Writing and reading
And example of simple writing of the mat file goes as follows:
As you can see, we can organize our datasets by passing the numbers into a dictionary
and storing it with a simple function, which can later be recovered using
loadmat. Note that datasets need not to be of equal size nor
Of course, life would not be real without problems. In fact, if you load the data now,
you will notice that when looking at the returned variable (dictionary) it contains a bit
y keywords there are also others containing some
contextual information. This information is usually just irrelevant, and since what
we put into the file is not exactly what we get from the file, it is already pretty annoying,
becuase we need to keep track of our keywords (dataset names). We cannot just blindly
iterate over a dictionary that is recovered from the file.
Another downside is that, even when we do keep track of the keywords, again, we do not receive the exact same thing back.
is not going to return us
array([...]), but rather
Again, this is annoying, since it would be natural to iterate over elements once we get
the dataset back as
data['x'][index]. Instead, we have to remember to put a
zero first and iterate over next "dimension" like this:
since when did we have it to begin with?
Now, you can imagine that the more dimensions our dataset have, not to mention nested dictionaries that can be saved too, our work becomes complicated and the smooth development is likely to get stranded in those petty things.
Hierarchical Data Format
Alright, once we remind oourselves that mat's we created by Matlab, where everything (including your cats) is a matrix, this zeros can be understandable. Still, there is a better way of handling data with h5 files. Python interfaces them through h5py module. The file format lets you organize data in structures (just like mat were supposed to do), but also possesses a designated place for metadata through attributes. This is pretty close to what you need in research... have it all in one place, nicely organized, etc.
With Big Data and python, there is however, one more advantage. The files can be accessed using subsets of datasets, which is "lifesaving" when running out of RAM. Let's face it. Big Data is indeed something that should be handled in the cloud, but how many times would you like to test your ideas on your local machine?
Saving a dataset
Saving of a dataset goes beyond one line of the code. For that we will use tables module, which handles the operations in a bit more friendly fashion.
What happened here is we essentially created/opened a new file, specified what value format
we would use (atom), and created an extendible array to which we were saving chunks
of data in the loop. The
(0, DIMENSION) tuple is used to specify the size
of the tables (again, needs not to be a 2D table!), with 0 indicating the dimension that
is meant to be extendible.
When reading from the file, we do not need to dump its entire content to RAM. We can, as said before, use chunks of it, depending on what part we need to access. This is especially useful when working with images or sequences.
The way to approach the problem is to define one walk though the pipeline as session, which can be uniquely described by a bunch of hyperparameters, nicely organized as JSON. The JSONs can then be stored an a database, signifying what hyperparamters' contraines have already been tested. Whenever you decide to add or remove a certain parameter in the future, it would not be a problem, since JSONs are very flexible.
Python interaces JSONs through, again, a dedicated module. Loading it can be done in the following way: Of course, you may build a dedicated class only dedicated to handling of your hyperparameters. The bottom line is that whenever various streams of data pass through your pipeline, it is always useful to shield the hyperparameters, so you do not get confused with variables, names, dimensions and datasets. JSON is a neat way to ensure this.
In this post, we discussed four files that are relevant from machine learning and Big Data point of view. Starting out with the simple csv, often used for inputing data, we moved on to mat and h5 files, which come in handy when dealing with datasets. Especially, the latter, although requiring more code, is a more generic way of organizing of datasets. Finally, we touched upon JSONs and outlined why they are useful for storing of hyperparameters.