JSON in context of machine-learining.

Sep 24, 2017

Intoduction

In the previous post, we have considered JSONs to be useful candidates for storing and passing of hyperparameters. Since there is a lot to it, let's have a closer look at how JSONs can be stored and passed between parts of the system.

Why playing JSON?

Before moving on, let's just quickly explain why is it worth to even consider using JSONs for passing of the hyperparameters?

First of all, a neural network pipeline is likely to be composed of several modules or stages that do completely different things (e.g. interfacing databases, feature extraction, grid search, etc.). Obviously, it is possible to pass all parameters (including hyperparameters as normal functions' or objects' arguments. However, working with Big Data involves plenty of complex operations of different kind. Therefore, in order to make the code understandable one has to early agree upon a specific naming conventions for the parameters. Not only it may not be plausible to create such convention at an early stage, especially when still experimenting, but also makes the code cluttered and difficult to maintain.

JSON offers a consistent way for formatting the hyperparameter input, which is both flexible and simple to use. More specifically:

Being essentially a string, it can be stored as a whole in a file or in as a single database entry.
It can be passed as an URL when system is distributed.
Python can easily convert it to a dictionary, where parameters can be called by keywords.
Just like a dictionary, its structure allows nesting that can greatly help to organize the hyperparameters.

Consequently, JSONs allow to decouple the hyperparameters' supply/storage system from the actual computation "core" of a pipeline.

Loading hyper-parameters in practice

As you might have noticed, I use the emphasis on the word hyper, but the approach is perfectly extendable for parameters that are not strictly considered hyperparameters in machine learning sense. They can, for example, refer to a particular SQL query through which the pipeline requests a specific subset of sample data.

In practice it is useful to create a dedicated class to deal with the parameters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import json

class Param():
"""
Auxiliary class for supplying of hyperparameters.
"""
def __init__(self, filepath='./params/hyper.json'):
    self.h = {
        "BATCH_SIZE"    :   100,
        "LEARNING_RATE" :   0.001,
        "GRID_SEARCH"   :   "False",
    }
try:
    p = {}
    with open(filepath) as f:
        p = json.load(f)
except IOError:
    print ("File not found.")
except ValueError:
    print ("File could not be read.")
else:
    for key in p:
        self.h = p[key]
    print ("Hyperparameters loaded OK.")

Now, whatever values are passed through the file are going to override the default settings. What is more, if we later decide to extend the range of parameters, this code will work fine. We only need to remember to add some sort of assertion on the future element that might be dependent on the newly introduced parameters. Other than that, the class is going to be self-sustained.

Shipping JSON over a network

Another likely scenario may involve supplying of the hyperparameters over a network. Let's assume a situation in which we run the computation on one machine, but the way these computations are executed will depend on some configurations. In this case, we can go one way, which is to update the existing file or we could take advantage of networking, and send all configurations through url.

If the server is set to expect the configurations to be sent to some specific url, we can define a very simple way for sending of the hyperparameters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import json
import httplib2 import Http

param = Param()
config = param.h

packet = json.dumps(config)
h = Http()
try:
    dest_IP   = "127.0.0.1"          # specify the correct address
    dest_port = 1234                 # specify the correct port
    dest_url  = 'accept/hyperparams' # specify the correct url
    destination = "http://{}:{}/{}".format(dest_IP, dest_port, dest_url)
    resp, content = h.request(destination, "POST", packet)
except ConnectionRefusedError:
    print ("Connection Refused!")
else:
    if (content == "success"):
        print ("Parameters accepted.") # do something if accepted
    else (content == "failure"):
        print ("Parameters rejected.") # do something else if rejected

Here, we assume that the paramters were defined through the Param class from before and they reside in self.h as dictionary. After converting it to JSON again, we use the httplib2 module to ship it to a designated url. If the server "knows" what to do with it, it is a quick way for configuring of the calculations.

...with attachment

An even more "sophisticated" situation may arise, in which we are not only concerned about sending of the parameters alone, but would like to send a sample file, for example. Doing this is straightforward and requires only one more step: encoding.

In order to transfer a file through an url, we need to somehow convert it to text, which can be transmitted as url itself. For that we can use the well-known base64 encoding.

1
2
3
4
5
6
7
8
9
import base64

with open('somefile.dat', 'rb') as f:
    file_content = f.read()

byte_content = base64.b64encode(file_content)
str_content  = byte_content.decode('utf-8')
param.h['somefile'] = str_content
packet = json.dumps(some_dict)

Here, it may look like we encode the file content just to decode it right away. In fact, in order to ensure compatibility, after we open the file as binary, the next step encodes its content to symbols that are not transmittable through url. It is for that reason we execute .decode function, which is to convert it to ASCII string.

Decoding it is not difficult either. It is kind-of applying reverse operations, backwards.

1
2
3
4
5
6
7
packet_as_dict = json.loads(packet)
retrieved_str  = packet_as_dict['somefile']
retrieved_byte = bytes(retrieved_str, 'utf-8')
decoded_str    = base64.decodestring(retrieved_byte)
with open('samefile.dat', 'wb') as f:
    f.write(decoded_str)
print ("File saved.")

Final remark

Since we touched upon networking problem, perhaps it would be useful to add that just sending data like this (especially the files) imposes potential risk of someone sending harmful content. Therefore, depending on the actual implementation, it may be a good idea to use encryption or signing of the transmitted packets. A simple trick to keep the pipeline safe.