ukraine

How to write better scientific code in Python?

Introduction

A large part of any scientific effort lies down in writing code. Be it typical machine-learning modeling, an analysis, or contributing to a data project, a significant part of the time goes into prototyping new functionalities. Being exploratory, it is expected that several parts of the program will be replaced or modified often beyond what was planned initially.

Unlike for “consumer software”, changes are often induced not by customer’s requirements, but rather by outcomes of the process along the way. For this reason, it is extremely valuable to design it in ways, which will not require a “total reconstruction”, if experimental evidence suggests a different path.

Writing science code comes with two (additional) specific challenges: The first one relates to mathematical errors. Mistakes in computations are often hard to trace, especially when the code is semantically correct. No bugs are found. No exception is raised. All looks good, but the (numerical) result is wrong. Especially, with implementing probabilistic models, the results may sometimes look good, depending on some initial conditions or a random factor.

The second comes from the fact described earlier. There will always be experimental parts, and they change all the time. Therefore, the key is to design each component, so that most of the work can stay and serve as a rock-solid foundation for the next stage of development.

In this article, we focus on patterns that can make the code more robust, understandable, and overall easier to handle. You will see how simple improvements can lead to more reusable code and fewer bugs.

Toy task

Our task, for the sake of demonstration, is to calculate the expected value of outcomes of a random process. Mathematically, it comes down to this formula:

where is a probability mass function (PMC), or for a continuous variable:

where is a probability density function (PDF).

As you may wonder, the challenge here is that you want it to work with any distribution: continuous or discrete. Or if not plausible, you want at least to recognize the nature of the problem, so you can adapt the code smoothly.

Bad code - the starting point

Let’s start with a not-so-good code, to begin with. Say you want to roll six-sided dice. With each outcome equally probable, it comes down to calculating the mean value over sampled results.

1
2
3
4
5
6
7
8
9
10
import random
import statistics


def die(sides = 6):
    return random.randint(1, sides)

def expected_value(n_samples):
    samples = [die() for _ in range(n_samples)]
    return statistics.mean(samples)

What’s wrong with this code? Several things…

First of all, the die function returns one sample at a time. It needs to be called N times to get N samples, which is slow.

Secondly, the expected_value function strongly depends on the die function that produces samples. It is evident, once you consider using a different die, say a 12-sided one. With this design, you need to “open” the expected_value to accept an additional parameter sides, only to pass it on to die to extend it for the more general case. While this would work, it makes the interface of expected_value unintuitive, but the solution still relies on using die as the source of samples thus making it hard to consider other distributions.

Remedies…

Let’s consider the options you have:

Idea #1

You can use an external variable to store samples:

1
2
3
4
5
6
def expected_value(samples):
    return statistics.mean(samples)

samples = [die(12) for _ in range(n_samples)]

ev = expected_value(samples)

This is pretty obvious, but you just move the problem elsewhere… Now, samples become a new entity, storage for data (even very large), and it is quite anonymous. The expected_value function expects to receive it but to prepare it is on you.

Idea #2

Another way is to keep the die inside, by passing it to expected_value as an object.

1
2
3
4
5
6
7
8
9
10
from functools import partial

twelve_sided_die = partial(die, 12)

def expected_values(die, n_samples):
    samples = [die() for _ in range(n_samples)]
    return statistics.mean(samples)


ev = expected_values(twelve_sided_die, 10000)

The idea uses a prepared “version” of die and makes expected_value use it as the source of samples. However, a new problem arises: The expected_value is only compatible with die. It can’t compute the result with any other “sample generator”, or at least it is not guaranteed to do it correctly.

Idea #3

The third idea is to recognize the problem at a more abstract level and design better interfaces.

At the abstract level, we have two concepts:

  • There exist a probability distribution from which we can sample. (It can be a die, a coin, normal distribution - doesn’t matter).
  • There is a mathematical operation that consumes and transforms data. (E.g. calculating mean, variance, etc.).

Let’s give is more attention to how we can build the right abstractions and how to control their mutual compatibility.

Distribution (Data)

Mathematically, a probability distribution can be function - continuous or discrete, finite or infinite, from which we can draw samples. The “prescription” for this function can be very different depending on the problem. We may use an “existing” formula such as Gaussian or Poisson distribution, but it can also be a “custom” creation derived from e.g. histogram.

With this in mind, let’s establish the following abstraction:

1
2
3
4
5
6
7
from abc import ABC, abstractmethod

class Distribution(ABC):
    
    @abstractmethod
    def sample(self):
        ...

Implementation

Due to @abstractmethod, our distribution enforces that we implement sample method to any child class that derives from this abstraction. For our die, this can be:

1
2
3
4
5
6
7
8
9
import numpy as np


class Die(Distribution):
    def __init__(self, sides = 6):
        self.sides = sides

    def sample(self, n_samples):
        return np.random.randint(1, self.sides + 1, size=n_samples)

Here, the delivery of samples happens by invoking the method sample that is specific to tossing a fair die: Die(12).sample(10000). Moreover, thanks to numpy, we can create large amounts of data very quickly by replacing the list comprehension with np.ndarray.

In fact, things can be improved even further. Currently, calling Die() returns something like this <__main__.Die at 0x7f43f4448400>, which is not imformative. Aslo Die() == Die() evaluates to False, as for python they are two different object instances of the same class. To fix it, we need to implement two more methods:

1
2
3
4
5
6
7
    def __repr__(self):
        return f"{self.__class__.__name__}(sides={self.sides})"

    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return self.sides == other.sides
        return False

The __repr__ method makes the rendering of the object nice, and __eq__ will only return True if the dice are “equally sided”.

Dataclasses

Implementing the four methods each time can be tedious. Besides, the current implementation of Die does not prevent altering of an object, even accidentally, by assigning the attribute to an existing object like this die.sides = 20.

Therefore, we can redefine the Die class using python’s dataclasses.

1
2
3
4
5
6
7
8
9
from dataclasses import dataclass


@dataclass(frozen=True)
class Die(Distribution):
    sides: int = 6

    def sample(self, n):
        return np.random.randint(1, self.sides + 1, size=n)

The behavior of this example is the same as the one before. Additionally, setting frozen=True, assigning a new value to die.sides will raise an exception. If we want a new die, we should create a new object.

Now, our expected_value function will likely take die as a distribution object and do the math by calling its sample method.

1
2
def expected_value(distribution, n):
    return distribution.sample(n=n).mean()

Typings

The above example is neat. We know exactly what expected_value does, and it is easy to test. However, an n-sided die is not the only distribution we may want to calculate the expected value of. For example, for the tossing of a coin, the outcomes are not numeric (unless we establish a convention and stick to it). Naturally, it makes sense to provide some hints as to what interfaces can be used together and how.

For a dynamically typed language like python, you are not forced to stick to variables’ types. However, with various IDEs and tools such as mypy, typing can help you spot potential points of failure and make the code more transparent.

Let’s redefine our classes using typing and create two more specific distributions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from typing import Generic, Sequence, TypeVar


D = TypeVar("D")


class Distribution(ABC, Generic[D]):
    
    @abstractmethod
    def sample(self, n: int) -> Sequence[D]:
        ...


@dataclass(frozen=True)
class Die(Distribution[int]):
    sides: int = 6

    def sample(self, n: int) -> Sequence[int]:
        return np.random.randint(1, self.sides + 1, size=n)


@dataclass(frozen=True)
class Coin(Distribution[str]):
    outcomes: tuple = ("H", "T")
    fairness: float = 0.5

    def sample(self, n: int) -> Sequence[str]:
        p = (self.fairness, 1.0 - self.fairness)
        return np.random.choice(self.outcomes, size=n, p=p)


@dataclass(frozen=True)
class Gaussian(Distribution[float]):
    mu: float = 0.0
    sigma: float = 1.0

    def sample(self, n: int) -> Sequence[float]:
        np.random.normal(loc=self.mu, scale=self.sigma, size=n)

Several things that happen here. Thanks to D = TypeVar("D"), we can now define a new variable type, by which we parametrize each distribution’s type. You can notice that Distribution inherits not only after the abstract base class but also after Generic[D], which turns it also into a new (parametrized) type. Now, it becomes a sort of identity, constituting a new data type.

Each version of sample is expected to return a sequence of a particular type that makes sense to each individual distribution’s context. This way, we have a unified interface that is a also parametrized. We can use this to ensure the correct behavior of expected_value:

1
2
def expected_value(distribution: Distribution[float | int], n: int) -> float:
    return distribution.sample(n=n).mean()

While passing e.g. die = Die() or gaussian = Gaussian() to expected_value will work (as both int and float are numeric), passing coin = Coin() will be flagged out by e.g. mypy, stating that

> error: Argument 1 to "expected_value" has incompatible type "Coin";
> expected "Distribution[Union[float, int]]"

This can give us an early warning before we even run the code.

Boosting up the purity

As you can see, designing interfaces using typing helps to formalize the intensions and catch bugs early. You can even bring it to the next level by leveraging numpy’s dtype. This way, you will not only ensure that the different elements fit together, but also be more conscious regarding the memory footprint of your data.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy.typing as npt


class Distribution(ABC, Generic[D]):
    
    @abstractmethod
    def sample(self, n: int) -> np.NDArray[np.generic]:
        ...


class Die(Distribution[int]):
    sides: int = 6

    def sample(self, n: int) -> npt.NDArray[np.uint8]:
        return np.random.randint(
            1, self.sides + 1, size=n, dtype=np.uint8
        )

This way, you will even be informed if the die.sample method returns numbers that are different than strictly unsigned 8-bit integers. The question is whether you want to go that deep? It’s something to think about.

Calculations

Let’s come back to designing the calculation part. So far, we have expected_value that is prepared to work with numerical distributions. Naturally, we can compute the expected value for Die and Gaussian, but not for Coin. Not with the current design.

To fix this, have two choices:

  • We can create a proxy distribution by mapping e.g. ("H", "T") -> (0, 1), or
  • We incorporate a mapping within expected_value, giving a possible “adapter”.

The first approach creates an artificial body, whose idea relies on convention. It doesn’t prevent anyone from defining another proxy with ("H", "T") -> (1, 0), leading to a hard to detect the bug.

Instead, we can modify expected_value and give it a possibility to use custom adapters.

1
2
3
4
5
6
def expected_value(
    d: Distribution[D],
    f: Callable[[D], Any] = lambda x: x,
    n: int = 1000
) -> float:
    return np.mean(np.apply_along_axis(f, axis=0, arr=d.sample(n)))

The second argument to expected_value is a callable (a function), that we can optionally use to translate outcomes of e.g. Coin() distribution. By default, however, it will leave the outcomes intact.

1
2
3
4
5
6
7
8
9
die = Die(12)
expected_values(die, n=10000)

gaussian = Gaussian(mu=4.0, sigma=2.0)
expected_value(gaussian, n=100000)

# but
coin = Coin(fairness=0.75)
expected_value(coin, f=lambda x: np.where(x == "H", 1.0, 0.0))

Here, we not only avoided creating a proxy distribution but also managed to avoid tying expected_value to any specific way of converting the data. The expected_value function does only what it promises to do: computes the expected value. If any adaptation or conversion is required, it is provided externally. Note that also here we have an option: we can define a named function (e.g. coin_conversion) in case we plan to reuse it or stick to lambda when a separate definition does not add value.

Composition and iteration

Abstracting away mathematical calculations proves itself useful especially with designing iterative algorithms. Oftentimes, aside from the main computations, we have to keep an eye on some collateral results such as convergence, early stopping (max iterations), metrics, and so on.

Let’s take the constant for example. Mathematically, we can get its value by taking the following limit:

The higher is, the more precise the approximation.

How can we solve it adequately?

Not the best way…

Let’s start with a poor man’s approach with a loop.

1
2
3
4
5
6
7
8
9
10
11
12
13
def approximate_e(
    initial_guess: float,
    max_iter: int = 10,
    epsilon: float = 0.01
) -> float:
    e_value = initial_guess
    for n in range(1, max_iter + 1):
        new_e_value = (1.0 + 1.0 / n) ** n
        if abs(new_e_value - e_value) < epsilon:
            return new_e_value
        e_value = new_e_value

    return new_e_value

Again, what’s wrong with this approach?

First of all, the function does three things instead of just one. Line 8. is the absolute essence of the calculation. Yet, with the early stopping and convergence conditions, we are left with a lot of code overhead, which is tightly coupled with the actual calculations. Although the two conditions look like something more generic, if we choose to replace the subject of the approximation (to a square root for example), we will have to copy-paste this additional code and make sure it doesn’t break the new algorithm.

Secondly, our only options regarding parametrizing the two conditions are to either hard-code the values for max_iter and epsilon or allow users to provide them as arguments. It spoils the interface and makes testing more difficult.

Finally, the algorithm generates the data “eagerly”. Instead of focusing on the math and providing values “when asked”, it throws the data at you. For large amounts of data, this can cause memory issues.

Abstraction over calculations

Now, let’s address all three problems at once by separating the responsibility between the different parts. We have three things:

  • something that we expect to return us correct numbers (the actual calculations),
  • something to stop the process if enough number of iterations pass (early stopping),
  • something to stop the process if the values no longer get noticeably improved (convergence).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from typing import Iterator
import itertools


def approximate_e() -> Iterator[float]:
    n = 1
    while True:
        yield (1.0 + 1.0 / n) ** n
        n += 1

def early_stop(
    values: Iterator[float], max_iter: int = 100
) -> Iterator[float]:
    return itertools.islice(values, max_iter)

def convergence(
    values: Iterator[float], epsilon: float = 0.01
) -> Iterator[float]:
    for a, b in itertools.pairwise(values):
        yield a

        if (a - b) < epsilon:
            break

This design uses iterators, which implement “lazy” loading. The data items are only returned one by one when requested (hence the keyword yield). Thanks to that, we (almost) don’t have to worry about the memory.

Furthermore, each of the three functions can exist separately. They have their specific interface and can be unit-tested in isolation.

Finally (and most importantly), the final result can be obtaied by chaining them together!

1
2
3
4
5
6
values = approximate_e()
values = early_stop(values, max_iter=50)
values = convergence(values, epsilon=0.001)

for value in values:
    print("e becomes:", value)

Older python

For python older than 3.10, the itertools.pairwise can be written:

1
2
3
4
5
6
7
def pairwise(values: Iterator[D]) -> Iterator[Tuple[D, D]]:
    a = next(values, None)
    if a is None:
        return
    for b in values:
        yield a, b
        a = b

Conclusions

Scientific programming comes with additional challenges. It is an exciting discipline, but the number of pitfalls seems to grow at least quadratically with problem complexity.

In this article, we have discussed two main components that seem to occur again and again in every scientific programming task: data and calculations. Hopefully, by abstracting them in the right way, you can not only increase the efficiency of your code and reduce bugs but also make coding a much better experience. For you and your team!