# How to write better scientific code in Python?

## Introduction

A large part of any scientific effort lies down in writing code. Be it typical machine-learning modeling, an analysis, or contributing to a data project, a significant part of the time goes into prototyping new functionalities. Being exploratory, it is expected that several parts of the program will be replaced or modified often beyond what was planned initially.

Unlike for “consumer software”, changes are often induced not by customer’s requirements, but rather by outcomes of the process along the way. For this reason, it is extremely valuable to design it in ways, which will not require a “total reconstruction”, if experimental evidence suggests a different path.

Writing science code comes with two (additional) specific challenges: The first one relates to mathematical errors. Mistakes in computations are often hard to trace, especially when the code is semantically correct. No bugs are found. No exception is raised. All looks good, but the (numerical) result is wrong. Especially, with implementing probabilistic models, the results may sometimes look good, depending on some initial conditions or a random factor.

The second comes from the fact described earlier. There will always be experimental parts, and they change all the time. Therefore, the key is to design each component, so that most of the work can stay and serve as a rock-solid foundation for the next stage of development.

In this article, we focus on patterns that can make the code more robust, understandable, and overall easier to handle. You will see how simple improvements can lead to more reusable code and fewer bugs.

### Toy task

Our task, for the sake of demonstration, is to calculate the **expected value**
of outcomes of a random process.
Mathematically, it comes down to this formula:

where is a probability mass function (PMC), or for a continuous variable:

where is a probability density function (PDF).

As you may wonder, the challenge here is that you want it to work with *any* distribution:
continuous or discrete.
Or if not plausible, you want at least to recognize the nature of the problem,
so you can adapt the code smoothly.

## Bad code - the starting point

Let’s start with a not-so-good code, to begin with. Say you want to roll six-sided dice. With each outcome equally probable, it comes down to calculating the mean value over sampled results.

1
2
3
4
5
6
7
8
9
10

import random
import statistics
def die(sides = 6):
return random.randint(1, sides)
def expected_value(n_samples):
samples = [die() for _ in range(n_samples)]
return statistics.mean(samples)

What’s wrong with this code? Several things…

First of all, the `die`

function returns one sample at a time.
It needs to be called `N`

times to get `N`

samples, which is slow.

Secondly, the `expected_value`

function strongly depends on the `die`

function
that produces samples.
It is evident, once you consider using a different die, say a 12-sided one.
With this design, you need to “open” the `expected_value`

to accept an additional
parameter `sides`

, only to pass it on to `die`

to extend it for the more general case.
While this would work, it makes the *interface* of `expected_value`

unintuitive,
but the solution still relies on using `die`

as the source of samples
thus making it hard to consider other distributions.

### Remedies…

Let’s consider the options you have:

#### Idea #1

You can use an external variable to store samples:

1
2
3
4
5
6

def expected_value(samples):
return statistics.mean(samples)
samples = [die(12) for _ in range(n_samples)]
ev = expected_value(samples)

This is pretty obvious, but you just move the problem elsewhere…
Now, `samples`

become a new entity, storage for data (even very large),
and it is quite anonymous.
The `expected_value`

function expects to receive it but to prepare it is on you.

#### Idea #2

Another way is to keep the `die`

inside, by passing it to `expected_value`

as an object.

1
2
3
4
5
6
7
8
9
10

from functools import partial
twelve_sided_die = partial(die, 12)
def expected_values(die, n_samples):
samples = [die() for _ in range(n_samples)]
return statistics.mean(samples)
ev = expected_values(twelve_sided_die, 10000)

The idea uses a prepared “version” of `die`

and makes `expected_value`

use it as the source
of samples.
However, a new problem arises:
The `expected_value`

is only compatible with `die`

.
It can’t compute the result with any other “sample generator”, or at least it is not
guaranteed to do it correctly.

#### Idea #3

The third idea is to recognize the problem at a more abstract level and design better interfaces.

At the abstract level, we have two concepts:

- There exist a
**probability distribution**from which we can sample. (It can be a die, a coin, normal distribution - doesn’t matter). - There is a
**mathematical operation**that consumes and transforms data. (E.g. calculating mean, variance, etc.).

Let’s give is more attention to how we can build the right abstractions and how to control their mutual compatibility.

## Distribution (Data)

Mathematically, a probability distribution can be function - continuous or discrete, finite or infinite, from which we can *draw samples.*
The “prescription” for this function can be very different depending on the problem.
We may use an “existing” formula such as Gaussian or Poisson distribution, but it can also be a “custom” creation derived from e.g. histogram.

With this in mind, let’s establish the following abstraction:

1
2
3
4
5
6
7

from abc import ABC, abstractmethod
class Distribution(ABC):
@abstractmethod
def sample(self):
...

### Implementation

Due to `@abstractmethod`

, our distribution enforces that we implement `sample`

method to any child class that derives from this abstraction.
For our die, this can be:

1
2
3
4
5
6
7
8
9

import numpy as np
class Die(Distribution):
def __init__(self, sides = 6):
self.sides = sides
def sample(self, n_samples):
return np.random.randint(1, self.sides + 1, size=n_samples)

Here, the delivery of samples happens by invoking the method `sample`

that is specific
to tossing a fair die: `Die(12).sample(10000)`

.
Moreover, thanks to `numpy`

, we can create large amounts of data very quickly by
replacing the list comprehension with `np.ndarray`

.

In fact, things can be improved even further.
Currently, calling `Die()`

returns something like this `<__main__.Die at 0x7f43f4448400>`

,
which is not imformative. Aslo `Die() == Die()`

evaluates to `False`

, as for python
they are two different object instances of the same class.
To fix it, we need to implement two more methods:

1
2
3
4
5
6
7

def __repr__(self):
return f"{self.__class__.__name__}(sides={self.sides})"
def __eq__(self, other):
if isinstance(other, self.__class__):
return self.sides == other.sides
return False

The `__repr__`

method makes the rendering of the object nice,
and `__eq__`

will only return `True`

if the dice are “equally sided”.

### Dataclasses

Implementing the four methods each time can be tedious.
Besides, the current implementation of `Die`

does not prevent altering of an object, even accidentally, by assigning the attribute to an existing object like this `die.sides = 20`

.

Therefore, we can redefine the `Die`

class using python’s `dataclasses`

.

1
2
3
4
5
6
7
8
9

from dataclasses import dataclass
@dataclass(frozen=True)
class Die(Distribution):
sides: int = 6
def sample(self, n):
return np.random.randint(1, self.sides + 1, size=n)

The behavior of this example is the same as the one before.
Additionally, setting `frozen=True`

, assigning a new value to `die.sides`

will raise an exception.
If we want a new die, we should create a new object.

Now, our `expected_value`

function will likely take `die`

as a distribution object
and do the math by calling its `sample`

method.

1
2

def expected_value(distribution, n):
return distribution.sample(n=n).mean()

### Typings

The above example is neat.
We know exactly what `expected_value`

does, and it is easy to test.
However, an n-sided die is not the only distribution we may want to calculate the expected
value of.
For example, for the tossing of a coin, the outcomes are not numeric (unless we establish a convention and stick to it).
Naturally, it makes sense to provide some hints as to what interfaces can be used
together and how.

For a *dynamically typed* language like python, you are not forced to stick to variables’ types.
However, with various IDEs and tools such as `mypy`

, typing can help you spot
potential points of failure and make the code more transparent.

Let’s redefine our classes using `typing`

and create two more specific distributions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

from typing import Generic, Sequence, TypeVar
D = TypeVar("D")
class Distribution(ABC, Generic[D]):
@abstractmethod
def sample(self, n: int) -> Sequence[D]:
...
@dataclass(frozen=True)
class Die(Distribution[int]):
sides: int = 6
def sample(self, n: int) -> Sequence[int]:
return np.random.randint(1, self.sides + 1, size=n)
@dataclass(frozen=True)
class Coin(Distribution[str]):
outcomes: tuple = ("H", "T")
fairness: float = 0.5
def sample(self, n: int) -> Sequence[str]:
p = (self.fairness, 1.0 - self.fairness)
return np.random.choice(self.outcomes, size=n, p=p)
@dataclass(frozen=True)
class Gaussian(Distribution[float]):
mu: float = 0.0
sigma: float = 1.0
def sample(self, n: int) -> Sequence[float]:
np.random.normal(loc=self.mu, scale=self.sigma, size=n)

Several things that happen here.
Thanks to `D = TypeVar("D")`

, we can now define a new variable type, by which
we parametrize each distribution’s type.
You can notice that `Distribution`

inherits not only after the abstract base class
but also after `Generic[D]`

, which turns it also into a new (parametrized) type.
Now, it becomes a sort of identity, constituting a new data type.

Each version of `sample`

is expected to return a sequence of a particular type
that makes sense to each individual distribution’s context.
This way, we have a unified interface that is a also parametrized.
We can use this to ensure the correct behavior of `expected_value`

:

1
2

def expected_value(distribution: Distribution[float | int], n: int) -> float:
return distribution.sample(n=n).mean()

While passing e.g. `die = Die()`

or `gaussian = Gaussian()`

to `expected_value`

will work
(as both `int`

and `float`

are numeric), passing `coin = Coin()`

will be flagged out
by e.g. `mypy`

, stating that

```
> error: Argument 1 to "expected_value" has incompatible type "Coin";
> expected "Distribution[Union[float, int]]"
```

This can give us an early warning before we even run the code.

### Boosting up the purity

As you can see, designing interfaces using `typing`

helps to formalize the intensions
and catch bugs early.
You can even bring it to the next level by leveraging
`numpy`

’s `dtype`

.
This way, you will not only ensure that the different elements fit together, but also
be more conscious regarding the memory footprint of your data.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

import numpy.typing as npt
class Distribution(ABC, Generic[D]):
@abstractmethod
def sample(self, n: int) -> np.NDArray[np.generic]:
...
class Die(Distribution[int]):
sides: int = 6
def sample(self, n: int) -> npt.NDArray[np.uint8]:
return np.random.randint(
1, self.sides + 1, size=n, dtype=np.uint8
)

This way, you will even be informed if the `die.sample`

method returns numbers that
are different than strictly unsigned 8-bit integers.
The question is whether you want to go that deep?
It’s something to think about.

## Calculations

Let’s come back to designing the calculation part.
So far, we have `expected_value`

that is prepared to work with numerical distributions.
Naturally, we can compute the expected value for `Die`

and `Gaussian`

, but not for `Coin`

.
Not with the current design.

To fix this, have two choices:

- We can create a
*proxy*distribution by mapping e.g.`("H", "T") -> (0, 1)`

, or - We incorporate a mapping within
`expected_value`

, giving a possible “adapter”.

The first approach creates an artificial body, whose idea relies on convention.
It doesn’t prevent anyone from defining another proxy with `("H", "T") -> (1, 0)`

,
leading to a hard to detect the bug.

Instead, we can modify `expected_value`

and give it a possibility to use custom adapters.

1
2
3
4
5
6

def expected_value(
d: Distribution[D],
f: Callable[[D], Any] = lambda x: x,
n: int = 1000
) -> float:
return np.mean(np.apply_along_axis(f, axis=0, arr=d.sample(n)))

The second argument to `expected_value`

is a callable (a function), that we can optionally
use to translate outcomes of e.g. `Coin()`

distribution.
By default, however, it will leave the outcomes intact.

1
2
3
4
5
6
7
8
9

die = Die(12)
expected_values(die, n=10000)
gaussian = Gaussian(mu=4.0, sigma=2.0)
expected_value(gaussian, n=100000)
# but
coin = Coin(fairness=0.75)
expected_value(coin, f=lambda x: np.where(x == "H", 1.0, 0.0))

Here, we not only avoided creating a proxy distribution but also managed to avoid
tying `expected_value`

to any specific way of converting the data.
The `expected_value`

function does only what it promises to do: computes the expected value.
If any adaptation or conversion is required, it is provided externally.
Note that also here we have an option: we can define a named function (e.g. `coin_conversion`

)
in case we plan to reuse it or stick to `lambda`

when a separate definition does
not add value.

### Composition and iteration

Abstracting away mathematical calculations proves itself useful especially with
designing *iterative* algorithms.
Oftentimes, aside from the main computations, we have to keep an eye on some
collateral results such as convergence, early stopping (max iterations), metrics, and so on.

Let’s take the constant for example. Mathematically, we can get its value by taking the following limit:

The higher is, the more precise the approximation.

How can we solve it adequately?

#### Not the best way…

Let’s start with a poor man’s approach with a loop.

1
2
3
4
5
6
7
8
9
10
11
12
13

def approximate_e(
initial_guess: float,
max_iter: int = 10,
epsilon: float = 0.01
) -> float:
e_value = initial_guess
for n in range(1, max_iter + 1):
new_e_value = (1.0 + 1.0 / n) ** n
if abs(new_e_value - e_value) < epsilon:
return new_e_value
e_value = new_e_value
return new_e_value

Again, what’s wrong with this approach?

First of all, the function does three things instead of just one.
Line `8.`

is the absolute essence of the calculation.
Yet, with the early stopping and convergence conditions, we are left with a lot of code
overhead, which is tightly coupled with the actual calculations.
Although the two conditions look like something more generic, if we choose to replace
the subject of the approximation (to a square root for example), we will have to
copy-paste this additional code and make sure it doesn’t break the new algorithm.

Secondly, our only options regarding parametrizing the two conditions are to either
hard-code the values for `max_iter`

and `epsilon`

or allow users to provide them as
arguments.
It spoils the interface and makes testing more difficult.

Finally, the algorithm generates the data “eagerly”. Instead of focusing on the math and providing values “when asked”, it throws the data at you. For large amounts of data, this can cause memory issues.

### Abstraction over calculations

Now, let’s address all three problems at once by separating the responsibility between the different parts. We have three things:

- something that we expect to return us correct numbers (the actual calculations),
- something to stop the process if enough number of iterations pass (early stopping),
- something to stop the process if the values no longer get noticeably improved (convergence).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

from typing import Iterator
import itertools
def approximate_e() -> Iterator[float]:
n = 1
while True:
yield (1.0 + 1.0 / n) ** n
n += 1
def early_stop(
values: Iterator[float], max_iter: int = 100
) -> Iterator[float]:
return itertools.islice(values, max_iter)
def convergence(
values: Iterator[float], epsilon: float = 0.01
) -> Iterator[float]:
for a, b in itertools.pairwise(values):
yield a
if (a - b) < epsilon:
break

This design uses iterators, which implement “lazy” loading.
The data items are only returned one by one when requested (hence the keyword `yield`

).
Thanks to that, we (almost) don’t have to worry about the memory.

Furthermore, each of the three functions can exist separately. They have their specific interface and can be unit-tested in isolation.

Finally (and most importantly), the final result can be obtaied by *chaining* them together!

1
2
3
4
5
6

values = approximate_e()
values = early_stop(values, max_iter=50)
values = convergence(values, epsilon=0.001)
for value in values:
print("e becomes:", value)

#### Older python

For python older than 3.10, the `itertools.pairwise`

can be written:

1
2
3
4
5
6
7

def pairwise(values: Iterator[D]) -> Iterator[Tuple[D, D]]:
a = next(values, None)
if a is None:
return
for b in values:
yield a, b
a = b

## Conclusions

Scientific programming comes with additional challenges. It is an exciting discipline, but the number of pitfalls seems to grow at least quadratically with problem complexity.

In this article, we have discussed two main components that seem to occur again and again in every scientific programming task: data and calculations. Hopefully, by abstracting them in the right way, you can not only increase the efficiency of your code and reduce bugs but also make coding a much better experience. For you and your team!