How to write better scientific code in Python?
Introduction
A large part of any scientific effort lies down in writing code. Be it typical machine-learning modeling, an analysis, or contributing to a data project, a significant part of the time goes into prototyping new functionalities. Being exploratory, it is expected that several parts of the program will be replaced or modified often beyond what was planned initially.
Unlike for “consumer software”, changes are often induced not by customer’s requirements, but rather by outcomes of the process along the way. For this reason, it is extremely valuable to design it in ways, which will not require a “total reconstruction”, if experimental evidence suggests a different path.
Writing science code comes with two (additional) specific challenges: The first one relates to mathematical errors. Mistakes in computations are often hard to trace, especially when the code is semantically correct. No bugs are found. No exception is raised. All looks good, but the (numerical) result is wrong. Especially, with implementing probabilistic models, the results may sometimes look good, depending on some initial conditions or a random factor.
The second comes from the fact described earlier. There will always be experimental parts, and they change all the time. Therefore, the key is to design each component, so that most of the work can stay and serve as a rock-solid foundation for the next stage of development.
In this article, we focus on patterns that can make the code more robust, understandable, and overall easier to handle. You will see how simple improvements can lead to more reusable code and fewer bugs.
Toy task
Our task, for the sake of demonstration, is to calculate the expected value of outcomes of a random process. Mathematically, it comes down to this formula:
where is a probability mass function (PMC), or for a continuous variable:
where is a probability density function (PDF).
As you may wonder, the challenge here is that you want it to work with any distribution: continuous or discrete. Or if not plausible, you want at least to recognize the nature of the problem, so you can adapt the code smoothly.
Bad code - the starting point
Let’s start with a not-so-good code, to begin with. Say you want to roll six-sided dice. With each outcome equally probable, it comes down to calculating the mean value over sampled results.
1
2
3
4
5
6
7
8
9
10
import random
import statistics
def die(sides = 6):
return random.randint(1, sides)
def expected_value(n_samples):
samples = [die() for _ in range(n_samples)]
return statistics.mean(samples)
What’s wrong with this code? Several things…
First of all, the die
function returns one sample at a time.
It needs to be called N
times to get N
samples, which is slow.
Secondly, the expected_value
function strongly depends on the die
function
that produces samples.
It is evident, once you consider using a different die, say a 12-sided one.
With this design, you need to “open” the expected_value
to accept an additional
parameter sides
, only to pass it on to die
to extend it for the more general case.
While this would work, it makes the interface of expected_value
unintuitive,
but the solution still relies on using die
as the source of samples
thus making it hard to consider other distributions.
Remedies…
Let’s consider the options you have:
Idea #1
You can use an external variable to store samples:
1
2
3
4
5
6
def expected_value(samples):
return statistics.mean(samples)
samples = [die(12) for _ in range(n_samples)]
ev = expected_value(samples)
This is pretty obvious, but you just move the problem elsewhere…
Now, samples
become a new entity, storage for data (even very large),
and it is quite anonymous.
The expected_value
function expects to receive it but to prepare it is on you.
Idea #2
Another way is to keep the die
inside, by passing it to expected_value
as an object.
1
2
3
4
5
6
7
8
9
10
from functools import partial
twelve_sided_die = partial(die, 12)
def expected_values(die, n_samples):
samples = [die() for _ in range(n_samples)]
return statistics.mean(samples)
ev = expected_values(twelve_sided_die, 10000)
The idea uses a prepared “version” of die
and makes expected_value
use it as the source
of samples.
However, a new problem arises:
The expected_value
is only compatible with die
.
It can’t compute the result with any other “sample generator”, or at least it is not
guaranteed to do it correctly.
Idea #3
The third idea is to recognize the problem at a more abstract level and design better interfaces.
At the abstract level, we have two concepts:
- There exist a probability distribution from which we can sample. (It can be a die, a coin, normal distribution - doesn’t matter).
- There is a mathematical operation that consumes and transforms data. (E.g. calculating mean, variance, etc.).
Let’s give is more attention to how we can build the right abstractions and how to control their mutual compatibility.
Distribution (Data)
Mathematically, a probability distribution can be function - continuous or discrete, finite or infinite, from which we can draw samples. The “prescription” for this function can be very different depending on the problem. We may use an “existing” formula such as Gaussian or Poisson distribution, but it can also be a “custom” creation derived from e.g. histogram.
With this in mind, let’s establish the following abstraction:
1
2
3
4
5
6
7
from abc import ABC, abstractmethod
class Distribution(ABC):
@abstractmethod
def sample(self):
...
Implementation
Due to @abstractmethod
, our distribution enforces that we implement sample
method to any child class that derives from this abstraction.
For our die, this can be:
1
2
3
4
5
6
7
8
9
import numpy as np
class Die(Distribution):
def __init__(self, sides = 6):
self.sides = sides
def sample(self, n_samples):
return np.random.randint(1, self.sides + 1, size=n_samples)
Here, the delivery of samples happens by invoking the method sample
that is specific
to tossing a fair die: Die(12).sample(10000)
.
Moreover, thanks to numpy
, we can create large amounts of data very quickly by
replacing the list comprehension with np.ndarray
.
In fact, things can be improved even further.
Currently, calling Die()
returns something like this <__main__.Die at 0x7f43f4448400>
,
which is not imformative. Aslo Die() == Die()
evaluates to False
, as for python
they are two different object instances of the same class.
To fix it, we need to implement two more methods:
1
2
3
4
5
6
7
def __repr__(self):
return f"{self.__class__.__name__}(sides={self.sides})"
def __eq__(self, other):
if isinstance(other, self.__class__):
return self.sides == other.sides
return False
The __repr__
method makes the rendering of the object nice,
and __eq__
will only return True
if the dice are “equally sided”.
Dataclasses
Implementing the four methods each time can be tedious.
Besides, the current implementation of Die
does not prevent altering of an object, even accidentally, by assigning the attribute to an existing object like this die.sides = 20
.
Therefore, we can redefine the Die
class using python’s dataclasses
.
1
2
3
4
5
6
7
8
9
from dataclasses import dataclass
@dataclass(frozen=True)
class Die(Distribution):
sides: int = 6
def sample(self, n):
return np.random.randint(1, self.sides + 1, size=n)
The behavior of this example is the same as the one before.
Additionally, setting frozen=True
, assigning a new value to die.sides
will raise an exception.
If we want a new die, we should create a new object.
Now, our expected_value
function will likely take die
as a distribution object
and do the math by calling its sample
method.
1
2
def expected_value(distribution, n):
return distribution.sample(n=n).mean()
Typings
The above example is neat.
We know exactly what expected_value
does, and it is easy to test.
However, an n-sided die is not the only distribution we may want to calculate the expected
value of.
For example, for the tossing of a coin, the outcomes are not numeric (unless we establish a convention and stick to it).
Naturally, it makes sense to provide some hints as to what interfaces can be used
together and how.
For a dynamically typed language like python, you are not forced to stick to variables’ types.
However, with various IDEs and tools such as mypy
, typing can help you spot
potential points of failure and make the code more transparent.
Let’s redefine our classes using typing
and create two more specific distributions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from typing import Generic, Sequence, TypeVar
D = TypeVar("D")
class Distribution(ABC, Generic[D]):
@abstractmethod
def sample(self, n: int) -> Sequence[D]:
...
@dataclass(frozen=True)
class Die(Distribution[int]):
sides: int = 6
def sample(self, n: int) -> Sequence[int]:
return np.random.randint(1, self.sides + 1, size=n)
@dataclass(frozen=True)
class Coin(Distribution[str]):
outcomes: tuple = ("H", "T")
fairness: float = 0.5
def sample(self, n: int) -> Sequence[str]:
p = (self.fairness, 1.0 - self.fairness)
return np.random.choice(self.outcomes, size=n, p=p)
@dataclass(frozen=True)
class Gaussian(Distribution[float]):
mu: float = 0.0
sigma: float = 1.0
def sample(self, n: int) -> Sequence[float]:
np.random.normal(loc=self.mu, scale=self.sigma, size=n)
Several things that happen here.
Thanks to D = TypeVar("D")
, we can now define a new variable type, by which
we parametrize each distribution’s type.
You can notice that Distribution
inherits not only after the abstract base class
but also after Generic[D]
, which turns it also into a new (parametrized) type.
Now, it becomes a sort of identity, constituting a new data type.
Each version of sample
is expected to return a sequence of a particular type
that makes sense to each individual distribution’s context.
This way, we have a unified interface that is a also parametrized.
We can use this to ensure the correct behavior of expected_value
:
1
2
def expected_value(distribution: Distribution[float | int], n: int) -> float:
return distribution.sample(n=n).mean()
While passing e.g. die = Die()
or gaussian = Gaussian()
to expected_value
will work
(as both int
and float
are numeric), passing coin = Coin()
will be flagged out
by e.g. mypy
, stating that
> error: Argument 1 to "expected_value" has incompatible type "Coin";
> expected "Distribution[Union[float, int]]"
This can give us an early warning before we even run the code.
Boosting up the purity
As you can see, designing interfaces using typing
helps to formalize the intensions
and catch bugs early.
You can even bring it to the next level by leveraging
numpy
’s dtype
.
This way, you will not only ensure that the different elements fit together, but also
be more conscious regarding the memory footprint of your data.
For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy.typing as npt
class Distribution(ABC, Generic[D]):
@abstractmethod
def sample(self, n: int) -> np.NDArray[np.generic]:
...
class Die(Distribution[int]):
sides: int = 6
def sample(self, n: int) -> npt.NDArray[np.uint8]:
return np.random.randint(
1, self.sides + 1, size=n, dtype=np.uint8
)
This way, you will even be informed if the die.sample
method returns numbers that
are different than strictly unsigned 8-bit integers.
The question is whether you want to go that deep?
It’s something to think about.
Calculations
Let’s come back to designing the calculation part.
So far, we have expected_value
that is prepared to work with numerical distributions.
Naturally, we can compute the expected value for Die
and Gaussian
, but not for Coin
.
Not with the current design.
To fix this, have two choices:
- We can create a proxy distribution by mapping e.g.
("H", "T") -> (0, 1)
, or - We incorporate a mapping within
expected_value
, giving a possible “adapter”.
The first approach creates an artificial body, whose idea relies on convention.
It doesn’t prevent anyone from defining another proxy with ("H", "T") -> (1, 0)
,
leading to a hard to detect the bug.
Instead, we can modify expected_value
and give it a possibility to use custom adapters.
1
2
3
4
5
6
def expected_value(
d: Distribution[D],
f: Callable[[D], Any] = lambda x: x,
n: int = 1000
) -> float:
return np.mean(np.apply_along_axis(f, axis=0, arr=d.sample(n)))
The second argument to expected_value
is a callable (a function), that we can optionally
use to translate outcomes of e.g. Coin()
distribution.
By default, however, it will leave the outcomes intact.
1
2
3
4
5
6
7
8
9
die = Die(12)
expected_values(die, n=10000)
gaussian = Gaussian(mu=4.0, sigma=2.0)
expected_value(gaussian, n=100000)
# but
coin = Coin(fairness=0.75)
expected_value(coin, f=lambda x: np.where(x == "H", 1.0, 0.0))
Here, we not only avoided creating a proxy distribution but also managed to avoid
tying expected_value
to any specific way of converting the data.
The expected_value
function does only what it promises to do: computes the expected value.
If any adaptation or conversion is required, it is provided externally.
Note that also here we have an option: we can define a named function (e.g. coin_conversion
)
in case we plan to reuse it or stick to lambda
when a separate definition does
not add value.
Composition and iteration
Abstracting away mathematical calculations proves itself useful especially with designing iterative algorithms. Oftentimes, aside from the main computations, we have to keep an eye on some collateral results such as convergence, early stopping (max iterations), metrics, and so on.
Let’s take the constant for example. Mathematically, we can get its value by taking the following limit:
The higher is, the more precise the approximation.
How can we solve it adequately?
Not the best way…
Let’s start with a poor man’s approach with a loop.
1
2
3
4
5
6
7
8
9
10
11
12
13
def approximate_e(
initial_guess: float,
max_iter: int = 10,
epsilon: float = 0.01
) -> float:
e_value = initial_guess
for n in range(1, max_iter + 1):
new_e_value = (1.0 + 1.0 / n) ** n
if abs(new_e_value - e_value) < epsilon:
return new_e_value
e_value = new_e_value
return new_e_value
Again, what’s wrong with this approach?
First of all, the function does three things instead of just one.
Line 8.
is the absolute essence of the calculation.
Yet, with the early stopping and convergence conditions, we are left with a lot of code
overhead, which is tightly coupled with the actual calculations.
Although the two conditions look like something more generic, if we choose to replace
the subject of the approximation (to a square root for example), we will have to
copy-paste this additional code and make sure it doesn’t break the new algorithm.
Secondly, our only options regarding parametrizing the two conditions are to either
hard-code the values for max_iter
and epsilon
or allow users to provide them as
arguments.
It spoils the interface and makes testing more difficult.
Finally, the algorithm generates the data “eagerly”. Instead of focusing on the math and providing values “when asked”, it throws the data at you. For large amounts of data, this can cause memory issues.
Abstraction over calculations
Now, let’s address all three problems at once by separating the responsibility between the different parts. We have three things:
- something that we expect to return us correct numbers (the actual calculations),
- something to stop the process if enough number of iterations pass (early stopping),
- something to stop the process if the values no longer get noticeably improved (convergence).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from typing import Iterator
import itertools
def approximate_e() -> Iterator[float]:
n = 1
while True:
yield (1.0 + 1.0 / n) ** n
n += 1
def early_stop(
values: Iterator[float], max_iter: int = 100
) -> Iterator[float]:
return itertools.islice(values, max_iter)
def convergence(
values: Iterator[float], epsilon: float = 0.01
) -> Iterator[float]:
for a, b in itertools.pairwise(values):
yield a
if (a - b) < epsilon:
break
This design uses iterators, which implement “lazy” loading.
The data items are only returned one by one when requested (hence the keyword yield
).
Thanks to that, we (almost) don’t have to worry about the memory.
Furthermore, each of the three functions can exist separately. They have their specific interface and can be unit-tested in isolation.
Finally (and most importantly), the final result can be obtaied by chaining them together!
1
2
3
4
5
6
values = approximate_e()
values = early_stop(values, max_iter=50)
values = convergence(values, epsilon=0.001)
for value in values:
print("e becomes:", value)
Older python
For python older than 3.10, the itertools.pairwise
can be written:
1
2
3
4
5
6
7
def pairwise(values: Iterator[D]) -> Iterator[Tuple[D, D]]:
a = next(values, None)
if a is None:
return
for b in values:
yield a, b
a = b
Conclusions
Scientific programming comes with additional challenges. It is an exciting discipline, but the number of pitfalls seems to grow at least quadratically with problem complexity.
In this article, we have discussed two main components that seem to occur again and again in every scientific programming task: data and calculations. Hopefully, by abstracting them in the right way, you can not only increase the efficiency of your code and reduce bugs but also make coding a much better experience. For you and your team!