A short pause in the summer heat. Portugal, 2019.

# Hidden Markov Model - Implemented from scratch

## Introduction

The Internet is full of good articles that explain the theory behind the Hidden Markov Model (HMM) well (e.g. 1, 2, 3 and 4) . However, many of these works contain a fair amount of rather advanced mathematical equations. While equations are necessary if one wants to explain the theory, we decided to take it to the next level and create a gentle step by step practical implementation to complement the good work of others.

In this short series of two articles, we will focus on translating all of the complicated mathematics into code. Our starting point is the document written by Mark Stamp. We will use this paper to define our code (this article) and then use a somewhat peculiar example of “Morning Insanity” to demonstrate its performance in practice.

### Notation

Before we begin, let’s revisit the notation we will be using. By the way, don’t worry if some of that is unclear to you. We will hold your hand.

• $T$ - length of the observation sequence.
• $N$ - number of latent (hidden) states.
• $M$ - number of observables.
• $Q = \{q_0, q_1, ..., q_{N-1}\}$ - hidden states.
• $V = \{0, 1, ..., M-1\}$ - set of possible observations.
• $\mathbf{A}$ - state transition matrix.
• $\mathbf{B}$ - emission probability matrix.
• $\vec{\pi}$ - initial state probability distribution.
• $\mathcal{O} = (\mathcal{O}_0, \mathcal{O}_2, ..., \mathcal{O}_{T-1})$ - observation sequence.
• $X = (x_0, x_1, ..., x_{T-1}), \ x_t \in Q$ - hidden state sequence.

Having that set defined, we can calculate the probability of any state and observation using the matrices:

• $\mathbf{A} = \{a_{ij}\}$ - begin an $N \times N$ transition matrix.
• $\mathbf{B} = \{b_j(k)\}$ - being an $M \times N$ emission matrix.

The probabilities associated with transition and observation (emission) are:

• $a_{ij} = p( q_j^{(t + 1)} \mid q_i^{(t)} )$,
• $b_j(k) = p(\mathcal{O}_k^{(t)} \mid q_j^{(t)})$.

The model is therefore defined as a collection: $\lambda = (\mathbf{A}, \mathbf{B}, \vec{\pi})$.

## Fundamental definitions

Since HMM is based on probability vectors and matrices, let’s first define objects that will represent the fundamental concepts. To be useful, the objects must reflect on certain properties. For example, all elements of a probability vector must be numbers $0 \le x \le 1$ and they must sum up to 1. Therefore, let’s design the objects the way they will inherently safeguard the mathematical properties.

The most natural way to initialize this object is to use a dictionary as it associates values with unique keys. Dictionaries, unfortunately, do not provide any assertion mechanisms that put any constraints on the values. Consequently, we build our custom ProbabilityVector object to ensure that our values behave correctly. Most importantly, we enforce the following:

• The number of values must equal the number of the keys (names of our states). Although this is not a problem when initializing the object from a dictionary, we will use other ways later.
• All names of the states must be unique (the same arguments apply).
• The probabilities must sum up to 1 (up to a certain tolerance).
• All probabilities must be $0 \le p \le 1$.

Having ensured that, we also provide two alternative ways to instantiate ProbabilityVector objects (decorated with @classmethod).

1. We instantiate the objects randomly - it will be useful when training.
2. We use ready-made numpy arrays and use values therein, and only providing the names for the states.

For convenience and debugging, we provide two additional methods for requesting the values. Decorated with @property, they return the content of the PV object as a dictionary or a pandas dataframe.

The PV objects need to satisfy the following mathematical operations (for the purpose of constructing of HMM):

1. comparison (__eq__) - to know if any two PV’s are equal,
2. element-wise multiplication of two PV’s or multiplication with a scalar (__mul__ and __rmul__).
3. dot product (__matmul__) - to perform vector-matrix multiplication
4. division by number (__truediv__),
5. argmax to find for which state the probability is the highest.
6. __getitem__ to enable selecting value by the key.

Note that when e.g. multiplying a PV with a scalar, the returned structure is a resulting numpy array, not another PV. This is because multiplying by anything other than 1 would violate the integrity of the PV itself.

Internally, the values are stored as a numpy array of size $(1 \times N)$.

### Probability Matrix

Another object is a Probability Matrix, which is a core part of the HMM definition. Formally, the $\mathbf{A}$ and $\mathbf{B}$ matrices must be row-stochastic, meaning that the values of every row must sum up to 1. We can, therefore, define our PM by stacking several PV’s, which we have constructed in a way to guarantee this constraint.

Here, the way we instantiate PM’s is by supplying a dictionary of PV’s to the constructor of the class. By doing this, we not only ensure that every row of PM is stochastic, but also supply the names for every observable.

 Our PM can, therefore, give an array of coefficients for any observable.

Mathematically, the PM is a $(M \times N)$ matrix:

The other methods are implemented in similar way to PV.

## Implementing Hidden Markov Chain

Before we proceed with calculating the score, let’s use our PV and PM definitions to implement the Hidden Markov Chain.

Again, we will do so as a class, calling it HiddenMarkovChain. It will collate at $\mathbf{A}$, $\mathbf{B}$ and $\vec{\pi}$. Later on, we will implement more methods that are applicable to this class.

### Computing score

Computing the score means to find what is the probability of a particular chain of observations $\mathcal{O}$ given our (known) model $\lambda = (\mathbf{A}, \mathbf{B}, \vec{\pi})$. In other words, we are interested in finding $p(\mathcal{O}\mid\lambda)$.

We can find $p(\mathcal{O}\mid\lambda)$ by marginalizing all possible chains of the hidden variables $X$, where $X = \{x_0,x_1,...x_{T-1}\}$:

Since $p(\mathcal{O}\mid X,\lambda) = \prod_{t=0}^{T-1} b_{x_t}(\mathcal{O}_t)$ (the product of all probabilities related to the observables) and $p(X\mid\lambda) = \vec{\pi} \prod_{t=0}^{T-1} a_{x_{t+1},x_{t}}$ (the product of all probabilities of transitioning from $x_{t}$ to $x_{t+1}$, the probability we are looking for (the score) is:

This is a naive way of computing of the score, since we need to calculate the probability for every possible chain $X$. Either way, let’s implement it in python:

#### Example

If our implementation is correct, then all score values for all possible observation chains, for a given model $\lambda$ should add up to one. Namely:

Indeed.

### Score with forward-pass

Computing the score the way we did above is kind of naive. In order to find the number for a particular observation chain $\mathcal{O}$, we have to compute the score for all possble latent variable sequences $X$. That requries $2TN^T$ multiplications, which even for small numbers takes time.

Another way to do it is to calculate partial observations of a sequence up to time $t$.

For $i \in \{0, 1, ..., N-1\}$ and $t \in \{0, 1, ..., T-1\}$:

Consequently,

and

Then

Note that $\alpha_t$ is a vector of lenght $N$. The sum of the product $\alpha_{t-1}(j)a_{j,i}$ can, in fact, be written as a dot product. Therefore:

where by $\star$ we denote an element-wise multiplication.

With this implementation, we reduce the number of multiplication to $N^2T$ and can take advantage of vectorization.

…yup.

### Simulation and convergence

Let’s test one more thing. Basically, let’s take our $\lambda = (\mathbf{A}, \mathbf{B}, \vec{\pi})$ and use it to generate a sequence of random observables, starting from some initial state probability $\vec{\pi}$.

If the desired length $T$ is “large enough”, we would expect that the system to converge on a sequence that, on average, gives the same number of events as we would expect from $\mathbf{A}$ and $\mathbf{B}$ matrices directly. In other words, the transition and the emission matrices “decide”, with a certain probability, what the next state will be and what observation we will get, for every step, respectively. Therefore, what may initially look like random events, on average should reflect the coefficients of the matrices themselves. Let’s check that as well.

### Latent states

The state matrix $\mathbf{A}$ is given by the following coefficients:

Consequently, the probability of “being” in the state $\textsf{1H}$ at $t + 1$, regardless of the previous state, is equal to:

If we assume that the prior probabilities of being at some state at $t$ are totally random, then $p(\textsf{1H}) = 1$ and $p(\textsf{2C}) = 0.9$, which after renormalizing give 0.55 and 0.45, respectively.

If we cound the number of occurences of each state and divide it by the number of elements in our sequence, we would get closer and closer to these number as the lenght of the sequence grows.

## Uncovering hidden variables

Let’s take our HiddenMarkovChain class to the next level and supplement it with more methods. The methods will help us to discover the most probable sequence of hidden variables behind the observation sequence.

### Expanding the class

We have defined $\vec{\alpha}$ to be the probability of partial observation of the sequence up to time $t$.

Now, let’s define the “opposite” probability. Namely, the probability of observing the sequence from $T-1$ down to $t$.

For $t = 0, 1, ..., T-1$ and $i = 0, 1, ..., N-1$, we define:

As before, we can calulate $\beta_t(i)$ recursively:

Then for $t \ne T-1$:

which in vectorized form, will be:

Finally, we also define a new quantity $\gamma$ to indicate the state $q_i$ at time $t$, for which the probability (calculated forwards and backwards) is the maximum:

Consequently, for any step $t = 0, 1, ..., T-1$, the state of the maximum likelihood can be found using:

### Validation

To validate, let’s generate some observable sequence $\mathcal{O}$. For that, we can use our model’s .run method. Then, we will use the .uncover method to find the most likely latent variable sequence.

#### Example

0 1 2 3 4 5
observed sequence 3L 3M 1S 3L 3L 3L
latent sequence 1H 2C 1H 1H 2C 1H
uncovered sequence 1H 1H 2C 1H 1H 1H

As we can see, the most likely latent state chain (according to the algorithm) is not the same as the one that actually caused the observations. This is to be expected. Afterall, each observation sequence can only be manifested with certain probability, dependent on the latent sequence.

The code below, evaluates the likelihood of different latent sequences resulting in our observation sequence.

index 0 1 2 3 4 5 score
8 1H 1H 2C 1H 1H 1H 3.1
24 1H 2C 2C 1H 1H 1H 2.9
40 2C 1H 2C 1H 1H 1H 2.7
12 1H 1H 2C 2C 1H 1H 2.7
10 1H 1H 2C 1H 2C 1H 2.7
9 1H 1H 2C 1H 1H 2C 2.7
25 1H 2C 2C 1H 1H 2C 2.5
0 1H 1H 1H 1H 1H 1H 2.5
26 1H 2C 2C 1H 2C 1H 2.5
28 1H 2C 2C 2C 1H 1H 2.5

The result above shows the sorted table of the latent sequences, given the observation sequence. The actual latent sequence (the one that caused the observations) places itself on the 35th position (we counted index from zero).

index 0 1 2 3 4 5 score
18 1H 2C 1H 1H 2C 1H 1.9

## Training the model

The time has come to show the training procedure. Formally, we are interested in finding $\lambda = (\mathbf{A}, \mathbf{B}, \vec{\pi})$ such that given a desired observation sequence $\mathcal{O}$, our model $\lambda$ would give the best fit.

### Expanding the class

Here, our starting point will be the HiddenMarkovModel_Uncover that we have defined earlier. We will add new methods to train it.

Knowing our latent states $Q$ and possible observation states $\mathcal{O}$, we automatically know the sizes of the matrices $\mathbf{A}$ and $\mathbf{B}$, hence $N$ and $M$. However, we need to determine $\{a_{i,j}\}, \{b_j(k)\}$ and $\vec{\pi}$.

For $t = 0, 1, ..., T - 2$ and $i, j \in \{0, 1, ..., N - 1\}$, we define “di-gammas”:

$\gamma_t(i, j)$ is the probability of transitioning $q_t \to q_{t + 1}$. Writing it in terms of $\alpha, \beta, A, B$, we have:

Now, thinking in terms of implementation, we want to avoid looping over $i, j$ and $t$ at the same time, as it’s gonna be deadly slow. Fortunately, for every $t$, we can vectorize the equation (preserving that $\alpha, a$ are indexed with i$t$ while $\beta, b$ are indexed with $t + 1$).

Having the equation for $\gamma_t(i, j)$, we can calulate

To find $\lambda = (\mathbf{A}, \mathbf{B}, \vec{\pi})$, we do

• For $i = 0, 1, ..., N - 1$:

or

• For $i, j = 0, 1, ..., N - 1$:
• For $j = 0, 1, ..., N - 1$ and $k = 0, 1, ..., M - 1$:

Having the “layer” supplemented with the ._difammas method, we should be able to perform all the necessary calculations. However, it makes sense to delegate the “management” of the layer to another class. In fact, the model training can be summarized as follows:

1. Initialize $\mathbf{A}, \mathbf{B}$ and $\vec{\pi}$.
2. Calculate $\gamma_t(i, j)$.
3. Update the model’s $\mathbf{A}, \mathbf{B}$ and $\vec{\pi}$.
4. We repeat the 2. and 3. until the score $p(\mathcal{O}\mid\lambda)$ no longer increases.

### Verification

Let’s look at the generated sequences. The “demanded” sequence is:

0 1 2 3 4 5
0 3L 2M 1S 3L 3L 3L

The table below summarizes simulated runs based on 100000 attempts (see above), with the frequency of occurrence and number of matching observations.

The bottom line is that if we have truly trained the model, we should see a strong tendency for it to generate us sequences that resemble the one we require. Let’s see if it happens.

counts 0 1 2 3 4 5 matched
0 8.907 3L 3L 3L 3L 3L 3L 4
1 4.422 3L 2M 3L 3L 3L 3L 5
2 4.286 1S 3L 3L 3L 3L 3L 3
3 4.284 3L 3L 3L 3L 3L 2M 3
4 4.278 3L 3L 3L 2M 3L 3L 3
5 4.227 3L 3L 1S 3L 3L 3L 5
6 4.179 3L 3L 3L 3L 1S 3L 3
7 2.179 3L 2M 3L 2M 3L 3L 4
8 2.173 3L 2M 3L 3L 1S 3L 4
9 2.165 1S 3L 1S 3L 3L 3L 4
10 2.147 3L 2M 3L 3L 3L 2M 4
11 2.136 3L 3L 3L 2M 3L 2M 2
12 2.121 3L 2M 1S 3L 3L 3L 6
13 2.111 1S 3L 3L 2M 3L 3L 2
14 2.1 1S 2M 3L 3L 3L 3L 4
15 2.075 3L 3L 3L 2M 1S 3L 2
16 2.05 1S 3L 3L 3L 3L 2M 2
17 2.04 3L 3L 1S 3L 3L 2M 4
18 2.038 3L 3L 1S 2M 3L 3L 4
19 2.022 3L 3L 1S 3L 1S 3L 4
20 2.008 1S 3L 3L 3L 1S 3L 2
21 1.955 3L 3L 3L 3L 1S 2M 2
22 1.079 1S 2M 3L 2M 3L 3L 3
23 1.077 1S 2M 3L 3L 3L 2M 3
24 1.075 3L 2M 1S 2M 3L 3L 5
25 1.064 1S 2M 1S 3L 3L 3L 5
26 1.052 1S 2M 3L 3L 1S 3L 3
27 1.048 3L 2M 3L 2M 1S 3L 3
28 1.032 1S 3L 1S 2M 3L 3L 3
29 1.024 1S 3L 1S 3L 1S 3L 3

And here are the sequences that we don’t want the model to create.

counts 0 1 2 3 4 5 matched
266 0.001 1S 1S 3L 3L 2M 2M 1
267 0.001 1S 2M 2M 3L 2M 2M 2
268 0.001 3L 1S 1S 3L 1S 1S 3
269 0.001 3L 3L 3L 1S 2M 2M 1
270 0.001 3L 1S 3L 1S 1S 3L 2
271 0.001 1S 3L 2M 1S 1S 3L 1
272 0.001 3L 2M 2M 3L 3L 1S 4
273 0.001 1S 3L 3L 1S 1S 1S 0
274 0.001 3L 1S 2M 2M 1S 2M 1
275 0.001 3L 3L 2M 1S 3L 2M 2

As we can see, there is a tendency for our model to generate sequences that resemble the one we require, although the exact one (the one that matchs 6/6) places itself already at the 10th position! On the other hand, according to the table, the top 10 sequencies are still the ones that are somewhat similar to the one we request.

To ultimately verify the quality of our model, let’s plot the outcomes together with the frequency of occrence and compare it agains a freshly initialized model, which is supposed to give us completely random sequences - just to compare.

It seems we have successfully implemented the training procedure. If we look at the curves, the initialized-only model generates observation sequences with almost equal probability. It’s completely random. However, the trained model gives sequences that are highly similar to the one we desire with much higher frequency. Despite the genuine sequence gets created in only 2% of total runs, the other similar sequences get generated approximately as often.

## Conclusion

In this article, we have presented a step-by-step implementation of the Hidden Markov Model. We have created the code by adapting the first principles approach. More specifically, we have shown how the probabilistic concepts that are expressed through eqiations can be implemented as objects and methods. Finally, we demonstrated the usage of the model with finding the score, uncovering of the latent variable chain and applied the training procedure.

### Do you want more?

• Check out my “Python Data Science Algorithm Gallery” project on Github.
• The code for this project is also stored as “Markeasy” library.
• Take a look at the “Morning Insanity” article, where we are putting this code into practice for higher good ;).