Polynomial Regression - which python package to use?

Jul 18, 2020

Introduction

Polynomial regression is one of the most fundamental concepts used in data analysis and prediction. Not only can any (infinitely differentiable) function be expressed as a polynomial through Taylor series at least within a certain interval, it is also one of the first problems that a beginner in machine-learning is confronted with. It is used across various disciplines such as financial analysis, signal processing, medical statistics, and more.

While polynomial regression is a relatively simple concept, it became a sort of “hello world” problem in machine-learning as it touches upon many core concepts. Still, some parts may sound confusing, especially since:

It introduces several things at once, although only some are specific to polynomial regression.
It is often implemented in many different ways.

In this blog post, we are going to construct it from the beginning using both equations and express them through the code. Then, we will compare our implementation against four python libraries that are the most widely used in data science:

and discuss differences between them, while pointing to similarities at the fundamental level.

The problem

Let’s discuss the mathematical foundations first.

Mathematically, the problem of regression is an attempt to model a relationship between an independent variable $x$ and a dependent variable $y$ . Assuming $y$ ’s dependent on $x$ is expressed in the following form:

$y = a_n x^n + a_{n-1} x^{n-1} + \cdots + a_1 x + a_0 + \varepsilon,$

we speak of polynomial regression (with $\varepsilon$ denoting a noise term). Naturally, if the maximum $n = 1$ , the problem becomes linear regression.

To solve the problem means to determine the values of all $a_n$ ’s to represent the data well. It boils down to the elements of the following steps:

A “guess” regarding the form of $y$ . It is called a hypothesis function $h_\theta(x)$ .
A way of measuring how “bad” is the hypothesis given the true data values. This “badness” is quantified using a so-called loss function $J$ .
An idea of how we could potentially “tune” the hypothesis such that we are better at approximating the data points. In other words, we need to have some way of minimizing $J$ .

Choosing the hypothesis

When speaking of polynomial regression, the very first thing we need to assume is the degree of the polynomial we will use as the hypothesis function. If we choose $n$ to be the degree, the hypothesis will take the following form:

$h_\theta(x) = \theta_n x^n + \theta_{n - 1}x^{n - 1} + \cdots + \theta_0 = \sum_{j = 0}^n \theta_j x^j.$

The cost function and mean square error

Next, we choose the metric by which we express the cost function. For regression problems, it makes sense to choose the so-called mean square error (MSE), which measures the average distance between the individual points and our predictions brought to the second power,

$J(\vec{\theta}) = \frac{1}{2m}\sum_{i = 1}^m \left(h_\theta(x_i) - y_i\right)^2.$

Here $m$ relates to the number of data points and $1/2$ factor is introduced for convenience (see later). We also denote all parameters as a vector of parameters $\vec{\theta} = [\theta_0, \theta_1, \cdots, \theta_n]$ .

The MSE metrics is only a choice, but it is reasonable due to the following reasons:

MSE is agnostic concerning the direction of the error (e.g. unlike mean error that is proportional to just $h - y$ in the first power).
MSE heavily punishes large differences (outliers) due to the second power (unlike e.g. mean absolute error that is proportional to $\text{abs}(h - y)$ ).
For MSE it is easier to obtain the gradient (e.g. unlike root mean square error that is $\sqrt{\text{MSE}}$ ).

Gradient descent

Once we have established the cost function and set some random initial $\vec{\theta}$ , we want to minimize $J$ . One of the possible options is the gradient descent optimization algorithm. As the name suggests, it requires us to calculate $\nabla_\theta J$ and subsequently update all $\theta$ ’s accordingly.

In other words, we need to compute the derivatives of $J$ concerning every $\theta_j$ , and use them to change the values of $\theta$ :

$\theta_j \leftarrow \theta_j - \alpha \frac{dJ}{d\theta_j}$

thus making the hypothesis better and better at representing the data.

Fortunately, $dJ/d\theta_j$ is easy to calcuate:

$\frac{d}{d\theta_j} J(\vec{\theta}) = \frac{1}{2m}\sum_{i = 1}^m \frac{d}{d\theta_j} \left(h_\theta (x_i) - y_i\right)^2 = \frac{1}{m}\sum_{i = 1}^m (h_\theta(x_i) - y_i) \cdot \frac{d}{d\theta_j}h_\theta(x_i).$

The last term is the derivative of the hypothesis function with respect to $\theta_j$ . Because we assumed $h_\theta$ to be a polynomial:

$\frac{d}{d\theta_j}h_\theta{(x_i)} = \sum_{j = 0}^n \frac{d}{d\theta_j} \theta_j x_i^j = x_i^j,$

and the whole gradient becomes:

$\nabla_\theta J = \frac{1}{m} \sum_{i = 1}^m (h_\theta(x_i) - y_i) \cdot x_i^j.$

If we organize all $x_i$ into a vector of examples, we can express the formula above using vector-matrix multiplication:

$\begin{bmatrix} \frac{dJ}{d\theta_0} \\ \frac{dJ}{d\theta_1} \\ \vdots \\ \frac{dJ}{d\theta_n} \end{bmatrix} = \frac{1}{m} \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_m \\ \vdots & \vdots & \ddots & \vdots \\ x_1^n & x_2^n & \cdots & x_m^n \end{bmatrix} \cdot \left( \begin{bmatrix} h_\theta(x_1) \\ h_\theta(x_2) \\ \vdots \\ h_\theta(x_n) \end{bmatrix} - \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \right),$

or in a more compact form

$\nabla_\theta J = \frac{1}{m} \mathbf{\tilde{X}} \cdot (h_\theta(\mathbf{x}) - \mathbf{y}).$

Here, $\mathbf{\tilde{X}}$ expresses a matrix composed of example vectors raised to the consecutive powers $j = 0, 1, \cdots, n$ .

Basic implementation (naive)

We call this implementation naive because it is in no way optimized for performance. Instead, the focus was to emphasize the correspondence between the code and the underlying equations.

To start, we will use our custom implementation of a symbolic polynomial (see gist). The Polynomial class defines a callable object based on the polynomial expression (the first equation). It implements many algebraic methods similar to our quaternion example), but for now the most important part is the initialization.

We begin with creating of a Polynomial object with random coefficients $\theta$ .

1
2
3
4
5
6
7
import numpy as np
from nppoly import Polynomial

np.random.seed(42)

def hypothesis(degree):
    return Polynomial(*np.random.rand(degree + 1))

This method accepts $n$ and returns a polynomial with random coefficients. Next, we define our MAE cost function:

1
2
def cost(h, X, y):
    return 0.5 / len(y) * ((h(X) - y) ** 2).sum()

Here len(y) is equivalent to $m$ and h, X and y are to be a hypothesis, then $\mathbf{x}$ and $\mathbf{y}$ – the vectors of arguments and values. Note that capitalizing X is more of a tradition to emphasize its matrix-like character. In our case, however, X is associated with one-dimensional data only.

Next, we need to express the gradient of $J$ . Thanks to the derivation, we already know the expression:

1
2
3
4
5
6
7
8
def grad_theta(h, X, y):
    diff = (h(X) - y).reshape(-1, 1)

    X = X.reashape(1, -1)
    X = list(map(lambda i: x ** i, reversed(range(h.shape))))
    X = np.concatenate(X)

    return 1 / len(y) * (X @ (diff)).reshape(1, -1)

In this function, we need to perform a couple of reshaping to ensure we can multiply the matrix by a vector. Lines 4-6 are responsible for constructing $\mathbf{\tilde{X}}$ , and reversed function is used to comply with the convention that $\theta_n$ stands before $\theta_{n-1}$ .

Finally, the optimization routine is done with the following function:

1
2
3
4
5
6
def optimize(h, X, y, epochs=5000, lr=0.01):
    theta = h._coeffs.reshape(1, -1)
    for epoch in range(epochs):
        theta -= lr * grad_theta(h, X, y)
        h = Polynomial(*theta.flatten())
    return h

Here, line 2 is just an initialization of the hypothesis. Then, inside the for loop, we perform $\vec{\theta} \leftarrow \vec{\theta} - \alpha \nabla_\theta J(\vec{\theta})$ , where $\alpha$ is the so-called learning rate lr and the repetition cycles are traditionally called “epochs”.

The optimization can be verified by executing a small script:

1
2
3
4
5
6
7
8
9
10
# fake data
X = np.linspace(0, 10, num=5)
y = 4 * X - 2 + np.random.randn(len(X))

h = hypothesis(1)
h = optimize(h, X, y)

# prediction
X_test = np.linspace(-5, 25, num=31)
y_pred = h(X_test)

Python libraries

Numpy

The first library that implements polynomial regression is numpy. It does so using numpy.polyfit function, which given the data (X and y) as well as the degree $n$ performs the procedure and returns an array of the coefficients $\theta$ .

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
from numpy import polyfit


# fake data
X = np.linspace(0, 10, num=5)
y = 4 * X - 2 + np.random.randn(X)

u = polyfit(X, y, deg=1)

# prediction
X_test = np.linspace(-5, 25, num=31)
y_pred = u[0] * X_test + u[1]

The function offers additional diagnostics if full is set to True, giving us information related to uncertainties.

Scipy

Another “recipe” for solving the polynomial regression problem is curve_fit included in scipy. This function is more generic comparing to polyfit as it does not require our “model” to assume a polynomial form. Its interface is also different.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
from scipy.optimize import curve_fit


def linear_function(x, a, b):
    return a * x + b

def quadratic_function(x, a, b, c):
    return a * x**2 + b * x + c


# fake data
X = np.linspace(0, 10, num=5)
y1 = 4 * X - 2 + np.random.randn(len(X))
y2 = 5 * X ** 2 - 3 * X + 10 + np.random.randn(len(X))

u = curve_fit(linear_function, X, y1)[0]
v = curve_fit(quadratic_function, X, y2)[0]

# prediction
X_test = np.linspace(-5, 25, num=31)
y_pred_1 = linear_function(X_test, u[0], u[1])
y_pred_2 = quadratic_function(X_test, v[0], v[1], v[2])

As opposed to polyfit, this function requires a model-function to passed in as an argument in the first place. It can be any parametrized mathematical formula, however, curve_fit imposes one condition: the model-function itself accepts data x as its first argument. Finally, it returns the optimized coefficients similarly to polyfit, although it also returns the diagnostic information (hence the [0] at the end to suppress it).

Scikit-learn

Scikit-learn (or sklearn) is a “go-to” library when it comes it machine-learning. It focuses heavily on interface consistency, meaning that it tries to unify access to different features and algorithms using the same methods such as .fit, .transform, .fit_transform,and .predict. Solving a linear regression solution is fairly easy, however, the polynomial regression case requires a bit of thinking.

Let’s start with the linear regression.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
from sklearn.linear_model import LinearRegression


# fake data
X = np.linspace(0, 10, num=5).reshape(-1, 1)
y = 4 * X - 2 + np.random.randn(len(X))

linreg = LinearRegression()
linreg.fit(X, y)

# prediction
X_test = np.linspace(-5, 25, num=31).reshape(-1, 1)
y_pred = linreg.predict(X_test)

Here, the solution is realized through the LinearRegression object. Following the scikit-learn’s logic, we first adjust the object to our data using the .fit method and then use .predict to render the results. Note that X needs to be reshaped to an (m, 1) vector column.

To address the more generic polynomial regression case, we need to combine LinearRegression with PolynomialFeatures object as there is no direct solution available. Indeed, sklearn was constructed with the intention to handle data-related problems rather than be an equation solver. More on the differences can be found in this post.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# fake data
X = np.linspace(0, 10, num=5).reshape(-1, 1)
y = 5 * X ** 2 - 3 * X + 10 + np.random.randn(len(X))

polyreg = make_pipeline(
        PolynomialFeatures(degree=2),
        LinearRegression()
        )
polyreg.fit(X, y)

# prediction
X_test = np.linspace(-5, 25, num=31).reshape(-1, 1)
y_pred = polyreg.predict(X_test)

The starting point is the PolynomialFeatures class, which creates “mixed” terms such as $a^nb^m$ , based on generic variables $a$ and $b$ . Here, we apply this transformation to $\mathbb{x}$ , this obtaining a feature vector (matrix) $(1, \mathbf{x}, \mathbf{x}^2)$ , as $n = 2$ , and $\mathbf{x}$ is one-dimensional. Then, using sklearn’s pipeline, we combine $\mathbf{x}$ ’s with linear coefficients $\theta$ , basically treating each $\mathbf{x}$ as a separate variable. Finally, we solve it as if we faced the standard linear regression problem, obtaining $\theta$ .

We can see that the approach taken here is quite different from both numpy and scipy. As sklearn “sees” the problem more in terms of consecutive adjustments (fit), transformations of data (not needed here), and predictions (predict). Either way, the result is the same.

Tensorflow

The last from the packages that we cover here is Tensorflow 2.0. As there have been substantial changes to the API introduced in version 2.0, we are not going to present the solution offered in 1.x. However, if you are interested, please refer to this excellent post by Trần Ngọc Minh.

The appearance of the process looks quite similar to our original implementation. Under the hood, it is, in fact, very different. Let’s take a look at the following snippet first:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
import tensorflow as tf


LEARNING_RATE = 0.0001
N_EPOCHS = 100000
DEGREE = 2


X = np.linspace(0, 10, num=5).reshape(-1, 1)
y = 5*X**2 - 3*X + 10 + np.random.randn(len(X))

theta = tf.Variable([0.0] * (DEGREE + 1), name='parameters')


def hypothesis(x):
    return sum([theta[i] * x**i for i in range(DEGREE + 1)])


def cost(y_pred, y):
    return tf.reduce_mean(tf.square(y_pred - y))


for epoch in range(N_EPOCHS):
    with tf.GradientTape() as tape:
        y_predicted = h(X)
        cost_value = cost(y_predicted, y2)
    gradients = tape.gradient(cost_value, theta)
    theta.assign_sub(gradients * LEARNING_RATE)


X_test = np.linspace(-5, 25, num=31)
y_test = hypothesis(X_test).numpy()

The logic related to the hypothesis and the cost function is the same. Also, the training is performed using the same method, only the function names are different. What happens, however, is the fact that tensorflow attempts to solve the problem symbolically, by building the so-called graph. The graph is a representation of all mathematical dependencies just so that derivatives can easily be calculated. For example, the theta.assign_sub(...) translates to updating $\vec{\theta}$ with the gradient, while tf. prefixed functions are the tensorflow symbolic counterparts of the functions known from numpy.

Conclusions

In this article, we have revisited the polynomial regression problem in both mathematical and programmatic terms. Furthermore, we have compared its implementation using four python libraries that are frequently used in data projects. We also compared them against a custom, demonstration oriented implementation, discussing both differences and similarities.

In case you still wonder which one to choose, here is our advice:

Unless you want to fit a non-polynomial function, go for numpy. Your project probably requires it anyway.
If you do have a more exotic function or function that you won’t easily convert to a polynomial, use scipy. Its interface is very clear and the fit is pretty fast.
In case you work on a bigger machine-learning project with sklearn and one of your steps requires some sort of polynomial regression, there is a solution here too. Although, there is little sense in installing the library just for doing the regression.
The same applies to tensorflow. Unless it is not a requirement - do not overcomplicate things.
Finally, speaking of the complexity, don’t waste time on your implementation - unless you are learning, of course.

If this post has helped you grasp a bigger picture and deepen your understanding, it is good. It means its primary goal has been achieved. Either way, please feel welcome to comment or share it!