Polynomial Regression - which python package to use?
Polynomial regression is one of the most fundamental concepts used in data analysis and prediction. Not only can any (infinitely differentiable) function be expressed as a polynomial through Taylor series at least within a certain interval, it is also one of the first problems that a beginner in machine-learning is confronted with. It is used across various disciplines such as financial analysis, signal processing, medical statistics, and more.
While polynomial regression is a relatively simple concept, it became a sort of “hello world” problem in machine-learning as it touches upon many core concepts. Still, some parts may sound confusing, especially since:
- It introduces several things at once, although only some are specific to polynomial regression.
- It is often implemented in many different ways.
In this blog post, we are going to construct it from the beginning using both equations and express them through the code. Then, we will compare our implementation against four python libraries that are the most widely used in data science:
and discuss differences between them, while pointing to similarities at the fundamental level.
Let’s discuss the mathematical foundations first.
Mathematically, the problem of regression is an attempt to model a relationship between an independent variable and a dependent variable . Assuming ’s dependent on is expressed in the following form:
we speak of polynomial regression (with denoting a noise term). Naturally, if the maximum , the problem becomes linear regression.
To solve the problem means to determine the values of all ’s to represent the data well. It boils down to the elements of the following steps:
- A “guess” regarding the form of . It is called a hypothesis function .
- A way of measuring how “bad” is the hypothesis given the true data values. This “badness” is quantified using a so-called loss function .
- An idea of how we could potentially “tune” the hypothesis such that we are better at approximating the data points. In other words, we need to have some way of minimizing .
Choosing the hypothesis
When speaking of polynomial regression, the very first thing we need to assume is the degree of the polynomial we will use as the hypothesis function. If we choose to be the degree, the hypothesis will take the following form:
The cost function and mean square error
Next, we choose the metric by which we express the cost function. For regression problems, it makes sense to choose the so-called mean square error (MSE), which measures the average distance between the individual points and our predictions brought to the second power,
Here relates to the number of data points and factor is introduced for convenience (see later). We also denote all parameters as a vector of parameters .
The MSE metrics is only a choice, but it is reasonable due to the following reasons:
- MSE is agnostic concerning the direction of the error (e.g. unlike mean error that is proportional to just in the first power).
- MSE heavily punishes large differences (outliers) due to the second power (unlike e.g. mean absolute error that is proportional to ).
- For MSE it is easier to obtain the gradient (e.g. unlike root mean square error that is ).
Once we have established the cost function and set some random initial , we want to minimize . One of the possible options is the gradient descent optimization algorithm. As the name suggests, it requires us to calculate and subsequently update all ’s accordingly.
In other words, we need to compute the derivatives of concerning every , and use them to change the values of :
thus making the hypothesis better and better at representing the data.
Fortunately, is easy to calcuate:
The last term is the derivative of the hypothesis function with respect to . Because we assumed to be a polynomial:
and the whole gradient becomes:
If we organize all into a vector of examples, we can express the formula above using vector-matrix multiplication:
or in a more compact form
Here, expresses a matrix composed of example vectors raised to the consecutive powers .
Basic implementation (naive)
We call this implementation naive because it is in no way optimized for performance. Instead, the focus was to emphasize the correspondence between the code and the underlying equations.
To start, we will use our custom implementation of a symbolic polynomial (see gist).
Polynomial class defines a callable object based on the polynomial expression (the first equation).
It implements many algebraic methods similar to our quaternion example), but for now
the most important part is the initialization.
We begin with creating of a
Polynomial object with random coefficients .
1 2 3 4 5 6 7 import numpy as np from nppoly import Polynomial np.random.seed(42) def hypothesis(degree): return Polynomial(*np.random.rand(degree + 1))
This method accepts and returns a polynomial with random coefficients. Next, we define our MAE cost function:
1 2 def cost(h, X, y): return 0.5 / len(y) * ((h(X) - y) ** 2).sum()
len(y) is equivalent to and
y are to be a hypothesis,
then and – the vectors of arguments and values.
Note that capitalizing
X is more of a tradition to emphasize its matrix-like character.
In our case, however,
X is associated with one-dimensional data only.
Next, we need to express the gradient of . Thanks to the derivation, we already know the expression:
1 2 3 4 5 6 7 8 def grad_theta(h, X, y): diff = (h(X) - y).reshape(-1, 1) X = X.reashape(1, -1) X = list(map(lambda i: x ** i, reversed(range(h.shape)))) X = np.concatenate(X) return 1 / len(y) * (X @ (diff)).reshape(1, -1)
In this function, we need to perform a couple of reshaping to ensure
we can multiply the matrix by a vector.
Lines 4-6 are responsible for constructing , and
reversed function is used to comply with the convention that stands before .
Finally, the optimization routine is done with the following function:
1 2 3 4 5 6 def optimize(h, X, y, epochs=5000, lr=0.01): theta = h._coeffs.reshape(1, -1) for epoch in range(epochs): theta -= lr * grad_theta(h, X, y) h = Polynomial(*theta.flatten()) return h
Here, line 2 is just an initialization of the hypothesis.
Then, inside the for loop, we perform , where is the so-called learning rate
lr and the repetition cycles are traditionally called “epochs”.
The optimization can be verified by executing a small script:
1 2 3 4 5 6 7 8 9 10 # fake data X = np.linspace(0, 10, num=5) y = 4 * X - 2 + np.random.randn(len(X)) h = hypothesis(1) h = optimize(h, X, y) # prediction X_test = np.linspace(-5, 25, num=31) y_pred = h(X_test)
The first library that implements polynomial regression is numpy.
It does so using
numpy.polyfit function, which given the data (
as well as the degree performs the procedure and returns an array
of the coefficients .
1 2 3 4 5 6 7 8 9 10 11 12 13 import numpy as np from numpy import polyfit # fake data X = np.linspace(0, 10, num=5) y = 4 * X - 2 + np.random.randn(X) u = polyfit(X, y, deg=1) # prediction X_test = np.linspace(-5, 25, num=31) y_pred = u * X_test + u
The function offers additional diagnostics if
full is set to
True, giving us information related to uncertainties.
Another “recipe” for solving the polynomial regression problem is
curve_fit included in
This function is more generic comparing to
polyfit as it does not require
our “model” to assume a polynomial form.
Its interface is also different.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import numpy as np from scipy.optimize import curve_fit def linear_function(x, a, b): return a * x + b def quadratic_function(x, a, b, c): return a * x**2 + b * x + c # fake data X = np.linspace(0, 10, num=5) y1 = 4 * X - 2 + np.random.randn(len(X)) y2 = 5 * X ** 2 - 3 * X + 10 + np.random.randn(len(X)) u = curve_fit(linear_function, X, y1) v = curve_fit(quadratic_function, X, y2) # prediction X_test = np.linspace(-5, 25, num=31) y_pred_1 = linear_function(X_test, u, u) y_pred_2 = quadratic_function(X_test, v, v, v)
As opposed to
polyfit, this function requires a model-function to passed
in as an argument in the first place.
It can be any parametrized mathematical formula,
curve_fit imposes one condition: the model-function itself accepts
x as its first argument.
Finally, it returns the optimized coefficients similarly to
although it also returns the diagnostic information (hence the
 at the end to
Scikit-learn (or sklearn) is a “go-to” library when it comes it machine-learning.
It focuses heavily on interface consistency, meaning that it tries to
unify access to different features and algorithms using the same methods such
Solving a linear regression solution is fairly easy, however,
the polynomial regression case requires a bit of thinking.
Let’s start with the linear regression.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 import numpy as np from sklearn.linear_model import LinearRegression # fake data X = np.linspace(0, 10, num=5).reshape(-1, 1) y = 4 * X - 2 + np.random.randn(len(X)) linreg = LinearRegression() linreg.fit(X, y) # prediction X_test = np.linspace(-5, 25, num=31).reshape(-1, 1) y_pred = linreg.predict(X_test)
Here, the solution is realized through the
Following the scikit-learn’s logic, we first adjust the object to our data using
.fit method and then use
.predict to render the results.
X needs to be reshaped to an
(m, 1) vector column.
To address the more generic polynomial regression case, we need to combine
PolynomialFeatures object as there is no direct
Indeed, sklearn was constructed with the intention to handle data-related problems
rather than be an equation solver.
More on the differences can be found in this post.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import numpy as np from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline # fake data X = np.linspace(0, 10, num=5).reshape(-1, 1) y = 5 * X ** 2 - 3 * X + 10 + np.random.randn(len(X)) polyreg = make_pipeline( PolynomialFeatures(degree=2), LinearRegression() ) polyreg.fit(X, y) # prediction X_test = np.linspace(-5, 25, num=31).reshape(-1, 1) y_pred = polyreg.predict(X_test)
The starting point is the PolynomialFeatures class, which creates “mixed” terms such as , based on generic variables and . Here, we apply this transformation to , this obtaining a feature vector (matrix) , as , and is one-dimensional. Then, using sklearn’s pipeline, we combine ’s with linear coefficients , basically treating each as a separate variable. Finally, we solve it as if we faced the standard linear regression problem, obtaining .
We can see that the approach taken here is quite different from both numpy and scipy.
As sklearn “sees” the problem more in terms of consecutive adjustments (
fit), transformations of data (not needed here), and predictions (
Either way, the result is the same.
The last from the packages that we cover here is Tensorflow 2.0. As there have been substantial changes to the API introduced in version 2.0, we are not going to present the solution offered in 1.x. However, if you are interested, please refer to this excellent post by Trần Ngọc Minh.
The appearance of the process looks quite similar to our original implementation. Under the hood, it is, in fact, very different. Let’s take a look at the following snippet first:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import numpy as np import tensorflow as tf LEARNING_RATE = 0.0001 N_EPOCHS = 100000 DEGREE = 2 X = np.linspace(0, 10, num=5).reshape(-1, 1) y = 5*X**2 - 3*X + 10 + np.random.randn(len(X)) theta = tf.Variable([0.0] * (DEGREE + 1), name='parameters') def hypothesis(x): return sum([theta[i] * x**i for i in range(DEGREE + 1)]) def cost(y_pred, y): return tf.reduce_mean(tf.square(y_pred - y)) for epoch in range(N_EPOCHS): with tf.GradientTape() as tape: y_predicted = h(X) cost_value = cost(y_predicted, y2) gradients = tape.gradient(cost_value, theta) theta.assign_sub(gradients * LEARNING_RATE) X_test = np.linspace(-5, 25, num=31) y_test = hypothesis(X_test).numpy()
The logic related to the hypothesis and the cost function is the same.
Also, the training is performed using the same method, only the function names are different.
What happens, however, is the fact that tensorflow attempts to solve the problem symbolically, by building the so-called graph.
The graph is a representation of all mathematical dependencies just so that derivatives can easily be calculated.
For example, the
theta.assign_sub(...) translates to updating with the gradient, while
tf. prefixed functions are the tensorflow symbolic counterparts of the functions known from numpy.
In this article, we have revisited the polynomial regression problem in both mathematical and programmatic terms. Furthermore, we have compared its implementation using four python libraries that are frequently used in data projects. We also compared them against a custom, demonstration oriented implementation, discussing both differences and similarities.
In case you still wonder which one to choose, here is our advice:
- Unless you want to fit a non-polynomial function, go for numpy. Your project probably requires it anyway.
- If you do have a more exotic function or function that you won’t easily convert to a polynomial, use scipy. Its interface is very clear and the fit is pretty fast.
- In case you work on a bigger machine-learning project with sklearn and one of your steps requires some sort of polynomial regression, there is a solution here too. Although, there is little sense in installing the library just for doing the regression.
- The same applies to tensorflow. Unless it is not a requirement - do not overcomplicate things.
- Finally, speaking of the complexity, don’t waste time on your implementation - unless you are learning, of course.
If this post has helped you grasp a bigger picture and deepen your understanding, it is good. It means its primary goal has been achieved. Either way, please feel welcome to comment or share it!