Post

10 Ways to Perform Regression in Machine Learning

January 03, 2022

Regression is one of the primary tools of data science. It is a type of predictive model which uses supervised machine learning to fit a function to data. The most popular form of regression is linear regression, but other forms such as nonlinear regression also exist. 

It is one of the most useful models in the real-world as regression models don’t usually require a lot of data to train and are easy to implement. Important to note is that regression models are not classifiers. Instead they are used to interpolate a function based on available data points. Regression models have many different use cases such as pricing a given good or commodity, predicting home values, and even understanding risks for certain diseases such as heart disease and cancer. 

There are many types of regression machine learning algorithms, all of them supervised learning based. These are 10 of the most popular.

1. Use Ordinary Least Squares

For those who aren’t familiar, regression refers to the process of fitting a function to data. Often, regression is used more specifically to refer to linear regression, which involves drawing either a line or linear hyperplane of best fit to a set of data points. The method used to fit this line is called ordinary least squares, and it refers to the process of minimizing the sum of squared error, defined as:

Ordinary Least Squares

Where f is the linear function we are fitting to the data. At each iteration of the algorithm, we compute the predictions using the current machine learning model weights and subtract from the true values, representing the ground truth, squaring the resulting outputs and summing over all input data points. This gives us the loss, which we can then use to perform backpropagation and gradient descent to recompute the model weights. There is also a closed form solution to this problem (in the linear case), and it’s implemented in many different statistical packages and languages including R, Python (scikit-learn), Matlab, etc. The OLS method is great to use when you have a dataset that exhibits a strongly linear dependence between its independent and dependent variables.

2. Use Polynomial Regression

If your data is a bit more curvy (we’re talking, quadratic, cubic, perhaps even quartic?), then you might want to try something like polynomial regression. Believe it or not, polynomial regression can be reformulated as linear regression, as long as you add in additional features to the model. For example, we can fit a cubic polynomial to our data vector data vector x by fitting a linear regression model to the expanded feature vector defined as data vector . This gives us lots of options in terms of what functions we can use to model our data, while still sticking to the same basic modeling premise.

3. Use a Regression Tree or Random Forest

Regression trees, or sometimes decision trees as they’re more colloquially known, are another excellent model for performing regression. They work in a somewhat different way. Instead of directly fitting a line to data, they operate by splitting the input space in half. 

If you do this many different times successively, it ends up approximating a smooth function and effectively becomes a regression. Regression trees can approximate a linear function, polynomial functions, or really any continuous function. They are very highly favored for their ease of use, scalability, and ease of interpretation. 

Because they split the input space in half many times, it’s easy to trace the function being computed across each feature of the input. For example, if we were performing regression to determine a patient’s risk of a heart attack, a patient may lie far towards the high-risk end of the spectrum if their blood pressure is greater than 170, their weight is greater than 250, their weekly exercise time is less than 1 hour, etc. In this way, it’s easy to determine the splits along which the decision tree makes decisions and communicates that to those who are using the model.

Random forests also perform regression. These are simply ensembles of many regression trees and are trained in much the same way. They are often more accurate and preferred over using a single tree. While single trees work well, they can often exhibit high variance. Therefore it’s often better to use the answer provided by averaging many different regression trees when deriving a prediction.

4. Use Ridge Regression

Ridge regression is a form of regularized regression. One problem that regression models can exhibit is that they might overfit the training data, meaning they fit the relationship in that dataset too closely to the detriment of generalizability. This is bad because it leads to worse predictions from the model when it’s presented with new, unseen data. Ridge regression addresses this by adding an L2 penalty on the model’s weights to the loss function. 

In effect, this prevents the model’s weights from growing too large which is what usually happens when the matrix overfits. This is what the loss function for ridge regression looks like:

Ridge Regression

The regression function, f, here is parameterized by a matrix of weights, W. As you can see, there is an additional term involving lambda and the L2 norm of W. This regularization term adds in the squared sum of W’s individual components, and thereby penalizes large values of the weights. The regularization coefficient lambda controls how strong the regularization effect is. Larger values of lambda lead to greater regularization.

5. Use Lasso Regression

Lasso regression sounds like something straight out of the wild west, but unfortunately it’s not that exciting. It’s simply another form of regularized regression. In contrast to L2 regression, it has the following loss function.

Lasso Regression

The only difference here is that the regularization term involves the L1 norm of W. This is simply the sum of the absolute values of all the individual components of W. This has the same effect as ridge regression of regularizing the solutions to the regression equation. However, the models that are typically fit by the lasso tend to have sparse weights, i.e. weights with many components that are close to zero and a few that are larger values. This is useful for modeling situations where sparsity is a component, or as a way of performing approximate feature selection automatically when fitting the model.

6. Use Bayesian Regression

Bayesian regression fits a regression model using Bayes’ rule, given as:

Bayesian Regression

In other words, the method estimates the probability of the model weights given the training data, and is computed in terms of the likelihood (the probability of the data given the weights), the prior probability of the weights (p(w)), and the marginal likelihood of the data (p(D)). In effect, it puts a distribution over the set of weights that can be fit by the model which allows one to reason in a way that’s more statistically sound about the set of solutions to the regression. This is the key advantage that Bayesian linear regression provides — instead of giving a single maximum likelihood estimate solution to the regression, it gives a distribution over solutions.

7. Use Gaussian Process Regression

Gaussian processes are another type of Bayesian model. Like Bayesian regression, they give a distribution of solutions to the regression equation. They model the posterior distribution of the regression function as samples of a normal distribution:

Gaussian Process Regression

The normal distribution here is parameterized by a mean vector mu, and a positive definite kernel matrix K. These are the parameters that the Gaussian process model fits during training. The Gaussian process is so named because it acts as a stochastic process in which every collection of its random variables has a multivariate normal distribution. It is a great model in that it gives you the full posterior distribution over functions and can fit arbitrarily complex functions to data.

8. Use Support Vector Regression

Much like the SVM for classification, the SVR algorithm uses a support vector machine to fit a linear regression to data. It is trained to solve the following optimization problem

Support Vector Regression

9. Use a Neural Network

In its simplest form, a neural network can be trained to perform regression. A linear, feedforward layer in a neural network is defined to perform the operation

Neural Network

When the network with a single such linear layer is trained to fit the weights, it performs linear regression! Of course, most neural networks apply a nonlinear activation function such as a sigmoid on top of this layer, i.e.

Neural Network

In that case, the model is ultimately trained to perform nonlinear regression. Neural networks are great to use when you have very large datasets as they can be efficiently trained on GPUs using gradient descent and backpropagation via neural network library such as PyTorch or Tensorflow.

10. Use Elastic Net Regression

Elastic net regression is a combination of l1 and l2 regularized regression. It uses two hyperparameters, alpha and beta, which control the contribution of each type of regularization to the model training process. 

Training and Validation of a Regression ML Model

Artificial intelligence and ML algorithms would be useless without validation. Validation allows us to ensure that our predictions are reasonable and generalizable to new datasets. This is vital for preventing training artifacts such as overfitting when training machine learning models. 

The first step in training and validating any machine learning pipeline is the data preparation and preprocessing of the training dataset. Next, the training dataset needs to be partitioned into a train and validation split. The train split will be used by the ml model for training and the validation split will be used for assessing metrics such as accuracy, precision, and recall. 

Typically, one will also have access to test data and a test set for evaluating the final generalization capacity and model performance. However, it’s important to note that evaluation on the test set should not proceed until after the model has been fully trained. Doing so beforehand amounts to data leakage. 

It can also be useful to use visualization when training and validating a regression model. Depending on the type of data, we can visualize metrics such as the training loss and validation accuracy as training proceeds. In this way, we can look to prevent overfitting by performing early stopping of model training before it occurs.

Conclusion

Regression is an invaluable step in the machine learning journey. Identifying where it might be useful requires a level of expertise. After annotating close to a million terabytes of data, iMerit has mastered the art of applying and conducting regression for clients across numerous industries and products. If regression is a challenge, then speak with an iMerit expert today.