Three Common Machine Learning Pitfalls and How to Avoid Them

November 05, 2021

There’s no need to reinvent the wheel. While this cliche is often overused, it especially applies to the world of machine learning. With so many successful and cutting-edge examples of ML model building, there is certainly no shortage of valuable wisdom to tap into. If you’re building an ML model and need some best practices to keep in mind, we’ve already done the research for you.

In this piece we will be focusing on the following topics:

  1. Preventing test data leakage.
  2. Determining the best model for your use case.
  3. Optimizing hyperparameters.

Prevent Test Data Leakage 

It is surprisingly common to have test data leaking into training data, leading to unrealistic performance in the test phase, and poor performance on real-world data. Test data leaks into training data typically by user mistake, which usually falls within one of the following three buckets:

Prevent Test Data Leakage
  • Data processing before separation: If you process data before separating test from training data, the parameters of the pre-processing model will be fitted with knowledge of the test set. To prevent this, you will need to first separate test and training data and then conduct processing exclusively on the training data.
  • Duplication: Sometimes, real-world data may produce a few sets of identical or near-identical data. A very clear example of this is a dataset of user messages on a messaging platform. A spam message can be sent multiple times to multiple users and will be observed as different data points, but contain the same content. If these entries are distributed across the test and training data, the algorithm will have exact knowledge of these messages, resulting in unrealistic performance.
  • Temporal data: When using temporal data, you may unreliably test your model by giving it knowledge about the future. To exemplify this, imagine three chronologically consecutive data points A, B, and C. If you are training the model on points A and C, and then testing it on point B, the ML algorithm will have more knowledge about the future of point B than realistically possible. To prevent this, it’s important to split temporal data across time, such that all training sets are chronologically earlier than all testing sets.

Determine the Best Model for Your Use Case

We can unpack industry folklore such as the No Free Lunch (NFL) theorem to help us reevaluate any assumptions about which ML model to use. The NFL theorem suggests that no ML approach is better than any other when applied over every possible problem. 

Fortunately, when we only want to solve a single problem, we can leverage the existing body of knowledge to get pointed in the right direction. Even if no amount of research can exactly prescribe the best ML model for a specific use case, it can still provide some explicit no-gos including:

  • Using a deep neural network in instances where data is limited.
Using a deep neural network in instances where data is limited
  • Using models that expect categorical features to a dataset which consists of numeric features.
  • Using a model that assumes no dependencies between variables and time series data.

It’s also important not to discriminate against models for any other reason than performance. We must consider human biases and scrutinize any tendencies we have to discard older models or non-proprietary models. New isn’t always better, so if an old, mature model fits your use case over a more advanced, newer model, it should not be viewed disfavorably. 

Similarly, the ‘not invented here syndrome’ may deter some researchers from using models that were not created in-house, potentially painting the researchers into a corner.

Perhaps the most encouraging aspect of developing ML models today is the convenience of modern libraries. We can find out which ML model works best for your use cases by trying out as many models as possible – within reason. These modern libraries in Python (scikit-learn), R (caret), Julia (MLJ) lower the implementation barrier and make it easy to scope out the best models with little change to the underlying code. 

Hyperparameter Optimization

To create efficient ML algorithms, you need to optimize any hyperparameters to get the best result for your use case. Examples of hyperparameters include number of trees in a random forest or the architecture of a neural network. Referring to similar use cases or existing research may point you in the right direction, but these hyperparameters should be fitted to your particular dataset to achieve performance. 

However, it’s worth noting that we need to approach hyperparameter optimization strategically rather than trying and testing different configurations to see how they turn out. Some of the most well-known optimization approaches include:

  • Grid search: Also known as a parameter sweep, grid search is an exhaustive search through a manually specified subset of the hyperparameter space of a learning algorithm. It is typically guided by some performance metric.
grid search
  • Random search: Compared to grid search, random search doesn’t exhaust all possible configurations, but rather selects them randomly.
  • Bayesian optimization: Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum.
  • Evolutionary optimization: This approach uses evolutionary algorithms to search the space of hyperparameters for a given algorithm. It is inspired by the biological concept of evolution and creates an initial population of random solutions, evaluates the hyperparameters tuples and acquires their fitness function, ranks the hyperparameter tuples by their relative fitness, and replace the worst-performing hyperparameter tuples with new hyperparameter tuples generated through crossover and mutation.
  • Population-based training: This approach eliminates manual hypertuning by having multiple learning processes operating independently using different hyperparameters. Poorly performing models are replaced with models that adopt modified hyperparameter values on the better performers.

It’s worth keeping in mind that optimizing hyperparameters and selecting features should be part of model training, and not an activity carried out before model training. A particularly common error is to do feature selection on the whole dataset before model training begins, but this will result in information leaking from the test set into the training process. 


Building robust models requires key considerations be taken throughout the process. While the above considerations are undoubtedly useful, there’s truly endless considerations to take even after you’ve completed the task of building the model. For more AI and ML insights, download iMerit’s solutions brief to learn more about how to create the best ML models possible.