Data isn’t born for AI/ML models. While quantifiable, data is still a chaotic mess that needs deciphering before it can be made useful. In the world of machine learning, poor-performing ML and AI models are all too common, and data is typically the culprit.
Problems with a model usually stem from inadequate or insufficient input data, an incorrectly trained neural network that cannot produce accurate results for given inputs, or bugs in the code used to train it. Data isn’t born for ML models.
In this piece, we’ll outline troubleshooting techniques and key considerations for evaluating your AI/ML model.
Most Common Data Challenges
In our time annotating for our customers, we’ve found that poorly performing models are typically caused by input data exhibiting at least one of the following traits:
- Corrupt data: When data is mismanaged, improperly formatted, or combined with incompatible data, the result is data corruption. It would shock you how common this is, despite the brilliant minds typically found working in AI.
- Incomplete data or Insufficient data: Ensuring data is complete before commencing an AI/ML model rollout is a crucial step. If any percentage of values are missing from a given dataset, then it is deemed incomplete. Failure to adhere to this will result in models performing unpredictably, and troubleshooting why is what’ll really cost you.
- Overfitting: Overfitting occurs when a large dataset trains an AI or ML model too precisely. It results in modeling errors that happen when a function is too closely fit to a limited set of data points. More on this later.
- Underfitting: Conversely, underfitting happens when the data size is too small, resulting in a model that hasn’t learned enough. The model and algorithm therefore do not fit the data adequately, resulting in poor output. More on this later, too.
While each of these have their own protocols for identifying and solving, they’re typically the first things data scientists look for when a model isn’t performing well. The next few challenges, however, are often overlooked.
Proper preprocessing of data is essential to the function of any AI/ML model. Although it is most necessary when preparing data, it’s still a commonly overlooked step that costs data scientists way more than it stands to save them.
Here is a check-list to follow when preprocessing your data:
1. Handle missing data and values: When there are features with missing data and values, be sure to either remove or replace them. Here’s a sample dataset with some missing values:
In the above dataset, data entry with ID value ‘4’ can be removed as there are two features missing, so imputing the data might not be the right choice here. But data entry with ID value ‘3’, which is missing value in the weight category, can be imputed with either the mean, median, or mode value of the feature.
2. Ensure the data is balanced. Data is imbalanced when it is unequally distributed or skewed towards one target class.
In the above dataset 90% of the data is skewed towards positive class values, and only 10% of data belongs to negative class values. Therefore, the chances of new data being predicted as a positive class are incredibly high. Unbalanced data like this must be appropriately handled by either resampling the data or augmenting the data.
As a side note, be careful about unbalanced datasets in edge cases, e.g. a few fire hazards amidst hundreds of images of safe situations.
3. Outliers: These are the values that do not fit within a dataset or distinctly standout. Box plots are best to predict which feature contains outliers.
The above box plot shows outliers of Petal Width for the value Setosa of the Iris dataset. You can easily remove these outliers to smoothen out the data.
4. Feature Normalization or Standardization: These are scaling techniques used to bring features on the same scale. Dataset features are not always on the same scale and vary tremendously in magnitude, units, and range. For an ML model to perform well, all features must be on the same scale.
This sample dataset displays ‘Width’ and ‘Height’. Here, ‘Width’ and ‘Height’ are indifferent from a scale perspective, so the chances are that more weightage will be given to the ‘Width’ feature while modeling. Therefore normalization is used to bring both features on the same scale, as shown in the above image.
Feature engineering involves modifying existing features or creating new ones to improve modeling results. Some feature engineering methods are:
1. Converting text data features into vectors
As machine learning is based on linear algebra, it is therefore necessary to convert textual data features into vectors. Techniques used to convert text data to features are Bag of Words(BOW), TF-IDF(Term Frequency-Inverse Document Frequency), Avg Word2Vec, and TF-IDF Word2Vec.
2. Modifying features
Some features need to be modified using feature binning, mathematical transform, feature slicing, and indicator variable techniques for best results.
3. Creating new features
New features can also be created using existing features like the featurization of categorical data using one-hot encoding.
Now that you’ve audited your data…
After you’ve audited your data using the above methods, the data should be as correct as possible. Notice that you always start with fixing the data first.
It is now time to fix the model itself. We can say the model is performing well when there is a bias-variance tradeoff and accuracy is high. Below is a four-step walkthrough to help you troubleshoot your model.
Step One: Feature Selection
Input data can sometimes contain up to hundreds of multiple input features. However, not all input features will contribute to an output, which means they’re not useful for the model. Ensuring the correct features are chosen is, therefore, the first step when ensuring optimal model performance. Selecting fewer features improves not only model performance but also reduces training time.
There are many ways to choose useful features:
1. Univariate and Bivariate Selection
Univariate and Bivariate selection help uncover the relationship between input features and output variables. Select the features that are firmly in relationship with the output variable. Statistics-based tests are used for scatter plots, correlation, ANOVA F-value method, etc. The Scikit Learn library SelectKBest method is also used to find the best features.
From the above code feature, 0,2,3 shows high scores. These features can be selected for modeling.
2. Principal Component Analysis(PCA)
PCA is an algorithm used for dimensionality reduction. The logic behind PCA is to choose features with high variance as high variance features contain more information. Scikit-learn PCA API is used for its implementation.
In the above sample code for PCA, the two steps.g.E.g. of data reduced from 784 to 2-D.
3. Feature Importance
The above code features 0,2,3,4, and 5 as high-importance features. These can be selected for the model.
Step Two: Model Selection
The next important step is model selection. Not every algorithm works for every dataset. To predict numerical values, try regression algorithms. When trying to predict categorical data, try classification algorithms. If the task is to find dataset structure, then try clustering algorithms.
For best results, try running the dataset through different model types. At a minimum, use two models or ensembling multiple models like Boosting, Bagging, Stacking, or Cascading. For complex and larger datasets, go for Neural Networks.
Step Three: Hyperparameter Tuning
Every algorithm contains hyperparameters. Hyperparameter tuning involves tuning these hyperparameters to find the best-performing model. Hyperparameter values can be modified for different values while running a learning algorithm over a training dataset. The best hyperparameter value can find out what fits well into new data.
For example, in the k-nearest neighbors’ algorithm, k is the hyperparameter. Finding the best value for k, like KNN, will work best with three nearest neighbors or five nearest neighbors, and so on is the key for optimal model performance.
Step Four: Cross-Validation
Cross-validation is the technique used to select the best model. The selection of the model is based on a bias-variance tradeoff.
In cross-validation, the data is divided into k equal subsets where one subset is used as a test/validation set while the rest of the k-1 subsets are used as training data. This process is repeated k times, such that each time the different subset is used for the test while other subsets are used for training the model.
Models of all the folds are then averaged to create the final model. Cross-validation trains the final model to performs optimally with new data without any overfitting and underfitting.
Overfitting occurs when the model fits too closely with its data, while underfitting means the model doesn’t fit its data. Overfitting can result in a low-bias, high-variance model, while underfitting can result in a high-bias, low-variance model.
As displayed in the above cross-validation graph, as model complexity increases, so too does variance increase. Therefore, the simpler a given model is, the lower the variance. Choosing the right model requires a bias-variance tradeoff, which means a model with a balanced bias and variance.
Troubleshooting a poorly performing model is no small task. Troubleshooting an AI or ML model can be dizzying, and knowing where to start is key. If after following these steps and considerations your model is still struggling to perform, then consider consulting one of iMerit’s annotation experts.