In a previous blog, we discussed using mini batches to make your training more efficient. Something else that can speed up (or slow down) training is the particular optimizer your ML system uses to adjust its parameters. Many ML optimizers have been developed over the years, and no single optimizer works best in all applications. Consequently, ML development environments such as TensorFlow allow you to choose among a number of optimizers, and many more variations appear in the ML literature. How is one to choose?
Fortunately, a recent study sheds some light on how to select a good optimizer for your application.
Comparing ML Optimizers
The objective of the study was to systematically compare optimizers by benchmarking their training performance. The authors selected the most commonly used optimizers by picking the 14 most mentioned in the online archive arXiv. The top 14 are shown below, together with the ‘Other’ category, which accounted for less than 4 % of the counted mentions in arXiv.
The study evaluated training results by assessing ML system test set performance after a budgeted number of training iterations. A variety of applications were benchmarked, using ML system architectures including multi-layer perceptrons, convolutional neural networks, recurrent networks, and variational auto encoders. Although there was no clear winner, ADAM was consistently among the best performing optimizers. In cases where another optimizer did better than ADAM, it was usually RMSProp or NAG.
The study also explored ‘tuning’ each optimizer, by evaluating various optimizer settings (hyperparameters). Optimizers generally have recommended default settings and the authors wanted to see how much could be gained by searching for better settings. The authors concluded that (1) the default settings for ADAM, RMSProp, and NAG perform well without further tuning, and (2) if better results are to be found, trying one of the other three optimizers is as good an approach as trying to tune the optimizer you started with.
So, the bottom-line conclusion for optimizer selection is that out of the many alternatives, picking ADAM with default parameters is a good place to start.
You may be able to improve your results by tuning ADAM or by trying RMSProp or NAG. This will require some trial training runs to compare alternatives, and the benchmarking study offers insights on how best to do these comparisons.
Lessons from Benchmarking
The study’s benchmarks were set up to account for two factors that are good to keep in mind when making your own optimizer comparisons:
- Although optimizers are set up to maximize performance with training samples, the ultimate goal of an ML system is to do well in the operational environment. This goal is best reflected by performance on test data, so test data performance should be used to compare optimizers.
- ML systems typically use randomly selected mini-batches, and randomly initialized ML system parameters. It has been shown that ‘unlucky’ random selections can make a good optimizer look bad on a particular run. Therefore, it is best to compare optimizers based on multiple runs using different random number generator seeds.
Takeaway: A Practical Optimizer Selection Approach
The results of the study suggest the following strategy for selecting the best optimizer for your application:
- Select your baseline optimizer. Unless you have experience that suggests otherwise, start with ADAM using its default hyperparameters.
- Set up an optimizer benchmark based on training and test data sets, a training budget (mini-batch size, number of iterations), and test data performance.
- Benchmark your baseline optimizer using multiple runs with different randomly selected mini-batches and initial parameters.
- Use the benchmark to compare your baseline optimizer to an alternative from one of the three best performing optimizers (ADAM, RMSProp, or NAG), using default hyperparameters. If better performance is obtained, switch your baseline optimizer.
- Optionally, tune your baseline optimizer by performing a random search over its hyperparameters.