Training data, carefully selected and accurately annotated, is a prerequisite for Machine Learning. Generally, the more training data the better. However, an ML system can take advantage of this data only if training can be accomplished in a practical amount of time. This has become more challenging as ML systems have grown to millions of parameters and millions of training samples. For example, the development of large natural language processing system in 2018 was estimated to have required the equivalent of a single fast graphics processing unit running continuously for 27 years.
Larger, faster computer systems have helped make training large ML systems practical. However, careful structuring can make training more efficient, allowing more training to be done with less hardware. One way to do this is to use mini-batch gradient descent, which reduces training time through efficient parameter search and good use of training hardware. Let’s look at how it works.
Speeding Up ML Training Computations
Typically, ML training uses multiple iterations of a two-cycle process: (1) forward propagation, where the ML system processes training samples to see if it gets correct results, and (2) back propagation, which adjusts parameters to get better results. Both forward and back propagation involve lots of calculations: for an ML system with n units in an average layer, the calculations are roughly proportional to the fourth power of n for forward propagation, and to the fifth power of n for back propagation.
When a forward and back propagation cycle is performed for each individual training sample, the process is called stochastic gradient descent. This is a workable approach to training, and it can eventually find good parameter sets. However, because vector and matrix processing are particularly fast on modern computer hardware and software, it is usually more efficient to perform the forward and back propagation cycle using multiple training samples all at once.
If the entire training set is used for each forward/back propagation cycle, it is called batch gradient descent. If subsets of the training set are used, it is called mini-batch gradient descent.
Mini-batch gradient descent is usually the preferred approach, with a mini-batch size adjusted to fit into processor memory.
Though both batch and mini-batch gradient descent reduce training computation time by taking advantage of efficient vector/matrix computation, batch gradient descent can become inefficient if training sets are too large to fit into processor memory. This requirement is often satisfied by making mini-batch sizes a power of two; 32, 64, 128, and 256 are typical mini-batch sizes. Trial runs may be required to select the most efficient number.
But beyond enabling more efficient computation, mini-batch gradient descent offers a second advantage – compared to stochastic gradient descent, it travels a shorter path in its search for a good parameter set.
Shortening the Path to a Good Parameter Set
All types of gradient descent attempt to adjust parameter sets, after each back propagation, toward better ML system performance. Mathematically, these steps go ‘downhill’ in a cost terrain, which is a high dimensional space where ‘altitude’ is a value of cost and ‘location’ is associated with a particular set of parameter values. Cost is a mathematical function defined so that lower cost means better ML system performance.
The figure at right depicts a very simple cost terrain. This ML system has two parameters, with possible values that span the plane of this page. The concentric ellipses represent contours of constant cost value, with higher outer values sloping down to the lowest value, the red dot in the center.
The segmented lines contrast the step-by-step paths taken by batch, mini-batch, and stochastic gradient descent.
- Batch gradient descent takes the most direct path. This is because the ultimate goal of training is to minimize the cost averaged across all the training samples. Because batch gradient descent uses all the training samples during each back propagation step, each parameter adjustment makes progress toward the ultimate goal. Unfortunately, as explained above, the advantage brought by a reduction in the number of training steps is usually outweighed by decreased computational efficiency
- Stochastic gradient descent takes the least direct path, because each back propagation step uses only a single training sample. These parameter adjustments will reduce cost for the single sample, but not necessarily improve the average for the entire training set. This results in a path that may entail many reversals
- Mini-batch gradient descent is a big improvement over stochastic gradient descent, because it adjusts parameters to improve the average cost for the mini-batch. The average cost for a mini-batch can be a reasonable estimate of the average for the entire training set, particularly if mini-batches are shuffled on subsequent iterations through the entire training set. Therefore, while mini-batches result in a less direct path than that of batch gradient descent, its path is much more direct than stochastic gradient descent. With its computational advantage over batch gradient descent, it is usually the preferred approach.
The table below summarizes the characteristics of the three types of gradient descent.
|Gradient descent type||Number of training samples processed together||Computational Efficiency||Path to Low Cost|
|Stochastic||One sample at a time||Low without vector/matrix processing||Longest|
|Mini-batch||Subset of training set||High||Short|
|Batch||Entire training set||Low due to memory overflow||Shortest|
Takeaway: Use Mini-Batches to Reduce Training Time
Your comprehensive collection of accurately labeled training data must be complemented by a process that allows training to be performed in a practical amount of time. Mini-batch gradient descent allows you to get the most from your training hardware by taking advantage of efficient vector/matrix processing, and by taking fewer steps to find a good parameter set.