Variance Reduction in Deep Learning: More Momentum is All You Need.
Abstract
Variance reduction (VR) techniques have contributed significantly to accelerating learning with massive datasets in the smooth and strongly convex setting (schmidt2017minimizing; johnson2013accelerating; roux2012stochastic). However, such techniques have not yet met the same success in the realm of largescale deep learning due to various factors such as the use of data augmentation or regularization methods like dropout (defazio2019ineffectiveness). This challenge has recently motivated the design of novel variance reduction techniques tailored explicitly for deep learning (arnold2019reducing; ma2018quasi). This work is an additional step in this direction. In particular, we exploit the ubiquitous clustering structure of rich datasets used in deep learning to design a family of scalable variance reduced optimization procedures by combining existing optimizers (e.g., SGD+Momentum, Quasi Hyperbolic Momentum, Implicit Gradient Transport) with a multimomentum strategy (yuan2019cover). Our proposal leads to faster convergence than vanilla methods on standard benchmark datasets (e.g., CIFAR and ImageNet). It is robust to label noise and amenable to distributed optimization. We provide a parallel implementation in JAX.
1 Introduction
Given a data generating distribution , a parameterized function (e.g. a deep network) and a loss function , we consider the traditional risk minimization problem representative of most settings in ML (shalev2014understanding):
(1) 
The celebrated optimization algorithm for solving this problem in largescale machine learning is Stochastic Gradient Descent (SGD) (Robbins2007ASA; bottou2018optimization). In favorable cases (e.g., smooth and strongly convex functions), SGD converges to a good solution by iteratively applying the following firstorder update rule: where is an instance drawn from , is the learning rate and is the (approximate) gradient. SGD enjoys several desirable properties. Indeed, It is fast: when one has access to the actual gradient, its convergence rate is for some . Also, it is memory efficient, robust, and amenable to distributed optimization (bottou2018optimization; nemirovski2009robust; dean2012large). However, we generally do not know the datagenerating distribution ; consequently, we do not have access to the actual gradient. Instead, we resort to empirical risk minimization and rely on a stochastic approximation of the gradient for a given and . Unfortunately, the gradient noise due to using the approximate instead of the exact gradient slows down the convergence speed (which becomes instead of ) (bottou2018optimization). It also makes the algorithm harder to tune. Variance Reduction (VR) techniques allow us to mitigate the impact of using noisy gradients in stochastic optimization (roux2012stochastic; defazio2014saga; johnson2013accelerating; mairal2013optimization).
Variance reduction methods have been successfully applied to convex learning problems (roux2012stochastic; defazio2014saga; shalev2013stochastic; johnson2013accelerating). However, this success does not readily translate to largescale nonconvex settings such as deep learning due to colossal memory requirements (roux2012stochastic; defazio2014saga; shalev2013stochastic) and the use of normalization and data augmentation procedures (defazio2019ineffectiveness). This has motivated the design of novel VR strategies that are better suited for deep learning (cutkosky2019momentum; arnold2019reducing; ma2018quasi). In the next section, we present in more detail Implicit Gradient Transport (arnold2019reducing) and Quasi Hyperbolic Momentum (ma2018quasi), two instances of successful application of VR to deep learning.
Large scale datasets used in deep learning problems come with a rich clustering structure (deng2009imagenet). For example, the data can result from the aggregation of several datasets collected by different individuals or organizations (e.g., this is typical in the federated learning setting (li2019federated)). When there is an underlying clustering structure in the data, the variance due to gradient noise decomposes into an incluster variance and a betweencluster variance. yuan2019cover exploit this insight to reduce the between cluster variance in the convex setting by considering every data point as a separate cluster. Unfortunately, such an approach requires storing as many models as data points and therefore has prohibitive memory requirements. Also, it does not consider the use of minibatches during training. Consequently, it is not directly applicable to deep learning. In this work, we make the following contributions:

We leverage the idea of using multiple momentum terms and introduce a family of variance reduced optimizers for deep learning. These new approaches (termed Discover ^{1}^{1}1Deep SCalable Online Variance Reduction) improve upon widely used methods such as SGD with Momentum, IGT, and QHM. They are theoretically wellmotivated and can exploit the clustering structures ubiquitous in deep learning.

Using simple clustering structures, we empirically validate that Discover optimizers lead to faster convergence and sometimes (significantly) improved generalization on standard benchmark datasets. We provide parallel implementations of our algorithms in JAX^{2}^{2}2https://github.com/google/jax.
In the next section, we provide a brief overview of variance reduction methods with an emphasis on Implicit Gradient Transport (arnold2019reducing) and Quasi Hyperbolic Momentum (ma2018quasi), two VR optimizers specifically designed for deep learning. We then discuss (section 3) the clustering structure present in most deep learning problems and how we can leverage it to design scalable variance reduction strategies. The experiments section demonstrates our proposals’ effectiveness and provides several insights regarding their behavior.
2 Variance Reduction in Deep Learning
We consider Variance Reduction (VR) in the realm of largescale deep learning (Goodfellowetal2016), i.e. when we are in the presence of significant amounts of (streaming) data, and the parameterized function is a massive deep neural network. Most existing approaches do not naturally scale to this setting. Indeed, dual methods for VR, such as SDCA (shalev2013stochastic), have prohibitive memory requirements due to the necessity of storing large dual iterates. Primal methods akin to SAGA (roux2012stochastic) are also memory inefficient because they need to store past gradients. Other approaches like SARAH (nguyen2017sarah) are computationally demanding; they entail a full snapshot gradient evaluation at each step and two minibatch evaluations. While some recent approaches, such as stochastic MISO (bietti2017stochastic), can handle infinite data in theory. However, their memory requirement scales linearly with the number of examples. In addition to these limitations, Defazzio & Bottou (defazio2019ineffectiveness) have shown that several other factors such as data augmentation (krizhevsky2012imagenet), batch normalization (ioffe2015batch), or dropout (srivastava2014dropout) impede the success of variance reduction methods in deep learning. Despite these negative results, some recent approaches such as Quasi Hyperbolic Momentum (ma2018quasi) and Implicit Gradient Transport with tail averaging (IGT) (arnold2019reducing) have shown promising results in variance reduction in the context of large scale deep learning. In the sequel, we present them in more detail. We first present the Momentum or Heavy Ball method since the above approaches and our proposal rely on it.
The Momentum or Heavy Ball (polyak1964some; sutskever2013importance) method uses the following update rule:
where is called the momentum buffer and balances the buffer and the approximate gradient compute at iteration . SGD with Momentum is widely used in deep learning due to its simplicity and its robustness to variations of hyperparameters (sutskever2013importance; zhang2017yellowfin). It is an effective way of alleviating slow convergence due to curvature (goh2017momentum) and has been recently shown also perform variance reduction (roux2018online).
The Quasi Hyperbolic Momentum (QHM) (ma2018quasi) method uses the following update rule:
QHM extends Momentum by using in the update a combination of the approximate gradient and the momentum buffer. When (resp. ), it reduces to Momentum (resp. SGD). QHM inherits the efficiency of Momentum and also performs VR. In addition, it alleviates the potential staleness of the momentum buffer (when choosing ).
The Implicit Gradient Transport (IGT) (arnold2019reducing) method uses the following update rule:
IGT effectively achieves variance reduction by combining several strategies maintaining a buffer of past gradients (using iterative tail averaging) and transporting them to equivalent gradients at the current point. The resulting update rule can also be combined with Momentum as shown above.
3 Exploiting Clusters in Variance Reduction
The Ubiquitous Clustering Structure
Largescale machine learning datasets come with a rich clustering structure that arises in different ways depending on the setting. For example, when training deep neural networks, we combine various data augmentation strategies, resulting in a mixture with clusters defined by the transformations (zhang2017mixup; yun2019cutmix; krizhevsky2012imagenet). In multiclass classification problems, one can consider each class as a separate cluster. Therefore, the overall dataset becomes a mixture defined by the classes. In all these cases, the data is not independent and identically distributed across clusters. Indeed, different data augmentation strategies lead to different training data distributions. To make the clustering structure apparent, we can rewrite the minimization problem in equation 1 as a combination of risks on the different clusters. To this end, we denote the data distribution corresponding to the th cluster and the probability with which we sample from that cluster, with . We also denote the realization of the data point belonging to the cluster for clarity. The risk minimization problem 1 becomes:
(2) 
While one can pool all the data to apply traditional empirical risk minimization 1, this would ignore valuable prior clustering information. In the next section, we show how, when solving the equivalent problem 2, we can leverage the clustering information to speed the learning procedure by reducing the variance due to gradient noise. We start with stochastic gradient descent and present a decomposition of such variance, which considers the clustering structure.
Gradient noise for SGD with clusters
Here, we rederive the variance of the gradient noise for SGD in presence of clustered data (sayed2014adaptive; yuan2019cover). To this end, we introduce the filtration and denote for convenience. We assume the loss is Lipschitz with respect to and the clustered risk is strongly convex, that is for any it holds:
For a given example , the update rule for stochastic gradient descent is . The gradient noise resulting from this update depends both on the probability distribution on the clusters and the data distribution for the considered cluster. For a given example, the gradient noise writes: and the gradient noise within the cluster is: . The following result bounds the first and second moment of the withincluster variance of the gradient noise for SGD.
Lemma 1
The first and second order moments of the gradient noise satisfy: and where , and .
We can now decompose the variance of the gradient noise into a sum of withincluster and betweencluster variance.
Lemma 2
Assuming the gradient is unbiased and the variance of the gradient noise within the cluster is bounded as in shown in lemma 1, the following inequality holds:
(3) 
Lemma 2 captures the structure of the problem. The LHS represents the variance of the gradient noise. The first term of the RHS is the withincluster variance , and the second term is the betweencluster variance . In the limit, the mean square deviation of the steadystate depends on the incluster and betweencluster variances as follows . Therefore when the clustering structure is known and , which is our working assumption, we can significantly reduce the overall variance by reducing the betweencluster variance. In the next section, we present an update rule exploiting this fact and generalizing the approach presented in (yuan2019cover) to the minibatch setting^{3}^{3}3It is worth noting that yuan2019cover have assumed the results in lemma 1 and state lemma 2 without proof. We provide full proofs in the appendix.. We also prove its gradient noise properties and convergence^{4}^{4}4The proof assumes a smooth and strongly convex setting. Though we consider in the experiments nonconvex problems, this assumption may be valid locally in a basin of attraction of the loss landscape of a deep neural network
4 Discover Algorithms
To reiterate, we assume we know the clustering structure and the probability of observing data from a given cluster such that . As we will show in the experiments, this is a realistic assumption, and straightforward design choices such as using labels or data augmentation strategies as clusters lead to improved results. We do not have access to the data distribution given a cluster . We consider a learning setting where at each round , we observe a batch of examples coming from the different groups and sampled according to the mixture distribution induced by the clustering structure. We propose to achieve between cluster variance reduction in minibatch stochastic gradient descent by recursively applying the following update rule:
(4) 
The update rule stems from the traditional use of control variates for variance reduction (Fish96) with the additional trick to use one control variate for each cluster given that the clustering structure is known. In Equation 4, each is an approximation of the actual cluster gradient to which the example belongs, and is the average cluster gradient. The gradient noise resulting from the update rule 4 depends both on the probability distribution on the clusters and the data distribution of each considered cluster. It can be written as follows:
In the sequel, we show how a recursive application of the above update rule ultimately leads to reduced betweencluster variance of the gradient noise . Before, we first state a lemma exposing the properties of the gradient noise . Similarly to Lemma 2, this results highlights how the variance of the gradient noise decomposes into an incluster and a betweencluster variance. We provide a proof of the lemma in the appendix Section 9.1.
Lemma 3
(gradient noise properties) Under the same assumptions as in Section 3, for a batch of size the gradient is unbiased . Denoting , and the variance of the gradient noise is bounded as:
(5) 
Remark
The second and the last term of the righthand side of this bound are respectively the withincluster and the betweencluster variance. If the approximate cluster gradient converges to the actual one as , the betweencluster variance vanishes. It is worth noting that when we use the same approximate gradient buffer for all the clusters, the update rule 4 reduces to that of SGD with Momentum (polyak1964some; goh2017momentum). Consequently, the between cluster variance term in the above bound becomes and may vanish as . Therefore, Momentum also can perform betweencluster variance reduction and can be seen as a special case of the method proposed here, albeit operating at a coarser level (maintaining one general approximate cluster gradient buffer instead of one for each cluster). Momentum has been mainly considered as a method for fighting curvature (goh2017momentum). Recent work has shown its variance reduction capabilities in specific cases (roux2018online). We show in our experiments that Momentum indeed performs betweencluster variance reduction. We now show that recursive applications of the update rule 4 indeed leads to vanishing betweencluster variance. The full proof of Theorem 1 is provided in the appendix Section 9.3.
Theorem 1
Under the same assumptions as in Section 3, we denote as the batch size and as the incluster variance with , . For any step size satisfying and , the iterate from Discover 1 converge in expectation to the solution with the contraction factor with . it holds: where
(6) 
(7) 
Theorem 1 shows that when the step size is small, Discover 1 eliminates the betweencluster variance in the limit, hence reducing the overall variance of the gradient noise to the incluster variance. Therefore, when the latter is significantly smaller than the betweencluster variance, Discover can be an effective variance reduction strategy. We show in the experiments section that when the clustering structure to exploit is carefully chosen, Discover leads to faster convergence and sometimes results in improved generalization thanks to the cluster information. We now show how to leverage the idea of using multiple momentum terms based on a given clustering to improve Implicit Gradient Transport (arnold2019reducing) and Quasi Hyperbolic Momentum (ma2018quasi) (both relying on a single momentum in their vanilla version). The update rules for such extensions (respectively called DiscoverIGT and DiscoverQHM) follow. Our experiments will demonstrate the effectiveness of these methods.
DiscoverIGT (DIGT) update rule
:
The DiscoverQHM (DQHM) method uses the following update rule:
where is the batch at iteration , is the cluster indexes present in the batch , and are updated in parallel for . DiscoverIGT combines gradient transport and the multimomentum strategy. DQHM extends QHM and Discover by using in the update a combination of the approximate gradient and the momentum buffer, together with a multimomentum approach similar to Discover. When , and it reduces to Momentum. When , it reduces to QHM. When , it reduces to Discover. Choosing alleviates the potential staleness of the buffer.
5 Experiments
We validate that the Discover algorithms coupled with a careful clustering structure lead to faster convergence compared to their vanilla counterparts (SGD+Momentum, IGT, and QHM^{5}^{5}5we restrict QHM to and so it is not reduced to SGD or SGD+Momentum). We provide additional insights and show that the runtime is comparable to competing methods thanks to a careful parallel implementation.
We implement all the methods used in our experiments using JAX (jax2018github), and FLAX (flax2020github) and perform our experiments on Google Cloud TPU (google_tpu) devices. We compare Discover with widely used minibatch SGD with Momentum, and Adam (kingma2014adam), as well as with the more recently introduced IGT and QHM on ImageNet (deng2009imagenet) and CIFAR10 (krizhevsky2009learning) datasets. Discover is amenable to parallelization. For example, at each round, we can update the cluster gradients in parallel, as shown in Algorithm 1. Our implementation exploits this feature and groups clusters by cores to enable parallel updates. That is, examples of each core belong to the same cluster. In this section we treat as an independent hyperparameter instead of using an exact equation to simplify the parallel implementation. We provide more details about the practical implementation in appendix Section 7.1. We confirm that using the same strategy does not make a practical difference for all the other optimizers we consider. For each optimizer, we select the hyperparameters either as suggested by its authors or by running a sweep to obtain the highest accuracy at the end of training across five random seeds. The details of the training setup, HP search, and the best values found for each optimizer are given in appendix Section 7.
ImageNet: Data augmentation methods as clusters
We first consider an image classification and train a ResNetv150 (He_2016_CVPR) model on the ImageNet (deng2009imagenet; ILSVRC15) dataset for 31200 steps (90 epochs). We use cosine learning rate schedule with minibatches of a batch size of 4096, weight decay regularization of , group normalization (Wu_2018_ECCV), and weight standardization (qiao2019weight). We use a standard preprocessing pipeline consisting of cropping (googlenet) with size 224x224, pixel value scaling, and a random horizontal flipping.
We train our Resnet50 using three data augmentation methods: random flipping, Mixup (zhang2017mixup), and CutMix (Yun_2019_ICCV). For each image, we apply all transformations separately, producing three differently augmented examples (as shown in Figure 5). Consequently, each of the augmentation methods induces a different cluster and the probability of each is . We compare the multimomentum strategies (Discover, DiscoverIGT, and DiscoverQHM), exploiting this clustering information, with their vanilla counterparts (SGD with Momentum, IGT, and QHM). For the sake of completeness, we also add Adam to the comparison as a baseline. Figure 1 shows the results for the different methods. Using all these data augmentation methods can make the learning problem more difficult because the resulting cluster data distributions can differ significantly. In this setting, we observe that Discover variants initially converge faster than their corresponding vanilla optimizers. Discover, and DiscoverIGT reach similar final performance to SGD with Momentum and IGT, respectively, while DiscoverQHM outperforms the vanilla QHM.
CIFAR: Classes as clusters.
Next, we consider the CIFAR10 (krizhevsky2009learning) classification and use the classes as the clusters, therefore having ten different clusters. We train a WideResNet2610 (zagoruyko2016wide) on CIFAR10 for 400 epochs using cosine learning rate schedule, batch size of 256, group normalization, L2regularization of and dropout rate of 0.3. The preprocessing consists of 4pixel zeropadding, a random crop of size 32x32, scaling the pixels to range and random horizontal flip.
Figure 2 shows the results for the different optimizers. CIFAR10 is an easy task and all the methods eventually converge to high accuracy. IGT converges faster than DiscoverIGT and reaches higher final accuracy ( vs ). Discover (resp. DiscoverQHM) converge similarly to SGD with Momentum (resp. QHM). These single momentum strategies also reach a similar (for QHM, vs ) or higher (for Momentum, vs ) final performance. To highlight the importance of carefully choosing the clustering structure, we also performed an experiment on CIFAR10 where we assigned each point to 1 of 10 clusters uniformly at random and train the model using discover and such clustering. The resulting test accuracy is down from when using classes as clusters. Therefore it is essential to use a good clustering structure.
Learning with noisy labels
We now consider the more challenging ImageNet and CIFAR10 classification task in the presence of label noise: At each round, the label of the considered example is flipped with probability (the noise level) and assigned to a different class selected uniformly at random among the other classes. In this setting, the variance of the gradient noise is increased due to the presence of label noise (see Appendix Section 8). We compare Discover, DiscoverQHM, and DiscoverIGT with their vanilla counterparts (SGD with Momentum, QHM, IGT) and Adam. We use Data augmentation methods as clusters for ImageNet and classes as clusters for CIFAR10. We use the same setting as in the previous ImageNet and CIFAR10 experiments. Still, We perform a hyperparameter (HP) search de novo to determine the best learning for each method in the presence of noisy labels. The details of the HP search, and the best values found for each optimizer are given in appendix Section 7.
When the label noise is low , all the methods achieve high accuracy on the noncorrupted training examples and the test examples with only a slight deterioration compared to a clean setting (exact results are given in appendix Section 7.5). Figure 3 and Figure 4 show the results of this experiment in the high noise level setting on ImageNet and CIFAR10, respectively. On CIFAR10, the multimomentum optimizers’ loss curves are almost superimposed with the curves of SGD with Momentum, QHM, and IGT. However, the former generalize significantly better at each time step and reach higher final performance than all their single momentum counterparts. It is worth noting that Discover outperforms all the multimomentum optimizers by a large margin on the validation accuracy, achieving . The superiority of multimomentum optimizers over the single momentum ones translates to ImageNet, suggesting that our Discover algorithms find better local optima even in challenging settings. The other interesting finding is that while Adam converges faster in training loss, it generalizes poorly.
Optimizer  ImageNet,  CIFAR10 ( 

64 TPUv3  4 TPUv2  
SGD  
Adam  
Momentum  
IGT  
QHM  
Discover  
DiscoverQHM  
DiscoverIGT 
Impact of clustering structure.
We also performed an experiment on CIFAR10 where we compare three clustering structures: (1) Classes as clusters: This is the setting we have considered so far for CIFAR10. (2) Transformations as clusters. (3) Random clusters: assigning each point to 1 of 10 clusters uniformly at random. In each case, we train the model using Discover and the chosen clustering. When the clusters are selected uniformly at random, the resulting test accuracy is . When we use transformation as clusters instead, the test accuracy improves to . The best accuracy with Discover on CIFAR10 () is achieved when using classes as clusters. However, using this clustering structure comes at the cost of a higher memory.
BetweenCluster Variance Reduction
In this experiment, we assess the betweencluster variance reduction capabilities of Discover compared to SGD and SGD with Momentum on CIFAR10 (with classes as clusters) both in the clean and noisy settings. We measure at each step of the optimization an approximation of the betweencluster variance obtained by substituting to in the betweencluster variance formula. For SGD we use Equation 3. For Discover and SGD with Momentum – Equation 5. However, for SGD with Momentum, one global gradient buffer is used for all the clusters. Equation 5 shows that in both the clean and the noisy setting, Discover reduces the betweencluster variance faster. The figure also highlights the betweencluster variance reduction capabilities of SGD with Momentum, which, though not as fast as Discover, also leads to quickly vanishing variance. This explains the good performance of Momentum in our previous experiments.
6 Conclusion and Perspectives
We introduced a set of scalable VR algorithms exploiting the ubiquitous clustering structure in the data and relying on a multimomentum strategy. We demonstrated that simple choices of clustering structure (e.g., transformations) can lead to improved results. Our framework gives the designer significant flexibility allowing them to leverage prior information to select good clustering structure. The experiments demonstrated that the proposed strategy often leads to faster convergence and sometimes to improved generalization, especially in challenging settings (e.g., label noise).
Table 1 shows there are only minor differences between the multimomentum strategies and their vanilla counterparts thanks to our efficient implementation that exploits the parallel structure of the algorithms.
References
7 Appendix A: Experimental details
7.1 Implementation
All models, training and evaluation pipelines are implemented using JAX (jax2018github) and FLAX (flax2020github). The data loading and preprocessing is implemented in Tensorflow v2 (tensorflow2015whitepaper) and TFDS (TFDS). The SGD, Momentum and Adam optimizers are taken from FLAX libraries, QHM is implemented following (ma2018quasi) and IGT is written in FLAX following an example at https://github.com/googleresearch/googleresearch/tree/master/igt_optimizer (arnold2019reducing). Discover optimizers are implemented with singleprogram multipledata training in mind. Our implementation expects a training data processed during a single training step by a given device core to contain examples from the same cluster. After shuffling the dataset, our input pipeline selects the next examples from the same randomly picked cluster to satisfy this condition. This translates into a single cluster per batch on onehost onecore training (e.g. a workstation with one CPU), but easily scales to more clusters by using multiple cores even within a singlehost setting. For examples, the CIFAR experiments use one host machine with 4 Google Cloud TPUv2 (google_tpu) devices, thus having 8 cores in total, so each batch can contain examples from up to 8 different clusters. Discover optimizer averages across devices to perform identical updates locally on each device and performs local updates in parallel, synchronizing across devices at the end of each training step (see Algorithm 1 for notation). This averaging makes effective value depend on the distributed training setup and to account for it we chose to tune as a direct hyperparameter instead of trying to incorporate the training setup into the formula from Algorithm 1.This implementation is well suited to a good balance of clusters within a single global batch, which is the usual case for a small number of clusters and a randomly shuffled dataset. We have not tested the performance when the cluster distribution in a global batch is significantly skewed.
7.2 Datasets
The datasets used in this work are ImageNet1000 (deng2009imagenet; ILSVRC15) and CIFAR10 (krizhevsky2009learning). The datasets were downloaded from https://www.tensorflow.org/datasets/catalog/imagenet2012 and https://www.tensorflow.org/datasets/catalog/cifar10 respectively. The detailed information about the size and format of the train and validation splits are available at the corresponding web pages.
7.3 ImageNet (augmentations as clusters)
We trained a ResNetv150 (He_2016_CVPR) model on the ImageNet (deng2009imagenet; ILSVRC15) dataset for 31200 steps, corresponding to 100 epochs on the unaugmented dataset. We used:

Cosine learning rate schedule with 5 warmup steps.

Batch size of 4096.

Weight decay regularization setting of . We used decoupled weight decay inspired by (loshchilov2018decoupled) and each step update all network parameters by where is the learning rate.

Group normalization with 32 groups (Wu_2018_ECCV).

Weight standardization.
The preprocessing consists of Inceptionstyle cropping (googlenet) with size 224x224, scaling the pixel values to range and random horizontal flip. In this setting the model achieves 0.76 accuracy with Momentum (polyak1964some; sutskever2013importance) optimizer when . We extend the setting to use Mixup (zhang2017mixup) and Cutmix (Yun_2019_ICCV) augmentations instead of random flipping, choosing the same mixing ratio for all examples in the single local batch of each distributed training host. We correspondingly assign all examples augmented with random horizontal flip to cluster 0, with Mixup  to cluster 1, Cutmix  to cluster 2. The clusters are used for Discover optimizer.
The hyperparameters are selected from a hyperparameter sweep to obtain the highest mean accuracy at 31200 steps across 5 random seeds. The learning rate is chosen from a set of {0.03, 0.1, 0.3} for all optimizers. The further hyperparameters are selected from the following sets:

Discover:

Momentum:

QHM: (as reported default for ImageNet in (ma2018quasi))

IGT: (as reported to be selected for ImageNet in (arnold2019reducing))

Adam:

DiscoverQHM: , ,

DiscoverIGT: , , tail_fraction
The exact hyperparameters selected for each optimizer from the hyperparameters sweep are:

Discover:

Momentum:

QHM:

IGT:

Adam:

DiscoverQHM:

DiscoverIGT:
The training is setup in the multidevice dataparallel fashion on a pod of 8x8 Google Cloud TPUv3 (google_tpu). There are 16 host machines, each having access to 4 TPUv3 devices. The ImageNet (deng2009imagenet; ILSVRC15) training and validation splits obtained from https://www.tensorflow.org/datasets/catalog/imagenet2012 are divided into equal parts and each host has access to its own separate part. At most examples are excluded this way to ensure the equal division. The training examples inside each batch are arranged such that each device core receives only examples from a single cluster (see Section 7.1). We confirmed that this does not make a practical difference for any other optimizer examined: the loss and accuracy curves look identical and the final accuracy is the same independent of whether this technique is applied or not.
7.4 CIFAR (classes as clusters)
We trained a WideResNet2610 (zagoruyko2016wide) model on the CIFAR10 dataset (krizhevsky2009learning) for 400 epochs. We used:

Cosine learning rate schedule with 5 warmup steps.

Batch size of 256.

L2regularization (l2_reg) set to 0.0005.

Dropout (dropout) rate of 0.3.

Group normalization (Wu_2018_ECCV) with 16 groups in the first layer in each block and 32 groups in the rest of the layers.
The preprocessing consists of 4 pixel zeropadding followed by a random crop of size 32x32, scaling the pixels to range and random horizontal flip. We correspondingly assign all examples with the same class to the same cluster (for Discover optimizer).
The hyperparameters are selected from a hyperparameter sweep to obtain the highest mean accuracy at 400 epochs across 5 runs with different random seeds. The learning rate is chosen from a set of {0.001, 0.01, 0.03, 0.1, 0.175} for all optimizers. The further hyperparameters are selected from the following sets:

Discover:

Momentum: (as used for WideResNet on CIFAR10 in (zagoruyko2016wide))

QHM:

IGT: (as determined to be best for CIFAR10 in (arnold2019reducing))

Adam: (as a good default settings reported in (kingma2014adam))

DiscoverQHM: , ,

DiscoverIGT: , tail_fraction
The exact hyperparameters selected for each optimizer from the hyperparameters sweep are:

SGD:

Discover:

Momentum:

QHM:

IGT:

Adam:

DiscoverQHM:

DiscoverIGT:
The training setup is the same as for ImageNet experiments (see appendix Section 7.3) but using only one host machine with 4 Google cloud TPUv2 (google_tpu). The CIFAR10 (krizhevsky2009learning) training and test splits are obtained from https://www.tensorflow.org/datasets/catalog/cifar10.
7.5 CIFAR (nosiy labels)
We also run the abovementioned CIFAR10 (krizhevsky2009learning) experiments with partially corrupted labels, where the label of each image is flipped to a random different class independently with probability every time the example is seen during training. E.g. the same example might get different labels in different training epochs. We have tuned the hyperparameters anew in a same way as the clean CIFAR10 experiment above Section 7.4 each value of . The results for the lownoise setting () and highnoise setting () are reported in Figure 6 and Figure 4 respectively.
The exact hyperparameters selected for setting are:

SGD:

Discover:

Momentum:

QHM:

IGT:

Adam:

DiscoverQHM:

DiscoverIGT:
The exact hyperparameters selected for setting are:

SGD:

Discover:

Momentum:

QHM:

IGT:

Adam:

DiscoverQHM:

DiscoverIGT:
7.6 ImageNet (nosiy labels)
We also run the abovementioned ImageNet (deng2009imagenet; ILSVRC15) experiments with partially corrupted labels, where the label of each image is flipped to a random different class independently with probability every time the example is seen during training. E.g. the same example might get different labels in different training epochs. We used the same hyperparameters as selected for the clean ImageNet experiment, except the learning rate which was selected again for each optimizer value of as described in Section 7.3. The best values were exactly the values chosen in Section 7.3.
7.7 CIFAR (augmentations as clusters)
We also run the abovementioned CIFAR10 experiments when using Mixup, Cutmix and leftright flipping augmented examples as clusters to mirror the setting used for ImageNet experiments above. We did not tune any hyperparameters anew, instead for both clean and noisy labels setting we have used the same hyperparameters as for the experiments above (when classes were used as clusters). When using clean labels the results for Discover modifications shown on Figure 7 were very similar to those of vanilla optimizers as expected. When using noisy labels (with ) Discover modifications performed significantly worse than vanilla optimizers as shown on Figure 8. These results highlight the importance of the careful clustering structure choice and suggest that Discover hyperparameters chosen for one clustering structure does not necessarily transfer to a different clustering choice in case of noisy labels.
8 Variance due to label noise
Suppose that we optimize the loss with gradient equal to . Let denote by the gradient of the loss where is the true labeling function. On an instance , we can write it as: , with and .
The first term is the traditional variance of the gradient noise, and the second term is due to the label noise the larger the noise the larger it is.
9 Appendix B: Proofs
Proof 1
Before proving, we denote and where the superscript indicate the cluster index arises from.
In this part we want to give a proof of Lemma 1. We want to show that the first and second order moments of the gradient noise satisfy :
(8)  
(9) 
where , and :
where is referred to as the magnitude of gradient noise.
In case all the clusters have the same probability that is , we have that:
which correspond to the magnitude of gradient noise without considering the clustering structure.
we recall that the gradient noise for SGD is given by : .
Where we use the fact that :
Using the following Jensen’s inequality :
It holds that :
Using the fact that for any random variable ,
and , we have :
We recall that the gradient noise within the cluster is given by: