Authors:
(1) Athanasios Angelakis, Amsterdam University Medical Center, University of Amsterdam - Data Science Center, Amsterdam Public Health Research Institute, Amsterdam, Netherlands
(2) Andrey Rass, Den Haag, Netherlands.
In this section, we seek to formalize a ”bare minimum” procedure necessary to adopt and replicate an experiment framework for assessing the tradeoff between overall model performance, DA intensity and classspecific bias by outlining the specifics of our experiments’ practical implementation. The goal of this is to serve as a guideline for applying the findings of Balestriero, Bottou, and LeCun (2022) in a more efficient manner that is better fit for practical or ”business” environments, as well to lay the groundwork for procuring the results that shall be discussed in further sections of this chapter.
We propose the following procedure, further referred to as ”Data Augmentation Robustness Scouting”: First, a set of computer vision architectures is to be selected for a given dataset and DA regimen. Following this, the model is trained on a subset of the data in several training runs, such that each run features an increasing intensity of augmentation (represented by as a function of α). The test set performance (per-class and overall) is then measured for every value of α, such that dynamics in performance can be observed as α is gradually increased. This procedure is then repeated from start to end, with the performance being averaged out for every value of α to smooth out any fluctuations resulting from the stochastic nature of the training process. The granularity (expressed through the amount and range of α steps and amount of runs per value of α) should be kept to a bare necessary minimum such that the desired clarity required to observe trends in performance dynamics over different α values can be achieved. Following this study, if ”ranges of interest” in α are established, it is recommended to perform the described procedure again with higher granularity within those ranges.
The experiments detailed in further sections of this chapter featured the DA Robustness Scouting procedure performed on three datasets - Fashion-MNIST, CIFAR-10 and CIFAR-100 (Xiao, Rasul, and Vollgraf 2017; Krizhevsky, Hinton et al. 2009). The Random Cropping DA was evaluated, along with Random Horizontal Flip as a fixed supplementary augmentation. For our purposes, the preliminary model tuning and experiments were implemented using the Tensorflow and Keras libraries for Python. Our approach was based on best optimization practices for training convolutional neural networks as described in Goodfellow, Bengio, and Courville (2016d). Before an experiment can be run, a given model is be repeatedly trained from scratch on a given dataset with a set of different permutations of learning rate, maximum epoch and batch size parameters. This was done in order to achieve a “best case” base model - one that did not use any overt regularization (so as not to obscure the impact of regularization during the experiment), provided the best possible performance on the test set, and was the least overfitted. As batch size tuning can be considered a form of regularization, only minimal tuning was applied to batch size to ensure sufficient model stability and performance. A validation subset (10% of the training set - common value used) was used to evaluate overfitting. It is important to note that the definition of “sufficient” is very dataset-specific, as some datasets lend themselves much more easily to being “solved” by models such as ResNet50 - both in terms of training error and generalization. Typically, regularization is used for the very purpose of helping improve generalization, but in the case of this study doing so would risk obscuring the phenomenon being observed. As such, a decision was made to “settle” that “best” base performance on some datasets would be considered subpar ceteris paribus.
The ResNet50 architecture was directly downloaded from Tensorflow’s computer vision model zoo sans the Imagenet weights, as the models would be trained from scratch. Balestriero, Bottou, and LeCun (2022) makes no mention of the optimizer used, so the Adam optimizer (Kingma and Ba 2014) was used in all tuning and experiment steps. Each combination of model and dataset was tuned across a set of learning rates, epoch counts and batch sizes selected with the goal of getting the best validation and test set accuracies (using the sparse categorical accuracy metric from Keras) while being mindful of models’ tendency towards overfitting.
The ”small” EfficientNetV2S (Tan and Le 2021) architecture was selected for its respective experiment due to being a modern and time-efficient implementation of the EfficientNet family of models. The model was downloaded, tuned and optimized as per the above procedure.
The SWIN Transformer architecture was implemented based on existing Keras documentation, which was inspired by the original approach used in Liu et al. (2021), processing image data using 2x2 patches (about 1/16 of the image per patch, similar to Dosovitskiy et al. (2020a)) and using an attention window size of 2 and a shifting window size of 1. However, this approach originally included built-in regularization methods that deviated from our base experiment structure, such as random cropping and flipping already built into the model, label smoothing and an AdamW optimizer (Loshchilov and Hutter 2017), which features decoupled weight decay. These were all omitted from the base model used in tuning and further experiments, while maintaining the core architecture.
As it is generally widespread and commonly used as a data augmentation technique, the particular Random Crop procedure to be used was not well-defined in the text of Balestriero, Bottou, and LeCun (2022). As such, for the purposes of this study, the Random Crop data augmentation was defined as applying the “random crop” transformation from Tensorflow’s image processing library to the training dataset at every iteration, with the resulting image height and width calculated using the following formula:
new_image_size = round (image_size ∗ (1 − α))
Where α is a percentage or fraction representing the portion of the image that would be omitted and round() is the default Python. No padding was used. After being cropped, the images were upscaled back up to their original size so that they would still match the input layer requirements of 32x32px that the model trained on them required. In addition, these dimensions also mean that the test images do not need to be cropped down to the new size, which would be the more likely scenario in practice.
Finally, in order to accommodate existing limitations in available computing, the granularity of the experiments was reduced from 20 models per Random Crop transformation alpha to 4, as well as adjusting α in steps of 3-4%, rather than evaluating every 1%. To ensure the results maintained integrity, the exact figures for this granularity reduction were motivated by finding the minimum number of runs which corresponded to insignificant marginal deviations in the resulting mean test accuracy per additional run, preserving the expected trend. The new granularity of the augmentation alpha was determined based on the observation that the original paper’s highlighted trends in accuracy would still persist if this smoothing was applied to them, if not become simpler to detect by virtue of eliminating the existing fluctuations. In addition, using steps of Random Crop α that were finer would produce the same dimensions in post-crop training images due to the rounding involved and the images’ small size, leading to multiple iterations which would be practically identical, thus creating redundancy.
Additionally, in the case of all conducted experiments, the Keras callback implementation of Early Stopping was used with a 10% validation subset separated from the main training subset. It is a popular deep learning technique that monitors some metric after each training epoch (typically, loss or a select accuracy metric on the validation subset) and terminates training if it fails to perform better than a previous evaluation in a number of epochs described by the “patience” parameter. This algorithm also commonly features an optional restoration of model weights to the best-performing epoch. Early Stopping is commonly recognized as a regularization technique - however, it is considered unobtrusive, as it requires almost no change in the underlying training procedure, and can also be thought of as a “very efficient hyperparameter selection algorithm” (Goodfellow, Bengio, and Courville 2016d). As this work concerns the effects of regularization, a decision was made that Early Stopping must be used cautiously, to minimum effect, despite its unobtrusive nature.