Four Defenses Against Adversarial Attacks

November 07, 2022

Recent work in artificial intelligence has resulted in deep learning algorithms that masterfully find patterns in huge amounts of data. And the machine learning models running these algorithms keep getting larger and more complex. State-of-the-art language models such as OpenAI’s GPT-3 have hundreds of billions of parameters and training sets of hundreds of billions of text data examples. Google Brain’s ViT-G/14 vision model has two billion parameters trained on three billion images. These gigantic deep neural networks (DNNs) are achieving remarkable results. They are the product of millions of experiments carried out by a growing research community, supported by optimistic investors and governments.

There is no question that DNNs are proving themselves very capable. However, despite being the leading edge of today’s artificial intelligence research, these systems are ‘intelligent’ in ways quite different from us humans. While today’s DNNs have their roots in early neural networks inspired by the human brain and nervous system, neural networks have always been based on artificial ‘neurons’ that are known to be far simpler than their biological counterparts. Even with simple neurons, though, early work showed that these networks could learn complicated patterns if they were large enough. DNNs have taken this idea further than anyone could have imagined, using the brute force of their huge networks to vanquish challenge after challenge.

No theory told us it was feasible to build and train DNNs as large as billions of units. We’ve discovered this only through experimentation. As a consequence, how and why DNNs work is a bit of a mystery. This means that research often turns up surprises. One such surprise is that DNNs’ are vulnerable to adversarial attacks.

An adversarial attack is when input data is configured in a way to cause a DNN to make mistakes. Researchers have discovered that imperceptible changes to input data can successfully attack a DNN, causing it to mistake patterns it previously easily recognized. In the example above, an image of a bus has been modified to make a DNN classify it as an ostrich, even though it still looks like a bus to us.

These kinds of attacks cause concern because they show a vulnerability that puts DNN reliability into question. They also open the door to criminal activities that might lead to, for example, crashing a self-driving car, bypassing content filters, or faking identity.

In this article we discuss how and why DNNs are vulnerable to adversarial attacks, and how to make machine learning models less vulnerable to these attacks.

The Hyperdimensional World of a DNN 

DNNs ‘learn’ about their environment by adjusting their model parameters during multiple training iterations or epochs. These adjustments can be thought of as moving from point to point in a loss terrain. A point in the loss terrain represents the value of the DNN’s loss function, typically cross-entropy loss, for a particular set of parameter values. The job of training is to find parameters that result in the DNN performing well, which is reflected by a small value of the loss function.

Gradient descent, used throughout deep learning, is an efficient procedure that searches the loss terrain to find a good set of parameters. The example below illustrates how gradient descent navigates the loss terrain for a very simple neural network, one with only two parameters, θ0 and θ1. The loss terrain is depicted as a colored mathematical surface showing the value of the network’s loss function J(θ0, θ1) for every possible value of the two parameters. The arrow in the diagram indicates where training starts, the location associated with initialized parameter values. Since in this example this point is on a ‘hill’, the loss is high. Gradient descent uses multiple iterations to incrementally adjust the parameters, mathematically ‘rolling’ down the hill to lower the loss. During this process the size of the downhill step is adaptively adjusted, making a tradeoff between taking a step that’s too small, stalling the search in a flat area, and taking a step that’s too large, missing a narrow canyon with a low loss value.

We can visualize the loss terrain for this very simple example but trying to grasp the 1,000,000,001-dimensional loss terrain for a billion parameter DNN boggles the mind! Twenty years ago, navigating a loss terrain with this many dimensions seemed impossible. A well-known mathematical principle called the ‘curse of dimensionality’ seemed to doom DNNs to training sets that grew exponentially with the number of network parameters. Some die-hard researchers persisted, however, and discovered that we can train these very large DNNs after all.

We’re still figuring out exactly why we’ve been able to do this, but it is believed that one of the reasons is the manifold hypothesis. This hypothesis speculates that DNNs need to explore only a tiny fraction of the loss terrain, because good parameter sets show up along narrow, low-dimensional paths. In addition, it is believed that the cost terrain has many local minima, all close to the optimal loss. As soon as any one of those locations is reached, the search can stop.

It’s a lucky thing that near-optimal DNN parameter sets lie along narrow pathways in the loss terrain. But what happens if a DNN encounters input data that takes it off the beaten path?

Pushing Data Points off the Track

A 2014 paper by Szegedy et al. (available on arxiv) explored what happens when special input data is crafted that takes a trained DNN to unfamiliar areas of the loss terrain. They ran image classification experiments using the imagenet visual database and the 60-million parameter ALEXNET deep convolutional neural network. They discovered they could fool the DNN by making very small changes to images that had been previously correctly classified. The trick was to make these small changes corresponding to the direction of increasing loss, i.e., ‘uphill’ in the cost terrain. Later work by Goodfellow et al. developed this approach further, demonstrating a particularly efficient way to create adversarial examples. The example below shows how a correctly classified image of a panda was imperceptibly perturbed to become an image that the DNN misclassified as a gibbon.

The 224x224x3 color input image of a panda was made into an adversarial image by adding a modifier image of very small values, magnified here by 143 times so we can better see its structure. The values of the pixels in the perturbing image were selected so that its 224x224x3 vector representation points in the direction of increasing loss. Even though the magnitude of this perturbing vector is very small (.007), the perturbations accumulated over the 150,528 dimensions of the input image vector push the DNN output far enough to cause a misclassification.

Szegedy et al. also discovered another surprising fact: the adversarial examples they generated for one DNN also worked as adversarial examples for other DNNs. This was true even if the DNNs had different hyperparameters or different training data sets. This transferability property of adversarial examples indicates that even with different architectures, DNNs share the way they represent objects in images. Apparently, they also suffer from shared delusions!

The adversarial perturbation of the panda image is an example of a white box adversarial attack. This type of attack requires access to the target DNN’s inner workings, visible, as it were, through a white box. This is illustrated in the diagram below. The adversarial attack has access to the gradients of the DNN’s loss function, which are used to perturb input data to create adversarial examples.

The transferability property discovered by Szegedy et al. creates the opportunity for another type of attack, a black box adversarial attack. This is an attack made without any information about a DNN’s internal structure. As illustrated in the diagram below, black box attacks rely instead on using the target system to generate training samples for a ‘substitute’ system created by the attacker. The substitute system is trained using these samples to mimic the classification results of the target system. The attacker then creates adversarial examples for the substitute system using the white box approach. Because of the transferability property, these adversarial examples can be used to mount a black box attack on the target DNN system.

Papernot et al. demonstrated a black box attack on a publicly available online image classification system. This system allows users to simply provide training data and labels, and the online system automatically selects and trains a DNN. Users can then input images to the trained DNN, and it will provide classifications. The user does not have any visibility into the DNN’s architecture.

Papernot et al. started by training the online system with a standard set of hand-drawn numerals. They then developed a set of training samples for their substitute model, using the labels obtained by presenting these samples to the trained online system. They continued training until they were confident that the decision boundaries of the substitute DNN reasonably approximated those of the target DNN. They tried to minimize the number of times they sent samples to the online system by sending only those needed to better map the decision boundaries.

Once the substitute DNN was trained, they created adversarial examples using a white-box approach. These adversarial examples were then used to attack the target model, the online system. They found that the online system misclassified 80% – 90% of the adversarial examples, evidence of a successful back-box attack.

At the 2017 International Conference on Learning Representations (ICLR), Liu et al. described work to improve both non-targeted and targeted black box adversarial attacks. (For non-targeted attacks the attacker doesn’t care about the particular class of the mistaken output, just that an error is made. A targeted attack seeks to consistently make misclassifications into particular classes.) Liu at al. observed that while it was relatively easy to come up with non-targeted adversarial examples using previously developed approaches, it was difficult to create targeted examples, particularly ones that transfer well across multiple target DNNs. They addressed this challenge by developing an approach that uses an ensemble of different DNNs to generate adversarial examples. They ran several experiments using 5 convolutional DNNs. For each experiment they used 4 of the DNNs to create adversarial examples for a black box attack on the fifth DNN. The attacks in each case were successful, resulting in error rates from 94% – 100%.

Impacts of Adversarial Attacks

The discovery of adversarial examples has a couple of important implications. First, it exposes a ‘brittleness’ associated with DNNs. When a DNN is trained to recognize pandas, we hope and expect that it will correctly classify anything that looks like a panda. Adversarial examples make it clear that DNNs don’t ‘understand’ pandas the way we do. This makes it difficult to trust that a DDN won’t unexpectedly fail because of some insignificant change in its operational environment.

A second implication is that bad actors could use adversarial attacks to intentionally mislead DNNs. Researchers have explored this vulnerability by creating a variety of interesting attacks. Here we review three examples: an adversarial attack that could crash a self-driving car; one that makes a man invisible to security cameras; and one that creates false transcripts of voice recordings.

To demonstrate how an adversarial attack might fool a self-driving car, Eykholt et al. developed a technique to design artificial ‘graffiti’, that when applied to a traffic sign can make a DNN misinterpret the sign. The picture above shows how they modified a stop sign with stickers that look like Inconsequential graffiti. In both static and drive-by tests, a computer vision DNN typical of self-driving cars consistently misinterpreted the stop sign as a ‘Speed Limit 45’ sign.

DNNs are increasingly used to visually monitor the security of indoor and outdoor areas. A popular and very capable tool for this is the YOLOv2 (You Only Look Once version 2) convolutional neural network. This DNN predicts the identities and bounding boxes of all the objects in an image, using a single pass through the model. Thys et al. developed ‘adversarial patches’ that can be used to fool the YOLOv2. In this example, an adversarial patch prevents the DNN from detecting one of the men in the image.

Cisse et al. demonstrated a black box attack on the Google Voice voice-to-text smart phone application. First, they developed a substitute voice recognition DNN, and they used it to create audio adversarial examples. Using rigorous statistical testing, the authors established that human subjects were unable to distinguish between the ground truth audio and the adversarial example.

Then they used Google Voice to create transcriptions of the original and adversarial audio samples. One of their experiments is shown in the figure below. The Groundtruth Transcription is of the original audio sample used to train the substitute system. Google Voice makes a few mistakes transcribing the original audio but retains its meaning. It totally loses the meaning in its transcription of the adversarial example.

Defending Against Adversarial Attacks

The discovery of adversarial examples has triggered extensive research on adversarial training methods to endow DNNs with more robust models less vulnerable to adversarial attack. In this section we discuss four of these methods:

  • Training Data Augmentation – generating adversarial examples and adding them to a DNN’s training data
  • Regularization – adding terms in the training loss function to steer gradient descent toward parameters more resistant to adversarial attacks
  • Optimization – changing the training objective from minimizing loss to minimizing the maximum loss achievable by an adversarial attack
  • Distillation – training a DNN to have smaller gradients in its loss terrain so it’s less impacted by adversarial examples.

Training Data Augmentation

Data augmentation expands a DNN’s training data by adding to it modified versions of its original training samples. For example, image data sets can be augmented with rotated and translated versions of its original training images.

To use augmentation for adversarial training, adversarial examples are added to the training data.. This stretches the DNN’s decision boundaries, so that they embrace the areas in its feature space where small adversarial changes can lead to misclassification. As Mabry et al. point out, handling adversarial examples generally requires more complex decision boundaries:

In the figure above, (a) shows a linear decision boundary easily separating green from blue samples. In (b), adding adversarial examples expands the range of training data, shown by the green and blue rectangles. The augmented data is no longer linearly separable. In (c) we see the more complex decision boundary needed to separate the augmented samples. This illustrates Mabry et al.’s general observation that adversarial robustness requires an increase in neural network capacity i.e., more layers or units, in order to implement more complex decision boundaries.

Several approaches have been developed to generate adversarial examples. Goodfellow et al. identified a very efficient method called the Fast Gradient Sign Method (FGSM). This approach creates an adversarial example by adding a small vector to the ‘natural’ example in the original training set. The magnitude of this adversarial perturbation is specified by an adjustable parameter, and the direction is set to be the gradient of the DNN’s loss function at the natural example. The panda/gibbon example above was generated using this method, with the magnitude parameter set to .007, a very small number which corresponded to the magnitude of the least significant bit in the encoded pixels.

Madry et al. identify Projected Gradient Descent (PGD) as a more powerful alternative to FGSM. PGD essentially performs a sequence of FGSM–type steps that better approximate the direction that results in the greatest increase in cost.

The Regularization Defense

Networks that overfit training data do not generalize well – they’ve ‘memorized’ their training data at the expense of learning the essential characteristics of the input data. Regularization is a set of techniques widely used in deep learning to prevent DNNs from overfitting. For example, L2 regularization adds a term to the network’s cross-entropy loss function that constrains the size of the weights learned during training. Smaller weights generally lead to simpler networks and smoother decision boundaries that are less sensitive to unimportant nuances in the training data. This results in better generalization and better performance in the operational environment.

Adversarial defense can be viewed as a different sort of regularization problem. A DNN needs to generalize its decision boundaries to properly incorporate adversarial examples. This can be accomplished using the standard regularization approach of adding terms to the loss function

Goodfellow et al. propose modifying the training loss function to contain two terms: the DNN’s standard loss evaluated at a regular training sample, plus the loss evaluated at the sample perturbed by FGSM. Their general formulation includes a weighted combination the two terms; in practice, Goodfellow et al. found good results by equally weighting the two loss components.

This regularization approach offers an advantage compared to simply using FGSM to augment training data. Since DNN parameters are being continually updated during the training process, the loss function uses new gradients and new FGSM examples created during every pass through the training data. This is like adding more adversarial examples to the training set, with every pass through the training data.

The Optimization Defense

Defense against adversarial attacks can be formulated as a minimax optimization problem: we want to minimize the maximum damage that can be done by the adversary. This can be visualized as seeking a saddle point, shown by the red dot in the figure below. Through the minimax lens, the standard DNN training formulation – minimizing its loss function – becomes an outer minimization combined with a concurrent inner maximization associated with worse-case adversarial attacks.

Researchers initially thought that approaching adversarial training as a minimax problem would not be practical. Standard training is complicated enough; adding inner maximization to the outer minimization greatly adds to the complication.

However, Madry et al. discovered two properties of DNN loss terrains that make it practical to take a minimax approach after all:

  • Projected Gradient Descent (PGD) consistently produces adversarial examples that yield loss values that are close to the global maximum
  • From Danskin’s theorem, you can optimize the minmax by just performing the outer minimization at the loss values produced by the PGD–perturbed examples.

These properties allow for a great simplification. Minmax training becomes a two-step process: (1) replace the original training data samples by their PGD-perturbed counterparts; then (2) train the DNN with the perturbed training data.

Training a DNN to optimize minmax will make a DNN more resilient to adversarial attacks. But how resilient? How well will it perform with adversarial examples? And will it still perform well with unperturbed data?

Madry et al. explored these questions through image classification experiments using the MNIST (handwritten numerals 0-9) and CIFAR10 (color images in 10 classes, airplanes, cars, etc.) data sets. Examples of both datasets are shown below. They started with classifiers that performed very well on the natural, unperturbed data: 100% accuracy on the MNIST data and 95% on CIFAR10. These standard classifiers were then subjected to adversarial attacks by feeding them PGD-perturbed input data. The classifiers performed very poorly on the data from this PGD attack – accuracy was reduced to less than 5%.

When the classifiers were trained using PGD-perturbed data, they performed much better. The MNIST classifier was able to achieve over 90% accuracy on the adversarial examples, while retaining 100% accuracy on the natural data. However, for the more complex CIFAR10 data, the improvement was not as great. The CFAR10 classifier was only able to achieve 46% accuracy on the adversarial examples, and its accuracy on the natural data dropped to 87%. While the 46% is quite an improvement over the 3% before adversarial training, these experiments indicate that improvements are still required in dealing with adversarial attacks.

Madry et al. also explored the impact of DNN capacity on adversarial training. Results for the MNIST data set are summarized in the following three grids:

Each grid shows the performance of MNIST classifiers with five increasing levels of capacity. The baseline classifier, ‘1’ on the Capacity scale, had a convolutional layer with 2 filters, followed by another convolutional layer with 4 filters and a fully connected layer with 64 units (i.e., a 2-4-64 network). For each step up the capacity scale, the number of filters and units was doubled, so that at 16 on the Capacity scale, the classifier was a 32-64-1024 network.

Three colored lines on each grid show the performance of the classifiers on three different test data sets: blue line: performance with the original or ‘natural’ data, red line: the original data perturbed using FGSM, and green line: the original data perturbed using PGD.

Finally, each individual grid shows results representing a particular training data set. The classifiers in the first grid were trained only on the natural (unperturbed) data. As shown, even in their highest capacity versions, these classifiers perform poorly on the adversarial examples produced using either FGSM or PGD.

The second two grids show the performance of classifiers trained using adversarial examples. The FGSM trained classifiers always do well on the FGSM perturbed data, even in its lowest capacity version. However, its performance on the natural data (blue line) is degraded and doesn’t approach the performance of the natural data trained classifiers until its capacity is increased to the highest levels. Even in its highest capacity version, the FGSM trained classifiers do not perform well with the PGD perturbed data.

The third grid shows the performance of the classifiers trained using PGD perturbed data. At Capacity scale 4 and higher, these classifiers perform well on all three data sets. This demonstrates two points: (1) PGD training gives resistance to both PGD and FGSM attacks; and (2) good performance with PGD perturbed data can only be obtained with a relatively high-capacity classifier.

Madry et al. also compared networks of two different capacities for classifying the CIFAR10 data, one with ten times the number of units compared to the other. They found a relatively small difference in the performance of the two networks, and the best result on adversarial samples was only 46%, well below the MNIST results. The authors conclude that further increases in capacity may improve performance on CFAR10 data, as it did with MNIST. They invite anyone to test the robustness of their adversarial-trained DNNs, making them available on github.

At the 2020 International Conference on Machine Learning (ICML) Maini et al. extended the PGD-based approach of Madry et al. to pick the worst-case perturbation at a training sample from a set of multiple perturbation models. They were able to show this approach can outperform using only PGD by 15%.

The Distillation Defense

Distillation is a technique originally developed to transfer learning from a large DNN to a smaller one. For example, this would allow a very large image classification DNN trained on a large server farm to transfer its learning to a DNN that is small enough to run efficiently on a cell phone.

The learning transfer happens through the SoftMax outputs of the larger model. For every input sample, these outputs give a probability for each possible classification of the input sample. These probabilities therefore reflect what the large DNN has learned about relationships between the various output classes. Distillation trains the smaller model by replacing the usual class labels in its training data with the SoftMax probabilities of the larger model. This trains the small model to mimic the outputs of larger one, rather than simply maximizing the SoftMax value for the correct class.

A paper presented at the 2016 IEEE Symposium on Security and Privacy by Papernot et al. adapted distillation for use in adversarial training. Instead of transferring learning from a large model to a smaller one, their proposed method, distillation defense, creates a second model the same size as the first, but with better resilience to adversarial attack. Their approach is summarized in the diagram below:

The two-step process first trains a DNN in the standard way using inputs x with corresponding labels y. Then a second DNN is trained using the same inputs, but instead of using the original labels y, the training process uses the SoftMax output probabilities, F(x), produced for each input by the fully trained first DNN. As with regular distillation, these output probabilities reflect what the first DNN has learned about how the output classes relate to each other.

The key to the adaptation of distillation to adversarial defense is a special parameter, the temperature T, that controls the variability of the SoftMax outputs. Higher values of T reduce the value of the maximum SoftMax probabilities produced by the DNNs and smooth the decision surface of the trained model. This smoother loss terrain substantially reduces gradients around the natural samples, so that adversarial examples are more easily classified. The authors found that a value of T = 20 worked well for distillation defense. After the distilled network is trained, T is set to 1.0 for classification.

The authors evaluated distilled networks using adversarial examples from the MNIST and CIFAR10 data sets. They evaluated their results using an adversarial effectiveness metric and showed that the metric could be reduced from 95% to less than .05% in their experiments.


Deep neural networks, some with billions of parameters, power ML systems processing the unprecedented quantities of text, images, and audio that permeate our modern environment. These systems learn patterns by navigating hyper-dimensional loss terrains, guided by millions or billions of training samples.

While DNNs have enabled the most impressive artificial intelligence systems yet created, the distinctive statistical approach used by deep learning models makes them vulnerable to mistakes that are surprising and troubling to us humans. This raises questions about DNN reliability, and the possibility of adversarial attacks that could crash cars, thwart security, or perpetrate fraud.

Researchers are beginning to better understand DNN vulnerabilities and to develop strategies to give these systems more adversarial robustness. We know how to generate good adversarial examples, and how to incorporate this knowledge into our DNN training regimens. The results so far show promise, and better training methods and higher compacity DNNs may help us improve even further.

As your training data partner, iMerit stands ready to apply our extensive experience to help you develop a training pipeline to maximize the resilience of your ML system.

iMerit collaborates to deploy AI and Machine Learning in Autonomous Technology, Geospatial Technology, Medical AI, and other industries. Our solutions labeled, annotated, enriched, and segmented over 100 million images and videos that power Computer Vision algorithms.

If you’d like to learn how iMerit can augment your machine learning projects, please contact us to talk to an expert.