1 Introduction

Deep neural networks have been used in many applications and have achieved impressive results in various domains, especially in computer vision tasks such as image classification, autonomous vehicles, and face recognition (Lin et al., 2018; Wang & Deng, 2021). However, there are still many security concerns about them as it has been demonstrated that these networks are not sufficiently robust against different types of adversarial examples (Akhtar & Mian, 2018; Szegedy et al., 2014). Adversarial examples are data points that are slightly perturbed and optimized to deceive a network. Numerous defenses have been developed against adversarial examples, but the issue has not yet been completely solved as the proposed methods lead to robustness against a limited number of threat models with specific conditions.

Variants of adversarial training (AT) (Madry et al., 2018) are currently state-of-the-art algorithms among empirical defenses (Bai et al., 2021; Kireev et al., 2021). AT tries to solve an optimization problem of minimizing loss of the network f parameterized by \(\theta\) based on adversarial examples \(x_i + \delta _i\). The perturbation \(\delta _i\) is obtained by maximizing the network loss for \(x_i+\delta _i\). In order to make the perturbation imperceptible, \(\delta _i\) could be made norm bounded. Therefore, AT briefly involves:

$$\begin{aligned} \min _\theta \sum _i \max _{\delta \in \Delta } \ell (f_\theta (x_i + \delta ), y_i), \end{aligned}$$
(1)

where \(\ell (\cdot )\) is the loss function, and \(\Delta\) is the set of feasible perturbations. The outer minimization can be done using the stochastic gradient descent algorithm, but the challenging part often involves solving the inner maximization.

It has been empirically observed that the model trained with AT will be robust mainly against perturbations within \(\Delta\), but almost fails under other threat models (Kang et al., 2019; Ilyas et al., 2019). In other words, the model is mainly robust against threat models used in the training, and presumably has a lower accuracy against other threat models than the models that are trained with those threat models. It is like model overfits to the training attack (Rice et al., 2020). By other threat models, we mean any bounded or unbounded attacks that are not used in the training.

Two strategies are considered in the literature to tackle this challenge: training against a union of perturbation sets (Maini et al., 2020; Tramer &Boneh, 2019), or using a more perceptually aligned metric to define the perturbation set, \(\Delta\) (Laidlaw et al., 2021). Along the first strategy, an attack was proposed that is updated based on the gradient from the worst case error of \(\{\ell _1, \ell _2, \ell _\infty \}\) attacks in each iteration (Maini et al., 2020). Such methods do not still generalize to imperceptible attacks that are not norm bounded (Laidlaw et al., 2021). As an alternative, it was proposed to broaden \(\Delta\) to contain more widespread attacks. For instance, it was suggested to bound the perturbation through bounding the distance of perturbed input to the original one in the embedding of a pre-trained deep neural network (Laidlaw et al., 2021). This is called the LPIPS distance (Zhang et al., 2018) and is believed to be a more perceptual metric than \(\ell _p\) norms usually used to bound the perturbation. However, the constrained version of their algorithm is computationally demanding, so they proposed a Lagrangian penalty in designing the attack:

$$\begin{aligned} \max _{\delta } \ell (f_\theta (x_i + \delta ), y_i) - \lambda ~ \text {LPIPS}(x_i, x_i + \delta ), \end{aligned}$$
(2)

where \(\text {LPIPS}(\cdot ,\cdot )\) is the LPIPS distance. This idea led to promising results on unseen attacks.

We will build upon this work and show that the Lagrangian nature of the loss plays a more important role in unseen attack generalization than the LPIPS distance since it leads to adaptive generation of training perturbations. Adaptive generation of perturbations means that the perturbation magnitude depends on the model output for each sample, so it grows large when the model is confident and shrinks otherwise. The rationale is simple but wise: perturbation set, \(\Delta\), should be as large as possible to ensure unseen attack generalization, but it has to shrink for difficult samples to not harm model clean accuracy.

To highlight the importance of Lagrangian nature of the attack in unseen threat model generalization, we observed that the results improve further if the LPIPS in the Lagrangian setting is replaced with the \(\ell _2\) distance! We will give some theoretical insights on a possible explanation for this observation. It has also been pointed out that due to the deep neural network adversarial fragility, the LPIPS metric has certain shortcomings in modeling the perceptual distances (Kettunen et al., 2019). We also note that in the ideal case, the distance metric should not have any a priori bias towards specific \(\delta\)’s. This fact brings out an issue with the LPIPS distance since LPIPS uses the \(\ell _2\) norm difference between outputs of internal convolutional layers of a pre-trained network for calculating the distance between \(x_i\) and \(x_i+\delta\). Outputs of internal convolutional layers of a pre-trained network change more with the perturbations in the foreground, i.e. parts of the image that contain the target object, than the background. As a result, perturbation in the background of the object changes the activation values in internal layers less than the perturbation in the foreground. This means that the attack would be penalized less if it perturbs the background rather than the foreground, encouraging the attacker to make changes in the background, which causes some natural bias in the attack. This bias is discussed in more details in Sect. 6.1.

According to the mentioned reasons, we suggest \(\left\| \delta \right\| _2\) as the distance metric in the Lagrangian setting, since it does not impose such biases in the attack. Closely related to this attack is the C &W attack (Carlini &Wagner, 2016) that also tries to find the minimal perturbation that can fool the classifier. This is achieved by adding a second term to the loss function to avoid large perturbations. However, the main difference of our work is in adjusting the weight, \(\lambda\), that is multiplied by the perturbation \(\ell _2\) norm in the loss. For instance, C &W attack initializes \(\lambda\) with a large value and gradually decreases it until it reaches a successful adversarial example. This leads to minimal successful perturbation, which sounds reasonable for the test time. But it has two disadvantages in the training time. First, searching over \(\lambda\) makes this attack infeasible for large datasets. In addition, under a reasonable threat model, other attacks are not necessarily restricted to generate minimal perturbations. Hence C &W does not make the model robust against such attacks.

Therefore, we propose to use a fixed schedule of \(\lambda\) for all samples to avoid these problems. More importantly, we argue that based on the envelope theorem, a fixed schedule of \(\lambda\) places an upper bound on the gradient of the loss with respect to the perturbation budget, \(\epsilon\), while also prevents \(\delta\) from being overly small on average. The former effect results in model generalization under slightly different perturbation budget \(\epsilon\) which is investigated in the theoretical section, while the latter helps ensure proper overall perturbation size that causes model robustness since a fixed proper \(\lambda\) controls the trade-off between model loss and perturbation size but does not expose the samples to be minimally perturbed. Therefore, we hypothesize that the same \(\lambda\) schedule for all instances prevents the loss function from changing drastically across slightly different perturbation sets and results in improved unforeseen attack generalization.

In summary, we propose to use a Lagrangian objective function with a fixed schedule for the multiplier to achieve better average accuracy against unforeseen attacks. Next, some empirical and theoretical insights are provided to support the use of such attacks in AT. It is also noted that the mentioned attack is feasible for large datasets. Finally, our method is compared with state-of-the-art methods against attacks that were not used during training and it is demonstrated that our approach is outperforming them.

2 Related work

Adversarial attacks and defenses as significant machine learning safety issues have attracted a lot of attention in recent years (Qian et al., 2022; Mohseni et al., 2023). Attacks are being proposed sequentially while defenses try to confront them and this loop always exists (Uesato et al., 2018; Athalye et al., 2018). Here, some popular adversarial attacks and empirical defenses are explored.

2.1 Adversarial attacks

One of the first successful attacks against deep neural networks was the Fast Gradient Sign Method (Goodfellow et al., 2015), which used input gradient signs to craft the adversarial perturbation. Projected Gradient Descent (PGD) (Madry et al., 2018) tried to perform FGSM iteratively inspired by Kurakin et al. (2017) and optimized the perturbation by iterative moves along the gradient direction and bounded it using the \(\ell _p\) norms. MI-FGSM (Dong et al., 2017) used the momentum to modify the gradient direction in each step of designing the perturbation. Similar to the C &W attack (Carlini &Wagner, 2016), ALMA (Rony et al., 2020) used the Lagrange method to generate minimal perturbations, except that it modifies the Lagrange weight in each iteration in a way that their custom distance metric is better satisfied. DeepFool attack assumed that the classifier is linear in a neighborhood of each example and tried to reduce the model’s accuracy by moving the example toward the nearest linear border (Moosavi-Dezfooli et al., 2015). \(\ell _0\) attacks such as PGD\(_0\) and CornerSearch, deceive networks with changing minimal pixels (Croce &Hein, 2019; Moda et al., 2019). As an alternative to the \(\ell _p\) norm bounded attacks, Wasserstein distance was introduced as a better distance for imperceptible perturbations (Wong et al., 2019). However, the proposed Sinkhorn iterations for projecting onto the Wasserstein ball are computationally expensive in large datasets.

Perceptual metrics such as SSIM (Wang et al., 2004) and LPIPS (Zhang et al., 2018) are another alternatives for the standard \(\ell _p\) norms in designing the perturbation set. For instance, LPA and PPGD (Laidlaw et al., 2021) attacks used the LPIPS distance for crafting perturbations that are not visible and recognizable by human. As the perturbation set is broader in such attacks, compared to the traditional \(\ell _p\) attacks, one would expect robustness against unseen attacks once the model is trained using those attacks. To check the model’s accuracy against unseen attacks, JPEG, Fog, Snow, and Gabor attacks (Kang et al., 2019) have been proposed that will be used to compare our final model with the rest. StAdv (Xiao et al., 2018) and RecolorAdv (Laidlaw &Feizi, 2019) have recently been proposed as other alternative adversarial attacks. StAdv makes local spatial transformation of each input pixel adversarially, and RecolorAdv maps the original pixel color to a perceptually indistinguishable color to fool the classifier. Our model will also be evaluated based on these attacks.

2.2 Empirical defenses

Machine learning models typically exhibit poor generalization ability in the presence of a distribution shift within the test set (Taori et al., 2020; Gulrajani &Lopez-Paz, 2021). There is a thread of works that try to improve the generalization of models under these distribution shifts, i.e., some works use random and diverse data augmentations to achieve comprehensive robustness against multiple types of unseen distributional shifts (Hendrycks et al., 2022; Wang et al., 2021). While these methods are effective against small distribution shifts, they are not really robust against distribution shifts caused by adversarial examples (Ding et al., 2019).

Several defense methods have been proposed to achieve robust models against adversarial samples, but many of these defenses suffer from a phenomenon called “gradient masking" and “gradient obfuscation" (Athalye et al., 2018). Perhaps the most effective defenses are variants of adversarial training (AT) (Madry et al., 2018; Wu et al., 2020; Zhang et al., 2020; Allen-Zhu &Li, 2022; Wang et al., 2022). This includes training with PGD adversarial examples. Our work is also concentrated on the same concept, and investigates the problem of adversarial training against unseen adversarial examples in the training.

Adversarial training on multiple perturbations (e.g., MSD (Maini et al., 2020)) or wider types of adversarial examples (e.g., PAT (Laidlaw et al., 2021)) are suggested as solutions to reach robustness against multiple threat models, but our work is more general and aims to reach robustness against all unseen attacks. There is another line of defenses that tries to provide a theoretical lower bound on the adversarial accuracy against any bounded attack, which is called certified defense (Cohen et al., 2019; Zhai et al., 2020; Zhang et al., 2022). While these defenses can be a solution to the problem of robustness against unseen attacks, they are not practically useful due to the large gap between certified accuracy and empirical accuracy of the adversarially trained model against the existing attacks. Some recent methods have tried to use an instance based perturbation budget in the training (e.g., IAAT (Balaji et al., 2019), MMA(Ding et al., 2020), and DDN (Rony et al., 2019)) that can be a potential solution for the problem of robustness against unseen attacks, and we will compare our method with them accurately.

Adversarial training can also be combined with self-supervised learning methods (Kim et al., 2020; Chen et al., 2020), or data augmentation and data generation methods (Rebuffi et al., 2021; Sehwag et al., 2022; Wang et al., 2023) to improve robustness, but in this work, the effort is to improve the base adversarial training, which could lead to the improvement of its variants. Finally, note that there could be a trade-off between standard accuracy and robust accuracy (Tsipras et al., 2019; Xing et al., 2021; Javanmard et al., 2020; Rade et al., 2022). Therefore, to have a fair evaluation, one has to compare the adversarial accuracy of different methods under the same or similar clean accuracy.

3 Method

3.1 Proposed approach

A contribution of our work is proposing an attack that is specifically designed to be used in adversarial training to achieve a robust model even against unforeseen attacks. In our proposed attack, we attempt to maximize \(\ell (f_\theta (x_i + \delta ), y_i)\) similar to other white-box attacks, where \(f_\theta\) is the network parameterized by \(\theta\) that classifies input samples. The true labels are denoted as \(y_i\), and margin loss (Carlini &Wagner, 2016) is used as \({\ell (\cdot )}\). Next, it is necessary to provide a method to prevent perturbation size from getting too large. Otherwise, the clean accuracy would drop significantly due to the semantic changes in the training. In order to force the perturbation \(\delta\) to be small, we use the Lagrangian formulation and add a penalty term \(\Vert \delta \Vert _2\) to the loss function instead of the usual way of bounding the perturbation with distance metrics such as \(\ell _p\). The advantages of controlling perturbation norm with Lagrangian formulation is discussed accurately in Sects. 3.2 and 3.3. We will empirically show that this modification boosts the generalization of the trained model to new unforeseen threat models.

As mentioned in introduction, the penalty term should preferably be unbiased against penalizing specific perturbations at the training time. Otherwise, it will cause overfitting to the training attack and prohibiting the classifier’s generalization to unforeseen threat models. To deal with this issue, we suggest \({\Vert \delta \Vert }_2\) as the penalty term to avoid such biases since it treats all the pixels similarly. We could also use other \(\ell _p\) norms, but \(\ell _2\) norm is more perceptually aligned with human perception among {\(\ell _0, \ell _1, \ell _2,\) and \(\ell _\infty\)} (Sen et al., 2019) that are mainly used in literature to bound the perturbation. Furthermore, {\(\ell _0, \ell _1,\) and \(\ell _\infty\)} distance metric are not fully differentiable, which makes the \(\ell _2\) distance metric a better choice. So our final objective function for generating perturbations would be:

$$\begin{aligned} \max _\delta \ell (f_\theta (x_i + \delta ), y_i)-\lambda || \delta ||_2, \end{aligned}$$
(3)

and \(\lambda\) is the Lagrange multiplier that determines the compromise between the first and second terms. The penalty term acts as a “perturbation decay" in gradient descent (GD) iterations, and prevents the input gradients from over-enlarging in each iteration.

We empirically found that at least five steps are required to perform this optimization numerically. This could also be justified intuitively in that, as opposed to AT, the loss is maximized and the perturbation norm is minimized simultaneously. Using five steps, optimization can be started with a smaller \(\lambda\) to move a sample in a space with maximum loss without worrying about the perturbation norm being large. In the following steps, \(\lambda\) is increased to make the \(\ell _2\) distance smaller. Also, the step size (\(\alpha\)) is simultaneously reduced to make sure that more minor changes to the perturbation are made in the next steps, and the sample still remains adversarial. To ensure that decreasing \(\alpha\) reduces the change rate, the gradient scale in different steps should be the same. Therefore, we normalize the input gradients in each step.

Our attack is summarized in alg. 1, which is called Lag. Attack. It is noteworthy that Lagrangian formulation for training is used with LPIPS as the penalty term in the PAT (Laidlaw et al., 2021) method. The authors of PAT believe that LPIPS as a perceptual metric causes robustness against unseen attacks, while we show that the reason is Lagrangian formulation, and our formulation in Lag. attack can outperform PAT without using LPIPS metric. Hence, our contribution in comparison with PAT would be showing the significance of Lagrangian formulation in training in addition to outperforming its results by replacing the LPIPS with \(\ell _2\) norm. Other prior works, such as C &W, use the Lagrangian for attacking the model in test time with minimally perturbing samples. This is done by initializing \(\lambda\) with a large value and gradually decreasing it until finding an adversarial example. In Sect. 3.3, we would discuss that minimally perturbing samples are not as effective as our method. Moreover, an empirical comparison with C &W is performed in Sect. 4.6.

figure a

3.2 Lag. Attack adaptive performance

Here, we discuss the connection between the proposed Lagrange method and some relevant earlier work such as MART (Wang et al., 2020) and instance adaptive methods (Balaji et al., 2019; Ding et al., 2020; Rony et al., 2019). MART investigated the effect of misclassified and correctly classified examples on the robustness of adversarially trained models. They suggested that misclassified examples and correctly classified examples should be treated differently during training. Adaptive methods try to find adversarial examples with minimum \(\ell _2/\ell _\infty\) norm for training. IAAT (Balaji et al., 2019) checks if the samples are adversarial or not in each epoch and decreases the perturbation budget for the adversarial ones, and increases it for the others. MMA (Ding et al., 2020) attacks the samples with an initial \(\epsilon _0\) and uses the bisection method to find the minimum \(\epsilon\) for each sample to become adversarial. DDN (Rony et al., 2019) is an iterative attack that modifies the perturbation budget during the attack and decreases it if the sample is adversarial, and conversely.

Our work is a generalization of these methods with a more straightforward approach. While other works place a constraint on the loss or perturbation norm to optimize the other with complex algorithms, we optimize them simultaneously. In the following, we will explain that our method also acts as an instance adaptive method similar to IAAT, MMA, and DDN. We next discuss the difference of our approach with the mentioned methods in more details in the next section.

Our proposed attack uses a scheme for the Lagrange multiplier that is constant across samples. This formulation causes larger perturbations for confidently classified examples and more minor perturbation for examples that are confidently misclassified which is called adaptive behavior. To empirically demonstrate this point, we plotted \(\Vert \delta \Vert _2\) against the probability of belonging to the correct class using the classifier output for each sample in the CIFAR-10 test dataset. Here, \(\delta\) is obtained using PGD (\(\ell _2\), \(\ell _\infty\)) (Madry et al., 2018), \(\delta _{\text {Lag.}}\) (ours), and DDN (Rony et al., 2019) as an adaptive method. Figure 1a shows the plot. Based on this figure, \(\ell _2\) norm of perturbations for PGD attacks are approximately constant, but in our attack that is using the Lagrangian formulation, it has direct relationship with classifier confidence of correctly classifying the example similar to the DDN attack.

The advantage of this adaptive perturbation magnitude is discussed in prior work (Balaji et al., 2019). Accordingly, poor generalization of adversarial training is a consequence of training with uniform perturbation radius around every training sample. Enforcing large margins around the samples near the decision boundaries produces poor decision boundaries that generalize poorly. On the other hand, an adaptive perturbation method lets the model observe and learn a wider range of adversarial perturbations that can improve the model’s generalization ability (Rice et al., 2020).

Fig. 1
figure 1

\(\ell _2\) norm of perturbations generated using the attack with Lagrangian formulation (our attack), PGD (\(\ell _2\), \(\ell _\infty\)) and DDN (as an adpative attack) vs. confidence of the model in labelling the correct class for corresponding clean samples in the CIFAR-10 test dataset. Plot (a) is the scatter plot, where each point is a sample. It shows that Lagrange attack leads to generation of perturbations with an adaptive size based on the model classification confidence. This is similar to the DDN as an adaptive attack, but there are some differences. Plot (b) is the average \(\ell _2\) norm of Lagrange and DDN perturbations for these samples which shows that Lagrange attack, unlike DDN, slightly perturbs misclassified ones. The Lagrange attack also does not perturb the correctly classified ones as much as DDN

3.3 Lagrange method versus instance adaptive methods

It has been demonstrated that there is a trade-off between clean and adversarial accuracy (Tsipras et al., 2019; Zhang et al., 2019). Therefore, an essential point in adversarial training is to achieve high clean accuracy along with adversarial accuracy. For this goal, instance adaptive methods suggest training the model with minimum perturbation that places the samples in the standard (non-adversarial) classification border to achieve both clean and adversarial accuracy. According to the previous section, this causes an adaptive behavior in the training, but we will argue that this is not the best idea to preserve clean and robust accuracy against the unseen attacks.

Perturbing images can reduce the clean accuracy of the model. Therefore, perturbations should be applied wisely and only if they are beneficial. For this reason, the attacker and the generated perturbations can be considered as a teacher for the model. The teacher’s goal is to teach the student more robust features by increasing the loss of the model via adding perturbations to the samples. Nevertheless, the teacher should consider the fact that large perturbations can cause a lower clean accuracy. Accordingly, it should only increase the perturbation norm if it makes an acceptable loss in the model. Otherwise, it only causes errors on the clean data without teaching the model a better feature against adversarial examples.

The Lagrange method considers this trade-off and selects the perturbation norm of images according to the loss of the model, regardless of the point that the image is adversarial or not. This can be seen in Fig. 1b where Lag. method slightly perturbs the confidently misclassified samples but does not perturb the confidently classified ones as much as DDN. This is the main difference between the Lagrange method and instance adaptive methods since instance adaptive methods perturb images until they become adversarial (minimally perturbing samples). We believe that our perturbing method is better in the training because slightly perturbing miss-classified samples can be effective in the training even though they are originally adversarial. We think this does not harm the clean accuracy since it does not change the semantics of the image, but clearly can help robustness. On the other hand, perturbing the confidently classified samples until they become adversarial can harm the clean accuracy due to the semantic changes in the image, which should be avoided (Tramèr et al., 2020).

To discuss this difference more accurately and show its effects in training, we measured the \(\ell _2\) norm of our attack against the \(\ell _2\) norm of the DDN attack during the training of a model with PGD-\(\ell _2\). As mentioned above, DDN attacks the samples with the minimum \(\ell _2\) norm that is possible. Also note that DDN is more accurate than other methods such as AN-PGD used in MMA to find the minimal perturbation.

Fig. 2
figure 2

ac \(\ell _2\) norm of perturbations from the DDN attack versus \(\ell _2\) norm of perturbations from the Lagrange attack during training of a model with PGD-\(\ell _2\) on CIFAR-10. After model initialization, model warm-up, and training of the model, Lagrangian and DDN attacks are performed separately on each data point, and the \(\ell _2\) norm of adversarial examples are plotted against each other

Figure 2 shows the results. Based on Fig. 2a, after initialization of the model, the Lagrange attack, unlike DDN, does not perturb images very much. As most (\(\sim\) 90%) of the samples are initially misclassified, DDN does not perturb such samples, while the Lagrange method attacks them slightly. This corresponds to the small, but dense, line segment at the bottom of Fig. 2a. In contrast, for the other 10% that are correctly, but randomly, classified, DDN applies a huge perturbation, which could hurt the clean accuracy by changing the semantic of the image. Moreover, note that the correct classification of such samples is by chance in the initialization and the model has not learned them yet. Therefore, intuitively, perturbing them largely does not help the model to learn robust features in such cases. This highlights the importance of choosing the adaptive perturbation size according to the loss, and not the classification outcome.

In addition, Fig. 2b shows the same plot after the warmup period that the model is trained with clean examples. In this case, the Lagrange attack perturbs images more than the DDN attack. In this state, the model has high clean accuracy, but it is fragile against adversarial perturbations. DDN uses this fragility and perturbs images with a smaller perturbation while the Lagrange attack perturbs them with a larger norm and tries to teach the model more robust features. Note that the classifier has learned much more at this state compared to the initialization. Therefore, applying larger training perturbations, by the Lagrange attack, really helps the classifier to solidify what it has learned. Finally, the same plot after completion of the adversarial training is shown in Fig. 2c. In this state, the model has learned and solidified its knowledge of certain samples. Therefore, to balance the clean and adversarial accuracies, one should lower the perturbation norm at this point. This is exactly what was happened in the Lagrange attack.

To conclude, the criteria that is used in adjusting the perturbation norm is different in our attack compared to the previously proposed instance adaptive attacks. Our criteria is based on the classification loss, while the previous methods rely on the classification outcome, which has the mentioned impacts. We will next discuss why our criteria better matches the unseen attack generalization than these methods.

3.4 Theoretical insights

Here, some theoretical insights are provided that support the Lagrangian formulation. Our goal is to show that adversarial training the model using an attack with Lagrangian formulation results in a bounded loss against all the other attacks that has a slightly different \(\ell _2\) norm than the training attack, which is a goal in unseen attack generalization. To this end, the definitions and proofs are provided in the following.

Definition 1

The adversarial loss at the input x for a custom threat model \(\Delta\) can be defined as:

$$\begin{aligned} U_\Delta : = \max _{\delta \in \Delta } \ell (f_\theta (x + \delta ), y), \end{aligned}$$
(4)

and the perturbation that results in \(U_\Delta\) is denoted as \(\delta _\Delta\) with the \(\ell _2\) norm of \({\Vert \delta _\Delta \Vert }_2\). Note that the ultimate goal in unseen attack generalization is bounding \(U_\Delta\). Also, let

$$\begin{aligned} L_2^\epsilon : = \max _\delta \ \ell (f_\theta (x+\delta ), y) \ \ \ s.t. \ \ \ {\Vert \delta \Vert }_2 \le \epsilon , \end{aligned}$$
(5)

which is the adversarial loss under \(\ell _2\) norm bounded perturbations. Moreover, we use the Lagrangian form in Eq. (3) to perform the maximization that gives us \(\delta _{\text {Lag.}}^\star\) as:

$$\begin{aligned} \delta _{\text {Lag.}}^\star : = \mathop {\mathrm {arg\,max}}\limits _\delta \ \ell (f_\theta (x+\delta ), y) -\lambda {\Vert \delta \Vert }_2. \end{aligned}$$
(6)

Lemma 1

The adversarial loss at the input x with custom perturbation \(\delta _\Delta\) is less equal than \(L_2^{{\Vert \delta _{\Delta } \Vert }_2}\), which can be summarized as:

$$\begin{aligned} U_\Delta \le L_2^{{\Vert \delta _{\Delta } \Vert }_2}. \end{aligned}$$
(7)

Proof

This can be deduced from the definitions. Accordingly, \(L_2^{{\Vert \delta _{\Delta } \Vert }_2}\) is the maximum loss that can be achieved with \({\Vert \delta _{\Delta } \Vert }_2\) perturbation budget. So, the loss that \(\delta _{\Delta }\) causes can only be equal to or less than this loss. \(\square\)

Lemma 2

The derivative of \(L_2^{{\Vert \delta _{\Delta } \Vert }_2}\) with respect to \(\epsilon\) for \(\epsilon = {\Vert \delta ^{\star }_{\text {Lag.}} \Vert }_2\) is equal to \(\lambda\), which can be summarized as:

$$\begin{aligned} \dfrac{\partial L^\epsilon _2}{\partial \epsilon }\bigg |_{\epsilon = {\Vert \delta ^{\star }_{\text {Lag.}} \Vert }_2} = \lambda . \end{aligned}$$
(8)

Proof

This can be deduced from Eq. (5) using the envelope theorem (Milgrom & Segal, 2002). In other words, it is only necessary to re-write Eq. (5) in the Lagrangian form and perform the derivative at the optimal point with respect to \(\epsilon\). \(\square\)

Lemma 3

If \(\Delta \epsilon\) is defined as:

$$\begin{aligned} \Delta \epsilon : = |{\Vert \delta _\Delta \Vert }_2 -{\Vert \delta _{\text {Lag.}}^\star \Vert }_2 |, \end{aligned}$$
(9)

we can place an upper bound on the \(L_2^{{\Vert \delta _{\Delta } \Vert }_2 }\) as:

$$\begin{aligned} L_2^{{\Vert \delta _{\Delta } \Vert }_2 } \le L_2^{{\Vert \delta ^\star _{\text {Lag.}} \Vert }_2 } + \lambda .(\Delta \epsilon ) + o(\Delta \epsilon ). \end{aligned}$$
(10)

Proof

By expanding the \(L_2^{{\Vert \delta _{\Delta } \Vert }_2 }\) around the \(\epsilon = {\Vert \delta ^\star _{\text {Lag.}} \Vert }_2\) using the Taylor series, we would have:

$$\begin{aligned} L_2^{{\Vert \delta _{\Delta } \Vert }_2 } \le L_2^{{\Vert \delta ^\star _{\text {Lag.}} \Vert }_2 } + \dfrac{\partial L^\epsilon _2}{\partial \epsilon }\bigg |_{\epsilon = {\Vert \delta ^{\star }_{\text {Lag.}} \Vert }_2}.(\Delta \epsilon ) + o(\Delta \epsilon ). \end{aligned}$$
(11)

Next, the Eq. (10) can be reached by replacing the derivative with the result of Lemma 2. \(\square\)

Theorem 1

An upper bound on the adversarial loss at the input x for a custom threat model \(\Delta\) can be set as:

$$\begin{aligned} U_\Delta \le L_2^{{\Vert \delta ^\star _{\text {Lag.}} \Vert }_2 } + \lambda .(\Delta \epsilon ) + o(\Delta \epsilon ). \end{aligned}$$
(12)

Proof

According to the Lemma 1, \(U_\Delta\) is less equal than \(L_2^{{\Vert \delta _{\Delta } \Vert }_2}\). So, we can replace this term in the result from Lemma 3, and reach the Eq. (12). \(\square\)

Remark 1

(Main result) Equation (12) is exactly what we were looking for. If \(\Delta \epsilon\) is small, knowing that \(\lambda\) has a fixed schedule, bounding \(L_2^{{\Vert \delta _{\text {Lag.}}^\star \Vert }_2 }\) using adversarial training with our proposed attack would be equivalent to bounding \(U_\Delta\), which is the loss related to the custom attack model \(\Delta\) that has slightly different \(\ell _2\) norm than what is considered in the training and causes generalization against unseen attacks. This upper bound on \(U_\Delta\) is investigated empirically in Sect. 6.2. We also empirically validate that even for large \(\Delta \epsilon\), such improvements hold.

Remark 2

The parameter \(\lambda\) is simply the derivative of adversarial loss under \(\ell _2\) attacks with respect to \(\epsilon\), which could potentially be large under other attacks. Hence training under the Lagrange attack would better suit the unseen attack generalization.

Remark 3

Equation (8) demonstrates that a fixed schedule of \(\lambda\) bounds the derivative of loss with respect to the perturbation budget, and the model’s loss will not grow abruptly with epsilon changes. Note that reducing \(\lambda\) increases \({{\Vert \delta _{\text {Lag.}}^\star \Vert }_2 }\), thus \(L_2^{{\Vert \delta _{\text {Lag.}}^\star \Vert }_2 }\) also increases. As a result based on Eq. (12), smaller \(\lambda\) does not necessarily reduce the upper bound on \(U_\Delta\), but can tighten it and help the attack generalization if we ensure that adversarial training with the corresponding \(\lambda\) can bound \(L_2^{{\Vert \delta _{\text {Lag.}}^\star \Vert }_2 }\) properly.

3.5 Computational efficiency

Our proposed attack is also computationally efficient. Since the comparison of execution time depends on the exact GPU/CPU used for training, mentioning numbers for the training time does not help much in practice. Therefore, comparing the execution time order is more helpful in practice.

Each iteration of our attack approximately equals one iteration in a PGD attack in terms of the running time. Considering that we use only 5 iterations, the execution time of our attack is almost equivalent to the PGD-5 attack.

It is noteworthy that PGD is usually used with 10 or more iterations at the training time to achieve robustness against perturbations with larger \(\ell _p\)-radii (Andriushchenko &Flammarion, 2020). So our attack takes about half the time required for the conventional PGDs. Fast-LPA attack used in PAT (Laidlaw et al., 2021) generates perturbation in 10 iterations, while we use only 5. It also computes the LPIPS distance in each iteration that makes each iteration more time-consuming, about 1.2 times for Fast-LPA with AlexNet. So our attack is also much faster than the Fast-LPA. Other attacks, such as JPEG(Kang et al., 2019), are also computationally worse than ours since they use more iterations or each iteration of their method consumes more time than ours. This makes our method scalable for training large models on large datasets.

4 Experiment

4.1 Setup

We proposed a new adversarial attack for the training time that aims to be robust against unseen threat models. To evaluate our method, the ResNet-18 model is adversarially trained using our attack on CIFAR-10 (Krizhevsky &Hinton, 2009), and the ResNet-34 on the ImageNet-100 dataset (a 100-class subset of ImageNet (Russakovsky et al., 2015) containing every 10th class by the WordNet ID). We compare the robust accuracy of our method with other state-of-the-art methods against attacks that are used during training and several unseen attacks for each of these datasets. For the training, standard techniques such as learning rate decay and warm-up is used to make the convergence faster. More specifically, following (Laidlaw et al., 2021), the learning rate for weight updates in SGD is set to 0.1 initially and reduced by a factor of 0.1 at epochs 70 and 90 in CIFAR-10. For the ImageNet-100, it is initially set to 0.1 and lowered by a factor of 0.1 at epochs 30, 60, and 80. The models are trained for 100 and 90 epochs in CIFAR-10 and ImageNet-100, respectively. The warm-up step is pre-training the networks with clean data samples for three epochs. The same techniques is applied in training the rest of the models to have a fair comparison. To make the robust accuracy of different models comparable, the attack hyperparameters in each method is tuned to get similar/same clean accuracy. This has been achieved by changing the bounds on the perturbation size of each method such that all the models reach a similar standard accuracy.

4.2 Baseline methods

We compare our method with the best methods that have reached acceptable and competitive robust accuracies. For CIFAR-10 dataset, TRADES \((\ell _\infty )\) (Zhang et al., 2019) is used as a method that considered the trade-off between standard and robust accuracy, and MART (Wang et al., 2020) since it treats misclassified examples differently and adversarial training (AT) with PGD \((\ell _\infty , \ell _2)\) (Madry et al., 2018) with or without early stopping (ES). We also compare with instance adaptive methods such as IAAT (Balaji et al., 2019), MMA (Ding et al., 2020), and DDN (Rony et al., 2019). In addition, MSD (Maini et al., 2020), and PAT (Laidlaw et al., 2021) tried to reach robustness against multiple perturbations and perceptual attacks that we also compare against. Among the methods with the Lagrangian form, we use ALMA (Rony et al., 2020) for training since it is faster than the methods such as C &W, and it is feasible for the training. Training with APGD (Croce &Hein, 2020) attack is also considered as an stronger attack than the PGD. All methods have been used with similar settings introduced for that method in the original work. Only in some cases for a fairer comparison, the perturbation budget size of training has been increased to achieve a better robust accuracy at the cost of reducing the clean accuracy. For example, the AT-\(\ell _2\) model is trained with \(\epsilon = 1.3\) and AT-\(\ell _\infty\) with \(\epsilon = 10/255\).

Among the mentioned cases, only adversarial training with PGD and Fast-LPA (the PAT method) have been examined with the ImageNet dataset due to shortage of computational resources. For a better comparison, we also consider adversarial training with JPEG attack (Kang et al., 2019). Prior work (Kang et al., 2019; Laidlaw et al., 2021) have demonstrated that the models trained using the JPEG attack, as opposed to other attacks such as Elastic, Fog, Gabor, Snow, RecolorAdv, and StAdv, have better generalization against unseen attacks. Also, training with the union of attacks is not considered for two reasons. First, the PAT method is more robust than the model optimized with a random attack from the union of attacks or the average or worst case loss of them Laidlaw et al. (2021). Second, it is inconsistent with our goal of examining the final model with unseen attacks since this method assumes that all types of possible attacks are known.

Also, to examine whether the instance adaptive perturbation norm has played a role in improving the results, as demonstrated in Fig. 1a and discussed in Sect. 3.2, a model is trained using PGD-\(\ell _\infty\) and another using PGD-\(\ell _2\) with perturbation bounds selected based on the probability of correct class given to the clean examples. For this goal, the thresholds are set {0.1, 0.25, 0.5} to divide [0, 1] interval for probability of the correct class into 4 sub-intervals, and set the maximum allowable perturbation norm for each sample in these sub-intervals to 0.03\(\epsilon\), 0.3\(\epsilon\), 0.55\(\epsilon\), and \(\epsilon\), respectively. We name these methods Threshold-\(\ell _\infty\)/\(\ell _2\), and they are only evaluated on the CIFAR-10 dataset for the sake of performing an ablation study.

4.3 Unforeseen attacks

To evaluate models trained on CIFAR-10, we use the attacks that are employed in the training time (PGD\((\ell _\infty , \ell _2)\), LPA), and unseen attacks that are widely used in earlier works. Kang et al. (2019) suggested 5 different types of attacks, including Elastic, JPEG, FOG, Gabor, and Snow, as unseen attacks and calibrated maximum distortion size for some of them in CIFAR-10, and all of them in the ImageNet-100 dataset. At the test time, we use the calibrated ones with maximum distortion size of \(\epsilon _3\) for CIFAR-10, and \(\epsilon _2\) for ImageNet-100 based on Kang et al. (2019) calibration. We also use StAdv (Xiao et al., 2018), RecolorAdv (Laidlaw &Feizi, 2019), PGD\(_0\) (Croce &Hein, 2019), Gaussian noise, and Gaussian blurring to make unseen attacks more comprehensive. These attacks are selected to cover a diversity of corruptions. For the ImageNet-100 dataset, we use the same attacks mentioned above except PGD\(_0\) since it has not been examined on ImageNet. Note that the attack parameters are selected following the ones that were widely used in earlier work or instructions of the paper that proposed the attack. Detailed information about the attacks are mentioned in the caption of Tables 1 and 2.

Table 1 Test robust accuracy of our model along with other models against training attacks and unseen attacks in the CIFAR-10 dataset
Table 2 Test robust accuracy of our model along with other models against training attacks and unseen attacks in the ImageNet-100 dataset

4.4 Adversarial accuracy

Results of evaluating our method and other mentioned methods against training and unseen attacks are listed in Table 1 for CIFAR-10, and Table 2 for the ImageNet-100 dataset. These results demonstrate that our method is more robust than the other baselines against unseen attacks and union of the unseen and training attacks on average. In CIFAR-10, our model has 5.9% and 4.7% higher average robust accuracy than others against unseen attacks and union of attacks, respectively. In ImageNet-100, our model also reaches 3.2% and 3.1% higher robustness accuracy against unseen attacks and union of attacks, respectively. Accordingly, our method outperforms the baselines in average on both datasets by a significant margin. Moreover, based on the results in Table 1 and 2, the proposed method is the best or second-best method against most of the attacks compared to other training methods. Thus, it also performs well even considering a specific attack for evaluation in addition to the average accuracy against unseen attacks.

We believe that the better or worse result than other trained models on a specific attack may be due to the biases in the other trained model, which are not easy to detect. For instance, AT-\(\ell _\infty\) model outperforms our method against the \(\ell _\infty\) attack since it is trained with the same attack. As another example, the MSD method outperforms others against PGD\(_0\) as an \(\ell _0\) attack. According to the claims in the MSD paper, this is caused by using \(\ell _1\) adversary in the training that is strictly stronger than an \(\ell _0\) adversary with the same radius, which causes robustness against \(\ell _0\) attacks. As another example, IAAT outperforms others against the Gaussian noise added to the image. This may be caused by the large \(\ell _2\) norm of the employed Gaussian noise (see Table 6). In other words, IAAT uses an instance based perturbation budget for training that allows large perturbations for some of the training samples. Therefore, IAAT may be more robust against attacks with large \(\ell _2\) norms.

Based on the results in Table 1, it should also be noted that (Threshold-\(\ell _\infty\)) and (Threshold-\(\ell _2\)) have higher robustness than PGD-\(\ell _\infty\) and PGD-\(\ell _2\), respectively, that demonstrates the effectiveness of lower perturbation budget for the misclassified examples on unseen attack generalization. Furthermore, this also lets us increase the perturbation size for some samples without loss of the clean accuracy. This is also the reason why our model performs better than AT-\(\ell _2\) against the \(\ell _2\) attack.

4.5 Flowers recognition

As demonstrated earlier, our method performs better than the others against unseen attacks on CIFAR-10 and ImageNet-100 datasets. Nevertheless, a big concern is that a lot of recently proposed classification methods have overfitted to these two datasets. In other words, these two datasets have mainly been used for the evaluations, and as a result, satisfactory performance of a method may be limited to these datasets. To address this concern, we examine our method against PGD attacks on another dataset. We set a high bar by comparing our method against PGD-\(\ell _\infty\) and PGD-\(\ell _2\) trained models on \(\ell _\infty\) and \(\ell _2\) balls, respectively.

It is more beneficial to select a dataset with different types of images than the mentioned datasets. It is also preferable that the data be almost balanced and its images have an appropriate resolution. For this purpose, the Flower datasetFootnote 1 is used that contains five different types of flowers (Daisy, Dandelion, Rose, Sunflower, and Tulip). Each class contains about 800 images and the resolution of the images is about 320 \(\times\) 240. This dataset includes some irrelevant images, which are removed for a more accurate evaluation.

We trained a ResNet-18 model to classify the flowers using the same training setting as in ImageNet-100. The models are trained with appropriate \(\epsilon\)’s (and other hyper-parameters) that make the clean accuracy in the same order. Then, it is evaluated against the PGD attacks with various bounds. Results are shown in Fig. 3, which demonstrate the better performance of our method than training with the same attacks that are used at the test time. It should be noted that outperforming such baselines is more challenging than comparing with other methods against the unseen attacks.

Fig. 3
figure 3

Evaluating adversarially trained models on the Flower dataset against, a PGD-\(\ell _\infty\), b PGD-\(\ell _2\) with various bounds. Allowable \(\ell _\infty /\ell _2\) norm of the perturbation is gradually increased

4.6 C &W attack for adversarial training

C &W proposes an attack with an objective function that maximizes loss while minimizing the perturbation size. The main difference between their objective function with Lag. attack is adjusting the Lagrange multiplier coefficient \(\lambda\) so that the minimal perturbation size is achieved. This is done by initializing \(\lambda\) with a large value and gradually decreasing it until reaching a successful adversarial example.

As our proposed attack is similar to the C &W attack, we also investigate adversarially training a model with the C &W attack. This attack aims to find a perturbation with a minimum size that can fool the classifier. From this aspect, training with this attack would be equivalent to training models with instance adaptive methods since they also try to find the attack with minimum perturbation budget in training. As a result, all the analyses made on instance adaptive methods in Sects. 3.2 and 3.3 are also valid about the training model with C &W attack. This is also true about the ALMA attack since it tries to find minimal perturbation.

To empirically investigate C &W, we use only the first three classes of the CIFAR-10 dataset since C &W is extremely time-consuming. Both \(\ell _2\) and \(\ell _\infty\) types of C &W attack are used for training models with the initial value of the Lagrangian multiplier \(c = 0.001\), 7 steps to perform the binary search over the c, and 1000 steps for optimization within each binary search step. Next, the resulting models are evaluated against PGD-\(\ell _2\) with various bounds. Again, we set the training attack hyper-parameters in a way to fix the clean accuracy across models. Based on Fig. 4, the robust accuracy of the C &W-trained models decrease more than other models as the allowable range of perturbation increases due to training of the model with minimally perturbed examples.

Fig. 4
figure 4

Comparison of robust accuracy for the C &W-trained models with three other methods against PGD-\(\ell _2\) on three first classes of CIFAR-10 dataset. PGD-\(\ell _2\) bound is gradually increased from 0.1 to 2

Also, our method competes the AT-L\(_2\) method in spite of the fact that our method is trained to stand a wide range of attacks, but AT-L\(_2\) is trained to only stand the PGD-\(\ell _2\) attack.

5 Ablation study

In the following, some additional experiments are conducted to investigate different aspects of our solution for the unforeseen attack generalization.

5.1 Main attack parameters

\(\lambda\) and \(\alpha\): The main parameters that make essential changes in our attack are the initial step size \(\alpha\), and initial Lagrange multiplier \(\lambda\). \(\alpha\) determines the amount of movement in the gradient direction in each step. The larger it is, the more significant changes in the perturbation would be made in each step, especially in the first steps that it has not been decayed. Therefore, increasing \(\alpha\) raises the perturbation norm, that can cause a more extensive adversarial loss. \(\lambda\) determines the compromise between the loss and perturbation norm. Larger \(\lambda\) decreases the perturbation norm and adversarial loss.

To clarify the change in loss and perturbation norm during our proposed attack and the impact of these parameters, we have empirically plotted the adversarial loss and \(\ell _2\) norm of perturbation in each step for various parameters in Fig. 5 (\(\lambda\) is fixed and \(\alpha\) is changed), and Fig. 6 (\(\alpha\) is fixed and \(\lambda\) is changed). These figures confirm our claims about the impacts of \(\alpha\) and \(\lambda\) on the generated perturbation. Moreover, they demonstrate that if these parameters are appropriately selected, an adversarial sample is generated in the first iterations, and in the following iterations, an attempt is made to reduce the perturbation norm as described in Sect. 3.

Fig. 5
figure 5

Effect of the step size \(\alpha\) in our proposed attack. \(C \& W\) Loss and \(\ell _2\) norm of perturbations generated by our attack are plotted in each step for various \(\alpha\)’s with a fixed \(\lambda\)

Fig. 6
figure 6

Effect of \(\lambda\) in our proposed attack. \(C \& W\) Loss and \(\ell _2\) norm of perturbations generated by our attack are plotted in each step for various \(\lambda\)’s with a fixed \(\alpha\)

Moreover, these two parameters let us control the compromise between clean and adversarial accuracy by changing the perturbation size and its corresponding loss in the training. To show this more empirically, the same setting of Table 1 is used to report the “unseen and union mean" accuracy for a variety of parameters. Results in Table 3 demonstrate that larger \(\alpha\) or smaller \(\lambda\) leads to a more robust model with the cost of a drop in clean accuracy and conversely.

Table 3 Effect of the Lagrange attack hyper-parameters \(\alpha\) and \(\lambda\) on the clean and adversarial accuracy of the model trained on the CIFAR-10 dataset

Other parameters: In addition to the \(\lambda\) and \(\alpha\), Lag. attack has three other parameters including N (number of iterations), \(\sigma ^2\) (perturbation initialization variance), and c (decay parameter). To study on the sensitivity of these parameters, the “unseen and union mean" accuracy is measured in Table 4 for some values other than their default values using a similar setup as table 1. Based on this Table, larger N causes a marginal improvement in adversarial accuracy with the cost of marginal loss of clean accuracy. Also, the results do not seem to be sensitive to the small changes in the other two parameters.

Table 4 Effect of the Lagrange attack hyper-parameters N, c, and \(\sigma ^2\) on the clean and adversarial accuracy of the model trained on the CIFAR-10 dataset. \({\alpha = 0.1}\), \(\lambda = 2.0\), \(N = 5\), \(c = 0.1\), and \(\sigma ^2 = 0.01\) are assumed as default values unless otherwise stated in the table

5.2 Evaluation against stronger attacks

As far as possible, attempts were made to perform the evaluations as comprehensively as possible. Various attacks are used to obtain the robust accuracy of different methods against unforeseen attacks and attacks used during training to make the assessments entirely fair. Since our context evaluates methods against unseen attacks, only PGD-\(\ell _\infty /\ell _2\) are used in the experiments and stronger versions of PGD are not considered. The results would not change even if stronger attacks such as AutoAttack (Croce &Hein, 2020) had been used. To illustrate this point, the accuracy of our method and PAT’s method (as state-of-the-art), are reported against standard AutoAttack in CIFAR-10 and ImageNet-100 dataset. The results are shown in Table 5 that demonstrate a trend similar to that of the PGD attacks. Hence, similar conclusions can be made with AutoAttack.

Table 5 Evaluation with AutoAttack (AA) in addition to the PGD Attack (perturbation budgets are similar to Tables 1 and 2). Both attacks show that our model is more robust

As another ablation study, in order to assess whether the selected unseen attacks span a variety of strengths, the average \(\ell _2\) norm of attacks are examined on our CIFAR-10 model. The results in Table 6 show the chosen attacks’ \(\ell _2\) norm diversity. Furthermore, the slightly larger \(\ell _2\) norm of the attacks than our training attack is another reason for the selected attacks’ fairness. It also confirms that our contribution is also about removing the LPIPS bias.

Table 6 Average \(\ell _2\) norm of perturbations generated in evaluations of our model in Table 1

Another point about the selected attacks is their perturbation budget. Although the perturbation bounds are chosen according to the prior work in the experiments, it is worth examining the effect of the test time perturbation budget on the results that are presented in Tables 1 and 2. For this purpose, a number of methods from Table 1 are evaluated against PGD attacks with different bounds. Results are shown in Table 7. It shows that if one method is more resistant to an attack with a specific bound than the others, it would also be more resistant against that attack with other bounds. Therefore, this parameter does not have much effect on the final results.

Table 7 Evaluating five different methods trained on CIFAR-10 against PGD attacks with three different bounds

5.3 Robust and clean accuracy trade-off

Based on Figs. 1b and 2a, DDN, as an adaptive method, does not perturb misclassified examples at all, while the Lagrange attack perturbs them slightly. Moreover, the Lagrange attack perturbs the confidently classified examples less than DDN. Here we demonstrate that the adaptive perturbation norm in the Lagrange attack helps the trade-off between robust and clean accuracies. To this end, a model is trained with clean examples and its clean and adversarial accuracies are compared with a training scheme that only perturbs the misclassified examples using PGD-\(\ell _2\) with \(\epsilon = 0.1\) to investigate the effect of slightly perturbing the misclassified examples. Also, to evaluate the effect of perturbation budget for confidently classified examples, two other training schemes are checked out that only perturb 10% of the examples with the highest classification confidence using PGD-\(\ell _2\) with \(\epsilon = 1.25\), and \(\epsilon = 1.5\). All the mentioned values of \(\epsilon\) are selected in accordance to the \(\ell _2\) perturbation norm that is adopted by the Lagrange attack for the samples with various correct class confidence in Fig. 1.

Based on the results in Table 8, slightly perturbing the misclassified examples leads to about 15% increase in the robust accuracy at the cost of losing less than 1% of the clean accuracy compared to perturbing none of the samples, which entirely worth it. Furthermore, decreasing the perturbation budget on confidently classified examples increases the clean accuracy by about 2.5% at the cost of losing the same amount of robust accuracy, which seems reasonable due to the importance of the clean accuracy.

Table 8 Evaluating the effect of perturbing only specific samples in the training of the model with CIFAR-10 based on the clean and robust accuracies of the model against PGD-\(\ell _2\) (\(\epsilon\) = 0.5) and PGD-\(\ell _\infty\) (\(\epsilon\) = 4/255)

These results indicate that the Lagrange attack considers the trade-off between clean and robust accuracy better than instance adaptive methods.

6 Remarks

6.1 LPA bias towards the background

For a model to stand against the unseen threat models at test time, it should be exposed to all types of adversarial examples during training. To this end, the training attack should not have any a priori bias toward a specific perturbation. Otherwise, it would not be able to generate some sort of perturbations, and model would not learn them. Following this argument, we notice that the LPA attack is more biased to perturb pixels in the background of the input image since it uses LPIPS as distance function. LPIPS uses outputs of internal convolutional layers of a pre-trained network which mostly change with perturbations in the foreground than the background. As a result, the attack would be penalized less if it perturbs the background rather than the foreground, causing a kind of bias toward perturbing the background. Even so, it would also perturb the foreground since it has more information than the background, but the bias still exists.

Perturbing background is not an issue itself, but biasing the training attack to mostly perturb the background can limit the learning of the model. For example, the model may have learned to stand against the perturbations in the background. In this case, the training attack should naturally use all its budget to perturb the image foreground, but the mentioned bias prevents this.

Here, it is practically demonstrated that using LPIPS instead of \({\Vert \delta \Vert }_2\) makes perturbations focused more on the background of the image. To validate this claim, a basic foreground/background segmentation is used as a tool to separate the background and foreground of the input image. We then calculate the ratio of total perturbations in the foreground to that in the background. For this goal, we used the deep networks with the same architecture as Lehman (2019) for CIFAR-10, and Lehman (2019) for ImageNet-100 to separate background and foreground in clean images. This gives us an estimation of the probability of a pixel being part of the main object. Using the segmentation results, we define F2B as a measure to determine the ratio of total perturbations in the foreground to that in the background as:

$$\begin{aligned} F2B(\delta ,\ p): = \dfrac{\sum _i (|\delta _i|\times p_i)/\sum _i p_i}{\sum _i(|\delta _i|\times (1-p_i))/\sum _i (1-p_i)}, \end{aligned}$$
(13)

where \(p_i\) is the calculated probability of pixel i belonging to the foreground, and \(\delta _i\) is the perturbation value at the same pixel. For CIFAR-10, the average F2B is 1.42 for LPIPS, while it is 1.69 for the Lagrange attack. For ImageNet-100, it is 1.06 and 1.16, respectively. This confirms higher focus of the LPA perturbations in the background than that of the Lagrange attack. To make this more clear, output perturbation of these attacks are shown in Fig. 7 for some sample images from the CIFAR-10 dataset. It is obvious that generated perturbations with LPIPS as the distance metric has made more changes in the background than the generated perturbations with the \(\ell _2\) norm.

Fig. 7
figure 7

Comparison of the generated perturbations using LPIPS distance, and \({\Vert \delta \Vert }_2\) as the penalty term. First row shows some images from CIFAR-10 dataset, second row shows the generated perturbation for images in the first row using \({\Vert \delta \Vert }_2\) as the distance metric, and the third row is the generated perturbation using the LPIPS distance

6.2 Loss upper bound

In Sect. 3.4, an upper bound was provided on the loss that an unseen attack can make in a given model. In order to examine this upper bound more empirically, its values can be approximated to make a comparison for several models. For this purpose, it is assumed that in Eq. (12), the \(\Delta \epsilon\) is small, and the upper bound can be approximated with only the first two terms. This is a reasonable assumption since attacking adversarially trained models with a much larger \(\epsilon\) than the training attack, greatly reduces the model’s accuracy and makes it meaningless to compare this upper bound. Therefore, the perturbation bound for the unseen attack is assumed 1.2 times the training attack perturbation norm. This makes the unseen attack perturbation bound slightly larger than the training attack, and \(\Delta \epsilon\) would be small as a result.

Using Eq. (8) and the assumption that \(\Delta \epsilon\) is small, the \(\lambda\) for each method can also be approximated, based on the envelope theorem, in the form of:

$$\begin{aligned} \lambda = \dfrac{L_2^{{\Vert \delta ^{\star } \Vert }_2+\eta }-L_2^{{\Vert \delta ^{\star } \Vert }_2}}{\eta }, \end{aligned}$$
(14)

where \(\delta ^{\star }\) is the training attack perturbation, \(\eta\) is a small constant (0.05 is used in our experiments), and \(L_2^\epsilon\) is estimated using PGD-\(\ell _2\) with the perturbation bound equal to \(\epsilon\).

Now, by approximating Eq. (12) with its first two terms and plugging the Eq. (14) in, the upper bound on the loss of an unseen attack would be such as:

$$\begin{aligned} U_\Delta \le L_2^{{\Vert \delta ^\star \Vert }_2 } + \dfrac{L_2^{{\Vert \delta ^{\star } \Vert }_2+\eta }-L_2^{{\Vert \delta ^{\star } \Vert }_2}}{\eta } (\Delta \epsilon ). \end{aligned}$$
(15)

Using Eq. (15), the upper bound for the loss of an unseen attack can be calculated for each sample individually. To make a comparison, this upper bound is calculated for the models that is adversarially trained using the Lagrange attack, PGD-\(\ell _2\), PGD-\(\ell _\infty\), and Fast-LPA (PAT model) with the samples in CIFAR-10. Since comparing sample to sample does not show the results properly, the mean, standard deviation, and median of the results for each method is presented in Fig. 8.

Fig. 8
figure 8

The mean, standard deviation, and median of the upper bounds on the loss of an unseen attack for several models that are adversarially trained on CIFAR-10

The results demonstrate that such upper bound for our model is lower than others. Note that this is just an upper bound on the loss of an unseen attack, and the exact accuracies of these methods have been investigated in Sect. 4.4.

It is noteworthy that the adversarially trained model with PGD-\(\ell _2\) uses the same training attack that is used for estimating \(L_2^\epsilon\), and this may bias the first term in the approximated upper bound to a lower value in this method. Still, our method has a smaller upper bound than PGD-\(\ell _2\) in average.

6.3 Signed versus pure input gradients

We note that using the pure input gradients, as opposed to the signed gradient, in the iterative updates would help to craft better perturbations even in the \(\ell _\infty\) threat model. To demonstrate this, the \(\ell _2\) and \(\ell _\infty\) adversarially trained models are tested against PGD-40 (\(\ell _\infty\)) and a modified version of PGD-40 (MPGD-40) that does not take the sign from gradients but normalizes them between \(-1\) and 1 to bring gradients from different samples in the same scale. Results listed in Table 9 demonstrate that this small change has made the attack stronger and confirms our claim. This also justifies the use of \(\ell _2\) attacks for training, where the normalized pure gradients are used in the training.

Table 9 Testing \(\ell _2\) and \(\ell _\infty\) adversarially trained models against \(\ell _\infty\) attacks with small and large bounds, using 40 iteration for optimizing the perturbations

7 Conclusion

Previous works generally examine the robustness only against a limited number of existing attacks and ignore other threat models. This creates a rivalry between defenses and attacks, and for every new defense, a new successful attack emerges, and vice versa. In doing so, we tried to solve this problem and came up with a method that would make the classifier more resistant to unforeseen attacks. To this end, we proposed a new attack that does not have the limitations and biases of previous attacks at the training. This attack uses the Lagrangian formulation to determine the perturbation margin associated with each sample according to the sample itself. The proposed approach is also much faster than others and allows its use for large datasets. Finally, we demonstrated the effectiveness of our method in improving the generalization in CIFAR-10 and ImageNet-100 against many unseen attacks. Our work shows that using attacks that perform well during test time is insufficient for reaching a robust model against unforeseen attacks. The training attack should be designed to teach a general case of possible perturbations to the classifier, and it should consider the trade-off between the clean and adversarial accuracies in perturbing the training samples.