Benchmark
Deterministic Verification Approaches (for DNNs)
Category | Name | Implementation |
---|---|---|
Complete > Solver-Based | Bounded MILP | Reimplementation |
Complete > Branch-and-Bound | AI2 | From ERAN |
Incomplete > Linear Relaxation > Linear Programming | LP-Full[1,2] | From CNN-Cert |
Incomplete > Linear Relaxation > Linear Inequality > Interval | IBP | Reimplementation |
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra | Fast-Lin | From CNN-Cert |
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra | CROWN | From CROWN-IBP |
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra | CNN-Cert | From CNN-Cert |
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra | CROWN-IBP | From CROWN-IBP |
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra | DeepPoly | From ERAN |
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra | RefineZono | From ERAN |
Incomplete > Linear Relaxation > Linear Inequality > Duality | WK[1,2] | From convex_adversarial |
Incomplete > Linear Relaxation > Multi-Neuron Relaxation | K-ReLU | From ERAN |
Incomplete > SDP | SDPVerify | Reimplementation |
Incomplete > SDP | LMIVerify | Reimplementation |
Incomplete > Lipschitz > General Lipschitz | Op-Norm | From RecurJac-and-CROWN |
Incomplete > Lipschitz > General Lipschitz | FastLip | From RecurJac-and-CROWN |
Incomplete > Lipschitz > General Lipschitz | RecurJac | From RecurJac-and-CROWN |
Architectures
We choose 7 network architectures for evaluation, including 3 fully-connected neural networks (FCNNa
, FCNNb
, FCNNc
) and 4 convolutional neural networks (CNNa
, CNNb
, CNNc
, CNNd
).
These architectures are picked from the literature and adapted for each evaluated dataset MNIST and CIFAR-10.
FCNNa | FCNNb | FCNNc | CNNa | CNNb | CNNc | CNNd | |
---|---|---|---|---|---|---|---|
Size on MNIST | 50 neurons 16,330 params |
310 neurons 99,710 params |
7,178 neurons 7,111,690 params |
4,814 neurons 166,406 params |
24,042 neurons 833,786 params |
48,074 neurons 1,974,762 params |
176,138 neurons 13,314,634 params |
Size on CIFAR-10 | 50 neurons 62,090 params |
310 neurons 328,510 params |
7,178 neurons 9,454,602 params |
6,254 neurons 214,918 params |
31,242 neurons 1,079,834 params |
62,474 neurons 2,466,858 params |
229,898 neurons 17,247,946 params |
Structure | Flatten -> 3 FCs | Flatten -> 4 FCs | Flatten -> 8 FCs | 2 Convs -> Flatten -> 2 FCs | 4 Convs -> Flatten -> 2 FCs | 4 Convs -> Flatten -> 3 FCs | 5 Convs -> Flatten -> 2 FCs |
Source | \(2 \times [20]\) from [1,2] | \(3 \times [100]\) enlarged from [1,2] | \(7 \times [1024]\) from [1,3] | Conv-Small in [4, 5] | Half-sized Conv-Large in [5] | Conv-Large in [5] | Double-sized Conv-Large in [5] |
Training Methods
For each of the 7 architectures, the model is trained by 5 approaches, yielding 5 concrete models:
-
clean
: standard training with cross-entropy loss -
adv1
: PGD adversarial training with \(\epsilon=0.1\) under \(\ell_\infty\) adversary -
adv3
: PGD adversarial training with \(\epsilon=0.3\) under \(\ell_\infty\) adversary -
cadv1
: CROWN-IBP training with \(\epsilon=0.1\) under \(\ell_\infty\) adversary -
cadv3
: CROWN-IBP training with \(\epsilon=0.3\) under \(\ell_\infty\) adversary
For each of the 7 architectures, the model is trained by 5 approaches, yielding 5 concrete models:
-
clean
: standard training with cross-entropy loss -
adv2
: PGD adversarial training with \(\epsilon=2/255\) under \(\ell_\infty\) adversary -
adv8
: PGD adversarial training with \(\epsilon=8/255\) under \(\ell_\infty\) adversary -
cadv2
: CROWN-IBP training with \(\epsilon=2/255\) under \(\ell_\infty\) adversary -
cadv8
: CROWN-IBP training with \(\epsilon=8/255\) under \(\ell_\infty\) adversary
We choose these training configurations to reflect three common types of models on which verification approaches are used: vanilla (undefended) models, empirical defense models and certification-oriented trained models. All models are trained to reach their expected robustness.
Clean and Empirical Robust Accuracies
We report the clean accuracies and empirical robust accuracies of all these models.
Clean accuracy: accuracy on the whole original test set with no perturbation.
Empirical robust accuracy: accuracy under 100-step PGD attack on the whole original test set, where the attack is under \(\ell_\infty\) norm and perturbation radius \(\epsilon\) is bounded by corresponding training radius (e.g., \(0.1\) for adv1
and cadv1
and \(0.3\) for adv3
and cadv3
). The step size is \(\epsilon/50\) following the literature. Due to low robustness, we do not report empirical robust accuracy for clean
models.
Clean Accuracy
FCNNa |
FCNNb |
FCNNc |
CNNa |
CNNb |
CNNc |
CNNd |
|
---|---|---|---|---|---|---|---|
clean |
93.63% | 96.12% | 95.05% | 98.48% | 98.85% | 98.85% | 99.20% |
adv1 |
93.36% | 97.12% | 98.00% | 99.01% | 99.33% | 99.24% | 99.37% |
cadv1 |
88.35% | 95.23% | 96.89% | 98.52% | 98.87% | 98.83% | 99.13% |
adv3 |
76.77% | 89.96% | 83.27% | 98.20% | 99.01% | 99.16% | 99.38% |
cadv3 |
45.79% | 76.21% | 35.87% | 96.26% | 98.14% | 98.11% | 98.58% |
Empirical Robust Accuracy
FCNNa |
FCNNb |
FCNNc |
CNNa |
CNNb |
CNNc |
CNNd |
|
---|---|---|---|---|---|---|---|
clean |
/ | / | / | / | / | / | / |
adv1 (\(\epsilon=0.1\)) |
78.39% | 85.18% | 87.40% | 95.48% | 96.09% | 96.35% | 97.64% |
cadv1 (\(\epsilon=0.1\)) |
80.99% | 90.67% | 93.43% | 97.18% | 98.12% | 98.41% | 98.50% |
adv3 (\(\epsilon=0.3\)) |
31.67% | 33.30% | 33.31% | 86.06% | 92.84% | 93.47% | 95.09% |
cadv3 (\(\epsilon=0.3\)) |
41.35% | 69.90% | 34.08% | 93.06% | 95.88% | 96.10% | 96.72% |
Clean Accuracy
FCNNa |
FCNNb |
FCNNc |
CNNa |
CNNb |
CNNc |
CNNd |
|
---|---|---|---|---|---|---|---|
clean |
38.46% | 41.76% | 46.37% | 59.45% | 65.65% | 58.60% | 83.53% |
adv2 |
41.17% | 44.31% | 36.19% | 60.40% | 68.66% | 65.59% | 83.65% |
cadv8 |
38.97% | 44.18% | 46.34% | 54.78% | 58.81% | 59.46% | 60.74% |
adv2 |
34.13% | 37.68% | 25.21% | 49.42% | 55.15% | 54.01% | 72.25% |
cadv8 |
30.59% | 32.05% | 30.30% | 40.51% | 40.05% | 40.11% | 40.61% |
Empirical Robust Accuracy
FCNNa |
FCNNb |
FCNNc |
CNNa |
CNNb |
CNNc |
CNNd |
|
---|---|---|---|---|---|---|---|
clean |
/ | / | / | / | / | / | / |
adv2 (\(\epsilon=2/255\)) |
39.49% | 42.99% | 35.87% | 57.39% | 64.01% | 62.05% | 76.41% |
cadv2 (\(\epsilon=2/255\)) |
38.71% | 43.83% | 45.42% | 54.17% | 58.76% | 59.25% | 59.88% |
adv8 (\(\epsilon=8/255\)) |
31.36% | 34.01% | 25.87% | 41.87% | 45.35% | 45.91% | 53.86% |
cadv8 (\(\epsilon=8/255\)) |
30.36% | 31.40% | 29.59% | 39.93% | 39.36% | 39.50% | 39.94% |
Dataset
We evaluate on two datasets: MNIST and CIFAR-10 (in SoK paper we only present results on CIFAR-10). We do not consider larger datasets like TinyImageNet, because most verification approaches cannot handle models for that scale, or can only provide low robust accuracy (e.g., \(20\%\) against \(\ell_\infty\) adversary with \(1/255\) attack radius).
MNIST is a set of \(28 \times 28\) gray-scale images.
CIFAR-10 is a set of \(3 \times 32 \times 32\) images.
Adversary
We focus on \(\ell_\infty\) adversary since this type is supported by most number of verification approaches to the best of our knowledge. On MNIST, we evaluate under attack radii \(0.02\), \(0.1\), and \(0.3\). On CIFAR-10, we evaluate under attack radii \(0.5/255\), \(2/255\), and \(8/255\).
Metrics
On each dataset, we uniformly sample 100 test set samples as the fixed set for evaluation. We evaluate by two metrics: certified accuracy and average certified robustness radius(in SoK we only report certified accuracy due to space limit).
-
Certified Accuracy: As described on the frontpage,
certified accuracy = # samples verified to be robust / number of all evaluated samples
under given \((\ell_\infty, \epsilon)\)-adversary. We also report the robust accuracy under empirical attack (PGD attack), which gives an upper bound of robust accuracy, so we can estimate the gap between certified robust accuracy and accuracy of existing attack. The time limit is60 s
per instance, and we count timeout instances as ``non-robust``. -
Average Certified Robustness Radius: We also evaluate the verification approaches by measuring their average certified robustness radius. The average certified robustness radius (\(\bar r\)) stands for the average \(\ell_\infty\) radius the verification approach can verify on the given subset of test set samples. We use the same uniformly sampled sets as in certified accuracy evaluation. To determine the best certified radius of each verification approach, we conduct a binary search process due to the monotonicity. Specifically, we do binary search on interval \([0, 0.5]\) because the largest possible radius is \(0.5\) for \([0, 1]\) bounded MNIST and CIFAR-10 inputs. If current radius \(mid\) is verified to be robust, we update current best by \(mid\) and let \(l \gets mid\), if current radius mid cannot be verified, we let \(r \gets mid\), until we reach the precision \(0.01\) on MNIST or \(0.001\) on CIFAR-10, or time is up. For the evaluation of average certified robustness radius, since it involves multiple evaluations because of binary search, we set the running time limit to
120 s
per input and record the highest certified radius it has been verified before the time is used up. The average certified robustness radius is evaluated on the same subset of test set samples as used for robust accuracy evaluation. We also report the smallest radius of adversarial samples found by empirical attack (PGD attack), which gives an upper bound of certified robustness radius for us to estimate the gap.
Experimental Environment
For MNIST experiments, we run the evaluation on 24-core Intel Xeon E5-2650 CPU running at 2.20 GHz with single NVIDIA GeForce GTX 1080 Ti GPU. For CIFAR-10 experiments, we run the evaluation on 24-core Intel Xeon Platinum 8259CL CPU running at 2.50 GHz with single NVIDIA Tesla T4 GPU. The CIFAR-10 experiments use slightly faster running environment due to its larger scale.
Choose the dataset and model training methods:
- On relatively small models (e.g.,
FCNNa
andFCNNb
), complete verification approaches can effectively verify robustness, thus they are the best choice. - On larger models (e.g.,
CNNb
,CNNc
,CNNd
), usually linear relaxation based verification approaches perform the best, since the complete verification approaches are too slow and other approaches are too loose, yielding almost 0% certified accuracy. However, linear relaxation based verification still cannot handle large DNNs and they are still too loose compared with the upper bound provided by PGD evaluation. - On robustly trained models, if the robust training approach
is CROWN-IBP which is tailored for IBP and CROWN,
these two verification approaches can certify high certified
accuracy (e.g., 25% − 28% on cadv8 for
CNNd
on CIFAR-10) while other approaches fail to verify. Indeed, robust training approaches can usually boost the certified accuracy but the models must be verified with their corresponding verification approaches. Similar observations can also be found in the literature. - SDP approaches usually take too long and thus are less practical.
- Generally, the tendency on MNIST is the same as that on CIFAR-10, despite their different input sizes.
- From these additional evaluations of average certified robustness radius, we find that the average radius has better precision than robust accuracy in terms of distinguishing different approaches’ performance. For example, on small models such as
FNNa
andFNNb
, if measured by robust accuracy, the complete verification approaches and some linear relaxation based approaches have almost the same precision (compare Bounded MILP, AI2 with CROWN, CNN-Cert). However, if measured by average certified robustness radius, we can observe that complete verification approaches can certify much larger radius since they do not use any relaxations and these models are small enough to be efficiently certified by complete veritification. - From the running time statistics, we observe that a main cause of the failing cases (i.e., 0 robust accuracy or certified radius) for these approaches is the excessive verification time. In particular, since the evaluation of average certified robustness radius requires multiple times of verification for each instance, inefficient verification has relatively poorer performance when measured by averaged certified radius.
Probabilistic Verification Approaches (for smoothed DNNs)
When evaluating certified robustness for smoothed DNNs, we evaluate the impact from the following three aspects:
- Verification Approaches: we evaluate 4 robustness verification approaches.
Verification | Evaluated Adversary Type |
---|---|
Differential Privacy Based [1] | \(\ell_1\), \(\ell_2\) |
Neyman-Pearson [2, 3, 4, 5] | \(\ell_1\), \(\ell_2\), \(\ell_\infty\) |
\(f\)-Divergence [6] | \(\ell_2\) |
Renyi Divergence [7] | \(\ell_1\) |
- Robust Training Approaches: we evaluate 5 robust training approaches.
Verification | Evaluated Adversary Type |
---|---|
Data Augmentation [2, 3] | \(\ell_1\), \(\ell_2\), \(\ell_\infty\) |
Adversarial Training [8] | \(\ell_2\) |
Adversarial + Pretraining [8, 9] | \(\ell_2\) |
MACER [10] | \(\ell_2\) |
ADRE [11] | \(\ell_2\) |
- Smoothing Distributions: we compare Gaussian, Laplace, and Uniform distribution.
Datasets
we evaluate on CIFAR-10 and ImageNet. CIFAR-10 is a set of \(3 \times 32 \times 32\) images, and ImageNet is a set of images rescaled to \(3 \times 224 \times 224\).
Model Architectures
Following common practice in literature [2, 3], we use ResNet-110 and Wide ResNet 40-2 on CIFAR-10; and ResNet-50 on ImageNet.
Evaluation Protocol
Following common practice, on both datasets, \(n = 1, 000\) samples are used for selecting the top label; \(N = 10^5\) samples are used for certification. The failure probability is set to \(1 − \alpha = .001\). For both datasets, we evaluate on a subset of 500 samples uniformly drawn from the test set. We report and compare the certified accuracy at given attack radius \(\epsilon\) under \(\ell_p\) adversaries for \(p = 1, 2, \infty\). To evaluate the \(\ell_\infty\) robustness, we use the following norm conversion rule: given input dimension \(d\), the model with certified robust radius \(\epsilon\) in \(\ell_2\) norm at point \(x_0\) is also certified robust with radius \(\epsilon/\sqrt d\) in \(\ell_\infty\) norm. The above conversion gives \(\ell_\infty\) robustness certification from existing L2- based certification, which is empirically shown to achieve the highest certified robustness for probabilistic approaches under \(\ell_\infty\) adversary.
We tune the hyper-parameters to achieve the best performance for each approach.
Choose the dataset:
Adversary | Model Architecture | Verification | Training | Smoothing Distribution | Certified Accuracy under Attack Radius \(\epsilon\) | |||||
---|---|---|---|---|---|---|---|---|---|---|
\(\epsilon\)= | 0.25 | 0.50 | 0.75 | 1.00 | 1.25 | 1.50 | ||||
\(\ell_2\) | Wide ResNet 40-2 | Differential Privacy | Data Augmentation | Gaussian | 34.2% | 14.8% | 6.8% | 2.2% | 0.0% | 0.0% |
\(\ell_2\) | Wide ResNet 40-2 | Neyman-Pearson | Data Augmentation | Gaussian | 68.8% | 46.8% | 36.0% | 25.4% | 19.8% | 15.6% |
\(\ell_2\) | Wide ResNet 40-2 | \(f\)-Divergence | Data Augmentation | Gaussian | 62.2% | 41.8% | 27.2% | 19.2% | 14.2% | 11.4% |
\(\ell_2\) | ResNet-110 | Neyman-Pearson | Data Augmentation | Gaussian | 61.2% | 43.2% | 32.0% | 22.4% | 17.2% | 14.0% |
\(\ell_2\) | ResNet-110 | Neyman-Pearson | Adversarial Training | Gaussian | 73.0% | 57.8% | 48.2% | 37.2% | 33.6% | 28.2% |
\(\ell_2\) | ResNet-110 | Neyman-Pearson | Adversarial + Pretraining | Gaussian | 81.8% | 62.6% | 52.4% | 37.2% | 34.0% | 30.2% |
\(\ell_2\) | ResNet-110 | Neyman-Pearson | MACER | Gaussian | 68.8% | 52.6% | 40.4% | 33.0% | 27.8% | 25.0% |
\(\ell_2\) | ResNet-110 | Neyman-Pearson | ADRE | Gaussian | 68.0% | 50.2% | 37.8% | 30.2% | 23.0% | 17.0% |
\(\epsilon\)= | 0.5 | 1.0 | 1.5 | 2.0 | 3.0 | 4.0 | ||||
\(\ell_1\) | Wide ResNet 40-2 | Differential Privacy Based | Data Augmentation | Laplace | 43.0% | 20.8% | 12.2% | 7.2% | 1.4% | 0.0% |
\(\ell_1\) | Wide ResNet 40-2 | Renyi Divergence | Data Augmentation | Laplace | 58.2% | 39.4% | 27.0% | 16.8% | 9.2% | 4.0% |
\(\ell_1\) | Wide ResNet 40-2 | Neyman-Pearson | Data Augmentation | Laplace | 58.4% | 39.6% | 27.0% | 17.2% | 9.2% | 4.2% |
\(\ell_1\) | Wide ResNet 40-2 | Neyman-Pearson | Data Augmentation | Uniform | 69.2% | 56.6% | 48.0% | 39.4% | 26.0% | 20.4% |
\(\epsilon\)= | 1/255 | 2/255 | 4/255 | 8/255 | ||||||
\(\ell_\infty\) | Wide ResNet 40-2 | Neyman-Pearson | Data Augmentation | Gaussian | 71.4% | 52.0% | 29.0% | 12.8% | ||
\(\ell_\infty\) | Wide ResNet 40-2 | Neyman-Pearson | Adversarial Training | Gaussian | 83.2% | 65.0% | 49.6% | 25.4% |
Adversary | Model Architecture | Verification | Training | Smoothing Distribution | Certified Accuracy under Attack Radius \(\epsilon\) | |||||
---|---|---|---|---|---|---|---|---|---|---|
\(\epsilon\)= | 0.5 | 1.0 | 1.5 | 2.0 | 2.5 | 3.0 | ||||
\(\ell_2\) | ResNet-50 | Differential Privacy Based | Data Augmentation | Gaussian | 26.0% | 12.2% | 4.4% | 0.0% | 0.0% | 0.0% |
\(\ell_2\) | ResNet-50 | Neyman-Pearson | Data Augmentation | Gaussian | 49.2% | 37.4% | 29.0% | 19.2% | 14.8% | 12.0% |
\(\ell_2\) | ResNet-50 | \(f\)-Divergence | Data Augmentation | Gaussian | 43.4% | 30.4% | 13.6% | 3.2% | 0.0% | 0.0% |
\(\ell_2\) | ResNet-50 | Neyman-Pearson | Data Augmentation | Gaussian | 49.2% | 37.4% | 29.0% | 19.2% | 14.8% | 12.0% |
\(\ell_2\) | ResNet-50 | Neyman-Pearson | Adversarial Training | Gaussian | 56.4% | 44.8% | 38.2% | 28.0% | 25.6% | 20.0% |
\(\ell_2\) | ResNet-50 | Neyman-Pearson | MACER | Gaussian | 57.0% | 43.2% | 31.4% | 24.8% | 17.6% | 14.0% |
\(\ell_2\) | ResNet-50 | Neyman-Pearson | ADRE | Gaussian | 57.0% | 41.8% | 30.0% | 23.6% | 17.8% | 14.2% |
\(\epsilon\)= | 0.5 | 1.0 | 1.5 | 2.0 | 3.0 | 4.0 | ||||
\(\ell_1\) | ResNet-50 | Differential Privacy Based | Data Augmentation | Laplace | 39.0% | 26.2% | 17.8% | 13.0% | 6.8% | 0.0% |
\(\ell_1\) | ResNet-50 | Renyi Divergence | Data Augmentation | Laplace | 48.2% | 40.4% | 31.0% | 25.8% | 19.0% | 13.6% |
\(\ell_1\) | ResNet-50 | Neyman-Pearson | Data Augmentation | Laplace | 49.0% | 40.8% | 31.2% | 26.0% | 19.0% | 13.6% |
\(\ell_1\) | ResNet-50 | Neyman-Pearson | Data Augmentation | Uniform | 55.2% | 49.0% | 45.6% | 42.0% | 32.8% | 24.8% |
\(\epsilon\)= | 1/255 | 2/255 | 4/255 | 8/255 | ||||||
\(\ell_\infty\) | ResNet-50 | Neyman-Pearson | Data Augmentation | Gaussian | 28.2% | 11.4% | 0.0% | 0.0% | ||
\(\ell_\infty\) | ResNet-50 | Neyman-Pearson | Adversarial Training | Gaussian | 38.2% | 20.4% | 4.6% | 0.0% |
- For both \(\ell_1\) and \(\ell_2\) adversaries, Neyman-Pearson based verification achieves the tightest results compared to others.
- The robust training approaches effectively enhance the models’ certified robustness. Among these existing robust training approaches, adversarial training usually achieves the best performance in terms of certified accuracy, and pretraining can further improve the performance.
- The choice of smoothing distribution can greatly affect the certified accuracy of smoothed models. One finding is that, under \(\ell_1\) adversary, the superior result is achieved by uniform smoothing distribution.
- For probabilistic verification approaches, certifying robustness under \(\ell_\infty\) adversary is challenging, and would become more challenging when the data dimension increases (note that ImageNet data has higher dimension than CIFAR-10 data, thus yielding relatively lower certified accuracy).