Deterministic Verification Approaches (for DNNs)

We evaluate the following 17 approaches.

Category Name Implementation
Complete > Solver-Based Bounded MILP Reimplementation
Complete > Branch-and-Bound AI2 From ERAN
Incomplete > Linear Relaxation > Linear Programming LP-Full[1,2] From CNN-Cert
Incomplete > Linear Relaxation > Linear Inequality > Interval IBP Reimplementation
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra Fast-Lin From CNN-Cert
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra CROWN From CROWN-IBP
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra CNN-Cert From CNN-Cert
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra CROWN-IBP From CROWN-IBP
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra DeepPoly From ERAN
Incomplete > Linear Relaxation > Linear Inequality > Polyhedra RefineZono From ERAN
Incomplete > Linear Relaxation > Linear Inequality > Duality WK[1,2] From convex_adversarial
Incomplete > Linear Relaxation > Multi-Neuron Relaxation K-ReLU From ERAN
Incomplete > SDP SDPVerify Reimplementation
Incomplete > SDP LMIVerify Reimplementation
Incomplete > Lipschitz > General Lipschitz Op-Norm From RecurJac-and-CROWN
Incomplete > Lipschitz > General Lipschitz FastLip From RecurJac-and-CROWN
Incomplete > Lipschitz > General Lipschitz RecurJac From RecurJac-and-CROWN

Architectures

We choose 7 network architectures for evaluation, including 3 fully-connected neural networks (FCNNa, FCNNb, FCNNc) and 4 convolutional neural networks (CNNa, CNNb, CNNc, CNNd). These architectures are picked from the literature and adapted for each evaluated dataset MNIST and CIFAR-10.

  FCNNa FCNNb FCNNc CNNa CNNb CNNc CNNd
Size on MNIST 50 neurons
16,330 params
310 neurons
99,710 params
7,178 neurons
7,111,690 params
4,814 neurons
166,406 params
24,042 neurons
833,786 params
48,074 neurons
1,974,762 params
176,138 neurons
13,314,634 params
Size on CIFAR-10 50 neurons
62,090 params
310 neurons
328,510 params
7,178 neurons
9,454,602 params
6,254 neurons
214,918 params
31,242 neurons
1,079,834 params
62,474 neurons
2,466,858 params
229,898 neurons
17,247,946 params
Structure Flatten -> 3 FCs Flatten -> 4 FCs Flatten -> 8 FCs 2 Convs -> Flatten -> 2 FCs 4 Convs -> Flatten -> 2 FCs 4 Convs -> Flatten -> 3 FCs 5 Convs -> Flatten -> 2 FCs
Source \(2 \times [20]\) from [1,2] \(3 \times [100]\) enlarged from [1,2] \(7 \times [1024]\) from [1,3] Conv-Small in [4, 5] Half-sized Conv-Large in [5] Conv-Large in [5] Double-sized Conv-Large in [5]

Training Methods

For each of the 7 architectures, the model is trained by 5 approaches, yielding 5 concrete models:

For each of the 7 architectures, the model is trained by 5 approaches, yielding 5 concrete models:

We choose these training configurations to reflect three common types of models on which verification approaches are used: vanilla (undefended) models, empirical defense models and certification-oriented trained models. All models are trained to reach their expected robustness.

Clean and Empirical Robust Accuracies

We report the clean accuracies and empirical robust accuracies of all these models.

Clean accuracy: accuracy on the whole original test set with no perturbation.

Empirical robust accuracy: accuracy under 100-step PGD attack on the whole original test set, where the attack is under \(\ell_\infty\) norm and perturbation radius \(\epsilon\) is bounded by corresponding training radius (e.g., \(0.1\) for adv1 and cadv1 and \(0.3\) for adv3 and cadv3). The step size is \(\epsilon/50\) following the literature. Due to low robustness, we do not report empirical robust accuracy for clean models.


Clean Accuracy
  FCNNa FCNNb FCNNc CNNa CNNb CNNc CNNd
clean 93.63% 96.12% 95.05% 98.48% 98.85% 98.85% 99.20%
adv1 93.36% 97.12% 98.00% 99.01% 99.33% 99.24% 99.37%
cadv1 88.35% 95.23% 96.89% 98.52% 98.87% 98.83% 99.13%
adv3 76.77% 89.96% 83.27% 98.20% 99.01% 99.16% 99.38%
cadv3 45.79% 76.21% 35.87% 96.26% 98.14% 98.11% 98.58%
Empirical Robust Accuracy
  FCNNa FCNNb FCNNc CNNa CNNb CNNc CNNd
clean / / / / / / /
adv1(\(\epsilon=0.1\)) 78.39% 85.18% 87.40% 95.48% 96.09% 96.35% 97.64%
cadv1(\(\epsilon=0.1\)) 80.99% 90.67% 93.43% 97.18% 98.12% 98.41% 98.50%
adv3(\(\epsilon=0.3\)) 31.67% 33.30% 33.31% 86.06% 92.84% 93.47% 95.09%
cadv3(\(\epsilon=0.3\)) 41.35% 69.90% 34.08% 93.06% 95.88% 96.10% 96.72%


Clean Accuracy
  FCNNa FCNNb FCNNc CNNa CNNb CNNc CNNd
clean 38.46% 41.76% 46.37% 59.45% 65.65% 58.60% 83.53%
adv2 41.17% 44.31% 36.19% 60.40% 68.66% 65.59% 83.65%
cadv8 38.97% 44.18% 46.34% 54.78% 58.81% 59.46% 60.74%
adv2 34.13% 37.68% 25.21% 49.42% 55.15% 54.01% 72.25%
cadv8 30.59% 32.05% 30.30% 40.51% 40.05% 40.11% 40.61%
Empirical Robust Accuracy
  FCNNa FCNNb FCNNc CNNa CNNb CNNc CNNd
clean / / / / / / /
adv2(\(\epsilon=2/255\)) 39.49% 42.99% 35.87% 57.39% 64.01% 62.05% 76.41%
cadv2(\(\epsilon=2/255\)) 38.71% 43.83% 45.42% 54.17% 58.76% 59.25% 59.88%
adv8(\(\epsilon=8/255\)) 31.36% 34.01% 25.87% 41.87% 45.35% 45.91% 53.86%
cadv8(\(\epsilon=8/255\)) 30.36% 31.40% 29.59% 39.93% 39.36% 39.50% 39.94%

Dataset

We evaluate on two datasets: MNIST and CIFAR-10 (in SoK paper we only present results on CIFAR-10). We do not consider larger datasets like TinyImageNet, because most verification approaches cannot handle models for that scale, or can only provide low robust accuracy (e.g., \(20\%\) against \(\ell_\infty\) adversary with \(1/255\) attack radius).

MNIST is a set of \(28 \times 28\) gray-scale images.

CIFAR-10 is a set of \(3 \times 32 \times 32\) images.

Adversary

We focus on \(\ell_\infty\) adversary since this type is supported by most number of verification approaches to the best of our knowledge. On MNIST, we evaluate under attack radii \(0.02\), \(0.1\), and \(0.3\). On CIFAR-10, we evaluate under attack radii \(0.5/255\), \(2/255\), and \(8/255\).

Metrics

On each dataset, we uniformly sample 100 test set samples as the fixed set for evaluation. We evaluate by two metrics: certified accuracy and average certified robustness radius(in SoK we only report certified accuracy due to space limit).

  • Certified Accuracy: As described on the frontpage, certified accuracy = # samples verified to be robust / number of all evaluated samples under given \((\ell_\infty, \epsilon)\)-adversary. We also report the robust accuracy under empirical attack (PGD attack), which gives an upper bound of robust accuracy, so we can estimate the gap between certified robust accuracy and accuracy of existing attack. The time limit is 60 s per instance, and we count timeout instances as ``non-robust``.

  • Average Certified Robustness Radius: We also evaluate the verification approaches by measuring their average certified robustness radius. The average certified robustness radius (\(\bar r\)) stands for the average \(\ell_\infty\) radius the verification approach can verify on the given subset of test set samples. We use the same uniformly sampled sets as in certified accuracy evaluation. To determine the best certified radius of each verification approach, we conduct a binary search process due to the monotonicity. Specifically, we do binary search on interval \([0, 0.5]\) because the largest possible radius is \(0.5\) for \([0, 1]\) bounded MNIST and CIFAR-10 inputs. If current radius \(mid\) is verified to be robust, we update current best by \(mid\) and let \(l \gets mid\), if current radius mid cannot be verified, we let \(r \gets mid\), until we reach the precision \(0.01\) on MNIST or \(0.001\) on CIFAR-10, or time is up. For the evaluation of average certified robustness radius, since it involves multiple evaluations because of binary search, we set the running time limit to 120 s per input and record the highest certified radius it has been verified before the time is used up. The average certified robustness radius is evaluated on the same subset of test set samples as used for robust accuracy evaluation. We also report the smallest radius of adversarial samples found by empirical attack (PGD attack), which gives an upper bound of certified robustness radius for us to estimate the gap.

Experimental Environment

For MNIST experiments, we run the evaluation on 24-core Intel Xeon E5-2650 CPU running at 2.20 GHz with single NVIDIA GeForce GTX 1080 Ti GPU. For CIFAR-10 experiments, we run the evaluation on 24-core Intel Xeon Platinum 8259CL CPU running at 2.50 GHz with single NVIDIA Tesla T4 GPU. The CIFAR-10 experiments use slightly faster running environment due to its larger scale.

Choose the dataset and model training methods:

  • On relatively small models (e.g., FCNNa and FCNNb), complete verification approaches can effectively verify robustness, thus they are the best choice.
  • On larger models (e.g., CNNb, CNNc, CNNd), usually linear relaxation based verification approaches perform the best, since the complete verification approaches are too slow and other approaches are too loose, yielding almost 0% certified accuracy. However, linear relaxation based verification still cannot handle large DNNs and they are still too loose compared with the upper bound provided by PGD evaluation.
  • On robustly trained models, if the robust training approach is CROWN-IBP which is tailored for IBP and CROWN, these two verification approaches can certify high certified accuracy (e.g., 25% − 28% on cadv8 for CNNd on CIFAR-10) while other approaches fail to verify. Indeed, robust training approaches can usually boost the certified accuracy but the models must be verified with their corresponding verification approaches. Similar observations can also be found in the literature.
  • SDP approaches usually take too long and thus are less practical.
  • Generally, the tendency on MNIST is the same as that on CIFAR-10, despite their different input sizes.
  • From these additional evaluations of average certified robustness radius, we find that the average radius has better precision than robust accuracy in terms of distinguishing different approaches’ performance. For example, on small models such as FNNa and FNNb, if measured by robust accuracy, the complete verification approaches and some linear relaxation based approaches have almost the same precision (compare Bounded MILP, AI2 with CROWN, CNN-Cert). However, if measured by average certified robustness radius, we can observe that complete verification approaches can certify much larger radius since they do not use any relaxations and these models are small enough to be efficiently certified by complete veritification.
  • From the running time statistics, we observe that a main cause of the failing cases (i.e., 0 robust accuracy or certified radius) for these approaches is the excessive verification time. In particular, since the evaluation of average certified robustness radius requires multiple times of verification for each instance, inefficient verification has relatively poorer performance when measured by averaged certified radius.

Probabilistic Verification Approaches (for smoothed DNNs)

When evaluating certified robustness for smoothed DNNs, we evaluate the impact from the following three aspects:

  • Verification Approaches: we evaluate 4 robustness verification approaches.
Verification Evaluated Adversary Type
Differential Privacy Based [1] \(\ell_1\), \(\ell_2\)
Neyman-Pearson [2, 3, 4, 5] \(\ell_1\), \(\ell_2\), \(\ell_\infty\)
\(f\)-Divergence [6] \(\ell_2\)
Renyi Divergence [7] \(\ell_1\)
  • Robust Training Approaches: we evaluate 5 robust training approaches.
Verification Evaluated Adversary Type
Data Augmentation [2, 3] \(\ell_1\), \(\ell_2\), \(\ell_\infty\)
Adversarial Training [8] \(\ell_2\)
Adversarial + Pretraining [8, 9] \(\ell_2\)
MACER [10] \(\ell_2\)
ADRE [11] \(\ell_2\)
  • Smoothing Distributions: we compare Gaussian, Laplace, and Uniform distribution.

Datasets

we evaluate on CIFAR-10 and ImageNet. CIFAR-10 is a set of \(3 \times 32 \times 32\) images, and ImageNet is a set of images rescaled to \(3 \times 224 \times 224\).

Model Architectures

Following common practice in literature [2, 3], we use ResNet-110 and Wide ResNet 40-2 on CIFAR-10; and ResNet-50 on ImageNet.

Evaluation Protocol

Following common practice, on both datasets, \(n = 1, 000\) samples are used for selecting the top label; \(N = 10^5\) samples are used for certification. The failure probability is set to \(1 − \alpha = .001\). For both datasets, we evaluate on a subset of 500 samples uniformly drawn from the test set. We report and compare the certified accuracy at given attack radius \(\epsilon\) under \(\ell_p\) adversaries for \(p = 1, 2, \infty\). To evaluate the \(\ell_\infty\) robustness, we use the following norm conversion rule: given input dimension \(d\), the model with certified robust radius \(\epsilon\) in \(\ell_2\) norm at point \(x_0\) is also certified robust with radius \(\epsilon/\sqrt d\) in \(\ell_\infty\) norm. The above conversion gives \(\ell_\infty\) robustness certification from existing L2- based certification, which is empirically shown to achieve the highest certified robustness for probabilistic approaches under \(\ell_\infty\) adversary.

We tune the hyper-parameters to achieve the best performance for each approach.

Choose the dataset:

Adversary Model Architecture Verification Training Smoothing DistributionCertified Accuracy under Attack Radius \(\epsilon\)
    \(\epsilon\)=0.250.500.751.001.251.50
\(\ell_2\)Wide ResNet 40-2Differential PrivacyData AugmentationGaussian34.2%14.8%6.8%2.2%0.0%0.0%
\(\ell_2\)Wide ResNet 40-2Neyman-PearsonData AugmentationGaussian68.8%46.8%36.0%25.4%19.8%15.6%
\(\ell_2\)Wide ResNet 40-2\(f\)-DivergenceData AugmentationGaussian62.2%41.8%27.2%19.2%14.2%11.4%
\(\ell_2\)ResNet-110Neyman-PearsonData AugmentationGaussian61.2%43.2%32.0%22.4%17.2%14.0%
\(\ell_2\)ResNet-110Neyman-PearsonAdversarial TrainingGaussian73.0%57.8%48.2%37.2%33.6%28.2%
\(\ell_2\)ResNet-110Neyman-PearsonAdversarial + PretrainingGaussian81.8%62.6%52.4%37.2%34.0%30.2%
\(\ell_2\)ResNet-110Neyman-PearsonMACERGaussian68.8%52.6%40.4%33.0%27.8%25.0%
\(\ell_2\)ResNet-110Neyman-PearsonADREGaussian68.0%50.2%37.8%30.2%23.0%17.0%
    \(\epsilon\)=0.51.01.52.03.04.0
\(\ell_1\)Wide ResNet 40-2Differential Privacy BasedData AugmentationLaplace43.0%20.8%12.2%7.2%1.4%0.0%
\(\ell_1\)Wide ResNet 40-2Renyi DivergenceData AugmentationLaplace58.2%39.4%27.0%16.8%9.2%4.0%
\(\ell_1\)Wide ResNet 40-2Neyman-PearsonData AugmentationLaplace58.4%39.6%27.0%17.2%9.2%4.2%
\(\ell_1\)Wide ResNet 40-2Neyman-PearsonData AugmentationUniform69.2%56.6%48.0%39.4%26.0%20.4%
    \(\epsilon\)=1/2552/2554/2558/255  
\(\ell_\infty\)Wide ResNet 40-2Neyman-PearsonData AugmentationGaussian71.4%52.0%29.0%12.8%  
\(\ell_\infty\)Wide ResNet 40-2Neyman-PearsonAdversarial TrainingGaussian83.2%65.0%49.6%25.4%  
AdversaryModel ArchitectureVerificationTrainingSmoothing DistributionCertified Accuracy under Attack Radius \(\epsilon\)
    \(\epsilon\)=0.51.01.52.02.53.0
\(\ell_2\)ResNet-50Differential Privacy BasedData AugmentationGaussian26.0%12.2%4.4%0.0%0.0%0.0%
\(\ell_2\)ResNet-50Neyman-PearsonData AugmentationGaussian49.2%37.4%29.0%19.2%14.8%12.0%
\(\ell_2\)ResNet-50\(f\)-DivergenceData AugmentationGaussian43.4%30.4%13.6%3.2%0.0%0.0%
\(\ell_2\)ResNet-50Neyman-PearsonData AugmentationGaussian49.2%37.4%29.0%19.2%14.8%12.0%
\(\ell_2\)ResNet-50Neyman-PearsonAdversarial TrainingGaussian56.4%44.8%38.2%28.0%25.6%20.0%
\(\ell_2\)ResNet-50Neyman-PearsonMACERGaussian57.0%43.2%31.4%24.8%17.6%14.0%
\(\ell_2\)ResNet-50Neyman-PearsonADREGaussian57.0%41.8%30.0%23.6%17.8%14.2%
    \(\epsilon\)=0.51.01.52.03.04.0
\(\ell_1\)ResNet-50Differential Privacy BasedData AugmentationLaplace39.0%26.2%17.8%13.0%6.8%0.0%
\(\ell_1\)ResNet-50Renyi DivergenceData AugmentationLaplace48.2%40.4%31.0%25.8%19.0%13.6%
\(\ell_1\)ResNet-50Neyman-PearsonData AugmentationLaplace49.0%40.8%31.2%26.0%19.0%13.6%
\(\ell_1\)ResNet-50Neyman-PearsonData AugmentationUniform55.2%49.0%45.6%42.0%32.8%24.8%
    \(\epsilon\)=1/2552/2554/2558/255  
\(\ell_\infty\)ResNet-50Neyman-PearsonData AugmentationGaussian28.2%11.4%0.0%0.0%  
\(\ell_\infty\)ResNet-50Neyman-PearsonAdversarial TrainingGaussian38.2%20.4%4.6%0.0%  
The neighboring rows in the same background color (light yellow or light blue) form a variable-controlled comparable group, where the highest certified accuracies within each group are bolded.

  • For both \(\ell_1\) and \(\ell_2\) adversaries, Neyman-Pearson based verification achieves the tightest results compared to others.
  • The robust training approaches effectively enhance the models’ certified robustness. Among these existing robust training approaches, adversarial training usually achieves the best performance in terms of certified accuracy, and pretraining can further improve the performance.
  • The choice of smoothing distribution can greatly affect the certified accuracy of smoothed models. One finding is that, under \(\ell_1\) adversary, the superior result is achieved by uniform smoothing distribution.
  • For probabilistic verification approaches, certifying robustness under \(\ell_\infty\) adversary is challenging, and would become more challenging when the data dimension increases (note that ImageNet data has higher dimension than CIFAR-10 data, thus yielding relatively lower certified accuracy).