From Nov 8, 2021 to Nov 10, 2021, the Simons Institute conducted the workshop Average-Case Complexity: From Cryptography to Statistical Learning, in the context of their semester program Computational Complexity of Statistical Inference. This post summarizes and extends parts of [Rag21W].
Statistical robustness is concerned with how estimators react to shifts in data distributions. One of the questions it tries to answer is the following: If one has a good estimator for a given parametric family, but it is fitted on data sampled from a distribution “slightly outside” the family, does the estimator still behave well? In practical applications, the model is almost never correct, so it is of great interest to design estimators which are robust in this sense. This is sometimes called robustness to poisoning (against modification of the training set, possibly malicious).
In the same spirit as above, “robustness” in supervised learning refers to the ability to conserve predictive power under different scenarios. In some critical applications, neural networks need to be robust against data specially crafted to fool them at test time, for instance in autonomous vehicles or intrusion detection. Here we focus on this case, namely the evaluation on a distribution slightly different from the training one, sometimes known as adversarial robustness.
Several methods exist to improve test-time robustness, with adversarial training forming a family of the most popular and successful ones.1 1 Since the seminal works on adversarial attacks [Sze13I] in 2013 and [Goo15E] on adversarial training in 2014 , thousands of papers have appeared in the field. For a recent review and taxonomy of the field, we refer to [Bai21R]. However, despite their success and that of other heuristics, they typically lack theoretical guarantees. As a matter of fact, for every adversarial training technique published in the past years, a counter-attack has been immediately developed which beats it (see the references in [Rag21W]).
1 Bounding the maximal error
A sensible path is therefore to avoid a never ending game of cat and mouse devising heuristic attacks and counterattacks, and instead try to bound or reduce the maximal expected loss when perturbations to the test data are allowed in a certain class.2 2 This usually means allowing perturbations of test data up to a magnitude $\varepsilon > 0$ in some norm. For vision tasks, it is natural and convenient to take $l_{\infty}$: an $\varepsilon$-ball around $x$ contains all images whose pixels differ at most $\varepsilon$ from those of $x$.
Bounds on this error can be added as a penalty term to the loss which directly reduces the maximal error that a class of attacks can achieve. As we will see below, this technique can also provide upper bounds on the error during training, cf. the end of Section 3. The additional term is however at the cost of final predictive power of the model (and, interestingly, it usually also increases sparsity in the weights).
It is therefore advantageous to compute instead bounds for arbitrary, powerful networks (known as “verification-agnostic” networks because they haven’t been specifically trained to reduce this error). However, the methods to do so typically suffer from problems of scalability and tightness, particularly losing power when large modifications to the inputs are allowed.
There are three main lines of work pursuing these two goals.
Exact verification falls in the second category, and tries to bound the exact error of arbitrary (albeit small) networks. In order to do this, it performs an exhaustive search over the class of allowed perturbations for each input. This is usually done with Satisfiability Modulo Theory or Mixed Integer Programming formulations and despite there being several optimizations in the literature, their cost is typically prohibitive for modern neural networks.3 3 These are on par with some convex relaxations as SDPs, solved with interior point methods, see below. For this reason, we won’t cove any such techniques in this post.
Convex relaxations of the bound on the maximal error can be used in either of the two forms mentioned: for post-training bounding or during training. The idea is to express the maximum error that an attacker can cause for any given test datum $(x, y)$ as a Quadratically Constrained Quadratic Program (QCQP), typically a maximization over a set of allowed perturbations of $x$ with $y$ fixed. Then one uses linear or semidefinite relaxations to convexify the objective. Linear Programming (LP) methods scale well but provide certifications which are typically too loose, and some research shows that there is a fundamental gap that cannot be covered with linear relaxations, see e.g. [Rag18S] (Proposition 1) or [Sal20C]. On the other hand Semi-Definite Programs (SDPs) provide tighter bounds but scale poorly when using off-the-shelf interior point methods.4 4 Specifically, for $n$ nodes, $\mathcal{O} (n^6)$ in runtime and $\mathcal{O} (n^4)$ in memory, see refs [41, 60] in [Dat20E]. Recent work leverages dual formulations of the SDPs achieving linear memory cost and running times, while still providing reasonably good bounds, see Section 5.
Finally, randomized smoothing belongs in a different category, since it is based on ideas in differential privacy and randomizes the outputs to improve robustness. We will not cover any such approach here.
We focus on convex relaxations, and review a series of papers using SDPs both for certification of pre-trained networks and penalised training.
2 The setting
For simplicity, we discuss binary classification of images, but the case of multiple classes is handled similarly and other data domains are also possible.
We allow a class of data manipulations of the form “change each pixel of an image by at most 10/255ths”. More precisely, an $\ell _{\infty}$-admissible attack is of the following form: given a data domain $\mathbb{R}^d$ and labels $\mathcal{Y}= \lbrace 0, 1 \rbrace$ we allow an attacker $A :\mathbb{R}^d \times \mathcal{Y} \rightarrow \mathbb{R}^d$ to modify any coordinate of $x \in \mathbb{R}^d$ by at most a fixed amount $\varepsilon > 0$, i.e. we allow any $\tilde{x} = A (x, y)$ with $| x - \tilde{x} |_{\infty} < \varepsilon$. The goal of the attacker is to make a classifier $h :\mathbb{R}^d \rightarrow [0, 1]^2$ produce an output $h (\tilde{x})$ which maximises the margin:5 5 This cumbersome choice of subindices subsumes the two possible situations for misclassification into one when subindices start at 1 and $\mathcal{Y}= \lbrace 0, 1 \rbrace .$ While in the two-dimensional case the margin can always be reduced to an expression solely dependent on the confidence in the correct class, for higher-dimensional certification it is common to define the margin as a function of two selected classes. This notation then generalizes to such cases.
\begin{equation} m_{y} (\tilde{x}) := h_{2 - y} (\tilde{x}) - h_{1 + y} (\tilde{x}), \label{eq-margin}\tag{1} \end{equation}
where the two outputs $h_{1}, h_{2}$ are the confidences for classes 0 and 1 respectively. A positive margin is a succesful attack. The attacker has access to all information about the network and data distribution (this full-information attack model is sometimes called white-box attack).
We work with 2-layer networks of the form
\begin{equation} h (x) := V \sigma (Wx), \label{eq-two-layer-network}\tag{2} \end{equation}
with $V \in \mathbb{R}^{2 \times d_{1}}, W \in \mathbb{R}^{d_{1} \times d}$ and non-linearity $\sigma$.6 6 The limitation to two layers is only for simplicity. All but the first of the papers reviewed generalize to an arbitrary number of them.
Now, given a sample $(x, y)$ consider the worst possible modification of $x$ within $\varepsilon$ distance, leaving the label untouched:
\begin{equation} \delta _{y}^{\star} (x) := \underset{| x - \tilde{x} |_{\infty} \leqslant \varepsilon}{\operatorname{argmax}} m_{y} (\tilde{x}) . \label{eq-worst-modification}\tag{3} \end{equation}
Note that this maximization depends on $h$ and is typically intractable. The robust error is the maximal expected loss over data within an $\varepsilon$-ball around “true” data:
\[ E_{\operatorname{robust}} (h) := \mathbb{E}_{X, Y} [l (h (\delta _{Y}^{\star} (X)), Y)], \]
with $l (\hat{y}, y)$ being the 0-1 loss $\mathbb{I} (\hat{y} \neq y)$ in our case. The corresponding sample quantity, which we also name “robust error”, is the maximal proportion of misclassified test data, under $\ell _{\infty}$-$\operatorname{attacks}$ of magnitude $\varepsilon > 0$:
\[ \hat{E}_{\operatorname{robust}} (h) := \frac{1}{n} \sum_{i = 1}^n l (h (\delta _{y_{i}}^{\star} (x_{i})), y_{i}) . \]
With 0-1 loss, the $\operatorname{argmax}$ in $\delta _{y_{i}}^{\star}$ is not necessary to compute this error, since only the sign of the margin matters: If the quantity
\begin{equation} \operatorname{opt} (x, y) := \underset{| x - \tilde{x} |_{\infty} \leqslant \varepsilon}{\max} m_{y} (\tilde{x}) \label{eq-certification-opt}\tag{4} \end{equation}
is negative, the network’s confidence in the correct class will be maximal in the whole $\varepsilon$-neighbourhood around $x$ and $l (h (\delta _{y}^{\star} (x)), y) = 0$ for this sample. In other words, one can count the number of points in which $\operatorname{opt}$ is positive to compute the robust error:
\begin{equation} \hat{E}_{\operatorname{robust}} (h) = \frac{1}{n} | \lbrace x_{i} :\operatorname{opt} (x_{i}, y_{i}) > 0 \rbrace | . \label{eq-empirical-robust-error}\tag{5} \end{equation}
If $\operatorname{opt} (x, y) \leqslant 0$ one says that $(x, y)$ is certified (or rather $\ell _{\infty}, \varepsilon$-certified): no admissible attack $A$ can deterministically fool the classifier for this sample. We call $m$ a certification function.7 7 In Section 5 we will see a method that generalizes to arbitrary quadratic certification functions.
The main difficulty for certification is the maximization (4), which in the works reviewed here is tackled with convex relaxations. One problem arising from this approach is evaluating the tightness of the bound because the exact maximization is typically intractable. As a proxy, a lower bound on $\operatorname{opt} (x, y)$ is obtained, e.g. with an adversarial attack like Projected Gradient Descent (PGD, [Mad18D]).8 8 The more powerful the attack, the tighter this lower bound will be.
A word on the allowed threat model
Working within the framework of $\ell _{\infty}$ or, more generally, $\ell _{p}$ attacks is both convenient from a mathematical point of view, and of interest since many of them try to modify inputs in ways imperceptible to the human eye, a characteristic well modeled by small $\ell _{p}$ balls around samples. But it is important to note that it ignores more natural and often very effective manipulations of the empirical data distribution which can be used to force misclassifications.
One could for instance sample new images from the same source as the training data, obtaining qualitatively similar data on which the classifier fails often due to the network having overfitted to particulars of the specific training set.9 9 The field of “domain generalization” tries to devise methods that learn from multiple datasets pertaining to the same task but potentially obtained independently.
The class of $\ell _{p}$-bounded attacks also does not cover realistic attack scenarios like adversarial patches, which preserve the semantics of an image with alterations which, despite being obvious to the eye, might not stand out as an attack to the untrained observer.10 10 Of course, one can design countermeasures against these, but, as mentioned above, certification against broad classes of attacks is in general more desirable.
3 Certified robustness via gradient bounds
[Rag18C] studies the two different goals described above:
- To compute bounds on the maximal error that an attacker can cause for fixed weights $V, W$ of $h$ (see (2)).
- To learn new weights $V, W$ for $h$ which reduce this maximal error.
The authors circumvent the problem of computing (5) with a tractable upper bound
\[ U (m, x, y) \geqslant l (h (\delta _{y}^{\star} (x)), y), \]
which avoids the maximization (3). They define the $U$-certified error to be11 11 The corresponding sample quantity is then the fraction of points that can be potentially attacked.
\[ E_{\operatorname{cert}} := \mathbb{E}_{X, Y} [\mathcal{U} (m, X, Y) > 0] . \]
By construction, no $\ell _{\infty}$-attack can attain an error higher than the certified error. Also, because it is a pointwise bound, wherever $U = 0$, the integrand in the robust error must be 0 as well.
The function $U$ is chosen to be a convex relaxation of (3) called MaxGrad. The name comes from the following bound on the margin (1):
\begin{eqnarray*} m_{y} (\tilde{x}) & = & m_{y} (x) + \int_{0}^1 \nabla m_{y} (t \tilde{x} + (1 - t) x)^{\top} (x - \tilde{x}) \mathrm{d} t\\\ & \leqslant & m_{y} (x) + \varepsilon \underset{| \tilde{x} - x | < \varepsilon}{\max} | \nabla m_{y} (\tilde{x}) |_{1}\\\ & \leqslant & m_{y} (x) + \varepsilon \underset{t \in [- 1, 1]^d}{\underset{s \in [- 1, 1]^{d_{1}}}{\max}} \frac{1}{2} t^{\top} W^{\top} (V_{1} - V_{2}) \odot \left( \textbf{1} + s \right), \end{eqnarray*}
where $\odot$ denotes the Hadarmard product and the last inequality is valid in the case of two-layer networks with ReLU activations (see [Rag18C] (Eqs. (5) to (7)) for the details).12 12 The number of two layers (one hidden and a linear mapping for the outputs) is required for the argument in MaxGrad, but the activations can be swapped for any other with uniformly bounded derivative. The final right hand side is a non-convex quadratic maximization problem, which is relaxed to an SDP roughly as follows. First introduce new variables $a = (1, t, s)^{\top}$and $M$, a matrix which is a function of $V_{1} - V_{2}$ and $W$. In the resulting problem, the new variable $a$ appears as a rank-one matrix $P := aa^{\top} \in \mathbb{R}^{(m + d + 1) \times (m + d + 1)}$. This is relaxed as $P = AA^{\top}$ for some full-rank $A$, or equivalently, $P \succcurlyeq 0$ (see [Rag18C] (Eqs (6) to (10)) for details). One obtains the SDP:
\begin{equation} m_{y} (x) + \frac{\varepsilon}{4} \underset{\operatorname{diag} (P) \preccurlyeq 1}{\underset{P \succcurlyeq 0}{\max}} \langle M, P \rangle, \tag{6} \end{equation}
which can be solved with off-the-shelf tools to obtain an upper bound denoted $\operatorname{MaxGrad} (m, x, y)$.
Limitations
The MaxGrad method has three major disadvantages:
- It is limited to 2-layer networks, a problem solved later in [Rag18S], see Section 4.
- It has a very high computational cost, manageable only for very small networks, see margin:cost-sdp-interior.
- It can yield very loose bounds. As a matter of fact MaxGrad is effectively vacuous on adversarially trained networks. For a very small $\varepsilon$ the quantity $\operatorname{MaxGrad}(m, x, y)$ is zero for less than 10% of the data.13 13 In the literature this is often phrased in reverse as the certified error being greater than 90%, which, as defined above, is the fraction of the points where $\operatorname{MaxGrad}(m, x, y) > 0$.
Certified robustness via a new objective
One way to mitigate the third problem is to add MaxGrad to the training objective, i.e. to set
\[ (\hat{V}, \hat{W}) \in \underset{V, W}{\operatorname{argmin}} \sum_{i = 1}^n l (h (x_{i}), y) + \lambda \underset{\operatorname{diag} (P) \preccurlyeq 1}{\underset{P \succcurlyeq 0}{\max}} \langle M, P \rangle, \]
with $\lambda > 0$ a regularization hyperparameter. Because computing the gradient of the above requires the expensive operation of solving the inner maximum, the authors use its dual formulation, which provides a cheaper upper bound for the primal, [Rag18C] (Eq. (15)).
Importantly, because every solution of the dual bounds the optimum of the primal from above, at every iteration of the optimization one has an upper bound on the certified error.
By penalising the loss, what was a vacuous upper bound becomes non-vacuous in practice and a general strategy emerges: in order to obtain certified defenses, it is possible to train networks by minimizing a loose but tractable upper bound for the worst-case loss.
Subsequent work building upon this paper improves the certified error for some examples. However, the final real-world performance of these networks on unaltered test samples remains subpar, something that motivates an interest in certifiers which operate on certification-agnostic networks.
4 Certified robustness with SDP
The previous technique changes the objective and obtains new parameters which are certified to have at most some given error. But what about arbitrary, or “foreign”, networks (i.e. not robustly trained)? We have seen that the gradient bound is mostly ineffective. Is it possible to improve it?
[Rag18S] does so with a direct computation of the maximum margin instead of a first-order approximation. The method also generalizes to any number of layers (ignoring computational cost), but we leave this out for simplicity.14 14 One aspect of the generalization is the need to add bounds on the output of each layer analogous to those for the attack. The authors use very rough ones based on interval arithmetic, but suggest that they could be made tighter for added performance.
Fix some input $(x, y)$ and let $\tilde{x} = A (x, y)$ be an attack. Recall that we want to bound the margin
\begin{eqnarray*} m_{y} (\tilde{x}) & := & h_{2 - y} (\tilde{x}) - h_{1 + y} (\tilde{x})\\\ & = & (V_{2 - y} - V_{1 + y}) x^1, \end{eqnarray*}
where $x^1 = \sigma (W \tilde{x})$, and $V$ is the final linear layer of the network, i.e. we want, cf (4):
\begin{eqnarray*} \operatorname{opt} (x, y) & = & \underset{\tilde{x} \in \mathbb{R}^d}{\max} (V_{2 - y} - V_{1 + y})^{\top} x^1\\\ & \text{s.t.} & x^1 = \sigma (W \tilde{x}), \tag{ReLU}\\\ & \operatorname{and} & | x - \tilde{x} |_{\infty} \leqslant \varepsilon . \tag{Attack} \end{eqnarray*}
Note how the non-linear activation function enters the problem as a constraint. The idea to make the maximization tractable is to write it as a quadratic problem, then relax the non-convex constraints into an SDP.
The key insight enabling this is that each ReLU activation $z_{j} = \max ((Wx )_{j}, 0)$ in the network can be expressed as the quadratic constraint:15 15 Later work has derived similar quadratic formulations for the sigmoid, tanh, leaky ReLU and other activations [Faz20S].
\[ z_{j} \geqslant 0, z_{j} \geqslant (Wx)_{j}, z_{j} (z_{j} - (Wx)_{j}) = 0, \quad \forall j. \]
The same applies to $\ell _{\infty}$ bounds on the perturbed inputs: $l_{j} \leqslant x_{j} \leqslant u_{j}$ for all $j$ and given $l_{j}, u_{j} \in \mathbb{R}$, is equivalent to $(x_{j} - l_{j}) (x_{j} - u_{j}) \leqslant 0$ for every $j$, or
\[ x_{j}^2 \leqslant (l_{j} + u_{j}) x_{j} - l_{j} (u_{j}), \quad \forall j. \]
With this, one can rewrite the above into the following quadratic, non-convex problem:16 16 Comparison of vectors is done coordinate-wise.
\begin{eqnarray*} \operatorname{opt} (x, y) & = & \underset{x, z \in \mathbb{R}^d}{\max} V^{\top} z \label{eq-problem-opt}\tag{7}\\\ & \text{s.t.} & z \geqslant 0, z \geqslant Wx, z \odot z = z \odot Wx, \tag{ReLU}\\\ & \operatorname{and} & x \odot x \leqslant (l + u) \odot x - l \odot u. \tag{Attack} \end{eqnarray*}
The trick to rewrite this as an SDP is to introduce new variables $a = (1, x, z)^{\top}$ and $P := aa^{\top}$. With “symbolic indexing” $P [\cdot]$ one can write
\[ P = \left(\begin{array}{ccc} P [1] & P [x^{\top}] & P [z^{\top}]\\\ P [x] & P [xx^{\top}] & P [xz^{\top}]\\\ P [z] & P [zx^{\top}] & P [zz^{\top}] \end{array}\right), \]
and relax the problem above as:17 17 Note that, as before, the key modification is the convexification of the constraint of having rank one, $P = aa^{\top}$, via the constraint $P = AA^{\top}$ for full-rank $A$, or equivalently, $P \succcurlyeq 0$.
\begin{eqnarray*} \widetilde{\operatorname{opt}} (x, y) & = & \underset{P \succcurlyeq 0}{\max} V^{\top} P [z] \label{eq-SDP}\tag{SDP}\\\ & \text{s.t.} & P [z] \geqslant 0,\\\ & & P [z] \geqslant WP [x],\\\ & & \operatorname{diag} (P [zz^{\top}]) =\operatorname{diag} (WP [xz^{\top}]),\\\ & \operatorname{and} & \operatorname{diag} (P [xx^{\top}]) \leqslant (l + u) \odot P [x] - l \odot u. \end{eqnarray*}
Using off-the-shelf solvers, the resulting SDP certifier, SDP-Cert, yields non vacuous bounds for several networks trained with standard adversarial methods, in contrast to MaxGrad (albeit only for small networks for which the optimization is possible). The intuition behind this is that MaxGrad is a linear approximation bounding the gradient uniformly, whereas SDP-Cert is a relaxation of the true quadratic problem of computing the worst possible margin $\operatorname{opt} (x, y)$.
SDP-Cert is also shown to provide better bounds than methods using linear programs, both empirically and theoretically. The latter statement is the content of [Rag18S] (Proposition 1), which proves a dimension-dependent gap between LP and SDP bounds of order $\sqrt{d}$ for random networks.18 18 Roughly, for networks of $m$ hidden nodes and dimension $d$, with high probability $\tilde{\delta}^{\star}_{y, \operatorname{LP}} = \Theta (md)$ and $\tilde{\delta}^{\star}_{y, \operatorname{SDP}} = \Theta \left( m \sqrt{d} + d \sqrt{m} \right) .$
Limitations
Like MaxGrad, SDP-Cert has several drawbacks. First, the bounds are loose: the strongest known attacks attain errors which are significantly lower. This is due to points where the attack fails but also does the certification. Upon close inspection it can be seen that the distribution of margins for these is close to 0, i.e. the attacks were “almost” succesful, see Figure 1.
Second, the use of an SDP and off-the-shelf interior point methods limits applications to networks of at most a few thousand nodes. The next paper we study addresses this issue taking ideas from the SDP literature and achieves linear running times.
5 Efficient SDP certification
To overcome the problem of scalability, [Dat20E] work with the dual of the SDP formulation and express it as a maximum eigenvalue problem with linear constraints. In doing so, they generalize the work of Section 4 without compromising the tightness of the bound and can extend their method to any quadratically constrained quadratic program.
An important contribution is that the optimization can be done with first-order methods whose subgradient computations can be carried out as forward and backward passes on the network, so that the per-iteration complexity is that of a constant number of such passes and can be implemented using standard frameworks like TensorFlow, PyTorch or Jax. This allows scaling up the size of the networks one order of magnitude up to 20K nodes and 2M parameters, which is certainly far from the size of state of the art models in vision and language, but can still be easily encountered in applications.
As before we want to compute $\operatorname{opt} (x, y)$, but after deriving the convex relaxation (SDP), we move on to the dual problem, and express it as a maximum eigenvalue problem.
To this end, we first write the Lagrangian for problem (7).19 19 As in Section 4, the method is presented for an arbitrary number of layers, but we stay in a simpler setting for clarity. Recall that the linear and quadratic constraints for a ReLU activation are ${z \geqslant 0}$, ${z \geqslant Wx}$ and $z \odot (z - Wx) \leqslant 0$ (where we write the last one as an inequality in view of the Lagrangian) and the quadratic constraints for an $\ell _{\infty}$ attack, $x \odot x \leqslant (l + u) \odot x - l \odot u$, for given bounds $l, u$. The Lagrangian of the constraints for this layer is then
\begin{eqnarray*} \tilde{\mathcal{L}} (x, z, \lambda) & := & - z^{\top} \lambda _{a}\\\ & & + (Wx - z)^{\top} \lambda _{b}\\\ & & + z \odot (z - Wx)^{\top} \lambda _{c}\\\ & & + (x \odot x - (l + u) \odot x + l \odot u) \lambda _{d} . \end{eqnarray*}
Rearranging terms, adding the objective $(V_{2 - y} - V_{1 + y})^{\top} x^1$ and renaming all network variables $x, z$ as $x$, one obtains a full Lagrangian
\[ \mathcal{L} (x, \lambda) := c (\lambda) + x^{\top} g (\lambda) + \tfrac{1}{2} x^{\top} H (\lambda) x, \]
where all the coefficients $c, g, H$ are affine functions of $\lambda$, [Dat20E] (Eqs. (2) to (3)). Interestingly, because $g$ and $H$ are the gradient and Hessian of $\mathcal{L}$, and all they involve is forward passes and element-wise operations, they can be computed with automatic differentiation using all common ML frameworks.
A standard duality argument provides:
\[ \operatorname{opt} (x, y) \leqslant \underset{\lambda \geqslant 0}{\min} \underset{l \leqslant x \leqslant u}{\max} \mathcal{L} (x, \lambda), \]
which in [Dat20E] (Proposition 1, p. 5) is transformed into the minimization
\begin{equation} \underset{\lambda \geqslant 0, \kappa \geqslant 0}{\min} f (\lambda, \kappa), \label{eq-sdp-fo}\tag{SDP-FO} \end{equation}
with $\lambda \in \mathbb{R}$ and $\kappa \in \mathbb{R}^{1 + N}$, $N$ being the sum of the dimensions of all layers (input and hidden in our setting), and with the objective given by
\[ f (\lambda, \kappa) := c (\lambda) + \frac{1}{2} \boldsymbol{1}^{\top} [\operatorname{diag} (\kappa) - \lambda _{\min}^{-} (Z) \boldsymbol{1}]^+, \]
where $\lambda _{\min}^{-}$ is the negative part of the smallest eigenvalue of a certain matrix $Z = Z (g, H)$.20 20 See [Dat20E] (Appendix A.4) for the proof that optimizing this new problem is equivalent to SDP-Cert.
The problem (SDP-FO) can be solved efficiently with a first-order projected subgradient method thanks to automatic differentiation and iterative computation of the eigendecomposition of $Z$. Good choices for initialization, regularization and learning rates improve convergence rates [Dat20E] (Section 5.3).
The authors test SDP-FO in networks both with a verification penalty term in the loss and without it, but their main result is its effectiveness on the latter. For these verification-agnostic networks, the dual form provides a bound which is reasonably tight (wrt. a proxy lower bound obtained with PGD, as described in Section 2) in many cases while enabling verification of larger models.
An additional positive aspect of the method is that it can handle quadratic certification functions. The margin (1) for adversarial robustness is linear, but the authors show an example of robustness of VAEs to perturbations in latent space where the certification function is given by the (quadratic) reconstruction error [Dat20E] (Eq. (7)). Such quadratic objectives cannot be directly plugged into LP or exact solvers without relaxing them first.
Limitations
Despite a linear running time, the bounds provided by SDP-FO lose tightness as the network size increases. This can be seen in Figure 2, which shows the accuracy under an attack using Projected Gradient Descent and the one from the SDP-FO upper bound. The broader the gap, the worse the performance of the certifier.
6 Open questions in relaxed certification
Besides the obvious problem of scalability, several issues remain untackled in the literature on convex relaxations of the maximal expected error.
It is necessary to understand why networks trained with the certification penalty are in general less performant, as well as the role that architecture and initialization play in this matter. There is no characterization of the effective hypothesis class that one minimises over with certified training. Interestingly, these networks show higher sparsity, a phenomenon for which there is no explanation yet. We discuss the “robustness tradeoff” in a future post.
From a technical point of view, tighter relaxations are required at larger network sizes. A clear candidate for improvement is the computation of better bounds for the intermediate layers in the case of more than two layers.21 21 This refers to a set of constraints required for each additional layer. Finally it would be interesting to see adoption of techniques from the SDP community beyond the 1st order dual method of Section 5.