A couple months back, I wrote about Pulina and Tacchella’s 2010 paper on verification of safety properties of neural networks. That paper is the earliest that I’ve found in the literature that makes a serious attempt at verification of neural networks using an SMT solver, and I found it really interesting.
In that post, I also mentioned a number of limitations of Pulina and Tacchella’s approach. First, the network used in the paper’s case study was tiny, and it wasn’t clear that the technique would scale to realistically sized neural networks. Second, the property being verified (ensuring that the network’s output fell within a specified range) seemed a bit boring; I was interested in whether it was possible to verify more interesting properties. Third, the function the network was learning — the position of the end of a robot arm, given its joint angles — seemed like something that could be worked out analytically, without the need for a neural network (or any machine learning at all). And finally, it wasn’t clear that the proposed verification approach would work for neural networks that did classification tasks, rather than regression tasks like the one in the case study. A 2011 follow-up paper by the authors addressed some of those concerns, but not all of them.
I was very excited, then, to find a new paper, posted on arXiv just a few weeks after my blog post went up, that takes a big step forward in addressing the limitations of the Pulina and Tacchella work.1 This paper, “Safety Verification of Deep Neural Networks” by Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu, presents a new technique for verifying safety properties of neural networks that is capable of handling much larger, deeper networks than what the Pulina and Tacchella approach had been shown to handle. The trick is to deal with the network one layer at a time, rather than all at once. This layer-by-layer approach, it seems, is the key innovation that makes the problem tractable for an SMT solver. Furthermore, unlike Pulina and Tachella, Huang et al. consider networks that do classification, and the kinds of classification tasks they are doing certainly justify the use of neural networks.
All this is very promising, but the reason I was most excited about Huang et al.’s work was that the particular safety property that they set out to verify has to do with robustness to adversarial inputs, a topic I’ve been digging into with a lot of enthusiasm recently. Robustness to adversarial inputs definitely counts as an “interesting” property to me!
Unfortunately, Huang et al.’s verification framework does not allow one to verify (or find a counterexample for) safety properties like “Classifier $C$ is robust to adversarial perturbations smaller than $p$”, where $p$ is some measure of how large the perturbation is. Instead, their setup only allows one to ask questions about a particular input $x$ to the classifier, and whether there are any adversarial inputs in a particular “region” around that input.
The region is assumed to contain only inputs that would be classified the same way as $x$ by a human observer (see Assumption 1 in section 3 of the paper), and so any input within the region that is classified differently from $x$ is considered adversarial. As mentioned, though, Huang et al.’s approach deals with the network layer by layer, and we can speak of “regions” around each vector of activations in the network, not just around the input vector. So, one thing I don’t yet understand about the paper is what this assumption means for deeper layers in the network. After all, it would be pretty hard for a human observer to make a decision about whether a vector of activations beyond the input layer is “close” to what it’s supposed to be. Or pretty hard for most human observers, anyway. (“All I see is blonde, brunette, redhead…”)
If there is an adversarial input in the region, the verification process will find it, and so the process can be used to rule out presence of adversarial inputs in the region. This would be intractable if Huang et al. didn’t also make certain assumptions about the kinds of manipulations that it is possible to make to an input. They introduce some mathematical machinery in section 3 of the paper that, as far as I can tell, makes it possible to check the entire region for adversarial manipulations without having to exhaustively check each point in it. I think the idea is something like “pick a finite set of manipulations that span the space of possible manipulations” — see the material about “ladders” in section 3.
I’m excited to see the research on this topic continue to progress. For the particular problem of verifying robustness to adversarial inputs, the fact that you can only verify robustness with respect to a particular input and region around it seems like a pretty big limitation. The approach also seems to rely on a number of assumptions that make it a bit less exciting; for instance, it seems to me that the assumptions around what kind of manipulations can be done to the input could result in supposedly “adversarial” inputs where the perturbation is pretty obvious to a human (as in Figure 10 of the paper, for instance). However, I’d be happy to be proven wrong about that. Speaking as a non-expert, I think this paper is an exciting step forward over the previous state of the art, especially the fact that it is possible to handle relatively large networks with the layer-by-layer approach.
Huang et al. say that the region might be infinite, but I don’t actually understand this; any actual representation on a computer is going to be finite. In any case, it could be intractable to search exhaustively. ↩