Proving that safety-critical neural networks do what they're supposed to: where we are, where we're going (part 1 of 2)

Neural networks are turning up everywhere these days, including in safety-critical systems, such as autonomous driving and flight control systems. When these systems fail, human lives are at risk. But it’s hard to provide formal guarantees about the behavior of neural networks — how can we know for sure that they won’t steer us the wrong way?

A few months ago, my group at Intel Labs (including, among others, Tatiana Shpeisman, Justin Gottschlich, and me) began sponsoring and collaborating with a team of researchers at Stanford who are extending the capabilities of automated verification tools, such as SMT solvers, to formally verify properties of deep neural networks used in safety-critical systems. The team, which includes Guy Katz, Kyle Julian, and faculty members Clark Barrett and Mykel Kochenderfer, has developed a prototype SMT solver tool that can verify safety properties of a neural network that gives aircraft collision avoidance advice. For instance, we can use it to show that if another plane is near our plane and approaching from the left, the network will always advise us to turn right. The Stanford team’s first paper on this project, “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”, will appear at CAV 2017 this July.

In this two-part series of posts, I’ll give a 30,000-foot (…so to speak) view of the aircraft collision avoidance system we want to verify properties of, how the verification tool works, some of the interesting insights that went into developing that tool, and where I’d like to see this project go next. This first post will cover what it means to verify properties of safety-critical neural networks and some detail about the collision avoidance system in question. In the second post, we’ll dig into SMT solving and the verification process itself.

What are safety-critical systems, and what does it mean to verify them?

To begin with, what is safety-critical software? One way to define it is software “whose failure may cause injury or death to human beings”. An example could be the software that’s part of the collision avoidance system on an airplane.

We always care about software correctness, or at least we do most of the time, but in safety-critical settings we especially care about rigorously showing that certain (bad) things can’t happen (that is, safety properties), or that certain (good) things will happen eventually (that is, liveness properties). We might even be required by law to go through a process of verifying that certain properties hold for a piece of software before we can run it on an airplane, for instance.

“Verification” is a word that gets thrown around a lot and means different things depending on the context. When I talk about software verification (or certification), I specifically mean using formal methods to rigorously prove that certain properties hold for a piece of software.1 What properties we prove will depend on the situation, but typically, we want to show that if the input to a program falls into a certain class of inputs, then the output from that program will fall into a certain class of outputs. In the case of an aircraft collision avoidance system, for instance, the input might be the state of the environment (including information about our plane, how fast it’s going, what direction it’s headed, and what other planes are nearby), and we might want to show that if the environment is in a certain state (say, there’s another plane in front of us and headed toward us) then the system is guaranteed to react in some way (say, by turning left or right).

When we write a test for a function, we’re testing a particular point in the input space for that function, and checking that it corresponds to a particular point in the output space. With safety-critical software, we usually want to guarantee that for a whole class of inputs, then some property is true for the whole class of corresponding outputs. This is sometimes impossible to do with testing alone — if the input space is infinite, then there’s no finite number of tests that will guarantee that the property holds, and even if the space is finite, it might be infeasible to write all the tests we’d need. So, for safety-critical systems we often look to verification rather than testing.2

Safety-critical software is already using deep neural networks

Today we have some very good tools for formally verifying properties of software systems. Some of those tools are based on SMT solver technology, which I’ll say more about in the second post in this series. Unfortunately, SMT-based verification tools are not neccessarily keeping pace with how some safety-critical software systems are now being built. In particular, safety-critical software is increasingly using deep neural networks, and though some progress has been made on verification of neural networks with SMT solvers in the last several years, automatically verifying interesting properties of larger (“deep”) networks is still an open problem.

As our motivating example of a safety-critical software system that uses deep neural networks, we’ll look at a paper that some of the Stanford team and their collaborators published last year, “Policy Compression for Aircraft Collision Avoidance Systems”. This research addressed the question of how to decrease the memory requirements of an aircraft collision avoidance system so that it could fit in memory on the hardware on board an aircraft. The system they’re concerned with involves a gigantic lookup table that enumerates tens of millions of possible environment states and maps each state to an advisory saying what action the aircraft should take in that state. The contents of this table amount to the collision avoidance policy. In its original form, the table takes up hundreds of gigabytes. Even a downsampled version of the table is still quite large, weighing in at about two gigabytes. In order to be used in practice on board an aircraft, it needs to be smaller still — hence “policy compression”.

The authors showed that as an alternative to storing the table itself, they can train a neural network on the table’s contents. The resulting trained model takes up only a few megabytes — about a thousand times less storage space than even the downsampled table. Furthermore, they found that according to certain metrics (which are discussed more in the paper), a controller with a neural-network-based policy representation actually performs better than the original table-based one because, unlike the table, the network is able to smoothly interpolate between inputs. However, the neural-network-based policy representation is only an experiment at this point, and the paper acknowledges that “there are significant certification concerns with neural network representations”. Those concerns are what the Stanford team is now working to address.

The ACAS X system

Let’s dig more deeply into the details of the flight collision avoidance system in question. The system is called ACAS X, which stands for Airborne Collision Avoidance System X.3 ACAS X has been under development for several years, and the intent is for it to be adopted for use on aircraft in a few years, replacing the existing collision avoidance system, TCAS, which has been in place for decades. There are, in fact, several variants of ACAS X; the variant we’ll be discussing here is what’s known as the “Xu” variant, where the “u” stands for unpiloted aircraft.

ACAS X runs on board an aircraft during flight. Once per second, it collects sensor measurements, which include information about where any intruder aircraft might be. Based on those measurements, it produces one of five of what are called horizontal resolution advisories.4 The most common advisory is COC, meaning “clear of conflict”, which means the plane doesn’t need to do anything special. There are four other advisories: weak right turn, strong right turn, weak left turn, or strong left turn. Weak turn advisories say to turn 1.5 degrees per second; strong turn advisories say to turn 3 degrees per second.

There’s some amount of uncertainty in the sensor measurements, and this uncertainty is modeled by what’s called a partially observable Markov decision process, or POMDP. From the POMDP, the developers of ACAS X generated a table that models about 120 million possible states, called the score table because it maps each state to a score for each of the five possible horizontal resolution advisories. The resolution advisory that the system produces is the result of looking up a state in the table and choosing the advisory with the best score.

Because the table is too big to fit in memory on board the aircraft, the aforementioned “Policy Compression” paper explores the possibility of training a neural network on the table and then storing that trained model instead. So, let’s take a look at how that works.

Using a DNN to represent the ACAS Xu collision avoidance policy

In ACAS X terminology, your plane is called the ownship, and the other plane is called the intruder. A state consists of seven pieces of information that describe the relationship between the ownship and the intruder. In particular:

Components of a state.  Image from "Policy Compression for Aircraft Collision Avoidance Systems" by Julian et al.

  • $\rho$ (rho) is the distance from the ownship to the intruder;
  • $\theta$ (theta) is the angle from the ownship to the intruder;
  • $\psi$ (psi) is the heading angle of the intruder, relative to the ownship heading angle;
  • $v_{\mathit{own}}$ is the speed of the ownship;
  • $v_{\mathit{int}}$ is the speed of the intruder;
  • $\tau$ (tau) is time until loss of vertical separation; and finally,
  • $a_{\mathit{prev}}$ is whatever the previous advisory was. (This allows the previous advisory to be taken into account when determining what the next advisory will be; for instance, if the previous advisory was “strong left”, it might be a good idea to disincentivize the system from advising “strong right”.)

The input to the trained network is a state — that is, a seven-dimensional vector $(\rho, \theta, \psi, v_{\mathit{own}}, v_{\mathit{int}}, \tau, a_{\mathit{prev}})$ — and the output is a five-dimensional vector giving a score to each of the five possible advisories, where lowest is best.

Neural networks are made up of layers of processing units. Each unit takes some number of inputs and produces an output that gets passed on to units in the next layer. At each unit, the inputs are combined using weights, learned during the training process, that determine how each input contributes to the result. The sum of the weighted inputs (plus a bias term that’s also learned during training) then passes through what’s called an activation function to produce the final output of the unit. To represent the ACAS Xu collision avoidance policy, we use a network with nine fully-connected layers — “fully-connected” meaning that the output of every unit in a layer becomes one of the inputs to every unit in the following layer — that was designed in such a way that there would be about 600,000 weights in the trained network. With nine layers and 600,000 weights, our network is big enough to qualify as “deep”, but the trained model would still be small enough to meet the memory constraints of running the system on board an aircraft, were this DNN representation to be adopted in practice.

A DNN representing the ACAS Xu collision avoidance policy.  Image is a modified version of one from "Policy Compression for Aircraft Collision Avoidance Systems" by Julian et al.

All of the activation functions in the ACAS Xu network are ReLUs. The ReLU function just returns the maximum of its input and 0. So, each unit in the network is taking as input the outputs from the previous layer, multiplying each one by the appropriate learned weight, taking the sum of those plus the bias term, and running the result through the ReLU activation function to return either itself or 0, whichever is larger.

I won’t go into any detail here about how the network is trained, because we’re interested not so much in training as we are in verifying that certain properties hold for the already trained network. For the curious, there’s more information about training to be found in the “Policy Compression” paper.5

Could we trust this DNN to control aircraft?

So, we have a DNN trained on the ACAS Xu score table that represents our collision avoidance policy. But the function computed by the network is only an approximation of the one described by the table, as the following picture illustrates:

Some differences in advisories given by the table-based and DNN-based ACAS Xu controllers.  Image is a modified version of one from "Policy Compression for Aircraft Collision Avoidance Systems" by Julian et al.

These plots show top-down views of an encounter between the ownship and an intruder. The ownship is centered at the origin and flying from left to right, and the intruder is flying from right to left. The color at each point in the plot shows the advisory that would be issued if the intruder were at that location. The plot on the left is what the original table’s advisories say to do, and the plot on the right is what the DNN advises.6

Clearly, they aren’t the same. However, that may or may not be a problem; the question we want to answer is whether or not the DNN-based system is (at least) as safe to use as the original table-based system. There are a number of properties that we might want to be able to ensure hold. For instance, some of those properties might include:

  • If the intruder is directly ahead and moving towards us, the score for the clear-of-conflict advisory will not be minimal. (Recall that the lowest score is best, so in this case, we don’t want the lowest score to tell us that we’re clear of conflict; we want the network to tell us to turn one way or another.)
  • If the intruder is near and approaching from the left, the network will advise “strong right”.

Our Stanford collaborators came up with a list of ten such properties that they wanted to verify; that list is in Appendix VI of the Reluplex paper. This list of ten properties is by no means an exhaustive list of all the things that we might want to verify about such a system. In fact, just figuring out what properties are important to verify is a challenge in itself.7 Assuming we know what properties we’re interested in showing, though, the next question is: how do we build verification tools that can prove that those properties hold, or that they don’t hold? That’s the question we’ll look at in my next post.

Thanks to Clark Barrett, Justin Gottschlich, Guy Katz, Mykel Kochenderfer, and Harold Treen for their feedback on drafts of this post.

  1. One irritating thing about the way people use the word “verification” is that we often speak of “verifying” (or “certifying”) some system without saying what it is that we’re verifying about that system. When I say “verification of neural networks”, it’s really shorthand for “verification of certain properties of neural networks”. “Proof of a program” is likewise shorthand for “proof of a particular property of a program” – for instance, that the program implements a particular specification. Sometimes, the property being proved or verified is clear from context, but without that context, it’s meaningless to talk about “verified software” or “verified hardware”, or to ask questions like “Has this software been verified?” Nevertheless, people (including me, in this post) do use those phrases all the time.

  2. Property-based testing, in the tradition of QuickCheck, lets you write down a property that you want to have be true of the output of a function, given a certain class of inputs. A property-based testing system will then generate a whole bunch of tests that cover different points in the function’s input space. It may have a clever search strategy so that if the property isn’t true, it’ll be likely to find a case where the property fails to hold. Although property-based testing isn’t the same thing as verification, it can be an important step toward verification, because it requires you to specify properties, and it will often help you quickly establish whether those properties don’t hold.

  3. The X doesn’t seem to stand for anything in particular; I think they just thought X sounded cool.

  4. ACAS Xu can also produce vertical resolution advisories, which say to make vertical maneuvers, but here we’ll just consider horizontal ones, which turn out to be the more interesting case.

  5. One detail about training that I do think is worth mentioning is that they put a lot of consideration into what loss function would make sense for this problem. If your network gives a good score to an advisory that tells the plane to turn at a time when it doesn’t really have to turn, then that’s bad, but not as bad as not telling the plane to turn at a time when it does really have to. So, for training, they used an asymmetric loss function that penalizes false negatives more heavily than it penalizes false positives. More details are in the paper.

  6. In this example, although it isn’t obvious from the top-down view, there also happens to be a large vertical separation between the ownship and the intruder. That’s why the area right in front of the ownship in the top-down view is white, meaning COC, because the intruder is too far away vertically to close the gap between them and create a collision risk. However, if the intruder is a bit further away horizontally, then there’s time for the intruder to make up the vertical gap. In other words, if the vertical separation is large, then a smaller horizontal separation is actually better for collision avoidance than a larger horizontal separation would be!

  7. I’m not a security person, but I imagine that deciding what properties to verify about a system is hard in the the same way that security threat modeling is hard. You can’t just ask, “Is this piece of software secure?” The right response to that question is, “What is your threat model?” By the same token, you can’t just ask, “Is this neural network safe to use?” You have to make decisions about what properties of the network would constitute safety, and only then can you (hopefully!) prove that those properties hold.