1. 程式人生 > >Fail-Safe AI and Self-Driving Cars

Fail-Safe AI and Self-Driving Cars

By Lance Eliot, the AI Trends Insider

Fail-safe. If you’ve ever done any kind of real-time systems development, and especially involving systems that upon experiencing a failure or fault could harm someone, you are likely aware of the importance of designing and building the system to be fail-safe.

For many AI developers that have cut their teeth developing AI systems in university research labs, the need to have fail-safe AI real-time systems has not been a particularly high priority. Often, the AI system being built is considered relatively experimental and intended to tryout new ways of advancing AI techniques and Machine Learning (ML) algorithms. There is not much need or concern in those systems about ensuring a fail-safe AI capability.

For AI self-driving cars, the auto makers and tech firms are at times behind the eight ball in terms of devising AI that is fail-safe. With the rush towards getting AI self-driving cars onto the roadways, it has been assumed that the AI is going to work properly, and if it doesn’t work properly that a human back-up driver in the vehicle will take over for the AI.

I’ve repeatedly cautioned and forewarned that using a human back-up driver as a form of “fail-safe” operation for the AI is just not sufficient. Human back-up drivers are apt to lose attention toward the driving task and not be fully engaged when needed. Also, human back-up drivers might not be aware that the AI is failing or faltering and therefore not realize they should be taking over the driving controls. Even if the human back-up driver somehow miraculously realizes or is alerted to take over the driving task, the human reaction time can be ineffective and the time for evasive action might already be past.

There’s another reason too to be concerned about the need for fail-safe AI in self-driving cars, namely that the hope and desire is that AI self-driving cars will ultimately need no human intervention whatsoever in the driving task. As such, if there is not going to be any expectation of a human jumping in to save the day, the AI has to be the one that can entirely on its own save the day. This means that even if the AI itself falters or fails, it somehow has to stand itself back up and continue driving the car. I’ll say more about this in a moment.

At the Cybernetic AI Self-Driving Car Institute, we are developing AI software for self-driving cars and are doing so with a proverbial “from the ground-up” fail-safe AI core tenent. It’s a key foundational aspect, in our view.

I’d like to clarify and introduce the notion that there are varying levels of AI self-driving cars. The topmost level is considered Level 5. A Level 5 self-driving car is one that is being driven by the AI and there is no human driver involved. For the design of Level 5 self-driving cars, the auto makers are even removing the gas pedal, brake pedal, and steering wheel, since those are contraptions used by human drivers. The Level 5 self-driving car is not being driven by a human and nor is there an expectation that a human driver will be present in the self-driving car. It’s all on the shoulders of the AI to drive the car.

For self-driving cars less than a Level 5, there must be a human driver present in the car. The human driver is currently considered the responsible party for the acts of the car. The AI and the human driver are co-sharing the driving task. In spite of this co-sharing, the human is supposed to remain fully immersed into the driving task and be ready at all times to perform the driving task. As suggested herein earlier, I’ve repeatedly warned about the dangers of this co-sharing and/or back-up driver arrangement and predicted it will produce many untoward results.

Let’s focus herein on the true Level 5 self-driving car. Much of the comments apply to the less than Level 5 self-driving cars too, but the fully autonomous AI self-driving car will receive the most attention in this discussion.

Here’s the usual steps involved in the AI driving task:

  •         Sensor data collection and interpretation
  •         Sensor fusion
  •         Virtual world model updating
  •         AI action planning
  •         Car controls command issuance

Another key aspect of AI self-driving cars is that they will be driving on our roadways in the midst of human driven cars too. There are some pundits of AI self-driving cars that continually refer to a Utopian world in which there are only AI self-driving cars on the public roads. Currently there are about 250+ million conventional cars in the United States alone, and those cars are not going to magically disappear or become true Level 5 AI self-driving cars overnight.

Indeed, the use of human driven cars will last for many years, likely many decades, and the advent of AI self-driving cars will occur while there are still human driven cars on the roads. This is a crucial point since this means that the AI of self-driving cars needs to be able to contend with not just other AI self-driving cars, but also contend with human driven cars. It is easy to envision a simplistic and rather unrealistic world in which all AI self-driving cars are politely interacting with each other and being civil about roadway interactions. That’s not what is going to be happening for the foreseeable future. AI self-driving cars and human driven cars will need to be able to cope with each other.

Fail-Safe Further Explained

Returning to the topic of fail-safe AI, let’s consider the nature of what it means to be considered “fail-safe” and then take a look at how it can be achieved for AI systems. There is often confusion among laypeople as to the meaning of fail-safe overall. Admittedly, it is a rather loaded term and one that carries various connotations and baggage associated with it.

Let’s start with the major misnomer about being fail-safe. Some falsely assume that a fail-safe system is one that will never fail in any respects whatsoever. Though you can indeed try to use that as a definition, it is not quite so applicable in the real-world of real-time systems.

If we were to take the term “fail-safe” as meaning that something can never ever fail, no matter what, it would suggest that it must be strictly and always impossible for failure to ever occur, of any kind and in any respects. I dare say that it is hard to imagine any kind of relatively complex and useful system that you might build that could reach to such a lofty goal. Never failing is a pretty high bar.

I realize it will be disappointing to some of you that harbor such a definition when I then say that the manner in which fail-safe is more commonly considered encompasses an allowance for failure in some respects, but that the system is able to handle the failure in a manner that will reduce, cover for, or avert the adverse consequences of the failure. For those of you that are perfectionists and insist on never allowing any kind of failure, it’s a hard pill to swallow that I am allowing “fail-safe” to mean that failure can occur. Sorry about that.

The classic line is that fail-safe means it is safe to fail. You design and construct the system with an eye towards failures happening and try to put in place ways to best cope with the failings.

I realize that it might now seem like we are opening the floodgates by allowing that a fail-safe system can experience failure.

If we are going to allow failure as an option, we next need to consider the range and depth of possible failures. Suppose that a teeny tiny failure occurs in some sub-component and yet it is not particularly material to anything else that the system is doing. As such, yes, a failure occurred, but it is buried so deeply and has nothing to do with the final outcome of the system, thus, the failure in a sense never saw the light of day. On the other hand, we might have a massive single failure or a collection of a multitude of severe errors, all of which produce a final outcome of disastrous and human harming results.

The notion underlying a fail-safe system is that it takes into account the potential for failure, including failures big and small, and including failures having adverse outcomes and those not, and as a core element of the design and how the system is constructed, it anticipates those potential failures along with ways to mitigate those failures.

Sometimes at conferences where I talk about fail-safe AI, I get the question that if one is designing an AI system based on expected failures then aren’t you really engendering a mindset of a willingness to allow failure to occur within the system?

My answer is not really. Sure, you could claim that by talking about potential ways of failure that it will creep into the minds of those designing and building the system, and perhaps they then carry that over into how they design and build it, committing into the AI system various failures that otherwise would have never occurred to them.

But, I don’t think any well managed AI development would get itself into that bind. Instead, I would argue that the opposite is actually worse, namely that if you don’t bring up potential failures beforehand then the AI developers will blindly and inadvertently possibly fall into the trap of silently and unknowingly designing failures and building failures into the system. I am a proponent that fore-knowledge is better than not having thought things through.

Another question that I sometimes get involves whether or not you can truly anticipate all possible failures, and even if so, the effort to design and build the AI fail-safe system to cope with all such failures would seem to cause the design and development to mushroom beyond any manageable size.

First, I’ll absolutely concur that an AI system designed and built with a fail-safe mindset is likely to take longer to design and build, potentially costing more too, in comparison to a slap-it together mindset that does not include systems fail-safe and related safety considerations.

Right now, most of the AI self-driving cars at a more advanced level are being pushed into experimental trials on our roadways and the lack of fail-safe aspects can be somewhat masked or hidden from view. If you have a team of AI developers that each day are tending to an AI self-driving car, along with the human back-up drivers, it can in a sense keep at bay any apparent AI system failures. A day-to-day problem in the AI system is quickly detected and dealt with. This is not what’s going to happen once such AI self-driving cars are truly allowed into the wild and become abundant on our roadways.

I’d suggest that you need to consider the overall total-cost over the lifetime of the AI self-driving car and weigh that versus the near-term get-it-out-the-door approach. An AI self-driving car that has weak or insufficient fail-safe AI will likely end-up either causing more so harm to humans, via say crashing into a wall or ramming into other cars, and up the ante on the overall cost of having deployed such a system. I’ve predicted over and over that there are going to be product liability lawsuits against AI self-driving car auto makers and tech firms. The question will certainly be posed as to what they did to incorporate fail-safe AI into their self-driving car AI.

Overall, I’m saying that it indeed is likely to take longer at the front-end to include fail-safe AI, but that over time this will be worth it. Not only due to the lives saved, and not only to avoid product liability losses, but also as a means to try and gain public trust that AI self-driving cars are essentially safe. Those auto makers that go for the first-move advantage will potentially put a bad apple into the barrel that will then have the chance of curtailing or disrupting all efforts towards achieving AI self-driving cars.

Now, in terms of the other part of the point about possibly elongating an AI development effort by wanting to incorporate fail-safe aspects, the second part of that question involves the implication that by trying to identify in-advance all possible points of failure you are taking on a nearly impossible task. It would certainly seem daunting to have to consider for an AI system composed of thousands upon thousands of interacting parts all the places that a failure might occur. The number of combinations and permutations of potential failures would become astronomical and it would seem unlikely you could consider them all.

Yes, indeed, you need to be reasonable and thoughtful when trying to address the entire population of potential failures. There are ways to categorize failures and assign probabilities to them. You need to have some fail-safe features to tackle the failures of the highest chances and/or the highest consequences. You then would have other fail-safe features to serve as a catch-all, aiming to cover things for failures that you either cannot necessarily well-anticipate or that might fall into the bucket of less risky or less severe kinds of failures.

You might be thinking that perhaps no AI self-driving car should be allowed onto the public roadways until it is fully tested in a lab.

This is a nice thought, but one that seems less viable in the real-world. One argument is that until an AI self-driving car is on the public roadways, it cannot be fully tested. There is only so much testing you can do via simulation. There is only so much testing that you can do via closed tracks that are built specifically for AI self-driving car testing. It is argued that the complexity of driving on public roadways can only in-the-end be ascertained by actually having the AI drive a car on public roadways.

Self-Driving Car AI Will Keep Learning Over Time

There’s another underlying aspect that adds to this dilemma. For many of the AI self-driving cars, the auto makers and tech firms are establishing the AI to be able to learn over time. Using various Machine Learning aspects, the AI self-driving car is intended to learn as it goes. This is rather different than most non-AI non-ML systems that pretty much do the same thing once they are released into production and actual use. If you have a system that does the same things over and over, you can test beforehand quite extensively. If you have a system that is changing as it is in use, the ability to test the system begins more problematic.

Indeed, one aspect about a fail-safe AI system is that assuming that the AI is incorporating Machine Learning, you have a moving target of what the failures might be. You could have perhaps anticipated certain kinds of failures when the AI system was at state A, prior to having “learned” and progressed to state B. Now, at the state B, it could be that failures of new kinds might be able to emerge. It is as though you have a creature that can morph itself over time. It had two heads and a tail before being released, and now it has three heads, two tails, and five tentacles. Trying to anticipate failures for something beyond the initial states that you knew about can be quite challenging.

Let’s next shift our attention toward the design and building of a fail-safe AI system.

Take a look at Figure 1.

In the diagram, I’ve divided the AI from the hardware and indicated that the AI is software and that the AI then runs on various hardware. This is an overall representation. I say that because I realize that some of you will start howling that the AI itself could be considered hardware and not just software. Yes, there are more and more AI systems that are incorporating specialized hardware that does chunks of the AI work, such as Artificial Neural Networks (ANN) that are embedded into GPU’s (Graphical Processing Units).

For ease of simplification of this herein discussion, I’m going to go with the preponderance of things and say that the AI is software and it is running on hardware, and for which admittedly some of that hardware will be optimized for running the AI software. This point might not placate everyone, but I think it is a reasonable assumption for this discussion, thanks.

In Figure 1a, you can see that we have an AI system and it is running on some kind of hardware. The hardware is shown as one box on the diagram. By this depiction, I am saying that it is perhaps one or more processors and they are collectively bound toward running the AI software. Though it shows just one box on the diagram, it could be many processors bound together.

Whenever you are designing and building a real-time system, you ought to be looking around for any potential Single Point of Failure (SPOF) that your design or construction is engendering. For the moment, let’s assume that the hardware that is running the AI system is absent of any redundancy. Thus, the AI software is dependent upon the hardware that is running the software, and if the hardware burps then it will cause the AI system to be disrupted.

As shown in Figure 1b, the easiest solution that most auto makers and tech firms are doing to deal with fail-safe AI is to add more hardware into the matter. We might have one or more hardware bundles, each of which is independently able to run the AI software. The AI could be running on all of them at once, and if any of the hardware bundles faults or falters, we would opt to make use of the other hardware.

In some less-than real-time systems, you might have hardware redundancy like this that consists of hot spares and cold spares. A cold spare implies that it is hardware that is not running the active software at the time and would need to spin-up to be engaged to run the AI software. A hot spare implies that the hardware is actively running the AI system and can be shifted to immediately. Overall, this arrangement of the hardware, however you opt to do it, will certainly aid in eliminating a SPOF and will provide redundancy that should be able to keep the AI running in spite of more limited hardware failures.

This does not though eliminate any dangers or consequences that can arise from hardware failures. It could be that the switching time to have the AI shift into running on another available hardware bundle might cause a problem for the AI system during a real-time activity. Imagine an AI self-driving car that is moving along on the freeway at 80 miles per hour and the AI is in the midst of trying to contend with difficult and complex traffic circumstances. Even a short delay in switching to another piece of hardware could be problematic.

Furthermore, suppose that the hardware that is on-board the self-driving car gets mushed all at once. If the AI self-driving car gets hit from behind by a human drunken driver and imagine that we’ve secreted all the on-board processors including the redundant ones into the trunk of the self-driving car, it could wipe out the hardware entirely. The AI software is not going to have much chance of doing anything once that hardware in-total goes out.

Generally, let’s go with the idea that by adding the redundant hardware, we are aiding progress toward having fail-safe AI. We’ve got hardware that will generally be able to withstand more common kinds of physical failures such as a processor that goes bad or that its memory has issues, etc.

Take a look next at Figure 2.

Though we are feeling good so far about the hardware redundancy, it still leaves us with the problem that the AI software itself is now the Single Point of Failure (SPOF). We need to do something about that hole.

Primary and Secondary Self-Driving Car AI

As shown in Figure 2a, what we might do is setup another AI system and have it serving as a secondary for the primary AI system (I’ll respectively refer to them as Primary AI and Secondary AI). We’ll keep the redundant hardware underneath it all, and you should just assume that’s the case for the rest of this discussion.

The Primary AI will periodically send a message to the Secondary AI and tell it that things are OK and there is not a need for the Secondary AI to step into the running of the AI self-driving car. The Secondary AI responds by acknowledging that it got the message from the Primary AI and that there is no need for the Secondary AI to engage actively into the driving of the self-driving car.

Per Figure 2b, at some point, the Primary AI might realize that things are amiss within itself and opt to instruct the Secondary AI to take over the driving of the self-driving car. I’ll say more about this in a moment.

Generally, we would need to decide how often this messaging will take place between the Primary AI and the Secondary AI overall. It has to be frequent enough to ensure that the Secondary AI can step into the Primary AI role when needed and as soon as needed, but also not so frequent that it somehow becomes a nuisance or resource consumption beyond its worth.

Also, there would likely be an aspect that if the Primary AI fails to message to the Secondary AI that things are OK on the prescribed schedule, the Secondary AI might either try to make contact with the Primary AI or it might just opt to step into the driving task and take it over from the Primary AI (this would need to be carefully considered as to the downsides and upsides of such an “unrequested” step-in).

One key to this approach will be whether or not the Primary AI can realize that it is not OK. If the Primary AI assumes it is OK, but it really is not OK, it’s not going to ask the Secondary to take over. The Primary AI should already have lots of designed and built-in capabilities for detecting its own status and be “self-aware” of what it is doing. Once it realizes that something is amiss, it would then presumably have the presence of “mind” that it should ask the Secondary to step-in.

Once the Secondary does step-in, whether by being asked or on its own (as mentioned earlier, if the Primary AI is non-communicative), we are now down to the bare bones. We now have a Single Point of Failure since we would assume that the Secondary AI is not going to ask the Primary AI to take over from it, and as such the whole ball-of-wax is now riding on the Secondary AI.

But, at least we do have the Secondary AI that is ready and willing and able to go. This does though bring up the aspect that the Secondary AI might have its own problems and not be able or willing to take over from the Primary AI. Let’s consider this a bit further.

If the Secondary AI is say running on the hardware secreted in the trunk and the Primary AI is too, and if the rear-ender bashes up the hardware, perhaps the Primary AI would realize it is now degraded and so it asks the Secondary to take over. But, the Secondary might also now be degraded and thus refuse the request for it to take over. In that case, we might stick with the Primary AI and not resort to using the Secondary AI.

There’s another variation of this matter in entails how the Primary AI and the Secondary AI were designed and written. Suppose that the Primary AI was designed and written such that we merely made a second copy and that became our Secondary AI. The question then arises that if the Primary AI has a fault that causes it to ask the Secondary AI to take over, will the Secondary AI potentially have the same exact fault since it is merely a carbon copy of the Primary AI? If that’s the case, we aren’t going to gain much in terms of having the Secondary AI (though, this can be somewhat ameliorated because it could be that the Secondary AI won’t necessarily experience the same fault due to other exigencies such as the underlying hardware and other factors).

Take a look at Figure 3.

When we had just a Primary AI and a Secondary AI, it was earlier pointed out that once the Secondary AI takes over that it has no further recourse as to switching to anything else. It has become the sole cook in the kitchen.

As shown in Figure 3a, the Primary AI could potentially have a multitude of alternates, rather than being limited to just the singular Secondary AI. In that case of multiple Alternates, the Primary AI could potentially switch to one of them, which if it them faulted it could switch to the next, and so on. Or, we could have some other means by which the Alternates are chosen to be used.

Via Figure 3b, I show that it is also important to consider whether the AI alternates will be the same as the AI of the Primary, or whether you might have different AI systems that were designed and built separately from the Primary AI and from the other alternates. If you are using AI’s that are the same, they are consider homogeneous alternates. If you are using AI’s that differ from each other, they are considered heterogeneous alternates.

For the case of homogeneous AI’s, you are really just copying the same AI and having multiple copies running or available. This is somewhat “easy” to do in that you aren’t separately designing each of those AI systems and nor building them separately. This means that the odds of an in-common failure point is likely relatively high, since they were designed and built identically.

N-Version Programming Explained

For the case of the heterogeneous AI’s, you are separately designing and building the AI systems, doing so based on presumably the same set of requirements (otherwise, they are in a sense completely different AI systems). In the parlance of software engineering, this is often referred to as N-Version Programming (NVP).

N-Version Programming is a big step to take when designing and building a system, AI or otherwise.

NVP means that you are significantly increasing the effort and likely cost of designing and building the system overall. You are essentially doing two more complete development efforts. You only see this kind of approach in often very expensive and very complex systems such as for outer space rockets and such that you know will need to be self-sufficient on long journeys and often without any human control readily feasible due to the distances involved.

There are various thorny questions to be considered with N-Version Programming.

Should you use the same teams for each of the N versions? But, if so, you are likely to get the same outcomes. Should you have separate teams but allow them to share some of the code (such as if they are using open source that neither of the teams built themselves)? But, if so, you are likely carrying into each such “separate” system the same potential failure points. And so on.

This kind of approach is not for the faint of heart. You’ve got to be willing to go to bat for something that will be costlier and likely longer than otherwise to undertake, but at the same time hopefully provide a greater margin of fail-safe AI. You can argue about how much of a fail-safe you would get, and the trade-off between the added cost and effort versus the presumed added safety.

Next, take a look at Figure 4.

As shown in 4a, the Primary AI has some N number of Alternates. So far, we’ve been assuming that the Secondary or the Alternates would consist of the same set of capabilities as the Primary AI, but that does not necessarily always need to be the case. Instead, it could be that the Secondary or Alternates are some subset of the Primary AI or some other aspect that pertains to the matter at hand.

In the case of an AI self-driving car, it could be that the Secondary or Alternates are only usable with respect to a fallback capability. If the Primary AI is having troubles, it would potentially be the case that the AI would want to perform a fallback operation and place the AI self-driving car into a minimum risk condition (such as parking safely on the side of the road). The Primary AI itself would likely have its own internal fallback capability, but if the internal fallback capability is faulty or the Primary AI otherwise becomes suspect, it might turn instead to one of the Alternates. Thus, it might be that the redundant AI is only with respect to the fallback operation.

The Primary AI might invoke an Alternate to undertake the fallback and if the chosen Alternate is unable to undertake the fallback it might then be handed to another Alternate. This would continue with each successive of the Alternates until hopefully one of them is able to actually perform a fallback. To try and avoid the chance that each of the fallback Alternates might have some in-common fault, they might be heterogeneous AI’s that each were designed and developed independently.

When the redundant AI’s are not equivalent in capability to the Primary AI, it is considered a thin form of redundancy since the redundancy only covers a portion of the full capabilities of the Primary AI. This also means that the Primary AI cannot hope to use the redundancy for the same range of capabilities as the Primary AI and instead would only switch over to a redundant AI when the circumstances were relevant to do so.

As shown in Figure 4b, the format of suggesting that there are alternates to the Primary AI is indicated via the overlapping series of boxes behind the Primary AI, similar to the format being shown for the hardware redundancy.

In Figure 5, the 5a again indicates a Primary AI with its alternates.

A variant of the so far approach of having a Primary AI consists of not having any Primary AI per se and instead having a set of essentially all alternates. In that case, the question arises as to whom is calling the shots, so to speak, if there isn’t a Primary AI doing so.

As shown in 5b, a system element labeled as the Chooser would be selecting among the alternates and determining which one is actively performing the driving task. This then would likely have all of the redundant AI’s actively performing the driving task in a virtual sense, and the Chooser opting to select whichever one would seem best at the moment to be undertaking the actual driving task.

The Chooser would likely be relatively small and simple in its operation and mainly be trying to ascertain whether the Alternates are all doing OK or whether any are at a fault condition of one kind or another.

To help decide which if any of the alternates might be suffering a fault, the Chooser might not have any of its own logic per se to gauge the efficacy of what the alternates are producing and so instead might rely upon comparing the alternates.

For example, one sometimes usable approach is the M Out Of N (MooN) technique. If we have N number of alternates, we might beforehand determine that if M of them agree then those M are considered correct and the other N-M alternates are incorrect. This could be a simple majority and thus referred to as a Majority Redundancy.

In Figure 6, some of the subsystems associated with the Primary AI (or what might be a set of alternate AI’s) is shown in 6a.

At the higher level of the AI system, the Primary AI is continually trying to gauge whether any of the subsystems might be encountering a fault or failure. If so, this might either be handled internally within that AI instance or it might inspire the AI instance to then make use of an alternate.

Conclusion

Today’s AI systems overall are generally quite primitive when it comes to being considered fail-safe or even at least fault tolerant. This is partially due to the aspect that many of these emerging AI systems were developed in research environments that did not require the kind of resiliency that is needed for real-world real-time systems.

The danger for many of today’s AI systems is that they are relatively fracture-critical, meaning that they are brittle and prone to falling apart, so to speak, upon the occurrence of even minor problems that might arise. Unfortunately, even a modest fault or failure will tend to get the AI combobbled and have it produce an incorrect action or otherwise bring about an untoward result.

For AI self-driving cars, an incorrect action or untoward result can lead to a car that becomes a harm to humans in it and those near to it. By considering the various means of using redundancy and other fail-safe AI techniques, an auto maker or tech firm that is designing and building the AI for a self-driving car has better odds at having robust and resilient AI. Robust and resilient AI can more readily deal with faults and either overcome those failures or at least be able to resolve a driving task with greater safety and aplomb.

Copyright 2018 Dr. Lance Eliot

This content is originally posted on AI Trends.