System level failure statistics needed

01 January 1970 2 8K Report

Now that we have support for fault recovery in VirtuosoNext, We have been wondering how extensive the coverage could be in real-life systems. The issue is that data on failure root causes is either considered as confidential, either narrowly focusing on specific elements (e.g. hardware reliability). We cannot really find statistical data for these system level failures. Do you know of any such data?

In a real system, we have layers and we have some assumptions. Firstly, todays hardware can be considered as highly reliable. Of course, it assumes that design rules were followed. If hardware fails, it will most often be because faults are introduced from the outside (bit flips, power supply spikes, etc, I/O issues, ...). Secondly, software can be correct (e.g. when formally developed and proofed), but will likely still contain residual errors. These can be due to incomplete specifications, numerical instability, compiler errors, memory access violations, etc. To simplify things, we also have to assume that the hardware provides some support in detection such faults. Memory management circuits can detect memory access violations, illegal instructions are data errors can generate an exception interrupt and at a coarser grain level, time-outs can signal that a complete unit is no longer responding. Everything else, might require redundancy in the architecture.

The RTOS kernel of VirtuosoNext handles faults detected by the CPU as exceptions:

Memory access violations (triggered by bit flips, but most likely software errors or security breaches).

Numerical exceptions: can be triggered by I/O not being clamped, but also by software errors and algorithmic instability.

Illegal instructions: pointer errors, bit flips, security breaches, …

Above support aims at providing continuity of the real-time embedded applications even when faults as above occur. The development environment assists in fine-grain space and time partitioning but also allows to define automatic "clean-up and recovery" actions. The code generators can be extended to automatically generate temporal and spatial redundancy (because VirtuosoNext is MP-transparent).

Much of such fault recovery support can of course be manually programmed, but the ideal case is that this is automated. The latter should be based on a trade-off analysis. Note that todays practice is often coarse grain. If a fault occurs, the whole application of even the complete processing system is rebooted. Even if such an event can have a low probability, in many cases it can be catastrophic. Boot times might be relatively fast with small programs (the code must be read from e.g. flash and the system re-initialised) but if the time constraints are too short (read: micro- or milliseconds) and the code is relatively large, this is not a real option. Hence, the system should prevent that a reboot is the last option available.

In order to provide such support in a meaningful (and economical) way, we need to know more about the residual probabilities of failure and errors in a real (embedded) system. We cannot really find statistical data for these system level failures. Do you know about any such data? We are aware that this might not be trivial, but your help will be greatly appreciated. Contact me at eric.verhulst (at) altreonic.com

Robert Mahar

So much depends on the specifics of that hardware and OS and application constellation. If you are interested in a conversation rather than an answer, per se, I am game. The short answer, IMO, is that hardware is a very unlikely cause of recoverable failures.

Eric Verhulst

I fully agree that HW (when correctly designed) is the least likely to fail, at least the processor. Memory and power subsystems do fail (because designers are optimistic). Nevertheless, the responses I had until now mainly point to HW reliability data. Its my belief that the most system level failures (think embedded, safety/mission critical) are due to programming errors, I/O issues (related to data representation and range), power supply glitches. One can more or less attribute these further to incomplete specifications. E.g. in VirtuosoNext we now catch processor exceptions and recover very fast but the exception has an origin, typically memory violation (pointer errors), non-converging algorithms, numerical issues like overflow, etc. The issue is that in order to improve the situation and develop optimal fault tolerance strategies, one needs data. What is the probability of these happening? It looks like nobody, even the big avionics players, have such data. Of course, the scope is potentially very wide as many issues are process related, and ultimately education related. Still, I am surprised how large organisations keep repeating the same "mistakes". Many catastrophic failures in embedded, for example at ESA, are related to overflows of sensor data. Its easy to fix by clamping them and by modeling such a situation during design.

Could you recommend some articles on Urban Transportation System optimization and Innovation?

A Question about Phd thesis?

After a lot of feature engineering for CTR modeling, it feels like it's basically the end of iteration? I mean, it's not cost-effective to keep doing?

How to use Desmond in HPC ?

Look for qualified candidates of Visiting Scholars to Southwest Jiaotong University?

How to combat antibiotic resistance?

How to prepare bacterial conditioned media to study the effect of bacterial secretome?

How can the resilience of agricultural systems be improved by both gradual climate change and increased climatic variability and extremes?

How to model a non-minimum phase system?

How to create a database management system of trees species using gis and remote sensing techniques?