Now that we have support for fault recovery in VirtuosoNext, We have been wondering how extensive the coverage could be in real-life systems. The issue is that data on failure root causes is either considered as confidential, either narrowly focusing on specific elements (e.g. hardware reliability). We cannot really find statistical data for these system level failures. Do you know of any such data?

In a real system, we have layers and we have some assumptions. Firstly, todays hardware can be considered as highly reliable. Of course, it assumes that design rules were followed. If hardware fails, it will most often be because faults are introduced from the outside (bit flips, power supply spikes, etc, I/O issues, ...). Secondly, software can be correct (e.g. when formally developed and proofed), but will likely still contain residual errors. These can be due to incomplete specifications, numerical instability, compiler errors, memory access violations, etc. To simplify things, we also have to assume that the hardware provides some support in detection such faults. Memory management circuits can detect memory access violations, illegal instructions are data errors can generate an exception interrupt and at a coarser grain level, time-outs can signal that a complete unit is no longer responding. Everything else, might require redundancy in the architecture.

The RTOS kernel of VirtuosoNext handles faults detected by the CPU as exceptions:

Memory access violations (triggered by bit flips, but most likely software errors or security breaches).

Numerical exceptions: can be triggered by I/O not being clamped, but also by software errors and algorithmic instability.

Illegal instructions: pointer errors, bit flips, security breaches, …

Above support aims at providing continuity of the real-time embedded applications even when faults as above occur. The development environment assists in fine-grain space and time partitioning but also allows to define automatic "clean-up and recovery" actions. The code generators can be extended to automatically generate temporal and spatial redundancy (because VirtuosoNext is MP-transparent). 

Much of such fault recovery support can of course be manually programmed, but the ideal case is that this is automated. The latter should be based on a trade-off analysis. Note that todays practice is often coarse grain. If a fault occurs, the whole application of even the complete processing system is rebooted. Even if such an event can have a low probability, in many cases it can be catastrophic. Boot times might be relatively fast with small programs (the code must be read from e.g. flash and the system re-initialised) but if the time constraints are too short (read: micro- or milliseconds) and the code is relatively large, this is not a real option. Hence, the system should prevent that a reboot is the last option available.

In order to provide such support in a meaningful (and economical) way, we need to know more about the residual probabilities of failure and errors in a real (embedded) system. We cannot really find statistical data for these system level failures. Do you know about any such data? We are aware that this might not be trivial, but your help will be greatly appreciated. Contact me at eric.verhulst (at) altreonic.com

Similar questions and discussions