Systems Engineering for Safety-Critical Applications
Part 1 - Gross Dependability Analysis
This series of articles covers basic methods for assessing the dependability of engineered systems during the design phase. Read on if you want an intro to some of the tools used during design of dependable systems for safety-critical applications. The content and structure of the series is derived from Chris Hobbs' book, Embedded Systems Development for Safety Critical Systems. I will use a simple system given in the book to demonstrate in detail a series of analyses that progressively deepen in fidelity, as is done in industry. The posts are organised into three parts:
- Gross Dependability Analysis with Markov Chains. A gross dependability analysis to derive a qualitative numerical estimate of system dependability
- Failure Analysis with Binary Fault Trees. An extended Binary Fault Tree to calculate the smallest sets of nodes that, by failing, will bring the system down.
- Redundancy as a Tool for Dependable System Design. Using the fault tree modelling tool from part 2 to compare, quantitatively, different redundancy architectures imposed on a simple automotive braking system model.
All code, derivations, tools and files used to generate the analyses are made available to the reader at github.com/hgrw/safety-critical-systems-blog.git.
What is an Ultra-Reliable System?
A system is said to be ultra-reliable when it has a mean-time between failure (MTBF) on the order of 1E9 hours. Ultra-reliability is required for safety-critical systems that are in commercial production, since large production volumes require very large MTBF. 1E9 hours works out to be 110K years. To put this in perspective, consider the Pantheon in Rome, which was completed under emperor Hadrian some time around 120AD. This is the largest non-reinforced concrete dome in the world and is about 16,652,773 hours old at the time of writing.
Fig.1 - Cross-section of the Pantheon showing how a 43.3-metre diameter sphere fits under its dome.
It's age is about 0.016% of the aggregate MTBF required for the highest integrity systems.
Here's another interesting example. Let's say you bought an ASIL-D MCU that has a clockspeed of 300MHz and a FIT (failure in time) score of 10 (MTBF == 1E8 hours). In aggregate, you could expect 1.08E20 ticks before failure. This number of ticks is way more than the number of stars in our galaxy (1E13), and not too much less than the estimated number of stars in the observable universe (1E13 x 1E11 = 1E24).
Fig.2 - Image of the night sky above Paranal, Chile on 21 July 2007, taken by ESO astronomer Yuri Beletsky.
What is Gross Dependability Analysis
As far as I can tell, gross dependability analysis was coined by Chris Hobbs as
"a quick-and-dirty calculation, carried out during the initial design cycle, to determine whether or not a proposed design could possibly meet the system's dependability requirements".
Using a Markov chain is a straightforward way to quickly assess system architecture dependability. The model is comprised of a set of system states, chained together to represent the various ways a system can fail. The price paid for the simplicity of the Markov Chain is the assumptions that must be made for the system to be solvable. The Markovian assumptions are:
- When the system is in a particular state, the history of how it arrived in that state is irrelevant for determining what will happen next.
- The interarrival times of failure events must be negatively exponentially distributed (i.e. poisson process).
- Only one state change may occur at any time.
Example System Under Test
We analyse the simple system shown below:
Fig.3 - A simple system. Credit: Embedded Software Development for Safety-Critical Systems, Chris Hobbs
Figure 3 illustrates a simple system with 7 components that may fail: 1A, 1B, 2A, 2B, 2C, 2D, and the power supply. In order to continue operating, the system needs the following to be working:
- the power supply, and
- subsystem 1, which will only work if both subsystems 1A and 1B are operational, or
- subsystem 2, which will only operate if 2C and 2D are operational and either 2A or 2B (or both) is operational.
We will take our time unit to be a year and will take the mean failure rate of all the components except the power supply to be 0.333 per year (i.e., a failure on average every 3 years). The mean failure rate of the power supply is 0.2 per year. The other values that we will need are repair times: When a component failure occurs, how long does it take for a repair to be completed? We will be making the unrealistic assumption that all failures, even those that do not immediately bring the entire system down, will be detected. Note that it is failing to detect component failures that normally contributes most to the failure rate of the complete system.
For the purposes of this exercise, we will assume that the mean repair time for a failed component when the system is still running is 2 weeks (i.e., a rate of 26 per year), and the mean repair time for a component when the system has failed is 1 day (i.e., 365.25 per year). When a repair technician arrives on site, all the failed components are fixed. Note that all the failure and repair rates are means; the actual times to failure or repair will be negatively exponentially distributed with those means. This example is analysed further in Chapter 12 where a fault tree is drawn for it.
Identifying System States
Figure 4 illustrates the states between which the system may move.
Fig.3 - An example of a Markov model. Credit: Embedded Software Development for Safety-Critical Systems, Chris Hobbs
Note that, as indicated in the figure, some arrows have been omitted to reduce clutter. Given those omissions, figure 2 can be read as follows.
Assume that the system is in state 5 with component 2C or 2D failed. From this state it can move to state 12 (system failed) if either 1A or 1B fails; it can move to state 9 if 2A fails; it can move to state 10 if 2B fails; and it can move to state 1 if the repair is completed (arrow not shown).
In accordance with Markovian assumptions, only one state change may occur at a time. It is therefore unnecessary to model, for example, the condition that, when in State 1, components 1A and 2A fail simultaneously. Only one event can occur at each moment in time.
Building the Equations
Given the rates at which each of these transitions occur, a series of simultaneous linear equations can be created to reflect these state transitions. We have
Given that there are twelve unknowns, it might be thought that these would be sufficient. However, in actual fact, they are not linearly independent, so any one of them needs to be discarded and a 13th equation added to the set.
Solving the Equations
The resulting twelve simultaneous linear equations can be solved using any convenient method (e.g., LU-decomposition). In selecting a method there can be a wide disparity in the magnitude of numbers that occur in this type of analysis. Even in this trivial problem, we have numbers as big as 365.25 (the repair rate following system failure) and as small as 0.2 (the failure rate of the power supply). In a more complex system, failures may occur only every few years, and repairs (particularly in a software system) may occur within milliseconds. Unless care is taken in the solution of the equations and unless the frequency unit is chosen appropriately, resolution can easily be lost.
Result
The plot below uses a simple python implementation of our toy model. We have used the results given in the book to validate that our approach is correct. The blue bars are the proportion of time spent in each state, the orange bars constitute the solution given in the book.
Fig.4 - Results and comparison with data from the book: Embedded Software Development for Safety-Critical Systems, Chris Hobbs