Dependability
- How you decide what is operating properly
- Infrastructure providers offer service legal agreements to guarantee the level of dependability
- Service accomplishment is when that is delivered
- Service interruption is when it is not
- In a system of components, the reliability of the whole system is less than the reliability of the individual components
Metrics
- Mean time to failure (MTTF)
- Failures in time (FIT)
- The rate of failures
- 1/MTTF
- FIT for a system = (number of component * (1/ Failure Rate of that component)) + …
- Traditionally reported as failures per billion hours of operation
- Mean time to repair (MTTR)
- Measures service interruption
- Mean time between Failures (MTBF)
- Module availability
Improving
- Fault Avoidance
- Prevent fault occurrence by construction
- Fault Tolerance
- Using redundancy to allow the service to comply with the service specification despite the occurrence of faults
- Fault Prediction
- Predict the faults and replace the component on time
Terms
- Fault
- A component behaves incorrectly
- Ex: An alpha particle hitting a memory cell and content is flipped
- Error
- The fault is used
- Does not have to impact outcome
- Ex: memory is modified but A is still less than B in an if statement
- Ex: modified memory content is used
- Failure
- The fault changes the outcome
- Ex: modified memory causes if statement to skip