Dependability

  • How you decide what is operating properly
    • Infrastructure providers offer service legal agreements to guarantee the level of dependability
    • Service accomplishment is when that is delivered
    • Service interruption is when it is not
  • In a system of components, the reliability of the whole system is less than the reliability of the individual components

Metrics

  • Mean time to failure (MTTF)
    • Measures reliability
  • Failures in time (FIT)
    • The rate of failures
    • 1/MTTF
    • FIT for a system = (number of component * (1/ Failure Rate of that component)) + …
    • Traditionally reported as failures per billion hours of operation
  • Mean time to repair (MTTR)
    • Measures service interruption
  • Mean time between Failures (MTBF)
    • MTTF + MTTR
  • Module availability
    • MTTF / (MTTF + MTTR)

Improving

  • Fault Avoidance
    • Prevent fault occurrence by construction
  • Fault Tolerance
    • Using redundancy to allow the service to comply with the service specification despite the occurrence of faults
  • Fault Prediction
    • Predict the faults and replace the component on time

Terms

  • Fault
    • A component behaves incorrectly
    • Ex: An alpha particle hitting a memory cell and content is flipped
  • Error
    • The fault is used
    • Does not have to impact outcome
      • Ex: memory is modified but A is still less than B in an if statement
    • Ex: modified memory content is used
  • Failure
    • The fault changes the outcome
    • Ex: modified memory causes if statement to skip