From Traditional Fault Tolerance to Blockchain. Wenbing ZhaoЧитать онлайн книгу.
are often used interchangeably.
Figure 1.1 An example of a chain of threats with two levels of recursion.
Of course, not all failures can be analyzed with the above chain of threats. For example, a power outage of the entire system would immediately cause the failure of the system.
Faults can be classified based on different criteria, the most common classifications include:
◾ Based on the source of the faults, faults can be classified as:– Hardware faults, if the faults are caused by the failure of hardware components such as power outages, hard drive failures, bad memory chips, etc.– Software faults, if the faults are caused by software bugs such as race conditions and no-boundary-checks for arrays.– Operator faults, if the faults are caused by the operator of the system, for example, misconfiguration, wrong upgrade procedures, etc.
◾ Based on the intent of the faults, faults can be classified as:– Non-malicious faults, if the faults are not caused by a person with malicious intent. For example, the naturally occurred hardware faults and some remnant software bugs such as race conditions are non-malicious faults.– Malicious faults, if the faults are caused by a person with intent to harm the system, for example, to deny services to legitimate clients or to compromise the integrity of the service. Malicious faults are often referred to as commission faults, or Byzantine faults [5].
◾ Based on the duration of the faults, faults can be classified as:– Transient faults, if such a fault is activated momentarily and becomes dormant again. For example, the race condition might often show up as transient fault because if the threads stop accessing the shared variable concurrently, the fault appears to have disappeared.– Permanent faults, if once a fault is activated, the fault stays activated unless the faulty component is repaired or the source of the fault is addressed. For example, a power outage is considered a permanent fault because unless the power is restored, a computer system will remain powered off. A specific permanent fault is the (process) crash fault. A segmentation fault could result in the crash of a process.
◾ Based on how a fault in a component reveals to other components in the system, faults can be classified as:– Content faults, if the values passed on to other components are wrong due to the faults. A faulty component may always pass on the same wrong values to other components, or it may return different values to different components that it interacts with. The latter is specifically modeled as Byzantine faults [5].– Timing faults, if the faulty component either returns a reply too early, or too late alter receiving a request from another component. An extreme case is when the faulty component stops responding at all (i.e., it takes infinite amount of time to return a reply), e.g., when the component crashes, or hangs due to an infinite loop or a deadlock.
◾ Based on whether or not a fault is reproducible or deterministic, faults (primarily software faults) can be classified as:– Reproducible/deterministic faults. The fault happens deterministically and can be easily reproduced. Accessing a null pointer is an example of deterministic fault, which often would lead to the crash of the system. This type of faults can be easily identified and repaired.– Nondeterministic faults. The fault appears to happen nondeterministically and hard to reproduce. For example, if a fault is caused by a specific interleaving of several threads when they access some shared variable, it is going to be hard to reproduce such a fault. This type of software faults is also referred to as Heisenbugs to highlight their uncertainty.
◾ Given a number of faults within a system, we can classify them based on their relationship:– Independent faults, if there is no causal relationship between the faults, e.g., given fault A and fault B, B is not caused by A, and A is not caused by B.– Correlated faults, if the faults are causally related, e.g., given fault A and fault B, either B is caused by A, or A is caused by B. If multiple components fail due to a common reason, the failures are referred to as common mode failures.
When the system fails, it is desirable to avoid catastrophic consequences, such as the loss of life. The consequence of the failure of a system can be alleviated by incorporating dependability mechanisms into the system such that when it fails, it stops responding to requests (such systems are referred to as fail-stop systems), if this is impossible, it returns consistent wrong values instead of inconsistent values to all components that it may interact with. If the failure of a system does not cause great harm either to human life or to the environment, we call such as system a fail-safe system. Usually, a fail-safe system defines a set of safe states. When a fail-safe system can no longer operate according to its specification due to faults, it can transit to one of the predefined safe states when it fails. For example, the computer system that is used to control a nuclear power plant must be a fail-safe system.
Perhaps counter intuitively, it is often desirable for a system to halt its operation immediately when it is in an error state or encounters an unexpected condition. The software engineering practice to ensure such a behavior is called fail fast [9]. The benefits of the fail-fast practice are that it enables early detection of software faults and the diagnosis of faults. When a fault has been propagated to many other components, it is a lot harder to pinpoint the source of the problem.
1.1.3 Dependability Attributes and Evaluation Metrics
A dependable system has a number of desirable attributes and some of the attributes can be used as evaluation metrics for the system. We classify these attributes into two categories: (1) those that are fundamental to, and are immediate concern of, all distributed systems, including availability, reliability, and integrity; and (2) those that are secondary and may not be of immediate concern of, or be applicable to all systems, such as maintainability and safety.
The availability and reliability of a system can be used as evaluation metrics. Other attributes are normally not used as evaluation metrics because it is difficult to quantify the integrity, maintainability, and safety of a distributed system.
1.1.3.1 Availability
Availability is a measure of the readiness of a dependable system at a point in time, i.e., when a client needs to use a service provided by the system, the probability that the system is there to provide the service to the client. The availability of a system is determined by two factors:
◾ Mean time to failure (MTTF). It characterizes how long the system can run without a failure.
◾ Mean time to repair (MTTR). It characterizes how long the system can be repaired and recovered to be fully functional again.
Availability is defined to be MTTF/(MTTF + MTTR). Hence, the larger the MTTF, and higher the availability of a system. Similarly, the smaller the MTTR, the higher the availability of the system.
The availability of a system is typically represented in terms of how many 9s. For example, if a system is claimed to offer five 9s availability, it means that the system will be available with a probability of 99.999%, i.e., the system has 10−5 probability to be not available when a client wants to access the service offered by the system at any time, which means that the system may have at most 5.256 minutes of down time a year.
1.1.3.2 Reliability
Reliability is a measure of the system’s capability of providing correct services continuously for a period of time. It is often represented as the probability for the system to do so for a given period of time t, i.e., Reliability = R(t). The larger the t, the lower the reliability value. The reliability of a system is proportional to MTTF. The relationship between reliability and availability can be represented as