From Traditional Fault Tolerance to Blockchain. Wenbing Zhao
Чтение книги онлайн.
Читать онлайн книгу From Traditional Fault Tolerance to Blockchain - Wenbing Zhao страница 15
◾ Based on how a fault in a component reveals to other components in the system, faults can be classified as:– Content faults, if the values passed on to other components are wrong due to the faults. A faulty component may always pass on the same wrong values to other components, or it may return different values to different components that it interacts with. The latter is specifically modeled as Byzantine faults [5].– Timing faults, if the faulty component either returns a reply too early, or too late alter receiving a request from another component. An extreme case is when the faulty component stops responding at all (i.e., it takes infinite amount of time to return a reply), e.g., when the component crashes, or hangs due to an infinite loop or a deadlock.
◾ Based on whether or not a fault is reproducible or deterministic, faults (primarily software faults) can be classified as:– Reproducible/deterministic faults. The fault happens deterministically and can be easily reproduced. Accessing a null pointer is an example of deterministic fault, which often would lead to the crash of the system. This type of faults can be easily identified and repaired.– Nondeterministic faults. The fault appears to happen nondeterministically and hard to reproduce. For example, if a fault is caused by a specific interleaving of several threads when they access some shared variable, it is going to be hard to reproduce such a fault. This type of software faults is also referred to as Heisenbugs to highlight their uncertainty.
◾ Given a number of faults within a system, we can classify them based on their relationship:– Independent faults, if there is no causal relationship between the faults, e.g., given fault A and fault B, B is not caused by A, and A is not caused by B.– Correlated faults, if the faults are causally related, e.g., given fault A and fault B, either B is caused by A, or A is caused by B. If multiple components fail due to a common reason, the failures are referred to as common mode failures.
When the system fails, it is desirable to avoid catastrophic consequences, such as the loss of life. The consequence of the failure of a system can be alleviated by incorporating dependability mechanisms into the system such that when it fails, it stops responding to requests (such systems are referred to as fail-stop systems), if this is impossible, it returns consistent wrong values instead of inconsistent values to all components that it may interact with. If the failure of a system does not cause great harm either to human life or to the environment, we call such as system a fail-safe system. Usually, a fail-safe system defines a set of safe states. When a fail-safe system can no longer operate according to its specification due to faults, it can transit to one of the predefined safe states when it fails. For example, the computer system that is used to control a nuclear power plant must be a fail-safe system.
Perhaps counter intuitively, it is often desirable for a system to halt its operation immediately when it is in an error state or encounters an unexpected condition. The software engineering practice to ensure such a behavior is called fail fast [9]. The benefits of the fail-fast practice are that it enables early detection of software faults and the diagnosis of faults. When a fault has been propagated to many other components, it is a lot harder to pinpoint the source of the problem.
1.1.3 Dependability Attributes and Evaluation Metrics
A dependable system has a number of desirable attributes and some of the attributes can be used as evaluation metrics for the system. We classify these attributes into two categories: (1) those that are fundamental to, and are immediate concern of, all distributed systems, including availability, reliability, and integrity; and (2) those that are secondary and may not be of immediate concern of, or be applicable to all systems, such as maintainability and safety.
The availability and reliability of a system can be used as evaluation metrics. Other attributes are normally not used as evaluation metrics because it is difficult to quantify the integrity, maintainability, and safety of a distributed system.
1.1.3.1 Availability
Availability is a measure of the readiness of a dependable system at a point in time, i.e., when a client needs to use a service provided by the system, the probability that the system is there to provide the service to the client. The availability of a system is determined by two factors:
◾ Mean time to failure (MTTF). It characterizes how long the system can run without a failure.
◾ Mean time to repair (MTTR). It characterizes how long the system can be repaired and recovered to be fully functional again.
Availability is defined to be MTTF/(MTTF + MTTR). Hence, the larger the MTTF, and higher the availability of a system. Similarly, the smaller the MTTR, the higher the availability of the system.
The availability of a system is typically represented in terms of how many 9s. For example, if a system is claimed to offer five 9s availability, it means that the system will be available with a probability of 99.999%, i.e., the system has 10−5 probability to be not available when a client wants to access the service offered by the system at any time, which means that the system may have at most 5.256 minutes of down time a year.
1.1.3.2 Reliability
Reliability is a measure of the system’s capability of providing correct services continuously for a period of time. It is often represented as the probability for the system to do so for a given period of time t, i.e., Reliability = R(t). The larger the t, the lower the reliability value. The reliability of a system is proportional to MTTF. The relationship between reliability and availability can be represented as
1.1.3.3 Integrity
Integrity refers to the capability of a system to protect its state from being compromised under various threats. In dependable computing research, integrity is typically translated into the consistency of server replicas, if redundancy is employed. As long as the number of faulty replicas does not exceed a pre-defined threshold, the consistency of the remaining replicas would naturally imply the integrity of the system.
1.1.3.4 Maintainability
Maintainability refers to the capability of a system to evolve after it is deployed. Once a software fault is detected, it is desirable to be able to apply a patch that repairs the system without having to uninstall the existing system and then reinstall an updated system. The same patching/software update mechanism may be used to add new features or improve the performance of the existing system. Ideally, we want to be able to perform the software update without having to shutdown the running system (often referred to as live upgrade or live update), which is already a standard feature for many operating systems for patching non-kernal level components. Live upgrade has also be achieved via replication in some distributed systems [12].
1.1.3.5 Safety
Safety