From Traditional Fault Tolerance to Blockchain. Wenbing Zhao

Чтение книги онлайн.

Читать онлайн книгу From Traditional Fault Tolerance to Blockchain - Wenbing Zhao страница 12

From Traditional Fault Tolerance to Blockchain - Wenbing Zhao

Скачать книгу

online shopping), civil infrastructure (e.g., electric power grid and traffic control), entertainment (e.g., online gaming and multimedia streaming), and personal data storage (e.g., various cloud services such as Dropbox, Google Drive, and OneDrive). Behind these cloud services is distributed computing, which addresses many critical issues in making the services dependable and trustworthy. The most important of all is to build consensus in its internal operations that span many different computing nodes.

      Distributed consensus has been studied for several decades, at least starting in 1970s. The reason why distributed consensus is important is that a distributed system would span over many computing nodes, and these nodes must maintain a common view on the system state so that each can operate as planned towards the objectives of the system. Prolonged inconsistency among different components of the system would damage the integrity of the system and ultimately would result in system-level failures that are visible to end users.

      The cost of system failures is enormous. If a data center is brought down by a system failure, the average cost for downtime may range from $42,000 to about $300,000 per hour [2, 6]. The cost can be estimated by summing up the wasted expenses and the loss of revenue. While the labor cost of downtime may be estimated relatively easily (i.e., roughly, wasted expenses per hour = number of employees × average salary per hour) [13], it is much harder to estimate the loss of revenue, especially due to the damages on the reputation of the business and the loyalty of its potential customers [2].

      All classical consensus algorithms rely on a concept referred to as membership, that is, every node would know how many nodes are in the current membership, the logical role of each node, and how to reach other nodes. Another important construct is voting via the sending of messages to each other. Typically, one of the members would assume a special role, which is referred to as the primary or the coordinator. The coordinator might fail or become compromised, in which case, a new coordinator would be elected through voting. As such, classical distributed consensus algorithms are expensive, highly complex, and not scalable due to the heavy use of multiple rounds of message exchanges among the members.

      In January 2009, the launch of the first practical cryptocurrency, Bitcoin [12], has completely changed the picture. The most essential prerequisite for a cryptocurrency is the assurance that it is virtually impossible for anyone to double-spend the money (i.e., cryptocurrency) one has. Bitcoin addressed this requirement by introducing an immutable distributed ledger in the form of a chain of blocks where each block aggregates hundreds or even thousands of transactions. This distributed ledger is often referred to as the blockchain. The immutability of the blockchain is achieved by several means: (1) cryptographic protection of the blockchain, such as digital signature, one-way hash function, and chaining of the blocks; (2) massive degree of replication of the blockchain across many nodes in the Bitcoin network; and (3) a novel probabilistic consensus algorithm that is completely different from classical consensus algorithms.

      This book contains two parts. The first part consists of the first 7 chapters and it covers the most essential techniques for building dependable distributed systems. The last 3 chapters form the second part, which covers the blockchain technology.

      Chapter 1 introduces the basic concepts and terminologies of dependable distributed computing, as well as the primary means to achieve dependability.

      Chapter 2 describes the checkpointing and logging mechanisms, which are widely used in practice to achieve some form of fault tolerance. Checkpointing and logging enable the recoverability of the system but do not prevent service disruption. These mechanisms are relatively simple to implement and understand, and they incur minimum runtime overhead while demanding very moderate extra resources (only stable storage). Furthermore, checkpointing and logging also serve as the foundation for more sophisticated dependability techniques.

      Chapter 3 covers research works on recovery-oriented computing, including fault detection and diagnosis, microreboot, and system-level undo and redo. Recovery-oriented computing aims to facilitate faster recovery after a system failure and thereby improving the availability of the system. Similar to checkpointing and logging, the mechanisms for recovery-oriented computing do not prevent service disruption, hence, it is a promising approach for many e-commerce application, but not suitable for applications that require high reliability.

      Chapter 4 outlines the replication technique for data and service fault tolerance. This is the fundamental technique to ensure high reliability. Through active replication (i.e., the use of multiple redundant copies of the application processes), the system would be able to mask the failure of a replica and continue to process clients’ requests (this is actually not entirely true, as we will show in later chapters, some failures may cause extended period of unavailability of the system). With replication comes the complexity of consistency issue. Ideally, the replicas should always maintain consistency with each other. However, doing so might not incur too much runtime overhead to be acceptable for some applications, or may cause extended period of system unavailability. Hence, strict consistency may have to be compromised either for better performance [15] or for better availability [19].

      Chapter 6 discusses the consensus problem and describes several Paxos algorithms, including the Classic Paxos, Dynamic Paxos, Cheap Paxos, and Fast Paxos. While it is easy for a group of processes to agree on the same value if all processes can communicate with each other promptly and if none of them fails, distributed consensus is an incredibly hard problem when processes might fail and there might be extended delay to send or receive a message. The classical Paxos algorithm solves the consensus problem

Скачать книгу