Home > Store

Fault Tolerance in Distributed Systems

By Pankaj Jalote
Published Apr 6, 1994 by Pearson.

Book

Sorry, this book is no longer in print.

Not for Sale

About

Description

Sample Content

Updates

More Information

About

Features

provides a comprehensive treatment of the various topics in the area of fault tolerance in software and distributed systems.

treats fault tolerant distributed systems as consisting of levels of abstraction, providing different tolerant services.

considers the lowest levels that support the abstractions of Byzantine agreement, fail-stop processors, stable storage, reliable communication, synchronized clocks, and failure detection.

discusses the higher levels that support abstractions of reliable and atomic broadcast, consistent state recovery, atomic actions, data resiliency, process resiliency, and fault tolerant software.

for each abstraction, provides a survey of the important methods for supporting the abstraction.

emphasizes techniques and algorithms rather than formalisms.



Description

Copyright 1994
Dimensions: 7" x 9-1/4"
Pages: 448
Edition: 1st

Book
ISBN-10: 0-13-301367-7
ISBN-13: 978-0-13-301367-2

Fault tolerance is an approach by which reliability of a computer system can be increased beyond what can be achieved by traditional methods. While hardware supported fault tolerance has been well-documented, the newer, software supported fault tolerance techniques have remained scattered throughout the literature. Comprehensive and self-contained, this book organizes that body of knowledge with a focus on fault tolerance in distributed systems. (The uniprocess case is treated as a special case of distributed systems.) KEY TOPICS: Treats fault tolerant distributed systems as consisting of levels of abstraction, providing different tolerant services. MARKET: For researchers/practitioners working in the area of fault tolerance.



Sample Content

1. Introduction.

Basic Concepts and Definitions. Phases in Fault Tolerance. Overview of Hardware Fault Tolerance. Reliability and Availability. Summary.

2. Distributed Systems.

System Model. Interprocess Communication. Ordering of Events and Logical Clocks. Execution Model and System State. Summary.

3. Basic Building Blocks.

Byzantine Agreement. Synchronized Clocks. Stable Storage. Fail Stop Processors. Failure Detection and Fault Diagnosis. Reliable Message Delivery. Summary.

4. Reliable, Atomic, and Causal Broadcast.

Reliable Broadcast. Atomic Broadcast. Causal Broadcast.

5. Recovering A Consistent State.

Asynchronous Checkpointing and Rollback. Distributed Checkpointing. Summary.

6. Atomic Actions.

Atomic Actions and Serializability. Atomic Actions in a Centralized System. Commit Protocols. Atomic Actions on Decentralized Data. Summary.

7. Data Replication And Resiliency.

Optimistic Approaches. Primary Site Approach. Resiliency with Active Replicas. Voting. Degree of Replication. Summary.

8. Process Resiliency.

Resilient Remote Procedure Call. Resiliency with Asynchronous Communication. Resiliency with Synchronous Message Passing. Total Failure and Last Process to Fail. Summary.

9. Software Design Faults.

Approaches for Uniprocess Software. Backward Recovery in Concurrent Systems. Forward Recovery in Concurrent Systems. Summary.

Bibliography.



Updates

Submit Errata



More Information



InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address