skip to main content
article
Free Access

Implementing fault-tolerant services using the state machine approach: a tutorial

Published:01 December 1990Publication History
Skip Abstract Section

Abstract

The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.

References

  1. AIZIKOWITZ, J. 1989. Designing distributed services using refinement mappings. Ph.D. dissertation, Computer Science Dept., Cornell Univ., Ithaca, New York. Also available as Tech. Rep. TR 89-1040. Google ScholarGoogle Scholar
  2. BERNSTEIN, A. J. 1985. A loosely coupled system for reliably storing data. IEEE Trans. Softw. Eng. SE-11, 5 (May), 446-454.Google ScholarGoogle Scholar
  3. BIRMAN, K. P. 1985. Replication and fault tolerance in the ISIS system. In Proceedings of the lOth A CM Symposium on Operating Systems Principles (Orcas Island, Washington, Dec. 1985), A CM, pp. 79-86. Google ScholarGoogle Scholar
  4. BIRMAN, K. P., AND JOSEPH, T. 1987. Reliable communication in the presence of failures. ACM TOCS 5, 1 (Feb. 1987), 47-76. Google ScholarGoogle Scholar
  5. CRISTIAN, F., AGHILI, H., STRONG, H. R., AND DOLEV, D. 1985. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Conference on Fault-tolerant Computing (Ann Arbor, Mich., June 1985), IEEE Computer Society.Google ScholarGoogle Scholar
  6. DIJKSTRA, E. W. 1974. Self stabilization in spite of distributed control. Commun. A CM I7, 11 (Nov.), 643-644. Google ScholarGoogle Scholar
  7. FISCHER, M., LYNCH, N., AND PATERSON, M. 1985. Impossibility of distributed consensus with one faulty process, d. ACM 32, 2 (Apr. 1986), 374-382. Google ScholarGoogle Scholar
  8. GARCIA-MOLINA, H., PITTELLI, F., AND DAVIDSON, S. 1986. Application of Byzantine agreement in database systems. ACM TODS 11, 1 (Mar. 1986), 27-47. Google ScholarGoogle Scholar
  9. GOPAL, A., STRONG, R., TOUEG, S., AND CRISTIAN, F., 1990. Early-delivery atomic broadcast. To appear in Proceedings of the 9th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (Quebec City, Quebec, Aug. 1990). Google ScholarGoogle Scholar
  10. GRAY, J. 1978. Notes on data base operating systems. In Operating Systems: An Advanced Course, Lecture Notes in Computer Science. Vol. 60. Springer- Verlag, New York, pp. 393-481. Google ScholarGoogle Scholar
  11. HALPERN, J., SIMONS, B., STRONG, R., AND DOLEV, D. 1984. Fault-tolerant clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), pp. 89-102. Google ScholarGoogle Scholar
  12. HUTCHINSON, N., AND PETERSON, L. 1988. Design of the x-kernel. In Proceedings of SIGCOMM '88--Symposium on Communication Architectures and Protocols (Stanford, Calif., Aug.), pp. 65-75. Google ScholarGoogle Scholar
  13. LAMPORT, L. 1978a. Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 7 (July), 558-565. Google ScholarGoogle Scholar
  14. LAMPORT, L. 1979b. The implementation of reliable distributed multiprocess systems. Comput. Networks 2, 95-114.Google ScholarGoogle Scholar
  15. LAMPORT, L. 1984. Using time instead of timeout for fault-tolerance in distributed systems. ACM TOPLAS 6, 2 (Apr.), 254-280. Google ScholarGoogle Scholar
  16. LAMPORT, L. 1989. The part-time parliament. Tech. Rep. 49. Digital Equipment Corporation Systems Research Center, Palo Alto, Calif.Google ScholarGoogle Scholar
  17. LAMPORT, L., AND MELLIAR-SMITH, P. M. 1984. Byzantine clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), 68-74. Google ScholarGoogle Scholar
  18. LAMPORT, L., SHOSTAK, R., AND PEASE, M. 1982. The Byzantine generals problem. ACM TOPLAS 4, 3 (July), 382-401. Google ScholarGoogle Scholar
  19. LISKOV, B., AND LADIN, R. 1986. Highly available distributed services and fault-tolerant distributed garbage collection. In Proceedings of the 5th A CM Symposium on Principles of Distributed Computing (Calgary, Alberta, Canada, Aug.), ACM, pp. 29-39. Google ScholarGoogle Scholar
  20. MANCINI, L., AND PAPPALARDO, G. 1988. Towards a theory of replicated processing. Formal Techniques in Real-Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol. 331. Springer-Verlag, New York, pp. 175-192. Google ScholarGoogle Scholar
  21. MARZULLO, K. 1989. Implementing fault-tolerant sensors. Tech. Rep. TR 89-997. Computer Science Dept., Cornell Univ., Ithaca, New York.Google ScholarGoogle Scholar
  22. MARZULLO, K., AND SCHMUCK, F. 1988. Supplying high availability with a standard network file system. In Proceedings of the 8th International Conference on Distributed Computing Systems (San Jose, CA, June), IEEE Computer Society, pp. 447-455.Google ScholarGoogle Scholar
  23. PETERSON, L. L., BUCHOLZ, N. C., AND SCHLICHT- ING, R. D. 1989. Preserving and using context information in interprocess communication. ACM TOCS 7, 3 (Aug.), 217-246. Google ScholarGoogle Scholar
  24. PITTELLI, F. M., AND GARCIA-MOLINA, S. 1989. Reliable scheduling in a TMR database system. ACM TOCS 7, 1 (Feb.), 25-60. Google ScholarGoogle Scholar
  25. SCHLICHTING, R. D., AND SCHNEIDER, F. B. 1983. Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM TOCS I, 3 (Aug.), 222-238. Google ScholarGoogle Scholar
  26. SCHNEIDER, F. B. 1980. Ensuring consistency on a distributed database system by use of distributed semaphores. In Proceedings of International Symposium on Distributed Data Bases (Paris, France, Mar.), INRIA, pp. 183-189.Google ScholarGoogle Scholar
  27. SCHNEIDER, F. B. 1982. Synchronization in distributed programs. ACM TOPLAS 4, 2 (Apr.), 179-195. Google ScholarGoogle Scholar
  28. SCHNEIDER, F. B. 1984. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS 2, 2 (May), 145-154. Google ScholarGoogle Scholar
  29. SCHNEIDER, F. B. 1985. Paradigms for distributed programs. Distributed Systems. Methods and Tools for Specification. Lecture Notes in Computer Science, Vol. 190. Springer-Verlag, New York, pp. 343-430. Google ScholarGoogle Scholar
  30. SCHNEIDER, F. B. 1986. A paradigm for reliable clock synchronization. In Proceedings of the Advanced Seminar on Real-Time Local Area Networks (Bandol, France, Apr.), INRIA, pp. 85-104.Google ScholarGoogle Scholar
  31. SCHNEIDER, F. B., GRIES, D., AND SCHLICHTING, R. D. 1984. Fault-tolerant broadcasts. Sci. Comput. Program. 4, 1-15. Google ScholarGoogle Scholar
  32. SIEWIOREK, D. P., AND SWARZ, R. S. 1982. The Theory and Practice of Reliable System Design. Digital Press, Bedford, Mass.Google ScholarGoogle Scholar
  33. SKEEN, D. 1982. Crash recovery in a distributed database system. Ph.D. dissertation, Univ. of California at Berkeley, May.Google ScholarGoogle Scholar
  34. STRONG, H. R., AND DOLEV, D. 1983. Byzantine agreement. Intellectual Leverage for the Information Society, Digest of Papers. (Compcon 83, IEEE Computer Society, Mar.), IEEE Computer Society, pp. 77-82.Google ScholarGoogle Scholar
  35. WENSLEY, J., WENSKY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P. M., SHOSTAK, R. E., and WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.Google ScholarGoogle Scholar

Recommendations

Reviews

Valentin Cristea

Distributed software structured in terms of clients and servers is considered. Replicas of a single server are executed on separate processors of a distributed system, and protocols coordinate client interactions with these replicas. The paper describes how a system can be viewed in terms of a state machine, clients, and output devices. In this context, Schneider considers two representative classes of faulty behavior: Byzantine failures and fail-stop failures. The core sections of the paper present algorithms that cope with these failures. An important class of optimizations and the dynamic reconfiguration are also tackled. A separate section discusses related work . The paper is intended for people working in the domain of distributed systems and real-time systems. It systematically presents protocols that involve replication of components using the state machine approach, although few of these protocols were obtained in this manner. The paper was received in November 1987 and the final revision was accepted in January 1990. Unfortunately, this long delay is easily perceived by the reader.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 22, Issue 4
    Dec. 1990
    109 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/98163
    Issue’s Table of Contents

    Copyright © 1990 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 December 1990
    Published in csur Volume 22, Issue 4

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader