article

Free Access

Implementing fault-tolerant services using the state machine approach: a tutorial

Author:
Fred B. Schneider

Cornell Univ., Ithaca, NY

Cornell Univ., Ithaca, NY
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 22 Issue 4pp 299–319https://doi.org/10.1145/98163.98167

Published:01 December 1990Publication History

ACM Computing Surveys

Abstract

The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.

References

AIZIKOWITZ, J. 1989. Designing distributed services using refinement mappings. Ph.D. dissertation, Computer Science Dept., Cornell Univ., Ithaca, New York. Also available as Tech. Rep. TR 89-1040. Google Scholar
BERNSTEIN, A. J. 1985. A loosely coupled system for reliably storing data. IEEE Trans. Softw. Eng. SE-11, 5 (May), 446-454.Google Scholar
BIRMAN, K. P. 1985. Replication and fault tolerance in the ISIS system. In Proceedings of the lOth A CM Symposium on Operating Systems Principles (Orcas Island, Washington, Dec. 1985), A CM, pp. 79-86. Google Scholar
BIRMAN, K. P., AND JOSEPH, T. 1987. Reliable communication in the presence of failures. ACM TOCS 5, 1 (Feb. 1987), 47-76. Google Scholar
CRISTIAN, F., AGHILI, H., STRONG, H. R., AND DOLEV, D. 1985. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Conference on Fault-tolerant Computing (Ann Arbor, Mich., June 1985), IEEE Computer Society.Google Scholar
DIJKSTRA, E. W. 1974. Self stabilization in spite of distributed control. Commun. A CM I7, 11 (Nov.), 643-644. Google Scholar
FISCHER, M., LYNCH, N., AND PATERSON, M. 1985. Impossibility of distributed consensus with one faulty process, d. ACM 32, 2 (Apr. 1986), 374-382. Google Scholar
GARCIA-MOLINA, H., PITTELLI, F., AND DAVIDSON, S. 1986. Application of Byzantine agreement in database systems. ACM TODS 11, 1 (Mar. 1986), 27-47. Google Scholar
GOPAL, A., STRONG, R., TOUEG, S., AND CRISTIAN, F., 1990. Early-delivery atomic broadcast. To appear in Proceedings of the 9th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (Quebec City, Quebec, Aug. 1990). Google Scholar
GRAY, J. 1978. Notes on data base operating systems. In Operating Systems: An Advanced Course, Lecture Notes in Computer Science. Vol. 60. Springer- Verlag, New York, pp. 393-481. Google Scholar
HALPERN, J., SIMONS, B., STRONG, R., AND DOLEV, D. 1984. Fault-tolerant clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), pp. 89-102. Google Scholar
HUTCHINSON, N., AND PETERSON, L. 1988. Design of the x-kernel. In Proceedings of SIGCOMM '88--Symposium on Communication Architectures and Protocols (Stanford, Calif., Aug.), pp. 65-75. Google Scholar
LAMPORT, L. 1978a. Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 7 (July), 558-565. Google Scholar
LAMPORT, L. 1979b. The implementation of reliable distributed multiprocess systems. Comput. Networks 2, 95-114.Google Scholar
LAMPORT, L. 1984. Using time instead of timeout for fault-tolerance in distributed systems. ACM TOPLAS 6, 2 (Apr.), 254-280. Google Scholar
LAMPORT, L. 1989. The part-time parliament. Tech. Rep. 49. Digital Equipment Corporation Systems Research Center, Palo Alto, Calif.Google Scholar
LAMPORT, L., AND MELLIAR-SMITH, P. M. 1984. Byzantine clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), 68-74. Google Scholar
LAMPORT, L., SHOSTAK, R., AND PEASE, M. 1982. The Byzantine generals problem. ACM TOPLAS 4, 3 (July), 382-401. Google Scholar
LISKOV, B., AND LADIN, R. 1986. Highly available distributed services and fault-tolerant distributed garbage collection. In Proceedings of the 5th A CM Symposium on Principles of Distributed Computing (Calgary, Alberta, Canada, Aug.), ACM, pp. 29-39. Google Scholar
MANCINI, L., AND PAPPALARDO, G. 1988. Towards a theory of replicated processing. Formal Techniques in Real-Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol. 331. Springer-Verlag, New York, pp. 175-192. Google Scholar
MARZULLO, K. 1989. Implementing fault-tolerant sensors. Tech. Rep. TR 89-997. Computer Science Dept., Cornell Univ., Ithaca, New York.Google Scholar
MARZULLO, K., AND SCHMUCK, F. 1988. Supplying high availability with a standard network file system. In Proceedings of the 8th International Conference on Distributed Computing Systems (San Jose, CA, June), IEEE Computer Society, pp. 447-455.Google Scholar
PETERSON, L. L., BUCHOLZ, N. C., AND SCHLICHT- ING, R. D. 1989. Preserving and using context information in interprocess communication. ACM TOCS 7, 3 (Aug.), 217-246. Google Scholar
PITTELLI, F. M., AND GARCIA-MOLINA, S. 1989. Reliable scheduling in a TMR database system. ACM TOCS 7, 1 (Feb.), 25-60. Google Scholar
SCHLICHTING, R. D., AND SCHNEIDER, F. B. 1983. Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM TOCS I, 3 (Aug.), 222-238. Google Scholar
SCHNEIDER, F. B. 1980. Ensuring consistency on a distributed database system by use of distributed semaphores. In Proceedings of International Symposium on Distributed Data Bases (Paris, France, Mar.), INRIA, pp. 183-189.Google Scholar
SCHNEIDER, F. B. 1982. Synchronization in distributed programs. ACM TOPLAS 4, 2 (Apr.), 179-195. Google Scholar
SCHNEIDER, F. B. 1984. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS 2, 2 (May), 145-154. Google Scholar
SCHNEIDER, F. B. 1985. Paradigms for distributed programs. Distributed Systems. Methods and Tools for Specification. Lecture Notes in Computer Science, Vol. 190. Springer-Verlag, New York, pp. 343-430. Google Scholar
SCHNEIDER, F. B. 1986. A paradigm for reliable clock synchronization. In Proceedings of the Advanced Seminar on Real-Time Local Area Networks (Bandol, France, Apr.), INRIA, pp. 85-104.Google Scholar
SCHNEIDER, F. B., GRIES, D., AND SCHLICHTING, R. D. 1984. Fault-tolerant broadcasts. Sci. Comput. Program. 4, 1-15. Google Scholar
SIEWIOREK, D. P., AND SWARZ, R. S. 1982. The Theory and Practice of Reliable System Design. Digital Press, Bedford, Mass.Google Scholar
SKEEN, D. 1982. Crash recovery in a distributed database system. Ph.D. dissertation, Univ. of California at Berkeley, May.Google Scholar
STRONG, H. R., AND DOLEV, D. 1983. Byzantine agreement. Intellectual Leverage for the Information Society, Digest of Papers. (Compcon 83, IEEE Computer Society, Mar.), IEEE Computer Society, pp. 77-82.Google Scholar
WENSLEY, J., WENSKY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P. M., SHOSTAK, R. E., and WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.Google Scholar

Recommendations

Implementing fault-tolerant services using state machines: beyond replication
DISC'10: Proceedings of the 24th international conference on Distributed computing

This paper describes a method to implement fault-tolerant services in distributed systems based on the idea of fused state machines. The theory of fused state machines uses a combination of coding theory and replication to ensure efficiency as well as ...
Read More
Byzantine Fault Tolerant State Machine Replication in Any Programming Language
PODC '19: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing

State machine replication is a fundamental primitive in fault tolerant distributed computing, but few production tools exist to support the replication of arbitrary state machines. The tools that do exist, like Apache Zookeeper, CoreOS's etcd, and ...
Read More
Fault tolerant control using a fuzzy predictive approach

This paper proposes the application of fault-tolerant control (FTC) using fuzzy predictive control. The FTC approach is based on two steps, fault detection and isolation (FDI) and fault accommodation. The fault detection is performed by a model-based ...
Read More

Reviews

Reviewer: Valentin Cristea

Distributed software structured in terms of clients and servers is considered. Replicas of a single server are executed on separate processors of a distributed system, and protocols coordinate client interactions with these replicas. The paper describes how a system can be viewed in terms of a state machine, clients, and output devices. In this context, Schneider considers two representative classes of faulty behavior: Byzantine failures and fail-stop failures. The core sections of the paper present algorithms that cope with these failures. An important class of optimizations and the dynamic reconfiguration are also tackled. A separate section discusses related work . The paper is intended for people working in the domain of distributed systems and real-time systems. It systematically presents protocols that involve replication of components using the state machine approach, although few of these protocols were obtained in this manner. The paper was received in November 1987 and the final revision was accepted in January 1990. Unfortunately, this long delay is easily perceived by the reader.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 22, Issue 4
Dec. 1990
109 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/98163
Issue’s Table of Contents

Copyright © 1990 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 1990
Published in csur Volume 22, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,622
  Total Citations
  View Citations
- 11,975
  Total Downloads
- Downloads (Last 12 months)1,227
- Downloads (Last 6 weeks)169
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys

Abstract

References

Cited By

Recommendations

Implementing fault-tolerant services using state machines: beyond replication

Byzantine Fault Tolerant State Machine Replication in Any Programming Language

Fault tolerant control using a fuzzy predictive approach

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys

Abstract

References

Cited By

Recommendations

Implementing fault-tolerant services using state machines: beyond replication

Byzantine Fault Tolerant State Machine Replication in Any Programming Language

Fault tolerant control using a fuzzy predictive approach

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media