Software implemented fault tolerance

Software implemented fault tolerance – 1973

NASA’s mission in the early 1970s included promoting technologies that would make both aircraft and spacecraft more reliable. It was clear by then that computers would find their way into aircraft; the question was whether they could be designed and built to be continuously available. In 1973, NASA asked SRI to use all it knew about fault-tolerant computing and build an experimental computer that could control the safety-critical functions of airplanes. An ultra-dependable controller was vital.

In response, SRI designed a formal specification for an ultra-high reliability commercial flight control system that required continuous computer control while in flight. SRI’s state machine approach, Software Implemented Fault Tolerance or SIFT, met the most stringent reliability requirements of any computer at that time, including uncovering “Byzantine” faults (those that display asymmetric symptoms). The system ran for years at NASA’s Langley Research Center.

The SIFT system included richly redundant hardware and featured highly accurate and resilient software. This integrated approach was a beachhead in an emerging field that came to be known as distributed computing.

Read more from SRI