Tuesday, September 22, 2009

BA Ch8 : Guardian - A fault-tolerant System

Guardian was the Tandem’s OS, used in the 1970’s. It tried to be fault-tolerant, selling itself as a very reliable system. The core idea was that everything should be duplicated in case one goes down. The T/16 machines had at least two processors, two buses, (often) two disks, etc.

Each process would be duplicated on two processors. On one processor it would be active and on the other it would be passive waiting for the first one to die or give up control. In the reliability world there are basically three ways to recover in the face of failure - check pointing or job replication, or to attempt to repair the state of the execution. For the Guardian they chose to do application-controlled check pointing to allow for recovery. As such each program would be responsible for check pointing its state at various intervals and if a processor goes down the other one would start from the last checkpoint.

In my opinion, Fault-tolerant software is something really hard to achieve but trying to achieve the fault tolerant software is highly desirable in some important application like airplane apps or military apps etc…I don’t have any experience in this area but I like reading this chapter and its interesting to see how they focused on fault-tolerance parameter more than other parameters. I think they did good as this system is developed in 70’s.

No comments:

Post a Comment