Challenges in Scalable Fault Tolerance


P. Lincoln, “Challenges in scalable fault tolerance,” 2009 IEEE/ACM International Symposium on Nanoscale Architectures, 2009, pp. 13-14, doi: 10.1109/NANOARCH.2009.5226360.


The continued scaling of device dimensions is leading toward devices where only a handful of dopant atoms or charges can make the difference between a one and a zero in the of state of represented bit, by enhancing or depleting channel conduction. Thus very minor static imperfections in dopant distribution, dielectric properties, or device geometry, and dynamic conditions associated with heat, radiation, or aging can perturb a device out of specification and cause electrical errors. Thus we might expect to soon see devices with millions of static defects and thousands of soft errors in very short time periods. Both static defects and dynamic faults at these scales present huge challenges. Classical methods for fault or defect tolerance at a higher level of architecture (eg. N-modular-redundancy) can be impractically expensive, and some approaches to diagnosis and reconfiguration require immense reliable memories or impractical test and reconfiguration times. Efficient and effective means are needed that exploit structure inherent in one layer of architecture to provide key properties to enable reliable execution at other levels.

Read more from SRI