Nian-Feng Tzeng
Center for Advanced Computer Studies
University of Louisiana at Lafayette
A large parallel system demands high reliability, but the probability of faults occurring in a system grows as its size increases. Reconfiguring a faulty system to get around the faults is a viable approach to reliability enhancement, since the system may continue operation after reconfiguration. Reconfiguration in faulty hypercubes and meshes, as well as recovery point selection, have been pursued. One key issue in parallel systems is the fault-tolerant organization and architecture design of the systems. Various fault-tolerant architectures for parallel machine construction have been considered.
Reconfiguring a faulty hypercube with an effort to keep as many fault-free nodes as possible has been treated for the first time by us. This often leads to an incomplete hypercube, whose size can be arbitrary and is often much larger than a reconfigured complete subcube. Different approaches to this maximum reconfiguration have been proposed. One approach is to identify maximum collections of complete subcubes of distinct sizes such that elements in each collection together form an incomplete cube. Another is able to recognize all maximum incomplete subcubes existing in a faulty hypercube, based on boolean expression manipulation. It is confirmed by fault simulation that thess proposed approaches indeed give rise to significantly larger reconfigured systems. Maximum reconfiguration of faulty meshes by retaining as many healthy nodes as possible has been considered as well. Such a reconfigured subsystem is convex so that a simple, deadlock-free routing algorithm exists in it. This reconfiguration is shown to require a channel width of two only, keeping hardware overhead low.
Rollback and recovery strategies have been widely used in several software
systems to improve reliability through fault tolerance.
Efficient solutions to the problem of selecting recovery points optimally
have been developed, with time complexity of O(N*N),
where N is the number of tasks
(while previously reported procedures have exponential time requirements).
These solutions are aimed at computation models in which task precedence
has a tree structure and a task may fail due to the presence of faults.
The butterfly distributed-memory system has a regular and simple interconnection
pattern, making it suitable for VLSI/WSI implementation.
Effective fault-tolerant techniques for butterfly distributed-memory systems
have been introduced to ensure their rigid full butterfly structures
in the presence of failures.
The reliability and layout of these proposed designs are evaluated analytically,
and they are shown to be superior to those of prior fault-tolerant approaches.
The Gamma interconnection network (GIN) is composed of 3 x 3 switches. We considered modifications to the GIN by altering the interconnection patterns between stages so as to create disjoint paths between every input-output pair, arriving at high terminal reliability. A modified GIN maintains the same routing complexity but enjoys a lower layout area and pin count, leading to potential cost reduction. It is shown that a modified GIN presents dramatic reliability improvement and possesses virtually identical communication performance as the original GIN.
Send e-mail to:
tzeng@cacs.louisiana.edu