Fault-tolerant computing

Model
Digital Document
Publisher
Florida Atlantic University
Description
Interprocessor communication plays an important role in the performance of multicomputer systems, such as hypercube multicomputers. In this thesis, we consider the multicast problem for a hypercube system in the presence of faulty components. Two types of algorithms are proposed. Type 1 algorithms, which are developed based on local network information, can tolerate both node failures and link failures. Type 2 algorithms, which are developed based on limited global network information, ensure that each destination receives message through the shortest path. Simulation results show that type 2 algorithms achieve very good results on both time and traffic steps, two main criteria in measuring the performance of interprocessor communication.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Research is under way to fabricate large-area thin-film transistor arrays produced on a thin polyimide substrate. The polyimide substrate is available in long thirty centimeter wide rolls of tape, and lithography hardware is being developed to expose hundreds of meters of this tape with electrically addressable light modulators which can resolve 2 $\mu$m features. A fault-tolerant memory architecture is proposed that is capable of storing one hour of D-1 component digital video (almost 10^12 bits) in real-time, on eight two-hundred meter long tapes. Appropriate error correcting codes and error concealment are proposed to compensate for drop-outs resulting from manufacturing defects so as to yield video images with error rates low enough to survive several generations of copies.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In the last few years, it has become profound to achieve higher performance of computers by solely upgrading logic technology. This required a move to a parallel processing system or a multiprocessor system in order to build faster computer systems. The importance of multiprocessor systems is increasing due to many reasons, one of which is reliability. In a multiprocessor system, a number of tasks may concurrently exist. To operate the system efficiently, one must carefully schedule the tasks. This thesis proposes a set of algorithms to schedule these tasks exploiting the inherent redundancy of processors in a multiprocessor system. Also discussed are some reliability issues and application to different networks with some examples.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This thesis considers the design of ultrareliable multicomputers for control applications. The fault tolerance problem is divided into three subproblems: software, processing node, and communication fault tolerance. Design is performed using layers of abstraction, with fault tolerance implemented by dedicated layers. For software fault tolerance, new constructs for concurrent n-version programming are introduced. For processing node fault tolerance, the distributed fault tolerance (DFT) concept of Chen and Chen is extended to allow for arbitrary failures. Communication fault tolerance is achieved with multicasting on a fault-tolerant graph (FG) network. Reliability models are developed for each of the layers, and a performance model is developed for the communication layer. An example flight control system is compared to currently existing architectures.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This thesis analyzes how the architecture of the Intel 80286 microprocessor may be used to implement fault tolerant software structures. The Multi-Micro Programming Line, MML, and the Intel 80286 kernel, K286, are used as tools to illustrate the implementation of software fault tolerance in an 80286 environment. The recovery metaprogram approach is supported by software layers which use the privilege levels in the 80286. Implementation of recovery blocks, N-version programming, exceptions, and conversations using a recovery metaprogram are covered. While the details are specific to the 80286 architecture, the general results apply to any architecture with three or more rings of privilege and a segmented virtual memory using descriptors.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Since computer systems are applied to many critical areas, fault-tolerance is a necessary requirement for their operation. Many techniques for dealing with hardware faults have been developed. Fault-tolerant software has had a much slower progress. Concurrent software adds an additional dimension to the problem of fault-tolerant software. This thesis uses an intermediate structure between two major schemes, conversation and programmer transparent coordination. The scheme proposed here accelerates conversations by using a special process or superprocess, which is executed on the same system level as the run-time system, and that by having access to the history of all interprocess communications can allow a process that passes its acceptance test to proceed conditionally. If the process does not pass its acceptance test all processes recover immediately without waiting to get to their acceptance tests. This work presents a set of algorithms to implement these ideas.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The hypercube has become one of the most popular architectures for a wide variety of parallel processing applications and has been used in several commercial and research multiprocessors. Its topological and reliability properties have been studied extensively and several techniques have been proposed for enhancing its reliability. We first present a survey of the techniques that have been used for analyzing and enhancing the reliability of the hypercube and propose a classification framework in which the surveyed reliability analysis techniques can be critically evaluated. Invariably, the techniques for enhancing the fault tolerance of the hypercube require modification of the processing nodes to include redundant elements, or alternatively, degrade the hypercube to a lower dimension cube when faults occur. We propose a technique using unmodified processing elements that takes advantage of the dataflow patterns of a specific class of parallel algorithms belonging to the divide-and-conquer paradigm. It is shown that by incorporating shuffles and exchanges, the execution of the divide-and-conquer class of algorithms on the hypercube can be made fault- tolerant. We develop this technique into a fault-tolerant computing scheme for execution of divide-and-conquer class of parallel algorithms, which we call Shuffle Exchange Hypercube (SEH). We propose a new recursively defined interconnection architecture for parallel computation called Cube-Connected Cubes (CCCubes). It is shown that the CCCubes architecture can emulate both the hypercube and the Cube-Connected Cycles (CCC) architectures. The CCCubes architecture is recursively extended into the kth order Generalized Cube-Connected Cubes (GCCCubes) architecture. We propose several classes of CCCubes and GCCCubes architectures and study their topological and reliability properties. A comparison of the reliability and topological properties of the proposed architectures with those of the hypercube is provided and it is shown that the CCCubes and GCCCubes architectures present practical alternatives to the hypercube. Finally, some areas worthy of further pursuit are presented, which include the problem of determining a switch route schedule for SEH, extension of shuffles and exchanges to CCCubes and GCCCubes, and the determination of a VLSI layout for the proposed CCCubes and GCCCubes architectures.
Model
Digital Document
Publisher
Florida Atlantic University
Description
We propose the balanced hypercube (BH), which is a variant of the standard hypercube (Q), as a multicomputer topological structure. An n-dimensional balanced hypercube BHn has the same desirable topological properties of the 2n-dimensional standard hypercube Q2n such as size (2^2n nodes and n2^2n edges), regularity and symmetry, connectivity (2n node-disjoint pathes between any pair of nodes), and diameter (2n when n = 1 or n is even). Moreover, BHn has smaller diameter (2n-1) than Qn's (2n) when n is odd other than 1. In addition, BHn is load balanced, i.e., for every node v of BHn, there exists another node v', called v's matching node, such that v and v' share the same adjacent node set. Therefore, BHn has a desirable fault tolerance feature: when a node v fails, we can simply shift the job execution on v to its matching node v' and the communication pattern between jobs remains the same. In this dissertation, we study the topological properties of BHn and explore its fault tolerance feature. Other design issues are considered, such as communication primitives, capability of simulating other multicomputer systems through graph embedding, resource placement. and VLSI/WSI layout. Finally, the use of BHn is illustrated by an application.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The growing demand for high availability of computer systems has led to a wide application range of fault-tolerant systems. In some real-time applications ultrareliable computer systems are required. Such computer systems should be capable of tolerating failures of not only their hardware components but also of their software components. This dissertation discusses three aspects of designing an ultrareliable system: (a) a hierarchical ultrareliable system structure; (b) a set of unified methods to tolerate both software and hardware faults in combination; and (c) formal specifications in the system structure. The proposed hierarchical structure has four layers: Application, Software Fault Tolerance, Combined Fault Tolerance and Configuration. The Application Layer defines the structure of the application software in terms of the modular structure using a module interconnection language. The failure semantics of the service provided by the system is also defined at this layer. At the Software Fault Tolerance Layer each module can use software fault tolerance methods. The implementation of the software and hardware fault tolerance is achieved at the Combined Fault Tolerance Layer which utilizes the combined software/hardware fault tolerance methods. The Configuration Layer performs actual software and hardware resource management for the requests of fault identification and recovery from the Combined Fault Tolerance Layer. A combined software and hardware fault model is used as the system fault model. This model uses the concepts of fault pattern and fault set to abstract the various occurrences of software and hardware faults. We also discuss extended comparison models that consider faulty software as well. The combined software/hardware fault tolerance methods are based on recovery blocks, N-version programming, extended comparison methods and both forward and backward recovery methods. Formal specifications and verifications are used in the system design process and the system structure to show that the design and implementation of a fault-tolerant system satisfy the functional and non-functional requirements. Brief discussions and examples of using formal specifications in the hierarchical structure are given.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Distributed shared memory (DSM) implements a shared-memory programming interface on message-passing hardware. The shared-memory programming paradigm offers several advantages over the message-passing paradigm. DSM is recognized as an important technology for massively parallel computing. However, as the number of processors in a system increases, the probability of a failure increases. To be widely useful, the DSM must be able to tolerate failures. This dissertation presents a method of implementing fault-tolerant DSM (FTDSM) that is based on the idea of a snooper. The snooper monitors DSM protocol messages and keeps a backup of the current state of the DSM. The snooper can respond on behalf of failed processors. The snooper-based FTDSM is an improvement over existing FTDSMs because it is based on the efficient dynamic distributed manager DSM algorithm, does not require the repair of a failed processor in access the DSM, and does not query all nodes to rebuild the state of the DSM. Three snooper-based FTDSM systems are developed. The single-snooper (SS) FTDSM has one snooper and is restricted to a broadcast network. Additional snoopers are added in the multiple-snooper (MS) FTDSM to improve performance. Two-phase commit (2PC) protocols are developed to coordinate the activities of the snoopers, and a special data structure is used to store causality information to reduce the amount of snooper activity. Snooping is integrated with each processor in the integrated snooper (IS) FTDSM. The IS FTDSM is scalable because it is not restricted to a broadcast network. The concept of dynamic snooping is introduced for the IS FTDSM and several snooper migration algorithms are studied. Several recovery algorithms are developed to allow failed processors to rejoin the system. The properties of data structures used to locate owners and snoopers are studied and used to prove that the system can tolerate any single fault. A flexible method of integrating application-level recovery with the FTDSM is presented, and a reliability analysis is conducted using a Markov-chain modeling tool to show that the snooper-based FTDSM is a cost effective way to improve the reliability of DSM.