Fault-tolerant computing

Model
Digital Document
Publisher
Florida Atlantic University
Description
Multistage interconnection networks (MINs) have become an important subset of the interconnection networks which are used to communicate between processors and memory modules for large scale multiprocessor systems. Unfortunately, unique path MINs lack fault tolerance. In this dissertation, a novel scheme for constructing fault-tolerant MINs is presented. We first partition the given MINs into even sized partitions and show some fault-tolerant properties of the partitioned MINs. Using three stages of multiplexers/demultiplexers, an augmenting scheme which takes advantage of locality in program execution is then proposed to further improve the fault-tolerant ability and performance of the partitioned MINs. The topological characteristics of augmented partitioned multistage interconnection networks (APMINs) are analyzed. Based on switch fault model, simulations have been carried out to evaluate the full access and dynamic full access capabilities of APMINs. The results show that the proposed scheme significantly improves the fault-tolerant capability of MINs. Cost effectiveness of this new scheme in terms of cost, full access, dynamic full access, locality, and average path length has also been evaluated. It has been shown that this new scheme is more cost effective for high switch failure rate and/or large size networks. Analytical modeling techniques have been developed to evaluate the performance of AP-Omega network and AP-Omega network-based multiprocessor systems. The performance of Omega, modified Omega, and AP-Omega networks in terms of processor utilization and processor waiting time have been compared and the results show that the new scheme indeed, improves the performance both in network level and in system level. Finally, based on the reliability of serial/parallel network components, models for evaluating the terminal reliability and the network reliability of AP-Omega network using upper and lower bound measures have also been proposed and the results show that applying locality improve APMINs' reliability.
Model
Digital Document
Publisher
Florida Atlantic University
Description
A method of on-line monitoring AUV onboard systems is described. This algorithm determines deviations from normal operating conditions based on a damage level calculated from recursive least squares system identification performed on the system under consideration, followed by a gradient detection technique which extracts significant changes in identified model parameters System damage types are characterized together with likely system responses to such failures. Extensive testing of the algorithm is performed using several simulated AUV on-board systems undergoing different types of failures while carrying out different mission scenarios.
Model
Digital Document
Publisher
Florida Atlantic University
Description
We have developed reliability models for a variety of fault-tolerant software constructs including those based on two well-known methodologies: recovery block and N-version programming, and their variations. We also developed models for the conversation scheme which provides fault tolerance for concurrent software and a newly proposed system architecture, the recovery metaprogram, which attempts to unify most of the existing fault-tolerant strategies. Each model is evaluated using either GSPN, a software package based on Generalized Stochastic Petri Nets, or Sharpe, an evaluation tool for Markov models. The numerical results are then analyzed and compared. Major results derived from this process include the identification of critical parameters for each model, the comparisons of relative performance among different software constructs, the justification of a preliminary approach to the modeling of complex conversations, and the justification of recovery metaprogram regarding improvement of reliability.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This dissertation describes the effect of collection and distribution of fault information on routing capacity in grid-connected networks with faults occurring during the routing process. The grid-connected network, such as hypercubes, 2-D meshes, and 3-D meshes, is one of the simplest and least expensive structures to build a system using hundreds and even thousands of processors. In such a system, efficient communication among the processors is critical to performance. Hence, the routing of messages is an important issue that needs to be addressed. As the number of nodes in the networks increases, the chance of failure also increases. The complex nature of networks also makes them vulnerable to disturbances. Therefore, the ability to route messages efficiently in the presence of faulty components, especially those might occur during the routing process, is becoming increasingly important. A central issue in designing a fault-tolerant routing algorithm is the way fault information is collected and used. The safety level model is a special coded fault information model in hypercubes which is more cost effective and more efficient than other information models. In this model, each node is associated with an integer, called safety level, which is an approximated measure of the number and distribution of faulty nodes in the neighborhood. The safety level of each node in an n-dimensional hypercube can be easily calculated through (n - 1)-rounds information exchanges among neighboring nodes. A k-safe node indicates the existence of at least one Hamming distance path (also called optimal path or minimal path) from this node to any node with Hamming distance k. We focus on routing capacity using safety levels in a dynamic system. In this case, the update of safety levels and the routing process proceed hand-in-hand. During the converging period, the routing process may experience extra hops based on unstable (inconsistent) information. Under the assumption that the total number of faults is less than n, we provide an upper bound of extra hops and show its accuracy and effectiveness. After that, we extend the results to meshes. Our simulation results show the effectiveness of our information model and scalability of our fault-information-based routing in the grid-connected networks with dynamic faults. Because our information is easy to update and maintain and optimality is still preserved, it is more cost effective than the others.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Reliability is a key system characteristic that is an increasing concern for current systems. Greater reliability is necessary due to the new ways in which services are delivered to the public. Services are used by many industries, including health care, government, telecommunications, tools, and products. We have defined an approach to incorporate reliability along the stages of system development. We first did a survey of existing dependability patterns to evaluate their possible use in this methodology. We have defined a systematic methodology that helps the designer apply reliability in all steps of the development life cycle in the form of patterns. A systematic failure enumeration process to define corresponding countermeasures was proposed as a guideline to define where reliability is needed. We introduced the idea of failure patterns which show how failures manifest and propagate in a system. We also looked at how to combine reliability and security. Finally, we defined an approach to certify the level of reliability of an implemented web service. All these steps lead towards a complete methodology.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The need to achieve dependability in critical infrastructures has become indispensable for government and commercial enterprises. This need has become more necessary with the proliferation of malicious attacks on critical systems, such as healthcare, aerospace and airline applications. Additionally, due to the widespread use of web services in critical systems, the need to ensure their reliability is paramount. We believe that patterns can be used to achieve dependability. We conducted a survey of fault tolerance, reliability and web service products and patterns to better understand them. One objective of our survey is to evaluate the state of these patterns, and to investigate which standards are being used in products and their tool support. Our survey found that these patterns are insufficient, and many web services products do not use them. In light of this, we wrote some fault tolerance and web services reliability patterns and present an analysis of them.