Fault-tolerant computing

Model
Digital Document
Publisher
Florida Atlantic University
Description
Fault tolerant programming methods improve software reliability using the principles of design diversity and redundancy. Design diversity and redundancy, on the other hand, escalate the cost of the software design and development. In this thesis, we study the reliability of hybrid fault tolerant systems. Probability models based on fault trees are developed for the recovery block (RB), N-version programming (NVP) and hybrid schemes which are the combinations of RB and NVP. Two heuristic methods are developed to construct hybrid fault tolerant systems with total cost constraints. The algorithms provide a systematic approach to the design of hybrid fault tolerant systems.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In recent years multiprocessor systems are becoming increasingly important in critical applications. In particular, their fault tolerance properties are of great importance for their ability to be used in these type of applications. We have developed a multiprocessor simulator that can be used to test different fault detection algorithms. The processors must have four communication links. This simulator operates by passing messages between processors. An algorithm was developed for routing the messages among the processors. The simulator can also be used to try different reconfiguration strategies. In particular we have tested Malek's comparison algorithm using different multiprocessor configurations. We also developed a program which determines the configuration of an unknown network of transputers.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Mesh-connected multicomputers are one of the simplest and least expensive structures to build a system using hundreds and even thousands of processors. The nodes communicate with each other by sending and receiving messages. As the system gets larger and larger, it not only requires the routing algorithms be efficient but also fault-tolerant. The fault model we use in 2-D meshes is a faulty block while in 3-D meshes, the fault model is a faculty cube. In order to route messages through feasible minimum paths, the extended safety level is used to determine the existence of a minimal path and faulty block (cube) information is used to guide the routing. This dissertation presents an in-depth study of fault-tolerant minimal routing in 2-D tori, 3-D meshes, and tree-based fault-tolerant multicasting in 2-D and 3-D meshes using extended safety levels. Also path-based fault-tolerant deadlock-free multicasting in 2-D and 3-D meshes is studied. In fault-tolerant minimal routing in 2-D meshes, if no faulty block is encountered, any adaptive minimal routing can be used until the message encounters a faulty block. The next step is guided by the faulty block information until the message gets away from the faulty block. After that, any minimal adaptive routing can be used again. The minimal routing in 2-D tori is similar to that in 2-D meshes if at the beginning of the routing a conversion is made from a 2-D torus to a 2-D mesh. The fault-tolerant minimal routing in 3-D meshes can be done in a similar way. In the tree-based multicasting in 2-D and 3-D meshes, a time-step optimal and traffic-step suboptimal algorithm is proposed. Several heuristic strategies are presented to resolve a conflict, which are compared by simulations. A path-based fault-tolerant deadlock-free multicast algorithm in 2-D meshes with inter-block distance of at least three is presented to solve the deadlock problem in tree-based multicast algorithms. The approach is then extended to 3-D meshes and to inter-block distance of at least two in 2-D meshes. The path is Hamiltonian that is only updated locally in the neighborhood of a faulty block when a faulty block is encountered. Two virtual channels are used to prevent deadlock in 2-D and 3-D meshes with inter-block (inter-cube) distance of at least three and two more virtual channels are added if the inter-block distance is at least two.
Model
Digital Document
Publisher
Florida Atlantic University
Description
With the increase in the applications of computer technology, there are more and more demands for the use of computer systems in the area of real-time applications and critical systems. Reliability and performance are fundamental design requirements for these applications. In this dissertation, we develop some specific aspects of a fault-tolerant decentralized system architecture. This system can execute concurrent processes and it is composed of processing elements that have only local memories with point-to-point communication. A model using hierarchical layers describes this system. Fault tolerance techniques are discussed for the applications, software, operating system, and hardware layers of the model. Scheduling of communicating tasks to increase performance is also addressed. Some special problems such as the Byzantine Generals problem are considered. We have shown that, by combining reliable techniques on different layers and with consideration of system performance, one can provide a system with a very high level reliability as well as performance.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In this thesis, a low interprocessor communication overhead and high performance data parallelism parallel application model in a network of workstations (NOWs) is proposed. Checkpointing and rollback technologies are used in this model for performance enhancement purpose. The proposed model is analyzed both theoretically and numerically. The simulation results show that a high performance of the parallel application model is expected. As a case study, the proposed model is used to the parallel Everglades Landscape Fire Model (ELFM) code which was developed by South Florida Water Management District (SFWMD). The parallel programming environment is Message-Passing Interface (MPI). A synchronous checkpointing and rollback mechanism is used to handle the spread of fire which is a dynamic and irregular component of the model. Results show that the performance of the parallel ELFM using MPI is significantly enhanced by the application of checkpointing and rollback.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Many methodologies for software modeling and design include some form of static and dynamic modeling to describe the structural and behavioral views respectively. Modeling and design of complex real-time software systems requires notations for describing concurrency, asynchronous event handling, communication between independent machines, timing properties, and accessing real time. Function-oriented structured analysis methodologies such as Ward and Mellor's SA/RT and Harel's Statecharts have provided extensions for real-time system modeling. Dynamic modeling of real time systems using object-oriented methodologies also requires extensions to the traditional state machine notations in order to convey the real time system characteristics and constraints. Shaw's Communicating Real Time State Machines (CRSM's), Harel's O-Chart notations, and the Octopus methodology provide methods for modeling real-time systems consistent with object-oriented methods. This thesis proposes an object-oriented analysis and design methodology that augments the traditional Object Modeling Technique (OMT) dynamic model with real-time extensions based on high-level parallel machines and communication notations from CRSM. An example of the proposed methodology is provided using a realistic but hypothetical example of an automated passenger train system. A design refinement step is included for fault tolerant considerations. An evaluation of the proposed methodology with its extended notations is provided.
Model
Digital Document
Publisher
Florida Atlantic University
Description
We study the embedding of binomial trees with variable roots in faulty hypercubes. Based on novel embedding strategies, we propose three embedding algorithms with variable nodes as the root. The first algorithm can tolerate up to n - 1 faulty links, but the execution can be done within log2(n - 1) subcube splits. The second one can tolerate up to [(3(n - 1))\2] faulty links. The last one can tolerate up to [(3(n - 4))\2] faulty nodes.
Model
Digital Document
Publisher
Florida Atlantic University
Description
We propose the enhanced Fibonacci cube (EFC), which is defined based on the sequence Fn = 2(n-2) + 2F(n-4). We study its topological properties, embeddings, applications, routings, VLSI/WSI implementations, and its extensions. Our results show that EFC retains many properties of the hypercube. It contains the Fibonacci cube (FC) and extended Fibonacci cube of the same order as subgraphs and maintains virtually all the desirable properties of FC. EFC is even better in some structural properties, embeddings, applications and VLSI designs than FC or hypercube. With EFC, there are more cubes with various structures and sizes for selection, and more backup cubes into which faulty hypercubes can be reconfigured, which alleviates the size limitation of the hypercube and results in a higher level of fault tolerance.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Fault simulators can be used for various purposes, such as the determination of the Fault coverage, the Automatic test pattern generation and the preparation of the Fault dictionaries. As the size of the digital circuits increases, the number of gates present increases and the time taken for fault simulation also increases. In order to reduce the fault simulation time, massively parallel computers are being used. We have developed a fault simulator on MASPAR, a massively parallel Single Instruction Multiple Data machine, based on the principles of parallel pattern parallel fault simulation. In order to eliminate the limitation of limited memory on MASPAR, we have designed an algorithm which reduces the amount of memory required for storing the circuit. We have implemented these algorithms in two different ways. These algorithms were tested on ISCAS85 benchmark circuits. The results have shown an improvement over other parallel algorithms.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The utilization of a multiprocessor system is enhanced when idle time of processors is reduced. Allocation of processes from overloaded processors to idle processors can balance the load on multiprocessor systems and increase system throughput by reducing the process execution time. This thesis presents a study of parameters, issues and existing algorithms related to load balancing. The performance of load balancing on hypercubes using three new algorithms is explored and analyzed. A new algorithm to balance load on hypercubes in the presence of link faults is presented and analyzed here. Another algorithm to balance load on hypercube systems containing faulty processors is proposed and studied. The applicability of load balancing to real life problems is demonstrated by showing that the execution of branch and bound problem on hypercubes speeds up when load balancing is used.