FAULT TOLERANCE IN THE MOBILE ENVIRONMENT

DANIEL C.  DOOLAN; SABIN  TABIRCA; LAURENCE T.  YANG

Authors

DANIEL C. DOOLAN School of Computing, Robert Gordon University Aberdeen, AB25 1HG, United Kingdom
SABIN TABIRCA Department of Computer Science, University College Cork Cork, Ireland
LAURENCE T. YANG Department of Computer Science, St. Francis Xavier University Antigonish, NS B2G 2W5, Canada

Keywords:

Fault Tolerance, Mobile Message Passing, Bluetooth

Abstract

In general it is assumed that a parallel program will execute on reliable hardware. A fault tolerant program and underlying infrastructure should be capable of surviving fail- ures such as system crashes and network failures. At the highest level the application should be capable of automatically recovering from a set of faults without any change to the apparent behaviour of the program. The process of checkpointing may be used to allow a program to save its state to persistent storage, abort and restart from the checkpoint. Several fault tolerant MPI implementations are currently in existence, for example MPICH-V is considered to be one of the most complete, featuring checkpointing and message logs to allow aborted processes to be replaced. No matter how sophisti- cated a fault tolerant system may be, it can never be completely relied upon, as there is always the possibility of a complete system failure. It is one thing to develop fault toler- ant applications on high end dedicated clusters and supercomputers, however applying fault tolerance to the realm of mobile parallel computing presents an entire new series of challenges that are inexorably linked with the unpredictable nature of wireless com- munication systems. Two differing strategies for fault tolerance in the mobile Bluetooth wireless environment will be presented and compared to see which should be adopted over another.

Downloads

Download data is not yet available.

References

A. Agbaria and R. Friedman (1999), Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters

of Workstations, The Eighth International Symposium on High Performance Distributed Com-

puting, pp. 167–176.

Bluetooth-SIG (2001), Annex A (Normative): Timers and Constants Bluetooth Specification Ver-

sion 1.1.

Climate Prediction.net (2008), http://climateprediction.net

D. C. Doolan, S. Tabirca, and L. T. Yang (2006), Mobile Parallel Computing, 5th International

Symposium on Parallel and Distributed Computing (ISPDC06), pp. 161–167.

G. E. Fagg and J. J. Dongarra (2000), FT-MPI: Fault Tolerant MPI, Supporting Dynamic Appli-

cations in a Dynamic World, Lecture Notes in Computer Science, Vol. 1908, pp. 346-353.

Folding@home (2008), Distributed Computing, understand protein folding, misfolding, and related

diseases, http://folding.stanford.edu

W. Gropp and E. Lusk (2002), Fault Tolerance in MPI Programs, Cluster Computing and Grid

Systems Conference, http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/mpi-fault.pdf

SETI@Home (2008) The Search for Extra Terrestial Intelligence at Home, http://setiathome.

ssl.berkeley.edu

G. Stellner (1996), CoCheck: Checkpointing and Process Migration for MPI, Parallel Processing

Symposium, pp. 526–531.

T. Tannenbaum and M. Litskow (1995), Checkpoint and Migration of Unix Processes in the Condor

Distributed Processing System, Dr. Dobbs Journal, vol. 227, pp. 40–48.

FAULT TOLERANCE IN THE MOBILE ENVIRONMENT

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

interview

splissue

award

2020 Best Paper Award

issn

cover

Make a Submission

subreq

indexed

logo