Skip to content

Proactive Fault Tolerance Framework

Summary: Proactive fault tolerance is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating parts of an application (task, process, or virtual machine) away from components that are about to fail. The proactive fault tolerance framework consists of a number of individual proof-of-concept prototypes, including process and virtual machine migration, scalable system monitoring, and online/offline system health analysis.

In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation high-performance computing (HPC) systems. The notion of proactive fault tolerance emerged in recent years. It is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating parts of an application (task, process, or virtual machine) away from nodes that are about to fail. Pre-fault indicators, such as a significant increase in heat, can be used to avoid an imminent failure through anticipation and reconfiguration. As computation is migrated away, application failures are avoided and application mean-time to failure (AMTTF) is extended beyond system mean-time to failure (SMTTF). Since avoiding a failure through preemptive migration is significantly more efficient than recovery from failure via traditional reactive fault tolerance mechanisms, such as checkpoint/restart, HPC system utilization becomes more efficient.

The proactive fault tolerance framework (Figure 1) consists of a number of individual proof-of-concept prototypes, including process and virtual machine migration, scalable system monitoring (Figure 2), and online/offline system health analysis. The novel process-level live migration mechanism (Figure 3) supports continued execution of applications during much of process migration. This scheme is integrated into an Message Passing Interface (MPI) execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive fault tolerance by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The scalable health monitoring system utilizes a tree-based overlay network to classify and aggregate monitoring metrics based on individual needs. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of 56 in comparison to the Ganglia distributed monitoring system. The online/offline system health analysis uses statistical methods, such as clustering and temporal analysis, to identify pre-fault indicators in the collected health monitoring data and in traditional system logs.


Figure 1: Proactive fault tolerance control loop

Figure 2: Scalable system monitoring architecture

Figure 3: Process migration overhead of NAS PB

Research Projects

Peer-reviewed Journal Publications

  1. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, February 1, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2011.10.009. Abstract Publication BibTeX Citation

Peer-reviewed Conference Publications

  1. Swen Böhm, Christian Engelmann, and Stephen L. Scott. Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC) 2010, pages 72-78, Melbourne, Australia, September 1-3, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4214-0. DOI 10.1109/HPCC.2010.32. Acceptance rate 19.1% (58/304). Abstract Publication Presentation BibTeX Citation
  2. Antonina Litvinova, Christian Engelmann, and Stephen L. Scott. A Proactive Fault Tolerance Framework for High-Performance Computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria, February 16-18, 2010. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-783-3. DOI 10.2316/P.2010.676-024. Abstract Publication Presentation BibTeX Citation
  3. Narate Taerat, Nichamon Naksinehaboon, Clayton Chandler, James Elliott, Chokchai (Box) Leangsuksun, George Ostrouchov, Stephen L. Scott, and Christian Engelmann. Blue Gene/L Log Analysis and Time to Interrupt Estimation. In Proceedings of the 4th International Conference on Availability, Reliability and Security (ARES) 2009, pages 173-180, Fukuoka, Japan, March 16-19, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-3572-2. DOI 10.1109/ARES.2009.105. Acceptance rate 25.0% (40/160). Abstract Publication BibTeX Citation
  4. Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.31. Acceptance rate 42.0% (58/138). Abstract Publication Presentation BibTeX Citation
  5. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. DOI 10.1145/1413370.1413414. Acceptance rate 21.3% (59/277). Abstract Publication Presentation BibTeX Citation
  6. Geoffroy R. Vallée, Kulathep Charoenpornwattana, Christian Engelmann, Anand Tikotekar, Chokchai (Box) Leangsuksun, Thomas Naughton, and Stephen L. Scott. A Framework For Proactive Fault Tolerance. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 659-664, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.171. Acceptance rate 21.1% (40/190). Abstract Publication Presentation BibTeX Citation

Peer-reviewed Workshop Publications

  1. George Ostrouchov, Thomas Naughton, Christian Engelmann, Geoffroy R. Vallée, and Stephen L. Scott. Nonparametric Multivariate Anomaly Analysis in Support of HPC Resilience. In Proceedings of the 5th IEEE International Conference on e-Science (e-Science) 2009: Workshop on Computational Science, pages 80-85, Oxford, UK, December 9-11, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-5946-9. DOI 10.1109/ESCIW.2009.5407992. Abstract Publication Presentation BibTeX Citation

Peer-reviewed Conference Posters

  1. Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009. Abstract Publication BibTeX Citation
  2. Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2009, Raleigh, NC, USA, February 14-18, 2009. Abstract Publication BibTeX Citation

White Papers

  1. Nathan DeBardeleben, James Laros, John T. Daly, Stephen L. Scott, Christian Engelmann, and Bill Harrod. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development. White paper for the U.S. National Science Foundation's High-end Computing Program, December 1, 2009. Publication BibTeX Citation

Technical Reports

  1. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Technical Report, ORNL/TM-2010/161, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August 1, 2010. Abstract Publication BibTeX Citation

Talks and Lectures

  1. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the United States Naval Academy, Annapolis, MD, USA, February 18, 2016. Abstract Presentation BibTeX Citation
  2. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the 19th Workshop on Distributed Supercomputing (SOS) 2015, Park City, UT, USA, March 2-5, 2015. Abstract Presentation BibTeX Citation
  3. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the Technical University of Dresden, Dresden, Germany, September 3, 2013. Abstract Presentation BibTeX Citation
  4. Christian Engelmann. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path Forward for Research and Development. Invited talk at the Argonne National Laboratory (ANL) Institute of Computing in Science (ICiS) Summer Workshop Week on Addressing Failures in Exascale Computing, Park City, UT, USA, August 4-11, 2012. Abstract Presentation BibTeX Citation
  5. Christian Engelmann. Resilient Software for ExaScale Computing. Invited talk at the Birds of a Feather Session on Resilient Software for ExaScale Computing at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 17, 2011. Abstract Presentation BibTeX Citation
  6. Christian Engelmann. Resilience and Hardware/Software Co-design for Extreme-Scale Supercomputing. Seminar at the Barcelona Supercomputing Center, Barcelona, Spain, July 27, 2011. Abstract Presentation BibTeX Citation
  7. Christian Engelmann. Scalable HPC System Monitoring. Invited talk at the 3rd HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2010, in conjunction with the 3rd Los Alamos Computer Science Symposium (LACSS) 2010, Santa Fe, NM, USA, October 13, 2010. Abstract Presentation BibTeX Citation
  8. Christian Engelmann. Beyond Application-Level Checkpoint/Restart – Advanced Software Approaches for Fault Resilience. Talk at the 39th SPEEDUP Workshop on High Performance Computing, Zurich, Switzerland, September 6, 2010. Presentation BibTeX Citation
  9. Christian Engelmann and Stephen L. Scott. Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond. Talk at the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) Workshop, in conjunction with the USENIX Federated Conferences Week (USENIX) 2010, Boston MA, USA, June 22, 2010. Abstract Presentation BibTeX Citation
  10. Christian Engelmann. Resilience Challenges at the Exascale. Talk at the 14th Workshop on Distributed Supercomputing (SOS) 2010, Savannah, GA, USA, March 8-11, 2010. Abstract Presentation BibTeX Citation
  11. Christian Engelmann and Stephen L. Scott. HPC System Software Research at Oak Ridge National Laboratory. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, February 22, 2010. Abstract Presentation BibTeX Citation
  12. Christian Engelmann. Modeling Techniques Towards Resilience. Invited talk at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009. Presentation BibTeX Citation
  13. Christian Engelmann. System Resilience Research at ORNL in the Context of HPC. Invited talk at the Institut National de Recherche en Informatique et en Automatique (INRIA), Rennes, France, May 15, 2009. Abstract Presentation BibTeX Citation
  14. Christian Engelmann. Proactive Fault Tolerance Using Preemptive Migration. Invited talk at the 3rd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2009, Cancun, Mexico, April 22-24, 2009. Abstract Presentation BibTeX Citation
  15. Christian Engelmann. Resiliency. Panel at the 13th Workshop on Distributed Supercomputing (SOS) 2009, Hilton Head, SC, USA, March 9-12, 2009. BibTeX Citation
  16. Christian Engelmann. High-Performance Computing Research at Oak Ridge National Laboratory. Invited talk at the Reading Annual Computational Science Workshop, Reading, United Kingdom, December 8, 2008. Abstract Presentation BibTeX Citation
  17. Christian Engelmann. Resiliency for High-Performance Computing. Invited talk at the 2nd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2008, Cancun, Mexico, April 10-12, 2008. Abstract Presentation BibTeX Citation
  18. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Seminar at the Laboratoire d'Analyse et d’Architecture des Systémes, Centre National de la Recherche Scientifique, Toulouse, France, February 11, 2008. Abstract Presentation BibTeX Citation

Co-advised Theses

  1. Swen Böhm. Development of a RAS Framework for HPC Environments: Realtime Data Reduction of Monitoring Data. Master’s thesis, Department of Computer Science, University of Reading, UK, March 12, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation
  2. Antonina Litvinova. RAS Framework Engine Prototype. Master’s thesis, Department of Computer Science, University of Reading, UK, September 22, 2009. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation