Proactive Fault Tolerance Framework – Christian Engelmann, Ph.D.

Summary: Proactive fault tolerance is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating parts of an application (task, process, or virtual machine) away from components that are about to fail. The proactive fault tolerance framework consists of a number of individual proof-of-concept prototypes, including process and virtual machine migration, scalable system monitoring, and online/offline system health analysis.

In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation high-performance computing (HPC) systems. The notion of proactive fault tolerance emerged in recent years. It is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating parts of an application (task, process, or virtual machine) away from nodes that are about to fail. Pre-fault indicators, such as a significant increase in heat, can be used to avoid an imminent failure through anticipation and reconfiguration. As computation is migrated away, application failures are avoided and application mean-time to failure (AMTTF) is extended beyond system mean-time to failure (SMTTF). Since avoiding a failure through preemptive migration is significantly more efficient than recovery from failure via traditional reactive fault tolerance mechanisms, such as checkpoint/restart, HPC system utilization becomes more efficient.

The proactive fault tolerance framework (Figure 1) consists of a number of individual proof-of-concept prototypes, including process and virtual machine migration, scalable system monitoring (Figure 2), and online/offline system health analysis. The novel process-level live migration mechanism (Figure 3) supports continued execution of applications during much of process migration. This scheme is integrated into an Message Passing Interface (MPI) execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive fault tolerance by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The scalable health monitoring system utilizes a tree-based overlay network to classify and aggregate monitoring metrics based on individual needs. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of 56 in comparison to the Ganglia distributed monitoring system. The online/offline system health analysis uses statistical methods, such as clustering and temporal analysis, to identify pre-fault indicators in the collected health monitoring data and in traditional system logs.

Figure 1: Proactive fault tolerance control loop

Figure 2: Scalable system monitoring architecture

Figure 3: Process migration overhead of NAS PB

Research Projects

2008-11: Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond

Funding Sources

Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy

Participating Institutions

Peer-reviewed Journal Publications

Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, February 1, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2011.10.009.

Peer-reviewed Conference Publications

Swen Böhm, Christian Engelmann, and Stephen L. Scott. Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC) 2010, pages 72-78, Melbourne, Australia, September 1-3, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4214-0. DOI 10.1109/HPCC.2010.32. Acceptance rate 19.1% (58/304).
Antonina Litvinova, Christian Engelmann, and Stephen L. Scott. A Proactive Fault Tolerance Framework for High-Performance Computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria, February 16-18, 2010. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-783-3. DOI 10.2316/P.2010.676-024.
Narate Taerat, Nichamon Naksinehaboon, Clayton Chandler, James Elliott, Chokchai (Box) Leangsuksun, George Ostrouchov, Stephen L. Scott, and Christian Engelmann. Blue Gene/L Log Analysis and Time to Interrupt Estimation. In Proceedings of the 4th International Conference on Availability, Reliability and Security (ARES) 2009, pages 173-180, Fukuoka, Japan, March 16-19, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-3572-2. DOI 10.1109/ARES.2009.105. Acceptance rate 25.0% (40/160).
Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.31. Acceptance rate 42.0% (58/138).
Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. DOI 10.1145/1413370.1413414. Acceptance rate 21.3% (59/277).
Geoffroy R. Vallée, Kulathep Charoenpornwattana, Christian Engelmann, Anand Tikotekar, Chokchai (Box) Leangsuksun, Thomas Naughton, and Stephen L. Scott. A Framework For Proactive Fault Tolerance. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 659-664, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.171. Acceptance rate 21.1% (40/190).

Peer-reviewed Workshop Publications

George Ostrouchov, Thomas Naughton, Christian Engelmann, Geoffroy R. Vallée, and Stephen L. Scott. Nonparametric Multivariate Anomaly Analysis in Support of HPC Resilience. In Proceedings of the 5th IEEE International Conference on e-Science (e-Science) 2009: Workshop on Computational Science, pages 80-85, Oxford, UK, December 9-11, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-5946-9. DOI 10.1109/ESCIW.2009.5407992.

Peer-reviewed Conference Posters

Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009.
Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2009, Raleigh, NC, USA, February 14-18, 2009.

White Papers

Nathan DeBardeleben, James Laros, John T. Daly, Stephen L. Scott, Christian Engelmann, and Bill Harrod. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development. White paper for the U.S. National Science Foundation's High-end Computing Program, December 1, 2009.

Technical Reports

Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Technical Report, ORNL/TM-2010/161, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August 1, 2010.

Talks and Lectures

Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the United States Naval Academy, Annapolis, MD, USA, February 18, 2016.
Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the 19th Workshop on Distributed Supercomputing (SOS) 2015, Park City, UT, USA, March 2-5, 2015.
Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the Technical University of Dresden, Dresden, Germany, September 3, 2013.
Christian Engelmann. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path Forward for Research and Development. Invited talk at the Argonne National Laboratory (ANL) Institute of Computing in Science (ICiS) Summer Workshop Week on Addressing Failures in Exascale Computing, Park City, UT, USA, August 4-11, 2012.
Christian Engelmann. Resilient Software for ExaScale Computing. Invited talk at the Birds of a Feather Session on Resilient Software for ExaScale Computing at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 17, 2011.
Christian Engelmann. Resilience and Hardware/Software Co-design for Extreme-Scale Supercomputing. Seminar at the Barcelona Supercomputing Center, Barcelona, Spain, July 27, 2011.
Christian Engelmann. Scalable HPC System Monitoring. Invited talk at the 3rd HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2010, in conjunction with the 3rd Los Alamos Computer Science Symposium (LACSS) 2010, Santa Fe, NM, USA, October 13, 2010.
Christian Engelmann. Beyond Application-Level Checkpoint/Restart – Advanced Software Approaches for Fault Resilience. Talk at the 39th SPEEDUP Workshop on High Performance Computing, Zurich, Switzerland, September 6, 2010.
Christian Engelmann and Stephen L. Scott. Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond. Talk at the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) Workshop, in conjunction with the USENIX Federated Conferences Week (USENIX) 2010, Boston MA, USA, June 22, 2010.
Christian Engelmann. Resilience Challenges at the Exascale. Talk at the 14th Workshop on Distributed Supercomputing (SOS) 2010, Savannah, GA, USA, March 8-11, 2010.
Christian Engelmann and Stephen L. Scott. HPC System Software Research at Oak Ridge National Laboratory. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, February 22, 2010.
Christian Engelmann. Modeling Techniques Towards Resilience. Invited talk at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009.
Christian Engelmann. System Resilience Research at ORNL in the Context of HPC. Invited talk at the Institut National de Recherche en Informatique et en Automatique (INRIA), Rennes, France, May 15, 2009.
Christian Engelmann. Proactive Fault Tolerance Using Preemptive Migration. Invited talk at the 3rd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2009, Cancun, Mexico, April 22-24, 2009.
Christian Engelmann. Resiliency. Panel at the 13th Workshop on Distributed Supercomputing (SOS) 2009, Hilton Head, SC, USA, March 9-12, 2009.
Christian Engelmann. High-Performance Computing Research at Oak Ridge National Laboratory. Invited talk at the Reading Annual Computational Science Workshop, Reading, United Kingdom, December 8, 2008.
Christian Engelmann. Resiliency for High-Performance Computing. Invited talk at the 2nd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2008, Cancun, Mexico, April 10-12, 2008.
Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Seminar at the Laboratoire d'Analyse et d’Architecture des Systémes, Centre National de la Recherche Scientifique, Toulouse, France, February 11, 2008.

Co-advised Theses

Swen Böhm. Development of a RAS Framework for HPC Environments: Realtime Data Reduction of Monitoring Data. Master’s thesis, Department of Computer Science, University of Reading, UK, March 12, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville).
Antonina Litvinova. RAS Framework Engine Prototype. Master’s thesis, Department of Computer Science, University of Reading, UK, September 22, 2009. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville).

Symbols: Abstract, Publication, Presentation, BibTeX Citation