This project aims at scalable technologies for providing high-level RAS for next-generation petascale scientific high-end computing (HEC) resources and beyond as outlined by the U.S. Department of Energy (DOE) Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) and the U.S. National Coordination Office for Networking and Information Technology Research and Development (NCO/NITRD) High-End Computing Revitalization Task Force (HECRTF) activities. Based on virtualized adaptation, reconfiguration, and preemptive measures, the ultimate goal is to provide for non-stop scientific computing on a 24×7 basis without interruption.
The taken technical approach leverages system-level virtualization technology to enable transparent proactive and reactive fault tolerance mechanisms on extreme scale HEC systems. This effort targets: (1) reliability analysis for identifying pre-fault indicators, predicting failures, and modeling and monitoring component and system reliability, (2) proactive fault tolerance technology based on preemptive migration away from components that are about to fail, (3) reactive fault tolerance enhancements, such as checkpoint interval and placement adaption to actual and predicted system health threats, and (4) holistic fault tolerance through combination of adaptive proactive and reactive fault tolerance.
Prominent Solutions
- Characterization of Faults, Errors, and Failures in Extreme-Scale Systems
- Proactive Fault Tolerance Framework
- Hybrid Full/Incremental System-level Checkpointing
Funding Sources
- Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) Program, Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
Participating Institutions
- Oak Ridge National Laboratory
- North Carolina State University
- Louisiana Tech University
- Tennessee Tech University
Peer-reviewed Journal Publications
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, February 1, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2011.10.009.
Peer-reviewed Conference Publications
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. DOI 10.1109/ICPADS.2010.48. Acceptance rate 29.6% (77/188).
- Swen Böhm, Christian Engelmann, and Stephen L. Scott. Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC) 2010, pages 72-78, Melbourne, Australia, September 1-3, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4214-0. DOI 10.1109/HPCC.2010.32. Acceptance rate 19.1% (58/304).
- Antonina Litvinova, Christian Engelmann, and Stephen L. Scott. A Proactive Fault Tolerance Framework for High-Performance Computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria, February 16-18, 2010. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-783-3. DOI 10.2316/P.2010.676-024.
- Narate Taerat, Nichamon Naksinehaboon, Clayton Chandler, James Elliott, Chokchai (Box) Leangsuksun, George Ostrouchov, Stephen L. Scott, and Christian Engelmann. Blue Gene/L Log Analysis and Time to Interrupt Estimation. In Proceedings of the 4th International Conference on Availability, Reliability and Security (ARES) 2009, pages 173-180, Fukuoka, Japan, March 16-19, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-3572-2. DOI 10.1109/ARES.2009.105. Acceptance rate 25.0% (40/160).
- Christian Engelmann, Hong H. Ong, and Stephen L. Scott. Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing (LCI) 2009, Boulder, CO, USA, March 9-12, 2009.
- Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.31. Acceptance rate 42.0% (58/138).
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. DOI 10.1145/1413370.1413414. Acceptance rate 21.3% (59/277).
- Geoffroy R. Vallée, Kulathep Charoenpornwattana, Christian Engelmann, Anand Tikotekar, Chokchai (Box) Leangsuksun, Thomas Naughton, and Stephen L. Scott. A Framework For Proactive Fault Tolerance. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 659-664, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.171. Acceptance rate 21.1% (40/190).
Peer-reviewed Workshop Publications
- Thomas Naughton, Swen Böhm, Christian Engelmann, and Geoffroy Vallée. Using Performance Tools to Support Experiments in HPC Resilience. In Lecture Notes in Computer Science: Proceedings of the 19th European Conference on Parallel and Distributed Computing (Euro-Par) 2013 Workshops: 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 727-736, Aachen, Germany, August 26, 2013. Springer Verlag, Berlin, Germany. ISBN 978-3-642-54419-4. ISSN 0302-9743. DOI 10.1007/978-3-642-54420-0_71. Acceptance rate 87.5% (7/8).
- Thomas Naughton, Geoffroy R. Vallée, Christian Engelmann, and Stephen L. Scott. A Case for Virtual Machine based Fault Injection in a High-Performance Computing Environment. In Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011: 5th Workshop on System-level Virtualization for High Performance Computing (HPCVirt), pages 234-243, Bordeaux, France, August 29 – September 2, 2011. Springer Verlag, Berlin, Germany. ISBN 978-3-642-29737. DOI 10.1007/978-3-642-29737-3_27.
- George Ostrouchov, Thomas Naughton, Christian Engelmann, Geoffroy R. Vallée, and Stephen L. Scott. Nonparametric Multivariate Anomaly Analysis in Support of HPC Resilience. In Proceedings of the 5th IEEE International Conference on e-Science (e-Science) 2009: Workshop on Computational Science, pages 80-85, Oxford, UK, December 9-11, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-5946-9. DOI 10.1109/ESCIW.2009.5407992.
- Thomas Naughton, Wesley Bland, Geoffroy R. Vallée, Christian Engelmann, and Stephen L. Scott. Fault Injection Framework for System Resilience Evaluation – Fake Faults for Finding Future Failures. In Proceedings of the 18th International Symposium on High Performance Distributed Computing (HPDC) 2009: 2nd Workshop on Resiliency in High Performance Computing (Resilience) 2009, pages 23-28, Munich, Germany, June 9, 2009. ACM Press, New York, NY, USA. ISBN 978-1-60558-587-1. DOI 10.1145/1552526.1552530.
Peer-reviewed Conference Posters
- Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009.
- Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2009, Raleigh, NC, USA, February 14-18, 2009.
White Papers
- Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Frank Mueller. Dynamic Self-Aware Runtime Software for Exascale Systems. White paper for the U.S. Department of Energy's Exascale Operating Systems and Runtime Technical Council, July 1, 2012.
- Nathan DeBardeleben, James Laros, John T. Daly, Stephen L. Scott, Christian Engelmann, and Bill Harrod. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development. White paper for the U.S. National Science Foundation's High-end Computing Program, December 1, 2009.
Technical Reports
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Full/Incremental Checkpoint/Restart for MPI Jobs in HPC Environments. Technical Report, ORNL/TM-2010/162, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August 1, 2010.
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Technical Report, ORNL/TM-2010/161, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August 1, 2010.
Talks and Lectures
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the United States Naval Academy, Annapolis, MD, USA, February 18, 2016.
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the 19th Workshop on Distributed Supercomputing (SOS) 2015, Park City, UT, USA, March 2-5, 2015.
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the Technical University of Dresden, Dresden, Germany, September 3, 2013.
- Christian Engelmann. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path Forward for Research and Development. Invited talk at the Argonne National Laboratory (ANL) Institute of Computing in Science (ICiS) Summer Workshop Week on Addressing Failures in Exascale Computing, Park City, UT, USA, August 4-11, 2012.
- Christian Engelmann. Resilient Software for ExaScale Computing. Invited talk at the Birds of a Feather Session on Resilient Software for ExaScale Computing at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 17, 2011.
- Christian Engelmann. Resilience and Hardware/Software Co-design for Extreme-Scale Supercomputing. Seminar at the Barcelona Supercomputing Center, Barcelona, Spain, July 27, 2011.
- Christian Engelmann. Scalable HPC System Monitoring. Invited talk at the 3rd HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2010, in conjunction with the 3rd Los Alamos Computer Science Symposium (LACSS) 2010, Santa Fe, NM, USA, October 13, 2010.
- Christian Engelmann. Beyond Application-Level Checkpoint/Restart – Advanced Software Approaches for Fault Resilience. Talk at the 39th SPEEDUP Workshop on High Performance Computing, Zurich, Switzerland, September 6, 2010.
- Christian Engelmann and Stephen L. Scott. Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond. Talk at the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) Workshop, in conjunction with the USENIX Federated Conferences Week (USENIX) 2010, Boston MA, USA, June 22, 2010.
- Christian Engelmann. Resilience Challenges at the Exascale. Talk at the 14th Workshop on Distributed Supercomputing (SOS) 2010, Savannah, GA, USA, March 8-11, 2010.
- Christian Engelmann and Stephen L. Scott. HPC System Software Research at Oak Ridge National Laboratory. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, February 22, 2010.
- Christian Engelmann. High-Performance Computing Research Internship and Appointment Opportunities at Oak Ridge National Laboratory. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, December 14, 2009.
- Christian Engelmann. Modeling Techniques Towards Resilience. Invited talk at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009.
- Christian Engelmann. System Resilience Research at ORNL in the Context of HPC. Invited talk at the Institut National de Recherche en Informatique et en Automatique (INRIA), Rennes, France, May 15, 2009.
- Christian Engelmann. High-Performance Computing Research and MSc Internship Opportunities at Oak Ridge National Laboratory. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, May 11, 2009.
- Christian Engelmann. Proactive Fault Tolerance Using Preemptive Migration. Invited talk at the 3rd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2009, Cancun, Mexico, April 22-24, 2009.
- Christian Engelmann. Resiliency. Panel at the 13th Workshop on Distributed Supercomputing (SOS) 2009, Hilton Head, SC, USA, March 9-12, 2009.
- Christian Engelmann. High-Performance Computing Research at Oak Ridge National Laboratory. Invited talk at the Reading Annual Computational Science Workshop, Reading, United Kingdom, December 8, 2008.
- Christian Engelmann. Modular Redundancy in HPC Systems: Why, Where, When and How?. Invited talk at the 1st HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2008, in conjunction with the 1st Los Alamos Computer Science Symposium (LACSS) 2008, Santa Fe, NM, USA, October 15, 2008.
- Christian Engelmann. Resiliency for High-Performance Computing. Invited talk at the 2nd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2008, Cancun, Mexico, April 10-12, 2008.
- Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Seminar at the Laboratoire d'Analyse et d’Architecture des Systémes, Centre National de la Recherche Scientifique, Toulouse, France, February 11, 2008.
Co-advised Theses
- Swen Böhm. Development of a RAS Framework for HPC Environments: Realtime Data Reduction of Monitoring Data. Master’s thesis, Department of Computer Science, University of Reading, UK, March 12, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville).
- Antonina Litvinova. RAS Framework Engine Prototype. Master’s thesis, Department of Computer Science, University of Reading, UK, September 22, 2009. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville).
Symbols: Abstract, Publication, Presentation, BibTeX Citation