Resilience, i.e., obtaining a correct solution in a timely and efficient manner, is one of the key challenges in extreme-scale supercomputing. Extreme heterogeneity, i.e., using multiple, and potentially configurable, types of processors, accelerators and memory/storage in a single computing platform, is adding a significant amount of complexity to the supercomputer hardware/software ecosystem. Errors and failures reported by such heterogeneous hardware will need to be handled by the appropriate software component to enable efficient masking, recovery, and avoidance with little burden on the user.
This project takes a first step toward resilience in leadership-class supercomputers with extreme heterogeneity. It performs research to enable fine-grain resilience for graphics processing units (GPU) accelerated systems, such as Oak Ridge National Laboratory’s Summit supercomputer, that is more efficient than traditional application-level checkpoint/restart. The approach centers on a novel concept for Quality of Service (QoS) and corresponding extensions for the for OpenMP parallel programming model. This project develops (1) error and failure models, (2) software resilience strategies and protection domains, (3) OpenMP QoS language extensions for resilience, (4) OpenMP QoS runtime extensions and policies for resilience, and (5) a proof-of-concept prototype demonstrating these capabilities on the Summit supercomputer at Oak Ridge National Laboratory.
The ultimate goal is to make fault resilience an integral part of the supercomputer hardware/software ecosystem, such that the burden for providing it is on the system by design and not on the user as an afterthought.
Figure: Compile-time workflow and run-time interactions of the rOpenMP prototype using LLVM 7
Funding Sources
- Laboratory Directed Research and Development, Oak Ridge National Laboratory
Participants
- Christian Engelmann (PI), Geoffroy Vallée, and Swaroop Pophale — Oak Ridge National Laboratory
Peer-reviewed Workshop Publications
- Christian Engelmann, Geoffroy R. Vallée, and Swaroop Pophale. Concepts for OpenMP Target Offload Resilience. In Lecture Notes in Computer Science: Proceedings of the 15th International Workshop on OpenMP (IWOMP) 2019, pages 78-93, Auckland, New Zealand, September 11-13, 2019. Springer Verlag, Berlin, Germany. ISBN 978-3-030-28595-1. DOI 10.1007/978-3-030-28596-8_6.
Talks and Lectures
- Christian Engelmann. Resilience in Parallel Programming Environments. Invited talk at the 8th Accelerated Data Analytics and Computing (ADAC) Institute Workshop, Tokyo, Japan, October 30-31, 2019.
Symbols: Abstract, Publication, Presentation, BibTeX Citation