The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2020) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2020). Several high-performance computing (HPC) resilience technologies have been developed. However, there are currently no tools, methods, and metrics to compare them and to identify the cost/benefit trade-off between the key system design factors: performance, resilience, and power consumption.
This work focuses on developing a resilience co-design toolkit with definitions, metrics, and methods to evaluate the cost/benefit trade-off of resilience solutions, identify hardware/software resilience properties, and coordinate interfaces/responsibilities of individual hardware/software components. The primary goal of this project is to provide the tools and data needed by HPC vendors to decide on future architectures and to enable direct feedback to HPC vendors on emerging resilience threats.
Prominent Solutions
Funding Sources
- Laboratory Directed Research and Development, Oak Ridge National Laboratory
Participating Institutions
Peer-reviewed Journal Publications
- Christian Engelmann and Thomas Naughton. A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator. Concurrency and Computation: Practice and Experience, volume 28, number 12, pages 3369-3389, August 1, 2016. John Wiley & Sons, Inc.. ISSN 1532-0634. DOI 10.1002/cpe.3805.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, January 1, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X. DOI 10.1016/j.future.2013.04.014.
Peer-reviewed Conference Publications
- Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Benchmark Generation and Simulation at Extreme Scale. In Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2016, pages 9-18, London, UK, September 21-23, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5090-3506-9. ISSN 1550-6525. DOI 10.1109/DS-RT.2016.18. Acceptance rate 42.0% (21/50). Best paper candidate.
- Christian Engelmann and Thomas Naughton. Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation. In Proceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016, Innsbruck, Austria, February 15-16, 2016. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-979-0. DOI 10.2316/P.2016.834-005.
- Christian Engelmann and Thomas Naughton. A Network Contention Model for the Extreme-scale Simulator. In Proceedings of the 34th IASTED International Conference on Modelling, Identification and Control (MIC) 2015, Innsbruck, Austria, February 17-18, 2015. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-975-2. DOI 10.2316/P.2015.826-043.
- Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, pages 198-207, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-6143-6. ISSN 1550-6525. DOI 10.1109/DS-RT.2014.32. Best paper candidate.
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192. DOI 10.1109/PDP.2014.74. Acceptance rate 32.6% (73/224).
Peer-reviewed Workshop Publications
- Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 962-971, Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-5117-3. ISSN 0190-3918. DOI 10.1109/ICPP.2013.114.
- Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Tools for Simulation and Benchmark Generation at Exascale. In Proceedings of the 7th Parallel Tools Workshop, pages 19-24, Dresden, Germany, September 3-4, 2013. Springer Verlag, Berlin, Germany. ISBN 978-3-319-08143-4. DOI 10.1007/978-3-319-08144-1_2.
White Papers
- Christian Engelmann and Thomas Naughton. A Hardware/Software Performance/Resilience/Power Co-Design Tool for Extreme-scale Computing. White paper accepted at the U.S. Department of Energy's Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2013, September 18-19, 2013.
- Christian Engelmann and Thomas Naughton. A Performance/Resilience/Power Co-design Tool for Extreme-scale High-Performance Computing. White paper accepted at the U.S. Department of Energy's Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2012, August 9-10, 2012.
Talks and Lectures
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the United States Naval Academy, Annapolis, MD, USA, February 18, 2016.
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the 19th Workshop on Distributed Supercomputing (SOS) 2015, Park City, UT, USA, March 2-5, 2015.
- Christian Engelmann. xSim: The Extreme-scale Simulator. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, February 23, 2015.
- Christian Engelmann. Supporting the Development of Resilient Message Passing Applications using Simulation. Invited talk at the Dagstuhl Seminar on Resilience in Exascale Computing, Schloss Dagstuhl, Wadern, Germany, September 28 – October 1, 2014.
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the Technical University of Dresden, Dresden, Germany, September 3, 2013.
- Christian Engelmann. Fault Tolerance Session. Invited talk at the The ExaChallenge Symposium, Dublin, Ireland, October 16-17, 2012.
Symbols: Abstract, Publication, Presentation, BibTeX Citation