Summary: The Extreme-scale Simulator (xSim) runs native high-performance computing applications with millions of concurrent execution threads in a controlled environment, while observing performance and resilience in a simulated extreme-scale system for application-architecture co-design. It is able to simulate a supercomputer with 134,217,728 (227) processes using a 960-core Linux cluster. It also offers process failure and soft error injection to study propagation, detection, notification and mitigation. |
The path to exascale high-performance computing (HPC) poses several challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Investigating the performance and resilience of parallel applications at scale on future architectures and the performance and resilience impact of different architecture choices is an important component of HPC hardware/software co-design. Without having access to future architectures at scale, simulation approaches provide an alternative for estimating parallel application performance and resilience on potential architecture choices. As highly accurate simulations are extremely slow and less scalable, different solution paths exist to trade-off simulation accuracy in order to gain simulation performance and scalability.
The Extreme-scale Simulator (xSim) is a performance investigation toolkit that permits running native HPC applications or proxy applications in a controlled environment with millions of concurrent execution threads, while observing application performance and resilience in a simulated extreme-scale system for hardware/software co-design. Using a lightweight parallel discrete event simulation (PDES), xSim executes a Message Passing Interface (MPI) application on a much smaller system in a highly oversubscribed fashion with a virtual wall clock time, such that performance data can be extracted based on a processor and a network model with an appropriate simulation scalability/accuracy trade-off (Figure 1). xSim is designed like a traditional performance tool, as an interposition library that sits between the MPI application and the MPI library, using the MPI profiling interface (Figure 2). It has been run up to 134,217,728 (2^27) communicating MPI ranks using a 960-core Linux cluster (see Figure 3 for 2^24 results).
xSim also permits the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling within the simulation using application-level checkpoint/restart. These capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique. Another feature provides user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). This permits investigating performance under failure and failure handling of ABFT solutions using the fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. xSim is the very first performance tool that supports ULFM and ABFT.
xSim also permits the injection of bit flip faults at specific of injection location(s) and fault activation time(s), while supporting a significant degree of configurability of the fault type. As radiation-induced bit flip faults are of particular concern in extreme-scale high-performance computing systems, xSim enables the development of soft-error resilient message passing applications by permitting the investigation of their correctness and performance under various fault conditions using this bit flip fault injection feature. xSim is the very first simulation-based MPI performance tool that supports both, the injection of process failures and bit flip faults.
Figure 1: xSim Architecture |
Figure 2: xSim Design |
Figure 3: Scaling a Monte Carlo code to 2^24 MPI ranks |
Research Projects
- 2013-16: MCREX: Monte Carlo Resilient Exascale Solvers
- 2012-14: Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing
- 2011-12: Extreme-scale Algorithms and Software Institute
- 2008-11: Scalable Algorithms for Petascale Systems with Multicore Architectures
Peer-reviewed Journal Publications
- Christian Engelmann and Thomas Naughton. A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator. Concurrency and Computation: Practice and Experience, volume 28, number 12, pages 3369-3389, August 1, 2016. John Wiley & Sons, Inc.. ISSN 1532-0634. DOI 10.1002/cpe.3805.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, January 1, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X. DOI 10.1016/j.future.2013.04.014.
Peer-reviewed Conference Publications
- Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Benchmark Generation and Simulation at Extreme Scale. In Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2016, pages 9-18, London, UK, September 21-23, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5090-3506-9. ISSN 1550-6525. DOI 10.1109/DS-RT.2016.18. Acceptance rate 42.0% (21/50). Best paper candidate.
- Christian Engelmann and Thomas Naughton. Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation. In Proceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016, Innsbruck, Austria, February 15-16, 2016. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-979-0. DOI 10.2316/P.2016.834-005.
- Christian Engelmann and Thomas Naughton. A Network Contention Model for the Extreme-scale Simulator. In Proceedings of the 34th IASTED International Conference on Modelling, Identification and Control (MIC) 2015, Innsbruck, Austria, February 17-18, 2015. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-975-2. DOI 10.2316/P.2015.826-043.
- Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, pages 198-207, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-6143-6. ISSN 1550-6525. DOI 10.1109/DS-RT.2014.32. Best paper candidate.
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192. DOI 10.1109/PDP.2014.74. Acceptance rate 32.6% (73/224).
- Christian Engelmann. Investigating Operating System Noise in Extreme-Scale High-Performance Computing Systems using Simulation. In Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, February 11-13, 2013. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-943-1. DOI 10.2316/P.2013.795-010.
- Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. DOI 10.1109/HPCSim.2011.5999835. Acceptance rate 28.1% (48/171).
Peer-reviewed Workshop Publications
- Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 962-971, Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-5117-3. ISSN 0190-3918. DOI 10.1109/ICPP.2013.114.
- Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Tools for Simulation and Benchmark Generation at Exascale. In Proceedings of the 7th Parallel Tools Workshop, pages 19-24, Dresden, Germany, September 3-4, 2013. Springer Verlag, Berlin, Germany. ISBN 978-3-319-08143-4. DOI 10.1007/978-3-319-08144-1_2.
- Ian S. Jones and Christian Engelmann. Simulation of Large-Scale HPC Architectures. In Proceedings of the 40th International Conference on Parallel Processing (ICPP) 2011: 2nd International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 447-456, Taipei, Taiwan, September 13-19, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4511-0. ISSN 1530-2016. DOI 10.1109/ICPPW.2011.44.
- Christian Engelmann and Frank Lauer. Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation. In Proceedings of the 12th IEEE International Conference on Cluster Computing (Cluster) 2010: 1st Workshop on Application/Architecture Co-design for Extreme-scale Computing (AACEC), pages 1-8, Hersonissos, Crete, Greece, September 20-24, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-8395-2. DOI 10.1109/CLUSTERWKSP.2010.5613113.
White Papers
- Christian Engelmann and Thomas Naughton. A Hardware/Software Performance/Resilience/Power Co-Design Tool for Extreme-scale Computing. White paper accepted at the U.S. Department of Energy's Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2013, September 18-19, 2013.
- Christian Engelmann and Thomas Naughton. A Performance/Resilience/Power Co-design Tool for Extreme-scale High-Performance Computing. White paper accepted at the U.S. Department of Energy's Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2012, August 9-10, 2012.
Talks and Lectures
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the United States Naval Academy, Annapolis, MD, USA, February 18, 2016.
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the 19th Workshop on Distributed Supercomputing (SOS) 2015, Park City, UT, USA, March 2-5, 2015.
- Christian Engelmann. xSim: The Extreme-scale Simulator. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, February 23, 2015.
- Christian Engelmann. Supporting the Development of Resilient Message Passing Applications using Simulation. Invited talk at the Dagstuhl Seminar on Resilience in Exascale Computing, Schloss Dagstuhl, Wadern, Germany, September 28 – October 1, 2014.
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the Technical University of Dresden, Dresden, Germany, September 3, 2013.
- Christian Engelmann. Fault Tolerance Session. Invited talk at the The ExaChallenge Symposium, Dublin, Ireland, October 16-17, 2012.
- Christian Engelmann. Resilience and Hardware/Software Co-design for Extreme-Scale Supercomputing. Seminar at the Barcelona Supercomputing Center, Barcelona, Spain, July 27, 2011.
- Christian Engelmann. Beyond Application-Level Checkpoint/Restart – Advanced Software Approaches for Fault Resilience. Talk at the 39th SPEEDUP Workshop on High Performance Computing, Zurich, Switzerland, September 6, 2010.
- Christian Engelmann and Stephen L. Scott. HPC System Software Research at Oak Ridge National Laboratory. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, February 22, 2010.
- Christian Engelmann. JCAS – IAA Simulation Efforts at Oak Ridge National Laboratory. Invited talk at the IAA Workshop on HPC Architectural Simulation (HPCAS), Boulder, CO, USA, September 1-2, 2009.
Co-advised Theses
- Ian S. Jones. Simulation of Large Scale Architectures on High Performance Computers. Master’s thesis, Department of Computer Science, University of Reading, UK, October 22, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville).
- Frank Lauer. Simulation of Advanced Large-Scale HPC Architectures. Master’s thesis, Department of Computer Science, University of Reading, UK, March 12, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville).
Symbols: Abstract, Publication, Presentation, BibTeX Citation