This project delivers an operating system and runtime (OS/R) environment for extreme-scale scientific computing. With application composition as the fundamental driving force, we develop the necessary OS/R interfaces and low-level system services required to support the isolation and sharing needed to design and implement applications, as well as, performance and correctness tools. Our approach also supports complex simulation and analysis workflows. A workflow’s components will likely consist of a wide range of parallel codes with different OS/R requirements, e.g., a relatively complicated multi-physics workflow that incorporates data from three different types of legacy codes that use Message Passing Interface (MPI) only, Partitioned Global Address Space (PGAS) languages, and MPI with threading, and requires components for analytics, visualization, uncertainty quantification, memory profiling, and performance analysis.
Instead of a single unified OS/R to support every conceivable requirement, we offer a lightweight OS/R system with the flexibility to custom build runtimes for any particular purpose. Each component executes in its own enclave with a specialized runtime and isolation properties. A global runtime system provides the software required to compose applications out of a collection of enclaves, join them through secure and low-latency communication, and schedule them to avoid contention and maximize resource utilization. The benefits gained from lightweight and customizable runtimes include predictable and consistent memory and network patterns, manageable resilience properties, and measurable power and energy characteristics. These benefits simplify algorithm design and development issues at a large scale.
Project deliverables are: (1) a OS/R stack based on the Kitten OS and Palacios virtual machine monitor and (2) high-value, high risk research that leverages the architecture of the base OS/R to explore issues of specific interest to exascale, e.g., virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools.
Funding Sources
- Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
Participating Institutions
- Sandia National Laboratories
- Georgia Institute of Technology
- Indiana University
- Los Alamos National Laboratory
- Lawrence Berkeley National Laboratory
- North Carolina State University
- Northwestern University
- Oak Ridge National Laboratory
- University of Pittsburgh
- University of Arizona
- University of California Berkeley
- University of New Mexico
- University of Texas at El Paso
- University of Tennessee, Knoxville
Peer-reviewed Journal Publications
- Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Epidemic Failure Detection and Consensus for Extreme Parallelism. International Journal of High Performance Computing Applications (IJHPCA), volume 32, number 5, pages 729-743, September 1, 2018. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342017690910.
Peer-reviewed Conference Publications
- David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the 30th ACM International Conference on Supercomputing (ICS) 2016, pages 7:1-7:14, Istanbul, Turkey, June 1-3, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4361-9. DOI 10.1145/2925426.2926295. Acceptance rate 24.2% (43/178).
- Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. In Proceedings of the 22nd European MPI Users` Group Meeting (EuroMPI) 2015, pages 13:1-13:9, Bordeaux, France, September 21-24, 2015. ACM Press, New York, NY, USA. ISBN 978-1-4503-3795-3. DOI 10.1145/2802658.2802660. Acceptance rate 48.3% (14/29).
Peer-reviewed Workshop Publications
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, Ferrol Aderholdt, and Stephen L. Scott. A Cooperative Approach to Virtual Machine Based Fault Injection. In Lecture Notes in Computer Science: Proceedings of the 22nd European Conference on Parallel and Distributed Computing (Euro-Par) 2016 Workshops: 9th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 671-682, Grenoble, France, August 23, 2016. Springer Verlag, Berlin, Germany. ISBN 978-3-319-58943-5. ISSN 0302-9743. DOI 10.1007/978-3-319-58943-5_54. Acceptance rate 55.6% (5/9).
- Zachary Parchman, Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, and David E. Bernholdt. Adding Fault Tolerance to NPB Benchmarks Using ULFM. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2016: 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016, pages 19-26, Kyoto, Japan, May 31 – June 4, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4349-7. DOI 10.1145/2909428.2909429. Acceptance rate 85.7% (6/7).
- Thomas Naughton, Garry Smith, Christian Engelmann, Geoffroy Vallée, Ferrol Aderholdt, and Stephen L. Scott. What is the right balance for performance and isolation with virtualization in HPC?. In Lecture Notes in Computer Science: Proceedings of the 20th European Conference on Parallel and Distributed Computing (Euro-Par) 2014 Workshops: 7th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 570-581, Porto, Portugal, August 25, 2014. Springer Verlag, Berlin, Germany. ISBN 978-3-319-14325-5. ISSN 0302-9743. DOI 10.1007/978-3-319-14325-5_49. Acceptance rate 60.0% (6/10).
White Papers
- Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, and David E. Bernholdt. Unified Execution Environment. White paper for the U.S. Department of Energy's Exascale Operating Systems and Runtime Technical Council, July 1, 2012.
Symbols: Abstract, Publication, Presentation, BibTeX Citation