- 2021-23: An Open Federated Architecture for the Laboratory of the Future: This project creates a federated hardware/software architecture for the research laboratory of the future, connecting scientific instruments, robot-controlled laboratories and edge/center computing/data resources to enable autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence driven design, discovery and evaluation.
- 2018-19: rOpenMP: A Resilient Parallel Programming Model for Heterogeneous Systems: The rOpenMP project performs research to enable fine-grain resilience for supercomputers with accelerators that is more efficient than traditional application-level checkpoint/restart. The approach centers on a novel concept for quality of service and corresponding extensions for the for OpenMP parallel programming model.
- 2015-21: Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale: This project increases the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software.
- 2015-19: Catalog: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems: This project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems.
- 2013-16: Hobbes: OS and Runtime Support for Application Composition: Operating system and runtime (OS/R) environment for extreme-scale scientific computing based on the Kitten OS and Palacios virtual machine monitor, including high-value, high risk research exploring virtualization, analytics, networking, energy/power, scheduling/parallelism, architecture, resilience, programming models, and tools.
- 2013-16: MCREX: Monte Carlo Resilient Exascale Solvers: This project develops resilient Monte Carlo solvers with natural fault tolerance to hard and soft failures for efficiently executing next-generation computational science applications on exascale high-performance computing (HPC) systems. It extends initial work in Monte Carlo Synthetic Acceleration (MCSA) and evaluates the developed solvers in a simulated extreme-scale system with realistic synthetic faults.
- 2012-14: Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing: This work focuses on developing a resilience co-design toolkit with definitions, metrics, and methods to evaluate the cost/benefit trade-off of resilience solutions, identify hardware/software resilience properties, and coordinate interfaces/responsibilities of individual hardware/software components. The primary goal of this project is to provide the tools and data needed by HPC vendors to decide on future architectures and to enable direct feedback to HPC vendors on emerging resilience threats.
- 2011-12: Extreme-scale Algorithms and Software Institute: The Extreme-scale Algorithms and Software Institute (EASI) focuses on closing the “application-architecture performance gap” through architecture-aware algorithms and libraries, and the supporting runtime capabilities to achieve scalable performance and resilience on heterogeneous architectures. The developed HPC hardware/software co-design toolkit evaluates the performance of algorithms on future HPC architectures at extreme scale with up to 134,217,728 (2^27) processor cores.
- 2009-11: Soft-Error Resilience for Future-Generation High-Performance Computing Systems: This project aims at developing a soft error resilience strategy for future-generation high-performance computing (HPC) systems. It two different solutions aiming at alleviating the issue of soft errors in large-scale HPC systems: (1) checkpoint storage virtualization to significantly improve checkpoint/restart times, and (2) software dual-modular redundancy (DMR) to eliminate rollback/recovery in HPC.
- 2008-11: Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond: This effort develops scalable technologies for providing high-level RAS for next-generation petascale scientific high-end computing (HEC) resources and beyond, including scalable HPC system monitoring, reliability analysis of components and full systems, fault prediction, proactive fault tolerance using prediction-triggered migration (process and virtual machine), incremental checkpoint/restart, and holistic fault tolerance (checkpoint/restart + migration).
- 2008-11: Scalable Algorithms for Petascale Systems with Multicore Architectures: This Institute for Advanced Architecture and Algorithms (IAA) project focuses on the development of architecture-aware algorithms and the supporting runtime features needed by these algorithms to solve general sparse linear systems common in many scientific applications. It further focuses on evaluating the algorithmic impact of future architecture choices and determining what architecture changes would have the highest impact.
- 2006-09: Harness Workbench: Unified and Adaptive Access to Diverse HPC Platforms: The Harness Workbench enhances the overall productivity of applications science on diverse high performance computing (HPC) platforms with two innovative software environments, a virtualized command toolkit for application building and execution that provides a common view across diverse HPC systems, and a next generation runtime environment that provides a flexible, adaptive framework using plug-ins.
- 2006-08: Virtualized System Environments for Petascale Computing and Beyond: This research project addresses scalability, manageability, and ease-of-use challenges in petascale system software and application runtime environments through the development of a virtual system environment (VSE), which enables plug-and-play supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor virtualization technologies
- 2004-07: MOLAR: Modular Linux and Adaptive Runtime Support for High-End Computing: This project targets adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale high-end scientific computing on the next generation of supercomputers by advancing computer reliability, availability and serviceability (RAS) management systems and by providing advanced monitoring and adaptation mechanisms for improved application performance and predictability.
- 2004-06: Reliability, Availability, and Serviceability (RAS) for Terascale Computing: This project produces proof-of-concept solutions that enables the removal of the numerous single points of failure in large systems while improving scalability and access to systems and data. Our research effort focuses on efficient redundancy strategies for head and service nodes as well as on a distributed storage infrastructure.
- 2002-04: Super-Scalable Algorithms for Next-Generation High-Performance Cellular Architectures: This research in cellular architectures is part of a Cooperative Research and Development Agreement (CRADA) between IBM and Oak Ridge National Laboratory to develop algorithms for the next-generation of supercomputers. The team at ORNL develops a simulator to emulate up to 500,000 virtual processors on a cluster with 5 real processors, solving simple equations and performing advanced collective communication primitives.
- 2000-05: Harness: Heterogeneous Distributed Computing: The heterogeneous adaptable reconfigurable networked systems (Harness) research project focuses on the design and development of a pluggable lightweight heterogeneous Distributed Virtual Machine (DVM) environment, where clusters of PCs, workstations, and “big iron” supercomputers can be aggregated to form one giant DVM (in the spirit of its widely-used predecessor, Parallel Virtual Machine (PVM)).