@article{agullo22resiliency, author = "Emmanuel Agullo and Mirco Altenbernd and Hartwig Anzt and Leonardo Bautista-Gomez and Tommaso Benacchio and Luca Bonaventura and Hans-Joachim Bungartz and Sanjay Chatterjee and Florina M. Ciorba and Nathan DeBardeleben and Daniel Drzisga and Sebastian Eibl and Christian Engelmann and Wilfried N. Gansterer and Luc Giraud and Dominik G{\"o}ddeke and Marco Heisig and Fabienne J{\'e}z{\'e}quel and Nils Kohl and Xiaoye Sherry Li and Romain Lion and Miriam Mehl and Paul Mycek and Michael Obersteiner and Enrique S. Quintana-Ort{\'i} and Francesco Rizzi and Ulrich R{\"u}de and Martin Schulz and Fred Fung and Robert Speck and Linda Stals and Keita Teranishi and Samuel Thibault and Dominik Th{\"o}nnes and Andreas Wagner and Barbara Wohlmuth", title = "Resiliency in Numerical Algorithm Design for Extreme Scale Simulations", journal = "\href{http://hpc.sagepub.com}{International Journal of High Performance Computing Applications (IJHPCA)}", volume = "36", number = "2", pages = "251--285", month = mar, year = "2022", publisher = "\href{http://www.sagepub.com}{SAGE Publications}", issn = "1094-3420", doi = "10.1177/10943420211055188", url = "http://www.christian-engelmann.info/publications/agullo22resiliency.pdf", abstract = "This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for Extreme Scale Simulations' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the writers. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 hours on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating- point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications, and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.", pts = "169116" }
@article{kumar21study, author = "Mohit Kumar and Saurabh Gupta and Tirthak Patel and Michael Wilder and Weisong Shi and Song Fu and Christian Engelmann and Devesh Tiwari", title = "Study of Interconnect Errors, Network Congestion, and Applications Characteristics for Throttle Prediction on a Large Scale {HPC} System", journal = "\href{http://www.elsevier.com/locate/jpdc}{Journal of Parallel and Distributed Computing (JPDC)}", volume = "153", pages = "29--43", month = jul, year = "2021", publisher = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The Netherlands}", issn = "0743-7315", doi = "10.1016/j.jpdc.2021.03.001", url = "http://www.christian-engelmann.info/publications/kumar21study.pdf", abstract = "Today's High Performance Computing (HPC) systems contain thousand of nodes which work together to provide performance in the order of peta ops. The performance of these systems depends on various components like processors, memory, and interconnect. Among all, interconnect plays a major role as it glues together all the hardware components in an HPC system. A slow interconnect can impact a scientific application running on multiple processes severely as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks a study that explores different interconnect errors, congestion events and applications characteristics on a large-scale HPC system. In our previous work, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors, and congestion events. In this work, we first show how congestion events can impact application performance. We then investigate application characteristics interaction with interconnect errors and network congestion to predict applications encountering congestion with more than 90\% accuracy", pts = "153615" }
@article{katti18epidemic, author = "Amogh Katti and Giuseppe Di Fatta and Thomas Naughton and Christian Engelmann", title = "Epidemic Failure Detection and Consensus for Extreme Parallelism", journal = "\href{http://hpc.sagepub.com}{International Journal of High Performance Computing Applications (IJHPCA)}", volume = "32", number = "5", pages = "729--743", month = sep, year = "2018", publisher = "\href{http://www.sagepub.com}{SAGE Publications}", issn = "1094-3420", doi = "10.1177/1094342017690910", url = "http://www.christian-engelmann.info/publications/katti17epidemic.pdf", abstract = "Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI Comm shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in all algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient.", pts = "72175" }
@article{hukerikar17resilience, author = "Saurabh Hukerikar and Christian Engelmann", title = "Resilience Design Patterns: {A} Structured Approach to Resilience at Extreme Scale", journal = "\href{http://superfri.org/superfri}{Journal of Supercomputing Frontiers and Innovations (JSFI)}", volume = "4", number = "3", pages = "4--42", month = oct, year = "2017", publisher = "\href{http://www.susu.ru/en}{South Ural State University Chelyabinsk, Russia}", issn = "2409-6008", doi = "10.14529/jsfi170301", url = "http://www.christian-engelmann.info/publications/hukerikar17resilience.pdf", abstract = "Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore, the resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on power consumption in HPC systems future systems are likely to embrace innovative architectures, increasing the levels of hardware and software complexities. As a result, the techniques that seek to improve resilience must navigate the complex trade-off space between resilience and the overheads to power consumption and performance. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency across the system stack. Additionally, few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems. In this paper, we develop a structured approach to the management of HPC resilience using the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each established solution is described in the form of a pattern that addresses concrete problems in the design of resilient systems. The complete catalog of resilience design patterns provides designers with reusable design elements. We also define a framework that enhances a designer's understanding of the important constraints and opportunities for the design patterns to be implemented and deployed at various layers of the system stack. This design framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also supports optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner despite frequent faults, errors, and failures of various types.", pts = "102201" }
@article{engelmann16new, author = "Christian Engelmann and Thomas Naughton", title = "A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator", journal = "\href{http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1532-0634} {Concurrency and Computation: Practice and Experience}", volume = "28", number = "12", pages = "3369--3389", month = aug, year = "2016", publisher = "\href{http://www.wiley.com}{John Wiley & Sons, Inc.}", issn = "1532-0634", doi = "10.1002/cpe.3805", url = "http://www.christian-engelmann.info/publications/engelmann16new.pdf", abstract = "Investigating the performance of parallel applications at scale on future high-performance computing~(HPC) architectures and the performance impact of different HPC architecture choices is an important component of HPC hardware/software co-design. The Extreme-scale Simulator (xSim) is a simulation toolkit for investigating the performance of parallel applications at scale. xSim scales to millions of simulated Message Passing Interface (MPI) processes. The xSim toolkit strives to limit simulation overheads in order to maintain performance and productivity criteria. This paper documents two improvements to xSim: (1)~a new deadlock resolution protocol to reduce the parallel discrete event simulation overhead, and (2)~a new simulated MPI message matching algorithm to reduce the oversubscription management cost. These enhancements resulted in significant performance improvements. The simulation overhead for running the NAS Parallel Benchmark suite dropped from 1,020\% to 238\% for the conjugate gradient (CG) benchmark and 102\% to 0\% for the embarrassingly parallel~(EP) benchmark. Additionally, the improvements were beneficial for reducing overheads in the highly accurate simulation mode of xSim, which is useful for resilience investigation studies for tracking intentional MPI process failures. In the highly accurate mode, the simulation overhead was reduced from 37,511\% to 13,808\% for CG and from 3,332\% to 204\% for EP.", pts = "58541" }
@article{snir14addressing, author = "Marc Snir and Robert W. Wisniewski and Jacob A. Abraham and Sarita V. Adve and Saurabh Bagchi and Pavan Balaji and Jim Belak and Pradip Bose and Franck Cappello and Bill Carlson and Andrew A. Chien and Paul Coteus and Nathan A. Debardeleben and Pedro Diniz and Christian Engelmann and Mattan Erez and Saverio Fazzari and Al Geist and Rinku Gupta and Fred Johnson and Sriram Krishnamoorthy and Sven Leyffer and Dean Liberty and Subhasish Mitra and Todd Munson and Rob Schreiber and Jon Stearley and Eric Van Hensbergen", title = "Addressing Failures in Exascale Computing", journal = "\href{http://hpc.sagepub.com}{International Journal of High Performance Computing Applications (IJHPCA)}", volume = "28", number = "2", pages = "127--171", month = may, year = "2014", publisher = "\href{http://www.sagepub.com}{SAGE Publications}", issn = "1094-3420", doi = "10.1177/1094342014522573", url = "http://www.christian-engelmann.info/publications/snir14addressing.pdf", abstract = "We present here a report produced by a workshop on Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.", pts = "49208" }
@article{engelmann13scaling, author = "Christian Engelmann", title = "Scaling To A Million Cores And Beyond: {Using} Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale", journal = "\href{http://www.elsevier.com/locate/fgcs}{Future Generation Computer Systems (FGCS)}", volume = "30", number = "0", pages = "59--65", month = jan, year = "2014", publisher = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The Netherlands}", issn = "0167-739X", doi = "10.1016/j.future.2013.04.014", url = "http://www.christian-engelmann.info/publications/engelmann13scaling.pdf", abstract = "As supercomputers scale to 1,000 PFlop/s over the next decade, investigating the performance of parallel applications at scale on future architectures and the performance impact of different architecture choices for high-performance computing (HPC) hardware/software co-design is crucial. This paper summarizes recent efforts in designing and implementing a novel HPC hardware/software co-design toolkit. The presented Extreme-scale Simulator (xSim) permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale HPC system using architectural models and virtual timing. This paper demonstrates the capabilities and usefulness of the xSim performance investigation toolkit, such as its scalability to $2^{27}$ simulated Message Passing Interface (MPI) ranks on 960 real processor cores, the capability to evaluate the performance of different MPI collective communication algorithms, and the ability to evaluate the performance of a basic Monte Carlo application with different architectural parameters.", pts = "42452" }
@article{wang12proactive, author = "Chao Wang and Frank Mueller and Christian Engelmann and Stephen L. Scott", title = "Proactive Process-Level Live Migration and Back Migration in {HPC} Environments", journal = "\href{http://www.elsevier.com/locate/jpdc}{Journal of Parallel and Distributed Computing (JPDC)}", volume = "72", number = "2", pages = "254--267", month = feb, year = "2012", publisher = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The Netherlands}", issn = "0743-7315", doi = "10.1016/j.jpdc.2011.10.009", url = "http://www.christian-engelmann.info/publications/wang12proactive.pdf", abstract = "As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70\% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.", pts = "35627" }
@article{scott10system, author = "Stephen L. Scott and Geoffroy R. Vall\'ee and Thomas Naughton and Anand Tikotekar and Christian Engelmann and Hong H. Ong", title = "System-Level Virtualization Research at {Oak Ridge National Laboratory}", journal = "\href{http://www.elsevier.com/locate/fgcs}{Future Generation Computer Systems (FGCS)}", volume = "26", number = "3", pages = "304--307", month = mar, year = "2010", publisher = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The Netherlands}", issn = "0167-739X", doi = "10.1016/j.future.2009.07.001", url = "http://www.christian-engelmann.info/publications/scott09system.pdf", abstract = "System-level virtualization is today enjoying a rebirth as a technique to effectively share what were then considered large computing resources to subsequently fade from the spotlight as individual workstations gained in popularity with a one machine -- one user approach. One reason for this resurgence is that the simple workstation has grown in capability to rival that of anything available in the past. Thus, computing centers are again looking at the price/performance benefit of sharing that single computing box via server consolidation. However, industry is only concentrating on the benefits of using virtualization for server consolidation (enterprise computing) whereas our interest is in leveraging virtualization to advance high-performance computing (HPC). While these two interests may appear to be orthogonal, one consolidating multiple applications and users on a single machine while the other requires all the power from many machines to be dedicated solely to its purpose, we propose that virtualization does provide attractive capabilities that may be exploited to the benefit of HPC interests. This does raise the two fundamental questions of: is the concept of virtualization (a machine sharing technology) really suitable for HPC and if so, how does one go about leveraging these virtualization capabilities for the benefit of HPC. To address these questions, this document presents ongoing studies on the usage of system-level virtualization in a HPC context. These studies include an analysis of the benefits of system-level virtualization for HPC, a presentation of research efforts based on virtualization for system availability, and a presentation of research efforts for the management of virtual systems. The basis for this document was material presented by Stephen L. Scott at the Collaborative and Grid Computing Technologies meeting held in Cancun, Mexico on April 12-14, 2007.", pts = "35628" }
@article{he09symmetric, author = "Xubin (Ben) He and Li Ou and Christian Engelmann and Xin Chen and Stephen L. Scott", title = "Symmetric Active/Active Metadata Service for High Availability Parallel File Systems", journal = "\href{http://www.elsevier.com/locate/jpdc}{Journal of Parallel and Distributed Computing (JPDC)}", volume = "69", number = "12", pages = "961-973", month = dec, year = "2009", publisher = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The Netherlands}", issn = "0743-7315", doi = "10.1016/j.jpdc.2009.08.004", url = "http://www.christian-engelmann.info/publications/he09symmetric.pdf", abstract = "High availability data storage systems are critical for many applications as research and business become more data-driven. Since metadata management is essential to system availability, multiple metadata services are used to improve the availability of distributed storage systems. Past research focused on the active/standby model, where each active service has at least one redundant idle backup. However, interruption of service and even some loss of service state may occur during a fail-over depending on the used replication technique. In addition, the replication overhead for multiple metadata services can be very high. The research in this paper targets the symmetric active/active replication model, which uses multiple redundant service nodes running in virtual synchrony. In this model, service node failures do not cause a fail-over to a backup and there is no disruption of service or loss of service state. We further discuss a fast delivery protocol to reduce the latency of the needed total order broadcast. Our prototype implementation shows that metadata service high availability can be achieved with an acceptable performance trade-off using our symmetric active/active metadata service solution.", pts = "21240" }
@article{he07unified, author = "Xubin (Ben) He and Li Ou and Martha J. Kosa and Stephen L. Scott and Christian Engelmann", title = "A Unified Multiple-Level Cache for High Performance Cluster Storage Systems", journal = "\href{http://www.inderscience.com/browse/index.php?journalcode=ijhpcn} {International Journal of High Performance Computing and Networking (IJHPCN)}", volume = "5", number = "1-2", pages = "97--109", month = nov # "~14, ", year = "2007", publisher = "\href{http://www.inderscience.com}{Inderscience Publishers, Geneve, Switzerland}", issn = "1740-0562", doi = "10.1504/IJHPCN.2007.015768", url = "http://www.christian-engelmann.info/publications/he07unified.pdf", abstract = "Highly available data storage for high-performance computing is becoming increasingly more critical as high-end computing systems scale up in size and storage systems are developed around network-centered architectures. A promising solution is to harness the collective storage potential of individual workstations much as we harness idle CPU cycles due to the excellent price/performance ratio and low storage usage of most commodity workstations. For such a storage system, metadata consistency is a key issue assuring storage system availability as well as data reliability. In this paper, we present a decentralized metadata management scheme that improves storage availability without sacrificing performance.", pts = "1907" }
@article{engelmann06symmetric, author = "Christian Engelmann and Stephen L. Scott and Chokchai (Box) Leangsuksun and Xubin (Ben) He", title = "Symmetric Active/Active High Availability for High-Performance Computing System Services", journal = "\href{http://www.jcomputers.us}{Journal of Computers (JCP)}", volume = "1", number = "8", pages = "43--54", month = dec, year = "2006", publisher = "\href{http://www.jcomputers.us}{Academy Publisher, Oulu, Finland}", issn = "1796-203X", doi = "10.4304/jcp.1.8.43-54", url = "http://www.christian-engelmann.info/publications/engelmann06symmetric.pdf", abstract = "This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.", pts = "4583" }
@article{engelmann06molar, author = "Christian Engelmann and Stephen L. Scott and David E. Bernholdt and Narasimha R. Gottumukkala and Chokchai (Box) Leangsuksun and Jyothish Varma and Chao Wang and Frank Mueller and Aniruddha G. Shet and Ponnuswamy (Saday) Sadayappan", title = "{MOLAR}: {A}daptive Runtime Support for High-End Computing Operating and Runtime Systems", journal = "\href{http://www.sigops.org/osr.html}{ACM SIGOPS Operating Systems Review (OSR)}", volume = "40", number = "2", pages = "63--72", month = apr, year = "2006", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", issn = "0163-5980", doi = "10.1145/1131322.1131337", url = "http://www.christian-engelmann.info/publications/engelmann06molar.pdf", abstract = "MOLAR is a multi-institutional research effort that concentrates on adaptive, reliable, and efficient operating and runtime system (OS/R) solutions for ultra-scale, high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined in FAST-OS (forum to address scalable technology for runtime and operating systems) and HECRTF (high-end computing revitalization task force) activities by exploring the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions, and by advancing computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. This paper describes recent research of the MOLAR team in advancing RAS for high-end computing OS/Rs.", pts = "1905" }
@conference{oles24understanding, author = "Vladyslav Oles and Anna Schmedding and George Ostrouchov and Woong Shi and Evgenia Smirni and Christian Engelmann", title = "Understanding {GPU} Memory Corruption at Extreme Scale: The Summit Case Study", booktitle = "Proceedings of the \href{https://ics2024.github.io/} {$38^{th}$ ACM International Conference on Supercomputing (ICS) 2024}", pages = "188-200", month = jun # "~4-7, ", year = "2024", address = "Kyoto, Japan", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "979-8-4007-0610-3", doi = "10.1145/3650200.3656615", url = "http://www.christian-engelmann.info/publications/oles24understanding.pdf", url2 = "http://www.christian-engelmann.info/publications/oles24understanding.ppt.pdf", abstract = "GPU memory corruption and in particular double- bit errors (DBEs) remain one of the least understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to job termination and can potentially cost thousands of node-hours, either from wasted computations or as the overhead from regular checkpointing needed to minimize the losses. As supercomputers and their components simultaneously grow in scale, density, failure rates, and environmental footprint, the efficiency of HPC operations becomes both an imperative and a challenge. We examine DBEs using system telemetry data and logs collected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). Using exploratory data analysis and statistical learning, we extract several insights about memory reliability in such GPUs. We find that GPUs with prior DBE occurrences are prone to experience them again due to otherwise harmless factors, correlate this phenomenon with GPU placement, and suggest manufacturing variability as a factor. On the general population of GPUs, we link DBEs to short- and long-term high power consumption modes while finding no significant correlation with higher temperatures. We also show that workload type can be a factor in GPU memory’s propensity to corruption.", pts = "212442" }
@conference{engelmann23science, author = "Christian Engelmann and Suhas Somnath", title = "Science Use Case Design Patterns for Autonomous Experiments", booktitle = "Proceedings of the \href{http://europlop.net} {$28^{th}$ European Conference on Pattern Languages of Programs (EuroPLoP) 2023}", pages = "1-14", month = jul # "~5-9, ", year = "2023", address = "Kloster Irsee, Germany", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "979-8-4007-0040-8", doi = "10.1145/3628034.3628060", url = "http://www.christian-engelmann.info/publications/engelmann23science.pdf", abstract = "Connecting scientific instruments and robot-controlled laboratories with computing and data resources at the edge, the Cloud or the high-performance computing (HPC) center enables autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence (AI)-driven design, discovery and evaluation. The Self-driven Experiments for Science / Interconnected Science Ecosystem (INTERSECT) Open Architecture enables science break- throughs using intelligent networked systems, instruments and facilities with a federated hardware/software architecture for the laboratory of the future. It relies on a novel approach, consisting of (1) science use case design patterns, (2) a system of systems architecture, and (3) a microservice architecture. This paper introduces the science use case design patterns of the INTERSECT Architecture. It describes the overall background, the involved terminology and concepts, and the pattern format and classification. It further offers an overview of the 12 defined patterns and 4 examples of patterns of 2 different pattern classes. It also provides insight into building solutions from these patterns. The target audience are computer, computational, instrument and domain science experts working in the field of autonomous experiments.", pts = "200749" }
@conference{engelmann22intersect, author = "Christian Engelmann and Olga Kuchar and Swen Boehm and Michael J. Brim and Thomas Naughton and Suhas Somnath and Scott Atchley and Jack Lange and Ben Mintz and Elke Arenholz", title = "The {INTERSECT} Open Federated Architecture for the Laboratory of the Future", booktitle = "Communications in Computer and Information Science (CCIS): Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. \href{https://smc.ornl.gov}{$18^{th}$ Smoky Mountains Computational Sciences & Engineering Conference (SMC) 2022}", volume = "1690", pages = "173--190", month = aug # "~24-25, ", year = "2022", publisher = "\href{http://www.springer.com}{Springer, Cham}", isbn = "978-3-031-23605-1", doi = "10.1007/978-3-031-23606-8_11", url = "http://www.christian-engelmann.info/publications/engelmann22intersect.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann22intersect.ppt.pdf", abstract = "A federated instrument-to-edge-to-center architecture is needed to autonomously collect, transfer, store, process, curate, and archive scientific data and reduce human-in-the-loop needs with (a) common interfaces to leverage community and custom software, (b) pluggability to permit adaptable solutions, reuse, and digital twins, and (c) an open standard to enable adoption by science facilities world-wide. The INTERSECT Open Architecture enables science breakthroughs using intelligent networked systems, instruments and facilities with autonomous experiments, ``self-driving'' laboratories, smart manufacturing and \gls{AI} driven design, discovery and evaluation. It creates an open federated architecture for the laboratory of the future using a novel approach, consisting of (1) science use case design patterns, (2) a system of systems architecture, and (3) a microservice architecture.", pts = "182854" }
@conference{hukerikar20plexus, author = "Saurabh Hukerikar and Christian Engelmann", title = "{PLEXUS}: {A} Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems", booktitle = "Proceedings of the \href{http://prdc.dependability.org/PRDC2020} {$25^{th}$ IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020}", pages = "31--39", month = dec # "~1-4, ", year = "2020", address = "Perth, Australia", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "1555-094X", isbn = "978-1-7281-8004-5", doi = "10.1109/PRDC50213.2020.00014", url = "http://www.christian-engelmann.info/publications/hukerikar20plexus.pdf", abstract = "For high-performance computing (HPC) system designers and users, meeting the myriad challenges of next-generation exascale supercomputing systems requires rethinking their approach to application and system software design. Among these challenges, providing resiliency and stability to the scientific applications in the presence of high fault rates requires new approaches to software architecture and design. As HPC systems become increasingly complex, they require intricate solutions for detection and mitigation for various modes of faults and errors that occur in these large-scale systems, as well as solutions for failure recovery. These resiliency solutions often interact with and affect other system properties, including application scalability, power and energy efficiency. Therefore, resilience solutions for HPC systems must be thoughtfully engineered and deployed. In previous work, we developed the concept of resilience design patterns, which consist of templated solutions based on well-established techniques for detection, mitigation and recovery. In this paper, we use these patterns as the foundation to propose new approaches to designing runtime systems for HPC systems. The instantiation of these patterns within a runtime system enables flexible and adaptable end-to-end resiliency solutions for HPC environments. The paper describes the architecture of the runtime system, named Plexus, and the strategies for dynamically composing and adapting pattern instances under runtime control. This runtime-based approach enables actively balancing the cost-benefit trade-off between performance overhead and protection coverage of the resilience solutions. Based on a prototype implementation of PLEXUS, we demonstrate the resiliency and performance gains achieved by the pattern-based runtime system for a parallel linear solver application.", pts = "147029" }
@conference{ostrouchov20gpu, author = "George Ostrouchov and Don Maxwell and Rizwan Ashraf and Christian Engelmann and Mallikarjun Shankar and James Rogers", title = "{GPU} Lifetimes on {Titan} Supercomputer: {Survival} Analysis and Reliability", booktitle = "Proceedings of the \href{http://sc20.supercomputing.org}{$33^{rd}$ IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2020}", pages = "41:1--14", month = nov # "~15-20, ", year = "2020", address = "Atlanta, GA, USA", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "9781728199986", doi = "10.1109/SC41405.2020.00045", url = "http://www.christian-engelmann.info/publications/ostrouchov20gpu.pdf", url2 = "http://www.christian-engelmann.info/publications/ostrouchov20gpu.ppt.pdf", abstract = "The Cray XK7 Titan was the top supercomputer system in the world for a very long time and remained critically important throughout its nearly seven year life. It was also a very interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three very significant rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 operation years in the GPU lifetimes, which correspond to Titan's 6 year long productive period after an initial break-in period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the system cooling architecture and job scheduling. In addition to describing some of the system history, the data collection, data cleaning, and our analysis of the data, we provide reliability recommendations for designing future state of the art supercomputing systems and their operation. We make the data and our analysis codes publicly available.", pts = "144470" }
@conference{jeong203d, author = "Haewon Jeong and Yaoqing Yang and Christian Engelmann and Vipul Gupta and Tze Meng Low and Pulkit Grover and Viveck Cadambe and Kannan Ramchandran", title = "{3D} Coded {SUMMA}: {C}ommunication-Efficient and Robust Parallel Matrix Multiplication", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{https://www.euro-par.org}{$26^{th}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2020}", volume = "12247", pages = "392--407", month = aug # "~24-28, ", year = "2020", address = "Warsaw, Poland", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-030-57674-5", doi = "10.1007/978-3-030-57675-2_25", url = "http://www.christian-engelmann.info/publications/jeong203d.pdf", url2 = "http://www.christian-engelmann.info/publications/jeong203d.ppt.pdf", abstract = "In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that is communication efficient and achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for distributed matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires 50\% less redundancy than replication, while the overhead in execution time is only about 5-10\%.", pts = "140756" }
@conference{kumar18understanding, author = "Mohit Kumar and Saurabh Gupta and Tirthak Patel and Michael Wilder and Weisong Shi and Song Fu and Christian Engelmann and Devesh Tiwari", title = "Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale {HPC} System", booktitle = "Proceedings of the \href{http://www.dsn.org} {$48^{th}$ IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018}", pages = "107--114", month = jun # "~25-28, ", year = "2018", address = "Luxembourg City, Luxembourg", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "2158-3927", isbn = "978-1-5386-5596-2", doi = "10.1109/DSN.2018.00023", url = "http://www.christian-engelmann.info/publications/kumar18understanding.pdf", abstract = "Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.", pts = "110648" }
@conference{nie18machine, author = "Bin Nie and Ji Xue and Saurabh Gupta and Tirthak Patel and Christian Engelmann and Evgenia Smirni and Devesh Tiwari", title = "Machine Learning Models for {GPU} Error Prediction in a Large Scale {HPC} System", booktitle = "Proceedings of the \href{http://www.dsn.org} {$48^{th}$ IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018}", pages = "95--106", month = jun # "~25-28, ", year = "2018", address = "Luxembourg City, Luxembourg", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "2158-3927", isbn = "978-1-5386-5596-2", doi = "10.1109/DSN.2018.00022", url = "http://www.christian-engelmann.info/publications/nie18machine.pdf", abstract = "Recently, GPUs have been widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative. Therefore, in this paper, we firstly study the conditions that trigger GPU errors with six-month trace data collected from a large-scale operational HPC system. Then, we resort to machine learning techniques to predict the occurrence of GPU errors, by taking advantage of the temporal and spatial dependency of the collected data. As discussed in the evaluation section, the prediction framework is robust and accurate under different workloads.", pts = "110650" }
@conference{ashraf18pattern-based, author = "Rizwan Ashraf and Saurabh Hukerikar and Christian Engelmann", title = "Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing", booktitle = "Proceedings of the \href{http://icpe2018.spec.org}{$9^{th}$ ACM/SPEC International Conference on Performance Engineering (ICPE) 2018}", pages = "80--87", month = apr # "~9-13, ", year = "2018", address = "Berlin, Germany", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4503-5095-2", doi = "10.1145/3184407.3184421", url = "http://www.christian-engelmann.info/publications/ashraf18pattern-based.pdf", url2 = "http://www.christian-engelmann.info/publications/ashraf18pattern-based.ppt.pdf", abstract = "Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle different types of errors, and recover from failures. In this paper, we propose a pattern-based approach to constructing multiresilience solutions. Using resilience patterns, we evaluate the performance and reliability characteristics of detection, containment and mitigation techniques for transient errors that cause silent data corruptions and techniques for fail-stop errors that result in process failures. We demonstrate the design and implementation of the resilience techniques across multiple layers of the system stack such that they are integrated to work together to achieve resiliency to different error types in a highly performance-effcient manner.", pts = "109667" }
@conference{ashraf18shrink, author = "Rizwan Ashraf and Saurabh Hukerikar and Christian Engelmann", title = "Shrink or Substitute: {H}andling Process Failures in {HPC} Systems using In-situ Recovery", booktitle = "Proceedings of the \href{http://www.pdp2018.org}{$26^{th}$ Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2018}", pages = "178--185", month = mar # "~21-23, ", year = "2018", address = "Cambridge, UK", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "2377-5750", isbn = "978-1-5386-4975-6", doi = "10.1109/PDP2018.2018.00032", url = "http://www.christian-engelmann.info/publications/ashraf18shrink.pdf", url2 = "http://www.christian-engelmann.info/publications/ashraf18shrink.ppt.pdf", abstract = "Efficient utilization of today's high-performance computing (HPC) systems with many, complex software and hardware components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean-time-to-failure (MTTF) of current and future HPC systems, long running simulations on these systems requires capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) called user-level failure mitigation (ULFM) for handling process failures without the need to discard the progress made by the application. We explore two alternative recovery strategies, which use ULFM along with application-driven in-memory checkpointing. In the first case, the application is recovered with only the surviving processes, and in the second case, spares are used to replace the failed processes, such that the original configuration of the application is restored. Our experimental results demonstrate that graceful degradation is a viable alternative for recovery in environments where spares may not be available.", pts = "107422" }
@conference{gupta17failures, author = "Saurabh Gupta and Tirthak Patel and Christian Engelmann and Devesh Tiwari", title = "Failures in Large Scale Systems: {L}ong-term Measurement, Analysis, and Implications", booktitle = "Proceedings of the \href{http://sc17.supercomputing.org}{$30^{th}$ IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017}", pages = "44:1--44:12", month = nov # "~12-17, ", year = "2017", address = "Denver, CO, USA", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4503-5114-0", doi = "10.1145/3126908.3126937", url = "http://www.christian-engelmann.info/publications/gupta17failures.pdf", url2 = "http://www.christian-engelmann.info/publications/gupta17failures.ppt.pdf", abstract = "Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Unfortunately, field-data based reliability studies are far in between and not exhaustive. Most HPC researchers and system practitioners still rely on outdated studies to understand HPC reliability characteristics and plan for future HPC systems. While the complexity of managing system reliability has increased, the public knowledge sharing about lessons learned from HPC centers has not increased in the same proportion. To bridge this gap, in this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems, and discuss new take-aways and con rm previous findings which continue to be valid.", pts = "100355" }
@conference{nie17characterizing, author = "Bin Nie and Ji Xue and Saurabh Gupta and Christian Engelmann and Evgenia Smirni and Devesh Tiwari", title = "Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: {I}nsights, Challenges, and Opportunities", booktitle = "Proceedings of the \href{http://mascots2017.cs.ucalgary.ca} {$25^{th}$ IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017}", pages = "22--31", month = sep # "~20-22, ", year = "2017", address = "Banff, AB, Canada", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "2375-0227", isbn = "978-1-5386-2764-8", doi = "10.1109/MASCOTS.2017.12", url = "http://www.christian-engelmann.info/publications/nie17characterizing.pdf", url2 = "", abstract = "GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks.", pts = "100351" }
@conference{hukerikar17pattern, author = "Saurabh Hukerikar and Christian Engelmann", title = "A Pattern Language for High-Performance Computing Resilience", booktitle = "Proceedings of the \href{http://europlop.net} {$22^{nd}$ European Conference on Pattern Languages of Programs (EuroPLoP) 2017}", pages = "12:1--12:16", month = jul # "~12-16, ", year = "2017", address = "Kloster Irsee, Germany", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4503-4848-5", doi = "10.1145/3147704.3147718", url = "http://www.christian-engelmann.info/publications/hukerikar17pattern.pdf", abstract = "High-performance computing systems (HPC) provide powerful capabilities for modeling and simulation, and data analytics in a broad class of computational problems in a variety of scientific and engineering domains. HPC designs are undergoing rapid changes in the hardware architectures and the software environment as the community pursues increasingly capable HPC systems. Among the key challenges for future generations of HPC systems is the ensuring efficient and correct operation despite the occurrence of faults or defects in system components that can cause errors and failures in a HPC system. Such events affect the correctness of the scientific applications, or may lead to their untimely termination. Future generations of HPC systems will consist of millions of compute, memory and storage components and the growing complexity of these computing behemoths increases the chances that a single fault event will cascade across the machine and bring down the entire system. Design patterns capture the essential techniques that are employed to solve recurring problems in the design of resilient computing systems. However, the complexity of modern HPC systems as well as the various challenges of future generations of systems requires consideration to numerous aspects and optimization principles, such as the impact of a resilience solution on the performance and power consumption. We present a pattern language for engineering resilience solutions. The language is targeted at hardware and software designers as well as the users and operators of HPC systems. The patterns are intended to develop complete resilience solutions that have different efficiency and complexity characteristics, which may be deployed at design time or runtime to ensure that HPC systems are able to deal with various types of faults, errors and failures.", pts = "102869" }
@conference{lagadapati16benchmark, author = "Mahesh Lagadapati and Frank Mueller and Christian Engelmann", title = "Benchmark Generation and Simulation at Extreme Scale", booktitle = "Proceedings of the \href{http://ds-rt.com/2016}{$20^{th}$ IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2016}", pages = "9--18", month = sep # "~21-23, ", year = "2016", address = "London, UK", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "1550-6525", isbn = "978-1-5090-3506-9", doi = "10.1109/DS-RT.2016.18", url = "http://www.christian-engelmann.info/publications/lagadapati16benchmark.pdf", url2 = "http://www.christian-engelmann.info/publications/lagadapati16benchmark.ppt.pdf", abstract = "The path to extreme scale high-performance computing (HPC) poses several challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Investigating the performance of parallel applications at scale on future architectures and the performance impact of different architectural choices is an important component of HPC hardware/software co-design. Simulations using models of future HPC systems and communication traces from applications running on existing HPC systems can offer an insight into the performance of future architectures. This work targets technology developed for scalable application tracing of communication events. It focuses on extreme-scale simulation of HPC applications and their communication behavior via lightweight parallel discrete event simulation for performance estimation and evaluation. Instead of simply replaying a trace within a simulator, this work promotes the generation of a benchmark from traces. This benchmark is subsequently exposed to simulation using models to reflect the performance characteristics of future-generation HPC systems. This technique provides a number of benefits, such as eliminating the data intensive trace replay and enabling simulations at different scales. The presented work features novel software co-design aspects, combining the ScalaTrace tool to generate scalable trace files, the ScalaBenchGen tool to generate the benchmark, and the xSim tool to assess the benchmark characteristics within a simulator.", pts = "68383" }
@conference{hukerikar16havens, author = "Saurabh Hukerikar and Christian Engelmann", title = "{Havens}: {Explicit} Reliable Memory Regions for {HPC} Applications", booktitle = "Proceedings of the \href{http://ieee-hpec.org} {$20^{th}$ IEEE High Performance Extreme Computing Conference (HPEC) 2016}", pages = "1--6", month = sep # "~13-15, ", year = "2016", address = "Waltham, MA, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", doi = "10.1109/HPEC.2016.7761593", url = "http://www.christian-engelmann.info/publications/hukerikar16havens.pdf", url2 = "http://www.christian-engelmann.info/publications/hukerikar16havens.ppt.pdf", abstract = "Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, the scientific simulations are expected to experience more interruptions caused by soft errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate in the presence of high memory fault rates. In this paper we propose a partial memory protection scheme using region-based memory management. We define regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical application code and variables to be placed in these havens. The fault coverage of our approach is application agnostic unlike algorithm-based fault tolerance techniques.", pts = "69230" }
@conference{tang16power-capping, author = "Kun Tang and Devesh Tiwari and Saurabh Gupta and Ping Huang and QiQi Lu and Christian Engelmann and Xubin He", title = "Power-Capping Aware Checkpointing: {On} the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy", booktitle = "Proceedings of the \href{http://www.dsn.org} {$46^{th}$ IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2016}", pages = "311--322", month = jun # "~28 - " # jul # "~1, ", year = "2016", address = "Toulouse, France", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "2158-3927", doi = "10.1109/DSN.2016.36", url = "http://www.christian-engelmann.info/publications/tang16power-aware.pdf", url2 = "", abstract = "Checkpoint and restart mechanisms have been widely used in large scientific simulation applications to make forward progress in case of failures. However, none of the prior works have considered the interaction of power-constraint with temperature, reliability, performance, and checkpointing interval. It is not clear how power-capping may affect optimal checkpointing interval. What are the involved reliability, performance, and energy trade-offs? In this paper, we develop a deep understanding about the interaction between power-capping and scientific applications using checkpoint/restart as resilience mechanism, and propose a new model for the optimal checkpointing interval (OCI) under power-capping. Our study reveals several interesting, and previously unknown, insights about how power-capping affects the reliability, energy consumption, performance.", pts = "62738" }
@conference{fiala16mini-ckpts, author = "David Fiala and Frank Mueller and Kurt Ferreira and Christian Engelmann", title = "{Mini-Ckpts}: Surviving {OS} Failures in Persistent Memory", booktitle = "Proceedings of the \href{http://ics16.bilkent.edu.tr} {$30^{th}$ ACM International Conference on Supercomputing (ICS) 2016}", pages = "7:1--7:14", month = jun # "~1-3, ", year = "2016", address = "Istanbul, Turkey", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4503-4361-9", doi = "10.1145/2925426.2926295", url = "http://www.christian-engelmann.info/publications/fiala16mini-ckpts.pdf", url2 = "http://www.christian-engelmann.info/publications/fiala16mini-ckpts.ppt.pdf", abstract = "Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme-scale systems. Current efforts have focused on application fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory are more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs --- and in HPC also for any other nodes a parallelized application runs on and communicates with: Any single node failure generally forces all processes of this application to terminate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. Mini-ckpts achieves this tolerance by ensuring that the critical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the application continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5\% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime system can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current, coarse-grained, application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional fault scenarios.", pts = "67816" }
@conference{bautista-gomez16reducing, author = "Leonardo Bautista-Gomez and Ana Gainaru and Swann Perarnau and Devesh Tiwari and Saurabh Gupta and Franck Cappello and Christian Engelmann and Marc Snir", title = "Reducing Waste in Extreme Scale Systems Through Introspective Analysis", booktitle = "Proceedings of the \href{http://www.ipdps.org} {$30^{th}$ IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016}", pages = "212--221", month = may # "~23-27, ", year = "2016", address = "Chicago, IL, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "1530-2075", doi = "10.1109/IPDPS.2016.100", url = "http://www.christian-engelmann.info/publications/bautista-gomez16reducing.pdf", url2 = "http://www.christian-engelmann.info/publications/bautista-gomez16reducing.ppt.pdf", abstract = "Resilience is an important challenge for extreme-scale supercomputers. Today, failures in supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. Our study of the failure logs of multiple supercomputers show that periods of higher failure density occur with up to three times more than the average. We design a monitoring system that listens to hardware events and forwards important events to the runtime to detect those regime changes. We implement a runtime capable of receiving notifications and adapt dynamically. In addition, we build an analytical model to predict the gains that such dynamic approach could achieve. We demonstrate that in some systems, our approach can reduce the wasted time.", pts = "62159" }
@conference{engelmann16supporting, author = "Christian Engelmann and Thomas Naughton", title = "Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation", booktitle = "Proceedings of the \href{http://www.iasted.org/conferences/home-795.html} {$13^{th}$ IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016}", month = feb # "~15-16, ", year = "2016", address = "Innsbruck, Austria", publisher = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB, Canada}", isbn = "978-0-88986-979-0", doi = "10.2316/P.2016.834-005", url = "http://www.christian-engelmann.info/publications/engelmann16supporting.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann16supporting.ppt.pdf", abstract = "Radiation-induced bit flip faults are of particular concern in extreme-scale high-performance computing systems. This paper presents a simulation-based tool that enables the development of soft-error resilient message passing applications by permitting the investigation of their correctness and performance under various fault conditions. The documented extensions to the Extreme-scale Simulator (xSim) enable the injection of bit flip faults at specific of injection location(s) and fault activation time(s), while supporting a significant degree of configurability of the fault type. Experiments show that the simulation overhead with the new feature is $\sim$2,325\% for serial execution and $\sim$1,730\% at 128 MPI processes, both with very fine-grain fault injection. Fault injection experiments demonstrate the usefulness of the new feature by injecting bit flips in the input and output matrices of a matrix-matrix multiply application, revealing vulnerability of data structures, masking and error propagation. xSim is the very first simulation-based MPI performance tool that supports both, the injection of process failures and bit flip faults.", pts = "60888" }
@conference{katti15scalable, author = "Amogh Katti and Giuseppe Di Fatta and Thomas Naughton and Christian Engelmann", title = "Scalable and Fault Tolerant Failure Detection and Consensus", booktitle = "Proceedings of the \href{https://eurompi2015.bordeaux.inria.fr}{$22^{nd}$ European MPI Users` Group Meeting (EuroMPI) 2015}", pages = "13:1--13:9", month = sep # "~21-24, ", year = "2015", address = "Bordeaux, France", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4503-3795-3", doi = "10.1145/2802658.2802660", url = "http://www.christian-engelmann.info/publications/katti15scalable.pdf", url2 = "http://www.christian-engelmann.info/publications/katti15scalable.ppt.pdf", abstract = "Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation (MPI\_Comm\_shrink) to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. The MPI\_Comm\_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms to support this operation. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory usage and network bandwidth costs and a perfect synchronization in achieving global consensus.", pts = "57940" }
@conference{engelmann15network, author = "Christian Engelmann and Thomas Naughton", title = "A Network Contention Model for the Extreme-scale Simulator", booktitle = "Proceedings of the \href{http://www.iasted.org/conferences/home-826.html} {$34^{th}$ IASTED International Conference on Modelling, Identification and Control (MIC) 2015}", month = feb # "~17-18, ", year = "2015", address = "Innsbruck, Austria", publisher = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB, Canada}", isbn = "978-0-88986-975-2", doi = "10.2316/P.2015.826-043", url = "http://www.christian-engelmann.info/publications/engelmann15network.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann15network.ppt.pdf", abstract = "The Extreme-scale Simulator (xSim) is a performance investigation toolkit for high-performance computing (HPC) hardware/software co-design. It permits running a HPC application with millions of concurrent execution threads, while observing its performance in a simulated extreme-scale system. This paper details a newly developed network modeling feature for xSim, eliminating the shortcomings of the existing network modeling capabilities. The approach takes a different path for implementing network contention and bandwidth capacity modeling using a less synchronous and accurate enough model design. With the new network modeling feature, xSim is able to simulate on-chip and on-node networks with reasonable accuracy and overheads.", pts = "53873" }
@conference{engelmann14improving, author = "Christian Engelmann and Thomas Naughton", title = "Improving the Performance of the Extreme-scale Simulator", booktitle = "Proceedings of the \href{http://ds-rt.com/2014}{$18^{th}$ IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014}", pages = "198--207", month = oct # "~1-3, ", year = "2014", address = "Toulouse, France", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "1550-6525", isbn = "978-1-4799-6143-6", doi = "10.1109/DS-RT.2014.32", url = "http://www.christian-engelmann.info/publications/engelmann14improving.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann14improving.ppt.pdf", abstract = "Investigating the performance of parallel applications at scale on future high-performance computing (HPC) architectures and the performance impact of different architecture choices is an important component of HPC hardware/software co-design. The Extreme-scale Simulator (xSim) is a simulation-based toolkit for investigating the performance of parallel applications at scale. xSim scales to millions of simulated Message Passing Interface (MPI) processes. The overhead introduced by a simulation tool is an important performance and productivity aspect. This paper documents two improvements to xSim: (1) a new deadlock resolution protocol to reduce the parallel discrete event simulation management overhead and (2) a new simulated MPI message matching algorithm to reduce the oversubscription management overhead. The results clearly show a significant performance improvement, such as by reducing the simulation overhead for running the NAS Parallel Benchmark suite inside the simulator from 1,020\% to 238\% for the conjugate gradient (CG) benchmark and from 102\% to 0\% for the embarrassingly parallel (EP) and benchmark, as well as, from 37,511\% to 13,808\% for CG and from 3,332\% to 204\% for EP with accurate process failure simulation.", pts = "50654" }
@conference{naughton14supporting, author = "Thomas Naughton and Christian Engelmann and Geoffroy Vall{\'e}e and Swen B{\"o}hm", title = "Supporting the Development of Resilient Message Passing Applications using Simulation", booktitle = "Proceedings of the \href{http://www.pdp2014.org}{$22^{nd}$ Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014}", pages = "271--278", month = feb # "~12-14, ", year = "2014", address = "Turin, Italy", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", issn = "1066-6192", doi = "10.1109/PDP.2014.74", url = "http://www.christian-engelmann.info/publications/naughton14supporting.pdf", url2 = "http://www.christian-engelmann.info/publications/naughton14supporting.ppt.pdf", abstract = "An emerging aspect of high-performance computing (HPC) hardware/software co-design is investigating performance under failure. The work in this paper extends the Extreme-scale Simulator (xSim), which was designed for evaluating the performance of message passing interface (MPI) applications on future HPC architectures, with fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. xSim permits running MPI applications with millions of concurrent MPI ranks, while observing application performance in a simulated extreme-scale system using a lightweight parallel discrete event simulation. The newly added features offer user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). The presented solution permits investigating performance under failure and failure handling of ABFT solutions. The newly enhanced xSim is the very first performance tool that supports ULFM and ABFT.", pts = "49204" }
@conference{vallee13runtime, author = "Geoffroy Vall{\'e}e and Thomas Naughton and Swen B{\"o}hm and Christian Engelmann", title = "A Runtime Environment for Supporting Research in Resilient {HPC} System Software & Tools", booktitle = "Proceedings of the \href{http://is-candar.org} {$1^{st}$ International Symposium on Computing and Networking - Across Practical Development and Theoretical Research - (CANDAR) 2013}", pages = "213--219", month = dec # "~4-6, ", year = "2013", address = "Matsuyama, Japan", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-4799-2795-1", doi = "10.1109/CANDAR.2013.38", url = "http://www.christian-engelmann.info/publications/vallee13runtime.pdf", url2 = "http://www.christian-engelmann.info/publications/vallee13runtime.ppt.pdf", abstract = "The high-performance computing~(HPC) community continues to increase the size and complexity of hardware platforms that support advanced scientific workloads. The runtime environment (RTE) is a crucial layer in the software stack for these large-scale systems. The RTE manages the interface between the operating system and the application running in parallel on the machine. The deployment of applications and tools on large-scale HPC computing systems requires the RTE to manage process creation in a scalable manner, support sparse connectivity, and provide fault tolerance. We have developed a new RTE that provides a basis for building distributed execution environments and developing tools for HPC to aid research in system software and resilience. This paper describes the software architecture of the Scalable runTime Component Infrastructure~(STCI), which is intended to provide a complete infrastructure for scalable start-up and management of many processes in large-scale HPC systems. We highlight features of the current implementation, which is provided as a system library that allows developers to easily use and integrate STCI in their tools and/or applications. The motivation for this work has been to support ongoing research activities in fault-tolerance for large-scale systems. We discuss the advantages of the modular framework employed and describe two use cases that demonstrate its capabilities: (i) an alternate runtime for a Message Passing Interface (MPI) stack, and (ii) a distributed control and communication substrate for a fault-injection tool.", pts = "45674" }
@conference{engelmann13investigating, author = "Christian Engelmann", title = "Investigating Operating System Noise in Extreme-Scale High-Performance Computing Systems using Simulation", booktitle = "Proceedings of the \href{http://www.iasted.org/conferences/home-795.html} {$11^{th}$ IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013}", month = feb # "~11-13, ", year = "2013", address = "Innsbruck, Austria", publisher = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB, Canada}", isbn = "978-0-88986-943-1", doi = "10.2316/P.2013.795-010", url = "http://www.christian-engelmann.info/publications/engelmann12investigating.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann12investigating.ppt.pdf", abstract = "Hardware/software co-design for future-generation high-performance computing (HPC) systems aims at closing the gap between the peak capabilities of the hardware and the performance realized by applications (application-architecture performance gap). Performance profiling of architectures and applications is a crucial part of this iterative process. The work in this paper focuses on operating system (OS) noise as an additional factor to be considered for co-design. It represents the first step in including OS noise in HPC hardware/software co-design by adding a noise injection feature to an existing simulation-based co-design toolkit. It reuses an existing abstraction for OS noise with frequency (periodic recurrence) and period (duration of each occurrence) to enhance the processor model of the Extreme-scale Simulator (xSim) with synchronized and random OS noise simulation. The results demonstrate this capability by evaluating the impact of OS noise on MPI\_Bcast() and MPI\_Reduce() in a simulated future-generation HPC system with 2,097,152 compute nodes.", pts = "40576" }
@conference{fiala12detection2, author = "David Fiala and Frank Mueller and Christian Engelmann and Kurt Ferreira and Ron Brightwell and Rolf Riesen", title = "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing", booktitle = "Proceedings of the \href{http://sc12.supercomputing.org}{$25^{th}$ IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012}", pages = "78:1--78:12", month = nov # "~10-16, ", year = "2012", address = "Salt Lake City, UT, USA", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4673-0804-5", doi = "10.1109/SC.2012.49", url = "http://www.christian-engelmann.info/publications/fiala12detection2.pdf", url2 = "http://www.christian-engelmann.info/publications/fiala12detection2.ppt.pdf", abstract = "Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that corrupt memory while applications continue to operate and report incorrect results. This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. Our study investigates the challenges inherent to detecting soft errors within MPI application while providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI message data between replicas, we study the best suited protocols for detecting and correcting MPI data that is the result of corruption. To experimentally validate our proposed detection and correction protocols, we introduce RedMPI, an MPI library which resides in the MPI profiling layer. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source by utilizing either double or triple redundancy. Our results indicate that our most efficient consistency protocol can successfully protect applications experiencing even high rates of silent data corruption with runtime overheads between 0\% and 30\% as compared to unprotected applications without redundancy. Using our fault injector within RedMPI, we observe that even a single soft error can have profound effects on running applications, causing a cascading pattern of corruption in most cases causes that spreads to all other processes. RedMPI's protection has been shown to successfully mitigate the effects of soft errors while allowing applications to complete with correct results even in the face of errors.", pts = "38306" }
@conference{elliott12combining, author = "James Elliott and Kishor Kharbas and David Fiala and Frank Mueller and Kurt Ferreira and Christian Engelmann", title = "Combining Partial Redundancy and Checkpointing for {HPC}", booktitle = "Proceedings of the \href{http://icdcs-2012.org/} {$32^{nd}$ International Conference on Distributed Computing Systems (ICDCS) 2012}", pages = "615--626", month = jun # "~18-21, ", year = "2012", address = "Macau, SAR, China", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-4685-8", issn = "1063-6927", doi = "10.1109/ICDCS.2012.56", url = "http://www.christian-engelmann.info/publications/elliott12combining.pdf", url2 = "http://www.christian-engelmann.info/publications/elliott12combining.ppt.pdf", abstract = "Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^15 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50\% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance technique, which employs redundant processes performing the same task. If a process fails, a replica of it can take over its execution. Thus, redundant copies can decrease the overall failure rate. The downside of redundancy is that extra resources are required and there is an additional overhead on communication and synchronization. This work contributes a model and analyzes the benefit of C/R in coordination with redundancy at different degrees to minimize the total wallclock time and resources utilization of HPC applications. We further conduct experiments with an implementation of redundancy within the MPI layer on a cluster. Our experimental results confirm the benefit of dual and triple redundancy - but not for partial redundancy - and show a close fit to the model. At 80,000 processes, dual redundancy requires twice the number of processing resources for an application but allows two jobs of 128 hours wallclock time to finish within the time of just one job without redundancy. For narrow ranges of processor counts, partial redundancy results in the lowest time. Once the count exceeds 770, 000, triple redundancy has the lowest overall cost. Thus, redundancy allows one to trade-off additional resource requirements against wallclock time, which provides a tuning knob for users to adapt to resource availabilities.", pts = "35629" }
@conference{wang12nvmalloc, author = "Chao Wang and Sudharshan S. Vazhkudai and Xiaosong Ma and Fei Meng and Youngjae Kim and Christian Engelmann", title = "{NVMalloc}: Exposing an Aggregate {SSD} Store as a Memory Partition in Extreme-Scale Machines", booktitle = "Proceedings of the \href{http://www.ipdps.org} {$26^{th}$ IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012}", pages = "957--968", month = may # "~21-25, ", year = "2012", address = "Shanghai, China", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-4675-9", doi = "10.1109/IPDPS.2012.90", url = "http://www.christian-engelmann.info/publications/wang12nvmalloc.pdf", url2 = "http://www.christian-engelmann.info/publications/wang12nvmalloc.ppt.pdf", abstract = "DRAM is a precious resource in extreme-scale machines and is increasingly becoming scarce, mainly due to the growing number of cores per node. On future multi-petaflop and exaflop machines, the memory pressure is likely to be so severe that we need to rethink our memory usage models. Fortunately, the advent of non-volatile memory (NVM) offers a unique opportunity in this space. Current NVM offerings possess several desirable properties, such as low cost and power efficiency, but also suffer from high latency and lifetime issues. We need rich techniques to be able to use them alongside DRAM. In this paper, we propose a novel approach to exploiting NVM as a secondary memory partition so that applications can explicitly allocate and manipulate memory regions therein. More specifically, we propose an NVMalloc library with a suite of services that enables applications to access a distributed NVM storage system. We have devised ways within NVMalloc so that the storage system, built from compute node-local NVM devices, can be accessed in a byte-addressable fashion using the memory mapped I/O interface. Our approach has the potential to re-energize out-of-core computations on large-scale machines by having applications allocate certain variables through NVMalloc, thereby increasing the overall memory available for the application. Our evaluation on a 128-core cluster shows that NVMalloc enables applications to compute problem sizes larger than the physical memory in a cost-effective manner. It can achieve better performance with increased computation time between NVM memory accesses or increased data access locality. In addition, our results suggest that while NVMalloc enables transparent access to NVM-resident variables, the explicit control it provides is crucial to optimize application performance.", pts = "35603" }
@conference{boehm12file, author = "Swen B{\"o}hm and Christian Engelmann", title = "File {I/O} for {MPI} Applications in Redundant Execution Scenarios", booktitle = "Proceedings of the \href{http://www.pdp2012.org}{$20^{th}$ Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2012}", pages = "112-119", month = feb # "~15-17, ", year = "2012", address = "Garching, Germany", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-4633-9", issn = "1066-6192", doi = "10.1109/PDP.2012.22", url = "http://www.christian-engelmann.info/publications/boehm12file.pdf", url2 = "http://www.christian-engelmann.info/publications/boehm12file.ppt.pdf", abstract = "As multi-petascale and exa-scale high-performance computing (HPC) systems inevitably have to deal with a number of resilience challenges, such as a significant growth in component count and smaller circuit sizes with lower circuit voltages, redundancy may offer an acceptable level of resilience that traditional fault tolerance techniques, such as checkpoint/restart, do not. Although redundancy in HPC is quite controversial due to the associated cost for redundant components, the constantly increasing number of cores-per-processor is tilting this cost calculation toward a system design where computation, such as for redundancy, is much cheaper and communication, needed for checkpoint/restart, is much more expensive. Recent research and development activities in redundancy for Message Passing Interface (MPI) applications focused on availability/reliability models and replication algorithms. This paper takes a first step toward solving an open research problem associated with running a parallel application redundantly, which is file I/O under redundancy. The approach intercepts file I/O calls made by a redundant application to employ coordination protocols that execute file I/O operations in a redundancy-oblivious fashion when accessing a node-local file system, or in a redundancy-aware fashion when accessing a shared networked file system. A proof-of concept prototype is presented and a number of coordination protocols are described and evaluated. The results show the performance impact for redundantly accessing a shared networked file system, but also demonstrate the capability to regain performance by utilizing MPI communication between replicas and parallel file I/O.", pts = "33577" }
@conference{boehm11xsim, author = "Swen B{\"o}hm and Christian Engelmann", title = "{xSim}: {The} Extreme-Scale Simulator", booktitle = "Proceedings of the \href{http://hpcs11.cisedu.info}{International Conference on High Performance Computing and Simulation (HPCS) 2011}", pages = "280-286", month = jul # "~4-8, ", year = "2011", address = "Istanbul, Turkey", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-61284-383-4", doi = "10.1109/HPCSim.2011.5999835", url = "http://www.christian-engelmann.info/publications/boehm11xsim.pdf", url2 = "http://www.christian-engelmann.info/publications/boehm11xsim.ppt.pdf", abstract = "Investigating parallel application performance properties at scale is becoming an important part of high-performance computing (HPC) application development and deployment. The Extreme-scale Simulator (xSim) is a performance investigation toolkit that permits running an application in a controlled environment at extreme scale without the need for a respective extreme-scale HPC system. Using a lightweight parallel discrete event simulation, xSim executes a parallel application with a virtual wall clock time, such that performance data can be extracted based on a processor model and a network model. This paper presents significant enhancements to the xSim toolkit prototype that provide a more complete Message Passing Interface (MPI) support and improve its versatility. These enhancements include full virtual MPI group, communicator and collective communication support, and global variables support. The new capabilities are demonstrated by executing the entire NAS Parallel Benchmark suite in a simulated HPC environment.", pts = "29960" }
@conference{engelmann11redundant, author = "Christian Engelmann and Swen B{\"o}hm", title = "Redundant Execution of {HPC} Applications with {MR-MPI}", booktitle = "Proceedings of the \href{http://www.iasted.org/conferences/home-719.html} {$10^{th}$ IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011}", pages = "31--38", month = feb # "~15-17, ", year = "2011", address = "Innsbruck, Austria", publisher = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB, Canada}", isbn = "978-0-88986-864-9", doi = "10.2316/P.2011.719-031", url = "http://www.christian-engelmann.info/publications/engelmann11redundant.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann11redundant.ppt.pdf", abstract = "This paper presents a modular-redundant Message Passing Interface (MPI) solution, MR-MPI, for transparently executing high-performance computing (HPC) applications in a redundant fashion. The presented work addresses the deficiencies of recovery-oriented HPC, i.e., checkpoint/restart to/from a parallel file system, at extreme scale by adding the redundancy approach to the HPC resilience portfolio. It utilizes the MPI performance tool interface, PMPI, to transparently intercept MPI calls from an application and to hide all redundancy-related mechanisms. A redundantly executed application runs with $r*m$ native MPI processes, where $r$ is the number of MPI ranks visible to the application and $m$ is the replication degree. Messages between redundant nodes are replicated. Partial replication for tunable resilience is supported. The performance results clearly show the negative impact of the O(m^2) messages between replicas. For low-level, point-to-point benchmarks, the impact can be as high as the replication degree. For applications, performance highly depends on the actual communication types and counts. On single-core systems, the overhead can be 0\% for embarrassingly parallel applications independent of the employed redundancy configuration or up to 70-90\% for communication-intensive applications in a dual-redundant configuration. On multi-core systems, the overhead can be significantly higher due to the additional communication contention.", pts = "27623" }
@conference{wang10hybrid2, author = "Chao Wang and Frank Mueller and Christian Engelmann and Stephen L. Scott", title = "Hybrid Checkpointing for {MPI} Jobs in {HPC} Environments", booktitle = "Proceedings of the \href{http://grid.sjtu.edu.cn/icpads10}{$16^{th}$ IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010}", pages = "524--533", month = dec # "~8-10, ", year = "2010", address = "Shanghai, China", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-4307-9", doi = "10.1109/ICPADS.2010.48", url = "http://www.christian-engelmann.info/publications/wang10hybrid2.pdf", url2 = "http://www.christian-engelmann.info/publications/wang10hybrid2.ppt.pdf", abstract = "As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Check pointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. We have designed a hybrid check pointing technique for MPI tasks of high-performance applications. This technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. Our implementation integrates new BLCR and LAM/MPI features that complement traditional full checkpoints. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints are an order of magnitude larger than overheads on restarts. We further derive qualitative results indicating an optimal balance between full/incremental checkpoints of our novel approach at a ratio of 1:9, which outperforms both always-full and always-incremental check pointing.", pts = "25447" }
@conference{li10functional, author = "Min Li and Sudharshan S. Vazhkudai and Ali R. Butt and Fei Meng and Xiaosong Ma and Youngjae Kim and Christian Engelmann and Galen Shipman", title = "Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures", booktitle = "Proceedings of the \href{http://sc10.supercomputing.org}{$23^{rd}$ IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010}", pages = "1-12", month = nov # "~13-19, ", year = "2010", address = "New Orleans, LA, USA", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4244-7559-9", doi = "10.1109/SC.2010.28", url = "http://www.christian-engelmann.info/publications/li10functional.pdf", url2 = "http://www.christian-engelmann.info/publications/li10functional.ppt.pdf", abstract = "Scaling computations on emerging massive-core supercomputers is a daunting task, which coupled with the significantly lagging system I/O capabilities exacerbates applications' end-to-end performance. The I/O bottleneck often negates potential performance benefits of assigning additional compute cores to an application. In this paper, we address this issue via a novel functional partitioning (FP) runtime environment that allocates cores to specific application tasks - checkpointing, de-duplication, and scientific data format transformation - so that the deluge of cores can be brought to bear on the entire gamut of application activities. The focus is on utilizing the extra cores to support HPC application I/O activities and also leverage solid-state disks in this context. For example, our evaluation shows that dedicating 1 core on an oct-core machine for checkpointing and its assist tasks using FP can improve overall execution time of a FLASH benchmark on 80 and 160 cores by 43.95\% and 41.34\%, respectively.", pts = "24996" }
@conference{boehm10aggregation, author = "Swen B{\"o}hm and Christian Engelmann and Stephen L. Scott", title = "Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments", booktitle = "Proceedings of the \href{http://www.anss.org.au/hpcc2010} {$12^{th}$ IEEE International Conference on High Performance Computing and Communications (HPCC) 2010}", pages = "72--78", month = sep # "~1-3, ", year = "2010", address = "Melbourne, Australia", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-4214-0", doi = "10.1109/HPCC.2010.32", url = "http://www.christian-engelmann.info/publications/boehm10aggregation.pdf", url2 = "http://www.christian-engelmann.info/publications/boehm10aggregation.ppt.pdf", abstract = "We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual monitoring metrics using a tree-based overlay network. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of ~56 in comparison to the Ganglia distributed monitoring system. A simple scaling study reveals, however, that further efforts are needed in reducing the amount of data to monitor future-generation extreme-scale systems with up to 1,000,000 nodes. The implemented solution did not had a measurable performance impact as the 32-node test system did not produce enough monitoring data to interfere with running applications.", pts = "24907" }
@conference{litvinova10proactive, author = "Antonina Litvinova and Christian Engelmann and Stephen L. Scott", title = "A Proactive Fault Tolerance Framework for High-Performance Computing", booktitle = "Proceedings of the \href{http://www.iasted.org/conferences/home-676.html} {$9^{th}$ IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010}", pages = "", month = feb # "~16-18, ", year = "2010", address = "Innsbruck, Austria", publisher = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB, Canada}", isbn = "978-0-88986-783-3", doi = "10.2316/P.2010.676-024", url = "http://www.christian-engelmann.info/publications/litvinova10proactive.pdf", url2 = "http://www.christian-engelmann.info/publications/litvinova10proactive.ppt.pdf", abstract = "As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures through preventative measures, such as by migrating application parts away from nodes that are about to fail. This paper presents a proactive FT framework that performs environmental monitoring, event logging, parallel job monitoring and resource monitoring to analyze HPC system reliability and to perform FT through such preventative actions.", pts = "13674" }
@conference{taerat09blue, author = "Narate Taerat and Nichamon Naksinehaboon and Clayton Chandler and James Elliott and Chokchai (Box) Leangsuksun and George Ostrouchov and Stephen L. Scott and Christian Engelmann", title = "{Blue Gene/L} Log Analysis and Time to Interrupt Estimation", booktitle = "Proceedings of the \href{http://www.ares-conference.eu/ares2009}{$4^{th}$ International Conference on Availability, Reliability and Security (ARES) 2009}", pages = "173--180", month = mar # "~16-19, ", year = "2009", address = "Fukuoka, Japan", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-4244-3572-2", doi = "10.1109/ARES.2009.105", url = "http://www.christian-engelmann.info/publications/taerat09blue.pdf", url2 = "", abstract = "System- and application-level failures could be characterized by analyzing relevant log files. The resulting data might then be used in numerous studies on and future developments for the mission-critical and large scale computational architecture, including fields such as failure prediction, reliability modeling, performance modeling and power awareness. In this paper, system logs covering a six month period of the Blue Gene/L supercomputer were obtained and subsequently analyzed. Temporal filtering was applied to remove duplicated log messages. Optimistic and pessimistic perspectives were exerted on filtered log information to observe failure behavior within the system. Further, various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research." }
@conference{engelmann09evaluating, author = "Christian Engelmann and Hong H. Ong and Stephen L. Scott", title = "Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems", booktitle = "Proceedings of the \href{http://www.linuxclustersinstitute.org/conferences} {$10^{th}$ LCI International Conference on High-Performance Clustered Computing (LCI) 2009}", month = mar # "~9-12, ", year = "2009", address = "Boulder, CO, USA", url = "http://www.christian-engelmann.info/publications/engelmann09evaluating.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann09evaluating.ppt.pdf", abstract = "Diskless high-performance computing (HPC) systems utilizing networked storage have become popular in the last several years. Removing disk drives significantly increases compute node reliability as they are known to be a major source of failures. Furthermore, networked storage solutions utilizing parallel I/O and replication are able to provide increased scalability and availability. Reducing a compute node to processor(s), memory and network interface(s) greatly reduces its physical size, which in turn allows for large-scale dense HPC solutions. However, one major obstacle is the requirement by certain operating systems (OSs), such as Linux, for a root file system. While one solution is to remove this requirement from the OS, another is to share the root file system over the networked storage. This paper evaluates three networked file system solutions, NFSv4, Lustre and PVFS2, with respect to their performance, scalability, and availability features for servicing a common root file system in a diskless HPC configuration. Our findings indicate that Lustre is a viable solution as it meets both, scaling and performance requirements. However, certain availability issues regarding single points of failure and control need to be considered.", pts = "14025" }
@conference{engelmann09proactive, author = "Christian Engelmann and Geoffroy R. Vall\'ee and Thomas Naughton and Stephen L. Scott", title = "Proactive Fault Tolerance Using Preemptive Migration", booktitle = "Proceedings of the \href{http://www.pdp2009.org}{$17^{th}$ Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009}", pages = "252--257", month = feb # "~18-20, ", year = "2009", address = "Weimar, Germany", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-3544-9", issn = "1066-6192", doi = "10.1109/PDP.2009.31", url = "http://www.christian-engelmann.info/publications/engelmann09proactive.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann09proactive.ppt.pdf", abstract = "Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.", pts = "13674" }
@conference{valentini09high, author = "Alessandro Valentini and Christian Di Biagio and Fabrizio Batino and Guido Pennella and Fabrizio Palma and Christian Engelmann", title = "High Performance Computing with {Harness} over {InfiniBand}", booktitle = "Proceedings of the \href{http://www.pdp2009.org}{$17^{th}$ Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009}", pages = "151--154", month = feb # "~18-20, ", year = "2009", address = "Weimar, Germany", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-3544-9", issn = "1066-6192", doi = "10.1109/PDP.2009.64", url = "http://www.christian-engelmann.info/publications/valentini09high.pdf", abstract = "Harness is an adaptable and plug-in-based middleware framework able to support distributed parallel computing. By now, it is based on the Ethernet protocol which cannot guarantee high performance throughput and Real Time (determinism) performance. During last years, both the research and industry environments have developed both new network architectures (InfiniBand, Myrinet, iWARP, etc.) to avoid those limits. This paper concerns the integration between Harness and InfiniBand focusing on two solutions: IP over InfiniBand (IPoIB) and Socket Direct Protocol (SDP) technology. Those allow Harness middleware to take advantage of the enhanced features provided by InfiniBand.", pts = "14107" }
@conference{engelmann09case, author = "Christian Engelmann and Hong H. Ong and Stephen L. Scott", title = "The Case for Modular Redundancy in Large-Scale High Performance Computing Systems", booktitle = "Proceedings of the \href{http://www.iasted.org/conferences/home-641.html} {$8^{th}$ IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009}", pages = "189--194", month = feb # "~16-18, ", year = "2009", address = "Innsbruck, Austria", publisher = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB, Canada}", isbn = "978-0-88986-784-0", doi = "", url = "http://www.christian-engelmann.info/publications/engelmann09case.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann09case.ppt.pdf", abstract = "Recent investigations into resilience of large-scale high-performance computing (HPC) systems showed a continuous trend of decreasing reliability and availability. Newly installed systems have a lower mean-time to failure (MTTF) and a higher mean-time to recover (MTTR) than their predecessors. Modular redundancy is being used in many mission critical systems today to provide for resilience, such as for aerospace and command & control systems. The primary argument against modular redundancy for resilience in HPC has always been that the capability of a HPC system, and respective return on investment, would be significantly reduced. We argue that modular redundancy can significantly increase compute node availability as it removes the impact of scale from single compute node MTTR. We further argue that single compute nodes can be much less reliable, and therefore less expensive, and still be highly available, if their MTTR/MTTF ratio is maintained.", pts = "13981" }
@conference{wang08proactive, author = "Chao Wang and Frank Mueller and Christian Engelmann and Stephen L. Scott", title = "Proactive Process-Level Live Migration in {HPC} Environments", booktitle = "Proceedings of the \href{http://sc08.supercomputing.org} {$21^{st}$ IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008}", pages = "1--12", month = nov # "~15-21, ", year = "2008", address = "Austin, TX, USA", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4244-2835-9", doi = "10.1145/1413370.1413414", url = "http://www.christian-engelmann.info/publications/wang08proactive.pdf", url2 = "http://www.christian-engelmann.info/publications/wang08proactive.ppt.pdf", abstract = "As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70\% of the faults are handled proactively.", pts = "12052" }
@conference{engelmann08symmetric, author = "Christian Engelmann and Stephen L. Scott and Chokchai (Box) Leangsuksun and Xubin (Ben) He", title = "Symmetric Active/Active Replication for Dependent Services", booktitle = "Proceedings of the \href{http://www.ares-conference.eu/ares2008}{$3^{rd}$ International Conference on Availability, Reliability and Security (ARES) 2008}", pages = "260--267", month = mar # "~4-7, ", year = "2008", address = "Barcelona, Spain", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-3102-1", doi = "10.1109/ARES.2008.64", url = "http://www.christian-engelmann.info/publications/engelmann08symmetric.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann08symmetric.ppt.pdf", abstract = "During the last several years, we have established the symmetric active/active replication model for service-level high availability and implemented several proof-of-concept prototypes. One major deficiency of our model is its inability to deal with dependent services, since its original architecture is based on the client-service model. This paper extends our model to dependent services using its already existing mechanisms and features. The presented concept is based on the idea that a service may also be a client of another service, and multiple services may be clients of each other. A high-level abstraction is used to illustrate dependencies between clients and services, and to decompose dependencies between services into respective client-service dependencies. This abstraction may be used for providing high availability in distributed computing systems with complex service-oriented architectures.", pts = "9456" }
@conference{vallee08framework, author = "Geoffroy R. Vall\'ee and Kulathep Charoenpornwattana and Christian Engelmann and Anand Tikotekar and Chokchai (Box) Leangsuksun and Thomas Naughton and Stephen L. Scott", title = "A Framework For Proactive Fault Tolerance", booktitle = "Proceedings of the \href{http://www.ares-conference.eu/ares2008}{$3^{rd}$ International Conference on Availability, Reliability and Security (ARES) 2008}", pages = "659--664", month = mar # "~4-7, ", year = "2008", address = "Barcelona, Spain", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-3102-1", doi = "10.1109/ARES.2008.171", url = "http://www.christian-engelmann.info/publications/vallee08framework.pdf", url2 = "http://www.christian-engelmann.info/publications/vallee08framework.ppt.pdf", abstract = "Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures and proactively take action before failures occur in order to minimize failure impact on the system and application execution. This document presents a proactive fault tolerance framework. This framework can use different proactive fault tolerance mechanisms, i.e. migration and pause/unpause. The framework also allows the implementation of new proactive fault tolerance policies thanks to a modular architecture. A first proactive fault tolerance policy has been implemented and preliminary experimentations have been done based on system-level virtualization and compared with results obtained by simulation." }
@conference{koenning08virtualized, author = "Bj{\"o}rn K{\"o}nning and Christian Engelmann and Stephen L. Scott and George A. (Al) Geist", title = "Virtualized Environments for the {Harness} High Performance Computing Workbench", booktitle = "Proceedings of the \href{http://www.pdp2008.org}{$16^{th}$ Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008}", pages = "133--140", month = feb # "~13-15, ", year = "2008", address = "Toulouse, France", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-3089-5", doi = "10.1109/PDP.2008.14", url = "http://www.christian-engelmann.info/publications/koenning08virtualized.pdf", url2 = "http://www.christian-engelmann.info/publications/koenning08virtualized.ppt.pdf", abstract = "This paper describes recent accomplishments in providing a virtualized environment concept and prototype for scientific application development and deployment as part of the Harness High Performance Computing (HPC) Workbench research effort. The presented work focuses on tools and mechanisms that simplify scientific application development and deployment tasks, such that only minimal adaptation is needed when moving from one HPC system to another or after HPC system upgrades. The overall technical approach focuses on the concept of adapting the HPC system environment to the actual needs of individual scientific applications instead of the traditional scheme of adapting scientific applications to individual HPC system environment properties. The presented prototype implementation is based on the mature and lightweight chroot virtualization approach for Unix-type systems with a focus on virtualized file system structure and virtualized shell environment variables utilizing virtualized environment configuration descriptions in Extensible Markup Language (XML) format. The presented work can be easily extended to other virtualization technologies, such as system-level virtualization solutions using hypervisors.", pts = "11532" }
@conference{vallee08system, author = "Geoffroy R. Vall\'ee and Thomas Naughton and Christian Engelmann and Hong H. Ong and Stephen L. Scott", title = "System-level Virtualization for High Performance Computing", booktitle = "Proceedings of the \href{http://www.pdp2008.org}{$16^{th}$ Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008}", pages = "636--643", month = feb # "~13-15, ", year = "2008", address = "Toulouse, France", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-3089-5", doi = "10.1109/PDP.2008.85", url = "http://www.christian-engelmann.info/publications/vallee08system.pdf", url2 = "http://www.christian-engelmann.info/publications/vallee08system.ppt.pdf", abstract = "System-level virtualization has been a research topic since the 70`s but regained popularity during the past few years because of the availability of efficient solution such as Xen and the implementation of hardware support in commodity processors (e.g. Intel-VT, AMD-V). However, a majority of system-level virtualization projects is guided by the server consolidation market. As a result, current virtualization solutions appear to not be suitable for high performance computing (HPC) which is typically based on large-scale systems. On another hand there is significant interest in exploiting virtual machines (VMs) within HPC for a number of other reasons. By virtualizing the machine, one is able to run a variety of operating systems and environments as needed by the applications. Virtualization allows users to isolate workloads, improving security and reliability. It is also possible to support non-native environments and/or legacy operating environments through virtualization. In addition, it is possible to balance work loads, use migration techniques to relocate applications from failing machines, and isolate fault systems for repair. This document presents the challenges for the implementation of a system-level virtualization solution for HPC. It also presents a brief survey of the different approaches and techniques to address these challenges.", pts = "11137" }
@conference{ou07symmetric, author = "Li Ou and Christian Engelmann and Xubin (Ben) He and Xin Chen and Stephen L. Scott", title = "Symmetric Active/Active Metadata Service for Highly Available Cluster Storage Systems", booktitle = "Proceedings of the \href{http://www.iasted.org/conferences/home-590.html} {$19^{th}$ IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS) 2007}", pages = "", month = nov # "~19-21, ", year = "2007", address = "Cambridge, MA, USA", publisher = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB, Canada}", isbn = "978-0-88986-703-1", doi = "", url = "http://www.christian-engelmann.info/publications/ou07symmetric.pdf", url2 = "http://www.christian-engelmann.info/publications/ou07symmetric.ppt.pdf", abstract = "In a typical distributed storage system, metadata is stored and managed by dedicated metadata servers. One way to improve the availability of distributed storage systems is to deploy multiple metadata servers. Past research focused on the active/standby model, where each active server has at least one redundant idle backup. However, interruption of service and loss of service state may occur during a fail-over depending on the used replication technique. The research in this paper targets the symmetric active/active replication model using multiple redundant service nodes running in virtual synchrony. In this model, service node failures do not cause a fail-over to a backup and there is no disruption of service or loss of service state. We propose a fast delivery protocol to reduce the latency of total order broadcast. Our prototype implementation shows that high availability of metadata servers can be achieved with an acceptable performance trade-off using the active/active metadata server solution.", pts = "8335" }
@conference{disaverio07distributed, author = "Emanuele Di Saverio and Marco Cesati and Christian Di Biagio and Guido Pennella and Christian Engelmann", title = "Distributed Real-Time Computing with {Harness}", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://pvmmpi07.lri.fr}{$14^{th}$ European PVM/MPI Users` Group Meeting (EuroPVM/MPI) 2007}", pages = "281--288", volume = "4757", month = sep # "~30 - " # oct # "~3, ", year = "2007", address = "Paris, France", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-540-75415-2", issn = "0302-9743", doi = "10.1007/978-3-540-75416-9_39", url = "http://www.christian-engelmann.info/publications/disaverio07distributed.pdf", url2 = "http://www.christian-engelmann.info/publications/disaverio07distributed.ppt.pdf", abstract = "Modern parallel and distributed computing solutions are often built onto a middleware software layer providing a higher and common level of service between computational nodes. Harness is an adaptable, plugin-based middleware framework for parallel and distributed computing. This paper reports recent research and development results of using Harness for real-time distributed computing applications in the context of an industrial environment with the needs to perform several safety critical tasks. The presented work exploits the modular architecture of Harness in conjunction with a lightweight threaded implementation to resolve several real-time issues by adding three new Harness plug-ins to provide a prioritized lightweight execution environment, low latency communication facilities, and local timestamped event logging.", pts = "7023" }
@conference{ou07fast, author = "Li Ou and Xubin (Ben) He and Christian Engelmann and Stephen L. Scott", title = "A Fast Delivery Protocol for Total Order Broadcasting", booktitle = "Proceedings of the \href{http://www.icccn.org/icccn07} {$16^{th}$ IEEE International Conference on Computer Communications and Networks (ICCCN) 2007}", pages = "730--734", month = aug # "~13-16, ", year = "2007", address = "Honolulu, HI, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-42441-251-8", issn = "1095-2055", doi = "10.1109/ICCCN.2007.4317904", url = "http://www.christian-engelmann.info/publications/ou07fast.pdf", url2 = "http://www.christian-engelmann.info/publications/ou07fast.ppt.pdf", abstract = "Sequencer, privilege-based, and communication history algorithms are popular approaches to implement total ordering, where communication history algorithms are most suitable for parallel computing systems, because they provide best performance under heavy work load. Unfortunately, post-transmission delay of communication history algorithms is most apparent when a system is idle. In this paper, we propose a fast delivery protocol to reduce the latency of message ordering. The protocol optimizes the total ordering process by waiting for messages only from a subset of the machines in the group, and by fast acknowledging messages on behalf of other machines. Our test results indicate that the fast delivery protocol is suitable for both idle and heavy load systems, while reducing the latency of message ordering.", pts = "6926" }
@conference{nagarajan07proactive, author = "Arun B. Nagarajan and Frank Mueller and Christian Engelmann and Stephen L. Scott", title = "Proactive Fault Tolerance for {HPC} with {Xen} Virtualization", booktitle = "Proceedings of the \href{http://ics07.ac.upc.edu}{$21^{st}$ ACM International Conference on Supercomputing (ICS) 2007}", pages = "23--32", month = jun # "~16-20, ", year = "2007", address = "Seattle, WA, USA", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-59593-768-1", doi = "10.1145/1274971.1274978", url = "http://www.christian-engelmann.info/publications/nagarajan07proactive.pdf", url2 = "http://www.christian-engelmann.info/publications/nagarajan07proactive.ppt.pdf", abstract = "Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today`s systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from unhealthy nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen`s live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.", pts = "6489" }
@conference{engelmann07programming, author = "Christian Engelmann and Stephen L. Scott and Chokchai (Box) Leangsuksun and Xubin (Ben) He", title = "On Programming Models for Service-Level High Availability", booktitle = "Proceedings of the \href{http://www.ares-conference.eu/ares2007}{$2^{nd}$ International Conference on Availability, Reliability and Security (ARES) 2007}", pages = "999--1006", month = apr # "~10-13, ", year = "2007", address = "Vienna, Austria", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "0-7695-2775-2", doi = "10.1109/ARES.2007.109", url = "http://www.christian-engelmann.info/publications/engelmann07programming.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann07programming.ppt.pdf", abstract = "This paper provides an overview of existing programming models for service-level high availability and investigates their differences, similarities, advantages, and disadvantages. Its goal is to help to improve reuse of code and to allow adaptation to quality of service requirements by using a uniform programming model description. It further aims at encouraging a discussion about these programming models and their provided quality of service, such as availability, performance, serviceability, usability, and applicability. Within this context, the presented research focuses on providing high availability for services running on head and service nodes of high-performance computing systems.", pts = "5078" }
@conference{wang07job, author = "Chao Wang and Frank Mueller and Christian Engelmann and Stephen L. Scott", title = "A Job Pause Service under {LAM/MPI+BLCR} for Transparent Fault Tolerance", booktitle = "Proceedings of the \href{http://www.ipdps.org/ipdps2007} {$21^{st}$ IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007}", pages = "1-10", month = mar # "~26-30, ", year = "2007", address = "Long Beach, CA, USA", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-59593-768-1", doi = "10.1109/IPDPS.2007.370307", url = "http://www.christian-engelmann.info/publications/wang07job.pdf", url2 = "http://www.christian-engelmann.info/publications/wang07job.ppt.pdf", abstract = "Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a mean-time-to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanism allows live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Our methodology includes LAM/MPI enhancements in support of scalable group communication with fluctuating number of nodes, reuse of network connections, transparent coordinated checkpoint scheduling and a BLCR enhancement for job pause. Experiments in a cluster with the NAS Parallel Benchmark suite show that our overhead for job pause is comparable to that of a complete job restart. A minimal overhead of 5.6\% is only incurred in case migration takes place while the regular checkpoint overhead remains unchanged. Yet, our approach alleviates the need to reboot the LAM run-time environment, which accounts for considerable overhead resulting in net savings of our scheme in the experiments. Our solution further provides full transparency and automation with the additional benefit of reusing existing resources. Executing continues after failures within the scheduled job, {\em \textit{i.e.}}, the application staging overhead is not incurred again in contrast to a restart. Our scheme offers additional potential for savings through incremental checkpointing and proactive diskless live migration, which we are currently working on.", pts = "4944" }
@conference{uhlemann06joshua, author = "Kai Uhlemann and Christian Engelmann and Stephen L. Scott", title = "{JOSHUA}: {S}ymmetric Active/Active Replication for Highly Available {HPC} Job and Resource Management", booktitle = "Proceedings of the \href{http://cluster2006.org}{$8^{th}$ IEEE International Conference on Cluster Computing (Cluster) 2006}", pages = "1-10", month = sep # "~25-28, ", year = "2006", address = "Barcelona, Spain", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "1-4244-0328-6", issn = "1552-5244", doi = "10.1109/CLUSTR.2006.311855", url = "http://www.christian-engelmann.info/publications/uhlemann06joshua.pdf", url2 = "http://www.christian-engelmann.info/publications/uhlemann06joshua.ppt.pdf", abstract = "Most of today`s HPC systems employ a single head node for control, which represents a single point of failure as it interrupts an entire HPC system upon failure. Furthermore, it is also a single point of control as it disables an entire HPC system until repair. One of the most important HPC system service running on the head node is the job and resource management. If it goes down, all currently running jobs loose the service they report back to. They have to be restarted once the head node is up and running again. With this paper, we present a generic approach for providing symmetric active/active replication for highly available HPC job and resource management. The JOSHUA solution provides a virtually synchronous environment for continuous availability without any interruption of service and without any loss of state. Replication is performed externally via the PBS service interface without the need to modify any service code. Test results as well as availability analysis of our proof-of-concept prototype implementation show that continuous availability can be provided by JOSHUA with an acceptable performance trade-off.", pts = "2631" }
@conference{baumann06parallel, author = "Ronald Baumann and Christian Engelmann and George A. (Al) Geist", title = "A Parallel Plug-in Programming Paradigm", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://hpcc06.lrr.in.tum.de}{$7^{th}$ International Conference on High Performance Computing and Communications (HPCC) 2006}", volume = "4208", pages = "823--832", month = sep # "~13-15, ", year = "2006", address = "Munich, Germany", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-540-39368-9", issn = "0302-9743", doi = "10.1007/11847366_85", url = "http://www.christian-engelmann.info/publications/baumann06parallel.pdf", url2 = "http://www.christian-engelmann.info/publications/baumann06parallel.ppt.pdf", abstract = "Software component architectures allow assembly of applications from individual software modules based on clearly defined programming interfaces, thus improving the reuse of existing solutions and simplifying application development. Furthermore, the plug-in programming paradigm additionally enables runtime reconfigurability, making it possible to adapt to changing application needs, such as different application phases, and system properties, like resource availability, by loading/unloading appropriate software modules. Similar to parallel programs, parallel plug-ins are an abstraction for a set of cooperating individual plug-ins within a parallel application utilizing a software component architecture. Parallel programming paradigms apply to parallel plug-ins in the same way they apply to parallel programs. The research presented in this paper targets the clear definition of parallel plug-ins and the development of a parallel plug-in programming paradigm.", pts = "2413" }
@conference{varma06scalable, author = "Jyothish Varma and Chao Wang and Frank Mueller and Christian Engelmann and Stephen L. Scott", title = "Scalable, Fault-Tolerant Membership for {MPI} Tasks on {HPC} Systems", booktitle = "Proceedings of the \href{http://www.ics-conference.org/2006} {$20^{th}$ ACM International Conference on Supercomputing (ICS) 2006}", pages = "219--228", month = jun # "~28-30, ", year = "2006", address = "Cairns, Australia", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", doi = "10.1145/1183401.1183433", isbn = "1-59593-282-8", url = "http://www.christian-engelmann.info/publications/varma06scalable.pdf", url2 = "http://www.christian-engelmann.info/publications/varma06scalable.ppt.pdf", abstract = "Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM`s Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over Blue Gene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.", pts = "2105" }
@conference{okunbor06exploring, author = "Daniel I. Okunbor and Christian Engelmann and Stephen L. Scott", title = "Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems", booktitle = "Proceedings of the \href{http://www.atiner.gr/docs/2006AAAPROGRAM_COMP.htm} {$2^{nd}$ International Conference on Computer Science and Information Systems 2006}", month = jun # "~19-21, ", year = "2006", address = "Athens, Greece", url = "http://www.christian-engelmann.info/publications/okunbor06exploring.pdf", abstract = "This paper presents various aspects of reliability, availability and serviceability (RAS) systems as they relate to group communication service, including reliable and total order multicast/broadcast, virtual synchrony, and failure detection. While the issue of availability, particularly high availability using replication-based architectures has recently received upsurge research interests, much still have to be done in understanding the basic underlying concepts for achieving RAS systems, especially in high-end and high performance computing (HPC) communities. Various attributes of group communication service and the prototype of symmetric active replication following ideas utilized in the Newtop protocol will be discussed. We explore the application of group communication service for RAS HPC, laying the groundwork for its integrated model.", pts = "3778" }
@conference{limaye05jobsite, author = "Kshitij Limaye and Chokchai (Box) Leangsuksun and Zeno Greenwood and Stephen L. Scott and Christian Engelmann and Richard M. Libby and Kasidit Chanchio", title = "Job-Site Level Fault Tolerance for Cluster and {Grid} Environments", booktitle = "Proceedings of the \href{http://cluster2005.org}{$7^{th}$ IEEE International Conference on Cluster Computing (Cluster) 2005}", pages = "1--9", month = sep # "~26-30, ", year = "2005", address = "Boston, MA, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "0-7803-9486-0", issn = "1552-5244", doi = "10.1109/CLUSTR.2005.347043", url = "http://www.christian-engelmann.info/publications/limaye05job-site.pdf", abstract = "In order to adopt high performance clusters and Grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system`s MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called Smart Failover provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state." }
@conference{song05umlbased, author = "Hertong Song and Chokchai (Box) Leangsuksun and Raja Nassar and Yudan Liu and Christian Engelmann and Stephen L. Scott", title = "{UML-based} {Beowulf} Cluster Availability Modeling", booktitle = "\href{http://www.world-academy-of-science.org/IMCSE2005/ws/SERP} {International Conference on Software Engineering Research and Practice (SERP) 2005}", pages = "161--167", month = jun # "~27-30, ", year = "2005", address = "Las Vegas, NV, USA", publisher = "CSREA Press", isbn = "1-932415-49-1" }
@conference{engelmann05superscalable, author = "Christian Engelmann and George A. (Al) Geist", title = "Super-Scalable Algorithms for Computing on 100,000 Processors", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://www.iccs-meeting.org/iccs2005}{$5^{th}$ International Conference on Computational Science (ICCS) 2005}, Part I", volume = "3514", pages = "313--320", month = may # "~22-25, ", year = "2005", address = "Atlanta, GA, USA", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-540-26032-5", issn = "0302-9743", doi = "10.1007/11428831_39", url = "http://www.christian-engelmann.info/publications/engelmann05superscalable.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann05superscalable.ppt.pdf", abstract = "In the next five years, the number of processors in high-end systems for scientific computing is expected to rise to tens and even hundreds of thousands. For example, the IBM Blue Gene/L can have up to 128,000 processors and the delivery of the first system is scheduled for 2005. Existing deficiencies in scalability and fault-tolerance of scientific applications need to be addressed soon. If the number of processors grows by a magnitude and efficiency drops by a magnitude, the overall effective computing performance stays the same. Furthermore, the mean time to interrupt of high-end computer systems decreases with scale and complexity. In a 100,000-processor system, failures may occur every couple of minutes and traditional checkpointing may no longer be feasible. With this paper, we summarize our recent research in super-scalable algorithms for computing on 100,000 processors. We introduce the algorithm properties of scale invariance and natural fault tolerance, and discuss how they can be applied to two different classes of algorithms. We also describe a super-scalable diskless checkpointing algorithm for problems that can`t be transformed into a super-scalable variant, or where other solutions are more efficient. Finally, a 100,000-processor simulator is presented as a platform for testing and experimentation." }
@conference{brim24microservices, author = "Michael J. Brim and Lance Drane and Marshall McDonnell and Christian Engelmann and Addi Malviya Thakur", title = "A Microservices Architecture Toolkit for Interconnected Science Ecosystems", booktitle = "Proceedings of the \href{http://sc24.supercomputing.org} {$37^{th}$ International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2024}: \href{https://works-workshop.org/} {$19^{th}$ Workshop on Workflows in Support of Large-Scale Science (WORKS) 2024}", pages = "", month = nov # "~18, ", year = "2024", address = "Atlanta, GA, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "", doi = "", url = "", url2 = "", abstract = "Microservices architecture is a promising approach for developing reusable scientific workflow capabilities for integrating diverse resources, such as experimental and observational instruments and advanced computational and data management systems, across many distributed organizations and facilities. In this paper, we describe how the INTERSECT Open Architecture leverages federated systems of microservices to construct interconnected science ecosystems, review how the INTERSECT software development kit eases microservice capability development, and demonstrate the use of such capabilities for deploying an example multi-facility INTERSECT ecosystem.", pts = "", note = "To appear" }
@conference{kumar21rdpm, author = "Mohit Kumar and Christian Engelmann", title = "{RDPM}: An Extensible Tool for Resilience Design Patterns Modeling", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{https://2021.euro-par.org}{$27^{th}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2021 Workshops}: \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2021} {$14^{th}$ Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids}", volume = "13098", pages = "283--297", month = aug # "~30, ", year = "2021", address = "Lisbon, Portugal", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-031-06155-4", doi = "10.1007/978-3-031-06156-1_23", url = "http://www.christian-engelmann.info/publications/kumar21rdpm.pdf", url2 = "", abstract = "Resilience to faults, errors, and failures in extreme-scale HPC systems is a critical challenge. Resilience design patterns offer a new, structured hardware and software design approach for improving resilience. While prior work focused on developing performance, reliability, and availability models for resilience design patterns, this paper extends it by providing a Resilience Design Patterns Modeling (RDPM) tool which allows (1) exploring performance, reliability, and availability of each resilience design pattern, (2) offering customization of parameters to optimize performance, reliability, and availability, and (3) allowing investigation of trade-off models for combining multiple patterns for practical resilience solutions.", pts = "161085" }
@conference{kumar20models, author = "Mohit Kumar and Christian Engelmann", title = "Models for Resilience Design Patterns", booktitle = "Proceedings of the \href{http://sc20.supercomputing.org} {$33^{rd}$ International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2020}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2020} {$10^{th}$ Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2020}", pages = "21-30", month = nov # "~11, ", year = "2020", address = "Atlanta, GA, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7381-1080-6", doi = "10.1109/FTXS51974.2020.00008", url = "http://www.christian-engelmann.info/publications/kumar20models.pdf", url2 = "http://www.christian-engelmann.info/publications/kumar20models.ppt.pdf", abstract = "Resilience plays an important role in supercomputers by providing correct and efficient operation in case of faults, errors, and failures. Resilience design patterns offer blueprints for effectively applying resilience technologies. Prior work focused on developing initial efficiency and performance models for resilience design patterns. This paper extends it by (1) describing performance, reliability, and availability models for all structural resilience design patterns, (2) providing more detailed models that include flowcharts and state diagrams, and (3) introducing the Resilience Design Pattern Modeling (RDPM) tool that calculates and plots the performance, reliability, and availability metrics of individual patterns and pattern combinations.", pts = "148010" }
@conference{sao19self-stabilizing, author = "Piyush Sao and Christian Engelmann and Srinivas Eswar and Oded Green and Richard Vuduc", title = "Self-stabilizing Connected Components", booktitle = "Proceedings of the \href{http://sc19.supercomputing.org} {$32^{nd}$ International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2019}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2019} {$9^{th}$ Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2019}", pages = "50--59", month = nov # "~22, ", year = "2019", address = "Denver, CO, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-7281-6013-9", doi = "10.1109/FTXS49593.2019.00011", url = "http://www.christian-engelmann.info/publications/sao19self-stabilizing.pdf", url2 = "http://www.christian-engelmann.info/publications/sao19self-stabilizing.ppt.pdf", abstract = "For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it applies the technique of \emph{self-stabilization}. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state. We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs $\bigo{V \log V}$ additional computation and requires $\bigo{V}$ additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional). When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in 80\% of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault-tolerance properties of self-stabilizing label-propagation, we believe, they are useful from the theoretical perspective; and may have other use-cases.", pts = "135067" }
@conference{engelmann19concepts, author = "Christian Engelmann and Geoffroy R. Vall\'ee and Swaroop Pophale", title = "Concepts for {OpenMP} Target Offload Resilience", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://parallel.auckland.ac.nz/iwomp2019} {$15^{th}$ International Workshop on OpenMP (IWOMP) 2019}", volume = "11718", pages = "78--93", month = sep # "~11-13, ", year = "2019", address = "Auckland, New Zealand", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-030-28595-1", doi = "10.1007/978-3-030-28596-8_6", url = "http://www.christian-engelmann.info/publications/engelmann19concepts.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann19concepts.ppt.pdf", abstract = "Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory, demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing on graphics processing units errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing systems", pts = "127338" }
@conference{hui18comprehensive2, author = "Yawei Hui and Byung Hoon (Hoony) Park and Christian Engelmann", title = "A Comprehensive Informative Metric for Analyzing {HPC} System Status using the {LogSCAN} Platform", booktitle = "Proceedings of the \href{http://sc18.supercomputing.org} {$31^{st}$ International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2018} {$8^{th}$ Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018}", pages = "29--38", month = nov # "~16, ", year = "2018", address = "Dallas, TX, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-7281-0222-1", doi = "10.1109/FTXS.2018.00007", url = "http://www.christian-engelmann.info/publications/hui18comprehensive2.pdf", url2 = "http://www.christian-engelmann.info/publications/hui18comprehensive2.ppt.pdf", abstract = "Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major challenge is to effectively summarize the status of a complex computer system, such as the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). Although there is plenty of operational and maintenance information collected and stored in real time, which may yield insights about short- and long-term system status, it is difficult to present this information in a comprehensive form. In this work, we present system information entropy (SIE), a newly developed metric that leverages the powers of traditional machine learning techniques and information theory. By compressing the multi-variant multi-dimensional event information recorded during the operation of the targeted system into a single time series of SIE, we demonstrate that the historical system status can be sensitively represented concisely and comprehensively. Given a sharp indicator as SIE, we argue that follow-up analytics based on SIE will reveal in-depth knowledge about system status using other sophisticated approaches, such as pattern recognition in the temporal domain or causality analysis incorporating extra independent metrics of the system.", pts = "119248" }
@conference{ashraf18analyzing, author = "Rizwan Ashraf and Christian Engelmann", title = "Analyzing the Impact of System Reliability Events on Applications in the {Titan} Supercomputer", booktitle = "Proceedings of the \href{http://sc18.supercomputing.org} {$31^{st}$ International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2018} {$8^{th}$ Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018}", pages = "39--48", month = nov # "~16, ", year = "2018", address = "Dallas, TX, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-7281-0222-1", doi = "10.1109/FTXS.2018.00008", url = "http://www.christian-engelmann.info/publications/ashraf18analyzing.pdf", url2 = "http://www.christian-engelmann.info/publications/ashraf18analyzing.ppt.pdf", abstract = "Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. In this paper, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: parallel file system, processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.", pts = "119070" }
@conference{park18big, author = "Byung Hoon (Hoony) Park and Yawei Hui and Swen Boehm and Rizwan Ashraf and Christian Engelmann and Christopher Layton", title = "A {Big Data} Analytics Framework for {HPC} Log Data: {Three} Case Studies Using the {Titan} Supercomputer Log", booktitle = "Proceedings of the \href{https://cluster2018.github.io} {$19^{th}$ IEEE International Conference on Cluster Computing (Cluster) 2018}: \href{https://sites.google.com/site/hpcmaspa2018} {$5^{th}$ Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2018}", pages = "571--579", month = sep # "~10, ", year = "2018", address = "Belfast, UK", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-5386-8319-4", issn = "2168-9253", doi = "10.1109/CLUSTER.2018.00073", url = "http://www.christian-engelmann.info/publications/park18big.pdf", url2 = "http://www.christian-engelmann.info/publications/park18big.ppt.pdf", abstract = "Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.", pts = "112964" }
@conference{ashraf18performance, author = "Rizwan Ashraf and Christian Engelmann", title = "Performance Efficient Multiresilience using Checkpoint Recovery in Iterative Algorithms", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{https://europar2018.org}{$24^{th}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2018 Workshops}: \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2018} {$11^{th}$ Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids}", volume = "11339", pages = "813--825", month = aug # "~28, ", year = "2018", address = "Turin, Italy", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-030-10549-5", doi = "10.1007/978-3-030-10549-5_63", url = "http://www.christian-engelmann.info/publications/ashraf18performance.pdf", url2 = "http://www.christian-engelmann.info/publications/ashraf18performance.ppt.pdf", abstract = "In this paper, we address the design challenge of building multiresilient iterative high-performance computing (HPC) applications. Multiresilience in HPC applications is the ability to tolerate and maintain forward progress in the presence of both soft errors and process failures. We address the challenge by proposing performance models which are useful to design performance efficient and resilient iterative applications. The models consider the interaction between soft error and process failure resilience solutions. We experimented with a linear solver application with two distinct kinds of soft error detectors: one detector is high overhead and high accuracy, whereas the second is low overhead and low accuracy. We show how both can be leveraged for verifying the integrity of checkpointed state used to recover from both soft errors and process failures. Our results show the performance efficiency and resiliency benefit of employing the low overhead detector with high frequency within the checkpoint interval, so that timely soft error recovery can take place, resulting in less re-computed work.", pts = "112980" }
@conference{park17big, author = "Byung Hoon (Hoony) Park and Saurabh Hukerikar and Christian Engelmann and Ryan Adamson", title = "Big Data Meets {HPC} Log Analytics: {Scalable} Approach to Understanding Systems at Extreme Scale", booktitle = "Proceedings of the \href{https://cluster17.github.io} {$18^{th}$ IEEE International Conference on Cluster Computing (Cluster) 2017}: \href{https://sites.google.com/site/hpcmaspa2017} {$4^{th}$ Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2017}", pages = "758--765", month = sep # "~5, ", year = "2017", address = "Honolulu, HI, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-5386-2327-5", issn = "2168-9253", doi = "10.1109/CLUSTER.2017.113", url = "http://www.christian-engelmann.info/publications/park17big.pdf", url2 = "http://www.christian-engelmann.info/publications/park17big.ppt.pdf", abstract = "Today's high-performance computing (HPC) systems are heavily instrumented generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the resource usage of user applications. These logs, once fully analyzed and correlated, can produce detailed information about the system health, root causes of failures, and analyze an application's interactions with the system, providing invaluable insights to domain scientists and system administrators. However, processing HPC logs requires deep understanding of hardware and software components at multiple layers of the system stack. Moreover, most log data is unstructured and voluminous, making it more difficult for scientists and engineers to analyze the data. With rapid increases in the scale and complexity of HPC systems, log data processing is becoming a big data challenge. This paper introduces a HPC log data analytics framework that is based on a distributed NoSQL database technology, which provides scalability and high availability, and the Apache Spark for rapid in-memory processing of log data. The framework enables the extraction of a range of information about the system so that system administrators and end users alike can obtain necessary insights for their specific needs. We describe our experience with using this framework to glean insights from the log data derived from the Titan supercomputer at the Oak Ridge National Laboratory.", pts = "100681" }
@conference{hukerikar17pattern-based, author = "Saurabh Hukerikar and Christian Engelmann", title = "Pattern-based Modeling of High-Performance Computing Resilience", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://europar2017.usc.es}{$23^{rd}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2017 Workshops}: \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2017} {$10^{th}$ Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids}", volume = "10659", pages = "557--568", month = aug # "~29, ", year = "2017", address = "Santiago de Compostela, Spain", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-319-75177-1", doi = "10.1007/978-3-319-75178-8_45", url = "http://www.christian-engelmann.info/publications/hukerikar17pattern-based.pdf", url2 = "http://www.christian-engelmann.info/publications/hukerikar17pattern-based.ppt.pdf", abstract = "The design of supercomputing systems and their applications must consider resilience, and power consumption as the key design parameters when designing to achieve higher performance. In previous work, we established a structured methodology for developing resilience solutions based on the concept of design patterns. In this paper we discuss analytical models for the design patterns to support quantitative analysis of their performance and reliability characteristics.", pts = "102871" }
@conference{hukerikar17towards, author = "Saurabh Hukerikar and Rizwan Ashraf and Christian Engelmann", title = "Towards New Metrics for High-Performance Computing Resilience", booktitle = "Proceedings of the \href{http://www.hpdc.org/2017} {$26^{th}$ ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2017}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2017} {$7^{th}$ Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2017}", pages = "23--30", month = jun # "~26-30, ", year = "2017", address = "Washington, D.C.", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4503-5001-3", doi = "10.1145/3086157.3086163", url = "http://www.christian-engelmann.info/publications/hukerikar17towards.pdf", url2 = "http://www.christian-engelmann.info/publications/hukerikar17towards.ppt.pdf", abstract = "Ensuring the reliability of applications is becoming an increasingly important challenge as high-performance computing (HPC) systems experience an ever-growing number of faults, errors and failures. While the HPC community has made substantial progress in developing various resilience solutions, it continues to rely on platform-based metrics to quantify application resiliency improvements. The resilience of an HPC application is concerned with the reliability of the application outcome as well as the fault handling efficiency. To understand the scope of impact, effective coverage and performance efficiency of existing and emerging resilience solutions, there is a need for new metrics. In this paper, we develop new ways to quantify resilience that consider both the reliability and the performance characteristics of the solutions from the perspective of HPC applications. As HPC systems continue to evolve in terms of scale and complexity, it is expected that applications will experience various types of faults, errors and failures, which will require applications to apply multiple resilience solutions across the system stack. The proposed metrics are intended to be useful for understanding the combined impact of these solutions on an application's ability to produce correct results and to evaluate their overall impact on an application's performance in the presence of various modes of faults.", pts = "74843" }
@conference{hukerikar16language, author = "Saurabh Hukerikar and Christian Engelmann", title = "Language Support for Reliable Memory Regions", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{https://lcpc2016.wordpress.com}{$29^{th}$ International Workshop on Languages and Compilers for Parallel Computing}", volume = "10136", pages = "73--87", month = sep # "~28-30, ", year = "2016", address = "Rochester, NY, USA", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-319-52708-6", issn = "0302-9743", doi = "10.1007/978-3-319-52709-3_6", url = "http://www.christian-engelmann.info/publications/hukerikar16language.pdf", url2 = "http://www.christian-engelmann.info/publications/hukerikar16language.ppt.pdf", abstract = "The path to exascale computational capabilities in high-performance computing (HPC) systems is challenged by the evolution of the architectures of supercomputing systems. The constraints of power have driven designs that include increasingly heterogeneous architectures and complex memory hierarchies. These systems are also expected to experience in an increased rate of errors, such that the applications will no longer be able to assume correct behavior of the underlying machine. To enable the scientific community to succeed in scaling their applications and harness the capabilities of exascale systems, we need software strategies that provide mechanisms for explicit management of locality and resilience to errors in the system. In prior work, we introduced the concept of explicitly reliable memory regions, called havens. Memory management using havens supports selective reliability through a region-based approach to memory allocation. Havens enable the creation of explicit software-enabled robust memory containers for which resilient behavior is guaranteed. In this paper, we propose language support for havens through type annotations that make the structure of a program's havens more explicit. We describe how the extended haven-based memory management model is implemented and the impact on the resiliency of a conjugate gradient application.", pts = "69644" }
@conference{naughton16cooperative, author = "Thomas Naughton and Christian Engelmann and Geoffroy Vall{\'e}e and Ferrol Aderholdt and Stephen L. Scott", title = "A Cooperative Approach to Virtual Machine Based Fault Injection", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{https://europar2016.inria.fr}{$22^{nd}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2016 Workshops}: \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2016} {$9^{th}$ Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids}", volume = "10104", pages = "671--682", month = aug # "~23, ", year = "2016", address = "Grenoble, France", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-319-58943-5", issn = "0302-9743", doi = "10.1007/978-3-319-58943-5_54", url = "http://www.christian-engelmann.info/publications/naughton16cooperative.pdf", url2 = "http://www.christian-engelmann.info/publications/naughton16cooperative.ppt.pdf", abstract = "Resilience investigations often employ fault injection (FI) tools to study the effects of simulated errors on a target system. It is important to keep the target system under test (SUT) isolated from the controlling environment in order to maintain control of the experiment. Virtual machines (VMs) have been used to aid these investigations due to the strong isolation properties of system-level virtualization. A key challenge in fault injection tools is to gain proper insight and context about the SUT. In VM-based FI tools, this challenge of target con- text is increased due to the separation between host and guest (VM). We discuss an approach to VM-based FI that leverages virtual machine introspection (VMI) methods to gain insight into the target's context running within the VM. The key to this environment is the ability to provide basic information to the FI system that can be used to create a map of the target environment. We describe a proof- of-concept implementation and a demonstration of its use to introduce simulated soft errors into an iterative solver benchmark running in user-space of a guest VM.", pts = "69232" }
@conference{parchman16adding, author = "Zachary Parchman and Geoffroy R. Vall\'ee and Thomas Naughton and Christian Engelmann and David E. Bernholdt", title = "Adding Fault Tolerance to {NPB} Benchmarks Using {ULFM}", booktitle = "Proceedings of the \href{http://www.hpdc.org/2016} {$25^{th}$ ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2016}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2016} {$6^{th}$ Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016}", pages = "19--26", month = may # "~31 - " # jun # "~4, ", year = "2016", address = "Kyoto, Japan", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-4503-4349-7", doi = "10.1145/2909428.2909429", url = "http://www.christian-engelmann.info/publications/parchman16adding.pdf", url2 = "http://www.christian-engelmann.info/publications/parchman16adding.ppt.pdf", abstract = "In the world of high-performance computing, fault tolerance and application resilience are becoming some of the primary concerns because of increasing hardware failures and memory corruptions. While the research community has been investigating various options, from system-level solutions to application-level solutions, standards such as the Message Passing Interface (MPI) are also starting at including such capabilities. The current proposal for MPI fault tolerant is centered around the User-Level Failure Mitigation (ULFM) concept, which provides means for fault detection and recovery of the MPI layer. This approach does not address application-level recovery, which is current left to application developers. In this work, we present a modification of some of the benchmarks of the NAS parallel benchmark (NPB) to include support of the ULFM capabilities as well as application- level strategies and mechanisms for application-level failure recovery. As such, we present: (i) an application-level library to ``checkpoint'' data, (ii) extensions of NPB benchmarks for fault tolerance based on different strategies, (iii) a fault injection tool, and (iv) some preliminary experiments that shows the impact of such fault tolerant strategies on the application execution.", pts = "62557" }
@conference{naughton14what, author = "Thomas Naughton and Garry Smith and Christian Engelmann and Geoffroy Vall{\'e}e and Ferrol Aderholdt and Stephen L. Scott", title = "What is the right balance for performance and isolation with virtualization in {HPC}?", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://europar2014.dcc.fc.up.pt}{$20^{th}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2014 Workshops}: \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2014} {$7^{th}$ Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids}", volume = "8805", pages = "570--581", month = aug # "~25, ", year = "2014", address = "Porto, Portugal", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-319-14325-5", issn = "0302-9743", doi = "10.1007/978-3-319-14325-5_49", url = "http://www.christian-engelmann.info/publications/naughton14what.pdf", url2 = "http://www.christian-engelmann.info/publications/naughton14what.ppt.pdf", abstract = "The use of virtualization in high-performance computing (HPC) has been suggested as a means to provide tailored services and added functionality that many users expect from full-featured Linux cluster environments. While the use of virtual machines in HPC can offer several benefits, maintaining performance is a crucial factor. In some instances performance criteria are placed above isolation properties and selective relaxation of isolation for performance is an important characteristic when considering resilience for HPC environments employing virtualization. In this paper we consider some of the factors associated with balancing performance and isolation in configurations that employ virtual machines. In this context, we propose a classification of errors based on the concept of ``error zones'', as well as a detailed analysis of the trade-offs between resilience and performance based on the level of isolation provided by virtualization solutions. Finally, the results from a set of experiments are presented, that use different virtualization solutions, and in doing so allow further elucidation of the topic.", pts = "51548" }
@conference{engelmann13toward, author = "Christian Engelmann and Thomas Naughton", title = "Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems", booktitle = "Proceedings of the \href{http://icpp2013.ens-lyon.fr}{$42^{nd}$ International Conference on Parallel Processing (ICPP) 2013}: \href{http://www.psti-workshop.org} {$4^{th}$ International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI)}", pages = "962-971", month = oct # "~2, ", year = "2013", address = "Lyon, France", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-5117-3", issn = "0190-3918", doi = "10.1109/ICPP.2013.114", url = "http://www.christian-engelmann.info/publications/engelmann13toward.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann13toward.ppt.pdf", abstract = "xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.", pts = "44445" }
@conference{lagadapati13tools, author = "Mahesh Lagadapati and Frank Mueller and Christian Engelmann", title = "Tools for Simulation and Benchmark Generation at Exascale", booktitle = "Proceedings of the \href{http://tools.zih.tu-dresden.de/2013/} {$7^{th}$ Parallel Tools Workshop}", pages = "19--24", month = sep # "~3-4, ", year = "2013", address = "Dresden, Germany", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-319-08143-4", doi = "10.1007/978-3-319-08144-1_2", url = "http://www.christian-engelmann.info/publications/lagadapati13tools.pdf", url2 = "http://www.christian-engelmann.info/publications/lagadapati13tools.ppt.pdf", abstract = "The path to exascale high-performance computing (HPC) poses several challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Investigating the performance of parallel applications at scale on future architectures and the performance impact of different architecture choices is an important component of HPC hardware/software co-design. Simulations using models of future HPC systems and communication traces from applications running on existing HPC systems can offer an insight into the performance of future architectures. This work targets technology developed for scalable application tracing of communication events and memory profiles, but can be extended to other areas, such as I/O, control flow, and data flow. It further focuses on extreme-scale simulation of millions of Message Passing Interface (MPI) ranks using a lightweight parallel discrete event simulation (PDES) toolkit for performance evaluation. Instead of simply replaying a trace within a simulation, the approach is to generate a benchmark from it and to run this benchmark within a simulation using models to reflect the performance characteristics of future-generation HPC systems. This provides a number of benefits, such as eliminating the data intensive trace replay and enabling simulations at different scales. The presented work utilizes the ScalaTrace tool to generate scalable trace files, the ScalaBenchGen tool to generate the benchmark, and the xSim tool to run the benchmark within a simulation.", pts = "48783" }
@conference{naughton13using, author = "Thomas Naughton and Swen B{\"o}hm and Christian Engelmann and Geoffroy Vall{\'e}e", title = "Using Performance Tools to Support Experiments in {HPC} Resilience", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://www.europar2013.org/}{$19^{th}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2013 Workshops}: \href{http://xcr.cenit.latech.edu/resilience2013}{$6^{th}$ Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids}", volume = "8374", pages = "727--736", month = aug # "~26, ", year = "2013", address = "Aachen, Germany", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-642-54419-4", issn = "0302-9743", doi = "10.1007/978-3-642-54420-0_71", url = "http://www.christian-engelmann.info/publications/naughton13using.pdf", url2 = "http://www.christian-engelmann.info/publications/naughton13using.ppt.pdf", abstract = "The high performance computing~(HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between ``performance tools'' and ``resilience tools''. As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community. In this paper, we describe the initial motivation to leverage standard HPC performance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in providing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerances.", pts = "45676" }
@conference{jones11simulation, author = "Ian S. Jones and Christian Engelmann", title = "Simulation of Large-Scale {HPC} Architectures", booktitle = "Proceedings of the \href{http://icpp2011.org}{$40^{th}$ International Conference on Parallel Processing (ICPP) 2011}: \href{http://www.psti-workshop.org} {$2^{nd}$ International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI)}", pages = "447-456", month = sep # "~13-19, ", year = "2011", address = "Taipei, Taiwan", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-4511-0", issn = "1530-2016", doi = "10.1109/ICPPW.2011.44", url = "http://www.christian-engelmann.info/publications/jones11simulation.pdf", url2 = "http://www.christian-engelmann.info/publications/jones11simulation.ppt.pdf", abstract = "The Extreme-scale Simulator (xSim) is a recently developed performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads. It allows observing parallel application performance properties in a simulated extreme-scale HPC system to further assist in HPC hardware and application software co-design on the road toward multi-petascale and exascale computing. This paper presents a newly implemented network model for the xSim performance investigation toolkit that is capable of providing simulation support for a variety of HPC network architectures with the appropriate trade-off between simulation scalability and accuracy. The taken approach focuses on a scalable distributed solution with latency and bandwidth restrictions for the simulated network. Different network architectures, such as star, ring, mesh, torus, twisted torus and tree, as well as hierarchical combinations, such as to simulate network-on-chip and network-on-node, are supported. Network traffic congestion modeling is omitted to gain simulation scalability by reducing simulation accuracy.", pts = "31901" }
@conference{fiala11tunable, author = "David Fiala and Kurt Ferreira and Frank Mueller and Christian Engelmann", title = "A Tunable, Software-based {DRAM} Error Detection and Correction Library for {HPC}", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://europar2011.bordeaux.inria.fr/}{$17^{th}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2011 Workshops, Part II}: \href{http://xcr.cenit.latech.edu/resilience2011}{$4^{th}$ Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids}", volume = "7156", pages = "251-261", month = aug # "~29 - " # sep # "~2, ", year = "2011", address = "Bordeaux, France", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-642-29740-3", doi = "10.1007/978-3-642-29740-3_29", url = "http://www.christian-engelmann.info/publications/fiala11tunable.pdf", url2 = "", abstract = "Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50\% overhead of resources, less than the 100\% needed for double modular redundancy.", pts = "35631" }
@conference{naughton11case, author = "Thomas Naughton and Geoffroy R. Vall\'ee and Christian Engelmann and Stephen L. Scott", title = "A Case for Virtual Machine based Fault Injection in a High-Performance Computing Environment", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://europar2011.bordeaux.inria.fr/}{$17^{th}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2011}: \href{http://www.csm.ornl.gov/srt/conferences/hpcvirt2011} {$5^{th}$ Workshop on System-level Virtualization for High Performance Computing (HPCVirt)}", volume = "7155", pages = "234-243", month = aug # "~29 - " # sep # "~2, ", year = "2011", address = "Bordeaux, France", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-642-29737", doi = "10.1007/978-3-642-29737-3_27", url = "http://www.christian-engelmann.info/publications/naughton11case.pdf", url2 = "http://www.christian-engelmann.info/publications/naughton11case.ppt.pdf", abstract = "Large-scale computing platforms provide tremendous capabilities for scientific discovery. These systems have hundreds of thousands of computing cores, hundreds of terabytes of memory, and enormous high-performance interconnection networks. These systems are facing enormous challenges to achieve performance at such scale. Failures are an Achilles heel of these enormous systems. As applications and system software scale up to multi-petaflop and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for HPC systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/ restart techniques. While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and while the majority of systems on the Top500 Supercomputer list run Linux, these operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption. The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [1, 2] Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/ Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context, and outline an approach that can leverage virtualization.", pts = "32309" }
@conference{engelmann10facilitating, author = "Christian Engelmann and Frank Lauer", title = "Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation", booktitle = "Proceedings of the \href{http://www.cluster2010.org}{$12^{th}$ IEEE International Conference on Cluster Computing (Cluster) 2010}: \href{http://www2.wmin.ac.uk/getovv/aacec10.html} {$1^{st}$ Workshop on Application/Architecture Co-design for Extreme-scale Computing (AACEC)}", pages = "1-8", month = sep # "~20-24, ", year = "2010", address = "Hersonissos, Crete, Greece", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-4244-8395-2", doi = "10.1109/CLUSTERWKSP.2010.5613113", url = "http://www.christian-engelmann.info/publications/engelmann10facilitating.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann10facilitating.ppt.pdf", abstract = "This work focuses on tools for investigating algorithm performance at extreme scale with millions of concurrent threads and for evaluating the impact of future architecture choices to facilitate the co-design of high-performance computing (HPC) architectures and applications. The approach focuses on lightweight simulation of extreme-scale HPC systems with the needed amount of accuracy. The prototype presented in this paper is able to provide this capability using a parallel discrete event simulation (PDES), such that a Message Passing Interface (MPI) application can be executed at extreme scale, and its performance properties can be evaluated. The results of an initial prototype are encouraging as a simple hello world MPI program could be scaled up to 1,048,576 virtual MPI processes on a four-node cluster, and the performance properties of two MPI programs could be evaluated at up to 1,024 and 16,384 virtual MPI processes on the same system.", pts = "25331" }
@conference{ostrouchov09nonparametric, author = "George Ostrouchov and Thomas Naughton and Christian Engelmann and Geoffroy R. Vall\'ee and Stephen L. Scott", title = "Nonparametric Multivariate Anomaly Analysis in Support of {HPC} Resilience", booktitle = "Proceedings of the \href{http://www.oerc.ox.ac.uk/ieee} {$5^{th}$ IEEE International Conference on e-Science (e-Science) 2009}: \href{http://www.oerc.ox.ac.uk/ieee/workshops/workshops/computational-science} {Workshop on Computational Science}", pages = "80-85", month = dec # "~9-11, ", year = "2009", address = "Oxford, UK", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-1-4244-5946-9", doi = "10.1109/ESCIW.2009.5407992", url = "http://www.christian-engelmann.info/publications/ostrouchov09nonparametric.pdf", url2 = "http://www.christian-engelmann.info/publications/ostrouchov09nonparametric.ppt.pdf", abstract = "Large-scale computing systems provide great potential for scientific exploration. However, the complexity that accompanies these enormous machines raises challeges for both, users and operators. The effective use of such systems is often hampered by failures encountered when running applications on systems containing tens-of-thousands of nodes and hundreds-of-thousands of compute cores capable of yielding petaflops of performance. In systems of this size failure detection is complicated and root-cause diagnosis difficult. This paper describes our recent work in the identification of anomalies in monitoring data and system logs to provide further insights into machine status, runtime behavior, failure modes and failure root causes. It discusses the details of an initial prototype that gathers the data and uses statistical techniques for analysis.", pts = "26081" }
@conference{naughton09fault, author = "Thomas Naughton and Wesley Bland and Geoffroy R. Vall\'ee and Christian Engelmann and Stephen L. Scott", title = "Fault Injection Framework for System Resilience Evaluation -- {F}ake Faults for Finding Future Failures", booktitle = "Proceedings of the \href{http://www.lrz-muenchen.de/hpdc2009}{$18^{th}$ International Symposium on High Performance Distributed Computing (HPDC) 2009}: \href{http://xcr.cenit.latech.edu/resilience2009}{$2^{nd}$ Workshop on Resiliency in High Performance Computing (Resilience) 2009}", pages = "23--28", month = jun # "~9, ", year = "2009", address = "Munich, Germany", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-60558-587-1", doi = "10.1145/1552526.1552530", url = "http://www.christian-engelmann.info/publications/naughton09fault.pdf", url2 = "http://www.christian-engelmann.info/publications/naughton09fault.ppt.pdf", abstract = "As high-performance computing (HPC) systems increase in size and complexity they become more difficult to manage. The enormous component counts associated with these large systems lead to significant challenges in system reliability and availability. This in turn is driving research into the resilience of large scale systems, which seeks to curb the effects of increased failures at large scales by masking the inevitable faults in these systems. The basic premise being that failure must be accepted as a reality of large scale system and coped with accordingly through system resilience. A key component in the development and evaluation of system resilience techniques is having a means to conduct controlled experiments. A common method for performing such experiments is to generate synthetic faults and study the resulting effects. In this paper we discuss the motivation and our initial use of software fault injection to support the evaluation of resilience for HPC systems. We mention background and related work in the area and discuss the design of a tool to aid in fault injection experiments for both user-space (application-level) and system-level failures." }
@conference{tikotekar09performance, author = "Anand Tikotekar and Hong H. Ong and Sadaf Alam and Geoffroy R. Vall\'ee and Thomas Naughton and Christian Engelmann and Stephen L. Scott", title = "Performance Comparison of Two Virtual Machine Scenarios Using an {HPC} Application -- {A} Case study Using Molecular Dynamics Simulations", booktitle = "Proceedings of the \href{http://www.csm.ornl.gov/srt/hpcvirt09}{$3^{rd}$ Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2009}, in conjunction with the \href{http://www.eurosys.org/2009}{$4^{th}$ ACM SIGOPS European Conference on Computer Systems (EuroSys) 2009}", pages = "33--40", month = mar # "~30, ", year = "2009", address = "Nuremberg, Germany", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-60558-465-2", doi = "10.1145/1519138.1519143", url = "http://www.christian-engelmann.info/publications/tikotekar09performance.pdf", url2 = "http://www.christian-engelmann.info/publications/tikotekar09performance.ppt.pdf", abstract = "Obtaining high flexibility to performance-loss ratio is a key challenge of today's HPC virtual environment landscape. And while extensive research has been targeted at extracting more performance from virtual machines, the idea that whether novel virtual machine usage scenarios could lead to high flexibility Vs performance trade-off has received less attention. We, in this paper, take a step forward by studying and comparing the performance implications of running the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) application on two virtual machine configurations. First configuration consists of two virtual machines per node with 1 application process per virtual machine. The second configuration consists of 1 virtual machine per node with 2 processes per virtual machine. Xen has been used as an hypervisor and standard Linux as a guest virtual machine. Our results show that the difference in overall performance impact on LAMMPS between the two virtual machine configurations described above is around 3\%. We also study the difference in performance impact in terms of each configuration's individual metrics such as CPU, I/O, Memory, and interrupt/context switches." }
@conference{vallee08virtual, author = "Geoffroy R. Vall\'ee and Thomas Naughton and Hong H. Ong and Anand Tikotekar and Christian Engelmann and Wesley Bland and Ferrol Aderholt and Stephen L. Scott", title = "Virtual System Environments", booktitle = "Communications in Computer and Information Science: Proceedings of the \href{http://www.dmtf.org/svm08}{$2^{nd}$ DMTF Academic Alliance Workshop on Systems and Virtualization Management: Standards and New Technologies (SVM) 2008}", volume = "18", pages = "72--83", month = oct # "~21-22, ", year = "2008", address = "Munich, Germany", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-540-88707-2", issn = "1865-0929", doi = "10.1007/978-3-540-88708-9_7", url = "http://www.christian-engelmann.info/publications/vallee08virtual.pdf", url2 = "", abstract = "Distributed and parallel systems are typically managed with static settings: the operating system (OS) and the runtime environment (RTE) are specified at a given time and cannot be changed to fit an application`s needs. This means that every time application developers want to use their application on a new execution platform, the application has to be ported to this new environment, which may be expensive in terms of application modifications and developer time. However, the science resides in the applications and not in the OS or the RTE. Therefore, it should be beneficial to adapt the OS and the RTE to the application instead of adapting the applications to the OS and the RTE. This document presents the concept of Virtual System Environments (VSE), which enables application developers to specify and create a virtual environment that properly fits their application`s needs. For that four challenges have to be addressed: (i) definition of the VSE itself by the application developers, (ii) deployment of the VSE, (iii) system administration for the platform, and (iv) protection of the platform from the running VSE. We therefore present an integrated tool for the definition and deployment of VSEs on top of traditional and virtual (i.e., using system-level virtualization) execution platforms. This tool provides the capability to choose the degree of delegation for system administration tasks and the degree of protection from the application (e.g., using virtual machines). To summarize, the VSE concept enables the customization of the OS/RTE used for the execution of application by users without compromising local system administration rules and execution platform protection constraints.", pts = "28239" }
@conference{tikotekar08analysis, author = "Anand Tikotekar and Geoffroy Vall\'ee and Thomas Naughton and Hong H. Ong and Christian Engelmann and Stephen L. Scott", title = "An Analysis of {HPC} Benchmark Applications in Virtual Machine Environments", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://europar2008.caos.uab.es}{$14^{th}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2008}: \href{http://scilytics.com/vhpc}{$3^{rd}$ Workshop on Virtualization in High-Performance Cluster and Grid Computing (VHPC) 2008}", volume = "5415", pages = "63--71", month = aug # "~26-29, ", year = "2008", address = "Las Palmas de Gran Canaria, Spain", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "978-3-642-00954-9", doi = "10.1007/978-3-642-00955-6", url = "http://www.christian-engelmann.info/publications/tikotekar08analysis.pdf", url2 = "http://www.christian-engelmann.info/publications/tikotekar08analysis.ppt.pdf", abstract = "Virtualization technology has been gaining acceptance in the scientific community due to its overall flexibility in running HPC applications. It has been reported that a specific class of applications is better suited to a particular type of virtualization scheme or implementation. For example, Xen has been shown to perform with little overhead for compute-bound applications. Such a study, although useful, does not allow us to generalize conclusions beyond the performance analysis of that application which is explicitly executed. An explanation of why the generalization described above is difficult, may be due to the versatility in applications, which leads to different overheads in virtual environments. For example, two similar applications may spend disproportionate amount of time in their respective library code when run in virtual environments. In this paper, we aim to study such potential causes by investigating the behavior and identifying patterns of various overheads for HPC benchmark applications. Based on the investigation of the overhead profiles for different benchmarks, we aim to address questions such as: Are the overhead profiles for a particular type of benchmarks (such as compute-bound) similar or are there grounds to conclude otherwise?" }
@conference{engelmann08symmetric2, author = "Christian Engelmann and Stephen L. Scott and Chokchai (Box) Leangsuksun and Xubin (Ben) He", title = "Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations", booktitle = "Proceedings of the \href{http://www.ens-lyon.fr/LIP/RESO/ccgrid2008}{$8^{th}$ IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2008}: \href{http://xcr.cenit.latech.edu/resilience2008}{Workshop on Resiliency in High Performance Computing (Resilience) 2008}", pages = "813--818", month = may # "~19-22, ", year = "2008", address = "Lyon, France", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "978-0-7695-3156-4", doi = "10.1109/CCGRID.2008.78", url = "http://www.christian-engelmann.info/publications/engelmann08symmetric2.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann08symmetric2.pdf", abstract = "This paper summarizes our efforts over the last 3-4 years in providing symmetric active/active high availability for high-performance computing (HPC) system services. This work paves the way for high-level reliability, availability and serviceability in extreme-scale HPC systems by focusing on the most critical components, head and service nodes, and by reinforcing them with appropriate high availability solutions. This paper presents our accomplishments in the form of concepts and respective prototypes, discusses existing limitations, outlines possible future work, and describes the relevance of this research to other, planned efforts.", pts = "9996" }
@conference{chen08online, author = "Xin Chen and Benjamin Eckart and Xubin (Ben) He and Christian Engelmann and Stephen L. Scott", title = "An Online Controller Towards Self-Adaptive File System Availability and Performance", booktitle = "Proceedings of the \href{http://xcr.cenit.latech.edu/hapcw2008}{$5^{th}$ High Availability and Performance Workshop (HAPCW) 2008}, in conjunction with the \href{http://www.hpcsw.org}{$1^{st}$ High-Performance Computer Science Week (HPCSW) 2008}", month = apr # "~3-4, ", year = "2008", address = "Denver, CO, USA", url = "http://www.christian-engelmann.info/publications/chen08online.pdf", url2 = "http://www.christian-engelmann.info/publications/chen08online.ppt.pdf", abstract = "At the present time, it can be a significant challenge to build a large-scale distributed file system that simultaneously maintains both high availability and high performance. Although many fault tolerance technologies have been proposed and used in both commercial and academic distributed file systems to achieve high availability, most of them typically sacrifice performance for higher system availability. Additionally, recent studies show that system availability and performance are related to the system workload. In this paper, we analyze the correlations among availability, performance, and workloads based on a replication strategy, and we discuss the trade off between availability and performance with different workloads. Our analysis leads to the design of an online controller that can dynamically achieve optimal performance and availability by tuning the system replication policy." }
@conference{tikotekar08effects, author = "Anand Tikotekar and Geoffroy Vall\'ee and Thomas Naughton and Hong H. Ong and Christian Engelmann and Stephen L. Scott and Anthony M. Filippi", title = "Effects of Virtualization on a Scientific Application -- {R}unning a Hyperspectral Radiative Transfer Code on Virtual Machines", booktitle = "Proceedings of the \href{http://www.csm.ornl.gov/srt/hpcvirt08}{$2^{nd}$ Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2008}, in conjunction with the \href{http://www.eurosys.org/2008}{$3^{rd}$ ACM SIGOPS European Conference on Computer Systems (EuroSys) 2008}", pages = "16--23", month = mar # "~31, ", year = "2008", address = "Glasgow, UK", publisher = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}", isbn = "978-1-60558-120-0", doi = "10.1145/1435452.1435455", url = "http://www.christian-engelmann.info/publications/tikotekar08effects.pdf", url2 = "http://www.christian-engelmann.info/publications/tikotekar08effects.ppt.pdf", abstract = "The topic of system-level virtualization has recently begun to receive interest for high performance computing (HPC). This is in part due to the isolation and encapsulation offered by the virtual machine. These traits enable applications to customize their environments and maintain consistent software configurations in their virtual domains. Additionally, there are mechanisms that can be used for fault tolerance like live virtual machine migration. Given these attractive benefits to virtualization, a fundamental question arises, how does this effect my scientific application? We use this as the premise for our paper and observe a real-world scientific code running on a Xen virtual machine. We studied the effects of running a radiative transfer simulation, Hydrolight, on a virtual machine. We discuss our methodology and report observations regarding the usage of virtualization with this application." }
@conference{engelmann07middleware, author = "Christian Engelmann and Hong H. Ong and Stephen L. Scott", title = "Middleware in Modern High Performance Computing System Architectures", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://www.iccs-meeting.org/iccs2007}{$7^{th}$ International Conference on Computational Science (ICCS) 2007}, Part II: \href{http://www.gup.uni-linz.ac.at/cce2007} {$4^{th}$ Special Session on Collaborative and Cooperative Environments (CCE) 2007}", volume = "4488", pages = "784--791", month = may # "~27-30, ", year = "2007", address = "Beijing, China", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "3-5407-2585-5", issn = "0302-9743", doi = "10.1007/978-3-540-72586-2_111", url = "http://www.christian-engelmann.info/publications/engelmann07middleware.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann07middleware.ppt.pdf", abstract = "A recent trend in modern high performance computing (HPC) system architectures employs lean compute nodes running a lightweight operating system (OS). Certain parts of the OS a well as other system software services are moved to service nodes in order to increase performance and scalability. This paper examines the impact of this HPC system architecture trend on HPC middleware software solutions, which traditionally equip HPC systems with advanced features, such as parallel and distributed programming models, appropriate system resource management mechanisms, remote application steering and user interaction techniques. Since the approach of keeping the compute node software stack small and simple is orthogonal to the middleware concept of adding missing OS features between OS and application, the role and architecture of middleware in modern HPC systems needs to be revisited. The result is a paradigm shift in HPC middleware design, where single middleware services are moved to service nodes, while runtime environments (RTEs) continue to reside on compute nodes.", pts = "5260" }
@conference{engelmann07transparent, author = "Christian Engelmann and Stephen L. Scott and Chokchai (Box) Leangsuksun and Xubin (Ben) He", title = "Transparent Symmetric Active/Active Replication for Service-Level High Availability", booktitle = "Proceedings of the \href{http://ccgrid07.lncc.br}{$7^{th}$ IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2007}: \href{http://www.lri.fr/~fedak/gp2pc-07} {$7^{th}$ International Workshop on Global and Peer-to-Peer Computing (GP2PC) 2007}", pages = "755--760", month = may # "~14-17, ", year = "2007", address = "Rio de Janeiro, Brazil", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "0-7695-2833-3", doi = "10.1109/CCGRID.2007.116", url = "http://www.christian-engelmann.info/publications/engelmann07transparent.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann07transparent.ppt.pdf", abstract = "As service-oriented architectures become more important in parallel and distributed computing systems, individual service instance reliability as well as appropriate service redundancy becomes an essential necessity in order to increase overall system availability. This paper focuses on providing redundancy strategies using service-level replication techniques. Based on previous research using symmetric active/active replication, this paper proposes a transparent symmetric active/active replication approach that allows for more reuse of code between individual service-level replication implementations by using a virtual communication layer. Service- and client-side interceptors are utilized in order to provide total transparency. Clients and servers are unaware of the replication infrastructure as it provides all necessary mechanisms internally.", pts = "5259" }
@conference{engelmann07configurable, author = "Christian Engelmann and Stephen L. Scott and Hong H. Ong and Geoffroy R. Vall\'ee and Thomas Naughton", title = "Configurable Virtualized System Environments for High Performance Computing", booktitle = "Proceedings of the \href{http://www.csm.ornl.gov/srt/hpcvirt07}{$1^{st}$ Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2007}, in conjunction with the \href{http://www.eurosys.org/2008}{$2^{nd}$ ACM SIGOPS European Conference on Computer Systems (EuroSys) 2007}", month = mar # "~20, ", year = "2007", address = "Lisbon, Portugal", url = "http://www.christian-engelmann.info/publications/engelmann07configurable.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann07configurable.ppt.pdf", abstract = "Existing challenges for current terascale high performance computing (HPC) systems are increasingly hampering the development and deployment efforts of system software and scientific applications for next-generation petascale systems. The expected rapid system upgrade interval toward petascale scientific computing demands an incremental strategy for the development and deployment of legacy and new large-scale scientific applications that avoids excessive porting. Furthermore, system software developers as well as scientific application developers require access to large-scale testbed environments in order to test individual solutions at scale. This paper proposes to address these issues at the system software level through the development of a virtualized system environment (VSE) for scientific computing. The proposed VSE approach enables plug-and-play supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor virtualization technologies. This paper describes the VSE system architecture in detail, discusses needed tools for VSE system management and configuration, and presents respective VSE use case scenarios.", pts = "5703" }
@conference{engelmann06towards, author = "Christian Engelmann and Stephen L. Scott and Chokchai (Box) Leangsuksun and Xubin (Ben) He", title = "Towards High Availability for High-Performance Computing System Services: {A}ccomplishments and Limitations", booktitle = "Proceedings of the \href{http://xcr.cenit.latech.edu/hapcw2006}{$4^{th}$ High Availability and Performance Workshop (HAPCW) 2006}, in conjunction with the \href{http://lacsi.krellinst.org} {$7^{th}$ Los Alamos Computer Science Institute (LACSI) Symposium 2006}", month = oct # "~17, ", year = "2006", address = "Santa Fe, NM, USA", url = "http://www.christian-engelmann.info/publications/engelmann06towards.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann06towards.ppt.pdf", abstract = "During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availability (HA) in HPC. These nodes typically run critical HPC system services, like job and resource management, and represent single points of failure and control for an entire HPC system. The overarching goal of our research is to provide high-level reliability, availability, and serviceability (RAS) for HPC systems by combining HA and HPC technology. This paper summarizes our accomplishments, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment.", pts = "3736" }
@conference{ou06achieving, author = "Li Ou and Xin Chen and Xubin (Ben) He and Christian Engelmann and Stephen L. Scott", title = "Achieving Computational {I/O} Effciency in a High Performance Cluster Using Multicore Processors", booktitle = "Proceedings of the \href{http://xcr.cenit.latech.edu/hapcw2006}{$4^{th}$ High Availability and Performance Workshop (HAPCW) 2006}, in conjunction with the \href{http://lacsi.krellinst.org} {$7^{th}$ Los Alamos Computer Science Institute (LACSI) Symposium 2006}", month = oct # "~17, ", year = "2006", address = "Santa Fe, NM, USA", url = "http://www.christian-engelmann.info/publications/ou06achieving.pdf", url2 = "http://www.christian-engelmann.info/publications/ou06achieving.ppt.pdf", abstract = "Cluster computing has become one of the most popular platforms for high-performance computing today. The recent popularity of multicore processors provides a flexible way to increase the computational capability of clusters. Although the system performance may improve with multicore processors in a cluster, I/O requests initiated by multiple cores may saturate the I/O bus, and furthermore increase the latency by issuing multiple non-contiguous disk accesses. In this paper, we propose an asymmetric collective I/O for multicore processors to improve multiple non-contiguous accesses. In our configuration, one core in each multicore processor is designated as the coordinator, and others serve as computing cores. The coordinator is responsible for aggregating I/O operations from computing cores and submitting a contiguous request. The coordinator allocates contiguous memory buffers on behalf of other cores to avoid redundant data copies.", pts = "4222" }
@conference{engelmann06rmix, author = "Christian Engelmann and George A. (Al) Geist", title = "{RMIX}: {A} Dynamic, Heterogeneous, Reconfigurable Communication Framework", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://www.iccs-meeting.org/iccs2006}{$6^{th}$ International Conference on Computational Science (ICCS) 2006}, Part II: \href{http://www.gup.uni-linz.ac.at/cce2006} {$3^{rd}$ Special Session on Collaborative and Cooperative Environments (CCE) 2006}", volume = "3992", pages = "573--580", month = may # "~28-31, ", year = "2006", address = "Reading, UK", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "3-540-34381-4", issn = "0302-9743", doi = "10.1007/11758525_77", url = "http://www.christian-engelmann.info/publications/engelmann06rmix.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann06rmix.ppt.pdf", abstract = "RMIX is a dynamic, heterogeneous, reconfigurable communication framework that allows software components to communicate using various RMI/RPC protocols, such as ONC RPC, Java RMI and SOAP, by facilitating dynamically loadable provider plug-ins to supply different protocol stacks. With this paper, we present a native (C-based), flexible, adaptable, multi-protocol RMI/RPC communication framework that complements the Java-based RMIX variant previously developed by our partner team at Emory University. Our approach offers the same multi-protocol RMI/RPC services and advanced invocation semantics via a C-based interface that does not require an object-oriented programming language. This paper provides a detailed description of our RMIX framework architecture and some of its features. It describes the general use case of the RMIX framework and its integration into the Harness metacomputing environment in the form of a plug-in.", pts = "1490" }
@conference{engelmann06active, author = "Christian Engelmann and Stephen L. Scott and Chokchai (Box) Leangsuksun and Xubin (Ben) He", title = "Active/Active Replication for Highly Available {HPC} System Services", booktitle = "Proceedings of the \href{http://www.ares-conference.eu/ares2006}{$1^{st}$ International Conference on Availability, Reliability and Security (ARES) 2006}: $1^{st}$ International Workshop on Frontiers in Availability, Reliability and Security (FARES) 2006", pages = "639-645", month = apr # "~20-22, ", year = "2006", address = "Vienna, Austria", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "0-7695-2567-9", doi = "10.1109/ARES.2006.23", url = "http://www.christian-engelmann.info/publications/engelmann06active.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann06active.ppt.pdf", abstract = "Today`s high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.", pts = "1485" }
@conference{engelmann05concepts, author = "Christian Engelmann and Stephen L. Scott", title = "Concepts for High Availability in Scientific High-End Computing", booktitle = "Proceedings of the \href{http://xcr.cenit.latech.edu/hapcw2005}{$3^{rd}$ High Availability and Performance Workshop (HAPCW) 2005}, in conjunction with the \href{http://lacsi.rice.edu/symposium/agenda_2005}{$6^{th}$ Los Alamos Computer Science Institute (LACSI) Symposium 2005}", month = oct # "~11, ", year = "2005", address = "Santa Fe, NM, USA", url = "http://www.christian-engelmann.info/publications/engelmann05concepts.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann05concepts.ppt.pdf", abstract = "Scientific high-end computing (HEC) has become an important tool for scientists world-wide to understand problems, such as in nuclear fusion, human genomics and nanotechnology. Every year, new HEC systems emerge on the market with better performance and higher scale. With only very few exceptions, the overall availability of recently installed systems has been lower in comparison to the same deployment phase of their predecessors. In contrast to the experienced loss of availability, the demand for continuous availability has risen dramatically due to the recent trend towards capability computing. In this paper, we analyze the existing deficiencies of current HEC systems and present several high availability concepts to counter the experienced loss of availability and to alleviate the expected impact on next-generation systems. We explain the application of these concepts to current and future HEC systems and list past and ongoing related research. This paper closes with a short summary of the presented work and a brief discussion of future efforts.", pts = "3777" }
@conference{engelmann05high, author = "Christian Engelmann and Stephen L. Scott", title = "High Availability for Ultra-Scale High-End Scientific Computing", booktitle = "Proceedings of the \href{http://coset.irisa.fr}{$2^{nd}$ International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005}, in conjunction with the \href{http://ics05.csail.mit.edu}{$19^{th}$ ACM International Conference on Supercomputing (ICS) 2005}", month = jun # "~19, ", year = "2005", address = "Cambridge, MA, USA", url = "http://www.christian-engelmann.info/publications/engelmann05high.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann05high.ppt.pdf", abstract = "Ultra-scale architectures for scientific high-end computing with tens to hundreds of thousands of processors, such as the IBM Blue Gene/L and the Cray X1, suffer from availability deficiencies, which impact the efficiency of running computational jobs by forcing frequent checkpointing of applications. Most systems are unable to handle runtime system configuration changes caused by failures and require a complete restart of essential system services, such as the job scheduler or MPI, or even of the entire machine. In this paper, we present a flexible, pluggable and component-based high availability framework that expands today`s effort in high availability computing of keeping a single server alive to include all machines cooperating in a high-end scientific computing environment, while allowing adaptation to system properties and application needs." }
@conference{leangsuksun05asymmetric, author = "Chokchai (Box) Leangsuksun and Venkata K. Munganuru and Tong Liu and Stephen L. Scott and Christian Engelmann", title = "Asymmetric Active-Active High Availability for High-end Computing", booktitle = "Proceedings of the \href{http://coset.irisa.fr}{$2^{nd}$ International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005}, in conjunction with the \href{http://ics05.csail.mit.edu}{$19^{th}$ ACM International Conference on Supercomputing (ICS) 2005}", month = jun # "~19, ", year = "2005", address = "Cambridge, MA, USA", url = "http://www.christian-engelmann.info/publications/leangsuksun05asymmetric.pdf", url2 = "http://www.christian-engelmann.info/publications/leangsuksun05asymmetric.ppt.pdf", abstract = "Linux clusters have become very popular for scientific computing at research institutions world-wide, because they can be easily deployed at a fairly low cost. However, the most pressing issues of today`s cluster solutions are availability and serviceability. The conventional Beowulf cluster architecture has a single head node connected to a group of compute nodes. This head node is a typical single point of failure and control, which severely limits availability and serviceability by effectively cutting off healthy compute nodes from the outside world upon overload or failure. In this paper, we describe a paradigm that addresses this issue using asymmetric active-active high availability. Our framework comprises of n + 1 head nodes, where n head nodes are active in the sense that they provide services to simultaneously incoming user requests. One standby server monitors all active servers and performs a fail-over in case of a detected outage. We present a prototype implementation based on a 2 + 1 solution and discuss initial results." }
@conference{engelmann05lightweight, author = "Christian Engelmann and George A. (Al) Geist", title = "A Lightweight Kernel for the Harness Metacomputing Framework", booktitle = "Proceedings of the \href{http://www.ipdps.org/ipdps2005}{$19^{th}$ IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2005}: \href{http://www.cs.umass.edu/~rsnbrg/hcw2005} {$14^{th}$ Heterogeneous Computing Workshop (HCW) 2005}", month = apr # "~4, ", year = "2005", address = "Denver, CO, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "0-7695-2312-9", issn = "1530-2075", doi = "10.1109/IPDPS.2005.34", url = "http://www.christian-engelmann.info/publications/engelmann05lightweight.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann05lightweight.ppt.pdf", abstract = "Harness is a pluggable heterogeneous Distributed Virtual Machine (DVM) environment for parallel and distributed scientific computing. This paper describes recent improvements in the Harness kernel design. By using a lightweight approach and moving previously integrated system services into software modules, the software becomes more versatile and adaptable. This paper outlines these changes and explains the major Harness kernel components in more detail. A short overview is given of ongoing efforts in integrating RMIX, a dynamic heterogeneous reconfigurable communication framework, into the Harness environment as a new plug-in software module. We describe the overall impact of these changes and how they relate to other ongoing work." }
@conference{engelmann04high, author = "Christian Engelmann and Stephen L. Scott and George A. (Al) Geist", title = "High Availability through Distributed Control", booktitle = "Proceedings of the \href{http://xcr.cenit.latech.edu/hapcw2004}{$2^{nd}$ High Availability and Performance Workshop (HAPCW) 2004}, in conjunction with the \href{http://lacsi.rice.edu/symposium/agenda_2004}{$5^{th}$ Los Alamos Computer Science Institute (LACSI) Symposium 2004}", month = oct # "~12, ", year = "2004", address = "Santa Fe, NM, USA", url = "http://www.christian-engelmann.info/publications/engelmann04high.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann04high.ppt.pdf", abstract = "Cost-effective, flexible and efficient scientific simulations in cutting-edge research areas utilize huge high-end computing resources with thousands of processors. In the next five to ten years the number of processors in such computer systems will rise to tens of thousands, while scientific application running times are expected to increase further beyond the Mean-Time-To-Interrupt (MTTI) of hardware and system software components. This paper describes the ongoing research in heterogeneous adaptable reconfigurable networked systems (Harness) and its recent achievements in the area of high availability distributed virtual machine environments for parallel and distributed scientific computing. It shows how a distributed control algorithm is able to steer a distributed virtual machine process in virtual synchrony while maintaining consistent replication for high availability. It briefly illustrates ongoing work in heterogeneous reconfigurable communication frameworks and security mechanisms. The paper continues with a short overview of similar research in reliable group communication frameworks, fault-tolerant process groups and highly available distributed virtual processes. It closes with a brief discussion of possible future research directions." }
@conference{he04highly, author = "Xubin (Ben) He and Li Ou and Stephen L. Scott and Christian Engelmann", title = "A Highly Available Cluster Storage System using Scavenging", booktitle = "Proceedings of the \href{http://xcr.cenit.latech.edu/hapcw2004}{$2^{nd}$ High Availability and Performance Workshop (HAPCW) 2004}, in conjunction with the \href{http://lacsi.rice.edu/symposium/agenda_2004}{$5^{th}$ Los Alamos Computer Science Institute (LACSI) Symposium 2004}", month = oct # "~12, ", year = "2004", address = "Santa Fe, NM, USA", url = "http://www.christian-engelmann.info/publications/he04highly.pdf", url2 = "http://www.christian-engelmann.info/publications/he04highly.ppt.pdf", abstract = "Highly available data storage for high-performance computing is becoming increasingly more critical as high-end computing systems scale up in size and storage systems are developed around network-centered architectures. A promising solution is to harness the collective storage potential of individual workstations much as we harness idle CPU cycles due to the excellent price/performance ratio and low storage usage of most commodity workstations. For such a storage system, metadata consistency is a key issue assuring storage system availability as well as data reliability. In this paper, we present a decentralized metadata management scheme that improves storage availability without sacrificing performance." }
@conference{engelmann03diskless, author = "Christian Engelmann and George A. (Al) Geist", title = "A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform", booktitle = "Proceedings of the \href{http://www.cs.msstate.edu/~clade2003}{Challenges of Large Applications in Distributed Environments Workshop (CLADE) 2003}, in conjunction with the \href{http://csag.ucsd.edu/HPDC-12}{$12^{th}$ IEEE International Symposium on High Performance Distributed Computing (HPDC) 2003}", pages = "47", month = jun # "~21, ", year = "2003", address = "Seattle, WA, USA", publisher = "\href{http://www.computer.org}{IEEE Computer Society, Los Alamitos, CA, USA}", isbn = "0-7695-1984-9", doi = "xpls/abs_all.jsp?arnumber=4159902", url = "http://www.christian-engelmann.info/publications/engelmann03diskless.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann03diskless.ppt.pdf", abstract = "This paper discusses the issue of fault-tolerance in distributed computer systems with tens or hundreds of thousands of diskless processor units. Such systems, like the IBM Blue Gene/L, are predicted to be deployed in the next five to ten years. Since a 100,000-processor system is going to be less reliable, scientific applications need to be able to recover from occurring failures more efficiently. In this paper, we adapt the present technique of diskless checkpointing to such huge distributed systems in order to equip existing scientific algorithms with super-scalable fault-tolerance. First, we discuss the method of diskless checkpointing, then we adapt this technique to super-scale architectures and finally we present results from an implementation of the Fast Fourier Transform that uses the adapted technique to achieve super-scale fault-tolerance." }
@conference{engelmann02distributed, author = "Christian Engelmann and Stephen L. Scott and George A. (Al) Geist", title = "Distributed Peer-to-Peer Control in {Harness}", booktitle = "Lecture Notes in Computer Science: Proceedings of the \href{http://www.science.uva.nl/events/ICCS2002}{$2^{nd}$ International Conference on Computational Science (ICCS) 2002}, Part II: Workshop on Global and Collaborative Computing", volume = "2330", pages = "720--727", month = apr # "~21-24, ", year = "2002", address = "Amsterdam, The Netherlands", publisher = "\href{http://www.springer.com}{Springer Verlag, Berlin, Germany}", isbn = "3-540-43593-X", issn = "0302-9743", doi = "content/l537ujfwt8yta2dp", url = "http://www.christian-engelmann.info/publications/engelmann02distributed.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann02distributed.ppt.pdf", abstract = "Harness is an adaptable fault-tolerant virtual machine environment for next-generation heterogeneous distributed computing developed as a follow on to PVM. It additionally enables the assembly of applications from plug-ins and provides fault-tolerance. This work describes the distributed control, which manages global state replication to ensure a high-availability of service. Group communication services achieve an agreement on an initial global state and a linear history of global state changes at all members of the distributed virtual machine. This global state is replicated to all members to easily recover from single, multiple and cascaded faults. A peer-to-peer ring network architecture and tunable multi-point failure conditions provide heterogeneity and scalability. Finally, the integration of the distributed control into the multi-threaded kernel architecture of Harness offers a fault-tolerant global state database service for plug-ins and applications." }
@misc{engelmann23intersect, author = "Christian Engelmann and Swen Boehm and Michael Brim and Jack Lange and Thomas Naughton and Patrick Widener and Ben Mintz and Rohit Srivastava", title = "INTERSECT: The Open Federated Architecture for the Laboratory of the Future", month = aug # "~7-10, ", year = "2023", howpublished = "{Poster at the \href{https://icpp23.sci.utah.edu/} {52nd International Conference on Parallel Processing (ICPP) 2023}, Salt Lake City, UT, USA}", url = "http://www.christian-engelmann.info/publications/engelmann23intersect.ppt.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann23intersect.pdf", abstract = "The open Self-driven Experiments for Science / Interconnected Science Ecosystem (INTERSECT) architecture connects scientific instruments and robot-controlled laboratories with computing and data resources at the edge, the Cloud or the high-performance computing center to enable autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence driven design, discovery and evaluation. Its a novel approach consists of science use case design patterns, a system of systems architecture, and a microservice architecture." }
@misc{engelmann22resilience, author = "Christian Engelmann and Mohit Kumar", title = "Resilience Design Patterns: A Structured Modeling Approach of Resilience in Computing Systems", month = aug # "~10-12, ", year = "2022", howpublished = "{Poster at the \href{https://www.bnl.gov/modsim2022} {Workshop on Modeling and Simulation of Systems and Applications (ModSim) 2022}, Seattle, WA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann22resilience.ppt.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann22resilience.pdf", abstract = "Resilience to faults, errors, and failures in extreme-scale high-performance computing (HPC) systems is a critical challenge. Resilience design patterns (Figure 1) offer a new, structured hardware/software design approach for improving resilience by identifying and evaluating repeatedly occurring resilience problems and coordinating corresponding solutions. Initial work identified and formalized these patterns and developed a proof-of-concept prototype to demonstrate portable resilience. This recent work created performance, reliability, and availability models for each of the identified 15 structural resilience design patterns and a modeling tool that allows (1) exploring the performance, reliability, and availability of each pattern, and (2) investigating the trade-offs be-tween patterns and pattern combinations." }
@misc{hui18realtime, author = "Yawei Hui and Rizwan Ashraf and Byung Hoon (Hoony) Park and Christian Engelmann", title = "Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing", month = dec # "~10-13, ", year = "2018", howpublished = "{Poster at the \href{http://cci.drexel.edu/bigdata/bigdata2018} {$6^{th}$ IEEE International Conference on Big Data (BigData) 2018}, Seattle, WA, USA}", url = "http://www.christian-engelmann.info/publications/hui18realtime.pdf", abstract = "Supercomputers are complex systems used to simulate, understand and solve real-world problems. In order to operate these systems efficiently and for the purpose of their maintainability, an accurate, concise, and timely determination of system status is crucial for its users and operators. However, this determination is challenging due to intricately connected heterogeneous software and hardware components, and due to sheer scale of such machines. In this poster, we demonstrate work-in-progress towards realization of a real-time monitoring framework for the 18,688-node Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Toward this end, we discuss the use of metrics which present a one-dimensional view of the system generating various types of information from 1000s of components and utilization statistics from 100s of user applications in near real-time. We demonstrate the efficacy of these metrics to understand and visualize raw log data generated by the system which otherwise may compose of 1000s of dimensions. We also demonstrate the architecture of proposed real-time stream processing framework which integrates, processes, analyzes, visualizes and stores system log data from an array of system components.." }
@misc{hui18comprehensive, author = "Yawei Hui and Byung Hoon (Hoony) Park and Christian Engelmann", title = "A Comprehensive Informative Metric for Summarizing {HPC} System Status", month = oct # "~21, ", year = "2018", howpublished = "{Poster at the \href{http://ldav.org} {$8^{th}$ IEEE Symposium on Large Data Analysis and Visualization} in conjunction with the \href{http://ieeevis.org/year/2018}{$8^{th}$ IEEE Vis 2018}, Berlin, Germany}", url = "http://www.christian-engelmann.info/publications/hui18comprehensive.pdf", abstract = "It remains a major challenge to effectively summarize and visualize in a comprehensive form the status of a complex computer system, such as the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). In the ongoing research highlighted in this poster, we present system information entropy (SIE), a newly developed system metric that leverages the powers of traditional machine learning techniques and information theory. By compressing the multi-variant multi-dimensional event information recorded during the operation of the targeted system into a single time series of SIE, we demonstrate that the historical system status can be sensitively summarized in form of SIE and visualized concisely and comprehensively." }
@misc{engelmann18modeling2, author = "Christian Engelmann and Rizwan Ashraf", title = "Modeling and Simulation of Extreme-Scale Systems for Resilience by Design", month = aug # "~15-17, ", year = "2018", howpublished = "{Poster at the \href{https://www.bnl.gov/modsim2018} {Workshop on Modeling and Simulation of Systems and Applications}, Seattle, WA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann18modeling2.pdf", abstract = "Resilience is a serious concern for extreme-scale high-performance computing (HPC). While the HPC community has developed various resilience solutions, the solution space remains fragmented. We created a structured approach to the design, evaluation and optimization of HPC resilience using the concept of design patterns. A design pattern describes a generalized solution to a repeatedly occurring problem. We identified the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each well-known solution that addresses a specific resilience challenge is described in the form of a design pattern. We developed a resilience design pattern specification, language and catalog, which can be used by system architects, system software and library developers, application programmers, as well as users and operators as essential building blocks when designing and deploying resilience solutions. The resilience design pattern approach provides a unique opportunity for design space exploration. As each resilience solution is abstracted as a pattern and each solution's properties are defined by pattern parameters, vertical and horizontal pattern compositions can describe the resilience capabilities of an entire HPC system. This permits the investigation of beneficial or counterproductive interactions between patterns and of the performance, resilience, and power consumption trade-off between different pattern parameters and compositions. The ultimate goal is to make resilience an integral part of the HPC hardware/software ecosystem by coordinating the various existing resilience solutions in a design space exploration process, such that the burden for providing resilience is on the system by design and not on the user as an afterthought. We are in the early stages of developing a novel design space exploration tool that enables this investigation using modeling and simulation. We developed performance and resilience models for each resilience design pattern. We also leverage results from the Catalog project, a collaborative effort between Oak Ridge National Laboratory, Argonne National Laboratory and Lawrence Livermore National Laboratory that developed models of the faults, errors and failures in today's HPC systems. We also leverage recent results from the same project by Lawrence Livermore National Laboratory in application reliability patterns. The planned research extends and combines this work to model the performance, resilience, and power consumption of an entire HPC system, initially at node-level granularity, and to simulate the dynamic interactions between deployed resilience solutions and the rest of the system. In the next iteration, finer-grain modeling and simulation, such as at the computational unit level, is used to increase accuracy. This work leverages the experience of the investigators in parallel discrete event simulation of extreme-scale systems, such as the Extreme-scale Simulator (xSim). The current state of the art in resilience modeling and simulation is fragmented as well. There is currently no such design space exploration tool. Instead, each resilience solution is typically investigated separately. There is only a small amount of work on multi-resilience solutions, including by the investigators. While there is work in investigating the performance/resilience trade-off space, there is almost no work in including power consumption." }
@misc{patil17exploring, author = "Onkar Patil and Saurabh Hukerikar and Frank Mueller and Christian Engelmann", title = "Exploring Use Cases for Non-Volatile Memories in Support of HPC Resilience", month = nov # "~12-17, ", year = "2017", howpublished = "{Poster at the \href{http://sc11.supercomputing.org} {30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017}, Denver, CO, USA}", url = "http://www.christian-engelmann.info/publications/patil17exploring.pdf", url2 = "http://www.christian-engelmann.info/publications/patil17exploring.ppt.pdf", abstract = "Improving resilience and creating resilient architectures is one of the major goals of exascale computing. With the advent of Non-volatile memory technologies, memory architectures with persistent memory regions will be a significant part of future architectures. There is potential to use them in more than one way to benefit different applications. We look to take advantage of this technology to enable more fine-grained and novel methodology that will improve resilience and efficiency of exascale applications. We have developed three modes of memory usage for persistent memory to enable efficient checkpointing in HPC applications. We have developed a simple API that is evaluated with the DGEMM benchmark on a 16-node cluster with independent SSDs on every node. Our aim is to build on this work and enable static and dynamic runtime systems that will inherently make the HPC applications more fault-tolerant and resistant to errors." }
@misc{fiala11detection, author = "David Fiala and Frank Mueller and Christian Engelmann and Rolf Riesen and Kurt Ferreira", title = "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing", month = nov # "~12-18, ", year = "2011", howpublished = "{Poster at the \href{http://sc11.supercomputing.org} {24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011}, Seattle, WA, USA}", url = "", abstract = "Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults will not be detected, manifesting themselves as silent errors that will corrupt memory while applications continue to operate and report incorrect results. This poster introduces RedMPI, an MPI library which resides in the MPI profiling layer. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source. By providing redundancy, RedMPI is capable of transparently detecting corrupt messages from MPI processes that become faulted during execution. Furthermore, with triple redundancy RedMPI additionally ``votes'' out MPI messages of a faulted process by replacing corrupted results with corrected results from unfaulted processes. We present an experimental evaluation of RedMPI on an assortment of applications to demonstrate the effectiveness of this approach." }
@misc{fiala11tunable2, author = "David Fiala and Kurt Ferreira and Frank Mueller and Christian Engelmann", title = "A Tunable, Software-based {DRAM} Error Detection and Correction Library for {HPC}", month = nov # "~12-18, ", year = "2011", howpublished = "{Poster at the \href{http://sc11.supercomputing.org} {24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011}, Seattle, WA, USA}", url = "", abstract = "Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification by utilizing the MMU. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with less than 100\% overhead of resources." }
@misc{scott09tunable2, author = "Stephen L. Scott and Christian Engelmann and Geoffroy R. Vall\'ee and Thomas Naughton and Anand Tikotekar and George Ostrouchov and Chokchai (Box) Leangsuksun and Nichamon Naksinehaboon and Raja Nassar and Mihaela Paun and Frank Mueller and Chao Wang and Arun B. Nagarajan and Jyothish Varma", title = "A Tunable Holistic Resiliency Approach for High-Performance Computing Systems", month = aug # "~12-14, ", year = "2009", howpublished = "{Poster at the \href{http://institute.lanl.gov/resilience/conferences/2009} {National HPC Workshop on Resilience 2009}, Arlington, VA, USA}", url = "http://www.christian-engelmann.info/publications/scott09tunable2.pdf", abstract = "In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework." }
@misc{scott09systemlevel, author = "Stephen L. Scott and Geoffroy R. Vall\'ee and Thomas Naughton and Anand Tikotekar and Christian Engelmann and Hong H. Ong", title = "System-level Virtualization for for High-Performance Computing", month = aug # "~12-14, ", year = "2009", howpublished = "{Poster at the \href{http://institute.lanl.gov/resilience/conferences/2009} {National HPC Workshop on Resilience 2009}, Arlington, VA, USA}", url = "http://www.christian-engelmann.info/publications/scott09systemlevel.pdf", abstract = "This poster summarizes our past and ongoing research and development efforts in novel system software solutions for providing a virtual system environment (VSE) for next-generation extreme-scale high-performance computing (HPC) systems and beyond. The poster showcases results of developed proof-of-concept implementations and performed theoretical analyses, outlines planned research and development activities, and presents respective initial results." }
@misc{scott09tunable, author = "Stephen L. Scott and Christian Engelmann and Geoffroy R. Vall\'ee and Thomas Naughton and Anand Tikotekar and George Ostrouchov and Chokchai (Box) Leangsuksun and Nichamon Naksinehaboon and Raja Nassar and Mihaela Paun and Frank Mueller and Chao Wang and Arun B. Nagarajan and Jyothish Varma", title = "A Tunable Holistic Resiliency Approach for High-Performance Computing Systems", month = feb # "~14-18, ", year = "2009", howpublished = "{Poster at the \href{http://ppopp09.rice.edu}{$14^{th}$ ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2009}, Raleigh, NC, USA}", url = "http://www.christian-engelmann.info/publications/scott09tunable.pdf", abstract = "In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework." }
@misc{geist08harness, author = "George A. (Al) Geist and Christian Engelmann and Jack J. Dongarra and George Bosilca and Magdalena M. S\l{}awi\'nska and Jaros\l{}aw K. S\l{}awi\'nski", title = "The {Harness} Workbench: {U}nified and Adaptive Access to Diverse High-Performance Computing Platforms", month = mar # "~30 - " # apr # "~5, ", year = "2008", howpublished = "{Poster at the \href{http://www.hpcsw.org}{$1^{st}$ High-Performance Computer Science Week (HPCSW) 2008}, Denver, CO, USA}", url = "http://www.christian-engelmann.info/publications/geist08harness.pdf", abstract = "This poster summarizes our past and ongoing research and development efforts in novel software solutions for providing unified and adaptive access to diverse high-performance computing (HPC) platforms. The poster showcases developed proof-of-concept implementations of tools and mechanisms that simplify scientific application development and deployment tasks, such that only minimal adaptation is needed when moving from one HPC system to another or after HPC system upgrades." }
@misc{scott08resiliency, author = "Stephen L. Scott and Christian Engelmann and Hong H. Ong and Geoffroy R. Vall\'ee and Thomas Naughton and Anand Tikotekar and George Ostrouchov and Chokchai (Box) Leangsuksun and Nichamon Naksinehaboon and Raja Nassar and Mihaela Paun and Frank Mueller and Chao Wang and Arun B. Nagarajan and Jyothish Varma and Xubin (Ben) He and Li Ou and Xin Chen", title = "Resiliency for High-Performance Computing Systems", month = mar # "~30 - " # apr # "~5, ", year = "2008", howpublished = "{Poster at the \href{http://www.hpcsw.org}{$1^{st}$ High-Performance Computer Science Week (HPCSW) 2008}, Denver, CO, USA}", url = "http://www.christian-engelmann.info/publications/scott08resiliency.pdf", abstract = "This poster summarizes our past and ongoing research and development efforts in novel system software solutions for providing high-level reliability, availability and serviceability (RAS) for next-generation extreme-scale high-performance computing (HPC) systems and beyond. The poster showcases results of developed proof-of-concept implementations and performed theoretical analyses, outlines planned research and development activities, and presents respective initial results." }
@misc{scott08systemlevel, author = "Stephen L. Scott and Geoffroy R. Vall\'ee and Thomas Naughton and Anand Tikotekar and Christian Engelmann and Hong H. Ong", title = "System-level Virtualization for for High-Performance Computing", month = mar # "~30 - " # apr # "~5, ", year = "2008", howpublished = "{Poster at the \href{http://www.hpcsw.org}{$1^{st}$ High-Performance Computer Science Week (HPCSW) 2008}, Denver, CO, USA}", url = "http://www.christian-engelmann.info/publications/scott08systemlevel.pdf", abstract = "This poster summarizes our past and ongoing research and development efforts in novel system software solutions for providing a virtual system environment (VSE) for next-generation extreme-scale high-performance computing (HPC) systems and beyond. The poster showcases results of developed proof-of-concept implementations and performed theoretical analyses, outlines planned research and development activities, and presents respective initial results." }
@misc{adamson21cybersecurity, author = "Ryan Adamson and Christian Engelmann", title = "Cybersecurity and Privacy for Instrument-to-Edge-to-Center Scientific Computing Ecosystems", howpublished = "White paper accepted at the U.S. Department of Energy's \href{https://www.orau.gov/2021ascr-cybersecurity} {ASCR Workshop on Cybersecurity and Privacy for Scientific Computing Ecosystems}", month = nov # "~3-5, ", year = "2021", url = "http://www.christian-engelmann.info/publications/adamson21cybersecurity.pdf", abstract = "The DOE's Artificial Intelligence (AI) for Science report outlines the need for intelligent systems, instruments, and facilities to enable science breakthroughs with autonomous experiments, 'self-driving' laboratories, smart manufacturing, and AI-driven design, discovery and evaluation. The DOE's Computational Facilities Research Workshop report identifies intelligent systems/facilities as a challenge with enabling automation and eliminating human-in-the-loop needs as a cross-cutting theme. Autonomous experiments, 'self-driving' laboratories and smart manufacturing employ machine-in-the-loop intelligence for decision-making. Human-in-the-loop needs are reduced by an autonomous online control that collects experiment data, analyzes it, and takes appropriate operational actions in real time to steer an ongoing or plan the next experiment. DOE laboratories are currently in the process of developing and deploying federated hardware/software architectures for connecting instruments with edge and center computing resources to autonomously collect, transfer, store, process, curate, and archive scientific data. These new instrument-to-edge-to-center scientific ecosystems face several cybersecurity and privacy challenges." }
@misc{li21toward, author = "Mingyan Li and Robert A. Bridges and Pablo Moriano and Christian Engelmann and Feiyi Wang and Ryan Adamson", title = "Toward Effective Security/Reliability Situational Awareness via Concurrent Security-or-Fault Analytics ", howpublished = "White paper accepted at the U.S. Department of Energy's \href{https://www.orau.gov/2021ascr-cybersecurity} {ASCR Workshop on Cybersecurity and Privacy for Scientific Computing Ecosystems}", month = nov # "~3-5, ", year = "2021", url = "http://www.christian-engelmann.info/publications/li21toward.pdf", abstract = "Modern critical infrastructures (CI) and scientific computing ecosystems (SCE) are complex and vulnerable. The complexity of CI/SCE, such as the distributed workload found across ASCR scientific computing facilities, does not allow for easy differentiation between emerging cyber security and reliability threats. It is also not easy to correctly identify the misbehaving systems. Sometimes, system failures are just caused by unintentional user misbehavior or actual hardware/software reliability issues, but it may take some significant amount of time and effort to develop that understanding through root-cause analysis. On the security front, CI/SCE are vital assets. They are prime targets of, and are vulnerable to, malicious cyber-attacks. Within DoE, inter-disciplinary and cross-facility collaboration (e.g., ORNL INTERSECT initiative, next-gen supercomputing OLCF6), traditional perimeter-based defense and demarcation line between malicious cyber-attacks and non-malicious system faults are blurring. Amidst realistic reliability and security threats, the ability to effectively distinguish between non-malicious faults and malicious attacks is critical not only in root cause identification but also in countermeasures generation. " }
@misc{finkel21research2, author = "Hal Finkel and Pete Beckman and Christian Engelmann and Shantenu Jha and Jack Lange", title = "Research Opportunities in Operating Systems for Scientific Edge Computing", howpublished = "White paper by the U.S. Department of Energy's \href{https://www.orau.gov/OSRoundtable2021} {ASCR Roundtable Discussions on Operating-Systems Research 2021}", month = jan # "~25, ", year = "2021", url = "http://www.christian-engelmann.info/publications/finkel21research2.pdf", abstract = "As scientific experiments generate ever-increasing amounts of data, and grow in operational complexity, modern experimental science demands unprecedented computational capabilities at the edge -- physically proximate to each experiment. While some requirements on these computational capabilities are shared with high-performance-computing (HPC) systems, scientific edge computing has a number of unique challenges. In the following, we survey current trends in system software and edge systems for scientific computing, associated research challenges and open questions, infrastructure requirements for operating-systems research, communities who should be involved in that research, and the anticipated benefits of success." }
@misc{finkel21research, author = "Hal Finkel and Pete Beckman and Ron Brightwell and Rudi Eigenmann and Christian Engelmann and Roberto Gioiosa and Kamil Iskra and Shantenu Jha and Jack Lange and Tapasya Patki and Kevin Pedretti", title = "Research Opportunities in Operating Systems for High-Performance Scientific Computing", howpublished = "White paper by the U.S. Department of Energy's \href{https://www.orau.gov/OSRoundtable2021} {ASCR Roundtable Discussions on Operating-Systems Research 2021}", month = jan # "~25, ", year = "2021", url = "http://www.christian-engelmann.info/publications/finkel21research.pdf", abstract = "As high-performance-computing (HPC) systems continue to evolve, with increasingly diverse and heterogeneous hardware, increasingly-complex requirements for security and multi-tenancy, and increasingly-demanding requirements for resiliency and monitoring, research in operating systems must continue to seed innovation to meet future needs. In the following, we survey current trends in system software and HPC systems for scientific computing, associated research challenges and open questions, infrastructure requirements for operating-systems research, communities who should be involved in that research, and the anticipated benefits of success." }
@misc{engelmann21resilience2, author = "Christian Engelmann", title = "Resilience by Codesign (and not as an Afterthought)", howpublished = "White paper accepted at the U.S. Department of Energy's \href{https://web.cvent.com/event/f64a4f28-b473-4808-924c-c8c3d9a2af63/} {Workshop on Reimagining Codesign 2021}", month = mar # "~16-18, ", year = "2021", url = "http://www.christian-engelmann.info/publications/engelmann21resilience2.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann21resilience2.ppt.pdf", abstract = "Resilience, i.e., obtaining a correct solution in a timely and efficient manner, is one of the key challenges in extreme-scale high-performance computing (HPC). Extreme heterogeneity, i.e., using multiple, and potentially configurable, types of processors, accelerators and memory/storage in a single computing platform, will add a significant amount of complexity to the HPC hardware/software eco-system. Hardware/software HPC codesign for resilience is mostly nonexistent at this point! Resilience needs to become an integral part of the HPC hardware/software ecosystem through codesign, such that the burden for resilience is on the system by design and not on the operator or user as an afterthought. Simply put, if resilience by design is not done now, in the early stages of extreme heterogeneity, the current state of practice for HPC resilience, global application-level checkpoint/restart, will re-main the same for decades to come due to the high costs of adoption of alternatives later on. " }
@misc{radojkovic20towards, author = "Petar Radojkovic and Manolis Marazakis and Paul Carpenter and Reiley Jeyapaul and Dimitris Gizopoulos and Martin Schulz and Adria Armejach and Eduard Ayguade and Fran\c{c}ois Bodin and Ramon Canal and Franck Cappello and Fabien Chaix and Guillaume Colin de Verdiere and Said Derradji and Stefano Di Carlo and Christian Engelmann and Ignacio Laguna and Miquel Moreto and Onur Mutlu and Lazaros Papadopoulos and Olly Perks and Manolis Ploumidis and Bezhad Salami and Yanos Sazeides and Dimitrios Soudris and Yiannis Sourdis and Per Stenstrom and Samuel Thibault and Will Toms and Osman Unsal", title = "Towards Resilient {EU} {HPC} Systems: {A} Blueprint", howpublished = "White paper by the \href{https://resilienthpc.eu} {European HPC resilience initiative}", month = apr # "~9, ", year = "2020", url = "http://www.christian-engelmann.info/publications/radojkovic20towards.pdf", abstract = "This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focussed on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally. This document is the first output of the ongoing European HPC resilience initiative and it covers individual nodes in HPC systems, encompassing CPU, memory, intra-node interconnect and emerging FPGA-based hardware accelerators. With community support and feedback on this initial document, we will update the analysis and expand the scope to include other types of accelerators, as well as networks and storage.", pts = "140761" }
@misc{engelmann18extreme, author = "Christian Engelmann and Rizwan Ashraf and Saurabh Hukerikar", title = "Extreme Heterogeneity with Resilience by Design (and not as an Afterthought)", howpublished = "White paper accepted at the U.S. Department of Energy's \href{https://orau.gov/exheterogeneity2018/}{Extreme Heterogeneity Virtual Workshop 2018}", month = jan # "~23-24, ", year = "2018", address = "Washington, DC, USA", url = "http://www.christian-engelmann.info/publications/engelmann18extreme.pdf", abstract = "Resilience, i.e., obtaining a correct solution in a timely and efficient manner, is one of the key challenges in extreme-scale high-performance computing (HPC). Extreme heterogeneity, i.e., using multiple, and potentially configurable, types of processors, accelerators and memory/storage in a single computing platform, will add a significant amount of complexity to the HPC hardware/software ecosystem. The notion of correct computation and program state assumed by users and application developers today, which has been based on binary bit-level correctness, will no longer hold for processing elements based on quantum qubits and analog circuits that model spiking neurons in neuromorphic computing elements. The diverse set of compute and memory components in future heterogeneous systems will require novel hardware and software resilience solutions. Errors and failures reported by such heterogeneous hardware will need to be handled by the appropriate software component to enable efficient masking, recovery, and avoidance with little burden on the user. Similarly, errors and failures reported by the software running on such heterogeneous hardware need to be equally efficiently handled with little burden on the user. This requires a new approach, where resilience is holistically provided by the HPC hardware/software ecosystem. The key challenges are to design and to operate extreme heterogeneous HPC systems with (1) wide-ranging resilience capabilities in system software, programming models, libraries, and applications, (2) interfaces and mechanisms for coordinating resilience capabilities across diverse hardware and software components, (3) appropriate metrics and tools for assessing performance, resilience, and energy, and (4) an understanding of the performance, resilience and energy trade-off that eventually results in well-informed HPC system design choices and runtime decisions." }
@misc{tiwari16lightweight, author = "Devesh Tiwari and Saurabh Gupta and Christian Engelmann", title = "Lightweight, Actionable Analytical Tools Based on Statistical Learning for Efficient System Operations", howpublished = "White paper accepted at the U.S. Department of Energy's \href{http://hpc.pnl.gov/modsim/2016}{Workshop on Modeling & Simulation of Systems & Applications (ModSim) 2016}", month = aug # "~10-12, ", year = "2016", address = "Seattle, WA, USA", url = "http://www.christian-engelmann.info/publications/tiwari16lightweight.pdf", url2 = "http://www.christian-engelmann.info/publications/tiwari16lightweight.ppt.pdf", abstract = "Modeling and simulation community has always relied on accurate and meaningful system data and parameters to drive analytical models and simulators. HPC systems continuously generate huge amount system event related data (e.g., system log, resource consumption log, RAS logs, power consumption logs), but meaningful interpretation and accuracy verification of such data is quite challenging. This talk offers a unique perspective and experience in demonstrating how modeling and simulation based research can actually be translated into production systems. We will discuss the short-term opportunities for modeling and simulation community to increase the impact and effectiveness of our analytical tools, ``dos and don'ts'', long-term challenges and opportunities.", pts = "69458" }
@misc{engelmann13hardware, author = "Christian Engelmann and Thomas Naughton", title = "A Hardware/Software Performance/Resilience/Power Co-Design Tool for Extreme-scale Computing", howpublished = "White paper accepted at the U.S. Department of Energy's \href{http://hpc.pnl.gov/modsim/2013}{Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2013}", month = sep # "~18-19, ", year = "2013", address = "Seattle, WA, USA", url = "http://www.christian-engelmann.info/publications/engelmann13hardware.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann13hardware.ppt.pdf", abstract = "xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. The newly added features also offer user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). The presented solution permits investigating performance under failure and failure handling of checkpoint/restart and ABFT solutions. The newly enhanced xSim is the very first performance tool that supports these capabilities." }
@misc{snir13addressing, author = "Marc Snir and and Robert W. Wisniewski and Jacob A. Abraham and Sarita V. Adve and Saurabh Bagchi and Pavan Balaji and Bill Carlson and Andrew A. Chien and Pedro Diniz and Christian Engelmann and Rinku Gupta and Fred Johnson and Jim Belak and Pradip Bose and Franck Cappello and Paul Coteus and Nathan A. Debardeleben and Mattan Erez and Saverio Fazzari and Al Geist and Sriram Krishnamoorthy and Sven Leyffer and Dean Liberty and Subhasish Mitra and Todd Munson and Rob Schreiber and Jon Stearley and Eric Van Hensbergen", title = "Addressing Failures in Exascale Computing", howpublished = "Workshop report", month = aug # "~4-11, ", year = "2013", address = "Park City, UT, USA", url = "http://www.christian-engelmann.info/publications/snir13addressing.pdf" }
@misc{geist12department, author = "Al Geist and Bob Lucas and Marc Snir and Shekhar Borkar and Eric Roman and Mootaz Elnozahy and Bert Still and Andrew Chien and Robert Clay and John Wu and Christian Engelmann and Nathan DeBardeleben and Rob Ross and Larry Kaplan and Martin Schulz and Mike Heroux and Sriram Krishnamoorthy and Lucy Nowell and Abhinav Vishnu and Lee-Ann Talley", title = "{U.S. Department of Energy} Fault Management Workshop", howpublished = "Workshop report for the U.S. Department of Energy", month = jun # "~6, ", year = "2012", address = "Baltimore, MA, USA", url = "http://www.christian-engelmann.info/publications/geist12department.pdf", abstract = "A Department of Energy (DOE) Fault Management Workshop was held on June 6, 2012 at the BWI Airport Marriot hotel in Maryland. The goals of this workshop were to: 1. Describe the required HPC resilience for critical DOE mission needs; 2. Detail what HPC resilience research is already being done at the DOE national laboratories and is expected to be done by industry or other groups; 3. Determine what fault management research is a priority for DOE's Office of Science and National Nuclear Security Administration (NNSA) over the next five years; 4. Develop a roadmap for getting the necessary research accomplished in the timeframe when it will be needed by the large computing facilities across DOE." }
@misc{engelmann12performance, author = "Christian Engelmann and Thomas Naughton", title = "A Performance/Resilience/Power Co-design Tool for Extreme-scale High-Performance Computing", howpublished = "White paper accepted at the U.S. Department of Energy's \href{http://hpc.pnl.gov/modsim/2012}{Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2012}", month = aug # "~9-10, ", year = "2012", address = "Seattle, WA, USA", url = "http://www.christian-engelmann.info/publications/engelmann12performance.pdf", abstract = "Performance, resilience and power consumption are key HPC system design factors that are highly interde-pendent. To enable extreme-scale computing it is essential to perform HPC hardware/software co-design that identifies the cost/benefit trade-off between these design factors for potential future architecture choices. The proposed research and development aims at developing an HPC hardware/software co-design toolkit for evaluating the resilience/power/performance cost/benefit trade-off of future architecture choices. The approach focuses on extending a simulation-based performance investigation toolkit with advanced resilience and power modeling and simulation features, such as (i) fault injection mechanisms, (ii) fault propagation, isolation, and detection models, (i) fault avoidance, masking, and recovery simulation, and (iv) power consumption models." }
@misc{engelmann12dynamic, author = "Christian Engelmann and Geoffroy R. Vall\'ee and Thomas Naughton and Frank Mueller", title = "Dynamic Self-Aware Runtime Software for Exascale Systems", howpublished = "White paper for the U.S. Department of Energy's \href{https://collab.cels.anl.gov/display/exaosr/Position+Papers} {Exascale Operating Systems and Runtime Technical Council}", month = jul, year = "2012", url = "http://www.christian-engelmann.info/publications/engelmann12dynamic.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann12dynamic.ppt.pdf", abstract = "At exascale, the power consumption, resilience, and load balancing constraints, especially their dynamic nature and interdependence, and the scale of the system require a radical change in future high-performance computing (HPC) operating systems and runtimes (OS/Rs). In contrast to the existing static OS/R solutions, an exascale OS/R is needed that is aware of the dynamically changing resources, constraints, and application needs, and that is able to autonomously coordinate (sometimes conflicting) responses to different changes in the system, simultaneously and at scale. To provide awareness and autonomic management, a novel, scalable and self-aware OS/R is needed that becomes the brains of the entire X-stack. It dynamically analyzes past, current, and future system status and application needs. It optimizes system usage by scheduling, migrating, and restarting tasks within and across nodes as needed to deal with multi-dimensional constraints, such as power consumption, permanent and transient faults, resource degradation, heterogeneity, data locality, and load balance." }
@misc{vallee12unified, author = "Geoffroy R. Vall\'ee and Thomas Naughton and Christian Engelmann and David E. Bernholdt", title = "Unified Execution Environment", howpublished = "White paper for the U.S. Department of Energy's \href{https://collab.cels.anl.gov/display/exaosr/Position+Papers} {Exascale Operating Systems and Runtime Technical Council}", month = jul, year = "2012", url = "http://www.christian-engelmann.info/publications/vallee12unified.pdf", abstract = "The design and development of new system software for HPC (both operating systems and run-times) face multiple challenges, including scalability (high level of parallelism), efficiency, resiliency, and dynamicity. Guided by these fundamental design principles, we advocate for a unified execution environment, which aims at being scalable, asynchronous, dynamic, resource efficient, and reusable. The proposed solution is based on the following core building blocks, (i) events, (ii) agents, and (iii) enclaves. We use these building blocks to support composable environments that may be tailored to combine appropriate system services as well as user jobs. Additionally, for resilience and scalability the proposed design encourages localized or regional operations to foster autonomy of execution contexts. We advocate this approach for exascale systems, which include a massive number of heterogeneous computing resources, since it enables architecturally informed structures (topologies) as well as encouraging efficient grouping of functionality/services." }
@misc{debardeleben09high-end, author = "Nathan DeBardeleben and James Laros and John T. Daly and Stephen L. Scott and Christian Engelmann and Bill Harrod", title = "High-End Computing Resilience: {Analysis} of Issues Facing the {HEC} Community and Path-Forward for Research and Development", howpublished = "White paper for the U.S. National Science Foundation's High-end Computing Program", month = dec, year = "2009", url = "http://www.christian-engelmann.info/publications/debardeleben09high-end.pdf" }
@techreport{brim23microservice, author = "Michael Brim and Christian Engelmann", title = "INTERSECT Architecture Specification: Microservice Architecture (Version 0.9)", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2023/3171", address = "Oak Ridge, TN, USA", month = sep, year = "2023", doi = "10.2172/2333815", url = "http://www.christian-engelmann.info/publications/brim23microservice.pdf", abstract = "Oak Ridge National Laboratory (ORNL)'s Self-driven Experiments for Science / Interconnected Science Ecosystem (INTERSECT) architecture project, titled ``An Open Federated Architecture for the Laboratory of the Future'', creates an open federated hardware/software architecture for the laboratory of the future using a novel system of systems (SoS) and microservice architecture approach, connecting scientific instruments, robot-controlled laboratories and edge/center computing/data resources to enable autonomous experiments, ``self-driving'' laboratories, smart manufacturing, and artificial intelligence (AI)-driven design, discovery and evaluation. The project describes science use cases as design patterns that identify and abstract the involved hardware/software components and their interactions in terms of control, work and data flow. It creates a SoS architecture of the federated hardware/software ecosystem that clarifies terms, architectural elements, the interactions between them and compliance. It further designs a federated microservice architecture, mapping science use case design patterns to the SoS architecture with loosely coupled microservices, standardized interfaces and multi programming language support. The primary deliverable of this project is an INTERSECT Open Architecture Specification, containing the science use case design pattern catalog, the federated SoS architecture specification and the federated microservice architecture specification. This document represents the microservice architecture of the INTERSECT Open Architecture Specification.", pts = "204232" }
@techreport{engelmann23use, author = "Christian Engelmann and Suhas Somnath", title = "INTERSECT Architecture Specification: Use Case Design Patterns (Version 0.9)", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2023/3133", address = "Oak Ridge, TN, USA", month = sep, year = "2023", doi = "10.2172/2229218", url = "http://www.christian-engelmann.info/publications/engelmann23use.pdf", abstract = "Connecting scientific instruments and robot-controlled laboratories with computing and data resources at the edge, the Cloud or the high-performance computing (HPC) center enables autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence (AI)-driven design, discovery and evaluation. The Self-driven Experiments for Science / Interconnected Science Ecosystem (INTERSECT) Open Architecture enables science breakthroughs using intelligent networked systems, instruments and facilities with a federated hardware/software architecture for the laboratory of the future. It relies on a novel approach, consisting of (1) science use case design patterns, (2) a system of systems architecture, and (3) a microservice architecture. This document introduces the science use case design patterns of the INTERSECT Architecture. It describes the overall background, the involved terminology and concepts, and the pattern format and classification. It further details the 12 defined patterns and provides insight into building solutions from these patterns. The document also describes the application of these patterns in the context of several INTERSECT autonomous laboratories. The target audience are computer, computational, instrument and domain science experts working in the field of autonomous experiments.", pts = "203995" }
@techreport{engelmann22rdp-20, author = "Christian Engelmann and Rizwan Ashraf and Saurabh Hukerikar and Mohit Kumar and Piyush Sao", title = "Resilience Design Patterns: {A} Structured Approach to Resilience at Extreme Scale (Version 2.0)", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2022/2809", address = "Oak Ridge, TN, USA", month = aug, year = "2022", doi = "10.2172/1922296", url = "http://www.christian-engelmann.info/publications/engelmann22rdp-20.pdf", abstract = "Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore, the resilience challenge for extreme-scale HPC systems requires coordination between various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on power consumption in future HPC systems, they are likely to embrace innovative architectures, increasing the levels of hardware and software complexities. Therefore, the techniques that seek to improve resilience must navigate the complex trade-off space between resilience and the overheads to power consumption and performance. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods to integrate the various HPC resilience techniques into composite solutions, nor are there methods to holistically evaluate the adequacy and efficacy of such solutions in terms of their protection coverage, and their performance & power efficiency characteristics. Additionally, few implementations of current resilience solutions are portable to newer architectures and software environments that will be deployed on future systems. We developed a new structured approach to the management of HPC resilience using the concept of resilience-based design patterns. In general, a design pattern is a repeatable solution to a commonly occurring problem. We identified the well-known solutions that are commonly used to deal with faults, errors and failures in HPC systems. In the initial design patterns specification (version 1.0), we described the various solutions, which address specific problems in the design of resilient HPC environments, in the form of patterns. Each pattern describes a problem caused by a fault, error or failure event in an HPC environment, and then describes the core of the solution of the problem in such a way that this solution may be adapted to different systems and implemented at different layers of the system stack. The catalog of these resilience design patterns provides designers with a collection of design elements. To construct complete resilience solutions using combinations of various patterns, we defined a framework that enhances HPC designers' understanding of the important constraints and the opportunities for the design patterns to be implemented and deployed at various layers of the system stack. The design framework is also useful for establishing interfaces and mechanisms to coordinate flexible fault management across hardware and software components, as well as to consider the trade-off between performance, resilience, and power consumption when constructing a solution. The resilience design patterns specification version 1.1 included more detailed explanations of the pattern solutions, the context in which the patterns are applicable, and the implications for hardware or software design. It also provided several additional examples and detailed case studies to demonstrate the use of patterns to build realistic solutions. In this version 1.2 of the specification document, we have improved the pattern descriptions, including graphical representations of the pattern components. These improvements are largely based on critical comments, feedback and suggestions received from pattern experts and readers of the previous versions of the specification. The pattern classification has been modified to further clarify the relationships between pattern categories. This version of the specification also introduces a pattern language for resilience design patterns. The pattern language presents the patterns in the catalog as a network, revealing the relations among the resilience patterns. The language provides designers with the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language also enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack. The overall goal of this work is to provide hardware and software designers, as well as the users and operators of HPC systems, a systematic methodology for the design and evaluation of resilience technologies in HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner despite frequent faults, errors, and failures of various types. Version 2.0 expands the resilience design pattern classification and catalog to include self-stabilization patterns and reliability, availability and performance models for each structural pattern.", pts = "189180" }
@techreport{brim22microservice, author = "Michael Brim and Christian Engelmann", title = "INTERSECT Architecture Specification: Microservice Architecture (Version 0.5)", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2022/2715", address = "Oak Ridge, TN, USA", month = sep, year = "2022", doi = "10.2172/1902805", url = "http://www.christian-engelmann.info/publications/brim22microservice.pdf", abstract = "Oak Ridge National Laboratory (ORNL)'s Self-driven Experiments for Science / Interconnected Science Ecosystem (INTERSECT) architecture project, titled ``An Open Federated Architecture for the Laboratory of the Future'', creates an open federated hardware/software architecture for the laboratory of the future using a novel system of systems (SoS) and microservice architecture approach, connecting scientific instruments, robot-controlled laboratories and edge/center computing/data resources to enable autonomous experiments, ``self-driving'' laboratories, smart manufacturing, and artificial intelligence (AI)-driven design, discovery and evaluation. The project describes science use cases as design patterns that identify and abstract the involved hardware/software components and their interactions in terms of control, work and data flow. It creates a SoS architecture of the federated hardware/software ecosystem that clarifies terms, architectural elements, the interactions between them and compliance. It further designs a federated microservice architecture, mapping science use case design patterns to the SoS architecture with loosely coupled microservices, standardized interfaces and multi programming language support. The primary deliverable of this project is an INTERSECT Open Architecture Specification, containing the science use case design pattern catalog, the federated SoS architecture specification and the federated microservice architecture specification. This document represents the microservice architecture of the INTERSECT Open Architecture Specification.", pts = "186195" }
@techreport{engelmann22use, author = "Christian Engelmann and Suhas Somnath", title = "INTERSECT Architecture Specification: Use Case Design Patterns (Version 0.5)", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2022/2681", address = "Oak Ridge, TN, USA", month = sep, year = "2022", doi = "10.2172/1896984", url = "http://www.christian-engelmann.info/publications/engelmann22use.pdf", abstract = "Oak Ridge National Laboratory (ORNL)'s Self-driven Experiments for Science / Interconnected Science Ecosystem (INTERSECT) architecture project, titled ``An Open Federated Architecture for the Laboratory of the Future'', creates an open federated hardware/software architecture for the laboratory of the future using a novel system of systems (SoS) and microservice architecture approach, connecting scientific instruments, robot-controlled laboratories and edge/center computing/data resources to enable autonomous experiments, ``self-driving'' laboratories, smart manufacturing, and artificial intelligence (AI)-driven design, discovery and evaluation. The project describes science use cases as design patterns that identify and abstract the involved hardware/software components and their interactions in terms of control, work and data flow. It creates a SoS architecture of the federated hardware/software ecosystem that clarifies terms, architectural elements, the interactions between them and compliance. It further designs a federated microservice architecture, mapping science use case design patterns to the SoS architecture with loosely coupled microservices, standardized interfaces and multi programming language support. The primary deliverable of this project is an INTERSECT Open Architecture Specification, containing the science use case design pattern catalog, the federated SoS architecture specification and the federated microservice architecture specification. This document represents the science use case design pattern catalog of the INTERSECT Open Architecture Specification.", pts = "185612" }
@techreport{hukerikar17rdp-12, author = "Saurabh Hukerikar and Christian Engelmann", title = "Resilience Design Patterns: {A} Structured Approach to Resilience at Extreme Scale (Version 1.2)", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2017/745", address = "Oak Ridge, TN, USA", month = aug, year = "2017", doi = "10.2172/1436045", url = "http://www.christian-engelmann.info/publications/hukerikar17rdp-12.pdf", abstract = "Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore, the resilience challenge for extreme-scale HPC systems requires coordination between various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on power consumption in future HPC systems, they are likely to embrace innovative architectures, increasing the levels of hardware and software complexities. Therefore, the techniques that seek to improve resilience must navigate the complex trade-off space between resilience and the overheads to power consumption and performance. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods to integrate the various HPC resilience techniques into composite solutions, nor are there methods to holistically evaluate the adequacy and efficacy of such solutions in terms of their protection coverage, and their performance & power efficiency characteristics. Additionally, few implementations of current resilience solutions are portable to newer architectures and software environments that will be deployed on future systems. We developed a new structured approach to the management of HPC resilience using the concept of resilience-based design patterns. In general, a design pattern is a repeatable solution to a commonly occurring problem. We identified the well-known solutions that are commonly used to deal with faults, errors and failures in HPC systems. In the initial design patterns specification (version 1.0), we described the various solutions, which address specific problems in the design of resilient HPC environments, in the form of patterns. Each pattern describes a problem caused by a fault, error or failure event in an HPC environment, and then describes the core of the solution of the problem in such a way that this solution may be adapted to different systems and implemented at different layers of the system stack. The catalog of these resilience design patterns provides designers with a collection of design elements. To construct complete resilience solutions using combinations of various patterns, we defined a framework that enhances HPC designers' understanding of the important constraints and the opportunities for the design patterns to be implemented and deployed at various layers of the system stack. The design framework is also useful for establishing interfaces and mechanisms to coordinate flexible fault management across hardware and software components, as well as to consider the trade-off between performance, resilience, and power consumption when constructing a solution. The resilience design patterns specification version 1.1 included more detailed explanations of the pattern solutions, the context in which the patterns are applicable, and the implications for hardware or software design. It also provided several additional examples and detailed case studies to demonstrate the use of patterns to build realistic solutions. In this version 1.2 of the specification document, we have improved the pattern descriptions, including graphical representations of the pattern components. These improvements are largely based on critical comments, feedback and suggestions received from pattern experts and readers of the previous versions of the specification. The pattern classification has been modified to further clarify the relationships between pattern categories. This version of the specification also introduces a pattern language for resilience design patterns. The pattern language presents the patterns in the catalog as a network, revealing the relations among the resilience patterns. The language provides designers with the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language also enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack. The overall goal of this work is to provide hardware and software designers, as well as the users and operators of HPC systems, a systematic methodology for the design and evaluation of resilience technologies in HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner despite frequent faults, errors, and failures of various types.", pts = "106427" }
@techreport{hukerikar16rdp-11, author = "Saurabh Hukerikar and Christian Engelmann", title = "Resilience Design Patterns: {A} Structured Approach to Resilience at Extreme Scale (Version 1.1)", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2016/767", address = "Oak Ridge, TN, USA", month = dec, year = "2016", doi = "10.2172/1345793", url = "http://www.christian-engelmann.info/publications/hukerikar16rdp-11.pdf", abstract = "Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore the resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on power consumption in HPC systems future systems are likely to embrace innovative architectures, increasing the levels of hardware and software complexities. As a result the techniques that seek to improve resilience must navigate the complex trade-off space between resilience and the overheads to power consumption and performance. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency across the system stack. Additionally, few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems. In this document, we develop a structured approach to the management of HPC resilience using the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each established solution is described in the form of a pattern that addresses concrete problems in the design of resilient systems. The complete catalog of resilience design patterns provides designers with reusable design elements. We also define a framework that enhances a designer's understanding of the important constraints and opportunities for the design patterns to be implemented and deployed at various layers of the system stack. This design framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also supports optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner in spite of frequent faults, errors, and failures of various types.", pts = "72341" }
@techreport{hukerikar16rdp-10, author = "Saurabh Hukerikar and Christian Engelmann", title = "Resilience Design Patterns: {A} Structured Approach to Resilience at Extreme Scale (Version 1.0)", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2016/687", address = "Oak Ridge, TN, USA", month = oct, year = "2016", doi = "10.2172/1338552", url = "http://www.christian-engelmann.info/publications/hukerikar16rdp-10.pdf", abstract = "Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest that very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Practical limits on power consumption in HPC systems will require future systems to embrace innovative architectures, increasing the levels of hardware and software complexities. The resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. These techniques must seek to improve resilience at reasonable overheads to power consumption and performance. While the HPC community has developed various solutions, application-level as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency across the system stack. Additionally, few of the current approaches are portable to newer architectures and software ecosystems, which are expected to be deployed on future systems. In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. The catalog of resilience design patterns provides designers with reusable design elements. We define a design framework that enhances our understanding of the important constraints and opportunities for solutions deployed at various layers of the system stack. The framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also enables optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner in spite of frequent faults, errors, and failures of various types.", pts = "71756" }
@techreport{fiala12detection, author = "David Fiala and Frank Mueller and Christian Engelmann and Kurt Ferreira and Ron Brightwell and Rolf Riesen", title = "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2012/227", address = "Oak Ridge, TN, USA", month = jun, year = "2012", url = "http://www.christian-engelmann.info/publications/fiala12detection.pdf", abstract = "Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that corrupt memory while applications continue to operate and report incorrect results. This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. Our study investigates the challenges inherent to detecting soft errors within MPI application while providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI message data between replicas, we study the best suited protocols for detecting and correcting MPI data that is the result of corruption. To experimentally validate our proposed detection and correction protocols, we introduce RedMPI, an MPI library which resides in the MPI profiling layer. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source by utilizing either double or triple redundancy. Our results indicate that our most efficient consistency protocol can successfully protect applications experiencing even high rates of silent data corruption with runtime overheads between 0\% and 30\% as compared to unprotected applications without redundancy. Using our fault injector within RedMPI, we observe that even a single soft error can have profound effects on running applications, causing a cascading pattern of corruption in most cases causes that spreads to all other processes. RedMPI's protection has been shown to successfully mitigate the effects of soft errors while allowing applications to complete with correct results even in the face of errors." }
@techreport{wang10hybrid, author = "Chao Wang and Frank Mueller and Christian Engelmann and Stephen L. Scott", title = "Hybrid Full/Incremental Checkpoint/Restart for {MPI} Jobs in {HPC} Environments", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2010/162", address = "Oak Ridge, TN, USA", month = aug, year = "2010", url = "http://www.christian-engelmann.info/publications/wang10hybrid.pdf", abstract = "As the number of cores in high-performance computing environments keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. We have designed a high-performance hybrid disk-based full/incremental checkpointing technique for MPI tasks to capture only data changed since the last checkpoint. Our implementation integrates new BLCR and LAM/MPI features that complement traditional full checkpoints. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints significantly outweigh the loss on restart operations. Experiments in a cluster with the NAS Parallel Benchmark suite and mpiBLAST indicate that savings due to replacing full checkpoints with incremental ones average 16.64 seconds while restore overhead amounts to just 1.17 seconds. These savings increase with the frequency of incremental checkpoints. Overall, our novel hybrid full/incremental checkpointing is superior to prior non-hybrid techniques." }
@techreport{wang10proactive, author = "Chao Wang and Frank Mueller and Christian Engelmann and Stephen L. Scott", title = "Proactive Process-Level Live Migration and Back Migration in {HPC} Environments", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2010/161", address = "Oak Ridge, TN, USA", month = aug, year = "2010", url = "http://www.christian-engelmann.info/publications/wang10proactive.pdf", abstract = "As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism suppor ts continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70\% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration will be." }
@dataset{shin23olcf, author = "Woong Shin and Vladyslav Oles and Anna Schmedding and George Ostrouchov and Evgenia Smirni and Christian Engelmann and Feiyi Wang", title = "{OLCF Summit} Supercomputer {GPU} Snapshots During Double-Bit Errors and Normal Operations", month = apr # "~20, ", year = "2023", doi = "10.13139/OLCF/1970187", url = "https://doi.ccs.ornl.gov/ui/doi/429", abstract = "As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide double-bit errors using system telemetry data and logs collected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). The dataset relies on Nvidia XID records internally collected by GPU firmware at the time of failure occurrence, on the reboot-time logs of each Summit node, on node-level job scheduler records collected after each job termination, and on a 1Hz data rate from the baseboard management controllers (BMCs) of each Summit compute node using the OpenBMC event subscription protocol." }
@misc{engelmann23interconnected4, author = "Christian Engelmann", title = "The Interconnected Science Ecosystem (INTERSECT)", month = oct # "~4, ", year = "2023", howpublished = "{Invited talk at the \href{https://www.hartree.stfc.ac.uk} {Hartree Centre, Science and Technology Facilities Council, Daresbury, UK}}", url = "http://www.christian-engelmann.info/publications/engelmann23interconnected4.ppt.pdf", abstract = "The Interconnected Science Ecosystem (INTERSECT) Initiative at Oak Ridge National Laboratory is in the process of creating an open federated hardware/software architecture for the laboratory of the future, connecting scientific instruments, robot-controlled laboratories, and edge/center computing/data resources to enable autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence driven design, discovery, and evaluation. Its novel approach describes science use cases as design patterns that identify and abstract the involved hardware/software components and their interactions in terms of control, work, and data flow. It creates a system-of-systems architecture of the federated hardware/software ecosystem that clarifies terms, architectural elements, the interactions between them and compliance. It further designs a federated microservice architecture, mapping science use case design patterns to the system-of-systems architecture with loosely coupled microservices and standardized interfaces. The INTERSECT Open Architecture Specification contains a use case design pattern catalog, a federated system-of-systems architecture specification, and a federated microservice architecture specification. It is currently being used to prototype and deploy autonomous experiments and self-driving laboratories at Oak Ridge National Laboratory in the following science areas: (1) automation for electric grid interconnected-laboratory emulation/simulation, (2) autonomous additive manufacturing, (3) autonomous continuous flow reactor synthesis, (4) autonomous electron microscopy, (5) autonomous robotic-controlled chemistry laboratory, and (6) integrating an ion trap quantum computing resource." }
@misc{engelmann23interconnected3, author = "Christian Engelmann", title = "The Interconnected Science Ecosystem (INTERSECT) Architecture", month = aug # "~21-23, ", year = "2023", howpublished = "{Invited talk at the \href{https://smc2023.ornl.gov} {$20^{th}$ Smoky Mountains Computational Sciences & Engineering Conference (SMC)}, Knoxville, TN, USA}", url = "http://www.christian-engelmann.info/publications/engelmann23interconnected3.ppt.pdf", abstract = "The Interconnected Science Ecosystem (INTERSECT) Initiative at Oak Ridge National Laboratory is in the process of creating an open federated hardware/software architecture for the laboratory of the future, connecting scientific instruments, robot-controlled laboratories, and edge/center computing/data resources to enable autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence driven design, discovery, and evaluation. Its novel approach describes science use cases as design patterns that identify and abstract the involved hardware/software components and their interactions in terms of control, work, and data flow. It creates a system-of-systems architecture of the federated hardware/software ecosystem that clarifies terms, architectural elements, the interactions between them and compliance. It further designs a federated microservice architecture, mapping science use case design patterns to the system-of-systems architecture with loosely coupled microservices and standardized interfaces. The INTERSECT Open Architecture Specification contains a use case design pattern catalog, a federated system-of-systems architecture specification, and a federated microservice architecture specification. It is currently being used to prototype and deploy autonomous experiments and self-driving laboratories at Oak Ridge National Laboratory in the following science areas: (1) automation for electric grid interconnected-laboratory emulation/simulation, (2) autonomous additive manufacturing, (3) autonomous continuous flow reactor synthesis, (4) autonomous electron microscopy, (5) autonomous robotic-controlled chemistry laboratory, and (6) integrating an ion trap quantum computing resource." }
@misc{engelmann23interconnected2, author = "Christian Engelmann", title = "The Interconnected Science Ecosystem (INTERSECT) Architecture", month = jul # "~10, ", year = "2023", howpublished = "{Seminar at the \href{http://www.lrz-muenchen.de}{Leibniz Rechenzentrum (LRZ)}, Garching, Germany}", url = "http://www.christian-engelmann.info/publications/engelmann23interconnected2.ppt.pdf", abstract = "The Interconnected Science Ecosystem (INTERSECT) Initiative at Oak Ridge National Laboratory is in the process of creating an open federated hardware/software architecture for the laboratory of the future, connecting scientific instruments, robot-controlled laboratories, and edge/center computing/data resources to enable autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence driven design, discovery, and evaluation. Its novel approach describes science use cases as design patterns that identify and abstract the involved hardware/software components and their interactions in terms of control, work, and data flow. It creates a system-of-systems architecture of the federated hardware/software ecosystem that clarifies terms, architectural elements, the interactions between them and compliance. It further designs a federated microservice architecture, mapping science use case design patterns to the system-of-systems architecture with loosely coupled microservices and standardized interfaces. The INTERSECT Open Architecture Specification contains a use case design pattern catalog, a federated system-of-systems architecture specification, and a federated microservice architecture specification. It is currently being used to prototype and deploy autonomous experiments and self-driving laboratories at Oak Ridge National Laboratory in the following science areas: (1) automation for electric grid interconnected-laboratory emulation/simulation, (2) autonomous additive manufacturing, (3) autonomous continuous flow reactor synthesis, (4) autonomous electron microscopy, (5) autonomous robotic-controlled chemistry laboratory, and (6) integrating an ion trap quantum computing resource." }
@misc{engelmann23interconnected, author = "Christian Engelmann", title = "The Interconnected Science Ecosystem (INTERSECT) Architecture", month = may # "~25, ", year = "2023", howpublished = "{Invited talk at the \href{https://esailworkshop.ornl.gov} {$1^{st}$ Ecosystems for Smart Autonomous Interconnected Labs (E-SAIL) Workshop}, held in conjunction with the \href{https://www.isc-hpc.com}{$38^{th}$ ISC High Performance (ISC) 2023}, Hamburg, Germany}", url = "http://www.christian-engelmann.info/publications/engelmann23interconnected.ppt.pdf", abstract = "The open Interconnected Science Ecosystem (INTERSECT) architecture connects scientific instruments and robot-controlled laboratories with computing and data resources at the edge, the Cloud or the high-performance computing center to enable autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence driven design, discovery and evaluation. Its a novel approach consists of science use case design patterns, a system of systems architecture, and a microservice architecture." }
@misc{engelmann22designing, author = "Christian Engelmann", title = "Designing Smart and Resilient Extreme-Scale Systems", month = feb # "~23-26, ", year = "2022", howpublished = "{Invited talk at the \href{https://www.siam.org/conferences/cm/conference/pp22} {$20^{th}$ SIAM Conference on Parallel Processing for Scientific Computing (PP) 2022}, Seattle, WA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann22designing.ppt.pdf", abstract = "Resilience is one of the critical challenges of extreme-scale high-performance computing (HPC) systems, as component counts increase, individual component reliability decreases, and software complexity increases. Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. This talk provides an overview of recent achievements in developing a taxonomy, catalog and models that capture the observed and inferred fault, error, and failure conditions in current supercomputers and in extrapolating this knowledge to future-generation systems. It also describes the path forward in machine-in-the-loop operational intelligence for smart computing systems, leveraging operational data analytics in a loop control that maximizes productivity and minimizes costs through adaptive autonomous operation for resilience." }
@misc{mintz21enabling, author = "Ben Mintz and Christian Engelmann and Elke Arenholz and Ryan Coffee", title = "Enabling Self-Driven Experiments for Science through an Interconnected Science Ecosystem (INTERSECT)", month = oct # "~20, ", year = "2021", howpublished = "{Panel at the \href{https://smc2021.ornl.gov}{$17^{th}$ Smoky Mountains Computational Sciences & Engineering Conference (SMC)}}", abstract = "The process of operating scientific instruments, conducting experiments, and executing scientific workflows in general is time-consuming and labor-intensive. Computer control of instruments and the rapid rise in simulation and modeling has led to a significant increase in both the quantity and quality of data, but scientists are still contributing to many low-level process steps in data acquisition, processing, and interpretation to produce scientific results. These issues led to the integration of automation and autonomy to decreased process bottlenecks and increased efficiencies. While automation incorporates tools that perform well-defined, systematic processes with limited human intervention, autonomy introduces smart decision-making techniques, such as artificial intelligence (AI) and machine learning (ML). Combining these advances to automate entire scientific workflows and controlling them with AI/ML will bring about revolutionary efficiencies and research outcomes. This kind of autonomous control of processes, experiments, and laboratories will fundamentally change the way scientists work, allowing us to explore high-dimensional problems previously considered impossible and discover new subtle correlations. To enable the interoperability of existing and future self-driven experiments, the scientific community needs a common Interconnected Science Ecosystem (INTERSECT) that consistently incorporates data management software, data analysis workflow tools, and experiment management/steering software as well as AI/ML capabilities. The development of INTERSECT requires tight collaboration between computer scientists, software engineers, data scientists, and domain scientists. This panel will introduce INTERSECT and discuss opportunities, challenges, and business goals for this type of ecosystem including scalability, interoperability, and solution/software transferability/reusability." }
@misc{engelmann21faults, author = "Christian Engelmann", title = "Faults, Errors and Failures in Extreme-Scale Supercomputers", month = aug # "~30, ", year = "2021", howpublished = "{Keynote talk at the \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2021}{$14^{th}$ Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids}, held in conjunction with the \href{http://europar2014.dcc.fc.up.pt} {$27^{th}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2021}, Lisbon, Portugal}", url = "http://www.christian-engelmann.info/publications/engelmann21faults.ppt.pdf", abstract = "Resilience is one of the critical challenges of extreme-scale high-performance computing systems, as component counts increase, individual component reliability decreases, and software complexity increases. Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. This talk provides an overview of reliability experiences with some of the largest supercomputers in the world and recent achievements in developing a taxonomy, catalog and models that capture the observed and inferred fault, error, and failure conditions in these systems." }
@misc{engelmann21resilience, author = "Christian Engelmann", title = "The Resilience Problem in Extreme Scale Computing: Experiences and the Path Forward", month = mar # "~1-5, ", year = "2021", howpublished = "{Invited talk at the \href{https://www.siam.org/conferences/cm/conference/cse21} {SIAM Conference on Computational Science and Engineering (CSE) 2021}, Fort Worth, TX, USA}", url = "http://www.christian-engelmann.info/publications/engelmann21resilience.ppt.pdf", abstract = "Resilience is one of the critical challenges of extreme-scale high-performance computing systems, as component counts increase, individual component reliability decreases, and software complexity increases. Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. This talk provides an overview of reliability experiences with some of the largest supercomputers in the world and recent achievements in developing a taxonomy, catalog and models that capture the observed and inferred fault, error, and failure conditions in these systems." }
@misc{engelmann21smart, author = "Christian Engelmann", title = "Smart and Resilient Extreme-Scale Systems", month = jan # "~19, ", year = "2021", howpublished = "{Invited talk at the \href{https://www.hipeac.net/2021/spring-virtual/#/program/sessions/7854/} {Workshop on Resilience in High Performance Computing (RESILIENTHPC)}, held in conjunction with the \href{https://www.hipeac.net/2021} {European Network on High-performance Embedded Architecture and Compilation (HiPEAC) Conference 2021}, Budapest, Hungary}", url = "http://www.christian-engelmann.info/publications/engelmann21smart.ppt.pdf", abstract = "Resilience is one of the critical challenges of extreme-scale high-performance computing (HPC) systems, as component counts increase, individual component reliability decreases, and software complexity increases. Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. This talk provides an overview of recent achievements in developing a taxonomy, catalog and models that capture the observed and inferred fault, error, and failure conditions in current supercomputers and in extrapolating this knowledge to future-generation systems. It also describes the path forward in machine-in-the-loop operational intelligence for smart computing systems, leveraging operational data analytics in a loop control that maximizes productivity and minimizes costs through adaptive autonomous operation for resilience." }
@misc{engelmann20resilience, author = "Christian Engelmann", title = "The Resilience Problem in Extreme Scale Computing", month = feb # "~12-15, ", year = "2020", howpublished = "{Invited talk at the \href{https://www.siam.org/conferences/cm/conference/pp20} {$19^{th}$ SIAM Conference on Parallel Processing for Scientific Computing (PP) 2020}, Seattle, WA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann20resilience.ppt.pdf", abstract = "Resilience is one of the critical challenges of extreme-scale high-performance computing (HPC) systems, as component counts increase, individual component reliability decreases, and software complexity increases. Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. This talk provides an overview of recent achievements in developing a taxonomy, catalog and models that capture the observed and inferred fault, error, and failure conditions in current supercomputers and in extrapolating this knowledge to future-generation systems." }
@misc{engelmann19resilience3, author = "Christian Engelmann", title = "Resilience in Parallel Programming Environments", month = oct # "~30-31, ", year = "2019", howpublished = "{Invited talk at the \href{https://iadac.github.io/events/adac8}{$8^{th}$ Accelerated Data Analytics and Computing (ADAC) Institute Workshop}, Tokyo, Japan}", url = "http://www.christian-engelmann.info/publications/engelmann19resilience.ppt.pdf", abstract = "Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory, demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. The presented work takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, this work describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing on graphics processing units errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing systems." }
@misc{engelmann19resilience2, author = "Christian Engelmann", title = "Resilience by Design (and not as an Afterthought)", month = mar # "~26-29, ", year = "2018", howpublished = "{Invited talk at the \href{https://sos23.ornl.gov/}{$23^{rd}$ Workshop on Distributed Supercomputing (SOS) 2019}, Asheville, NC, USA}", url = "http://www.christian-engelmann.info/publications/engelmann19resilience2.ppt.pdf", abstract = "Resilience, i.e., obtaining a correct solution in a timely and efficient manner, is one of the key challenges in extreme-scale high-performance computing (HPC). The challenge is to build a reliable HPC system within a given cost budget that achieves the expected performance. Every generation of supercomputers deployed at Oak Ridge National Laboratory (ORNL) had to deal with expected and unexpected faults, errors and failures. While these supercomputers are designed to deal with expected issues, unexpected reliability problems can lead to severe degradation in operational capabilities. For example, ORNL's Titan supercomputer experienced an unexpected increase in general-purpose graphics processing unit (GPGPU) failures between 2015 and 2017. At the peak of the problem, Titan was losing an average of 12 GPGPUs (and corresponding compute nodes) per day. Over 50\% of its 18,688 GPGPUs had to be replaced. The system and the applications using it were never designed to handle such a high failure rate in an efficient manner. Other past unexpected reliability issues with supercomputers at US Department of Energy HPC centers were caused by early wear-out, dirty power, bad solder, other manufacturing issues, design errors in hardware, design errors in software and user errors. With the expected decrease in reliability due to component count increases, process technology challenges, hardware heterogeneity and software complexity, risk mitigation against unexpected issues is becoming paramount to ensure the success of future extreme-scale HPC systems. Resilience needs to be holistically provided by the HPC hardware/software ecosystem. The key challenges are to design and to operate extreme HPC systems with (1) wide-ranging resilience capabilities in hardware, system software, programming models, libraries, and applications, (2) interfaces and mechanisms for coordinating resilience capabilities across diverse hardware and software components, (3) appropriate metrics and tools for assessing performance, resilience, and energy, and (4) an understanding of the performance, resilience and energy trade-off that eventually results in well-informed HPC system design choices and runtime decisions." }
@misc{engelmann19resilience, author = "Christian Engelmann", title = "Resilience for Extreme Scale Systems: Understanding the Problem", month = feb # "~25 - " # mar # "~1, ", year = "2018", howpublished = "{Invited talk at the \href{https://www.siam.org/meetings/cse19/}{SIAM Conference on Computational Science and Engineering (CSE) 2019}, Spokane, WA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann19resilience.ppt.pdf", abstract = "Resilience is one of the critical challenges of extreme-scale high-performance computing (HPC) systems, as component counts increase, individual component reliability decreases, and software complexity increases. Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. This talk provides an overview of the Catalog project, which develops a taxonomy, catalog and models that capture the observed and inferred fault, error, and failure conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, this project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory." }
@misc{engelmann18modeling, author = "Christian Engelmann and Rizwan Ashraf", title = "Modeling and Simulation of Extreme-Scale Systems for Resilience by Design", month = aug # "~15-17, ", year = "2018", howpublished = "{Invited talk at the \href{https://www.bnl.gov/modsim2018} {Workshop on Modeling and Simulation of Systems and Applications}, Seattle, WA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann18modeling.ppt.pdf", abstract = "Resilience is a serious concern for extreme-scale high-performance computing (HPC). While the HPC community has developed various resilience solutions, the solution space remains fragmented. We created a structured approach to the design, evaluation and optimization of HPC resilience using the concept of design patterns. A design pattern describes a generalized solution to a repeatedly occurring problem. We identified the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each well-known solution that addresses a specific resilience challenge is described in the form of a design pattern. We developed a resilience design pattern specification, language and catalog, which can be used by system architects, system software and library developers, application programmers, as well as users and operators as essential building blocks when designing and deploying resilience solutions. The resilience design pattern approach provides a unique opportunity for design space exploration. As each resilience solution is abstracted as a pattern and each solution's properties are defined by pattern parameters, vertical and horizontal pattern compositions can describe the resilience capabilities of an entire HPC system. This permits the investigation of beneficial or counterproductive interactions between patterns and of the performance, resilience, and power consumption trade-off between different pattern parameters and compositions. The ultimate goal is to make resilience an integral part of the HPC hardware/software ecosystem by coordinating the various existing resilience solutions in a design space exploration process, such that the burden for providing resilience is on the system by design and not on the user as an afterthought. We are in the early stages of developing a novel design space exploration tool that enables this investigation using modeling and simulation. We developed performance and resilience models for each resilience design pattern. We also leverage results from the Catalog project, a collaborative effort between Oak Ridge National Laboratory, Argonne National Laboratory and Lawrence Livermore National Laboratory that developed models of the faults, errors and failures in today's HPC systems. We also leverage recent results from the same project by Lawrence Livermore National Laboratory in application reliability patterns. The planned research extends and combines this work to model the performance, resilience, and power consumption of an entire HPC system, initially at node-level granularity, and to simulate the dynamic interactions between deployed resilience solutions and the rest of the system. In the next iteration, finer-grain modeling and simulation, such as at the computational unit level, is used to increase accuracy. This work leverages the experience of the investigators in parallel discrete event simulation of extreme-scale systems, such as the Extreme-scale Simulator (xSim). The current state of the art in resilience modeling and simulation is fragmented as well. There is currently no such design space exploration tool. Instead, each resilience solution is typically investigated separately. There is only a small amount of work on multi-resilience solutions, including by the investigators. While there is work in investigating the performance/resilience trade-off space, there is almost no work in including power consumption." }
@misc{engelmann18characterizing2, author = "Christian Engelmann", title = "Characterizing Faults, Errors, and Failures in Extreme-Scale Systems", month = jul # "~2-4, ", year = "2018", howpublished = "{Invited talk at the \href{https://pasc18.pasc-conference.org}{Platform for Advanced Scientific Computing (PASC) Conference 2018}, Basel, Switzerland}", url = "http://www.christian-engelmann.info/publications/engelmann18characterizing2.ppt.pdf", abstract = "Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned." }
@misc{engelmann18characterizing, author = "Christian Engelmann", title = "Characterizing Faults, Errors, and Failures in Extreme-Scale Systems", month = jun # "~20-21, ", year = "2018", howpublished = "{Invited talk at the \href{https://iadac.github.io/adac6}{$6^{th}$ Accelerated Data Analytics and Computing (ADAC) Institute Workshop}, Zurich, Switzerland}", url = "http://www.christian-engelmann.info/publications/engelmann18characterizing.ppt.pdf", abstract = "Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned." }
@misc{engelmann18pattern-based, author = "Christian Engelmann", title = "Pattern-based Modeling of Fail-stop and Soft-error Resilience for Iterative Linear Solvers", month = mar # "~7-10, ", year = "2018", howpublished = "{Invited talk at the \href{https://www.siam.org/meetings/pp18/}{$18^{th}$ SIAM Conference on Parallel Processing for Scientific Computing (PP) 2018}, Tokyo, Japan}", url = "http://www.christian-engelmann.info/publications/engelmann18resilience.ppt.pdf", abstract = "Reliability is a serious concern for future extreme-scale high-performance computing (HPC). While the HPC community has developed various resilience solutions, the solution space remains fragmented. With this work, we develop a structured approach to the design, evaluation and optimization of HPC resilience using the concept of design patterns. We identify the problems caused by faults, errors and failures in HPC systems and the techniques used to deal with these events. Each well-known solution that addresses a specific resilience challenge is described in the form of a pattern. We develop a catalog of such resilience design patterns, which may be used by system architects, system software and tools developers, application programmers, as well as users and operators as essential building blocks when designing and deploying resilience solutions. We also develop a design framework that enhances a designer's understanding the opportunities for integrating multiple patterns across layers of the system stack and the important constraints during implementation of the individual patterns. It is also useful for designing mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The resilience patterns and the design framework also enable exploration and evaluation of design alternatives and support optimization of the cost-benefit trade-offs among performance, protection coverage, and power consumption of resilience solutions." }
@misc{engelmann18resilience, author = "Christian Engelmann", title = "Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale", month = mar # "~7-10, ", year = "2018", howpublished = "{Invited talk at the \href{https://www.siam.org/meetings/pp18/}{$18^{th}$ SIAM Conference on Parallel Processing for Scientific Computing (PP) 2018}, Tokyo, Japan}", url = "http://www.christian-engelmann.info/publications/pattern-based.ppt.pdf", abstract = "The reliability of high-performance computing (HPC) platforms is among the most critical challenges as systems continue to increase component counts, while the individual component reliability decreases and software complexity increases. While most resilience solutions are designed to address a specific fault model, HPC applications must contend with extremely high rates of faults from various sources with different levels of severity. Therefore, resilience for extreme-scale HPC systems and their applications requires an integrated approach, which leverages detection, containment and mitigation capabilities from different layers of the HPC environment. With this work, we propose an approach based on design patterns to explore a multi-level resilience solution that addresses silent data corruptions and process failures. The structured approach enables evaluation of the key components of a multi-level resilience solution using pattern performance models and systematically integrating the patterns into a complete solution by assessing the interplay between the patterns. We describe the design steps to develop a multi-level resilience solution for an iterative linear solver application that combines algorithmic resilience features of the solver with the fault tolerance primitives provided by ULFM MPI. Our results demonstrate the viability of designing HPC applications capable of surviving simultaneous injection of hard and soft errors in a performance efficient manner." }
@misc{engelmann17catalog2, author = "Christian Engelmann", title = "A Catalog of Faults, Errors, and Failures in Extreme-Scale Systems", month = jul # "~10-14, ", year = "2017", howpublished = "{Invited talk at the \href{http://www.siam.org/meetings/an17/}{SIAM Annual Meeting (AM) 2017}, Pittsburgh, PA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann17catalog2.ppt.pdf", abstract = "Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned." }
@misc{engelmann17characterizing, author = "Christian Engelmann", title = "Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems", month = jun # "~16-22, ", year = "2017", howpublished = "{Invited talk at the \href{http://www.isc-hpc.com} {International Supercomputing Conference (ISC) 2017}, Frankfurt am Main, Germany}", url = "http://www.christian-engelmann.info/publications/engelmann17characterizing.ppt.pdf", abstract = "Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned." }
@misc{engelmann17catalog, author = "Christian Engelmann", title = "A Catalog of Faults, Errors, and Failures in Extreme-Scale Systems", month = may # "~24-26, ", year = "2017", howpublished = "{Invited talk at the \href{http://icl.cs.utk.edu/workshops/scheduling2017/} {$12^{th}$ Scheduling for Large Scale Systems Workshop (SLSSW) 2017}, Knoxville, TN, USA}", url = "http://www.christian-engelmann.info/publications/engelmann17catalog.ppt.pdf", abstract = "Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned." }
@misc{engelmann16missing, author = "Christian Engelmann", title = "The Missing High-Performance Computing Fault Model", month = apr # "~12-15, ", year = "2016", howpublished = "{Invited talk at the \href{http://www.siam.org/meetings/pp16/}{$17^{th}$ SIAM Conference on Parallel Processing for Scientific Computing (PP) 2016}, Paris, France}", url = "http://www.christian-engelmann.info/publications/engelmann16missing.ppt.pdf", abstract = "The path to exascale computing poses several research challenges. Resilience is one of the most important challenges. This talk will present recent work in developing the missing high-performance computing (HPC) fault model. This effort identifies, categorizes and models the fault, error and failure properties of today's HPC systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolates this knowledge to exascale HPC systems." }
@misc{engelmann16resilience2, author = "Christian Engelmann", title = "Resilience Challenges and Solutions for Extreme-Scale Supercomputing", month = feb # "~18, ", year = "2016", howpublished = "{Invited talk at the \href{http://www.usna.edu}{United States Naval Academy}, Annapolis, MD, USA}", url = "http://www.christian-engelmann.info/publications/engelmann16resilience2.ppt.pdf", abstract = "The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2022) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2022). This talk provides an overview of recent and ongoing resilience research and development activities at Oak Ridge National Laboratory in advanced checkpoint storage architectures, process-level incremental checkpoint/restart, proactive fault tolerance using prediction-triggered process or virtual machine migration, MPI process-level software redundancy, and soft-error injection tools to study the vulnerability of science applications." }
@misc{engelmann15toward, author = "Christian Engelmann", title = "Toward A Fault Model And Resilience Design Patterns For Extreme Scale Systems", month = aug # "~24-28, ", year = "2015", howpublished = "{Keynote talk at the \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2015}{$8^{th}$ Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids}, held in conjunction with the \href{http://europar2014.dcc.fc.up.pt} {$21^{st}$ European Conference on Parallel and Distributed Computing (Euro-Par) 2015}, Vienna, Austria}", url = "http://www.christian-engelmann.info/publications/engelmann15toward.ppt.pdf", abstract = "The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2022) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2022). This talk provides an overview of two recently funded projects. The Characterizing Faults, Errors, and Failures in Extreme-Scale Systems project identifies, categorizes and models the fault, error and failure properties of US Department of Energy high-performance computing (HPC) systems. It develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale HPC systems. The Resilience Design Patterns project will increase the ability of scientific applications to reach accurate solutions in a timely and efficient manner. Using a novel design pattern concept, it identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout high-performance computing hardware and software." }
@misc{engelmann15resilience, author = "Christian Engelmann", title = "Resilience Challenges and Solutions for Extreme-Scale Supercomputing", month = mar # "~2-5, ", year = "2015", howpublished = "{Invited talk at the $19^{th}$ Workshop on Distributed Supercomputing (SOS) 2015, Park City, UT, USA}", url = "http://www.christian-engelmann.info/publications/engelmann15resilience.ppt.pdf", abstract = "The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2022) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2022). This talk provides an overview of recent and ongoing resilience research and development activities at Oak Ridge National Laboratory in advanced checkpoint storage architectures, process-level incremental checkpoint/restart, proactive fault tolerance using prediction-triggered process or virtual machine migration, MPI process-level software redundancy, and soft-error injection tools to study the vulnerability of science applications." }
@misc{engelmann15xsim, author = "Christian Engelmann", title = "xSim: {T}he Extreme-scale Simulator", month = feb # "~23, ", year = "2015", howpublished = "{Seminar at the \href{http://www.lrz-muenchen.de}{Leibniz Rechenzentrum (LRZ)}, Garching, Germany}", url = "http://www.christian-engelmann.info/publications/engelmann15xsim.ppt.pdf", abstract = "The path to exascale high-performance computing (HPC) poses several challenges related to power, performance, and resilience. Investigating the performance and resilience of parallel applications at scale on future architectures and the performance and resilience impact of different architecture choices is an important component of HPC hardware/software co-design. Without having access to future architectures at scale, simulation provides an alternative. The Extreme-scale Simulator (xSim) is a performance investigation toolkit that permits running applications in a controlled environment with millions of concurrent execution threads, while observing performance and resilience in a simulated extreme-scale system. Using a lightweight parallel discrete event simulation, xSim executes a Message Passing Interface (MPI) application on a much smaller system in a highly oversubscribed fashion with a virtual wall clock time, such that performance data can be extracted based on a processor and a network model. xSim is designed like a traditional performance tool, as an interposition library that sits between the MPI application and the MPI library, using the MPI profiling interface. It has been run up to 134,217,728 (2^27) MPI ranks using a 960-core Linux cluster. xSim also permits the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling within the simulation using application-level checkpoint/restart. Another feature provides user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). xSim is the very first performance tool that supports ULFM and ABFT." }
@misc{engelmann14supporting, author = "Christian Engelmann", title = "Supporting the Development of Resilient Message Passing Applications using Simulation", month = sep # "~28 - " # oct # "~1, ", year = "2014", howpublished = "Invited talk at the \href{http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=14402} {Dagstuhl Seminar on Resilience in Exascale Computing}, Schloss Dagstuhl, Wadern, Germany", url = "http://www.christian-engelmann.info/publications/engelmann14supporting.ppt.pdf", abstract = "An emerging aspect of high-performance computing (HPC) hardware/software co-design is investigating performance under failure. The presented work extends the Extreme-scale Simulator (xSim), which was designed for evaluating the performance of message passing interface (MPI) applications on future HPC architectures, with fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. xSim permits running MPI applications with millions of concurrent MPI ranks, while observing application performance in a simulated extreme-scale system using a lightweight parallel discrete event simulation. The newly added features offer user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). The presented solution permits investigating performance under failure and failure handling of ABFT solutions. The newly enhanced xSim is the very first performance tool that supports ULFM and ABFT." }
@misc{engelmann13resilience, author = "Christian Engelmann", title = "Resilience Challenges and Solutions for Extreme-Scale Supercomputing", month = sep # "~3, ", year = "2013", howpublished = "{Invited talk at the Technical University of Dresden, Dresden, Germany}", url = "http://www.christian-engelmann.info/publications/engelmann13resilience.ppt.pdf", abstract = "With the recent deployment of the 18 PFlop/s Titan supercomputer and the exascale roadmap targeting 100, 300, and eventually 1,000 PFlop/s by 2022, Oak Ridge National Laboratory is at the forefront of scientific capability computing. The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2022) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2022). This talk provides an overview of recent and ongoing resilience research and development activities at Oak Ridge National Laboratory in advanced checkpoint storage architectures, process-level incremental checkpoint/restart, proactive fault tolerance using prediction-triggered process or virtual machine migration, MPI process-level software redundancy, and soft-error injection tools to study the vulnerability of science applications and of CMOS logic in processors and memory." }
@misc{engelmann12fault, author = "Christian Engelmann", title = "Fault Tolerance Session", month = oct # "~16-17, ", year = "2012", howpublished = "{Invited talk at the \href{http://www.aanmelder.nl/exachallenge} {The ExaChallenge Symposium}, Dublin, Ireland}", url = "http://www.christian-engelmann.info/publications/engelmann12fault.ppt.pdf" }
@misc{engelmann12high-end, author = "Christian Engelmann", title = "High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path Forward for Research and Development", month = aug # "~4-11, ", year = "2012", howpublished = "{Invited talk at the Argonne National Laboratory (ANL) Institute of Computing in Science (ICiS) \href{http://www.icis.anl.gov/programs/summer2012-4b} {Summer Workshop Week on Addressing Failures in Exascale Computing}, Park City, UT, USA}", url = "http://www.christian-engelmann.info/publications/engelmann12high-end.ppt.pdf", abstract = "The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2020) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2020). To provide input for a discussion of future needs in resilience research, development, and standards work, this talk gives a brief summary of the outcomes from the National HPC Workshop on Resilience, held in Arlington, VA, USA on August 12-14, 2009." }
@misc{engelmann12resilience, author = "Christian Engelmann", title = "Resilience for Permanent, Transient, and Undetected Errors", month = mar # "~12-15, ", year = "2012", howpublished = "{Invited talk at the \href{http://www.cs.sandia.gov/Conferences/SOS16} {$16^{th}$ Workshop on Distributed Supercomputing (SOS) 2012}, Santa Barbara, CA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann12resilience.ppt.pdf", abstract = "With the ongoing deployment of 10-20 PFlop/s supercomputers and the exascale roadmap targeting 100, 300, and eventually 1,000 PFlop/s by 2020, the path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2020) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2020). This talk provides an overview of recent and ongoing resilience research and development activities at Oak Ridge National Laboratory, and of future needs in resilience research, development, and standards work." }
@misc{engelmann12scaling, author = "Christian Engelmann", title = "Scaling To A Million Cores And Beyond: A Basic Understanding Of The Challenges Ahead On The Road To Exascale", month = jan # "~24, ", year = "2012", howpublished = "{Invited talk at the \href{https://researcher.ibm.com/researcher/view_page.php?id=2580} {$1^{st}$ International Workshop on Extreme Scale Parallel Architectures and Systems (ESPAS) 2012}, in conjunction with the \href{http://www.hipeac.net/conference/paris}{$7^{th}$ International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) 2012}, Paris France}", url = "http://www.christian-engelmann.info/publications/engelmann12scaling.ppt.pdf", abstract = "On the road toward multi-petascale and exascale HPC, the trend in architecture goes clearly in only one direction. HPC systems will dramatically scale up in compute node and processor core counts. By 2020, an exascale system may have up to 1,000,000 compute nodes with 1,000 cores per node. The substantial growth in concurrency causes parallel application scalability issues due to sequential application parts, synchronizing communication, and other bottlenecks. Investigating parallel algorithm performance properties at this scale and with these architectural properties for HPC hardware/software co-design is crucial to enable extreme-scale computing. The presented work utilizes the Extreme-scale Simulator (xSim) performance investigation toolkit to identify the scaling characteristics of a simple Monte Carlo algorithm from 1 to 16 million MPI processes on different multi-core architecture choices. The results show the limitations of strong scaling and the negative impact of employing more but less powerful cores for energy savings." }
@misc{engelmann11resilient, author = "Christian Engelmann", title = "Resilient Software for ExaScale Computing", month = nov # "~17, ", year = "2011", howpublished = "{Invited talk at the Birds of a Feather Session on Resilient Software for ExaScale Computing at the \href{http://sc11.supercomputing.org} {24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011}, Seattle, WA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann11resilient.ppt.pdf", abstract = "ExaScale computing systems will likely consist of millions of cores executing applications with billions of threads, based on 14nm or less CMOS technology, according to the ITRS roadmap. Processing elements built on this technology, coupled with dynamic power management will exhibit high variability in performance, between cores and across different runs. Even worse, preliminary figures indicates that on average about every couple of minutes - at least - something in the system will break. Traditional checkpointing strategies are unlikely to work, given the time it will take to save the huge quantities of data combined with the fact that they will need to be restored frequently. This BoF wants to investigate resilient software: software that is able to survive failing hardware and continue to run, without minimal performance impact. Furthermore, we may also discuss tradeoffs between rerunning the application and the cost of instrumentation to deal with resilience." }
@misc{engelmann11resilience, author = "Christian Engelmann", title = "Resilience and Hardware/Software Co-design for Extreme-Scale Supercomputing", month = jul # "~27, ", year = "2011", howpublished = "{Seminar at the \href{http://www.bsc.es}{Barcelona Supercomputing Center}, Barcelona, Spain}", url = "http://www.christian-engelmann.info/publications/engelmann11resilience.ppt.pdf", abstract = "Oak Ridge National Laboratory (ORNL) provides the most powerful high-performance computing (HPC) resources in the world for open scientific research. Jaguar, a 224,162-core Cray XT5 with a LINPACK performance of 1.759 PFlop/s, for example, is the world's 3rd fastest supercomputer. 80\% of its resources are allocated through a reviewed process to address the most challenging scientific problems in climate modeling, renewable energy, materials science, fusion and other areas. ORNL's Computer Science and Mathematics Division performs computer science and mathematics research to increase supercomputer efficiency and application scientist productivity while accelerating time to solution for scientific breakthroughs. This talk details recent research advancements at ORNL in two areas: (1) resilience and (2) hardware/software co-design for extreme-scale supercomputing. Both are essential on the road toward exa-scale HPC systems with millions-to-billions of cores. Due to the expected drastic increase in scale, the corresponding decrease in system mean-time to interrupt warrants a rethinking of the traditional checkpoint/restart approach for HPC resilience. New concepts discussed in this talk range from preventative measures, such as task migration based on fault prediction, to more aggressive fault masking, such as various levels of redundancy. Further, the expected drastic increase in task parallelism requires redesigning algorithms to avoid the consequences of Amdahl's law at extreme scale. As million-way task parallel systems don't exist yet, this talk discusses a lightweight system simulation approach for performance estimation of algorithms at scale." }
@misc{engelmann10scalable, author = "Christian Engelmann", title = "Scalable HPC System Monitoring", month = oct # "~13, ", year = "2010", howpublished = "{Invited talk at the $3^{rd}$ HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2010, in conjunction with the \href{http://www.lanl.gov/conferences/lacss/2010}{$3^{rd}$ Los Alamos Computer Science Symposium (LACSS) 2010}, Santa Fe, NM, USA}", url = "http://www.christian-engelmann.info/publications/engelmann10scalable.ppt.pdf", abstract = "We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual monitoring metrics using a tree-based overlay network. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of ~56 in comparison to the Ganglia distributed monitoring system. A simple scaling study reveals, however, that further efforts are needed in reducing the amount of data to monitor future-generation extreme-scale systems with up to 1,000,000 nodes. The implemented solution did not had a measurable performance impact as the 32-node test system did not produce enough monitoring data to interfere with running applications." }
@misc{engelmann10beyond, author = "Christian Engelmann", title = "Beyond Application-Level Checkpoint/Restart - {Advanced} Software Approaches for Fault Resilience", month = sep # "~6, ", year = "2010", howpublished = "{Talk at the \href{http://www.speedup.ch/workshops/w39_2010.html} {$39^{th}$ SPEEDUP Workshop on High Performance Computing}, Zurich, Switzerland}", url = "http://www.christian-engelmann.info/publications/engelmann10beyond.ppt.pdf" }
@misc{engelmann10reliability, author = "Christian Engelmann and Stephen L. Scott", title = "Reliability, Availability, and Serviceability ({RAS}) for Petascale High-End Computing and Beyond", month = jun # "~22, ", year = "2010", howpublished = "{Talk at the \href{http://www.usenix.org/events/fastos10} {Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) Workshop}, in conjunction with the \href{http://www.usenix.org/events/confweek10}{USENIX Federated Conferences Week (USENIX) 2010}, Boston MA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann10reliability.ppt.pdf", abstract = "This project aims at scalable technologies for providing high-level RAS for next-generation petascale scientific high-performance computing (HPC) resources and beyond as outlined by the U.S. Department of Energy (DOE) Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) and the U.S. National Coordination Office for Networking and Information Technology Research and Development (NCO/NITRD) High-End Computing Revitalization Task Force (HECRTF) activities. Based on virtualized adaptation, reconfiguration, and preemptive measures, the ultimate goal is to provide for non-stop scientific computing on a 24x7 basis without interruption. The taken technical approach leverages system-level virtualization technology to enable transparent proactive and reactive fault tolerance mechanisms on extreme scale HPC systems. This effort targets: (1) reliability analysis for identifying pre-fault indicators, predicting failures, and modeling and monitoring component and system reliability, (2) proactive fault tolerance technology based on preemptive migration away from components that are about to fail, (3) reactive fault tolerance enhancements, such as checkpoint interval and placement adaptation to actual and predicted system health threats, and (4) holistic fault tolerance through combination of adaptive proactive and reactive fault tolerance." }
@misc{engelmann10resilience, author = "Christian Engelmann", title = "Resilience Challenges at the Exascale", month = mar # "~8-11, ", year = "2010", howpublished = "{Talk at the \href{http://www.csm.ornl.gov/workshops/SOS14}{$14^{th}$ Workshop on Distributed Supercomputing (SOS) 2010}, Savannah, GA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann10resilience.ppt.pdf", abstract = "The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count and component reliability decreases. This talk discusses the future needs in resilience research, development, and standards work based on the outcomes from the National HPC Workshop on Resilience, held in Arlington, VA, USA on August 12-14, 2009." }
@misc{engelmann10hpc, author = "Christian Engelmann and Stephen L. Scott", title = "{HPC} System Software Research at {Oak Ridge National Laboratory}", month = feb # "~22, ", year = "2010", howpublished = "{Seminar at the \href{http://www.lrz-muenchen.de}{Leibniz Rechenzentrum (LRZ)}, Garching, Germany}", url = "http://www.christian-engelmann.info/publications/engelmann10hpc.ppt.pdf", abstract = "Oak Ridge National Laboratory (ORNL) is the largest energy laboratory in the United States. Its National Center for Computational Sciences (NCCS) provides the most powerful computing resources in the world for open scientific research. Jaguar, a Cray XT5 system at NCCS, is the fastest supercomputer in the world. It recently ranked #1 in the Top 500 List of Supercomputer Sites with a maximal LINPACK benchmark performance of 1.759 PFlop/s and a theoretical peak performance of 2.331 PFlop/s, where 1 PFlop/s is $10^{15}$ Floating Point Operations Per Second. Annually, 80 percent of Jaguar's resources are allocated through the U.S Department of Energy's Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program, a competitively selected, peer reviewed process open to researchers from universities, industry, government and non-profit organizations. These allocations address some of the most challenging scientific problems in areas such as climate modeling, renewable energy, materials science, fusion and combustion. In conjunction with NCCS, the Computer Science and Mathematics Division at ORNL performs basic and applied research in HPC, mathematics, and intelligent systems. This talk gives a summary of the HPC research and development in system software performed at ORNL, including resilience at extreme scale and virtualization technologies in HPC. Specifically, this talk will focus on advanced resilience technologies, such as migration of computation away from components that are about to fail and on management and customization of virtualized environments." }
@misc{engelmann09high2, author = "Christian Engelmann", title = "High-Performance Computing Research Internship and Appointment Opportunities at {Oak Ridge National Laboratory}", month = dec # "~14, ", year = "2009", howpublished = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk} {University of Reading}, Reading, United Kingdom}", url = "http://www.christian-engelmann.info/publications/engelmann09high2.ppt.pdf", abstract = "Oak Ridge National Laboratory (ORNL) is the largest energy laboratory in the United States. Its National Center for Computational Sciences (NCCS) provides the most powerful computing resources in the world for open scientific research. Jaguar, a Cray XT5 system at NCCS, is the fastest supercomputer in the world. It recently ranked #1 in the Top 500 List of Supercomputer Sites with a maximal LINPACK benchmark performance of 1.759 PFlop/s and a theoretical peak performance of 2.331 PFlop/s, where 1 PFlop/s is $10^{15}$ Floating Point Operations Per Second. Annually, 80 percent of Jaguar's resources are allocated through the U.S Department of Energy's Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program, a competitively selected, peer reviewed process open to researchers from universities, industry, government and non-profit organizations. These allocations address some of the most challenging scientific problems in areas such as climate modeling, renewable energy, materials science, fusion and combustion. In conjunction with NCCS, the Computer Science and Mathematics Division at ORNL performs basic and applied research in HPC, mathematics, and intelligent systems. This talk gives a summary of the HPC research performed at ORNL. It provides details about the Jaguar peta-scale computing resource, an overview of the computational science research carried out using ORNL's computing resources, and a description of various computer science efforts targeting solutions for next-generation HPC systems. This talk also provides information about internship opportunities for MSc students and research appointment opportunities for recent graduates." }
@misc{engelmann09jcas, author = "Christian Engelmann", title = "{JCAS} - {IAA} Simulation Efforts at {Oak Ridge National Laboratory}", month = sep # "~1-2, ", year = "2009", howpublished = "{Invited talk at the \href{http://www.cs.sandia.gov/CSRI/Workshops/2009/IAA} {IAA Workshop on HPC Architectural Simulation (HPCAS)}, Boulder, CO, USA}", url = "http://www.christian-engelmann.info/publications/engelmann09jcas.ppt.pdf" }
@misc{engelmann09modeling, author = "Christian Engelmann", title = "Modeling Techniques Towards Resilience", month = aug # "~12-14, ", year = "2009", howpublished = "{Invited talk at the \href{http://institute.lanl.gov/resilience/conferences/2009} {National HPC Workshop on Resilience 2009}, Arlington, VA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann09modeling.ppt.pdf" }
@misc{engelmann09system, author = "Christian Engelmann", title = "System Resilience Research at {ORNL} in the Context of {HPC}", month = may # "~15, ", year = "2009", howpublished = "{Invited talk at the \href{http://www.inria.fr/inria/organigramme/fiche_ur-ren.fr.html} {Institut National de Recherche en Informatique et en Automatique (INRIA)}, Rennes, France}", url = "http://www.christian-engelmann.info/publications/engelmann09system.pdf", abstract = "The continuing growth in high performance computing (HPC) system scale poses a challenge for system software and scientific applications with respect to reliability, availability and serviceability (RAS). With only very few exceptions, the availability of recently installed systems has been lower in comparison to the same deployment phase of their predecessors. As a result, sites lower allowable job run times in order to force applications to store intermediate results (checkpoints) as insurance against lost computation time. However, checkpoints themselves waste valuable computation time and resources. In contrast to the experienced loss of availability, the demand for continuous availability has risen dramatically with the trend towards capability computing, which drives the race for scientific discovery by running applications on the fastest machines available while desiring significant amounts of time (weeks and months) without interruption. These machines must be able to run in the event of frequent interrupts in such a manner that the capability is not severely degraded. Thus, research and development of scalable RAS technologies is paramount to the success of future extreme-scale systems. This talk summarizes our accomplishments in the area of high-level RAS for HPC, such as developed concepts and implemented proof-of-concept prototypes." }
@misc{engelmann09high, author = "Christian Engelmann", title = "High-Performance Computing Research and {MSc} Internship Opportunities at {Oak Ridge National Laboratory}", month = may # "~11, ", year = "2009", howpublished = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk} {University of Reading}, Reading, United Kingdom}", url = "http://www.christian-engelmann.info/publications/engelmann09high.pdf", abstract = "Oak Ridge National Laboratory (ORNL) is the largest energy laboratory in the United States. Its National Center for Computational Sciences (NCCS) provides the most powerful computing resources in the world for open scientific research. Jaguar, a Cray XT5 system at NCCS, is the second HPC system to exceed 1 PFlop/s ($10^{15}$ Floating Point Operations Per Second), and the fastest open science supercomputer in the world. It recently ranked #2 in the Top 500 List of Supercomputer Sites with a maximal LINPACK benchmark performance of 1.059 PFlop/s and a theoretical peak performance of 1.3814 PFlop/s. Annually, 80 percent of Jaguar's resources are allocated through the U.S Department of Energy's Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program, a competitively selected, peer reviewed process open to researchers from universities, industry, government and non-profit organizations. These allocations address some of the most challenging scientific problems in areas such as climate modeling, renewable energy, materials science, fusion and combustion. In conjunction with NCCS, the Computer Science and Mathematics Division at ORNL performs basic and applied research in HPC, mathematics, and intelligent systems. This talk gives a summary of the HPC research performed at ORNL. It provides details about the Jaguar peta-scale computing resource, an overview of the computational science research carried out using ORNL's computing resources, and a description of various computer science efforts targeting solutions for next-generation HPC systems. This talk also provides information about internship opportunities for MSc students." }
@misc{engelmann09modular, author = "Christian Engelmann", title = "Modular Redundancy for Soft-Error Resilience in Large-Scale {HPC} Systems", month = may # "~3-8, ", year = "2009", howpublished = "{Invited talk at the \href{http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=09191} {Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids}, Schloss Dagstuhl, Wadern, Germany}", url = "http://www.christian-engelmann.info/publications/engelmann09modular.pdf", abstract = "Recent investigations into resilience of large-scale high-performance computing (HPC) systems showed a continuous trend of decreasing reliability and availability. Newly installed systems have a lower mean-time to failure (MTTF) and a higher mean-time to recover (MTTR) than their predecessors. Modular redundancy is being used in many mission critical systems today to provide for resilience, such as for aerospace and command & control systems. The primary argument against modular redundancy for resilience in HPC has always been that the capability of a HPC system, and respective return on investment, would be significantly reduced. We argue that modular redundancy can significantly increase compute node availability as it removes the impact of scale from single compute node MTTR. We further argue that single compute nodes can be much less reliable, and therefore less expensive, and still be highly available, if their MTTR/MTTF ratio is maintained." }
@misc{engelmann09proactive2, author = "Christian Engelmann", title = "Proactive Fault Tolerance Using Preemptive Migration", month = apr # "~22-24, ", year = "2009", howpublished = "{Invited talk at the \href{http://acet.rdg.ac.uk/events/details/cancun.php} {$3^{rd}$ Collaborative and Grid Computing Technologies Workshop (CGCTW) 2009}, Cancun, Mexico}", url = "http://www.christian-engelmann.info/publications/engelmann09proactive2.pdf", abstract = "The continuing growth in high-performance computing (HPC) system scale poses a challenge for system software and scientific applications with respect to reliability, availability and serviceability (RAS). In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation HPC systems. The concept of proactive fault tolerance prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This talk presents our past and ongoing efforts in proactive fault resilience for HPC. Presented work includes proactive fault resilience techniques, transparent process- and virtual-machine-level migration, system and application reliability models and analyses, failure prediction, and trade-off models for combining preemptive migration with checkpoint/restart. All these individual technologies are put into context with a proposed holistic HPC fault resilience framework." }
@misc{engelmann09resiliency, author = "Christian Engelmann", title = "Resiliency", month = mar # "~9-12, ", year = "2009", howpublished = "{Panel at the \href{http://www.cs.sandia.gov/Conferences/SOS13}{$13^{th}$ Workshop on Distributed Supercomputing (SOS) 2009}, Hilton Head, SC, USA}" }
@misc{engelmann08high, author = "Christian Engelmann", title = "High-Performance Computing Research at {Oak Ridge National Laboratory}", month = dec # "~8, ", year = "2008", howpublished = "{Invited talk at the Reading Annual Computational Science Workshop, Reading, United Kingdom}", url = "http://www.christian-engelmann.info/publications/engelmann08high.pdf", abstract = "Oak Ridge National Laboratory (ORNL) is the largest energy laboratory in the United States. Its National Center for Computational Sciences (NCCS) provides the most powerful computing resources in the world for open scientific research. Jaguar, a Cray XT5 system at NCCS, is the second HPC system to exceed 1 PFlop/s (10^15 Floating Point Operations Per Second), and the fastest open science supercomputer in the world. It recently ranked #2 in the Top 500 List of Supercomputer Sites with a maximal LINPACK benchmark performance of 1.059 PFlop/s and a theoretical peak performance of 1.3814 PFlop/s. Annually, 80 percent of Jaguar’s resources are allocated through the U.S Department of Energy’s Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program, a competitively selected, peer reviewed process open to researchers from universities, industry, government and non-profit organizations. These allocations address some of the most challenging scientific problems in areas such as climate modeling, renewable energy, materials science, fusion and combustion. In conjunction with NCCS, the Computer Science and Mathematics Division at ORNL performs basic and applied research in HPC, mathematics, and intelligent systems. This talk gives a summary of the HPC research performed at ORNL. It provides details about the Jaguar peta-scale computing resource, an overview of the computational science research carried out using ORNL’s computing resources, and a description of various computer science efforts targeting solutions for next-generation HPC systems." }
@misc{engelmann08modular, author = "Christian Engelmann", title = "Modular Redundancy in {HPC} Systems: {W}hy, Where, When and How?", month = oct # "~15, ", year = "2008", howpublished = "{Invited talk at the $1^{st}$ HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2008, in conjunction with the \href{http://www.lanl.gov/conferences/lacss/2008}{$1^{st}$ Los Alamos Computer Science Symposium (LACSS) 2008}, Santa Fe, NM, USA}", url = "http://www.christian-engelmann.info/publications/engelmann08modular.ppt.pdf", abstract = "The continuing growth in high-performance computing (HPC) system scale poses a challenge for system software and scientific applications with respect to reliability, availability and serviceability (RAS). With only very few exceptions, the availability of recently installed systems has been lower in comparison to the same deployment phase of their predecessors. As a result, sites lower allowable job run times in order to force applications to store intermediate results (checkpoints) as insurance against lost computation time. However, checkpoints themselves waste valuable computation time and resources. In contrast to the experienced loss of availability, the demand for continuous availability has risen dramatically with the trend towards capability computing, which drives the race for scientific discovery by running applications on the fastest machines available while desiring significant amounts of time (weeks and months) without interruption. These machines must be able to run in the event of frequent interrupts in such a manner that the capability is not severely degraded. Thus, research and development of scalable RAS technologies is paramount to the success of future extreme-scale systems. This talk summarizes our past accomplishments, ongoing work, and future plans in the area of high-level RAS for HPC." }
@misc{engelmann08resiliency, author = "Christian Engelmann", title = "Resiliency for High-Performance Computing", month = apr # "~10-12, ", year = "2008", howpublished = "{Invited talk at the \href{http://acet.rdg.ac.uk/events/details/cancun.php} {$2^{nd}$ Collaborative and Grid Computing Technologies Workshop (CGCTW) 2008}, Cancun, Mexico}", url = "http://www.christian-engelmann.info/publications/engelmann08resiliency.ppt.pdf", abstract = "In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation high-performance computing (HPC) systems. One major source of concern are non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as preemptive migration, simply fail as random soft errors can't be predicted. This talk proposes a new, bold direction in resiliency for HPC as it targets resiliency for next-generation extreme-scale HPC systems at the system software level through computational redundancy strategies, i.e., dual- and triple-modular redundancy." }
@misc{engelmann08advanced, author = "Christian Engelmann", title = "Advanced Fault Tolerance Solutions for High Performance Computing", month = feb # "~11, ", year = "2008", howpublished = "{Seminar at the \href{http://www.laas.fr}{Laboratoire d'Analyse et d'Architecture des Syst\'emes}, \href{http://www.cnrs.fr}{Centre National de la Recherche Scientifique}, Toulouse, France}", url = "http://www.christian-engelmann.info/publications/engelmann08advanced.ppt.pdf", abstract = "The continuing growth in high performance computing (HPC) system scale poses a challenge for system software and scientific applications with respect to reliability, availability and serviceability (RAS). With only very few exceptions, the availability of recently installed systems has been lower in comparison to the same deployment phase of their predecessors. As a result, sites lower allowable job run times in order to force applications to store intermediate results (checkpoints) as insurance against lost computation time. However, checkpoints themselves waste valuable computation time and resources. In contrast to the experienced loss of availability, the demand for continuous availability has risen dramatically with the trend towards capability computing, which drives the race for scientific discovery by running applications on the fastest machines available while desiring significant amounts of time (weeks and months) without interruption. These machines must be able to run in the event of frequent interrupts in such a manner that the capability is not severely degraded. Thus, research and development of scalable RAS technologies is paramount to the success of future extreme-scale systems. This talk summarizes our accomplishments in the area of high-level RAS for HPC, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment." }
@misc{engelmann07service, author = "Christian Engelmann", title = "Service-Level High Availability in Parallel and Distributed Systems", month = oct # "~10, ", year = "2007", howpublished = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk} {University of Reading}, Reading, United Kingdom}", url = "http://www.christian-engelmann.info/publications/engelmann07service.pdf", abstract = "As service-oriented architectures become more important in parallel and distributed computing systems, individual service instance reliability as well as appropriate service redundancy are essential to increase overall system availability. This talk focuses on redundancy strategies using service-level replication techniques. An overview of existing programming models for service-level high availability is presented and their differences, similarities, advantages, and disadvantages are discussed. Recent advances in providing service-level symmetric active/active high availability are discussed. While the primary target of the presented research is high availability for service nodes in tightly-coupled extreme-scale high-performance computing (HPC) systems, it is also applicable to loosely-coupled distributed computing scenarios." }
@misc{engelmann07advanced2, author = "Christian Engelmann", title = "Advanced Fault Tolerance Solutions for High Performance Computing", month = jun # "~8, ", year = "2007", howpublished = "{Invited talk at the \href{http://www.thaigrid.or.th/wttc2007}{Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007}, Khon Kean, Thailand}", url = "http://www.christian-engelmann.info/publications/engelmann07advanced2.ppt.pdf", abstract = "The continuing growth in high performance computing (HPC) system scale poses a challenge for system software and scientific applications with respect to reliability, availability and serviceability (RAS). With only very few exceptions, the availability of recently installed systems has been lower in comparison to the same deployment phase of their predecessors. As a result, sites lower allowable job run times in order to force applications to store intermediate results (checkpoints) as insurance against lost computation time. However, checkpoints themselves waste valuable computation time and resources. In contrast to the experienced loss of availability, the demand for continuous availability has risen dramatically with the trend towards capability computing, which drives the race for scientific discovery by running applications on the fastest machines available while desiring significant amounts of time (weeks and months) without interruption. These machines must be able to run in the event of frequent interrupts in such a manner that the capability is not severely degraded. Thus, research and development of scalable RAS technologies is paramount to the success of future extreme-scale systems. This talk summarizes our accomplishments in the area of high-level RAS for HPC, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment." }
@misc{engelmann07advanced, author = "Christian Engelmann", title = "Advanced Fault Tolerance Solutions for High Performance Computing", month = jun # "~4-5, ", year = "2007", howpublished = "{Invited talk at the \href{http://www.thaigrid.or.th/wttc2007}{Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007}, Bangkok, Thailand}", url = "http://www.christian-engelmann.info/publications/engelmann07advanced.ppt.pdf", abstract = "The continuing growth in high performance computing (HPC) system scale poses a challenge for system software and scientific applications with respect to reliability, availability and serviceability (RAS). With only very few exceptions, the availability of recently installed systems has been lower in comparison to the same deployment phase of their predecessors. As a result, sites lower allowable job run times in order to force applications to store intermediate results (checkpoints) as insurance against lost computation time. However, checkpoints themselves waste valuable computation time and resources. In contrast to the experienced loss of availability, the demand for continuous availability has risen dramatically with the trend towards capability computing, which drives the race for scientific discovery by running applications on the fastest machines available while desiring significant amounts of time (weeks and months) without interruption. These machines must be able to run in the event of frequent interrupts in such a manner that the capability is not severely degraded. Thus, research and development of scalable RAS technologies is paramount to the success of future extreme-scale systems. This talk summarizes our accomplishments in the area of high-level RAS for HPC, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment." }
@misc{engelmann07operating, author = "Christian Engelmann", title = "Operating System Research at {ORNL}: {S}ystem-level Virtualization", month = apr # "~10, ", year = "2007", howpublished = "{Seminar at the \href{http://www.gup.uni-linz.ac.at} {Institute of Graphics and Parallel Processing}, \href{http://www.uni-linz.ac.at}{Johannes Kepler University}, Linz, Austria}", url = "http://www.christian-engelmann.info/publications/engelmann07operating.ppt.pdf", abstract = "The emergence of virtualization enabled hardware, such as the latest generation AMD and Intel processors, has raised significant interest in High Performance Computing (HPC) community. In particular, system-level virtualization provides an opportunity to advance the design and development of operating systems, programming environments, administration practices, and resource management tools. This leads to some potential research topics for HPC, such as failure tolerance, system management, and solutions for application porting to new HPC platforms. This talk will present an overview of the research in System-level Virtualization taking place by the Systems Research Team in the Computer Science Research Group at Oak Ridge National Laboratory." }
@misc{engelmann07towards, author = "Christian Engelmann", title = "Towards High Availability for High-Performance Computing System Services: {A}ccomplishments and Limitations", month = mar # "~14, ", year = "2007", howpublished = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk} {University of Reading}, Reading, United Kingdom}", url = "http://www.christian-engelmann.info/publications/engelmann07towards.pdf", abstract = "During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availability (HA) in HPC. These nodes typically run critical HPC system services, like job and resource management, and represent single points of failure and control for an entire HPC system. The overarching goal of our research is to provide high-level reliability, availability, and serviceability (RAS) for HPC systems by combining HA and HPC technology. This talk summarizes our accomplishments, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment." }
@misc{engelmann06high, author = "Christian Engelmann", title = "High Availability for Ultra-Scale High-End Scientific Computing", month = jun # "~9, ", year = "2006", howpublished = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk} {University of Reading}, Reading, United Kingdom}", url = "http://www.christian-engelmann.info/publications/engelmann06high.ppt.pdf", abstract = "A major concern in exploiting ultra-scale architectures for scientific high-end computing (HEC) with tens to hundreds of thousands of processors, such as the IBM Blue Gene/L and the Cray X1, is the potential inability to identify problems and take preemptive action before a failure impacts a running job. In fact, in systems of this scale, predictions estimate the mean time to interrupt in terms of hours. Current solutions for fault-tolerance in HEC focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by failures and require a complete restart of essential system services (e.g. MPI) or even of the entire machine. High availability (HA) computing strives to avoid the problems of unexpected failures through preemptive measures. There are various techniques to implement high availability. In contrast to active/hot-standby high availability with its fail-over model, active/active high availability with its virtual synchrony model is superior in many areas including scalability, throughput, availability and responsiveness. However, it is significantly more complex. The overall goal of our research is to expand today`s effort in HA for HEC, so that systems that have the ability to hot-swap hardware components can be kept alive by an OS runtime environment that understands the concept of dynamic system configuration. This talk will present an overview of recent research at Oak Ridge National Laboratory in high availability solutions for ultra-scale scientific high-end computing." }
@misc{scott06advancing, author = "Stephen L. Scott and Christian Engelmann", title = "Advancing Reliability, Availability and Serviceability for High-Performance Computing", month = apr # "~19, ", year = "2006", howpublished = "{Seminar at the \href{http://www.gup.uni-linz.ac.at} {Institute of Graphics and Parallel Processing}, \href{http://www.uni-linz.ac.at}{Johannes Kepler University}, Linz, Austria}", url = "http://www.christian-engelmann.info/publications/scott06advancing.ppt.pdf", abstract = "Today’s high performance computing systems have several reliability deficiencies resulting in noticeable availability and serviceability issues. For example, head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. Furthermore, current solutions for fault-tolerance focus on dealing with the result of a failure. However, most are unable to transparently mask runtime system configuration changes caused by failures and require a complete restart of essential system services, such as MPI, in case of a failure. High availability computing strives to avoid the problems of unexpected failures through preemptive measures. The overall goal of our research is to expand today’s effort in high availability for high-performance computing, so that systems can be kept alive by an OS runtime environment that understands the concepts of dynamic system configuration and degraded operation mode. This talk will present an overview of recent research performed at Oak Ridge National Laboratory in collaboration with Louisiana Tech University, North Carolina State University and the University of Reading in developing core technologies and proof-of-concept prototypes that improve the overall reliability, availability and serviceability of high-performance computing systems." }
@misc{engelmann05high4, author = "Christian Engelmann", title = "High Availability for Ultra-Scale High-End Scientific Computing", month = oct # "~18, ", year = "2005", howpublished = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk} {University of Reading}, Reading, United Kingdom}", url = "http://www.christian-engelmann.info/publications/engelmann05high4.ppt.pdf", abstract = "A major concern in exploiting ultra-scale architectures for scientific high-end computing (HEC) with tens to hundreds of thousands of processors, such as the IBM Blue Gene/L and the Cray X1, is the potential inability to identify problems and take preemptive action before a failure impacts a running job. In fact, in systems of this scale, predictions estimate the mean time to interrupt in terms of hours. Current solutions for fault-tolerance in HEC focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by failures and require a complete restart of essential system services (e.g. MPI) or even of the entire machine. High availability (HA) computing strives to avoid the problems of unexpected failures through preemptive measures. There are various techniques to implement high availability. In contrast to active/hot-standby high availability with its fail-over model, active/active high availability with its virtual synchrony model is superior in many areas including scalability, throughput, availability and responsiveness. However, it is significantly more complex. The overall goal of our research is to expand today`s effort in HA for HEC, so that systems that have the ability to hot-swap hardware components can be kept alive by an OS runtime environment that understands the concept of dynamic system configuration. This talk will present an overview of recent research at Oak Ridge National Laboratory in high availability solutions for ultra-scale scientific high-end computing." }
@misc{engelmann05high3, author = "Christian Engelmann", title = "High Availability for Ultra-Scale High-End Scientific Computing", month = sep # "~26, ", year = "2005", howpublished = "{Seminar at the \href{http://www.uncfsu.edu/macsc}{Department of Mathematics and Computer Science}, \href{http://www.uncfsu.edu}{Fayetteville State University}, Fayetteville, NC, USA}", url = "http://www.christian-engelmann.info/publications/engelmann05high3.ppt.pdf", abstract = "A major concern in exploiting ultra-scale architectures for scientific high-end computing (HEC) with tens to hundreds of thousands of processors, such as the IBM Blue Gene/L and the Cray X1, is the potential inability to identify problems and take preemptive action before a failure impacts a running job. In fact, in systems of this scale, predictions estimate the mean time to interrupt in terms of hours. Current solutions for fault-tolerance in HEC focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by failures and require a complete restart of essential system services (e.g. MPI) or even of the entire machine. High availability (HA) computing strives to avoid the problems of unexpected failures through preemptive measures. There are various techniques to implement high availability. In contrast to active/hot-standby high availability with its fail-over model, active/active high availability with its virtual synchrony model is superior in many areas including scalability, throughput, availability and responsiveness. However, it is significantly more complex. The overall goal of our research is to expand today’s effort in HA for HEC, so that systems that have the ability to hot-swap hardware components can be kept alive by an OS runtime environment that understands the concept of dynamic system configuration. This talk will present an overview of recent research at Oak Ridge National Laboratory in fault tolerance and high availability solutions for ultra-scale scientific high-end computing." }
@misc{engelmann05high2, author = "Christian Engelmann", title = "High Availability for Ultra-Scale High-End Scientific Computing", month = may # "~13, ", year = "2005", howpublished = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk} {University of Reading}, Reading, United Kingdom}", url = "http://www.christian-engelmann.info/publications/engelmann05high2.ppt.pdf", abstract = "A major concern in exploiting ultra-scale architectures for scientific high-end computing (HEC) with tens to hundreds of thousands of processors, such as the IBM Blue Gene/L and the Cray X1, is the potential inability to identify problems and take preemptive action before a failure impacts a running job. In fact, in systems of this scale, predictions estimate the mean time to interrupt in terms of hours. Current solutions for fault-tolerance in HEC focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by failures and require a complete restart of essential system services (e.g. MPI) or even of the entire machine. High availability (HA) computing strives to avoid the problems of unexpected failures through preemptive measures. There are various techniques to implement high availability. In contrast to active/hot-standby high availability with its fail-over model, active/active high availability with its virtual synchrony model is superior in many areas including scalability, throughput, availability and responsiveness. However, it is significantly more complex. The overall goal of our research is to expand today’s effort in HA for HEC, so that systems that have the ability to hot-swap hardware components can be kept alive by an OS runtime environment that understands the concept of dynamic system configuration. This talk will present an overview of recent research at Oak Ridge National Laboratory in fault-tolerant heterogeneous metacomputing, advanced super-scalable algorithms and high availability system software for ultra-scale scientific high-end computing." }
@misc{engelmann05high1, author = "Christian Engelmann", title = "High Availability for Ultra-Scale High-End Scientific Computing", month = apr # "~15, ", year = "2005", howpublished = "{Seminar at the \href{http://cenit.latech.edu}{Center for Entrepreneurship and Information Technology}, \href{http://www.latech.edu}{Louisiana Tech University}, Ruston, LA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann05high1.ppt.pdf", abstract = "A major concern in exploiting ultra-scale architectures for scientific high-end computing (HEC) with tens to hundreds of thousands of processors is the potential inability to identify problems and take preemptive action before a failure impacts a running job. In fact, in systems of this scale, predictions estimate the mean time to interrupt in terms of hours. Current solutions for fault-tolerance in HEC focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by failures and require a complete restart of essential system services (e.g. MPI) or even of the entire machine. High availability (HA) computing strives to avoid the problems of unexpected failures through preemptive measures. There are various techniques to implement high availability. In contrast to active/hot-standby high availability with its fail-over model, active/active high availability with its virtual synchrony model is superior in many areas including scalability, throughput, availability and responsiveness. However, it is significantly more complex. The overall goal of this research is to expand today’s effort in HA for HEC, so that systems that have the ability to hot-swap hardware components can be kept alive by an OS runtime environment that understands the concept of dynamic system configuration. With the aim of addressing the future challenges of high availability in ultra-scale HEC, this project intends to develop a proof-of-concept implementation of an active/active high availability system software framework." }
@misc{engelmann04diskless, author = "Christian Engelmann", title = "Diskless Checkpointing on Super-scale Architectures -- {A}pplied to the Fast Fourier Transform", month = feb # "~25, ", year = "2004", howpublished = "{Invited talk at the \href{http://www.siam.org/meetings/pp04} {$11^{th}$ SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP) 2004}, San Francisco, CA, USA}", url = "http://www.christian-engelmann.info/publications/engelmann04diskless.ppt.pdf", abstract = "This talk discusses the issue of fault-tolerance in distributed computer systems with tens or hundreds of thousands of diskless processor units. Such systems, like the IBM Blue Gene/L, are predicted to be deployed in the next five to ten years. Since a 100,000-processor system is going to be less reliable, scientific applications need to be able to recover from occurring failures more efficiently. In this paper, we adapt the present technique of diskless checkpointing to such huge distributed systems in order to equip existing scientific algorithms with super-scalable fault-tolerance. First, we discuss the method of diskless checkpointing, then we adapt this technique to super-scale architectures and finally we present results from an implementation of the Fast Fourier Transform that uses the adapted technique to achieve super-scale fault-tolerance." }
@misc{engelmann04superscalable, author = "Christian Engelmann", title = "Super-scalable Algorithms -- {N}ext Generation Supercomputing on 100,000 and more Processors", month = jan # "~29, ", year = "2004", howpublished = "{Seminar at the \href{http://www.csm.ornl.gov}{Computer Science and Mathematics Division}, \href{http://www.ornl.gov} {Oak Ridge National Laboratory}, Oak Ridge, TN, USA}", url = "http://www.christian-engelmann.info/publications/engelmann04superscalable.ppt.pdf", abstract = "This talk discusses recent research into the issues and potential problems of algorithm scalability and fault-tolerance on next-generation high-performance computer systems with tens and even hundreds of thousands of processors. Such massively parallel computers, like the IBM Blue Gene/L, are going to be deployed in the next five to ten years and existing deficiencies in scalability and fault-tolerance need to be addressed soon. Scientific algorithms have shown poor scalability on 10,000-processor systems that exist today. Furthermore, future systems will be less reliable due to the large number of components. Super-scalable algorithms, which have the properties of scale invariance and natural fault-tolerance, are able to get the correct answer despite multiple task failures and without checkpointing. We will show that such algorithms exist for a wide variety of problems, such as finite difference, finite element, multigrid and global maximum. Despite these findings, traditional algorithms may still be preferred due to their known behavior, or simply because a super-scalable algorithm does not exist or is hard to find for a particular problem. In this case, we propose a peer-to-peer diskless checkpointing algorithm that can provide scale invariant fault-tolerance." }
@misc{engelmann03distributed, author = "Christian Engelmann", title = "Distributed Peer-to-Peer Control for {Harness}", month = feb # "~11, ", year = "2004", howpublished = "{Seminar at the \href{http://www.csc.ncsu.edu}{Department of Computer Science}, \href{http://www.ncsu.edu}{North Carolina State University}, Raleigh, NC, USA}", url = "http://www.christian-engelmann.info/publications/engelmann03distributed.ppt.pdf", abstract = "Harness is an adaptable fault-tolerant virtual machine environment for next-generation heterogeneous distributed computing developed as a follow on to PVM. It additionally enables the assembly of applications from plug-ins and provides fault-tolerance. This work describes the distributed control, which manages global state replication to ensure a high-availability of service. Group communication services achieve an agreement on an initial global state and a linear history of global state changes at all members of the distributed virtual machine. This global state is replicated to all members to easily recover from single, multiple and cascaded faults. A peer-to-peer ring network architecture and tunable multi-point failure conditions provide heterogeneity and scalability. Finally, the integration of the distributed control into the multi-threaded kernel architecture of Harness offers a fault-tolerant global state database service for plug-ins and applications." }
@mastersthesis{jones10simulation, author = "Ian S. Jones", title = "Simulation of Large Scale Architectures on High Performance Computers", month = oct # "~22, ", year = "2010", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", note = "Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville)", url = "http://www.christian-engelmann.info/publications/jones10simulation.pdf", url2 = "http://www.christian-engelmann.info/publications/jones10simulation.ppt.pdf", abstract = "Powerful supercomputers often need to be simulated for the purposes of testing the scalability of various applications. This thesis endeavours to further develop the existing simulator, XSIM, and implement the functionality to simulate real-world networks and the latency which might be encountered by messages travelling through that network. The upgraded simulator will then be tested at the Oak Ridge National Laboratory. The work completed herein should provide a solid foundation for further improvements to XSIM; it simulates a variety of basic network topologies, calculating the shortest path for any given message and generates a transmission time." }
@mastersthesis{boehm10development, author = "Swen B{\"o}hm", title = "Development of a {RAS} Framework for {HPC} Environments: {Realtime} Data Reduction of Monitoring Data", month = mar # "~12, ", year = "2010", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", note = "Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville)", url = "http://www.christian-engelmann.info/publications/boehm10development.pdf", url2 = "http://www.christian-engelmann.info/publications/boehm10development.ppt.pdf", abstract = "The advancements of high-performance computing (HPC) systems in the last decades lead to more and more complex systems containing thousands or tens-of-thousands computing systems that are working together. While the computational performance of these systems increased dramaticaly in the last years the I/O subsystems have not gained such a significant improvement. With increasing nummbers of hardware components in the next generation HPC systems maintaining the relaiability of such systems becomes more and more difficult since the probability of hardware failures is increasing with the number of components. The capacities of traditional reactive fault tolerance technologies are exceeded by the development of next generation systems and alternatives have to be found. This paper discusses a monitoring system that is using data reduction techniques to decrease the amount of the collected data. The system is part of a proactive fault tolerance system that may challenge the reliability problems of exascale HPC systems." }
@mastersthesis{lauer10simulation, author = "Frank Lauer", title = "Simulation of Advanced Large-Scale {HPC} Architectures", month = mar # "~12, ", year = "2010", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", note = "Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville)", url = "http://www.christian-engelmann.info/publications/lauer10simulation.pdf", url2 = "http://www.christian-engelmann.info/publications/lauer10simulation.ppt.pdf", abstract = "The rapid development of massive parallel systems in the high- performance computing (HPC) area requires efficient scalability of applications. The next generation's design of supercomputers is today not certain in terms of what will be the computational, memory and I/O capabilities. However it is most certain that they become even more parallel. Getting the most performance from these machines in not only a matter of hardware, it is also an issue of programming design. Therefore, it has to be a co-development. However, how to test algorithm's on machines which are not existing today. To address the programming issues in terms of scalability and fault tolerance for the next generation, this projects aim is to design and develop a simulator based on parallel discrete event simulation (PDES) for applications using MPI communication. Some of the fastest supercomputers in the world already interconnecting $10^5$ cores together to catch up the simulator will be able to simulate at least $10^7$ virtual processes." }
@mastersthesis{litvinova09ras, author = "Antonina Litvinova", title = "{RAS} Framework Engine Prototype", month = sep # "~22, ", year = "2009", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", note = "Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville)", url = "http://www.christian-engelmann.info/publications/litvinova09ras.pdf", url2 = "http://www.christian-engelmann.info/publications/litvinova09ras.ppt.pdf", abstract = "Extreme high performance computing (HPC) systems constantly increase in scale from a few thousands of processors cores to thousands of thousands of processors cores and beyond. However their system mean-time to interrupt decreases according. The current approach of fault tolerance in HPC is checkpoint/restart, i.e. a method based on recovery from experienced failures. However checkpoint/restart cannot deal with errors in the same efficient way anymore, because of HPC systems modification. For example, increasing error rates, increasing aggregate memory, and not proportionally increasing input/output capabilities. The recently introduced concept is proactive fault tolerance which avoids experiencing failures through preventative measures. Proactive fault tolerance uses migration which is an emerging technology that prevents failures on HPC systems by migrating applications or application parts away from a node that is deteriorating to a spare node. This thesis discusses work conducted at ORNL to develop a Proactive Fault Tolerance Framework Engine Prototype for HPC systems with high reliability, availability and serviceability. The prototype performs environmental system monitoring, system event logging, parallel job monitoring and system resource monitoring in order to analyse HPC system reliability and to perform fault avoidance through a migration." }
@mastersthesis{koenning07virtualized, author = "Bj{\"o}rn K{\"o}nning", title = "Virtualized Environments for the {Harness Workbench}", month = mar # "~14, ", year = "2007", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", note = "Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory)", url = "http://www.christian-engelmann.info/publications/koenning07virtualized.pdf", url2 = "http://www.christian-engelmann.info/publications/koenning07virtualized.ppt.pdf", abstract = "The expanded use of computational sciences today leads to a significant need of high performance computing systems. High performance computing is currently undergoing vigorous revival, and multiple efforts are underway to develop much faster computing systems in the near future. New software tools are required for the efficient use of petascale computing systems. With the new Harness Workbench Project the Oak Ridge National Laboratory intends to develop an appropriate development and runtime environment for high performance computing platforms. This dissertation project is part of the Harness Workbench Project, and deals with the development of a concept for virtualised environments and various approaches to create and describe them. The developed virtualisation approach is based on the \verb|chroot| mechanism and uses platform-independent environment descriptions. File structures and environment variables are emulated to provide the portability of computational software over diverse high performance computing platforms. Security measures and sandbox characteristic are integrable." }
@mastersthesis{weber07high, author = "Matthias Weber", title = "High Availability for the {Lustre} File System", month = mar # "~14, ", year = "2007", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", note = "Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the \href{http://www.f1.fhtw-berlin.de}{Department of Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical College for Engineering and Economics (FHTW) Berlin}, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory)", url = "http://www.christian-engelmann.info/publications/weber07high.pdf", url2 = "http://www.christian-engelmann.info/publications/weber07high.ppt.pdf", abstract = "With the growing importance of high performance computing and, more importantly, the fast growing size of sophisticated high performance computing systems, research in the area of high availability is essential to meet the needs to sustain the current growth. This Master thesis project aims to improve the availability of Lustre. Major concern of this project is the metadata server of the file system. The metadata server of Lustre suffers from the last single point of failure in the file system. To overcome this single point of failure an active/active high availability approach is introduced. The new file system design with multiple MDS nodes running in virtual synchrony leads to a significant increase of availability. Two prototype implementations aim to show how the proposed system design and its new realized form of symmetric active/active high availability can be accomplished in practice. The results of this work point out the difficulties in adapting the file system to the active/active high availability design. Tests identify not achieved functionality and show performance problems of the proposed solution. The findings of this dissertation may be used for further work on high availability for distributed file systems." }
@mastersthesis{baumann06design, author = "Ronald Baumann", title = "Design and Development of Prototype Components for the {Harness} High-Performance Computing Workbench", month = mar # "~6, ", year = "2006", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", note = "Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the \href{http://www.f1.fhtw-berlin.de}{Department of Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical College for Engineering and Economics (FHTW) Berlin}, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist and Christian Engelmann (Oak Ridge National Laboratory)", url = "http://www.christian-engelmann.info/publications/baumann06design.pdf", url2 = "http://www.christian-engelmann.info/publications/baumann06design.ppt.pdf", abstract = "This master thesis examines plug-in technology, especially the new field of parallel plug-ins. Plug-ins are popular because they extend the capabilities of software packages such as browsers and Photoshop, and allow an individual user to add new functionality. Parallel plug-ins also provide the above capabilities to a distributed set of resources, i.e., a plug-in now becomes a set of coordinating plug-ins. Second, the set of plugins may be heterogeneous either in function or because the underlying resources are heterogeneous. This new dimension of complexity provides a rich research space which is explored in this thesis. Experiences are collected and presented as parallel plug-in paradigms and concepts. The Harness framework was used in this project, in particular the plugin manager and available communication capabilities. Plug-ins provide methods for users to extend Harness according to their requirements. The result of this thesis is a parallel plug-in paradigm and template for Harness. Users of the Harness environment will be able to design and implement their applications in the form of parallel plug-ins easier and faster by using the paradigm resulting from this project. Prototypes were implemented which handle different aspects of parallel plug-ins. Parallel plug-in configurations were tested on an appropriate number of Harness kernels, including available communication and error-handling capabilities. Furthermore, research was done in the area of fault tolerance while parallel plug-ins are (un)loaded, as well as while a task is performed." }
@mastersthesis{uhlemann06high, author = "Kai Uhlemann", title = "High Availability for High-End Scientific Computing", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", month = mar # "~6, ", year = "2006", note = "Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the \href{http://www.f1.fhtw-berlin.de}{Department of Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical College for Engineering and Economics (FHTW) Berlin}, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist and Christian Engelmann (Oak Ridge National Laboratory)", url = "http://www.christian-engelmann.info/publications/uhlemann06high.pdf", url2 = "http://www.christian-engelmann.info/publications/uhlemann06high.ppt.pdf", abstract = "With the growing interest and popularity in high performance cluster computing and, more importantly, the fast growing size of compute clusters, research in the area of high availability is essential to meet the needs to sustain the current growth. This Master thesis project introduces a new approach for high availability focusing on the head node of a cluster system. This projects focus is on providing high availability to the job scheduler service, which is the most vital part of the traditional Beowulf-style cluster architecture. This research seeks to add high availability to the job scheduler service and resource management system, typically running on the head node, leading to a significant increase of availability for cluster computing. Also, this software project takes advantage of the virtual synchrony paradigm to achieve active/active replication, the highest form of high availability. A proof-of-concept implementation shows how high availability can be designed in software and what results can be expected of such a system. The results may be reused for future or existing projects to further improve and extent the high availability of compute clusters." }
@phdthesis{engelmann08symmetric3, author = "Christian Engelmann", title = "Symmetric Active/Active High Availability for High-Performance Computing System Services", month = dec # "~8, ", year = "2008", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", note = "Thesis research performed at Oak Ridge National Laboratory. Advisor: Prof. Vassil N. Alexandrov (University of Reading)", url = "http://www.christian-engelmann.info/publications/engelmann08symmetric3.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann08symmetric3.ppt.pdf", abstract = "In order to address anticipated high failure rates, reliability, availability and serviceability have become an urgent priority for next-generation high-performance computing (HPC) systems. This thesis aims to pave the way for highly available HPC systems by focusing on their most critical components and by reinforcing them with appropriate high availability solutions. Service components, such as head and service nodes, are the Achilles heel of a HPC system. A failure typically results in a complete system-wide outage. This thesis targets efficient software state replication mechanisms for service component redundancy to achieve high availability as well as high performance. Its methodology relies on defining a modern theoretical foundation for providing service-level high availability, identifying availability deficiencies of HPC systems, and comparing various service-level high availability methods. This thesis showcases several developed proof-of-concept prototypes providing high availability for services running on HPC head and service nodes using the symmetric active/active replication method, i.e., state-machine replication, to complement prior work in this area using active/standby and asymmetric active/active configurations. Presented contributions include a generic taxonomy for service high availability, an insight into availability deficiencies of HPC systems, and a unified definition of service-level high availability methods. Further contributions encompass a fully functional symmetric active/active high availability prototype for a HPC job and resource management service that does not require modification of service, a fully functional symmetric active/active high availability prototype for a HPC parallel file system metadata service that offers high performance, and two preliminary prototypes for a transparent symmetric active/active replication software framework for client-service and dependent service scenarios that hide the replication infrastructure from clients and services. Assuming a mean-time to failure of 5,000 hours for a head or service node, all presented prototypes improve service availability from 99.285\% to 99.995\% in a two-node system, and to 99.99996\% with three nodes." }
@mastersthesis{engelmann01distributed, author = "Christian Engelmann", title = "Distributed Peer-to-Peer Control for {Harness}", month = jul # "~7, ", year = "2001", school = "\href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK", note = "Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the \href{http://www.f1.fhtw-berlin.de}{Department of Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical College for Engineering and Economics (FHTW) Berlin}, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist (Oak Ridge National Laboratory)", url = "http://www.christian-engelmann.info/publications/engelmann01distributed.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann01distributed.ppt.pdf", abstract = "Parallel processing, the method of cutting down a large computational problem into many small tasks which are solved in parallel, is a field of increasing importance in science. Cost-effective, flexible and efficient simulations of mathematical models of physical, chemical or biological real-world problems are replacing the traditional experimental research. Current software solutions for parallel and scientific computation, like Parallel Virtual Machine and Message Passing Interface, have limitations in handling faults and failures, in utilizing heterogeneous and dynamically changing communication structures, and in enabling migrating or cooperative applications. The current research in heterogeneous adaptable reconfigurable networked systems (Harness) aims to produce the next generation of software solutions for distributed computing. A high-available and light-weighted distributed virtual machine service provides an encapsulation of a few hundred to a few thousand physical machines in a virtual heterogeneous large scale cluster. A high availability of a service in distributed systems can be achieved by replication of the service state on multiple server processes. If one ore more server processes fails, the surviving ones continue to provide the service because they know the state. Since every member of a distributed virtual machine is part of the distributed virtual machine service state and is able to change this state, a distributed control is needed to replicate the state and maintain its consistency. This distributed control manages state changes as well as the state-replication and the detection of and recovery from faults and failures of server processes. This work analyzes system architectures currently used in heterogeneous distributed computing by defining terms, conditions and assumptions. It shows that such systems are asynchronous and may use partially synchronous communication to detect and to distinguish different classes of faults and failures. It describes how a high availability of a large scale distributed service on a huge number of servers residing on different geographical locations can be realized. Asynchronous group communication services, such as Reliable Broadcast, Atomic Broadcast, Distributed Agreement and Membership, are analyzed to develop linear scalable algorithms in an unidirectional and in a bidirectional connected asynchronous peer-to-peer ring architecture. A Transaction Control group communication service is introduced as state-replication service. The system analysis distinguishes different types of distributed systems, where active transactions execute state changes using non-replicated data of one or more servers and inactive transactions report state changes using replicated data only. It is applicable for passive fault-tolerant distributed databases as well as for active fault-tolerant distributed control mechanisms. No control token is used and time stamps are avoided, so that all members of a server group have equal responsibilities and are independent from the system time. A prototype which implements the most complicated Transaction Control algorithm is realized due to the complexity of the distributed system and the early development stage of the introduced algorithms. The prototype is used to obtain practical experience with the state-replication algorithm." }
@mastersthesis{engelmann01distributed2, author = "Christian Engelmann", title = "Distributed Peer-to-Peer Control for {Harness}", month = feb # "~23, ", year = "2001", school = "\href{http://www.f1.fhtw-berlin.de}{Department of Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical College for Engineering and Economics (FHTW) Berlin}, Germany", note = "Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the \href{http://www.cs.reading.ac.uk}{Department of Computer Science}, \href{http://www.reading.ac.uk}{University of Reading}, UK. Advisors: Prof. Uwe Metzler (Technical College for Engineering and Economics (FHTW) Berlin); George A. (Al) Geist (Oak Ridge National Laboratory)", url = "http://www.christian-engelmann.info/publications/engelmann01distributed2.pdf", url2 = "http://www.christian-engelmann.info/publications/engelmann01distributed2.ppt.pdf", abstract = "Parallel processing, the method of cutting down a large computational problem into many small tasks which are solved in parallel, is a field of increasing importance in science. Cost-effective, flexible and efficient simulations of mathematical models of physical, chemical or biological real-world problems are replacing the traditional experimental research. Current software solutions for parallel and scientific computation, like Parallel Virtual Machine and Message Passing Interface, have limitations in handling faults and failures, in utilizing heterogeneous and dynamically changing communication structures, and in enabling migrating or cooperative applications. The current research in heterogeneous adaptable reconfigurable networked systems (Harness) aims to produce the next generation of software solutions for distributed computing. A high-available and light-weighted distributed virtual machine service provides an encapsulation of a few hundred to a few thousand physical machines in a virtual heterogeneous large scale cluster. A high availability of a service in distributed systems can be achieved by replication of the service state on multiple server processes. If one ore more server processes fails, the surviving ones continue to provide the service because they know the state. Since every member of a distributed virtual machine is part of the distributed virtual machine service state and is able to change this state, a distributed control is needed to replicate the state and maintain its consistency. This distributed control manages state changes as well as the state-replication and the detection of and recovery from faults and failures of server processes. This work analyzes system architectures currently used in heterogeneous distributed computing by defining terms, conditions and assumptions. It shows that such systems are asynchronous and may use partially synchronous communication to detect and to distinguish different classes of faults and failures. It describes how a high availability of a large scale distributed service on a huge number of servers residing on different geographical locations can be realized. Asynchronous group communication services, such as Reliable Broadcast, Atomic Broadcast, Distributed Agreement and Membership, are analyzed to develop linear scalable algorithms in an unidirectional and in a bidirectional connected asynchronous peer-to-peer ring architecture. A Transaction Control group communication service is introduced as state-replication service. The system analysis distinguishes different types of distributed systems, where active transactions execute state changes using non-replicated data of one or more servers and inactive transactions report state changes using replicated data only. It is applicable for passive fault-tolerant distributed databases as well as for active fault-tolerant distributed control mechanisms. No control token is used and time stamps are avoided, so that all members of a server group have equal responsibilities and are independent from the system time. A prototype which implements the most complicated Transaction Control algorithm is realized due to the complexity of the distributed system and the early development stage of the introduced algorithms. The prototype is used to obtain practical experience with the state-replication algorithm." }
@techreport{kuchar22system, author = "Olga A. Kuchar and Swen Boehm and Thomas Naughton and Suhas Somnath and Ben Mintz and Jack Lange and Scott Atchley and Rohit Srivastava and Patrick Widener", title = "INTERSECT Architecture Specification: System-of-systems Architecture (Version 0.5)", institution = "Oak Ridge National Laboratory", number = "ORNL/TM-2022/2717", address = "Oak Ridge, TN, USA", month = sep, year = "2022", doi = "10.2172/1968700", url = "http://www.christian-engelmann.info/publications/kuchar22system.pdf", abstract = "Oak Ridge National Laboratory (ORNL)'s Self-driven Experiments for Science / Interconnected Science Ecosystem (INTERSECT) architecture project, titled ``An Open Federated Architecture for the Laboratory of the Future'', creates an open federated hardware/software architecture for the laboratory of the future using a novel system of systems (SoS) and microservice architecture approach, connecting scientific instruments, robot-controlled laboratories and edge/center computing/data resources to enable autonomous experiments, ``self-driving'' laboratories, smart manufacturing, and artificial intelligence (AI)-driven design, discovery and evaluation. The architecture project is divided into three focus areas: design patterns, SoS architecture, and microservice architecture. The design patterns area focuses on describing science use cases as design patterns that identify and abstract the involved hardware/software components and their interactions in terms of control, work and data flow. The SoS architecture area focuses on an open architecture specification for the federated ecosystem that clarifies terms, architectural elements, the interactions between them and compliance. The microservice architecture describes blueprints for loosely coupled microservice, standardized interfaces, and multi-programming language support. This document is the SoS Architecture specification only, and captures the system of systems architecture design for the INTERSECT Initiative and its components. It is intended to provide a deep analysis and specification of how the INTERSECT platform will be designed, and to link the scientific needs identified across disciplines with the technical needs involved in the support, development, and evolution of a science ecosystem.", pts = "186209" }