Skip to content

BibTex Citations

@article{agullo22resiliency,
  author        = "Emmanuel Agullo
                   and Mirco Altenbernd
                   and Hartwig Anzt
                   and Leonardo Bautista-Gomez
                   and Tommaso Benacchio
                   and Luca Bonaventura
                   and Hans-Joachim Bungartz
                   and Sanjay Chatterjee
                   and Florina M. Ciorba
                   and Nathan DeBardeleben
                   and Daniel Drzisga
                   and Sebastian Eibl
                   and Christian Engelmann
                   and Wilfried N. Gansterer
                   and Luc Giraud
                   and Dominik G{\"o}ddeke
                   and Marco Heisig
                   and Fabienne J{\'e}z{\'e}quel
                   and Nils Kohl
                   and Xiaoye Sherry Li
                   and Romain Lion
                   and Miriam Mehl
                   and Paul Mycek
                   and Michael Obersteiner
                   and Enrique S. Quintana-Ort{\'i}
                   and Francesco Rizzi
                   and Ulrich R{\"u}de
                   and Martin Schulz
                   and Fred Fung
                   and Robert Speck
                   and Linda Stals
                   and Keita Teranishi
                   and Samuel Thibault
                   and Dominik Th{\"o}nnes
                   and Andreas Wagner
                   and Barbara Wohlmuth",
  title         = "Resiliency in Numerical Algorithm Design for Extreme Scale Simulations",
  journal       = "\href{http://hpc.sagepub.com}{International Journal of High
                   Performance Computing Applications (IJHPCA)}",
  volume        = "36",
  number        = "2",
  pages         = "251--285",
  month         = mar,
  year          = "2022",
  publisher     = "\href{http://www.sagepub.com}{SAGE Publications}",
  issn          = "1094-3420",
  doi           = "10.1177/10943420211055188",
  url           = "http://www.christian-engelmann.info/publications/agullo22resiliency.pdf",
  abstract      = "This work is based on the seminar titled 'Resiliency in
                   Numerical Algorithm Design for Extreme Scale Simulations'
                   held March 1-6, 2020 at Schloss Dagstuhl, that was attended
                   by all the writers. Advanced supercomputing is characterized
                   by very high computation speeds at the cost of involving an
                   enormous amount of resources and costs. A typical large-scale
                   computation running for 48 hours on a system consuming 20 MW,
                   as predicted for exascale systems, would consume a million
                   kWh, corresponding to about 100k Euro in energy cost for
                   executing 1023 floating- point operations. It is clearly
                   unacceptable to lose the whole computation if any of the
                   several million parallel processes fails during the execution.
                   Moreover, if a single operation suffers from a bit-flip error,
                   should the whole computation be declared invalid? What about
                   the notion of reproducibility itself: should this core
                   paradigm of science be revised and refined for results that
                   are obtained by large scale simulation? Naive versions of
                   conventional resilience techniques will not scale to the
                   exascale regime: with a main memory footprint of tens of
                   Petabytes, synchronously writing checkpoint data all the way
                   to background storage at frequent intervals will create
                   intolerable overheads in runtime and energy consumption.
                   Forecasts show that the mean time between failures could be
                   lower than the time to recover from such a checkpoint, so
                   that large calculations at scale might not make any progress
                   if robust alternatives are not investigated.
                   More advanced resilience techniques must be devised. The key
                   may lie in exploiting both advanced system features as well
                   as specific application knowledge. Research will face two
                   essential questions: (1) what are the reliability
                   requirements for a particular computation and (2) how do we
                   best design the algorithms and software to meet these
                   requirements? While the analysis of use cases can help
                   understand the particular reliability requirements, the
                   construction of remedies is currently wide open. One avenue
                   would be to refine and improve on system- or application-level
                   checkpointing and rollback strategies in the case an error
                   is detected. Developers might use fault notification
                   interfaces and flexible runtime systems to respond to node
                   failures in an application-dependent fashion. Novel numerical
                   algorithms or more stochastic computational approaches may
                   be required to meet accuracy requirements in the face of
                   undetectable soft errors. These ideas constituted an
                   essential topic of the seminar.
                   The goal of this Dagstuhl Seminar was to bring together a
                   diverse group of scientists with expertise in exascale
                   computing to discuss novel ways to make applications
                   resilient against detected and undetected faults. In
                   particular, participants explored the role that algorithms
                   and applications play in the holistic approach needed to
                   tackle this challenge. This article gathers a broad range
                   of perspectives on the role of algorithms, applications,
                   and systems in achieving resilience for extreme scale
                   simulations. The ultimate goal is to spark novel ideas and
                   encourage the development of concrete solutions for achieving
                   such resilience holistically.",
  pts           = "169116"
}
@article{kumar21study,
  author        = "Mohit Kumar
                   and Saurabh Gupta
                   and Tirthak Patel
                   and Michael Wilder
                   and Weisong Shi
                   and Song Fu
                   and Christian Engelmann
                   and Devesh Tiwari",
  title         = "Study of Interconnect Errors, Network Congestion, and
                   Applications Characteristics for Throttle Prediction on a
                   Large Scale {HPC} System",
  journal       = "\href{http://www.elsevier.com/locate/jpdc}{Journal of
                   Parallel and Distributed Computing (JPDC)}",
  volume        = "153",
  pages         = "29--43",
  month         = jul,
  year          = "2021",
  publisher     = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The
                   Netherlands}",
  issn          = "0743-7315",
  doi           = "10.1016/j.jpdc.2021.03.001",
  url           = "http://www.christian-engelmann.info/publications/kumar21study.pdf",
  abstract      = "Today's High Performance Computing (HPC) systems contain
                   thousand of nodes which work together to provide performance
                   in the order of peta ops. The performance of these systems
                   depends on various components like processors, memory, and
                   interconnect. Among  all, interconnect plays a major role as
                   it glues together all the hardware components in an HPC
                   system. A slow interconnect can impact a scientific
                   application running on multiple processes severely as they
                   rely on fast network messages to communicate and synchronize
                   frequently. Unfortunately, the HPC community lacks a study
                   that explores different interconnect errors, congestion
                   events and applications characteristics on a large-scale HPC
                   system. In our previous work, we process and analyze
                   interconnect data of the Titan supercomputer to develop a
                   thorough understanding of interconnects faults, errors,
                   and congestion events. In this work, we first show how
                   congestion events can impact application performance. We
                   then investigate application characteristics interaction
                   with interconnect errors and network congestion to predict
                   applications encountering congestion with more than 90\%
                   accuracy",
  pts           = "153615"
}
@article{katti18epidemic,
  author        = "Amogh Katti
                   and Giuseppe Di Fatta
                   and Thomas Naughton
                   and Christian Engelmann",
  title         = "Epidemic Failure Detection and Consensus for Extreme
                   Parallelism",
  journal       = "\href{http://hpc.sagepub.com}{International Journal of High
                   Performance Computing Applications (IJHPCA)}",
  volume        = "32",
  number        = "5",
  pages         = "729--743",
  month         = sep,
  year          = "2018",
  publisher     = "\href{http://www.sagepub.com}{SAGE Publications}",
  issn          = "1094-3420",
  doi           = "10.1177/1094342017690910",
  url           = "http://www.christian-engelmann.info/publications/katti17epidemic.pdf",
  abstract      = "Future extreme-scale high-performance computing systems will
                   be required to work under frequent component failures. The
                   MPI Forum's User Level Failure Mitigation proposal has
                   introduced an operation, MPI Comm shrink, to synchronize the
                   alive processes on the list of failed processes, so that
                   applications can continue to execute even in the presence of
                   failures by adopting algorithm-based fault tolerance
                   techniques. This MPI Comm shrink operation requires a failure
                   detection and consensus algorithm. This paper presents three
                   novel failure detection and consensus algorithms using
                   Gossiping. The proposed algorithms were implemented and
                   tested using the Extreme-scale Simulator. The results show
                   that in all algorithms the number of Gossip cycles to achieve
                   global consensus scales logarithmically with system size. The
                   second algorithm also shows better scalability in terms of
                   memory and network bandwidth usage and a perfect
                   synchronization in achieving global consensus. The third
                   approach is a three-phase distributed failure detection and
                   consensus algorithm and provides consistency guarantees even
                   in very large and extreme-scale systems while at the same
                   time being memory and bandwidth efficient.",
  pts           = "72175"
}
@article{hukerikar17resilience,
  author        = "Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "Resilience Design Patterns: {A} Structured Approach to
                   Resilience at Extreme Scale",
  journal       = "\href{http://superfri.org/superfri}{Journal of
                   Supercomputing Frontiers and Innovations (JSFI)}",
  volume        = "4",
  number        = "3",
  pages         = "4--42",
  month         = oct,
  year          = "2017",
  publisher     = "\href{http://www.susu.ru/en}{South Ural State University
                   Chelyabinsk, Russia}",
  issn          = "2409-6008",
  doi           = "10.14529/jsfi170301",
  url           = "http://www.christian-engelmann.info/publications/hukerikar17resilience.pdf",
  abstract      = "Reliability is a serious concern for future extreme-scale
                   high-performance computing (HPC) systems. Projections based
                   on the current generation of HPC systems and technology
                   roadmaps suggest the prevalence of very high fault rates in
                   future systems. The errors resulting from these faults will
                   propagate and generate various kinds of failures, which may
                   result in outcomes ranging from result corruptions to
                   catastrophic application crashes. Therefore, the resilience
                   challenge for extreme-scale HPC systems requires management
                   of various hardware and software technologies that are
                   capable of handling a broad set of fault models at
                   accelerated fault rates. Also, due to practical limits on
                   power consumption in HPC systems future systems are likely
                   to embrace innovative architectures, increasing the levels
                   of hardware and software complexities. As a result, the
                   techniques that seek to improve resilience must navigate
                   the complex trade-off space between resilience and the
                   overheads to power consumption and performance. While the
                   HPC community has developed various resilience solutions,
                   application-level techniques as well as system-based
                   solutions, the solution space of HPC resilience
                   techniques remains fragmented. There are no formal methods
                   and metrics to investigate and evaluate resilience
                   holistically in HPC systems that consider impact scope,
                   handling coverage, and performance & power efficiency
                   across the system stack. Additionally, few of the current
                   approaches are portable to newer architectures and software
                   environments that will be deployed on future systems.
                   In this paper, we develop a structured approach to the
                   management of HPC resilience using the concept of
                   resilience-based design patterns. A design pattern is a
                   general repeatable solution to a commonly occurring problem.
                   We identify the commonly occurring problems and solutions
                   used to deal with faults, errors and failures in HPC systems.
                   Each established solution is described in the form of a
                   pattern that addresses concrete problems in the design of
                   resilient systems. The complete catalog of resilience design
                   patterns provides designers with reusable design elements.
                   We also define a framework that enhances a designer's
                   understanding of the important constraints and opportunities
                   for the design patterns to be implemented and deployed at
                   various layers of the system stack. This design framework
                   may be used to establish mechanisms and interfaces to
                   coordinate flexible fault management across hardware and
                   software components. The framework also supports
                   optimization of the cost-benefit trade-offs among
                   performance, resilience, and power consumption. The overall
                   goal of this work is to enable a systematic methodology for
                   the design and evaluation of resilience technologies in
                   extreme-scale HPC systems that keep scientific applications
                   running to a correct solution in a timely and cost-efficient
                   manner despite frequent faults, errors, and failures of
                   various types.",
  pts           = "102201"
}
@article{engelmann16new,
  author        = "Christian Engelmann
                   and Thomas Naughton",
  title         = "A New Deadlock Resolution Protocol and Message Matching
                   Algorithm for the Extreme-scale Simulator",
  journal       = "\href{http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1532-0634}
                   {Concurrency and Computation: Practice and Experience}",
  volume        = "28",
  number        = "12",
  pages         = "3369--3389",
  month         = aug,
  year          = "2016",
  publisher     = "\href{http://www.wiley.com}{John Wiley & Sons, Inc.}",
  issn          = "1532-0634",
  doi           = "10.1002/cpe.3805",
  url           = "http://www.christian-engelmann.info/publications/engelmann16new.pdf",
  abstract      = "Investigating the performance of parallel applications at
                   scale on future high-performance computing~(HPC) architectures
                   and the performance impact of different HPC architecture
                   choices is an important component of HPC hardware/software
                   co-design. The Extreme-scale Simulator (xSim) is a simulation
                   toolkit for investigating the performance of parallel
                   applications at scale. xSim scales to millions of simulated
                   Message Passing Interface (MPI) processes. The xSim toolkit
                   strives to limit simulation overheads in order to maintain
                   performance and productivity criteria. This paper documents
                   two improvements to xSim: (1)~a new deadlock resolution
                   protocol to reduce the parallel discrete event simulation
                   overhead, and (2)~a new simulated MPI message matching
                   algorithm to reduce the oversubscription management cost.
                   These enhancements resulted in significant performance
                   improvements. The simulation overhead for running the NAS
                   Parallel Benchmark suite dropped from 1,020\% to 238\% for
                   the conjugate gradient (CG) benchmark and 102\% to 0\% for
                   the embarrassingly parallel~(EP) benchmark. Additionally, the
                   improvements were beneficial for reducing overheads in the
                   highly accurate simulation mode of xSim, which is useful for
                   resilience investigation studies for tracking intentional MPI
                   process failures. In the highly accurate mode, the simulation
                   overhead was reduced from 37,511\% to 13,808\% for CG and
                   from 3,332\% to 204\% for EP.",
  pts           = "58541"
}
@article{snir14addressing,
  author        = "Marc Snir
                   and Robert W. Wisniewski
                   and Jacob A. Abraham
                   and Sarita V. Adve
                   and Saurabh Bagchi
                   and Pavan Balaji
                   and Jim Belak
                   and Pradip Bose
                   and Franck Cappello
                   and Bill Carlson
                   and Andrew A. Chien
                   and Paul Coteus
                   and Nathan A. Debardeleben
                   and Pedro Diniz
                   and Christian Engelmann
                   and Mattan Erez
                   and Saverio Fazzari
                   and Al Geist
                   and Rinku Gupta
                   and Fred Johnson
                   and Sriram Krishnamoorthy
                   and Sven Leyffer
                   and Dean Liberty
                   and Subhasish Mitra
                   and Todd Munson
                   and Rob Schreiber
                   and Jon Stearley
                   and Eric Van Hensbergen",
  title         = "Addressing Failures in Exascale Computing",
  journal       = "\href{http://hpc.sagepub.com}{International Journal of High
                   Performance Computing Applications (IJHPCA)}",
  volume        = "28",
  number        = "2",
  pages         = "127--171",
  month         = may,
  year          = "2014",
  publisher     = "\href{http://www.sagepub.com}{SAGE Publications}",
  issn          = "1094-3420",
  doi           = "10.1177/1094342014522573",
  url           = "http://www.christian-engelmann.info/publications/snir14addressing.pdf",
  abstract      = "We present here a report produced by a workshop on 
                   Addressing failures in exascale computing' held in Park City, 
                   Utah, 4-11 August 2012. The charter of this workshop was to 
                   establish a common taxonomy about resilience across all the 
                   levels in a computing system, discuss existing knowledge on 
                   resilience across the various hardware and software layers 
                   of an exascale system, and build on those results, examining 
                   potential solutions from both a hardware and software 
                   perspective and focusing on a combined approach.
                   The workshop brought together participants with expertise in 
                   applications, system software, and hardware; they came from 
                   industry, government, and academia, and their interests ranged 
                   from theory to implementation. The combination allowed broad 
                   and comprehensive discussions and led to this document, which 
                   summarizes and builds on those discussions.",
  pts           = "49208"
}
@article{engelmann13scaling,
  author        = "Christian Engelmann",
  title         = "Scaling To A Million Cores And Beyond: {Using} Light-Weight
                   Simulation to Understand The Challenges Ahead On The Road To
                   Exascale",
  journal       = "\href{http://www.elsevier.com/locate/fgcs}{Future Generation
                   Computer Systems (FGCS)}",
  volume        = "30",
  number        = "0",
  pages         = "59--65",
  month         = jan,
  year          = "2014",
  publisher     = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The
                   Netherlands}",
  issn          = "0167-739X",
  doi           = "10.1016/j.future.2013.04.014",
  url           = "http://www.christian-engelmann.info/publications/engelmann13scaling.pdf",
  abstract      = "As supercomputers scale to 1,000 PFlop/s over the next
                   decade, investigating the performance of parallel
                   applications at scale on future architectures and the
                   performance impact of different architecture choices for
                   high-performance computing (HPC) hardware/software co-design
                   is crucial. This paper summarizes recent efforts in designing
                   and implementing a novel HPC hardware/software co-design
                   toolkit. The presented Extreme-scale Simulator (xSim) permits
                   running an HPC application in a controlled environment with
                   millions of concurrent execution threads while observing its
                   performance in a simulated extreme-scale HPC system using
                   architectural models and virtual timing. This paper
                   demonstrates the capabilities and usefulness of the xSim
                   performance investigation toolkit, such as its scalability
                   to $2^{27}$ simulated Message Passing Interface (MPI) ranks
                   on 960 real processor cores, the capability to evaluate the
                   performance of different MPI collective communication
                   algorithms, and the ability to evaluate the performance of
                   a basic Monte Carlo application with different architectural
                   parameters.",
  pts           = "42452"
}
@article{wang12proactive,
  author        = "Chao Wang
                   and Frank Mueller
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Proactive Process-Level Live Migration and Back Migration in
                   {HPC} Environments",
  journal       = "\href{http://www.elsevier.com/locate/jpdc}{Journal of
                   Parallel and Distributed Computing (JPDC)}",
  volume        = "72",
  number        = "2",
  pages         = "254--267",
  month         = feb,
  year          = "2012",
  publisher     = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The
                   Netherlands}",
  issn          = "0743-7315",
  doi           = "10.1016/j.jpdc.2011.10.009",
  url           = "http://www.christian-engelmann.info/publications/wang12proactive.pdf",
  abstract      = "As the number of nodes in high-performance computing
                   environments keeps increasing, faults are becoming common
                   place. Reactive fault tolerance (FT) often does not scale
                   due to massive I/O requirements and relies on manual job
                   resubmission.
                   This work complements reactive with proactive FT at the
                   process level. Through health monitoring, a subset of node
                   failures can be anticipated when one's health deteriorates.
                   A novel process-level live migration mechanism supports
                   continued execution of applications during much of process
                   migration. This scheme is integrated into an MPI execution
                   environment to transparently sustain health-inflicted node
                   failures, which eradicates the need to restart and requeue
                   MPI jobs. Experiments indicate that 1-6.5 s of prior warning
                   are required to successfully trigger live process migration
                   while similar operating system virtualization mechanisms
                   require 13-24 s. This self-healing approach complements
                   reactive FT by nearly cutting the number of checkpoints in
                   half when 70\% of the faults are handled proactively. The
                   work also provides a novel back migration approach to
                   eliminate load imbalance or bottlenecks caused by migrated
                   tasks. Experiments indicate the larger the amount of
                   outstanding execution, the higher the benefit due to back
                   migration.",
  pts           = "35627"
}
@article{scott10system,
  author        = "Stephen L. Scott
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Anand Tikotekar
                   and Christian Engelmann
                   and Hong H. Ong",
  title         = "System-Level Virtualization Research at {Oak Ridge National
                   Laboratory}",
  journal       = "\href{http://www.elsevier.com/locate/fgcs}{Future Generation
                   Computer Systems (FGCS)}",
  volume        = "26",
  number        = "3",
  pages         = "304--307",
  month         = mar,
  year          = "2010",
  publisher     = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The
                   Netherlands}",
  issn          = "0167-739X",
  doi           = "10.1016/j.future.2009.07.001",
  url           = "http://www.christian-engelmann.info/publications/scott09system.pdf",
  abstract      = "System-level virtualization is today enjoying a rebirth as a
                   technique to effectively share what were then considered
                   large computing resources to subsequently fade from the
                   spotlight as individual workstations gained in popularity
                   with a one machine -- one user approach. One reason for
                   this resurgence is that the simple workstation has grown in
                   capability to rival that of anything available in the past.
                   Thus, computing centers are again looking at the
                   price/performance benefit of sharing that single computing
                   box via server consolidation. However, industry is only
                   concentrating on the benefits of using virtualization for
                   server consolidation (enterprise computing) whereas our
                   interest is in leveraging virtualization to advance
                   high-performance computing (HPC). While these two interests
                   may appear to be orthogonal, one consolidating multiple
                   applications and users on a single machine while the other
                   requires all the power from many machines to be dedicated
                   solely to its purpose, we propose that virtualization does
                   provide attractive capabilities that may be exploited to the
                   benefit of HPC interests. This does raise the two fundamental
                   questions of: is the concept of virtualization (a machine
                   sharing technology) really suitable for HPC and if so,
                   how does one go about leveraging these virtualization
                   capabilities for the benefit of HPC. To address these
                   questions, this document presents ongoing studies on the
                   usage of system-level virtualization in a HPC context. These
                   studies include an analysis of the benefits of system-level
                   virtualization for HPC, a presentation of research efforts
                   based on virtualization for system availability, and a
                   presentation of research efforts for the management of
                   virtual systems. The basis for this document was material
                   presented by Stephen L. Scott at the Collaborative and Grid
                   Computing Technologies meeting held in Cancun, Mexico on
                   April 12-14, 2007.",
  pts           = "35628"
}
@article{he09symmetric,
  author        = "Xubin (Ben) He
                   and Li Ou
                   and Christian Engelmann
                   and Xin Chen
                   and Stephen L. Scott",
  title         = "Symmetric Active/Active Metadata Service for High
                   Availability Parallel File Systems",
  journal       = "\href{http://www.elsevier.com/locate/jpdc}{Journal of
                   Parallel and Distributed Computing (JPDC)}",
  volume        = "69",
  number        = "12",
  pages         = "961-973",
  month         = dec,
  year          = "2009",
  publisher     = "\href{http://www.elsevier.com}{Elsevier B.V, Amsterdam, The
                   Netherlands}",
  issn          = "0743-7315",
  doi           = "10.1016/j.jpdc.2009.08.004",
  url           = "http://www.christian-engelmann.info/publications/he09symmetric.pdf",
  abstract      = "High availability data storage systems are critical for many
                   applications as research and business become more
                   data-driven. Since metadata management is essential to
                   system availability, multiple metadata services are used to
                   improve the availability of distributed storage systems.
                   Past research focused on the active/standby model, where
                   each active service has at least one redundant idle backup.
                   However, interruption of service and even some loss of
                   service state may occur during a fail-over depending on the
                   used replication technique. In addition, the replication
                   overhead for multiple metadata services can be very high.
                   The research in this paper targets the symmetric
                   active/active replication model, which uses multiple
                   redundant service nodes running in virtual synchrony. In
                   this model, service node failures do not cause a fail-over
                   to a backup and there is no disruption of service or loss
                   of service state. We further discuss a fast delivery
                   protocol to reduce the latency of the needed total order
                   broadcast. Our prototype implementation shows that
                   metadata service high availability can be achieved with
                   an acceptable performance trade-off using our symmetric
                   active/active metadata service solution.",
  pts           = "21240"
}
@article{he07unified,
  author        = "Xubin (Ben) He
                   and Li Ou
                   and Martha J. Kosa
                   and Stephen L. Scott
                   and Christian Engelmann",
  title         = "A Unified Multiple-Level Cache for High Performance Cluster
                   Storage Systems",
  journal       = "\href{http://www.inderscience.com/browse/index.php?journalcode=ijhpcn}
                   {International Journal of High Performance Computing and
                   Networking (IJHPCN)}",
  volume        = "5",
  number        = "1-2",
  pages         = "97--109",
  month         = nov # "~14, ",
  year          = "2007",
  publisher     = "\href{http://www.inderscience.com}{Inderscience Publishers,
                   Geneve, Switzerland}",
  issn          = "1740-0562",
  doi           = "10.1504/IJHPCN.2007.015768",
  url           = "http://www.christian-engelmann.info/publications/he07unified.pdf",
  abstract      = "Highly available data storage for high-performance computing
                   is becoming increasingly more critical as high-end computing
                   systems scale up in size and storage systems are developed
                   around network-centered architectures. A promising solution
                   is to harness the collective storage potential of individual
                   workstations much as we harness idle CPU cycles due to the
                   excellent price/performance ratio and low storage usage of
                   most commodity workstations. For such a storage system,
                   metadata consistency is a key issue assuring storage system
                   availability as well as data reliability. In this paper, we
                   present a decentralized metadata management scheme that
                   improves storage availability without sacrificing
                   performance.",
  pts           = "1907"
}
@article{engelmann06symmetric,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and Chokchai (Box) Leangsuksun
                   and Xubin (Ben) He",
  title         = "Symmetric Active/Active High Availability for
                   High-Performance Computing System Services",
  journal       = "\href{http://www.jcomputers.us}{Journal of Computers (JCP)}",
  volume        = "1",
  number        = "8",
  pages         = "43--54",
  month         = dec,
  year          = "2006",
  publisher     = "\href{http://www.jcomputers.us}{Academy Publisher,
                   Oulu, Finland}",
  issn          = "1796-203X",
  doi           = "10.4304/jcp.1.8.43-54",
  url           = "http://www.christian-engelmann.info/publications/engelmann06symmetric.pdf",
  abstract      = "This work aims to pave the way for high availability in
                   high-performance computing (HPC) by focusing on efficient
                   redundancy strategies for head and service nodes. These nodes
                   represent single points of failure and control for an entire
                   HPC system as they render it inaccessible and unmanageable in
                   case of a failure until repair. The presented approach
                   introduces two distinct replication methods, internal and
                   external, for providing symmetric active/active high
                   availability for multiple redundant head and service nodes
                   running in virtual synchrony utilizing an existing process
                   group communication system for service group membership
                   management and reliable, totally ordered message delivery.
                   Resented results of a prototype implementation that offers
                   symmetric active/active replication for HPC job and resource
                   management using external replication show that the highest
                   level of availability can be provided with an acceptable
                   performance trade-off.",
  pts           = "4583"
}
@article{engelmann06molar,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and David E. Bernholdt
                   and Narasimha R. Gottumukkala
                   and Chokchai (Box) Leangsuksun
                   and Jyothish Varma
                   and Chao Wang
                   and Frank Mueller
                   and Aniruddha G. Shet
                   and Ponnuswamy (Saday) Sadayappan",
  title         = "{MOLAR}: {A}daptive Runtime Support for High-End Computing
                   Operating and Runtime Systems",
  journal       = "\href{http://www.sigops.org/osr.html}{ACM SIGOPS Operating
                   Systems Review (OSR)}",
  volume        = "40",
  number        = "2",
  pages         = "63--72",
  month         = apr,
  year          = "2006",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  issn          = "0163-5980",
  doi           = "10.1145/1131322.1131337",
  url           = "http://www.christian-engelmann.info/publications/engelmann06molar.pdf",
  abstract      = "MOLAR is a multi-institutional research effort that
                   concentrates on adaptive, reliable, and efficient operating
                   and runtime system (OS/R) solutions for ultra-scale,
                   high-end scientific computing on the next generation of
                   supercomputers. This research addresses the challenges
                   outlined in FAST-OS (forum to address scalable technology for
                   runtime and operating systems) and HECRTF (high-end computing
                   revitalization task force) activities by exploring the use of
                   advanced monitoring and adaptation to improve application
                   performance and predictability of system interruptions, and
                   by advancing computer reliability, availability and
                   serviceability (RAS) management systems to work cooperatively
                   with the OS/R to identify and preemptively resolve system
                   issues. This paper describes recent research of the MOLAR
                   team in advancing RAS for high-end computing OS/Rs.",
  pts           = "1905"
}
@conference{oles24understanding,
  author        = "Vladyslav Oles
                   and Anna Schmedding
                   and George Ostrouchov
                   and Woong Shi
                   and Evgenia Smirni
                   and Christian Engelmann",
  title         = "Understanding {GPU} Memory Corruption at Extreme Scale: The
                   Summit Case Study",
  booktitle     = "Proceedings of the \href{https://ics2024.github.io/}
                   {$38^{th}$ ACM International Conference on Supercomputing
                    (ICS) 2024}",
  pages         = "188-200",
  month         = jun # "~4-7, ",
  year          = "2024",
  address       = "Kyoto, Japan",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "979-8-4007-0610-3",
  doi           = "10.1145/3650200.3656615",
  url           = "http://www.christian-engelmann.info/publications/oles24understanding.pdf",
  url2          = "http://www.christian-engelmann.info/publications/oles24understanding.ppt.pdf",
  abstract      = "GPU memory corruption and in particular double- bit errors
                  (DBEs) remain one of the least understood aspects of HPC
                  system reliability. Albeit rare, their occurrences always
                  lead to job termination and can potentially cost thousands of
                  node-hours, either from wasted computations or as the
                  overhead from regular checkpointing needed to minimize the
                  losses. As supercomputers and their components simultaneously
                  grow in scale, density, failure rates, and environmental
                  footprint, the efficiency of HPC operations becomes both an
                  imperative and a challenge.
                  We examine DBEs using system telemetry data and logs collected
                  from the Summit supercomputer, equipped with 27,648 Tesla V100
                  GPUs with 2nd-generation high-bandwidth memory (HBM2). Using
                  exploratory data analysis and statistical learning, we extract
                  several insights about memory reliability in such GPUs. We
                  find that GPUs with prior DBE occurrences are prone to
                  experience them again due to otherwise harmless factors,
                  correlate this phenomenon with GPU placement, and suggest
                  manufacturing variability as a factor. On the general
                  population of GPUs, we link DBEs to short- and long-term high
                  power consumption modes while finding no significant
                  correlation with higher temperatures. We also show that
                  workload type can be a factor in GPU memory’s propensity to
                  corruption.",
  pts           = "212442"
}
@conference{engelmann23science,
  author        = "Christian Engelmann
                   and Suhas Somnath",
  title         = "Science Use Case Design Patterns for Autonomous Experiments",
  booktitle     = "Proceedings of the \href{http://europlop.net}
                   {$28^{th}$ European Conference on Pattern Languages of
                   Programs (EuroPLoP) 2023}",
  pages         = "1-14",
  month         = jul # "~5-9, ",
  year          = "2023",
  address       = "Kloster Irsee, Germany",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "979-8-4007-0040-8",
  doi           = "10.1145/3628034.3628060",
  url           = "http://www.christian-engelmann.info/publications/engelmann23science.pdf",
  abstract      = "Connecting scientific instruments and robot-controlled
                   laboratories with computing and data resources at the edge,
                   the Cloud or the high-performance computing (HPC) center
                   enables autonomous experiments, self-driving laboratories,
                   smart manufacturing, and artificial intelligence (AI)-driven
                   design, discovery and evaluation. The Self-driven Experiments
                   for Science / Interconnected Science Ecosystem (INTERSECT)
                   Open Architecture enables science break- throughs using
                   intelligent networked systems, instruments and facilities
                   with a federated hardware/software architecture for the
                   laboratory of the future. It relies on a novel approach,
                   consisting of (1) science use case design patterns, (2) a
                   system of systems architecture, and (3) a microservice
                   architecture. This paper introduces the science use case
                   design patterns of the INTERSECT Architecture. It describes
                   the overall background, the involved terminology and concepts,
                   and the pattern format and classification. It further offers
                   an overview of the 12 defined patterns and 4 examples of
                   patterns of 2 different pattern classes. It also provides
                   insight into building solutions from these patterns. The
                   target audience are computer, computational, instrument and
                   domain science experts working in the field of autonomous
                   experiments.",
  pts           = "200749"
}
@conference{engelmann22intersect,
  author        = "Christian Engelmann
                   and Olga Kuchar
                   and Swen Boehm
                   and Michael J. Brim
                   and Thomas Naughton
                   and Suhas Somnath
                   and Scott Atchley
                   and Jack Lange
                   and Ben Mintz
                   and Elke Arenholz",
  title         = "The {INTERSECT} Open Federated Architecture for the
                   Laboratory of the Future",
  booktitle     = "Communications in Computer and Information Science (CCIS):
                   Accelerating Science and Engineering Discoveries Through
                   Integrated Research Infrastructure for Experiment, Big Data,
                   Modeling and Simulation.
                   \href{https://smc.ornl.gov}{$18^{th}$ Smoky Mountains
                   Computational Sciences & Engineering Conference (SMC)
                   2022}",
  volume        = "1690",
  pages         = "173--190",
  month         = aug # "~24-25, ",
  year          = "2022",
  publisher     = "\href{http://www.springer.com}{Springer, Cham}",
  isbn          = "978-3-031-23605-1",
  doi           = "10.1007/978-3-031-23606-8_11",
  url           = "http://www.christian-engelmann.info/publications/engelmann22intersect.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann22intersect.ppt.pdf",
  abstract      = "A federated instrument-to-edge-to-center architecture is
                   needed to autonomously collect, transfer, store, process,
                   curate, and archive scientific data and reduce
                   human-in-the-loop needs with (a) common interfaces to
                   leverage community and custom software, (b) pluggability to
                   permit adaptable solutions, reuse, and digital twins, and (c)
                   an open standard to enable adoption by science facilities
                   world-wide. The INTERSECT Open Architecture enables science
                   breakthroughs using intelligent networked systems,
                   instruments and facilities with autonomous experiments,
                   ``self-driving'' laboratories, smart manufacturing and
                   \gls{AI} driven design, discovery and evaluation. It creates
                   an open federated architecture for the laboratory of the
                   future using a novel approach, consisting of (1) science use
                   case design patterns, (2) a system of systems architecture,
                   and (3) a microservice architecture.",
  pts           = "182854"
}
@conference{hukerikar20plexus,
  author        = "Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "{PLEXUS}: {A} Pattern-Oriented Runtime System Architecture
                   for Resilient Extreme-Scale High-Performance Computing
                   Systems",
  booktitle     = "Proceedings of the
                   \href{http://prdc.dependability.org/PRDC2020}
                   {$25^{th}$ IEEE Pacific Rim International Symposium on
                    Dependable Computing (PRDC) 2020}",
  pages         = "31--39",
  month         = dec # "~1-4, ",
  year          = "2020",
  address       = "Perth, Australia",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "1555-094X",
  isbn          = "978-1-7281-8004-5",
  doi           = "10.1109/PRDC50213.2020.00014",
  url           = "http://www.christian-engelmann.info/publications/hukerikar20plexus.pdf",
  abstract      = "For high-performance computing (HPC) system designers and
                   users, meeting the myriad challenges of next-generation
                   exascale supercomputing systems requires rethinking their
                   approach to application and system software design. Among
                   these challenges, providing resiliency and stability to the
                   scientific applications in the presence of high fault rates
                   requires new approaches to software architecture and design.
                   As HPC systems become increasingly complex, they require
                   intricate solutions for detection and mitigation for various
                   modes of faults and errors that occur in these large-scale
                   systems, as well as solutions for failure recovery. These
                   resiliency solutions often interact with and affect other
                   system properties, including application scalability, power
                   and energy efficiency. Therefore, resilience solutions for
                   HPC systems must be thoughtfully engineered and deployed.
                   In previous work, we developed the concept of resilience
                   design patterns, which consist of templated solutions based
                   on well-established techniques for detection, mitigation
                   and recovery. In this paper, we use these patterns as the
                   foundation to propose new approaches to designing runtime
                   systems for HPC systems. The instantiation of these
                   patterns within a runtime system enables flexible and
                   adaptable end-to-end resiliency solutions for HPC
                   environments. The paper describes the architecture of the
                   runtime system, named Plexus, and the strategies for
                   dynamically composing and adapting pattern instances under
                   runtime control. This runtime-based approach enables
                   actively balancing the cost-benefit trade-off between
                   performance overhead and protection coverage of the
                   resilience solutions. Based on a prototype implementation
                   of PLEXUS, we demonstrate the resiliency and performance
                   gains achieved by the pattern-based runtime system for a
                   parallel linear solver application.",
  pts           = "147029"
}
@conference{ostrouchov20gpu,
  author        = "George Ostrouchov
                   and Don Maxwell
                   and Rizwan Ashraf
                   and Christian Engelmann
                   and Mallikarjun Shankar
                   and James Rogers",
  title         = "{GPU} Lifetimes on {Titan} Supercomputer: {Survival} Analysis
                   and Reliability",
  booktitle     = "Proceedings of the
                   \href{http://sc20.supercomputing.org}{$33^{rd}$ IEEE/ACM
                   International Conference on High Performance Computing,
                   Networking, Storage and Analysis (SC) 2020}",
  pages         = "41:1--14",
  month         = nov # "~15-20, ",
  year          = "2020",
  address       = "Atlanta, GA, USA",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "9781728199986",
  doi           = "10.1109/SC41405.2020.00045",
  url           = "http://www.christian-engelmann.info/publications/ostrouchov20gpu.pdf",
  url2          = "http://www.christian-engelmann.info/publications/ostrouchov20gpu.ppt.pdf",
  abstract      = "The Cray XK7 Titan was the top supercomputer system in the
                   world for a very long time and remained critically important
                   throughout its nearly seven year life. It was also a very
                   interesting machine from a reliability viewpoint as most of
                   its power came from 18,688 GPUs whose operation was forced to
                   execute three very significant rework cycles, two on the GPU
                   mechanical assembly and one on the GPU circuitboards. We
                   write about the last rework cycle and a reliability analysis
                   of over 100,000 operation years in the GPU lifetimes, which
                   correspond to Titan's 6 year long productive period after an
                   initial break-in period. Using time between failures analysis
                   and statistical survival analysis techniques, we find that
                   GPU reliability is dependent on heat dissipation to an extent
                   that strongly correlates with detailed nuances of the system
                   cooling architecture and job scheduling. In addition to
                   describing some of the system history, the data collection,
                   data cleaning, and our analysis of the data, we provide
                   reliability recommendations for designing future state of the
                   art supercomputing systems and their operation. We make the
                   data and our analysis codes publicly available.",
  pts           = "144470"
}
@conference{jeong203d,
  author        = "Haewon Jeong
                   and Yaoqing Yang
                   and Christian Engelmann
                   and Vipul Gupta
                   and Tze Meng Low
                   and Pulkit Grover
                   and Viveck Cadambe
                   and Kannan Ramchandran",
  title         = "{3D} Coded {SUMMA}: {C}ommunication-Efficient and Robust
                   Parallel Matrix Multiplication",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{https://www.euro-par.org}{$26^{th}$
                   European Conference on Parallel and Distributed Computing
                   (Euro-Par) 2020}",
  volume        = "12247",
  pages         = "392--407",
  month         = aug # "~24-28, ",
  year          = "2020",
  address       = "Warsaw, Poland",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-030-57674-5",
  doi           = "10.1007/978-3-030-57675-2_25",
  url           = "http://www.christian-engelmann.info/publications/jeong203d.pdf",
  url2          = "http://www.christian-engelmann.info/publications/jeong203d.ppt.pdf",
  abstract      = "In this paper, we propose a novel fault-tolerant parallel
                   matrix multiplication algorithm called 3D Coded SUMMA that is
                   communication efficient and achieves higher failure-tolerance
                   than replication-based schemes for the same amount of
                   redundancy. This work bridges the gap between recent
                   developments in coded computing and fault-tolerance in
                   high-performance computing (HPC). The core idea of coded
                   computing is the same as algorithm-based fault-tolerance
                   (ABFT), which is weaving redundancy in the computation using
                   error-correcting codes. In particular, we show that MatDot
                   codes, an innovative code construction for distributed matrix
                   multiplications, can be integrated into three-dimensional
                   SUMMA (Scalable Universal Matrix Multiplication Algorithm) in
                   a communication-avoiding manner. To tolerate any two node
                   failures, the proposed 3D Coded SUMMA requires 50\% less
                   redundancy than replication, while the overhead in execution
                   time is only about 5-10\%.",
  pts           = "140756"
}
@conference{kumar18understanding,
  author        = "Mohit Kumar
                   and Saurabh Gupta
                   and Tirthak Patel
                   and Michael Wilder
                   and Weisong Shi
                   and Song Fu
                   and Christian Engelmann
                   and Devesh Tiwari",
  title         = "Understanding and Analyzing Interconnect Errors and Network
                   Congestion on a Large Scale {HPC} System",
  booktitle     = "Proceedings of the \href{http://www.dsn.org}
                   {$48^{th}$ IEEE/IFIP International Conference on Dependable
                    Systems and Networks (DSN) 2018}",
  pages         = "107--114",
  month         = jun # "~25-28, ",
  year          = "2018",
  address       = "Luxembourg City, Luxembourg",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "2158-3927",
  isbn          = "978-1-5386-5596-2",
  doi           = "10.1109/DSN.2018.00023",
  url           = "http://www.christian-engelmann.info/publications/kumar18understanding.pdf",
  abstract      = "Today's High Performance Computing (HPC) systems are capable
                   of delivering performance in the order of petaflops due to
                   the fast computing devices, network interconnect, and
                   back-end storage systems. In particular, interconnect
                   resilience and congestion resolution methods have a major
                   impact on the overall interconnect and application
                   performance. This is especially true for scientific
                   applications running multiple processes on different compute
                   nodes as they rely on fast network messages to communicate
                   and synchronize frequently. Unfortunately, the HPC community
                   lacks state-of-practice experience reports that detail how
                   different interconnect errors and congestion events occur
                   on large-scale HPC systems. Therefore, in this paper, we
                   process and analyze interconnect data of the Titan
                   supercomputer to develop a thorough understanding of
                   interconnects faults, errors and congestion events. We also
                   study the interaction between interconnect, errors, network
                   congestion and application characteristics.",
  pts           = "110648"
}
@conference{nie18machine,
  author        = "Bin Nie
                   and Ji Xue
                   and Saurabh Gupta
                   and Tirthak Patel
                   and Christian Engelmann
                   and Evgenia Smirni
                   and Devesh Tiwari",
  title         = "Machine Learning Models for {GPU} Error Prediction in a Large
                   Scale {HPC} System",
  booktitle     = "Proceedings of the \href{http://www.dsn.org}
                   {$48^{th}$ IEEE/IFIP International Conference on Dependable
                    Systems and Networks (DSN) 2018}",
  pages         = "95--106",
  month         = jun # "~25-28, ",
  year          = "2018",
  address       = "Luxembourg City, Luxembourg",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "2158-3927",
  isbn          = "978-1-5386-5596-2",
  doi           = "10.1109/DSN.2018.00022",
  url           = "http://www.christian-engelmann.info/publications/nie18machine.pdf",
  abstract      = "Recently, GPUs have been widely deployed on large-scale HPC
                   systems to provide powerful computational capability for
                   scientific applications from various domains. As those
                   applications are normally long-running, investigating the
                   characteristics of GPU errors becomes imperative. Therefore,
                   in this paper, we firstly study the conditions that trigger
                   GPU errors with six-month trace data collected from a
                   large-scale operational HPC system. Then, we resort to
                   machine learning techniques to predict the occurrence of
                   GPU errors, by taking advantage of the temporal and spatial
                   dependency of the collected data. As discussed in the
                   evaluation section, the prediction framework is robust and
                   accurate under different workloads.",
  pts           = "110650"
}
@conference{ashraf18pattern-based,
  author        = "Rizwan Ashraf
                   and Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "Pattern-based Modeling of Multiresilience Solutions for
                   High-Performance Computing",
  booktitle     = "Proceedings of the \href{http://icpe2018.spec.org}{$9^{th}$
                   ACM/SPEC International Conference on Performance Engineering
                   (ICPE) 2018}",
  pages         = "80--87",
  month         = apr # "~9-13, ",
  year          = "2018",
  address       = "Berlin, Germany",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4503-5095-2",
  doi           = "10.1145/3184407.3184421",
  url           = "http://www.christian-engelmann.info/publications/ashraf18pattern-based.pdf",
  url2          = "http://www.christian-engelmann.info/publications/ashraf18pattern-based.ppt.pdf",
  abstract      = "Resiliency is the ability of large-scale high-performance
                   computing (HPC) applications to gracefully handle different
                   types of errors, and recover from failures. In this paper, we
                   propose a pattern-based approach to constructing
                   multiresilience solutions. Using resilience patterns, we
                   evaluate the performance and reliability characteristics of
                   detection, containment and mitigation techniques for transient
                   errors that cause silent data corruptions and techniques for
                   fail-stop errors that result in process failures. We
                   demonstrate the design and implementation of the resilience
                   techniques across multiple layers of the system stack such
                   that they are integrated to work together to achieve
                   resiliency to different error types in a highly
                   performance-effcient manner.",
  pts           = "109667"
}
@conference{ashraf18shrink,
  author        = "Rizwan Ashraf
                   and Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "Shrink or Substitute: {H}andling Process Failures in {HPC}
                   Systems using In-situ Recovery",
  booktitle     = "Proceedings of the \href{http://www.pdp2018.org}{$26^{th}$
                   Euromicro International Conference on Parallel, Distributed,
                   and network-based Processing (PDP) 2018}",
  pages         = "178--185",
  month         = mar # "~21-23, ",
  year          = "2018",
  address       = "Cambridge, UK",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "2377-5750",
  isbn          = "978-1-5386-4975-6",
  doi           = "10.1109/PDP2018.2018.00032",
  url           = "http://www.christian-engelmann.info/publications/ashraf18shrink.pdf",
  url2          = "http://www.christian-engelmann.info/publications/ashraf18shrink.ppt.pdf",
  abstract      = "Efficient utilization of today's high-performance computing
                   (HPC) systems with many, complex software and hardware
                   components requires that the HPC applications are designed to
                   tolerate process failures at runtime. With low
                   mean-time-to-failure (MTTF) of current and future HPC
                   systems, long running simulations on these systems requires
                   capabilities for gracefully handling process failures by the
                   applications themselves. In this paper, we explore the use of
                   fault tolerance extensions to Message Passing Interface (MPI)
                   called user-level failure mitigation (ULFM) for handling
                   process failures without the need to discard the progress
                   made by the application. We explore two alternative recovery
                   strategies, which use ULFM along with application-driven
                   in-memory checkpointing. In the first case, the application
                   is recovered with only the surviving processes, and in the
                   second case, spares are used to replace the failed processes,
                   such that the original configuration of the application is
                   restored. Our experimental results demonstrate that graceful
                   degradation is a viable alternative for recovery in
                   environments where spares may not be available.",
  pts           = "107422"
}
@conference{gupta17failures,
  author        = "Saurabh Gupta
                   and Tirthak Patel
                   and Christian Engelmann
                   and Devesh Tiwari",
  title         = "Failures in Large Scale Systems: {L}ong-term Measurement,
                   Analysis, and Implications",
  booktitle     = "Proceedings of the
                   \href{http://sc17.supercomputing.org}{$30^{th}$ IEEE/ACM
                   International Conference on High Performance Computing,
                   Networking, Storage and Analysis (SC) 2017}",
  pages         = "44:1--44:12",
  month         = nov # "~12-17, ",
  year          = "2017",
  address       = "Denver, CO, USA",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4503-5114-0",
  doi           = "10.1145/3126908.3126937",
  url           = "http://www.christian-engelmann.info/publications/gupta17failures.pdf",
  url2          = "http://www.christian-engelmann.info/publications/gupta17failures.ppt.pdf",
  abstract      = "Resilience is one of the key challenges in maintaining high
                   efficiency of future extreme scale supercomputers.
                   Unfortunately, field-data based reliability studies are far
                   in between and not exhaustive. Most HPC researchers and
                   system practitioners still rely on outdated studies to
                   understand HPC reliability characteristics and plan for
                   future HPC systems. While the complexity of managing system
                   reliability has increased, the public knowledge sharing about
                   lessons learned from HPC centers has not increased in the
                   same proportion. To bridge this gap, in this work, we compare
                   and contrast the reliability characteristics of multiple
                   large-scale HPC production systems, and discuss new
                   take-aways and con rm previous findings which continue to be
                   valid.",
  pts           = "100355"
}
@conference{nie17characterizing,
  author        = "Bin Nie
                   and Ji Xue
                   and Saurabh Gupta
                   and Christian Engelmann
                   and Evgenia Smirni
                   and Devesh Tiwari",
  title         = "Characterizing Temperature, Power, and Soft-Error Behaviors
                   in Data Center Systems: {I}nsights, Challenges, and
                   Opportunities",
  booktitle     = "Proceedings of the \href{http://mascots2017.cs.ucalgary.ca}
                   {$25^{th}$ IEEE International Symposium on the Modeling,
                   Analysis, and Simulation of Computer and Telecommunication
                   Systems (MASCOTS) 2017}",
  pages         = "22--31",
  month         = sep # "~20-22, ",
  year          = "2017",
  address       = "Banff, AB, Canada",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "2375-0227",
  isbn          = "978-1-5386-2764-8",
  doi           = "10.1109/MASCOTS.2017.12",
  url           = "http://www.christian-engelmann.info/publications/nie17characterizing.pdf",
  url2          = "",
  abstract      = "GPUs have become part of the mainstream high performance
                   computing facilities that increasingly require more
                   computational power to simulate physical phenomena quickly
                   and accurately. However, GPU nodes also consume significantly
                   more power than traditional CPU nodes, and high power
                   consumption introduces new system operation challenges,
                   including increased temperature, power/cooling cost, and
                   lower system reliability. This paper explores how power
                   consumption and temperature characteristics affect
                   reliability, provides insights into what are the implications
                   of such understanding, and how to exploit these insights
                   toward predicting GPU errors using neural networks.",
  pts           = "100351"
}
@conference{hukerikar17pattern,
  author        = "Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "A Pattern Language for High-Performance Computing
                   Resilience",
  booktitle     = "Proceedings of the \href{http://europlop.net}
                   {$22^{nd}$ European Conference on Pattern Languages of
                   Programs (EuroPLoP) 2017}",
  pages         = "12:1--12:16",
  month         = jul # "~12-16, ",
  year          = "2017",
  address       = "Kloster Irsee, Germany",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4503-4848-5",
  doi           = "10.1145/3147704.3147718",
  url           = "http://www.christian-engelmann.info/publications/hukerikar17pattern.pdf",
  abstract      = "High-performance computing systems (HPC) provide powerful
                   capabilities for modeling and simulation, and data analytics
                   in a broad class of computational problems in a variety of
                   scientific and engineering domains. HPC designs are
                   undergoing rapid changes in the hardware architectures and
                   the software environment as the community pursues
                   increasingly capable HPC systems. Among the key challenges
                   for future generations of HPC systems is the ensuring
                   efficient and correct operation despite the occurrence of
                   faults or defects in system components that can cause errors
                   and failures in a HPC system. Such events affect the
                   correctness of the scientific applications, or may lead to
                   their untimely termination. Future generations of HPC systems
                   will consist of millions of compute, memory and storage
                   components and the growing complexity of these computing
                   behemoths increases the chances that a single fault event will
                   cascade across the machine and bring down the entire system.
                   Design patterns capture the essential techniques that are
                   employed to solve recurring problems in the design of
                   resilient computing systems. However, the complexity of
                   modern HPC systems as well as the various challenges of
                   future generations of systems requires consideration to
                   numerous aspects and optimization principles, such as the
                   impact of a resilience solution on the performance and
                   power consumption. We present a pattern language for
                   engineering resilience solutions. The language is targeted
                   at hardware and software designers as well as the users and
                   operators of HPC systems. The patterns are intended to
                   develop complete resilience solutions that have different
                   efficiency and complexity characteristics, which may be
                   deployed at design time or runtime to ensure that HPC systems
                   are able to deal with various types of faults, errors and
                   failures.",
  pts           = "102869"
}
@conference{lagadapati16benchmark,
  author        = "Mahesh Lagadapati
                   and Frank Mueller
                   and Christian Engelmann",
  title         = "Benchmark Generation and Simulation at Extreme Scale",
  booktitle     = "Proceedings of the \href{http://ds-rt.com/2016}{$20^{th}$
                   IEEE/ACM International Symposium on Distributed Simulation
                   and Real Time Applications (DS-RT) 2016}",
  pages         = "9--18",
  month         = sep # "~21-23, ",
  year          = "2016",
  address       = "London, UK",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "1550-6525",
  isbn          = "978-1-5090-3506-9",
  doi           = "10.1109/DS-RT.2016.18",
  url           = "http://www.christian-engelmann.info/publications/lagadapati16benchmark.pdf",
  url2          = "http://www.christian-engelmann.info/publications/lagadapati16benchmark.ppt.pdf",
  abstract      = "The path to extreme scale high-performance computing (HPC)
                   poses several challenges related to power, performance,
                   resilience, productivity, programmability, data movement, and
                   data management. Investigating the performance of parallel
                   applications at scale on future architectures and the
                   performance impact of different architectural choices is an
                   important component of HPC hardware/software co-design.
                   Simulations using models of future HPC systems and
                   communication traces from applications running on existing
                   HPC systems can offer an insight into the performance of
                   future architectures. This work targets technology developed
                   for scalable application tracing of communication events. It
                   focuses on extreme-scale simulation of HPC applications and
                   their communication behavior via lightweight parallel
                   discrete event simulation for performance estimation and
                   evaluation. Instead of simply replaying a trace within a
                   simulator, this work promotes the generation of a benchmark
                   from traces. This benchmark is subsequently exposed to
                   simulation using models to reflect the performance
                   characteristics of future-generation HPC systems. This
                   technique provides a number of benefits, such as eliminating
                   the data intensive trace replay and enabling simulations at
                   different scales. The presented work features novel software
                   co-design aspects, combining the ScalaTrace tool to generate
                   scalable trace files, the ScalaBenchGen tool to generate the
                   benchmark, and the xSim tool to assess the benchmark
                   characteristics within a simulator.",
  pts           = "68383"
}
@conference{hukerikar16havens,
  author        = "Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "{Havens}: {Explicit} Reliable Memory Regions for {HPC}
                   Applications",
  booktitle     = "Proceedings of the \href{http://ieee-hpec.org}
                   {$20^{th}$ IEEE High Performance Extreme Computing
                   Conference (HPEC) 2016}",
  pages         = "1--6",
  month         = sep # "~13-15, ",
  year          = "2016",
  address       = "Waltham, MA, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  doi           = "10.1109/HPEC.2016.7761593",
  url           = "http://www.christian-engelmann.info/publications/hukerikar16havens.pdf",
  url2          = "http://www.christian-engelmann.info/publications/hukerikar16havens.ppt.pdf",
  abstract      = "Supporting error resilience in future exascale-class
                   supercomputing systems is a critical challenge. Due to
                   transistor scaling trends and increasing memory density,
                   the scientific simulations are expected to experience more
                   interruptions caused by soft errors in the system memory.
                   Existing hardware-based detection and recovery techniques
                   will be inadequate in the presence of high memory fault
                   rates.
                   In this paper we propose a partial memory protection scheme
                   using region-based memory management. We define regions
                   called havens that provide fault protection for program
                   objects. We provide reliability for the regions through a
                   software-based parity protection mechanism. Our approach
                   enables critical application code and variables to be placed
                   in these havens. The fault coverage of our approach is
                   application agnostic unlike algorithm-based fault tolerance
                   techniques.",
  pts           = "69230"
}
@conference{tang16power-capping,
  author        = "Kun Tang
                   and Devesh Tiwari
                   and Saurabh Gupta
                   and Ping Huang
                   and QiQi Lu
                   and Christian Engelmann
                   and Xubin He",
  title         = "Power-Capping Aware Checkpointing: {On} the Interplay Among
                   Power-Capping, Temperature, Reliability, Performance, and
                   Energy",
  booktitle     = "Proceedings of the \href{http://www.dsn.org}
                   {$46^{th}$ IEEE/IFIP International Conference on Dependable
                    Systems and Networks (DSN) 2016}",
  pages         = "311--322",
  month         = jun # "~28 - " # jul # "~1, ",
  year          = "2016",
  address       = "Toulouse, France",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "2158-3927",
  doi           = "10.1109/DSN.2016.36",
  url           = "http://www.christian-engelmann.info/publications/tang16power-aware.pdf",
  url2          = "",
  abstract      = "Checkpoint and restart mechanisms have been widely used in
                   large scientific simulation applications to make forward
                   progress in case of failures. However, none of the prior
                   works have considered the interaction of power-constraint
                   with temperature, reliability, performance, and checkpointing
                   interval. It is not clear how power-capping may affect
                   optimal checkpointing interval. What are the involved
                   reliability, performance, and energy trade-offs? In this
                   paper, we develop a deep understanding about the interaction
                   between power-capping and scientific applications using
                   checkpoint/restart as resilience mechanism, and propose a
                   new model for the optimal checkpointing interval (OCI) under
                   power-capping. Our study reveals several interesting, and
                   previously unknown, insights about how power-capping affects
                   the reliability, energy consumption, performance.",
  pts           = "62738"
}
@conference{fiala16mini-ckpts,
  author        = "David Fiala
                   and Frank Mueller
                   and Kurt Ferreira
                   and Christian Engelmann",
  title         = "{Mini-Ckpts}: Surviving {OS} Failures in Persistent Memory",
  booktitle     = "Proceedings of the \href{http://ics16.bilkent.edu.tr}
                   {$30^{th}$ ACM International Conference on Supercomputing
                    (ICS) 2016}",
  pages         = "7:1--7:14",
  month         = jun # "~1-3, ",
  year          = "2016",
  address       = "Istanbul, Turkey",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4503-4361-9",
  doi           = "10.1145/2925426.2926295",
  url           = "http://www.christian-engelmann.info/publications/fiala16mini-ckpts.pdf",
  url2          = "http://www.christian-engelmann.info/publications/fiala16mini-ckpts.ppt.pdf",
  abstract      = "Concern is growing in the high-performance computing (HPC)
                   community on the reliability of future extreme-scale systems.
                   Current efforts have focused on application fault-tolerance
                   rather than the operating system (OS), despite the fact that
                   recent studies have suggested that failures in OS memory are
                   more likely. The OS is critical to a system's correct and
                   efficient operation of the node and processes it governs ---
                   and in HPC also for any other nodes a parallelized
                   application runs on and communicates with: Any single node
                   failure generally forces all processes of this application
                   to terminate due to tight communication in HPC. Therefore,
                   the OS itself must be capable of tolerating failures. In
                   this work, we introduce mini-ckpts, a framework which
                   enables application survival despite the occurrence of a
                   fatal OS failure or crash. Mini-ckpts achieves this
                   tolerance by ensuring that the critical data describing a
                   process is preserved in persistent memory prior to the
                   failure. Following the failure, the OS is rejuvenated via
                   a warm reboot and the application continues execution
                   effectively making the failure and restart transparent. The
                   mini-ckpts rejuvenation and recovery process is measured to
                   take between three to six seconds and has a failure-free
                   overhead of between 3-5\% for a number of key HPC workloads.
                   In contrast to current fault-tolerance methods, this work
                   ensures that the operating and runtime system can continue in
                   the presence of faults. This is a much finer-grained and
                   dynamic method of fault-tolerance than the current,
                   coarse-grained, application-centric methods. Handling faults
                   at this level has the potential to greatly reduce overheads
                   and enables mitigation of additional fault scenarios.",
  pts           = "67816"
}
@conference{bautista-gomez16reducing,
  author        = "Leonardo Bautista-Gomez
                   and Ana Gainaru
                   and Swann Perarnau
                   and Devesh Tiwari
                   and Saurabh Gupta
                   and Franck Cappello
                   and Christian Engelmann
                   and Marc Snir",
  title         = "Reducing Waste in Extreme Scale Systems Through Introspective
                   Analysis",
  booktitle     = "Proceedings of the \href{http://www.ipdps.org}
                   {$30^{th}$ IEEE International Parallel and Distributed
                   Processing Symposium (IPDPS) 2016}",
  pages         = "212--221",
  month         = may # "~23-27, ",
  year          = "2016",
  address       = "Chicago, IL, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "1530-2075",
  doi           = "10.1109/IPDPS.2016.100",
  url           = "http://www.christian-engelmann.info/publications/bautista-gomez16reducing.pdf",
  url2          = "http://www.christian-engelmann.info/publications/bautista-gomez16reducing.ppt.pdf",
  abstract      = "Resilience is an important challenge for extreme-scale 
                   supercomputers. Today, failures in supercomputers are 
                   assumed to be uniformly distributed in time. However, recent 
                   studies show that failures in high-performance computing 
                   systems are partially correlated in time, generating periods 
                   of higher failure density. Our study of the failure logs of 
                   multiple supercomputers show that periods of higher failure 
                   density occur with up to three times more than the average. 
                   We design a monitoring system that listens to hardware 
                   events and forwards important events to the runtime to 
                   detect those regime changes. We implement a runtime capable 
                   of receiving notifications and adapt dynamically. In 
                   addition, we build an analytical model to predict the gains 
                   that such dynamic approach could achieve. We demonstrate that 
                   in some systems, our approach can reduce the wasted time.",
  pts           = "62159"
}
@conference{engelmann16supporting,
  author        = "Christian Engelmann
                   and Thomas Naughton",
  title         = "Supporting the Development of Soft-Error Resilient Message
                   Passing Applications using Simulation",
  booktitle     = "Proceedings of the
                   \href{http://www.iasted.org/conferences/home-795.html}
                   {$13^{th}$ IASTED International Conference on Parallel and
                   Distributed Computing and Networks (PDCN) 2016}",
  month         = feb # "~15-16, ",
  year          = "2016",
  address       = "Innsbruck, Austria",
  publisher     = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB,
                   Canada}",
  isbn          = "978-0-88986-979-0",
  doi           = "10.2316/P.2016.834-005",
  url           = "http://www.christian-engelmann.info/publications/engelmann16supporting.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann16supporting.ppt.pdf",
  abstract      = "Radiation-induced bit flip faults are of particular concern
                   in extreme-scale high-performance computing systems. This
                   paper presents a simulation-based tool that enables the
                   development of soft-error resilient message passing
                   applications by permitting the investigation of their
                   correctness and performance under various fault conditions.
                   The documented extensions to the Extreme-scale Simulator
                   (xSim) enable the injection of bit flip faults at specific
                   of injection location(s) and fault activation time(s),
                   while supporting a significant degree of configurability of
                   the fault type. Experiments show that the simulation
                   overhead with the new feature is $\sim$2,325\% for serial
                   execution and $\sim$1,730\% at 128 MPI processes, both with
                   very fine-grain fault injection. Fault injection experiments
                   demonstrate the usefulness of the new feature by injecting
                   bit flips in the input and output matrices of a matrix-matrix
                   multiply application, revealing vulnerability of data
                   structures, masking and error propagation. xSim is the very
                   first simulation-based MPI performance tool that supports
                   both, the injection of process failures and bit flip faults.",
  pts           = "60888"
}
@conference{katti15scalable,
  author        = "Amogh Katti
                   and Giuseppe Di Fatta
                   and Thomas Naughton
                   and Christian Engelmann",
  title         = "Scalable and Fault Tolerant Failure Detection and Consensus",
  booktitle     = "Proceedings of the
                   \href{https://eurompi2015.bordeaux.inria.fr}{$22^{nd}$
                   European MPI Users` Group Meeting (EuroMPI) 2015}",
  pages         = "13:1--13:9",
  month         = sep # "~21-24, ",
  year          = "2015",
  address       = "Bordeaux, France",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4503-3795-3",
  doi           = "10.1145/2802658.2802660",
  url           = "http://www.christian-engelmann.info/publications/katti15scalable.pdf",
  url2          = "http://www.christian-engelmann.info/publications/katti15scalable.ppt.pdf",
  abstract      = "Future extreme-scale high-performance computing systems will
                   be required to work under frequent component failures. The
                   MPI Forum's User Level Failure Mitigation proposal has
                   introduced an operation (MPI\_Comm\_shrink) to synchronize
                   the alive processes on the list of failed processes, so that
                   applications can continue to execute even in the presence of
                   failures by adopting algorithm-based fault tolerance
                   techniques. The MPI\_Comm\_shrink operation requires a
                   fault tolerant failure detection and consensus algorithm.
                   This paper presents and compares two novel failure detection
                   and consensus algorithms to support this operation. The
                   proposed algorithms are based on Gossip protocols and are
                   inherently fault-tolerant and scalable. The proposed
                   algorithms were implemented and tested using the
                   Extreme-scale Simulator. The results show that in both
                   algorithms the number of Gossip cycles to achieve global
                   consensus scales logarithmically with system size. The second
                   algorithm also shows better scalability in terms of memory
                   usage and network bandwidth costs and a perfect
                   synchronization in achieving global consensus.",
  pts           = "57940"
}
@conference{engelmann15network,
  author        = "Christian Engelmann
                   and Thomas Naughton",
  title         = "A Network Contention Model for the Extreme-scale Simulator",
  booktitle     = "Proceedings of the
                   \href{http://www.iasted.org/conferences/home-826.html}
                   {$34^{th}$ IASTED International Conference on Modelling,
                   Identification and Control (MIC) 2015}",
  month         = feb # "~17-18, ",
  year          = "2015",
  address       = "Innsbruck, Austria",
  publisher     = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB,
                   Canada}",
  isbn          = "978-0-88986-975-2",
  doi           = "10.2316/P.2015.826-043",
  url           = "http://www.christian-engelmann.info/publications/engelmann15network.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann15network.ppt.pdf",
  abstract      = "The Extreme-scale Simulator (xSim) is a performance
                   investigation toolkit for high-performance computing (HPC)
                   hardware/software co-design. It permits running a HPC
                   application with millions of concurrent execution threads,
                   while observing its performance in a simulated extreme-scale
                   system. This paper details a newly developed network modeling
                   feature for xSim, eliminating the shortcomings of the
                   existing network modeling capabilities. The approach takes a
                   different path for implementing network contention and
                   bandwidth capacity modeling using a less synchronous and
                   accurate enough model design. With the new network modeling
                   feature, xSim is able to simulate on-chip and on-node
                   networks with reasonable accuracy and overheads.",
  pts           = "53873"
}
@conference{engelmann14improving,
  author        = "Christian Engelmann
                   and Thomas Naughton",
  title         = "Improving the Performance of the Extreme-scale Simulator",
  booktitle     = "Proceedings of the \href{http://ds-rt.com/2014}{$18^{th}$
                   IEEE/ACM International Symposium on Distributed Simulation
                   and Real Time Applications (DS-RT) 2014}",
  pages         = "198--207",
  month         = oct # "~1-3, ",
  year          = "2014",
  address       = "Toulouse, France",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "1550-6525",
  isbn          = "978-1-4799-6143-6",
  doi           = "10.1109/DS-RT.2014.32",
  url           = "http://www.christian-engelmann.info/publications/engelmann14improving.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann14improving.ppt.pdf",
  abstract      = "Investigating the performance of parallel applications
                   at scale on future high-performance computing (HPC)
                   architectures and the performance impact of different
                   architecture choices is an important component of HPC
                   hardware/software co-design.
                   The Extreme-scale Simulator (xSim) is a simulation-based
                   toolkit for investigating the performance of parallel
                   applications at scale. xSim scales to millions of simulated
                   Message Passing Interface (MPI) processes. The overhead
                   introduced by a simulation tool is an important performance
                   and productivity aspect. This paper documents two
                   improvements to xSim: (1) a new deadlock resolution protocol
                   to reduce the parallel discrete event simulation management
                   overhead and (2) a new simulated MPI message matching
                   algorithm to reduce the oversubscription management overhead.
                   The results clearly show a significant performance
                   improvement, such as by reducing the simulation overhead for
                   running the NAS Parallel Benchmark suite inside the simulator 
                   from 1,020\% to 238\% for the conjugate gradient (CG)
                   benchmark and from 102\% to 0\% for the embarrassingly
                   parallel (EP) and benchmark, as well as, from 37,511\% to
                   13,808\% for CG and from 3,332\% to 204\% for EP with
                   accurate process failure simulation.",
  pts           = "50654"
}
@conference{naughton14supporting,
  author        = "Thomas Naughton
                   and Christian Engelmann
                   and Geoffroy Vall{\'e}e
                   and Swen B{\"o}hm",
  title         = "Supporting the Development of Resilient Message Passing
                   Applications using Simulation",
  booktitle     = "Proceedings of the \href{http://www.pdp2014.org}{$22^{nd}$
                   Euromicro International Conference on Parallel, Distributed,
                   and network-based Processing (PDP) 2014}",
  pages         = "271--278",
  month         = feb # "~12-14, ",
  year          = "2014",
  address       = "Turin, Italy",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  issn          = "1066-6192",
  doi           = "10.1109/PDP.2014.74",
  url           = "http://www.christian-engelmann.info/publications/naughton14supporting.pdf",
  url2          = "http://www.christian-engelmann.info/publications/naughton14supporting.ppt.pdf",
  abstract      = "An emerging aspect of high-performance computing (HPC)
                   hardware/software co-design is investigating performance
                   under failure. The work in this paper extends the
                   Extreme-scale Simulator (xSim), which was designed for
                   evaluating the performance of message passing interface
                   (MPI) applications on future HPC architectures, with
                   fault-tolerant MPI extensions proposed by the MPI Fault
                   Tolerance Working Group. xSim permits running MPI
                   applications with millions of concurrent MPI ranks, while
                   observing application performance in a simulated
                   extreme-scale system using a lightweight parallel discrete
                   event simulation. The newly added features offer user-level
                   failure mitigation (ULFM) extensions at the simulated MPI
                   layer to support algorithm-based fault tolerance (ABFT).
                   The presented solution permits investigating performance
                   under failure and failure handling of ABFT solutions.
                   The newly enhanced xSim is the very first performance tool
                   that supports ULFM and ABFT.",
  pts           = "49204"
}
@conference{vallee13runtime,
  author        = "Geoffroy Vall{\'e}e
                   and Thomas Naughton
                   and Swen B{\"o}hm
                   and Christian Engelmann",
  title         = "A Runtime Environment for Supporting Research in Resilient
                   {HPC} System Software & Tools",
  booktitle     = "Proceedings of the \href{http://is-candar.org}
                   {$1^{st}$ International Symposium on Computing and
                   Networking - Across Practical Development and Theoretical
                   Research - (CANDAR) 2013}",
  pages         = "213--219",
  month         = dec # "~4-6, ",
  year          = "2013",
  address       = "Matsuyama, Japan",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-4799-2795-1",
  doi           = "10.1109/CANDAR.2013.38",
  url           = "http://www.christian-engelmann.info/publications/vallee13runtime.pdf",
  url2          = "http://www.christian-engelmann.info/publications/vallee13runtime.ppt.pdf",
  abstract      = "The high-performance computing~(HPC) community continues to
                   increase the size and complexity of hardware platforms that
                   support advanced scientific workloads. The runtime
                   environment (RTE) is a crucial layer in the software
                   stack for these large-scale systems. The RTE manages the
                   interface between the operating system and the application
                   running in parallel on the machine. The deployment of
                   applications and tools on large-scale HPC computing systems
                   requires the RTE to manage process creation in a scalable
                   manner, support sparse connectivity, and provide fault
                   tolerance. We have developed a new RTE that provides a basis
                   for building distributed execution environments and
                   developing tools for HPC to aid research in system software
                   and resilience. This paper describes the software
                   architecture of the Scalable runTime Component
                   Infrastructure~(STCI), which is intended to provide a
                   complete infrastructure for scalable start-up and
                   management of many processes in large-scale HPC systems. We
                   highlight features of the current implementation, which is
                   provided as a system library that allows developers to easily
                   use and integrate STCI in their tools and/or applications.
                   The motivation for this work has been to support ongoing
                   research activities in fault-tolerance for large-scale
                   systems. We discuss the advantages of the modular framework
                   employed and describe two use cases that demonstrate its
                   capabilities: (i) an alternate runtime for a Message Passing
                   Interface (MPI) stack, and (ii) a distributed control and
                   communication substrate for a fault-injection tool.",
  pts           = "45674"
}
@conference{engelmann13investigating,
  author        = "Christian Engelmann",
  title         = "Investigating Operating System Noise in Extreme-Scale
                   High-Performance Computing Systems using Simulation",
  booktitle     = "Proceedings of the
                   \href{http://www.iasted.org/conferences/home-795.html}
                   {$11^{th}$ IASTED International Conference on Parallel and
                   Distributed Computing and Networks (PDCN) 2013}",
  month         = feb # "~11-13, ",
  year          = "2013",
  address       = "Innsbruck, Austria",
  publisher     = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB,
                   Canada}",
  isbn          = "978-0-88986-943-1",
  doi           = "10.2316/P.2013.795-010",
  url           = "http://www.christian-engelmann.info/publications/engelmann12investigating.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann12investigating.ppt.pdf",
  abstract      = "Hardware/software co-design for future-generation
                   high-performance computing (HPC) systems aims at closing
                   the gap between the peak capabilities of the hardware
                   and the performance realized by applications
                   (application-architecture performance gap). Performance
                   profiling of architectures and applications is a crucial
                   part of this iterative process. The work in this paper
                   focuses on operating system (OS) noise as an additional
                   factor to be considered for co-design. It represents the
                   first step in including OS noise in HPC hardware/software
                   co-design by adding a noise injection feature to an existing
                   simulation-based co-design toolkit. It reuses an existing
                   abstraction for OS noise with frequency (periodic recurrence)
                   and period (duration of each occurrence) to enhance the
                   processor model of the Extreme-scale Simulator (xSim) with
                   synchronized and random OS noise simulation. The results
                   demonstrate this capability by evaluating the impact of OS
                   noise on MPI\_Bcast() and MPI\_Reduce() in a simulated
                   future-generation HPC system with 2,097,152 compute nodes.",
  pts           = "40576"
}
@conference{fiala12detection2,
  author        = "David Fiala
                   and Frank Mueller
                   and Christian Engelmann
                   and Kurt Ferreira
                   and Ron Brightwell
                   and Rolf Riesen",
  title         = "Detection and Correction of Silent Data Corruption for
                   Large-Scale High-Performance Computing",
  booktitle     = "Proceedings of the
                   \href{http://sc12.supercomputing.org}{$25^{th}$ IEEE/ACM
                   International Conference on High Performance Computing,
                   Networking, Storage and Analysis (SC) 2012}",
  pages         = "78:1--78:12",
  month         = nov # "~10-16, ",
  year          = "2012",
  address       = "Salt Lake City, UT, USA",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4673-0804-5",
  doi           = "10.1109/SC.2012.49",
  url           = "http://www.christian-engelmann.info/publications/fiala12detection2.pdf",
  url2          = "http://www.christian-engelmann.info/publications/fiala12detection2.ppt.pdf",
  abstract      = "Faults have become the norm rather than the exception for
                   high-end computing on clusters with 10s/100s of thousands of
                   cores. Exacerbating this situation, some of these faults
                   remain undetected, manifesting themselves as silent errors
                   that corrupt memory while applications continue to operate
                   and report incorrect results.
                   This paper studies the potential for redundancy to both
                   detect and correct soft errors in MPI message-passing
                   applications. Our study investigates the challenges inherent
                   to detecting soft errors within MPI application while
                   providing transparent MPI redundancy. By assuming a model
                   wherein corruption in application data manifests itself by
                   producing differing MPI message data between replicas, we
                   study the best suited protocols for detecting and correcting
                   MPI data that is the result of corruption.
                   To experimentally validate our proposed detection and
                   correction protocols, we introduce RedMPI, an MPI library
                   which resides in the MPI profiling layer. RedMPI is capable
                   of both online detection and correction of soft errors that
                   occur in MPI applications without requiring any
                   modifications to the application source by utilizing either
                   double or triple redundancy.
                   Our results indicate that our most efficient consistency
                   protocol can successfully protect applications experiencing
                   even high rates of silent data corruption with runtime
                   overheads between 0\% and 30\% as compared to unprotected
                   applications without redundancy.
                   Using our fault injector within RedMPI, we observe that even
                   a single soft error can have profound effects on running
                   applications, causing a cascading pattern of corruption in
                   most cases causes that spreads to all other processes.
                   RedMPI's protection has been shown to successfully mitigate
                   the effects of soft errors while allowing applications to
                   complete with correct results even in the face of errors.",
  pts           = "38306"
}
@conference{elliott12combining,
  author        = "James Elliott
                   and Kishor Kharbas
                   and David Fiala
                   and Frank Mueller
                   and Kurt Ferreira
                   and Christian Engelmann",
  title         = "Combining Partial Redundancy and Checkpointing for {HPC}",
  booktitle     = "Proceedings of the \href{http://icdcs-2012.org/}
                   {$32^{nd}$ International Conference on Distributed
                   Computing Systems (ICDCS) 2012}",
  pages         = "615--626",
  month         = jun # "~18-21, ",
  year          = "2012",
  address       = "Macau, SAR, China",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-4685-8",
  issn          = "1063-6927",
  doi           = "10.1109/ICDCS.2012.56",
  url           = "http://www.christian-engelmann.info/publications/elliott12combining.pdf",
  url2          = "http://www.christian-engelmann.info/publications/elliott12combining.ppt.pdf",
  abstract      = "Today's largest High Performance Computing (HPC) systems
                   exceed one Petaflops (10^15 floating point operations per
                   second) and exascale systems are projected within seven
                   years. But reliability is becoming one of the major
                   challenges faced by exascale computing. With billion-core
                   parallelism, the mean time to failure is projected to be in
                   the range of minutes or hours instead of days. Failures are
                   becoming the norm rather than the exception during execution
                   of HPC applications. Current fault tolerance techniques in
                   HPC focus on reactive ways to mitigate faults, namely via
                   checkpoint and restart (C/R). Apart from storage overheads,
                   C/R-based fault recovery comes at an additional cost in
                   terms of application performance because normal execution
                   is disrupted when checkpoints are taken. Studies have shown
                   that applications running at a large scale spend more than
                   50\% of their total time saving checkpoints, restarting and
                   redoing lost work. Redundancy is another fault tolerance
                   technique, which employs redundant processes performing the
                   same task. If a process fails, a replica of it can take over
                   its execution. Thus, redundant copies can decrease the
                   overall failure rate. The downside of redundancy is that
                   extra resources are required and there is an additional
                   overhead on communication and synchronization. This work
                   contributes a model and analyzes the benefit of C/R in
                   coordination with redundancy at different degrees to
                   minimize the total wallclock time and resources utilization
                   of HPC applications. We further conduct experiments with an
                   implementation of redundancy within the MPI layer on a
                   cluster. Our experimental results confirm the benefit of dual
                   and triple redundancy - but not for partial redundancy - and
                   show a close fit to the model. At 80,000 processes, dual
                   redundancy requires twice the number of processing resources
                   for an application but allows two jobs of 128 hours wallclock
                   time to finish within the time of just one job without
                   redundancy. For narrow ranges of processor counts, partial
                   redundancy results in the lowest time. Once the count exceeds
                   770, 000, triple redundancy has the lowest overall cost.
                   Thus, redundancy allows one to trade-off additional resource
                   requirements against wallclock time, which provides a tuning
                   knob for users to adapt to resource availabilities.",
  pts           = "35629"
}
@conference{wang12nvmalloc,
  author        = "Chao Wang
                   and Sudharshan S. Vazhkudai
                   and Xiaosong Ma
                   and Fei Meng
                   and Youngjae Kim
                   and Christian Engelmann",
  title         = "{NVMalloc}: Exposing an Aggregate {SSD} Store as a Memory
                   Partition in Extreme-Scale Machines",
  booktitle     = "Proceedings of the \href{http://www.ipdps.org}
                   {$26^{th}$ IEEE International Parallel and Distributed
                   Processing Symposium (IPDPS) 2012}",
  pages         = "957--968",
  month         = may # "~21-25, ",
  year          = "2012",
  address       = "Shanghai, China",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-4675-9",
  doi           = "10.1109/IPDPS.2012.90",
  url           = "http://www.christian-engelmann.info/publications/wang12nvmalloc.pdf",
  url2          = "http://www.christian-engelmann.info/publications/wang12nvmalloc.ppt.pdf",
  abstract      = "DRAM is a precious resource in extreme-scale machines and is
                   increasingly becoming scarce, mainly due to the growing
                   number of cores per node. On future multi-petaflop and
                   exaflop machines, the memory pressure is likely to be so
                   severe that we need to rethink our memory usage models.
                   Fortunately, the advent of non-volatile memory (NVM) offers
                   a unique opportunity in this space. Current NVM offerings
                   possess several desirable properties, such as low cost and
                   power efficiency, but also suffer from high latency and
                   lifetime issues. We need rich techniques to be able to use
                   them alongside DRAM. In this paper, we propose a novel
                   approach to exploiting NVM as a secondary memory partition
                   so that applications can explicitly allocate and manipulate
                   memory regions therein. More specifically, we propose an
                   NVMalloc library with a suite of services that enables
                   applications to access a distributed NVM storage system.
                   We have devised ways within NVMalloc so that the storage
                   system, built from compute node-local NVM devices, can be
                   accessed in a byte-addressable fashion using the memory
                   mapped I/O interface. Our approach has the potential to
                   re-energize out-of-core computations on large-scale machines
                   by having applications allocate certain variables through
                   NVMalloc, thereby increasing the overall memory available
                   for the application. Our evaluation on a 128-core cluster
                   shows that NVMalloc enables applications to compute problem
                   sizes larger than the physical memory in a cost-effective
                   manner. It can achieve better performance with increased
                   computation time between NVM memory accesses or increased
                   data access locality. In addition, our results suggest that
                   while NVMalloc enables transparent access to NVM-resident
                   variables, the explicit control it provides is crucial to
                   optimize application performance.",
  pts           = "35603"
}
@conference{boehm12file,
  author        = "Swen B{\"o}hm and
                   Christian Engelmann",
  title         = "File {I/O} for {MPI} Applications in Redundant Execution
                   Scenarios",
  booktitle     = "Proceedings of the \href{http://www.pdp2012.org}{$20^{th}$
                   Euromicro International Conference on Parallel, Distributed,
                   and network-based Processing (PDP) 2012}",
  pages         = "112-119",
  month         = feb # "~15-17, ",
  year          = "2012",
  address       = "Garching, Germany",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-4633-9",
  issn          = "1066-6192",
  doi           = "10.1109/PDP.2012.22",
  url           = "http://www.christian-engelmann.info/publications/boehm12file.pdf",
  url2          = "http://www.christian-engelmann.info/publications/boehm12file.ppt.pdf",
  abstract      = "As multi-petascale and exa-scale high-performance computing
                   (HPC) systems inevitably have to deal with a number of
                   resilience challenges, such as a significant growth in
                   component count and smaller circuit sizes with lower circuit
                   voltages, redundancy may offer an acceptable level of
                   resilience that traditional fault tolerance techniques, such
                   as checkpoint/restart, do not. Although redundancy in HPC is
                   quite controversial due to the associated cost for redundant
                   components,  the constantly increasing number of
                   cores-per-processor is tilting this cost calculation toward
                   a system design where computation, such as for redundancy,
                   is much cheaper and communication, needed for
                   checkpoint/restart, is much more expensive. Recent research
                   and development activities in redundancy for Message Passing
                   Interface (MPI) applications focused on
                   availability/reliability models and replication algorithms.
                   This paper takes a first step toward solving an open research
                   problem associated with running a parallel application
                   redundantly, which is file I/O under redundancy. The
                   approach intercepts file I/O calls made by a redundant
                   application to employ coordination protocols that execute
                   file I/O operations in a redundancy-oblivious fashion when
                   accessing a node-local file system, or in a redundancy-aware
                   fashion when accessing a shared networked file system.
                   A proof-of concept prototype is presented and a number of
                   coordination protocols are described and evaluated. The
                   results show the performance impact for redundantly accessing
                   a shared networked file system, but also demonstrate the
                   capability to regain performance by utilizing MPI
                   communication between replicas and parallel file I/O.",
  pts           = "33577"
}
@conference{boehm11xsim,
  author        = "Swen B{\"o}hm
                   and Christian Engelmann",
  title         = "{xSim}: {The} Extreme-Scale Simulator",
  booktitle     = "Proceedings of the
                   \href{http://hpcs11.cisedu.info}{International Conference on
                   High Performance Computing and Simulation (HPCS) 2011}",
  pages         = "280-286",
  month         = jul # "~4-8, ",
  year          = "2011",
  address       = "Istanbul, Turkey",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-61284-383-4",
  doi           = "10.1109/HPCSim.2011.5999835",
  url           = "http://www.christian-engelmann.info/publications/boehm11xsim.pdf",
  url2          = "http://www.christian-engelmann.info/publications/boehm11xsim.ppt.pdf",
  abstract      = "Investigating parallel application performance properties at
                   scale is becoming an important part of high-performance
                   computing (HPC) application development and deployment. The
                   Extreme-scale Simulator (xSim) is a performance investigation
                   toolkit that permits running an application in a controlled
                   environment at extreme scale without the need for a
                   respective extreme-scale HPC system. Using a lightweight
                   parallel discrete event simulation, xSim executes a parallel
                   application with a virtual wall clock time, such that
                   performance data can be extracted based on a processor model
                   and a network model. This paper presents significant
                   enhancements to the xSim toolkit prototype that provide a
                   more complete Message Passing Interface (MPI) support and
                   improve its versatility. These enhancements include full
                   virtual MPI group, communicator and collective communication
                   support, and global variables support. The new capabilities
                   are demonstrated by executing the entire NAS Parallel
                   Benchmark suite in a simulated HPC environment.",
  pts           = "29960"
}
@conference{engelmann11redundant,
  author        = "Christian Engelmann
                   and Swen B{\"o}hm",
  title         = "Redundant Execution of {HPC} Applications with {MR-MPI}",
  booktitle     = "Proceedings of the
                   \href{http://www.iasted.org/conferences/home-719.html}
                   {$10^{th}$ IASTED International Conference on Parallel and
                   Distributed Computing and Networks (PDCN) 2011}",
  pages         = "31--38",
  month         = feb # "~15-17, ",
  year          = "2011",
  address       = "Innsbruck, Austria",
  publisher     = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB,
                   Canada}",
  isbn          = "978-0-88986-864-9",
  doi           = "10.2316/P.2011.719-031",
  url           = "http://www.christian-engelmann.info/publications/engelmann11redundant.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann11redundant.ppt.pdf",
  abstract      = "This paper presents a modular-redundant Message Passing
                   Interface (MPI) solution, MR-MPI, for transparently executing 
                   high-performance computing (HPC) applications in a redundant
                   fashion. The presented work addresses the deficiencies of
                   recovery-oriented HPC, i.e., checkpoint/restart to/from a
                   parallel file system, at extreme scale by adding the
                   redundancy approach to the HPC resilience portfolio. It
                   utilizes the MPI performance tool interface, PMPI, to
                   transparently intercept MPI calls from an application and to
                   hide all redundancy-related mechanisms. A redundantly
                   executed application runs with $r*m$ native MPI processes,
                   where $r$ is the number of MPI ranks visible to the
                   application and $m$ is the replication degree. Messages
                   between redundant nodes are replicated. Partial replication
                   for tunable resilience is supported. The performance results
                   clearly show the negative impact of the O(m^2) messages
                   between replicas. For low-level, point-to-point benchmarks,
                   the impact can be as high as the replication degree. For
                   applications, performance highly depends on the actual
                   communication types and counts. On single-core systems, the
                   overhead can be 0\% for embarrassingly parallel applications
                   independent of the employed redundancy configuration or up
                   to 70-90\% for communication-intensive applications in a
                   dual-redundant configuration. On multi-core systems, the
                   overhead can be significantly higher due to the additional
                   communication contention.",
  pts           = "27623"
}
@conference{wang10hybrid2,
  author        = "Chao Wang
                   and Frank Mueller
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Hybrid Checkpointing for {MPI} Jobs in {HPC} Environments",
  booktitle     = "Proceedings of the
                   \href{http://grid.sjtu.edu.cn/icpads10}{$16^{th}$ IEEE
                   International Conference on Parallel and Distributed Systems
                   (ICPADS) 2010}",
  pages         = "524--533",
  month         = dec # "~8-10, ",
  year          = "2010",
  address       = "Shanghai, China",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-4307-9",
  doi           = "10.1109/ICPADS.2010.48",
  url           = "http://www.christian-engelmann.info/publications/wang10hybrid2.pdf",
  url2          = "http://www.christian-engelmann.info/publications/wang10hybrid2.ppt.pdf",
  abstract      = "As the core count in high-performance computing systems keeps
                   increasing, faults are becoming common place. Check pointing
                   addresses such faults but captures full process images even
                   though only a subset of the process image changes between
                   checkpoints. We have designed a hybrid check pointing
                   technique for MPI tasks of high-performance applications.
                   This technique alternates between full and incremental
                   checkpoints: At incremental checkpoints, only data changed
                   since the last checkpoint is captured. Our implementation
                   integrates new BLCR and LAM/MPI features that complement
                   traditional full checkpoints. This results in significantly
                   reduced checkpoint sizes and overheads with only moderate
                   increases in restart overhead. After accounting for cost and
                   savings, benefits due to incremental checkpoints are an order
                   of magnitude larger than overheads on restarts. We further
                   derive qualitative results indicating an optimal balance
                   between full/incremental checkpoints of our novel approach at
                   a ratio of 1:9, which outperforms both always-full and
                   always-incremental check pointing.",
  pts           = "25447"
}
@conference{li10functional,
  author        = "Min Li
                   and Sudharshan S. Vazhkudai
                   and Ali R. Butt
                   and Fei Meng
                   and Xiaosong Ma
                   and Youngjae Kim
                   and Christian Engelmann
                   and Galen Shipman",
  title         = "Functional Partitioning to Optimize End-to-End Performance on
                   Many-Core Architectures",
  booktitle     = "Proceedings of the
                   \href{http://sc10.supercomputing.org}{$23^{rd}$ IEEE/ACM
                   International Conference on High Performance Computing,
                   Networking, Storage and Analysis (SC) 2010}",
  pages         = "1-12",
  month         = nov # "~13-19, ",
  year          = "2010",
  address       = "New Orleans, LA, USA",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4244-7559-9",
  doi           = "10.1109/SC.2010.28",
  url           = "http://www.christian-engelmann.info/publications/li10functional.pdf",
  url2          = "http://www.christian-engelmann.info/publications/li10functional.ppt.pdf",
  abstract      = "Scaling computations on emerging massive-core supercomputers
                   is a daunting task, which coupled with the significantly
                   lagging system I/O capabilities exacerbates applications'
                   end-to-end performance. The I/O bottleneck often negates
                   potential performance benefits of assigning additional
                   compute cores to an application. In this paper, we address
                   this issue via a novel functional partitioning (FP) runtime
                   environment that allocates cores to specific application
                   tasks - checkpointing, de-duplication, and scientific data
                   format transformation - so that the deluge of cores can be
                   brought to bear on the entire gamut of application
                   activities. The focus is on utilizing the extra cores to
                   support HPC application I/O activities and also leverage
                   solid-state disks in this context. For example, our
                   evaluation shows that dedicating 1 core on an oct-core
                   machine for checkpointing and its assist tasks using FP can
                   improve overall execution time of a FLASH benchmark on 80 and 
                   160 cores by 43.95\% and 41.34\%, respectively.",
  pts           = "24996"
}
@conference{boehm10aggregation,
  author        = "Swen B{\"o}hm
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Aggregation of Real-Time System Monitoring Data for Analyzing
                   Large-Scale Parallel and Distributed Computing Environments",
  booktitle     = "Proceedings of the \href{http://www.anss.org.au/hpcc2010}
                   {$12^{th}$ IEEE International Conference on High Performance
                   Computing and Communications (HPCC) 2010}",
  pages         = "72--78",
  month         = sep # "~1-3, ",
  year          = "2010",
  address       = "Melbourne, Australia",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-4214-0",
  doi           = "10.1109/HPCC.2010.32",
  url           = "http://www.christian-engelmann.info/publications/boehm10aggregation.pdf",
  url2          = "http://www.christian-engelmann.info/publications/boehm10aggregation.ppt.pdf",
  abstract      = "We present a monitoring system for large-scale parallel and
                   distributed computing environments that allows to trade-off
                   accuracy in a tunable fashion to gain scalability without
                   compromising fidelity. The approach relies on classifying
                   each gathered monitoring metric based on individual needs
                   and on aggregating messages containing classes of individual
                   monitoring metrics using a tree-based overlay network. The
                   MRNet-based prototype is able to significantly reduce the
                   amount of gathered and stored monitoring data, e.g., by a
                   factor of ~56 in comparison to the Ganglia distributed
                   monitoring system. A simple scaling study reveals, however,
                   that further efforts are needed in reducing the amount of
                   data to monitor future-generation extreme-scale systems with
                   up to 1,000,000 nodes. The implemented solution did not had
                   a measurable performance impact as the 32-node test system
                   did not produce enough monitoring data to interfere with
                   running applications.",
  pts           = "24907"
}
@conference{litvinova10proactive,
  author        = "Antonina Litvinova
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "A Proactive Fault Tolerance Framework for High-Performance
                   Computing",
  booktitle     = "Proceedings of the
                   \href{http://www.iasted.org/conferences/home-676.html}
                   {$9^{th}$ IASTED International Conference on Parallel and
                   Distributed Computing and Networks (PDCN) 2010}",
  pages         = "",
  month         = feb # "~16-18, ",
  year          = "2010",
  address       = "Innsbruck, Austria",
  publisher     = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB,
                   Canada}",
  isbn          = "978-0-88986-783-3",
  doi           = "10.2316/P.2010.676-024",
  url           = "http://www.christian-engelmann.info/publications/litvinova10proactive.pdf",
  url2          = "http://www.christian-engelmann.info/publications/litvinova10proactive.ppt.pdf",
  abstract      = "As high-performance computing (HPC) systems continue to
                   increase in scale, their mean-time to interrupt decreases
                   respectively. The current state of practice for fault
                   tolerance (FT) is checkpoint/restart. However, with
                   increasing error rates, increasing aggregate memory and not
                   proportionally increasing I/O capabilities, it is becoming
                   less efficient. Proactive FT avoids experiencing failures
                   through preventative measures, such as by migrating
                   application parts away from nodes that are about to fail.
                   This paper presents a proactive FT framework that performs
                   environmental monitoring, event logging, parallel job
                   monitoring and resource monitoring to analyze HPC system
                   reliability and to perform FT through such preventative
                   actions.",
  pts           = "13674"
}
@conference{taerat09blue,
  author        = "Narate Taerat
                   and Nichamon Naksinehaboon
                   and Clayton Chandler
                   and James Elliott
                   and Chokchai (Box) Leangsuksun
                   and George Ostrouchov
                   and Stephen L. Scott
                   and Christian Engelmann",
  title         = "{Blue Gene/L} Log Analysis and Time to Interrupt Estimation",
  booktitle     = "Proceedings of the
                   \href{http://www.ares-conference.eu/ares2009}{$4^{th}$
                   International Conference on Availability, Reliability and
                   Security (ARES) 2009}",
  pages         = "173--180",
  month         = mar # "~16-19, ",
  year          = "2009",
  address       = "Fukuoka, Japan",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-4244-3572-2",
  doi           = "10.1109/ARES.2009.105",
  url           = "http://www.christian-engelmann.info/publications/taerat09blue.pdf",
  url2          = "",
  abstract      = "System- and application-level failures could be characterized
                   by analyzing relevant log files. The resulting data might
                   then be used in numerous studies on and future developments
                   for the mission-critical and large scale computational
                   architecture, including fields such as failure prediction,
                   reliability modeling, performance modeling and power
                   awareness. In this paper, system logs covering a six month
                   period of the Blue Gene/L supercomputer were obtained and
                   subsequently analyzed. Temporal filtering was applied to
                   remove duplicated log messages. Optimistic and pessimistic
                   perspectives were exerted on filtered log information to
                   observe failure behavior within the system. Further, various
                   time to repair factors were applied to obtain application
                   time to interrupt, which will be exploited in further
                   resilience modeling research."
}
@conference{engelmann09evaluating,
  author        = "Christian Engelmann
                   and Hong H. Ong
                   and Stephen L. Scott",
  title         = "Evaluating the Shared Root File System Approach for Diskless
                   High-Performance Computing Systems",
  booktitle     = "Proceedings of the
                   \href{http://www.linuxclustersinstitute.org/conferences}
                   {$10^{th}$ LCI International Conference on High-Performance
                   Clustered Computing (LCI) 2009}",
  month         = mar # "~9-12, ",
  year          = "2009",
  address       = "Boulder, CO, USA",
  url           = "http://www.christian-engelmann.info/publications/engelmann09evaluating.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann09evaluating.ppt.pdf",
  abstract      = "Diskless high-performance computing (HPC) systems utilizing
                   networked storage have become popular in the last several
                   years. Removing disk drives significantly increases compute
                   node reliability as they are known to be a major source of
                   failures. Furthermore, networked storage solutions utilizing
                   parallel I/O and replication are able to provide increased
                   scalability and availability. Reducing a compute node to
                   processor(s), memory and network interface(s) greatly reduces
                   its physical size, which in turn allows for large-scale dense
                   HPC solutions. However, one major obstacle is the requirement
                   by certain operating systems (OSs), such as Linux, for a root
                   file system. While one solution is to remove this requirement
                   from the OS, another is to share the root file system over
                   the networked storage. This paper evaluates three networked
                   file system solutions, NFSv4, Lustre and PVFS2, with respect
                   to their performance, scalability, and availability features
                   for servicing a common root file system in a diskless HPC
                   configuration. Our findings indicate that Lustre is a viable
                   solution as it meets both, scaling and performance
                   requirements. However, certain availability issues regarding
                   single points of failure and control need to be considered.",
  pts           = "14025"
}
@conference{engelmann09proactive,
  author        = "Christian Engelmann
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Stephen L. Scott",
  title         = "Proactive Fault Tolerance Using Preemptive Migration",
  booktitle     = "Proceedings of the \href{http://www.pdp2009.org}{$17^{th}$
                   Euromicro International Conference on Parallel, Distributed,
                   and network-based Processing (PDP) 2009}",
  pages         = "252--257",
  month         = feb # "~18-20, ",
  year          = "2009",
  address       = "Weimar, Germany",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-3544-9",
  issn          = "1066-6192",
  doi           = "10.1109/PDP.2009.31",
  url           = "http://www.christian-engelmann.info/publications/engelmann09proactive.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann09proactive.ppt.pdf",
  abstract      = "Proactive fault tolerance (FT) in high-performance computing
                   is a concept that prevents compute node failures from
                   impacting running parallel applications by preemptively
                   migrating application parts away from nodes that are about
                   to fail. This paper provides a foundation for proactive FT by
                   defining its architecture and classifying implementation
                   options. This paper further relates prior work to the
                   presented architecture and classification, and discusses the
                   challenges ahead for needed supporting technologies.",
  pts           = "13674"
}
@conference{valentini09high,
  author        = "Alessandro Valentini
                   and Christian Di Biagio
                   and Fabrizio Batino
                   and Guido Pennella
                   and Fabrizio Palma
                   and Christian Engelmann",
  title         = "High Performance Computing with {Harness} over {InfiniBand}",
  booktitle     = "Proceedings of the \href{http://www.pdp2009.org}{$17^{th}$
                   Euromicro International Conference on Parallel, Distributed,
                   and network-based Processing (PDP) 2009}",
  pages         = "151--154",
  month         = feb # "~18-20, ",
  year          = "2009",
  address       = "Weimar, Germany",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-3544-9",
  issn          = "1066-6192",
  doi           = "10.1109/PDP.2009.64",
  url           = "http://www.christian-engelmann.info/publications/valentini09high.pdf",
  abstract      = "Harness is an adaptable and plug-in-based middleware
                   framework able to support distributed parallel computing. By
                   now, it is based on the Ethernet protocol which cannot
                   guarantee high performance throughput and Real Time
                   (determinism) performance. During last years, both the
                   research and industry environments have developed both new
                   network architectures (InfiniBand, Myrinet, iWARP, etc.) to
                   avoid those limits. This paper concerns the integration
                   between Harness and InfiniBand focusing on two solutions: IP
                   over InfiniBand (IPoIB) and Socket Direct Protocol (SDP)
                   technology. Those allow Harness middleware to take advantage
                   of the enhanced features provided by InfiniBand.",
  pts           = "14107"
}
@conference{engelmann09case,
  author        = "Christian Engelmann
                   and Hong H. Ong
                   and Stephen L. Scott",
  title         = "The Case for Modular Redundancy in Large-Scale High
                   Performance Computing Systems",
  booktitle     = "Proceedings of the
                   \href{http://www.iasted.org/conferences/home-641.html}
                   {$8^{th}$ IASTED International Conference on Parallel and
                   Distributed Computing and Networks (PDCN) 2009}",
  pages         = "189--194",
  month         = feb # "~16-18, ",
  year          = "2009",
  address       = "Innsbruck, Austria",
  publisher     = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB,
                   Canada}",
  isbn          = "978-0-88986-784-0",
  doi           = "",
  url           = "http://www.christian-engelmann.info/publications/engelmann09case.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann09case.ppt.pdf",
  abstract      = "Recent investigations into resilience of large-scale
                   high-performance computing (HPC) systems showed a continuous
                   trend of decreasing reliability and availability. Newly
                   installed systems have a lower mean-time to failure (MTTF)
                   and a higher mean-time to recover (MTTR) than their
                   predecessors. Modular redundancy is being used in many
                   mission critical systems today to provide for resilience,
                   such as for aerospace and command & control systems. The
                   primary argument against modular redundancy for resilience
                   in HPC has always been that the capability of a HPC system,
                   and respective return on investment, would be significantly
                   reduced. We argue that modular redundancy can significantly
                   increase compute node availability as it removes the impact
                   of scale from single compute node MTTR. We further argue that
                   single compute nodes can be much less reliable, and therefore
                   less expensive, and still be highly available, if their
                   MTTR/MTTF ratio is maintained.",
  pts           = "13981"
}
@conference{wang08proactive,
  author        = "Chao Wang
                   and Frank Mueller
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Proactive Process-Level Live Migration in {HPC}
                   Environments",
  booktitle     = "Proceedings of the \href{http://sc08.supercomputing.org}
                   {$21^{st}$ IEEE/ACM International Conference on High
                   Performance Computing, Networking, Storage and Analysis (SC)
                   2008}",
  pages         = "1--12",
  month         = nov # "~15-21, ",
  year          = "2008",
  address       = "Austin, TX, USA",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4244-2835-9",
  doi           = "10.1145/1413370.1413414",
  url           = "http://www.christian-engelmann.info/publications/wang08proactive.pdf",
  url2          = "http://www.christian-engelmann.info/publications/wang08proactive.ppt.pdf",
  abstract      = "As the number of nodes in high-performance computing
                   environments keeps increasing, faults are becoming common
                   place. Reactive fault tolerance (FT) often does not scale due
                   to massive I/O requirements and relies on manual job
                   resubmission. This work complements reactive with proactive
                   FT at the process level. Through health monitoring, a subset
                   of node failures can be anticipated when one's health
                   deteriorates. A novel process-level live migration mechanism
                   supports continued execution of applications during much of
                   processes migration. This scheme is integrated into an MPI
                   execution environment to transparently sustain
                   health-inflicted node failures, which eradicates the need to
                   restart and requeue MPI jobs. Experiments indicate that 1-6.5
                   seconds of prior warning are required to successfully trigger
                   live process migration while similar operating system
                   virtualization mechanisms require 13-24 seconds. This
                   self-healing approach complements reactive FT by nearly
                   cutting the number of checkpoints in half when 70\% of the
                   faults are handled proactively.",
  pts           = "12052"
}
@conference{engelmann08symmetric,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and Chokchai (Box) Leangsuksun
                   and Xubin (Ben) He",
  title         = "Symmetric Active/Active Replication for Dependent Services",
  booktitle     = "Proceedings of the
                   \href{http://www.ares-conference.eu/ares2008}{$3^{rd}$
                   International Conference on Availability, Reliability and
                   Security (ARES) 2008}",
  pages         = "260--267",
  month         = mar # "~4-7, ",
  year          = "2008",
  address       = "Barcelona, Spain",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-3102-1",
  doi           = "10.1109/ARES.2008.64",
  url           = "http://www.christian-engelmann.info/publications/engelmann08symmetric.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann08symmetric.ppt.pdf",
  abstract      = "During the last several years, we have established the
                   symmetric active/active replication model for service-level
                   high availability and implemented several proof-of-concept
                   prototypes. One major deficiency of our model is its
                   inability to deal with dependent services, since its original
                   architecture is based on the client-service model. This paper
                   extends our model to dependent services using its already
                   existing mechanisms and features. The presented concept is
                   based on the idea that a service may also be a client of
                   another service, and multiple services may be clients of each
                   other. A high-level abstraction is used to illustrate
                   dependencies between clients and services, and to decompose
                   dependencies between services into respective client-service
                   dependencies. This abstraction may be used for providing
                   high availability in distributed computing systems with
                   complex service-oriented architectures.",
  pts           = "9456"
}
@conference{vallee08framework,
  author        = "Geoffroy R. Vall\'ee
                   and Kulathep Charoenpornwattana
                   and Christian Engelmann
                   and Anand Tikotekar
                   and Chokchai (Box) Leangsuksun
                   and Thomas Naughton
                   and Stephen L. Scott",
  title         = "A Framework For Proactive Fault Tolerance",
  booktitle     = "Proceedings of the
                   \href{http://www.ares-conference.eu/ares2008}{$3^{rd}$
                   International Conference on Availability, Reliability and
                   Security (ARES) 2008}",
  pages         = "659--664",
  month         = mar # "~4-7, ",
  year          = "2008",
  address       = "Barcelona, Spain",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-3102-1",
  doi           = "10.1109/ARES.2008.171",
  url           = "http://www.christian-engelmann.info/publications/vallee08framework.pdf",
  url2          = "http://www.christian-engelmann.info/publications/vallee08framework.ppt.pdf",
  abstract      = "Fault tolerance is a major concern to guarantee availability
                   of critical services as well as application execution.
                   Traditional approaches for fault tolerance include
                   checkpoint/restart or duplication. However it is also
                   possible to anticipate failures and proactively take action
                   before failures occur in order to minimize failure impact on
                   the system and application execution. This document presents
                   a proactive fault tolerance framework. This framework can use
                   different proactive fault tolerance mechanisms, i.e.
                   migration and pause/unpause. The framework also allows the
                   implementation of new proactive fault tolerance policies
                   thanks to a modular architecture. A first proactive fault
                   tolerance policy has been implemented and preliminary
                   experimentations have been done based on system-level
                   virtualization and compared with results obtained by
                   simulation."
}
@conference{koenning08virtualized,
  author        = "Bj{\"o}rn K{\"o}nning
                   and Christian Engelmann
                   and Stephen L. Scott
                   and George A. (Al) Geist",
  title         = "Virtualized Environments for the {Harness} High Performance
                   Computing Workbench",
  booktitle     = "Proceedings of the \href{http://www.pdp2008.org}{$16^{th}$
                   Euromicro International Conference on Parallel, Distributed,
                   and network-based Processing (PDP) 2008}",
  pages         = "133--140",
  month         = feb # "~13-15, ",
  year          = "2008",
  address       = "Toulouse, France",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-3089-5",
  doi           = "10.1109/PDP.2008.14",
  url           = "http://www.christian-engelmann.info/publications/koenning08virtualized.pdf",
  url2          = "http://www.christian-engelmann.info/publications/koenning08virtualized.ppt.pdf",
  abstract      = "This paper describes recent accomplishments in providing a
                   virtualized environment concept and prototype for scientific
                   application development and deployment as part of the Harness
                   High Performance Computing (HPC) Workbench research effort.
                   The presented work focuses on tools and mechanisms that
                   simplify scientific application development and deployment
                   tasks, such that only minimal adaptation is needed when
                   moving from one HPC system to another or after HPC system
                   upgrades. The overall technical approach focuses on the
                   concept of adapting the HPC system environment to the actual
                   needs of individual scientific applications instead of the
                   traditional scheme of adapting scientific applications to
                   individual HPC system environment properties. The presented
                   prototype implementation is based on the mature and
                   lightweight chroot virtualization approach for Unix-type
                   systems with a focus on virtualized file system structure
                   and virtualized shell environment variables utilizing
                   virtualized environment configuration descriptions in
                   Extensible Markup Language (XML) format. The presented work
                   can be easily extended to other virtualization technologies,
                   such as system-level virtualization solutions using
                   hypervisors.",
  pts           = "11532"
}
@conference{vallee08system,
  author        = "Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Christian Engelmann
                   and Hong H. Ong
                   and Stephen L. Scott",
  title         = "System-level Virtualization for High Performance Computing",
  booktitle     = "Proceedings of the \href{http://www.pdp2008.org}{$16^{th}$
                   Euromicro International Conference on Parallel, Distributed,
                   and network-based Processing (PDP) 2008}",
  pages         = "636--643",
  month         = feb # "~13-15, ",
  year          = "2008",
  address       = "Toulouse, France",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-3089-5",
  doi           = "10.1109/PDP.2008.85",
  url           = "http://www.christian-engelmann.info/publications/vallee08system.pdf",
  url2          = "http://www.christian-engelmann.info/publications/vallee08system.ppt.pdf",
  abstract      = "System-level virtualization has been a research topic since
                   the 70`s but regained popularity during the past few years
                   because of the availability of efficient solution such as Xen
                   and the implementation of hardware support in commodity
                   processors (e.g. Intel-VT, AMD-V). However, a majority of
                   system-level virtualization projects is guided by the server
                   consolidation market. As a result, current virtualization
                   solutions appear to not be suitable for high performance
                   computing (HPC) which is typically based on large-scale
                   systems. On another hand there is significant interest in
                   exploiting virtual machines (VMs) within HPC for a number of
                   other reasons. By virtualizing the machine, one is able to
                   run a variety of operating systems and environments as needed
                   by the applications. Virtualization allows users to isolate
                   workloads, improving security and reliability. It is also
                   possible to support non-native environments and/or legacy
                   operating environments through virtualization. In addition,
                   it is possible to balance work loads, use migration
                   techniques to relocate applications from failing machines,
                   and isolate fault systems for repair. This document presents
                   the challenges for the implementation of a system-level
                   virtualization solution for HPC. It also presents a brief
                   survey of the different approaches and techniques to address
                   these challenges.",
  pts           = "11137"
}
@conference{ou07symmetric,
  author        = "Li Ou
                   and Christian Engelmann
                   and Xubin (Ben) He
                   and Xin Chen
                   and Stephen L. Scott",
  title         = "Symmetric Active/Active Metadata Service for Highly Available
                   Cluster Storage Systems",
  booktitle     = "Proceedings of the
                   \href{http://www.iasted.org/conferences/home-590.html}
                   {$19^{th}$ IASTED International Conference on Parallel and
                   Distributed Computing and Systems (PDCS) 2007}",
  pages         = "",
  month         = nov # "~19-21, ",
  year          = "2007",
  address       = "Cambridge, MA, USA",
  publisher     = "\href{http://www.actapress.com}{ACTA Press, Calgary, AB,
                   Canada}",
  isbn          = "978-0-88986-703-1",
  doi           = "",
  url           = "http://www.christian-engelmann.info/publications/ou07symmetric.pdf",
  url2          = "http://www.christian-engelmann.info/publications/ou07symmetric.ppt.pdf",
  abstract      = "In a typical distributed storage system, metadata is stored
                   and managed by dedicated metadata servers. One way to improve
                   the availability of distributed storage systems is to deploy
                   multiple metadata servers. Past research focused on the
                   active/standby model, where each active server has at least
                   one redundant idle backup. However, interruption of service
                   and loss of service state may occur during a fail-over
                   depending on the used replication technique. The research in
                   this paper targets the symmetric active/active replication
                   model using multiple redundant service nodes running in
                   virtual synchrony. In this model, service node failures do
                   not cause a fail-over to a backup and there is no disruption
                   of service or loss of service state. We propose a fast
                   delivery protocol to reduce the latency of total order
                   broadcast. Our prototype implementation shows that high
                   availability of metadata servers can be achieved with an
                   acceptable performance trade-off using the active/active
                   metadata server solution.",
  pts           = "8335"
}
@conference{disaverio07distributed,
  author        = "Emanuele Di Saverio
                   and Marco Cesati
                   and Christian Di Biagio
                   and Guido Pennella
                   and Christian Engelmann",
  title         = "Distributed Real-Time Computing with {Harness}",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://pvmmpi07.lri.fr}{$14^{th}$ European PVM/MPI
                   Users` Group Meeting (EuroPVM/MPI) 2007}",
  pages         = "281--288",
  volume        = "4757",
  month         = sep # "~30 - " # oct # "~3, ",
  year          = "2007",
  address       = "Paris, France",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-540-75415-2",
  issn          = "0302-9743",
  doi           = "10.1007/978-3-540-75416-9_39",
  url           = "http://www.christian-engelmann.info/publications/disaverio07distributed.pdf",
  url2          = "http://www.christian-engelmann.info/publications/disaverio07distributed.ppt.pdf",
  abstract      = "Modern parallel and distributed computing solutions are often
                   built onto a middleware software layer providing a higher
                   and common level of service between computational nodes.
                   Harness is an adaptable, plugin-based middleware framework
                   for parallel and distributed computing. This paper reports
                   recent research and development results of using Harness for
                   real-time distributed computing applications in the context
                   of an industrial environment with the needs to perform
                   several safety critical tasks. The presented work exploits
                   the modular architecture of Harness in conjunction with a
                   lightweight threaded implementation to resolve several
                   real-time issues by adding three new Harness plug-ins to
                   provide a prioritized lightweight execution environment, low
                   latency communication facilities, and local timestamped event
                   logging.",
  pts           = "7023"
}
@conference{ou07fast,
  author        = "Li Ou
                   and Xubin (Ben) He
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "A Fast Delivery Protocol for Total Order Broadcasting",
  booktitle     = "Proceedings of the \href{http://www.icccn.org/icccn07}
                   {$16^{th}$ IEEE International Conference on Computer
                   Communications and Networks (ICCCN) 2007}",
  pages         = "730--734",
  month         = aug # "~13-16, ",
  year          = "2007",
  address       = "Honolulu, HI, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-42441-251-8",
  issn          = "1095-2055",
  doi           = "10.1109/ICCCN.2007.4317904",
  url           = "http://www.christian-engelmann.info/publications/ou07fast.pdf",
  url2          = "http://www.christian-engelmann.info/publications/ou07fast.ppt.pdf",
  abstract      = "Sequencer, privilege-based, and communication history
                   algorithms are popular approaches to implement total
                   ordering, where communication history algorithms are most
                   suitable for parallel computing systems, because they provide
                   best performance under heavy work load. Unfortunately,
                   post-transmission delay of communication history algorithms
                   is most apparent when a system is idle. In this paper, we
                   propose a fast delivery protocol to reduce the latency of
                   message ordering. The protocol optimizes the total ordering
                   process by waiting for messages only from a subset of the
                   machines in the group, and by fast acknowledging messages on
                   behalf of other machines. Our test results indicate that the
                   fast delivery protocol is suitable for both idle and heavy
                   load systems, while reducing the latency of message
                   ordering.",
  pts           = "6926"
}
@conference{nagarajan07proactive,
  author        = "Arun B. Nagarajan
                   and Frank Mueller
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Proactive Fault Tolerance for {HPC} with {Xen}
                   Virtualization",
  booktitle     = "Proceedings of the \href{http://ics07.ac.upc.edu}{$21^{st}$
                   ACM International Conference on Supercomputing (ICS) 2007}",
  pages         = "23--32",
  month         = jun # "~16-20, ",
  year          = "2007",
  address       = "Seattle, WA, USA",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-59593-768-1",
  doi           = "10.1145/1274971.1274978",
  url           = "http://www.christian-engelmann.info/publications/nagarajan07proactive.pdf",
  url2          = "http://www.christian-engelmann.info/publications/nagarajan07proactive.ppt.pdf",
  abstract      = "Large-scale parallel computing is relying increasingly on
                   clusters with thousands of processors. At such large counts
                   of compute nodes, faults are becoming common place. Current
                   techniques to tolerate faults focus on reactive schemes to
                   recover from faults and generally rely on a
                   checkpoint/restart mechanism. Yet, in today`s systems, node
                   failures can often be anticipated by detecting a
                   deteriorating health status. Instead of a reactive scheme for
                   fault tolerance (FT), we are promoting a proactive one where
                   processes automatically migrate from unhealthy nodes to
                   healthy ones. Our approach relies on operating system
                   virtualization techniques exemplified by but not limited to
                   Xen. This paper contributes an automatic and transparent
                   mechanism for proactive FT for arbitrary MPI applications.
                   It leverages virtualization techniques combined with health
                   monitoring and load-based migration. We exploit Xen`s live
                   migration mechanism for a guest operating system (OS) to
                   migrate an MPI task from a health-deteriorating node to a
                   healthy one without stopping the MPI task during most of the
                   migration. Our proactive FT daemon orchestrates the tasks of
                   health monitoring, load determination and initiation of guest
                   OS migration. Experimental results demonstrate that live
                   migration hides migration costs and limits the overhead to
                   only a few seconds making it an attractive approach to
                   realize FT in HPC systems. Overall, our enhancements make
                   proactive FT a valuable asset for long-running MPI
                   application that is complementary to reactive FT using full
                   checkpoint/restart schemes since checkpoint frequencies can
                   be reduced as fewer unanticipated failures are encountered.
                   In the context of OS virtualization, we believe that this is
                   the first comprehensive study of proactive fault tolerance
                   where live migration is actually triggered by health
                   monitoring.",
  pts           = "6489"
}
@conference{engelmann07programming,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and Chokchai (Box) Leangsuksun
                   and Xubin (Ben) He",
  title         = "On Programming Models for Service-Level High Availability",
  booktitle     = "Proceedings of the
                   \href{http://www.ares-conference.eu/ares2007}{$2^{nd}$
                   International Conference on Availability, Reliability and
                   Security (ARES) 2007}",
  pages         = "999--1006",
  month         = apr # "~10-13, ",
  year          = "2007",
  address       = "Vienna, Austria",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "0-7695-2775-2",
  doi           = "10.1109/ARES.2007.109",
  url           = "http://www.christian-engelmann.info/publications/engelmann07programming.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann07programming.ppt.pdf",
  abstract      = "This paper provides an overview of existing programming
                   models for service-level high availability and investigates
                   their differences, similarities, advantages, and
                   disadvantages. Its goal is to help to improve reuse of code
                   and to allow adaptation to quality of service requirements by
                   using a uniform programming model description. It further
                   aims at encouraging a discussion about these programming
                   models and their provided quality of service, such as
                   availability, performance, serviceability, usability, and
                   applicability. Within this context, the presented research
                   focuses on providing high availability for services running
                   on head and service nodes of high-performance computing
                   systems.",
  pts           = "5078"
}
@conference{wang07job,
  author        = "Chao Wang
                   and Frank Mueller
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "A Job Pause Service under {LAM/MPI+BLCR} for Transparent
                   Fault Tolerance",
  booktitle     = "Proceedings of the \href{http://www.ipdps.org/ipdps2007}
                   {$21^{st}$ IEEE International Parallel and Distributed
                   Processing Symposium (IPDPS) 2007}",
  pages         = "1-10",
  month         = mar # "~26-30, ",
  year          = "2007",
  address       = "Long Beach, CA, USA",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-59593-768-1",
  doi           = "10.1109/IPDPS.2007.370307",
  url           = "http://www.christian-engelmann.info/publications/wang07job.pdf",
  url2          = "http://www.christian-engelmann.info/publications/wang07job.ppt.pdf",
  abstract      = "Checkpoint/restart (C/R) has become a requirement for
                   long-running jobs in large-scale clusters due to a
                   mean-time-to-failure (MTTF) in the order of hours. After a
                   failure, C/R mechanisms generally require a complete restart
                   of an MPI job from the last checkpoint. A complete restart,
                   however, is unnecessary since all but one node are typically
                   still alive. Furthermore, a restart may result in lengthy job
                   requeuing even though the original job had not exceeded its
                   time quantum. In this paper, we overcome these shortcomings.
                   Instead of job restart, we have developed a transparent
                   mechanism for job pause within LAM/MPI+BLCR. This mechanism
                   allows live nodes to remain active and roll back to the last
                   checkpoint while failed nodes are dynamically replaced by
                   spares before resuming from the last checkpoint. Our
                   methodology includes LAM/MPI enhancements in support of
                   scalable group communication with fluctuating number of
                   nodes, reuse of network connections, transparent coordinated
                   checkpoint scheduling and a BLCR enhancement for job pause.
                   Experiments in a cluster with the NAS Parallel Benchmark
                   suite show that our overhead for job pause is comparable to
                   that of a complete job restart. A minimal overhead of 5.6\%
                   is only incurred in case migration takes place while the
                   regular checkpoint overhead remains unchanged. Yet, our
                   approach alleviates the need to reboot the LAM run-time
                   environment, which accounts for considerable overhead
                   resulting in net savings of our scheme in the experiments.
                   Our solution further provides full transparency and
                   automation with the additional benefit of reusing existing
                   resources. Executing continues after failures within the
                   scheduled job, {\em \textit{i.e.}}, the application staging
                   overhead is not incurred again in contrast to a restart.
                   Our scheme offers additional potential for savings through
                   incremental checkpointing and proactive diskless live
                   migration, which we are currently working on.",
  pts           = "4944"
}
@conference{uhlemann06joshua,
  author        = "Kai Uhlemann
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "{JOSHUA}: {S}ymmetric Active/Active Replication for Highly
                   Available {HPC} Job and Resource Management",
  booktitle     = "Proceedings of the \href{http://cluster2006.org}{$8^{th}$
                   IEEE International Conference on Cluster Computing (Cluster)
                   2006}",
  pages         = "1-10",
  month         = sep # "~25-28, ",
  year          = "2006",
  address       = "Barcelona, Spain",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "1-4244-0328-6",
  issn          = "1552-5244",
  doi           = "10.1109/CLUSTR.2006.311855",
  url           = "http://www.christian-engelmann.info/publications/uhlemann06joshua.pdf",
  url2          = "http://www.christian-engelmann.info/publications/uhlemann06joshua.ppt.pdf",
  abstract      = "Most of today`s HPC systems employ a single head node for
                   control, which represents a single point of failure as it
                   interrupts an entire HPC system upon failure. Furthermore, it
                   is also a single point of control as it disables an entire
                   HPC system until repair. One of the most important HPC system
                   service running on the head node is the job and resource
                   management. If it goes down, all currently running jobs loose
                   the service they report back to. They have to be restarted
                   once the head node is up and running again. With this paper,
                   we present a generic approach for providing symmetric
                   active/active replication for highly available HPC job and
                   resource management. The JOSHUA solution provides a virtually
                   synchronous environment for continuous availability without
                   any interruption of service and without any loss of state.
                   Replication is performed externally via the PBS service
                   interface without the need to modify any service code. Test
                   results as well as availability analysis of our
                   proof-of-concept prototype implementation show that
                   continuous availability can be provided by JOSHUA with an
                   acceptable performance trade-off.",
  pts           = "2631"
}
@conference{baumann06parallel,
  author        = "Ronald Baumann
                   and Christian Engelmann
                   and George A. (Al) Geist",
  title         = "A Parallel Plug-in Programming Paradigm",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://hpcc06.lrr.in.tum.de}{$7^{th}$ International
                   Conference on High Performance Computing and Communications
                   (HPCC) 2006}",
  volume        = "4208",
  pages         = "823--832",
  month         = sep # "~13-15, ",
  year          = "2006",
  address       = "Munich, Germany",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-540-39368-9",
  issn          = "0302-9743",
  doi           = "10.1007/11847366_85",
  url           = "http://www.christian-engelmann.info/publications/baumann06parallel.pdf",
  url2          = "http://www.christian-engelmann.info/publications/baumann06parallel.ppt.pdf",
  abstract      = "Software component architectures allow assembly of
                   applications from individual software modules based on
                   clearly defined programming interfaces, thus improving the
                   reuse of existing solutions and simplifying application
                   development. Furthermore, the plug-in programming paradigm
                   additionally enables runtime reconfigurability, making it
                   possible to adapt to changing application needs, such as
                   different application phases, and system properties, like
                   resource availability, by loading/unloading appropriate
                   software modules. Similar to parallel programs, parallel
                   plug-ins are an abstraction for a set of cooperating
                   individual plug-ins within a parallel application utilizing
                   a software component architecture. Parallel programming
                   paradigms apply to parallel plug-ins in the same way they
                   apply to parallel programs. The research presented in this
                   paper targets the clear definition of parallel plug-ins and
                   the development of a parallel plug-in programming paradigm.",
  pts           = "2413"
}
@conference{varma06scalable,
  author        = "Jyothish Varma
                   and Chao Wang
                   and Frank Mueller
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Scalable, Fault-Tolerant Membership for {MPI} Tasks on {HPC}
                   Systems",
  booktitle     = "Proceedings of the \href{http://www.ics-conference.org/2006}
                   {$20^{th}$ ACM International Conference on Supercomputing
                   (ICS) 2006}",
  pages         = "219--228",
  month         = jun # "~28-30, ",
  year          = "2006",
  address       = "Cairns, Australia",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  doi           = "10.1145/1183401.1183433",
  isbn          = "1-59593-282-8",
  url           = "http://www.christian-engelmann.info/publications/varma06scalable.pdf",
  url2          = "http://www.christian-engelmann.info/publications/varma06scalable.ppt.pdf",
  abstract      = "Reliability is increasingly becoming a challenge for
                   high-performance computing (HPC) systems with thousands of
                   nodes, such as IBM`s Blue Gene/L. A shorter
                   mean-time-to-failure can be addressed by adding fault
                   tolerance to reconfigure working nodes to ensure that
                   communication and computation can progress. However, existing
                   approaches fall short in providing scalability and small
                   reconfiguration overhead within the fault-tolerant layer.
                   This paper contributes a scalable approach to reconfigure the
                   communication infrastructure after node failures. We propose
                   a decentralized (peer-to-peer) protocol that maintains a
                   consistent view of active nodes in the presence of faults.
                   Our protocol shows response times in the order of hundreds of
                   microseconds and single-digit milliseconds for 
                   reconfiguration using MPI over Blue Gene/L and TCP over 
                   Gigabit, respectively. The protocol can be adapted to match
                   the network topology to further increase performance. We also
                   verify experimental results against a performance model,
                   which demonstrates the scalability of the approach. Hence,
                   the membership service is suitable for deployment in the
                   communication layer of MPI runtime systems, and we have
                   integrated an early version into LAM/MPI.",
  pts           = "2105"
}
@conference{okunbor06exploring,
  author        = "Daniel I. Okunbor
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Exploring Process Groups for Reliability, Availability and
                   Serviceability of Terascale Computing Systems",
  booktitle     = "Proceedings of the
                   \href{http://www.atiner.gr/docs/2006AAAPROGRAM_COMP.htm}
                   {$2^{nd}$ International Conference on Computer Science and
                   Information Systems 2006}",
  month         = jun # "~19-21, ",
  year          = "2006",
  address       = "Athens, Greece",
  url           = "http://www.christian-engelmann.info/publications/okunbor06exploring.pdf",
  abstract      = "This paper presents various aspects of reliability,
                   availability and serviceability (RAS) systems as they relate
                   to group communication service, including reliable and total
                   order multicast/broadcast, virtual synchrony, and failure
                   detection. While the issue of availability, particularly
                   high availability using replication-based architectures has
                   recently received upsurge research interests, much still have
                   to be done in understanding the basic underlying concepts for
                   achieving RAS systems, especially in high-end and high
                   performance computing (HPC) communities. Various attributes
                   of group communication service and the prototype of symmetric
                   active replication following ideas utilized in the Newtop
                   protocol will be discussed. We explore the application of
                   group communication service for RAS HPC, laying the
                   groundwork for its integrated model.",
  pts           = "3778"
}
@conference{limaye05jobsite,
  author        = "Kshitij Limaye
                   and Chokchai (Box) Leangsuksun
                   and Zeno Greenwood
                   and Stephen L. Scott
                   and Christian Engelmann
                   and Richard M. Libby
                   and Kasidit Chanchio",
  title         = "Job-Site Level Fault Tolerance for Cluster and {Grid}
                   Environments",
  booktitle     = "Proceedings of the \href{http://cluster2005.org}{$7^{th}$
                   IEEE International Conference on Cluster Computing (Cluster)
                   2005}",
  pages         = "1--9",
  month         = sep # "~26-30, ",
  year          = "2005",
  address       = "Boston, MA, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "0-7803-9486-0",
  issn          = "1552-5244",
  doi           = "10.1109/CLUSTR.2005.347043",
  url           = "http://www.christian-engelmann.info/publications/limaye05job-site.pdf",
  abstract      = "In order to adopt high performance clusters and Grid
                   computing for mission critical applications, fault tolerance
                   is a necessity. Common fault tolerance techniques in
                   distributed systems are normally achieved with
                   checkpoint-recovery and job replication on alternative
                   resources, in cases of a system outage. The first approach
                   depends on the system`s MTTR while the latter approach
                   depends on the availability of alternative sites to run
                   replicas. There is a need for complementing these approaches
                   by proactively handling failures at a job-site level,
                   ensuring the system high availability with no loss of user
                   submitted jobs. This paper discusses a novel fault tolerance
                   technique  that enables the job-site recovery in Beowulf
                   cluster-based grid environments, whereas existing techniques
                   give up a failed system by seeking alternative resources.
                   Our results suggest sizable aggregate performance improvement
                   during an implementation of our method in Globus-enabled
                   HA-OSCAR. The technique called Smart Failover provides a
                   transparent and graceful recovery mechanism that saves job
                   states in a local job-manager queue and transfers those
                   states to the backup server periodically, and in critical
                   system events. Thus whenever a failover occurs, the backup
                   server is able to restart the jobs from their last saved
                   state."
}
@conference{song05umlbased,
  author        = "Hertong Song
                   and Chokchai (Box) Leangsuksun
                   and Raja Nassar
                   and Yudan Liu
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "{UML-based} {Beowulf} Cluster Availability Modeling",
  booktitle     = "\href{http://www.world-academy-of-science.org/IMCSE2005/ws/SERP}
                   {International Conference on Software Engineering Research
                   and Practice (SERP) 2005}",
  pages         = "161--167",
  month         = jun # "~27-30, ",
  year          = "2005",
  address       = "Las Vegas, NV, USA",
  publisher     = "CSREA Press",
  isbn          = "1-932415-49-1"
}
@conference{engelmann05superscalable,
  author        = "Christian Engelmann
                   and George A. (Al) Geist",
  title         = "Super-Scalable Algorithms for Computing on 100,000
                   Processors",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://www.iccs-meeting.org/iccs2005}{$5^{th}$
                   International Conference on Computational Science (ICCS)
                   2005}, Part I",
  volume        = "3514",
  pages         = "313--320",
  month         = may # "~22-25, ",
  year          = "2005",
  address       = "Atlanta, GA, USA",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-540-26032-5",
  issn          = "0302-9743",
  doi           = "10.1007/11428831_39",
  url           = "http://www.christian-engelmann.info/publications/engelmann05superscalable.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann05superscalable.ppt.pdf",
  abstract      = "In the next five years, the number of processors in high-end
                   systems for scientific computing is expected to rise to tens
                   and even hundreds of thousands. For example, the IBM Blue
                   Gene/L can have up to 128,000 processors and the delivery of
                   the first system is scheduled for 2005. Existing deficiencies
                   in scalability and fault-tolerance of scientific applications
                   need to be addressed soon. If the number of processors grows
                   by a magnitude and efficiency drops by a magnitude, the
                   overall effective computing performance stays the same.
                   Furthermore, the mean time to interrupt of high-end computer
                   systems decreases with scale and complexity. In a
                   100,000-processor system, failures may occur every couple of
                   minutes and traditional checkpointing may no longer be
                   feasible. With this paper, we summarize our recent research
                   in super-scalable algorithms for computing on 100,000
                   processors. We introduce the algorithm properties of scale
                   invariance and natural fault tolerance, and discuss how they
                   can be applied to two different classes of algorithms. We
                   also describe a super-scalable diskless checkpointing
                   algorithm for problems that can`t be transformed into a
                   super-scalable variant, or where other solutions are more
                   efficient. Finally, a 100,000-processor simulator is
                   presented as a platform for testing and experimentation."
}
@conference{brim24microservices,
  author        = "Michael J. Brim
                   and Lance Drane
                   and Marshall McDonnell
                   and Christian Engelmann
                   and Addi Malviya Thakur",
  title         = "A Microservices Architecture Toolkit for Interconnected
                   Science Ecosystems",
  booktitle     = "Proceedings of the \href{http://sc24.supercomputing.org}
                   {$37^{th}$ International Conference on High Performance
                   Computing, Networking, Storage and Analysis (SC) Workshops
                   2024}: \href{https://works-workshop.org/}
                   {$19^{th}$ Workshop on Workflows in Support of Large-Scale
                    Science (WORKS) 2024}",
  pages         = "",
  month         = nov # "~18, ",
  year          = "2024",
  address       = "Atlanta, GA, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "",
  doi           = "",
  url           = "",
  url2          = "",
  abstract      = "Microservices architecture is a promising approach for
                   developing reusable scientific workflow capabilities for
                   integrating diverse resources, such as experimental and
                   observational instruments and advanced computational and data
                   management systems, across many distributed organizations and
                   facilities. In this paper, we describe how the INTERSECT Open
                   Architecture leverages federated systems of microservices to
                   construct interconnected science ecosystems, review how the
                   INTERSECT software development kit eases microservice
                   capability development, and demonstrate the use of such
                   capabilities for deploying an example multi-facility
                   INTERSECT ecosystem.",
  pts           = "",
  note          = "To appear"
}
@conference{kumar21rdpm,
  author        = "Mohit Kumar
                   and Christian Engelmann",
  title         = "{RDPM}: An Extensible Tool for Resilience Design Patterns
                   Modeling",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{https://2021.euro-par.org}{$27^{th}$
                   European Conference on Parallel and Distributed Computing
                   (Euro-Par) 2021 Workshops}:
                   \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2021}
                   {$14^{th}$ Workshop on Resiliency in High Performance
                   Computing (Resilience) in Clusters, Clouds, and Grids}",
  volume        = "13098",
  pages         = "283--297",
  month         = aug # "~30, ",
  year          = "2021",
  address       = "Lisbon, Portugal",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-031-06155-4",
  doi           = "10.1007/978-3-031-06156-1_23",
  url           = "http://www.christian-engelmann.info/publications/kumar21rdpm.pdf",
  url2          = "",
  abstract      = "Resilience to faults, errors, and failures in extreme-scale
                   HPC systems is a critical challenge. Resilience design
                   patterns offer a new, structured hardware and software design
                   approach for improving resilience. While prior work focused
                   on developing performance, reliability, and availability
                   models for resilience design patterns, this paper extends it
                   by providing a Resilience Design Patterns Modeling (RDPM)
                   tool which allows (1) exploring performance, reliability,
                   and availability of each resilience design pattern, (2)
                   offering customization of parameters to optimize performance,
                   reliability, and availability, and (3) allowing
                   investigation of trade-off models for combining multiple
                   patterns for practical resilience solutions.",
  pts           = "161085"
}
@conference{kumar20models,
  author        = "Mohit Kumar
                   and Christian Engelmann",
  title         = "Models for Resilience Design Patterns",
  booktitle     = "Proceedings of the \href{http://sc20.supercomputing.org}
                   {$33^{rd}$ International Conference on High Performance
                   Computing, Networking, Storage and Analysis (SC) Workshops
                   2020}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2020}
                   {$10^{th}$ Workshop on Fault Tolerance for HPC at eXtreme
                   Scale (FTXS) 2020}",
  pages         = "21-30",
  month         = nov # "~11, ",
  year          = "2020",
  address       = "Atlanta, GA, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7381-1080-6",
  doi           = "10.1109/FTXS51974.2020.00008",
  url           = "http://www.christian-engelmann.info/publications/kumar20models.pdf",
  url2          = "http://www.christian-engelmann.info/publications/kumar20models.ppt.pdf",
  abstract      = "Resilience plays an important role in supercomputers by
                   providing correct and efficient operation in case of faults,
                   errors, and failures. Resilience design patterns offer
                   blueprints for effectively applying resilience technologies.
                   Prior work focused on developing initial efficiency and
                   performance models for resilience design patterns. This paper
                   extends it by (1) describing performance, reliability, and
                   availability models for all structural resilience design
                   patterns, (2) providing more detailed models that include
                   flowcharts and state diagrams, and (3) introducing the
                   Resilience Design Pattern Modeling (RDPM) tool that
                   calculates and plots the performance, reliability, and
                   availability metrics of individual patterns and pattern
                   combinations.",
  pts           = "148010"
}
@conference{sao19self-stabilizing,
  author        = "Piyush Sao
                   and Christian Engelmann
                   and Srinivas Eswar
                   and Oded Green
                   and Richard Vuduc",
  title         = "Self-stabilizing Connected Components",
  booktitle     = "Proceedings of the \href{http://sc19.supercomputing.org}
                   {$32^{nd}$ International Conference on High Performance
                   Computing, Networking, Storage and Analysis (SC) Workshops
                   2019}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2019}
                   {$9^{th}$ Workshop on Fault Tolerance for HPC at eXtreme
                   Scale (FTXS) 2019}",
  pages         = "50--59",
  month         = nov # "~22, ",
  year          = "2019",
  address       = "Denver, CO, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-7281-6013-9",
  doi           = "10.1109/FTXS49593.2019.00011",
  url           = "http://www.christian-engelmann.info/publications/sao19self-stabilizing.pdf",
  url2          = "http://www.christian-engelmann.info/publications/sao19self-stabilizing.ppt.pdf",
  abstract      = "For the problem of computing the connected components of a
                   graph, this paper considers the design of algorithms that are
                   resilient to transient hardware faults, like bit flips. More
                   specifically, it applies the technique of
                   \emph{self-stabilization}. A system is self-stabilizing if,
                   when starting from a valid or invalid state, it is guaranteed
                   to reach a valid state after a finite number of steps.
                   Therefore on a machine subject to a transient fault, a
                   self-stabilizing algorithm could recover if that fault caused
                   the system to enter an invalid state.
                   We give a comprehensive analysis of the valid and invalid
                   states during label propagation and derive algorithms to
                   verify and correct the invalid state. The self-stabilizing
                   label-propagation algorithm performs $\bigo{V \log V}$
                   additional computation and requires $\bigo{V}$ additional
                   storage over its conventional counterpart (and, as such,
                   does not increase asymptotic complexity over conventional).
                   When run against a battery of simulated fault injection
                   tests, the self-stabilizing label propagation algorithm
                   exhibits more resilient behavior than a triple modular
                   redundancy (TMR) based fault-tolerant algorithm in 80\% of
                   cases. From a performance perspective, it also outperforms
                   TMR as it requires fewer iterations in total. Beyond the
                   fault-tolerance properties of self-stabilizing
                   label-propagation, we believe, they are useful from the
                   theoretical perspective; and may have other use-cases.",
  pts           = "135067"
}
@conference{engelmann19concepts,
  author        = "Christian Engelmann
                   and Geoffroy R. Vall\'ee
                   and Swaroop Pophale",
  title         = "Concepts for {OpenMP} Target Offload Resilience",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://parallel.auckland.ac.nz/iwomp2019}
                   {$15^{th}$ International Workshop on OpenMP (IWOMP) 2019}",
  volume        = "11718",
  pages         = "78--93",
  month         = sep # "~11-13, ",
  year          = "2019",
  address       = "Auckland, New Zealand",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-030-28595-1",
  doi           = "10.1007/978-3-030-28596-8_6",
  url           = "http://www.christian-engelmann.info/publications/engelmann19concepts.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann19concepts.ppt.pdf",
  abstract      = "Recent reliability issues with one of the fastest
                   supercomputers in the world, Titan at Oak Ridge National
                   Laboratory, demonstrated the need for resilience in
                   large-scale heterogeneous computing. OpenMP currently does
                   not address error and failure behavior. This paper takes a
                   first step toward resilience for heterogeneous systems by
                   providing the concepts for resilient OpenMP offload to
                   devices. Using real-world error and failure observations,
                   the paper describes the concepts and terminology for
                   resilient OpenMP target offload, including error and failure
                   classes and resilience strategies. It details the
                   experienced general-purpose computing on graphics processing
                   units errors and failures in Titan. It further proposes
                   improvements in OpenMP, including a preliminary prototype
                   design, to support resilient offload to devices for
                   efficient handling of errors and failures in heterogeneous
                   high-performance computing systems",
  pts           = "127338"
}
@conference{hui18comprehensive2,
  author        = "Yawei Hui
                   and Byung Hoon (Hoony) Park
                   and Christian Engelmann",
  title         = "A Comprehensive Informative Metric for Analyzing {HPC} System
                   Status using the {LogSCAN} Platform",
  booktitle     = "Proceedings of the \href{http://sc18.supercomputing.org}
                   {$31^{st}$ International Conference on High Performance
                   Computing, Networking, Storage and Analysis (SC) Workshops
                   2018}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2018}
                   {$8^{th}$ Workshop on Fault Tolerance for HPC at eXtreme
                   Scale (FTXS) 2018}",
  pages         = "29--38",
  month         = nov # "~16, ",
  year          = "2018",
  address       = "Dallas, TX, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-7281-0222-1",
  doi           = "10.1109/FTXS.2018.00007",
  url           = "http://www.christian-engelmann.info/publications/hui18comprehensive2.pdf",
  url2          = "http://www.christian-engelmann.info/publications/hui18comprehensive2.ppt.pdf",
  abstract      = "Log processing by Spark and Cassandra-based ANalytics
                   (LogSCAN) is a newly developed analytical platform that
                   provides flexible and scalable data gathering, transformation
                   and computation. One major challenge is to effectively
                   summarize the status of a complex computer system, such as
                   the Titan supercomputer at the Oak Ridge Leadership Computing
                   Facility (OLCF). Although there is plenty of operational and
                   maintenance information collected and stored in real time,
                   which may yield insights about short- and long-term system
                   status, it is difficult to present this information in a
                   comprehensive form. In this work, we present system
                   information entropy (SIE), a newly developed metric that
                   leverages the powers of traditional machine learning
                   techniques and information theory. By compressing the
                   multi-variant multi-dimensional event information recorded
                   during the operation of the targeted system into a single
                   time series of SIE, we demonstrate that the historical
                   system status can be sensitively represented concisely and
                   comprehensively. Given a sharp indicator as SIE, we argue
                   that follow-up analytics based on SIE will reveal in-depth
                   knowledge about system status using other sophisticated
                   approaches, such as pattern recognition in the temporal
                   domain or causality analysis incorporating extra
                   independent metrics of the system.",
  pts           = "119248"
}
@conference{ashraf18analyzing,
  author        = "Rizwan Ashraf
                   and Christian Engelmann",
  title         = "Analyzing the Impact of System Reliability Events on
                   Applications in the {Titan} Supercomputer",
  booktitle     = "Proceedings of the \href{http://sc18.supercomputing.org}
                   {$31^{st}$ International Conference on High Performance
                   Computing, Networking, Storage and Analysis (SC) Workshops
                   2018}: \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2018}
                   {$8^{th}$ Workshop on Fault Tolerance for HPC at eXtreme
                   Scale (FTXS) 2018}",
  pages         = "39--48",
  month         = nov # "~16, ",
  year          = "2018",
  address       = "Dallas, TX, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-7281-0222-1",
  doi           = "10.1109/FTXS.2018.00008",
  url           = "http://www.christian-engelmann.info/publications/ashraf18analyzing.pdf",
  url2          = "http://www.christian-engelmann.info/publications/ashraf18analyzing.ppt.pdf",
  abstract      = "Extreme-scale computing systems employ Reliability,
                   Availability and Serviceability (RAS) mechanisms and
                   infrastructure to log events from multiple system components.
                   In this paper, we analyze RAS logs in conjunction with the 
                   application placement and scheduling database, in order to 
                   understand the impact of common RAS events on application
                   performance. This study conducted on the records of about 2
                   million applications executed on Titan supercomputer
                   provides important insights for system users, operators and
                   computer science researchers. In this paper, we investigate
                   the impact of RAS events on application performance and its
                   variability by comparing cases where events are recorded
                   with corresponding cases where no events are recorded. Such
                   a statistical investigation is possible since we observed
                   that system users tend to execute their applications
                   multiple times. Our analysis reveals that most RAS events
                   do impact application performance, although not always. We
                   also find that different system components affect
                   application performance differently. In particular, our
                   investigation includes the following components: parallel
                   file system, processor, memory, graphics processing units,
                   system and user software issues. Our work establishes the
                   importance of providing feedback to system users for
                   increasing operational efficiency of extreme-scale systems.",
  pts           = "119070"
}
@conference{park18big,
  author        = "Byung Hoon (Hoony) Park
                   and Yawei Hui
                   and Swen Boehm
                   and Rizwan Ashraf
                   and Christian Engelmann
                   and Christopher Layton",
  title         = "A {Big Data} Analytics Framework for {HPC} Log Data: {Three}
                   Case Studies Using the {Titan} Supercomputer Log",
  booktitle     = "Proceedings of the \href{https://cluster2018.github.io}
                   {$19^{th}$ IEEE International Conference on Cluster Computing
                   (Cluster) 2018}:
                   \href{https://sites.google.com/site/hpcmaspa2018}
                   {$5^{th}$ Workshop on Monitoring and Analysis for High
                   Performance Systems Plus Applications (HPCMASPA) 2018}",
  pages         = "571--579",
  month         = sep # "~10, ",
  year          = "2018",
  address       = "Belfast, UK",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-5386-8319-4",
  issn          = "2168-9253",
  doi           = "10.1109/CLUSTER.2018.00073",
  url           = "http://www.christian-engelmann.info/publications/park18big.pdf",
  url2          = "http://www.christian-engelmann.info/publications/park18big.ppt.pdf",
  abstract      = "Reliability, availability and serviceability (RAS) logs of
                   high performance computing (HPC) resources, when closely
                   investigated in spatial and temporal dimensions, can provide
                   invaluable information regarding system status, performance,
                   and resource utilization. These data are often generated from
                   multiple logging systems and sensors that cover many
                   components of the system. The analysis of these data for
                   finding persistent temporal and spatial insights faces two
                   main difficulties: the volume of RAS logs makes manual
                   inspection difficult and the unstructured nature and unique
                   properties of log data produced by each subsystem adds
                   another dimension of difficulty in identifying implicit
                   correlation among recorded events. To address these issues,
                   we recently developed a multi-user Big Data analytics
                   framework for HPC log data at Oak Ridge National Laboratory
                   (ORNL). This paper introduces three in-progress data
                   analytics projects that leverage this framework to assess
                   system status, mine event patterns, and study correlations
                   between user applications and system events. We describe the
                   motivation of each project and detail their workflows using
                   three years of log data collected from ORNL's Titan
                   supercomputer.",
  pts           = "112964"
}
@conference{ashraf18performance,
  author        = "Rizwan Ashraf
                   and Christian Engelmann",
  title         = "Performance Efficient Multiresilience using Checkpoint
                   Recovery in Iterative Algorithms",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{https://europar2018.org}{$24^{th}$
                   European Conference on Parallel and Distributed Computing
                   (Euro-Par) 2018 Workshops}:
                   \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2018}
                   {$11^{th}$ Workshop on Resiliency in High Performance
                   Computing (Resilience) in Clusters, Clouds, and Grids}",
  volume        = "11339",
  pages         = "813--825",
  month         = aug # "~28, ",
  year          = "2018",
  address       = "Turin, Italy",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-030-10549-5",
  doi           = "10.1007/978-3-030-10549-5_63",
  url           = "http://www.christian-engelmann.info/publications/ashraf18performance.pdf",
  url2          = "http://www.christian-engelmann.info/publications/ashraf18performance.ppt.pdf",
  abstract      = "In this paper, we address the design challenge of building
                   multiresilient iterative high-performance computing (HPC)
                   applications. Multiresilience in HPC applications is the
                   ability to tolerate and maintain forward progress in the
                   presence of both soft errors and process failures. We address
                   the challenge by proposing performance models which are
                   useful to design performance efficient and resilient
                   iterative applications. The models consider the interaction
                   between soft error and process failure resilience solutions.
                   We experimented with a linear solver application with two
                   distinct kinds of soft error detectors: one detector is high
                   overhead and high accuracy, whereas the second is low
                   overhead and low accuracy. We show how both can be leveraged
                   for verifying the integrity of checkpointed state used to
                   recover from both soft errors and process failures. Our
                   results show the performance efficiency and resiliency
                   benefit of employing the low overhead detector with high
                   frequency within the checkpoint interval, so that timely
                   soft error recovery can take place, resulting in less
                   re-computed work.",
  pts           = "112980"
}
@conference{park17big,
  author        = "Byung Hoon (Hoony) Park
                   and Saurabh Hukerikar
                   and Christian Engelmann
                   and Ryan Adamson",
  title         = "Big Data Meets {HPC} Log Analytics: {Scalable} Approach to
                   Understanding Systems at Extreme Scale",
  booktitle     = "Proceedings of the \href{https://cluster17.github.io}
                   {$18^{th}$ IEEE International Conference on Cluster Computing
                   (Cluster) 2017}:
                   \href{https://sites.google.com/site/hpcmaspa2017}
                   {$4^{th}$ Workshop on Monitoring and Analysis for High
                   Performance Systems Plus Applications (HPCMASPA) 2017}",
  pages         = "758--765",
  month         = sep # "~5, ",
  year          = "2017",
  address       = "Honolulu, HI, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-5386-2327-5",
  issn          = "2168-9253",
  doi           = "10.1109/CLUSTER.2017.113",
  url           = "http://www.christian-engelmann.info/publications/park17big.pdf",
  url2          = "http://www.christian-engelmann.info/publications/park17big.ppt.pdf",
  abstract      = "Today's high-performance computing (HPC) systems are heavily
                   instrumented generating logs containing information about
                   abnormal events, such as critical conditions, faults, errors
                   and failures, system resource utilization, and about the
                   resource usage of user applications. These logs, once fully
                   analyzed and correlated, can produce detailed information
                   about the system health, root causes of failures, and
                   analyze an application's interactions with the system,
                   providing invaluable insights to domain scientists and
                   system administrators. However, processing HPC logs
                   requires deep understanding of hardware and software
                   components at multiple layers of the system stack.
                   Moreover, most log data is unstructured and voluminous,
                   making it more difficult for scientists and engineers to
                   analyze the data. With rapid increases in the scale and
                   complexity of HPC systems, log data processing is becoming
                   a big data challenge. This paper introduces a HPC log data
                   analytics framework that is based on a distributed NoSQL
                   database technology, which provides scalability and high
                   availability, and the Apache Spark for rapid in-memory
                   processing of log data. The framework enables the
                   extraction of a range of information about the system so
                   that system administrators and end users alike can obtain
                   necessary insights for their specific needs. We describe
                   our experience with using this framework to glean insights
                   from the log data derived from the Titan supercomputer at
                   the Oak Ridge National Laboratory.",
  pts           = "100681"
}
@conference{hukerikar17pattern-based,
  author        = "Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "Pattern-based Modeling of High-Performance Computing
                   Resilience",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://europar2017.usc.es}{$23^{rd}$
                   European Conference on Parallel and Distributed Computing
                   (Euro-Par) 2017 Workshops}:
                   \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2017}
                   {$10^{th}$ Workshop on Resiliency in High Performance
                   Computing (Resilience) in Clusters, Clouds, and Grids}",
  volume        = "10659",
  pages         = "557--568",
  month         = aug # "~29, ",
  year          = "2017",
  address       = "Santiago de Compostela, Spain",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-319-75177-1",
  doi           = "10.1007/978-3-319-75178-8_45",
  url           = "http://www.christian-engelmann.info/publications/hukerikar17pattern-based.pdf",
  url2          = "http://www.christian-engelmann.info/publications/hukerikar17pattern-based.ppt.pdf",
  abstract      = "The design of supercomputing systems and their applications
                   must consider resilience, and power consumption as the key
                   design parameters when designing to achieve higher
                   performance. In previous work, we established a structured
                   methodology for developing resilience solutions based on the
                   concept of design patterns. In this paper we discuss
                   analytical models for the design patterns to support
                   quantitative analysis of their performance and reliability
                   characteristics.",
  pts           = "102871"
}
@conference{hukerikar17towards,
  author        = "Saurabh Hukerikar
                   and Rizwan Ashraf
                   and Christian Engelmann",
  title         = "Towards New Metrics for High-Performance Computing
                   Resilience",
  booktitle     = "Proceedings of the \href{http://www.hpdc.org/2017}
                   {$26^{th}$ ACM International Symposium on High-Performance
                   Parallel and Distributed Computing (HPDC) 2017}:
                   \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2017}
                   {$7^{th}$ Workshop on Fault Tolerance for HPC at eXtreme
                   Scale (FTXS) 2017}",
  pages         = "23--30",
  month         = jun # "~26-30, ",
  year          = "2017",
  address       = "Washington, D.C.",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4503-5001-3",
  doi           = "10.1145/3086157.3086163",
  url           = "http://www.christian-engelmann.info/publications/hukerikar17towards.pdf",
  url2          = "http://www.christian-engelmann.info/publications/hukerikar17towards.ppt.pdf",
  abstract      = "Ensuring the reliability of applications is becoming an
                   increasingly important challenge as high-performance
                   computing (HPC) systems experience an ever-growing number of
                   faults, errors and failures. While the HPC community has made
                   substantial progress in developing various resilience
                   solutions, it continues to rely on platform-based metrics to
                   quantify application resiliency improvements. The resilience
                   of an HPC application is concerned with the reliability of
                   the application outcome as well as the fault handling
                   efficiency. To understand the scope of impact, effective
                   coverage and performance efficiency of existing and emerging
                   resilience solutions, there is a need for new metrics. In
                   this paper, we develop new ways to quantify resilience that
                   consider both the reliability and the performance
                   characteristics of the solutions from the perspective of HPC
                   applications. As HPC systems continue to evolve in terms of
                   scale and complexity, it is expected that applications will
                   experience various types of faults, errors and failures,
                   which will require applications to apply multiple resilience
                   solutions across the system stack. The proposed metrics are
                   intended to be useful for understanding the combined impact
                   of these solutions on an application's ability to produce
                   correct results and to evaluate their overall impact on an
                   application's performance in the presence of various modes
                   of faults.",
  pts           = "74843"
}
@conference{hukerikar16language,
  author        = "Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "Language Support for Reliable Memory Regions",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{https://lcpc2016.wordpress.com}{$29^{th}$
                   International Workshop on Languages and Compilers for
                   Parallel Computing}",
  volume        = "10136",
  pages         = "73--87",
  month         = sep # "~28-30, ",
  year          = "2016",
  address       = "Rochester, NY, USA",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-319-52708-6",
  issn          = "0302-9743",
  doi           = "10.1007/978-3-319-52709-3_6",
  url           = "http://www.christian-engelmann.info/publications/hukerikar16language.pdf",
  url2          = "http://www.christian-engelmann.info/publications/hukerikar16language.ppt.pdf",
  abstract      = "The path to exascale computational capabilities in
                   high-performance computing (HPC) systems is challenged by the
                   evolution of the architectures of supercomputing systems. The
                   constraints of power have driven designs that include
                   increasingly heterogeneous architectures and complex memory
                   hierarchies. These systems are also expected to experience in
                   an increased rate of errors, such that the applications will
                   no longer be able to assume correct behavior of the
                   underlying machine. To enable the scientific community to
                   succeed in scaling their applications and harness the
                   capabilities of exascale systems, we need software strategies
                   that provide mechanisms for explicit management of locality
                   and resilience to errors in the system.
                   In prior work, we introduced the concept of explicitly
                   reliable memory regions, called havens. Memory management
                   using havens supports selective reliability through a
                   region-based approach to memory allocation. Havens enable the
                   creation of explicit software-enabled robust memory
                   containers for which resilient behavior is guaranteed. In
                   this paper, we propose language support for havens through
                   type annotations that make the structure of a program's
                   havens more explicit. We describe how the extended
                   haven-based memory management model is implemented and the
                   impact on the resiliency of a conjugate gradient
                   application.",
  pts           = "69644"
}
@conference{naughton16cooperative,
  author        = "Thomas Naughton
                   and Christian Engelmann
                   and Geoffroy Vall{\'e}e
                   and Ferrol Aderholdt
                   and Stephen L. Scott",
  title         = "A Cooperative Approach to Virtual Machine Based Fault
                   Injection",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{https://europar2016.inria.fr}{$22^{nd}$
                   European Conference on Parallel and Distributed Computing
                   (Euro-Par) 2016 Workshops}:
                   \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2016}
                   {$9^{th}$ Workshop on Resiliency in High Performance
                   Computing (Resilience) in Clusters, Clouds, and Grids}",
  volume        = "10104",
  pages         = "671--682",
  month         = aug # "~23, ",
  year          = "2016",
  address       = "Grenoble, France",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-319-58943-5",
  issn          = "0302-9743",
  doi           = "10.1007/978-3-319-58943-5_54",
  url           = "http://www.christian-engelmann.info/publications/naughton16cooperative.pdf",
  url2          = "http://www.christian-engelmann.info/publications/naughton16cooperative.ppt.pdf",
  abstract      = "Resilience investigations often employ fault injection (FI)
                   tools to study the effects of simulated errors on a target
                   system. It is important to keep the target system under test
                   (SUT) isolated from the controlling environment in order to
                   maintain control of the experiment. Virtual machines (VMs)
                   have been used to aid these investigations due to the strong
                   isolation properties of system-level virtualization. A key
                   challenge in fault injection tools is to gain proper insight
                   and context about the SUT. In VM-based FI tools, this
                   challenge of target con- text is increased due to the
                   separation between host and guest (VM). We discuss an
                   approach to VM-based FI that leverages virtual machine
                   introspection (VMI) methods to gain insight into the target's
                   context running within the VM. The key to this environment is
                   the ability to provide basic information to the FI system
                   that can be used to create a map of the target environment.
                   We describe a proof- of-concept implementation and a
                   demonstration of its use to introduce simulated soft errors
                   into an iterative solver benchmark running in user-space of
                   a guest VM.",
  pts           = "69232"
}
@conference{parchman16adding,
  author        = "Zachary Parchman
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Christian Engelmann
                   and David E. Bernholdt",
  title         = "Adding Fault Tolerance to {NPB} Benchmarks Using {ULFM}",
  booktitle     = "Proceedings of the \href{http://www.hpdc.org/2016}
                   {$25^{th}$ ACM International Symposium on High-Performance
                   Parallel and Distributed Computing (HPDC) 2016}:
                   \href{https://sites.google.com/site/ftxsworkshop/home/ftxs-2016}
                   {$6^{th}$ Workshop on Fault Tolerance for HPC at eXtreme
                   Scale (FTXS) 2016}",
  pages         = "19--26",
  month         = may # "~31 - " # jun # "~4, ",
  year          = "2016",
  address       = "Kyoto, Japan",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-4503-4349-7",
  doi           = "10.1145/2909428.2909429",
  url           = "http://www.christian-engelmann.info/publications/parchman16adding.pdf",
  url2          = "http://www.christian-engelmann.info/publications/parchman16adding.ppt.pdf",
  abstract      = "In the world of high-performance computing, fault tolerance
                   and application resilience are becoming some of the primary
                   concerns because of increasing hardware failures and memory
                   corruptions. While the research community has been
                   investigating various options, from system-level solutions to
                   application-level solutions, standards such as the Message
                   Passing Interface (MPI) are also starting at including such
                   capabilities. The current proposal for MPI fault tolerant is
                   centered around the User-Level Failure Mitigation (ULFM)
                   concept, which provides means for fault detection and
                   recovery of the MPI layer. This approach does not address
                   application-level recovery, which is current left to
                   application developers. In this work, we present a
                   modification of some of the benchmarks of the NAS parallel
                   benchmark (NPB) to include support of the ULFM capabilities
                   as well as application- level strategies and mechanisms for
                   application-level failure recovery. As such, we present:
                   (i) an application-level library to ``checkpoint'' data,
                   (ii) extensions of NPB benchmarks for fault tolerance based
                   on different strategies, (iii) a fault injection tool, and
                   (iv) some preliminary experiments that shows the impact of
                   such fault tolerant strategies on the application
                   execution.",
  pts           = "62557"
}
@conference{naughton14what,
  author        = "Thomas Naughton
                   and Garry Smith
                   and Christian Engelmann
                   and Geoffroy Vall{\'e}e
                   and Ferrol Aderholdt
                   and Stephen L. Scott",
  title         = "What is the right balance for performance and isolation with
                   virtualization in {HPC}?",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://europar2014.dcc.fc.up.pt}{$20^{th}$
                   European Conference on Parallel and Distributed Computing
                   (Euro-Par) 2014 Workshops}:
                   \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2014}
                   {$7^{th}$ Workshop on Resiliency in High Performance
                   Computing (Resilience) in Clusters, Clouds, and Grids}",
  volume        = "8805",
  pages         = "570--581",
  month         = aug # "~25, ",
  year          = "2014",
  address       = "Porto, Portugal",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-319-14325-5",
  issn          = "0302-9743",
  doi           = "10.1007/978-3-319-14325-5_49",
  url           = "http://www.christian-engelmann.info/publications/naughton14what.pdf",
  url2          = "http://www.christian-engelmann.info/publications/naughton14what.ppt.pdf",
  abstract      = "The use of virtualization in high-performance computing (HPC)
                   has been suggested as a means to provide tailored services
                   and added functionality that many users expect from
                   full-featured Linux cluster environments. While the use of
                   virtual machines in HPC can offer several benefits,
                   maintaining performance is a crucial factor. In some
                   instances performance criteria are placed above isolation
                   properties and selective relaxation of isolation for
                   performance is an important characteristic when considering
                   resilience for HPC environments employing virtualization.
                   In this paper we consider some of the factors associated with
                   balancing performance and isolation in configurations that
                   employ virtual machines. In this context, we propose a
                   classification of errors based on the concept of ``error
                   zones'', as well as a detailed analysis of the trade-offs
                   between resilience and performance based on the level of
                   isolation provided by virtualization solutions. Finally, the
                   results from a set of experiments are presented, that use
                   different virtualization solutions, and in doing so allow
                   further elucidation of the topic.",
  pts           = "51548"
}
@conference{engelmann13toward,
  author        = "Christian Engelmann
                   and Thomas Naughton",
  title         = "Toward a Performance/Resilience Tool for Hardware/Software
                   Co-Design of High-Performance Computing Systems",
  booktitle     = "Proceedings of the
                   \href{http://icpp2013.ens-lyon.fr}{$42^{nd}$ International
                   Conference on Parallel Processing (ICPP) 2013}:
                   \href{http://www.psti-workshop.org} {$4^{th}$ International
                   Workshop on Parallel Software Tools and Tool Infrastructures
                   (PSTI)}",
  pages         = "962-971",
  month         = oct # "~2, ",
  year          = "2013",
  address       = "Lyon, France",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-5117-3",
  issn          = "0190-3918",
  doi           = "10.1109/ICPP.2013.114",
  url           = "http://www.christian-engelmann.info/publications/engelmann13toward.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann13toward.ppt.pdf",
  abstract      = "xSim is a simulation-based performance investigation toolkit
                   that permits running high-performance computing (HPC)
                   applications in a controlled environment with millions of
                   concurrent execution threads, while observing application
                   performance in a simulated extreme-scale system for
                   hardware/software co-design. The presented work details newly
                   developed features for xSim that permit the injection of MPI
                   process failures, the propagation/detection/notification of
                   such failures within the simulation, and their handling using
                   application-level checkpoint/restart. These new capabilities
                   enable the observation of application behavior and
                   performance under failure within a simulated
                   future-generation HPC system using the most common fault
                   handling technique.",
  pts           = "44445"
}
@conference{lagadapati13tools,
  author        = "Mahesh Lagadapati
                   and Frank Mueller
                   and Christian Engelmann",
  title         = "Tools for Simulation and Benchmark Generation at Exascale",
  booktitle     = "Proceedings of the \href{http://tools.zih.tu-dresden.de/2013/}
                   {$7^{th}$ Parallel Tools Workshop}",
  pages         = "19--24",
  month         = sep # "~3-4, ",
  year          = "2013",
  address       = "Dresden, Germany",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-319-08143-4",
  doi           = "10.1007/978-3-319-08144-1_2",
  url           = "http://www.christian-engelmann.info/publications/lagadapati13tools.pdf",
  url2          = "http://www.christian-engelmann.info/publications/lagadapati13tools.ppt.pdf",
  abstract      = "The path to exascale high-performance computing (HPC) poses several
                   challenges related to power, performance, resilience, productivity,
                   programmability, data movement, and data management. Investigating the
                   performance of parallel applications at scale on future architectures
                   and the performance impact of different architecture choices is an
                   important component of HPC hardware/software co-design. Simulations
                   using models of future HPC systems and communication traces from
                   applications running on existing HPC systems can offer an insight into
                   the performance of future architectures. This work targets technology
                   developed for scalable application tracing of communication events and
                   memory profiles, but can be extended to other areas, such as I/O,
                   control flow, and data flow. It further focuses on extreme-scale
                   simulation of millions of Message Passing Interface (MPI) ranks using
                   a lightweight parallel discrete event simulation (PDES) toolkit for
                   performance evaluation. Instead of simply replaying a trace within a
                   simulation, the approach is to generate a benchmark from it and to run
                   this benchmark within a simulation using models to reflect the
                   performance characteristics of future-generation HPC systems. This
                   provides a number of benefits, such as eliminating the data intensive
                   trace replay and enabling simulations at different scales. The
                   presented work utilizes the ScalaTrace tool to generate scalable trace
                   files, the ScalaBenchGen tool to generate the benchmark, and the xSim
                   tool to run the benchmark within a simulation.",
  pts           = "48783"
}
@conference{naughton13using,
  author        = "Thomas Naughton
                   and Swen B{\"o}hm
                   and Christian Engelmann
                   and Geoffroy Vall{\'e}e",
  title         = "Using Performance Tools to Support Experiments in {HPC}
                   Resilience",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://www.europar2013.org/}{$19^{th}$
                   European Conference on Parallel and Distributed Computing
                   (Euro-Par) 2013 Workshops}:
                   \href{http://xcr.cenit.latech.edu/resilience2013}{$6^{th}$
                   Workshop on Resiliency in High Performance Computing
                   (Resilience) in Clusters, Clouds, and Grids}",
  volume        = "8374",
  pages         = "727--736",
  month         = aug # "~26, ",
  year          = "2013",
  address       = "Aachen, Germany",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-642-54419-4",
  issn          = "0302-9743",
  doi           = "10.1007/978-3-642-54420-0_71",
  url           = "http://www.christian-engelmann.info/publications/naughton13using.pdf",
  url2          = "http://www.christian-engelmann.info/publications/naughton13using.ppt.pdf",
  abstract      = "The high performance computing~(HPC) community is working to
                   address fault tolerance and resilience concerns for current
                   and future large scale computing platforms. This is driving
                   enhancements in the programming environments, specifically
                   research on enhancing message passing libraries to support
                   fault tolerant computing capabilities. The community has
                   also recognized that tools for resilience experimentation
                   are greatly lacking. However, we argue that there are
                   several parallels between ``performance tools'' and
                   ``resilience tools''. As such, we believe the rich set of
                   HPC performance-focused tools can be extended (repurposed)
                   to benefit the resilience community. In this paper, we
                   describe the initial motivation to leverage standard HPC
                   performance analysis techniques to aid in developing
                   diagnostic tools to assist fault tolerance experiments for
                   HPC applications. These diagnosis procedures help to provide
                   context for the system when the errors (failures) occurred.
                   We describe our initial work in leveraging an MPI
                   performance trace tool to assist in providing global context
                   during fault injection experiments. Such tools will assist
                   the HPC resilience community as they extend existing and new
                   application codes to support fault tolerances.",
  pts           = "45676"
}
@conference{jones11simulation,
  author        = "Ian S. Jones
                   and Christian Engelmann",
  title         = "Simulation of Large-Scale {HPC} Architectures",
  booktitle     = "Proceedings of the
                   \href{http://icpp2011.org}{$40^{th}$ International Conference
                   on Parallel Processing (ICPP) 2011}:
                   \href{http://www.psti-workshop.org} {$2^{nd}$ International
                   Workshop on Parallel Software Tools and Tool Infrastructures
                   (PSTI)}",
  pages         = "447-456",
  month         = sep # "~13-19, ",
  year          = "2011",
  address       = "Taipei, Taiwan",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-4511-0",
  issn          = "1530-2016",
  doi           = "10.1109/ICPPW.2011.44",
  url           = "http://www.christian-engelmann.info/publications/jones11simulation.pdf",
  url2          = "http://www.christian-engelmann.info/publications/jones11simulation.ppt.pdf",
  abstract      = "The Extreme-scale Simulator (xSim) is a recently developed
                   performance investigation toolkit that permits running
                   high-performance computing (HPC) applications in a controlled
                   environment with millions of concurrent execution threads. It
                   allows observing parallel application performance properties
                   in a simulated extreme-scale HPC system to further assist in
                   HPC hardware and application software co-design on the road
                   toward multi-petascale and exascale computing. This paper
                   presents a newly implemented network model for the xSim
                   performance investigation toolkit that is capable of
                   providing simulation support for a variety of HPC network
                   architectures with the appropriate trade-off between
                   simulation scalability and accuracy. The taken approach
                   focuses on a scalable distributed solution with latency and
                   bandwidth restrictions for the simulated network. Different
                   network architectures, such as star, ring, mesh, torus,
                   twisted torus and tree, as well as hierarchical combinations,
                   such as to simulate network-on-chip and network-on-node, are
                   supported. Network traffic congestion modeling is omitted to
                   gain simulation scalability by reducing simulation accuracy.",
  pts           = "31901"
}
@conference{fiala11tunable,
  author        = "David Fiala
                   and Kurt Ferreira
                   and Frank Mueller
                   and Christian Engelmann",
  title         = "A Tunable, Software-based {DRAM} Error Detection and
                   Correction Library for {HPC}",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://europar2011.bordeaux.inria.fr/}{$17^{th}$
                   European Conference on Parallel and Distributed Computing
                   (Euro-Par) 2011 Workshops, Part II}:
                   \href{http://xcr.cenit.latech.edu/resilience2011}{$4^{th}$
                   Workshop on Resiliency in High Performance Computing
                   (Resilience) in Clusters, Clouds, and Grids}",
  volume        = "7156",
  pages         = "251-261",
  month         = aug # "~29 - " # sep # "~2, ",
  year          = "2011",
  address       = "Bordeaux, France",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-642-29740-3",
  doi           = "10.1007/978-3-642-29740-3_29",
  url           = "http://www.christian-engelmann.info/publications/fiala11tunable.pdf",
  url2          = "",
  abstract      = "Proposed exascale systems will present a number of
                   considerable resiliency challenges. In particular, DRAM
                   soft-errors, or bit-flips, are expected to greatly increase
                   due to the increased memory density of these systems.
                   Current hardware-based fault-tolerance methods will be
                   unsuitable for addressing the expected soft error frequency
                   rate. As a result, additional software will be needed to
                   address this challenge. In this paper we introduce LIBSDC,
                   a tunable, transparent silent data corruption detection and
                   correction library for HPC applications. LIBSDC provides
                   comprehensive SDC protection for program memory by
                   implementing on-demand page integrity verification.
                   Experimental benchmarks with Mantevo HPCCG show that once
                   tuned, LIBSDC is able to achieve SDC protection with 50\%
                   overhead of resources, less than the 100\% needed for double
                   modular redundancy.",
  pts           = "35631"
}
@conference{naughton11case,
  author        = "Thomas Naughton
                   and Geoffroy R. Vall\'ee
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "A Case for Virtual Machine based Fault Injection in a
                   High-Performance Computing Environment",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://europar2011.bordeaux.inria.fr/}{$17^{th}$
                   European Conference on Parallel and Distributed Computing
                   (Euro-Par) 2011}:
                   \href{http://www.csm.ornl.gov/srt/conferences/hpcvirt2011}
                   {$5^{th}$ Workshop on System-level Virtualization for High
                   Performance Computing (HPCVirt)}",
  volume        = "7155",
  pages         = "234-243",
  month         = aug # "~29 - " # sep # "~2, ",
  year          = "2011",
  address       = "Bordeaux, France",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-642-29737",
  doi           = "10.1007/978-3-642-29737-3_27",
  url           = "http://www.christian-engelmann.info/publications/naughton11case.pdf",
  url2          = "http://www.christian-engelmann.info/publications/naughton11case.ppt.pdf",
  abstract      = "Large-scale computing platforms provide tremendous
                   capabilities for scientific discovery. These systems have
                   hundreds of thousands of computing cores, hundreds of
                   terabytes of memory, and enormous high-performance
                   interconnection networks. These systems are facing enormous
                   challenges to achieve performance at such scale. Failures
                   are an Achilles heel of these enormous systems. As
                   applications and system software scale up to multi-petaflop
                   and beyond to exascale platforms, the occurrence of failure
                   will be much more common. This has given rise to a push in
                   fault-tolerance and resilience research for HPC systems.
                   This includes work on log analysis to identify types of
                   failures, enhancements to the Message Passing Interface
                   (MPI) to incorporate fault awareness, and a variety of
                   fault tolerance mechanisms that span redundant computation,
                   algorithm based fault tolerance, and advanced checkpoint/
                   restart techniques. While there is much work to be done on
                   the FT/Resilience mechanisms for such large-scale systems,
                   there is also a profound gap in the tools for
                   experimentation. This gap is compounded by the fact that HPC
                   environments have stringent performance requirements and are
                   often highly customized. The tool chain for these systems are
                   often tailored for the platform and while the majority of
                   systems on the Top500 Supercomputer list run Linux, these
                   operating environments typically contain many site/machine
                   specific enhancements. Therefore, it is desirable to maintain
                   a consistent execution environment to minimize end-user
                   (scientist) interruption. The work on system-level
                   virtualization for HPC system offers a unique opportunity to
                   maintain a consistent execution environment via a virtual
                   machine (VM). Recent work on virtualization for HPC has shown
                   that low-overhead, high performance systems can be realized
                   [1, 2] Virtualization also provides a clean abstraction for
                   building experimental tools for investigation into the
                   effects of failures in HPC and the related research on FT/
                   Resilience mechanisms and policies. In this paper we discuss
                   the motivation for tools to perform fault injection in an HPC
                   context, and outline an approach that can leverage
                   virtualization.",
  pts           = "32309"
}
@conference{engelmann10facilitating,
  author        = "Christian Engelmann
                   and Frank Lauer",
  title         = "Facilitating Co-Design for Extreme-Scale Systems Through
                   Lightweight Simulation",
  booktitle     = "Proceedings of the
                   \href{http://www.cluster2010.org}{$12^{th}$ IEEE
                   International Conference on Cluster Computing (Cluster)
                   2010}: \href{http://www2.wmin.ac.uk/getovv/aacec10.html}
                   {$1^{st}$ Workshop on Application/Architecture Co-design for
                   Extreme-scale Computing (AACEC)}",
  pages         = "1-8",
  month         = sep # "~20-24, ",
  year          = "2010",
  address       = "Hersonissos, Crete, Greece",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-4244-8395-2",
  doi           = "10.1109/CLUSTERWKSP.2010.5613113",
  url           = "http://www.christian-engelmann.info/publications/engelmann10facilitating.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann10facilitating.ppt.pdf",
  abstract      = "This work focuses on tools for investigating algorithm
                   performance at extreme scale with millions of concurrent
                   threads and for evaluating the impact of future architecture
                   choices to facilitate the co-design of high-performance
                   computing (HPC) architectures and applications. The approach
                   focuses on lightweight simulation of extreme-scale HPC
                   systems with the needed amount of accuracy. The prototype
                   presented in this paper is able to provide this capability
                   using a parallel discrete event simulation (PDES), such that
                   a Message Passing Interface (MPI) application can be executed
                   at extreme scale, and its performance properties can be
                   evaluated. The results of an initial prototype are
                   encouraging as a simple hello world MPI program could be
                   scaled up to 1,048,576 virtual MPI processes on a four-node
                   cluster, and the performance properties of two MPI programs
                   could be evaluated at up to 1,024 and 16,384 virtual MPI
                   processes on the same system.",
  pts           = "25331"
}
@conference{ostrouchov09nonparametric,
  author        = "George Ostrouchov
                   and Thomas Naughton
                   and Christian Engelmann
                   and Geoffroy R. Vall\'ee
                   and Stephen L. Scott",
  title         = "Nonparametric Multivariate Anomaly Analysis in Support of
                   {HPC} Resilience",
  booktitle     = "Proceedings of the \href{http://www.oerc.ox.ac.uk/ieee}
                   {$5^{th}$ IEEE International Conference on e-Science
                   (e-Science) 2009}:
                   \href{http://www.oerc.ox.ac.uk/ieee/workshops/workshops/computational-science}
                   {Workshop on Computational Science}",
  pages         = "80-85",
  month         = dec # "~9-11, ",
  year          = "2009",
  address       = "Oxford, UK",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-1-4244-5946-9",
  doi           = "10.1109/ESCIW.2009.5407992",
  url           = "http://www.christian-engelmann.info/publications/ostrouchov09nonparametric.pdf",
  url2          = "http://www.christian-engelmann.info/publications/ostrouchov09nonparametric.ppt.pdf",
  abstract      = "Large-scale computing systems provide great potential for
                   scientific exploration. However, the complexity that
                   accompanies these enormous machines raises challeges for
                   both, users and operators. The effective use of such systems
                   is often hampered by failures encountered when running
                   applications on systems containing tens-of-thousands of nodes
                   and hundreds-of-thousands of compute cores capable of
                   yielding petaflops of performance. In systems of this size
                   failure detection is complicated and root-cause diagnosis
                   difficult. This paper describes our recent work in the
                   identification of anomalies in monitoring data and system
                   logs to provide further insights into machine status, runtime
                   behavior, failure modes and failure root causes. It discusses
                   the details of an initial prototype that gathers the data and
                   uses statistical techniques for analysis.",
  pts           = "26081"
}
@conference{naughton09fault,
  author        = "Thomas Naughton
                   and Wesley Bland
                   and Geoffroy R. Vall\'ee
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Fault Injection Framework for System Resilience Evaluation --
                   {F}ake Faults for Finding Future Failures",
  booktitle     = "Proceedings of the
                   \href{http://www.lrz-muenchen.de/hpdc2009}{$18^{th}$
                   International Symposium on High Performance Distributed
                   Computing (HPDC) 2009}:
                   \href{http://xcr.cenit.latech.edu/resilience2009}{$2^{nd}$
                   Workshop on Resiliency in High Performance Computing
                   (Resilience) 2009}",
  pages         = "23--28",
  month         = jun # "~9, ",
  year          = "2009",
  address       = "Munich, Germany",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-60558-587-1",
  doi           = "10.1145/1552526.1552530",
  url           = "http://www.christian-engelmann.info/publications/naughton09fault.pdf",
  url2          = "http://www.christian-engelmann.info/publications/naughton09fault.ppt.pdf",
  abstract      = "As high-performance computing (HPC) systems increase in size
                   and complexity they become more difficult to manage. The
                   enormous component counts associated with these large systems
                   lead to significant challenges in system reliability and
                   availability. This in turn is driving research into the
                   resilience of large scale systems, which seeks to curb the
                   effects of increased failures at large scales by masking the
                   inevitable faults in these systems. The basic premise being
                   that failure must be accepted as a reality of large scale
                   system and coped with accordingly through system resilience.
                   A key component in the development and evaluation of system
                   resilience techniques is having a means to conduct controlled
                   experiments. A common method for performing such experiments
                   is to generate synthetic faults and study the resulting
                   effects. In this paper we discuss the motivation and our
                   initial use of software fault injection to support the
                   evaluation of resilience for HPC systems. We mention
                   background and related work in the area and discuss the
                   design of a tool to aid in fault injection experiments for
                   both user-space (application-level) and system-level
                   failures."
}
@conference{tikotekar09performance,
  author        = "Anand Tikotekar
                   and Hong H. Ong
                   and Sadaf Alam
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Performance Comparison of Two Virtual Machine Scenarios Using
                   an {HPC} Application -- {A} Case study Using Molecular
                   Dynamics Simulations",
  booktitle     = "Proceedings of the
                   \href{http://www.csm.ornl.gov/srt/hpcvirt09}{$3^{rd}$
                   Workshop on System-level Virtualization for High Performance
                   Computing (HPCVirt) 2009}, in conjunction with the
                   \href{http://www.eurosys.org/2009}{$4^{th}$ ACM SIGOPS
                   European Conference on Computer Systems (EuroSys) 2009}",
  pages         = "33--40",
  month         = mar # "~30, ",
  year          = "2009",
  address       = "Nuremberg, Germany",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-60558-465-2",
  doi           = "10.1145/1519138.1519143",
  url           = "http://www.christian-engelmann.info/publications/tikotekar09performance.pdf",
  url2          = "http://www.christian-engelmann.info/publications/tikotekar09performance.ppt.pdf",
  abstract      = "Obtaining high flexibility to performance-loss ratio is a
                   key challenge of today's HPC virtual environment landscape.
                   And while extensive research has been targeted at extracting
                   more performance from virtual machines, the idea that whether
                   novel virtual machine usage scenarios could lead to high
                   flexibility Vs performance trade-off has received less
                   attention. We, in this paper, take a step forward by studying
                   and comparing the performance implications of running the
                   Large-scale Atomic/Molecular Massively Parallel Simulator
                   (LAMMPS) application on two virtual machine configurations.
                   First configuration consists of two virtual machines per node
                   with 1 application process per virtual machine. The second
                   configuration consists of 1 virtual machine per node with 2
                   processes per virtual machine. Xen has been used as an
                   hypervisor and standard Linux as a guest virtual machine. Our
                   results show that the difference in overall performance
                   impact on LAMMPS between the two virtual machine
                   configurations described above is around 3\%. We also study
                   the difference in performance impact in terms of each
                   configuration's individual metrics such as CPU, I/O, Memory,
                   and interrupt/context switches."
}
@conference{vallee08virtual,
  author        = "Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Hong H. Ong
                   and Anand Tikotekar
                   and Christian Engelmann
                   and Wesley Bland
                   and Ferrol Aderholt
                   and Stephen L. Scott",
  title         = "Virtual System Environments",
  booktitle     = "Communications in Computer and Information Science:
                   Proceedings of the \href{http://www.dmtf.org/svm08}{$2^{nd}$
                   DMTF Academic Alliance Workshop on Systems and Virtualization
                   Management: Standards and New Technologies (SVM) 2008}",
  volume        = "18",
  pages         = "72--83",
  month         = oct # "~21-22, ",
  year          = "2008",
  address       = "Munich, Germany",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-540-88707-2",
  issn          = "1865-0929",
  doi           = "10.1007/978-3-540-88708-9_7",
  url           = "http://www.christian-engelmann.info/publications/vallee08virtual.pdf",
  url2          = "",
  abstract      = "Distributed and parallel systems are typically managed with
                   static settings: the operating system (OS) and the runtime
                   environment (RTE) are specified at a given time and cannot be
                   changed to fit an application`s needs. This means that every
                   time application developers want to use their application on
                   a new execution platform, the application has to be ported to
                   this new environment, which may be expensive in terms of
                   application modifications and developer time. However, the
                   science resides in the applications and not in the OS or the
                   RTE. Therefore, it should be beneficial to adapt the OS and
                   the RTE to the application instead of adapting the
                   applications to the OS and the RTE. This document presents
                   the concept of Virtual System Environments (VSE), which
                   enables application developers to specify and create a
                   virtual environment that properly fits their application`s
                   needs. For that four challenges have to be addressed: (i)
                   definition of the VSE itself by the application developers,
                   (ii) deployment of the VSE, (iii) system administration for
                   the platform, and (iv) protection of the platform from the
                   running VSE. We therefore present an integrated tool for the
                   definition and deployment of VSEs on top of traditional and
                   virtual (i.e., using system-level virtualization) execution
                   platforms. This tool provides the capability to choose the
                   degree of delegation for system administration tasks and the
                   degree of protection from the application (e.g., using
                   virtual machines). To summarize, the VSE concept enables the
                   customization of the OS/RTE used for the execution of
                   application by users without compromising local system
                   administration rules and execution platform protection
                   constraints.",
  pts           = "28239"

}
@conference{tikotekar08analysis,
  author        = "Anand Tikotekar
                   and Geoffroy Vall\'ee
                   and Thomas Naughton
                   and Hong H. Ong
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "An Analysis of {HPC} Benchmark Applications in Virtual
                   Machine Environments",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://europar2008.caos.uab.es}{$14^{th}$ European
                   Conference on Parallel and Distributed Computing (Euro-Par)
                   2008}: \href{http://scilytics.com/vhpc}{$3^{rd}$ Workshop on
                   Virtualization in High-Performance Cluster and Grid Computing
                   (VHPC) 2008}",
  volume        = "5415",
  pages         = "63--71",
  month         = aug # "~26-29, ",
  year          = "2008",
  address       = "Las Palmas de Gran Canaria, Spain",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "978-3-642-00954-9",
  doi           = "10.1007/978-3-642-00955-6",
  url           = "http://www.christian-engelmann.info/publications/tikotekar08analysis.pdf",
  url2          = "http://www.christian-engelmann.info/publications/tikotekar08analysis.ppt.pdf",
  abstract      = "Virtualization technology has been gaining acceptance in the
                   scientific community due to its overall flexibility in
                   running HPC applications. It has been reported that a
                   specific class of applications is better suited to a
                   particular type of virtualization scheme or implementation.
                   For example, Xen has been shown to perform with little
                   overhead for compute-bound applications. Such a study,
                   although useful, does not allow us to generalize conclusions
                   beyond the performance analysis of that application which is
                   explicitly executed. An explanation of why the generalization
                   described above is difficult, may be due to the versatility
                   in applications, which leads to different overheads in
                   virtual environments. For example, two similar applications
                   may spend disproportionate amount of time in their respective
                   library code when run in virtual environments. In this paper,
                   we aim to study such potential causes by investigating the
                   behavior and identifying patterns of various overheads for
                   HPC benchmark applications. Based on the investigation of the
                   overhead profiles for different benchmarks, we aim to address
                   questions such as: Are the overhead profiles for a particular
                   type of benchmarks (such as compute-bound) similar or are
                   there grounds to conclude otherwise?"
}
@conference{engelmann08symmetric2,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and Chokchai (Box) Leangsuksun
                   and Xubin (Ben) He",
  title         = "Symmetric Active/Active High Availability for
                   High-Performance Computing System Services: Accomplishments
                   and Limitations",
  booktitle     = "Proceedings of the
                   \href{http://www.ens-lyon.fr/LIP/RESO/ccgrid2008}{$8^{th}$
                   IEEE International Symposium on Cluster Computing and the
                   Grid (CCGrid) 2008}:
                   \href{http://xcr.cenit.latech.edu/resilience2008}{Workshop on
                   Resiliency in High Performance Computing (Resilience) 2008}",
  pages         = "813--818",
  month         = may # "~19-22, ",
  year          = "2008",
  address       = "Lyon, France",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "978-0-7695-3156-4",
  doi           = "10.1109/CCGRID.2008.78",
  url           = "http://www.christian-engelmann.info/publications/engelmann08symmetric2.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann08symmetric2.pdf",
  abstract      = "This paper summarizes our efforts over the last 3-4 years in
                   providing symmetric active/active high availability for
                   high-performance computing (HPC) system services. This work
                   paves the way for high-level reliability, availability and
                   serviceability in extreme-scale HPC systems by focusing on
                   the most critical components, head and service nodes, and by
                   reinforcing them with appropriate high availability
                   solutions. This paper presents our accomplishments in the
                   form of concepts and respective prototypes, discusses
                   existing limitations, outlines possible future work, and
                   describes the relevance of this research to other, planned
                   efforts.",
  pts           = "9996"
}
@conference{chen08online,
  author        = "Xin Chen
                   and Benjamin Eckart
                   and Xubin (Ben) He
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "An Online Controller Towards Self-Adaptive File System
                   Availability and Performance",
  booktitle     = "Proceedings of the
                   \href{http://xcr.cenit.latech.edu/hapcw2008}{$5^{th}$ High
                   Availability and Performance Workshop (HAPCW) 2008}, in
                   conjunction with the \href{http://www.hpcsw.org}{$1^{st}$
                   High-Performance Computer Science Week (HPCSW) 2008}",
  month         = apr # "~3-4, ",
  year          = "2008",
  address       = "Denver, CO, USA",
  url           = "http://www.christian-engelmann.info/publications/chen08online.pdf",
  url2          = "http://www.christian-engelmann.info/publications/chen08online.ppt.pdf",
  abstract      = "At the present time, it can be a significant challenge to
                   build a large-scale distributed file system that
                   simultaneously maintains both high availability and high
                   performance. Although many fault tolerance technologies have
                   been proposed and used in both commercial and academic
                   distributed file systems to achieve high availability, most
                   of them typically sacrifice performance for higher system
                   availability. Additionally, recent studies show that system
                   availability and performance are related to the system
                   workload. In this paper, we analyze the correlations among
                   availability, performance, and workloads based on a
                   replication strategy, and we discuss the trade off between
                   availability and performance with different workloads. Our
                   analysis leads to the design of an online controller that can
                   dynamically achieve optimal performance and availability by
                   tuning the system replication policy."
}
@conference{tikotekar08effects,
  author        = "Anand Tikotekar
                   and Geoffroy Vall\'ee
                   and Thomas Naughton
                   and Hong H. Ong
                   and Christian Engelmann
                   and Stephen L. Scott
                   and Anthony M. Filippi",
  title         = "Effects of Virtualization on a Scientific Application --
                   {R}unning a Hyperspectral Radiative Transfer Code on Virtual
                   Machines",
  booktitle     = "Proceedings of the
                   \href{http://www.csm.ornl.gov/srt/hpcvirt08}{$2^{nd}$
                   Workshop on System-level Virtualization for High Performance
                   Computing (HPCVirt) 2008}, in conjunction with the
                   \href{http://www.eurosys.org/2008}{$3^{rd}$ ACM SIGOPS
                   European Conference on Computer Systems (EuroSys) 2008}",
  pages         = "16--23",
  month         = mar # "~31, ",
  year          = "2008",
  address       = "Glasgow, UK",
  publisher     = "\href{http://www.acm.org}{ACM Press, New York, NY, USA}",
  isbn          = "978-1-60558-120-0",
  doi           = "10.1145/1435452.1435455",
  url           = "http://www.christian-engelmann.info/publications/tikotekar08effects.pdf",
  url2          = "http://www.christian-engelmann.info/publications/tikotekar08effects.ppt.pdf",
  abstract      = "The topic of system-level virtualization has recently begun
                   to receive interest for high performance computing (HPC).
                   This is in part due to the isolation and encapsulation
                   offered by the virtual machine. These traits enable
                   applications to customize their environments and maintain
                   consistent software configurations in their virtual domains.
                   Additionally, there are mechanisms that can be used for fault
                   tolerance like live virtual machine migration. Given these
                   attractive benefits to virtualization, a fundamental question
                   arises, how does this effect my scientific application? We
                   use this as the premise for our paper and observe a
                   real-world scientific code running on a Xen virtual machine.
                   We studied the effects of running a radiative transfer
                   simulation, Hydrolight, on a virtual machine. We discuss our
                   methodology and report observations regarding the usage of
                   virtualization with this application."
}
@conference{engelmann07middleware,
  author        = "Christian Engelmann
                   and Hong H. Ong
                   and Stephen L. Scott",
  title         = "Middleware in Modern High Performance Computing System
                   Architectures",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://www.iccs-meeting.org/iccs2007}{$7^{th}$
                   International Conference on Computational Science (ICCS)
                   2007}, Part II: \href{http://www.gup.uni-linz.ac.at/cce2007}
                   {$4^{th}$ Special Session on Collaborative and Cooperative
                   Environments (CCE) 2007}",
  volume        = "4488",
  pages         = "784--791",
  month         = may # "~27-30, ",
  year          = "2007",
  address       = "Beijing, China",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "3-5407-2585-5",
  issn          = "0302-9743",
  doi           = "10.1007/978-3-540-72586-2_111",
  url           = "http://www.christian-engelmann.info/publications/engelmann07middleware.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann07middleware.ppt.pdf",
  abstract      = "A recent trend in modern high performance computing (HPC)
                   system architectures employs lean compute nodes running a
                   lightweight operating system (OS). Certain parts of the OS a
                   well as other system software services are moved to service
                   nodes in order to increase performance and scalability. This
                   paper examines the impact of this HPC system architecture
                   trend on HPC middleware software solutions, which
                   traditionally equip HPC systems with advanced features, such
                   as parallel and distributed programming models, appropriate
                   system resource management mechanisms, remote application
                   steering and user interaction techniques. Since the approach
                   of keeping the compute node software stack small and simple
                   is orthogonal to the middleware concept of adding missing OS
                   features between OS and application, the role and
                   architecture of middleware in modern HPC systems needs to be
                   revisited. The result is a paradigm shift in HPC middleware
                   design, where single middleware services are moved to service
                   nodes, while runtime environments (RTEs) continue to reside
                   on compute nodes.",
  pts           = "5260"
}
@conference{engelmann07transparent,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and Chokchai (Box) Leangsuksun
                   and Xubin (Ben) He",
  title         = "Transparent Symmetric Active/Active Replication for
                   Service-Level High Availability",
  booktitle     = "Proceedings of the \href{http://ccgrid07.lncc.br}{$7^{th}$
                   IEEE International Symposium on Cluster Computing and the
                   Grid (CCGrid) 2007}: \href{http://www.lri.fr/~fedak/gp2pc-07}
                   {$7^{th}$ International Workshop on Global and Peer-to-Peer
                   Computing (GP2PC) 2007}",
  pages         = "755--760",
  month         = may # "~14-17, ",
  year          = "2007",
  address       = "Rio de Janeiro, Brazil",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "0-7695-2833-3",
  doi           = "10.1109/CCGRID.2007.116",
  url           = "http://www.christian-engelmann.info/publications/engelmann07transparent.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann07transparent.ppt.pdf",
  abstract      = "As service-oriented architectures become more important in
                   parallel and distributed computing systems, individual
                   service instance reliability as well as appropriate service
                   redundancy becomes an essential necessity in order to
                   increase overall system availability. This paper focuses on
                   providing redundancy strategies using service-level
                   replication techniques. Based on previous research using
                   symmetric active/active replication, this paper proposes a
                   transparent symmetric active/active replication approach that
                   allows for more reuse of code between individual
                   service-level replication implementations by using a virtual
                   communication layer. Service- and client-side interceptors
                   are utilized in order to provide total transparency. Clients
                   and servers are unaware of the replication infrastructure as
                   it provides all necessary mechanisms internally.",
  pts           = "5259"
}
@conference{engelmann07configurable,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and Hong H. Ong
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton",
  title         = "Configurable Virtualized System Environments for High
                   Performance Computing",
  booktitle     = "Proceedings of the
                   \href{http://www.csm.ornl.gov/srt/hpcvirt07}{$1^{st}$
                   Workshop on System-level Virtualization for High Performance
                   Computing (HPCVirt) 2007}, in conjunction with the
                   \href{http://www.eurosys.org/2008}{$2^{nd}$ ACM SIGOPS
                   European Conference on Computer Systems (EuroSys) 2007}",
  month         = mar # "~20, ",
  year          = "2007",
  address       = "Lisbon, Portugal",
  url           = "http://www.christian-engelmann.info/publications/engelmann07configurable.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann07configurable.ppt.pdf",
  abstract      = "Existing challenges for current terascale high performance
                   computing (HPC) systems are increasingly hampering the
                   development and deployment efforts of system software and
                   scientific applications for next-generation petascale
                   systems. The expected rapid system upgrade interval toward
                   petascale scientific computing demands an incremental
                   strategy for the development and deployment of legacy and new
                   large-scale scientific applications that avoids excessive
                   porting. Furthermore, system software developers as well as
                   scientific application developers require access to
                   large-scale testbed environments in order to test individual
                   solutions at scale. This paper proposes to address these
                   issues at the system software level through the development
                   of a virtualized system environment (VSE) for scientific
                   computing. The proposed VSE approach enables
                   plug-and-play supercomputing through
                   desktop-to-cluster-to-petaflop computer system-level
                   virtualization based on recent advances in hypervisor
                   virtualization technologies. This paper describes the VSE
                   system architecture in detail, discusses needed tools for
                   VSE system management and configuration, and presents
                   respective VSE use case scenarios.",
  pts           = "5703"
}
@conference{engelmann06towards,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and Chokchai (Box) Leangsuksun
                   and Xubin (Ben) He",
  title         = "Towards High Availability for High-Performance Computing
                   System Services: {A}ccomplishments and Limitations",
  booktitle     = "Proceedings of the
                   \href{http://xcr.cenit.latech.edu/hapcw2006}{$4^{th}$ High
                   Availability and Performance Workshop (HAPCW) 2006}, in
                   conjunction with the \href{http://lacsi.krellinst.org}
                   {$7^{th}$ Los Alamos Computer Science Institute (LACSI)
                   Symposium 2006}",
  month         = oct # "~17, ",
  year          = "2006",
  address       = "Santa Fe, NM, USA",
  url           = "http://www.christian-engelmann.info/publications/engelmann06towards.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann06towards.ppt.pdf",
  abstract      = "During the last several years, our teams at Oak Ridge
                   National Laboratory, Louisiana Tech University, and Tennessee
                   Technological University focused on efficient redundancy
                   strategies for head and service nodes of high-performance
                   computing (HPC) systems in order to pave the way for high
                   availability (HA) in HPC. These nodes typically run critical
                   HPC system services, like job and resource management, and
                   represent single points of failure and control for an entire
                   HPC system. The overarching goal of our research is to
                   provide high-level reliability, availability, and
                   serviceability (RAS) for HPC systems by combining HA and HPC
                   technology. This paper summarizes our accomplishments, such
                   as developed concepts and implemented proof-of-concept
                   prototypes, and describes existing limitations, such as
                   performance issues, which need to be dealt with for
                   production-type deployment.",
  pts           = "3736"
}
@conference{ou06achieving,
  author        = "Li Ou
                   and Xin Chen
                   and Xubin (Ben) He
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Achieving Computational {I/O} Effciency in a High Performance
                   Cluster Using Multicore Processors",
  booktitle     = "Proceedings of the
                   \href{http://xcr.cenit.latech.edu/hapcw2006}{$4^{th}$ High
                   Availability and Performance Workshop (HAPCW) 2006}, in
                   conjunction with the \href{http://lacsi.krellinst.org}
                   {$7^{th}$ Los Alamos Computer Science Institute (LACSI)
                   Symposium 2006}",
  month         = oct # "~17, ",
  year          = "2006",
  address       = "Santa Fe, NM, USA",
  url           = "http://www.christian-engelmann.info/publications/ou06achieving.pdf",
  url2          = "http://www.christian-engelmann.info/publications/ou06achieving.ppt.pdf",
  abstract      = "Cluster computing has become one of the most popular
                   platforms for high-performance computing today. The recent
                   popularity of multicore processors provides a flexible way to
                   increase the computational capability of clusters. Although
                   the system performance may improve with multicore processors
                   in a cluster, I/O requests initiated by multiple cores may
                   saturate the I/O bus, and furthermore increase the latency by
                   issuing  multiple non-contiguous disk accesses. In this
                   paper, we propose an asymmetric collective I/O for multicore
                   processors to improve multiple non-contiguous accesses. In
                   our configuration, one core in each multicore processor is
                   designated as the coordinator, and others serve as computing
                   cores. The coordinator is responsible for aggregating I/O
                   operations from computing cores and submitting a contiguous
                   request. The coordinator allocates contiguous memory buffers
                   on behalf of other cores to avoid redundant data copies.",
  pts           = "4222"
}
@conference{engelmann06rmix,
  author        = "Christian Engelmann
                   and George A. (Al) Geist",
  title         = "{RMIX}: {A} Dynamic, Heterogeneous, Reconfigurable
                   Communication Framework",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://www.iccs-meeting.org/iccs2006}{$6^{th}$
                   International Conference on Computational Science (ICCS)
                   2006}, Part II: \href{http://www.gup.uni-linz.ac.at/cce2006}
                   {$3^{rd}$ Special Session on Collaborative and Cooperative
                   Environments (CCE) 2006}",
  volume        = "3992",
  pages         = "573--580",
  month         = may # "~28-31, ",
  year          = "2006",
  address       = "Reading, UK",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "3-540-34381-4",
  issn          = "0302-9743",
  doi           = "10.1007/11758525_77",
  url           = "http://www.christian-engelmann.info/publications/engelmann06rmix.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann06rmix.ppt.pdf",
  abstract      = "RMIX is a dynamic, heterogeneous, reconfigurable
                   communication framework that allows software components to
                   communicate using various RMI/RPC protocols, such as ONC RPC,
                   Java RMI and SOAP, by facilitating dynamically loadable
                   provider plug-ins to supply different protocol stacks. With
                   this paper, we present a native (C-based), flexible,
                   adaptable, multi-protocol RMI/RPC communication framework
                   that complements the Java-based RMIX variant previously
                   developed by our partner team at Emory University. Our
                   approach offers the same multi-protocol RMI/RPC services
                   and advanced invocation semantics via a C-based interface
                   that does not require an object-oriented programming
                   language. This paper provides a detailed description of our
                   RMIX framework architecture and some of its features. It
                   describes the general use case of the RMIX framework and its
                   integration into the Harness metacomputing environment in the
                   form of a plug-in.",
  pts           = "1490"
}
@conference{engelmann06active,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and Chokchai (Box) Leangsuksun
                   and Xubin (Ben) He",
  title         = "Active/Active Replication for Highly Available {HPC} System
                   Services",
  booktitle     = "Proceedings of the
                   \href{http://www.ares-conference.eu/ares2006}{$1^{st}$
                   International Conference on Availability, Reliability and
                   Security (ARES) 2006}: $1^{st}$ International Workshop on
                   Frontiers in Availability, Reliability and Security (FARES)
                   2006",
  pages         = "639-645",
  month         = apr # "~20-22, ",
  year          = "2006",
  address       = "Vienna, Austria",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "0-7695-2567-9",
  doi           = "10.1109/ARES.2006.23",
  url           = "http://www.christian-engelmann.info/publications/engelmann06active.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann06active.ppt.pdf",
  abstract      = "Today`s high performance computing systems have several
                   reliability deficiencies resulting in availability and
                   serviceability issues. Head and service nodes represent a
                   single point of failure and control for an entire system as
                   they render it inaccessible and unmanageable in case of a
                   failure until repair, causing a significant downtime. This
                   paper introduces two distinct replication methods (internal
                   and external) for providing symmetric active/active high
                   availability for multiple head and service nodes running in
                   virtual synchrony. It presents a comparison of both methods
                   in terms of expected correctness, ease-of-use and performance
                   based on early results from ongoing work in providing
                   symmetric active/active high availability for two HPC system
                   services (TORQUE and PVFS metadata server). It continues with
                   a short description of a distributed mutual exclusion
                   algorithm and a brief statement regarding the handling of
                   Byzantine failures. This paper concludes with an overview of
                   past and ongoing work, and a short summary of the presented
                   research.",
  pts           = "1485"
}
@conference{engelmann05concepts,
  author        = "Christian Engelmann
                   and Stephen L. Scott",
  title         = "Concepts for High Availability in Scientific High-End
                   Computing",
  booktitle     = "Proceedings of the
                   \href{http://xcr.cenit.latech.edu/hapcw2005}{$3^{rd}$ High
                   Availability and Performance Workshop (HAPCW) 2005}, in
                   conjunction with the
                   \href{http://lacsi.rice.edu/symposium/agenda_2005}{$6^{th}$
                   Los Alamos Computer Science Institute (LACSI) Symposium
                   2005}",
  month         = oct # "~11, ",
  year          = "2005",
  address       = "Santa Fe, NM, USA",
  url           = "http://www.christian-engelmann.info/publications/engelmann05concepts.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann05concepts.ppt.pdf",
  abstract      = "Scientific high-end computing (HEC) has become an important
                   tool for scientists world-wide to understand problems, such
                   as in nuclear fusion, human genomics and nanotechnology.
                   Every year, new HEC systems emerge on the market with better
                   performance and higher scale. With only very few exceptions,
                   the overall availability of recently installed systems has
                   been lower in comparison to the same deployment phase of
                   their predecessors. In contrast to the experienced loss of
                   availability, the demand for continuous availability has
                   risen dramatically due to the recent trend towards capability
                   computing. In this paper, we analyze the existing
                   deficiencies of current HEC systems and present several high
                   availability concepts to counter the experienced loss of
                   availability and to alleviate the expected impact on
                   next-generation systems. We explain the application of these
                   concepts to current and future HEC systems and list past and
                   ongoing related research. This paper closes with a short
                   summary of the presented work and a brief discussion of
                   future efforts.",
  pts           = "3777"
}
@conference{engelmann05high,
  author        = "Christian Engelmann
                   and Stephen L. Scott",
  title         = "High Availability for Ultra-Scale High-End Scientific
                   Computing",
  booktitle     = "Proceedings of the \href{http://coset.irisa.fr}{$2^{nd}$
                   International Workshop on Operating Systems, Programming
                   Environments and Management Tools for High-Performance
                   Computing on Clusters (COSET-2) 2005}, in conjunction with
                   the \href{http://ics05.csail.mit.edu}{$19^{th}$ ACM
                   International Conference on Supercomputing (ICS) 2005}",
  month         = jun # "~19, ",
  year          = "2005",
  address       = "Cambridge, MA, USA",
  url           = "http://www.christian-engelmann.info/publications/engelmann05high.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann05high.ppt.pdf",
  abstract      = "Ultra-scale architectures for scientific high-end computing
                   with tens to hundreds of thousands of processors, such as the
                   IBM Blue Gene/L and the Cray X1, suffer from availability
                   deficiencies, which impact the efficiency of running
                   computational jobs by forcing frequent checkpointing of
                   applications. Most systems are unable to handle runtime
                   system configuration changes caused by failures and require
                   a complete restart of essential system services, such as the
                   job scheduler or MPI, or even of the entire machine. In this
                   paper, we present a flexible, pluggable and component-based
                   high availability framework that expands today`s effort in
                   high availability computing of keeping a single server alive
                   to include all machines cooperating in a high-end scientific
                   computing environment, while allowing adaptation to system
                   properties and application needs."
}
@conference{leangsuksun05asymmetric,
  author        = "Chokchai (Box) Leangsuksun
                   and Venkata K. Munganuru
                   and Tong Liu
                   and Stephen L. Scott
                   and Christian Engelmann",
  title         = "Asymmetric Active-Active High Availability for High-end
                   Computing",
  booktitle     = "Proceedings of the \href{http://coset.irisa.fr}{$2^{nd}$
                   International Workshop on Operating Systems, Programming
                   Environments and Management Tools for High-Performance
                   Computing on Clusters (COSET-2) 2005}, in conjunction with
                   the \href{http://ics05.csail.mit.edu}{$19^{th}$ ACM
                   International Conference on Supercomputing (ICS) 2005}",
  month         = jun # "~19, ",
  year          = "2005",
  address       = "Cambridge, MA, USA",
  url           = "http://www.christian-engelmann.info/publications/leangsuksun05asymmetric.pdf",
  url2          = "http://www.christian-engelmann.info/publications/leangsuksun05asymmetric.ppt.pdf",
  abstract      = "Linux clusters have become very popular for scientific
                   computing at research institutions world-wide, because they
                   can be easily deployed at a fairly low cost. However, the
                   most pressing issues of today`s cluster solutions are
                   availability and serviceability. The conventional Beowulf
                   cluster architecture has a single head node connected to a
                   group of compute nodes. This head node is a typical single
                   point of failure and control, which severely limits
                   availability and serviceability by effectively cutting off
                   healthy compute nodes from the outside world upon overload
                   or failure. In this paper, we describe a paradigm that
                   addresses this issue using asymmetric active-active high
                   availability. Our framework comprises of n + 1 head nodes,
                   where n head nodes are active in the sense that they provide
                   services to simultaneously incoming user requests. One
                   standby server monitors all active servers and performs a
                   fail-over in case of a detected outage. We present a
                   prototype implementation based on a 2 + 1 solution and
                   discuss initial results."
}
@conference{engelmann05lightweight,
  author        = "Christian Engelmann
                   and George A. (Al) Geist",
  title         = "A Lightweight Kernel for the Harness Metacomputing
                   Framework",
  booktitle     = "Proceedings of the
                   \href{http://www.ipdps.org/ipdps2005}{$19^{th}$ IEEE
                   International Parallel and Distributed Processing Symposium
                   (IPDPS) 2005}: \href{http://www.cs.umass.edu/~rsnbrg/hcw2005}
                   {$14^{th}$ Heterogeneous Computing Workshop (HCW) 2005}",
  month         = apr # "~4, ",
  year          = "2005",
  address       = "Denver, CO, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "0-7695-2312-9",
  issn          = "1530-2075",
  doi           = "10.1109/IPDPS.2005.34",
  url           = "http://www.christian-engelmann.info/publications/engelmann05lightweight.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann05lightweight.ppt.pdf",
  abstract      = "Harness is a pluggable heterogeneous Distributed Virtual
                   Machine (DVM) environment for parallel and distributed
                   scientific computing. This paper describes recent
                   improvements in the Harness kernel design. By using a
                   lightweight approach and moving previously integrated system
                   services into software modules, the software becomes more
                   versatile and adaptable. This paper outlines these changes
                   and explains the major Harness kernel components in more
                   detail. A short overview is given of ongoing efforts in
                   integrating RMIX, a dynamic heterogeneous reconfigurable
                   communication framework, into the Harness environment as a
                   new plug-in software module. We describe the overall impact
                   of these changes and how they relate to other ongoing work."
}
@conference{engelmann04high,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and George A. (Al) Geist",
  title         = "High Availability through Distributed Control",
  booktitle     = "Proceedings of the
                   \href{http://xcr.cenit.latech.edu/hapcw2004}{$2^{nd}$ High
                   Availability and Performance Workshop (HAPCW) 2004}, in
                   conjunction with the
                   \href{http://lacsi.rice.edu/symposium/agenda_2004}{$5^{th}$
                   Los Alamos Computer Science Institute (LACSI) Symposium
                   2004}",
  month         = oct # "~12, ",
  year          = "2004",
  address       = "Santa Fe, NM, USA",
  url           = "http://www.christian-engelmann.info/publications/engelmann04high.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann04high.ppt.pdf",
  abstract      = "Cost-effective, flexible and efficient scientific simulations
                   in cutting-edge research areas utilize huge high-end
                   computing resources with thousands of processors. In the next
                   five to ten years the number of processors in such computer
                   systems will rise to tens of thousands, while scientific
                   application running times are expected to increase further
                   beyond the Mean-Time-To-Interrupt (MTTI) of hardware and
                   system software components. This paper describes the ongoing
                   research in heterogeneous adaptable reconfigurable networked
                   systems (Harness) and its recent achievements in the area of
                   high availability distributed virtual machine environments
                   for parallel and distributed scientific computing. It shows
                   how a distributed control algorithm is able to steer a
                   distributed virtual machine process in virtual synchrony
                   while maintaining consistent replication for high
                   availability. It briefly illustrates ongoing work in
                   heterogeneous reconfigurable communication frameworks and
                   security mechanisms. The paper continues with a short
                   overview of similar research in reliable group communication
                   frameworks, fault-tolerant process groups and highly
                   available distributed virtual processes. It closes with a
                   brief discussion of possible future research directions."
}
@conference{he04highly,
  author        = "Xubin (Ben) He
                   and Li Ou
                   and Stephen L. Scott
                   and Christian Engelmann",
  title         = "A Highly Available Cluster Storage System using Scavenging",
  booktitle     = "Proceedings of the
                   \href{http://xcr.cenit.latech.edu/hapcw2004}{$2^{nd}$ High
                   Availability and Performance Workshop (HAPCW) 2004}, in
                   conjunction with the
                   \href{http://lacsi.rice.edu/symposium/agenda_2004}{$5^{th}$
                   Los Alamos Computer Science Institute (LACSI) Symposium
                   2004}",
  month         = oct # "~12, ",
  year          = "2004",
  address       = "Santa Fe, NM, USA",
  url           = "http://www.christian-engelmann.info/publications/he04highly.pdf",
  url2          = "http://www.christian-engelmann.info/publications/he04highly.ppt.pdf",
  abstract      = "Highly available data storage for high-performance computing
                   is becoming increasingly more critical as high-end computing
                   systems scale up in size and storage systems are developed
                   around network-centered architectures. A promising solution
                   is to harness the collective storage potential of individual
                   workstations much as we harness idle CPU cycles due to the
                   excellent price/performance ratio and low storage usage of
                   most commodity workstations. For such a storage system,
                   metadata consistency is a key issue assuring storage system
                   availability as well as data reliability. In this paper, we
                   present a decentralized metadata management scheme that
                   improves storage availability without sacrificing
                   performance."
}
@conference{engelmann03diskless,
  author        = "Christian Engelmann
                   and George A. (Al) Geist",
  title         = "A Diskless Checkpointing Algorithm for Super-scale
                   Architectures Applied to the Fast Fourier Transform",
  booktitle     = "Proceedings of the
                   \href{http://www.cs.msstate.edu/~clade2003}{Challenges of
                   Large Applications in Distributed Environments Workshop
                   (CLADE) 2003}, in conjunction with the
                   \href{http://csag.ucsd.edu/HPDC-12}{$12^{th}$ IEEE
                   International Symposium on High Performance Distributed
                   Computing (HPDC) 2003}",
  pages         = "47",
  month         = jun # "~21, ",
  year          = "2003",
  address       = "Seattle, WA, USA",
  publisher     = "\href{http://www.computer.org}{IEEE Computer Society, Los
                   Alamitos, CA, USA}",
  isbn          = "0-7695-1984-9",
  doi           = "xpls/abs_all.jsp?arnumber=4159902",
  url           = "http://www.christian-engelmann.info/publications/engelmann03diskless.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann03diskless.ppt.pdf",
  abstract      = "This paper discusses the issue of fault-tolerance in
                   distributed computer systems with tens or hundreds of
                   thousands of diskless processor units. Such systems, like the
                   IBM Blue Gene/L, are predicted to be deployed in the next
                   five to ten years. Since a 100,000-processor system is going
                   to be less reliable, scientific applications need to be able
                   to recover from occurring failures more efficiently. In this
                   paper, we adapt the present technique of diskless
                   checkpointing to such huge distributed systems in order to
                   equip existing scientific algorithms with super-scalable
                   fault-tolerance. First, we discuss the method of diskless
                   checkpointing, then we adapt this technique to super-scale
                   architectures and finally we present results from an
                   implementation of the Fast Fourier Transform that uses the
                   adapted technique to achieve super-scale fault-tolerance."
}
@conference{engelmann02distributed,
  author        = "Christian Engelmann
                   and Stephen L. Scott
                   and George A. (Al) Geist",
  title         = "Distributed Peer-to-Peer Control in {Harness}",
  booktitle     = "Lecture Notes in Computer Science: Proceedings of the
                   \href{http://www.science.uva.nl/events/ICCS2002}{$2^{nd}$
                   International Conference on Computational Science (ICCS)
                   2002}, Part II: Workshop on Global and Collaborative
                   Computing",
  volume        = "2330",
  pages         = "720--727",
  month         = apr # "~21-24, ",
  year          = "2002",
  address       = "Amsterdam, The Netherlands",
  publisher     = "\href{http://www.springer.com}{Springer Verlag, Berlin,
                   Germany}",
  isbn          = "3-540-43593-X",
  issn          = "0302-9743",
  doi           = "content/l537ujfwt8yta2dp",
  url           = "http://www.christian-engelmann.info/publications/engelmann02distributed.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann02distributed.ppt.pdf",
  abstract      = "Harness is an adaptable fault-tolerant virtual machine
                   environment for next-generation heterogeneous distributed
                   computing developed as a follow on to PVM. It additionally
                   enables the assembly of applications from plug-ins and
                   provides fault-tolerance. This work describes the distributed
                   control, which manages global state replication to ensure a
                   high-availability of service. Group communication services
                   achieve an agreement on an initial global state and a linear
                   history of global state changes at all members of the
                   distributed virtual machine. This global state is replicated
                   to all members to easily recover from single, multiple and
                   cascaded faults. A peer-to-peer ring network architecture and
                   tunable multi-point failure conditions provide heterogeneity
                   and scalability. Finally, the integration of the distributed
                   control into the multi-threaded kernel architecture of
                   Harness offers a fault-tolerant global state database service
                   for plug-ins and applications."
}
@misc{engelmann23intersect,
  author        = "Christian Engelmann
                   and Swen Boehm
                   and Michael Brim
                   and Jack Lange
                   and Thomas Naughton
                   and Patrick Widener
                   and Ben Mintz
                   and Rohit Srivastava",
  title         = "INTERSECT: The Open Federated Architecture for the
                   Laboratory of the Future",
  month         = aug # "~7-10, ",
  year          = "2023",
  howpublished  = "{Poster at the \href{https://icpp23.sci.utah.edu/}
                   {52nd International Conference on Parallel Processing (ICPP)
                   2023}, Salt Lake City, UT, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann23intersect.ppt.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann23intersect.pdf",
  abstract      = "The open Self-driven Experiments for Science / Interconnected
                   Science Ecosystem (INTERSECT) architecture connects
                   scientific instruments and robot-controlled laboratories with
                   computing and data resources at the edge, the Cloud or the
                   high-performance computing center to enable autonomous
                   experiments, self-driving laboratories, smart manufacturing,
                   and artificial intelligence driven design, discovery and
                   evaluation. Its a novel approach consists of science use case
                   design patterns, a system of systems architecture, and a
                   microservice architecture."
}
@misc{engelmann22resilience,
  author        = "Christian Engelmann and Mohit Kumar",
  title         = "Resilience Design Patterns: A Structured Modeling Approach of
                   Resilience in Computing Systems",
  month         = aug # "~10-12, ",
  year          = "2022",
  howpublished  = "{Poster at the \href{https://www.bnl.gov/modsim2022}
                   {Workshop on Modeling and Simulation of Systems and
                   Applications (ModSim) 2022}, Seattle, WA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann22resilience.ppt.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann22resilience.pdf",
  abstract      = "Resilience to faults, errors, and failures in extreme-scale
                   high-performance computing (HPC) systems is a critical
                   challenge. Resilience design patterns (Figure 1) offer a new,
                   structured hardware/software design approach for improving
                   resilience by identifying and evaluating repeatedly occurring
                   resilience problems and coordinating corresponding solutions.
                   Initial work identified and formalized these patterns and
                   developed a proof-of-concept prototype to demonstrate
                   portable resilience. This recent work created performance,
                   reliability, and availability models for each of the
                   identified 15 structural resilience design patterns and a
                   modeling tool that allows (1) exploring the performance,
                   reliability, and availability of each pattern, and (2)
                   investigating the trade-offs be-tween patterns and pattern
                   combinations."
}
@misc{hui18realtime,
  author        = "Yawei Hui
                   and Rizwan Ashraf
                   and Byung Hoon (Hoony) Park
                   and Christian Engelmann",
  title         = "Real-Time Assessment of Supercomputer Status by a
                   Comprehensive Informative Metric through Streaming
                   Processing",
  month         = dec # "~10-13, ",
  year          = "2018",
  howpublished  = "{Poster at the
                    \href{http://cci.drexel.edu/bigdata/bigdata2018}
                    {$6^{th}$ IEEE International Conference on Big Data (BigData) 2018},
                    Seattle, WA, USA}",
  url           = "http://www.christian-engelmann.info/publications/hui18realtime.pdf",
  abstract      = "Supercomputers are complex systems used to simulate,
                   understand and solve real-world problems. In order to operate
                   these systems efficiently and for the purpose of their
                   maintainability, an accurate, concise, and timely
                   determination of system status is crucial for its users and
                   operators. However, this determination is challenging due to
                   intricately connected heterogeneous software and hardware
                   components, and due to sheer scale of such machines. In this
                   poster, we demonstrate work-in-progress towards realization
                   of a real-time monitoring framework for the 18,688-node Titan
                   supercomputer at Oak Ridge Leadership Computing Facility
                   (OLCF). Toward this end, we discuss the use of metrics which
                   present a one-dimensional view of the system generating
                   various types of information from 1000s of components and
                   utilization statistics from 100s of user applications in near
                   real-time. We demonstrate the efficacy of these metrics to
                   understand and visualize raw log data generated by the
                   system which otherwise may compose of 1000s of dimensions.
                   We also demonstrate the architecture of proposed real-time
                   stream processing framework which integrates, processes,
                   analyzes, visualizes and stores system log data from an array
                   of system components.."
}
@misc{hui18comprehensive,
  author        = "Yawei Hui
                   and Byung Hoon (Hoony) Park
                   and Christian Engelmann",
  title         = "A Comprehensive Informative Metric for Summarizing {HPC}
                   System Status",
  month         = oct # "~21, ",
  year          = "2018",
  howpublished  = "{Poster at the \href{http://ldav.org}
                    {$8^{th}$ IEEE Symposium on Large Data Analysis and
                     Visualization} in conjunction with the 
                    \href{http://ieeevis.org/year/2018}{$8^{th}$ IEEE Vis 2018},
                    Berlin, Germany}",
  url           = "http://www.christian-engelmann.info/publications/hui18comprehensive.pdf",
  abstract      = "It remains a major challenge to effectively summarize and
                   visualize in a comprehensive form the status of a complex
                   computer system, such as the Titan supercomputer at the Oak
                   Ridge Leadership Computing Facility (OLCF). In the ongoing
                   research highlighted in this poster, we present system
                   information entropy (SIE), a newly developed system metric
                   that leverages the powers of traditional machine learning
                   techniques and information theory. By compressing the
                   multi-variant multi-dimensional event information recorded
                   during the operation of the targeted system into a single
                   time series of SIE, we demonstrate that the historical
                   system status can be sensitively summarized in form of SIE
                   and visualized concisely and comprehensively."
}
@misc{engelmann18modeling2,
  author        = "Christian Engelmann and Rizwan Ashraf",
  title         = "Modeling and Simulation of Extreme-Scale Systems for
                   Resilience by Design",
  month         = aug # "~15-17, ",
  year          = "2018",
  howpublished  = "{Poster at the \href{https://www.bnl.gov/modsim2018}
                   {Workshop on Modeling and Simulation of Systems and
                   Applications}, Seattle, WA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann18modeling2.pdf",
  abstract      = "Resilience is a serious concern for extreme-scale
                   high-performance computing (HPC). While the HPC community has
                   developed various resilience solutions, the solution space
                   remains fragmented. We created a structured approach to the
                   design, evaluation and optimization of HPC resilience using
                   the concept of design patterns. A design pattern describes a
                   generalized solution to a repeatedly occurring problem. We
                   identified the commonly occurring problems and solutions used
                   to deal with faults, errors and failures in HPC systems. Each
                   well-known solution that addresses a specific resilience
                   challenge is described in the form of a design pattern. We
                   developed a resilience design pattern specification, language
                   and catalog, which can be used by system architects, system
                   software and library developers, application programmers, as
                   well as users and operators as essential building blocks when
                   designing and deploying resilience solutions.
                   The resilience design pattern approach provides a unique
                   opportunity for design space exploration. As each resilience
                   solution is abstracted as a pattern and each solution's
                   properties are defined by pattern parameters, vertical and
                   horizontal pattern compositions can describe the resilience
                   capabilities of an entire HPC system. This permits the
                   investigation of beneficial or counterproductive interactions
                   between patterns and of the performance, resilience, and
                   power consumption trade-off between different pattern
                   parameters and compositions. The ultimate goal is to make
                   resilience an integral part of the HPC hardware/software
                   ecosystem by coordinating the various existing resilience
                   solutions in a design space exploration process, such that
                   the burden for providing resilience is on the system by
                   design and not on the user as an afterthought.
                   We are in the early stages of developing a novel design space
                   exploration tool that enables this investigation using
                   modeling and simulation. We developed performance and
                   resilience models for each resilience design pattern. We also
                   leverage results from the Catalog project, a collaborative
                   effort between Oak Ridge National Laboratory, Argonne
                   National Laboratory and Lawrence Livermore National
                   Laboratory that developed models of the faults, errors and
                   failures in today's HPC systems. We also leverage recent
                   results from the same project by Lawrence Livermore National
                   Laboratory in application reliability patterns. The planned
                   research extends and combines this work to model the
                   performance, resilience, and power consumption of an entire
                   HPC system, initially at node-level granularity, and to
                   simulate the dynamic interactions between deployed
                   resilience solutions and the rest of the system. In the next
                   iteration, finer-grain modeling and simulation, such as at
                   the computational unit level, is used to increase accuracy.
                   This work leverages the experience of the investigators in
                   parallel discrete event simulation of extreme-scale systems,
                   such as the Extreme-scale Simulator (xSim).
                   The current state of the art in resilience modeling and
                   simulation is fragmented as well. There is currently no such
                   design space exploration tool. Instead, each resilience
                   solution is typically investigated separately. There is only
                   a small amount of work on multi-resilience solutions,
                   including by the investigators. While there is work in
                   investigating the performance/resilience trade-off space,
                   there is almost no work in including power consumption."
}
@misc{patil17exploring,
  author        = "Onkar Patil
                   and Saurabh Hukerikar
                   and Frank Mueller
                   and Christian Engelmann",
  title         = "Exploring Use Cases for Non-Volatile Memories in Support of
                   HPC Resilience",
  month         = nov # "~12-17, ",
  year          = "2017",
  howpublished  = "{Poster at the \href{http://sc11.supercomputing.org}
                   {30th IEEE/ACM International Conference on High Performance
                    Computing, Networking, Storage and Analysis (SC) 2017},
                   Denver, CO, USA}",
  url           = "http://www.christian-engelmann.info/publications/patil17exploring.pdf",
  url2          = "http://www.christian-engelmann.info/publications/patil17exploring.ppt.pdf",
  abstract      = "Improving resilience and creating resilient architectures is
                   one of the major goals of exascale computing. With the advent
                   of Non-volatile memory technologies, memory architectures
                   with persistent memory regions will be a significant part of
                   future architectures. There is potential to use them in more
                   than one way to benefit different applications. We look to
                   take advantage of this technology to enable more fine-grained
                   and novel methodology that will improve resilience and
                   efficiency of exascale applications. We have developed three
                   modes of memory usage for persistent memory to enable
                   efficient checkpointing in HPC applications. We have
                   developed a simple API that is evaluated with the DGEMM
                   benchmark on a 16-node cluster with independent SSDs on every
                   node. Our aim is to build on this work and enable static and
                   dynamic runtime systems that will inherently make the HPC
                   applications more fault-tolerant and resistant to errors."
}
@misc{fiala11detection,
  author        = "David Fiala
                   and Frank Mueller
                   and Christian Engelmann
                   and Rolf Riesen
                   and Kurt Ferreira",
  title         = "Detection and Correction of Silent Data Corruption for
                   Large-Scale High-Performance Computing",
  month         = nov # "~12-18, ",
  year          = "2011",
  howpublished  = "{Poster at the \href{http://sc11.supercomputing.org}
                   {24th IEEE/ACM International Conference on High Performance
                    Computing, Networking, Storage and Analysis (SC) 2011},
                   Seattle, WA, USA}",
  url           = "",
  abstract      = "Faults have become the norm rather than the exception for
                   high-end computing on clusters with 10s/100s of thousands of
                   cores. Exacerbating this situation, some of these faults will
                   not be detected, manifesting themselves as silent errors that
                   will corrupt memory while applications continue to operate and
                   report incorrect results. This poster introduces RedMPI, an
                   MPI library which resides in the MPI profiling layer. RedMPI
                   is capable of both online detection and correction of soft
                   errors that occur in MPI applications without requiring any
                   modifications to the application source. By providing
                   redundancy, RedMPI is capable of transparently detecting
                   corrupt messages from MPI processes that become faulted during
                   execution. Furthermore, with triple redundancy RedMPI
                   additionally ``votes'' out MPI messages of a faulted process
                   by replacing corrupted results with corrected results from
                   unfaulted processes. We present an experimental evaluation of
                   RedMPI on an assortment of applications to demonstrate the
                   effectiveness of this approach."
}
@misc{fiala11tunable2,
  author        = "David Fiala
                   and Kurt Ferreira
                   and Frank Mueller
                   and Christian Engelmann",
  title         = "A Tunable, Software-based {DRAM} Error Detection and Correction
                   Library for {HPC}",
  month         = nov # "~12-18, ",
  year          = "2011",
  howpublished  = "{Poster at the \href{http://sc11.supercomputing.org}
                   {24th IEEE/ACM International Conference on High Performance
                    Computing, Networking, Storage and Analysis (SC) 2011},
                   Seattle, WA, USA}",
  url           = "",
  abstract      = "Proposed exascale systems will present a number of
                   considerable resiliency challenges. In particular, DRAM
                   soft-errors, or bit-flips, are expected to greatly increase
                   due to the increased memory density of these systems. Current
                   hardware-based fault-tolerance methods will be unsuitable for
                   addressing the expected soft error frequency rate. As a
                   result, additional software will be needed to address this
                   challenge. In this paper we introduce LIBSDC, a tunable,
                   transparent silent data corruption detection and correction
                   library for HPC applications. LIBSDC provides comprehensive
                   SDC protection for program memory by implementing on-demand
                   page integrity verification by utilizing the MMU. Experimental 
                   benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is
                   able to achieve SDC protection with less than 100\% overhead
                   of resources."
}
@misc{scott09tunable2,
  author        = "Stephen L. Scott
                   and Christian Engelmann
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Anand Tikotekar
                   and George Ostrouchov
                   and Chokchai (Box) Leangsuksun
                   and Nichamon Naksinehaboon
                   and Raja Nassar
                   and Mihaela Paun
                   and Frank Mueller
                   and Chao Wang
                   and Arun B. Nagarajan
                   and Jyothish Varma",
  title         = "A Tunable Holistic Resiliency Approach for High-Performance
                   Computing Systems",
  month         = aug # "~12-14, ",
  year          = "2009",
  howpublished  = "{Poster at the
                   \href{http://institute.lanl.gov/resilience/conferences/2009}
                   {National HPC Workshop on Resilience 2009}, Arlington, VA,
                   USA}",
  url           = "http://www.christian-engelmann.info/publications/scott09tunable2.pdf",
  abstract      = "In order to address anticipated high failure rates,
                   resiliency characteristics have become an urgent priority for
                   next-generation extreme-scale high-performance computing
                   (HPC) systems. This poster describes our past and ongoing
                   efforts in novel fault resilience technologies for HPC.
                   Presented work includes proactive fault resilience
                   techniques, system and application reliability models and
                   analyses, failure prediction, transparent process- and
                   virtual-machine-level migration, and trade-off models for
                   combining preemptive migration with checkpoint/restart. This
                   poster summarizes our work and puts all individual
                   technologies into context with a proposed holistic fault
                   resilience framework."
}
@misc{scott09systemlevel,
  author        = "Stephen L. Scott
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Anand Tikotekar
                   and Christian Engelmann
                   and Hong H. Ong",
  title         = "System-level Virtualization for for High-Performance
                   Computing",
  month         = aug # "~12-14, ",
  year          = "2009",
  howpublished  = "{Poster at the
                   \href{http://institute.lanl.gov/resilience/conferences/2009}
                   {National HPC Workshop on Resilience 2009}, Arlington, VA,
                   USA}",
  url           = "http://www.christian-engelmann.info/publications/scott09systemlevel.pdf",
  abstract      = "This poster summarizes our past and ongoing research and
                   development efforts in novel system software solutions for
                   providing a virtual system environment (VSE) for
                   next-generation extreme-scale high-performance computing
                   (HPC) systems and beyond. The poster showcases results of
                   developed proof-of-concept implementations and performed
                   theoretical analyses, outlines planned research and
                   development activities, and presents respective initial
                   results."
}
@misc{scott09tunable,
  author        = "Stephen L. Scott
                   and Christian Engelmann
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Anand Tikotekar
                   and George Ostrouchov
                   and Chokchai (Box) Leangsuksun
                   and Nichamon Naksinehaboon
                   and Raja Nassar
                   and Mihaela Paun
                   and Frank Mueller
                   and Chao Wang
                   and Arun B. Nagarajan
                   and Jyothish Varma",
  title         = "A Tunable Holistic Resiliency Approach for High-Performance
                   Computing Systems",
  month         = feb # "~14-18, ",
  year          = "2009",
  howpublished  = "{Poster at the \href{http://ppopp09.rice.edu}{$14^{th}$ ACM
                   SIGPLAN Symposium on Principles and Practice of Parallel
                   Programming (PPoPP) 2009}, Raleigh, NC, USA}",
  url           = "http://www.christian-engelmann.info/publications/scott09tunable.pdf",
  abstract      = "In order to address anticipated high failure rates,
                   resiliency characteristics have become an urgent priority for
                   next-generation extreme-scale high-performance computing
                   (HPC) systems. This poster describes our past and ongoing
                   efforts in novel fault resilience technologies for HPC.
                   Presented work includes proactive fault resilience
                   techniques, system and application reliability models and
                   analyses, failure prediction, transparent process- and
                   virtual-machine-level migration, and trade-off models for
                   combining preemptive migration with checkpoint/restart. This
                   poster summarizes our work and puts all individual
                   technologies into context with a proposed holistic fault
                   resilience framework."
}
@misc{geist08harness,
  author        = "George A. (Al) Geist
                   and Christian Engelmann
                   and Jack J. Dongarra
                   and George Bosilca
                   and Magdalena M. S\l{}awi\'nska
                   and Jaros\l{}aw K. S\l{}awi\'nski",
  title         = "The {Harness} Workbench: {U}nified and Adaptive Access to
                   Diverse High-Performance Computing Platforms",
  month         = mar # "~30 - " # apr # "~5, ",
  year          = "2008",
  howpublished  = "{Poster at the \href{http://www.hpcsw.org}{$1^{st}$
                   High-Performance Computer Science Week (HPCSW) 2008}, Denver,
                   CO, USA}",
  url           = "http://www.christian-engelmann.info/publications/geist08harness.pdf",
  abstract      = "This poster summarizes our past and ongoing research and
                   development efforts in novel software solutions for providing
                   unified and adaptive access to diverse high-performance
                   computing (HPC) platforms. The poster showcases developed
                   proof-of-concept implementations of tools and mechanisms that
                   simplify scientific application development and deployment
                   tasks, such that only minimal adaptation is needed when
                   moving from one HPC system to another or after HPC system
                   upgrades."
}
@misc{scott08resiliency,
  author        = "Stephen L. Scott
                   and Christian Engelmann
                   and Hong H. Ong
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Anand Tikotekar
                   and George Ostrouchov
                   and Chokchai (Box) Leangsuksun
                   and Nichamon Naksinehaboon
                   and Raja Nassar
                   and Mihaela Paun
                   and Frank Mueller
                   and Chao Wang
                   and Arun B. Nagarajan
                   and Jyothish Varma
                   and Xubin (Ben) He
                   and Li Ou
                   and Xin Chen",
  title         = "Resiliency for High-Performance Computing Systems",
  month         = mar # "~30 - " # apr # "~5, ",
  year          = "2008",
  howpublished  = "{Poster at the \href{http://www.hpcsw.org}{$1^{st}$
                   High-Performance Computer Science Week (HPCSW) 2008}, Denver,
                   CO, USA}",
  url           = "http://www.christian-engelmann.info/publications/scott08resiliency.pdf",
  abstract      = "This poster summarizes our past and ongoing research and
                   development efforts in novel system software solutions for
                   providing high-level reliability, availability and
                   serviceability (RAS) for next-generation extreme-scale
                   high-performance computing (HPC) systems and beyond. The
                   poster showcases results of developed proof-of-concept
                   implementations and performed theoretical analyses, outlines
                   planned research and development activities, and presents
                   respective initial results."
}
@misc{scott08systemlevel,
  author        = "Stephen L. Scott
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Anand Tikotekar
                   and Christian Engelmann
                   and Hong H. Ong",
  title         = "System-level Virtualization for for High-Performance
                   Computing",
  month         = mar # "~30 - " # apr # "~5, ",
  year          = "2008",
  howpublished  = "{Poster at the \href{http://www.hpcsw.org}{$1^{st}$
                   High-Performance Computer Science Week (HPCSW) 2008}, Denver,
                   CO, USA}",
  url           = "http://www.christian-engelmann.info/publications/scott08systemlevel.pdf",
  abstract      = "This poster summarizes our past and ongoing research and
                   development efforts in novel system software solutions for
                   providing a virtual system environment (VSE) for
                   next-generation extreme-scale high-performance computing
                   (HPC) systems and beyond. The poster showcases results of
                   developed proof-of-concept implementations and performed
                   theoretical analyses, outlines planned research and
                   development activities, and presents respective initial
                   results."
}
@misc{adamson21cybersecurity,
  author        = "Ryan Adamson
                   and Christian Engelmann",
  title         = "Cybersecurity and Privacy for Instrument-to-Edge-to-Center
                   Scientific Computing Ecosystems",
  howpublished  = "White paper accepted at the U.S. Department of Energy's
                   \href{https://www.orau.gov/2021ascr-cybersecurity}
                   {ASCR Workshop on Cybersecurity and Privacy for Scientific
                    Computing Ecosystems}",
  month         = nov # "~3-5, ",
  year          = "2021",
  url           = "http://www.christian-engelmann.info/publications/adamson21cybersecurity.pdf",
  abstract      = "The DOE's Artificial Intelligence (AI) for Science report
                   outlines the need for intelligent systems, instruments, and
                   facilities to enable science breakthroughs with autonomous
                   experiments, 'self-driving' laboratories, smart manufacturing,
                   and AI-driven design, discovery and evaluation. The DOE's
                   Computational Facilities Research Workshop report identifies
                   intelligent systems/facilities as a challenge with enabling
                   automation and eliminating human-in-the-loop needs as a
                   cross-cutting theme. Autonomous experiments, 'self-driving'
                   laboratories and smart manufacturing employ
                   machine-in-the-loop intelligence for decision-making.
                   Human-in-the-loop needs are reduced by an autonomous online
                   control that collects experiment data, analyzes it, and
                   takes appropriate operational actions in real time to steer
                   an ongoing or plan the next experiment. DOE laboratories are
                   currently in the process of developing and deploying
                   federated hardware/software architectures for connecting
                   instruments with edge and center computing resources to
                   autonomously collect, transfer, store, process, curate, and
                   archive scientific data. These new
                   instrument-to-edge-to-center scientific ecosystems face
                   several cybersecurity and privacy challenges."
}
@misc{li21toward,
  author        = "Mingyan Li
                   and Robert A. Bridges
                   and Pablo Moriano
                   and Christian Engelmann
                   and Feiyi Wang
                   and Ryan Adamson",
  title         = "Toward Effective Security/Reliability Situational Awareness
                   via Concurrent Security-or-Fault Analytics ",
  howpublished  = "White paper accepted at the U.S. Department of Energy's
                   \href{https://www.orau.gov/2021ascr-cybersecurity}
                   {ASCR Workshop on Cybersecurity and Privacy for Scientific
                    Computing Ecosystems}",
  month         = nov # "~3-5, ",
  year          = "2021",
  url           = "http://www.christian-engelmann.info/publications/li21toward.pdf",
  abstract      = "Modern critical infrastructures (CI) and scientific computing
                   ecosystems (SCE) are complex and vulnerable. The complexity
                   of CI/SCE, such as the distributed workload found across ASCR
                   scientific computing facilities, does not allow for easy
                   differentiation between emerging cyber security and
                   reliability threats. It is also not easy to correctly
                   identify the misbehaving systems. Sometimes, system failures
                   are just caused by unintentional user misbehavior or actual
                   hardware/software reliability issues, but it may take some
                   significant amount of time and effort to develop that
                   understanding through root-cause analysis. On the security
                   front, CI/SCE are vital assets. They are prime targets of,
                   and are vulnerable to, malicious cyber-attacks. Within DoE,
                   inter-disciplinary and cross-facility collaboration (e.g.,
                   ORNL INTERSECT initiative, next-gen supercomputing OLCF6),
                   traditional perimeter-based defense and demarcation line
                   between malicious cyber-attacks and non-malicious system
                   faults are blurring. Amidst realistic reliability and
                   security threats, the ability to effectively distinguish
                   between non-malicious faults and malicious attacks is
                   critical not only in root cause identification but also in
                   countermeasures generation. "
}
@misc{finkel21research2,
  author        = "Hal Finkel
                   and Pete Beckman
                   and Christian Engelmann
                   and Shantenu Jha
                   and Jack Lange",
  title         = "Research Opportunities in Operating Systems for Scientific Edge Computing",
  howpublished  = "White paper by the U.S. Department of Energy's
                   \href{https://www.orau.gov/OSRoundtable2021}
                   {ASCR Roundtable Discussions on Operating-Systems Research 2021}",
  month         = jan # "~25, ",
  year          = "2021",
  url           = "http://www.christian-engelmann.info/publications/finkel21research2.pdf",
  abstract      = "As scientific experiments generate ever-increasing amounts of
                   data, and grow in operational complexity, modern experimental
                   science demands unprecedented computational capabilities at
                   the edge -- physically proximate to each experiment. While
                   some requirements on these computational capabilities are
                   shared with high-performance-computing (HPC) systems,
                   scientific edge computing has a number of unique challenges.
                   In the following, we survey current trends in system
                   software and edge systems for scientific computing,
                   associated research challenges and open questions,
                   infrastructure requirements for operating-systems research,
                   communities who should be involved in that research, and the
                   anticipated benefits of success."
}
@misc{finkel21research,
  author        = "Hal Finkel
                   and Pete Beckman
                   and Ron Brightwell
                   and Rudi Eigenmann
                   and Christian Engelmann
                   and Roberto Gioiosa
                   and Kamil Iskra
                   and Shantenu Jha
                   and Jack Lange
                   and Tapasya Patki
                   and Kevin Pedretti",
  title         = "Research Opportunities in Operating Systems for High-Performance Scientific Computing",
  howpublished  = "White paper by the U.S. Department of Energy's
                   \href{https://www.orau.gov/OSRoundtable2021}
                   {ASCR Roundtable Discussions on Operating-Systems Research 2021}",
  month         = jan # "~25, ",
  year          = "2021",
  url           = "http://www.christian-engelmann.info/publications/finkel21research.pdf",
  abstract      = "As high-performance-computing (HPC) systems continue to
                   evolve, with increasingly diverse and heterogeneous hardware,
                   increasingly-complex requirements for security and
                   multi-tenancy, and increasingly-demanding requirements for
                   resiliency and monitoring, research in operating systems must
                   continue to seed innovation to meet future needs. In the
                   following, we survey current trends in system software and
                   HPC systems for scientific computing, associated research
                   challenges and open questions, infrastructure requirements
                   for operating-systems research, communities who should be
                   involved in that research, and the anticipated benefits of
                   success."
}
@misc{engelmann21resilience2,
  author        = "Christian Engelmann",
  title         = "Resilience by Codesign (and not as an Afterthought)",
  howpublished  = "White paper accepted at the U.S. Department of Energy's
                   \href{https://web.cvent.com/event/f64a4f28-b473-4808-924c-c8c3d9a2af63/}
                   {Workshop on Reimagining Codesign 2021}",
  month         = mar # "~16-18, ",
  year          = "2021",
  url           = "http://www.christian-engelmann.info/publications/engelmann21resilience2.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann21resilience2.ppt.pdf",
  abstract      = "Resilience, i.e., obtaining a correct solution in a timely
                   and efficient manner, is one of the key challenges in
                   extreme-scale high-performance computing (HPC). Extreme
                   heterogeneity, i.e., using multiple, and potentially
                   configurable, types of processors, accelerators and
                   memory/storage in a single computing platform, will add a
                   significant amount of complexity to the HPC hardware/software
                   eco-system. Hardware/software HPC codesign for resilience is
                   mostly nonexistent at this point! Resilience needs to become
                   an integral part of the HPC hardware/software ecosystem
                   through codesign, such that the burden for resilience is on
                   the system by design and not on the operator or user as an
                   afterthought. Simply put, if resilience by design is not
                   done now, in the early stages of extreme heterogeneity, the
                   current state of practice for HPC resilience, global
                   application-level checkpoint/restart, will re-main the same
                   for decades to come due to the high costs of adoption of
                   alternatives later on. "
}
@misc{radojkovic20towards,
  author        = "Petar Radojkovic
                   and Manolis Marazakis
                   and Paul Carpenter
                   and Reiley Jeyapaul
                   and Dimitris Gizopoulos
                   and Martin Schulz
                   and Adria Armejach
                   and Eduard Ayguade
                   and Fran\c{c}ois Bodin
                   and Ramon Canal
                   and Franck Cappello
                   and Fabien Chaix
                   and Guillaume Colin de Verdiere
                   and Said Derradji
                   and Stefano Di Carlo
                   and Christian Engelmann
                   and Ignacio Laguna
                   and Miquel Moreto
                   and Onur Mutlu
                   and Lazaros Papadopoulos
                   and Olly Perks
                   and Manolis Ploumidis
                   and Bezhad Salami
                   and Yanos Sazeides
                   and Dimitrios Soudris
                   and Yiannis Sourdis
                   and Per Stenstrom
                   and Samuel Thibault
                   and Will Toms
                   and Osman Unsal",
  title         = "Towards Resilient {EU} {HPC} Systems: {A} Blueprint",
  howpublished  = "White paper by the
                   \href{https://resilienthpc.eu}
                   {European HPC resilience initiative}",
  month         = apr # "~9, ",
  year          = "2020",
  url           = "http://www.christian-engelmann.info/publications/radojkovic20towards.pdf",
  abstract      = "This document aims to spearhead a Europe-wide discussion on
                   HPC system resilience and to help the European HPC community
                   define best practices for resilience. We analyse a wide range
                   of state-of-the-art resilience mechanisms and recommend the
                   most effective approaches to employ in large-scale HPC
                   systems. Our guidelines will be useful in the allocation of
                   available resources, as well as guiding researchers and
                   research funding towards the enhancement of resilience
                   approaches with the highest priority and utility. Although
                   our work is focussed on the needs of next generation HPC
                   systems in Europe, the principles and evaluations are
                   applicable globally.
                   This document is the first output of the ongoing European HPC
                   resilience initiative and it covers individual nodes in HPC
                   systems, encompassing CPU, memory, intra-node interconnect
                   and emerging FPGA-based hardware accelerators. With community
                   support and feedback on this initial document, we will update
                   the analysis and expand the scope to include other types of
                   accelerators, as well as networks and storage.",
  pts           = "140761"
}
@misc{engelmann18extreme,
  author        = "Christian Engelmann
                   and Rizwan Ashraf
                   and Saurabh Hukerikar",
  title         = "Extreme Heterogeneity with Resilience by Design (and not as
                   an Afterthought)",
  howpublished  = "White paper accepted at the U.S. Department of Energy's
                   \href{https://orau.gov/exheterogeneity2018/}{Extreme
                   Heterogeneity Virtual Workshop 2018}",
  month         = jan # "~23-24, ",
  year          = "2018",
  address       = "Washington, DC, USA",
  url           = "http://www.christian-engelmann.info/publications/engelmann18extreme.pdf",
  abstract      = "Resilience, i.e., obtaining a correct solution in a timely
                   and efficient manner, is one of the key challenges in
                   extreme-scale high-performance computing (HPC). Extreme
                   heterogeneity, i.e., using multiple, and potentially
                   configurable, types of processors, accelerators and 
                   memory/storage in a single computing platform, will add a
                   significant amount of complexity to the HPC hardware/software
                   ecosystem. The notion of correct computation and program
                   state assumed by users and application developers today,
                   which has been based on binary bit-level correctness, will
                   no longer hold for processing elements based on quantum
                   qubits and analog circuits that model spiking neurons in
                   neuromorphic computing elements. The diverse set of compute
                   and memory components in future heterogeneous systems will
                   require novel hardware and software resilience solutions.
                   Errors and failures reported by such heterogeneous hardware
                   will need to be handled by the appropriate software
                   component to enable efficient masking, recovery, and
                   avoidance with little burden on the user. Similarly, errors
                   and failures reported by the software running on such
                   heterogeneous hardware need to be equally efficiently
                   handled with little burden on the user. This requires a new
                   approach, where resilience is holistically provided by the
                   HPC hardware/software ecosystem. The key challenges are to
                   design and to operate extreme heterogeneous HPC systems
                   with (1) wide-ranging resilience capabilities in system
                   software, programming models, libraries, and applications,
                   (2) interfaces and mechanisms for coordinating resilience
                   capabilities across diverse hardware and software
                   components, (3) appropriate metrics and tools for assessing
                   performance, resilience, and energy, and (4) an
                   understanding of the performance, resilience and energy
                   trade-off that eventually results in well-informed HPC
                   system design choices and runtime decisions."
}
@misc{tiwari16lightweight,
  author        = "Devesh Tiwari
                   and Saurabh Gupta
                   and Christian Engelmann",
  title         = "Lightweight, Actionable Analytical Tools Based on Statistical
                   Learning for Efficient System Operations",
  howpublished  = "White paper accepted at the U.S. Department of Energy's
                   \href{http://hpc.pnl.gov/modsim/2016}{Workshop on Modeling
                   & Simulation of Systems & Applications (ModSim)
                   2016}",
  month         = aug # "~10-12, ",
  year          = "2016",
  address       = "Seattle, WA, USA",
  url           = "http://www.christian-engelmann.info/publications/tiwari16lightweight.pdf",
  url2          = "http://www.christian-engelmann.info/publications/tiwari16lightweight.ppt.pdf",
  abstract      = "Modeling and simulation community has always relied on
                   accurate and meaningful system data and parameters to drive
                   analytical models and simulators. HPC systems continuously
                   generate huge amount system event related data (e.g., system
                   log, resource consumption log, RAS logs, power consumption
                   logs), but meaningful interpretation and accuracy verification
                   of such data is quite challenging. This talk offers a unique
                   perspective and experience in demonstrating how modeling and
                   simulation based research can actually be translated into
                   production systems. We will discuss the short-term
                   opportunities for modeling and simulation community to
                   increase the impact and effectiveness of our analytical
                   tools, ``dos and don'ts'', long-term challenges and
                   opportunities.",
  pts           = "69458"
}
@misc{engelmann13hardware,
  author        = "Christian Engelmann
                   and Thomas Naughton",
  title         = "A Hardware/Software Performance/Resilience/Power Co-Design
                   Tool for Extreme-scale Computing",
  howpublished  = "White paper accepted at the U.S. Department of Energy's
                   \href{http://hpc.pnl.gov/modsim/2013}{Workshop on Modeling
                   & Simulation of Exascale Systems & Applications (ModSim)
                   2013}",
  month         = sep # "~18-19, ",
  year          = "2013",
  address       = "Seattle, WA, USA",
  url           = "http://www.christian-engelmann.info/publications/engelmann13hardware.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann13hardware.ppt.pdf",
  abstract      = "xSim is a simulation-based performance investigation toolkit
                   that permits running high-performance computing (HPC)
                   applications in a controlled environment with millions of
                   concurrent execution threads, while observing application
                   performance in a simulated extreme-scale system for
                   hardware/software co-design. The presented work details newly
                   developed features for xSim that permit the injection of MPI
                   process failures, the propagation/detection/notification of
                   such failures within the simulation, and their handling using
                   application-level checkpoint/restart. The newly added features
                   also offer user-level failure mitigation (ULFM) extensions
                   at the simulated MPI layer to support algorithm-based fault
                   tolerance (ABFT). The presented solution permits investigating
                   performance under failure and failure handling of
                   checkpoint/restart and ABFT solutions. The newly enhanced xSim
                   is the very first performance tool that supports these
                   capabilities."
}
@misc{snir13addressing,
  author        = "Marc Snir and
                   and Robert W. Wisniewski
                   and Jacob A. Abraham
                   and Sarita V. Adve
                   and Saurabh Bagchi
                   and Pavan Balaji
                   and Bill Carlson
                   and Andrew A. Chien
                   and Pedro Diniz
                   and Christian Engelmann
                   and Rinku Gupta
                   and Fred Johnson
                   and Jim Belak
                   and Pradip Bose
                   and Franck Cappello
                   and Paul Coteus
                   and Nathan A. Debardeleben
                   and Mattan Erez
                   and Saverio Fazzari
                   and Al Geist
                   and Sriram Krishnamoorthy
                   and Sven Leyffer
                   and Dean Liberty
                   and Subhasish Mitra
                   and Todd Munson
                   and Rob Schreiber
                   and Jon Stearley
                   and Eric Van Hensbergen",
  title         = "Addressing Failures in Exascale Computing",
  howpublished  = "Workshop report",
  month         = aug # "~4-11, ",
  year          = "2013",
  address       = "Park City, UT, USA",
  url           = "http://www.christian-engelmann.info/publications/snir13addressing.pdf"
}
@misc{geist12department,
  author        = "Al Geist
                   and Bob Lucas
                   and Marc Snir
                   and Shekhar Borkar
                   and Eric Roman
                   and Mootaz Elnozahy
                   and Bert Still
                   and Andrew Chien
                   and Robert Clay
                   and John Wu
                   and Christian Engelmann
                   and Nathan DeBardeleben
                   and Rob Ross
                   and Larry Kaplan
                   and Martin Schulz
                   and Mike Heroux
                   and Sriram Krishnamoorthy
                   and Lucy Nowell
                   and Abhinav Vishnu
                   and Lee-Ann Talley",
  title         = "{U.S. Department of Energy} Fault Management Workshop",
  howpublished  = "Workshop report for the U.S. Department of Energy",
  month         = jun # "~6, ",
  year          = "2012",
  address       = "Baltimore, MA, USA",
  url           = "http://www.christian-engelmann.info/publications/geist12department.pdf",
  abstract      = "A Department of Energy (DOE) Fault Management Workshop was
                   held on June 6, 2012 at the BWI Airport Marriot hotel in
                   Maryland. The goals of this workshop were to: 1. Describe
                   the required HPC resilience for critical DOE mission needs;
                   2. Detail what HPC resilience research is already being done
                   at the DOE national laboratories and is expected to be done
                   by industry or other groups; 3. Determine what fault
                   management research is a priority for DOE's Office of
                   Science and National Nuclear Security Administration
                   (NNSA) over the next five years; 4. Develop a roadmap for
                   getting the necessary research accomplished in the timeframe
                   when it will be needed by the large computing facilities
                   across DOE."
}
@misc{engelmann12performance,
  author        = "Christian Engelmann
                   and Thomas Naughton",
  title         = "A Performance/Resilience/Power Co-design Tool for
                   Extreme-scale High-Performance Computing",
  howpublished  = "White paper accepted at the U.S. Department of Energy's
                   \href{http://hpc.pnl.gov/modsim/2012}{Workshop on Modeling
                   & Simulation of Exascale Systems & Applications (ModSim)
                   2012}",
  month         = aug # "~9-10, ",
  year          = "2012",
  address       = "Seattle, WA, USA",
  url           = "http://www.christian-engelmann.info/publications/engelmann12performance.pdf",
  abstract      = "Performance, resilience and power consumption are key HPC
                   system design factors that are highly interde-pendent. To
                   enable extreme-scale computing it is essential to perform
                   HPC hardware/software co-design that identifies the
                   cost/benefit trade-off between these design factors for
                   potential future architecture choices. The proposed research
                   and development aims at developing an HPC hardware/software
                   co-design toolkit for evaluating the
                   resilience/power/performance cost/benefit trade-off of
                   future architecture choices. The approach focuses on
                   extending a simulation-based performance investigation
                   toolkit with advanced resilience and power modeling and
                   simulation features, such as (i) fault injection mechanisms,
                   (ii) fault propagation, isolation, and detection models, (i)
                   fault avoidance, masking, and recovery simulation, and (iv)
                   power consumption models."
}
@misc{engelmann12dynamic,
  author        = "Christian Engelmann
                   and Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Frank Mueller",
  title         = "Dynamic Self-Aware Runtime Software for Exascale Systems",
  howpublished  = "White paper for the U.S. Department of Energy's
                   \href{https://collab.cels.anl.gov/display/exaosr/Position+Papers}
                   {Exascale Operating Systems and Runtime Technical Council}",
  month         = jul,
  year          = "2012",
  url           = "http://www.christian-engelmann.info/publications/engelmann12dynamic.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann12dynamic.ppt.pdf",
  abstract      = "At exascale, the power consumption, resilience, and load
                   balancing constraints, especially their dynamic nature and
                   interdependence, and the scale of the system require a
                   radical change in future high-performance computing (HPC)
                   operating systems and runtimes (OS/Rs). In contrast to the
                   existing static OS/R solutions, an exascale OS/R is needed
                   that is aware of the dynamically changing resources,
                   constraints, and application needs, and that is able to
                   autonomously coordinate (sometimes conflicting) responses
                   to different changes in the system, simultaneously and at
                   scale. To provide awareness and autonomic management, a
                   novel, scalable and self-aware OS/R is needed that becomes
                   the brains of the entire X-stack. It dynamically analyzes
                   past, current, and future system status and application
                   needs. It optimizes system usage by scheduling, migrating,
                   and restarting tasks within and across nodes as needed to
                   deal with multi-dimensional constraints, such as power
                   consumption, permanent and transient faults, resource
                   degradation, heterogeneity, data locality, and load balance."
}
@misc{vallee12unified,
  author        = "Geoffroy R. Vall\'ee
                   and Thomas Naughton
                   and Christian Engelmann
                   and David E. Bernholdt",
  title         = "Unified Execution Environment",
  howpublished  = "White paper for the U.S. Department of Energy's
                   \href{https://collab.cels.anl.gov/display/exaosr/Position+Papers}
                   {Exascale Operating Systems and Runtime Technical Council}",
  month         = jul,
  year          = "2012",
  url           = "http://www.christian-engelmann.info/publications/vallee12unified.pdf",
  abstract      = "The design and development of new system software for HPC
                   (both operating systems and run-times) face multiple
                   challenges, including scalability (high level of parallelism),
                   efficiency, resiliency, and dynamicity. Guided by these
                  fundamental design principles, we advocate for a unified
                  execution environment, which aims at being scalable,
                  asynchronous, dynamic, resource efficient, and reusable. The
                  proposed solution is based on the following core building
                  blocks, (i) events, (ii) agents, and (iii) enclaves. We use
                  these building blocks to support composable environments that
                  may be tailored to combine appropriate system services as well
                  as user jobs.  Additionally,  for resilience and scalability
                  the proposed design encourages localized or regional
                  operations to foster autonomy of execution contexts. We
                  advocate this approach for exascale systems, which include a
                  massive number of heterogeneous computing resources, since it
                  enables architecturally informed structures (topologies) as
                  well as encouraging efficient grouping of
                  functionality/services."
}
@misc{debardeleben09high-end,
  author        = "Nathan DeBardeleben
                   and James Laros
                   and John T. Daly
                   and Stephen L. Scott
                   and Christian Engelmann
                   and Bill Harrod",
  title         = "High-End Computing Resilience: {Analysis} of Issues
                   Facing the {HEC} Community and Path-Forward for
                   Research and Development",
  howpublished  = "White paper for the U.S. National Science Foundation's High-end Computing Program",
  month         = dec,
  year          = "2009",
  url           = "http://www.christian-engelmann.info/publications/debardeleben09high-end.pdf"
}
@techreport{brim23microservice,
  author        = "Michael Brim
                   and Christian Engelmann",
  title         = "INTERSECT Architecture Specification:
                   Microservice Architecture (Version 0.9)",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2023/3171",
  address       = "Oak Ridge, TN, USA",
  month         = sep,
  year          = "2023",
  doi           = "10.2172/2333815",
  url           = "http://www.christian-engelmann.info/publications/brim23microservice.pdf",
  abstract      = "Oak Ridge National Laboratory (ORNL)'s Self-driven
                   Experiments for Science / Interconnected Science Ecosystem
                   (INTERSECT) architecture project, titled ``An Open Federated
                   Architecture for the Laboratory of the Future'', creates an
                   open federated hardware/software architecture for the
                   laboratory of the future using a novel system of systems
                   (SoS) and microservice architecture approach, connecting
                   scientific instruments, robot-controlled laboratories and
                   edge/center computing/data resources to enable autonomous
                   experiments, ``self-driving'' laboratories, smart
                   manufacturing, and artificial intelligence (AI)-driven
                   design, discovery and evaluation. The project describes
                   science use cases as design patterns that identify and
                   abstract the involved hardware/software components and their
                   interactions in terms of control, work and data flow. It
                   creates a SoS architecture of the federated
                   hardware/software ecosystem that clarifies terms,
                   architectural elements, the interactions between them and
                   compliance. It further designs a federated microservice
                   architecture, mapping science use case design patterns to the
                   SoS architecture with loosely coupled microservices,
                   standardized interfaces and multi programming language
                   support. The primary deliverable of this project is an
                   INTERSECT Open Architecture Specification, containing the
                   science use case design pattern catalog, the federated SoS
                   architecture specification and the federated microservice
                   architecture specification. This document represents the
                   microservice architecture of the INTERSECT Open Architecture
                   Specification.",
  pts           = "204232"
}
@techreport{engelmann23use,
  author        = "Christian Engelmann
                   and Suhas Somnath",
  title         = "INTERSECT Architecture Specification: Use Case Design
                   Patterns (Version 0.9)",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2023/3133",
  address       = "Oak Ridge, TN, USA",
  month         = sep,
  year          = "2023",
  doi           = "10.2172/2229218",
  url           = "http://www.christian-engelmann.info/publications/engelmann23use.pdf",
  abstract      = "Connecting scientific instruments and robot-controlled
                   laboratories with computing and data resources at the edge,
                   the Cloud or the high-performance computing (HPC) center
                   enables autonomous experiments, self-driving laboratories,
                   smart manufacturing, and artificial intelligence (AI)-driven
                   design, discovery and evaluation. The Self-driven Experiments
                   for Science / Interconnected Science Ecosystem (INTERSECT)
                   Open Architecture enables science breakthroughs using
                   intelligent networked systems, instruments and facilities
                   with a federated hardware/software architecture for the
                   laboratory of the future. It relies on a novel approach,
                   consisting of (1) science use case design patterns, (2) a
                   system of systems architecture, and (3) a microservice
                   architecture. This document introduces the science use case
                   design patterns of the INTERSECT Architecture. It describes
                   the overall background, the involved terminology and
                   concepts, and the pattern format and classification. It
                   further details the 12 defined patterns and provides insight
                   into building solutions from these patterns. The document
                   also describes the application of these patterns in the
                   context of several INTERSECT autonomous laboratories. The
                   target audience are computer, computational, instrument and
                   domain science experts working in the field of autonomous
                   experiments.",
  pts           = "203995"
}
@techreport{engelmann22rdp-20,
  author        = "Christian Engelmann
                   and Rizwan Ashraf
                   and Saurabh Hukerikar
                   and Mohit Kumar
                   and Piyush Sao",
  title         = "Resilience Design Patterns: {A} Structured Approach to
                   Resilience at Extreme Scale (Version 2.0)",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2022/2809",
  address       = "Oak Ridge, TN, USA",
  month         = aug,
  year          = "2022",
  doi           = "10.2172/1922296",
  url           = "http://www.christian-engelmann.info/publications/engelmann22rdp-20.pdf",
  abstract      = "Reliability is a serious concern for future extreme-scale
                   high-performance computing (HPC) systems. Projections based
                   on the current generation of HPC systems and technology
                   roadmaps suggest the prevalence of very high fault rates in
                   future systems. The errors resulting from these faults will
                   propagate and generate various kinds of failures, which may
                   result in outcomes ranging from result corruptions to
                   catastrophic application crashes. Therefore, the resilience
                   challenge for extreme-scale HPC systems requires coordination
                   between various hardware and software technologies that are
                   capable of handling a broad set of fault models at
                   accelerated fault rates. Also, due to practical limits on
                   power consumption in future HPC systems, they are likely to
                   embrace innovative architectures, increasing the levels of
                   hardware and software complexities. Therefore, the
                   techniques that seek to improve resilience must navigate the
                   complex trade-off space between resilience and the overheads
                   to power consumption and performance. While the HPC community
                   has developed various resilience solutions, application-level
                   techniques as well as system-based solutions, the solution
                   space of HPC resilience techniques remains fragmented. There
                   are no formal methods to integrate the various HPC resilience
                   techniques into composite solutions, nor are there methods to
                   holistically evaluate the adequacy and efficacy of such
                   solutions in terms of their protection coverage, and their
                   performance & power efficiency characteristics. Additionally,
                   few implementations of current resilience solutions are
                   portable to newer architectures and software environments that
                   will be deployed on future systems.
                   We developed a new structured approach to the management of
                   HPC resilience using the concept of resilience-based design
                   patterns. In general, a design pattern is a repeatable
                   solution to a commonly occurring problem. We identified the
                   well-known solutions that are commonly used to deal with
                   faults, errors and failures in HPC systems. In the initial
                   design patterns specification (version 1.0), we described
                   the various solutions, which address specific problems in
                   the design of resilient HPC environments, in the form of
                   patterns. Each pattern describes a problem caused by a fault,
                   error or failure event in an HPC environment, and then
                   describes the core of the solution of the problem in such a
                   way that this solution may be adapted to different systems
                   and implemented at different layers of the system stack. The
                   catalog of these resilience design patterns provides
                   designers with a collection of design elements. To construct
                   complete resilience solutions using combinations of various
                   patterns, we defined a framework that enhances HPC designers'
                   understanding of the important constraints and the
                   opportunities for the design patterns to be implemented and
                   deployed at various layers of the system stack. The design
                   framework is also useful for establishing interfaces and
                   mechanisms to coordinate flexible fault management across
                   hardware and software components, as well as to consider the
                   trade-off between performance, resilience, and power
                   consumption when constructing a solution. The resilience
                   design patterns specification version 1.1 included more
                   detailed explanations of the pattern solutions, the context
                   in which the patterns are applicable, and the implications
                   for hardware or software design. It also provided several
                   additional examples and detailed case studies to demonstrate
                   the use of patterns to build realistic solutions.
                   In this version 1.2 of the specification document, we have
                   improved the pattern descriptions, including graphical
                   representations of the pattern components. These
                   improvements are largely based on critical comments,
                   feedback and suggestions received from pattern experts and
                   readers of the previous versions of the specification. The
                   pattern classification has been modified to further clarify
                   the relationships between pattern categories. This version
                   of the specification also introduces a pattern language for
                   resilience design patterns. The pattern language presents
                   the patterns in the catalog as a network, revealing the
                   relations among the resilience patterns. The language
                   provides designers with the means to explore alternative
                   techniques for handling a specific fault model that may have
                   different efficiency and complexity characteristics. Using
                   the pattern language also enables the design and
                   implementation of comprehensive resilience solutions as a
                   set of interconnected resilience patterns that can be
                   instantiated across layers of the system stack. The overall
                   goal of this work is to provide hardware and software
                   designers, as well as the users and operators of HPC systems,
                   a systematic methodology for the design and evaluation of
                   resilience technologies in HPC systems that keep scientific
                   applications running to a correct solution in a timely and
                   cost-efficient manner despite frequent faults, errors, and
                   failures of various types.
                   Version 2.0 expands the resilience design pattern
                   classification and catalog to include self-stabilization
                   patterns and reliability, availability and performance models
                   for each structural pattern.",
  pts           = "189180"
}
@techreport{brim22microservice,
  author        = "Michael Brim
                   and Christian Engelmann",
  title         = "INTERSECT Architecture Specification:
                   Microservice Architecture (Version 0.5)",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2022/2715",
  address       = "Oak Ridge, TN, USA",
  month         = sep,
  year          = "2022",
  doi           = "10.2172/1902805",
  url           = "http://www.christian-engelmann.info/publications/brim22microservice.pdf",
  abstract      = "Oak Ridge National Laboratory (ORNL)'s Self-driven
                   Experiments for Science / Interconnected Science Ecosystem
                   (INTERSECT) architecture project, titled ``An Open Federated
                   Architecture for the Laboratory of the Future'', creates an
                   open federated hardware/software architecture for the
                   laboratory of the future using a novel system of systems
                   (SoS) and microservice architecture approach, connecting
                   scientific instruments, robot-controlled laboratories and
                   edge/center computing/data resources to enable autonomous
                   experiments, ``self-driving'' laboratories, smart
                   manufacturing, and artificial intelligence (AI)-driven
                   design, discovery and evaluation. The project describes
                   science use cases as design patterns that identify and
                   abstract the involved hardware/software components and their
                   interactions in terms of control, work and data flow. It
                   creates a SoS architecture of the federated
                   hardware/software ecosystem that clarifies terms,
                   architectural elements, the interactions between them and
                   compliance. It further designs a federated microservice
                   architecture, mapping science use case design patterns to the
                   SoS architecture with loosely coupled microservices,
                   standardized interfaces and multi programming language
                   support. The primary deliverable of this project is an
                   INTERSECT Open Architecture Specification, containing the
                   science use case design pattern catalog, the federated SoS
                   architecture specification and the federated microservice
                   architecture specification. This document represents the
                   microservice architecture of the INTERSECT Open Architecture
                   Specification.",
  pts           = "186195"
}
@techreport{engelmann22use,
  author        = "Christian Engelmann
                   and Suhas Somnath",
  title         = "INTERSECT Architecture Specification: Use Case Design
                   Patterns (Version 0.5)",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2022/2681",
  address       = "Oak Ridge, TN, USA",
  month         = sep,
  year          = "2022",
  doi           = "10.2172/1896984",
  url           = "http://www.christian-engelmann.info/publications/engelmann22use.pdf",
  abstract      = "Oak Ridge National Laboratory (ORNL)'s Self-driven
                   Experiments for Science / Interconnected Science Ecosystem
                   (INTERSECT) architecture project, titled ``An Open Federated
                   Architecture for the Laboratory of the Future'', creates an
                   open federated hardware/software architecture for the
                   laboratory of the future using a novel system of systems
                   (SoS) and microservice architecture approach, connecting
                   scientific instruments, robot-controlled laboratories and
                   edge/center computing/data resources to enable autonomous
                   experiments, ``self-driving'' laboratories, smart
                   manufacturing, and artificial intelligence (AI)-driven
                   design, discovery and evaluation. The project describes
                   science use cases as design patterns that identify and
                   abstract the involved hardware/software components and their
                   interactions in terms of control, work and data flow. It
                   creates a SoS architecture of the federated
                   hardware/software ecosystem that clarifies terms,
                   architectural elements, the interactions between them and
                   compliance. It further designs a federated microservice
                   architecture, mapping science use case design patterns to the
                   SoS architecture with loosely coupled microservices,
                   standardized interfaces and multi programming language
                   support. The primary deliverable of this project is an
                   INTERSECT Open Architecture Specification, containing the
                   science use case design pattern catalog, the federated SoS
                   architecture specification and the federated microservice
                   architecture specification. This document represents the
                   science use case design pattern catalog of the INTERSECT Open
                   Architecture Specification.",
  pts           = "185612"
}
@techreport{hukerikar17rdp-12,
  author        = "Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "Resilience Design Patterns: {A} Structured Approach to
                   Resilience at Extreme Scale (Version 1.2)",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2017/745",
  address       = "Oak Ridge, TN, USA",
  month         = aug,
  year          = "2017",
  doi           = "10.2172/1436045",
  url           = "http://www.christian-engelmann.info/publications/hukerikar17rdp-12.pdf",
  abstract      = "Reliability is a serious concern for future extreme-scale
                   high-performance computing (HPC) systems. Projections based
                   on the current generation of HPC systems and technology
                   roadmaps suggest the prevalence of very high fault rates in
                   future systems. The errors resulting from these faults will
                   propagate and generate various kinds of failures, which may
                   result in outcomes ranging from result corruptions to
                   catastrophic application crashes. Therefore, the resilience
                   challenge for extreme-scale HPC systems requires coordination
                   between various hardware and software technologies that are
                   capable of handling a broad set of fault models at
                   accelerated fault rates. Also, due to practical limits on
                   power consumption in future HPC systems, they are likely to
                   embrace innovative architectures, increasing the levels of
                   hardware and software complexities. Therefore, the
                   techniques that seek to improve resilience must navigate the
                   complex trade-off space between resilience and the overheads
                   to power consumption and performance. While the HPC community
                   has developed various resilience solutions, application-level
                   techniques as well as system-based solutions, the solution
                   space of HPC resilience techniques remains fragmented. There
                   are no formal methods to integrate the various HPC resilience
                   techniques into composite solutions, nor are there methods to
                   holistically evaluate the adequacy and efficacy of such
                   solutions in terms of their protection coverage, and their
                   performance & power efficiency characteristics. Additionally,
                   few implementations of current resilience solutions are
                   portable to newer architectures and software environments that
                   will be deployed on future systems.
                   We developed a new structured approach to the management of
                   HPC resilience using the concept of resilience-based design
                   patterns. In general, a design pattern is a repeatable
                   solution to a commonly occurring problem. We identified the
                   well-known solutions that are commonly used to deal with
                   faults, errors and failures in HPC systems. In the initial
                   design patterns specification (version 1.0), we described
                   the various solutions, which address specific problems in
                   the design of resilient HPC environments, in the form of
                   patterns. Each pattern describes a problem caused by a fault,
                   error or failure event in an HPC environment, and then
                   describes the core of the solution of the problem in such a
                   way that this solution may be adapted to different systems
                   and implemented at different layers of the system stack. The
                   catalog of these resilience design patterns provides
                   designers with a collection of design elements. To construct
                   complete resilience solutions using combinations of various
                   patterns, we defined a framework that enhances HPC designers'
                   understanding of the important constraints and the
                   opportunities for the design patterns to be implemented and
                   deployed at various layers of the system stack. The design
                   framework is also useful for establishing interfaces and
                   mechanisms to coordinate flexible fault management across
                   hardware and software components, as well as to consider the
                   trade-off between performance, resilience, and power
                   consumption when constructing a solution. The resilience
                   design patterns specification version 1.1 included more
                   detailed explanations of the pattern solutions, the context
                   in which the patterns are applicable, and the implications
                   for hardware or software design. It also provided several
                   additional examples and detailed case studies to demonstrate
                   the use of patterns to build realistic solutions.
                   In this version 1.2 of the specification document, we have
                   improved the pattern descriptions, including graphical
                   representations of the pattern components. These
                   improvements are largely based on critical comments,
                   feedback and suggestions received from pattern experts and
                   readers of the previous versions of the specification. The
                   pattern classification has been modified to further clarify
                   the relationships between pattern categories. This version
                   of the specification also introduces a pattern language for
                   resilience design patterns. The pattern language presents
                   the patterns in the catalog as a network, revealing the
                   relations among the resilience patterns. The language
                   provides designers with the means to explore alternative
                   techniques for handling a specific fault model that may have
                   different efficiency and complexity characteristics. Using
                   the pattern language also enables the design and
                   implementation of comprehensive resilience solutions as a
                   set of interconnected resilience patterns that can be
                   instantiated across layers of the system stack. The overall
                   goal of this work is to provide hardware and software
                   designers, as well as the users and operators of HPC systems,
                   a systematic methodology for the design and evaluation of
                   resilience technologies in HPC systems that keep scientific
                   applications running to a correct solution in a timely and
                   cost-efficient manner despite frequent faults, errors, and
                   failures of various types.",
  pts           = "106427"
}
@techreport{hukerikar16rdp-11,
  author        = "Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "Resilience Design Patterns: {A} Structured Approach to
                   Resilience at Extreme Scale (Version 1.1)",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2016/767",
  address       = "Oak Ridge, TN, USA",
  month         = dec,
  year          = "2016",
  doi           = "10.2172/1345793",
  url           = "http://www.christian-engelmann.info/publications/hukerikar16rdp-11.pdf",
  abstract      = "Reliability is a serious concern for future extreme-scale
                   high-performance computing (HPC) systems. Projections based
                   on the current generation of HPC systems and technology
                   roadmaps suggest the prevalence of very high fault rates in
                   future systems. The errors resulting from these faults will
                   propagate and generate various kinds of failures, which may
                   result in outcomes ranging from result corruptions to
                   catastrophic application crashes. Therefore the resilience
                   challenge for extreme-scale HPC systems requires management
                   of various hardware and software technologies that are
                   capable of handling a broad set of fault models at
                   accelerated fault rates. Also, due to practical limits on
                   power consumption in HPC systems future systems are likely
                   to embrace innovative architectures, increasing the levels
                   of hardware and software complexities. As a result the
                   techniques that seek to improve resilience must navigate
                   the complex trade-off space between resilience and the
                   overheads to power consumption and performance. While the
                   HPC community has developed various resilience solutions,
                   application-level techniques as well as system-based
                   solutions, the solution space of HPC resilience techniques
                   remains fragmented. There are no formal methods and metrics
                   to investigate and evaluate resilience holistically in HPC
                   systems that consider impact scope, handling coverage, and
                   performance & power efficiency across the system stack.
                   Additionally, few of the current approaches are portable to
                   newer architectures and software environments that will be
                   deployed on future systems.
                   In this document, we develop a structured approach to the
                   management of HPC resilience using the concept of
                   resilience-based design patterns. A design pattern is a
                   general repeatable solution to a commonly occurring problem.
                   We identify the commonly occurring problems and solutions
                   used to deal with faults, errors and failures in HPC systems.
                   Each established solution is described in the form of a
                   pattern that addresses concrete problems in the design of
                   resilient systems. The complete catalog of resilience design
                   patterns provides designers with reusable design elements. We
                   also define a framework that enhances a designer's
                   understanding of the important constraints and opportunities
                   for the design patterns to be implemented and deployed at
                   various layers of the system stack. This design framework may
                   be used to establish mechanisms and interfaces to coordinate
                   flexible fault management across hardware and software
                   components. The framework also supports optimization of the
                   cost-benefit trade-offs among performance, resilience, and
                   power consumption. The overall goal of this work is to enable
                   a systematic methodology for the design and evaluation of
                   resilience technologies in extreme-scale HPC systems that
                   keep scientific applications running to a correct solution
                   in a timely and cost-efficient manner in spite of frequent
                   faults, errors, and failures of various types.",
  pts           = "72341"
}
@techreport{hukerikar16rdp-10,
  author        = "Saurabh Hukerikar
                   and Christian Engelmann",
  title         = "Resilience Design Patterns: {A} Structured Approach to
                   Resilience at Extreme Scale (Version 1.0)",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2016/687",
  address       = "Oak Ridge, TN, USA",
  month         = oct,
  year          = "2016",
  doi           = "10.2172/1338552",
  url           = "http://www.christian-engelmann.info/publications/hukerikar16rdp-10.pdf",
  abstract      = "Reliability is a serious concern for future extreme-scale
                   high-performance computing (HPC) systems. Projections based
                   on the current generation of HPC systems and technology
                   roadmaps suggest that very high fault rates in future
                   systems. The errors resulting from these faults will
                   propagate and generate various kinds of failures, which may
                   result in outcomes ranging from result corruptions to
                   catastrophic application crashes. Practical limits on power
                   consumption in HPC systems will require future systems to
                   embrace innovative architectures, increasing the levels of
                   hardware and software complexities.
                   The resilience challenge for extreme-scale HPC systems
                   requires management of various hardware and software
                   technologies that are capable of handling a broad set of
                   fault models at accelerated fault rates. These techniques
                   must seek to improve resilience at reasonable overheads to
                   power consumption and performance. While the HPC community
                   has developed various solutions, application-level as well
                   as system-based solutions, the solution space of HPC
                   resilience techniques remains fragmented. There are no formal
                   methods and metrics to investigate and evaluate resilience
                   holistically in HPC systems that consider impact scope,
                   handling coverage, and performance & power efficiency across
                   the system stack. Additionally, few of the current approaches
                   are portable to newer architectures and software ecosystems,
                   which are expected to be deployed on future systems.
                   In this document, we develop a structured approach to the
                   management of HPC resilience based on the concept of
                   resilience-based design patterns. A design pattern is a
                   general repeatable solution to a commonly occurring problem.
                   We identify the commonly occurring problems and solutions
                   used to deal with faults, errors and failures in HPC systems.
                   The catalog of resilience design patterns provides designers
                   with reusable design elements. We define a design framework
                   that enhances our understanding of the important constraints
                   and opportunities for solutions deployed at various layers of
                   the system stack. The framework may be used to establish
                   mechanisms and interfaces to coordinate flexible fault
                   management across hardware and software components. The
                   framework also enables optimization of the cost-benefit
                   trade-offs among performance, resilience, and power
                   consumption. The overall goal of this work is to enable a
                   systematic methodology for the design and evaluation of
                   resilience technologies in extreme-scale HPC systems that
                   keep scientific applications running to a correct solution
                   in a timely and cost-efficient manner in spite of frequent
                   faults, errors, and failures of various types.",
  pts           = "71756"
}
@techreport{fiala12detection,
  author        = "David Fiala
                   and Frank Mueller
                   and Christian Engelmann
                   and Kurt Ferreira
                   and Ron Brightwell
                   and Rolf Riesen",
  title         = "Detection and Correction of Silent Data Corruption for
                   Large-Scale High-Performance Computing",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2012/227",
  address       = "Oak Ridge, TN, USA",
  month         = jun,
  year          = "2012",
  url           = "http://www.christian-engelmann.info/publications/fiala12detection.pdf",
  abstract      = "Faults have become the norm rather than the exception for
                   high-end computing on clusters with 10s/100s of thousands
                   of cores. Exacerbating this situation, some of these faults
                   remain undetected, manifesting themselves as silent errors
                   that corrupt memory while applications continue to operate
                   and report incorrect results.
                   This paper studies the potential for redundancy to both
                   detect and correct soft errors in MPI message-passing
                   applications. Our study investigates the challenges inherent
                   to detecting soft errors within MPI application while
                   providing transparent MPI redundancy. By assuming a model
                   wherein corruption in application data manifests itself by
                   producing differing MPI message data between replicas, we
                   study the best suited protocols for detecting and correcting
                   MPI data that is the result of corruption.
                   To experimentally validate our proposed detection and
                   correction protocols, we introduce RedMPI, an MPI library
                   which resides in the MPI profiling layer. RedMPI is capable
                   of both online detection and correction of soft errors that
                   occur in MPI applications without requiring any modifications
                   to the application source by utilizing either double or
                   triple redundancy.
                   Our results indicate that our most efficient consistency
                   protocol can successfully protect applications experiencing
                   even high rates of silent data corruption with runtime
                   overheads between 0\% and 30\% as compared to unprotected
                   applications without redundancy.
                   Using our fault injector within RedMPI, we observe that even
                   a single soft error can have profound effects on running
                   applications, causing a cascading pattern of corruption in
                   most cases causes that spreads to all other processes.
                   RedMPI's protection has been shown to successfully mitigate
                   the effects of soft errors while allowing applications to
                   complete with correct results even in the face of errors."
}
@techreport{wang10hybrid,
  author        = "Chao Wang
                   and Frank Mueller
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Hybrid Full/Incremental Checkpoint/Restart for {MPI} Jobs in
                   {HPC} Environments",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2010/162",
  address       = "Oak Ridge, TN, USA",
  month         = aug,
  year          = "2010",
  url           = "http://www.christian-engelmann.info/publications/wang10hybrid.pdf",
  abstract      = "As the number of cores in high-performance computing
                   environments keeps increasing, faults are becoming common
                   place. Checkpointing addresses such faults but captures
                   full process images even though only a subset of the
                   process image changes between checkpoints.
                   We have designed a high-performance hybrid disk-based
                   full/incremental checkpointing technique for MPI tasks
                   to capture only data changed since the last checkpoint.
                   Our implementation integrates new BLCR and LAM/MPI
                   features that complement traditional full checkpoints.
                   This results in significantly reduced checkpoint sizes
                   and overheads with only moderate increases in restart
                   overhead. After accounting for cost and savings, benefits
                   due to incremental checkpoints significantly outweigh the
                   loss on restart operations.
                   Experiments in a cluster with the NAS Parallel Benchmark
                   suite and mpiBLAST indicate that savings due to replacing
                   full checkpoints with incremental ones average 16.64
                   seconds while restore overhead amounts to just 1.17
                   seconds. These savings increase with the frequency of
                   incremental checkpoints. Overall, our novel hybrid
                   full/incremental checkpointing is superior to prior
                   non-hybrid techniques."
}
@techreport{wang10proactive,
  author        = "Chao Wang
                   and Frank Mueller
                   and Christian Engelmann
                   and Stephen L. Scott",
  title         = "Proactive Process-Level Live Migration and Back Migration in
                   {HPC} Environments",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2010/161",
  address       = "Oak Ridge, TN, USA",
  month         = aug,
  year          = "2010",
  url           = "http://www.christian-engelmann.info/publications/wang10proactive.pdf",
  abstract      = "As the number of nodes in high-performance computing
                   environments keeps increasing, faults are becoming common
                   place. Reactive fault tolerance (FT) often does not scale
                   due to massive I/O requirements and relies on manual job
                   resubmission. This work complements reactive with proactive
                   FT at the process level. Through health monitoring, a subset
                   of node failures can be anticipated when one's health
                   deteriorates. A novel process-level live migration mechanism
                   suppor ts continued execution of applications during much of
                   processes migration. This scheme is integrated into an MPI
                   execution environment to transparently sustain
                   health-inflicted node failures, which eradicates the need to
                   restart and requeue MPI jobs. Experiments indicate that 1-6.5
                   seconds of prior warning are required to successfully trigger
                   live process migration while similar operating system
                   virtualization mechanisms require 13-24 seconds. This
                   self-healing approach complements reactive FT by nearly
                   cutting the number of checkpoints in half when 70\% of
                   the faults are handled proactively. The work also provides
                   a novel back migration approach to eliminate load imbalance
                   or bottlenecks caused by migrated tasks. Experiments indicate
                   the larger the amount of outstanding execution, the higher
                   the benefit due to back migration will be."
}
@dataset{shin23olcf,
  author        = "Woong Shin
                   and Vladyslav Oles
                   and Anna Schmedding
                   and George Ostrouchov
                   and Evgenia Smirni
                   and Christian Engelmann
                   and Feiyi Wang",
  title         = "{OLCF Summit} Supercomputer {GPU} Snapshots During
                   Double-Bit Errors and Normal Operations",
  month         = apr # "~20, ",
  year          = "2023",
  doi           = "10.13139/OLCF/1970187",
  url           = "https://doi.ccs.ornl.gov/ui/doi/429",
  abstract      = "As we move into the exascale era, the power and energy
                   footprints of high-performance computing (HPC) systems have
                   grown significantly larger. Due to the harsh power and
                   thermal conditions the system, components are exposed to
                   extreme operating conditions. Operation of such modern HPC
                   systems requires deep insights into long term system behavior
                   to maintain its efficiency as well as its longevity. To help
                   the HPC community to gain such insights, we provide
                   double-bit errors using system telemetry data and logs
                   collected from the Summit supercomputer, equipped with 27,648
                   Tesla V100 GPUs with 2nd-generation high-bandwidth memory
                   (HBM2). The dataset relies on Nvidia XID records internally
                   collected by GPU firmware at the time of failure occurrence,
                   on the reboot-time logs of each Summit node, on node-level
                   job scheduler records collected after each job termination,
                   and on a 1Hz data rate from the baseboard management
                   controllers (BMCs) of each Summit compute node using the
                   OpenBMC event subscription protocol."
}
@misc{engelmann23interconnected4,
  author        = "Christian Engelmann",
  title         = "The Interconnected Science Ecosystem (INTERSECT)",
  month         = oct # "~4, ",
  year          = "2023",
  howpublished  = "{Invited talk at the \href{https://www.hartree.stfc.ac.uk}
                   {Hartree Centre, Science and Technology Facilities Council,
                   Daresbury, UK}}",
  url           = "http://www.christian-engelmann.info/publications/engelmann23interconnected4.ppt.pdf",
  abstract      = "The Interconnected Science Ecosystem (INTERSECT) Initiative
                   at Oak Ridge National Laboratory is in the process of
                   creating an open federated hardware/software architecture for
                   the laboratory of the future, connecting scientific
                   instruments, robot-controlled laboratories, and edge/center
                   computing/data resources to enable autonomous experiments,
                   self-driving laboratories, smart manufacturing, and
                   artificial intelligence driven design, discovery, and
                   evaluation. Its novel approach describes science use cases as
                   design patterns that identify and abstract the involved
                   hardware/software components and their interactions in terms
                   of control, work, and data flow. It creates a
                   system-of-systems architecture of the federated
                   hardware/software ecosystem that clarifies terms,
                   architectural elements, the interactions between them and
                   compliance. It further designs a federated microservice
                   architecture, mapping science use case design patterns to the
                   system-of-systems architecture with loosely coupled
                   microservices and standardized interfaces. The INTERSECT Open
                   Architecture Specification contains a use case design pattern
                   catalog, a federated system-of-systems architecture
                   specification, and a federated microservice architecture
                   specification. It is currently being used to prototype and
                   deploy autonomous experiments and self-driving laboratories at
                   Oak Ridge National Laboratory in the following science areas:
                   (1) automation for electric grid interconnected-laboratory
                   emulation/simulation, (2) autonomous additive manufacturing,
                   (3) autonomous continuous flow reactor synthesis, (4)
                   autonomous electron microscopy, (5) autonomous
                   robotic-controlled chemistry laboratory, and (6) integrating
                   an ion trap quantum computing resource."
}
@misc{engelmann23interconnected3,
  author        = "Christian Engelmann",
  title         = "The Interconnected Science Ecosystem (INTERSECT)
                   Architecture",
  month         = aug # "~21-23, ",
  year          = "2023",
  howpublished  = "{Invited talk at the \href{https://smc2023.ornl.gov}
                   {$20^{th}$ Smoky Mountains Computational Sciences &
                   Engineering Conference (SMC)}, Knoxville, TN, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann23interconnected3.ppt.pdf",
  abstract      = "The Interconnected Science Ecosystem (INTERSECT) Initiative
                   at Oak Ridge National Laboratory is in the process of
                   creating an open federated hardware/software architecture for
                   the laboratory of the future, connecting scientific
                   instruments, robot-controlled laboratories, and edge/center
                   computing/data resources to enable autonomous experiments,
                   self-driving laboratories, smart manufacturing, and artificial
                   intelligence driven design, discovery, and evaluation. Its
                   novel approach describes science use cases as design patterns
                   that identify and abstract the involved hardware/software
                   components and their interactions in terms of control, work,
                   and data flow. It creates a system-of-systems architecture of
                   the federated hardware/software ecosystem that clarifies
                   terms, architectural elements, the interactions between them
                   and compliance. It further designs a federated microservice
                   architecture, mapping science use case design patterns to the
                   system-of-systems architecture with loosely coupled
                   microservices and standardized interfaces. The INTERSECT Open
                   Architecture Specification contains a use case design pattern
                   catalog, a federated system-of-systems architecture
                   specification, and a federated microservice architecture
                   specification. It is currently being used to prototype and
                   deploy autonomous experiments and self-driving laboratories
                   at Oak Ridge National Laboratory in the following science
                   areas: (1) automation for electric grid
                   interconnected-laboratory emulation/simulation, (2) autonomous
                   additive manufacturing, (3) autonomous continuous flow reactor
                   synthesis, (4) autonomous electron microscopy, (5) autonomous
                   robotic-controlled chemistry laboratory, and (6) integrating
                   an ion trap quantum computing resource."
}
@misc{engelmann23interconnected2,
  author        = "Christian Engelmann",
  title         = "The Interconnected Science Ecosystem (INTERSECT)
                   Architecture",
  month         = jul # "~10, ",
  year          = "2023",
  howpublished  = "{Seminar at the \href{http://www.lrz-muenchen.de}{Leibniz
                   Rechenzentrum (LRZ)}, Garching, Germany}",
  url           = "http://www.christian-engelmann.info/publications/engelmann23interconnected2.ppt.pdf",
  abstract      = "The Interconnected Science Ecosystem (INTERSECT) Initiative
                   at Oak Ridge National Laboratory is in the process of
                   creating an open federated hardware/software architecture for
                   the laboratory of the future, connecting scientific
                   instruments, robot-controlled laboratories, and edge/center
                   computing/data resources to enable autonomous experiments,
                   self-driving laboratories, smart manufacturing, and artificial
                   intelligence driven design, discovery, and evaluation. Its
                   novel approach describes science use cases as design patterns
                   that identify and abstract the involved hardware/software
                   components and their interactions in terms of control, work,
                   and data flow. It creates a system-of-systems architecture of
                   the federated hardware/software ecosystem that clarifies
                   terms, architectural elements, the interactions between them
                   and compliance. It further designs a federated microservice
                   architecture, mapping science use case design patterns to the
                   system-of-systems architecture with loosely coupled
                   microservices and standardized interfaces. The INTERSECT Open
                   Architecture Specification contains a use case design pattern
                   catalog, a federated system-of-systems architecture
                   specification, and a federated microservice architecture
                   specification. It is currently being used to prototype and
                   deploy autonomous experiments and self-driving laboratories
                   at Oak Ridge National Laboratory in the following science
                   areas: (1) automation for electric grid
                   interconnected-laboratory emulation/simulation, (2) autonomous
                   additive manufacturing, (3) autonomous continuous flow reactor
                   synthesis, (4) autonomous electron microscopy, (5) autonomous
                   robotic-controlled chemistry laboratory, and (6) integrating
                   an ion trap quantum computing resource."
}
@misc{engelmann23interconnected,
  author        = "Christian Engelmann",
  title         = "The Interconnected Science Ecosystem (INTERSECT)
                   Architecture",
  month         = may # "~25, ",
  year          = "2023",
  howpublished  = "{Invited talk at the \href{https://esailworkshop.ornl.gov}
                    {$1^{st}$ Ecosystems for Smart Autonomous Interconnected
                    Labs (E-SAIL) Workshop}, held in conjunction with the
                    \href{https://www.isc-hpc.com}{$38^{th}$ ISC High
                    Performance (ISC) 2023}, Hamburg, Germany}",
  url           = "http://www.christian-engelmann.info/publications/engelmann23interconnected.ppt.pdf",
  abstract      = "The open Interconnected Science Ecosystem (INTERSECT)
                   architecture connects scientific instruments and
                   robot-controlled laboratories with computing and data
                   resources at the edge, the Cloud or the high-performance
                   computing center to enable autonomous experiments,
                   self-driving laboratories, smart manufacturing, and
                   artificial intelligence driven design, discovery and
                   evaluation. Its a novel approach consists of science use
                   case design patterns, a system of systems architecture, and
                   a microservice architecture."
}
@misc{engelmann22designing,
  author        = "Christian Engelmann",
  title         = "Designing Smart and Resilient Extreme-Scale Systems",
  month         = feb # "~23-26, ",
  year          = "2022",
  howpublished  = "{Invited talk at the
                   \href{https://www.siam.org/conferences/cm/conference/pp22}
                   {$20^{th}$ SIAM Conference on Parallel Processing for
                   Scientific Computing (PP) 2022}, Seattle, WA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann22designing.ppt.pdf",
  abstract      = "Resilience is one of the critical challenges of extreme-scale
                   high-performance computing (HPC) systems, as component counts
                   increase, individual component reliability decreases, and
                   software complexity increases. Building a reliable
                   supercomputer that achieves the expected performance within a
                   given cost budget and providing efficiency and correctness
                   during operation in the presence of faults, errors, and
                   failures requires a full understanding of the resilience
                   problem. This talk provides an overview of recent achievements
                   in developing a taxonomy, catalog and models that capture the
                   observed and inferred fault, error, and failure conditions in
                   current supercomputers and in extrapolating this knowledge to
                   future-generation systems. It also describes the path forward
                   in machine-in-the-loop operational intelligence for smart
                   computing systems, leveraging operational data analytics in a
                   loop control that maximizes productivity and minimizes costs
                   through adaptive autonomous operation for resilience."
}
@misc{mintz21enabling,
  author        = "Ben Mintz
                   and Christian Engelmann
                   and Elke Arenholz
                   and Ryan Coffee",
  title         = "Enabling Self-Driven Experiments for Science through an
                   Interconnected Science Ecosystem (INTERSECT)",
  month         = oct # "~20, ",
  year          = "2021",
  howpublished  = "{Panel at the \href{https://smc2021.ornl.gov}{$17^{th}$ Smoky
                    Mountains Computational Sciences & Engineering Conference
                    (SMC)}}",
  abstract      = "The process of operating scientific instruments, conducting
                   experiments, and executing scientific workflows in general is
                   time-consuming and labor-intensive. Computer control of
                   instruments and the rapid rise in simulation and modeling has
                   led to a significant increase in both the quantity and
                   quality of data, but scientists are still contributing to
                   many low-level process steps in data acquisition, processing,
                   and interpretation to produce scientific results. These
                   issues led to the integration of automation and autonomy to
                   decreased process bottlenecks and increased efficiencies.
                   While automation incorporates tools that perform
                   well-defined, systematic processes with limited human
                   intervention, autonomy introduces smart decision-making
                   techniques, such as artificial intelligence (AI) and machine
                   learning (ML). Combining these advances to automate entire
                   scientific workflows and controlling them with AI/ML will
                   bring about revolutionary efficiencies and research outcomes.
                   This kind of autonomous control of processes, experiments,
                   and laboratories will fundamentally change the way scientists
                   work, allowing us to explore high-dimensional problems
                   previously considered impossible and discover new subtle
                   correlations.

                   To enable the interoperability of existing and future
                   self-driven experiments, the scientific community needs a
                   common Interconnected Science Ecosystem (INTERSECT) that
                   consistently incorporates data management software, data
                   analysis workflow tools, and experiment management/steering
                   software as well as AI/ML capabilities. The development of
                   INTERSECT requires tight collaboration between computer
                   scientists, software engineers, data scientists, and domain
                   scientists. This panel will introduce INTERSECT and discuss
                   opportunities, challenges, and business goals for this type
                   of ecosystem including scalability, interoperability, and
                   solution/software transferability/reusability."
}
@misc{engelmann21faults,
  author        = "Christian Engelmann",
  title         = "Faults, Errors and Failures in Extreme-Scale Supercomputers",
  month         = aug # "~30, ",
  year          = "2021",
  howpublished  = "{Keynote talk at the
                    \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2021}{$14^{th}$
                    Workshop on Resiliency in High Performance Computing
                    (Resilience) in Clusters, Clouds, and Grids}, held in
                    conjunction with the \href{http://europar2014.dcc.fc.up.pt}
                    {$27^{th}$ European Conference on Parallel and Distributed
                    Computing (Euro-Par) 2021}, Lisbon, Portugal}",
  url           = "http://www.christian-engelmann.info/publications/engelmann21faults.ppt.pdf",
  abstract      = "Resilience is one of the critical challenges of extreme-scale
                   high-performance computing systems, as component counts
                   increase, individual component reliability decreases, and
                   software complexity increases. Building a reliable
                   supercomputer that achieves the expected performance within a
                   given cost budget and providing efficiency and correctness
                   during operation in the presence of faults, errors, and
                   failures requires a full understanding of the resilience
                   problem. This talk provides an overview of reliability
                   experiences with some of the largest supercomputers in the
                   world and recent achievements in developing a taxonomy,
                   catalog and models that capture the observed and inferred
                   fault, error, and failure conditions in these systems."
}
@misc{engelmann21resilience,
  author        = "Christian Engelmann",
  title         = "The Resilience Problem in Extreme Scale Computing:
                   Experiences and the Path Forward",
  month         = mar # "~1-5, ",
  year          = "2021",
  howpublished  = "{Invited talk at the
                   \href{https://www.siam.org/conferences/cm/conference/cse21}
                   {SIAM Conference on Computational Science and Engineering
                   (CSE) 2021}, Fort Worth, TX, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann21resilience.ppt.pdf",
  abstract      = "Resilience is one of the critical challenges of extreme-scale
                   high-performance computing systems, as component counts
                   increase, individual component reliability decreases, and
                   software complexity increases. Building a reliable
                   supercomputer that achieves the expected performance within a
                   given cost budget and providing efficiency and correctness
                   during operation in the presence of faults, errors, and
                   failures requires a full understanding of the resilience
                   problem. This talk provides an overview of reliability
                   experiences with some of the largest supercomputers in the
                   world and recent achievements in developing a taxonomy,
                   catalog and models that capture the observed and inferred
                   fault, error, and failure conditions in these systems."
}
@misc{engelmann21smart,
  author        = "Christian Engelmann",
  title         = "Smart and Resilient Extreme-Scale Systems",
  month         = jan # "~19, ",
  year          = "2021",
  howpublished  = "{Invited talk at the
                    \href{https://www.hipeac.net/2021/spring-virtual/#/program/sessions/7854/}
                    {Workshop on Resilience in High Performance Computing
                    (RESILIENTHPC)}, held in conjunction with the
                    \href{https://www.hipeac.net/2021}
                    {European Network on High-performance Embedded Architecture
                     and Compilation (HiPEAC) Conference 2021}, Budapest, Hungary}",
  url           = "http://www.christian-engelmann.info/publications/engelmann21smart.ppt.pdf",
  abstract      = "Resilience is one of the critical challenges of extreme-scale
                   high-performance computing (HPC) systems, as component counts
                   increase, individual component reliability decreases, and
                   software complexity increases. Building a reliable
                   supercomputer that achieves the expected performance within a
                   given cost budget and providing efficiency and correctness
                   during operation in the presence of faults, errors, and
                   failures requires a full understanding of the resilience
                   problem. This talk provides an overview of recent
                   achievements in developing a taxonomy, catalog and models
                   that capture the observed and inferred fault, error, and
                   failure conditions in current supercomputers and in
                   extrapolating this knowledge to future-generation systems.
                   It also describes the path forward in machine-in-the-loop
                   operational intelligence for smart computing systems,
                   leveraging operational data analytics in a loop control that
                   maximizes productivity and minimizes costs through adaptive
                   autonomous operation for resilience."
}
@misc{engelmann20resilience,
  author        = "Christian Engelmann",
  title         = "The Resilience Problem in Extreme Scale Computing",
  month         = feb # "~12-15, ",
  year          = "2020",
  howpublished  = "{Invited talk at the
                   \href{https://www.siam.org/conferences/cm/conference/pp20}
                   {$19^{th}$ SIAM Conference on Parallel Processing for
                   Scientific Computing (PP) 2020}, Seattle, WA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann20resilience.ppt.pdf",
  abstract      = "Resilience is one of the critical challenges of extreme-scale
                   high-performance computing (HPC) systems, as component counts
                   increase, individual component reliability decreases, and
                   software complexity increases. Building a reliable
                   supercomputer that achieves the expected performance within
                   a given cost budget and providing efficiency and correctness
                   during operation in the presence of faults, errors, and
                   failures requires a full understanding of the resilience
                   problem. This talk provides an overview of recent
                   achievements in developing a taxonomy, catalog and models
                   that capture the observed and inferred fault, error, and
                   failure conditions in current supercomputers and in
                   extrapolating this knowledge to future-generation systems."
}
@misc{engelmann19resilience3,
  author        = "Christian Engelmann",
  title         = "Resilience in Parallel Programming Environments",
  month         = oct # "~30-31, ",
  year          = "2019",
  howpublished  = "{Invited talk at the
                   \href{https://iadac.github.io/events/adac8}{$8^{th}$
                   Accelerated Data Analytics and Computing (ADAC) Institute
                   Workshop}, Tokyo, Japan}",
  url           = "http://www.christian-engelmann.info/publications/engelmann19resilience.ppt.pdf",
  abstract      = "Recent reliability issues with one of the fastest
                   supercomputers in the world, Titan at Oak Ridge National
                   Laboratory, demonstrated the need for resilience in
                   large-scale heterogeneous computing. OpenMP currently does
                   not address error and failure behavior. The presented work
                   takes a first step toward resilience for heterogeneous
                   systems by providing the concepts for resilient OpenMP
                   offload to devices. Using real-world error and failure
                   observations, this work describes the concepts and
                   terminology for resilient OpenMP target offload, including
                   error and failure classes and resilience strategies. It
                   details the experienced general-purpose computing on
                   graphics processing units errors and failures in Titan. It
                   further proposes improvements in OpenMP, including a
                   preliminary prototype design, to support resilient offload
                   to devices for efficient handling of errors and failures in
                   heterogeneous high-performance computing systems."
}
@misc{engelmann19resilience2,
  author        = "Christian Engelmann",
  title         = "Resilience by Design (and not as an Afterthought)",
  month         = mar # "~26-29, ",
  year          = "2018",
  howpublished  = "{Invited talk at the \href{https://sos23.ornl.gov/}{$23^{rd}$
                   Workshop on Distributed Supercomputing (SOS) 2019}, Asheville,
                   NC, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann19resilience2.ppt.pdf",
  abstract      = "Resilience, i.e., obtaining a correct solution in a timely
                   and efficient manner, is one of the key challenges in
                   extreme-scale high-performance computing (HPC). The challenge
                   is to build a reliable HPC system within a given cost budget
                   that achieves the expected performance. Every generation of
                   supercomputers deployed at Oak Ridge National Laboratory
                   (ORNL) had to deal with expected and unexpected faults,
                   errors and failures. While these supercomputers are designed
                   to deal with expected issues, unexpected reliability problems
                   can lead to severe degradation in operational capabilities.
                   For example, ORNL's Titan supercomputer experienced an
                   unexpected increase in general-purpose graphics processing
                   unit (GPGPU) failures between 2015 and 2017. At the peak of
                   the problem, Titan was losing an average of 12 GPGPUs (and
                   corresponding compute nodes) per day. Over 50\% of its 18,688
                   GPGPUs had to be replaced. The system and the applications
                   using it were never designed to handle such a high failure
                   rate in an efficient manner. Other past unexpected
                   reliability issues with supercomputers at US Department of
                   Energy HPC centers were caused by early wear-out, dirty
                   power, bad solder, other manufacturing issues, design errors
                   in hardware, design errors in software and user errors. With
                   the expected decrease in reliability due to component count
                   increases, process technology challenges, hardware
                   heterogeneity and software complexity, risk mitigation
                   against unexpected issues is becoming paramount to ensure
                   the success of future extreme-scale HPC systems. Resilience
                   needs to be holistically provided by the HPC
                   hardware/software ecosystem. The key challenges are to
                   design and to operate extreme HPC systems with (1)
                   wide-ranging resilience capabilities in hardware, system
                   software, programming models, libraries, and applications,
                   (2) interfaces and mechanisms for coordinating resilience
                   capabilities across diverse hardware and software
                   components, (3) appropriate metrics and tools for assessing
                   performance, resilience, and energy, and (4) an
                   understanding of the performance, resilience and energy
                   trade-off that eventually results in well-informed HPC
                   system design choices and runtime decisions."
}
@misc{engelmann19resilience,
  author        = "Christian Engelmann",
  title         = "Resilience for Extreme Scale Systems: Understanding the
                   Problem",
  month         = feb # "~25 - " # mar # "~1, ",
  year          = "2018",
  howpublished  = "{Invited talk at the
                   \href{https://www.siam.org/meetings/cse19/}{SIAM Conference
                   on Computational Science and Engineering (CSE) 2019},
                   Spokane, WA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann19resilience.ppt.pdf",
  abstract      = "Resilience is one of the critical challenges of extreme-scale
                   high-performance computing (HPC) systems, as component counts
                   increase, individual component reliability decreases, and
                   software complexity increases. Building a reliable
                   supercomputer that achieves the expected performance within a
                   given cost budget and providing efficiency and correctness
                   during operation in the presence of faults, errors, and
                   failures requires a full understanding of the resilience
                   problem. This talk provides an overview of the Catalog
                   project, which develops a taxonomy, catalog and models that
                   capture the observed and inferred fault, error, and failure
                   conditions in current supercomputers and extrapolates this
                   knowledge to future-generation systems. To date, this
                   project has analyzed billions of node hours of system logs
                   from supercomputers at Oak Ridge National Laboratory and
                   Argonne National Laboratory."
}
@misc{engelmann18modeling,
  author        = "Christian Engelmann and Rizwan Ashraf",
  title         = "Modeling and Simulation of Extreme-Scale Systems for
                   Resilience by Design",
  month         = aug # "~15-17, ",
  year          = "2018",
  howpublished  = "{Invited talk at the \href{https://www.bnl.gov/modsim2018}
                   {Workshop on Modeling and Simulation of Systems and
                   Applications}, Seattle, WA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann18modeling.ppt.pdf",
  abstract      = "Resilience is a serious concern for extreme-scale
                   high-performance computing (HPC). While the HPC community has
                   developed various resilience solutions, the solution space
                   remains fragmented. We created a structured approach to the
                   design, evaluation and optimization of HPC resilience using
                   the concept of design patterns. A design pattern describes a
                   generalized solution to a repeatedly occurring problem. We
                   identified the commonly occurring problems and solutions used
                   to deal with faults, errors and failures in HPC systems. Each
                   well-known solution that addresses a specific resilience
                   challenge is described in the form of a design pattern. We
                   developed a resilience design pattern specification, language
                   and catalog, which can be used by system architects, system
                   software and library developers, application programmers, as
                   well as users and operators as essential building blocks when
                   designing and deploying resilience solutions.
                   The resilience design pattern approach provides a unique
                   opportunity for design space exploration. As each resilience
                   solution is abstracted as a pattern and each solution's
                   properties are defined by pattern parameters, vertical and
                   horizontal pattern compositions can describe the resilience
                   capabilities of an entire HPC system. This permits the
                   investigation of beneficial or counterproductive interactions
                   between patterns and of the performance, resilience, and
                   power consumption trade-off between different pattern
                   parameters and compositions. The ultimate goal is to make
                   resilience an integral part of the HPC hardware/software
                   ecosystem by coordinating the various existing resilience
                   solutions in a design space exploration process, such that
                   the burden for providing resilience is on the system by
                   design and not on the user as an afterthought.
                   We are in the early stages of developing a novel design space
                   exploration tool that enables this investigation using
                   modeling and simulation. We developed performance and
                   resilience models for each resilience design pattern. We also
                   leverage results from the Catalog project, a collaborative
                   effort between Oak Ridge National Laboratory, Argonne
                   National Laboratory and Lawrence Livermore National
                   Laboratory that developed models of the faults, errors and
                   failures in today's HPC systems. We also leverage recent
                   results from the same project by Lawrence Livermore National
                   Laboratory in application reliability patterns. The planned
                   research extends and combines this work to model the
                   performance, resilience, and power consumption of an entire
                   HPC system, initially at node-level granularity, and to
                   simulate the dynamic interactions between deployed
                   resilience solutions and the rest of the system. In the next
                   iteration, finer-grain modeling and simulation, such as at
                   the computational unit level, is used to increase accuracy.
                   This work leverages the experience of the investigators in
                   parallel discrete event simulation of extreme-scale systems,
                   such as the Extreme-scale Simulator (xSim).
                   The current state of the art in resilience modeling and
                   simulation is fragmented as well. There is currently no such
                   design space exploration tool. Instead, each resilience
                   solution is typically investigated separately. There is only
                   a small amount of work on multi-resilience solutions,
                   including by the investigators. While there is work in
                   investigating the performance/resilience trade-off space,
                   there is almost no work in including power consumption."
}
@misc{engelmann18characterizing2,
  author        = "Christian Engelmann",
  title         = "Characterizing Faults, Errors, and Failures in Extreme-Scale
                   Systems",
  month         = jul # "~2-4, ",
  year          = "2018",
  howpublished  = "{Invited talk at the
                   \href{https://pasc18.pasc-conference.org}{Platform for
                   Advanced Scientific Computing (PASC) Conference 2018},
                   Basel, Switzerland}",
  url           = "http://www.christian-engelmann.info/publications/engelmann18characterizing2.ppt.pdf",
  abstract      = "Building a reliable supercomputer that achieves the expected
                   performance within a given cost budget and providing
                   efficiency and correctness during operation in the presence
                   of faults, errors, and failures requires a full understanding
                   of the resilience problem. The Catalog project develops a
                   fault taxonomy, catalog and models that capture the observed
                   and inferred conditions in current supercomputers and
                   extrapolates this knowledge to future-generation systems. To
                   date, the Catalog project has analyzed billions of node hours
                   of system logs from supercomputers at Oak Ridge National
                   Laboratory and Argonne National Laboratory. This talk
                   provides an overview of our findings and lessons learned."
}
@misc{engelmann18characterizing,
  author        = "Christian Engelmann",
  title         = "Characterizing Faults, Errors, and Failures in Extreme-Scale
                   Systems",
  month         = jun # "~20-21, ",
  year          = "2018",
  howpublished  = "{Invited talk at the
                   \href{https://iadac.github.io/adac6}{$6^{th}$
                   Accelerated Data Analytics and Computing (ADAC) Institute
                   Workshop}, Zurich, Switzerland}",
  url           = "http://www.christian-engelmann.info/publications/engelmann18characterizing.ppt.pdf",
  abstract      = "Building a reliable supercomputer that achieves the expected
                   performance within a given cost budget and providing
                   efficiency and correctness during operation in the presence
                   of faults, errors, and failures requires a full understanding
                   of the resilience problem. The Catalog project develops a
                   fault taxonomy, catalog and models that capture the observed
                   and inferred conditions in current supercomputers and
                   extrapolates this knowledge to future-generation systems. To
                   date, the Catalog project has analyzed billions of node hours
                   of system logs from supercomputers at Oak Ridge National
                   Laboratory and Argonne National Laboratory. This talk
                   provides an overview of our findings and lessons learned."
}
@misc{engelmann18pattern-based,
  author        = "Christian Engelmann",
  title         = "Pattern-based Modeling of Fail-stop and Soft-error Resilience
                   for Iterative Linear Solvers",
  month         = mar # "~7-10, ",
  year          = "2018",
  howpublished  = "{Invited talk at the
                   \href{https://www.siam.org/meetings/pp18/}{$18^{th}$ SIAM
                   Conference on Parallel Processing for Scientific Computing
                   (PP) 2018}, Tokyo, Japan}",
  url           = "http://www.christian-engelmann.info/publications/engelmann18resilience.ppt.pdf",
  abstract      = "Reliability is a serious concern for future extreme-scale
                   high-performance computing (HPC). While the HPC community has
                   developed various resilience solutions, the solution space
                   remains fragmented. With this work, we develop a structured
                   approach to the design, evaluation and optimization of HPC
                   resilience using the concept of design patterns. We identify
                   the problems caused by faults, errors and failures in HPC
                   systems and the techniques used to deal with these events.
                   Each well-known solution that addresses a specific resilience
                   challenge is described in the form of a pattern. We develop a
                   catalog of such resilience design patterns, which may be used
                   by system architects, system software and tools developers,
                   application programmers, as well as users and operators as
                   essential building blocks when designing and deploying
                   resilience solutions. We also develop a design framework that
                   enhances a designer's understanding the opportunities for
                   integrating multiple patterns across layers of the system
                   stack and the important constraints during implementation of
                   the individual patterns. It is also useful for designing
                   mechanisms and interfaces to coordinate flexible fault
                   management across hardware and software components. The
                   resilience patterns and the design framework also enable
                   exploration and evaluation of design alternatives and
                   support optimization of the cost-benefit trade-offs among
                   performance, protection coverage, and power consumption of
                   resilience solutions."
}
@misc{engelmann18resilience,
  author        = "Christian Engelmann",
  title         = "Resilience Design Patterns: A Structured Approach to
                   Resilience at Extreme Scale",
  month         = mar # "~7-10, ",
  year          = "2018",
  howpublished  = "{Invited talk at the
                   \href{https://www.siam.org/meetings/pp18/}{$18^{th}$ SIAM
                   Conference on Parallel Processing for Scientific Computing
                   (PP) 2018}, Tokyo, Japan}",
  url           = "http://www.christian-engelmann.info/publications/pattern-based.ppt.pdf",
  abstract      = "The reliability of high-performance computing (HPC) platforms
                   is among the most critical challenges as systems continue to
                   increase component counts, while the individual component
                   reliability decreases and software complexity increases.
                   While most resilience solutions are designed to address a
                   specific fault model, HPC applications must contend with
                   extremely high rates of faults from various sources with
                   different levels of severity. Therefore, resilience for
                   extreme-scale HPC systems and their applications requires an
                   integrated approach, which leverages detection, containment
                   and mitigation capabilities from different layers of the HPC
                   environment. With this work, we propose an approach based on
                   design patterns to explore a multi-level resilience solution
                   that addresses silent data corruptions and process failures.
                   The structured approach enables evaluation of the key
                   components of a multi-level resilience solution using pattern
                   performance models and systematically integrating the
                   patterns into a complete solution by assessing the interplay
                   between the patterns. We describe the design steps to develop
                   a multi-level resilience solution for an iterative linear
                   solver application that combines algorithmic resilience
                   features of the solver with the fault tolerance primitives
                   provided by ULFM MPI. Our results demonstrate the viability
                   of designing HPC applications capable of surviving
                   simultaneous injection of hard and soft errors in a
                   performance efficient manner."
}
@misc{engelmann17catalog2,
  author        = "Christian Engelmann",
  title         = "A Catalog of Faults, Errors, and Failures in Extreme-Scale
                   Systems",
  month         = jul # "~10-14, ",
  year          = "2017",
  howpublished  = "{Invited talk at the
                   \href{http://www.siam.org/meetings/an17/}{SIAM Annual
                   Meeting (AM) 2017}, Pittsburgh, PA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann17catalog2.ppt.pdf",
  abstract      = "Building a reliable supercomputer that achieves the expected
                   performance within a given cost budget and providing
                   efficiency and correctness during operation in the presence
                   of faults, errors, and failures requires a full understanding
                   of the resilience problem. The Catalog project develops a
                   fault taxonomy, catalog and models that capture the observed
                   and inferred conditions in current supercomputers and
                   extrapolates this knowledge to future-generation systems. To
                   date, the Catalog project has analyzed billions of node hours
                   of system logs from supercomputers at Oak Ridge National
                   Laboratory and Argonne National Laboratory. This talk
                   provides an overview of our findings and lessons learned."
}
@misc{engelmann17characterizing,
  author        = "Christian Engelmann",
  title         = "Characterizing Faults, Errors and Failures in Extreme-Scale
                   Computing Systems",
  month         = jun # "~16-22, ",
  year          = "2017",
  howpublished  = "{Invited talk at the
                   \href{http://www.isc-hpc.com}
                   {International Supercomputing Conference (ISC) 2017},
                   Frankfurt am Main, Germany}",
  url           = "http://www.christian-engelmann.info/publications/engelmann17characterizing.ppt.pdf",
  abstract      = "Building a reliable supercomputer that achieves the expected
                   performance within a given cost budget and providing
                   efficiency and correctness during operation in the presence
                   of faults, errors, and failures requires a full understanding
                   of the resilience problem. The Catalog project develops a
                   fault taxonomy, catalog and models that capture the observed
                   and inferred conditions in current supercomputers and
                   extrapolates this knowledge to future-generation systems. To
                   date, the Catalog project has analyzed billions of node hours
                   of system logs from supercomputers at Oak Ridge National
                   Laboratory and Argonne National Laboratory. This talk
                   provides an overview of our findings and lessons learned."
}
@misc{engelmann17catalog,
  author        = "Christian Engelmann",
  title         = "A Catalog of Faults, Errors, and Failures in Extreme-Scale
                   Systems",
  month         = may # "~24-26, ",
  year          = "2017",
  howpublished  = "{Invited talk at the
                   \href{http://icl.cs.utk.edu/workshops/scheduling2017/}
                   {$12^{th}$ Scheduling for Large Scale Systems Workshop
                   (SLSSW) 2017}, Knoxville, TN, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann17catalog.ppt.pdf",
  abstract      = "Building a reliable supercomputer that achieves the expected
                   performance within a given cost budget and providing
                   efficiency and correctness during operation in the presence
                   of faults, errors, and failures requires a full understanding
                   of the resilience problem. The Catalog project develops a
                   fault taxonomy, catalog and models that capture the observed
                   and inferred conditions in current supercomputers and
                   extrapolates this knowledge to future-generation systems. To
                   date, the Catalog project has analyzed billions of node hours
                   of system logs from supercomputers at Oak Ridge National
                   Laboratory and Argonne National Laboratory. This talk
                   provides an overview of our findings and lessons learned."
}
@misc{engelmann16missing,
  author        = "Christian Engelmann",
  title         = "The Missing High-Performance Computing Fault Model",
  month         = apr # "~12-15, ",
  year          = "2016",
  howpublished  = "{Invited talk at the
                   \href{http://www.siam.org/meetings/pp16/}{$17^{th}$ SIAM
                   Conference on Parallel Processing for Scientific Computing
                   (PP) 2016}, Paris, France}",
  url           = "http://www.christian-engelmann.info/publications/engelmann16missing.ppt.pdf",
  abstract      = "The path to exascale computing poses several research
                   challenges. Resilience is one of the most important
                   challenges. This talk will present recent work in
                   developing the missing high-performance computing (HPC)
                   fault model. This effort identifies, categorizes and
                   models the fault, error and failure properties of
                   today's HPC systems. It develops a fault taxonomy,
                   catalog and models that capture the observed and inferred
                   conditions in current systems and extrapolates this
                   knowledge to exascale HPC systems."
}
@misc{engelmann16resilience2,
  author        = "Christian Engelmann",
  title         = "Resilience Challenges and Solutions for Extreme-Scale
                   Supercomputing",
  month         = feb # "~18, ",
  year          = "2016",
  howpublished  = "{Invited talk at the \href{http://www.usna.edu}{United
                    States Naval Academy}, Annapolis, MD, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann16resilience2.ppt.pdf",
  abstract      = "The path to exascale computing poses several research
                   challenges related to power, performance, resilience,
                   productivity, programmability, data movement, and data
                   management. Resilience, i.e., providing efficiency and
                   correctness in the presence of faults, is one of the most
                   important exascale computer science challenges as systems
                   scale up in component count (100,000-1,000,000 nodes with
                   1,000-10,000 cores per node by 2022) and component
                   reliability decreases (7 nm technology with near-threshold
                   voltage operation by 2022). This talk provides an overview
                   of recent and ongoing resilience research and development
                   activities at Oak Ridge National Laboratory in advanced
                   checkpoint storage architectures, process-level incremental
                   checkpoint/restart, proactive fault tolerance using
                   prediction-triggered process or virtual machine migration,
                   MPI process-level software redundancy, and soft-error
                   injection tools to study the vulnerability of science
                   applications."
}
@misc{engelmann15toward,
  author        = "Christian Engelmann",
  title         = "Toward A Fault Model And Resilience Design Patterns For
                   Extreme Scale Systems",
  month         = aug # "~24-28, ",
  year          = "2015",
  howpublished  = "{Keynote talk at the
                    \href{http://www.csm.ornl.gov/srt/conferences/Resilience/2015}{$8^{th}$
                    Workshop on Resiliency in High Performance Computing
                    (Resilience) in Clusters, Clouds, and Grids}, held in
                    conjunction with the \href{http://europar2014.dcc.fc.up.pt}
                    {$21^{st}$ European Conference on Parallel and Distributed
                    Computing (Euro-Par) 2015}, Vienna, Austria}",
  url           = "http://www.christian-engelmann.info/publications/engelmann15toward.ppt.pdf",
  abstract      = "The path to exascale computing poses several research
                   challenges related to power, performance, resilience,
                   productivity, programmability, data movement, and data
                   management. Resilience, i.e., providing efficiency and
                   correctness in the presence of faults, is one of the most
                   important exascale computer science challenges as systems
                   scale up in component count (100,000-1,000,000 nodes with
                   1,000-10,000 cores per node by 2022) and component
                   reliability decreases (7 nm technology with near-threshold
                   voltage operation by 2022). This talk provides an overview
                   of two recently funded projects.
                   The Characterizing Faults, Errors, and Failures in
                   Extreme-Scale Systems project identifies, categorizes and
                   models the fault, error and failure properties of US
                   Department of Energy high-performance computing (HPC)
                   systems. It develops a fault taxonomy, catalog and models
                   that capture the observed and inferred conditions in current
                   systems and extrapolate this knowledge to exascale HPC
                   systems.
                   The Resilience Design Patterns project will increase the
                   ability of scientific applications to reach accurate
                   solutions in a timely and efficient manner. Using a novel
                   design pattern concept, it identifies and evaluates
                   repeatedly occurring resilience problems and coordinates
                   solutions throughout high-performance computing hardware
                   and software."
}
@misc{engelmann15resilience,
  author        = "Christian Engelmann",
  title         = "Resilience Challenges and Solutions for Extreme-Scale
                   Supercomputing",
  month         = mar # "~2-5, ",
  year          = "2015",
  howpublished  = "{Invited talk at the
                    $19^{th}$ Workshop on Distributed Supercomputing (SOS)
                     2015, Park City, UT, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann15resilience.ppt.pdf",
  abstract      = "The path to exascale computing poses several research
                   challenges related to power, performance, resilience,
                   productivity, programmability, data movement, and data
                   management. Resilience, i.e., providing efficiency and
                   correctness in the presence of faults, is one of the most
                   important exascale computer science challenges as systems
                   scale up in component count (100,000-1,000,000 nodes with
                   1,000-10,000 cores per node by 2022) and component
                   reliability decreases (7 nm technology with near-threshold
                   voltage operation by 2022). This talk provides an overview
                   of recent and ongoing resilience research and development
                   activities at Oak Ridge National Laboratory in advanced
                   checkpoint storage architectures, process-level incremental
                   checkpoint/restart, proactive fault tolerance using
                   prediction-triggered process or virtual machine migration,
                   MPI process-level software redundancy, and soft-error
                   injection tools to study the vulnerability of science
                   applications."
}
@misc{engelmann15xsim,
  author        = "Christian Engelmann",
  title         = "xSim: {T}he Extreme-scale Simulator",
  month         = feb # "~23, ",
  year          = "2015",
  howpublished  = "{Seminar at the \href{http://www.lrz-muenchen.de}{Leibniz
                   Rechenzentrum (LRZ)}, Garching, Germany}",
  url           = "http://www.christian-engelmann.info/publications/engelmann15xsim.ppt.pdf",
  abstract      = "The path to exascale high-performance computing (HPC) poses
                   several challenges related to power, performance, and
                   resilience. Investigating the performance and resilience of
                   parallel applications at scale on future architectures and
                   the performance and resilience impact of different
                   architecture choices is an important component of HPC
                   hardware/software co-design. Without having access to future
                   architectures at scale, simulation provides an alternative.
                   The Extreme-scale Simulator (xSim) is a performance
                   investigation toolkit that permits running applications in
                   a controlled environment with millions of concurrent
                   execution threads, while observing performance and
                   resilience in a simulated extreme-scale system. Using a
                   lightweight parallel discrete event simulation, xSim executes
                   a Message Passing Interface (MPI) application on a much
                   smaller system in a highly oversubscribed fashion with a
                   virtual wall clock time, such that performance data can be
                   extracted based on a processor and a network model. xSim is
                   designed like a traditional performance tool, as an
                   interposition library that sits between the MPI application
                   and the MPI library, using the MPI profiling interface. It
                   has been run up to 134,217,728 (2^27) MPI ranks using a
                   960-core Linux cluster. xSim also permits the injection of
                   MPI process failures, the propagation/detection/notification
                   of such failures within the simulation, and their handling
                   within the simulation using application-level
                   checkpoint/restart. Another feature provides user-level
                   failure mitigation (ULFM) extensions at the simulated MPI
                   layer to support algorithm-based fault tolerance (ABFT).
                   xSim is the very first performance tool that supports ULFM
                   and ABFT."
}
@misc{engelmann14supporting,
  author        = "Christian Engelmann",
  title         = "Supporting the Development of Resilient Message Passing
                   Applications using Simulation",
  month         = sep # "~28 - " # oct # "~1, ",
  year          = "2014",
  howpublished  = "Invited talk at the
                   \href{http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=14402}
                   {Dagstuhl Seminar on Resilience in Exascale Computing},
                   Schloss Dagstuhl, Wadern, Germany",
  url           = "http://www.christian-engelmann.info/publications/engelmann14supporting.ppt.pdf",
  abstract      = "An emerging aspect of high-performance computing (HPC)
                   hardware/software co-design is investigating performance
                   under failure. The presented work extends the Extreme-scale
                   Simulator (xSim), which was designed for evaluating the
                   performance of message passing interface (MPI) applications
                   on future HPC architectures, with fault-tolerant MPI
                   extensions proposed by the MPI Fault Tolerance Working Group.
                   xSim permits running MPI applications with millions of
                   concurrent MPI ranks, while observing application performance
                   in a simulated extreme-scale system using a lightweight
                   parallel discrete event simulation. The newly added features
                   offer user-level failure mitigation (ULFM) extensions at the
                   simulated MPI layer to support algorithm-based fault tolerance
                   (ABFT). The presented solution permits investigating
                   performance under failure and failure handling of ABFT
                   solutions. The newly enhanced xSim is the very first
                   performance tool that supports ULFM and ABFT."
}
@misc{engelmann13resilience,
  author        = "Christian Engelmann",
  title         = "Resilience Challenges and Solutions for Extreme-Scale
                   Supercomputing",
  month         = sep # "~3, ",
  year          = "2013",
  howpublished  = "{Invited talk at the Technical University of Dresden,
                    Dresden, Germany}",
  url           = "http://www.christian-engelmann.info/publications/engelmann13resilience.ppt.pdf",
  abstract      = "With the recent deployment of the 18 PFlop/s Titan
                   supercomputer and the exascale roadmap targeting 100, 300,
                   and eventually 1,000 PFlop/s by 2022, Oak Ridge National
                   Laboratory is at the forefront of scientific capability
                   computing. The path to exascale computing poses several
                   research challenges related to power, performance,
                   resilience, productivity, programmability, data movement,
                   and data management. Resilience, i.e., providing efficiency
                   and correctness in the presence of faults, is one of the
                   most important exascale computer science challenges as
                   systems scale up in component count (100,000-1,000,000
                   nodes with 1,000-10,000 cores per node by 2022) and
                   component reliability decreases (7 nm technology with
                   near-threshold voltage operation by 2022). This talk
                   provides an overview of recent and ongoing resilience
                   research and development activities at Oak Ridge National
                   Laboratory in advanced checkpoint storage architectures,
                   process-level incremental checkpoint/restart, proactive
                   fault tolerance using prediction-triggered process or
                   virtual machine migration, MPI process-level software
                   redundancy, and soft-error injection tools to study the
                   vulnerability of science applications and of CMOS logic
                   in processors and memory."
}
@misc{engelmann12fault,
  author        = "Christian Engelmann",
  title         = "Fault Tolerance Session",
  month         = oct # "~16-17, ",
  year          = "2012",
  howpublished  = "{Invited talk at the
                    \href{http://www.aanmelder.nl/exachallenge}
                    {The ExaChallenge Symposium}, Dublin, Ireland}",
  url           = "http://www.christian-engelmann.info/publications/engelmann12fault.ppt.pdf"
}
@misc{engelmann12high-end,
  author        = "Christian Engelmann",
  title         = "High-End Computing Resilience: Analysis of Issues Facing the
                   HEC Community and Path Forward for Research and Development",
  month         = aug # "~4-11, ",
  year          = "2012",
  howpublished  = "{Invited talk at the Argonne National Laboratory (ANL)
                    Institute of Computing in Science (ICiS)
                    \href{http://www.icis.anl.gov/programs/summer2012-4b}
                    {Summer Workshop Week on Addressing Failures in Exascale
                     Computing}, Park City, UT, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann12high-end.ppt.pdf",
  abstract      = "The path to exascale computing poses several research
                   challenges related to power, performance, resilience,
                   productivity, programmability, data movement, and data
                   management. Resilience, i.e., providing efficiency and
                   correctness in the presence of faults, is one of the most
                   important exascale computer science challenges as systems
                   scale up in component count (100,000-1,000,000 nodes with
                   1,000-10,000 cores per node by 2020) and component
                   reliability decreases (7 nm technology with near-threshold
                   voltage operation by 2020). To provide input for a
                   discussion of future needs in resilience research,
                   development, and standards work, this talk gives a brief
                   summary of the outcomes from the National HPC Workshop on
                   Resilience, held in Arlington, VA, USA on August 12-14,
                   2009."
}
@misc{engelmann12resilience,
  author        = "Christian Engelmann",
  title         = "Resilience for Permanent, Transient, and Undetected Errors",
  month         = mar # "~12-15, ",
  year          = "2012",
  howpublished  = "{Invited talk at the
                    \href{http://www.cs.sandia.gov/Conferences/SOS16}
                    {$16^{th}$ Workshop on Distributed Supercomputing (SOS)
                     2012}, Santa Barbara, CA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann12resilience.ppt.pdf",
  abstract      = "With the ongoing deployment of 10-20 PFlop/s supercomputers
                   and the exascale roadmap targeting 100, 300, and eventually
                   1,000 PFlop/s by 2020, the path to exascale computing poses
                   several research challenges related to power, performance,
                   resilience, productivity, programmability, data movement,
                   and data management. Resilience, i.e., providing efficiency
                   and correctness in the presence of faults, is one of the
                   most important exascale computer science challenges as
                   systems scale up in component count (100,000-1,000,000
                   nodes with 1,000-10,000 cores per node by 2020) and
                   component reliability decreases (7 nm technology with
                   near-threshold voltage operation by 2020). This talk
                   provides an overview of recent and ongoing resilience
                   research and development activities at Oak Ridge National
                   Laboratory, and of future needs in resilience research,
                   development, and standards work."
}
@misc{engelmann12scaling,
  author        = "Christian Engelmann",
  title         = "Scaling To A Million Cores And Beyond: A Basic Understanding
                   Of The Challenges Ahead On The Road To Exascale",
  month         = jan # "~24, ",
  year          = "2012",
  howpublished  = "{Invited talk at the
                   \href{https://researcher.ibm.com/researcher/view_page.php?id=2580}
                   {$1^{st}$ International Workshop on Extreme Scale Parallel
                   Architectures and Systems (ESPAS) 2012}, in conjunction with
                   the \href{http://www.hipeac.net/conference/paris}{$7^{th}$
                   International Conference on High-Performance and Embedded
                   Architectures and Compilers (HiPEAC) 2012}, Paris France}",
  url           = "http://www.christian-engelmann.info/publications/engelmann12scaling.ppt.pdf",
  abstract      = "On the road toward multi-petascale and exascale HPC, the
                   trend in architecture goes clearly in only one direction.
                   HPC systems will dramatically scale up in compute node and
                   processor core counts. By 2020, an exascale system may have
                   up to 1,000,000 compute nodes with 1,000 cores per node. The
                   substantial growth in concurrency causes parallel application
                   scalability issues due to sequential application parts,
                   synchronizing communication, and other bottlenecks.
                   Investigating parallel algorithm performance properties at
                   this scale and with these architectural properties for HPC
                   hardware/software co-design is crucial to enable
                   extreme-scale computing. The presented work utilizes the
                   Extreme-scale Simulator (xSim) performance investigation
                   toolkit to identify the scaling characteristics of a simple
                   Monte Carlo algorithm from 1 to 16 million MPI processes on
                   different multi-core architecture choices. The results show
                   the limitations of strong scaling and the negative impact of
                   employing more but less powerful cores for energy savings."
}
@misc{engelmann11resilient,
  author        = "Christian Engelmann",
  title         = "Resilient Software for ExaScale Computing",
  month         = nov # "~17, ",
  year          = "2011",
  howpublished  = "{Invited talk at the Birds of a Feather Session on Resilient
                   Software for ExaScale Computing at the
                   \href{http://sc11.supercomputing.org}
                   {24th IEEE/ACM International Conference on High Performance
                    Computing, Networking, Storage and Analysis (SC) 2011},
                   Seattle, WA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann11resilient.ppt.pdf",
  abstract      = "ExaScale computing systems will likely consist of millions
                   of cores executing applications with billions of threads,
                   based on 14nm or less CMOS technology, according to the
                   ITRS roadmap. Processing elements built on this technology,
                   coupled with dynamic power management will exhibit high
                   variability in performance, between cores and across
                   different runs. Even worse, preliminary figures indicates
                   that on average about every couple of minutes - at least -
                   something in the system will break. Traditional
                   checkpointing strategies are unlikely to work, given the
                   time it will take to save the huge quantities of data
                   combined with the fact that they will need to be restored
                   frequently. This BoF wants to investigate resilient
                   software: software that is able to survive failing
                   hardware and continue to run, without minimal performance
                   impact. Furthermore, we may also discuss tradeoffs between
                   rerunning the application and the cost of instrumentation
                   to deal with resilience."
}
@misc{engelmann11resilience,
  author        = "Christian Engelmann",
  title         = "Resilience and Hardware/Software Co-design for Extreme-Scale
                   Supercomputing",
  month         = jul # "~27, ",
  year          = "2011",
  howpublished  = "{Seminar at the \href{http://www.bsc.es}{Barcelona
                   Supercomputing Center}, Barcelona, Spain}",
  url           = "http://www.christian-engelmann.info/publications/engelmann11resilience.ppt.pdf",
  abstract      = "Oak Ridge National Laboratory (ORNL) provides the most
                   powerful high-performance computing (HPC) resources in the
                   world for open scientific research. Jaguar, a 224,162-core
                   Cray XT5 with a LINPACK performance of 1.759 PFlop/s, for
                   example, is the world's 3rd fastest supercomputer. 80\% of
                   its resources are allocated through a reviewed process to
                   address the most challenging scientific problems in climate
                   modeling, renewable energy, materials science, fusion and
                   other areas. ORNL's Computer Science and Mathematics Division
                   performs computer science and mathematics research to
                   increase supercomputer efficiency and application scientist
                   productivity while accelerating time to solution for
                   scientific breakthroughs. This talk details recent research
                   advancements at ORNL in two areas: (1) resilience and (2)
                   hardware/software co-design for extreme-scale supercomputing.
                   Both are essential on the road toward exa-scale HPC systems
                   with millions-to-billions of cores. Due to the expected
                   drastic increase in scale, the corresponding decrease in
                   system mean-time to interrupt warrants a rethinking of the
                   traditional checkpoint/restart approach for HPC resilience.
                   New concepts discussed in this talk range from preventative
                   measures, such as task migration based on fault prediction,
                   to more aggressive fault masking, such as various levels of
                   redundancy. Further, the expected drastic increase in task
                   parallelism requires redesigning algorithms to avoid the
                   consequences of Amdahl's law at extreme scale. As million-way
                   task parallel systems don't exist yet, this talk discusses a
                   lightweight system simulation approach for performance
                   estimation of algorithms at scale."
}
@misc{engelmann10scalable,
  author        = "Christian Engelmann",
  title         = "Scalable HPC System Monitoring",
  month         = oct # "~13, ",
  year          = "2010",
  howpublished  = "{Invited talk at the $3^{rd}$ HPC Resiliency Summit: Workshop
                   on Resiliency for Petascale HPC 2010, in conjunction with the
                   \href{http://www.lanl.gov/conferences/lacss/2010}{$3^{rd}$
                   Los Alamos Computer Science Symposium (LACSS) 2010}, Santa
                   Fe, NM, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann10scalable.ppt.pdf",
  abstract      = "We present a monitoring system for large-scale parallel and
                   distributed computing environments that allows to trade-off
                   accuracy in a tunable fashion to gain scalability without
                   compromising fidelity. The approach relies on classifying
                   each gathered monitoring metric based on individual needs
                   and on aggregating messages containing classes of individual
                   monitoring metrics using a tree-based overlay network. The
                   MRNet-based prototype is able to significantly reduce the
                   amount of gathered and stored monitoring data, e.g., by a
                   factor of ~56 in comparison to the Ganglia distributed
                   monitoring system. A simple scaling study reveals, however,
                   that further efforts are needed in reducing the amount of
                   data to monitor future-generation extreme-scale systems with
                   up to 1,000,000 nodes. The implemented solution did not had
                   a measurable performance impact as the 32-node test system
                   did not produce enough monitoring data to interfere with
                   running applications."
}
@misc{engelmann10beyond,
  author        = "Christian Engelmann",
  title         = "Beyond Application-Level Checkpoint/Restart - {Advanced}
                   Software Approaches for Fault Resilience",
  month         = sep # "~6, ",
  year          = "2010",
  howpublished  = "{Talk at the
                   \href{http://www.speedup.ch/workshops/w39_2010.html}
                   {$39^{th}$ SPEEDUP Workshop on High Performance Computing},
                   Zurich, Switzerland}",
  url           = "http://www.christian-engelmann.info/publications/engelmann10beyond.ppt.pdf"
}
@misc{engelmann10reliability,
  author        = "Christian Engelmann and
                   Stephen L. Scott",
  title         = "Reliability, Availability, and Serviceability ({RAS}) for
                   Petascale High-End Computing and Beyond",
  month         = jun # "~22, ",
  year          = "2010",
  howpublished  = "{Talk at the \href{http://www.usenix.org/events/fastos10}
                   {Forum to Address Scalable Technology for Runtime and
                   Operating Systems (FAST-OS) Workshop}, in conjunction with
                   the \href{http://www.usenix.org/events/confweek10}{USENIX
                   Federated Conferences Week (USENIX) 2010}, Boston MA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann10reliability.ppt.pdf",
  abstract      = "This project aims at scalable technologies for providing
                   high-level RAS for next-generation petascale scientific
                   high-performance computing (HPC) resources and beyond as
                   outlined by the U.S. Department of Energy (DOE) Forum to
                   Address Scalable Technology for Runtime and Operating
                   Systems (FAST-OS) and the U.S. National Coordination Office
                   for Networking and Information Technology Research and
                   Development (NCO/NITRD) High-End Computing Revitalization
                   Task Force (HECRTF) activities. Based on virtualized
                   adaptation, reconfiguration, and preemptive measures, the
                   ultimate goal is to provide for non-stop scientific computing
                   on a 24x7 basis without interruption. The taken technical
                   approach leverages system-level virtualization technology to
                   enable transparent proactive and reactive fault tolerance
                   mechanisms on extreme scale HPC systems. This effort targets:
                   (1) reliability analysis for identifying pre-fault
                   indicators, predicting failures, and modeling and monitoring
                   component and system reliability, (2) proactive fault
                   tolerance technology based on preemptive migration away from
                   components that are about to fail, (3) reactive fault
                   tolerance enhancements, such as checkpoint interval and
                   placement adaptation to actual and predicted system health
                   threats, and (4) holistic fault tolerance through combination
                   of adaptive proactive and reactive fault tolerance."
}
@misc{engelmann10resilience,
  author        = "Christian Engelmann",
  title         = "Resilience Challenges at the Exascale",
  month         = mar # "~8-11, ",
  year          = "2010",
  howpublished  = "{Talk at the
                   \href{http://www.csm.ornl.gov/workshops/SOS14}{$14^{th}$
                   Workshop on Distributed Supercomputing (SOS) 2010}, Savannah,
                   GA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann10resilience.ppt.pdf",
  abstract      = "The path to exascale computing poses several research
                   challenges related to power, performance, resilience,
                   productivity, programmability, data movement, and data
                   management. Resilience, i.e., providing efficiency and
                   correctness in the presence of faults, is one of the most
                   important exascale computer science challenges as systems
                   scale up in component count and component reliability
                   decreases. This talk discusses the future needs in
                   resilience research, development, and standards work
                   based on the outcomes from the National HPC Workshop on
                   Resilience, held in Arlington, VA, USA on August 12-14,
                   2009."
}
@misc{engelmann10hpc,
  author        = "Christian Engelmann
                   and Stephen L. Scott",
  title         = "{HPC} System Software Research at {Oak Ridge National
                   Laboratory}",
  month         = feb # "~22, ",
  year          = "2010",
  howpublished  = "{Seminar at the \href{http://www.lrz-muenchen.de}{Leibniz
                   Rechenzentrum (LRZ)}, Garching, Germany}",
  url           = "http://www.christian-engelmann.info/publications/engelmann10hpc.ppt.pdf",
  abstract      = "Oak Ridge National Laboratory (ORNL) is the largest energy
                   laboratory in the United States. Its National Center for
                   Computational Sciences (NCCS) provides the most powerful
                   computing resources in the world for open scientific
                   research. Jaguar, a Cray XT5 system at NCCS, is the fastest
                   supercomputer in the world. It recently ranked #1 in the Top
                   500 List of Supercomputer Sites with a maximal LINPACK
                   benchmark performance of 1.759 PFlop/s and a theoretical peak
                   performance of 2.331 PFlop/s, where 1 PFlop/s is $10^{15}$
                   Floating Point Operations Per Second. Annually, 80 percent of
                   Jaguar's resources are allocated through the U.S Department
                   of Energy's Innovative and Novel Computational Impact on
                   Theory and Experiment (INCITE) program, a competitively
                   selected, peer reviewed process open to researchers from
                   universities, industry, government and non-profit
                   organizations. These allocations address some of the most
                   challenging scientific problems in areas such as climate
                   modeling, renewable energy, materials science, fusion and
                   combustion. In conjunction with NCCS, the Computer Science
                   and Mathematics Division at ORNL performs basic and applied
                   research in HPC, mathematics, and intelligent systems. This
                   talk gives a summary of the HPC research and development in
                   system software performed at ORNL, including resilience at
                   extreme scale and virtualization technologies in HPC.
                   Specifically, this talk will focus on advanced resilience
                   technologies, such as migration of computation away from
                   components that are about to fail and on management and
                   customization of virtualized environments."
}
@misc{engelmann09high2,
  author        = "Christian Engelmann",
  title         = "High-Performance Computing Research Internship and Appointment
                   Opportunities at {Oak Ridge National Laboratory}",
  month         = dec # "~14, ",
  year          = "2009",
  howpublished  = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department
                   of Computer Science}, \href{http://www.reading.ac.uk}
                   {University of Reading}, Reading, United Kingdom}",
  url           = "http://www.christian-engelmann.info/publications/engelmann09high2.ppt.pdf",
  abstract      = "Oak Ridge National Laboratory (ORNL) is the largest energy
                   laboratory in the United States. Its National Center for
                   Computational Sciences (NCCS) provides the most powerful
                   computing resources in the world for open scientific
                   research. Jaguar, a Cray XT5 system at NCCS, is the fastest
                   supercomputer in the world. It recently ranked #1 in the Top
                   500 List of Supercomputer Sites with a maximal LINPACK
                   benchmark performance of 1.759 PFlop/s and a theoretical peak
                   performance of 2.331 PFlop/s, where 1 PFlop/s is $10^{15}$
                   Floating Point Operations Per Second. Annually, 80 percent of
                   Jaguar's resources are allocated through the U.S Department
                   of Energy's Innovative and Novel Computational Impact on 
                   Theory and Experiment (INCITE) program, a competitively
                   selected, peer reviewed process open to researchers from
                   universities, industry, government and non-profit
                   organizations. These allocations address some of the most
                   challenging scientific problems in areas such as climate
                   modeling, renewable energy, materials science, fusion and
                   combustion. In conjunction with NCCS, the Computer Science
                   and Mathematics Division at ORNL performs basic and applied
                   research in HPC, mathematics, and intelligent systems. This
                   talk gives a summary of the HPC research performed at ORNL.
                   It provides details about the Jaguar peta-scale computing
                   resource, an overview of the computational science research
                   carried out using ORNL's computing resources, and a
                   description of various computer science efforts targeting
                   solutions for next-generation HPC systems. This talk also
                   provides information about internship opportunities for MSc
                   students and research appointment opportunities for recent
                   graduates."
}
@misc{engelmann09jcas,
  author        = "Christian Engelmann",
  title         = "{JCAS} - {IAA} Simulation Efforts at {Oak Ridge National
                   Laboratory}",
  month         = sep # "~1-2, ",
  year          = "2009",
  howpublished  = "{Invited talk at the
                   \href{http://www.cs.sandia.gov/CSRI/Workshops/2009/IAA}
                   {IAA Workshop on HPC Architectural Simulation (HPCAS)},
                   Boulder, CO, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann09jcas.ppt.pdf"
}
@misc{engelmann09modeling,
  author        = "Christian Engelmann",
  title         = "Modeling Techniques Towards Resilience",
  month         = aug # "~12-14, ",
  year          = "2009",
  howpublished  = "{Invited talk at the
                   \href{http://institute.lanl.gov/resilience/conferences/2009}
                   {National HPC Workshop on Resilience 2009}, Arlington, VA,
                   USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann09modeling.ppt.pdf"
}
@misc{engelmann09system,
  author        = "Christian Engelmann",
  title         = "System Resilience Research at {ORNL} in the Context of
                   {HPC}",
  month         = may # "~15, ",
  year          = "2009",
  howpublished  = "{Invited talk at the \href{http://www.inria.fr/inria/organigramme/fiche_ur-ren.fr.html}
                   {Institut National de Recherche en Informatique et en
                   Automatique (INRIA)}, Rennes, France}",
  url           = "http://www.christian-engelmann.info/publications/engelmann09system.pdf",
  abstract      = "The continuing growth in high performance computing (HPC)
                   system scale poses a challenge for system software and
                   scientific applications with respect to reliability,
                   availability and serviceability (RAS). With only very few
                   exceptions, the availability of recently installed systems
                   has been lower in comparison to the same deployment phase of
                   their predecessors. As a result, sites lower allowable job
                   run times in order to force applications to store
                   intermediate results (checkpoints) as insurance against lost
                   computation time. However, checkpoints themselves waste
                   valuable computation time and resources. In contrast to the
                   experienced loss of availability, the demand for continuous
                   availability has risen dramatically with the trend towards
                   capability computing, which drives the race for scientific
                   discovery by running applications on the fastest machines
                   available while desiring significant amounts of time (weeks
                   and months) without interruption. These machines must be able
                   to run in the event of frequent interrupts in such a manner
                   that the capability is not severely degraded. Thus, research
                   and development of scalable RAS technologies is paramount to
                   the success of future extreme-scale systems. This talk
                   summarizes our accomplishments in the area of high-level RAS
                   for HPC, such as developed concepts and implemented
                   proof-of-concept prototypes."
}
@misc{engelmann09high,
  author        = "Christian Engelmann",
  title         = "High-Performance Computing Research and {MSc} Internship
                   Opportunities at {Oak Ridge National Laboratory}",
  month         = may # "~11, ",
  year          = "2009",
  howpublished  = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department
                   of Computer Science}, \href{http://www.reading.ac.uk}
                   {University of Reading}, Reading, United Kingdom}",
  url           = "http://www.christian-engelmann.info/publications/engelmann09high.pdf",
  abstract      = "Oak Ridge National Laboratory (ORNL) is the largest energy
                   laboratory in the United States. Its National Center for
                   Computational Sciences (NCCS) provides the most powerful
                   computing resources in the world for open scientific
                   research. Jaguar, a Cray XT5 system at NCCS, is the second
                   HPC system to exceed 1 PFlop/s ($10^{15}$ Floating Point
                   Operations Per Second), and the fastest open science
                   supercomputer in the world. It recently ranked #2 in the Top
                   500 List of Supercomputer Sites with a maximal LINPACK
                   benchmark performance of 1.059 PFlop/s and a theoretical peak
                   performance of 1.3814 PFlop/s. Annually, 80 percent of
                   Jaguar's resources are allocated through the U.S Department
                   of Energy's Innovative and Novel Computational Impact on
                   Theory and Experiment (INCITE) program, a competitively
                   selected, peer reviewed process open to researchers from
                   universities, industry, government and non-profit
                   organizations. These allocations address some of the most
                   challenging scientific problems in areas such as climate
                   modeling, renewable energy, materials science, fusion and
                   combustion. In conjunction with NCCS, the Computer Science
                   and Mathematics Division at ORNL performs basic and applied
                   research in HPC, mathematics, and intelligent systems. This
                   talk gives a summary of the HPC research performed at ORNL.
                   It provides details about the Jaguar peta-scale computing
                   resource, an overview of the computational science research
                   carried out using ORNL's computing resources, and a
                   description of various computer science efforts targeting
                   solutions for next-generation HPC systems. This talk also
                   provides information about internship opportunities for MSc
                   students."
}
@misc{engelmann09modular,
  author        = "Christian Engelmann",
  title         = "Modular Redundancy for Soft-Error Resilience in Large-Scale
                   {HPC} Systems",
  month         = may # "~3-8, ",
  year          = "2009",
  howpublished  = "{Invited talk at the \href{http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=09191}
                   {Dagstuhl Seminar on Fault Tolerance in High-Performance
                   Computing and Grids}, Schloss Dagstuhl, Wadern, Germany}",
  url           = "http://www.christian-engelmann.info/publications/engelmann09modular.pdf",
  abstract      = "Recent investigations into resilience of large-scale
                   high-performance computing (HPC) systems showed a continuous
                   trend of decreasing reliability and availability. Newly
                   installed systems have a lower mean-time to failure (MTTF)
                   and a higher mean-time to recover (MTTR) than their
                   predecessors. Modular redundancy is being used in many
                   mission critical systems today to provide for resilience,
                   such as for aerospace and command & control systems. The
                   primary argument against modular redundancy for resilience
                   in HPC has always been that the capability of a HPC system,
                   and respective return on investment, would be significantly
                   reduced. We argue that modular redundancy can significantly
                   increase compute node availability as it removes the impact
                   of scale from single compute node MTTR. We further argue that
                   single compute nodes can be much less reliable, and therefore
                   less expensive, and still be highly available, if their
                   MTTR/MTTF ratio is maintained."
}
@misc{engelmann09proactive2,
  author        = "Christian Engelmann",
  title         = "Proactive Fault Tolerance Using Preemptive Migration",
  month         = apr # "~22-24, ",
  year          = "2009",
  howpublished  = "{Invited talk at the
                   \href{http://acet.rdg.ac.uk/events/details/cancun.php}
                   {$3^{rd}$ Collaborative and Grid Computing Technologies
                   Workshop (CGCTW) 2009}, Cancun, Mexico}",
  url           = "http://www.christian-engelmann.info/publications/engelmann09proactive2.pdf",
  abstract      = "The continuing growth in high-performance computing (HPC)
                   system scale poses a challenge for system software and
                   scientific applications with respect to reliability,
                   availability and serviceability (RAS). In order to address
                   anticipated high failure rates, resiliency characteristics
                   have become an urgent priority for next-generation HPC
                   systems. The concept of proactive fault tolerance prevents
                   compute node failures from impacting running parallel
                   applications by preemptively migrating application parts
                   away from nodes that are about to fail. This talk presents
                   our past and ongoing efforts in proactive fault resilience
                   for HPC. Presented work includes proactive fault resilience
                   techniques, transparent process- and virtual-machine-level
                   migration, system and application reliability models and
                   analyses, failure prediction, and trade-off models for
                   combining preemptive migration with checkpoint/restart. All
                   these individual technologies are put into context with a
                   proposed holistic HPC fault resilience framework."
}
@misc{engelmann09resiliency,
  author        = "Christian Engelmann",
  title         = "Resiliency",
  month         = mar # "~9-12, ",
  year          = "2009",
  howpublished  = "{Panel at the
                   \href{http://www.cs.sandia.gov/Conferences/SOS13}{$13^{th}$
                   Workshop on Distributed Supercomputing (SOS) 2009}, Hilton
                   Head, SC, USA}"
}
@misc{engelmann08high,
  author        = "Christian Engelmann",
  title         = "High-Performance Computing Research at {Oak Ridge National
                   Laboratory}",
  month         = dec # "~8, ",
  year          = "2008",
  howpublished  = "{Invited talk at the Reading Annual Computational Science
                    Workshop, Reading, United Kingdom}",
  url           = "http://www.christian-engelmann.info/publications/engelmann08high.pdf",
  abstract      = "Oak Ridge National Laboratory (ORNL) is the largest energy
                   laboratory in the United States. Its National Center for
                   Computational Sciences (NCCS) provides the most powerful
                   computing resources in the world for open scientific
                   research. Jaguar, a Cray XT5 system at NCCS, is the second
                   HPC system to exceed 1 PFlop/s (10^15 Floating Point
                   Operations Per Second), and the fastest open science
                   supercomputer in the world. It recently ranked #2 in the Top
                   500 List of Supercomputer Sites with a maximal LINPACK
                   benchmark performance of 1.059 PFlop/s and a theoretical peak
                   performance of 1.3814 PFlop/s. Annually, 80 percent of
                   Jaguar’s resources are allocated through the U.S Department
                   of Energy’s Innovative and Novel Computational Impact on
                   Theory and Experiment (INCITE) program, a competitively
                   selected, peer reviewed process open to researchers from
                   universities, industry, government and non-profit
                   organizations. These allocations address some of the most
                   challenging scientific problems in areas such as climate
                   modeling, renewable energy, materials science, fusion and
                   combustion. In conjunction with NCCS, the Computer Science
                   and Mathematics Division at ORNL performs basic and applied
                   research in HPC, mathematics, and intelligent systems. This
                   talk gives a summary of the HPC research performed at ORNL.
                   It provides details about the Jaguar peta-scale computing
                   resource, an overview of the computational science research
                   carried out using ORNL’s computing resources, and a
                   description of various computer science efforts targeting
                   solutions for next-generation HPC systems."
}
@misc{engelmann08modular,
  author        = "Christian Engelmann",
  title         = "Modular Redundancy in {HPC} Systems: {W}hy, Where, When and How?",
  month         = oct # "~15, ",
  year          = "2008",
  howpublished  = "{Invited talk at the $1^{st}$ HPC Resiliency Summit: Workshop
                   on Resiliency for Petascale HPC 2008, in conjunction with the
                   \href{http://www.lanl.gov/conferences/lacss/2008}{$1^{st}$
                   Los Alamos Computer Science Symposium (LACSS) 2008}, Santa
                   Fe, NM, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann08modular.ppt.pdf",
  abstract      = "The continuing growth in high-performance computing (HPC)
                   system scale poses a challenge for system software and
                   scientific applications with respect to reliability,
                   availability and serviceability (RAS). With only very few
                   exceptions, the availability of recently installed systems
                   has been lower in comparison to the same deployment phase of
                   their predecessors. As a result, sites lower allowable job
                   run times in order to force applications to store
                   intermediate results (checkpoints) as insurance against lost
                   computation time. However, checkpoints themselves waste
                   valuable computation time and resources. In contrast to the
                   experienced loss of availability, the demand for continuous
                   availability has risen dramatically with the trend towards
                   capability computing, which drives the race for scientific
                   discovery by running applications on the fastest machines
                   available while desiring significant amounts of time (weeks
                   and months) without interruption. These machines must be able
                   to run in the event of frequent interrupts in such a manner
                   that the capability is not severely degraded. Thus, research
                   and development of scalable RAS technologies is paramount to
                   the success of future extreme-scale systems. This talk
                   summarizes our past accomplishments, ongoing work, and future
                   plans in the area of high-level RAS for HPC."
}
@misc{engelmann08resiliency,
  author        = "Christian Engelmann",
  title         = "Resiliency for High-Performance Computing",
  month         = apr # "~10-12, ",
  year          = "2008",
  howpublished  = "{Invited talk at the
                   \href{http://acet.rdg.ac.uk/events/details/cancun.php}
                   {$2^{nd}$ Collaborative and Grid Computing Technologies
                   Workshop (CGCTW) 2008}, Cancun, Mexico}",
  url           = "http://www.christian-engelmann.info/publications/engelmann08resiliency.ppt.pdf",
  abstract      = "In order to address anticipated high failure rates,
                   resiliency characteristics have become an urgent priority for
                   next-generation high-performance computing (HPC) systems. One
                   major source of concern are non-recoverable soft errors,
                   i.e., bit flips in memory, cache, registers, and logic. The
                   probability of such errors not only grows with system size,
                   but also with increasing architectural vulnerability caused
                   by employing accelerators and by shrinking nanometer
                   technology. Reactive fault tolerance technologies, such as
                   checkpoint/restart, are unable to handle high failure rates
                   due to associated overheads, while proactive resiliency
                   technologies, such as preemptive migration, simply fail as
                   random soft errors can't be predicted. This talk proposes a
                   new, bold direction in resiliency for HPC as it targets
                   resiliency for next-generation extreme-scale HPC systems at
                   the system software level through computational redundancy
                   strategies, i.e., dual- and triple-modular redundancy."
}
@misc{engelmann08advanced,
  author        = "Christian Engelmann",
  title         = "Advanced Fault Tolerance Solutions for High Performance
                   Computing",
  month         = feb # "~11, ",
  year          = "2008",
  howpublished  = "{Seminar at the \href{http://www.laas.fr}{Laboratoire
                   d'Analyse et d'Architecture des Syst\'emes},
                   \href{http://www.cnrs.fr}{Centre National de la Recherche
                   Scientifique}, Toulouse, France}",
  url           = "http://www.christian-engelmann.info/publications/engelmann08advanced.ppt.pdf",
  abstract      = "The continuing growth in high performance computing (HPC)
                   system scale poses a challenge for system software and
                   scientific applications with respect to reliability,
                   availability and serviceability (RAS). With only very few
                   exceptions, the availability of recently installed systems
                   has been lower in comparison to the same deployment phase of
                   their predecessors. As a result, sites lower allowable job
                   run times in order to force applications to store
                   intermediate results (checkpoints) as insurance against lost
                   computation time. However, checkpoints themselves waste
                   valuable computation time and resources. In contrast to the
                   experienced loss of availability, the demand for continuous
                   availability has risen dramatically with the trend towards
                   capability computing, which drives the race for scientific
                   discovery by running applications on the fastest machines
                   available while desiring significant amounts of time (weeks
                   and months) without interruption. These machines must be able
                   to run in the event of frequent interrupts in such a manner
                   that the capability is not severely degraded. Thus, research
                   and development of scalable RAS technologies is paramount to
                   the success of future extreme-scale systems. This talk
                   summarizes our accomplishments in the area of high-level RAS
                   for HPC, such as developed concepts and implemented
                   proof-of-concept prototypes, and describes existing
                   limitations, such as performance issues, which need to be
                   dealt with for production-type deployment."
}
@misc{engelmann07service,
  author        = "Christian Engelmann",
  title         = "Service-Level High Availability in Parallel and Distributed
                   Systems",
  month         = oct # "~10, ",
  year          = "2007",
  howpublished  = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department
                   of Computer Science}, \href{http://www.reading.ac.uk}
                   {University of Reading}, Reading, United Kingdom}",
  url           = "http://www.christian-engelmann.info/publications/engelmann07service.pdf",
  abstract      = "As service-oriented architectures become more important in
                   parallel and distributed computing systems, individual
                   service instance reliability as well as appropriate service
                   redundancy are essential to increase overall system
                   availability. This talk focuses on redundancy strategies
                   using service-level replication techniques. An overview of
                   existing programming models for service-level high
                   availability is presented and their differences,
                   similarities, advantages, and disadvantages are discussed.
                   Recent advances in providing service-level symmetric
                   active/active high availability are discussed. While the
                   primary target of the presented research is high availability
                   for service nodes in tightly-coupled extreme-scale
                   high-performance computing (HPC) systems, it is also
                   applicable to loosely-coupled distributed computing
                   scenarios."
}
@misc{engelmann07advanced2,
  author        = "Christian Engelmann",
  title         = "Advanced Fault Tolerance Solutions for High Performance
                   Computing",
  month         = jun # "~8, ",
  year          = "2007",
  howpublished  = "{Invited talk at the
                   \href{http://www.thaigrid.or.th/wttc2007}{Workshop on Trends,
                   Technologies and Collaborative Opportunities in High
                   Performance and Grid Computing (WTTC) 2007}, Khon Kean,
                   Thailand}",
  url           = "http://www.christian-engelmann.info/publications/engelmann07advanced2.ppt.pdf",
  abstract      = "The continuing growth in high performance computing (HPC)
                   system scale poses a challenge for system software and
                   scientific applications with respect to reliability,
                   availability and serviceability (RAS). With only very few
                   exceptions, the availability of recently installed systems
                   has been lower in comparison to the same deployment phase of
                   their predecessors. As a result, sites lower allowable job
                   run times in order to force applications to store
                   intermediate results (checkpoints) as insurance against lost
                   computation time. However, checkpoints themselves waste
                   valuable computation time and resources. In contrast to the
                   experienced loss of availability, the demand for continuous
                   availability has risen dramatically with the trend towards
                   capability computing, which drives the race for scientific
                   discovery by running applications on the fastest machines
                   available while desiring significant amounts of time (weeks
                   and months) without interruption. These machines must be able
                   to run in the event of frequent interrupts in such a manner
                   that the capability is not severely degraded. Thus, research
                   and development of scalable RAS technologies is paramount to
                   the success of future extreme-scale systems. This talk
                   summarizes our accomplishments in the area of high-level RAS
                   for HPC, such as developed concepts and implemented
                   proof-of-concept prototypes, and describes existing
                   limitations, such as performance issues, which need to be
                   dealt with for production-type deployment."
}
@misc{engelmann07advanced,
  author        = "Christian Engelmann",
  title         = "Advanced Fault Tolerance Solutions for High Performance
                   Computing",
  month         = jun # "~4-5, ",
  year          = "2007",
  howpublished  = "{Invited talk at the
                   \href{http://www.thaigrid.or.th/wttc2007}{Workshop on Trends,
                   Technologies and Collaborative Opportunities in High
                   Performance and Grid Computing (WTTC) 2007}, Bangkok,
                   Thailand}",
  url           = "http://www.christian-engelmann.info/publications/engelmann07advanced.ppt.pdf",
  abstract      = "The continuing growth in high performance computing (HPC)
                   system scale poses a challenge for system software and
                   scientific applications with respect to reliability,
                   availability and serviceability (RAS). With only very few
                   exceptions, the availability of recently installed systems
                   has been lower in comparison to the same deployment phase of
                   their predecessors. As a result, sites lower allowable job
                   run times in order to force applications to store
                   intermediate results (checkpoints) as insurance against lost
                   computation time. However, checkpoints themselves waste
                   valuable computation time and resources. In contrast to the
                   experienced loss of availability, the demand for continuous
                   availability has risen dramatically with the trend towards
                   capability computing, which drives the race for scientific
                   discovery by running applications on the fastest machines
                   available while desiring significant amounts of time (weeks
                   and months) without interruption. These machines must be
                   able to run in the event of frequent interrupts in such a
                   manner that the capability is not severely degraded. Thus,
                   research and development of scalable RAS technologies is
                   paramount to the success of future extreme-scale systems.
                   This talk summarizes our accomplishments in the area of
                   high-level RAS for HPC, such as developed concepts and
                   implemented proof-of-concept prototypes, and describes
                   existing limitations, such as performance issues, which
                   need to be dealt with for production-type deployment."
}
@misc{engelmann07operating,
  author        = "Christian Engelmann",
  title         = "Operating System Research at {ORNL}: {S}ystem-level
                   Virtualization",
  month         = apr # "~10, ",
  year          = "2007",
  howpublished  = "{Seminar at the \href{http://www.gup.uni-linz.ac.at}
                   {Institute of Graphics and Parallel Processing},
                   \href{http://www.uni-linz.ac.at}{Johannes Kepler University},
                   Linz, Austria}",
  url           = "http://www.christian-engelmann.info/publications/engelmann07operating.ppt.pdf",
  abstract      = "The emergence of virtualization enabled hardware, such as the
                   latest generation AMD and Intel processors, has raised
                   significant interest in High Performance Computing (HPC)
                   community. In particular, system-level virtualization
                   provides an opportunity to advance the design and development
                   of operating systems, programming environments,
                   administration practices, and resource management tools. This
                   leads to some potential research topics for HPC, such as
                   failure tolerance, system management, and solutions for
                   application porting to new HPC platforms. This talk will
                   present an overview of the research in System-level
                   Virtualization taking place by the Systems Research Team in
                   the Computer Science Research Group at Oak Ridge National
                   Laboratory."
}
@misc{engelmann07towards,
  author        = "Christian Engelmann",
  title         = "Towards High Availability for High-Performance Computing
                   System Services: {A}ccomplishments and Limitations",
  month         = mar # "~14, ",
  year          = "2007",
  howpublished  = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department
                   of Computer Science}, \href{http://www.reading.ac.uk}
                   {University of Reading}, Reading, United Kingdom}",
  url           = "http://www.christian-engelmann.info/publications/engelmann07towards.pdf",
  abstract      = "During the last several years, our teams at Oak Ridge
                   National Laboratory, Louisiana Tech University, and Tennessee
                   Technological University focused on efficient redundancy
                   strategies for head and service nodes of high-performance
                   computing (HPC) systems in order to pave the way for high
                   availability (HA) in HPC. These nodes typically run critical
                   HPC system services, like job and resource management, and
                   represent single points of failure and control for an entire
                   HPC system. The overarching goal of our research is to
                   provide high-level reliability, availability, and
                   serviceability (RAS) for HPC systems by combining HA and HPC
                   technology. This talk summarizes our accomplishments, such as
                   developed concepts and implemented proof-of-concept
                   prototypes, and describes existing limitations, such as
                   performance issues, which need to be dealt with for
                   production-type deployment."
}
@misc{engelmann06high,
  author        = "Christian Engelmann",
  title         = "High Availability for Ultra-Scale High-End Scientific
                   Computing",
  month         = jun # "~9, ",
  year          = "2006",
  howpublished  = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department
                   of Computer Science}, \href{http://www.reading.ac.uk}
                   {University of Reading}, Reading, United Kingdom}",
  url           = "http://www.christian-engelmann.info/publications/engelmann06high.ppt.pdf",
  abstract      = "A major concern in exploiting ultra-scale architectures for
                   scientific high-end computing (HEC) with tens to hundreds of
                   thousands of processors, such as the IBM Blue Gene/L and the
                   Cray X1, is the potential inability to identify problems and
                   take preemptive action before a failure impacts a running
                   job. In fact, in systems of this scale, predictions estimate
                   the mean time to interrupt in terms of hours. Current
                   solutions for fault-tolerance in HEC focus on dealing with
                   the result of a failure. However, most are unable to handle
                   runtime system configuration changes caused by failures and
                   require a complete restart of essential system services
                   (e.g. MPI) or even of the entire machine. High availability
                   (HA) computing strives to avoid the problems of unexpected
                   failures through preemptive measures. There are various
                   techniques to implement high availability. In contrast to
                   active/hot-standby high availability with its fail-over
                   model, active/active high availability with its virtual
                   synchrony model is superior in many areas including
                   scalability, throughput, availability and responsiveness.
                   However, it is significantly more complex. The overall goal
                   of our research is to expand today`s effort in HA for HEC,
                   so that systems that have the ability to hot-swap hardware
                   components can be kept alive by an OS runtime environment
                   that understands the concept of dynamic system configuration.
                   This talk will present an overview of recent research at Oak
                   Ridge National Laboratory in high availability solutions for
                   ultra-scale scientific high-end computing."
}
@misc{scott06advancing,
  author        = "Stephen L. Scott
                   and Christian Engelmann",
  title         = "Advancing Reliability, Availability and Serviceability for
                   High-Performance Computing",
  month         = apr # "~19, ",
  year          = "2006",
  howpublished  = "{Seminar at the \href{http://www.gup.uni-linz.ac.at}
                   {Institute of Graphics and Parallel Processing},
                   \href{http://www.uni-linz.ac.at}{Johannes Kepler University},
                   Linz, Austria}",
  url           = "http://www.christian-engelmann.info/publications/scott06advancing.ppt.pdf",
  abstract      = "Today’s high performance computing systems have several
                   reliability deficiencies resulting in noticeable availability
                   and serviceability issues. For example, head and service
                   nodes represent a single point of failure and control for an
                   entire system as they render it inaccessible and unmanageable
                   in case of a failure until repair, causing a significant
                   downtime. Furthermore, current solutions for fault-tolerance
                   focus on dealing with the result of a failure. However, most
                   are unable to transparently mask runtime system configuration
                   changes caused by failures and require a complete restart of
                   essential system services, such as MPI, in case of a failure.
                   High availability computing strives to avoid the problems of
                   unexpected failures through preemptive measures. The overall
                   goal of our research is to expand today’s effort in high
                   availability for high-performance computing, so that systems
                   can be kept alive by an OS runtime environment that
                   understands the concepts of dynamic system configuration and
                   degraded operation mode. This talk will present an overview
                   of recent research performed at Oak Ridge National Laboratory
                   in collaboration with Louisiana Tech University, North
                   Carolina State University and the University of Reading in
                   developing core technologies and proof-of-concept prototypes
                   that improve the overall reliability, availability and
                   serviceability of high-performance computing systems."
}
@misc{engelmann05high4,
  author        = "Christian Engelmann",
  title         = "High Availability for Ultra-Scale High-End Scientific
                   Computing",
  month         = oct # "~18, ",
  year          = "2005",
  howpublished  = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department
                   of Computer Science}, \href{http://www.reading.ac.uk}
                   {University of Reading}, Reading, United Kingdom}",
  url           = "http://www.christian-engelmann.info/publications/engelmann05high4.ppt.pdf",
  abstract      = "A major concern in exploiting ultra-scale architectures for
                   scientific high-end computing (HEC) with tens to hundreds of
                   thousands of processors, such as the IBM Blue Gene/L and the
                   Cray X1, is the potential inability to identify problems and
                   take preemptive action before a failure impacts a running
                   job. In fact, in systems of this scale, predictions estimate
                   the mean time to interrupt in terms of hours. Current
                   solutions for fault-tolerance in HEC focus on dealing with
                   the result of a failure. However, most are unable to handle
                   runtime system configuration changes caused by failures and
                   require a complete restart of essential system services (e.g.
                   MPI) or even of the entire machine. High availability (HA)
                   computing strives to avoid the problems of unexpected
                   failures through preemptive measures. There are various
                   techniques to implement high availability. In contrast to
                   active/hot-standby high availability with its fail-over
                   model, active/active high availability with its virtual
                   synchrony model is superior in many areas including
                   scalability, throughput, availability and responsiveness.
                   However, it is significantly more complex. The overall goal
                   of our research is to expand today`s effort in HA for HEC, so
                   that systems that have the ability to hot-swap hardware
                   components can be kept alive by an OS runtime environment
                   that understands the concept of dynamic system configuration.
                   This talk will present an overview of recent research at Oak
                   Ridge National Laboratory in high availability solutions for
                   ultra-scale scientific high-end computing."
}
@misc{engelmann05high3,
  author        = "Christian Engelmann",
  title         = "High Availability for Ultra-Scale High-End Scientific
                   Computing",
  month         = sep # "~26, ",
  year          = "2005",
  howpublished  = "{Seminar at the \href{http://www.uncfsu.edu/macsc}{Department
                   of Mathematics and Computer Science},
                   \href{http://www.uncfsu.edu}{Fayetteville State University},
                   Fayetteville, NC, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann05high3.ppt.pdf",
  abstract      = "A major concern in exploiting ultra-scale architectures for
                   scientific high-end computing (HEC) with tens to hundreds of
                   thousands of processors, such as the IBM Blue Gene/L and the
                   Cray X1, is the potential inability to identify problems and
                   take preemptive action before a failure impacts a running
                   job. In fact, in systems of this scale, predictions estimate
                   the mean time to interrupt in terms of hours. Current
                   solutions for fault-tolerance in HEC focus on dealing with
                   the result of a failure. However, most are unable to handle
                   runtime system configuration changes caused by failures and
                   require a complete restart of essential system services (e.g.
                   MPI) or even of the entire machine. High availability (HA)
                   computing strives to avoid the problems of unexpected
                   failures through preemptive measures. There are various
                   techniques to implement high availability. In contrast to
                   active/hot-standby high availability with its fail-over
                   model, active/active high availability with its virtual
                   synchrony model is superior in many areas including
                   scalability, throughput, availability and responsiveness.
                   However, it is significantly more complex. The overall goal
                   of our research is to expand today’s effort in HA for HEC, so
                   that systems that have the ability to hot-swap hardware
                   components can be kept alive by an OS runtime environment
                   that understands the concept of dynamic system configuration.
                   This talk will present an overview of recent research at Oak
                   Ridge National Laboratory in fault tolerance and high
                   availability solutions for ultra-scale scientific high-end
                   computing."
}
@misc{engelmann05high2,
  author        = "Christian Engelmann",
  title         = "High Availability for Ultra-Scale High-End Scientific
                   Computing",
  month         = may # "~13, ",
  year          = "2005",
  howpublished  = "{Seminar at the \href{http://www.cs.reading.ac.uk}{Department
                   of Computer Science}, \href{http://www.reading.ac.uk}
                   {University of Reading}, Reading, United Kingdom}",
  url           = "http://www.christian-engelmann.info/publications/engelmann05high2.ppt.pdf",
  abstract      = "A major concern in exploiting ultra-scale architectures for
                   scientific high-end computing (HEC) with tens to hundreds of
                   thousands of processors, such as the IBM Blue Gene/L and the
                   Cray X1, is the potential inability to identify problems and
                   take preemptive action before a failure impacts a running
                   job. In fact, in systems of this scale, predictions estimate
                   the mean time to interrupt in terms of hours. Current
                   solutions for fault-tolerance in HEC focus on dealing with
                   the result of a failure. However, most are unable to handle
                   runtime system configuration changes caused by failures and
                   require a complete restart of essential system services (e.g.
                   MPI) or even of the entire machine. High availability (HA)
                   computing strives to avoid the problems of unexpected
                   failures through preemptive measures. There are various
                   techniques to implement high availability. In contrast to
                   active/hot-standby high availability with its fail-over
                   model, active/active high availability with its virtual
                   synchrony model is superior in many areas including
                   scalability, throughput, availability and responsiveness.
                   However, it is significantly more complex. The overall goal
                   of our research is to expand today’s effort in HA for HEC,
                   so that systems that have the ability to hot-swap hardware
                   components can be kept alive by an OS runtime environment
                   that understands the concept of dynamic system configuration.
                   This talk will present an overview of recent research at Oak
                   Ridge National Laboratory in fault-tolerant heterogeneous
                   metacomputing, advanced super-scalable algorithms and high
                   availability system software for ultra-scale scientific
                   high-end computing."
}
@misc{engelmann05high1,
  author        = "Christian Engelmann",
  title         = "High Availability for Ultra-Scale High-End Scientific
                   Computing",
  month         = apr # "~15, ",
  year          = "2005",
  howpublished  = "{Seminar at the \href{http://cenit.latech.edu}{Center for
                   Entrepreneurship and Information Technology},
                   \href{http://www.latech.edu}{Louisiana Tech University},
                   Ruston, LA, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann05high1.ppt.pdf",
  abstract      = "A major concern in exploiting ultra-scale architectures for
                   scientific high-end computing (HEC) with tens to hundreds of
                   thousands of processors is the potential inability to
                   identify problems and take preemptive action before a failure
                   impacts a running job. In fact, in systems of this scale,
                   predictions estimate the mean time to interrupt in terms of
                   hours. Current solutions for fault-tolerance in HEC focus on
                   dealing with the result of a failure. However, most are
                   unable to handle runtime system configuration changes caused
                   by failures and require a complete restart of essential
                   system services (e.g. MPI) or even of the entire machine.
                   High availability (HA) computing strives to avoid the
                   problems of unexpected failures through preemptive measures.
                   There are various techniques to implement high availability.
                   In contrast to active/hot-standby high availability with its
                   fail-over model, active/active high availability with its
                   virtual synchrony model is superior in many areas including
                   scalability, throughput, availability and responsiveness.
                   However, it is significantly more complex. The overall goal
                   of this research is to expand today’s effort in HA for HEC,
                   so that systems that have the ability to hot-swap hardware
                   components can be kept alive by an OS runtime environment
                   that understands the concept of dynamic system configuration.
                   With the aim of addressing the future challenges of high
                   availability in ultra-scale HEC, this project intends to
                   develop a proof-of-concept implementation of an active/active
                   high availability system software framework."
}
@misc{engelmann04diskless,
  author        = "Christian Engelmann",
  title         = "Diskless Checkpointing on Super-scale Architectures --
                   {A}pplied to the Fast Fourier Transform",
  month         = feb # "~25, ",
  year          = "2004",
  howpublished  = "{Invited talk at the \href{http://www.siam.org/meetings/pp04}
                   {$11^{th}$ SIAM Conference on Parallel Processing for
                   Scientific Computing (SIAM PP) 2004}, San Francisco, CA,
                   USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann04diskless.ppt.pdf",
  abstract      = "This talk discusses the issue of fault-tolerance in
                   distributed computer systems with tens or hundreds of
                   thousands of diskless processor units. Such systems, like the
                   IBM Blue Gene/L, are predicted to be deployed in the next
                   five to ten years. Since a 100,000-processor system is going
                   to be less reliable, scientific applications need to be able
                   to recover from occurring failures more efficiently. In this
                   paper, we adapt the present technique of diskless
                   checkpointing to such huge distributed systems in order to
                   equip existing scientific algorithms with super-scalable
                   fault-tolerance. First, we discuss the method of diskless
                   checkpointing, then we adapt this technique to super-scale
                   architectures and finally we present results from an
                   implementation of the Fast Fourier Transform that uses the
                   adapted technique to achieve super-scale fault-tolerance."
}
@misc{engelmann04superscalable,
  author        = "Christian Engelmann",
  title         = "Super-scalable Algorithms -- {N}ext Generation Supercomputing
                   on 100,000 and more Processors",
  month         = jan # "~29, ",
  year          = "2004",
  howpublished  = "{Seminar at the \href{http://www.csm.ornl.gov}{Computer
                   Science and Mathematics Division}, \href{http://www.ornl.gov}
                   {Oak Ridge National Laboratory}, Oak Ridge, TN, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann04superscalable.ppt.pdf",
  abstract      = "This talk discusses recent research into the issues and
                   potential problems of algorithm scalability and
                   fault-tolerance on next-generation high-performance computer
                   systems with tens and even hundreds of thousands of
                   processors. Such massively parallel computers, like the IBM
                   Blue Gene/L, are going to be deployed in the next five to ten
                   years and existing deficiencies in scalability and
                   fault-tolerance need to be addressed soon. Scientific
                   algorithms have shown poor scalability on 10,000-processor
                   systems that exist today. Furthermore, future systems will be
                   less reliable due to the large number of components.
                   Super-scalable algorithms, which have the properties of scale
                   invariance and natural fault-tolerance, are able to get the
                   correct answer despite multiple task failures and without
                   checkpointing. We will show that such algorithms exist for a
                   wide variety of problems, such as finite difference, finite
                   element, multigrid and global maximum. Despite these
                   findings, traditional algorithms may still be preferred due
                   to their known behavior, or simply because a super-scalable
                   algorithm does not exist or is hard to find for a particular
                   problem. In this case, we propose a peer-to-peer diskless
                   checkpointing algorithm that can provide scale invariant
                   fault-tolerance."
}
@misc{engelmann03distributed,
  author        = "Christian Engelmann",
  title         = "Distributed Peer-to-Peer Control for {Harness}",
  month         = feb # "~11, ",
  year          = "2004",
  howpublished  = "{Seminar at the \href{http://www.csc.ncsu.edu}{Department of
                   Computer Science}, \href{http://www.ncsu.edu}{North Carolina
                   State University}, Raleigh, NC, USA}",
  url           = "http://www.christian-engelmann.info/publications/engelmann03distributed.ppt.pdf",
  abstract      = "Harness is an adaptable fault-tolerant virtual machine
                   environment for next-generation heterogeneous distributed
                   computing developed as a follow on to PVM. It additionally
                   enables the assembly of applications from plug-ins and
                   provides fault-tolerance. This work describes the distributed
                   control, which manages global state replication to ensure a
                   high-availability of service. Group communication services
                   achieve an agreement on an initial global state and a linear
                   history of global state changes at all members of the
                   distributed virtual machine. This global state is replicated
                   to all members to easily recover from single, multiple and
                   cascaded faults. A peer-to-peer ring network architecture and
                   tunable multi-point failure conditions provide heterogeneity
                   and scalability. Finally, the integration of the distributed
                   control into the multi-threaded kernel architecture of
                   Harness offers a fault-tolerant global state database service
                   for plug-ins and applications."
}
@mastersthesis{jones10simulation,
  author        = "Ian S. Jones",
  title         = "Simulation of Large Scale Architectures on High Performance
                   Computers",
  month         = oct # "~22, ",
  year          = "2010",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Advisors: Prof. Vassil N. Alexandrov (University of Reading);
                   Christian Engelmann (Oak Ridge National Laboratory);
                   George Bosilca (University of Tennessee, Knoxville)",
  url           = "http://www.christian-engelmann.info/publications/jones10simulation.pdf",
  url2          = "http://www.christian-engelmann.info/publications/jones10simulation.ppt.pdf",
  abstract      = "Powerful supercomputers often need to be simulated for the
                   purposes of testing the scalability of various applications.
                   This thesis endeavours to further develop the existing
                   simulator, XSIM, and implement the functionality to simulate
                   real-world networks and the latency which might be encountered
                   by messages travelling through that network. The upgraded
                   simulator will then be tested at the Oak Ridge National
                   Laboratory. The work completed herein should provide a solid
                   foundation for further improvements to XSIM; it simulates a
                   variety of basic network topologies, calculating the shortest
                   path for any given message and generates a transmission time."
}
@mastersthesis{boehm10development,
  author        = "Swen B{\"o}hm",
  title         = "Development of a {RAS} Framework for {HPC} Environments:
                   {Realtime} Data Reduction of Monitoring Data",
  month         = mar # "~12, ",
  year          = "2010",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Advisors: Prof. Vassil N. Alexandrov (University of Reading);
                   Christian Engelmann (Oak Ridge National Laboratory);
                   George Bosilca (University of Tennessee, Knoxville)",
  url           = "http://www.christian-engelmann.info/publications/boehm10development.pdf",
  url2          = "http://www.christian-engelmann.info/publications/boehm10development.ppt.pdf",
  abstract      = "The advancements of high-performance computing (HPC) systems
                   in the last decades lead to more and more complex systems
                   containing thousands or tens-of-thousands computing systems
                   that are working together. While the computational performance
                   of these systems increased dramaticaly in the last years the
                   I/O subsystems have not gained such a significant improvement.
                   With increasing nummbers of hardware components in the next
                   generation HPC systems maintaining the relaiability of such
                   systems becomes more and more difficult since the probability
                   of hardware failures is increasing with the number of
                   components. The capacities of traditional reactive fault
                   tolerance technologies are exceeded by the development of next
                   generation systems and alternatives have to be found. This
                   paper discusses a monitoring system that is using data
                   reduction techniques to decrease the amount of the collected
                   data. The system is part of a proactive fault tolerance system
                   that may challenge the reliability problems of exascale
                   HPC systems."
}
@mastersthesis{lauer10simulation,
  author        = "Frank Lauer",
  title         = "Simulation of Advanced Large-Scale {HPC} Architectures",
  month         = mar # "~12, ",
  year          = "2010",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Advisors: Prof. Vassil N. Alexandrov (University of Reading);
                   Christian Engelmann (Oak Ridge National Laboratory);
                   George Bosilca (University of Tennessee, Knoxville)",
  url           = "http://www.christian-engelmann.info/publications/lauer10simulation.pdf",
  url2          = "http://www.christian-engelmann.info/publications/lauer10simulation.ppt.pdf",
  abstract      = "The rapid development of massive parallel systems in the high-
                   performance computing (HPC) area requires efficient
                   scalability of applications. The next generation's design of
                   supercomputers is today not certain in terms of what will be
                   the computational, memory and I/O capabilities. However it is
                   most certain that they become even more parallel. Getting
                   the most performance from these machines in not only a matter
                   of hardware, it is also an issue of programming design.
                   Therefore, it has to be a co-development. However, how to test
                   algorithm's on machines which are not existing today. To
                   address the programming issues in terms of scalability and
                   fault tolerance for the next generation, this projects aim is
                   to design and develop a simulator based on parallel discrete
                   event simulation (PDES) for applications using MPI
                   communication. Some of the fastest supercomputers in the world
                   already interconnecting $10^5$ cores together to catch up the
                   simulator will be able to simulate at least $10^7$ virtual
                   processes."
}
@mastersthesis{litvinova09ras,
  author        = "Antonina Litvinova",
  title         = "{RAS} Framework Engine Prototype",
  month         = sep # "~22, ",
  year          = "2009",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Advisors: Prof. Vassil N. Alexandrov (University of Reading);
                   Christian Engelmann (Oak Ridge National Laboratory);
                   George Bosilca (University of Tennessee, Knoxville)",
  url           = "http://www.christian-engelmann.info/publications/litvinova09ras.pdf",
  url2          = "http://www.christian-engelmann.info/publications/litvinova09ras.ppt.pdf",
  abstract      = "Extreme high performance computing (HPC) systems constantly
                   increase in scale from a few thousands of processors cores
                   to thousands of thousands of processors cores and beyond.
                   However their system mean-time to interrupt decreases
                   according. The current approach of fault tolerance in HPC
                   is checkpoint/restart, i.e. a method based on recovery from
                   experienced failures. However checkpoint/restart cannot deal
                   with errors in the same efficient way anymore, because of
                   HPC systems modification. For example, increasing error
                   rates, increasing aggregate memory, and not proportionally
                   increasing input/output capabilities. The recently
                   introduced concept is proactive fault tolerance which
                   avoids experiencing failures through preventative measures.
                   Proactive fault tolerance uses migration which is an
                   emerging technology that prevents failures on HPC systems
                   by migrating applications or application parts away from
                   a node that is deteriorating to a spare node. This thesis
                   discusses work conducted at ORNL to develop a Proactive
                   Fault Tolerance Framework Engine Prototype for HPC systems
                   with high reliability, availability and serviceability.
                   The prototype performs environmental system monitoring,
                   system event logging, parallel job monitoring and system
                   resource monitoring in order to analyse HPC system
                   reliability and to perform fault avoidance through a
                   migration."
}
@mastersthesis{koenning07virtualized,
  author        = "Bj{\"o}rn K{\"o}nning",
  title         = "Virtualized Environments for the {Harness Workbench}",
  month         = mar # "~14, ",
  year          = "2007",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Advisors: Prof. Vassil N. Alexandrov (University of Reading);
                   Christian Engelmann (Oak Ridge National Laboratory)",
  url           = "http://www.christian-engelmann.info/publications/koenning07virtualized.pdf",
  url2          = "http://www.christian-engelmann.info/publications/koenning07virtualized.ppt.pdf",
  abstract      = "The expanded use of computational sciences today leads to a
                   significant need of high performance computing systems. High
                   performance computing is currently undergoing vigorous
                   revival, and multiple efforts are underway to develop much
                   faster computing systems in the near future. New software
                   tools are required for the efficient use of petascale
                   computing systems. With the new Harness Workbench Project
                   the Oak Ridge National Laboratory intends to develop an
                   appropriate development and runtime environment for high
                   performance computing platforms. This dissertation project
                   is part of the Harness Workbench Project, and deals with the
                   development of a concept for virtualised environments and
                   various approaches to create and describe them. The developed
                   virtualisation approach is based on the \verb|chroot|
                   mechanism and uses platform-independent environment
                   descriptions. File structures and environment variables are
                   emulated to provide the portability of computational software
                   over diverse high performance computing platforms. Security
                   measures and sandbox characteristic are integrable."
}
@mastersthesis{weber07high,
  author        = "Matthias Weber",
  title         = "High Availability for the {Lustre} File System",
  month         = mar # "~14, ",
  year          = "2007",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Double diploma in conjunction with the
                   \href{http://www.f1.fhtw-berlin.de}{Department of
                   Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical
                   College for Engineering and Economics (FHTW) Berlin},
                   Germany. Advisors: Prof. Vassil N. Alexandrov (University of
                   Reading); Christian Engelmann (Oak Ridge National
                   Laboratory)",
  url           = "http://www.christian-engelmann.info/publications/weber07high.pdf",
  url2          = "http://www.christian-engelmann.info/publications/weber07high.ppt.pdf",
  abstract      = "With the growing importance of high performance computing
                   and, more importantly, the fast growing size of sophisticated
                   high performance computing systems, research in the area of
                   high availability is essential to meet the needs to sustain
                   the current growth. This Master thesis project aims to
                   improve the availability of Lustre. Major concern of this
                   project is the metadata server of the file system. The
                   metadata server of Lustre suffers from the last single point
                   of failure in the file system. To overcome this single point
                   of failure an active/active high availability approach is
                   introduced. The new file system design with multiple MDS
                   nodes running in virtual synchrony leads to a significant
                   increase of availability. Two prototype implementations aim
                   to show how the proposed system design and its new realized
                   form of symmetric active/active high availability can be
                   accomplished in practice. The results of this work point out
                   the difficulties in adapting the file system to the
                   active/active high availability design. Tests identify not
                   achieved functionality and show performance problems of the
                   proposed solution. The findings of this dissertation may be
                   used for further work on high availability for distributed
                   file systems."
}
@mastersthesis{baumann06design,
  author        = "Ronald Baumann",
  title         = "Design and Development of Prototype Components for the
                   {Harness} High-Performance Computing Workbench",
  month         = mar # "~6, ",
  year          = "2006",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Double diploma in conjunction with the
                   \href{http://www.f1.fhtw-berlin.de}{Department of
                   Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical
                   College for Engineering and Economics (FHTW) Berlin},
                   Germany. Advisors: Prof. Vassil N. Alexandrov (University of
                   Reading); George A. (Al) Geist and Christian  Engelmann (Oak
                   Ridge National Laboratory)",
  url           = "http://www.christian-engelmann.info/publications/baumann06design.pdf",
  url2          = "http://www.christian-engelmann.info/publications/baumann06design.ppt.pdf",
  abstract      = "This master thesis examines plug-in technology, especially
                   the new field of parallel plug-ins. Plug-ins are popular
                   because they extend the capabilities of software packages
                   such as browsers and Photoshop, and allow an individual user
                   to add new functionality. Parallel plug-ins also provide the
                   above capabilities to a distributed set of resources, i.e.,
                   a plug-in now becomes a set of coordinating plug-ins. Second,
                   the set of plugins may be heterogeneous either in function or
                   because the underlying resources are heterogeneous. This new
                   dimension of complexity provides a rich research space which
                   is explored in this thesis. Experiences are collected and
                   presented as parallel plug-in paradigms and concepts. The
                   Harness framework was used in this project, in particular the
                   plugin manager and available communication capabilities.
                   Plug-ins provide methods for users to extend Harness
                   according to their requirements. The result of this thesis is
                   a parallel plug-in paradigm and template for Harness. Users
                   of the Harness environment will be able to design and
                   implement their applications in the form of parallel plug-ins
                   easier and faster by using the paradigm resulting from this
                   project. Prototypes were implemented which handle different
                   aspects of parallel plug-ins. Parallel plug-in configurations
                   were tested on an appropriate number of Harness kernels,
                   including available communication and error-handling
                   capabilities. Furthermore, research was done in the area of
                   fault tolerance while parallel plug-ins are (un)loaded, as
                   well as while a task is performed."
}
@mastersthesis{uhlemann06high,
  author        = "Kai Uhlemann",
  title         = "High Availability for High-End Scientific Computing",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  month         = mar # "~6, ",
  year          = "2006",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Double diploma in conjunction with the
                   \href{http://www.f1.fhtw-berlin.de}{Department of
                   Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical
                   College for Engineering and Economics (FHTW) Berlin},
                   Germany. Advisors: Prof. Vassil N. Alexandrov (University of
                   Reading); George A. (Al) Geist and  Christian Engelmann (Oak
                   Ridge National Laboratory)",
  url           = "http://www.christian-engelmann.info/publications/uhlemann06high.pdf",
  url2          = "http://www.christian-engelmann.info/publications/uhlemann06high.ppt.pdf",
  abstract      = "With the growing interest and popularity in high performance
                   cluster computing and, more importantly, the fast growing
                   size of compute clusters, research in the area of high
                   availability is essential to meet the needs to sustain the
                   current growth. This Master thesis project introduces a new
                   approach for high availability focusing on the head node of a
                   cluster system. This projects focus is on providing high
                   availability to the job scheduler service, which is the most
                   vital part of the traditional Beowulf-style cluster
                   architecture. This research seeks to add high availability to
                   the job scheduler service and resource management system,
                   typically running on the head node, leading to a significant
                   increase of availability for cluster computing. Also, this
                   software project takes advantage of the virtual synchrony
                   paradigm to achieve active/active replication, the highest
                   form of high availability. A proof-of-concept implementation
                   shows how high availability can be designed in software and
                   what results can be expected of such a system. The results
                   may be reused for future or existing projects to further
                   improve and extent the high availability of compute
                   clusters."
}
@phdthesis{engelmann08symmetric3,
  author        = "Christian Engelmann",
  title         = "Symmetric Active/Active High Availability for
                   High-Performance Computing System Services",
  month         = dec # "~8, ",
  year          = "2008",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Advisor: Prof. Vassil N. Alexandrov (University of Reading)",
  url           = "http://www.christian-engelmann.info/publications/engelmann08symmetric3.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann08symmetric3.ppt.pdf",
  abstract      = "In order to address anticipated high failure rates,
                   reliability, availability and serviceability have become an
                   urgent priority for next-generation high-performance
                   computing (HPC) systems. This thesis aims to pave the way for
                   highly available HPC systems by focusing on their most
                   critical components and by reinforcing them with appropriate
                   high availability solutions. Service components, such as head
                   and service nodes, are the Achilles heel of a HPC system.
                   A failure typically results in a complete system-wide outage.
                   This thesis targets efficient software state replication
                   mechanisms for service component redundancy to achieve high
                   availability as well as high performance. Its methodology
                   relies on defining a modern theoretical foundation for
                   providing service-level high availability, identifying
                   availability deficiencies of HPC systems, and comparing
                   various service-level high availability methods. This thesis
                   showcases several developed proof-of-concept prototypes
                   providing high availability for services running on HPC head
                   and service nodes using the symmetric active/active
                   replication method, i.e., state-machine replication, to
                   complement prior work in this area using active/standby and
                   asymmetric active/active configurations. Presented
                   contributions include a generic taxonomy for service high
                   availability, an insight into availability deficiencies of
                   HPC systems, and a unified definition of service-level high
                   availability methods. Further contributions encompass a fully
                   functional symmetric active/active high availability
                   prototype for a HPC job and resource management service that
                   does not require modification of service, a fully functional
                   symmetric active/active high availability prototype for a HPC
                   parallel file system metadata service that offers high
                   performance, and two preliminary prototypes for a transparent
                   symmetric active/active replication software framework for
                   client-service and dependent service scenarios that hide the
                   replication infrastructure from clients and services.
                   Assuming a mean-time to failure of 5,000 hours for a head or
                   service node, all presented prototypes improve service
                   availability from 99.285\% to 99.995\% in a two-node system,
                   and to 99.99996\% with three nodes."
}
@mastersthesis{engelmann01distributed,
  author        = "Christian Engelmann",
  title         = "Distributed Peer-to-Peer Control for {Harness}",
  month         = jul # "~7, ",
  year          = "2001",
  school        = "\href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Double diploma in conjunction with the
                   \href{http://www.f1.fhtw-berlin.de}{Department of
                   Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical
                   College for Engineering and Economics (FHTW) Berlin},
                   Germany. Advisors: Prof. Vassil N. Alexandrov (University of
                   Reading); George A. (Al) Geist (Oak Ridge National
                   Laboratory)",
  url           = "http://www.christian-engelmann.info/publications/engelmann01distributed.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann01distributed.ppt.pdf",
  abstract      = "Parallel processing, the method of cutting down a large
                   computational problem into many small tasks which are solved
                   in parallel, is a field of increasing importance in science.
                   Cost-effective, flexible and efficient simulations of
                   mathematical models of physical, chemical or biological
                   real-world problems are replacing the traditional
                   experimental research. Current software solutions for
                   parallel and scientific computation, like Parallel Virtual
                   Machine and Message Passing Interface, have limitations in
                   handling faults and failures, in utilizing heterogeneous and
                   dynamically changing communication structures, and in
                   enabling migrating or cooperative applications. The current
                   research in heterogeneous adaptable reconfigurable networked
                   systems (Harness) aims to produce the next generation of
                   software solutions for distributed computing. A
                   high-available and light-weighted distributed virtual
                   machine service provides an encapsulation of a few hundred
                   to a few thousand physical machines in a virtual
                   heterogeneous large scale cluster. A high availability of
                   a service in distributed systems can be achieved by
                   replication of the service state on multiple server
                   processes. If one ore more server processes fails, the
                   surviving ones continue to provide the service because they
                   know the state. Since every member of a distributed virtual
                   machine is part of the distributed virtual machine service
                   state and is able to change this state, a distributed control
                   is needed to replicate the state and maintain its
                   consistency. This distributed control manages state changes
                   as well as the state-replication and the detection of and
                   recovery from faults and failures of server processes. This
                   work analyzes system architectures currently used in
                   heterogeneous distributed computing by defining terms,
                   conditions and assumptions. It shows that such systems are
                   asynchronous and may use partially synchronous communication
                   to detect and to distinguish different classes of faults and
                   failures. It describes how a high availability of a large
                   scale distributed service on a huge number of servers
                   residing on different geographical locations can be realized.
                   Asynchronous group communication services, such as Reliable
                   Broadcast, Atomic Broadcast, Distributed Agreement and
                   Membership, are analyzed to develop linear scalable
                   algorithms in an unidirectional and in a bidirectional
                   connected asynchronous peer-to-peer ring architecture.
                   A Transaction Control group communication service is
                   introduced as state-replication service. The system analysis
                   distinguishes different types of distributed systems, where
                   active transactions execute state changes using
                   non-replicated data of one or more servers and inactive
                   transactions report state changes using replicated data only.
                   It is applicable for passive fault-tolerant distributed
                   databases as well as for active fault-tolerant distributed
                   control mechanisms. No control token is used and time stamps
                   are avoided, so that all members of a server group have equal
                   responsibilities and are independent from the system time.
                   A prototype which implements the most complicated Transaction
                   Control algorithm is realized due to the complexity of the
                   distributed system and the early development stage of the
                   introduced algorithms. The prototype is used to obtain
                   practical experience with the state-replication algorithm."
}
@mastersthesis{engelmann01distributed2,
  author        = "Christian Engelmann",
  title         = "Distributed Peer-to-Peer Control for {Harness}",
  month         = feb # "~23, ",
  year          = "2001",
  school        = "\href{http://www.f1.fhtw-berlin.de}{Department of
                   Engineering~I}, \href{http://www.f1.fhtw-berlin.de}{Technical
                   College for Engineering and Economics (FHTW) Berlin},
                   Germany",
  note          = "Thesis research performed at Oak Ridge National Laboratory.
                   Double diploma in conjunction with the
                   \href{http://www.cs.reading.ac.uk}{Department of Computer
                   Science}, \href{http://www.reading.ac.uk}{University of
                   Reading}, UK. Advisors: Prof. Uwe Metzler (Technical College
                   for Engineering and Economics (FHTW) Berlin); George A. (Al)
                   Geist (Oak Ridge National Laboratory)",
  url           = "http://www.christian-engelmann.info/publications/engelmann01distributed2.pdf",
  url2          = "http://www.christian-engelmann.info/publications/engelmann01distributed2.ppt.pdf",
  abstract      = "Parallel processing, the method of cutting down a large
                   computational problem into many small tasks which are solved
                   in parallel, is a field of increasing importance in science.
                   Cost-effective, flexible and efficient simulations of
                   mathematical models of physical, chemical or biological
                   real-world problems are replacing the traditional
                   experimental research. Current software solutions for
                   parallel and scientific computation, like Parallel Virtual
                   Machine and Message Passing Interface, have limitations in
                   handling faults and failures, in utilizing heterogeneous and
                   dynamically changing communication structures, and in
                   enabling migrating or cooperative applications. The current
                   research in heterogeneous adaptable reconfigurable networked
                   systems (Harness) aims to produce the next generation of
                   software solutions for distributed computing. A
                   high-available and light-weighted distributed virtual
                   machine service provides an encapsulation of a few hundred
                   to a few thousand physical machines in a virtual
                   heterogeneous large scale cluster. A high availability of
                   a service in distributed systems can be achieved by
                   replication of the service state on multiple server
                   processes. If one ore more server processes fails, the
                   surviving ones continue to provide the service because they
                   know the state. Since every member of a distributed virtual
                   machine is part of the distributed virtual machine service
                   state and is able to change this state, a distributed control
                   is needed to replicate the state and maintain its
                   consistency. This distributed control manages state changes
                   as well as the state-replication and the detection of and
                   recovery from faults and failures of server processes. This
                   work analyzes system architectures currently used in
                   heterogeneous distributed computing by defining terms,
                   conditions and assumptions. It shows that such systems are
                   asynchronous and may use partially synchronous communication
                   to detect and to distinguish different classes of faults and
                   failures. It describes how a high availability of a large
                   scale distributed service on a huge number of servers
                   residing on different geographical locations can be realized.
                   Asynchronous group communication services, such as Reliable
                   Broadcast, Atomic Broadcast, Distributed Agreement and
                   Membership, are analyzed to develop linear scalable
                   algorithms in an unidirectional and in a bidirectional
                   connected asynchronous peer-to-peer ring architecture.
                   A Transaction Control group communication service is
                   introduced as state-replication service. The system analysis
                   distinguishes different types of distributed systems, where
                   active transactions execute state changes using
                   non-replicated data of one or more servers and inactive
                   transactions report state changes using replicated data only.
                   It is applicable for passive fault-tolerant distributed
                   databases as well as for active fault-tolerant distributed
                   control mechanisms. No control token is used and time stamps
                   are avoided, so that all members of a server group have equal
                   responsibilities and are independent from the system time.
                   A prototype which implements the most complicated Transaction
                   Control algorithm is realized due to the complexity of the
                   distributed system and the early development stage of the
                   introduced algorithms. The prototype is used to obtain
                   practical experience with the state-replication algorithm."
}
@techreport{kuchar22system,
  author        = "Olga A. Kuchar
                   and Swen Boehm
                   and Thomas Naughton
                   and Suhas Somnath
                   and Ben Mintz
                   and Jack Lange
                   and Scott Atchley
                   and Rohit Srivastava
                   and Patrick Widener",
  title         = "INTERSECT Architecture Specification:
                   System-of-systems Architecture (Version 0.5)",
  institution   = "Oak Ridge National Laboratory",
  number        = "ORNL/TM-2022/2717",
  address       = "Oak Ridge, TN, USA",
  month         = sep,
  year          = "2022",
  doi           = "10.2172/1968700",
  url           = "http://www.christian-engelmann.info/publications/kuchar22system.pdf",
  abstract      = "Oak Ridge National Laboratory (ORNL)'s Self-driven
                   Experiments for Science / Interconnected Science Ecosystem
                   (INTERSECT) architecture project, titled ``An Open Federated
                   Architecture for the Laboratory of the Future'', creates an
                   open federated hardware/software architecture for the
                   laboratory of the future using a novel system of systems
                   (SoS) and microservice architecture approach, connecting
                   scientific instruments, robot-controlled laboratories and
                   edge/center computing/data resources to enable autonomous
                   experiments, ``self-driving'' laboratories, smart
                   manufacturing, and artificial intelligence (AI)-driven
                   design, discovery and evaluation.
                   The architecture project is divided into three focus areas:
                   design patterns, SoS architecture, and microservice
                   architecture. The design patterns area focuses on describing
                   science use cases as design patterns that identify and
                   abstract the involved hardware/software components and their
                   interactions in terms of control, work and data flow. The
                   SoS architecture area focuses on an open architecture
                   specification for the federated ecosystem that clarifies
                   terms, architectural elements, the interactions between
                   them and compliance. The microservice architecture
                   describes blueprints for loosely coupled microservice,
                   standardized interfaces, and multi-programming language
                   support. This document is the SoS Architecture specification
                   only, and captures the system of systems architecture
                   design for the INTERSECT Initiative and its components. It
                   is intended to provide a deep analysis and specification of
                   how the INTERSECT platform will be designed, and to link the
                   scientific needs identified across disciplines with the
                   technical needs involved in the support, development, and
                   evolution of a science ecosystem.",
  pts           = "186209"
}