Peer-reviewed Journal Papers
- Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik Göddeke, Marco Heisig, Fabienne Jézéquel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortí, Francesco Rizzi, Ulrich Rüde, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thönnes, Andreas Wagner, and Barbara Wohlmuth. Resiliency in Numerical Algorithm Design for Extreme Scale Simulations. International Journal of High Performance Computing Applications (IJHPCA), volume 36, number 2, pages 251-285, March 1, 2022. SAGE Publications. ISSN 1094-3420. DOI 10.1177/10943420211055188.
- Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Study of Interconnect Errors, Network Congestion, and Applications Characteristics for Throttle Prediction on a Large Scale HPC System. Journal of Parallel and Distributed Computing (JPDC), volume 153, pages 29-43, July 1, 2021. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2021.03.001.
- Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Epidemic Failure Detection and Consensus for Extreme Parallelism. International Journal of High Performance Computing Applications (IJHPCA), volume 32, number 5, pages 729-743, September 1, 2018. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342017690910.
- Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale. Journal of Supercomputing Frontiers and Innovations (JSFI), volume 4, number 3, pages 4-42, October 1, 2017. South Ural State University Chelyabinsk, Russia. ISSN 2409-6008. DOI 10.14529/jsfi170301.
- Christian Engelmann and Thomas Naughton. A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator. Concurrency and Computation: Practice and Experience, volume 28, number 12, pages 3369-3389, August 1, 2016. John Wiley & Sons, Inc.. ISSN 1532-0634. DOI 10.1002/cpe.3805.
- Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, May 1, 2014. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342014522573.
- Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, January 1, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X. DOI 10.1016/j.future.2013.04.014.
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, February 1, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2011.10.009.
- Stephen L. Scott, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong H. Ong. System-Level Virtualization Research at Oak Ridge National Laboratory. Future Generation Computer Systems (FGCS), volume 26, number 3, pages 304-307, March 1, 2010. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X. DOI 10.1016/j.future.2009.07.001.
- Xubin (Ben) He, Li Ou, Christian Engelmann, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing (JPDC), volume 69, number 12, pages 961-973, December 1, 2009. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2009.08.004.
- Xubin (Ben) He, Li Ou, Martha J. Kosa, Stephen L. Scott, and Christian Engelmann. A Unified Multiple-Level Cache for High Performance Cluster Storage Systems. International Journal of High Performance Computing and Networking (IJHPCN), volume 5, number 1-2, pages 97-109, November 14, 2007. Inderscience Publishers, Geneve, Switzerland. ISSN 1740-0562. DOI 10.1504/IJHPCN.2007.015768.
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, December 1, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X. DOI 10.4304/jcp.1.8.43-54.
- Christian Engelmann, Stephen L. Scott, David E. Bernholdt, Narasimha R. Gottumukkala, Chokchai (Box) Leangsuksun, Jyothish Varma, Chao Wang, Frank Mueller, Aniruddha G. Shet, and Ponnuswamy (Saday) Sadayappan. MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems. ACM SIGOPS Operating Systems Review (OSR), volume 40, number 2, pages 63-72, April 1, 2006. ACM Press, New York, NY, USA. ISSN 0163-5980. DOI 10.1145/1131322.1131337.
Peer-reviewed Conference Papers
- Vladyslav Oles, Anna Schmedding, George Ostrouchov, Woong Shi, Evgenia Smirni, and Christian Engelmann. Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS) 2024, pages 188-200, Kyoto, Japan, June 4-7, 2024. ACM Press, New York, NY, USA. ISBN 979-8-4007-0610-3. DOI 10.1145/3650200.3656615. Acceptance rate 36.0% (45/125).
- Christian Engelmann and Suhas Somnath. Science Use Case Design Patterns for Autonomous Experiments. In Proceedings of the 28th European Conference on Pattern Languages of Programs (EuroPLoP) 2023, pages 1-14, Kloster Irsee, Germany, July 5-9, 2023. ACM Press, New York, NY, USA. ISBN 979-8-4007-0040-8. DOI 10.1145/3628034.3628060.
- Christian Engelmann, Olga Kuchar, Swen Boehm, Michael J. Brim, Thomas Naughton, Suhas Somnath, Scott Atchley, Jack Lange, Ben Mintz, and Elke Arenholz. The INTERSECT Open Federated Architecture for the Laboratory of the Future. In Communications in Computer and Information Science (CCIS): Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. 18th Smoky Mountains Computational Sciences & Engineering Conference (SMC) 2022, pages 173-190, August 24-25, 2022. Springer, Cham. ISBN 978-3-031-23605-1. DOI 10.1007/978-3-031-23606-8_11. Acceptance rate 32.4% (24/74).
- Saurabh Hukerikar and Christian Engelmann. PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems. In Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020, pages 31-39, Perth, Australia, December 1-4, 2020. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-8004-5. ISSN 1555-094X. DOI 10.1109/PRDC50213.2020.00014. Acceptance rate 40.9% (18/44).
- George Ostrouchov, Don Maxwell, Rizwan Ashraf, Christian Engelmann, Mallikarjun Shankar, and James Rogers. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the 33rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2020, pages 41:1-14, Atlanta, GA, USA, November 15-20, 2020. ACM Press, New York, NY, USA. ISBN 9781728199986. DOI 10.1109/SC41405.2020.00045. Acceptance rate 25.1% (95/378).
- Haewon Jeong, Yaoqing Yang, Christian Engelmann, Vipul Gupta, Tze Meng Low, Pulkit Grover, Viveck Cadambe, and Kannan Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020, pages 392-407, Warsaw, Poland, August 24-28, 2020. Springer Verlag, Berlin, Germany. ISBN 978-3-030-57674-5. DOI 10.1007/978-3-030-57675-2_25. Acceptance rate 24.5% (39/159).
- Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 107-114, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00023. Acceptance rate 27.2% (62/228).
- Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 95-106, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00022. Acceptance rate 27.2% (62/228).
- Rizwan Ashraf, Saurabh Hukerikar, and Christian Engelmann. Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing. In Proceedings of the 9th ACM/SPEC International Conference on Performance Engineering (ICPE) 2018, pages 80-87, Berlin, Germany, April 9-13, 2018. ACM Press, New York, NY, USA. ISBN 978-1-4503-5095-2. DOI 10.1145/3184407.3184421. Acceptance rate 23.7% (14/59).
- Rizwan Ashraf, Saurabh Hukerikar, and Christian Engelmann. Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery. In Proceedings of the 26th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2018, pages 178-185, Cambridge, UK, March 21-23, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-4975-6. ISSN 2377-5750. DOI 10.1109/PDP2018.2018.00032. Acceptance rate 29.3% (27/92).
- Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 44:1-44:12, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5114-0. DOI 10.1145/3126908.3126937. Acceptance rate 18.7% (61/327).
- Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017, pages 22-31, Banff, AB, Canada, September 20-22, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2764-8. ISSN 2375-0227. DOI 10.1109/MASCOTS.2017.12. Acceptance rate 30.95% (26/84).
- Saurabh Hukerikar and Christian Engelmann. A Pattern Language for High-Performance Computing Resilience. In Proceedings of the 22nd European Conference on Pattern Languages of Programs (EuroPLoP) 2017, pages 12:1-12:16, Kloster Irsee, Germany, July 12-16, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-4848-5. DOI 10.1145/3147704.3147718.
- Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Benchmark Generation and Simulation at Extreme Scale. In Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2016, pages 9-18, London, UK, September 21-23, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5090-3506-9. ISSN 1550-6525. DOI 10.1109/DS-RT.2016.18. Acceptance rate 42.0% (21/50). Best paper candidate.
- Saurabh Hukerikar and Christian Engelmann. Havens: Explicit Reliable Memory Regions for HPC Applications. In Proceedings of the 20th IEEE High Performance Extreme Computing Conference (HPEC) 2016, pages 1-6, Waltham, MA, USA, September 13-15, 2016. IEEE Computer Society, Los Alamitos, CA, USA. DOI 10.1109/HPEC.2016.7761593.
- Kun Tang, Devesh Tiwari, Saurabh Gupta, Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2016, pages 311-322, Toulouse, France, June 28 – July 1, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 2158-3927. DOI 10.1109/DSN.2016.36. Acceptance rate 22.4% (58/259).
- David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the 30th ACM International Conference on Supercomputing (ICS) 2016, pages 7:1-7:14, Istanbul, Turkey, June 1-3, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4361-9. DOI 10.1145/2925426.2926295. Acceptance rate 24.2% (43/178).
- Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Extreme Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, pages 212-221, Chicago, IL, USA, May 23-27, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1530-2075. DOI 10.1109/IPDPS.2016.100. Acceptance rate 23.0% (114/496).
- Christian Engelmann and Thomas Naughton. Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation. In Proceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016, Innsbruck, Austria, February 15-16, 2016. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-979-0. DOI 10.2316/P.2016.834-005.
- Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. In Proceedings of the 22nd European MPI Users` Group Meeting (EuroMPI) 2015, pages 13:1-13:9, Bordeaux, France, September 21-24, 2015. ACM Press, New York, NY, USA. ISBN 978-1-4503-3795-3. DOI 10.1145/2802658.2802660. Acceptance rate 48.3% (14/29).
- Christian Engelmann and Thomas Naughton. A Network Contention Model for the Extreme-scale Simulator. In Proceedings of the 34th IASTED International Conference on Modelling, Identification and Control (MIC) 2015, Innsbruck, Austria, February 17-18, 2015. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-975-2. DOI 10.2316/P.2015.826-043.
- Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, pages 198-207, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-6143-6. ISSN 1550-6525. DOI 10.1109/DS-RT.2014.32. Best paper candidate.
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192. DOI 10.1109/PDP.2014.74. Acceptance rate 32.6% (73/224).
- Geoffroy Vallée, Thomas Naughton, Swen Böhm, and Christian Engelmann. A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools. In Proceedings of the 1st International Symposium on Computing and Networking – Across Practical Development and Theoretical Research – (CANDAR) 2013, pages 213-219, Matsuyama, Japan, December 4-6, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-2795-1. DOI 10.1109/CANDAR.2013.38. Acceptance rate 35.8% (28/78).
- Christian Engelmann. Investigating Operating System Noise in Extreme-Scale High-Performance Computing Systems using Simulation. In Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, February 11-13, 2013. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-943-1. DOI 10.2316/P.2013.795-010.
- David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. DOI 10.1109/SC.2012.49. Acceptance rate 21.2% (100/472).
- James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. DOI 10.1109/ICDCS.2012.56. Acceptance rate 13.8% (71/515).
- Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. DOI 10.1109/IPDPS.2012.90. Acceptance rate 20.7% (118/569).
- Swen Böhm and Christian Engelmann. File I/O for MPI Applications in Redundant Execution Scenarios. In Proceedings of the 20th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2012, pages 112-119, Garching, Germany, February 15-17, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4633-9. ISSN 1066-6192. DOI 10.1109/PDP.2012.22.
- Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. DOI 10.1109/HPCSim.2011.5999835. Acceptance rate 28.1% (48/171).
- Christian Engelmann and Swen Böhm. Redundant Execution of HPC Applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011, pages 31-38, Innsbruck, Austria, February 15-17, 2011. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-864-9. DOI 10.2316/P.2011.719-031.
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. DOI 10.1109/ICPADS.2010.48. Acceptance rate 29.6% (77/188).
- Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. DOI 10.1109/SC.2010.28. Acceptance rate 19.8% (50/253).
- Swen Böhm, Christian Engelmann, and Stephen L. Scott. Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC) 2010, pages 72-78, Melbourne, Australia, September 1-3, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4214-0. DOI 10.1109/HPCC.2010.32. Acceptance rate 19.1% (58/304).
- Antonina Litvinova, Christian Engelmann, and Stephen L. Scott. A Proactive Fault Tolerance Framework for High-Performance Computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria, February 16-18, 2010. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-783-3. DOI 10.2316/P.2010.676-024.
- Narate Taerat, Nichamon Naksinehaboon, Clayton Chandler, James Elliott, Chokchai (Box) Leangsuksun, George Ostrouchov, Stephen L. Scott, and Christian Engelmann. Blue Gene/L Log Analysis and Time to Interrupt Estimation. In Proceedings of the 4th International Conference on Availability, Reliability and Security (ARES) 2009, pages 173-180, Fukuoka, Japan, March 16-19, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-3572-2. DOI 10.1109/ARES.2009.105. Acceptance rate 25.0% (40/160).
- Christian Engelmann, Hong H. Ong, and Stephen L. Scott. Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing (LCI) 2009, Boulder, CO, USA, March 9-12, 2009.
- Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.31. Acceptance rate 42.0% (58/138).
- Alessandro Valentini, Christian Di Biagio, Fabrizio Batino, Guido Pennella, Fabrizio Palma, and Christian Engelmann. High Performance Computing with Harness over InfiniBand. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 151-154, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.64. Acceptance rate 42.0% (58/138).
- Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-784-0.
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. DOI 10.1145/1413370.1413414. Acceptance rate 21.3% (59/277).
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active Replication for Dependent Services. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 260-267, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.64. Acceptance rate 21.1% (40/190).
- Geoffroy R. Vallée, Kulathep Charoenpornwattana, Christian Engelmann, Anand Tikotekar, Chokchai (Box) Leangsuksun, Thomas Naughton, and Stephen L. Scott. A Framework For Proactive Fault Tolerance. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 659-664, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.171. Acceptance rate 21.1% (40/190).
- Björn Könning, Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. Virtualized Environments for the Harness High Performance Computing Workbench. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008, pages 133-140, Toulouse, France, February 13-15, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3089-5. DOI 10.1109/PDP.2008.14. Acceptance rate 40% (83/207).
- Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, Hong H. Ong, and Stephen L. Scott. System-level Virtualization for High Performance Computing. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008, pages 636-643, Toulouse, France, February 13-15, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3089-5. DOI 10.1109/PDP.2008.85. Acceptance rate 40% (83/207).
- Li Ou, Christian Engelmann, Xubin (Ben) He, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for Highly Available Cluster Storage Systems. In Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS) 2007, Cambridge, MA, USA, November 19-21, 2007. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-703-1. Acceptance rate 49%.
- Emanuele Di Saverio, Marco Cesati, Christian Di Biagio, Guido Pennella, and Christian Engelmann. Distributed Real-Time Computing with Harness. In Lecture Notes in Computer Science: Proceedings of the 14th European PVM/MPI Users` Group Meeting (EuroPVM/MPI) 2007, pages 281-288, Paris, France, September 30 – October 3, 2007. Springer Verlag, Berlin, Germany. ISBN 978-3-540-75415-2. ISSN 0302-9743. DOI 10.1007/978-3-540-75416-9_39.
- Li Ou, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. A Fast Delivery Protocol for Total Order Broadcasting. In Proceedings of the 16th IEEE International Conference on Computer Communications and Networks (ICCCN) 2007, pages 730-734, Honolulu, HI, USA, August 13-16, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-42441-251-8. ISSN 1095-2055. DOI 10.1109/ICCCN.2007.4317904. Acceptance rate 29.1% (160/550).
- Arun B. Nagarajan, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, pages 23-32, Seattle, WA, USA, June 16-20, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1145/1274971.1274978. Acceptance rate 23.6% (29/123).
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. On Programming Models for Service-Level High Availability. In Proceedings of the 2nd International Conference on Availability, Reliability and Security (ARES) 2007, pages 999-1006, Vienna, Austria, April 10-13, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2775-2. DOI 10.1109/ARES.2007.109. Acceptance rate 28.3% (60/212).
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, pages 1-10, Long Beach, CA, USA, March 26-30, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1109/IPDPS.2007.370307. Acceptance rate 26% (109/419).
- Kai Uhlemann, Christian Engelmann, and Stephen L. Scott. JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management. In Proceedings of the 8th IEEE International Conference on Cluster Computing (Cluster) 2006, pages 1-10, Barcelona, Spain, September 25-28, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 1-4244-0328-6. ISSN 1552-5244. DOI 10.1109/CLUSTR.2006.311855. Acceptance rate 33.1% (42/127).
- Ronald Baumann, Christian Engelmann, and George A. (Al) Geist. A Parallel Plug-in Programming Paradigm. In Lecture Notes in Computer Science: Proceedings of the 7th International Conference on High Performance Computing and Communications (HPCC) 2006, pages 823-832, Munich, Germany, September 13-15, 2006. Springer Verlag, Berlin, Germany. ISBN 978-3-540-39368-9. ISSN 0302-9743. DOI 10.1007/11847366_85.
- Jyothish Varma, Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. In Proceedings of the 20th ACM International Conference on Supercomputing (ICS) 2006, pages 219-228, Cairns, Australia, June 28-30, 2006. ACM Press, New York, NY, USA. ISBN 1-59593-282-8. DOI 10.1145/1183401.1183433. Acceptance rate 26.2% (37/141).
- Daniel I. Okunbor, Christian Engelmann, and Stephen L. Scott. Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems. In Proceedings of the 2nd International Conference on Computer Science and Information Systems 2006, Athens, Greece, June 19-21, 2006.
- Kshitij Limaye, Chokchai (Box) Leangsuksun, Zeno Greenwood, Stephen L. Scott, Christian Engelmann, Richard M. Libby, and Kasidit Chanchio. Job-Site Level Fault Tolerance for Cluster and Grid Environments. In Proceedings of the 7th IEEE International Conference on Cluster Computing (Cluster) 2005, pages 1-9, Boston, MA, USA, September 26-30, 2005. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7803-9486-0. ISSN 1552-5244. DOI 10.1109/CLUSTR.2005.347043. Acceptance rate 39.6% (45/138).
- Hertong Song, Chokchai (Box) Leangsuksun, Raja Nassar, Yudan Liu, Christian Engelmann, and Stephen L. Scott. UML-based Beowulf Cluster Availability Modeling. In International Conference on Software Engineering Research and Practice (SERP) 2005, pages 161-167, Las Vegas, NV, USA, June 27-30, 2005. CSREA Press. ISBN 1-932415-49-1.
- Christian Engelmann and George A. (Al) Geist. Super-Scalable Algorithms for Computing on 100,000 Processors. In Lecture Notes in Computer Science: Proceedings of the 5th International Conference on Computational Science (ICCS) 2005, Part I, pages 313-320, Atlanta, GA, USA, May 22-25, 2005. Springer Verlag, Berlin, Germany. ISBN 978-3-540-26032-5. ISSN 0302-9743. DOI 10.1007/11428831_39. Acceptance rate 35%.
Peer-reviewed Workshop Papers
- Michael J. Brim, Lance Drane, Marshall McDonnell, Christian Engelmann, and Addi Malviya Thakur. A Microservices Architecture Toolkit for Interconnected Science Ecosystems. In Proceedings of the 37th International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2024: 19th Workshop on Workflows in Support of Large-Scale Science (WORKS) 2024, Atlanta, GA, USA, November 18, 2024. IEEE Computer Society, Los Alamitos, CA, USA. To appear.
- Mohit Kumar and Christian Engelmann. RDPM: An Extensible Tool for Resilience Design Patterns Modeling. In Lecture Notes in Computer Science: Proceedings of the 27th European Conference on Parallel and Distributed Computing (Euro-Par) 2021 Workshops: 14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 283-297, Lisbon, Portugal, August 30, 2021. Springer Verlag, Berlin, Germany. ISBN 978-3-031-06155-4. DOI 10.1007/978-3-031-06156-1_23. Acceptance rate 66.7% (4/6).
- Mohit Kumar and Christian Engelmann. Models for Resilience Design Patterns. In Proceedings of the 33rd International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2020: 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2020, pages 21-30, Atlanta, GA, USA, November 11, 2020. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7381-1080-6. DOI 10.1109/FTXS51974.2020.00008. Acceptance rate 66.7% (6/9).
- Piyush Sao, Christian Engelmann, Srinivas Eswar, Oded Green, and Richard Vuduc. Self-stabilizing Connected Components. In Proceedings of the 32nd International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2019: 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2019, pages 50-59, Denver, CO, USA, November 22, 2019. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-6013-9. DOI 10.1109/FTXS49593.2019.00011. Acceptance rate 60.0% (6/10).
- Christian Engelmann, Geoffroy R. Vallée, and Swaroop Pophale. Concepts for OpenMP Target Offload Resilience. In Lecture Notes in Computer Science: Proceedings of the 15th International Workshop on OpenMP (IWOMP) 2019, pages 78-93, Auckland, New Zealand, September 11-13, 2019. Springer Verlag, Berlin, Germany. ISBN 978-3-030-28595-1. DOI 10.1007/978-3-030-28596-8_6.
- Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Analyzing HPC System Status using the LogSCAN Platform. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 29-38, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00007. Acceptance rate 45.0% (9/20).
- Rizwan Ashraf and Christian Engelmann. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 39-48, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00008. Acceptance rate 45.0% (9/20).
- Byung Hoon (Hoony) Park, Yawei Hui, Swen Boehm, Rizwan Ashraf, Christian Engelmann, and Christopher Layton. A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log. In Proceedings of the 19th IEEE International Conference on Cluster Computing (Cluster) 2018: 5th Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2018, pages 571-579, Belfast, UK, September 10, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-8319-4. ISSN 2168-9253. DOI 10.1109/CLUSTER.2018.00073.
- Rizwan Ashraf and Christian Engelmann. Performance Efficient Multiresilience using Checkpoint Recovery in Iterative Algorithms. In Lecture Notes in Computer Science: Proceedings of the 24th European Conference on Parallel and Distributed Computing (Euro-Par) 2018 Workshops: 11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 813-825, Turin, Italy, August 28, 2018. Springer Verlag, Berlin, Germany. ISBN 978-3-030-10549-5. DOI 10.1007/978-3-030-10549-5_63. Acceptance rate 50.0% (4/8).
- Byung Hoon (Hoony) Park, Saurabh Hukerikar, Christian Engelmann, and Ryan Adamson. Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale. In Proceedings of the 18th IEEE International Conference on Cluster Computing (Cluster) 2017: 4th Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2017, pages 758-765, Honolulu, HI, USA, September 5, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2327-5. ISSN 2168-9253. DOI 10.1109/CLUSTER.2017.113.
- Saurabh Hukerikar and Christian Engelmann. Pattern-based Modeling of High-Performance Computing Resilience. In Lecture Notes in Computer Science: Proceedings of the 23rd European Conference on Parallel and Distributed Computing (Euro-Par) 2017 Workshops: 10th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 557-568, Santiago de Compostela, Spain, August 29, 2017. Springer Verlag, Berlin, Germany. ISBN 978-3-319-75177-1. DOI 10.1007/978-3-319-75178-8_45. Acceptance rate 66.7% (4/6).
- Saurabh Hukerikar, Rizwan Ashraf, and Christian Engelmann. Towards New Metrics for High-Performance Computing Resilience. In Proceedings of the 26th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2017: 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2017, pages 23-30, Washington, D.C., June 26-30, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5001-3. DOI 10.1145/3086157.3086163. Acceptance rate 83.3% (5/6).
- Saurabh Hukerikar and Christian Engelmann. Language Support for Reliable Memory Regions. In Lecture Notes in Computer Science: Proceedings of the 29th International Workshop on Languages and Compilers for Parallel Computing, pages 73-87, Rochester, NY, USA, September 28-30, 2016. Springer Verlag, Berlin, Germany. ISBN 978-3-319-52708-6. ISSN 0302-9743. DOI 10.1007/978-3-319-52709-3_6. Acceptance rate 76.9% (20/26).
- Thomas Naughton, Christian Engelmann, Geoffroy Vallée, Ferrol Aderholdt, and Stephen L. Scott. A Cooperative Approach to Virtual Machine Based Fault Injection. In Lecture Notes in Computer Science: Proceedings of the 22nd European Conference on Parallel and Distributed Computing (Euro-Par) 2016 Workshops: 9th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 671-682, Grenoble, France, August 23, 2016. Springer Verlag, Berlin, Germany. ISBN 978-3-319-58943-5. ISSN 0302-9743. DOI 10.1007/978-3-319-58943-5_54. Acceptance rate 55.6% (5/9).
- Zachary Parchman, Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, and David E. Bernholdt. Adding Fault Tolerance to NPB Benchmarks Using ULFM. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2016: 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016, pages 19-26, Kyoto, Japan, May 31 – June 4, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4349-7. DOI 10.1145/2909428.2909429. Acceptance rate 85.7% (6/7).
- Thomas Naughton, Garry Smith, Christian Engelmann, Geoffroy Vallée, Ferrol Aderholdt, and Stephen L. Scott. What is the right balance for performance and isolation with virtualization in HPC?. In Lecture Notes in Computer Science: Proceedings of the 20th European Conference on Parallel and Distributed Computing (Euro-Par) 2014 Workshops: 7th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 570-581, Porto, Portugal, August 25, 2014. Springer Verlag, Berlin, Germany. ISBN 978-3-319-14325-5. ISSN 0302-9743. DOI 10.1007/978-3-319-14325-5_49. Acceptance rate 60.0% (6/10).
- Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 962-971, Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-5117-3. ISSN 0190-3918. DOI 10.1109/ICPP.2013.114.
- Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Tools for Simulation and Benchmark Generation at Exascale. In Proceedings of the 7th Parallel Tools Workshop, pages 19-24, Dresden, Germany, September 3-4, 2013. Springer Verlag, Berlin, Germany. ISBN 978-3-319-08143-4. DOI 10.1007/978-3-319-08144-1_2.
- Thomas Naughton, Swen Böhm, Christian Engelmann, and Geoffroy Vallée. Using Performance Tools to Support Experiments in HPC Resilience. In Lecture Notes in Computer Science: Proceedings of the 19th European Conference on Parallel and Distributed Computing (Euro-Par) 2013 Workshops: 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 727-736, Aachen, Germany, August 26, 2013. Springer Verlag, Berlin, Germany. ISBN 978-3-642-54419-4. ISSN 0302-9743. DOI 10.1007/978-3-642-54420-0_71. Acceptance rate 87.5% (7/8).
- Ian S. Jones and Christian Engelmann. Simulation of Large-Scale HPC Architectures. In Proceedings of the 40th International Conference on Parallel Processing (ICPP) 2011: 2nd International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 447-456, Taipei, Taiwan, September 13-19, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4511-0. ISSN 1530-2016. DOI 10.1109/ICPPW.2011.44.
- David Fiala, Kurt Ferreira, Frank Mueller, and Christian Engelmann. A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. In Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011 Workshops, Part II: 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 251-261, Bordeaux, France, August 29 – September 2, 2011. Springer Verlag, Berlin, Germany. ISBN 978-3-642-29740-3. DOI 10.1007/978-3-642-29740-3_29. Acceptance rate 60.0% (12/20).
- Thomas Naughton, Geoffroy R. Vallée, Christian Engelmann, and Stephen L. Scott. A Case for Virtual Machine based Fault Injection in a High-Performance Computing Environment. In Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011: 5th Workshop on System-level Virtualization for High Performance Computing (HPCVirt), pages 234-243, Bordeaux, France, August 29 – September 2, 2011. Springer Verlag, Berlin, Germany. ISBN 978-3-642-29737. DOI 10.1007/978-3-642-29737-3_27.
- Christian Engelmann and Frank Lauer. Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation. In Proceedings of the 12th IEEE International Conference on Cluster Computing (Cluster) 2010: 1st Workshop on Application/Architecture Co-design for Extreme-scale Computing (AACEC), pages 1-8, Hersonissos, Crete, Greece, September 20-24, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-8395-2. DOI 10.1109/CLUSTERWKSP.2010.5613113.
- George Ostrouchov, Thomas Naughton, Christian Engelmann, Geoffroy R. Vallée, and Stephen L. Scott. Nonparametric Multivariate Anomaly Analysis in Support of HPC Resilience. In Proceedings of the 5th IEEE International Conference on e-Science (e-Science) 2009: Workshop on Computational Science, pages 80-85, Oxford, UK, December 9-11, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-5946-9. DOI 10.1109/ESCIW.2009.5407992.
- Thomas Naughton, Wesley Bland, Geoffroy R. Vallée, Christian Engelmann, and Stephen L. Scott. Fault Injection Framework for System Resilience Evaluation – Fake Faults for Finding Future Failures. In Proceedings of the 18th International Symposium on High Performance Distributed Computing (HPDC) 2009: 2nd Workshop on Resiliency in High Performance Computing (Resilience) 2009, pages 23-28, Munich, Germany, June 9, 2009. ACM Press, New York, NY, USA. ISBN 978-1-60558-587-1. DOI 10.1145/1552526.1552530.
- Anand Tikotekar, Hong H. Ong, Sadaf Alam, Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, and Stephen L. Scott. Performance Comparison of Two Virtual Machine Scenarios Using an HPC Application – A Case study Using Molecular Dynamics Simulations. In Proceedings of the 3rd Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2009, in conjunction with the 4th ACM SIGOPS European Conference on Computer Systems (EuroSys) 2009, pages 33-40, Nuremberg, Germany, March 30, 2009. ACM Press, New York, NY, USA. ISBN 978-1-60558-465-2. DOI 10.1145/1519138.1519143.
- Geoffroy R. Vallée, Thomas Naughton, Hong H. Ong, Anand Tikotekar, Christian Engelmann, Wesley Bland, Ferrol Aderholt, and Stephen L. Scott. Virtual System Environments. In Communications in Computer and Information Science: Proceedings of the 2nd DMTF Academic Alliance Workshop on Systems and Virtualization Management: Standards and New Technologies (SVM) 2008, pages 72-83, Munich, Germany, October 21-22, 2008. Springer Verlag, Berlin, Germany. ISBN 978-3-540-88707-2. ISSN 1865-0929. DOI 10.1007/978-3-540-88708-9_7.
- Anand Tikotekar, Geoffroy Vallée, Thomas Naughton, Hong H. Ong, Christian Engelmann, and Stephen L. Scott. An Analysis of HPC Benchmark Applications in Virtual Machine Environments. In Lecture Notes in Computer Science: Proceedings of the 14th European Conference on Parallel and Distributed Computing (Euro-Par) 2008: 3rd Workshop on Virtualization in High-Performance Cluster and Grid Computing (VHPC) 2008, pages 63-71, Las Palmas de Gran Canaria, Spain, August 26-29, 2008. Springer Verlag, Berlin, Germany. ISBN 978-3-642-00954-9. DOI 10.1007/978-3-642-00955-6.
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations. In Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2008: Workshop on Resiliency in High Performance Computing (Resilience) 2008, pages 813-818, Lyon, France, May 19-22, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3156-4. DOI 10.1109/CCGRID.2008.78.
- Xin Chen, Benjamin Eckart, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. An Online Controller Towards Self-Adaptive File System Availability and Performance. In Proceedings of the 5th High Availability and Performance Workshop (HAPCW) 2008, in conjunction with the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, April 3-4, 2008.
- Anand Tikotekar, Geoffroy Vallée, Thomas Naughton, Hong H. Ong, Christian Engelmann, Stephen L. Scott, and Anthony M. Filippi. Effects of Virtualization on a Scientific Application – Running a Hyperspectral Radiative Transfer Code on Virtual Machines. In Proceedings of the 2nd Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2008, in conjunction with the 3rd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2008, pages 16-23, Glasgow, UK, March 31, 2008. ACM Press, New York, NY, USA. ISBN 978-1-60558-120-0. DOI 10.1145/1435452.1435455.
- Christian Engelmann, Hong H. Ong, and Stephen L. Scott. Middleware in Modern High Performance Computing System Architectures. In Lecture Notes in Computer Science: Proceedings of the 7th International Conference on Computational Science (ICCS) 2007, Part II: 4th Special Session on Collaborative and Cooperative Environments (CCE) 2007, pages 784-791, Beijing, China, May 27-30, 2007. Springer Verlag, Berlin, Germany. ISBN 3-5407-2585-5. ISSN 0302-9743. DOI 10.1007/978-3-540-72586-2_111.
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Transparent Symmetric Active/Active Replication for Service-Level High Availability. In Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2007: 7th International Workshop on Global and Peer-to-Peer Computing (GP2PC) 2007, pages 755-760, Rio de Janeiro, Brazil, May 14-17, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2833-3. DOI 10.1109/CCGRID.2007.116.
- Christian Engelmann, Stephen L. Scott, Hong H. Ong, Geoffroy R. Vallée, and Thomas Naughton. Configurable Virtualized System Environments for High Performance Computing. In Proceedings of the 1st Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2007, in conjunction with the 2nd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2007, Lisbon, Portugal, March 20, 2007.
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations. In Proceedings of the 4th High Availability and Performance Workshop (HAPCW) 2006, in conjunction with the 7th Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006.
- Li Ou, Xin Chen, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. Achieving Computational I/O Effciency in a High Performance Cluster Using Multicore Processors. In Proceedings of the 4th High Availability and Performance Workshop (HAPCW) 2006, in conjunction with the 7th Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006.
- Christian Engelmann and George A. (Al) Geist. RMIX: A Dynamic, Heterogeneous, Reconfigurable Communication Framework. In Lecture Notes in Computer Science: Proceedings of the 6th International Conference on Computational Science (ICCS) 2006, Part II: 3rd Special Session on Collaborative and Cooperative Environments (CCE) 2006, pages 573-580, Reading, UK, May 28-31, 2006. Springer Verlag, Berlin, Germany. ISBN 3-540-34381-4. ISSN 0302-9743. DOI 10.1007/11758525_77.
- Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Active/Active Replication for Highly Available HPC System Services. In Proceedings of the 1st International Conference on Availability, Reliability and Security (ARES) 2006: 1st International Workshop on Frontiers in Availability, Reliability and Security (FARES) 2006, pages 639-645, Vienna, Austria, April 20-22, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2567-9. DOI 10.1109/ARES.2006.23.
- Christian Engelmann and Stephen L. Scott. Concepts for High Availability in Scientific High-End Computing. In Proceedings of the 3rd High Availability and Performance Workshop (HAPCW) 2005, in conjunction with the 6th Los Alamos Computer Science Institute (LACSI) Symposium 2005, Santa Fe, NM, USA, October 11, 2005.
- Christian Engelmann and Stephen L. Scott. High Availability for Ultra-Scale High-End Scientific Computing. In Proceedings of the 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with the 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005.
- Chokchai (Box) Leangsuksun, Venkata K. Munganuru, Tong Liu, Stephen L. Scott, and Christian Engelmann. Asymmetric Active-Active High Availability for High-end Computing. In Proceedings of the 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with the 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005.
- Christian Engelmann and George A. (Al) Geist. A Lightweight Kernel for the Harness Metacomputing Framework. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2005: 14th Heterogeneous Computing Workshop (HCW) 2005, Denver, CO, USA, April 4, 2005. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2312-9. ISSN 1530-2075. DOI 10.1109/IPDPS.2005.34.
- Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. High Availability through Distributed Control. In Proceedings of the 2nd High Availability and Performance Workshop (HAPCW) 2004, in conjunction with the 5th Los Alamos Computer Science Institute (LACSI) Symposium 2004, Santa Fe, NM, USA, October 12, 2004.
- Xubin (Ben) He, Li Ou, Stephen L. Scott, and Christian Engelmann. A Highly Available Cluster Storage System using Scavenging. In Proceedings of the 2nd High Availability and Performance Workshop (HAPCW) 2004, in conjunction with the 5th Los Alamos Computer Science Institute (LACSI) Symposium 2004, Santa Fe, NM, USA, October 12, 2004.
- Christian Engelmann and George A. (Al) Geist. A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform. In Proceedings of the Challenges of Large Applications in Distributed Environments Workshop (CLADE) 2003, in conjunction with the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC) 2003, pages 47, Seattle, WA, USA, June 21, 2003. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-1984-9. DOI xpls/abs_all.jsp?arnumber=4159902.
- Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. Distributed Peer-to-Peer Control in Harness. In Lecture Notes in Computer Science: Proceedings of the 2nd International Conference on Computational Science (ICCS) 2002, Part II: Workshop on Global and Collaborative Computing, pages 720-727, Amsterdam, The Netherlands, April 21-24, 2002. Springer Verlag, Berlin, Germany. ISBN 3-540-43593-X. ISSN 0302-9743. DOI content/l537ujfwt8yta2dp.
Peer-reviewed Conference Posters
- Christian Engelmann, Swen Boehm, Michael Brim, Jack Lange, Thomas Naughton, Patrick Widener, Ben Mintz, and Rohit Srivastava. INTERSECT: The Open Federated Architecture for the Laboratory of the Future. Poster at the 52nd International Conference on Parallel Processing (ICPP) 2023, Salt Lake City, UT, USA, August 7-10, 2023.
- Christian Engelmann and Mohit Kumar. Resilience Design Patterns: A Structured Modeling Approach of Resilience in Computing Systems. Poster at the Workshop on Modeling and Simulation of Systems and Applications (ModSim) 2022, Seattle, WA, USA, August 10-12, 2022.
- Yawei Hui, Rizwan Ashraf, Byung Hoon (Hoony) Park, and Christian Engelmann. Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing. Poster at the 6th IEEE International Conference on Big Data (BigData) 2018, Seattle, WA, USA, December 10-13, 2018.
- Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Summarizing HPC System Status. Poster at the 8th IEEE Symposium on Large Data Analysis and Visualization in conjunction with the 8th IEEE Vis 2018, Berlin, Germany, October 21, 2018.
- Christian Engelmann and Rizwan Ashraf. Modeling and Simulation of Extreme-Scale Systems for Resilience by Design. Poster at the Workshop on Modeling and Simulation of Systems and Applications, Seattle, WA, USA, August 15-17, 2018.
- Onkar Patil, Saurabh Hukerikar, Frank Mueller, and Christian Engelmann. Exploring Use Cases for Non-Volatile Memories in Support of HPC Resilience. Poster at the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, Denver, CO, USA, November 12-17, 2017.
- David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, and Kurt Ferreira. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. Poster at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 12-18, 2011.
- David Fiala, Kurt Ferreira, Frank Mueller, and Christian Engelmann. A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. Poster at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 12-18, 2011.
- Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009.
- Stephen L. Scott, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong H. Ong. System-level Virtualization for for High-Performance Computing. Poster at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009.
- Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2009, Raleigh, NC, USA, February 14-18, 2009.
- George A. (Al) Geist, Christian Engelmann, Jack J. Dongarra, George Bosilca, Magdalena M. Sławińska, and Jarosław K. Sławiński. The Harness Workbench: Unified and Adaptive Access to Diverse High-Performance Computing Platforms. Poster at the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, March 30 – April 5, 2008.
- Stephen L. Scott, Christian Engelmann, Hong H. Ong, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, Jyothish Varma, Xubin (Ben) He, Li Ou, and Xin Chen. Resiliency for High-Performance Computing Systems. Poster at the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, March 30 – April 5, 2008.
- Stephen L. Scott, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong H.