Skip to content

About Me

Extreme-Scale Computing | Fault Resilience | HW/SW Co-Design Tools | Computing Continuum | Autonomous Experiments

Dr. Christian Engelmann is a Senior Computer Scientist and the Intelligent Systems and Facilities Research Group Leader at Oak Ridge National Laboratory (ORNL), the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $2.7 billion and 6,000+ staff. He has more than 23 years experience in software research and development for extreme-scale high-performance computing (HPC) systems. Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, and interoperability.

Dr. Engelmann’s primary expertise is in HPC resilience, i.e., efficiency and correctness in the presence of faults, errors, and failures. He is a leading HPC resilience expert and was a member of the DOE Technical Council on HPC Resilience 2013-15. He received the 2015 DOE Early Career Award for research in resilience design patterns. Dr. Engelmann’s secondary expertise is in system software for the instrument-to-edge-to-Cloud-to-center computing continuum, enabling science breakthroughs with autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence (AI) driven design, discovery and evaluation. He further has expertise in lightweight simulation of future-generation extreme-scale supercomputers, studying the impact of hardware/software properties on performance and resilience for application-architecture co-design. Dr. Engelmann is also an expert in operating system and runtime software for parallel and distributed systems.

Dr. Engelmann earned a Dipl.-Ing. (FH) in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, and a M.Sc. in Computer Science from the University of Reading, UK, both in 2001 as conjoint degrees, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE). He is also a Member of the Society for Industrial and Applied Mathematics (SIAM) and the Advanced Computing Systems Association (USENIX).

View Christian Engelmann's profile on LinkedIn | | View Christian Engelmann's profile on Google Scholar | DBLP: Christian Engelmann | Scopus ID: 18037364000 | ORCID iD iconorcid.org/0000-0003-4365-6416

Contact: engelmannc@computer.org | 2-page biography: Publication | Resume: Available upon request

Ongoing Projects

2021-…: The Open Federated Architecture for the Laboratory of the Future project connects scientific instruments, robot-controlled laboratories and edge/center computing/data resources to enable autonomous experiments, self-driving laboratories, smart manufacturing, and AI-driven design, discovery and evaluation.

Recently In the News

2023-08-24: ORNL News. INTERSECT launches autonomous ‘labs of the future’.

2021-03-30: DOE Advanced Scientific Computing Research. New Approach to Fault Tolerance Means More Efficient High-Performance Computers.
2021-01-04: HPCwire. What’s New in HPC Research: GPU Lifetimes, the Square Kilometre Array, Support Tickets & More.

Latest Peer-Reviewed Publications

  1. M. J. Brim, L. Drane, M. McDonnell, C. Engelmann, and A. M. Thakur. A Microservices Architecture Toolkit for Interconnected Science Ecosystems. In Proceedings of the 37th International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2024: 19th Workshop on Workflows in Support of Large-Scale Science (WORKS) 2024, November, 2024. To appear. Abstract BibTeX Citation
  2. V. Oles, A. Schmedding, G. Ostrouchov, W. Shi, E. Smirni, and C. Engelmann. Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS) 2024, June, 2024. DOI 10.1145/3650200.3656615. Accept. rate 36.0% (45/125). Publication Presentation BibTeX Citation
  3. C. Engelmann and S. Somnath. Science Use Case Design Patterns for Autonomous Experiments. In Proceedings of the 28th European Conference on Pattern Languages of Programs (EuroPLoP) 2023, July, 2023. DOI 10.1145/3628034.3628060. Abstract Publication BibTeX Citation
  4. C. Engelmann, O. Kuchar, S. Boehm, M. J. Brim, T. Naughton, S. Somnath, S. Atchley, J. Lange, B. Mintz, and E. Arenholz. The INTERSECT Open Federated Architecture for the Laboratory of the Future. In Communications in Computer and Information Science (CCIS): Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. 18th Smoky Mountains Computational Sciences & Engineering Conference (SMC) 2022, August, 2022. DOI 10.1007/978-3-031-23606-8_11. Accept. rate 32.4% (24/74). Abstract Publication Presentation BibTeX Citation
  5. E. Agullo, M. Altenbernd, H. Anzt, L. Bautista-Gomez, T. Benacchio, L. Bonaventura, H. Bungartz, S. Chatterjee, F. M. Ciorba, N. DeBardeleben, D. Drzisga, S. Eibl, C. Engelmann, W. N. Gansterer, L. Giraud, D. Göddeke, M. Heisig, F. Jézéquel, N. Kohl, X. S. Li, R. Lion, M. Mehl, P. Mycek, M. Obersteiner, E. S. Quintana-Ortí, F. Rizzi, U. Rüde, M. Schulz, F. Fung, R. Speck, L. Stals, K. Teranishi, S. Thibault, D. Thönnes, A. Wagner, and B. Wohlmuth. Resiliency in Numerical Algorithm Design for Extreme Scale Simulations. International Journal of High Performance Computing Applications (IJHPCA), volume 36, number 2, March, 2022. DOI 10.1177/10943420211055188. Abstract Publication BibTeX Citation

Highly Cited Peer-Reviewed Publications

  1. A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, June, 2007. DOI 10.1145/1274971.1274978. Accept. rate 23.6% (29/123). 527 citations. Abstract Publication Presentation BibTeX Citation
  2. M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, May, 2014. DOI 10.1177/1094342014522573. 526 citations. Abstract Publication BibTeX Citation
  3. D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, and R. Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, November, 2012. DOI 10.1109/SC.2012.49. Accept. rate 21.2% (100/472). 386 citations. Abstract Publication Presentation BibTeX Citation
  4. C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, November, 2008. DOI 10.1145/1413370.1413414. Accept. rate 21.3% (59/277). 250 citations. Abstract Publication Presentation BibTeX Citation
  5. J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, June, 2012. DOI 10.1109/ICDCS.2012.56. Accept. rate 13.8% (71/515). 203 citations. Abstract Publication Presentation BibTeX Citation

Other Significant Publications

  1. M. Kumar, S. Gupta, T. Patel, M. Wilder, W. Shi, S. Fu, C. Engelmann, and D. Tiwari. Study of Interconnect Errors, Network Congestion, and Applications Characteristics for Throttle Prediction on a Large Scale HPC System. Journal of Parallel and Distributed Computing (JPDC), volume 153, July, 2021. DOI 10.1016/j.jpdc.2021.03.001. Abstract Publication BibTeX Citation
  2. G. Ostrouchov, D. Maxwell, R. Ashraf, C. Engelmann, M. Shankar, and J. Rogers. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the 33rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2020, November, 2020. DOI 10.1109/SC41405.2020.00045. Accept. rate 25.1% (95/378). Abstract Publication Presentation BibTeX Citation
  3. H. Jeong, Y. Yang, C. Engelmann, V. Gupta, T. M. Low, P. Grover, V. Cadambe, and K. Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020, August, 2020. DOI 10.1007/978-3-030-57675-2_25. Accept. rate 24.5% (39/159). Abstract Publication Presentation BibTeX Citation
  4. D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the 30th ACM International Conference on Supercomputing (ICS) 2016, June, 2016. DOI 10.1145/2925426.2926295. Accept. rate 24.2% (43/178). Abstract Publication Presentation BibTeX Citation
  5. C. Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, January, 2014. DOI 10.1016/j.future.2013.04.014. 70 citations. Abstract Publication BibTeX Citation

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation