Skip to content

Peer-Reviewed Conference Papers

  1. Vladyslav Oles, Anna Schmedding, George Ostrouchov, Woong Shi, Evgenia Smirni, and Christian Engelmann. Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS) 2024, pages 188-200, Kyoto, Japan, June 4-7, 2024. ACM Press, New York, NY, USA. ISBN 979-8-4007-0610-3. DOI 10.1145/3650200.3656615. Acceptance rate 36.0% (45/125). Publication Presentation BibTeX Citation
  2. Christian Engelmann and Suhas Somnath. Science Use Case Design Patterns for Autonomous Experiments. In Proceedings of the 28th European Conference on Pattern Languages of Programs (EuroPLoP) 2023, pages 1-14, Kloster Irsee, Germany, July 5-9, 2023. ACM Press, New York, NY, USA. ISBN 979-8-4007-0040-8. DOI 10.1145/3628034.3628060. Abstract Publication BibTeX Citation
  3. Christian Engelmann, Olga Kuchar, Swen Boehm, Michael J. Brim, Thomas Naughton, Suhas Somnath, Scott Atchley, Jack Lange, Ben Mintz, and Elke Arenholz. The INTERSECT Open Federated Architecture for the Laboratory of the Future. In Communications in Computer and Information Science (CCIS): Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. 18th Smoky Mountains Computational Sciences & Engineering Conference (SMC) 2022, pages 173-190, August 24-25, 2022. Springer, Cham. ISBN 978-3-031-23605-1. DOI 10.1007/978-3-031-23606-8_11. Acceptance rate 32.4% (24/74). Abstract Publication Presentation BibTeX Citation
  4. Saurabh Hukerikar and Christian Engelmann. PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems. In Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020, pages 31-39, Perth, Australia, December 1-4, 2020. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-8004-5. ISSN 1555-094X. DOI 10.1109/PRDC50213.2020.00014. Acceptance rate 40.9% (18/44). Abstract Publication BibTeX Citation
  5. George Ostrouchov, Don Maxwell, Rizwan Ashraf, Christian Engelmann, Mallikarjun Shankar, and James Rogers. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the 33rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2020, pages 41:1-14, Atlanta, GA, USA, November 15-20, 2020. ACM Press, New York, NY, USA. ISBN 9781728199986. DOI 10.1109/SC41405.2020.00045. Acceptance rate 25.1% (95/378). Abstract Publication Presentation BibTeX Citation
  6. Haewon Jeong, Yaoqing Yang, Christian Engelmann, Vipul Gupta, Tze Meng Low, Pulkit Grover, Viveck Cadambe, and Kannan Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020, pages 392-407, Warsaw, Poland, August 24-28, 2020. Springer Verlag, Berlin, Germany. ISBN 978-3-030-57674-5. DOI 10.1007/978-3-030-57675-2_25. Acceptance rate 24.5% (39/159). Abstract Publication Presentation BibTeX Citation
  7. Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 107-114, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00023. Acceptance rate 27.2% (62/228). Abstract Publication BibTeX Citation
  8. Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 95-106, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00022. Acceptance rate 27.2% (62/228). Abstract Publication BibTeX Citation
  9. Rizwan Ashraf, Saurabh Hukerikar, and Christian Engelmann. Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing. In Proceedings of the 9th ACM/SPEC International Conference on Performance Engineering (ICPE) 2018, pages 80-87, Berlin, Germany, April 9-13, 2018. ACM Press, New York, NY, USA. ISBN 978-1-4503-5095-2. DOI 10.1145/3184407.3184421. Acceptance rate 23.7% (14/59). Abstract Publication Presentation BibTeX Citation
  10. Rizwan Ashraf, Saurabh Hukerikar, and Christian Engelmann. Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery. In Proceedings of the 26th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2018, pages 178-185, Cambridge, UK, March 21-23, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-4975-6. ISSN 2377-5750. DOI 10.1109/PDP2018.2018.00032. Acceptance rate 29.3% (27/92). Abstract Publication Presentation BibTeX Citation
  11. Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 44:1-44:12, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5114-0. DOI 10.1145/3126908.3126937. Acceptance rate 18.7% (61/327). Abstract Publication Presentation BibTeX Citation
  12. Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017, pages 22-31, Banff, AB, Canada, September 20-22, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2764-8. ISSN 2375-0227. DOI 10.1109/MASCOTS.2017.12. Acceptance rate 30.95% (26/84). Abstract Publication BibTeX Citation
  13. Saurabh Hukerikar and Christian Engelmann. A Pattern Language for High-Performance Computing Resilience. In Proceedings of the 22nd European Conference on Pattern Languages of Programs (EuroPLoP) 2017, pages 12:1-12:16, Kloster Irsee, Germany, July 12-16, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-4848-5. DOI 10.1145/3147704.3147718. Abstract Publication BibTeX Citation
  14. Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Benchmark Generation and Simulation at Extreme Scale. In Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2016, pages 9-18, London, UK, September 21-23, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5090-3506-9. ISSN 1550-6525. DOI 10.1109/DS-RT.2016.18. Acceptance rate 42.0% (21/50). Best paper candidate. Abstract Publication Presentation BibTeX Citation
  15. Saurabh Hukerikar and Christian Engelmann. Havens: Explicit Reliable Memory Regions for HPC Applications. In Proceedings of the 20th IEEE High Performance Extreme Computing Conference (HPEC) 2016, pages 1-6, Waltham, MA, USA, September 13-15, 2016. IEEE Computer Society, Los Alamitos, CA, USA. DOI 10.1109/HPEC.2016.7761593. Abstract Publication Presentation BibTeX Citation
  16. Kun Tang, Devesh Tiwari, Saurabh Gupta, Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2016, pages 311-322, Toulouse, France, June 28 – July 1, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 2158-3927. DOI 10.1109/DSN.2016.36. Acceptance rate 22.4% (58/259). Abstract Publication BibTeX Citation
  17. David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the 30th ACM International Conference on Supercomputing (ICS) 2016, pages 7:1-7:14, Istanbul, Turkey, June 1-3, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4361-9. DOI 10.1145/2925426.2926295. Acceptance rate 24.2% (43/178). Abstract Publication Presentation BibTeX Citation
  18. Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Extreme Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, pages 212-221, Chicago, IL, USA, May 23-27, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1530-2075. DOI 10.1109/IPDPS.2016.100. Acceptance rate 23.0% (114/496). Abstract Publication Presentation BibTeX Citation
  19. Christian Engelmann and Thomas Naughton. Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation. In Proceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016, Innsbruck, Austria, February 15-16, 2016. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-979-0. DOI 10.2316/P.2016.834-005. Abstract Publication Presentation BibTeX Citation
  20. Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. In Proceedings of the 22nd European MPI Users` Group Meeting (EuroMPI) 2015, pages 13:1-13:9, Bordeaux, France, September 21-24, 2015. ACM Press, New York, NY, USA. ISBN 978-1-4503-3795-3. DOI 10.1145/2802658.2802660. Acceptance rate 48.3% (14/29). Abstract Publication Presentation BibTeX Citation
  21. Christian Engelmann and Thomas Naughton. A Network Contention Model for the Extreme-scale Simulator. In Proceedings of the 34th IASTED International Conference on Modelling, Identification and Control (MIC) 2015, Innsbruck, Austria, February 17-18, 2015. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-975-2. DOI 10.2316/P.2015.826-043. Abstract Publication Presentation BibTeX Citation
  22. Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, pages 198-207, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-6143-6. ISSN 1550-6525. DOI 10.1109/DS-RT.2014.32. Best paper candidate. Abstract Publication Presentation BibTeX Citation
  23. Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192. DOI 10.1109/PDP.2014.74. Acceptance rate 32.6% (73/224). Abstract Publication Presentation BibTeX Citation
  24. Geoffroy Vallée, Thomas Naughton, Swen Böhm, and Christian Engelmann. A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools. In Proceedings of the 1st International Symposium on Computing and Networking – Across Practical Development and Theoretical Research – (CANDAR) 2013, pages 213-219, Matsuyama, Japan, December 4-6, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-2795-1. DOI 10.1109/CANDAR.2013.38. Acceptance rate 35.8% (28/78). Abstract Publication Presentation BibTeX Citation
  25. Christian Engelmann. Investigating Operating System Noise in Extreme-Scale High-Performance Computing Systems using Simulation. In Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, February 11-13, 2013. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-943-1. DOI 10.2316/P.2013.795-010. Abstract Publication Presentation BibTeX Citation
  26. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. DOI 10.1109/SC.2012.49. Acceptance rate 21.2% (100/472). Abstract Publication Presentation BibTeX Citation
  27. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. DOI 10.1109/ICDCS.2012.56. Acceptance rate 13.8% (71/515). Abstract Publication Presentation BibTeX Citation
  28. Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. DOI 10.1109/IPDPS.2012.90. Acceptance rate 20.7% (118/569). Abstract Publication Presentation BibTeX Citation
  29. Swen Böhm and Christian Engelmann. File I/O for MPI Applications in Redundant Execution Scenarios. In Proceedings of the 20th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2012, pages 112-119, Garching, Germany, February 15-17, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4633-9. ISSN 1066-6192. DOI 10.1109/PDP.2012.22. Abstract Publication Presentation BibTeX Citation
  30. Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. DOI 10.1109/HPCSim.2011.5999835. Acceptance rate 28.1% (48/171). Abstract Publication Presentation BibTeX Citation
  31. Christian Engelmann and Swen Böhm. Redundant Execution of HPC Applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011, pages 31-38, Innsbruck, Austria, February 15-17, 2011. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-864-9. DOI 10.2316/P.2011.719-031. Abstract Publication Presentation BibTeX Citation
  32. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. DOI 10.1109/ICPADS.2010.48. Acceptance rate 29.6% (77/188). Abstract Publication Presentation BibTeX Citation
  33. Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. DOI 10.1109/SC.2010.28. Acceptance rate 19.8% (50/253). Abstract Publication Presentation BibTeX Citation
  34. Swen Böhm, Christian Engelmann, and Stephen L. Scott. Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC) 2010, pages 72-78, Melbourne, Australia, September 1-3, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4214-0. DOI 10.1109/HPCC.2010.32. Acceptance rate 19.1% (58/304). Abstract Publication Presentation BibTeX Citation
  35. Antonina Litvinova, Christian Engelmann, and Stephen L. Scott. A Proactive Fault Tolerance Framework for High-Performance Computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria, February 16-18, 2010. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-783-3. DOI 10.2316/P.2010.676-024. Abstract Publication Presentation BibTeX Citation
  36. Narate Taerat, Nichamon Naksinehaboon, Clayton Chandler, James Elliott, Chokchai (Box) Leangsuksun, George Ostrouchov, Stephen L. Scott, and Christian Engelmann. Blue Gene/L Log Analysis and Time to Interrupt Estimation. In Proceedings of the 4th International Conference on Availability, Reliability and Security (ARES) 2009, pages 173-180, Fukuoka, Japan, March 16-19, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-3572-2. DOI 10.1109/ARES.2009.105. Acceptance rate 25.0% (40/160). Abstract Publication BibTeX Citation
  37. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing (LCI) 2009, Boulder, CO, USA, March 9-12, 2009. Abstract Publication Presentation BibTeX Citation
  38. Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.31. Acceptance rate 42.0% (58/138). Abstract Publication Presentation BibTeX Citation
  39. Alessandro Valentini, Christian Di Biagio, Fabrizio Batino, Guido Pennella, Fabrizio Palma, and Christian Engelmann. High Performance Computing with Harness over InfiniBand. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 151-154, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.64. Acceptance rate 42.0% (58/138). Abstract Publication BibTeX Citation
  40. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-784-0. Abstract Publication Presentation BibTeX Citation
  41. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. DOI 10.1145/1413370.1413414. Acceptance rate 21.3% (59/277). Abstract Publication Presentation BibTeX Citation
  42. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active Replication for Dependent Services. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 260-267, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.64. Acceptance rate 21.1% (40/190). Abstract Publication Presentation BibTeX Citation
  43. Geoffroy R. Vallée, Kulathep Charoenpornwattana, Christian Engelmann, Anand Tikotekar, Chokchai (Box) Leangsuksun, Thomas Naughton, and Stephen L. Scott. A Framework For Proactive Fault Tolerance. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 659-664, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.171. Acceptance rate 21.1% (40/190). Abstract Publication Presentation BibTeX Citation
  44. Björn Könning, Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. Virtualized Environments for the Harness High Performance Computing Workbench. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008, pages 133-140, Toulouse, France, February 13-15, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3089-5. DOI 10.1109/PDP.2008.14. Acceptance rate 40% (83/207). Abstract Publication Presentation BibTeX Citation
  45. Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, Hong H. Ong, and Stephen L. Scott. System-level Virtualization for High Performance Computing. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008, pages 636-643, Toulouse, France, February 13-15, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3089-5. DOI 10.1109/PDP.2008.85. Acceptance rate 40% (83/207). Abstract Publication Presentation BibTeX Citation
  46. Li Ou, Christian Engelmann, Xubin (Ben) He, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for Highly Available Cluster Storage Systems. In Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS) 2007, Cambridge, MA, USA, November 19-21, 2007. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-703-1. Acceptance rate 49%. Abstract Publication Presentation BibTeX Citation
  47. Emanuele Di Saverio, Marco Cesati, Christian Di Biagio, Guido Pennella, and Christian Engelmann. Distributed Real-Time Computing with Harness. In Lecture Notes in Computer Science: Proceedings of the 14th European PVM/MPI Users` Group Meeting (EuroPVM/MPI) 2007, pages 281-288, Paris, France, September 30 – October 3, 2007. Springer Verlag, Berlin, Germany. ISBN 978-3-540-75415-2. ISSN 0302-9743. DOI 10.1007/978-3-540-75416-9_39. Abstract Publication Presentation BibTeX Citation
  48. Li Ou, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. A Fast Delivery Protocol for Total Order Broadcasting. In Proceedings of the 16th IEEE International Conference on Computer Communications and Networks (ICCCN) 2007, pages 730-734, Honolulu, HI, USA, August 13-16, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-42441-251-8. ISSN 1095-2055. DOI 10.1109/ICCCN.2007.4317904. Acceptance rate 29.1% (160/550). Abstract Publication Presentation BibTeX Citation
  49. Arun B. Nagarajan, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, pages 23-32, Seattle, WA, USA, June 16-20, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1145/1274971.1274978. Acceptance rate 23.6% (29/123). Abstract Publication Presentation BibTeX Citation
  50. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. On Programming Models for Service-Level High Availability. In Proceedings of the 2nd International Conference on Availability, Reliability and Security (ARES) 2007, pages 999-1006, Vienna, Austria, April 10-13, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2775-2. DOI 10.1109/ARES.2007.109. Acceptance rate 28.3% (60/212). Abstract Publication Presentation BibTeX Citation
  51. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, pages 1-10, Long Beach, CA, USA, March 26-30, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1109/IPDPS.2007.370307. Acceptance rate 26% (109/419). Abstract Publication Presentation BibTeX Citation
  52. Kai Uhlemann, Christian Engelmann, and Stephen L. Scott. JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management. In Proceedings of the 8th IEEE International Conference on Cluster Computing (Cluster) 2006, pages 1-10, Barcelona, Spain, September 25-28, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 1-4244-0328-6. ISSN 1552-5244. DOI 10.1109/CLUSTR.2006.311855. Acceptance rate 33.1% (42/127). Abstract Publication Presentation BibTeX Citation
  53. Ronald Baumann, Christian Engelmann, and George A. (Al) Geist. A Parallel Plug-in Programming Paradigm. In Lecture Notes in Computer Science: Proceedings of the 7th International Conference on High Performance Computing and Communications (HPCC) 2006, pages 823-832, Munich, Germany, September 13-15, 2006. Springer Verlag, Berlin, Germany. ISBN 978-3-540-39368-9. ISSN 0302-9743. DOI 10.1007/11847366_85. Abstract Publication Presentation BibTeX Citation
  54. Jyothish Varma, Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. In Proceedings of the 20th ACM International Conference on Supercomputing (ICS) 2006, pages 219-228, Cairns, Australia, June 28-30, 2006. ACM Press, New York, NY, USA. ISBN 1-59593-282-8. DOI 10.1145/1183401.1183433. Acceptance rate 26.2% (37/141). Abstract Publication Presentation BibTeX Citation
  55. Daniel I. Okunbor, Christian Engelmann, and Stephen L. Scott. Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems. In Proceedings of the 2nd International Conference on Computer Science and Information Systems 2006, Athens, Greece, June 19-21, 2006. Abstract Publication BibTeX Citation
  56. Kshitij Limaye, Chokchai (Box) Leangsuksun, Zeno Greenwood, Stephen L. Scott, Christian Engelmann, Richard M. Libby, and Kasidit Chanchio. Job-Site Level Fault Tolerance for Cluster and Grid Environments. In Proceedings of the 7th IEEE International Conference on Cluster Computing (Cluster) 2005, pages 1-9, Boston, MA, USA, September 26-30, 2005. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7803-9486-0. ISSN 1552-5244. DOI 10.1109/CLUSTR.2005.347043. Acceptance rate 39.6% (45/138). Abstract Publication BibTeX Citation
  57. Hertong Song, Chokchai (Box) Leangsuksun, Raja Nassar, Yudan Liu, Christian Engelmann, and Stephen L. Scott. UML-based Beowulf Cluster Availability Modeling. In International Conference on Software Engineering Research and Practice (SERP) 2005, pages 161-167, Las Vegas, NV, USA, June 27-30, 2005. CSREA Press. ISBN 1-932415-49-1. BibTeX Citation
  58. Christian Engelmann and George A. (Al) Geist. Super-Scalable Algorithms for Computing on 100,000 Processors. In Lecture Notes in Computer Science: Proceedings of the 5th International Conference on Computational Science (ICCS) 2005, Part I, pages 313-320, Atlanta, GA, USA, May 22-25, 2005. Springer Verlag, Berlin, Germany. ISBN 978-3-540-26032-5. ISSN 0302-9743. DOI 10.1007/11428831_39. Acceptance rate 35%. Abstract Publication Presentation BibTeX Citation

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation