This research in cellular architectures is part of a Cooperative Research and Development Agreement (CRADA) between IBM and Oak Ridge National Laboratory (ORNL) to develop algorithms for the next-generation of supercomputers. It focuses on the development of algorithms that are able to use a 100,000-processor machine efficiently and are capable of adapting to or simply surviving faults. Such huge computer systems, like the IBM Blue Gene/L, need to address already existing problems in algorithm scalability and fault-tolerance, which continue to increase with system scale.
In a first step, the team at ORNL develops a simulator to emulate up to 5,000 virtual processors on a single real processor, solving a simple equation at the virtual 5,000 processor scale. It a second step, the emulation is extended to 500,000 virtual processors on a cluster with 5 real processors, solving simple equations and performing advanced collective communication primitives.
Funding Sources
- Laboratory Directed Research and Development, Oak Ridge National Laboratory
Participating Institutions
Peer-reviewed Conference Publications
- Christian Engelmann and George A. (Al) Geist. Super-Scalable Algorithms for Computing on 100,000 Processors. In Lecture Notes in Computer Science: Proceedings of the 5th International Conference on Computational Science (ICCS) 2005, Part I, pages 313-320, Atlanta, GA, USA, May 22-25, 2005. Springer Verlag, Berlin, Germany. ISBN 978-3-540-26032-5. ISSN 0302-9743. DOI 10.1007/11428831_39. Acceptance rate 35%.
Peer-reviewed Workshop Publications
- Christian Engelmann and George A. (Al) Geist. A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform. In Proceedings of the Challenges of Large Applications in Distributed Environments Workshop (CLADE) 2003, in conjunction with the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC) 2003, pages 47, Seattle, WA, USA, June 21, 2003. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-1984-9. DOI xpls/abs_all.jsp?arnumber=4159902.
Talks and Lectures
- Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Seminar at the Laboratoire d'Analyse et d’Architecture des Systémes, Centre National de la Recherche Scientifique, Toulouse, France, February 11, 2008.
- Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Invited talk at the Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007, Khon Kean, Thailand, June 8, 2007.
- Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Invited talk at the Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007, Bangkok, Thailand, June 4-5, 2007.
- Christian Engelmann. Diskless Checkpointing on Super-scale Architectures – Applied to the Fast Fourier Transform. Invited talk at the 11th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP) 2004, San Francisco, CA, USA, February 25, 2004.
- Christian Engelmann. Super-scalable Algorithms – Next Generation Supercomputing on 100,000 and more Processors. Seminar at the Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA, January 29, 2004.
Symbols: Abstract, Publication, Presentation, BibTeX Citation