Abstract
In parallel with the R&D efforts in USA and Europe, China’s National High-tech R&D program has setup its goal in developing petaflops computers. Researchers and engineers world-wide are looking for appropriate methods and technologies to achieve the petaflops computer system. Based on discussion on important design issues in developing the petaflops computer, this paper raises the major technological challenges including the memory wall, low power system design, interconnects, and programming support, etc. Current efforts in addressing some of these challenges and in pursuing possible solutions for developing the petaflops systems are presented. Several existing systems are briefly introduced as examples, including Roadrunner, Cray XT5 jaguar, Dawning 5000A/6000, and Lenovo DeepComp 7000. Architectures proposed by Chinese researchers for implementing the petaflops computer are also introduced. Advantages of the architecture as well as the difficulties in its implementation are discussed. Finally, future research direction in development of high productivity computing systems is discussed.
Similar content being viewed by others
References
Bell G. Bell’s law for the birth and death of computer classes. Commun. ACM, 2008, 51(1): 86–94
Kumar S, Hughes C J, Nguyen A. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. ISCA’07, 2007, 162–173
Leverich J, Arakida H, Solomatnikov A, et al. Comparing memory systems for chip multiprocessors. ISCA’07, 2007, 358–368
Liu C, Anand S, Kandemir M. Organizing the last line of defense before hitting the memory wall for CMPs. In: Proceedings of 10th International Symposium on High Performance Computer Architecture, 2004, 176–185
Kim S, Chandra D, Solihin Y. Fair cache sharing and partitioning in a chip multiprocessor architecture. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, 2004, 111–122
Huh J, et al. A NUCA substrate for flexible CMP cache sharing. In: Proceedings of the 19th annual international conference on Supercomputing. Cambridge, Massachusetts: ACM, 2005, 31–40
Chishti Z, Powell MD, Vijaykumar T N. Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings. 32nd International Symposium on Computer Architecture. 2005, 357–368
Kogge P M, et al. Combined DRAM and logic chip for massively parallel systems. In: Proceedings of the 16th Conference on Advanced Research in VLSI. 1995, 4–16
Sterling T, Bergman L. A design analysis of a hybrid technology multithreaded architecture for petaflops scale computation3. In: Proceedings of the 13th international conference on Supercomputing, Rhodes, Greece: ACM, 1999, 286–293
Draper J, et al. The architecture of the DIVA processing-in-memory chip. In: Proceedings of the 16th international conference on Supercomputing, USA: ACM, 2002, 14–25
Sterling T L, Zima H P. Gilgamesh: a multithreaded processor-inmemory architecture for petaflops computing. In: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland: IEEE Computer Society Press, 2002, 1–23
Sterling, T, Brodowicz M. The MIND scalable PIM architecture. In: Advanced Research Workshop on High Performance Computing Technology and Applications. Cetraro, Italy, 2005
Khailany B, Dally W J, Rixner S, et al. Imagine: Media Processing with Streams. IEEE Micro, 2001, 21(2): 35–46
ClearSpeed, “CSX600 datasheet,” http://www.clearspeed.com/downloads/CSX600Processor.pdf, 2005.
Dally W J, et al. Merrimac: supercomputing with streams. In: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, 2003
Wang L, Yang X J, Yan X B, et al. A 64-bit stream processor architecture for scientific applications. In: Proceedings of the 34th Annual international Symposium on Computer Architecture. San Diego CA, USA, 2007: 210–219
Pham D, et al. The design and implementation of a first-generation cell processor. In: Proceedings of the IEEE International Solid-State Circuits Conference. San Francisco, CA, 2005, 184–185
Kapasi U J, et al. The imagine stream processor. In: Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors. 2002, 282–288
Erez M. Merrimac: high-performance and highly-efficient scientific computing with streams, 2007, Stanford University. 211
Stitt G, Vahid F, Nematbakhsh S. Energy savings and speedups from partitioning critical software loops to hardware in embedded systems. ACM trans. on Embedded Comput. Syst., 2004, 3, (1): 218–232
Todman T J, Constantinides G A, Wilton S J E, et al. Reconfigurable computing: architectures and design Methods. IEE Proc.-Comput. Digit. Tech., 2005, 152(2): 193–207
Sankaralingam K, et al., Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. SIGARCH Comput. Archit. News. 2003. 31 (2): 422–433
Taylor M B, et al. The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro, 2002, 22 (2): 25–35
Paleologo G A, Benini L, Bogliolo A, et al. Policy optimization for dynamic power management. In: Proceedings of the Design Automation Conference (DAC’98). San Francisco: IEEE/ACM, 1998, 182–187
Nowka K, Carpenter G, MacDonald E, et al. A 32-bit powerPC system-on-a-chip with support for dynamic voltage scaling and dynamic frequency scaling. IEEE Journal of Solid-State Circuits, 2002, 37(11): 1600–1608
Heo S, Barr K, Asanovic K, Reducing power density through activity migration. In: Proceedings of International Symposium on Low Power Electronics and Design (ISLPED), 2003, 217–222
Powell M, Gomaa M, Vijaykumar T N. Heat-and-run: leveraging SMT and CMP to manage power density through the operating system. In: Proceedings of 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XI). 2004, 260–270
Fritts J E, Chamberlain R D. Breaking the memory bottleneck with an optical data path. In: Proceedings of 35th Annual Simulation Symposium. 2002
Kirman N, Kirman M, Dokania R K, et al. Leveraging optical technology in future bus-based chip multiprocessors. In: Proceedings of the 39th International Symposium on Microarchitecture. 2006, 492–507
Shacham A, Small B A, Liboiron-Ladouceur O, et al. A fully implemented 12x12 data vortex optical packet switching interconnection network. Journal of Lightwave Technology, 2005, 23: 3066–3075
Patrick D, et al. Lighnting network and systems architecture. Journal of Lightwave Technology, 1996, 14: 1371–1387
Chunming M, et. al., Dynamic reconfiguration of optically interconnected networks with time-division multiplexing. Journal of Parallel and Distributed Computing, 1994, 22(2): 268–278
Praveen K, Roger C, Mark F. Dynamic reconfiguration of an optical interconnect. In: Proceedings of 36th Annual Simulation Symposium, 2003, 89–97
Kodi A, Louri A. Performance adaptive power-aware reconfigurable optical interconnects for high-performance computing (HPC) systems. In: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. Reno, Nevada: ACM, 2007
Drost R, et al. Proximity Communications. IEEE CICC, 2003, 469–472
Bader D, Kanade V, Madduri K. SWARM: A Parallel Programming Framework for Multi- Core Processors. First Workshop on Multithreaded Architectures and Applications (MTAAP) at IPDPS 2007. Long Beach, CA, USA, March 2007
Perez J, Bellens P, Badia R, Labarta J. CellSs: Making it easier to program the Cell Broadband Engine processor. IBM Journal of Research and Development, 2007, 51(5): 593–604
Reed D, Lu C, Mendes C. Big systems and big reliability challenges. In: Proceedings of Int’l Conf. on Parallel Computing (ParCo), 2003, 19(4): 189–197
Schroeder B, Gibson G. A large scale study of failures in highperformance-computing systems. In: Proceedings of Int’l Conf. on Dependable Systems and Networks (DSN), 2006
Schroeder B, Gibson G A. Understanding failures in petascale computers. In: Proceedings of SciDAC 2007 J. Phys: Conf. Ser., 2007, 78(012022)
Daly J T. A model for predicting the optimum checkpoint interval for restart dumps. In: Proceedings of ICCS 2003, LNCS 2660, 3–12
Babaoglu O, Joy W. Converting a swap-based system to do paging in an architecture lacking page reference bits. In: Proceedings of 8th Symp. on Operating Systems Principles (SOSP), 1981, 15(5): 78–96
Sancho J, Petrini F, Johnson G, et al. On the feasibility of incremental checkpointing for scientific computing, In: Proceedings of 18th International Parallel and Distributed Processing Symp. 2004, 26–30
Lu C D. Scalable diskless checkpointing for large parallel systems. PhD dissertation, Univ. of Illinois at Urbana- Champaign, 2005
Zheng G, Shi L, Kale L. FTC-Charm++: an in-memory checkpointbased fault tolerant runtime for charm++ and MPI. In: Proceedings of IEEE Int’l Conf. on Cluster Computing (Cluster), 2004, 93–103
Oliner A J, Rudolph L, and Sahoo R K. Cooperative checkpointing: a robust approach to large-scale systems reliability. In: Proceedings of the 20th Annual International Conference on Supercomputing (ICS). Cairns, Australia, June 2006, 14–23
Yudan, L., et al. An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of IEEE International Symposium on Parallel and Distributed Processing, 2008
Xue Z, et al. A survey on failure prediction of large-scale server clusters. In: Proceedings of the eighth ACIS international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing (SNPD 2007), 2007, 2: 733–738
Sahoo R, Oliner A, Rish I, et al. Critical event prediction for proactive management in large-scale computer clusters, In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 426–435
Wang C, Mueller F, Engelmann C, et al. Proactive process-level live migration in HPC environments. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Austin, TX, USA, 2007, 23–32
Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2007, 25–38
Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2007, 25–38
Snavely A, Tullsen D. Symbiotic jobscheduling for a simultaneous multithreading processor. In: Proceedings of ASPLOS, 2000, 234–244
Jiang Y, Shen X. Exploration of the influence of program inputs on cmp co-scheduling. In: Proceedings of European Conference on Parallel Computing (Euro-Par), 2008, 5168: 263–273
Jiang, Y, et al. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the 17th international conference on Parallel architectures and compilation techniques. Toronto, Ontario, Canada: ACM, 2008, 220–229
Blagojevic F, et al. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Comput., 2007, 33 (10–11): 700–719
Becchi M, Crowley P. Dynamic thread assignment on heterogeneous multiprocessor architectures. In: Proceedings of the 3rd conference on Computing frontiers. Ischia, Italy: ACM, 2006, 29–40
Watanabe R, et al. Task scheduling under performance constraints for reducing the energy consumption of the GALS multi-processor SoC. In: Proceedings of IEEE/ACM DATE, 2007
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qian, D., Zhu, D. Challenges and possible approaches: towards the petaflops computers. Front. Comput. Sci. China 3, 273–289 (2009). https://doi.org/10.1007/s11704-009-0022-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-009-0022-6