Challenges and possible approaches: towards the petaflops computers

Depei Qian^1,2 &
Danfeng Zhu²

80 Accesses
Explore all metrics

Abstract

In parallel with the R&D efforts in USA and Europe, China’s National High-tech R&D program has setup its goal in developing petaflops computers. Researchers and engineers world-wide are looking for appropriate methods and technologies to achieve the petaflops computer system. Based on discussion on important design issues in developing the petaflops computer, this paper raises the major technological challenges including the memory wall, low power system design, interconnects, and programming support, etc. Current efforts in addressing some of these challenges and in pursuing possible solutions for developing the petaflops systems are presented. Several existing systems are briefly introduced as examples, including Roadrunner, Cray XT5 jaguar, Dawning 5000A/6000, and Lenovo DeepComp 7000. Architectures proposed by Chinese researchers for implementing the petaflops computer are also introduced. Advantages of the architecture as well as the difficulties in its implementation are discussed. Finally, future research direction in development of high productivity computing systems is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bell G. Bell’s law for the birth and death of computer classes. Commun. ACM, 2008, 51(1): 86–94
Article Google Scholar
Kumar S, Hughes C J, Nguyen A. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. ISCA’07, 2007, 162–173
Leverich J, Arakida H, Solomatnikov A, et al. Comparing memory systems for chip multiprocessors. ISCA’07, 2007, 358–368
Liu C, Anand S, Kandemir M. Organizing the last line of defense before hitting the memory wall for CMPs. In: Proceedings of 10th International Symposium on High Performance Computer Architecture, 2004, 176–185
Kim S, Chandra D, Solihin Y. Fair cache sharing and partitioning in a chip multiprocessor architecture. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, 2004, 111–122
Huh J, et al. A NUCA substrate for flexible CMP cache sharing. In: Proceedings of the 19th annual international conference on Supercomputing. Cambridge, Massachusetts: ACM, 2005, 31–40
Chapter Google Scholar
Chishti Z, Powell MD, Vijaykumar T N. Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings. 32nd International Symposium on Computer Architecture. 2005, 357–368
Kogge P M, et al. Combined DRAM and logic chip for massively parallel systems. In: Proceedings of the 16th Conference on Advanced Research in VLSI. 1995, 4–16
Sterling T, Bergman L. A design analysis of a hybrid technology multithreaded architecture for petaflops scale computation3. In: Proceedings of the 13th international conference on Supercomputing, Rhodes, Greece: ACM, 1999, 286–293
Chapter Google Scholar
Draper J, et al. The architecture of the DIVA processing-in-memory chip. In: Proceedings of the 16th international conference on Supercomputing, USA: ACM, 2002, 14–25
Chapter Google Scholar
Sterling T L, Zima H P. Gilgamesh: a multithreaded processor-inmemory architecture for petaflops computing. In: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland: IEEE Computer Society Press, 2002, 1–23
Google Scholar
Sterling, T, Brodowicz M. The MIND scalable PIM architecture. In: Advanced Research Workshop on High Performance Computing Technology and Applications. Cetraro, Italy, 2005
Khailany B, Dally W J, Rixner S, et al. Imagine: Media Processing with Streams. IEEE Micro, 2001, 21(2): 35–46
Article Google Scholar
ClearSpeed, “CSX600 datasheet,” http://www.clearspeed.com/downloads/CSX600Processor.pdf, 2005.
Dally W J, et al. Merrimac: supercomputing with streams. In: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, 2003
Wang L, Yang X J, Yan X B, et al. A 64-bit stream processor architecture for scientific applications. In: Proceedings of the 34th Annual international Symposium on Computer Architecture. San Diego CA, USA, 2007: 210–219
Pham D, et al. The design and implementation of a first-generation cell processor. In: Proceedings of the IEEE International Solid-State Circuits Conference. San Francisco, CA, 2005, 184–185
Kapasi U J, et al. The imagine stream processor. In: Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors. 2002, 282–288
Erez M. Merrimac: high-performance and highly-efficient scientific computing with streams, 2007, Stanford University. 211
Stitt G, Vahid F, Nematbakhsh S. Energy savings and speedups from partitioning critical software loops to hardware in embedded systems. ACM trans. on Embedded Comput. Syst., 2004, 3, (1): 218–232
Article Google Scholar
Todman T J, Constantinides G A, Wilton S J E, et al. Reconfigurable computing: architectures and design Methods. IEE Proc.-Comput. Digit. Tech., 2005, 152(2): 193–207
Article Google Scholar
Sankaralingam K, et al., Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. SIGARCH Comput. Archit. News. 2003. 31 (2): 422–433
Article Google Scholar
Taylor M B, et al. The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro, 2002, 22 (2): 25–35
Article Google Scholar
Paleologo G A, Benini L, Bogliolo A, et al. Policy optimization for dynamic power management. In: Proceedings of the Design Automation Conference (DAC’98). San Francisco: IEEE/ACM, 1998, 182–187
Google Scholar
Nowka K, Carpenter G, MacDonald E, et al. A 32-bit powerPC system-on-a-chip with support for dynamic voltage scaling and dynamic frequency scaling. IEEE Journal of Solid-State Circuits, 2002, 37(11): 1600–1608
Article Google Scholar
Heo S, Barr K, Asanovic K, Reducing power density through activity migration. In: Proceedings of International Symposium on Low Power Electronics and Design (ISLPED), 2003, 217–222
Powell M, Gomaa M, Vijaykumar T N. Heat-and-run: leveraging SMT and CMP to manage power density through the operating system. In: Proceedings of 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XI). 2004, 260–270
Fritts J E, Chamberlain R D. Breaking the memory bottleneck with an optical data path. In: Proceedings of 35th Annual Simulation Symposium. 2002
Kirman N, Kirman M, Dokania R K, et al. Leveraging optical technology in future bus-based chip multiprocessors. In: Proceedings of the 39th International Symposium on Microarchitecture. 2006, 492–507
Shacham A, Small B A, Liboiron-Ladouceur O, et al. A fully implemented 12x12 data vortex optical packet switching interconnection network. Journal of Lightwave Technology, 2005, 23: 3066–3075
Article Google Scholar
Patrick D, et al. Lighnting network and systems architecture. Journal of Lightwave Technology, 1996, 14: 1371–1387
Article Google Scholar
Chunming M, et. al., Dynamic reconfiguration of optically interconnected networks with time-division multiplexing. Journal of Parallel and Distributed Computing, 1994, 22(2): 268–278
Article Google Scholar
Praveen K, Roger C, Mark F. Dynamic reconfiguration of an optical interconnect. In: Proceedings of 36th Annual Simulation Symposium, 2003, 89–97
Kodi A, Louri A. Performance adaptive power-aware reconfigurable optical interconnects for high-performance computing (HPC) systems. In: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. Reno, Nevada: ACM, 2007
Google Scholar
Drost R, et al. Proximity Communications. IEEE CICC, 2003, 469–472
Bader D, Kanade V, Madduri K. SWARM: A Parallel Programming Framework for Multi- Core Processors. First Workshop on Multithreaded Architectures and Applications (MTAAP) at IPDPS 2007. Long Beach, CA, USA, March 2007
Google Scholar
Perez J, Bellens P, Badia R, Labarta J. CellSs: Making it easier to program the Cell Broadband Engine processor. IBM Journal of Research and Development, 2007, 51(5): 593–604
Article Google Scholar
Reed D, Lu C, Mendes C. Big systems and big reliability challenges. In: Proceedings of Int’l Conf. on Parallel Computing (ParCo), 2003, 19(4): 189–197
Google Scholar
Schroeder B, Gibson G. A large scale study of failures in highperformance-computing systems. In: Proceedings of Int’l Conf. on Dependable Systems and Networks (DSN), 2006
Schroeder B, Gibson G A. Understanding failures in petascale computers. In: Proceedings of SciDAC 2007 J. Phys: Conf. Ser., 2007, 78(012022)
Daly J T. A model for predicting the optimum checkpoint interval for restart dumps. In: Proceedings of ICCS 2003, LNCS 2660, 3–12
Babaoglu O, Joy W. Converting a swap-based system to do paging in an architecture lacking page reference bits. In: Proceedings of 8th Symp. on Operating Systems Principles (SOSP), 1981, 15(5): 78–96
Google Scholar
Sancho J, Petrini F, Johnson G, et al. On the feasibility of incremental checkpointing for scientific computing, In: Proceedings of 18th International Parallel and Distributed Processing Symp. 2004, 26–30
Lu C D. Scalable diskless checkpointing for large parallel systems. PhD dissertation, Univ. of Illinois at Urbana- Champaign, 2005
Zheng G, Shi L, Kale L. FTC-Charm++: an in-memory checkpointbased fault tolerant runtime for charm++ and MPI. In: Proceedings of IEEE Int’l Conf. on Cluster Computing (Cluster), 2004, 93–103
Oliner A J, Rudolph L, and Sahoo R K. Cooperative checkpointing: a robust approach to large-scale systems reliability. In: Proceedings of the 20th Annual International Conference on Supercomputing (ICS). Cairns, Australia, June 2006, 14–23
Chapter Google Scholar
Yudan, L., et al. An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of IEEE International Symposium on Parallel and Distributed Processing, 2008
Xue Z, et al. A survey on failure prediction of large-scale server clusters. In: Proceedings of the eighth ACIS international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing (SNPD 2007), 2007, 2: 733–738
Google Scholar
Sahoo R, Oliner A, Rish I, et al. Critical event prediction for proactive management in large-scale computer clusters, In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 426–435
Wang C, Mueller F, Engelmann C, et al. Proactive process-level live migration in HPC environments. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Austin, TX, USA, 2007, 23–32
Google Scholar
Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2007, 25–38
Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2007, 25–38
Snavely A, Tullsen D. Symbiotic jobscheduling for a simultaneous multithreading processor. In: Proceedings of ASPLOS, 2000, 234–244
Jiang Y, Shen X. Exploration of the influence of program inputs on cmp co-scheduling. In: Proceedings of European Conference on Parallel Computing (Euro-Par), 2008, 5168: 263–273
Google Scholar
Jiang, Y, et al. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the 17th international conference on Parallel architectures and compilation techniques. Toronto, Ontario, Canada: ACM, 2008, 220–229
Chapter Google Scholar
Blagojevic F, et al. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Comput., 2007, 33 (10–11): 700–719
Google Scholar
Becchi M, Crowley P. Dynamic thread assignment on heterogeneous multiprocessor architectures. In: Proceedings of the 3rd conference on Computing frontiers. Ischia, Italy: ACM, 2006, 29–40
Chapter Google Scholar
Watanabe R, et al. Task scheduling under performance constraints for reducing the energy consumption of the GALS multi-processor SoC. In: Proceedings of IEEE/ACM DATE, 2007

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China
Depei Qian
School of Computer Science and Technology, Beihang University, Beijing, 100191, China
Depei Qian & Danfeng Zhu

Authors

Depei Qian
View author publications
You can also search for this author in PubMed Google Scholar
Danfeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Depei Qian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qian, D., Zhu, D. Challenges and possible approaches: towards the petaflops computers. Front. Comput. Sci. China 3, 273–289 (2009). https://doi.org/10.1007/s11704-009-0022-6

Download citation

Received: 25 January 2009
Accepted: 20 June 2009
Published: 29 July 2009
Issue Date: September 2009
DOI: https://doi.org/10.1007/s11704-009-0022-6

Challenges and possible approaches: towards the petaflops computers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Evaluating OpenMP Affinity on the POWER8 Architecture

HDNN: a cross-platform MLIR dialect for deep neural networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Challenges and possible approaches: towards the petaflops computers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Evaluating OpenMP Affinity on the POWER8 Architecture

HDNN: a cross-platform MLIR dialect for deep neural networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now