Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2148600.2148606acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
poster

Poster: scalable infrastructure to support supercomputer resiliency-aware applications and load balancing

Published: 12 November 2011 Publication History

Abstract

High performance computing systems display increasing complexity and component counts. This trend exposes weaknesses in the underlying clustering infrastructure needed for continuous availability, maximizing utilization, and efficient administration of such systems. To mitigate the problem, we present a highly scalable clustering infrastructure, based on peer-to-peer technologies, for supporting resiliency-aware applications as well as efficient monitoring and load balancing. Supported services include Membership, Publish-subscribe messaging, Convergecast, Attribute replication and a DHT. We present a preliminary evaluation taken from an IBM BlueGene/P, demonstrating scalability up to ~ 256K nodes.

Supplementary Material

PDF File (post128.pdf)

References

[1]
HPC Colony-II. http://www.hpc-colony.org, 2011.
[2]
A. Allavena, A. Demers, and J. E. Hopcroft. Correctness of a gossip based membership protocol. In PODC '05: Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing, pages 292--301. ACM, 2005.
[3]
G. Chockler, R. Melamed, Y. Tock, and R. Vitenberg. SpiderCast: a scalable interest-aware overlay for topic-based pub/sub communication. In Proceedings of the 2007 inaugural international conference on Distributed event-based systems, DEBS '07, pages 14--25. ACM, 2007.
[4]
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. ExaScale Computing Study: Technology Challenges in Achieving ExaScale Systems. Technical report, DARPA IPTO, Air Force Research Labs, Sept. 2008.
[5]
P. C. Roth, D. C. Arnold, and B. P. Miller. MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools. In Proceedings of the 2003 ACM/IEEE conference on Supercomputing, SC '03. ACM, 2003.
[6]
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM Comput. Commun. Rev., 31(4):149--160, 2001.
[7]
R. van Renesse, D. Dumitriu, V. Gough, and C. Thomas. Efficient reconciliation and flow control for anti-entropy protocols. In Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware, LADIS '08. ACM, 2008.
[8]
J. Varma, C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Scalable, fault tolerant membership for MPI tasks on HPC systems. In ICS '06: Proceedings of the 20th annual international conference on Supercomputing. ACM, 2006.

Cited By

View all
  • (2013)Manifesto of edge ICT fabric2013 17th International Conference on Intelligence in Next Generation Networks (ICIN)10.1109/ICIN.2013.6670888(9-15)Online publication date: Oct-2013
  • (2013)Design and implementation of a scalable membership service for supercomputer resiliency-aware runtimeProceedings of the 19th international conference on Parallel Processing10.1007/978-3-642-40047-6_37(354-366)Online publication date: 26-Aug-2013

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
November 2011
166 pages
ISBN:9781450310307
DOI:10.1145/2148600

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. membership
  3. middleware
  4. peer-to-peer
  5. pub/sub systems
  6. scalability

Qualifiers

  • Poster

Conference

SC '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2013)Manifesto of edge ICT fabric2013 17th International Conference on Intelligence in Next Generation Networks (ICIN)10.1109/ICIN.2013.6670888(9-15)Online publication date: Oct-2013
  • (2013)Design and implementation of a scalable membership service for supercomputer resiliency-aware runtimeProceedings of the 19th international conference on Parallel Processing10.1007/978-3-642-40047-6_37(354-366)Online publication date: 26-Aug-2013

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media