Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/232973.232981acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article
Free access

COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors

Published: 01 May 1996 Publication History

Abstract

Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5% in the best case to 35% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data.

References

[1]
AGARWAL, A., CHAIKEN, D., JOHNSON, K., KRANZ, D., KUBIATOWICZ, J., KURIHARA, K., LIM, B., MA, G., AND NUSSBAUM, D. The MIT Alewife machine : A largescale distributed memory multiprocessor. Research report MIT/LCS/TM-454, MIT Laboratory for Computer Science, June 1991.]]
[2]
BARTLETT, J., GRAY, J., AND HORST, B. Fault tolerance in Tandem computer systems. In The Evolution of Fault- Tolerant Computing, A. Avizienls, H. Kopetz, and J. Laprie, Eds., vol. 1. Springer Verlag, 1987, pp. 55-76.]]
[3]
BERNSTEIN, P. Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing. IEEE Computer 21, 2 (February 1988), 37-45.]]
[4]
BIRMAN, K. Replication and fault-tolerance in the ISIS system. In Proc. of lOth A CM Symposium on Operating Systems Principles (Washington, December 1985), pp. 79-86.]]
[5]
CHAIKEN, D., FIELDS, C., KURIHARA, g., AND AGARWAL, A. Directory-based cache coherence in large-scale multiprocessors. IEEE Computer 23, 6 (June 1990), 49-58.]]
[6]
DAVIS, n., GOLDSCHMIDT, S., AND HENNESSY, J. Multiprocessor simulatlon using Tango. In Proc. of 1991 International Conference on Parallel Processing (August 1991), vol. II, pp. 99-107.]]
[7]
DIN, M., GRYGIER, A., HESSENAUER, H., HILDEBRAND, U., HONIG, J., HOHL, W., MICHEL, E., AND PATARICZA, A. Fault tolerance in distributed shared memory multiprocessors. In Parallel Computer Architectures (1994), A. Bode and M. Cin, Eds., vol. 732 of Lecture Notes in Computer Science, Springer Verlag, pp. 31-48.]]
[8]
ELNOZAHY, E., JOHNSON, D., AND ZWAENEPOEL, W. The performaalce of consistent checkpoint. In Proc. of 11 th Symposium on Reliable Distributed Systems (October 1992), pp. 39-47.]]
[9]
FRANK, S., BURKHARDT, H., AND ROTHNIE, J. The KSR1 : Bridging the gap between shared memory and MPPs. In Proc. of spring COMPCON'93 (February 1993), I. C. Society, Ed., pp. 285-294.]]
[10]
GEFFLAUT, A. Proposition et ~valuation d'une architecture multiprocesseur extensibIe d m~moire partagie tol~rante aux }autes. PhD thesis, Universit6 de Rennes I, January 1995.]]
[11]
GEFFLAUT, A., MORIN, C., AND BAN~.TRE, M. Tolerating node failu~:es in cache only memory architectures. In Proc. of Supercomputing'9~ (November 1994).]]
[12]
HAGERSTEN, E., LANDIS, A., AND HARIDI, S. DDM- a cacheonly memory architecture. IEEE Computer 25, 9 (September 1992), 44-54.]]
[13]
HARRISON, E., AND SCHMITT~ E. The structure of SYS- TEM/88, a fault-tolerant computer. IBM Systems Journal 26, 3 (1987), 293-318.]]
[14]
JANSSENS, B., AND FUCHS, W. Experimental evaluation of multiprocessor cache-based error recovery. In Proc. of 1991 International Conference on Parallel Processing (August 1991), vol. I, pp. 505-508.]]
[15]
KERMARREC, A., CABILLIC, G., GEFFLAUT, A., MORIN, C., AND PUAUT~ I. A recoverable distributed shared memory integrating coherence and recoverability. IEEE Computer Society Press, pp. 289-298.]]
[16]
KUSKIN, J., OFELT, D., HEINRICH, M., HEINLEIN, J., SIMONI, R., GHARACHORLOO, g., CHAPIN, J., NAKAHIRA, D., BAX- TER, J., HOROWITZ, M., GUPTA, A., ROSENBLUM, M., AND HENNP.SSY, J. The Stanford FLASH multiprocessor. In Proc. of 21th Annual International Symposium on Computer Architecture (Chicago, Illinois, April 1994), pp. 302-313.]]
[17]
LAMPSON, B. Atomic transactions. In Distributed Systems and Architecture and Implementation : an Advanced Course, vol. 105 o{ Lecture Notes in Computer Science. Springer Verlag, 1981, pp. 246-265.]]
[18]
LARUS, J. Abstract execution : A technique for efficiently tracing programs. Software Practice and Experience 20~ 12 (December 1990), 1251-1258.]]
[19]
LEE, P., AND ANDERSON, T. Fault Tolerance: Principles and Practice, second revised ed., vol. 3 of Dependable Computing and Fault-.Tolerant Systems. Springer Verlag, 1990.]]
[20]
LENOSKI, D., LAUDON, J., GHARACHORLOO, K., WEBER, W., GUPTA, A., HENNESSY, J., HOROWITZ, M., AND LAM, M. The Stanford DASH multiprocessor. IEEE Computer 25, 3 (March 1992), 63-79.]]
[21]
SAULSBURY, A., WILKINSON, T., CARTER, J., AND LANDIS, A. An argument for simple COMA. In Proc. of 1st IEEE Symposium on High-Performance Computer Architecture (January 1995 ).]]
[22]
SCHWETMAN, H. CSIM user's guide, rev. 2. Tech. Rep. ACT- 126-90, Rev. 2, MCC, July 1992.]]
[23]
SINGH, J., WEBER, W., AND GUPTA, A. SPLASH:Stanford parallel applications for shared-memory. Tech. Rep. CSL- TR-91-469, Computer Systems Laboratory, Stanford University, April 1991.]]
[24]
STENSTROM, P., JOE, T., AND GUPTA, A. Comparative performance evaluation of cache-coherent NUMA and COMA architectures, in Proc. of 19th Annual International Symposium on Computer Architecture (May 1992), pp. 80-91.]]

Cited By

View all
  • (2020)On Providing OS Support to Allow Transparent Use of Traditional Programming Models for Persistent MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/338863716:3(1-24)Online publication date: 23-Jun-2020
  • (2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
  • (2015)ThyNVMProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830802(672-685)Online publication date: 5-Dec-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '96: Proceedings of the 23rd annual international symposium on Computer architecture
May 1996
318 pages
ISBN:0897917863
DOI:10.1145/232973
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 24, Issue 2
    Special Issue: Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96)
    May 1996
    303 pages
    ISSN:0163-5964
    DOI:10.1145/232974
    Issue’s Table of Contents

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 1996

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Scalable Shared Memory Multiprocessors
  2. backward error recovery
  3. coherence protocol
  4. fault-tolerance

Qualifiers

  • Article

Conference

ISCA96
Sponsor:
ISCA96: International Conference on Computer Architecture
May 22 - 24, 1996
Pennsylvania, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)138
  • Downloads (Last 6 weeks)22
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)On Providing OS Support to Allow Transparent Use of Traditional Programming Models for Persistent MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/338863716:3(1-24)Online publication date: 23-Jun-2020
  • (2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
  • (2015)ThyNVMProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830802(672-685)Online publication date: 5-Dec-2015
  • (2011)ReboundACM SIGARCH Computer Architecture News10.1145/2024723.200008339:3(153-164)Online publication date: 4-Jun-2011
  • (2011)ReboundProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000083(153-164)Online publication date: 4-Jun-2011
  • (2006)SWICHIEEE Micro10.1109/MM.2006.10026:5(28-40)Online publication date: 1-Sep-2006
  • (2006)ReViveI/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery ServersThe Twelfth International Symposium on High-Performance Computer Architecture, 2006.10.1109/HPCA.2006.1598129(203-214)Online publication date: 2006
  • (2003)Modeling and evaluating the time overhead induced by BER in COMA multiprocessorsJournal of Systems Architecture: the EUROMICRO Journal10.1016/S1383-7621(03)00024-948:13-15(377-385)Online publication date: 1-May-2003
  • (2002)ReViveProceedings of the 29th annual international symposium on Computer architecture10.5555/545215.545228(111-122)Online publication date: 25-May-2002
  • (2002)ReViveACM SIGARCH Computer Architecture News10.1145/545214.54522830:2(111-122)Online publication date: 1-May-2002
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media