Article

Free access

The performance advantages of integrating block data transfer in cache-coherent multiprocessors

Authors:

Steven Cameron Woo,

Jaswinder Pal Singh,

John L. HennessyAuthors Info & Claims

ASPLOS VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems

Pages 219 - 229

https://doi.org/10.1145/195473.195547

Published: 01 November 1994 Publication History

Abstract

Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms is used to study performance gains in five important scientific computations that appear to be good candidates for using block transfer. Our conclusion is that the benefits of block transfer are not substantial for hardware cache-coherent multiprocessors. The main reasons for this are (i) the relatively modest fraction of time applications spend in communication amenable to block transfer, (ii) the difficulty of finding enough independent computation to overlap with the communication latency that remains after block transfer, and (iii) long cache lines often capture many of the benefits of block transfer in efficient cache-coherent machines. In the cases where block transfer improves performance, prefetching can often provide comparable, if not superior, performance benefits. We also examine the impact of varying important communication parameters and processor speed on the effectiveness of block transfer, and comment on useful features that a block transfer facility should support for real applications.

References

[1]

Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 104- 114, May 1990.

Digital Library

[2]

David H. Bailey. FFTs in External or Hierarchical Memory. Journal of Supercomputing, 4(1):23-35, March 1990.

Digital Library

[3]

Brian N. Bershad, Matthew J. Zekauskas, and Wayne A. Sawdon. The Midway Distributed Shared Memory System. In Proceedings of COMPCON'93, February 1993.

[4]

Achi Brandt. Multi-Level Adaptive Solutions to Boundary-Value Problems. Mathematics of Computation, 31(138):333-390, April 1977.

[5]

Sandhya Dwarkadas, Pete Keleher, Alan Cox, and Willy Zwaenepoel. An Evaluation of Software Distributed Shared Memory for Next-Generation Processors and Networks. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 144-155, May 1993.

Digital Library

[6]

David H. Bailey et al. The NAS Parallel Benchmarks. International Journal of Supercomputer Applications, 5(3):63-73, Fall 1991.

Digital Library

[7]

David Kranz et al. Integrating Message-Passing and Shared-Memory: Early Experience. In Proceedings of the Fourth A CM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 54- 63, May 1993.

Digital Library

[8]

Guy E. Blelloch et al. A Comparison of Sorting Algorithms for the Connection Machine CM-2. In Symposium on Parallel Algorithms and Architectures, pages 3-16, July 1991.

Digital Library

[9]

Jeffrey Kuskin et al. The Stanford FLASH Multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 302-313, April 1994.

Digital Library

[10]

John Heinlein et al. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS- VI), October 1994.

Digital Library

[11]

Stephen Goldschmidt. Simulation of Multiprocessors: Accuracy and Performance. PhD thesis, Stanford University, June 1993.

Digital Library

[12]

Cray Research Inc. Cray T3D System Architecture and Overview. Revision 1.c. Technical report, Cray Research Inc., September 1993.

[13]

Dan Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 148-159, May 1990.

Digital Library

[14]

Steven K. Reinhardt, James R. Lares, and David A. Wood. Tempest and Typhoon: User-level Shared Memory. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 325-336, April 1994.

Digital Library

[15]

Jaswinder Pal Singh, John L. Hennessy, and Anoop Gupta. Scaling Parallel Programs for Multiprocessors: Methodology and Examples. IEEE Computer, 26(7):42-50, July 1993.

Digital Library

[16]

Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Applications for Shared Memory. Computer Architecture News, 20(i):5- 44, March 1992. Also Stanford University Technical Report No. CSL-TR-92-526, June 1992.

Digital Library

[17]

Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Technical Report CSL-TR-93-593, Stanford University, December 1993.

Digital Library

Cited By

Alshboul MElnawawy HElkhouly RKimura KTuck JSolihin Y(2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
https://dl.acm.org/doi/10.1145/3323091
Alshboul MTuck JSolihin Y(2018)Lazy persistencyProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00044(439-451)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00044
Baumann PSantini S(2017)Every Byte CountsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/30900521:2(1-29)Online publication date: 30-Jun-2017
https://dl.acm.org/doi/10.1145/3090052
Show More Cited By

Index Terms

The performance advantages of integrating block data transfer in cache-coherent multiprocessors

Recommendations

The performance advantages of integrating block data transfer in cache-coherent multiprocessors

Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware ...
The performance advantages of integrating block data transfer in cache-coherent multiprocessors
The Effects of Block Size on the Performance of Coherent Caches in Shared-Memory Multiprocessors

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems

November 1994

341 pages

ISBN:0897916603

DOI:10.1145/195473

Chairmen:
Forest Baskett
Silicon Graphics
,
Douglas Clark
Princeton Univ.

ACM SIGPLAN Notices Volume 29, Issue 11
Nov. 1994
323 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/195470
Editor:
Richard L. Wexelblat
Washington D.C.
Issue’s Table of Contents
ACM SIGOPS Operating Systems Review Volume 28, Issue 5
Dec. 1994
323 pages
ISSN:0163-5980
DOI:10.1145/381792
Chairman:
Henry M. Levy
Univ. of Washington, Seattle
Issue’s Table of Contents

Copyright © 1994 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 1994

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

ASPLOS94

Sponsor:

ASPLOS94: 6th Conference on Architectural Support of Programming Languages & Operating Systems

October 5 - 7, 1994

California, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
1,034
Total Downloads

Downloads (Last 12 months)115
Downloads (Last 6 weeks)19

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alshboul MElnawawy HElkhouly RKimura KTuck JSolihin Y(2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
https://dl.acm.org/doi/10.1145/3323091
Alshboul MTuck JSolihin Y(2018)Lazy persistencyProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00044(439-451)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00044
Baumann PSantini S(2017)Every Byte CountsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/30900521:2(1-29)Online publication date: 30-Jun-2017
https://dl.acm.org/doi/10.1145/3090052
Bae SFerreira DSuffoletto BPuyana JKurtz RChung TDey A(2017)Detecting Drinking Episodes in Young Adults Using Smartphone-based SensorsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/30900511:2(1-36)Online publication date: 30-Jun-2017
https://dl.acm.org/doi/10.1145/3090051
Hussain IAhmad AQadri MQadri NAhmed J(2016)Ant Colony Optimization for multicore re-configurable architectureAI Communications10.3233/AIC-16070829:5(595-606)Online publication date: 15-Nov-2016
https://doi.org/10.3233/AIC-160708
Arslan STopcuoglu HKandemir MTosun O(2015)Performance and Energy Efficient Asymmetrically Reliable Caches for Multicore ArchitecturesProceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop10.1109/IPDPSW.2015.113(1025-1032)Online publication date: 25-May-2015
https://dl.acm.org/doi/10.1109/IPDPSW.2015.113
(2014)Accelerated Harmonic-Balance Analysis Using a Graphical Processing Unit PlatformIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2014.230469633:7(1017-1030)Online publication date: Jul-2014
https://doi.org/10.1109/TCAD.2014.2304696
Xu TPahikkala TLiljeberg PPlosila JTenhunen H(2013)Optimized multicore architectures for data parallel fast Fourier transformProceedings of the 14th International Conference on Computer Systems and Technologies10.1145/2516775.2516808(75-82)Online publication date: 28-Jun-2013
https://dl.acm.org/doi/10.1145/2516775.2516808
Butko AGaribotti ROst LSassatelli G(2012)Accuracy evaluation of GEM5 simulator system7th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC)10.1109/ReCoSoC.2012.6322869(1-7)Online publication date: Jul-2012
https://doi.org/10.1109/ReCoSoC.2012.6322869
Xu TPahikkala TAirola ALiljeberg PPlosila JSalakoski TTenhunen H(2012)Implementation and Analysis of Block Dense Matrix Decomposition on Network-on-ChipsProceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems10.1109/HPCC.2012.76(516-523)Online publication date: 25-Jun-2012
https://dl.acm.org/doi/10.1109/HPCC.2012.76
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents