Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/195473.195547acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
Article
Free access

The performance advantages of integrating block data transfer in cache-coherent multiprocessors

Published: 01 November 1994 Publication History

Abstract

Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms is used to study performance gains in five important scientific computations that appear to be good candidates for using block transfer. Our conclusion is that the benefits of block transfer are not substantial for hardware cache-coherent multiprocessors. The main reasons for this are (i) the relatively modest fraction of time applications spend in communication amenable to block transfer, (ii) the difficulty of finding enough independent computation to overlap with the communication latency that remains after block transfer, and (iii) long cache lines often capture many of the benefits of block transfer in efficient cache-coherent machines. In the cases where block transfer improves performance, prefetching can often provide comparable, if not superior, performance benefits. We also examine the impact of varying important communication parameters and processor speed on the effectiveness of block transfer, and comment on useful features that a block transfer facility should support for real applications.

References

[1]
Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 104- 114, May 1990.
[2]
David H. Bailey. FFTs in External or Hierarchical Memory. Journal of Supercomputing, 4(1):23-35, March 1990.
[3]
Brian N. Bershad, Matthew J. Zekauskas, and Wayne A. Sawdon. The Midway Distributed Shared Memory System. In Proceedings of COMPCON'93, February 1993.
[4]
Achi Brandt. Multi-Level Adaptive Solutions to Boundary-Value Problems. Mathematics of Computation, 31(138):333-390, April 1977.
[5]
Sandhya Dwarkadas, Pete Keleher, Alan Cox, and Willy Zwaenepoel. An Evaluation of Software Distributed Shared Memory for Next-Generation Processors and Networks. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 144-155, May 1993.
[6]
David H. Bailey et al. The NAS Parallel Benchmarks. International Journal of Supercomputer Applications, 5(3):63-73, Fall 1991.
[7]
David Kranz et al. Integrating Message-Passing and Shared-Memory: Early Experience. In Proceedings of the Fourth A CM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 54- 63, May 1993.
[8]
Guy E. Blelloch et al. A Comparison of Sorting Algorithms for the Connection Machine CM-2. In Symposium on Parallel Algorithms and Architectures, pages 3-16, July 1991.
[9]
Jeffrey Kuskin et al. The Stanford FLASH Multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 302-313, April 1994.
[10]
John Heinlein et al. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS- VI), October 1994.
[11]
Stephen Goldschmidt. Simulation of Multiprocessors: Accuracy and Performance. PhD thesis, Stanford University, June 1993.
[12]
Cray Research Inc. Cray T3D System Architecture and Overview. Revision 1.c. Technical report, Cray Research Inc., September 1993.
[13]
Dan Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 148-159, May 1990.
[14]
Steven K. Reinhardt, James R. Lares, and David A. Wood. Tempest and Typhoon: User-level Shared Memory. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 325-336, April 1994.
[15]
Jaswinder Pal Singh, John L. Hennessy, and Anoop Gupta. Scaling Parallel Programs for Multiprocessors: Methodology and Examples. IEEE Computer, 26(7):42-50, July 1993.
[16]
Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Applications for Shared Memory. Computer Architecture News, 20(i):5- 44, March 1992. Also Stanford University Technical Report No. CSL-TR-92-526, June 1992.
[17]
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Technical Report CSL-TR-93-593, Stanford University, December 1993.

Cited By

View all
  • (2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
  • (2018)Lazy persistencyProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00044(439-451)Online publication date: 2-Jun-2018
  • (2017)Every Byte CountsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/30900521:2(1-29)Online publication date: 30-Jun-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
November 1994
341 pages
ISBN:0897916603
DOI:10.1145/195473
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 1994

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

ASPLOS94
Sponsor:

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)115
  • Downloads (Last 6 weeks)19
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
  • (2018)Lazy persistencyProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00044(439-451)Online publication date: 2-Jun-2018
  • (2017)Every Byte CountsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/30900521:2(1-29)Online publication date: 30-Jun-2017
  • (2017)Detecting Drinking Episodes in Young Adults Using Smartphone-based SensorsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/30900511:2(1-36)Online publication date: 30-Jun-2017
  • (2016)Ant Colony Optimization for multicore re-configurable architectureAI Communications10.3233/AIC-16070829:5(595-606)Online publication date: 15-Nov-2016
  • (2015)Performance and Energy Efficient Asymmetrically Reliable Caches for Multicore ArchitecturesProceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop10.1109/IPDPSW.2015.113(1025-1032)Online publication date: 25-May-2015
  • (2014)Accelerated Harmonic-Balance Analysis Using a Graphical Processing Unit PlatformIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2014.230469633:7(1017-1030)Online publication date: Jul-2014
  • (2013)Optimized multicore architectures for data parallel fast Fourier transformProceedings of the 14th International Conference on Computer Systems and Technologies10.1145/2516775.2516808(75-82)Online publication date: 28-Jun-2013
  • (2012)Accuracy evaluation of GEM5 simulator system7th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC)10.1109/ReCoSoC.2012.6322869(1-7)Online publication date: Jul-2012
  • (2012)Implementation and Analysis of Block Dense Matrix Decomposition on Network-on-ChipsProceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems10.1109/HPCC.2012.76(516-523)Online publication date: 25-Jun-2012
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media