research-article

Provably good multicore cache performance for divide-and-conquer algorithms

Authors:

Guy E. Blelloch,

Rezaul A. Chowdhury,

Phillip B. Gibbons,

Vijaya Ramachandran,

Michael KozuchAuthors Info & Claims

SODA '08: Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms

Pages 501 - 510

Published: 20 January 2008 Publication History

Abstract

This paper presents a multicore-cache model that reflects the reality that multicore processors have both per-processor private (L₁) caches and a large shared (L₂) cache on chip. We consider a broad class of parallel divide-and-conquer algorithms and present a new on-line scheduler, CONTROLLED-PDF, that is competitive with the standard sequential scheduler in the following sense. Given any dynamically unfolding computation DAG from this class of algorithms, the cache complexity on the multicore-cache model under our new scheduler is within a constant factor of the sequential cache complexity for both L₁ and L₂, while the time complexity is within a constant factor of the sequential time complexity divided by the number of processors p. These are the first such asymptotically-optimal results for any multicore model. Finally, we show that a separator-based algorithm for sparse-matrix-dense-vector-multiply achieves provably good cache performance in the multicore-cache model, as well as in the well-studied sequential cache-oblivious model.

References

[1]

www.sun.com/processors/UltraSPARC-T1/, 2007.

[2]

www.tilera.com, 2007.

[3]

Intel shows off 80-core processor. www.news.com/2100-1006_3-6158181.html, 2007.

[4]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002. Springer.

[5]

A. Aggarwal, B. Alpern, A. Chandra, and M. Snir. A model for hierarchical memory. In ACM STOC, 1987.

Digital Library

[6]

A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9), 1988.

Digital Library

[7]

S. Akl and N. Santoro. Optimal parallel merging and sorting without memory conflicts. IEEE Transactions on Computers, 36(11), 1987.

Digital Library

[8]

B. Alpern, L. Carter, E. Feig, and T. Selker. The uniform memory hierachy model of computation. Algorth-mica, 12(2/3), 1994. Springer.

[9]

L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In ACM ISCA, 2000.

Digital Library

[10]

M. A. Bender, G. S. Brodal, R. Fagerberg, R. Jacob, and E. Vicari. Optimal sparse matrix dense vector multiplication in the I/O-model. In ACM SPAA, 2007.

Digital Library

[11]

M. A. Bender, J. T. Fineman, S. Gilbert, and B. C. Kuszmaul. Concurrent cache-oblivious B-trees. In ACM SPAA, 2005.

Digital Library

[12]

G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In ACM SPAA, 2004.

Digital Library

[13]

G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the ACM, 46(2), 1999.

Digital Library

[14]

G. E. Blelloch, P. B. Gibbons, Y. Matias, and G. J. Narlikar. Space-efficient scheduling of parallelism with synchronization variables. In ACM SPAA, 1997.

Digital Library

[15]

R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H. Randall. An analysis of dag-consistent distributed shared-memory algorithms. In ACM SPAA, 1996.

Digital Library

[16]

R. Brent. The parallel evaluation of general arithmetic expressions. Journal of the ACM, 21:201--206, 1974.

Digital Library

[17]

E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In ACM SPAA, 2007.

Digital Library

[18]

S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Har-davellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In ACM SPAA, 2007.

Digital Library

[19]

R. Chowdhury and V. Ramachandran. The cache-oblivious gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. In ACM SPAA, 2007.

Digital Library

[20]

A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In USENIX Ann. Tech. Conf., 2005.

Digital Library

[21]

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In IEEE FOCS, 1999.

Digital Library

[22]

M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In ACM SPAA, 2006.

Digital Library

[23]

M. T. Goodrich, M. Nelson, and N. Sitchinava. Sorting in parallel external-memory multicores. Technical report, U.C. Irvine, 2007.

[24]

L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, 20(2), 2000.

Digital Library

[25]

L. Hammond, B. Nayfeh, and K. Olukotun. A single-chip multiprocessor. IEEE Computer, 30(9), 1997.

Digital Library

[26]

R. J. Lipton and R. E. Tarjan. A separator theorem for planar graphs. SIAM Journal on Applied Mathematics, 36(2), 1979.

[27]

G. J. Narlikar. Scheduling threads for low space requirement and good locality. Theory of Computing Systems, 35(2), 2002. Springer.

[28]

B. A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of design alternatives for a multiprocessor microprocessor. In ACM ISCA, 1996.

Digital Library

[29]

V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4), 1969. Springer.

[30]

D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In ACM EuroSys, 2007.

Digital Library

Cited By

DeLayo DZhang KAgrawal KBender MBerry JDas RMoseley BPhillips CAgrawal KLee I(2022)Automatic HBM ManagementProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538570(147-159)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538570
Ramachandran VShi EAgrawal KAzar Y(2021)Data Oblivious Algorithms for MulticoresProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461783(373-384)Online publication date: 6-Jul-2021
https://dl.acm.org/doi/10.1145/3409964.3461783
Das RAgrawal KBender MBerry JMoseley BPhillips CScheideler CSpear M(2020)How to Manage High-Bandwidth Memory AutomaticallyProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400233(187-199)Online publication date: 6-Jul-2020
https://dl.acm.org/doi/10.1145/3350755.3400233
Show More Cited By

Index Terms

Recommendations

Divide-and-conquer: a bubble replacement for low level caches
ICS '09: Proceedings of the 23rd international conference on Supercomputing

The widely used LRU replacement policy suffers from the following problems. First, LRU does not exploit fre-quency information of cache accesses. Second, LRU may experience cache thrashing when access to cache exhibits cyclic patterns and the cache ...
Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
Performance-Energy Considerations for Shared Cache Management in a Heterogeneous Multicore Processor

Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as graphic processing unit (GPU) cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SODA '08: Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms

January 2008

1289 pages

Program Chair:
Shang-Hua Teng
Boston University and Akamai Technologies, Inc.

Sponsors

SIAM Activity Group on Discrete Mathematics
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 20 January 2008

Check for updates

Qualifiers

Research-article

Conference

SODA08

Sponsor:

SIGACT

SODA08: 19th ACM-SIAM Symposium on Discrete Algorithms

January 20 - 22, 2008

California, San Francisco

Acceptance Rates

Overall Acceptance Rate 411 of 1,322 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
895
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

DeLayo DZhang KAgrawal KBender MBerry JDas RMoseley BPhillips CAgrawal KLee I(2022)Automatic HBM ManagementProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538570(147-159)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538570
Ramachandran VShi EAgrawal KAzar Y(2021)Data Oblivious Algorithms for MulticoresProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461783(373-384)Online publication date: 6-Jul-2021
https://dl.acm.org/doi/10.1145/3409964.3461783
Das RAgrawal KBender MBerry JMoseley BPhillips CScheideler CSpear M(2020)How to Manage High-Bandwidth Memory AutomaticallyProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400233(187-199)Online publication date: 6-Jul-2020
https://dl.acm.org/doi/10.1145/3350755.3400233
Hosseini Rad MPatooghy AFazeli M(2018)An Efficient Programming Skeleton for Clusters of Multi-Core ProcessorsInternational Journal of Parallel Programming10.1007/s10766-017-0517-y46:6(1094-1109)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-017-0517-y
Cole RRamachandran V(2017)Resource Oblivious Sorting on MulticoresACM Transactions on Parallel Computing10.1145/30402213:4(1-31)Online publication date: 23-Mar-2017
https://dl.acm.org/doi/10.1145/3040221
Danelutto MDe Matteis TMencagli GTorquati MJannesari ASato YWinter S(2016)A divide-and-conquer parallel pattern implementation for multicoresProceedings of the 3rd International Workshop on Software Engineering for Parallel Systems10.1145/3002125.3002128(10-19)Online publication date: 21-Oct-2016
https://dl.acm.org/doi/10.1145/3002125.3002128
Simhadri HBlelloch GFineman JGibbons PKyrola A(2016)Experimental Analysis of Space-Bounded SchedulersACM Transactions on Parallel Computing10.1145/29383893:1(1-27)Online publication date: 28-Jun-2016
https://dl.acm.org/doi/10.1145/2938389
Chen QGuo M(2015)Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/276645012:2(1-24)Online publication date: 8-Jul-2015
https://dl.acm.org/doi/10.1145/2766450
Ma LAgrawal KChamberlain R(2014)A memory access model for highly-threaded many-core architecturesFuture Generation Computer Systems10.5555/2747903.274817530:C(202-215)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.5555/2747903.2748175
Bender MEbrahimi RFineman JGhasemiesfeh GJohnson RMcCauley SChekuri C(2014)Cache-adaptive algorithmsProceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms10.5555/2634074.2634145(958-971)Online publication date: 5-Jan-2014
https://dl.acm.org/doi/10.5555/2634074.2634145
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents