The Cache Complexity of Multithreaded Cache Oblivious Algorithms

Matteo Frigo¹ &
Volker Strumpen¹

281 Accesses
31 Citations
Explore all metrics

Abstract

We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk work-stealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs $O(n^{3}/\sqrt{ Z}+(Pn)^{1/3}n^{2})$ cache misses when executed by the Cilk scheduler on a machine with P processors, each with a cache of size Z, with high probability. This bound is tighter than previously published bounds. We also present a new multithreaded cache oblivious algorithm for 1D stencil computations incurring $O(n^{2}/Z+n+\sqrt{ Pn^{3+\epsilon}})$ cache misses with high probability, one for Gaussian elimination and back substitution, and one for the length computation part of the longest common subsequence problem incurring $O(n^{2}/Z+\sqrt{ Pn^{3.58}})$ cache misses with high probability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Guiding the Optimization of Parallel Codes on Multicores Using an Analytical Cache Model

Task Scheduling on Manycore Processors with Home Caches

Fast, Dynamically-Sized Concurrent Hash Table

References

Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. Theory Comput. Syst. 35(3), 321–347 (2002)
Article MATH MathSciNet Google Scholar
Belady, L.A.: A study of replacement algorithms for virtual storage computers. IBM Syst. J. 5(2), 78–101 (1966)
Article Google Scholar
Bender, M.A., Fineman, J.T., Gilbert, S., Kuszmaul, B.C.: Concurrent cache-oblivious B-trees. In: 17th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 228–237, Las Vegas, NV, July 2005
Bilardi, G., Preparata, F.P.: Upper bounds to processor-time tradeoffs under bounded-speed message propagation. In: 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 185–194, Santa Barbara, CA, July 1995
Blelloch, G.E., Gibbons, P.B.: Effectively sharing a cache among threads. In: 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 235–244, Barcelona, Spain, June 2004
Blelloch, G.E., Gibbons, P.B., Matias, Y.: Provably efficient scheduling for languages with fine-grained parallelism. J. ACM 46(2), 281–321 (1999)
Article MATH MathSciNet Google Scholar
Blumofe, R.D., Frigo, M., Joerg, C.F., Leiserson, C.E., Randall, K.H.: An analysis of dag-consistent distributed shared-memory algorithms. In: 8th Annunal ACM Symposium on Parallel Algorithms and Architectures, pp. 297–308, Padua, Italy, June 1996
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 207–216, Santa Barbara, CA, July 1995
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)
Article MATH MathSciNet Google Scholar
Brent, R.P.: The parallel evaluation of general arithmetic expressions. J. ACM 21(2), 201–206 (1974)
Article MATH MathSciNet Google Scholar
Chowdhury, R.A., Ramachandran, V.: Cache-oblivious dynamic programming. In: 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 591–600, Miami, FL, 2006
Chowdhury, R.A., Ramachandran, V.: The cache-oblivious Gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In: Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’07), pp. 71–80. Assoc. Comput. Mach., New York (2007)
Chapter Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press, Cambridge (1990)
MATH Google Scholar
Culler, D.E., Karp, R.M., Patterson, D.A., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: towards a realistic model of parallel computation. In: 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1–12, San Diego, CA, May 1993
Fortune, S.J., Wyllie, J.: Parallelism in random access machines. In: 10th Annual ACM Symposium on Theory of Computing, pp. 114–118, San Diego, CA, May 1978
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache oblivious algorithms. In: 40th Annual Symposium on Foundations of Computer Science, pp. 285–298, New York, USA, October 1999
Frigo, M., Randall, K.H., Leiserson, C.E.: The implementation of the Cilk-5 multithreaded language. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998
Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: International Conference on Supercomputing, pp. 361–366, Boston, MA, June 2005
Gibbons, P.B., Matias, Y., Ramachandran, V.: Can a shared-memory model serve as a bridging model for parallel computation? Theory Comput. Syst. 32(3), 327–359 (1999)
Article MATH MathSciNet Google Scholar
Golub, G.H., van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins Press, Baltimore (1996)
MATH Google Scholar
Graham, R.L.: Bounds for certain multiprocessing anomalies. Bell Syst. Tech. J. 45, 1563–1581 (1966)
Google Scholar
Graham, R.L., Knuth, D.E., Patashnik, O.: Concrete Mathematics. Addison-Wesley, Reading (1994)
MATH Google Scholar
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Article MATH MathSciNet Google Scholar
Hong, J.-W., Kung, H.T.: I/O complexity: the red-blue pebbling game. In: 13th Annual ACM Symposium on Theory of Computing, pp. 326–333, Milwaukee, Wisconsin, May 1981
Leiserson, C.E.: Minicourse on multithreaded programming in Cilk, lecture 2: analysis of Cilk algorithms. Summer School on Language-Based Techniques for Concurrent and Distributed Software at the University of Oregon (2006)
Narlikar, G.J., Blelloch, G.E.: Space-efficient scheduling of nested parallelism. ACM Trans. Program. Lang. Syst. 21(1), 138–173 (1999)
Article Google Scholar
Sleator, D.D., Tarjan, R.E.: Amortized efficiency of list update and paging rules. Commun. ACM 28(2), 202–208 (1985)
Article MathSciNet Google Scholar
Strumpen, V., Frigo, M.: Software engineering aspects of cache oblivious stencil computations. Technical Report RC24035, IBM Research, August 2006
Toledo, S.: Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18(4), 1065–1081 (1997)
Article MATH MathSciNet Google Scholar
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Article Google Scholar

Download references

Author information

Authors and Affiliations

IBM Austin Research Laboratory, 11501 Burnet Road, Austin, TX, 78758, USA
Matteo Frigo & Volker Strumpen

Authors

Matteo Frigo
View author publications
You can also search for this author in PubMed Google Scholar
Volker Strumpen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matteo Frigo.

Additional information

This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Frigo, M., Strumpen, V. The Cache Complexity of Multithreaded Cache Oblivious Algorithms. Theory Comput Syst 45, 203–233 (2009). https://doi.org/10.1007/s00224-007-9098-2

Download citation

Received: 29 September 2006
Accepted: 22 November 2007
Published: 14 December 2007
Issue Date: August 2009
DOI: https://doi.org/10.1007/s00224-007-9098-2

The Cache Complexity of Multithreaded Cache Oblivious Algorithms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Guiding the Optimization of Parallel Codes on Multicores Using an Analytical Cache Model

Task Scheduling on Manycore Processors with Home Caches

Fast, Dynamically-Sized Concurrent Hash Table

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

The Cache Complexity of Multithreaded Cache Oblivious Algorithms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Guiding the Optimization of Parallel Codes on Multicores Using an Analytical Cache Model

Task Scheduling on Manycore Processors with Home Caches

Fast, Dynamically-Sized Concurrent Hash Table

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation