research-article

Public Access

Locality analysis through static parallel sampling

Authors:

Sreepathi PaiAuthors Info & Claims

PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 557 - 570

https://doi.org/10.1145/3192366.3192402

Published: 11 June 2018 Publication History

Abstract

Locality analysis is important since accessing memory is much slower than computing. Compile-time locality analysis can provide detailed program-level feedback for compilers or runtime systems faster than trace-based locality analysis.

In this paper, we describe a new approach to locality analysis based on static parallel sampling. A compiler analyzes loop-based code and generates sampler code which is run to measure locality. Our approach can predict precise cache line granularity miss ratio curves for complex loops with non-linear array references and even branches. The precision and overhead of static sampling are evaluated using PolyBench and a bit-reversal loop. Our result shows that by randomly sampling 2% of loop iterations, a compiler can construct almost exact miss ratio curves as trace based analysis. Sampling 0.5% and 1% iterations can achieve good precision and efficiency with an average 0.6% to 1% the time of tracing respectively. Our analysis can also be parallelized. The analysis may assist program optimization techniques such as tiling, program co-location, cache hint selection and help to analyze write locality and parallel locality.

Supplementary Material

WEBM File (p557-chen.webm)

Download
97.10 MB

References

[1]

Randy Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers.

[2]

Alexander I Barvinok. 1994. A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Mathematics of Operations Research 19, 4 (1994), 769–779.

Digital Library

[3]

Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The Polyhedral Model Is More Widely Applicable Than You Think. CC 6011 (2010), 283–303.

Digital Library

[4]

Kristof Beyls and Erik H. D’Hollander. 2005. Generating cache hints for improved program efficiency. Journal of Systems Architecture 51, 4 (2005), 223–250.

Digital Library

[5]

Kristof Beyls and Erik H. D’Hollander. 2006. Discovery of localityimproving refactoring by reuse path analysis. In Proceedings of High Performance Computing and Communications. Springer. Lecture Notes in Computer Science, Vol. 4208. 220–229.

Digital Library

[6]

Paul Bourke. 1993. DFT (Discrete Fourier Transform) FFT (Fast Fourier Transform). Internet, http://astronomy. swin. edu. au/˜ pbourke/analysis/dft (1993).

[7]

Jacob Brock, Chencheng Ye, Chen Ding, Yechen Li, Xiaolin Wang, and Yingwei Luo. 2015. Optimal cache partition-sharing. In Parallel Processing (ICPP), 2015 44th International Conference on. IEEE, 749–758.

Digital Library

[8]

Carlos Carvalho. 2002. The gap between processor and memory speeds. In Proc. of IEEE International Conference on Control and Automation.

[9]

Calin Cascaval, Evelyn Duesterwald, Peter F. Sweeney, and Robert W. Wisniewski. 2005. Multiple Page Size Modeling and Optimization. In Proceedings of PACT. 339–349.

Digital Library

[10]

Calin Cascaval and David A. Padua. 2003. Estimating cache misses and locality using stack distances. In Proceedings of ICS. 150–159.

Digital Library

[11]

Sanjay Chatterjee, Nick Vrvilo, Zoran Budimlic, Kathleen Knobe, and Vivek Sarkar. 2016. Declarative tuning for locality in parallel programs. In Parallel Processing (ICPP), 2016 45th International Conference on. IEEE, 452–457.

[12]

Arun Chauhan and Chun-Yu Shei. 2010. Static reuse distances for locality-based optimizations in MATLAB. In Proceedings of ICS. 295– 304.

Digital Library

[13]

Dong Chen, Fangzhou Liu, Chen Ding, and Chucheow Lim. 2017. POSTER: Static Reuse Time Analysis Using Dependence Distance. In International Workshop on Languages and Compilers for Parallel Computing. Springer.

[14]

Dong Chen, Chencheng Ye, and Chen Ding. 2016. Write Locality and Optimization for Persistent Memory. In Proceedings of the Second International Symposium on Memory Systems. ACM, 77–87.

Digital Library

[15]

Philippe Clauss. 2014. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 237–244.

Digital Library

[16]

Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of PLDI. 279–290.

Digital Library

[17]

George B Dantzig and B Curtis Eaves. 1973. Fourier-Motzkin elimination and its dual. Journal of Combinatorial Theory, Series A 14, 3 (1973), 288–297.

[18]

C. Ding and K. Kennedy. 2004. Improving effective bandwidth through compiler enhancement of global cache reuse. J. Parallel and Distrib. Comput. 64, 1 (2004), 108–134.

Digital Library

[19]

Johannes Doerfert, Clemens Hammacher, Kevin Streit, and Sebastian Hack. 2013. Spolly: speculative optimizations in the polyhedral model. IMPACT 2013 (2013), 55.

[20]

David Eklov and Erik Hagersten. 2010. StatStack: Efficient modeling of LRU caches. In Proceedings of ISPASS. 55–65.

[21]

Karim Esseghir. 1993. Improving data locality for caches. Ph.D. Dissertation. Rice University.

[22]

Naznin Fauzia. 2015. Characterization of Data Locality Potential of CPU and GPU Applications through Dynamic Analysis. Ph.D. Dissertation. The Ohio State University.

[23]

Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1997. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 11th international conference on Supercomputing. ACM, 317–324.

Digital Library

[24]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1–10.

[25]

Xiameng Hu, Xiaolin Wang, Yechen Li, Yingwei Luo, Chen Ding, and Zhenlin Wang. 2017. Optimal Symbiosis and Fair Scheduling in Shared Cache. IEEE TPDS 28, 4 (2017), 1134–1148.

Digital Library

[26]

Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, Chen Ding, and Zhenlin Wang. 2016. Kinetic modeling of data eviction in cache. In 2016 USENIX Annual Technical Conference (USENIX ATC 16). USENIX Association, 351–364.

Digital Library

[27]

Ken Kennedy and Kathryn S McKinley. 1992. Optimizing for parallelism and data locality. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 151–162.

Digital Library

[28]

David J Kuck, Yoichi Muraoka, and Shyh-Ching Chen. 1972. On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Trans. Comput. 100, 12 (1972), 1293–1310.

Digital Library

[29]

Leslie Lamport. 1974. The parallel execution of DO loops. Commun. ACM 17, 2 (1974), 83–93.

Digital Library

[30]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California.

Digital Library

[31]

Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, and Xipeng Shen. 2016. Data-centric combinatorial optimization of parallel code. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 38.

Digital Library

[32]

Hao Luo, Pengcheng Li, and Chen Ding. 2017. Thread data sharing in cache: Theory and measurement. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 103–115.

Digital Library

[33]

R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM System Journal 9, 2 (1970), 78–117.

Digital Library

[34]

Miquel Moreto, Francisco J Cazorla, Alex Ramirez, and Mateo Valero. 2008. MLP-aware dynamic cache partitioning. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 337–352.

Digital Library

[35]

Preeti Ranjan Panda, Hiroshi Nakamura, Nikil D Dutt, and Alexandru Nicolau. 1999. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers 48, 2 (1999), 142–149.

Digital Library

[36]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530.

Digital Library

[37]

Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of PACT. 53–64.

Digital Library

[38]

David K. Tam, Reza Azimi, Livio Soares, and Michael Stumm. 2009. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations. In Proceedings of ASPLOS. 121–132.

Digital Library

[39]

Xavier Vera, Josep Llosa, Antonio Gonzalez, and Carlos Ciuraneta. 2000. A fast implementation of cache miss equations. In Procs. of the 8th. Int. Workshop on Compilers for Parallel Computers. 319–326.

[40]

Xavier Vera and Jingling Xue. 2002. Let’s study whole-program cache behaviour analytically. In High-Performance Computer Architecture, 2002. Proceedings. Eighth International Symposium on. IEEE, 175–186.

Digital Library

[41]

Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2004. Analytical computation of Ehrhart polynomials: Enabling more compiler analyses and optimizations. In Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems. ACM, 248–258.

Digital Library

[42]

Xiaolin Wang, Yechen Li, Yingwei Luo, Xiameng Hu, Jacob Brock, Chen Ding, and Zhenlin Wang. 2015. Optimal footprint symbiosis in shared cache. In Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on. IEEE, 412–422.

Digital Library

[43]

Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas JA Harvey, Andrew Warfield, and Coho Data. 2014. Characterizing storage workloads with counter stacks. In Proceedings of OSDI. USENIX Association, 335– 349.

Digital Library

[44]

Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In Proceedings of PLDI. 30–44.

Digital Library

[45]

Xiaoya Xiang, Bin Bao, Tongxin Bai, Chen Ding, and Trishul M. Chilimbi. 2011. All-window profiling and composable models of cache sharing. In Proceedings of PPoPP. 91–102.

Digital Library

[46]

Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. 2013. HOTL: a higher order theory of locality. In Proceedings of ASPLOS. 343–356.

Digital Library

[47]

Chencheng Ye, Jacob Brock, Chen Ding, and Hai Jin. 2017. Rochester elastic cache utility (recu): Unequal cache sharing is good economics. International Journal of Parallel Programming 45, 1 (2017), 30–44.

Digital Library

[48]

Chencheng Ye, Chen Ding, Hao Luo, Jacob Brock, Dong Chen, and Hai Jin. 2017. Cache Exclusivity and Sharing: Theory and Optimization. ACM Trans. on Arch. and Code Opt. 14, 4, 34:1–34:26.

Digital Library

[49]

Yutao Zhong and Wentao Chang. 2008. Sampling-based program locality approximation. In Proceedings of ISMM. 91–100.

Digital Library

Cited By

Pitchanathan AGrover KGrosser T(2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656452
Liu FZhu YSun SDing CSmith WHosseini K(2024)Parallel Loop Locality Analysis for Symbolic Thread CountsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676948(219-232)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676948
Ding CReber BPatru DKneipp AFigorito M(2023)E=MC^2: Efficient Mobility Centric CachingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631892(1-5)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631892
Show More Cited By

Index Terms

Locality analysis through static parallel sampling
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

Locality analysis through static parallel sampling
PLDI '18

Locality analysis is important since accessing memory is much slower than computing. Compile-time locality analysis can provide detailed program-level feedback for compilers or runtime systems faster than trace-based locality analysis.

In this paper, ...
Locality-Driven Dynamic GPU Cache Bypassing
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of ...
Coordinated Multilevel Buffer Cache Management with Consistent Access Locality Quantification

This paper proposes a protocol for effective coordinated buffer cache management in a multilevel cache hierarchy typical of a client/server system. Currently, such cache hierarchies are managed suboptimally—decisions about block placement and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2018

825 pages

ISBN:9781450356985

DOI:10.1145/3192366

General Chair:
Jeffrey S. Foster
University of Maryland at College Park, USA
,
Program Chair:
Dan Grossman
University of Washington, USA

ACM SIGPLAN Notices Volume 53, Issue 4
PLDI '18
April 2018
834 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296979
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation
Chinese Scholarship Council

Conference

PLDI '18

Sponsor:

SIGPLAN

PLDI '18: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 18 - 22, 2018

PA, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
1,099
Total Downloads

Downloads (Last 12 months)326
Downloads (Last 6 weeks)50

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pitchanathan AGrover KGrosser T(2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656452
Liu FZhu YSun SDing CSmith WHosseini K(2024)Parallel Loop Locality Analysis for Symbolic Thread CountsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676948(219-232)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676948
Ding CReber BPatru DKneipp AFigorito M(2023)E=MC^2: Efficient Mobility Centric CachingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631892(1-5)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631892
Jadhav HAgrawal NJoshi SShenoi V(2023)Hardware Counter-based Performance Analysis of ANUGA Flood SimulatorProceedings of the 2023 Fifteenth International Conference on Contemporary Computing10.1145/3607947.3608041(412-418)Online publication date: 3-Aug-2023
https://dl.acm.org/doi/10.1145/3607947.3608041
Reber BGould MKneipp ALiu FPrechtl IDing CChen LPatru D(2023)Cache Programming for Scientific Loops Using LeasesACM Transactions on Architecture and Code Optimization10.1145/360009020:3(1-25)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3600090
Morelli CReineke JJhala RDillig I(2022)Warping cache simulation of polyhedral programsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523714(316-331)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523714
Ding CChen DLiu FReber BSmith W(2022)CARL: Compiler Assigned Reference LeasingACM Transactions on Architecture and Code Optimization10.1145/349873019:1(1-28)Online publication date: 17-Mar-2022
https://dl.acm.org/doi/10.1145/3498730
Smith WByrne DDing C(2021)Writeback Modeling: Theory and Application to Zipfian WorkloadsProceedings of the International Symposium on Memory Systems10.1145/3488423.3519331(1-14)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3488423.3519331
Tang XKandemir MKarakoy M(2021)Mix and Match: Reorganizing Tasks for Enhancing Data LocalityProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/34600875:2(1-24)Online publication date: 4-Jun-2021
https://dl.acm.org/doi/10.1145/3460087
Prechtl IReber BDing CPatru DChen D(2020)CLAM: Compiler Lease of Cache MemoryProceedings of the International Symposium on Memory Systems10.1145/3422575.3422800(281-296)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3422575.3422800
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents