Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3192366.3192402acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Public Access

Locality analysis through static parallel sampling

Published: 11 June 2018 Publication History

Abstract

Locality analysis is important since accessing memory is much slower than computing. Compile-time locality analysis can provide detailed program-level feedback for compilers or runtime systems faster than trace-based locality analysis.
In this paper, we describe a new approach to locality analysis based on static parallel sampling. A compiler analyzes loop-based code and generates sampler code which is run to measure locality. Our approach can predict precise cache line granularity miss ratio curves for complex loops with non-linear array references and even branches. The precision and overhead of static sampling are evaluated using PolyBench and a bit-reversal loop. Our result shows that by randomly sampling 2% of loop iterations, a compiler can construct almost exact miss ratio curves as trace based analysis. Sampling 0.5% and 1% iterations can achieve good precision and efficiency with an average 0.6% to 1% the time of tracing respectively. Our analysis can also be parallelized. The analysis may assist program optimization techniques such as tiling, program co-location, cache hint selection and help to analyze write locality and parallel locality.

Supplementary Material

WEBM File (p557-chen.webm)

References

[1]
Randy Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers.
[2]
Alexander I Barvinok. 1994. A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Mathematics of Operations Research 19, 4 (1994), 769–779.
[3]
Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The Polyhedral Model Is More Widely Applicable Than You Think. CC 6011 (2010), 283–303.
[4]
Kristof Beyls and Erik H. D’Hollander. 2005. Generating cache hints for improved program efficiency. Journal of Systems Architecture 51, 4 (2005), 223–250.
[5]
Kristof Beyls and Erik H. D’Hollander. 2006. Discovery of localityimproving refactoring by reuse path analysis. In Proceedings of High Performance Computing and Communications. Springer. Lecture Notes in Computer Science, Vol. 4208. 220–229.
[6]
Paul Bourke. 1993. DFT (Discrete Fourier Transform) FFT (Fast Fourier Transform). Internet, http://astronomy. swin. edu. au/˜ pbourke/analysis/dft (1993).
[7]
Jacob Brock, Chencheng Ye, Chen Ding, Yechen Li, Xiaolin Wang, and Yingwei Luo. 2015. Optimal cache partition-sharing. In Parallel Processing (ICPP), 2015 44th International Conference on. IEEE, 749–758.
[8]
Carlos Carvalho. 2002. The gap between processor and memory speeds. In Proc. of IEEE International Conference on Control and Automation.
[9]
Calin Cascaval, Evelyn Duesterwald, Peter F. Sweeney, and Robert W. Wisniewski. 2005. Multiple Page Size Modeling and Optimization. In Proceedings of PACT. 339–349.
[10]
Calin Cascaval and David A. Padua. 2003. Estimating cache misses and locality using stack distances. In Proceedings of ICS. 150–159.
[11]
Sanjay Chatterjee, Nick Vrvilo, Zoran Budimlic, Kathleen Knobe, and Vivek Sarkar. 2016. Declarative tuning for locality in parallel programs. In Parallel Processing (ICPP), 2016 45th International Conference on. IEEE, 452–457.
[12]
Arun Chauhan and Chun-Yu Shei. 2010. Static reuse distances for locality-based optimizations in MATLAB. In Proceedings of ICS. 295– 304.
[13]
Dong Chen, Fangzhou Liu, Chen Ding, and Chucheow Lim. 2017. POSTER: Static Reuse Time Analysis Using Dependence Distance. In International Workshop on Languages and Compilers for Parallel Computing. Springer.
[14]
Dong Chen, Chencheng Ye, and Chen Ding. 2016. Write Locality and Optimization for Persistent Memory. In Proceedings of the Second International Symposium on Memory Systems. ACM, 77–87.
[15]
Philippe Clauss. 2014. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 237–244.
[16]
Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of PLDI. 279–290.
[17]
George B Dantzig and B Curtis Eaves. 1973. Fourier-Motzkin elimination and its dual. Journal of Combinatorial Theory, Series A 14, 3 (1973), 288–297.
[18]
C. Ding and K. Kennedy. 2004. Improving effective bandwidth through compiler enhancement of global cache reuse. J. Parallel and Distrib. Comput. 64, 1 (2004), 108–134.
[19]
Johannes Doerfert, Clemens Hammacher, Kevin Streit, and Sebastian Hack. 2013. Spolly: speculative optimizations in the polyhedral model. IMPACT 2013 (2013), 55.
[20]
David Eklov and Erik Hagersten. 2010. StatStack: Efficient modeling of LRU caches. In Proceedings of ISPASS. 55–65.
[21]
Karim Esseghir. 1993. Improving data locality for caches. Ph.D. Dissertation. Rice University.
[22]
Naznin Fauzia. 2015. Characterization of Data Locality Potential of CPU and GPU Applications through Dynamic Analysis. Ph.D. Dissertation. The Ohio State University.
[23]
Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1997. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 11th international conference on Supercomputing. ACM, 317–324.
[24]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1–10.
[25]
Xiameng Hu, Xiaolin Wang, Yechen Li, Yingwei Luo, Chen Ding, and Zhenlin Wang. 2017. Optimal Symbiosis and Fair Scheduling in Shared Cache. IEEE TPDS 28, 4 (2017), 1134–1148.
[26]
Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, Chen Ding, and Zhenlin Wang. 2016. Kinetic modeling of data eviction in cache. In 2016 USENIX Annual Technical Conference (USENIX ATC 16). USENIX Association, 351–364.
[27]
Ken Kennedy and Kathryn S McKinley. 1992. Optimizing for parallelism and data locality. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 151–162.
[28]
David J Kuck, Yoichi Muraoka, and Shyh-Ching Chen. 1972. On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Trans. Comput. 100, 12 (1972), 1293–1310.
[29]
Leslie Lamport. 1974. The parallel execution of DO loops. Commun. ACM 17, 2 (1974), 83–93.
[30]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California.
[31]
Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, and Xipeng Shen. 2016. Data-centric combinatorial optimization of parallel code. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 38.
[32]
Hao Luo, Pengcheng Li, and Chen Ding. 2017. Thread data sharing in cache: Theory and measurement. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 103–115.
[33]
R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM System Journal 9, 2 (1970), 78–117.
[34]
Miquel Moreto, Francisco J Cazorla, Alex Ramirez, and Mateo Valero. 2008. MLP-aware dynamic cache partitioning. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 337–352.
[35]
Preeti Ranjan Panda, Hiroshi Nakamura, Nikil D Dutt, and Alexandru Nicolau. 1999. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers 48, 2 (1999), 142–149.
[36]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530.
[37]
Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of PACT. 53–64.
[38]
David K. Tam, Reza Azimi, Livio Soares, and Michael Stumm. 2009. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations. In Proceedings of ASPLOS. 121–132.
[39]
Xavier Vera, Josep Llosa, Antonio Gonzalez, and Carlos Ciuraneta. 2000. A fast implementation of cache miss equations. In Procs. of the 8th. Int. Workshop on Compilers for Parallel Computers. 319–326.
[40]
Xavier Vera and Jingling Xue. 2002. Let’s study whole-program cache behaviour analytically. In High-Performance Computer Architecture, 2002. Proceedings. Eighth International Symposium on. IEEE, 175–186.
[41]
Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2004. Analytical computation of Ehrhart polynomials: Enabling more compiler analyses and optimizations. In Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems. ACM, 248–258.
[42]
Xiaolin Wang, Yechen Li, Yingwei Luo, Xiameng Hu, Jacob Brock, Chen Ding, and Zhenlin Wang. 2015. Optimal footprint symbiosis in shared cache. In Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on. IEEE, 412–422.
[43]
Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas JA Harvey, Andrew Warfield, and Coho Data. 2014. Characterizing storage workloads with counter stacks. In Proceedings of OSDI. USENIX Association, 335– 349.
[44]
Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In Proceedings of PLDI. 30–44.
[45]
Xiaoya Xiang, Bin Bao, Tongxin Bai, Chen Ding, and Trishul M. Chilimbi. 2011. All-window profiling and composable models of cache sharing. In Proceedings of PPoPP. 91–102.
[46]
Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. 2013. HOTL: a higher order theory of locality. In Proceedings of ASPLOS. 343–356.
[47]
Chencheng Ye, Jacob Brock, Chen Ding, and Hai Jin. 2017. Rochester elastic cache utility (recu): Unequal cache sharing is good economics. International Journal of Parallel Programming 45, 1 (2017), 30–44.
[48]
Chencheng Ye, Chen Ding, Hao Luo, Jacob Brock, Dong Chen, and Hai Jin. 2017. Cache Exclusivity and Sharing: Theory and Optimization. ACM Trans. on Arch. and Code Opt. 14, 4, 34:1–34:26.
[49]
Yutao Zhong and Wentao Chang. 2008. Sampling-based program locality approximation. In Proceedings of ISMM. 91–100.

Cited By

View all
  • (2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
  • (2024)Parallel Loop Locality Analysis for Symbolic Thread CountsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676948(219-232)Online publication date: 14-Oct-2024
  • (2023)E=MC^2: Efficient Mobility Centric CachingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631892(1-5)Online publication date: 2-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2018
825 pages
ISBN:9781450356985
DOI:10.1145/3192366
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2018

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Static analysis
  2. locality
  3. program specialization

Qualifiers

  • Research-article

Funding Sources

Conference

PLDI '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)326
  • Downloads (Last 6 weeks)50
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
  • (2024)Parallel Loop Locality Analysis for Symbolic Thread CountsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676948(219-232)Online publication date: 14-Oct-2024
  • (2023)E=MC^2: Efficient Mobility Centric CachingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631892(1-5)Online publication date: 2-Oct-2023
  • (2023)Hardware Counter-based Performance Analysis of ANUGA Flood SimulatorProceedings of the 2023 Fifteenth International Conference on Contemporary Computing10.1145/3607947.3608041(412-418)Online publication date: 3-Aug-2023
  • (2023)Cache Programming for Scientific Loops Using LeasesACM Transactions on Architecture and Code Optimization10.1145/360009020:3(1-25)Online publication date: 19-Jul-2023
  • (2022)Warping cache simulation of polyhedral programsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523714(316-331)Online publication date: 9-Jun-2022
  • (2022)CARL: Compiler Assigned Reference LeasingACM Transactions on Architecture and Code Optimization10.1145/349873019:1(1-28)Online publication date: 17-Mar-2022
  • (2021)Writeback Modeling: Theory and Application to Zipfian WorkloadsProceedings of the International Symposium on Memory Systems10.1145/3488423.3519331(1-14)Online publication date: 27-Sep-2021
  • (2021)Mix and Match: Reorganizing Tasks for Enhancing Data LocalityProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/34600875:2(1-24)Online publication date: 4-Jun-2021
  • (2020)CLAM: Compiler Lease of Cache MemoryProceedings of the International Symposium on Memory Systems10.1145/3422575.3422800(281-296)Online publication date: 28-Sep-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media