Nothing Special   »   [go: up one dir, main page]

skip to main content
survey

Evaluation of Hardware Data Prefetchers on Server Processors

Published: 18 June 2019 Publication History

Abstract

Data prefetching, i.e., the act of predicting an application’s future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: Nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access patterns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the efficiency of prefetching in a specific way.
In this survey, we evaluate the effectiveness of data prefetching in the context of server applications and shed light on its design trade-offs. To do so, we choose a target architecture based on a contemporary server processor and stack various state-of-the-art data prefetchers on top of it. We analyze the prefetchers in terms of their ability to predict memory accesses and enhance overall system performance, as well as their imposed overheads. Finally, by comparing the state-of-the-art prefetchers with impractical ideal prefetchers, we motivate further work on improving data prefetching techniques.

References

[1]
2012. CloudSuite. Retrieved from http://cloudsuite.ch.
[2]
2017. ChampSim. Retrieved from https://github.com/ChampSim/.
[3]
2017. Intel® Xeon® Processor E3-1245 v6. Retrieved from https://www.intel.com/content/www/us/en/products/processors/xeon/e3-processors/e3-1245-v6.html.
[4]
Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and David A. Wood. 1999. DBMSs on a modern processor: Where does time go? In Proceedings of the International Conference on Very Large Data Bases (VLDB’99). 266--277.
[5]
Haitham Akkary and Michael A. Driscoll. 1998. A dynamic multithreading processor. In Proceedings of the International Symposium on Microarchitecture (MICRO’98). IEEE, 226--236.
[6]
Jean-Loup Baer and Tien-Fu Chen. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the ACM/IEEE Conference on Supercomputing. 176--186.
[7]
Mohammad Bakhshalipour, Aydin Faraji, Seyed Armin Vakil Ghahani, Farid Samandi, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Reducing writebacks through in-cache displacement. ACM Trans. Design Automat. Electron. Syst. 24, 2 (2019), 16.
[8]
Mohammad Bakhshalipour, Pejman Lotfi-Kamran, Abbas Mazloumi, Farid Samandi, Mahmood Naderan-Tahan, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2018. Fast data delivery for many-core processors. IEEE Trans. Comput. 67, 10 (2018), 1416--1429.
[9]
Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2017. An efficient temporal data prefetcher for L1 caches. IEEE Comput. Architect. Lett. 16, 2 (2017), 99--102.
[10]
Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Domino temporal data prefetcher. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’18). IEEE, 131--142.
[11]
Mohammad Bakhshalipour, Mehran Shakerinava, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Bingo spatial data prefetcher. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’19).
[12]
Mohammad Bakhshalipour, HamidReza Zare, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Die-stacked DRAM: Memory, cache, or MemCache? arXiv preprint arXiv:1809.08828.
[13]
Burton H. Bloom. 1970. Space/Time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426.
[14]
Ioana Burcea, Stephen Somogyi, Andreas Moshovos, and Babak Falsafi. 2008. Predictor virtualization. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’08). 157--167.
[15]
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith. 2006. Stealth prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’06). 274--282.
[16]
Chi F. Chen, Se-Hyun Yang, Babak Falsafi, and Andreas Moshovos. 2004. Accurate and complexity-effective spatial pattern prediction. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’04). 276--287.
[17]
Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramanian, Anantha P. Chandrakasan, and Li-Shiuan Peh. 2013. SMART: A single-cycle reconfigurable NoC for SoC applications. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’13). 338--343.
[18]
Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving hash join performance through prefetching. ACM Trans. Database Syst. 32, 3 (Aug. 2007).
[19]
Trishul M. Chilimbi. 2001. On the stability of temporal data reference profiles. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’01). 151--160.
[20]
Trishul M. Chilimbi and Martin Hirzel. 2002. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’02). 199--209.
[21]
Yuan Chou. 2007. Low-cost epoch-based correlation prefetching for commercial applications. In Proceedings of the International Symposium on Microarchitecture (MICRO’07). 301--313.
[22]
Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John P. Shen. 2001. Dynamic speculative precomputation. In Proceedings of the International Symposium on Microarchitecture (MICRO’01). 306--317.
[23]
Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the International Symposium on Computer Architecture (ISCA’01). 14--25.
[24]
Pat Conway and Bill Hughes. 2007. The AMD opteron northbridge architecture. IEEE Micro 27, 2 (Mar. 2007), 10--21.
[25]
Heming Cui, Jingyue Wu, Chia-Che Tsai, and Junfeng Yang. 2010. Stable deterministic multithreading through schedule memoization. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, 207--221.
[26]
F. Dahlgren and P. Stenstrom. 1995. Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’95). 68.
[27]
Pedro Diaz and Marcelo Cintra. 2009. Stream chaining: Exploiting multiple levels of correlation in data prefetching. In Proceedings of the International Symposium on Computer Architecture (ISCA’09). 81--92.
[28]
Jack Doweck. 2006. Inside intel® core microarchitecture. In IEEE Hot Chips Symposium (HCS’06). 1--35.
[29]
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2010. Fairness via source throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). 335--346.
[30]
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-aware shared resource management for multi-core systems. In Proceedings of the International Symposium on Computer Architecture (ISCA’11). 141--152.
[31]
Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009. Coordinated control of multiple prefetchers in multi-core systems. In Proceedings of the International Symposium on Microarchitecture (MICRO’09). 316--326.
[32]
Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’09). 7--17.
[33]
Hodjat Asghari Esfeden, Farzad Khorasani, Hyeran Jeon, Daniel Wong, and Nael Abu-Ghazaleh. 2019. CORF: Coalescing operand register file for GPUs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM.
[34]
Pouya Esmaili-Dokht, Mohammad Bakhshalipour, Behnam Khodabandeloo, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Scale-out processors 8 energy efficiency. arXiv preprint arXiv:1808.04864.
[35]
Babak Falsafi and Thomas F. Wenisch. 2014. A Primer on Hardware Prefetching. Morgan 8 Claypool Publishers.
[36]
Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). 37--48.
[37]
Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Quantifying the mismatch between emerging scale-out applications and modern processors. ACM Trans. Comput. Syst. 30, 4, Article 15 (Nov. 2012), 24 pages.
[38]
Ilya Ganusov and Martin Burtscher. 2006. Future execution: A prefetching mechanism that uses multiple cores to speed up single threads. ACM Trans. Architect. Code Optim. 3, 4 (Dec. 2006), 424--449.
[39]
Boris Grot, Damien Hardy, Pejman Lotfi-Kamran, Chrysostomos Nicopoulos, Yiannakis Sazeides, and Babak Falsafi. 2012. Optimizing data-center TCO with scale-out processors. IEEE Micro 32, 5 (Sept. 2012), 1--63.
[40]
Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the International Symposium on Computer Architecture (ISCA’10). ACM, 37--47.
[41]
Richard A. Hankins, Trung Diep, Murali Annavaram, Brian Hirano, Harald Eri, Hubert Nueckel, and John P. Shen. 2003. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the International Symposium on Microarchitecture (MICRO’03). 116--120.
[42]
Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju G. Mancheril, Anastassia Ailamaki, and Babak Falsafi. 2007. Database servers on chip multiprocessors: Limitations and opportunities. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’07). 79--87.
[43]
Ruud Haring, Martin Ohmacht, Thomas Fox, Michael Gschwind, David Satterfield, Krishnan Sugavanam, Paul Coteus, Philip Heidelberger, Matthias Blumrich, Robert Wisniewski, Alan Gara, George Chiu, Peter Boyle, Norman Chist, and Changhoan Kim. 2012. The IBM blue Gene/Q compute chip. IEEE Micro 32, 2 (2012), 48--60.
[44]
Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2016. Accelerating dependent cache misses with an enhanced memory controller. In Proceedings of the International Symposium on Computer Architecture (ISCA’16). 444--455.
[45]
Tim Horel and Gary Lauterbach. 1999. UltraSPARC-III: Designing third-generation 64-bit performance. IEEE Micro 19, 3 (1999), 73--85.
[46]
Christopher J. Hughes and Sarita V. Adve. 2005. Memory-side prefetching for linked data structures for processor-in-memory systems. J. Parallel and Distrib. Comput. 65, 4 (Apr. 2005), 448--463.
[47]
Jaehyuk Huh, Doug Burger, and Stephen W. Keckler. 2001. Exploring the design space of future CMPs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’01). 199--210.
[48]
Ibrahim Hur and Calvin Lin. 2006. Memory prefetching using adaptive stream detection. In Proceedings of the International Symposium on Microarchitecture (MICRO’06). 397--408.
[49]
Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, and Santosh G. Abraham. 2004. Effective stream-based and execution-based data prefetching. In Proceedings of the International Conference on Supercomputing (ICS’04). 1--11.
[50]
Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2009. Access map pattern matching for data cache prefetch. In Proceedings of the International Conference on Supercomputing (ICS’09). 499--500.
[51]
Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proceedings of the International Symposium on Microarchitecture (MICRO’13). 247--259.
[52]
Djordje Jevdjic, Stavros Volos, and Babak Falsafi. 2013. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache. In Proceedings of the International Symposium on Computer Architecture (ISCA’13). 404--415.
[53]
Daniel A. Jiménez and Calvin Lin. 2001. Dynamic branch prediction with perceptrons. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’01). 197--206.
[54]
Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’12). 137--146.
[55]
Ryan Johnson, Stavros Harizopoulos, Nikos Hardavellas, Kivanc Sabirli, Ippokratis Pandis, Anastasia Ailamaki, Naju G. Mancheril, and Babak Falsafi. 2007. To share or not to share? In Proceedings of the International Conference on Very Large Data Bases (VLDB’07). 351--362.
[56]
Doug Joseph and Dirk Grunwald. 1997. Prefetching using markov predictors. In Proceedings of the International Symposium on Computer Architecture (ISCA’97). 252--263.
[57]
Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully associative cache and prefetch buffers. In Proceedings of the International Symposium on Computer Architecture (ISCA’90). 364--373.
[58]
David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-Fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proceedings of the International Symposium on Microarchitecture (MICRO’14). 623--634.
[59]
Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Inter-core prefetching for multicore processors using migrating helper threads. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). 393--404.
[60]
Mahmut Kandemir, Yuanrui Zhang, and Ozcan Ozturk. 2009. Adaptive prefetching for shared cache-based chip multiprocessors. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’09). 773--778.
[61]
Tejas S. Karkhanis and James E. Smith. 2004. A first-order superscalar processor model. In Proceedings of the International Symposium on Computer Architecture (ISCA’04). 338--349.
[62]
Mehmet Kayaalp, Khaled N. Khasawneh, Hodjat Asghari Esfeden, Jesse Elwell, Nael Abu-Ghazaleh, Dmitry Ponomarev, and Aamer Jaleel. 2017. RIC: Relaxed inclusion caches for mitigating LLC side-channel attacks. In Proceedings of the Design Automation Conference (DAC’17). ACM, Article 7, 6 pages.
[63]
Farzad Khorasani, Hodjat Asghari Esfeden, Nael Abu-Ghazaleh, and Vivek Sarkar. 2018. In-register parameter caching for dynamic neural nets with virtual persistent processor specialization. In Proceedings of the International Symposium on Microarchitecture (MICRO’18). IEEE, 377--389.
[64]
Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. RegMutex: Inter-warp GPU register time-sharing. In Proceedings of the International Symposium on Computer Architecture (ISCA’18). IEEE Press, 816--828.
[65]
Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path confidence-based lookahead prefetching. In Proceedings of the International Symposium on Microarchitecture (MICRO’16). 60:1--60:12.
[66]
Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’10). 1--12.
[67]
Sanjeev Kumar and Christopher Wilkerson. 1998. Exploiting spatial locality in data caches using spatial footprints. In Proceedings of the International Symposium on Computer Architecture (ISCA’98). 357--368.
[68]
James R. Larus and Michael Parkes. 2002. Using cohort-scheduling to enhance server performance. In Proceedings of the General Track of the Annual Conference on USENIX Annual Technical Conference (ATEC’02). 103--114.
[69]
Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In Proceedings of the International Symposium on Microarchitecture (MICRO’08). 200--209.
[70]
Jaejin Lee, Changhee Jung, Daeseob Lim, and Yan Solihin. 2009. Prefetching with helper threads for loosely coupled multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 20, 9 (Sept. 2009), 1309--1324.
[71]
Kevin Lim, Parthasarathy Ranganathan, Jichuan Chang, Chandrakant Patel, Trevor Mudge, and Steven Reinhardt. 2008. Understanding and designing new server architectures for emerging warehouse-computing environments. In Proceedings of the International Symposium on Computer Architecture (ISCA’08). 315--326.
[72]
Jack L. Lo, Luiz André Barroso, Susan J. Eggers, Kourosh Gharachorloo, Henry M. Levy, and Sujay S. Parekh. 1998. An analysis of database workload performance on simultaneous multithreaded processors. In Proceedings of the International Symposium on Computer Architecture (ISCA’98). 39--50.
[73]
Pejman Lotfi-Kamran, Boris Grot, and Babak Falsafi. 2012. NOC-Out: Microarchitecting a scale-out processor. In Proceedings of the 45th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’12). 177--187.
[74]
Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, and Babak Falsafi. 2012. Scale-out processors. In Proceedings of the International Symposium on Computer Architecture (ISCA’12). 500--511.
[75]
Pejman Lotfi-Kamran, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2016. An efficient hybrid-switched network-on-chip for chip multiprocessors. IEEE Trans. Comput. 65, 5 (May 2016), 1656--1662.
[76]
Pejman Lotfi-Kamran, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2017. Near-ideal networks-on-chip for servers. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’17). 277--288.
[77]
Chi-Keung Luk and Todd C. Mowry. 1996. Compiler-based prefetching for recursive data structures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’96). 222--233.
[78]
Sanyam Mehta, Zhenman Fang, Antonia Zhai, and Pen-Chung Yew. 2014. Multi-stage coordinated prefetching for present-day processors. In Proceedings of the International Conference on Supercomputing (ICS’14). 73--82.
[79]
Pierre Michaud. 2016. Best-offset hardware prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’16). 469--480.
[80]
Amirhossein Mirhosseini and Thomas F. Wenisch. 2019. The queuing-first approach for tail management of interactive services. IEEE Micro (2019).
[81]
Amirhossein Mirhosseini, Akshitha Sriraman, and Thomas F. Wenisch. 2019. Enhancing server efficiency in the face of killer microseconds. Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’19).
[82]
Sparsh Mittal. 2016. A survey of recent prefetching techniques for processor caches. ACM Comput. Surveys 49, 2, Article 35 (Aug. 2016), 35:1--35:35.
[83]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the International Symposium on Microarchitecture (MICRO’07). 3--14.
[84]
Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Techniques for efficient processing in runahead execution engines. In Proceedings of the International Symposium on Computer Architecture (ISCA’05). 370--381.
[85]
Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’03). 129.
[86]
Mario Nemirovsky and Dean M. Tullsen. 2013. Multithreading Architecture (1st ed.). Morgan 8 Claypool Publishers.
[87]
Kyle J. Nesbit, Ashutosh S. Dhodapkar, and James E. Smith. 2004. AC/DC: An adaptive data cache prefetcher. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’04). 135--145.
[88]
Kyle J. Nesbit and James E. Smith. 2004. Data cache prefetching using a global history buffer. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’04). 96.
[89]
Craig G. Nevill-Manning and Ian H. Witten. 1997. Identifying hierarchical structure in sequences: A linear-time algorithm. J. Artific. Intell. Res. 7, 1 (Sept. 1997), 67--82.
[90]
S. Palacharla and R. E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the International Symposium on Computer Architecture (ISCA’94). 24--33.
[91]
S. H. Pugsley, A. R. Alameldeen, C. Wilkerson, and H. Kim. 2015. The 2nd Data Prefetching Championship (DPC-2). http://comparch-conf.gatech.edu/dpc2/.
[92]
Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’14). 626--637.
[93]
Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, and Luiz André Barroso. 1998. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98). 307--318.
[94]
Brian M. Rogers, Anil Krishna, Gordon B. Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proceedings of the International Symposium on Computer Architecture (ISCA’09). 371--382.
[95]
Amir Roth and Gurindar S. Sohi. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the International Symposium on Computer Architecture (ISCA’99). 111--121.
[96]
Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei W. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP’08). ACM, 73--82.
[97]
Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 489--502.
[98]
Suleyman Sair, Timothy Sherwood, and Brad Calder. 2003. A decoupled predictor-directed stream prefetching architecture. IEEE Trans. Comput. 52, 3 (March 2003), 260--276.
[99]
Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans. Architect. Code Optim. 11, 4, Article 51 (Jan. 2015), 22 pages.
[100]
Timothy Sherwood, Suleyman Sair, and Brad Calder. 2000. Predictor-directed stream buffers. In Proceedings of the International Symposium on Microarchitecture (MICRO’00). 42--53.
[101]
Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H. Pugsley, and Zeshan Chishti. 2015. Efficiently prefetching complex address patterns. In Proceedings of the International Symposium on Microarchitecture (MICRO’15). 141--152.
[102]
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon Phi product. IEEE Micro 36, 2 (Mar. 2016), 34--46.
[103]
Yan Solihin, Jaejin Lee, and Josep Torrellas. 2002. Using a user-level memory thread for correlation prefetching. In Proceedings of the International Symposium on Computer Architecture (ISCA’02). 171--182.
[104]
Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. In Proceedings of the International Symposium on Computer Architecture (ISCA’09). 69--80.
[105]
Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial memory streaming. In Proceedings of the International Symposium on Computer Architecture (ISCA’06). 252--263.
[106]
Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’07). 63--74.
[107]
Joel M. Tendler, J. Steve Dodson, J. S. Fields, Hung Le, and Balaram Sinharoy. 2002. POWER4 system microarchitecture. IBM J. Res. Dev. 46, 1 (2002), 5--25.
[108]
Pedro Trancoso, Josep-L. Larriba-Pey, Zheng Zhang, and Josep Torrellas. 1997. The memory performance of DSS commercial workloads in shared-memory multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’97). 250--260.
[109]
Armin Vakil-Ghahani, Sara Mahdizadeh-Shahri, Mohammad-Reza Lotfi-Namin, Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Cache replacement policy based on expected hit count. IEEE Comput. Architect. Lett. 17, 1 (2018), 64--67.
[110]
Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2009. Practical off-chip meta-data for temporal memory streaming. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’09). 79--90.
[111]
Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. 2005. Temporal streaming of shared memory. In Proceedings of the International Symposium on Computer Architecture (ISCA’05). 222--233.
[112]
Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. PACMan: Prefetch-aware cache management for high performance caching. In Proceedings of the International Symposium on Microarchitecture (MICRO’11). 442--453.
[113]
Carole-Jean Wu and Margaret Martonosi. 2011. Characterization and dynamic mitigation of intra-application cache interference. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS’11). 2--11.
[114]
Wm. A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. SIGARCH Comput. Archit. News 23, 1 (Mar. 1995), 20--24.
[115]
Praveen Yedlapalli, Jagadish Kotra, Emre Kultursay, Mahmut Kandemir, Chita R. Das, and Anand Sivasubramaniam. 2013. Meeting midway: Improving CMP performance with memory-side prefetching. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’13). 289--298.
[116]
Chengqiang Zhang and Sally A. McKee. 2000. Hardware-only stream prefetching and dynamic access ordering. In Proceedings of the International Conference on Supercomputing (ICS’00). 167--175.
[117]
Weifeng Zhang, Brad Calder, and Dean M. Tullsen. 2006. A self-repairing prefetcher in an event-driven dynamic optimization framework. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). 50--64.

Cited By

View all
  • (2024)LSTM-CRP: Algorithm-Hardware Co-Design and Implementation of Cache Replacement Policy Using Long Short-Term MemoryBig Data and Cognitive Computing10.3390/bdcc81001408:10(140)Online publication date: 21-Oct-2024
  • (2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
  • (2022)ReSembleProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571992(1-14)Online publication date: 13-Nov-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 52, Issue 3
May 2020
734 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3341324
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2019
Accepted: 01 February 2019
Revised: 01 December 2018
Received: 01 August 2017
Published in CSUR Volume 52, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data prefetching
  2. and spatio-temporal correlation
  3. scale-out workloads
  4. server processors

Qualifiers

  • Survey
  • Research
  • Refereed

Funding Sources

  • Iran National Science Foundation (INSF)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)189
  • Downloads (Last 6 weeks)20
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)LSTM-CRP: Algorithm-Hardware Co-Design and Implementation of Cache Replacement Policy Using Long Short-Term MemoryBig Data and Cognitive Computing10.3390/bdcc81001408:10(140)Online publication date: 21-Oct-2024
  • (2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
  • (2022)ReSembleProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571992(1-14)Online publication date: 13-Nov-2022
  • (2022)BIOS-Based Server Intelligent OptimizationSensors10.3390/s2218673022:18(6730)Online publication date: 6-Sep-2022
  • (2022)SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core SystemsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545044(1-12)Online publication date: 29-Aug-2022
  • (2022)ReSemble: Reinforced Ensemble Framework for Data PrefetchingSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00086(1-14)Online publication date: Nov-2022
  • (2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
  • (2021)Computer Aided Framework and Data Evaluation for Ceramic Product Design2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV)10.1109/ICICV50876.2021.9388640(846-850)Online publication date: 4-Feb-2021
  • (2021)HyperData: A Data Transfer Accelerator for Software Data Planes Based on Targeted Prefetching2021 IEEE 39th International Conference on Computer Design (ICCD)10.1109/ICCD53106.2021.00059(326-334)Online publication date: Oct-2021
  • (2020)BOW: Breathing Operand Windows to Exploit Bypassing in GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00084(996-1008)Online publication date: Oct-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media