Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Fast Key-Value Lookups with Node Tracker

Published: 08 June 2021 Publication History

Abstract

Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3× speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1× when used as a prefetcher; 11.9× as a prefetcher with BOS; 14.9× as a pre-execution unit and 18.8× as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203× and 11× over the single-core and 24-core baselines, respectively.

References

[1]
Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous memory access chaining. Proc. VLDB Endow. 9, 4 (Dec. 2015), 252–263.
[2]
Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving hash join performance through prefetching. ACM Trans. Database Syst. 32, 3 (Aug. 2007), Article 17.
[3]
Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast architecture sensitive tree search on modern CPUs and GPUs. In Proceedings of the Special Interest Group on Management of Data Conference (SIGMOD’10).
[4]
Changhee Jung, Daeseob Lim, Jaejin Lee, and Y. Solihin. 2006. Helper thread prefetching for loosely-coupled multiprocessor systems. In Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium.
[5]
S. Ainsworth and T. M. Jones. An event-triggered programmable prefetcher for irregular workloads. ACM SIGPLAN Not. 53, 2 (Mar. 2018), 578--592.
[6]
J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15), 105–117.
[7]
AMD. 2017. Software Optimization Guide for AMD Family 17h Processors, 2.8.1.6. https://developer.amd.com/wordpress/media/2013/12/55723_SOG_Fam_17h_Processors_3.00.pdf.
[8]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7.
[9]
WikiChip Fuse. 2019. Intel Sunny Cove Core to Deliver a Major Improvement in Single-Thread Performance, Bigger Improvements to Follow. Retrieved from https://fuse.wikichip.org/news/2371/intel-sunny-cove-core-to-deliver-a-major-improvement-in-single-thread-performance-bigger-improvements-to-follow/.
[10]
AMD. 2019. Retrieved from https://www.amd.com/en/products/epyc-server.
[11]
Mustafa Cavus, Resit Sendag, and Joshua J. Yi. 2018. Array tracking prefetcher for indirect accesses. In Proceedings of the IEEE International Conference on Computer Design (ICCD’18). 132--139.
[12]
M. Hashemi, O. Mutlu, and Y. N. Patt. 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.
[13]
Mingxing Tan, Xianhua Liu, Tong Tong, and Xu Cheng. 2012. CVP: An energy-efficient indirect branch prediction with compiler-guided value pattern. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 111–120.
[14]
A. Mukkara, N. Beckmann, M. Abeydeera, X. Ma, and D. Sanchez. 2018. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–14.
[15]
Andre Seznec. 2016. TAGE-SC-L Branch Predictors Again. In Proceedings of the 5th Championship on Branch Prediction.
[16]
P. Michaud. 2016. Best-offset hardware prefetching. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA’16). 469–480.
[17]
J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and Z. Chishti. 2016. Path confidence based lookahead prefetching. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 1–12.
[18]
A. Jain and C. Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 247--259.
[19]
William Pugh. 1990. Concurrent Maintenance of Skip Lists. Technical Report. University of Maryland at College Park, College Park, MD, 1990.
[20]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 469–480.
[21]
Sparsh Mittal. 2016. A survey of recent prefetching techniques for processor caches. ACM Comput. Surv. 49, 2, Article 35 (Aug. 2016), 35 pages.
[22]
Y. Ishii, M. Inaba, and K. Hiraki. 2009. Access map pattern matching for data cache prefetch. In Proceedings of the International Conference on Supercomputing (ICS’09). 499–500.
[23]
M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti. 2015. Efficiently prefetching complex address patterns. In Proceedings of the International Symposium on Microarchitecture (MICRO). 141–152.
[24]
M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2019. Bingo spatial data prefetcher. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA’19), 399–411.
[25]
S. Bakalapati and B. Panda. 2019. Bouquet of instruction pointers: Instruction pointer classifier based hardware prefetching. InThird Data Prefetching Competition. ISCA 2019. IEEE Press, 118--131.
[26]
Retrieved from https://dpc3.compas.cs.stonybrook.edu.
[27]
David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-Fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Los Alamitos, CA, 623–634.
[28]
Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When prefetching works, when it doesn’t, and why. ACM Trans. Archit. Code Optim. 9, 1 (Mar. 2012), Article 2.
[29]
M. Lipasti, W. Schmidt, S. Kunkel, and R. Roediger. 1995. SPAID: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 232–236.
[30]
Luk and Mowry. 1996. Compiler-based prefetching for recursive data structures. ACM SIGOPS Rev. 30, 5 (1996), 222–233.
[31]
David Callahan, Ken Kennedy, and Allan Porterfield. 1991. Software prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). ACM, New York, NY, 40–52.
[32]
Mowry. 1994. Tolerating Latency through Software-controlled Data Prefetching. Ph.D. Dissertation, Stanford University, 1994.
[33]
Youfeng Wu, Mauricio Serrano, Rakesh Krishnaiyer, Wei Li, and Jesse Fang1. 2002. Value profile guided stride prefetching for irregular code. In Compiler Construction. Springer, 307–324.
[34]
Ainsworth and Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’17). IEEE Press, 305--317.
[35]
Kim and D. Yeung. 2002. Design and evaluation of compiler algorithms for pre-execution. Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’02). 159--170.
[36]
E. Lau, J. E. Miller, I. Choi, D. Yeung, S. Amarasinghe, and A. Agarwal. 2011. Multicore performance optimization using partner cores. Proceedings of the USENIX Conference on Hot Topics in Parallelism (HotPar’11).
[37]
T. J. Ham, J. L. Aragón, and M. Martonosi. 2015. DeSC: Decoupled supply-compute communication management for heterogeneous architectures. Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). 191--203.
[38]
I. Ganusov and M. Burtscher. 2006. Efficient emulation of hardware prefetchers via event-driven helper threading. Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’06).
[39]
C.-H. Ho, S. J. Kim, and K. Sankaralingam. 2015. Efficient execution of memory access phases using datafow specialization. In Proceedings of the International Symposium on Computer Architecture (ISCA’15). 118--130.
[40]
S. Kumar, A. Shriraman, V. Srinivasan, D. Lin, and J. Phillips. 2014. Sqrl: Hardware accelerator for collecting software data structures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 475--476.
[41]
S. Kumar, N. Vedula, A. Shriraman, and V. Srinivasan. 2015. Dasx: Hardware accelerator for software data structures. In Proceedings of the International Conference on Supercomputing (ICS’15). 361--372.
[42]
O. Kocberber, B. Grot, J. Picorel, B. Falsaf., K. Lim, and P. Ranganathan. 2013. Meet the walkers: Accelerating index traversals for inmemory databases. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’13).
[43]
E. Ebrahimi, O. Mutlu, and Y. N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the 15th International Symposium on High Performance Computing Architecture (HPCA’09), 7–17.
[44]
A. Roth, A. Moshovos, and G. S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectual Support for Programming Languages and Operating Systems (ASPLOS’98), 115–126.
[45]
C.-L. Yang and A. R. Lebeck. 2000. Push vs. pull: Data movement for linked data structures. In Proceedings of the 14th International Conference on Supercomputing (ICS’00). 176–186.
[46]
J. Collins, S. Sair, B. Calder, and D. Tullsen. 2002. Pointer cache assisted prefetching. In Proceedings of the 35th International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 62–73.
[47]
Amir Roth and Gurindar S. Sohi. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99). IEEE Computer Society, Los Alamitos, CA, 111–121.
[48]
Zhenlin Wang, Doug Burger, Kathryn S. McKinley, Steven K. Reinhardt, and Charles C. Weems. 2003. Guided region prefetching: A cooperative hardware/software approach. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03). ACM, New York, NY, 388–398.
[49]
Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, 178—190.
[50]
Daniel Jimenez. 2016. Multiperspective perceptron predictor. Championship Branch Prediction Competition (2016).
[51]
H. Gao et al. 2008. Address-branch correlation: A novel locality for long-latency hard-to-predict branches. Proceedings of the International Symposium on High Performance Computer Architecture (HPCA-14). 74--85.
[52]
Muawya Al-Otoom, Elliott Forbes, and Eric Rotenberg. 2010. EXACT: Explicit dynamic-branch prediction with active updates. In Proceedings of the 7th ACM International Conference on Computing Frontiers (CF’10). ACM, New York, NY, 165–176.
[53]
M. U. Farooq, Khubaib, and L. K. John. 2013. Store-load-branch (SLB) predictor: A compiler assisted branch prediction for data dependent branches. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13), 59–70.
[54]
Lloyd, G. Scott and Maya Gokhale. 2017. Near memory key/value lookup acceleration. Proceedings of the International Symposium on Memory Systems (MEMSYS’17). 26--33.
[55]
Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2014. Main-memory hash joins on modern processor architectures. IEEE Trans. Knowl. Data Eng. 27 (2014), 1–1. 10.1109/TKDE.2014.2313874.
[56]
C. Balkesen, J. Teubner, G. Alonso and M. T. Özsu. 2013. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE’13). 362–373.
[57]
Karthik Sundaramoorthy, Zachary Purser, and Eric Rotenberg. 2000. Slipstream processors: Improving both performance and fault tolerance. Sigplan Not. 35 (2000), 257–268. 10.1145/356989.357013.
[58]
R. Sheikh, J. Tuck, and E. Rotenberg. 2012. Control-flow decoupling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 329–340.
[59]
James David Dundas and Trevor Mudge. 1998. Improving Processor Performance by Dynamically Pre-processing the Instruction Stream. Ph.D. Dissertation. University of Michigan, USA.
[60]
C. Lattner and V. Adve. 2004. LLVM: a compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization 2004 (2004), 75--86.
[61]
Cheng K. Chen, Chih-Chieh Lee, and T. N. Mudge. 1997. Instruction prefetching using branch prediction information. In Proceedings of the International Conference on Computer Design VLSI in Computers and Processors. 593–601.
[62]
Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive instruction fetch. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). Association for Computing Machinery, New York, NY, 152–162.
[63]
Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2016. Accelerating dependent cache misses with an enhanced memory controller. SIGARCH Comput. Archit. News 44, 3 (Jun. 2016), 444–455.
[64]
Mustafa Cavus, Resit Sendag, and Joshua J. Yi. 2020. Informed prefetching for indirect memory accesses. ACM Trans. Archit. Code Optim. 17, 1, Article 4 (Mar. 2020), 29 pages.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 3
September 2021
370 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3460978
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2021
Accepted: 01 February 2021
Revised: 01 February 2021
Received: 01 August 2020
Published in TACO Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hardware and software prefetch
  2. branch prediction
  3. in-memory database applications
  4. pre-execution

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 500
    Total Downloads
  • Downloads (Last 12 months)190
  • Downloads (Last 6 weeks)34
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media