research-article

Open access

Fast Key-Value Lookups with Node Tracker

Authors:

Mohammed Shatnawi,

Augustus K. UhtAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 18, Issue 3

Article No.: 34, Pages 1 - 26

https://doi.org/10.1145/3452099

Published: 08 June 2021 Publication History

All formats PDF

Abstract

Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3× speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1× when used as a prefetcher; 11.9× as a prefetcher with BOS; 14.9× as a pre-execution unit and 18.8× as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203× and 11× over the single-core and 24-core baselines, respectively.

References

[1]

Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous memory access chaining. Proc. VLDB Endow. 9, 4 (Dec. 2015), 252–263.

Digital Library

[2]

Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving hash join performance through prefetching. ACM Trans. Database Syst. 32, 3 (Aug. 2007), Article 17.

Digital Library

[3]

Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast architecture sensitive tree search on modern CPUs and GPUs. In Proceedings of the Special Interest Group on Management of Data Conference (SIGMOD’10).

Digital Library

[4]

Changhee Jung, Daeseob Lim, Jaejin Lee, and Y. Solihin. 2006. Helper thread prefetching for loosely-coupled multiprocessor systems. In Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium.

[5]

S. Ainsworth and T. M. Jones. An event-triggered programmable prefetcher for irregular workloads. ACM SIGPLAN Not. 53, 2 (Mar. 2018), 578--592.

[6]

J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15), 105–117.

[7]

AMD. 2017. Software Optimization Guide for AMD Family 17h Processors, 2.8.1.6. https://developer.amd.com/wordpress/media/2013/12/55723_SOG_Fam_17h_Processors_3.00.pdf.

[8]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7.

Digital Library

[9]

WikiChip Fuse. 2019. Intel Sunny Cove Core to Deliver a Major Improvement in Single-Thread Performance, Bigger Improvements to Follow. Retrieved from https://fuse.wikichip.org/news/2371/intel-sunny-cove-core-to-deliver-a-major-improvement-in-single-thread-performance-bigger-improvements-to-follow/.

[10]

AMD. 2019. Retrieved from https://www.amd.com/en/products/epyc-server.

[11]

Mustafa Cavus, Resit Sendag, and Joshua J. Yi. 2018. Array tracking prefetcher for indirect accesses. In Proceedings of the IEEE International Conference on Computer Design (ICCD’18). 132--139.

[12]

M. Hashemi, O. Mutlu, and Y. N. Patt. 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.

[13]

Mingxing Tan, Xianhua Liu, Tong Tong, and Xu Cheng. 2012. CVP: An energy-efficient indirect branch prediction with compiler-guided value pattern. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 111–120.

Digital Library

[14]

A. Mukkara, N. Beckmann, M. Abeydeera, X. Ma, and D. Sanchez. 2018. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–14.

[15]

Andre Seznec. 2016. TAGE-SC-L Branch Predictors Again. In Proceedings of the 5th Championship on Branch Prediction.

[16]

P. Michaud. 2016. Best-offset hardware prefetching. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA’16). 469–480.

[17]

J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and Z. Chishti. 2016. Path confidence based lookahead prefetching. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 1–12.

[18]

A. Jain and C. Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 247--259.

[19]

William Pugh. 1990. Concurrent Maintenance of Skip Lists. Technical Report. University of Maryland at College Park, College Park, MD, 1990.

[20]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 469–480.

[21]

Sparsh Mittal. 2016. A survey of recent prefetching techniques for processor caches. ACM Comput. Surv. 49, 2, Article 35 (Aug. 2016), 35 pages.

[22]

Y. Ishii, M. Inaba, and K. Hiraki. 2009. Access map pattern matching for data cache prefetch. In Proceedings of the International Conference on Supercomputing (ICS’09). 499–500.

[23]

M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti. 2015. Efficiently prefetching complex address patterns. In Proceedings of the International Symposium on Microarchitecture (MICRO). 141–152.

[24]

M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2019. Bingo spatial data prefetcher. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA’19), 399–411.

[25]

S. Bakalapati and B. Panda. 2019. Bouquet of instruction pointers: Instruction pointer classifier based hardware prefetching. InThird Data Prefetching Competition. ISCA 2019. IEEE Press, 118--131.

[26]

Retrieved from https://dpc3.compas.cs.stonybrook.edu.

[27]

David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-Fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Los Alamitos, CA, 623–634.

Digital Library

[28]

Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When prefetching works, when it doesn’t, and why. ACM Trans. Archit. Code Optim. 9, 1 (Mar. 2012), Article 2.

Digital Library

[29]

M. Lipasti, W. Schmidt, S. Kunkel, and R. Roediger. 1995. SPAID: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 232–236.

Digital Library

[30]

Luk and Mowry. 1996. Compiler-based prefetching for recursive data structures. ACM SIGOPS Rev. 30, 5 (1996), 222–233.

Digital Library

[31]

David Callahan, Ken Kennedy, and Allan Porterfield. 1991. Software prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). ACM, New York, NY, 40–52.

Digital Library

[32]

Mowry. 1994. Tolerating Latency through Software-controlled Data Prefetching. Ph.D. Dissertation, Stanford University, 1994.

[33]

Youfeng Wu, Mauricio Serrano, Rakesh Krishnaiyer, Wei Li, and Jesse Fang1. 2002. Value profile guided stride prefetching for irregular code. In Compiler Construction. Springer, 307–324.

[34]

Ainsworth and Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’17). IEEE Press, 305--317.

[35]

Kim and D. Yeung. 2002. Design and evaluation of compiler algorithms for pre-execution. Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’02). 159--170.

[36]

E. Lau, J. E. Miller, I. Choi, D. Yeung, S. Amarasinghe, and A. Agarwal. 2011. Multicore performance optimization using partner cores. Proceedings of the USENIX Conference on Hot Topics in Parallelism (HotPar’11).

[37]

T. J. Ham, J. L. Aragón, and M. Martonosi. 2015. DeSC: Decoupled supply-compute communication management for heterogeneous architectures. Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). 191--203.

Digital Library

[38]

I. Ganusov and M. Burtscher. 2006. Efficient emulation of hardware prefetchers via event-driven helper threading. Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’06).

[39]

C.-H. Ho, S. J. Kim, and K. Sankaralingam. 2015. Efficient execution of memory access phases using datafow specialization. In Proceedings of the International Symposium on Computer Architecture (ISCA’15). 118--130.

Digital Library

[40]

S. Kumar, A. Shriraman, V. Srinivasan, D. Lin, and J. Phillips. 2014. Sqrl: Hardware accelerator for collecting software data structures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 475--476.

Digital Library

[41]

S. Kumar, N. Vedula, A. Shriraman, and V. Srinivasan. 2015. Dasx: Hardware accelerator for software data structures. In Proceedings of the International Conference on Supercomputing (ICS’15). 361--372.

[42]

O. Kocberber, B. Grot, J. Picorel, B. Falsaf., K. Lim, and P. Ranganathan. 2013. Meet the walkers: Accelerating index traversals for inmemory databases. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’13).

[43]

E. Ebrahimi, O. Mutlu, and Y. N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the 15th International Symposium on High Performance Computing Architecture (HPCA’09), 7–17.

[44]

A. Roth, A. Moshovos, and G. S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectual Support for Programming Languages and Operating Systems (ASPLOS’98), 115–126.

[45]

C.-L. Yang and A. R. Lebeck. 2000. Push vs. pull: Data movement for linked data structures. In Proceedings of the 14th International Conference on Supercomputing (ICS’00). 176–186.

[46]

J. Collins, S. Sair, B. Calder, and D. Tullsen. 2002. Pointer cache assisted prefetching. In Proceedings of the 35th International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 62–73.

Digital Library

[47]

Amir Roth and Gurindar S. Sohi. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99). IEEE Computer Society, Los Alamitos, CA, 111–121.

[48]

Zhenlin Wang, Doug Burger, Kathryn S. McKinley, Steven K. Reinhardt, and Charles C. Weems. 2003. Guided region prefetching: A cooperative hardware/software approach. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03). ACM, New York, NY, 388–398.

[49]

Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, 178—190.

Digital Library

[50]

Daniel Jimenez. 2016. Multiperspective perceptron predictor. Championship Branch Prediction Competition (2016).

[51]

H. Gao et al. 2008. Address-branch correlation: A novel locality for long-latency hard-to-predict branches. Proceedings of the International Symposium on High Performance Computer Architecture (HPCA-14). 74--85.

[52]

Muawya Al-Otoom, Elliott Forbes, and Eric Rotenberg. 2010. EXACT: Explicit dynamic-branch prediction with active updates. In Proceedings of the 7th ACM International Conference on Computing Frontiers (CF’10). ACM, New York, NY, 165–176.

Digital Library

[53]

M. U. Farooq, Khubaib, and L. K. John. 2013. Store-load-branch (SLB) predictor: A compiler assisted branch prediction for data dependent branches. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13), 59–70.

[54]

Lloyd, G. Scott and Maya Gokhale. 2017. Near memory key/value lookup acceleration. Proceedings of the International Symposium on Memory Systems (MEMSYS’17). 26--33.

[55]

Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2014. Main-memory hash joins on modern processor architectures. IEEE Trans. Knowl. Data Eng. 27 (2014), 1–1. 10.1109/TKDE.2014.2313874.

[56]

C. Balkesen, J. Teubner, G. Alonso and M. T. Özsu. 2013. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE’13). 362–373.

Digital Library

[57]

Karthik Sundaramoorthy, Zachary Purser, and Eric Rotenberg. 2000. Slipstream processors: Improving both performance and fault tolerance. Sigplan Not. 35 (2000), 257–268. 10.1145/356989.357013.

Digital Library

[58]

R. Sheikh, J. Tuck, and E. Rotenberg. 2012. Control-flow decoupling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 329–340.

Digital Library

[59]

James David Dundas and Trevor Mudge. 1998. Improving Processor Performance by Dynamically Pre-processing the Instruction Stream. Ph.D. Dissertation. University of Michigan, USA.

[60]

C. Lattner and V. Adve. 2004. LLVM: a compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization 2004 (2004), 75--86.

[61]

Cheng K. Chen, Chih-Chieh Lee, and T. N. Mudge. 1997. Instruction prefetching using branch prediction information. In Proceedings of the International Conference on Computer Design VLSI in Computers and Processors. 593–601.

[62]

Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive instruction fetch. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). Association for Computing Machinery, New York, NY, 152–162.

Digital Library

[63]

Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2016. Accelerating dependent cache misses with an enhanced memory controller. SIGARCH Comput. Archit. News 44, 3 (Jun. 2016), 444–455.

Digital Library

[64]

Mustafa Cavus, Resit Sendag, and Joshua J. Yi. 2020. Informed prefetching for indirect memory accesses. ACM Trans. Archit. Code Optim. 17, 1, Article 4 (Mar. 2020), 29 pages.

Digital Library

Index Terms

Fast Key-Value Lookups with Node Tracker
1. Computer systems organization
  1. Architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Control-Flow Decoupling
MICRO-45: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Mobile and PC/server class processor companies continue to roll out flagship core micro architectures that are faster than their predecessors. Meanwhile placing more cores on a chip coupled with constant supply voltage puts per-core energy consumption ...
Filtering directory lookups in CMPs

Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed ...
Slipstream processors revisited: exploiting branch sets
ISCA '20: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture

Delinquent branches and loads remain key performance limiters in some applications. One approach to mitigate them is pre-execution. Broadly, there are two classes of pre-execution: one class repeatedly forks small helper threads, each targeting an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 18, Issue 3

September 2021

370 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3460978

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2021

Accepted: 01 February 2021

Revised: 01 February 2021

Received: 01 August 2020

Published in TACO Volume 18, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF (National Science Foundation)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
500
Total Downloads

Downloads (Last 12 months)190
Downloads (Last 6 weeks)34

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents