research-article

Open access

Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding

Authors:

Jingling XueAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 4

Article No.: 46, Pages 1 - 28

https://doi.org/10.1145/3280850

Published: 16 November 2018 Publication History

All formats PDF

Abstract

We introduce Poker, a permutation-based approach for vectorizing multiple queries over B⁺-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons needed for a query to a minimum. Implemented as a C++ template library, Poker represents a general-purpose solution for vectorizing the queries over indexing trees on multi-core processors equipped with SIMD units. For a set of five representative benchmarks evaluated with 24 configurations each, Poker outperforms the state of the art by 2.11x with one single thread and 2.28x with eight threads on an Intel Broadwell processor that supports 256-bit AVX2, on average. In addition, strip-mining queries will further improve Poker’s performance by 1.21x (with one single thread) and 1.31x (with eight threads), on average.

References

[1]

1997. Principles and Recommendations for Population and Housing Censuses. United Nations.

[2]

2016. Intel® 64 and IA-32 architectures optimization reference manual. Intel Corporation.

[3]

Andrew Anderson, Avinash Malik, and David Gregg. 2015. Automatic vectorization of interleaved data revisited. ACM Trans. Archit. Code Optim. 12, 4 (2015), 50:1--50:25.

Digital Library

[4]

Amittai F. Aviram. 2008. Interactive B<sup;>+</sup;>-tree. http://www.amittai.com/prose/bplustree.html. (2008).

[5]

Timo Bingmann. 2013. Stx B<sup;>+</sup;>-tree C++ template classes. http://panthema.net/2007/stx-btree. (2013).

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE, 44--54.

Digital Library

[7]

S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, Liang Wang, and K. Skadron. 2010. A characterization of the rodinia benchmark suite with comparison to contemporary CMP workloads. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE, 1--11.

Digital Library

[8]

Linchuan Chen, Peng Jiang, and Gagan Agrawal. 2016. Exploiting recent SIMD architectural advances for irregular applications. In Proceedings of the International Symposium on Code Generation and Optimization. ACM, 47--58.

Digital Library

[9]

Mayank Daga and Mark Nutter. 2012. Exploiting coarse-grained parallelism in B+ tree searches on an APU. In Proceedings of the SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 240--247.

Digital Library

[10]

Franz Franchetti and Markus Püschel. 2008. Generating SIMD vectorized permutations. In Proceedings of the International Conference on Compiler Construction. Springer, 116--131.

Digital Library

[11]

Michael Goldfarb, Youngjoon Jo, and Milind Kulkarni. 2013. General transformations for GPU execution of tree traversals. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 10:1--10:12.

Digital Library

[12]

T. Hayes, O. Palomar, O. Unsal, A. Cristal, and M. Valero. 2016. Future vector microprocessor extensions for data aggregations. In Proceedings of the 43rd Annual International Symposium on Computer Architecture. IEEE, 418--430.

Digital Library

[13]

Peng Jiang and Gagan Agrawal. 2017. Efficient SIMD and MIMD parallelization of hash-based aggregation by conflict mitigation. In Proceedings of the International Conference on Supercomputing. ACM, 24:1--24:11.

Digital Library

[14]

Youngjoon Jo, Michael Goldfarb, and Milind Kulkarni. 2013. Automatic vectorization of tree traversals. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 363--374.

Digital Library

[15]

Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast architecture sensitive tree search on modern CPUs and gpus. In Proceedings of the International Conference on Management of Data. ACM, 339--350.

Digital Library

[16]

D. E. Knuth. 1971. Optimum binary search trees. Acta Inform 1, 1 (1971), 14–25.

Digital Library

[17]

Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. 2016. Hybrid CPU-GPU scheduling and execution of tree traversals. In Proceedings of the International Conference on Supercomputing. ACM, 2:1--2:12.

Digital Library

[18]

Daniel S. McFarlin, Volodymyr Arbatov, Franz Franchetti, and Markus Püschel. 2011. Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets. In Proceedings of the International Conference on Supercomputing.

Digital Library

[19]

Aravind Natarajan and Neeraj Mittal. 2014. Fast concurrent lock-free binary search trees. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 317--328.

Digital Library

[20]

Jakob Nielsen. 1993. Usability Engineering. Morgan Kaufmann.

Digital Library

[21]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of interleaved data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 132--143.

Digital Library

[22]

Bin Ren, Youngjoon Jo, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2015. Efficient execution of recursive programs on commodity vector hardware. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 509--520.

Digital Library

[23]

Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization. IEEE, 1--10.

Digital Library

[24]

Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture. ACM, 356--367.

Digital Library

[25]

Robert Sedgewick and Kevin Wayne. 2011. Algorithms, 4th Edition. Addison-Wesley Professional.

Digital Library

[26]

Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A hybrid B+-tree as solution for in-memory indexing on CPU-GPU heterogeneous computing platforms. In Proceedings of the International Conference on Management of Data. ACM, 1523--1538.

Digital Library

[27]

Karthik Sonti. 2016. Building event-driven batch analytics on AWS. https://aws.amazon.com/cn/blogs/bigdata/building-event-driven-batch-analytics-on-aws/.

[28]

Nigel Stephens. 2016. ARMv8-A next-generation vector architecture for HPC. (2016).

[29]

H. R. Strong, G. Markowsky, and A. K. Chandra. 1979. Search within a page. Journal of the ACM 26, 3 (1979), 457–482.

Digital Library

[30]

Tomas Karnagel, Roman Dementiev, Ravi Rajwar Konrad Lai Thomas Legler Benjamin Schlegel, and Wolfgang Lehner. 2014. Improving in-memory database index performance with Intel transactional synchronization extensions. In Proceedings of the 20th International Symposium on High Performance Computer Architecture. IEEE, 476--487.

[31]

Xin Wang, Weihua Zhang, Zhaoguo Wang, Ziyun Wei, Haibo Chen, and Wenyun Zhao. 2017. Eunomia: Scaling concurrent search trees under contention using HTM. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 385--399.

Digital Library

[32]

Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. 2009. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB 2, 1 (2009), 385–394.

Digital Library

[33]

Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers.

Digital Library

[34]

Steffen Zeuch, Johann-Christoph Freytag, and Frank Huber. 2014. Adapting tree structures for processing with SIMD instructions. In Proceedings of the 17th International Conference on Extending Database Technology. OpenProceedings.org, 97--108.

[35]

Feng Zhang, Peng Di, Hao Zhou, Xiangke Liao, and Jingling Xue. 2016. RegTT: Accelerating tree traversals on GPUs by exploiting regularities. In Proceedings of the 45th International Conference on Parallel Processing. IEEE, 562--571.

[36]

Feng Zhang and Jingling Xue. 2018. Poker: Permutation-based SIMD execution of intensive tree search by path encoding. In Proceedings of the International Symposium on Code Generation and Optimization. ACM, 87--99.

Digital Library

[37]

Hao Zhou and Jingling Xue. 2016. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization (2016), 11:1--11:26.

Digital Library

[38]

Hao Zhou and Jingling Xue. 2016. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the International Symposium on Code Generation and Optimization. ACM, 59--69.

Digital Library

Cited By

Qayyum ARazzak ITanveer MMazher M(2022)Spontaneous Facial Behavior Analysis Using Deep Transformer-based Framework for Child–computer InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353957720:2(1-17)Online publication date: 26-May-2022
https://dl.acm.org/doi/10.1145/3539577
Chen LLiu YMan Y(2021)Spatial-temporal channel-wise attention network for action recognitionMultimedia Tools and Applications10.1007/s11042-021-10752-z80:14(21789-21808)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s11042-021-10752-z

Index Terms

Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Poker: permutation-based SIMD execution of intensive tree search by path encoding
CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

We propose POKER, a permutation-based vectorization approach for vectorizing multiple queries over B+-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key ...
GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs
CF'17: Proceedings of the Computing Frontiers Conference

Spatial blocking is a critical memory-access optimization to efficiently exploit the computing resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data over multiple spatial iterations, spatial blocking can significantly ...
Boundary element quadrature schemes for multi- and many-core architectures

In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 4

December 2018

706 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3284745

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 November 2018

Accepted: 01 September 2018

Revised: 01 August 2018

Received: 01 May 2018

Published in TACO Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Australian Research Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
753
Total Downloads

Downloads (Last 12 months)154
Downloads (Last 6 weeks)14

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qayyum ARazzak ITanveer MMazher M(2022)Spontaneous Facial Behavior Analysis Using Deep Transformer-based Framework for Child–computer InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353957720:2(1-17)Online publication date: 26-May-2022
https://dl.acm.org/doi/10.1145/3539577
Chen LLiu YMan Y(2021)Spatial-temporal channel-wise attention network for action recognitionMultimedia Tools and Applications10.1007/s11042-021-10752-z80:14(21789-21808)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s11042-021-10752-z

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents