Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding

Published: 16 November 2018 Publication History

Abstract

We introduce Poker, a permutation-based approach for vectorizing multiple queries over B+-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons needed for a query to a minimum. Implemented as a C++ template library, Poker represents a general-purpose solution for vectorizing the queries over indexing trees on multi-core processors equipped with SIMD units. For a set of five representative benchmarks evaluated with 24 configurations each, Poker outperforms the state of the art by 2.11x with one single thread and 2.28x with eight threads on an Intel Broadwell processor that supports 256-bit AVX2, on average. In addition, strip-mining queries will further improve Poker’s performance by 1.21x (with one single thread) and 1.31x (with eight threads), on average.

References

[1]
1997. Principles and Recommendations for Population and Housing Censuses. United Nations.
[2]
2016. Intel® 64 and IA-32 architectures optimization reference manual. Intel Corporation.
[3]
Andrew Anderson, Avinash Malik, and David Gregg. 2015. Automatic vectorization of interleaved data revisited. ACM Trans. Archit. Code Optim. 12, 4 (2015), 50:1--50:25.
[4]
Amittai F. Aviram. 2008. Interactive B<sup;>+</sup;>-tree. http://www.amittai.com/prose/bplustree.html. (2008).
[5]
Timo Bingmann. 2013. Stx B<sup;>+</sup;>-tree C++ template classes. http://panthema.net/2007/stx-btree. (2013).
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE, 44--54.
[7]
S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, Liang Wang, and K. Skadron. 2010. A characterization of the rodinia benchmark suite with comparison to contemporary CMP workloads. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE, 1--11.
[8]
Linchuan Chen, Peng Jiang, and Gagan Agrawal. 2016. Exploiting recent SIMD architectural advances for irregular applications. In Proceedings of the International Symposium on Code Generation and Optimization. ACM, 47--58.
[9]
Mayank Daga and Mark Nutter. 2012. Exploiting coarse-grained parallelism in B+ tree searches on an APU. In Proceedings of the SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 240--247.
[10]
Franz Franchetti and Markus Püschel. 2008. Generating SIMD vectorized permutations. In Proceedings of the International Conference on Compiler Construction. Springer, 116--131.
[11]
Michael Goldfarb, Youngjoon Jo, and Milind Kulkarni. 2013. General transformations for GPU execution of tree traversals. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 10:1--10:12.
[12]
T. Hayes, O. Palomar, O. Unsal, A. Cristal, and M. Valero. 2016. Future vector microprocessor extensions for data aggregations. In Proceedings of the 43rd Annual International Symposium on Computer Architecture. IEEE, 418--430.
[13]
Peng Jiang and Gagan Agrawal. 2017. Efficient SIMD and MIMD parallelization of hash-based aggregation by conflict mitigation. In Proceedings of the International Conference on Supercomputing. ACM, 24:1--24:11.
[14]
Youngjoon Jo, Michael Goldfarb, and Milind Kulkarni. 2013. Automatic vectorization of tree traversals. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 363--374.
[15]
Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast architecture sensitive tree search on modern CPUs and gpus. In Proceedings of the International Conference on Management of Data. ACM, 339--350.
[16]
D. E. Knuth. 1971. Optimum binary search trees. Acta Inform 1, 1 (1971), 14–25.
[17]
Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. 2016. Hybrid CPU-GPU scheduling and execution of tree traversals. In Proceedings of the International Conference on Supercomputing. ACM, 2:1--2:12.
[18]
Daniel S. McFarlin, Volodymyr Arbatov, Franz Franchetti, and Markus Püschel. 2011. Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets. In Proceedings of the International Conference on Supercomputing.
[19]
Aravind Natarajan and Neeraj Mittal. 2014. Fast concurrent lock-free binary search trees. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 317--328.
[20]
Jakob Nielsen. 1993. Usability Engineering. Morgan Kaufmann.
[21]
Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of interleaved data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 132--143.
[22]
Bin Ren, Youngjoon Jo, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2015. Efficient execution of recursive programs on commodity vector hardware. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 509--520.
[23]
Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization. IEEE, 1--10.
[24]
Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture. ACM, 356--367.
[25]
Robert Sedgewick and Kevin Wayne. 2011. Algorithms, 4th Edition. Addison-Wesley Professional.
[26]
Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A hybrid B+-tree as solution for in-memory indexing on CPU-GPU heterogeneous computing platforms. In Proceedings of the International Conference on Management of Data. ACM, 1523--1538.
[27]
Karthik Sonti. 2016. Building event-driven batch analytics on AWS. https://aws.amazon.com/cn/blogs/bigdata/building-event-driven-batch-analytics-on-aws/.
[28]
Nigel Stephens. 2016. ARMv8-A next-generation vector architecture for HPC. (2016).
[29]
H. R. Strong, G. Markowsky, and A. K. Chandra. 1979. Search within a page. Journal of the ACM 26, 3 (1979), 457–482.
[30]
Tomas Karnagel, Roman Dementiev, Ravi Rajwar Konrad Lai Thomas Legler Benjamin Schlegel, and Wolfgang Lehner. 2014. Improving in-memory database index performance with Intel transactional synchronization extensions. In Proceedings of the 20th International Symposium on High Performance Computer Architecture. IEEE, 476--487.
[31]
Xin Wang, Weihua Zhang, Zhaoguo Wang, Ziyun Wei, Haibo Chen, and Wenyun Zhao. 2017. Eunomia: Scaling concurrent search trees under contention using HTM. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 385--399.
[32]
Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. 2009. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB 2, 1 (2009), 385–394.
[33]
Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers.
[34]
Steffen Zeuch, Johann-Christoph Freytag, and Frank Huber. 2014. Adapting tree structures for processing with SIMD instructions. In Proceedings of the 17th International Conference on Extending Database Technology. OpenProceedings.org, 97--108.
[35]
Feng Zhang, Peng Di, Hao Zhou, Xiangke Liao, and Jingling Xue. 2016. RegTT: Accelerating tree traversals on GPUs by exploiting regularities. In Proceedings of the 45th International Conference on Parallel Processing. IEEE, 562--571.
[36]
Feng Zhang and Jingling Xue. 2018. Poker: Permutation-based SIMD execution of intensive tree search by path encoding. In Proceedings of the International Symposium on Code Generation and Optimization. ACM, 87--99.
[37]
Hao Zhou and Jingling Xue. 2016. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization (2016), 11:1--11:26.
[38]
Hao Zhou and Jingling Xue. 2016. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the International Symposium on Code Generation and Optimization. ACM, 59--69.

Cited By

View all
  • (2022)Spontaneous Facial Behavior Analysis Using Deep Transformer-based Framework for Child–computer InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353957720:2(1-17)Online publication date: 26-May-2022
  • (2021)Spatial-temporal channel-wise attention network for action recognitionMultimedia Tools and Applications10.1007/s11042-021-10752-z80:14(21789-21808)Online publication date: 1-Jun-2021

Index Terms

  1. Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 4
    December 2018
    706 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3284745
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 November 2018
    Accepted: 01 September 2018
    Revised: 01 August 2018
    Received: 01 May 2018
    Published in TACO Volume 15, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SIMD
    2. intensive tree search
    3. permute

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)154
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Spontaneous Facial Behavior Analysis Using Deep Transformer-based Framework for Child–computer InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353957720:2(1-17)Online publication date: 26-May-2022
    • (2021)Spatial-temporal channel-wise attention network for action recognitionMultimedia Tools and Applications10.1007/s11042-021-10752-z80:14(21789-21808)Online publication date: 1-Jun-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media