Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3318464.3389752acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

The Case for a Learned Sorting Algorithm

Published: 31 May 2020 Publication History

Abstract

Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sort-merge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order. We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38x performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49x improvement over sequential Radix Sort, and 5.54x improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python.

Supplementary Material

MP4 File (3318464.3389752.mp4)
Presentation Video

References

[1]
Apple. 2018. Apple/Swift: standard library sort. (2018). https://github.com/apple/swift/blob/master/test/Prototypes/IntroSort.swift
[2]
Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. 2017. In-Place Parallel Super Scalar Samplesort (IPSSSSo). In 25th Annual European Symposium on Algorithms (ESA 2017) (Leibniz International Proceedings in Informatics (LIPIcs)), Kirk Pruhs and Christian Sohler (Eds.), Vol. 87. Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 9:1--9:14. https://doi.org/10.4230/LIPIcs.ESA.2017.9
[3]
Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Özsu. 2013. Multi-core, main-memory joins: Sort vs. hash revisited. Proceedings of the VLDB Endowment, Vol. 7, 1 (2013), 85--96.
[4]
Paul E. Black. 2019. Histogram Sort. (2019). https://www.nist.gov/dads/HTML/histogramSort.html
[5]
Berenger Bramas. 2017. Fast sorting algorithms using AVX-512 on Intel Knights Landing. arXiv preprint arXiv:1704.08579, Vol. 305 (2017), 315.
[6]
Minsik Cho, Daniel Brand, Rajesh Bordawekar, Ulrich Finkler, Vincent Kulandaisamy, and Ruchir Puri. 2015. PARADIS: an efficient parallel algorithm for in-place radix sort. Proceedings of the VLDB Endowment, Vol. 8, 12 (2015), 1518--1529.
[7]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2007. Introduction to algorithms 2 ed.). MIT Press, Cambridge, MA.
[8]
Zbigniew J. Czech, George Havas, and Bohdan S. Majewski. 1992. An optimal algorithm for generating minimal perfect hash functions. Inform. Process. Lett., Vol. 43, 5 (1992), 257 -- 264. https://doi.org/10.1016/0020-0190(92)90220-P
[9]
T. Dachraoui and L. Narayanan. 1996. Fast deterministic sorting on large parallel machines. In Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing. IEEE, New Orleans, LA, 273--280. https://doi.org/10.1109/SPDP.1996.570344
[10]
Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, Friedhelm Meyer Auf Der Heide, Hans Rohnert, and Robert E Tarjan. 1994. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput., Vol. 23, 4 (1994), 738--761.
[11]
Jialin Ding, Umar Farooq Minhas, Hantian Zhang, Yinan Li, Chi Wang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, and David Lomet. 2019. ALEX: An Updatable Adaptive Learned Index. (2019). arxiv: cs.DB/1905.08898
[12]
A. Dvoretzky, J. Kiefer, and J. Wolfowitz. 1956. Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator. Ann. Math. Statist., Vol. 27, 3 (09 1956), 642--669. https://doi.org/10.1214/aoms/1177728174
[13]
Paul Embrechts and Marius Hofert. 2013. A note on generalized inverses. Mathematical Methods of Operations Research, Vol. 77, 3 (2013), 423--432.
[14]
Timothy Furtak, José Nelson Amaral, and Robert Niewiadomski. 2007. Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures. ACM, San Diego, CA, 348--357.
[15]
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-Aware Index Structure. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD `19). Association for Computing Machinery, New York, NY, USA, 1189--1206. https://doi.org/10.1145/3299869.3319860
[16]
GNU. 2009. C+: STL sort. (2009). https://gcc.gnu.org/onlinedocs/libstdc+/libstdc+-html-USERS-4.4/a01347.html
[17]
Go. 2009. Source file src/sort/sort.go. (2009). https://golang.org/src/sort/sort.go
[18]
Fuji Goro and Morwenn. 2019. Open-source C+ implementation of Timsort. (2019). https://github.com/gfx/cpp-TimSort
[19]
Jim Gray, Chris Nyberg, Mehul Shah, and Naga Govindaraju. 2017. The SortBenchmark dataset. (2017). http://sortbenchmark.org/
[20]
Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. 2019. Learning-Based Frequency Estimation Algorithms. International Conference on Learning Representations. ICLR, New Orleans, LA. https://openreview.net/forum?id=r1lohoCqY7
[21]
Hiroshi Inoue and Kenjiro Taura. 2015. SIMD-and cache-friendly algorithm for sorting an array of structures. Proceedings of the VLDB Endowment, Vol. 8, 11 (2015), 1274--1285.
[22]
Intel. 2020. Intel Performance Primitive library for x86 architectures. (2020). http://software.intel.com/en-us/intel-ipp/
[23]
Java. 2017. Java 9: List.sort. (2017). https://docs.oracle.com/javase/9/docs/api/java/util/List.html#sort-java.util.Comparator-
[24]
Nathan Jay, Noga H. Rotman, P. Brighten Godfrey, Michael Schapira, and Aviv Tamar. 2018. Internet Congestion Control via Deep Reinforcement Learning. (2018). arxiv: cs.NI/1810.03259
[25]
Johan Ludwig William Valdemar Jensen et al. 1906. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, Vol. 30 (1906), 175--193.
[26]
Jie Jiang, Lixiong Zheng, Junfeng Pu, Xiong Cheng, Chongqing Zhao, Mark R Nutter, and Jeremy D Schaub. 2017. Tencent Sort. (2017).
[27]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. (2018). arxiv: cs.DB/1809.00677
[28]
Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Learned Database System. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13--16, 2019, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2019/papers/p117-kraska-cidr19.pdf
[29]
Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 489--504.
[30]
Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).
[31]
Maciej Kurant, Minas Gjoka, Carter T. Butts, and Athina Markopoulou. 2011. Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks. In Proceedings of ACM SIGMETRICS '11. San Jose, CA.
[32]
Bohdan S Majewski, Nicholas C Wormald, George Havas, and Zbigniew J Czech. 1996. A family of perfect hashing methods. Comput. J., Vol. 39, 6 (1996), 547--554.
[33]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM '19). ACM, New York, NY, USA, 270--288. https://doi.org/10.1145/3341302.3342080
[34]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1705--1718. https://doi.org/10.14778/3342263.3342644
[35]
Ryan Marcus and Olga Papaemmanouil. 2018. Deep reinforcement learning for join order enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. ACM, 3.
[36]
Peter McIlroy. 1993. Optimistic sorting and information theoretic complexity. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 467--474.
[37]
MongoDB. 2018. MongoDB: sorter.cpp. (2018). https://github.com/mongodb/mongo/blob/master/src/mongo/db/sorter/sorter.cpp
[38]
David R Musser. 1997. Introspective sorting and selection algorithms. Software: Practice and Experience, Vol. 27, 8 (1997), 983--993.
[39]
MySQL. 2000. MySQL: filesort.cc. (2000). https://github.com/mysql/mysql-server/blob/8.0/sql/filesort.cc
[40]
Omar Obeya, Endrias Kahssay, Edward Fan, and Julian Shun. 2019. Theoretically-Efficient and Practical Parallel In-Place Radix Sorting. In The 31st ACM on Symposium on Parallelism in Algorithms and Architectures. ACM, 213--224.
[41]
OpenAddresses. 2020. The OpenAddresses - Northeast dataset. (2020). https://data.openaddresses.io/openaddr-collected-us_northeast.zip
[42]
OpenStreetMap contributors. 2017. Planet dump retrieved from https://planet.osm.org. (2017). https://www.openstreetmap.org
[43]
Emanuel Parzen. 1962. On estimation of a probability density function and mode. The annals of mathematical statistics, Vol. 33, 3 (1962), 1065--1076.
[44]
Orson R. L. Peters. 2020. The Pattern-Defeating Quicksort Algorithm. (2020). https://github.com/orlp/pdqsort
[45]
Tim Peters. 2002. Python: list.sort. (2002). https://github.com/python/cpython/blob/master/Objects/listsort.txt
[46]
Postgres. 1996. Postgres: tuplesort.c. (1996). https://github.com/postgres/postgres/blob/master/src/backend/utils/sort/tuplesort.c
[47]
Murray Rosenblatt. 1956. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics (1956), 832--837.
[48]
Peter Sanders and Sebastian Winkel. 2004. Super scalar sample sort. In European Symposium on Algorithms. Springer, 784--796.
[49]
Michael Axtmann Sascha Witt. 2020. Open-source C+ implementation of the IPS4o algorithm. (2020). https://github.com/SaschaWitt/ips4o
[50]
Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D Nguyen, Victor W Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 351--362.
[51]
Andrew Schein. 2009. Open-source C+ implementation of Radix Sort for double-precision floating points. (2009). https://bitbucket.org/ais/usort/src/474cc2a19224/usort/f8_sort.c
[52]
SQLite. 2011. SQLite: vdbesort.c. (2011). https://github.com/mackyle/sqlite/blob/master/src/vdbesort.c
[53]
Simon Steele and Marius vZilénas. 2020. 479k English words for all your dictionary. (2020). https://github.com/dwyl/english-words
[54]
Michal Turvc an'ik and Martin Javurek. 2016. Hash function generation by neural network. In 2016 New Trends in Signal Processing (NTSP). IEEE, 1--5.
[55]
Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedings of the 2019 International Conference on Management of Data. ACM, 36--53.
[56]
Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927 (2014).
[57]
Jianfeng Wang, Jingdong Wang, Nenghai Yu, and Shipeng Li. 2013. Order preserving hashing for approximate nearest neighbor search. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 133--142.
[58]
Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. Proceedings of the VLDB Endowment, Vol. 12, 3 (2018), 210--222.

Cited By

View all
  • (2024)LITS: An Optimized Learned Index for StringsProceedings of the VLDB Endowment10.14778/3681954.368201017:11(3415-3427)Online publication date: 1-Jul-2024
  • (2024)The Holon Approach for Simultaneously Tuning Multiple Components in a Self-Driving Database Management System with Machine Learning via Synthesized Proto-ActionsProceedings of the VLDB Endowment10.14778/3681954.368200717:11(3373-3387)Online publication date: 30-Aug-2024
  • (2024)Can Learned Indexes be Built Efficiently? A Deep Dive into Sampling Trade-offsProceedings of the ACM on Management of Data10.1145/36549192:3(1-25)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CDF
  2. ML for systems
  3. RMI
  4. learned algorithm
  5. linear interpolation
  6. linear models
  7. sorting
  8. sorting algorithm

Qualifiers

  • Research-article

Funding Sources

  • NSF
  • DARPA

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,660
  • Downloads (Last 6 weeks)141
Reflects downloads up to 20 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)LITS: An Optimized Learned Index for StringsProceedings of the VLDB Endowment10.14778/3681954.368201017:11(3415-3427)Online publication date: 1-Jul-2024
  • (2024)The Holon Approach for Simultaneously Tuning Multiple Components in a Self-Driving Database Management System with Machine Learning via Synthesized Proto-ActionsProceedings of the VLDB Endowment10.14778/3681954.368200717:11(3373-3387)Online publication date: 30-Aug-2024
  • (2024)Can Learned Indexes be Built Efficiently? A Deep Dive into Sampling Trade-offsProceedings of the ACM on Management of Data10.1145/36549192:3(1-25)Online publication date: 30-May-2024
  • (2024)G-Learned Index: Enabling Efficient Learned Index on GPUIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.338121435:6(950-967)Online publication date: 2-Apr-2024
  • (2024)LBSC: A Cost-Aware Caching Framework for Cloud Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00373(4911-4924)Online publication date: 13-May-2024
  • (2024)Sum-Product Network-Based Cardinality Estimation ResearchArtificial Intelligence in China10.1007/978-981-99-7545-7_40(387-397)Online publication date: 23-Mar-2024
  • (2023)Sorting with predictionsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667277(26563-26584)Online publication date: 10-Dec-2023
  • (2023)Intelligent Path-Selection-Aided Decoding of Polar CodesEntropy10.3390/e2502020025:2(200)Online publication date: 19-Jan-2023
  • (2023)Learned Sorted Table Search and Static Indexes in Small-Space Data ModelsData10.3390/data80300568:3(56)Online publication date: 3-Mar-2023
  • (2023)The Case for Learned In-Memory JoinsProceedings of the VLDB Endowment10.14778/3587136.358714816:7(1749-1762)Online publication date: 1-Mar-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media