research-article

Public Access

The Case for a Learned Sorting Algorithm

Authors:

Ugur Çetintemel,

Tim KraskaAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1001 - 1016

https://doi.org/10.1145/3318464.3389752

Published: 31 May 2020 Publication History

Abstract

Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sort-merge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order. We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38x performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49x improvement over sequential Radix Sort, and 5.54x improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python.

Supplementary Material

MP4 File (3318464.3389752.mp4)

Presentation Video

Download
83.96 MB

References

[1]

Apple. 2018. Apple/Swift: standard library sort. (2018). https://github.com/apple/swift/blob/master/test/Prototypes/IntroSort.swift

[2]

Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. 2017. In-Place Parallel Super Scalar Samplesort (IPSSSSo). In 25th Annual European Symposium on Algorithms (ESA 2017) (Leibniz International Proceedings in Informatics (LIPIcs)), Kirk Pruhs and Christian Sohler (Eds.), Vol. 87. Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 9:1--9:14. https://doi.org/10.4230/LIPIcs.ESA.2017.9

[3]

Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Özsu. 2013. Multi-core, main-memory joins: Sort vs. hash revisited. Proceedings of the VLDB Endowment, Vol. 7, 1 (2013), 85--96.

Digital Library

[4]

Paul E. Black. 2019. Histogram Sort. (2019). https://www.nist.gov/dads/HTML/histogramSort.html

[5]

Berenger Bramas. 2017. Fast sorting algorithms using AVX-512 on Intel Knights Landing. arXiv preprint arXiv:1704.08579, Vol. 305 (2017), 315.

[6]

Minsik Cho, Daniel Brand, Rajesh Bordawekar, Ulrich Finkler, Vincent Kulandaisamy, and Ruchir Puri. 2015. PARADIS: an efficient parallel algorithm for in-place radix sort. Proceedings of the VLDB Endowment, Vol. 8, 12 (2015), 1518--1529.

Digital Library

[7]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2007. Introduction to algorithms 2 ed.). MIT Press, Cambridge, MA.

[8]

Zbigniew J. Czech, George Havas, and Bohdan S. Majewski. 1992. An optimal algorithm for generating minimal perfect hash functions. Inform. Process. Lett., Vol. 43, 5 (1992), 257 -- 264. https://doi.org/10.1016/0020-0190(92)90220-P

Digital Library

[9]

T. Dachraoui and L. Narayanan. 1996. Fast deterministic sorting on large parallel machines. In Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing. IEEE, New Orleans, LA, 273--280. https://doi.org/10.1109/SPDP.1996.570344

[10]

Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, Friedhelm Meyer Auf Der Heide, Hans Rohnert, and Robert E Tarjan. 1994. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput., Vol. 23, 4 (1994), 738--761.

Digital Library

[11]

Jialin Ding, Umar Farooq Minhas, Hantian Zhang, Yinan Li, Chi Wang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, and David Lomet. 2019. ALEX: An Updatable Adaptive Learned Index. (2019). arxiv: cs.DB/1905.08898

[12]

A. Dvoretzky, J. Kiefer, and J. Wolfowitz. 1956. Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator. Ann. Math. Statist., Vol. 27, 3 (09 1956), 642--669. https://doi.org/10.1214/aoms/1177728174

[13]

Paul Embrechts and Marius Hofert. 2013. A note on generalized inverses. Mathematical Methods of Operations Research, Vol. 77, 3 (2013), 423--432.

[14]

Timothy Furtak, José Nelson Amaral, and Robert Niewiadomski. 2007. Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures. ACM, San Diego, CA, 348--357.

Digital Library

[15]

Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-Aware Index Structure. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD `19). Association for Computing Machinery, New York, NY, USA, 1189--1206. https://doi.org/10.1145/3299869.3319860

Digital Library

[16]

GNU. 2009. C+: STL sort. (2009). https://gcc.gnu.org/onlinedocs/libstdc+/libstdc+-html-USERS-4.4/a01347.html

[17]

Go. 2009. Source file src/sort/sort.go. (2009). https://golang.org/src/sort/sort.go

[18]

Fuji Goro and Morwenn. 2019. Open-source C+ implementation of Timsort. (2019). https://github.com/gfx/cpp-TimSort

[19]

Jim Gray, Chris Nyberg, Mehul Shah, and Naga Govindaraju. 2017. The SortBenchmark dataset. (2017). http://sortbenchmark.org/

[20]

Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. 2019. Learning-Based Frequency Estimation Algorithms. International Conference on Learning Representations. ICLR, New Orleans, LA. https://openreview.net/forum?id=r1lohoCqY7

[21]

Hiroshi Inoue and Kenjiro Taura. 2015. SIMD-and cache-friendly algorithm for sorting an array of structures. Proceedings of the VLDB Endowment, Vol. 8, 11 (2015), 1274--1285.

Digital Library

[22]

Intel. 2020. Intel Performance Primitive library for x86 architectures. (2020). http://software.intel.com/en-us/intel-ipp/

[23]

Java. 2017. Java 9: List.sort. (2017). https://docs.oracle.com/javase/9/docs/api/java/util/List.html#sort-java.util.Comparator-

[24]

Nathan Jay, Noga H. Rotman, P. Brighten Godfrey, Michael Schapira, and Aviv Tamar. 2018. Internet Congestion Control via Deep Reinforcement Learning. (2018). arxiv: cs.NI/1810.03259

[25]

Johan Ludwig William Valdemar Jensen et al. 1906. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, Vol. 30 (1906), 175--193.

[26]

Jie Jiang, Lixiong Zheng, Junfeng Pu, Xiong Cheng, Chongqing Zhao, Mark R Nutter, and Jeremy D Schaub. 2017. Tencent Sort. (2017).

[27]

Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. (2018). arxiv: cs.DB/1809.00677

[28]

Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Learned Database System. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13--16, 2019, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2019/papers/p117-kraska-cidr19.pdf

[29]

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 489--504.

Digital Library

[30]

Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).

[31]

Maciej Kurant, Minas Gjoka, Carter T. Butts, and Athina Markopoulou. 2011. Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks. In Proceedings of ACM SIGMETRICS '11. San Jose, CA.

Digital Library

[32]

Bohdan S Majewski, Nicholas C Wormald, George Havas, and Zbigniew J Czech. 1996. A family of perfect hashing methods. Comput. J., Vol. 39, 6 (1996), 547--554.

[33]

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM '19). ACM, New York, NY, USA, 270--288. https://doi.org/10.1145/3341302.3342080

Digital Library

[34]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1705--1718. https://doi.org/10.14778/3342263.3342644

Digital Library

[35]

Ryan Marcus and Olga Papaemmanouil. 2018. Deep reinforcement learning for join order enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. ACM, 3.

Digital Library

[36]

Peter McIlroy. 1993. Optimistic sorting and information theoretic complexity. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 467--474.

Digital Library

[37]

MongoDB. 2018. MongoDB: sorter.cpp. (2018). https://github.com/mongodb/mongo/blob/master/src/mongo/db/sorter/sorter.cpp

[38]

David R Musser. 1997. Introspective sorting and selection algorithms. Software: Practice and Experience, Vol. 27, 8 (1997), 983--993.

Digital Library

[39]

MySQL. 2000. MySQL: filesort.cc. (2000). https://github.com/mysql/mysql-server/blob/8.0/sql/filesort.cc

[40]

Omar Obeya, Endrias Kahssay, Edward Fan, and Julian Shun. 2019. Theoretically-Efficient and Practical Parallel In-Place Radix Sorting. In The 31st ACM on Symposium on Parallelism in Algorithms and Architectures. ACM, 213--224.

[41]

OpenAddresses. 2020. The OpenAddresses - Northeast dataset. (2020). https://data.openaddresses.io/openaddr-collected-us_northeast.zip

[42]

OpenStreetMap contributors. 2017. Planet dump retrieved from https://planet.osm.org. (2017). https://www.openstreetmap.org

[43]

Emanuel Parzen. 1962. On estimation of a probability density function and mode. The annals of mathematical statistics, Vol. 33, 3 (1962), 1065--1076.

[44]

Orson R. L. Peters. 2020. The Pattern-Defeating Quicksort Algorithm. (2020). https://github.com/orlp/pdqsort

[45]

Tim Peters. 2002. Python: list.sort. (2002). https://github.com/python/cpython/blob/master/Objects/listsort.txt

[46]

Postgres. 1996. Postgres: tuplesort.c. (1996). https://github.com/postgres/postgres/blob/master/src/backend/utils/sort/tuplesort.c

[47]

Murray Rosenblatt. 1956. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics (1956), 832--837.

[48]

Peter Sanders and Sebastian Winkel. 2004. Super scalar sample sort. In European Symposium on Algorithms. Springer, 784--796.

[49]

Michael Axtmann Sascha Witt. 2020. Open-source C+ implementation of the IPS4o algorithm. (2020). https://github.com/SaschaWitt/ips4o

[50]

Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D Nguyen, Victor W Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 351--362.

Digital Library

[51]

Andrew Schein. 2009. Open-source C+ implementation of Radix Sort for double-precision floating points. (2009). https://bitbucket.org/ais/usort/src/474cc2a19224/usort/f8_sort.c

[52]

SQLite. 2011. SQLite: vdbesort.c. (2011). https://github.com/mackyle/sqlite/blob/master/src/vdbesort.c

[53]

Simon Steele and Marius vZilénas. 2020. 479k English words for all your dictionary. (2020). https://github.com/dwyl/english-words

[54]

Michal Turvc an'ik and Martin Javurek. 2016. Hash function generation by neural network. In 2016 New Trends in Signal Processing (NTSP). IEEE, 1--5.

[55]

Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedings of the 2019 International Conference on Management of Data. ACM, 36--53.

Digital Library

[56]

Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927 (2014).

[57]

Jianfeng Wang, Jingdong Wang, Nenghai Yu, and Shipeng Li. 2013. Order preserving hashing for approximate nearest neighbor search. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 133--142.

Digital Library

[58]

Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. Proceedings of the VLDB Endowment, Vol. 12, 3 (2018), 210--222.

Digital Library

Cited By

Yang YChen S(2024)LITS: An Optimized Learned Index for StringsProceedings of the VLDB Endowment10.14778/3681954.368201017:11(3415-3427)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682010
Zhang WLim WButrovich MPavlo A(2024)The Holon Approach for Simultaneously Tuning Multiple Components in a Self-Driving Database Management System with Machine Learning via Synthesized Proto-ActionsProceedings of the VLDB Endowment10.14778/3681954.368200717:11(3373-3387)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682007
Choi MYoo SChoi J(2024)Can Learned Indexes be Built Efficiently? A Deep Dive into Sampling Trade-offsProceedings of the ACM on Management of Data10.1145/36549192:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654919
Show More Cited By

Index Terms

The Case for a Learned Sorting Algorithm

Recommendations

A new parallel sorting algorithm based upon min-mid-max operations
Abstract
In this paper we propose a new parallel sorting algorithm which is based upon an operation which sorts three elements. This algorithm is similar to the parallel odd-even merge sorting algorithm proposed by Batcher, except in the latter, the basic ...
Best sorting algorithm for nearly sorted lists

Straight Insertion Sort, Shellsort, Straight Merge Sort, Quickersort, and Heapsort are compared on nearly sorted lists. The ratio of the minimum number of list elements which must be removed so that the remaining portion of the list is in order to the size ...
A Novel Sorting Algorithm for Many-core Architectures Based on Adaptive Bitonic Sort
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Adaptive bitonic sort is a well known merge-based parallel sorting algorithm. It achieves optimal complexity using a complex tree-like data structure called a bitonic tree. Due to this, using adaptive bitonic sort together with other algorithms usually ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
DARPA

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
6,824
Total Downloads

Downloads (Last 12 months)1,660
Downloads (Last 6 weeks)141

Reflects downloads up to 20 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang YChen S(2024)LITS: An Optimized Learned Index for StringsProceedings of the VLDB Endowment10.14778/3681954.368201017:11(3415-3427)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682010
Zhang WLim WButrovich MPavlo A(2024)The Holon Approach for Simultaneously Tuning Multiple Components in a Self-Driving Database Management System with Machine Learning via Synthesized Proto-ActionsProceedings of the VLDB Endowment10.14778/3681954.368200717:11(3373-3387)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682007
Choi MYoo SChoi J(2024)Can Learned Indexes be Built Efficiently? A Deep Dive into Sampling Trade-offsProceedings of the ACM on Management of Data10.1145/36549192:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654919
Liu JZhang FLu LQi CGuo XDeng DLi GZhang HZhai JZhang HChen YPan ADu X(2024)G-Learned Index: Enabling Efficient Learned Index on GPUIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.338121435:6(950-967)Online publication date: 2-Apr-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3381214
Ji ZXie ZWu YZhang M(2024)LBSC: A Cost-Aware Caching Framework for Cloud Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00373(4911-4924)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00373
Tie JYang HCheng YSun CPan Z(2024)Sum-Product Network-Based Cardinality Estimation ResearchArtificial Intelligence in China10.1007/978-981-99-7545-7_40(387-397)Online publication date: 23-Mar-2024
https://doi.org/10.1007/978-981-99-7545-7_40
Bai XCoester COh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Sorting with predictionsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667277(26563-26584)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667277
Cui HNiu KZhong S(2023)Intelligent Path-Selection-Aided Decoding of Polar CodesEntropy10.3390/e2502020025:2(200)Online publication date: 19-Jan-2023
https://doi.org/10.3390/e25020200
Amato DGiancarlo RLo Bosco G(2023)Learned Sorted Table Search and Static Indexes in Small-Space Data ModelsData10.3390/data80300568:3(56)Online publication date: 3-Mar-2023
https://doi.org/10.3390/data8030056
Sabek IKraska T(2023)The Case for Learned In-Memory JoinsProceedings of the VLDB Endowment10.14778/3587136.358714816:7(1749-1762)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.14778/3587136.3587148
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents