Created attachment 12978 [details] Evidence for quadratic worst case. The C++ standard mandates that `std::sort` has complexity O(N log(N)), but libc++'s implementation takes O(N^2) in the worst case. As proof I've attached a program that constructs a worst case input for several sizes. It also instruments the number of comparisons used to sort these worst cases and prints the results. The technique used is described in the 1999 paper "A Killer Adversary for Quicksort" by M. D. McIlroy. Compiling and running: $ clang++ -O2 -m64 -march=native -std=c++11 -stdlib=libc++ main.cpp -nodefaultlibs -lc++ -lc++abi -lm -lc -lgcc_s -lgcc && ./a.out N: comparisons 100: 2479 200: 10229 400: 40729 800: 161729 1600: 448698 3200: 1413898 6400: 5264298 This is clearly quadratic. To illustrate this defect more, these are the comparison counts given by compiling using libstdc++: $ clang++ -O2 -m64 -march=native -std=c++11 main.cpp && ./a.out N: comparisons 100: 1742 200: 4260 400: 10035 800: 22784 1600: 51049 3200: 112386 6400: 244850 Inspecting the source of <algorithm> shows the cause of the issue: there is no introsort implemented, only quicksort (albeit with non-trivial optimizations, but nothing preventing the worst case). I've written a patch that augments the current implementation to make it work as an introspective sort, switching to heapsort if the recursion depth exceeds 2*floor(log(N)). This post can only have one attachment, so I'll post it as an attachment to a comment. Having not contributed to libc++ before I'm not 100% familiar with all coding styles/(un)written rules. My changes are rather minimal though, so I've just followed what style could be found in context. And of course I knowingly submit my code under the libc++ licenses, the MIT license and the UIUC License.
Created attachment 12979 [details] A patch that fixes the quadratic behaviour by implementing introsort. This is the patch my bug report refers to.
Hi Orson, Please mail this patch to the cfe-commits list for review. Please make sure the patch is sent as an attachment to the e-mail (not inline text), and please make sure the repeat the background information you've included here in the e-mail itself. For more information on the development policy and workflow, please see: http://llvm.org/docs/DeveloperPolicy.html Thanks!
Hi Hal, I've sent the patch to the mailing list, where it is waiting for moderator approval since I'm not a member of the list. Greetings, Orson Peters
An update. EricWF is working on a set of benchmarks for std::sort and other algorithms/containers. He expects to finish this quite soon, and I'll be using that to make sure this doesn't introduce a performance regression - and will be adding this to the test suite. So I haven't forgotten about it.
(In reply to comment #4) > An update. EricWF is working on a set of benchmarks for std::sort and other > algorithms/containers. He expects to finish this quite soon, and I'll be > using that to make sure this doesn't introduce a performance regression - > and will be adding this to the test suite. > > So I haven't forgotten about it. What's the status on this?
I think std::sort() is still broken in trunk as of today. std::sort() should not call the compare function on the same element: it cannot be O(N log N) if it does. Related bug: https://llvm.org/bugs/show_bug.cgi?id=28551 Reduced testcase: $ cat a.cc #include <algorithm> #include <vector> #include <cassert> int main(int argc, char** argv) { int size = 100; std::vector<int> v(size); // Elements in v are unique. for (int i = 0; i < size; ++i) v[i] = i; std::sort(v.begin(), v.end(), [&](int x, int y) { assert(x != y); return x < y; }); return 0; } $ clang++ -std=c++11 -stdlib=libc++ a.cc -o a.out $ ./a.out a.out: a.cc:14: auto main(int, char **)::(anonymous class)::operator()(int, int) const: Assertion `x != y' failed. ./go.sh: line 5: 19447 Aborted (core dumped) ./a.out $ clang++ -std=c++11 -stdlib=libstdc++ a.cc -o a.out $ ./a.out Works fine with libstdc++.
*** Bug 28551 has been marked as a duplicate of this bug. ***
I would like to point out that our std::sort implementation is between 2x and 5x faster on already sorted inputs than libstdc++ is. Some benchmark results: Run on (4 X 4228.3 MHz CPU s) // libc++ // Benchmark Time CPU Iterations ----------------------------------------------------------------------------- BM_Sort/random_uint32/1024 170614 ns 174629 ns 4375 BM_Sort/sorted_ascending_uint32/1024 12902 ns 13276 ns 53030 BM_Sort/sorted_descending_uint32/1024 20919 ns 21712 ns 29661 BM_Sort/single_element_uint32/1024 13747 ns 14384 ns 60345 BM_Sort/random_strings/1024 1097028 ns 998514 ns 673 BM_Sort/sorted_ascending_strings/1024 233688 ns 208469 ns 3070 BM_Sort/sorted_descending_strings/1024 341642 ns 361934 ns 1923 BM_Sort/single_element_strings/1024 1081605 ns 1126143 ns 547 // libstdc++ // Benchmark Time CPU Iterations ----------------------------------------------------------------------------- BM_Sort/random_uint32/1024 156709 ns 161200 ns 4268 BM_Sort/sorted_ascending_uint32/1024 49431 ns 53600 ns 10000 BM_Sort/sorted_descending_uint32/1024 42401 ns 41201 ns 16990 BM_Sort/single_element_uint32/1024 32115 ns 30892 ns 20588 BM_Sort/random_strings/1024 878349 ns 796434 ns 673 BM_Sort/sorted_ascending_strings/1024 576253 ns 532000 ns 1000 BM_Sort/sorted_descending_strings/1024 506059 ns 368390 ns 1683 BM_Sort/single_element_strings/1024 2453953 ns 2542751 ns 269 The benchmarks used can be found here: https://reviews.llvm.org/D22240
Eric Fiselier, can you run that same benchmark with the 'block' branch of my pattern-defeating quicksort algorithm? https://github.com/orlp/pdqsort/tree/block I've been waiting for a while now on this benchmark suite to start discussion of integrating the algorithm to libc++, I wasn't aware you finished the benchmark.
(In reply to comment #8) > Benchmark Time CPU Iterations > ----------------------------------------------------------------------------- > BM_Sort/random_uint32/1024 170614 ns 174629 ns 4375 > BM_Sort/sorted_ascending_uint32/1024 12902 ns 13276 ns 53030 [...] Could you please measure the number of instructions executed, data reads, and data writes, instead of the number of nano seconds of execution? I think those are a more stable metrics than the execution time that may vary with the state of the machine, daemons, noise, etc. $ valgrind --tool=cachegrind ./a.out $ cg_annotate cachegrind.out
I double Orson's request about using pdqsort as internal algorithm for std::sort. You can see here, how pdqsort is fast (https://github.com/orlp/pdqsort).
Fixed with https://reviews.llvm.org/D113413, should land in llvm 14 If you have stability sorting guarantees that you cannot fix and need migration, use -D_LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY or -D_LIBCPP_DEBUG=1 from https://reviews.llvm.org/D96946 to randomize the input range before sorting