Обсуждение: tuple radix sort
First, a quick demonstration of what this PoC can do on 1 million
random not-NULL bigints:
set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
240ms
set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
140ms
Background: Peter Geoghegan recently mentioned to me off-list an
interesting set of techniques for sorting in the context of databases.
I'm not yet sure how to approach certain aspects of that architecture,
so I won't go into the full picture at this point. However, there is
one piece that already fits well within our existing architecture, and
that is using radix sort on datum1. The basic sequence is:
1. Partition tuples on first key NULL and not-NULL, according to NULLS
FIRST or NULLS LAST.
2. Do normal qsort on the NULL partition using the tiebreak comparator.
3. Create a "conditioned" or "normalized" datum that encodes datum1
such that unsigned comparison is order-preserving, accounting for ASC
/ DESC as well. I've reused space now unused during in-memory not-NULL
sorts:
typedef struct
{
  void     *tuple;      /* the tuple itself */
  Datum    datum1;      /* value of first key column */
  union
  {
    struct
    {
      bool    isnull1;    /* is first key column NULL? */
      int     srctape;    /* source tape number */
    };
    Datum    cond_datum1; /* sort key for radix sort */
  };
} SortTuple;
4. Radix sort on cond_datum1. For the PoC I've based it on the
implementation in "ska sort" [1] (C++, Boost license). For
medium-sized sorts it uses "American flag sort" (there is a paper [3]
co-authored by M. D. McIlroy, same as in the paper we reference for
quicksort). For larger sorts it's similar, but performs multiple
passes, which takes better advantage of modern CPUs. Upon recursion,
sorts on small partitions divert to quicksort. Any necessary tiebreaks
are handled by quicksort, either after the end of radix sort, or when
diverting to small quicksort.
5. Reset isnull1 to "false" before returning to the caller. This also
must be done when diverting to quicksort.
Next steps: Try to find regressions (help welcome here). The v1 patch
has some optimizations, but in other ways things are simple and/or
wasteful. Exactly how things fit together will be informed by what, if
anything, has to be done to avoid regressions. I suspect the challenge
will be multikey sorts when the first key has low cardinality. This is
because tiebreaks are necessarily postponed rather than taken care of
up front. I'm optimistic, since low cardinality cases can be even
faster than our B&M qsort, so we have some headroom:
drop table if exists test;
create unlogged table test (a bigint);
insert into test select
(1_000_000_000 * random())::bigint % 8 -- mod
-- (1_000_000_000 * random())::bigint  -- random, for the case at the top
from generate_series(1,1_000_000,1) i;
vacuum freeze test;
select pg_prewarm('test');
set work_mem = '64MB';
set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
95ms
set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
84ms
[1] https://github.com/skarupke/ska_sort/tree/master
[2] https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/
[3] http://static.usenix.org/publications/compsystems/1993/win_mcilroy.pdf
--
John Naylor
Amazon Web Services
			
		Вложения
> On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote: > > I suspect the challenge > will be multikey sorts when the first key has low cardinality. As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that proves that: ``` evantest=# drop table if exists test_multi; evantest=# create unlogged table test_multi (category int, name text); — first column has only 4 distinct values evantest=# insert into test_multi select (random() * 4)::int as category, md5(random()::text) || md5(random()::text) as namefrom generate_series(1, 1000000); evantest=# vacuum freeze test_multi; evantest=# select count(*) from test_multi; evantest=# set work_mem = '64MB’; evantest-# \timing on Timing is on. evantest=# set wip_radix_sort = 'off'; Time: 0.403 ms evantest=# \o /dev/null evantest=# select * from test_multi order by category, name; Time: 5607.336 ms (00:05.607) evantest=# select * from test_multi order by category, name; Time: 5703.555 ms (00:05.704) evantest=# select * from test_multi order by category, name; Time: 5692.644 ms (00:05.693) evantest=# set wip_radix_sort = 'on'; Time: 0.859 ms evantest=# select * from test_multi order by category, name; Time: 5822.979 ms (00:05.823) evantest=# select * from test_multi order by category, name; Time: 5881.256 ms (00:05.881) evantest=# select * from test_multi order by category, name; Time: 5976.351 ms (00:05.976) ``` Roughly 5% slower for this corner case. However, when I recreate the test table with high cardinality first column, wip_radix_sort seems still slower: ``` evantest=# \o evantest=# drop table if exists test_multi; DROP TABLE evantest=# create unlogged table test_multi (category int, name text); CREATE TABLE evantest=# insert into test_multi evantest-# select (random() * 1000000)::int as category, md5(random()::text) || md5(random()::text) as name from generate_series(1,1000000); INSERT 0 1000000 evantest=# vacuum freeze test_multi; VACUUM evantest=# select count(*) from test_multi; count --------- 1000000 (1 row) evantest=# select * from test_multi limit 5; category | name ----------+------------------------------------------------------------------ 607050 | c555126a5afea9f5ffe3880248c89944d211bc378f8c3b6d125b4360fe8619b7 843579 | 69b5a1dba76f52ff238566a3f88315a7425116d5d271fb54701b6e49d4afd8ce 106298 | a96e8674db219e12463ecdbb405b99c631767972e489093045c97976c17c6561 621860 | 5e6739ea9f533f9cdb0b8db76e3d4ce63be6b2b612c8aff06c4b80451f8f2edc 290110 | 56944320e5abd3a854fffdd185b969727e8d414448d440725a392cda4c6355c4 (5 rows) evantest=# \timing on Timing is on. evantest=# \o /dev/null evantest=# set wip_radix_sort = 'off'; Time: 0.904 ms evantest=# select * from test_multi limit 5; Time: 0.983 ms evantest=# select * from test_multi order by category, name; Time: 593.578 ms evantest=# select * from test_multi order by category, name; Time: 597.329 ms evantest=# select * from test_multi order by category, name; Time: 600.050 ms evantest=# set wip_radix_sort = 'on'; Time: 0.737 ms evantest=# select * from test_multi order by category, name; Time: 611.604 ms evantest=# select * from test_multi order by category, name; Time: 613.115 ms evantest=# select * from test_multi order by category, name; Time: 615.003 ms ``` This seems like a real regression. Then I tried to only sort on the first column, yes, now radix is faster: ``` evantest=# set wip_radix_sort = 'off’; evantest=# select * from test_multi order by category; Time: 445.498 ms evantest=# select * from test_multi order by category; Time: 451.834 ms evantest=# select * from test_multi order by category; Time: 454.531 ms evantest=# set wip_radix_sort = 'on'; Time: 0.329 ms evantest=# select * from test_multi order by category; Time: 402.829 ms evantest=# select * from test_multi order by category; Time: 408.014 ms evantest=# select * from test_multi order by category; Time: 415.340 ms evantest=# select * from test_multi order by category; Time: 413.969 ms ``` Hope the test helps. (The test was run a MacBook M4. ) Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
On Wed, Oct 29, 2025 at 3:25 PM Chao Li <li.evan.chao@gmail.com> wrote: > > On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote: > > > > I suspect the challenge > > will be multikey sorts when the first key has low cardinality. > > As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that provesthat: > > ``` > evantest=# drop table if exists test_multi; > evantest=# create unlogged table test_multi (category int, name text); > — first column has only 4 distinct values Thanks for testing. Note it's actually 5 because of rounding. Your text also seems to have em-dashes and unicode apostrophes where it should have dashes / single quotes. That's not great if you expect others to try to reproduce. I'm also not thrilled about having to remove your psql prompt. drop table if exists test_multi; create unlogged table test_multi (category int, name text); insert into test_multi select (random() * 4)::int as category, md5(random()::text) || md5(random()::text) as name from generate_series(1, 1000000); vacuum freeze test_multi; Anyway, because this table is larger than my first example, the input no longer fits into 64MB of work_mem and it switches to an external merge sort. Normally I set work_mem to 1GB for testing sorts so I don't have to think about it, but neglected to in my first email. I don't know if that explains the disparity, but I've made that change for my quick tests below. > evantest=# \o /dev/null > evantest=# select * from test_multi order by category, name; [...] > Roughly 5% slower for this corner case. Seems fine for me on this old Intel laptop (min of 5 runs): set wip_radix_sort = 'off'; 2368.369 set wip_radix_sort = 'on'; 2290.654 It's close enough that I'll want to test more closely at a range of low-cardinality inputs. I haven't done any rigorous scripted testing yet, so take this with a grain of salt. > However, when I recreate the test table with high cardinality first column, wip_radix_sort seems still slower: drop table if exists test_multi; create unlogged table test_multi (category int, name text); insert into test_multi select (random() * 1000000)::int as category, md5(random()::text) || md5(random()::text) as name from generate_series(1, 1000000); vacuum freeze test_multi; > evantest=# set wip_radix_sort = 'off'; > Time: 0.904 ms > evantest=# select * from test_multi order by category, name; > Time: 593.578 ms > evantest=# select * from test_multi order by category, name; > Time: 597.329 ms > evantest=# select * from test_multi order by category, name; > Time: 600.050 ms > > evantest=# set wip_radix_sort = 'on'; > Time: 0.737 ms > evantest=# select * from test_multi order by category, name; > Time: 611.604 ms > evantest=# select * from test_multi order by category, name; > Time: 613.115 ms > evantest=# select * from test_multi order by category, name; > Time: 615.003 ms > ``` > > This seems like a real regression. It's better for me here (min of 5 again), although the time scanning the table probably dominates: off: 536.257 on: 471.345 -- John Naylor Amazon Web Services
> On Oct 29, 2025, at 19:29, John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Oct 29, 2025 at 3:25 PM Chao Li <li.evan.chao@gmail.com> wrote: >>> On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote: >>> >>> I suspect the challenge >>> will be multikey sorts when the first key has low cardinality. >> >> As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that provesthat: >> >> ``` >> evantest=# drop table if exists test_multi; >> evantest=# create unlogged table test_multi (category int, name text); >> — first column has only 4 distinct values > > Thanks for testing. Note it's actually 5 because of rounding. Yes, 0-4, totally 5. > Your > text also seems to have em-dashes and unicode apostrophes where it > should have dashes / single quotes. That's not great if you expect > others to try to reproduce. I just copied the content from psql (running in iTerm). I did a Google search, and found that was because of Mac Mail’s “smartquotes” substitution. Looks like even I manually type in a pair of single quotes, it still does the substitution. Iwill try to see how to disable that, but I don’t want to switch to another mail app. > I'm also not thrilled about having to > remove your psql prompt. > I just wanted to show my entire test process, so I simply copied all contents from psql. In future, I will remove psql promptsfrom reproduce procedure. > drop table if exists test_multi; > create unlogged table test_multi (category int, name text); > insert into test_multi select (random() * 4)::int as category, > md5(random()::text) || md5(random()::text) as name from > generate_series(1, 1000000); > vacuum freeze test_multi; > > Anyway, because this table is larger than my first example, the input > no longer fits into 64MB of work_mem and it switches to an external > merge sort. Normally I set work_mem to 1GB for testing sorts so I > don't have to think about it, but neglected to in my first email. I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with data: ``` evantest=# set work_mem = '1GB'; Time: 0.301 ms evantest=# evantest=# select * from test_multi order by category, name; Time: 575.247 ms evantest=# select * from test_multi order by category, name; Time: 554.351 ms evantest=# select * from test_multi order by category, name; Time: 565.100 ms evantest=# evantest=# set wip_radix_sort = 'on'; Time: 0.752 ms evantest=# select * from test_multi order by category, name; Time: 558.057 ms evantest=# select * from test_multi order by category, name; Time: 565.542 ms evantest=# select * from test_multi order by category, name; Time: 559.973 ms ``` With radix_sort on and off, execution time are almost the same. Then I restore the data to low cardinality, off is still faster than on: ``` evantest=# set wip_radix_sort = ‘off'; Time: 0.549 ms evantest=# select * from test_multi order by category, name; Time: 5509.075 ms (00:05.509) evantest=# select * from test_multi order by category, name; Time: 5553.566 ms (00:05.554) evantest=# select * from test_multi order by category, name; Time: 5598.595 ms (00:05.599) evantest=# set wip_radix_sort = ‘on'; Time: 0.786 ms evantest=# evantest=# select * from test_multi order by category, name; Time: 5770.964 ms (00:05.771) evantest=# select * from test_multi order by category, name; Time: 5779.755 ms (00:05.780) evantest=# select * from test_multi order by category, name; Time: 5851.134 ms (00:05.851) evantest=# evantest=# set work_mem = '2GB’; # increasing work_mem to 2GB doesn’t help Time: 0.404 ms evantest=# evantest=# select * from test_multi order by category, name; Time: 5781.005 ms (00:05.781) evantest=# select * from test_multi order by category, name; Time: 5826.025 ms (00:05.826) evantest=# select * from test_multi order by category, name; Time: 5937.919 ms (00:05.938) ``` Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
> On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote: > > <v1-0001-Use-radix-sort-when-datum1-is-an-integer-type.patch> I just quick went through the code change. I guess I need more time to understand the entire logic, but I find a typo thatmight effect the tests: ``` + Assert(last = first); ``` “=“ should be “==" Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
On Thu, Oct 30, 2025 at 8:56 AM Chao Li <li.evan.chao@gmail.com> wrote: > I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with data: > With radix_sort on and off, execution time are almost the same. Are you by chance running with asserts on? It's happened before, so I have to make sure. That makes a big difference here because I disabled diversion thresholds in assert builds so that regression tests (few cases with large inputs) cover the paths I want, in addition to my running a standalone stress test. Speaking of tests, I forgot to mention that regression tests will fail since in-place radix sort is an unstable sort, as qsort is as well, but in a different way -- this is expected. In assert builds, the patch verifies the sort after the fact with the standard comparator, and will throw an error if it's wrong. On Thu, Oct 30, 2025 at 9:19 AM Chao Li <li.evan.chao@gmail.com> wrote: > I just quick went through the code change. I guess I need more time to understand the entire logic, but I find a typo thatmight effect the tests: > > ``` > + Assert(last = first); > ``` > > “=“ should be “==" Yes, you're quite right, thanks for spotting! I reran my stress test that has randomly distributed NULLs and the assert still holds, so nothing further to fix yet. The NULL partitioning part of the code hasn't been well tested in its current form, and I may arrange things so that that step and the datum conditioning step happen in a single pass. I'm not yet sure if it's important enough to justify the additional complexity. -- John Naylor Amazon Web Services
> On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Oct 30, 2025 at 8:56 AM Chao Li <li.evan.chao@gmail.com> wrote: > >> I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with data: > >> With radix_sort on and off, execution time are almost the same. > > Are you by chance running with asserts on? It's happened before, so I > have to make sure. That makes a big difference here because I disabled > diversion thresholds in assert builds so that regression tests (few > cases with large inputs) cover the paths I want, in addition to my > running a standalone stress test. > Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
On Thu, 30 Oct 2025 at 16:46, Chao Li <li.evan.chao@gmail.com> wrote: > > On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote: > > Are you by chance running with asserts on? It's happened before, so I > > have to make sure. That makes a big difference here because I disabled > > diversion thresholds in assert builds so that regression tests (few > > cases with large inputs) cover the paths I want, in addition to my > > running a standalone stress test. > > Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later. Never expect anything meaningful to come from running performance tests on Assert builds. You should always be rebuilding without Asserts before doing performance tests. David
> On Oct 30, 2025, at 13:01, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 30 Oct 2025 at 16:46, Chao Li <li.evan.chao@gmail.com> wrote:
>>> On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:
>>> Are you by chance running with asserts on? It's happened before, so I
>>> have to make sure. That makes a big difference here because I disabled
>>> diversion thresholds in assert builds so that regression tests (few
>>> cases with large inputs) cover the paths I want, in addition to my
>>> running a standalone stress test.
>>
>> Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.
>
> Never expect anything meaningful to come from running performance
> tests on Assert builds. You should always be rebuilding without
> Asserts before doing performance tests.
>
Sure, good to learn. Actually I am very new to PG development, so any guidance is greatly appreciated.
I just made a distclean, then configure without any parameter. Now, the overall execution time reduced ~10% than with
asserts.With the low cardinality data, off and on are very close: 
```
evantest=# set wip_radix_sort = 'off';
Time: 0.206 ms
evantest=# select * from test_multi order by category, name;
Time: 5070.277 ms (00:05.070)
evantest=# select * from test_multi order by category, name;
Time: 5158.748 ms (00:05.159)
evantest=# select * from test_multi order by category, name;
Time: 5072.708 ms (00:05.073)
evantest=# set wip_radix_sort = 'on';
Time: 0.177 ms
evantest=# select * from test_multi order by category, name;
Time: 4992.516 ms (00:04.993)
evantest=# select * from test_multi order by category, name;
Time: 5145.361 ms (00:05.145)
evantest=# select * from test_multi order by category, name;
Time: 5101.800 ms (00:05.102)
evantest=# \o
evantest=# show work_mem;
 work_mem
----------
 1GB
(1 row)
Time: 0.186 ms
evantest=# explain select * from test_multi order by category, name;
                                QUERY PLAN
---------------------------------------------------------------------------
 Sort  (cost=122003.84..124503.84 rows=1000000 width=69)
   Sort Key: category, name
   ->  Seq Scan on test_multi  (cost=0.00..22346.00 rows=1000000 width=69)
(3 rows)
```
And with high cardinality test data, on has a big win:
```
evantest=# set wip_radix_sort = 'off';
Time: 0.174 ms
evantest=# select * from test_multi order by category, name;
Time: 353.702 ms
evantest=# select * from test_multi order by category, name;
Time: 375.549 ms
evantest=# select * from test_multi order by category, name;
Time: 367.967 ms
evantest=# set wip_radix_sort = 'on';
Time: 0.147 ms
evantest=# select * from test_multi order by category, name;
Time: 279.537 ms
evantest=# select * from test_multi order by category, name;
Time: 278.114 ms
evantest=# select * from test_multi order by category, name;
Time: 284.273 ms
```
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
			
		I wrote: > The v1 patch > has some optimizations, but in other ways things are simple and/or > wasteful. Exactly how things fit together will be informed by what, if > anything, has to be done to avoid regressions. In v1, radix sort diverts to qsort_tuple for small partitions (similar to how quicksort diverts to insertion sort), but qsort_tuple is inefficient because the comparator is called via a function pointer. I also thought having two different radix sorts was too complex, so I wondered if it'd be better to get rid of the smaller radix sort (whose control flow I find harder to understand, even ignoring the unsightly goto) and have the larger sort divert to a new quicksort specialization that compares on the conditioned datum. That allows skipping branches for NULL comparisons and order reversal. I've done this in v2. It makes sense to replace the three current integer-comparison quicksorts with one. v1 was careful to restore isnull1 to false when diverting to quicksort for the tiebreak. v2 doesn't bother, since the only tiebreak in core that looks at isnull1 is comparetup_datum_tiebreak, which is not reachable by radix sort, requiring a pass-by-value datum that compares like an integer. This is a bit of a risk, since it's possible a third party fork could be doing something weird. Seems unlikely, but something to keep in mind. I used a standalone program (attached) to microbenchmark this new fallback qsort vs. a pass of radix sort on one byte to get a decent threshold value. This is not quite fair, since the quicksort will then be finished, but the radix sort could still need to recurse to the next byte(s), so these number could underestimate the threshold. This is just to get an idea. The numbers are in RDTSC ticks per element sorted. cardinality: 256 number of elements: 100 qsort: 35.4 radix: 49.2 number of elements: 200 qsort: 34.9 radix: 38.1 number of elements: 400 qsort: 42.4 radix: 34.4 number of elements: 800 qsort: 95.0 radix: 29.2 number of elements: 1600 qsort: 115.0 radix: 22.4 number of elements: 3200 qsort: 125.5 radix: 19.4 number of elements: 6400 qsort: 128.1 radix: 17.6 With the highest cardinality possible on a single byte, radix sort is actually not bad at low inputs. Notice that the time per element is consistently going down with larger inputs. Smaller inputs have large constant overheads, made worse by my unrolling the counting step. cardinality: 2 number of elements: 100 qsort: 09.2 radix: 28.0 number of elements: 200 qsort: 09.1 radix: 19.5 number of elements: 400 qsort: 10.4 radix: 15.7 number of elements: 800 qsort: 10.1 radix: 14.5 number of elements: 1600 qsort: 10.4 radix: 13.7 number of elements: 3200 qsort: 15.8 radix: 13.6 number of elements: 6400 qsort: 22.2 radix: 13.8 This is an extreme best case for B&M quicksort, which is basically O(n) -- the point at which the per-element time goes up seems purely due to exceeding L1 cache. Radix sort takes a big input to catch up, but it doesn't seem awful, either. cardinality: 16 number of elements: 100 qsort: 19.5 radix: 34.5 number of elements: 200 qsort: 18.7 radix: 22.6 number of elements: 400 qsort: 18.5 radix: 17.2 number of elements: 800 qsort: 25.0 radix: 14.8 number of elements: 1600 qsort: 43.8 radix: 13.8 number of elements: 3200 qsort: 51.2 radix: 13.2 number of elements: 6400 qsort: 59.0 radix: 12.8 This is still low cardinality, but behaves more like the high cardinality case. I've set the threshold to 400 for now, but I'm not claiming that's the end story. In addition to the underestimation mentioned above, unrolling the counting step is a factor. Unrolling makes smaller inputs worse (which we can reach by recursing from larger inputs), but unrolling seems important for large inputs with low cardinality (a few percent, but I haven't shared numbers yet). We've found that a large input with only 4-5 distinct values just barely wins with radix sort. I'll be curious to see if unrolling is actually needed to prevent regressions there. Other things to consider: - I don't quite like how the NULL partitioning step looks, and it could be costly when the distribution of NULL is not predictable, so I'm thinking of turning part of that into a branch-free cyclic permutation, similar to https://www.postgresql.org/message-id/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com - The quicksort on the NULL partition still compares isnull1 -- the branches are predictable but perhaps it's worth it to add a specialization that skips that. -- John Naylor Amazon Web Services
Вложения
I wrote: > I've set the threshold to 400 for now, but I'm not claiming that's the > end story. In addition to the underestimation mentioned above, > unrolling the counting step is a factor. Unrolling makes smaller > inputs worse (which we can reach by recursing from larger inputs), but > unrolling seems important for large inputs with low cardinality (a few > percent, but I haven't shared numbers yet). We've found that a large > input with only 4-5 distinct values just barely wins with radix sort. > I'll be curious to see if unrolling is actually needed to prevent > regressions there. Looking more closely at my development history, it turns out I added loop unrolling before adding common prefix detection. Most real-world non-negative integers have the upper bytes the same, especially since the datum is 8 bytes regardless of underlying type. For those bytes, the radix sort finds only one unique byte and continues on to the next byte. By detecting the common prefix as we condition the datums, it matters less how fast we can count since we simply skip some useless work. (This is not as relevant when we have an abbreviated datum) Repeating part of the microbenchmark from last time with no unrolling, a threshold of 200 works for all but the lowest cardinality inputs: cardinality: 256 number of elements: 100 qsort: 34.8 radix: 38.3 number of elements: 200 qsort: 34.9 radix: 29.7 number of elements: 400 qsort: 40.8 radix: 29.2 cardinality: 16 number of elements: 100 qsort: 19.3 radix: 26.2 number of elements: 200 qsort: 18.9 radix: 18.2 number of elements: 400 qsort: 18.5 radix: 14.5 cardinality: 2 number of elements: 100 qsort: 09.3 radix: 21.6 number of elements: 200 qsort: 08.9 radix: 15.4 number of elements: 400 qsort: 10.3 radix: 14.0 To test further, I dug up a test from [1] that stresses low cardinality on multiple sort keys (attached in a form to allow turing radix sort on and off via a command line argument), and found no regression with or without loop unrolling: V2: off: 4 ^ 8: latency average = 101.070 ms 5 ^ 8: latency average = 704.862 ms 6 ^ 8: latency average = 3651.015 ms 7 ^ 8: latency average = 15141.412 ms on: 4 ^ 8: latency average = 99.939 ms 5 ^ 8: latency average = 683.018 ms 6 ^ 8: latency average = 3545.626 ms 7 ^ 8: latency average = 14095.677 ms V3: off: 4 ^ 8: latency average = 99.486 ms 5 ^ 8: latency average = 693.434 ms 6 ^ 8: latency average = 3607.940 ms 7 ^ 8: latency average = 14602.325 ms on: 4 ^ 8: latency average = 97.664 ms 5 ^ 8: latency average = 678.752 ms 6 ^ 8: latency average = 3361.765 ms 7 ^ 8: latency average = 14121.190 ms So v3 gets rid of loop unrolling, reduces the threshold to 200. [1] https://www.postgresql.org/message-id/flat/CAApHDvqkHZsT2gaAWFM7D%3D7qyQ%3DeKXQvvn%2BBkwCn4Rvj1w4EKQ%40mail.gmail.com#1c67321e09be15e031d3995d45a1b8fd TODOs: - adding a "sorted pre-check" to keep up with our qsort for ascending inputs - further performance validation - possibly doing NULL partitioning differently - possibly specializing qsort on the NULL partition - code cleanup -- John Naylor Amazon Web Services