Обсуждение: tuple radix sort

Поиск
Список
Период
Сортировка

tuple radix sort

От
John Naylor
Дата:
First, a quick demonstration of what this PoC can do on 1 million
random not-NULL bigints:

set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
240ms

set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
140ms

Background: Peter Geoghegan recently mentioned to me off-list an
interesting set of techniques for sorting in the context of databases.
I'm not yet sure how to approach certain aspects of that architecture,
so I won't go into the full picture at this point. However, there is
one piece that already fits well within our existing architecture, and
that is using radix sort on datum1. The basic sequence is:

1. Partition tuples on first key NULL and not-NULL, according to NULLS
FIRST or NULLS LAST.
2. Do normal qsort on the NULL partition using the tiebreak comparator.
3. Create a "conditioned" or "normalized" datum that encodes datum1
such that unsigned comparison is order-preserving, accounting for ASC
/ DESC as well. I've reused space now unused during in-memory not-NULL
sorts:

typedef struct
{
  void     *tuple;      /* the tuple itself */
  Datum    datum1;      /* value of first key column */

  union
  {
    struct
    {
      bool    isnull1;    /* is first key column NULL? */
      int     srctape;    /* source tape number */
    };
    Datum    cond_datum1; /* sort key for radix sort */
  };
} SortTuple;


4. Radix sort on cond_datum1. For the PoC I've based it on the
implementation in "ska sort" [1] (C++, Boost license). For
medium-sized sorts it uses "American flag sort" (there is a paper [3]
co-authored by M. D. McIlroy, same as in the paper we reference for
quicksort). For larger sorts it's similar, but performs multiple
passes, which takes better advantage of modern CPUs. Upon recursion,
sorts on small partitions divert to quicksort. Any necessary tiebreaks
are handled by quicksort, either after the end of radix sort, or when
diverting to small quicksort.
5. Reset isnull1 to "false" before returning to the caller. This also
must be done when diverting to quicksort.

Next steps: Try to find regressions (help welcome here). The v1 patch
has some optimizations, but in other ways things are simple and/or
wasteful. Exactly how things fit together will be informed by what, if
anything, has to be done to avoid regressions. I suspect the challenge
will be multikey sorts when the first key has low cardinality. This is
because tiebreaks are necessarily postponed rather than taken care of
up front. I'm optimistic, since low cardinality cases can be even
faster than our B&M qsort, so we have some headroom:

drop table if exists test;
create unlogged table test (a bigint);
insert into test select
(1_000_000_000 * random())::bigint % 8 -- mod
-- (1_000_000_000 * random())::bigint  -- random, for the case at the top
from generate_series(1,1_000_000,1) i;
vacuum freeze test;

select pg_prewarm('test');
set work_mem = '64MB';

set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
95ms

set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
84ms


[1] https://github.com/skarupke/ska_sort/tree/master
[2] https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/
[3] http://static.usenix.org/publications/compsystems/1993/win_mcilroy.pdf

--
John Naylor
Amazon Web Services

Вложения

Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:
>
> I suspect the challenge
> will be multikey sorts when the first key has low cardinality.

As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that proves
that:

```
evantest=# drop table if exists test_multi;
evantest=# create unlogged table test_multi (category int, name text);
— first column has only 4 distinct values
evantest=# insert into test_multi select (random() * 4)::int as category, md5(random()::text) || md5(random()::text) as
namefrom generate_series(1, 1000000); 
evantest=# vacuum freeze test_multi;
evantest=# select count(*) from test_multi;
evantest=# set work_mem = '64MB’;

evantest-# \timing on
Timing is on.
evantest=# set wip_radix_sort = 'off';
Time: 0.403 ms
evantest=# \o /dev/null
evantest=# select * from test_multi order by category, name;
Time: 5607.336 ms (00:05.607)
evantest=# select * from test_multi order by category, name;
Time: 5703.555 ms (00:05.704)
evantest=# select * from test_multi order by category, name;
Time: 5692.644 ms (00:05.693)

evantest=# set wip_radix_sort = 'on';
Time: 0.859 ms
evantest=# select * from test_multi order by category, name;
Time: 5822.979 ms (00:05.823)
evantest=# select * from test_multi order by category, name;
Time: 5881.256 ms (00:05.881)
evantest=# select * from test_multi order by category, name;
Time: 5976.351 ms (00:05.976)
```

Roughly 5% slower for this corner case.

However, when I recreate the test table with high cardinality first column, wip_radix_sort seems still slower:

```
evantest=# \o
evantest=# drop table if exists test_multi;
DROP TABLE
evantest=# create unlogged table test_multi (category int, name text);
CREATE TABLE
evantest=# insert into test_multi
evantest-# select (random() * 1000000)::int as category,  md5(random()::text) || md5(random()::text) as name from
generate_series(1,1000000); 
INSERT 0 1000000
evantest=# vacuum freeze test_multi;
VACUUM
evantest=# select count(*) from test_multi;
  count
---------
 1000000
(1 row)

evantest=# select * from test_multi limit 5;
 category |                               name
----------+------------------------------------------------------------------
   607050 | c555126a5afea9f5ffe3880248c89944d211bc378f8c3b6d125b4360fe8619b7
   843579 | 69b5a1dba76f52ff238566a3f88315a7425116d5d271fb54701b6e49d4afd8ce
   106298 | a96e8674db219e12463ecdbb405b99c631767972e489093045c97976c17c6561
   621860 | 5e6739ea9f533f9cdb0b8db76e3d4ce63be6b2b612c8aff06c4b80451f8f2edc
   290110 | 56944320e5abd3a854fffdd185b969727e8d414448d440725a392cda4c6355c4
(5 rows)

evantest=# \timing on
Timing is on.

evantest=# \o /dev/null
evantest=# set wip_radix_sort = 'off';
Time: 0.904 ms
evantest=# select * from test_multi limit 5;
Time: 0.983 ms
evantest=# select * from test_multi order by category, name;
Time: 593.578 ms
evantest=# select * from test_multi order by category, name;
Time: 597.329 ms
evantest=# select * from test_multi order by category, name;
Time: 600.050 ms

evantest=# set wip_radix_sort = 'on';
Time: 0.737 ms
evantest=# select * from test_multi order by category, name;
Time: 611.604 ms
evantest=# select * from test_multi order by category, name;
Time: 613.115 ms
evantest=# select * from test_multi order by category, name;
Time: 615.003 ms
```

This seems like a real regression.

Then I tried to only sort on the first column, yes, now radix is faster:

```
evantest=# set wip_radix_sort = 'off’;
evantest=# select * from test_multi order by category;
Time: 445.498 ms
evantest=# select * from test_multi order by category;
Time: 451.834 ms
evantest=# select * from test_multi order by category;
Time: 454.531 ms

evantest=# set wip_radix_sort = 'on';
Time: 0.329 ms
evantest=# select * from test_multi order by category;
Time: 402.829 ms
evantest=# select * from test_multi order by category;
Time: 408.014 ms
evantest=# select * from test_multi order by category;
Time: 415.340 ms
evantest=# select * from test_multi order by category;
Time: 413.969 ms
```

Hope the test helps. (The test was run a MacBook M4. )

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: tuple radix sort

От
John Naylor
Дата:
On Wed, Oct 29, 2025 at 3:25 PM Chao Li <li.evan.chao@gmail.com> wrote:
> > On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > I suspect the challenge
> > will be multikey sorts when the first key has low cardinality.
>
> As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that
provesthat: 
>
> ```
> evantest=# drop table if exists test_multi;
> evantest=# create unlogged table test_multi (category int, name text);
> — first column has only 4 distinct values

Thanks for testing. Note it's actually 5 because of rounding. Your
text also seems to have em-dashes and unicode apostrophes where it
should have dashes / single quotes. That's not great if you expect
others to try to reproduce. I'm also not thrilled about having to
remove your psql prompt.

drop table if exists test_multi;
create unlogged table test_multi (category int, name text);
insert into test_multi select (random() * 4)::int as category,
md5(random()::text) || md5(random()::text) as name from
generate_series(1, 1000000);
vacuum freeze test_multi;

Anyway, because this table is larger than my first example, the input
no longer fits into 64MB of work_mem and it switches to an external
merge sort. Normally I set work_mem to 1GB for testing sorts so I
don't have to think about it, but neglected to in my first email. I
don't know if that explains the disparity, but I've made that change
for my quick tests below.

> evantest=# \o /dev/null
> evantest=# select * from test_multi order by category, name;
[...]
> Roughly 5% slower for this corner case.

Seems fine for me on this old Intel laptop (min of 5 runs):

set wip_radix_sort = 'off';
2368.369

set wip_radix_sort = 'on';
2290.654

It's close enough that I'll want to test more closely at a range of
low-cardinality inputs. I haven't done any rigorous scripted testing
yet, so take this with a grain of salt.

> However, when I recreate the test table with high cardinality first column, wip_radix_sort seems still slower:

drop table if exists test_multi;
create unlogged table test_multi (category int, name text);
insert into test_multi select (random() * 1000000)::int as category,
md5(random()::text) || md5(random()::text) as name from
generate_series(1, 1000000);
vacuum freeze test_multi;

> evantest=# set wip_radix_sort = 'off';
> Time: 0.904 ms

> evantest=# select * from test_multi order by category, name;
> Time: 593.578 ms
> evantest=# select * from test_multi order by category, name;
> Time: 597.329 ms
> evantest=# select * from test_multi order by category, name;
> Time: 600.050 ms
>
> evantest=# set wip_radix_sort = 'on';
> Time: 0.737 ms
> evantest=# select * from test_multi order by category, name;
> Time: 611.604 ms
> evantest=# select * from test_multi order by category, name;
> Time: 613.115 ms
> evantest=# select * from test_multi order by category, name;
> Time: 615.003 ms
> ```
>
> This seems like a real regression.

It's better for me here (min of 5 again), although the time scanning
the table probably dominates:

off:
536.257

on:
471.345

--
John Naylor
Amazon Web Services



Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 29, 2025, at 19:29, John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 3:25 PM Chao Li <li.evan.chao@gmail.com> wrote:
>>> On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:
>>>
>>> I suspect the challenge
>>> will be multikey sorts when the first key has low cardinality.
>>
>> As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that
provesthat: 
>>
>> ```
>> evantest=# drop table if exists test_multi;
>> evantest=# create unlogged table test_multi (category int, name text);
>> — first column has only 4 distinct values
>
> Thanks for testing. Note it's actually 5 because of rounding.

Yes, 0-4, totally 5.

> Your
> text also seems to have em-dashes and unicode apostrophes where it
> should have dashes / single quotes. That's not great if you expect
> others to try to reproduce.

I just copied the content from psql (running in iTerm). I did a Google search, and found that was because of Mac Mail’s
“smartquotes” substitution. Looks like even I manually type in a pair of single quotes, it still does the substitution.
Iwill try to see how to disable that, but I don’t want to switch to another mail app. 

> I'm also not thrilled about having to
> remove your psql prompt.
>

I just wanted to show my entire test process, so I simply copied all contents from psql. In future, I will remove psql
promptsfrom reproduce procedure. 

> drop table if exists test_multi;
> create unlogged table test_multi (category int, name text);
> insert into test_multi select (random() * 4)::int as category,
> md5(random()::text) || md5(random()::text) as name from
> generate_series(1, 1000000);
> vacuum freeze test_multi;
>
> Anyway, because this table is larger than my first example, the input
> no longer fits into 64MB of work_mem and it switches to an external
> merge sort. Normally I set work_mem to 1GB for testing sorts so I
> don't have to think about it, but neglected to in my first email.

I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with data:

```
evantest=# set work_mem = '1GB';
Time: 0.301 ms
evantest=#
evantest=# select * from test_multi order by category, name;
Time: 575.247 ms
evantest=# select * from test_multi order by category, name;
Time: 554.351 ms
evantest=# select * from test_multi order by category, name;
Time: 565.100 ms
evantest=#
evantest=# set wip_radix_sort = 'on';
Time: 0.752 ms
evantest=# select * from test_multi order by category, name;
Time: 558.057 ms
evantest=# select * from test_multi order by category, name;
Time: 565.542 ms
evantest=# select * from test_multi order by category, name;
Time: 559.973 ms
```

With radix_sort on and off, execution time are almost the same.

Then I restore the data to low cardinality, off is still faster than on:
```
evantest=# set wip_radix_sort = ‘off';
Time: 0.549 ms
evantest=# select * from test_multi order by category, name;
Time: 5509.075 ms (00:05.509)
evantest=# select * from test_multi order by category, name;
Time: 5553.566 ms (00:05.554)
evantest=# select * from test_multi order by category, name;
Time: 5598.595 ms (00:05.599)
evantest=# set wip_radix_sort = ‘on';
Time: 0.786 ms
evantest=#
evantest=# select * from test_multi order by category, name;
Time: 5770.964 ms (00:05.771)
evantest=# select * from test_multi order by category, name;
Time: 5779.755 ms (00:05.780)
evantest=# select * from test_multi order by category, name;
Time: 5851.134 ms (00:05.851)
evantest=#
evantest=# set work_mem = '2GB’; # increasing work_mem to 2GB doesn’t help
Time: 0.404 ms
evantest=#
evantest=# select * from test_multi order by category, name;
Time: 5781.005 ms (00:05.781)
evantest=# select * from test_multi order by category, name;
Time: 5826.025 ms (00:05.826)
evantest=# select * from test_multi order by category, name;
Time: 5937.919 ms (00:05.938)
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:
>
> <v1-0001-Use-radix-sort-when-datum1-is-an-integer-type.patch>

I just quick went through the code change. I guess I need more time to understand the entire logic, but I find a typo
thatmight effect the tests: 

```
+    Assert(last = first);
```

“=“ should be “=="

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: tuple radix sort

От
John Naylor
Дата:
On Thu, Oct 30, 2025 at 8:56 AM Chao Li <li.evan.chao@gmail.com> wrote:

> I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with
data:

> With radix_sort on and off, execution time are almost the same.

Are you by chance running with asserts on? It's happened before, so I
have to make sure. That makes a big difference here because I disabled
diversion thresholds in assert builds so that regression tests (few
cases with large inputs) cover the paths I want, in addition to my
running a standalone stress test.

Speaking of tests, I forgot to mention that regression tests will fail
since in-place radix sort is an unstable sort, as qsort is as well,
but in a different way -- this is expected. In assert builds, the
patch verifies the sort after the fact with the standard comparator,
and will throw an error if it's wrong.

On Thu, Oct 30, 2025 at 9:19 AM Chao Li <li.evan.chao@gmail.com> wrote:
> I just quick went through the code change. I guess I need more time to understand the entire logic, but I find a typo
thatmight effect the tests: 
>
> ```
> +       Assert(last = first);
> ```
>
> “=“ should be “=="

Yes, you're quite right, thanks for spotting! I reran my stress test
that has randomly distributed NULLs and the assert still holds, so
nothing further to fix yet. The NULL partitioning part of the code
hasn't been well tested in its current form, and I may arrange things
so that that step and the datum conditioning step happen in a single
pass. I'm not yet sure if it's important enough to justify the
additional complexity.

--
John Naylor
Amazon Web Services



Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Oct 30, 2025 at 8:56 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
>> I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with
data:
>
>> With radix_sort on and off, execution time are almost the same.
>
> Are you by chance running with asserts on? It's happened before, so I
> have to make sure. That makes a big difference here because I disabled
> diversion thresholds in assert builds so that regression tests (few
> cases with large inputs) cover the paths I want, in addition to my
> running a standalone stress test.
>

Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






Re: tuple radix sort

От
David Rowley
Дата:
On Thu, 30 Oct 2025 at 16:46, Chao Li <li.evan.chao@gmail.com> wrote:
> > On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:
> > Are you by chance running with asserts on? It's happened before, so I
> > have to make sure. That makes a big difference here because I disabled
> > diversion thresholds in assert builds so that regression tests (few
> > cases with large inputs) cover the paths I want, in addition to my
> > running a standalone stress test.
>
> Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.

Never expect anything meaningful to come from running performance
tests on Assert builds. You should always be rebuilding without
Asserts before doing performance tests.

David



Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 30, 2025, at 13:01, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 30 Oct 2025 at 16:46, Chao Li <li.evan.chao@gmail.com> wrote:
>>> On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:
>>> Are you by chance running with asserts on? It's happened before, so I
>>> have to make sure. That makes a big difference here because I disabled
>>> diversion thresholds in assert builds so that regression tests (few
>>> cases with large inputs) cover the paths I want, in addition to my
>>> running a standalone stress test.
>>
>> Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.
>
> Never expect anything meaningful to come from running performance
> tests on Assert builds. You should always be rebuilding without
> Asserts before doing performance tests.
>

Sure, good to learn. Actually I am very new to PG development, so any guidance is greatly appreciated.

I just made a distclean, then configure without any parameter. Now, the overall execution time reduced ~10% than with
asserts.With the low cardinality data, off and on are very close: 

```
evantest=# set wip_radix_sort = 'off';
Time: 0.206 ms
evantest=# select * from test_multi order by category, name;
Time: 5070.277 ms (00:05.070)
evantest=# select * from test_multi order by category, name;
Time: 5158.748 ms (00:05.159)
evantest=# select * from test_multi order by category, name;
Time: 5072.708 ms (00:05.073)

evantest=# set wip_radix_sort = 'on';
Time: 0.177 ms
evantest=# select * from test_multi order by category, name;
Time: 4992.516 ms (00:04.993)
evantest=# select * from test_multi order by category, name;
Time: 5145.361 ms (00:05.145)
evantest=# select * from test_multi order by category, name;
Time: 5101.800 ms (00:05.102)

evantest=# \o
evantest=# show work_mem;
 work_mem
----------
 1GB
(1 row)

Time: 0.186 ms
evantest=# explain select * from test_multi order by category, name;
                                QUERY PLAN
---------------------------------------------------------------------------
 Sort  (cost=122003.84..124503.84 rows=1000000 width=69)
   Sort Key: category, name
   ->  Seq Scan on test_multi  (cost=0.00..22346.00 rows=1000000 width=69)
(3 rows)
```

And with high cardinality test data, on has a big win:
```
evantest=# set wip_radix_sort = 'off';
Time: 0.174 ms
evantest=# select * from test_multi order by category, name;
Time: 353.702 ms
evantest=# select * from test_multi order by category, name;
Time: 375.549 ms
evantest=# select * from test_multi order by category, name;
Time: 367.967 ms
evantest=# set wip_radix_sort = 'on';
Time: 0.147 ms
evantest=# select * from test_multi order by category, name;
Time: 279.537 ms
evantest=# select * from test_multi order by category, name;
Time: 278.114 ms
evantest=# select * from test_multi order by category, name;
Time: 284.273 ms
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: tuple radix sort

От
John Naylor
Дата:
I wrote:

> The v1 patch
> has some optimizations, but in other ways things are simple and/or
> wasteful. Exactly how things fit together will be informed by what, if
> anything, has to be done to avoid regressions.

In v1, radix sort diverts to qsort_tuple for small partitions (similar
to how quicksort diverts to insertion sort), but qsort_tuple is
inefficient because the comparator is called via a function pointer.

I also thought having two different radix sorts was too complex, so I
wondered if it'd be better to get rid of the smaller radix sort (whose
control flow I find harder to understand, even ignoring the unsightly
goto) and have the larger sort divert to a new quicksort
specialization that compares on the conditioned datum. That allows
skipping branches for NULL comparisons and order reversal. I've done
this in v2. It makes sense to replace the three current
integer-comparison quicksorts with one.

v1 was careful to restore isnull1 to false when diverting to quicksort
for the tiebreak. v2 doesn't bother, since the only tiebreak in core
that looks at isnull1 is comparetup_datum_tiebreak, which is not
reachable by radix sort, requiring a pass-by-value datum that compares
like an integer. This is a bit of a risk, since it's possible a third
party fork could be doing something weird. Seems unlikely, but
something to keep in mind.

I used a standalone program (attached) to microbenchmark this new
fallback qsort vs. a pass of radix sort on one byte to get a decent
threshold value. This is not quite fair, since the quicksort will then
be finished, but the radix sort could still need to recurse to the
next byte(s), so these number could underestimate the threshold. This
is just to get an idea.

The numbers are in RDTSC ticks per element sorted.

cardinality: 256
number of elements:  100   qsort: 35.4 radix: 49.2
number of elements:  200   qsort: 34.9 radix: 38.1
number of elements:  400   qsort: 42.4 radix: 34.4
number of elements:  800   qsort: 95.0 radix: 29.2
number of elements: 1600   qsort: 115.0 radix: 22.4
number of elements: 3200   qsort: 125.5 radix: 19.4
number of elements: 6400   qsort: 128.1 radix: 17.6

With the highest cardinality possible on a single byte, radix sort is
actually not bad at low inputs. Notice that the time per element is
consistently going down with larger inputs. Smaller inputs have large
constant overheads, made worse by my unrolling the counting step.

cardinality: 2
number of elements:  100   qsort: 09.2 radix: 28.0
number of elements:  200   qsort: 09.1 radix: 19.5
number of elements:  400   qsort: 10.4 radix: 15.7
number of elements:  800   qsort: 10.1 radix: 14.5
number of elements: 1600   qsort: 10.4 radix: 13.7
number of elements: 3200   qsort: 15.8 radix: 13.6
number of elements: 6400   qsort: 22.2 radix: 13.8

This is an extreme best case for B&M quicksort, which is basically
O(n) -- the point at which the per-element time goes up seems purely
due to exceeding L1 cache. Radix sort takes a big input to catch up,
but it doesn't seem awful, either.

cardinality: 16
number of elements:  100   qsort: 19.5 radix: 34.5
number of elements:  200   qsort: 18.7 radix: 22.6
number of elements:  400   qsort: 18.5 radix: 17.2
number of elements:  800   qsort: 25.0 radix: 14.8
number of elements: 1600   qsort: 43.8 radix: 13.8
number of elements: 3200   qsort: 51.2 radix: 13.2
number of elements: 6400   qsort: 59.0 radix: 12.8

This is still low cardinality, but behaves more like the high cardinality case.

I've set the threshold to 400 for now, but I'm not claiming that's the
end story. In addition to the underestimation mentioned above,
unrolling the counting step is a factor. Unrolling makes smaller
inputs worse (which we can reach by recursing from larger inputs), but
unrolling seems important for large inputs with low cardinality (a few
percent, but I haven't shared numbers yet). We've found that a large
input with only 4-5 distinct values just barely wins with radix sort.
I'll be curious to see if unrolling is actually needed to prevent
regressions there.

Other things to consider:

- I don't quite like how the NULL partitioning step looks, and it
could be costly when the distribution of NULL is not predictable, so
I'm thinking of turning part of that into a branch-free cyclic
permutation, similar to
https://www.postgresql.org/message-id/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com

- The quicksort on the NULL partition still compares isnull1 -- the
branches are predictable but perhaps it's worth it to add a
specialization that skips that.

--
John Naylor
Amazon Web Services

Вложения

Re: tuple radix sort

От
John Naylor
Дата:
I wrote:

> I've set the threshold to 400 for now, but I'm not claiming that's the
> end story. In addition to the underestimation mentioned above,
> unrolling the counting step is a factor. Unrolling makes smaller
> inputs worse (which we can reach by recursing from larger inputs), but
> unrolling seems important for large inputs with low cardinality (a few
> percent, but I haven't shared numbers yet). We've found that a large
> input with only 4-5 distinct values just barely wins with radix sort.
> I'll be curious to see if unrolling is actually needed to prevent
> regressions there.

Looking more closely at my development history, it turns out I added
loop unrolling before adding common prefix detection. Most real-world
non-negative integers have the upper bytes the same, especially since
the datum is 8 bytes regardless of underlying type. For those bytes,
the radix sort finds only one unique byte and continues on to the next
byte. By detecting the common prefix as we condition the datums, it
matters less how fast we can count since we simply skip some useless
work. (This is not as relevant when we have an abbreviated datum)

Repeating part of the microbenchmark from last time with no unrolling,
a threshold of 200 works for all but the lowest cardinality inputs:

cardinality: 256
number of elements:  100   qsort: 34.8 radix: 38.3
number of elements:  200   qsort: 34.9 radix: 29.7
number of elements:  400   qsort: 40.8 radix: 29.2

cardinality: 16
number of elements:  100   qsort: 19.3 radix: 26.2
number of elements:  200   qsort: 18.9 radix: 18.2
number of elements:  400   qsort: 18.5 radix: 14.5

cardinality: 2
number of elements:  100   qsort: 09.3 radix: 21.6
number of elements:  200   qsort: 08.9 radix: 15.4
number of elements:  400   qsort: 10.3 radix: 14.0

To test further, I dug up a test from [1] that stresses low
cardinality on multiple sort keys (attached in a form to allow turing
radix sort on and off via a command line argument), and found no
regression with or without loop unrolling:

V2:
off:
4 ^ 8: latency average = 101.070 ms
5 ^ 8: latency average = 704.862 ms
6 ^ 8: latency average = 3651.015 ms
7 ^ 8: latency average = 15141.412 ms

on:
4 ^ 8: latency average = 99.939 ms
5 ^ 8: latency average = 683.018 ms
6 ^ 8: latency average = 3545.626 ms
7 ^ 8: latency average = 14095.677 ms

V3:
off:
4 ^ 8: latency average = 99.486 ms
5 ^ 8: latency average = 693.434 ms
6 ^ 8: latency average = 3607.940 ms
7 ^ 8: latency average = 14602.325 ms

on:
4 ^ 8: latency average = 97.664 ms
5 ^ 8: latency average = 678.752 ms
6 ^ 8: latency average = 3361.765 ms
7 ^ 8: latency average = 14121.190 ms

So v3 gets rid of loop unrolling, reduces the threshold to 200.

[1]
https://www.postgresql.org/message-id/flat/CAApHDvqkHZsT2gaAWFM7D%3D7qyQ%3DeKXQvvn%2BBkwCn4Rvj1w4EKQ%40mail.gmail.com#1c67321e09be15e031d3995d45a1b8fd

TODOs:
- adding a "sorted pre-check" to keep up with our qsort for ascending inputs
- further performance validation
- possibly doing NULL partitioning differently
- possibly specializing qsort on the NULL partition
- code cleanup

--
John Naylor
Amazon Web Services

Вложения

Re: tuple radix sort

От
John Naylor
Дата:
On Mon, Nov 3, 2025 at 8:24 PM I wrote:

> v1 was careful to restore isnull1 to false when diverting to quicksort
> for the tiebreak. v2 doesn't bother, since the only tiebreak in core
> that looks at isnull1 is comparetup_datum_tiebreak, which is not
> reachable by radix sort, requiring a pass-by-value datum that compares
> like an integer. This is a bit of a risk, since it's possible a third
> party fork could be doing something weird. Seems unlikely, but
> something to keep in mind.

I decided I wasn't quite comfortable with the full normalized datum
sharing space in SortTuple with isnull1. There's too much of a
cognitive burden involved in deciding when we do or don't need to
reset isnull1, and there's a non-zero risk of difficult-to-detect
bugs. For v4 I've instead used one byte of padding space in SortTuple
to store only the byte used for the current pass. That means we must
compute the normalized datum on every pass. That's actually better
than it sounds, since that one byte can now be used directly during
the "deal" step, rather than having to extract the byte from the
normalized datum by shifting and masking. That extraction step might
add significant cycles in cases where a pass requires multiple
iterations through the "deal" loop. It doesn't seem to make much
difference in practice, performance-wise, even with the following
pessimization:

I had to scrap the qsort specialization on the normalized datum for
small sorts, since it's no longer stored. It could still be worth it
to compute the "next byte of the normalized datum" and perform a qsort
on that (falling back to the comparator function in the usual way),
but I haven't felt the need to resort to that yet. For v4, I just
divert to qsort_tuple in non-assert builds, with a threshold of 40.

> - I don't quite like how the NULL partitioning step looks, and it
> could be costly when the distribution of NULL is not predictable, so
> I'm thinking of turning part of that into a branch-free cyclic
> permutation, similar to
> https://www.postgresql.org/message-id/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com

This is done. Even though the inner loop is mysterious at first
glance, it's really quite elegant.

I made an attempt at clean-up, but it's still under-commented. The
common prefix detection has moved to a separate patch (v4-0004).

I've been forcing all eligible sorts to use radix sort in assert
builds, even when small enough that qsort would be faster. Since both
qsort and in-place radix sort are unstable, it's expected that some
regression tests need adjustment (v4-0002). One thing surprised me,
however: The pg_upgrade TAP test that runs regression tests on the old
cluster showed additional failures that I can't explain. I haven't
seen this before, but it's possible I never ran TAP tests when testing
new sort algorithms previously. This doesn't happen if you change the
current insertion sort threshold, so I haven't been able to reproduce
it aside from this patch. For that reason I can't completely rule out
an actual bug, although I actually have more confidence in the
verification of correct sort order in v4, since isnull1 now never
changes, just as in master. I found that changing some tests to have
additional sort keys seems to fix it (v4-0003). I did this in a rather
quick and haphazard fashion. There's probably a longer conversation to
be had about making test output more deterministic while still
covering the intended executor paths.

Aside from that, this seems like a good place to settle down, so I'm
going to create a CF entry for this. I'll start more rigorous
performance testing in the near future.


--
John Naylor
Amazon Web Services

Вложения

Re: tuple radix sort

От
David Geier
Дата:
On 29.10.2025 07:28, John Naylor wrote:
> 
> Next steps: Try to find regressions (help welcome here). The v1 patch
> has some optimizations, but in other ways things are simple and/or
> wasteful. Exactly how things fit together will be informed by what, if
> anything, has to be done to avoid regressions. I suspect the challenge
> will be multikey sorts when the first key has low cardinality. This is
> because tiebreaks are necessarily postponed rather than taken care of
> up front. I'm optimistic, since low cardinality cases can be even
> faster than our B&M qsort, so we have some headroom:

Hi John,

I've also been looking into radix sort the last days to accelerate GIN
index builds. Ordering and removing duplicates requires a fast sort in
generate_trgm(). My own implementation (likely slower than the
algorithms you used) also showed a decent speedup.

Beyond that there are many more places in the code base that could be
changed to use radix sort instead of qsort.

What would be great is if we could build a generic radix sort
implementation, similarly to sort_template.h that can be used in other
places. We would have to think a bit about the interface because instead
of a comparator we would require some radix extraction callback.

If you're open to that idea I could give abstracting the code a try.

--
David Geier



Re: tuple radix sort

От
John Naylor
Дата:
On Wed, Nov 12, 2025 at 9:28 PM David Geier <geidav.pg@gmail.com> wrote:
> I've also been looking into radix sort the last days to accelerate GIN
> index builds. Ordering and removing duplicates requires a fast sort in
> generate_trgm().

If that's the case then I suggest first seeing if dfd8e6c73ee made
things any worse. A simpler possible improvement is to use a similar
normalization step for the chars, if needed, then do the sort and
quinique with a specialization for unsigned chars. (We don't yet
specialize qunique, but that can be remedied). If you're interested,
please start a separate thread for that.

> What would be great is if we could build a generic radix sort
> implementation, similarly to sort_template.h that can be used in other
> places. We would have to think a bit about the interface because instead
> of a comparator we would require some radix extraction callback.

That's moving the goalposts too far IMO. I want to get to a place
where I feel comfortable with the decisions made, and that already
requires a lot of testing. Also, I don't want to risk introducing
abstractions that make future improvements to tuplesort more
cumbersome.

--
John Naylor
Amazon Web Services



Re: tuple radix sort

От
David Geier
Дата:
Hi John!

On 13.11.2025 05:01, John Naylor wrote:
> If that's the case then I suggest first seeing if dfd8e6c73ee made
> things any worse. A simpler possible improvement is to use a similar
> normalization step for the chars, if needed, then do the sort and
> quinique with a specialization for unsigned chars. (We don't yet
> specialize qunique, but that can be remedied). If you're interested,
> please start a separate thread for that.

It did but only a bit. I worked around it by having two sort
specializations, one for signed and one for unsigned. I also wanted to
try to use a hash table to filter out duplicates and then only sort the
remaining unique trigams, which are, most of the times, a lot less.

Generally speaking, the GIN code is death by a thousand cuts. I've got a
patch coming up that cuts CREATE INDEX runtime in half for columns with
relatively short strings and yields even better results for columns with
longer strings. But that's not only changing the sort but requires a few
 changes in a couple of places. More details in the upcoming thread.

I thought qunique() is already pretty optimal because it's defined in a
header file. I believe that even the comparator gets inlined. What would
be useful though is if qunique() used an equality comparator which only
returns true/false instead of a sort comparator. In the GIN code this
also shaved off a few percent. I'll take a closer look at qunique() at
open a thread with the findings / ideas for changes.

Anyways. In this context GIN was just one example for where a generic
radix sort would be useful and there are certainly more.

> 
> That's moving the goalposts too far IMO. I want to get to a place
> where I feel comfortable with the decisions made, and that already
> requires a lot of testing. Also, I don't want to risk introducing
> abstractions that make future improvements to tuplesort more
> cumbersome.

On a quick glance it looks like you didn't specialize much. So the
testing seems related to if the new algo introduces regressions, not if
the abstraction would cause problems. So it should be possible to
extract out the code fairly easily without invalidating your existing
benchmark results.

I understand that you want to make progress with the use case at hand
but I feel like we're missing out on a lot of opportunity where the
introduced code would also be very beneficial. Beyond that we could
nicely test the new sort code in the spirit of test_rbtree.c and
friends. Maybe you want to give it a 2nd thought.

--
David Geier




Re: tuple radix sort

От
John Naylor
Дата:
On Sat, Nov 15, 2025 at 1:05 AM David Geier <geidav.pg@gmail.com> wrote:
> I understand that you want to make progress with the use case at hand
> but I feel like we're missing out on a lot of opportunity where the
> introduced code would also be very beneficial.

The patch is independently beneficial, but is also just a stepping
stone toward something larger, and I don't yet know exactly how it's
going to look. Premature abstractions are just going to get in the
way. I'd be open to hear proposals for possible wider application
after the dust settles, but that's not going to happen during the PG19
cycle.

--
John Naylor
Amazon Web Services

On Sat, Nov 15, 2025 at 1:05 AM David Geier <geidav.pg@gmail.com> wrote:
>
> Hi John!
>
> On 13.11.2025 05:01, John Naylor wrote:
> > If that's the case then I suggest first seeing if dfd8e6c73ee made
> > things any worse. A simpler possible improvement is to use a similar
> > normalization step for the chars, if needed, then do the sort and
> > quinique with a specialization for unsigned chars. (We don't yet
> > specialize qunique, but that can be remedied). If you're interested,
> > please start a separate thread for that.
>
> It did but only a bit. I worked around it by having two sort
> specializations, one for signed and one for unsigned. I also wanted to
> try to use a hash table to filter out duplicates and then only sort the
> remaining unique trigams, which are, most of the times, a lot less.
>
> Generally speaking, the GIN code is death by a thousand cuts. I've got a
> patch coming up that cuts CREATE INDEX runtime in half for columns with
> relatively short strings and yields even better results for columns with
> longer strings. But that's not only changing the sort but requires a few
>  changes in a couple of places. More details in the upcoming thread.
>
> I thought qunique() is already pretty optimal because it's defined in a
> header file. I believe that even the comparator gets inlined. What would
> be useful though is if qunique() used an equality comparator which only
> returns true/false instead of a sort comparator. In the GIN code this
> also shaved off a few percent. I'll take a closer look at qunique() at
> open a thread with the findings / ideas for changes.
>
> Anyways. In this context GIN was just one example for where a generic
> radix sort would be useful and there are certainly more.
>
> >
> > That's moving the goalposts too far IMO. I want to get to a place
> > where I feel comfortable with the decisions made, and that already
> > requires a lot of testing. Also, I don't want to risk introducing
> > abstractions that make future improvements to tuplesort more
> > cumbersome.
>
> On a quick glance it looks like you didn't specialize much. So the
> testing seems related to if the new algo introduces regressions, not if
> the abstraction would cause problems. So it should be possible to
> extract out the code fairly easily without invalidating your existing
> benchmark results.
>
> I understand that you want to make progress with the use case at hand
> but I feel like we're missing out on a lot of opportunity where the
> introduced code would also be very beneficial. Beyond that we could
> nicely test the new sort code in the spirit of test_rbtree.c and
> friends. Maybe you want to give it a 2nd thought.
>
> --
> David Geier
>


--
John Naylor
Amazon Web Services



Re: tuple radix sort

От
David Geier
Дата:
Hi John!

On 15.11.2025 03:47, John Naylor wrote:
> On Sat, Nov 15, 2025 at 1:05 AM David Geier <geidav.pg@gmail.com> wrote:
>> I understand that you want to make progress with the use case at hand
>> but I feel like we're missing out on a lot of opportunity where the
>> introduced code would also be very beneficial.
> 
> The patch is independently beneficial, but is also just a stepping
> stone toward something larger, and I don't yet know exactly how it's
> going to look. Premature abstractions are just going to get in the
> way. I'd be open to hear proposals for possible wider application
> after the dust settles, but that's not going to happen during the PG19
> cycle.
> 

That sounds like a good compromise. Let's see what else can profit from
the new sorting code once we've got the tuple sort in.

--
David Geier



Re: tuple radix sort

От
John Naylor
Дата:
I wrote:
> Aside from that, this seems like a good place to settle down, so I'm
> going to create a CF entry for this. I'll start more rigorous
> performance testing in the near future.

Here's the first systematic test results, with scripts. Overall, I'm
very pleased. With extremely low cardinality, it's close enough to our
B&M quicksort that any difference (a hair slower or faster) can be
discarded as insignificant. It's massively faster with most other
inputs, so I'll just highlight the exceptions:

"ascending" - Our qsort runs a "presorted check" before every
partitioning step, and I hadn't done this for radix sort yet because I
wanted to see what the "natural" difference is. I'm inclined to put in
a single precheck at the beginning (people have come to expect that to
be there), but not one at every recursion because I don't think that's
useful. (Aside: that precheck at every recursion should be replaced by
something that detects ascending/descending runs at the very start,
but that's a separate thread)

"stagger" with multiplier = no. records / 2 - This seems to be a case
where the qsort's presorted check happens to get lucky. As I said
above, we should actually detect more sorted runs with something more
comprehensive.

"p5" - This is explicitly designed to favor the B&M qsort. The input
is 95% zeros, 2.5% negative numbers, and 2.5% positive numbers. The
first qsort pivot is pretty much guaranteed to be zero, and the first
partitioning step completes very quickly. Radix sort must do a lot
more work, but it not different than the amount of work it does with
other patterns -- it's much less sensitive to the input distribution
than qsort. In this case, there's a mix of negative and positive
bigints. That defeats common prefix detection, and the first iteration
deals into two piles: negative and non-negative. Then a few recursions
happen where there is only a single distinct byte, so no useful work
happens. I suppose I could try common prefix detection at every
recursion, but I don't think that's widely beneficial for integers.
Maybe the single-byte-plus comparator small qsort would help a little,
and I'm considering adding that anyway. In one sense this is the most
worrying, since there doesn't seem to be a widely-useful mitigation,
but in another sense it's the least worrying, since this case is
deliberately constructed to be at a disadvantage.

--
John Naylor
Amazon Web Services

Вложения

Re: tuple radix sort

От
Álvaro Herrera
Дата:
On 2025-Nov-12, John Naylor wrote:

> +/*
> + * Based on implementation in https://github.com/skarupke/ska_sort (Boost license),
> + * with the following noncosmetic change:
> + *  - count sorted partitions in every pass, rather than maintaining a
> + *    list of unsorted partitions
> + */
> +static void
> +radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *state)

I think given https://www.boost.org/LICENSE_1_0.txt you should include a
copy of the Boost license in this comment, as well as the copyright
statement from the hpp file,

//          Copyright Malte Skarupke 2016.
// Distributed under the Boost Software License, Version 1.0.
//    (See http://www.boost.org/LICENSE_1_0.txt)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: tuple radix sort

От
John Naylor
Дата:
On Thu, Nov 20, 2025 at 6:13 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
> I think given https://www.boost.org/LICENSE_1_0.txt you should include a
> copy of the Boost license in this comment, as well as the copyright
> statement from the hpp file,

Will do, next time I do some polishing.

(While thinking about it, I need to sprinkle in some
CHECK_FOR_INTERRUPTS(), too).

--
John Naylor
Amazon Web Services



Re: tuple radix sort

От
Chengpeng Yan
Дата:
> On Nov 12, 2025, at 21:57, John Naylor <johncnaylorls@gmail.com> wrote:
> 
> I decided I wasn't quite comfortable with the full normalized datum
> sharing space in SortTuple with isnull1. There's too much of a
> cognitive burden involved in deciding when we do or don't need to
> reset isnull1, and there's a non-zero risk of difficult-to-detect
> bugs. For v4 I've instead used one byte of padding space in SortTuple
> to store only the byte used for the current pass. That means we must
> compute the normalized datum on every pass. That's actually better
> than it sounds, since that one byte can now be used directly during
> the "deal" step, rather than having to extract the byte from the
> normalized datum by shifting and masking. That extraction step might
> add significant cycles in cases where a pass requires multiple
> iterations through the "deal" loop. It doesn't seem to make much
> difference in practice, performance-wise, even with the following
> pessimization:
> 
> I had to scrap the qsort specialization on the normalized datum for
> small sorts, since it's no longer stored. It could still be worth it
> to compute the "next byte of the normalized datum" and perform a qsort
> on that (falling back to the comparator function in the usual way),
> but I haven't felt the need to resort to that yet. For v4, I just
> divert to qsort_tuple in non-assert builds, with a threshold of 40.
> […]
> 
> I made an attempt at clean-up, but it's still under-commented. The
> common prefix detection has moved to a separate patch (v4-0004).
> 
> I've been forcing all eligible sorts to use radix sort in assert
> builds, even when small enough that qsort would be faster. Since both
> qsort and in-place radix sort are unstable, it's expected that some
> regression tests need adjustment (v4-0002). One thing surprised me,
> however: The pg_upgrade TAP test that runs regression tests on the old
> cluster showed additional failures that I can't explain. I haven't
> seen this before, but it's possible I never ran TAP tests when testing
> new sort algorithms previously. This doesn't happen if you change the
> current insertion sort threshold, so I haven't been able to reproduce
> it aside from this patch. For that reason I can't completely rule out
> an actual bug, although I actually have more confidence in the
> verification of correct sort order in v4, since isnull1 now never
> changes, just as in master. I found that changing some tests to have
> additional sort keys seems to fix it (v4-0003). I did this in a rather
> quick and haphazard fashion. There's probably a longer conversation to
> be had about making test output more deterministic while still
> covering the intended executor paths.
> 
> Aside from that, this seems like a good place to settle down, so I'm
> going to create a CF entry for this. I'll start more rigorous
> performance testing in the near future.
> 

Hi John,

I have reviewed the v4 patch set. I applied the patches and ran
"make check" on macOS 14.2.1 (M1), and all regression tests passed.

Overall, the implementation looks solid and the feature is a great
addition. I have a few suggestions regarding code optimization and one
discussion point regarding the data structure design.

Minor Comments / Optimizations:

1. Optimization in radix_sort_tuple (v4-0001)

In radix_sort_tuple, there is a check:

```
if (part.offset == part.next_offset)
```

Since "part" is a local copy of the struct, this check might not
reflect the latest state updated inside the loop. It might be slightly
more efficient to check the array directly:

```
if (partitions[idx].offset == partitions[idx].next_offset)
```

This allows us to detect if a partition has been fully consumed/sorted
within the current pass, potentially saving an iteration of the
"while (num_remaining > 1)" loop.

1. Branchless calculation in Common Prefix (v4-0004)

In sort_byvalue_datum, when calculating the common bits:

```
if (this_common_bits > common_upper_bits)
common_upper_bits = this_common_bits;
```

Since we are looking for the leftmost bit difference, we could
accumulate the differences using bitwise OR. This avoids a conditional
branch inside the loop:

```
common_upper_bits |= this_common_bits;
```

3. Short-circuit for identical keys (v4-0004)

When calculating common_prefix, if common_upper_bits is 0, it implies
that all non-null keys are identical (for the bits we care about). In
this case, we might be able to skip the radix sort entirely or handle
it as a single partition. Currently, the code handles it by passing
"common_upper_bits | 1" to pg_leftmost_one_pos64, which is safe but
perhaps not the most optimal path for identical keys.

-----

Discussion: SortTuple Layout and normalize_datum

I have a concern regarding the addition of "uint8 current_byte" to the
SortTuple struct.

1. Struct Size and Abstraction:

SortTuple is the generic atom for the entire sorting module. Adding a
field specific to integer Radix Sort feels like a bit of a leaky
abstraction. While it might fit into padding on some 64-bit platforms
(keeping the size at 32 bytes), relying on padding is fragile. If the
struct size increases, it reduces the number of tuples work_mem can
hold, affecting all sort operations, not just integers.

2. Cache Invalidation (Read vs. Write):

Currently, radix_sort_tuple writes to tup->current_byte in every pass.
This turns what could be a read-only access to the tuple array into a
read-write access, potentially dirtying cache lines and causing
unnecessary memory traffic.

Proposal: On-the-fly Normalization

We can avoid storing current_byte by calculating the relevant byte on
the fly. While this means re-calculating the byte extraction, we can
optimize normalize_datum.

The normalization logic (mapping signed integers to unsigned space)
essentially flips the sign bit. Instead of doing a full 64-bit
normalization for every byte extraction, we can apply this
transformation only when extracting the most significant byte (MSB).

The logic could look like this:

```
/*
 * Extract the byte at 'level' from 'key'.
 * level 0 denotes the Most Significant Byte.
 */
static inline uint8_t
extract_raw_byte(Datum key, int level)
{
    return (key >> (((SIZEOF_DATUM - 1) - level) * 8)) & 0xFF;
}

/*
 * Extract the "logically normalized" byte at 'level'.
 *
 * This effectively replaces the need to call normalize_datum() on
 * the full key.
 * - For non-MSB levels: return the raw byte.
 * - For the MSB level: flip the sign bit if the comparator
 * requires it.
 * - If sorting DESC: invert the result.
 */
static inline uint8_t
radix_extract_byte(Datum orig, int level, SortSupport ssup)
{
    uint8_t byte;

    /* 1. Extract raw byte first */
    byte = extract_raw_byte(orig, level);

    /*
     * 2. Apply normalization only to the Most Significant Byte.
     *
     * Mathematically, normalizing a signed integer for unsigned
     * comparison is equivalent to flipping the sign bit (adding
     * 2^(Width-1)).
     * In the byte domain, this means XORing the MSB with 0x80.
     */
    if (level == 0)
    {
        if (ssup->comparator == ssup_datum_signed_cmp ||
            ssup->comparator == ssup_datum_int32_cmp)
        {
            byte ^= 0x80;
        }
        else
        {
            Assert(ssup->comparator == ssup_datum_unsigned_cmp);
        }
    }

    /*
     * 3. Handle Reverse Sort
     * Instead of inverting the whole Datum, we just invert the
     * current byte.
     */
    if (ssup->ssup_reverse)
        byte = ~byte;

    return byte;
}
```

Benefits:

1. Keeps SortTuple minimal: No risk of struct bloat.
2. Read-only passes: The tuple array remains clean in cache during the
counting phase.
3. Reduced instruction count: We avoid the overhead of full
normalize_datum calls for lower bytes.

I'm happy to help verify this approach if you think it's worth
pursuing.

Best regards,
Chengpeng Yan