Re: tuple radix sort

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: tuple radix sort
Дата
Msg-id CANWCAZYpGMDSSwAa18fOxJGXaPzVdyPsWpOkfCX32DWh3Qznzw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: tuple radix sort  (John Naylor <johncnaylorls@gmail.com>)
Список pgsql-hackers
I wrote:

> I've set the threshold to 400 for now, but I'm not claiming that's the
> end story. In addition to the underestimation mentioned above,
> unrolling the counting step is a factor. Unrolling makes smaller
> inputs worse (which we can reach by recursing from larger inputs), but
> unrolling seems important for large inputs with low cardinality (a few
> percent, but I haven't shared numbers yet). We've found that a large
> input with only 4-5 distinct values just barely wins with radix sort.
> I'll be curious to see if unrolling is actually needed to prevent
> regressions there.

Looking more closely at my development history, it turns out I added
loop unrolling before adding common prefix detection. Most real-world
non-negative integers have the upper bytes the same, especially since
the datum is 8 bytes regardless of underlying type. For those bytes,
the radix sort finds only one unique byte and continues on to the next
byte. By detecting the common prefix as we condition the datums, it
matters less how fast we can count since we simply skip some useless
work. (This is not as relevant when we have an abbreviated datum)

Repeating part of the microbenchmark from last time with no unrolling,
a threshold of 200 works for all but the lowest cardinality inputs:

cardinality: 256
number of elements:  100   qsort: 34.8 radix: 38.3
number of elements:  200   qsort: 34.9 radix: 29.7
number of elements:  400   qsort: 40.8 radix: 29.2

cardinality: 16
number of elements:  100   qsort: 19.3 radix: 26.2
number of elements:  200   qsort: 18.9 radix: 18.2
number of elements:  400   qsort: 18.5 radix: 14.5

cardinality: 2
number of elements:  100   qsort: 09.3 radix: 21.6
number of elements:  200   qsort: 08.9 radix: 15.4
number of elements:  400   qsort: 10.3 radix: 14.0

To test further, I dug up a test from [1] that stresses low
cardinality on multiple sort keys (attached in a form to allow turing
radix sort on and off via a command line argument), and found no
regression with or without loop unrolling:

V2:
off:
4 ^ 8: latency average = 101.070 ms
5 ^ 8: latency average = 704.862 ms
6 ^ 8: latency average = 3651.015 ms
7 ^ 8: latency average = 15141.412 ms

on:
4 ^ 8: latency average = 99.939 ms
5 ^ 8: latency average = 683.018 ms
6 ^ 8: latency average = 3545.626 ms
7 ^ 8: latency average = 14095.677 ms

V3:
off:
4 ^ 8: latency average = 99.486 ms
5 ^ 8: latency average = 693.434 ms
6 ^ 8: latency average = 3607.940 ms
7 ^ 8: latency average = 14602.325 ms

on:
4 ^ 8: latency average = 97.664 ms
5 ^ 8: latency average = 678.752 ms
6 ^ 8: latency average = 3361.765 ms
7 ^ 8: latency average = 14121.190 ms

So v3 gets rid of loop unrolling, reduces the threshold to 200.

[1]
https://www.postgresql.org/message-id/flat/CAApHDvqkHZsT2gaAWFM7D%3D7qyQ%3DeKXQvvn%2BBkwCn4Rvj1w4EKQ%40mail.gmail.com#1c67321e09be15e031d3995d45a1b8fd

TODOs:
- adding a "sorted pre-check" to keep up with our qsort for ascending inputs
- further performance validation
- possibly doing NULL partitioning differently
- possibly specializing qsort on the NULL partition
- code cleanup

--
John Naylor
Amazon Web Services

Вложения

В списке pgsql-hackers по дате отправления: