Re: tuple radix sort
| От | John Naylor | 
|---|---|
| Тема | Re: tuple radix sort | 
| Дата | |
| Msg-id | CANWCAZYpGMDSSwAa18fOxJGXaPzVdyPsWpOkfCX32DWh3Qznzw@mail.gmail.com обсуждение исходный текст  | 
		
| Ответ на | Re: tuple radix sort (John Naylor <johncnaylorls@gmail.com>) | 
| Список | pgsql-hackers | 
I wrote: > I've set the threshold to 400 for now, but I'm not claiming that's the > end story. In addition to the underestimation mentioned above, > unrolling the counting step is a factor. Unrolling makes smaller > inputs worse (which we can reach by recursing from larger inputs), but > unrolling seems important for large inputs with low cardinality (a few > percent, but I haven't shared numbers yet). We've found that a large > input with only 4-5 distinct values just barely wins with radix sort. > I'll be curious to see if unrolling is actually needed to prevent > regressions there. Looking more closely at my development history, it turns out I added loop unrolling before adding common prefix detection. Most real-world non-negative integers have the upper bytes the same, especially since the datum is 8 bytes regardless of underlying type. For those bytes, the radix sort finds only one unique byte and continues on to the next byte. By detecting the common prefix as we condition the datums, it matters less how fast we can count since we simply skip some useless work. (This is not as relevant when we have an abbreviated datum) Repeating part of the microbenchmark from last time with no unrolling, a threshold of 200 works for all but the lowest cardinality inputs: cardinality: 256 number of elements: 100 qsort: 34.8 radix: 38.3 number of elements: 200 qsort: 34.9 radix: 29.7 number of elements: 400 qsort: 40.8 radix: 29.2 cardinality: 16 number of elements: 100 qsort: 19.3 radix: 26.2 number of elements: 200 qsort: 18.9 radix: 18.2 number of elements: 400 qsort: 18.5 radix: 14.5 cardinality: 2 number of elements: 100 qsort: 09.3 radix: 21.6 number of elements: 200 qsort: 08.9 radix: 15.4 number of elements: 400 qsort: 10.3 radix: 14.0 To test further, I dug up a test from [1] that stresses low cardinality on multiple sort keys (attached in a form to allow turing radix sort on and off via a command line argument), and found no regression with or without loop unrolling: V2: off: 4 ^ 8: latency average = 101.070 ms 5 ^ 8: latency average = 704.862 ms 6 ^ 8: latency average = 3651.015 ms 7 ^ 8: latency average = 15141.412 ms on: 4 ^ 8: latency average = 99.939 ms 5 ^ 8: latency average = 683.018 ms 6 ^ 8: latency average = 3545.626 ms 7 ^ 8: latency average = 14095.677 ms V3: off: 4 ^ 8: latency average = 99.486 ms 5 ^ 8: latency average = 693.434 ms 6 ^ 8: latency average = 3607.940 ms 7 ^ 8: latency average = 14602.325 ms on: 4 ^ 8: latency average = 97.664 ms 5 ^ 8: latency average = 678.752 ms 6 ^ 8: latency average = 3361.765 ms 7 ^ 8: latency average = 14121.190 ms So v3 gets rid of loop unrolling, reduces the threshold to 200. [1] https://www.postgresql.org/message-id/flat/CAApHDvqkHZsT2gaAWFM7D%3D7qyQ%3DeKXQvvn%2BBkwCn4Rvj1w4EKQ%40mail.gmail.com#1c67321e09be15e031d3995d45a1b8fd TODOs: - adding a "sorted pre-check" to keep up with our qsort for ascending inputs - further performance validation - possibly doing NULL partitioning differently - possibly specializing qsort on the NULL partition - code cleanup -- John Naylor Amazon Web Services
Вложения
В списке pgsql-hackers по дате отправления: