Re: Optimize mul_var() for var1ndigits >= 8

Поиск
Список
Период
Сортировка
От Dean Rasheed
Тема Re: Optimize mul_var() for var1ndigits >= 8
Дата
Msg-id CAEZATCW-NV9CkAUfbpEm-42DeN3si0nJKUAOub3aBFb=0HFKuQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Optimize mul_var() for var1ndigits >= 8  ("Joel Jacobson" <joel@compiler.org>)
Ответы Re: Optimize mul_var() for var1ndigits >= 8
Список pgsql-hackers
On Mon, 29 Jul 2024 at 18:57, Joel Jacobson <joel@compiler.org> wrote:
>
> Thanks to v3-0002, they are all still significantly faster when both patches have been applied,
> but I wonder if it is expected or not, that v3-0001 temporarily made them a bit slower?
>

There's no obvious reason why 0001 would make those cases slower, but
the fact that, together with 0002, it's a significant net win, and the
gains for 5 and 6-digit inputs make it worthwhile, in my opinion.

Something I did notice in my tests was that if ndigits was a small
multiple of 8, the old code was disproportionately faster, which can
be explained by the fact that the computation fits exactly into a
whole number of XMM register operations, with no remaining digits to
process. For example, these results from above:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
----------+----------+---------------+---------------+----------
       15 |       15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
       16 |       16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
       17 |       17 | 3.9258755e+06 |  4.935394e+06 | +25.71%

       23 |       23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
       24 |       24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
       25 |       25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%

       31 |       31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
       32 |       32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
       33 |       33 |   2.27133e+06 |  3.435459e+06 | +51.25%

(Note how 16x16 was much faster than 15x15, for example.)

The patched code seems to do a better job at levelling out and coping
with arbitrary-sized inputs, not just those that fit exactly into a
whole number of loops using SSE2 operations.

Something else I noticed was that the relative gains for large numbers
of digits were much higher with clang than with gcc:

gcc 13.3.0:

    16383 |    16383 |     21.629467 |      73.58552 | +240.21%

clang 15.0.7:

    16383 |    16383 |     11.562384 |      73.00517 | +531.40%

That seems to be because clang doesn't do a good job of generating
efficient SSE2 code in the old case of 16-bit x 16-bit
multiplications. Looking on godbolt.org, it generates
overly-complicated code using PMULUDQ, which actually does 32-bit x
32-bit multiplications. Gcc, on the other hand, generates a much more
compact loop, using PMULHW and PMULLW, which is much faster. With the
patch, they both generate the same SSE2 code, so the results are
pretty consistent.

Regards,
Dean



В списке pgsql-hackers по дате отправления: