Re: speed up verifying UTF-8

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: speed up verifying UTF-8
Дата
Msg-id CAFBsxsHR08mHEf06PvrMRstfcyPJLwF69g0r1pvRrxWD4GEVoQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Ответы Re: speed up verifying UTF-8
Список pgsql-hackers
Attached is v20, which has a number of improvements:

1. Cleaned up and explained DFA coding.
2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to produce an optimized loop, using branch-free accumulators. That way, it doesn't need to be rewritten for different input lengths. I also think it's a bit easier to understand this way.
3. Put SSE helper functions in their own file.
4. Mostly-cosmetic edits to the configure detection.
5. Draft commit message.

With #2 above in place, I wanted to try different strides for the DFA, so more measurements (hopefully not much more of these):

Power8, gcc 4.8

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v20, 8-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1189 |   550 |   246 |     600 |    936

v20, 16-byte stride (in the actual patch):
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     981 |   440 |   134 |     791 |    820

v20, 32-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     857 |   481 |   141 |     834 |    839

Based on the above, I decided that 16 bytes had the best overall balance. Other platforms may differ, but I don't expect it to make a huge amount of difference.

Just for fun, I was also a bit curious about what Vladimir mentioned upthread about x86-64-v3 offering a different shift instruction. Somehow, clang 12 refused to build with that target, even though the release notes say it can, but gcc 11 was fine:

x86 Macbook, gcc 11, USE_FALLBACK_UTF8=1:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1200 |   728 |   370 |     544 |    637

v20:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     459 |   243 |    77 |     424 |    440

v20, CFLAGS="-march=x86-64-v3 -O2" :
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     390 |   215 |    77 |     303 |    323

And, gcc does generate the desired shift here:

objdump -S src/port/pg_utf8_fallback.o | grep shrx
      53: c4 e2 eb f7 d1               shrxq %rdx, %rcx, %rdx

While it looks good, clang can do about as good by simply unrolling all 16 shifts in the loop, which gcc won't do. To be clear, it's irrelevant, since x86-64-v3 includes AVX2, and if we had that we would just use it with the SIMD algorithm.

Macbook x86, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v20, USE_FALLBACK_UTF8=1:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     351 |   172 |    88 |     349 |    350

v20, with SSE4:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     142 |    92 |    59 |     141 |    141

I'm pretty happy with the patch at this point.

--
John Naylor
EDB: http://www.enterprisedb.com
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: shared-memory based stats collector
Следующее
От: Vladimir Sitnikov
Дата:
Сообщение: Re: speed up verifying UTF-8