Re: [POC] verifying UTF-8 using SIMD instructions

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: [POC] verifying UTF-8 using SIMD instructions
Дата
Msg-id CAFBsxsEinN2YLsGj_skeGu+o3Lht6DMS651z2J2UXO6iARyrtA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [POC] verifying UTF-8 using SIMD instructions  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: [POC] verifying UTF-8 using SIMD instructions  (John Naylor <john.naylor@enterprisedb.com>)
Re: [POC] verifying UTF-8 using SIMD instructions  (John Naylor <john.naylor@enterprisedb.com>)
Список pgsql-hackers
On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>

Attached is the first attempt at using SSE4 to do the validation, but first I'll answer your questions about the fallback.

I should mention that v2 had a correctness bug for 4-byte characters that I found when I was writing regression tests. It shouldn't materially affect performance, however.

> I thought the "chinese" numbers above are pure multibyte input, and it
> seems to do well on that. Where does it regress? In multibyte encodings
> other than UTF-8?

Yes, the second set of measurements was intended to represent multibyte encodings other than UTF-8. But instead of using one of those encodings, I simulated non-UTF-8 by copying the pattern used for those: in the loop, check for ascii then either advance or verify one character. It was a quick way to use the same test.

> How bad is the regression?

I'll copy the measurements here together with master so it's easier to compare:

~= PG13 (revert 6c5576075b0f9 and b80e10638e3):

 chinese | mixed | ascii
---------+-------+-------
    1565 |   848 |   365

master:

 chinese | mixed | ascii
---------+-------+-------
    1081 |   761 |   366

ascii fast-path plus pg_*_verifychar():

 chinese | mixed | ascii
---------+-------+-------
    1279 |   656 |    94

As I mentioned upthread, pure multibyte is still faster than PG13. Reducing the ascii check to 8-bytes at time might alleviate the regression.

> I tested this on my first generation Raspberry Pi (chipmunk). I had to
> tweak it a bit to make it compile, since the SSE autodetection code was
> not finished yet. And I used generate_series(1, 1000) instead of
> generate_series(1, 10000) in the test script (mbverifystr-speed.sql)
> because this system is so slow.
>
> master:
>
>   mixed | ascii
> -------+-------
>    1310 |  1041
> (1 row)
>
> v2-add-portability-stub-and-new-fallback.patch:
>
>   mixed | ascii
> -------+-------
>    2979 |   910
> (1 row)
>
> I'm guessing that's because the unaligned access in check_ascii() is
> expensive on this platform.

Hmm, I used memcpy() as suggested. Is that still slow on that platform? That's 32-bit, right? Some possible remedies:

1) For the COPY FROM case, we should align the allocation on a cacheline -- we already have examples of that idiom elsewhere. I was actually going to suggest doing this anyway, since unaligned SIMD loads are often slower, too.

2) As the simdjson fallback was based on Fuchsia (the Lemire paper implies it was tested carefully on Arm and I have no reason to doubt that), I could try to follow that example more faithfully by computing the actual codepoints. It's more computation and just as many branches as far as I can tell, but it's not a lot of work. I can add that alternative fallback to the patch set. I have no Arm machines, but I can test on a POWER8 machine.

3) #ifdef out the ascii check for 32-bit platforms.

4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time. I'll probably try this first.

Now, I'm pleased to report that I got SSE4 working, and it seems to work. It still needs some stress testing to find any corner case bugs, but it shouldn't be too early to share some numbers on Clang 10 / MacOS:

master:

 chinese | mixed | ascii
---------+-------+-------
    1082 |   751 |   364

v3 with SSE4.1:

 chinese | mixed | ascii
---------+-------+-------
     127 |   128 |   126

Some caveats and notes:

- It takes almost no recognizable code from simdjson, but it does take the magic constants lookup tables almost verbatim. The main body of the code has no intrinsics at all (I think). They're all hidden inside static inline helper functions. I reused some cryptic variable names from simdjson. It's a bit messy but not terrible.

- It diffs against the noError conversion patch and adds additional tests.

- It's not smart enough to stop at the last valid character boundary -- it's either all-valid or it must start over with the fallback. That will have to change in order to work with the proposed noError conversions. It shouldn't be very hard, but needs thought as to the clearest and safest way to code it.

- There is no ascii fast-path yet. With this algorithm we have to be a bit more careful since a valid ascii chunk could be preceded by an incomplete sequence at the end of the previous chunk. Not too hard, just a bit more work.

- This is my first time hacking autoconf, and it still seems slightly broken, yet functional on my machine at least.

- It only needs SSE4.1, but I didn't want to create a whole new CFLAGS, so it just reuses SSE4.2 for the runtime check and the macro names. Also, it doesn't test for SSE2, it just insists on 64-bit for the runtime check. I imagine it would refuse to build on 32-bit machines if you passed it -msse42

- There is a placeholder for Windows support, but it's not developed.

- I had to add a large number of casts to get rid of warnings in the magic constants macros. That needs some polish.

I also attached a C file that visually demonstrates every step of the algorithm following the example found in Table 9 in the paper. That contains the skeleton coding I started with and got abandoned early, so it might differ from the actual patch. 

--
John Naylor
EDB: http://www.enterprisedb.com
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Yugo NAGATA
Дата:
Сообщение: Re: Implementing Incremental View Maintenance
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: PATCH: Batch/pipelining support for libpq