Re: speed up verifying UTF-8

Поиск

Список

Период

Сортировка

От	John Naylor
Тема	Re: speed up verifying UTF-8
Дата	6 июня 2021 г. 22:21:51
Msg-id	CAFBsxsGSnBnHfJ7D6Vs5bzYK=syCXf75-e6zLOV93AQ7hTt9jg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: speed up verifying UTF-8 (Heikki Linnakangas <hlinnaka@iki.fi>)
Список	pgsql-hackers

Дерево обсуждения

On Thu, Jun 3, 2021 at 3:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 03/06/2021 22:16, Heikki Linnakangas wrote:
> > On 03/06/2021 22:10, John Naylor wrote:
> >> On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas <hlinnaka@iki.fi
> >> <mailto:hlinnaka@iki.fi>> wrote:
> >> > x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
> >> > x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
> >> >
> >> > /* then check that the high bit is set in each byte. */
> >> > x = (x1 | x2);
> >> > x &= UINT64CONST(0x8080808080808080);
> >> > if (x != UINT64CONST(0x8080808080808080))
> >> > return 0;

> If you replace (x1 | x2) with (x1 & x2) above, I think it's correct.

After looking at it again with fresh eyes, I agree this is correct. I modified the regression tests to pad the input bytes with ascii so that the code path that works on 16-bytes at a time is tested. I use both UTF-8 input tables for some of the additional tests. There is a de facto requirement that the descriptions are unique across both of the input tables. That could be done more elegantly, but I wanted to keep things simple for now.

v11-0001 is an improvement over v10:

clang 12.0.5 / MacOS:

master:

chinese | mixed | ascii
---------+-------+-------
975 | 686 | 369

v10-0001:

chinese | mixed | ascii
---------+-------+-------
930 | 549 | 109

v11-0001:

chinese | mixed | ascii
---------+-------+-------
687 | 440 | 64

gcc 4.8.5 / Linux (older machine)

master:

chinese | mixed | ascii
---------+-------+-------
2559 | 1495 | 825

v10-0001:

chinese | mixed | ascii
---------+-------+-------
2966 | 1034 | 156

v11-0001:

chinese | mixed | ascii
---------+-------+-------
2242 | 824 | 140

Previous testing on POWER8 and Arm64 leads me to expect similar results there as well.

I also looked again at 0002 and decided I wasn't quite happy with the test coverage. Previously, the code padded out a short input with ascii so that the 16-bytes-at-a-time code path was always exercised. However, that required some finicky complexity and still wasn't adequate. For v11, I ripped that out and put the responsibility on the regression tests to make sure the various code paths are exercised.

--
John Naylor
EDB: http://www.enterprisedb.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Tom Lane
Дата: 06 июня 2021 г., 22:17:36
Сообщение: Re: PoC/WIP: Extended statistics on expressions

Следующее

От: Tomas Vondra
Дата: 06 июня 2021 г., 22:47:16
Сообщение: Re: list of extended statistics on psql (\dX)

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: speed up verifying UTF-8

Вложения

Предыдущее

Следующее