Re: speed up verifying UTF-8

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: speed up verifying UTF-8
Дата
Msg-id CAFBsxsHP556FGZZZi=gmYxZFXv0QjPXHWmbp1HdeFPYGVUQ-XQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Ответы Re: speed up verifying UTF-8
Список pgsql-hackers
> On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:

> > An alternative idea: should we optimize for validation of **valid** inputs rather than optimizing the worst case?
> > In other words, what if the implementation processes all characters always and uses a slower method in case of validation failure?
> > I would guess it is more important to be faster with accepting valid input rather than "faster to reject invalid input".
>
> > static int pg_utf8_verifystr2(const unsigned char *s, int len) {
> >     if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid, then just accept it
> >         return s + len;
> >     }
> >     // slow path: the string is not valid, perform a slower analysis
> >     return s + ....;
> > }

This turned out to be a really good idea (v19 attached):

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     888 |   607 |   179 |     777 |   1328

v19:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     809 |   472 |   223 |     558 |    805

x86 Macbook, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     674 |   346 |    78 |     309 |    504

v19:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     379 |   181 |    94 |     219 |    376

Note that the branchy code's worst case (mixed8) is here the same speed as multibyte. With Vladimir's idea * , we call check_ascii only every 8 bytes of input, not every time we verify one multibyte character. Also, we only have to check the DFA state every time we loop over 8 bytes, not every time we step through the DFA. That means we have to walk backwards at the end to find the last leading byte, but the SSE code already knew how to do that, so I used that logic here in the caller, which will allow some simplification of how the SSE code returns.

The state check is likely why the ascii case is slightly slower than v14. We could go back to checking ascii 16 bytes at a time, since there's little penalty for doing so.

* (Greg was thinking the same thing upthread, but I don't think the branchy code I posted at the time could have taken advantage of this)

I'm pretty confident this improvement is architecture-independent. Next month I'll clean this up and rebase the SSE patch over this.

I wrote:

> + /*
> + * NB: This check must be strictly greater-than, otherwise an invalid byte
> + * at the end might not get detected.
> + */
> + while (len > sizeof(__m128i))

Note to self: I actually think this isn't needed anymore since I changed how the SSE code deals with remainder sequences at the end.

--
John Naylor
EDB: http://www.enterprisedb.com
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: logical decoding and replication of sequences
Следующее
От: Zhihong Yu
Дата:
Сообщение: Re: Have I found an interval arithmetic bug?