Re: speed up verifying UTF-8

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: speed up verifying UTF-8
Дата
Msg-id c3200e58-bad2-4414-9289-62a8a3bb02b5@iki.fi
обсуждение исходный текст
Ответ на Re: speed up verifying UTF-8  (Greg Stark <stark@mit.edu>)
Ответы Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Список pgsql-hackers
On 03/06/2021 17:33, Greg Stark wrote:
>> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1
> | half2). We have to compute (half1 | half2) anyway.
> 
> Wouldn't you have to check (half1 & half2) ?

Ah, you're right of course. But & is not quite right either, it will 
give false positives. That's ok from a correctness point of view here, 
because we then fall back to checking byte by byte, but I don't think 
it's a good tradeoff.

I think this works, however:

/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
static inline int
check_ascii(const unsigned char *s, int len)
{
    uint64        half1,
                half2,
                highbits_set;
    uint64        x1,
                x2;
    uint64        x;

    if (len >= 2 * sizeof(uint64))
    {
        memcpy(&half1, s, sizeof(uint64));
        memcpy(&half2, s + sizeof(uint64), sizeof(uint64));

        /* Check if any bytes in this chunk have the high bit set. */
        highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
        if (highbits_set)
            return 0;

        /*
         * Check if there are any zero bytes in this chunk.
         *
         * First, add 0x7f to each byte. This sets the high bit in each byte,
         * unless it was a zero. We already checked that none of the bytes had
         * the high bit set previously, so the max value each byte can have
         * after the addition is 0x7f + 0x7f = 0xfe, and we don't need to
         * worry about carrying over to the next byte.
         */
        x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
        x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);

        /* then check that the high bit is set in each byte. */
        x = (x1 | x2);
        x &= UINT64CONST(0x8080808080808080);
        if (x != UINT64CONST(0x8080808080808080))
            return 0;

        return 2 * sizeof(uint64);
    }
    else
        return 0;
}

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: security_definer_search_path GUC
Следующее
От: John Naylor
Дата:
Сообщение: Re: speed up verifying UTF-8