Re: speed up verifying UTF-8
От | Heikki Linnakangas |
---|---|
Тема | Re: speed up verifying UTF-8 |
Дата | |
Msg-id | c3200e58-bad2-4414-9289-62a8a3bb02b5@iki.fi обсуждение исходный текст |
Ответ на | Re: speed up verifying UTF-8 (Greg Stark <stark@mit.edu>) |
Ответы |
Re: speed up verifying UTF-8
|
Список | pgsql-hackers |
On 03/06/2021 17:33, Greg Stark wrote: >> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1 > | half2). We have to compute (half1 | half2) anyway. > > Wouldn't you have to check (half1 & half2) ? Ah, you're right of course. But & is not quite right either, it will give false positives. That's ok from a correctness point of view here, because we then fall back to checking byte by byte, but I don't think it's a good tradeoff. I think this works, however: /* Verify a chunk of bytes for valid ASCII including a zero-byte check. */ static inline int check_ascii(const unsigned char *s, int len) { uint64 half1, half2, highbits_set; uint64 x1, x2; uint64 x; if (len >= 2 * sizeof(uint64)) { memcpy(&half1, s, sizeof(uint64)); memcpy(&half2, s + sizeof(uint64), sizeof(uint64)); /* Check if any bytes in this chunk have the high bit set. */ highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080)); if (highbits_set) return 0; /* * Check if there are any zero bytes in this chunk. * * First, add 0x7f to each byte. This sets the high bit in each byte, * unless it was a zero. We already checked that none of the bytes had * the high bit set previously, so the max value each byte can have * after the addition is 0x7f + 0x7f = 0xfe, and we don't need to * worry about carrying over to the next byte. */ x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f); x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f); /* then check that the high bit is set in each byte. */ x = (x1 | x2); x &= UINT64CONST(0x8080808080808080); if (x != UINT64CONST(0x8080808080808080)) return 0; return 2 * sizeof(uint64); } else return 0; } - Heikki
В списке pgsql-hackers по дате отправления: