Re: speed up verifying UTF-8
От | John Naylor |
---|---|
Тема | Re: speed up verifying UTF-8 |
Дата | |
Msg-id | CAFBsxsHG=g6W8Mie+_NO8dV6O0pO2stxrnS=me5ZmGqk--fd5g@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: speed up verifying UTF-8 (Heikki Linnakangas <hlinnaka@iki.fi>) |
Ответы |
Re: speed up verifying UTF-8
(John Naylor <john.naylor@enterprisedb.com>)
|
Список | pgsql-hackers |
On Fri, Dec 10, 2021 at 2:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I had another look at this now. Looks good, just a few minor comments below: Thanks for reviewing! I've attached v25 to address your points. > This function assumes that the input len is a multiple of 8. There's an > assertion for that, but it would be good to also mention it in the > function comment. I took me a moment to realize that. Done. > Given that assumption, I wonder if "while (len >= 0)" would marginally > faster. Or compute "s_end = s + len" first, and check for "while (s < > s_end)", so that you don't need to update 'len' in the loop. With two chunks, gcc 4.8.5/11.2 and clang 12 will unroll the inner loop, so it doesn't matter: L51: mov rdx, QWORD PTR [rdi] mov rsi, QWORD PTR [rdi+8] lea rax, [rdx+rbx] lea rbp, [rsi+rbx] and rax, rbp and rax, r11 cmp rax, r11 jne .L66 or rdx, rsi test rdx, r11 jne .L66 sub r8d, 16 ; refers to "len" in the caller pg_utf8_verifystr() add rdi, 16 cmp r8d, 15 jg .L51 I *think* these are the same instructions as from your version from some time ago that handled two integers explicitly -- I rewrote it like this to test different chunk sizes. (Aside on 32-byte strides: Four chunks was within the noise level of two chunks on the platform I tested. With 32 bytes, that increases the chance that a mixed input would have non-ascii and defeat this optimization, so should be significantly faster to make up for that. Along those lines, in the future we could consider SSE2 (unrolled 2 x 16 bytes) for this path. Since it's part of the spec for x86-64, we wouldn't need a runtime check -- just #ifdef it inline. And we could piggy-back on the CRC SSE4.2 configure test for intrinsic support, so that would avoid adding a bunch of complexity.) That said, I think your suggestions are better on code clarity grounds. I'm on the fence about "while(s < s_end)", so I went with "while (len > 0)" because it matches the style in wchar.c. > Also would be good to mention what exactly the return value means. I.e > "returns false if the input contains any bytes with the high-bit set, or > zeros". Done. > > + /* > > + * Check if any high bits in the zero accumulator got cleared. > > + * > > + * XXX: As noted above, the zero check is only valid if the chunk had no > > + * high bits set. However, the compiler may perform these two checks in > > + * any order. That's okay because if any high bits were set, we would > > + * return false regardless, so invalid results from the zero check don't > > + * matter. > > + */ > > + if (zero_cum != UINT64CONST(0x8080808080808080)) > > + return false; > I don't understand the "the compiler may perform these checks in any > order" comment. We trust the compiler to do the right thing, and only > reorder things when it's safe to do so. What is special here, why is it > worth mentioning here? Ah, that's a good question, and now that you mention it, the comment is silly. When looking at the assembly output a while back, I was a bit astonished that it didn't match my mental model of what was happening, so I made this note. I've removed the whole XXX comment here and expanded the first comment in the loop to: /* * Capture any zero bytes in this chunk. * * First, add 0x7f to each byte. This sets the high bit in each byte, * unless it was a zero. If any resulting high bits are zero, the * corresponding high bits in the zero accumulator will be cleared. * * If none of the bytes in the chunk had the high bit set, the max * value each byte can have after the addition is 0x7f + 0x7f = 0xfe, * and we don't need to worry about carrying over to the next byte. If * any input bytes did have the high bit set, it doesn't matter * because we check for those separately. */ > > @@ -1721,7 +1777,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len) > > return s - start; > > } > > > > -static int > > +static pg_noinline int > > pg_utf8_verifychar(const unsigned char *s, int len) > > { > > int l; > > Why force it to not be inlined? Since the only direct caller is now only using it for small inputs, I thought about saving space, but it's not enough to matter, so I'll go ahead and leave it out. While at it, I removed the unnecessary "inline" declaration for utf8_advance(), since the compiler can do that anyway. > > + * In a shift-based DFA, the input byte is an index into array of integers > > + * whose bit pattern encodes the state transitions. To compute the current > > + * state, we simply right-shift the integer by the current state and apply a > > + * mask. In this scheme, the address of the transition only depends on the > > + * input byte, so there is better pipelining. > > Should be "To compute the *next* state, ...", I think. Fixed. > The way the state transition table works is pretty inscrutable. That's > understandable, because the values were found by an SMT solver, so I'm > not sure if anything can be done about it. Do you mean in general, or just the state values? Like any state machine, the code is simple, and the complexity is hidden in the data. Hopefully the first link I included in the comment is helpful. The SMT solver was only needed to allow 32-bit (instead of 64-bit) entries in the transition table, so it's not strictly necessary. A lookup table that fits in 1kB is nice from a cache perspective, however. With 64-bit, the state values are less weird-looking but they're still just arbitrary numbers. As long as ERR = 0 and the largest is at most 9, it doesn't matter what they are, so I'm not sure it's much less mysterious. You can see the difference between 32-bit and 64-bit in [1]. -- In addition to Heikki's. review points, I've made a couple small additional changes from v24: I rewrote this part, so we don't need these macros anymore: - if (!IS_HIGHBIT_SET(*s) || - IS_UTF8_2B_LEAD(*s) || - IS_UTF8_3B_LEAD(*s) || - IS_UTF8_4B_LEAD(*s)) + if (!IS_HIGHBIT_SET(*s) || pg_utf_mblen(s) > 1) And I moved is_valid_ascii() to pg_wchar.h so it can be used elsewhere. I'm not sure there's a better place to put it. I tried using this for text_position(), for which I'll start a new thread. [1] https://www.postgresql.org/message-id/attachment/125672/v22-addendum-32-bit-transitions.txt -- John Naylor EDB: http://www.enterprisedb.com -- John Naylor EDB: http://www.enterprisedb.com
Вложения
В списке pgsql-hackers по дате отправления:
Следующее
От: vignesh CДата:
Сообщение: Re: Failed transaction statistics to measure the logical replication progress