Обсуждение: use SSE2 for is_valid_ascii
new thread [was: WIP Patch: Add a function that returns binary JSONB as a bytea] > I wrote: > > We can also shave a > > few percent by having pg_utf8_verifystr use SSE2 for the ascii path. I > > can look into this. > > Here's a patch for that. If the input is mostly ascii, I'd expect that > part of the flame graph to shrink by 40-50% and give a small boost > overall. Here is an updated patch using the new USE_SSE2 symbol. The style is different from the last one in that each stanza has platform-specific code. I wanted to try it this way because is_valid_ascii() is already written in SIMD-ish style using general purpose registers and bit twiddling, so it seemed natural to see the two side-by-side. Sometimes they can share the same comment. If we think this is bad for readability, I can go back to one block each, but that way leads to duplication of code and it's difficult to see what's different for each platform, IMO. -- John Naylor EDB: http://www.enterprisedb.com
Вложения
On Wed, Aug 10, 2022 at 01:50:14PM +0700, John Naylor wrote: > Here is an updated patch using the new USE_SSE2 symbol. The style is > different from the last one in that each stanza has platform-specific > code. I wanted to try it this way because is_valid_ascii() is already > written in SIMD-ish style using general purpose registers and bit > twiddling, so it seemed natural to see the two side-by-side. Sometimes > they can share the same comment. If we think this is bad for > readability, I can go back to one block each, but that way leads to > duplication of code and it's difficult to see what's different for > each platform, IMO. This is a neat patch. I don't know that we need an entirely separate code block for the USE_SSE2 path, but I do think that a little bit of extra commentary would improve the readability. IMO the existing comment for the zero accumulator has the right amount of detail. + /* + * Set all bits in each lane of the error accumulator where input + * bytes are zero. + */ + error_cum = _mm_or_si128(error_cum, + _mm_cmpeq_epi8(chunk, _mm_setzero_si128())); I wonder if reusing a zero vector (instead of creating a new one every time) has any noticeable effect on performance. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Aug 11, 2022 at 5:31 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > This is a neat patch. I don't know that we need an entirely separate code > block for the USE_SSE2 path, but I do think that a little bit of extra > commentary would improve the readability. IMO the existing comment for the > zero accumulator has the right amount of detail. > > + /* > + * Set all bits in each lane of the error accumulator where input > + * bytes are zero. > + */ > + error_cum = _mm_or_si128(error_cum, > + _mm_cmpeq_epi8(chunk, _mm_setzero_si128())); Okay, I will think about the comments, thanks for looking. > I wonder if reusing a zero vector (instead of creating a new one every > time) has any noticeable effect on performance. Creating a zeroed register is just FOO PXOR FOO, which should get hoisted out of the (unrolled in this case) loop, and which a recent CPU will just map to a hard-coded zero in the register file, in which case the execution latency is 0 cycles. :-) -- John Naylor EDB: http://www.enterprisedb.com
On Thu, Aug 11, 2022 at 11:10:34AM +0700, John Naylor wrote:
>> I wonder if reusing a zero vector (instead of creating a new one every
>> time) has any noticeable effect on performance.
> 
> Creating a zeroed register is just FOO PXOR FOO, which should get
> hoisted out of the (unrolled in this case) loop, and which a recent
> CPU will just map to a hard-coded zero in the register file, in which
> case the execution latency is 0 cycles. :-)
Ah, indeed.  At -O2, my compiler seems to zero out two registers before the
loop with either approach:
    pxor    %xmm0, %xmm0    ; accumulator
    pxor    %xmm2, %xmm2    ; always zeros
And within the loop, I see the following:
    movdqu  (%rdi), %xmm1
    movdqu  (%rdi), %xmm3
    addq    $16, %rdi
    pcmpeqb %xmm2, %xmm1    ; check for zeros
    por %xmm3, %xmm0        ; OR data into accumulator
    por %xmm1, %xmm0        ; OR zero check results into accumulator
    cmpq    %rdi, %rsi
So the call to _mm_setzero_si128() within the loop is fine.  Apologies for
the noise.
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
			
		v3 applies on top of the v9 json_lex_string patch in [1] and adds a bit more to that, resulting in a simpler patch that is more amenable to additional SIMD-capable platforms. [1] https://www.postgresql.org/message-id/CAFBsxsFV4v802idV0-Bo%3DV7wLMHRbOZ4er0hgposhyGCikmVGA%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Вложения
On Thu, Aug 25, 2022 at 04:41:53PM +0700, John Naylor wrote: > v3 applies on top of the v9 json_lex_string patch in [1] and adds a > bit more to that, resulting in a simpler patch that is more amenable > to additional SIMD-capable platforms. LGTM -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Aug 26, 2022 at 10:26 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Aug 25, 2022 at 04:41:53PM +0700, John Naylor wrote: > > v3 applies on top of the v9 json_lex_string patch in [1] and adds a > > bit more to that, resulting in a simpler patch that is more amenable > > to additional SIMD-capable platforms. > > LGTM Thanks for looking, pushed with some rearrangements. -- John Naylor EDB: http://www.enterprisedb.com