Обсуждение: Use compiler intrinsics for bit ops in hash
Folks, The recent patch for distinct windowing aggregates contained a partial fix of the FIXME that didn't seem entirely right, so I extracted that part, changed it to use compiler intrinsics, and submit it here. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Вложения
Hi David, On Tue, Jan 14, 2020 at 9:36 AM David Fetter <david@fetter.org> wrote: > > Folks, > > The recent patch for distinct windowing aggregates contained a partial > fix of the FIXME that didn't seem entirely right, so I extracted that > part, changed it to use compiler intrinsics, and submit it here. The changes in hash AM and SIMPLEHASH do look like a net positive improvement. My biggest cringe might be in pg_bitutils: > diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h > index 498e532308..cc9338da25 100644 > --- a/src/include/port/pg_bitutils.h > +++ b/src/include/port/pg_bitutils.h > @@ -145,4 +145,32 @@ pg_rotate_right32(uint32 word, int n) > return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n)); > } > > +/* ceil(lg2(num)) */ > +static inline uint32 > +ceil_log2_32(uint32 num) > +{ > + return pg_leftmost_one_pos32(num-1) + 1; > +} > + > +static inline uint64 > +ceil_log2_64(uint64 num) > +{ > + return pg_leftmost_one_pos64(num-1) + 1; > +} > + > +/* calculate first power of 2 >= num > + * per https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2 > + * using BSR where available */ > +static inline uint32 > +next_power_of_2_32(uint32 num) > +{ > + return ((uint32) 1) << (pg_leftmost_one_pos32(num-1) + 1); > +} > + > +static inline uint64 > +next_power_of_2_64(uint64 num) > +{ > + return ((uint64) 1) << (pg_leftmost_one_pos64(num-1) + 1); > +} > + > #endif /* PG_BITUTILS_H */ > 1. Is ceil_log2_64 dead code? 2. The new utilities added here (ceil_log2_32 and company, next_power_of_2_32 and company) all require num > 1, but don't clearly Assert (or at the very least document) so. 3. A couple of the callers can actively pass in an argument of 1, e.g. from _hash_spareindex in hashutil.c, while some other callers are iffy at best (simplehash.h maybe?) 4. It seems like you *really* would like an operation like LZCNT in x86 (first appearing in Haswell) that is well defined on zero input. ISTM the alternatives are: a) Special case 1. That seems straightforward, but the branching cost on a seemingly unlikely condition seems to be a lot of performance loss b) Use architecture specific intrinsic (and possibly with CPUID shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ intrinsic elsewhere. The CLZ GCC intrinsic seems to map to instructions that are well defined on zero in most ISA's other than x86, so maybe we can get away with special-casing x86? Cheers, Jesse
On Tue, Jan 14, 2020 at 12:21:41PM -0800, Jesse Zhang wrote: > Hi David, > > On Tue, Jan 14, 2020 at 9:36 AM David Fetter <david@fetter.org> wrote: > > > > Folks, > > > > The recent patch for distinct windowing aggregates contained a partial > > fix of the FIXME that didn't seem entirely right, so I extracted that > > part, changed it to use compiler intrinsics, and submit it here. > > The changes in hash AM and SIMPLEHASH do look like a net positive > improvement. My biggest cringe might be in pg_bitutils: Thanks for looking at this! > > diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h > > index 498e532308..cc9338da25 100644 > > --- a/src/include/port/pg_bitutils.h > > +++ b/src/include/port/pg_bitutils.h > > @@ -145,4 +145,32 @@ pg_rotate_right32(uint32 word, int n) > > return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n)); > > } > > > > +/* ceil(lg2(num)) */ > > +static inline uint32 > > +ceil_log2_32(uint32 num) > > +{ > > + return pg_leftmost_one_pos32(num-1) + 1; > > +} > > + > > +static inline uint64 > > +ceil_log2_64(uint64 num) > > +{ > > + return pg_leftmost_one_pos64(num-1) + 1; > > +} > > + > > +/* calculate first power of 2 >= num > > + * per https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2 > > + * using BSR where available */ > > +static inline uint32 > > +next_power_of_2_32(uint32 num) > > +{ > > + return ((uint32) 1) << (pg_leftmost_one_pos32(num-1) + 1); > > +} > > + > > +static inline uint64 > > +next_power_of_2_64(uint64 num) > > +{ > > + return ((uint64) 1) << (pg_leftmost_one_pos64(num-1) + 1); > > +} > > + > > #endif /* PG_BITUTILS_H */ > > > > 1. Is ceil_log2_64 dead code? Let's call it nascent code. I suspect there are places it could go, if I look for them. Also, it seemed silly to have one without the other. > 2. The new utilities added here (ceil_log2_32 and company, > next_power_of_2_32 and company) all require num > 1, but don't clearly > Assert (or at the very least document) so. Assert()ed. > 3. A couple of the callers can actively pass in an argument of 1, e.g. > from _hash_spareindex in hashutil.c, while some other callers are iffy > at best (simplehash.h maybe?) What would you recommend be done about this? > 4. It seems like you *really* would like an operation like LZCNT in x86 > (first appearing in Haswell) that is well defined on zero input. ISTM > the alternatives are: > > a) Special case 1. That seems straightforward, but the branching cost > on a seemingly unlikely condition seems to be a lot of performance > loss > > b) Use architecture specific intrinsic (and possibly with CPUID > shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ > intrinsic elsewhere. The CLZ GCC intrinsic seems to map to > instructions that are well defined on zero in most ISA's other than > x86, so maybe we can get away with special-casing x86? b) seems much more attractive. Is there some way to tilt the tools so that this happens? What should I be reading up on? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Вложения
On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote: > > The changes in hash AM and SIMPLEHASH do look like a net positive > > improvement. My biggest cringe might be in pg_bitutils: > > > > 1. Is ceil_log2_64 dead code? > > Let's call it nascent code. I suspect there are places it could go, if > I look for them. Also, it seemed silly to have one without the other. > While not absolutely required, I'd like us to find at least one place and start using it. (Clang also nags at me when we have unused functions). > On Tue, Jan 14, 2020 at 12:21:41PM -0800, Jesse Zhang wrote: > > 4. It seems like you *really* would like an operation like LZCNT in x86 > > (first appearing in Haswell) that is well defined on zero input. ISTM > > the alternatives are: > > > > a) Special case 1. That seems straightforward, but the branching cost > > on a seemingly unlikely condition seems to be a lot of performance > > loss > > > > b) Use architecture specific intrinsic (and possibly with CPUID > > shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ > > intrinsic elsewhere. The CLZ GCC intrinsic seems to map to > > instructions that are well defined on zero in most ISA's other than > > x86, so maybe we can get away with special-casing x86? i. We can detect LZCNT instruction by checking one of the "extended feature" (EAX=80000001) bits using CPUID. Unlike the "basic features" (EAX=1), extended feature flags have been more vendor-specific, but fortunately it seems that the feature bit for LZCNT is the same [1][2]. ii. We'll most likely still need to provide a fallback implementation for processors that don't have LZCNT (either because they are from a different vendor, or an older Intel/AMD processor). I wonder if simply checking for 1 is "good enough". Maybe a micro benchmark is in order? > Is there some way to tilt the tools so that this happens? We have a couple options here: 1. Use a separate object (a la our SSE 4.2 implemenation of CRC). On Clang and GCC (I don't have MSVC at hand), -mabm or -mlzcnt should cause __builtin_clz to generate the LZCNT instruction, which is well defined on zero input. The default configuration would translate __builtin_clz to code that subtracts BSR from the width of the input, but BSR leaves the destination undefined on zero input. 2. (My least favorite) use inline asm (a la our popcount implementation). > b) seems much more attractive. Is there some way to tilt the tools so > that this happens? What should I be reading up on? The enclosed references hopefully are good places to start. Let me know if you have more ideas. Cheers, Jesse References: [1] "How to detect New Instruction support in the 4th generation Intel® Core™ processor family" https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family [2] "Bit Manipulation Instruction Sets" https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets
On Wed, Jan 15, 2020 at 6:09 AM David Fetter <david@fetter.org> wrote: > [v2 patch] Hi David, I have a stylistic comment on this snippet: - for (i = _hash_log2(metap->hashm_bsize); i > 0; --i) - { - if ((1 << i) <= metap->hashm_bsize) - break; - } + i = pg_leftmost_one_pos32(metap->hashm_bsize); Assert(i > 0); metap->hashm_bmsize = 1 << i; metap->hashm_bmshift = i + BYTE_TO_BIT; Naming the variable "i" made sense when it was a loop counter, but it seems out of place now. Same with the Assert. Also, this + * using BSR where available */ is not directly tied to anything in this function, or even in the function it calls, and could get out of date easily. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jan 18, 2020 at 11:46:24AM +0800, John Naylor wrote: > On Wed, Jan 15, 2020 at 6:09 AM David Fetter <david@fetter.org> wrote: > > [v2 patch] > > Hi David, > > I have a stylistic comment on this snippet: > > - for (i = _hash_log2(metap->hashm_bsize); i > 0; --i) > - { > - if ((1 << i) <= metap->hashm_bsize) > - break; > - } > + i = pg_leftmost_one_pos32(metap->hashm_bsize); > Assert(i > 0); > metap->hashm_bmsize = 1 << i; > metap->hashm_bmshift = i + BYTE_TO_BIT; > > Naming the variable "i" made sense when it was a loop counter, but it > seems out of place now. Same with the Assert. Fixed by removing the variable entirely. > Also, this > > + * using BSR where available */ > > is not directly tied to anything in this function, or even in the > function it calls, and could get out of date easily. Removed. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Вложения
On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote: > On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote: > > > The changes in hash AM and SIMPLEHASH do look like a net positive > > > improvement. My biggest cringe might be in pg_bitutils: > > > > > > 1. Is ceil_log2_64 dead code? > > > > Let's call it nascent code. I suspect there are places it could go, if > > I look for them. Also, it seemed silly to have one without the other. > > > > While not absolutely required, I'd like us to find at least one > place and start using it. (Clang also nags at me when we have > unused functions). Done in the expanded patches attached. > > On Tue, Jan 14, 2020 at 12:21:41PM -0800, Jesse Zhang wrote: > > > 4. It seems like you *really* would like an operation like LZCNT in x86 > > > (first appearing in Haswell) that is well defined on zero input. ISTM > > > the alternatives are: > > > > > > a) Special case 1. That seems straightforward, but the branching cost > > > on a seemingly unlikely condition seems to be a lot of performance > > > loss > > > > > > b) Use architecture specific intrinsic (and possibly with CPUID > > > shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ > > > intrinsic elsewhere. The CLZ GCC intrinsic seems to map to > > > instructions that are well defined on zero in most ISA's other than > > > x86, so maybe we can get away with special-casing x86? > > i. We can detect LZCNT instruction by checking one of the > "extended feature" (EAX=80000001) bits using CPUID. Unlike the > "basic features" (EAX=1), extended feature flags have been more > vendor-specific, but fortunately it seems that the feature bit > for LZCNT is the same [1][2]. > > ii. We'll most likely still need to provide a fallback > implementation for processors that don't have LZCNT (either > because they are from a different vendor, or an older Intel/AMD > processor). I wonder if simply checking for 1 is "good enough". > Maybe a micro benchmark is in order? I'm not sure how I'd run one on the architectures we support. What I've done here is generalize our implementation to be basically like LZCNT and TZCNT at the cost of a brief branch that might go away at runtime. > 2. (My least favorite) use inline asm (a la our popcount > implementation). Yeah, I'd like to fix that, but I kept the scope of this one relatively narrow. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Вложения
On Fri, Jan 31, 2020 at 04:59:18PM +0100, David Fetter wrote: > On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote: > > On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote: > > > > The changes in hash AM and SIMPLEHASH do look like a net positive > > > > improvement. My biggest cringe might be in pg_bitutils: > > > > > > > > 1. Is ceil_log2_64 dead code? > > > > > > Let's call it nascent code. I suspect there are places it could go, if > > > I look for them. Also, it seemed silly to have one without the other. > > > > > > > While not absolutely required, I'd like us to find at least one > > place and start using it. (Clang also nags at me when we have > > unused functions). > > Done in the expanded patches attached. These bit-rotted a little, so I've updated them. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Вложения
On Wed, Feb 26, 2020 at 09:12:24AM +0100, David Fetter wrote: > On Fri, Jan 31, 2020 at 04:59:18PM +0100, David Fetter wrote: > > On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote: > > > On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote: > > > > > The changes in hash AM and SIMPLEHASH do look like a net positive > > > > > improvement. My biggest cringe might be in pg_bitutils: > > > > > > > > > > 1. Is ceil_log2_64 dead code? > > > > > > > > Let's call it nascent code. I suspect there are places it could go, if > > > > I look for them. Also, it seemed silly to have one without the other. > > > > > > > > > > While not absolutely required, I'd like us to find at least one > > > place and start using it. (Clang also nags at me when we have > > > unused functions). > > > > Done in the expanded patches attached. > > These bit-rotted a little, so I've updated them. 05d8449e73694585b59f8b03aaa087f04cc4679a broke this patch set, so fix. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Вложения
On Thu, Feb 27, 2020 at 1:56 PM David Fetter <david@fetter.org> wrote: > [v6 set] Hi David, In 0002, the pg_bitutils functions have a test (input > 0), and the new callers ceil_log2_* and next_power_of_2_* have asserts. That seems backward to me. I imagine some callers of bitutils will already know the value > 0, and it's probably good to keep that branch out of the lowest level functions. What do you think? -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 27, 2020 at 02:41:49PM +0800, John Naylor wrote: > On Thu, Feb 27, 2020 at 1:56 PM David Fetter <david@fetter.org> wrote: > > [v6 set] > > Hi David, > > In 0002, the pg_bitutils functions have a test (input > 0), and the > new callers ceil_log2_* and next_power_of_2_* have asserts. That seems > backward to me. To me, too, now that you mention it. My thinking was a little fuzzed by trying to accommodate platforms with intrinsics where clz is defined for 0 inputs. > I imagine some callers of bitutils will already know the value > 0, > and it's probably good to keep that branch out of the lowest level > functions. What do you think? I don't know quite how smart compilers and CPUs are these days, so it's unclear to me how often that branch would actually happen. Anyhow, I'll get a revised patch set out later today. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi David, On Wed, Feb 26, 2020 at 9:56 PM David Fetter <david@fetter.org> wrote: > > On Wed, Feb 26, 2020 at 09:12:24AM +0100, David Fetter wrote: > > On Fri, Jan 31, 2020 at 04:59:18PM +0100, David Fetter wrote: > > > On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote: > > > > On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote: > > > > > > The changes in hash AM and SIMPLEHASH do look like a net positive > > > > > > improvement. My biggest cringe might be in pg_bitutils: > > > > > > > > > > > > 1. Is ceil_log2_64 dead code? > > > > > > > > > > Let's call it nascent code. I suspect there are places it could go, if > > > > > I look for them. Also, it seemed silly to have one without the other. > > > > > > > > > > > > > While not absolutely required, I'd like us to find at least one > > > > place and start using it. (Clang also nags at me when we have > > > > unused functions). > > > > > > Done in the expanded patches attached. I see that you've found use of it in dynahash, thanks! The math in the new (from v4 to v6) patch is wrong: it yields ceil_log2(1) = 1 or next_power_of_2(1) = 2. I can see that you lifted the restriction of "num greater than one" for ceil_log2() in this patch set, but it's now _more_ problematic to base those functions on pg_leftmost_one_pos(). I'm not comfortable with your changes to pg_leftmost_one_pos() to remove the restriction on word being non-zero. Specifically pg_leftmost_one_pos() is made to return 0 on 0 input. While none of its current callers (in HEAD) is harmed, this introduces muddy semantics: 1. pg_leftmost_one_pos is semantically undefined on 0 input: scanning for a set bit in a zero word won't find it anywhere. 2. we can _try_ generalizing it to accommodate ceil_log2 by extrapolating based on the invariant that BSR + LZCNT = 31 (or 63). In that case, the extrapolation yields -1 for pg_leftmost_one_pos(0). I'm not convinced that others on the list will be comfortable with the generalization suggested in 2 above. I've quickly put together a PoC patch on top of yours, which re-implements ceil_log2 using LZCNT coupled with a CPUID check. Thoughts? Cheers, Jesse
Вложения
On Thu, Feb 27, 2020 at 1:56 PM David Fetter <david@fetter.org> wrote: > > [v6 patch set] Here I'm only looking at 0001. It needs rebasing, but it's trivial to see what it does. I noticed in some places, you've replaced "long" with uint64, but many are int64. I started making a list, but it got too long, and I had to stop and ask: Is there a reason to change from signed to unsigned for any of the ones that aren't directly related to hashing code? Is there some larger pattern I'm missing? -static long gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb); -static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum); +static uint64 gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb); +static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, uint64 blocknum); I believe these should actually use BlockNumber, if these refer to relation blocks as opposed to temp file blocks (I haven't read the code). -exec_execute_message(const char *portal_name, long max_rows) +exec_execute_message(const char *portal_name, uint64 max_rows) The only call site of this function uses an int32, which gets its value from pq_getmsgint, which returns uint32. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 3, 2020 at 4:46 AM Jesse Zhang <sbjesse@gmail.com> wrote: > The math in the new (from v4 to v6) patch is wrong: it yields > ceil_log2(1) = 1 or next_power_of_2(1) = 2. I think you're right. > I can see that you lifted > the restriction of "num greater than one" for ceil_log2() in this patch > set, but it's now _more_ problematic to base those functions on > pg_leftmost_one_pos(). > I'm not comfortable with your changes to pg_leftmost_one_pos() to remove > the restriction on word being non-zero. Specifically > pg_leftmost_one_pos() is made to return 0 on 0 input. While none of its > current callers (in HEAD) is harmed, this introduces muddy semantics: > > 1. pg_leftmost_one_pos is semantically undefined on 0 input: scanning > for a set bit in a zero word won't find it anywhere. Right. > I've quickly put together a PoC patch on top of yours, which > re-implements ceil_log2 using LZCNT coupled with a CPUID check. > Thoughts? This patch seems to be making an assumption that an indirect function call is faster than taking a branch (in inlined code) that the CPU will almost always predict correctly. It would be nice to have some numbers to compare. (against pg_count_leading_zeros_* using the "slow" versions but statically inlined). Stylistically, "8 * sizeof(num)" is a bit overly formal, since the hard-coded number we want is in the name of the function. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Mar 02, 2020 at 12:45:21PM -0800, Jesse Zhang wrote: > Hi David, > > On Wed, Feb 26, 2020 at 9:56 PM David Fetter <david@fetter.org> wrote: > > > > On Wed, Feb 26, 2020 at 09:12:24AM +0100, David Fetter wrote: > > > On Fri, Jan 31, 2020 at 04:59:18PM +0100, David Fetter wrote: > > > > On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote: > > > > > On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote: > > > > > > > The changes in hash AM and SIMPLEHASH do look like a net positive > > > > > > > improvement. My biggest cringe might be in pg_bitutils: > > > > > > > > > > > > > > 1. Is ceil_log2_64 dead code? > > > > > > > > > > > > Let's call it nascent code. I suspect there are places it could go, if > > > > > > I look for them. Also, it seemed silly to have one without the other. > > > > > > > > > > > > > > > > While not absolutely required, I'd like us to find at least one > > > > > place and start using it. (Clang also nags at me when we have > > > > > unused functions). > > > > > > > > Done in the expanded patches attached. > I see that you've found use of it in dynahash, thanks! > > The math in the new (from v4 to v6) patch is wrong: it yields > ceil_log2(1) = 1 or next_power_of_2(1) = 2. I can see that you lifted > the restriction of "num greater than one" for ceil_log2() in this patch > set, but it's now _more_ problematic to base those functions on > pg_leftmost_one_pos(). > > I'm not comfortable with your changes to pg_leftmost_one_pos() to remove > the restriction on word being non-zero. Specifically > pg_leftmost_one_pos() is made to return 0 on 0 input. While none of its > current callers (in HEAD) is harmed, this introduces muddy semantics: > > 1. pg_leftmost_one_pos is semantically undefined on 0 input: scanning > for a set bit in a zero word won't find it anywhere. > > 2. we can _try_ generalizing it to accommodate ceil_log2 by > extrapolating based on the invariant that BSR + LZCNT = 31 (or 63). In > that case, the extrapolation yields -1 for pg_leftmost_one_pos(0). > > I'm not convinced that others on the list will be comfortable with the > generalization suggested in 2 above. > > I've quickly put together a PoC patch on top of yours, which > re-implements ceil_log2 using LZCNT coupled with a CPUID check. > Thoughts? Per discussion on IRC with Andrew (RhodiumToad) Gierth: The runtime detection means there's always an indirect call overhead and no way to inline. This is counter to what using compiler intrinsics is supposed to do. It's better to rely on the compiler, because: (a) The compiler often knows whether the value can or can't be 0 and can therefore skip a conditional jump. (b) If you're targeting a recent microarchitecture, the compiler can just use the right instruction. (c) Even if the conditional branch is left in, it's not a big overhead. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi John, Oops this email has been sitting in my outbox for 3 days... On Wed, Mar 4, 2020 at 1:46 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > On Tue, Mar 3, 2020 at 4:46 AM Jesse Zhang <sbjesse@gmail.com> wrote: > > I've quickly put together a PoC patch on top of yours, which > > re-implements ceil_log2 using LZCNT coupled with a CPUID check. > > Thoughts? > > This patch seems to be making an assumption that an indirect function > call is faster than taking a branch (in inlined code) that the CPU > will almost always predict correctly. It would be nice to have some > numbers to compare. (against pg_count_leading_zeros_* using the "slow" > versions but statically inlined). > Ah, how could I forget that... I ran a quick benchmark on my laptop, and indeed, even though the GCC-generated code takes a hit on zero input (Clang generates slightly different code that gives indistinguishable runtime for zero and non-zero inputs), the inlined code (the function input in my benchmark is never a constant literal so the branch does get exercised at runtime) is still more than twice as fast as the function call. ------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------ BM_pfunc/0 1.57 ns 1.56 ns 447127265 BM_pfunc/1 1.56 ns 1.56 ns 449618696 BM_pfunc/8 1.57 ns 1.57 ns 443013856 BM_pfunc/64 1.57 ns 1.57 ns 448784369 BM_slow/0 0.602 ns 0.600 ns 1000000000 BM_slow/1 0.391 ns 0.390 ns 1000000000 BM_slow/8 0.392 ns 0.391 ns 1000000000 BM_slow/64 0.391 ns 0.390 ns 1000000000 BM_fast/0 1.47 ns 1.46 ns 477513921 BM_fast/1 1.47 ns 1.46 ns 473992040 BM_fast/8 1.46 ns 1.46 ns 474895755 BM_fast/64 1.47 ns 1.46 ns 477215268 For your amusement, I've attached the meat of the benchmark. To build the code you can grab the repository at https://github.com/d/glowing-chainsaw/tree/pfunc > Stylistically, "8 * sizeof(num)" is a bit overly formal, since the > hard-coded number we want is in the name of the function. Oh yeah, overly generic code is indicative of the remnants of my C++ brain, will fix. Cheers, Jesse
Вложения
Hi David, On Sun, Mar 8, 2020 at 11:34 AM David Fetter <david@fetter.org> wrote: > > On Mon, Mar 02, 2020 at 12:45:21PM -0800, Jesse Zhang wrote: > > Hi David, > > Per discussion on IRC with Andrew (RhodiumToad) Gierth: > > The runtime detection means there's always an indirect call overhead > and no way to inline. This is counter to what using compiler > intrinsics is supposed to do. > > It's better to rely on the compiler, because: > (a) The compiler often knows whether the value can or can't be 0 and > can therefore skip a conditional jump. Yes, the compiler would know to eliminate the branch if the inlined function is called with a literal argument, or it infers an invariant from the context (like nesting inside a conditional block, or a previous conditional "noreturn" path). > (b) If you're targeting a recent microarchitecture, the compiler can > just use the right instruction. I might be more conservative than you are on (b). The thought of building a binary that cannot run "somewhere" where the compiler supports by default still mortifies me. > (c) Even if the conditional branch is left in, it's not a big overhead. > I 100% agree with (c), see benchmarking results upthread. Cheers, Jesse
On Sat, 29 Feb 2020 at 04:13, David Fetter <david@fetter.org> wrote: > > On Thu, Feb 27, 2020 at 02:41:49PM +0800, John Naylor wrote: > > In 0002, the pg_bitutils functions have a test (input > 0), and the > > new callers ceil_log2_* and next_power_of_2_* have asserts. That seems > > backward to me. > > To me, too, now that you mention it. My thinking was a little fuzzed > by trying to accommodate platforms with intrinsics where clz is > defined for 0 inputs. Wouldn't it be better just to leave the existing definitions of the pg_leftmost_one_pos* function alone? It seems to me you're hacking away at those just so you can support passing 1 to the new functions, and that's giving you trouble now because you're doing num-1 to handle the case where the number is already a power of 2. Which is troublesome because 1-1 is 0, which you're trying to code around. Isn't it better just to put in a run-time check for numbers that are already a power of 2 and then get rid of the num - 1? Something like: /* * pg_nextpow2_32 * Returns the next highest power of 2 of 'num', or 'num', if it's already a * power of 2. 'num' mustn't be 0 or be above UINT_MAX / 2. */ static inline uint32 pg_nextpow2_32(uint32 num) { Assert(num > 0 && num <= UINT_MAX / 2); /* use some bitmasking tricks to see if only 1 bit is on */ return (num & (num - 1)) == 0 ? num : ((uint32) 1) << (pg_leftmost_one_pos32(num) + 1); } I think you'll also want to mention the issue about numbers greater than UINT_MAX / 2, as I've done above and also align your naming conversion to what else is in that file. I don't think Jesse's proposed solution is that great due to the additional function call overhead for pg_count_leading_zeros_32(). The (num & (num - 1)) == 0 I imagine will perform better, but I didn't test it. Also, wondering if you've looked at any of the other places where we do "*= 2;" or "<<= 1;" inside a loop? There's quite a number that look like candidates for using the new function.
On Thu, Mar 12, 2020 at 7:42 AM David Rowley <dgrowleyml@gmail.com> wrote: > > I don't think Jesse's proposed solution is that great due to the > additional function call overhead for pg_count_leading_zeros_32(). The > (num & (num - 1)) == 0 I imagine will perform better, but I didn't > test it. Right, I believe we've all landed on the same page about that. I see two ways of doing next_power_of_2_32 without an indirect function call, and leaving pg_leftmost_one_pos32 the same as it is now. I haven't measured either yet (or tested for that matter): static inline uint32 next_power_of_2_32(uint32 num) { Assert(num > 0 && num <= UINT_MAX / 2); /* use some bitmasking tricks to see if only 1 bit is on */ if (num & (num - 1)) == 0) return num; return ((uint32) 1) << (pg_leftmost_one_pos32(num) + 1) } OR { Assert(num > 0 && num <= UINT_MAX / 2); return ((uint32) 1) << ceil_log2_32(num); } static inline uint32 ceil_log2_32(uint32 num) { Assert(num > 0); if (num == 1) return 0; return pg_leftmost_one_pos32(num-1) + 1; } One naming thing I noticed: the name "next power of two" implies to me num *= 2 for a power of two, not the same as the input. The latter behavior is better called "ceil power of 2". > Also, wondering if you've looked at any of the other places where we > do "*= 2;" or "<<= 1;" inside a loop? There's quite a number that look > like candidates for using the new function. A brief look shows a few functions where this is done in a tight loop: nodes/list.c:new_list LWLockRegisterTranche ensure_record_cache_typmod_slot_exists pqCheckOutBufferSpace ExecChooseHashTableSize ExecHashBuildSkewHash choose_nelem_alloc init_htab hash_estimate_size hash_select_dirsize AllocSetAlloc -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 12 Mar 2020 at 22:59, John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Thu, Mar 12, 2020 at 7:42 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > I don't think Jesse's proposed solution is that great due to the > > additional function call overhead for pg_count_leading_zeros_32(). The > > (num & (num - 1)) == 0 I imagine will perform better, but I didn't > > test it. > > Right, I believe we've all landed on the same page about that. I see > two ways of doing next_power_of_2_32 without an indirect function > call, and leaving pg_leftmost_one_pos32 the same as it is now. I > haven't measured either yet (or tested for that matter): I've attached an updated patch. It includes the modifications mentioned above to pre-check for a power of 2 number with the bit masking hack mentioned above. I also renamed the functions to be more aligned to the other functions in pg_bitutils.h I'm not convinced pg_ceil_log2_* needs the word "ceil" in there. I dropped the part of the patch that was changing longs to ints of a known size. I went on and did some additional conversion in the 0003 patch. There are more laying around the code base, but I ended up finding a bit to fix up than i had thought I would. e.g. various places that repalloc() inside a loop that is multiplying the allocation size by 2 each time. The repalloc should be done at the end, not during the loop. I thought I might come back to those some time in the future. Is anyone able to have a look at this? David
Вложения
On Tue, Apr 7, 2020 at 7:41 PM David Rowley <dgrowleyml@gmail.com> wrote: > I've attached an updated patch. It includes the modifications > mentioned above to pre-check for a power of 2 number with the bit > masking hack mentioned above. I also renamed the functions to be more > aligned to the other functions in pg_bitutils.h I'm not convinced > pg_ceil_log2_* needs the word "ceil" in there. > > I dropped the part of the patch that was changing longs to ints of a > known size. I went on and did some additional conversion in the 0003 > patch. There are more laying around the code base, but I ended up > finding a bit to fix up than i had thought I would. e.g. various > places that repalloc() inside a loop that is multiplying the > allocation size by 2 each time. The repalloc should be done at the > end, not during the loop. I thought I might come back to those some > time in the future. > > Is anyone able to have a look at this? > > David Hi David, Overall looks good to me. Just a couple things I see: It seems _hash_log2 is still in the tree, but has no callers? - max_size = 8; /* semi-arbitrary small power of 2 */ - while (max_size < min_size + LIST_HEADER_OVERHEAD) - max_size *= 2; + max_size = pg_nextpower2_32(Max(8, min_size + LIST_HEADER_OVERHEAD)); Minor nit: We might want to keep the comment that the number is "semi-arbitrary" here as well. - 'pg_waldump', 'scripts'); + 'pg_validatebackup', 'pg_waldump', 'scripts'); This seems like a separate concern? -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi John, Thanks for having a look at this. On Wed, 8 Apr 2020 at 00:16, John Naylor <john.naylor@2ndquadrant.com> wrote: > Overall looks good to me. Just a couple things I see: > > It seems _hash_log2 is still in the tree, but has no callers? Yeah, I left it in there since it was an external function. Perhaps we could rip it out and write something in the commit message that it should be replaced with the newer functions. Thinking of extension authors here. > - max_size = 8; /* semi-arbitrary small power of 2 */ > - while (max_size < min_size + LIST_HEADER_OVERHEAD) > - max_size *= 2; > + max_size = pg_nextpower2_32(Max(8, min_size + LIST_HEADER_OVERHEAD)); > > Minor nit: We might want to keep the comment that the number is > "semi-arbitrary" here as well. I had dropped that as the 8 part was mentioned in the comment above: "The minimum allocation is 8 ListCell units". I can put it back, I had just thought it was overkill. > - 'pg_waldump', 'scripts'); > + 'pg_validatebackup', 'pg_waldump', 'scripts'); > > This seems like a separate concern? That's required due to the #include "lib/simplehash.h" in pg_validatebackup.c. I have to say, I didn't really take the time to understand all the Perl code there, but without that change, I was getting a link error when testing on Windows, and after I added pg_validatebackup to that array, it worked. David
On Tue, Apr 7, 2020 at 8:26 PM David Rowley <dgrowleyml@gmail.com> wrote: > > Hi John, > > Thanks for having a look at this. > > On Wed, 8 Apr 2020 at 00:16, John Naylor <john.naylor@2ndquadrant.com> wrote: > > Overall looks good to me. Just a couple things I see: > > > > It seems _hash_log2 is still in the tree, but has no callers? > > Yeah, I left it in there since it was an external function. Perhaps > we could rip it out and write something in the commit message that it > should be replaced with the newer functions. Thinking of extension > authors here. I'm not the best judge of where to draw the line for extensions, but this function does have a name beginning with an underscore, which to me is a red flag that it's internal in nature. > > Minor nit: We might want to keep the comment that the number is > > "semi-arbitrary" here as well. > > I had dropped that as the 8 part was mentioned in the comment above: > "The minimum allocation is 8 ListCell units". I can put it back, I had > just thought it was overkill. Oh I see now, nevermind. > > - 'pg_waldump', 'scripts'); > > + 'pg_validatebackup', 'pg_waldump', 'scripts'); > > > > This seems like a separate concern? > > That's required due to the #include "lib/simplehash.h" in > pg_validatebackup.c. I have to say, I didn't really take the time to > understand all the Perl code there, but without that change, I was > getting a link error when testing on Windows, and after I added > pg_validatebackup to that array, it worked. Hmm. Does pg_bitutils.h need something like this? #ifndef FRONTEND extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_number_of_ones[256]; #else extern const uint8 pg_leftmost_one_pos[256]; extern const uint8 pg_rightmost_one_pos[256]; extern const uint8 pg_number_of_ones[256]; #endif -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 8 Apr 2020 at 01:16, John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Tue, Apr 7, 2020 at 8:26 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > > Hi John, > > > > Thanks for having a look at this. > > > > On Wed, 8 Apr 2020 at 00:16, John Naylor <john.naylor@2ndquadrant.com> wrote: > > > Overall looks good to me. Just a couple things I see: > > > > > > It seems _hash_log2 is still in the tree, but has no callers? > > > > Yeah, I left it in there since it was an external function. Perhaps > > we could rip it out and write something in the commit message that it > > should be replaced with the newer functions. Thinking of extension > > authors here. > > I'm not the best judge of where to draw the line for extensions, but > this function does have a name beginning with an underscore, which to > me is a red flag that it's internal in nature. OK. I've removed that function now and stuck a note in the commit message to mention an alternative. > Hmm. Does pg_bitutils.h need something like this? > > #ifndef FRONTEND > extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256]; > extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256]; > extern PGDLLIMPORT const uint8 pg_number_of_ones[256]; > #else > extern const uint8 pg_leftmost_one_pos[256]; > extern const uint8 pg_rightmost_one_pos[256]; > extern const uint8 pg_number_of_ones[256]; > #endif Yeah, looking at keywords.h, we hit this before in c2d1eea9e75. Your proposed fix works and is the same as in keywords.h, so I've gone with that. I've attached v8 of the patchset. David
Вложения
On Wed, Apr 8, 2020 at 9:04 AM David Rowley <dgrowleyml@gmail.com> wrote: > [v8] Looks good to me, marked RFC. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 8 Apr 2020 at 15:06, John Naylor <john.naylor@2ndquadrant.com> wrote: > Looks good to me, marked RFC. Thanks a lot for reviewing those changes. I've now pushed all 3 of the patches. David