Обсуждение: optimize lookups in snapshot [sub]xip arrays
Hi hackers,
A few years ago, there was a proposal to create hash tables for long
[sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
I was curious whether this idea still showed measurable benefits, so I
revamped the patch and ran the same test as before [1].  Here are the
results for 60₋second runs on an r5d.24xlarge with the data directory on
the local NVMe storage:
     writers  HEAD  patch  diff
    ----------------------------
     16       659   664    +1%
     32       645   663    +3%
     64       659   692    +5%
     128      641   716    +12%
     256      619   610    -1%
     512      530   702    +32%
     768      469   582    +24%
     1000     367   577    +57%
As before, the hash table approach seems to provide a decent benefit at
higher client counts, so I felt it was worth reviving the idea.
The attached patch has some key differences from the previous proposal.
For example, the new patch uses simplehash instead of open-coding a new
hash table.  Also, I've bumped up the threshold for creating hash tables to
128 based on the results of my testing.  The attached patch waits until a
lookup of [sub]xip before generating the hash table, so we only need to
allocate enough space for the current elements in the [sub]xip array, and
we avoid allocating extra memory for workloads that do not need the hash
tables.  I'm slightly worried about increasing the number of memory
allocations in this code path, but the results above seemed encouraging on
that front.
Thoughts?
[0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru
[1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
			
		Вложения
On Wed, Jul 13, 2022 at 10:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Hi hackers, > > A few years ago, there was a proposal to create hash tables for long > [sub]xip arrays in snapshots [0], but the thread seems to have fizzled out. > I was curious whether this idea still showed measurable benefits, so I > revamped the patch and ran the same test as before [1]. Here are the > results for 60₋second runs on an r5d.24xlarge with the data directory on > the local NVMe storage: > > writers HEAD patch diff > ---------------------------- > 16 659 664 +1% > 32 645 663 +3% > 64 659 692 +5% > 128 641 716 +12% > 256 619 610 -1% > 512 530 702 +32% > 768 469 582 +24% > 1000 367 577 +57% Impressive. > As before, the hash table approach seems to provide a decent benefit at > higher client counts, so I felt it was worth reviving the idea. > > The attached patch has some key differences from the previous proposal. > For example, the new patch uses simplehash instead of open-coding a new > hash table. Also, I've bumped up the threshold for creating hash tables to > 128 based on the results of my testing. The attached patch waits until a > lookup of [sub]xip before generating the hash table, so we only need to > allocate enough space for the current elements in the [sub]xip array, and > we avoid allocating extra memory for workloads that do not need the hash > tables. I'm slightly worried about increasing the number of memory > allocations in this code path, but the results above seemed encouraging on > that front. > > Thoughts? > > [0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru > [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com Aren't these snapshot arrays always sorted? I see the following code: /* sort so we can bsearch() */ qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator); /* sort so we can bsearch() later */ qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator); If the ordering isn't an invariant of these snapshot arrays, can we also use the hash table mechanism for all of the snapshot arrays infrastructure rather than qsort+bsearch in a few places and hash table for others? + * The current value worked well in testing, but it's still mostly a guessed-at + * number that might need updating in the future. + */ +#define XIP_HASH_MIN_ELEMENTS (128) + Do you see a regression with a hash table for all the cases? Why can't we just build a hash table irrespective of these limits and use it for all the purposes instead of making it complex with different approaches if we don't have measurable differences in the performance or throughput? +static inline bool +XidInXip(TransactionId xid, TransactionId *xip, uint32 xcnt, + xip_hash_hash **xiph) + /* Make sure the hash table is built. */ + if (*xiph == NULL) + { + *xiph = xip_hash_create(TopTransactionContext, xcnt, NULL); + + for (int i = 0; i < xcnt; i++) Why create a hash table on the first search? Why can't it be built while inserting or creating these snapshots? Basically, instead of the array, can these snapshot structures be hash tables by themselves? I know this requires a good amount of code refactoring, but worth considering IMO as it removes bsearch thus might improve the performance further. Regards, Bharath Rupireddy.
Hi Bharath,
Thanks for taking a look.
On Thu, Jul 14, 2022 at 03:10:56PM +0530, Bharath Rupireddy wrote:
> Aren't these snapshot arrays always sorted? I see the following code:
> 
> /* sort so we can bsearch() */
> qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
> 
> /* sort so we can bsearch() later */
> qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);
AFAICT these arrays are sorted in limited cases, such as
pg_current_snapshot() and logical replication.  GetSnapshotData() does not
appear to sort them, so I don't think we can always assume they are sorted.
In the previous thread, Tomas analyzed simply sorting the arrays [0] and
found that it provided much less improvement compared to the hash table
approach, so I have not seriously considered it here.
> If the ordering isn't an invariant of these snapshot arrays, can we
> also use the hash table mechanism for all of the snapshot arrays
> infrastructure rather than qsort+bsearch in a few places and hash
> table for others?
Unless there is demonstrable benefit in doing so for the few places that
sort the arrays, I'm ѕkeptical it's worth the complexity.  This patch is
targeted to XidInMVCCSnapshot(), which we can demonstrate has clear impact
on TPS for some workloads.
> + * The current value worked well in testing, but it's still mostly a guessed-at
> + * number that might need updating in the future.
> + */
> +#define XIP_HASH_MIN_ELEMENTS (128)
> +
> 
> Do you see a regression with a hash table for all the cases? Why can't
> we just build a hash table irrespective of these limits and use it for
> all the purposes instead of making it complex with different
> approaches if we don't have measurable differences in the performance
> or throughput?
I performed the same tests as before with a variety of values.  Here are
the results:
     writers  HEAD  1   16  32  64  128
    ------------------------------------
     16       659   698 678 659 665 664
     32       645   661 688 657 649 663
     64       659   656 653 649 663 692
     128      641   636 639 679 643 716
     256      619   641 619 643 653 610
     512      530   609 582 602 605 702
     768      469   610 608 551 571 582
     1000     367   610 538 557 556 577
I was surpised to see that there really wasn't a regression at the low end,
but keep in mind that this is a rather large machine and a specialized
workload for generating snapshots with long [sub]xip arrays.  That being
said, there really wasn't any improvement at the low end, either.  If we
always built a hash table, we'd be introducing more overhead and memory
usage in return for approximately zero benefit.  My intent was to only take
on the overhead in cases where we believe it might have a positive impact,
which is why I picked the somewhat conservative value of 128.  If the
overhead isn't a concern, it might be feasible to always make [sub]xip a
hash table.
> +static inline bool
> +XidInXip(TransactionId xid, TransactionId *xip, uint32 xcnt,
> + xip_hash_hash **xiph)
> 
> + /* Make sure the hash table is built. */
> + if (*xiph == NULL)
> + {
> + *xiph = xip_hash_create(TopTransactionContext, xcnt, NULL);
> +
> + for (int i = 0; i < xcnt; i++)
> 
> Why create a hash table on the first search? Why can't it be built
> while inserting or creating these snapshots? Basically, instead of the
> array, can these snapshot structures be hash tables by themselves? I
> know this requires a good amount of code refactoring, but worth
> considering IMO as it removes bsearch thus might improve the
> performance further.
The idea is to avoid the overhead unless something actually needs to
inspect these arrays.
[0] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
			
		Hi, Sounds worth pursuing. On 2022-07-13 10:09:50 -0700, Nathan Bossart wrote: > The attached patch has some key differences from the previous proposal. > For example, the new patch uses simplehash instead of open-coding a new > hash table. +1 > The attached patch waits until a lookup of [sub]xip before generating the > hash table, so we only need to allocate enough space for the current > elements in the [sub]xip array, and we avoid allocating extra memory for > workloads that do not need the hash tables. Hm. Are there any contexts where we'd not want the potential for failing due to OOM? I wonder if we additionally / alternatively could use a faster method of searching the array linearly, e.g. using SIMD. Another thing that might be worth looking into is to sort the xip/subxip arrays into a binary-search optimized layout. That'd make the binary search faster, wouldn't require additional memory (a boolean indicating whether sorted somewhere, I guess), and would easily persist across copies of the snapshot. > I'm slightly worried about increasing the number of memory > allocations in this code path, but the results above seemed encouraging on > that front. ISMT that the test wouldn't be likely to show those issues. > These hash tables are regarded as ephemeral; they only live in > process-local memory and are never rewritten, copied, or > serialized. What does rewriting refer to here? Not convinced that's the right idea in case of copying. I think we often end up copying snapshots frequently, and building & allocating the hashed xids separately every time seems not great. > + snapshot->xiph = NULL; > + snapshot->subxiph = NULL; Do we need separate hashes for these? ISTM that if we're overflowed then we don't need ->subxip[h], and if not, then the action for an xid being in ->xip and ->subxiph is the same? Greetings, Andres Freund
Hi Andres,
Thanks for taking a look.
On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
> Hm. Are there any contexts where we'd not want the potential for failing due
> to OOM?
I'm not sure about this one.
> I wonder if we additionally / alternatively could use a faster method of
> searching the array linearly, e.g. using SIMD.
I looked into using SIMD.  The patch is attached, but it is only intended
for benchmarking purposes and isn't anywhere close to being worth serious
review.  There may be a simpler/better way to implement the linear search,
but this seemed to work well.  Overall, SIMD provided a decent improvement.
I had to increase the number of writers quite a bit in order to demonstrate
where the hash tables began winning.  Here are the numbers:
    writers head simd hash
    256     663  632  694
    512     530  618  637
    768     489  544  573
    1024    364  508  562
    2048    185  306  485
    4096    146  197  441
While it is unsurprising that the hash tables perform the best, there are a
couple of advantages to SIMD that might make that approach worth
considering.  For one, there's really no overhead (i.e., you don't need to
sort the array or build a hash table), so we can avoid picking an arbitrary
threshold and just have one code path.  Also, a SIMD implementation for a
linear search through an array of integers could likely be easily reused
elsewhere.
> Another thing that might be worth looking into is to sort the xip/subxip
> arrays into a binary-search optimized layout. That'd make the binary search
> faster, wouldn't require additional memory (a boolean indicating whether
> sorted somewhere, I guess), and would easily persist across copies of the
> snapshot.
I spent some time looking into this, but I haven't attempted to implement
it.  IIUC the most difficult part of this is sorting the array in place to
the special layout.
>> These hash tables are regarded as ephemeral; they only live in
>> process-local memory and are never rewritten, copied, or
>> serialized.
> 
> What does rewriting refer to here?
I mean that a hash table created for one snapshot will not be cleared and
reused for another.
> Not convinced that's the right idea in case of copying. I think we often end
> up copying snapshots frequently, and building & allocating the hashed xids
> separately every time seems not great.
Right.  My concern with reusing the hash tables is that we'd need to
allocate much more space that would go largely unused in many cases.
>> +    snapshot->xiph = NULL;
>> +    snapshot->subxiph = NULL;
> 
> Do we need separate hashes for these? ISTM that if we're overflowed then we
> don't need ->subxip[h], and if not, then the action for an xid being in ->xip
> and ->subxiph is the same?
Do you mean that we can combine these into one hash table?  That might
work.
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
			
		Вложения
Hi, all
>
>          if (!snapshot->suboverflowed)
>          {
>              /* we have full data, so search subxip */
> -            int32        j;
> -
> -            for (j = 0; j < snapshot->subxcnt; j++)
> -            {
> -                if (TransactionIdEquals(xid, snapshot->subxip[j]))
> -                    return true;
> -            }
> +            if (XidInXip(xid, snapshot->subxip, snapshot->subxcnt,
> +                         &snapshot->subxiph))
> +                return true;
>
>              /* not there, fall through to search xip[] */
>          }
If snaphost->suboverflowed is  false then the subxcnt must be less than PGPROC_MAX_CACHED_SUBXIDS which is 64 now.
And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which is 128 currently during discussion.
So that, subxid’s hash table will never be used, right?
Regards,
Zhang Mingli
> On Jul 14, 2022, at 01:09, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Hi hackers,
>
> A few years ago, there was a proposal to create hash tables for long
> [sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
> I was curious whether this idea still showed measurable benefits, so I
> revamped the patch and ran the same test as before [1].  Here are the
> results for 60₋second runs on an r5d.24xlarge with the data directory on
> the local NVMe storage:
>
>     writers  HEAD  patch  diff
>    ----------------------------
>     16       659   664    +1%
>     32       645   663    +3%
>     64       659   692    +5%
>     128      641   716    +12%
>     256      619   610    -1%
>     512      530   702    +32%
>     768      469   582    +24%
>     1000     367   577    +57%
>
> As before, the hash table approach seems to provide a decent benefit at
> higher client counts, so I felt it was worth reviving the idea.
>
> The attached patch has some key differences from the previous proposal.
> For example, the new patch uses simplehash instead of open-coding a new
> hash table.  Also, I've bumped up the threshold for creating hash tables to
> 128 based on the results of my testing.  The attached patch waits until a
> lookup of [sub]xip before generating the hash table, so we only need to
> allocate enough space for the current elements in the [sub]xip array, and
> we avoid allocating extra memory for workloads that do not need the hash
> tables.  I'm slightly worried about increasing the number of memory
> allocations in this code path, but the results above seemed encouraging on
> that front.
>
> Thoughts?
>
> [0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru
> [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
> <v1-0001-Optimize-lookups-in-snapshot-transactions-in-prog.patch>
			
		В Ср, 13/07/2022 в 10:09 -0700, Nathan Bossart пишет: > Hi hackers, > > A few years ago, there was a proposal to create hash tables for long > [sub]xip arrays in snapshots [0], but the thread seems to have fizzled out. > I was curious whether this idea still showed measurable benefits, so I > revamped the patch and ran the same test as before [1]. Here are the > results for 60₋second runs on an r5d.24xlarge with the data directory on > the local NVMe storage: > > writers HEAD patch diff > ---------------------------- > 16 659 664 +1% > 32 645 663 +3% > 64 659 692 +5% > 128 641 716 +12% > 256 619 610 -1% > 512 530 702 +32% > 768 469 582 +24% > 1000 367 577 +57% > > As before, the hash table approach seems to provide a decent benefit at > higher client counts, so I felt it was worth reviving the idea. > > The attached patch has some key differences from the previous proposal. > For example, the new patch uses simplehash instead of open-coding a new > hash table. Also, I've bumped up the threshold for creating hash tables to > 128 based on the results of my testing. The attached patch waits until a > lookup of [sub]xip before generating the hash table, so we only need to > allocate enough space for the current elements in the [sub]xip array, and > we avoid allocating extra memory for workloads that do not need the hash > tables. I'm slightly worried about increasing the number of memory > allocations in this code path, but the results above seemed encouraging on > that front. > > Thoughts? > > [0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru > [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com > I'm glad my idea has been reborn. Well, may be simplehash is not bad idea. While it certainly consumes more memory and CPU instructions. I'll try to review. regards, Yura Sokolov
On Sun, Jul 24, 2022 at 12:48:25PM +0800, Zhang Mingli wrote: > If snaphost->suboverflowed is false then the subxcnt must be less than PGPROC_MAX_CACHED_SUBXIDS which is 64 now. > > And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which is 128 currently during discussion. > > So that, subxid’s hash table will never be used, right? This array will store up to TOTAL_MAX_CACHED_SUBXIDS transactions, which will typically be much greater than 64. When there isn't any overflow, subxip stores all of the subxids for all of the entries in the procarray. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Got it, thanks. Regards, Zhang Mingli > On Jul 25, 2022, at 12:08, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Sun, Jul 24, 2022 at 12:48:25PM +0800, Zhang Mingli wrote: >> If snaphost->suboverflowed is false then the subxcnt must be less than PGPROC_MAX_CACHED_SUBXIDS which is 64 now. >> >> And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which is 128 currently during discussion. >> >> So that, subxid’s hash table will never be used, right? > > This array will store up to TOTAL_MAX_CACHED_SUBXIDS transactions, which > will typically be much greater than 64. When there isn't any overflow, > subxip stores all of the subxids for all of the entries in the procarray. > > -- > Nathan Bossart > Amazon Web Services: https://aws.amazon.com
On Sat, Jul 16, 2022 at 08:59:57PM -0700, Nathan Bossart wrote: > On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote: >> I wonder if we additionally / alternatively could use a faster method of >> searching the array linearly, e.g. using SIMD. > > I looked into using SIMD. The patch is attached, but it is only intended > for benchmarking purposes and isn't anywhere close to being worth serious > review. There may be a simpler/better way to implement the linear search, > but this seemed to work well. Overall, SIMD provided a decent improvement. > I had to increase the number of writers quite a bit in order to demonstrate > where the hash tables began winning. Here are the numbers: > > writers head simd hash > 256 663 632 694 > 512 530 618 637 > 768 489 544 573 > 1024 364 508 562 > 2048 185 306 485 > 4096 146 197 441 > > While it is unsurprising that the hash tables perform the best, there are a > couple of advantages to SIMD that might make that approach worth > considering. For one, there's really no overhead (i.e., you don't need to > sort the array or build a hash table), so we can avoid picking an arbitrary > threshold and just have one code path. Also, a SIMD implementation for a > linear search through an array of integers could likely be easily reused > elsewhere. From the discussion thus far, it seems there is interest in optimizing [sub]xip lookups, so I'd like to spend some time moving it forward. I think the biggest open question is which approach to take. Both the SIMD and hash table approaches seem viable, but I think I prefer the SIMD approach at the moment (see the last paragraph of quoted text for the reasons). What do folks think? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
> On Jul 26, 2022, at 03:04, Nathan Bossart <nathandbossart@gmail.com> wrote: >> > From the discussion thus far, it seems there is interest in optimizing > [sub]xip lookups, so I'd like to spend some time moving it forward. I > think the biggest open question is which approach to take. Both the SIMD > and hash table approaches seem viable, but I think I prefer the SIMD > approach at the moment (see the last paragraph of quoted text for the > reasons). What do folks think? > > -- > Nathan Bossart > Amazon Web Services: https://aws.amazon.com > > +1, I’m not familiar with SIMD, will try to review this patch. Regards, Zhang Mingli
On 2022-07-25 12:04:19 -0700, Nathan Bossart wrote: > On Sat, Jul 16, 2022 at 08:59:57PM -0700, Nathan Bossart wrote: > > On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote: > >> I wonder if we additionally / alternatively could use a faster method of > >> searching the array linearly, e.g. using SIMD. > > > > I looked into using SIMD. The patch is attached, but it is only intended > > for benchmarking purposes and isn't anywhere close to being worth serious > > review. There may be a simpler/better way to implement the linear search, > > but this seemed to work well. Overall, SIMD provided a decent improvement. > > I had to increase the number of writers quite a bit in order to demonstrate > > where the hash tables began winning. Here are the numbers: > > > > writers head simd hash > > 256 663 632 694 > > 512 530 618 637 > > 768 489 544 573 > > 1024 364 508 562 > > 2048 185 306 485 > > 4096 146 197 441 > > > > While it is unsurprising that the hash tables perform the best, there are a > > couple of advantages to SIMD that might make that approach worth > > considering. For one, there's really no overhead (i.e., you don't need to > > sort the array or build a hash table), so we can avoid picking an arbitrary > > threshold and just have one code path. Also, a SIMD implementation for a > > linear search through an array of integers could likely be easily reused > > elsewhere. > > From the discussion thus far, it seems there is interest in optimizing > [sub]xip lookups, so I'd like to spend some time moving it forward. I > think the biggest open question is which approach to take. Both the SIMD > and hash table approaches seem viable, but I think I prefer the SIMD > approach at the moment (see the last paragraph of quoted text for the > reasons). What do folks think? Agreed on all points.
On Tue, Jul 26, 2022 at 11:19:06AM -0700, Andres Freund wrote: > On 2022-07-25 12:04:19 -0700, Nathan Bossart wrote: >> From the discussion thus far, it seems there is interest in optimizing >> [sub]xip lookups, so I'd like to spend some time moving it forward. I >> think the biggest open question is which approach to take. Both the SIMD >> and hash table approaches seem viable, but I think I prefer the SIMD >> approach at the moment (see the last paragraph of quoted text for the >> reasons). What do folks think? > > Agreed on all points. Great! Here is a new patch. A couple notes: * I briefly looked into seeing whether auto-vectorization was viable and concluded it was not for these loops. * I borrowed USE_SSE2 from one of John Naylor's patches [0]. I'm not sure whether this is committable, so I would welcome thoughts on the proper form. Given the comment says that SSE2 is supported by all x86-64 hardware, I'm not seeing why we need the SSE 4.2 checks. Is it not enough to check for __x86_64__ and _M_AMD64? * I haven't looked into adding an ARM implementation yet. [0] https://postgr.es/m/CAFBsxsHko7yc8A-2PpjQ%3D2StomXF%2BT2jgKF%3DWaMFZWi8CvV7hA%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
On Fri, Jul 29, 2022 at 4:34 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> * I briefly looked into seeing whether auto-vectorization was viable and
> concluded it was not for these loops.
>
> * I borrowed USE_SSE2 from one of John Naylor's patches [0]. I'm not sure
> whether this is committable,
I'll be the first to say it's not committable and needs some thought. Since there are several recently proposed patches that take advantage of SSE2, it seems time for me to open a new thread and get that prerequisite settled. I'll do that next week.
> so I would welcome thoughts on the proper
> form. Given the comment says that SSE2 is supported by all x86-64
> hardware, I'm not seeing why we need the SSE 4.2 checks. Is it not
> enough to check for __x86_64__ and _M_AMD64?
That's enough for emitting instructions that the target CPU can run, but says nothing (I think) about the host compiler's ability to understand the intrinsics and associated headers. The architecture is old enough that maybe zero compilers in the buildfarm that target AMD64 fail to understand SSE2 intrinsics, but I hadn't looked into it. The SSE 4.2 intrinsics check is not necessary, but it was sufficient and already present, so I borrowed it for the PoC.
--
John Naylor
EDB: http://www.enterprisedb.com
			
		> * I briefly looked into seeing whether auto-vectorization was viable and
> concluded it was not for these loops.
>
> * I borrowed USE_SSE2 from one of John Naylor's patches [0]. I'm not sure
> whether this is committable,
I'll be the first to say it's not committable and needs some thought. Since there are several recently proposed patches that take advantage of SSE2, it seems time for me to open a new thread and get that prerequisite settled. I'll do that next week.
> so I would welcome thoughts on the proper
> form. Given the comment says that SSE2 is supported by all x86-64
> hardware, I'm not seeing why we need the SSE 4.2 checks. Is it not
> enough to check for __x86_64__ and _M_AMD64?
That's enough for emitting instructions that the target CPU can run, but says nothing (I think) about the host compiler's ability to understand the intrinsics and associated headers. The architecture is old enough that maybe zero compilers in the buildfarm that target AMD64 fail to understand SSE2 intrinsics, but I hadn't looked into it. The SSE 4.2 intrinsics check is not necessary, but it was sufficient and already present, so I borrowed it for the PoC.
--
John Naylor
EDB: http://www.enterprisedb.com
On Sat, Jul 30, 2022 at 12:02:02PM +0700, John Naylor wrote: > On Fri, Jul 29, 2022 at 4:34 AM Nathan Bossart <nathandbossart@gmail.com> > wrote: >> * I borrowed USE_SSE2 from one of John Naylor's patches [0]. I'm not > sure >> whether this is committable, > > I'll be the first to say it's not committable and needs some thought. Since > there are several recently proposed patches that take advantage of SSE2, it > seems time for me to open a new thread and get that prerequisite settled. > I'll do that next week. Awesome. I will help test and review. >> so I would welcome thoughts on the proper >> form. Given the comment says that SSE2 is supported by all x86-64 >> hardware, I'm not seeing why we need the SSE 4.2 checks. Is it not >> enough to check for __x86_64__ and _M_AMD64? > > That's enough for emitting instructions that the target CPU can run, but > says nothing (I think) about the host compiler's ability to understand the > intrinsics and associated headers. The architecture is old enough that > maybe zero compilers in the buildfarm that target AMD64 fail to understand > SSE2 intrinsics, but I hadn't looked into it. The SSE 4.2 intrinsics check > is not necessary, but it was sufficient and already present, so I borrowed > it for the PoC. Got it, makes sense. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Jul 29, 2022 at 10:38:11PM -0700, Nathan Bossart wrote: > On Sat, Jul 30, 2022 at 12:02:02PM +0700, John Naylor wrote: >> I'll be the first to say it's not committable and needs some thought. Since >> there are several recently proposed patches that take advantage of SSE2, it >> seems time for me to open a new thread and get that prerequisite settled. >> I'll do that next week. > > Awesome. I will help test and review. While this prerequisite is worked out [0], here is a new patch set. I've added an 0002 in which I've made use of the proposed SSE2 linear search function in several other areas. I haven't done any additional performance analysis, and it's likely I'm missing some eligible code locations, but at the very least, this demonstrates the reusability of the new function. [0] https://postgr.es/m/CAFBsxsE2G_H_5Wbw%2BNOPm70-BK4xxKf86-mRzY%3DL2sLoQqM%2B-Q%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
Hi, FWIW, I'd split the introduction of the helper and the use of it in snapmgr into separate patches. On 2022-08-02 15:13:01 -0700, Nathan Bossart wrote: > diff --git a/src/include/c.h b/src/include/c.h > index d35405f191..2c1a47bc28 100644 > --- a/src/include/c.h > +++ b/src/include/c.h > @@ -371,6 +371,14 @@ typedef void (*pg_funcptr_t) (void); > #endif > #endif > > +/* > + * Are SSE2 intrinsics available? > + */ > +#if (defined(__x86_64__) || defined(_M_AMD64)) > +#include <emmintrin.h> > +#define USE_SSE2 > +#endif > + It doesn't strike me as a good idea to include this in every single translation unit in pg. That header (+dependencies) isn't small. I'm on board with normalizing defines for SSE availability somewhere central though. > +/* > + * pg_linearsearch_uint32 > + * > + * Returns the address of the first element in 'base' that equals 'key', or > + * NULL if no match is found. > + */ > +#ifdef USE_SSE2 > +pg_attribute_no_sanitize_alignment() > +#endif What's the deal with this annotation? Needs a comment. > +static inline uint32 * > +pg_linearsearch_uint32(uint32 key, uint32 *base, uint32 nelem) Hm. I suspect this could be a bit faster if we didn't search for the offset, but just for presence in the array? Most users don't need the offset. Greetings, Andres Freund
On Tue, Aug 02, 2022 at 03:55:39PM -0700, Andres Freund wrote: > FWIW, I'd split the introduction of the helper and the use of it in snapmgr > into separate patches. Will do. > On 2022-08-02 15:13:01 -0700, Nathan Bossart wrote: >> diff --git a/src/include/c.h b/src/include/c.h >> index d35405f191..2c1a47bc28 100644 >> --- a/src/include/c.h >> +++ b/src/include/c.h >> @@ -371,6 +371,14 @@ typedef void (*pg_funcptr_t) (void); >> #endif >> #endif >> >> +/* >> + * Are SSE2 intrinsics available? >> + */ >> +#if (defined(__x86_64__) || defined(_M_AMD64)) >> +#include <emmintrin.h> >> +#define USE_SSE2 >> +#endif >> + > > It doesn't strike me as a good idea to include this in every single > translation unit in pg. That header (+dependencies) isn't small. > > I'm on board with normalizing defines for SSE availability somewhere central > though. Yeah, this is just a temporary hack for now. It'll go away once the defines for SSE2 availability are committed. >> +/* >> + * pg_linearsearch_uint32 >> + * >> + * Returns the address of the first element in 'base' that equals 'key', or >> + * NULL if no match is found. >> + */ >> +#ifdef USE_SSE2 >> +pg_attribute_no_sanitize_alignment() >> +#endif > > What's the deal with this annotation? Needs a comment. Will do. c.h suggests that this should only be used for x86-specific code. >> +static inline uint32 * >> +pg_linearsearch_uint32(uint32 key, uint32 *base, uint32 nelem) > > Hm. I suspect this could be a bit faster if we didn't search for the offset, > but just for presence in the array? Most users don't need the offset. Just under half of the callers in 0002 require the offset, but I don't know if any of those are worth optimizing in the first place. I'll change it for now. It's easy enough to add it back in the future if required. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Aug 3, 2022 at 6:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Just under half of the callers in 0002 require the offset, but I don't know
> if any of those are worth optimizing in the first place. I'll change it
> for now. It's easy enough to add it back in the future if required.
Yeah, some of those callers will rarely have more than several elements to search in the first place, or aren't performance-sensitive.
> Just under half of the callers in 0002 require the offset, but I don't know
> if any of those are worth optimizing in the first place. I'll change it
> for now. It's easy enough to add it back in the future if required.
Yeah, some of those callers will rarely have more than several elements to search in the first place, or aren't performance-sensitive.
Here is a new patch set. 0001 is the currently-proposed patch from the other thread [0] for determining SSE2 support. 0002 introduces the optimized linear search function. And 0003 makes use of the new function for the [sub]xip lookups in XidInMVCCSnapshot(). [0] https://postgr.es/m/CAFBsxsGktHL7%3DJXbgnKTi_uL0VRPcH4FSAqc6yK-3%2BJYfqPPjA%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
Hi, On 2022-08-02 16:43:57 -0700, Nathan Bossart wrote: > >> +/* > >> + * pg_linearsearch_uint32 > >> + * > >> + * Returns the address of the first element in 'base' that equals 'key', or > >> + * NULL if no match is found. > >> + */ > >> +#ifdef USE_SSE2 > >> +pg_attribute_no_sanitize_alignment() > >> +#endif > > > > What's the deal with this annotation? Needs a comment. > > Will do. c.h suggests that this should only be used for x86-specific code. What I'm asking is why the annotation is needed at all? Greetings, Andres Freund
On Wed, Aug 03, 2022 at 11:06:58AM -0700, Andres Freund wrote: > On 2022-08-02 16:43:57 -0700, Nathan Bossart wrote: >> >> +#ifdef USE_SSE2 >> >> +pg_attribute_no_sanitize_alignment() >> >> +#endif >> > >> > What's the deal with this annotation? Needs a comment. >> >> Will do. c.h suggests that this should only be used for x86-specific code. > > What I'm asking is why the annotation is needed at all? Upon further inspection, I don't think this is needed. I originally borrowed it from the SSE version of the CRC code, but while it is trivial to produce alignment failures with the CRC code, I haven't been able to generate any with my patches. Looking at the code, I'm not sure why I was worried about this in the first place. Please pardon the brain fade. Here is a new patch set without the annotation. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
On Thu, Aug 4, 2022 at 3:25 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Here is a new patch set without the annotation.
Were you considering adding the new function to simd.h now that that's committed? It's a bit up in the air what should go in there, but this new function is low-level and generic enough to be a candidate...
I wonder if the "pg_" prefix is appropriate here, as that is most often used for things that hide specific details *and* where the base name would clash, like OS calls or C library functions. I'm not quite sure where the line is drawn, but I mean that "linearsearch" is a generic algorithm and not a specific API we are implementing, if that makes sense.
The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(), pg_popcount32, etc). We'll likely want to search bytes sometime, so something to keep in mind as far as naming ("_int" vs "_byte"?).
The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(), pg_popcount32, etc). We'll likely want to search bytes sometime, so something to keep in mind as far as naming ("_int" vs "_byte"?).
I'm not a fan of "its" as a variable name, and I'm curious what it's intended to convey.
All the __m128i vars could technically be declared const, although I think it doesn't matter -- it's just a hint to the reader.
Out of curiosity do we know how much we get by loading four registers rather than two?
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Aug 04, 2022 at 02:58:14PM +0700, John Naylor wrote:
> Were you considering adding the new function to simd.h now that that's
> committed? It's a bit up in the air what should go in there, but this new
> function is low-level and generic enough to be a candidate...
I don't have a strong opinion.  I went with a separate file because I
envisioned a variety of possible linear search functions (e.g., char,
uint16, uint32), and some might not use SIMD instructions.  Futhermore, it
seemed less obvious to look in simd.h for linear search functions.  That
being said, it might make sense to just add it here for now.
> I wonder if the "pg_" prefix is appropriate here, as that is most often
> used for things that hide specific details *and* where the base name would
> clash, like OS calls or C library functions. I'm not quite sure where the
> line is drawn, but I mean that "linearsearch" is a generic algorithm and
> not a specific API we are implementing, if that makes sense.
Yeah, I was concerned about clashing with lsearch() and lfind().  I will
drop the prefix.
> The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(),
> pg_popcount32, etc). We'll likely want to search bytes sometime, so
> something to keep in mind as far as naming ("_int" vs "_byte"?).
How about something like lsearch32 or linearsearch32?
> I'm not a fan of "its" as a variable name, and I'm curious what it's
> intended to convey.
It's short for "iterations."  I'll spell it out completely to avoid this
kind of confusion.
> All the __m128i vars could technically be declared const, although I think
> it doesn't matter -- it's just a hint to the reader.
Will do.
> Out of curiosity do we know how much we get by loading four registers
> rather than two?
The small program I've been using for testing takes about 40% more time
with the two register approach.  The majority of this test involves
searching for elements that either don't exist in the array or that live
near the end of the array, so this is probably close to the worst case.
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
			
		On Fri, Aug 5, 2022 at 5:15 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Thu, Aug 04, 2022 at 02:58:14PM +0700, John Naylor wrote:
> > Were you considering adding the new function to simd.h now that that's
> > committed? It's a bit up in the air what should go in there, but this new
> > function is low-level and generic enough to be a candidate...
>
> I don't have a strong opinion. I went with a separate file because I
> envisioned a variety of possible linear search functions (e.g., char,
> uint16, uint32), and some might not use SIMD instructions. Futhermore, it
> seemed less obvious to look in simd.h for linear search functions.
That is a good point. Maybe potential helpers in simd.h should only deal specifically with vector registers, with it's users providing C fallbacks. I don't have any good ideas of where to put the new function, though.
> > I wonder if the "pg_" prefix is appropriate here, as that is most often
> > used for things that hide specific details *and* where the base name would
> > clash, like OS calls or C library functions. I'm not quite sure where the
> > line is drawn, but I mean that "linearsearch" is a generic algorithm and
> > not a specific API we are implementing, if that makes sense.
>
> Yeah, I was concerned about clashing with lsearch() and lfind(). I will
> drop the prefix.
Hmm, I didn't know about those. lfind() is similar enough that it would make sense to have pg_lfind32() etc in src/include/port/pg_lsearch.h, at least for the v4 version that returns the pointer. We already declare bsearch_arg() in src/include/port.h and that's another kind of array search. Returning bool is different enough to have a different name. pg_lfind32_ispresent()? *_noptr()? Meh.
Having said all that, the man page under BUGS [1] says "The naming is unfortunate."
> > Out of curiosity do we know how much we get by loading four registers
> > rather than two?
>
> The small program I've been using for testing takes about 40% more time
> with the two register approach. The majority of this test involves
> searching for elements that either don't exist in the array or that live
> near the end of the array, so this is probably close to the worst case.
Ok, sounds good.
[1] https://man7.org/linux/man-pages/man3/lsearch.3.html#BUGS
--
John Naylor
EDB: http://www.enterprisedb.com
			
		>
> On Thu, Aug 04, 2022 at 02:58:14PM +0700, John Naylor wrote:
> > Were you considering adding the new function to simd.h now that that's
> > committed? It's a bit up in the air what should go in there, but this new
> > function is low-level and generic enough to be a candidate...
>
> I don't have a strong opinion. I went with a separate file because I
> envisioned a variety of possible linear search functions (e.g., char,
> uint16, uint32), and some might not use SIMD instructions. Futhermore, it
> seemed less obvious to look in simd.h for linear search functions.
That is a good point. Maybe potential helpers in simd.h should only deal specifically with vector registers, with it's users providing C fallbacks. I don't have any good ideas of where to put the new function, though.
> > I wonder if the "pg_" prefix is appropriate here, as that is most often
> > used for things that hide specific details *and* where the base name would
> > clash, like OS calls or C library functions. I'm not quite sure where the
> > line is drawn, but I mean that "linearsearch" is a generic algorithm and
> > not a specific API we are implementing, if that makes sense.
>
> Yeah, I was concerned about clashing with lsearch() and lfind(). I will
> drop the prefix.
Hmm, I didn't know about those. lfind() is similar enough that it would make sense to have pg_lfind32() etc in src/include/port/pg_lsearch.h, at least for the v4 version that returns the pointer. We already declare bsearch_arg() in src/include/port.h and that's another kind of array search. Returning bool is different enough to have a different name. pg_lfind32_ispresent()? *_noptr()? Meh.
Having said all that, the man page under BUGS [1] says "The naming is unfortunate."
> > Out of curiosity do we know how much we get by loading four registers
> > rather than two?
>
> The small program I've been using for testing takes about 40% more time
> with the two register approach. The majority of this test involves
> searching for elements that either don't exist in the array or that live
> near the end of the array, so this is probably close to the worst case.
Ok, sounds good.
[1] https://man7.org/linux/man-pages/man3/lsearch.3.html#BUGS
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Aug 05, 2022 at 11:02:15AM +0700, John Naylor wrote: > That is a good point. Maybe potential helpers in simd.h should only deal > specifically with vector registers, with it's users providing C fallbacks. > I don't have any good ideas of where to put the new function, though. I moved it to src/include/port for now since that's where files like pg_bswap.h live. > Hmm, I didn't know about those. lfind() is similar enough that it would > make sense to have pg_lfind32() etc in src/include/port/pg_lsearch.h, at > least for the v4 version that returns the pointer. We already declare > bsearch_arg() in src/include/port.h and that's another kind of array > search. Returning bool is different enough to have a different name. > pg_lfind32_ispresent()? *_noptr()? Meh. > > Having said all that, the man page under BUGS [1] says "The naming is > unfortunate." I went ahead and renamed it to pg_lfind32() and switched it back to returning the pointer. That felt the cleanest from the naming perspective, but as Andres noted, it might not be as fast as just looking for the presence of the element. I modified my small testing program to perform many searches on small arrays, and I wasn't able to identify any impact, so perhaps thіs is good enough. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
Hi, On 2022-08-05 13:25:10 -0700, Nathan Bossart wrote: > I went ahead and renamed it to pg_lfind32() and switched it back to > returning the pointer. That felt the cleanest from the naming perspective, > but as Andres noted, it might not be as fast as just looking for the > presence of the element. I modified my small testing program to perform > many searches on small arrays, and I wasn't able to identify any impact, so > perhaps thіs is good enough. Why on small arrays? I'd expect a difference mainly if it there's at least a few iterations. But mainly I'd expect to find a difference if the SIMD code were optimized a further on the basis of not needing to return the offset. E.g. by replacing _mm_packs_epi32 with _mm_or_si128, that's cheaper. - Andres
On Fri, Aug 05, 2022 at 03:04:34PM -0700, Andres Freund wrote:
> But mainly I'd expect to find a difference if the SIMD code were optimized a
> further on the basis of not needing to return the offset. E.g. by
> replacing _mm_packs_epi32 with _mm_or_si128, that's cheaper.
I haven't been able to find a significant difference between the two.  If
anything, the _mm_packs_epi* approach actually seems to be slightly faster
in some cases.  For something marginally more concrete, I compared the two
in perf-top and saw the following for the relevant instructions:
_mm_packs_epi*:
    0.19 │       packssdw   %xmm1,%xmm0
    0.62 │       packssdw   %xmm1,%xmm0
    7.14 │       packsswb   %xmm1,%xmm0
_mm_or_si128:
    1.52 │       por        %xmm1,%xmm0
    2.05 │       por        %xmm1,%xmm0
    5.66 │       por        %xmm1,%xmm0
I also tried a combined approach where I replaced _mm_packs_epi16 with
_mm_or_si128:
    1.16 │       packssdw   %xmm1,%xmm0
    1.47 │       packssdw   %xmm1,%xmm0
    8.17 │       por        %xmm1,%xmm0
Of course, this simplistic analysis leaves out the impact of the
surrounding instructions, but it seems to support the idea that the
_mm_packs_epi* approach might have a slight edge.
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
			
		On Sat, Aug 06, 2022 at 11:13:26AM -0700, Nathan Bossart wrote: > On Fri, Aug 05, 2022 at 03:04:34PM -0700, Andres Freund wrote: >> But mainly I'd expect to find a difference if the SIMD code were optimized a >> further on the basis of not needing to return the offset. E.g. by >> replacing _mm_packs_epi32 with _mm_or_si128, that's cheaper. > > I haven't been able to find a significant difference between the two. If > anything, the _mm_packs_epi* approach actually seems to be slightly faster > in some cases. For something marginally more concrete, I compared the two > in perf-top and saw the following for the relevant instructions: Nevermind, I'm wrong. When compiled with -O2, it uses more than just the xmm0 and xmm1 registers, and the _mm_or_si128 approach consistently shows a speedup of slightly more than 5%. Patches attached. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
On Sun, Aug 7, 2022 at 4:25 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > [v8] Okay, I think it's basically in good shape. Since it should be a bit faster than a couple versions ago, would you be up for retesting with the original test having 8 to 512 writers? And also add the const markers we discussed upthread? Aside from that, I plan to commit this week unless there is further bikeshedding. -- John Naylor EDB: http://www.enterprisedb.com
On Mon, Aug 8, 2022 at 12:17 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Sun, Aug 7, 2022 at 4:25 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > > [v8] > > Okay, I think it's basically in good shape. Since it should be a bit > faster than a couple versions ago, would you be up for retesting with > the original test having 8 to 512 writers? And also add the const > markers we discussed upthread? Aside from that, I plan to commit this > week unless there is further bikeshedding. I quickly reviewed v8 patch set, few comments: 1) pg_lfind32 - why just uint32? If it's not possible to define functions for char, unsigned char, int16, uint16, int32, int64, uint64 and so on, can we add a few comments around that? Also, the comments can talk about if the base type or element data type of array or data type of key matters to use pg_lfind32. 2) I think this is not just for the remaining elements but also for non-USE_SSE2 cases. Also, please specify in which cases we reach here for USE_SSE2 cases. + /* Process the remaining elements the slow way. */ 3) Can pg_lfind32 return the index of the key found, for instance to use it for setting/resetting the found element in the array? + * pg_lfind32 + * + * Returns true if there is an element in 'base' that equals 'key'. Otherwise, + * returns false. + */ +static inline bool +pg_lfind32(uint32 key, uint32 *base, uint32 nelem) 4) Can we, right away, use this API to replace linear search, say, in SimpleLruReadPage_ReadOnly(), ATExecAttachPartitionIdx(), AfterTriggerSetState()? I'm sure I might be missing other places, but can we replace the possible found areas with the new function? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Mon, Aug 8, 2022 at 2:26 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > > 1) pg_lfind32 - why just uint32? If it's not possible to define > functions for char, unsigned char, int16, uint16, int32, int64, uint64 > and so on, can we add a few comments around that? Also, the comments Future work, as far as I'm concerned. I'm interested in using a char version for json strings. > 3) Can pg_lfind32 return the index of the key found, for instance to > use it for setting/resetting the found element in the array? That was just discussed. It's slightly faster not to return an index. -- John Naylor EDB: http://www.enterprisedb.com
On Mon, Aug 8, 2022 at 2:30 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Aug 8, 2022 at 2:26 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > 3) Can pg_lfind32 return the index of the key found, for instance to > > use it for setting/resetting the found element in the array? > > That was just discussed. It's slightly faster not to return an index. I haven't looked upthread, please share the difference. How about another version of the function that returns the index too? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Mon, Aug 08, 2022 at 01:46:48PM +0700, John Naylor wrote:
> Okay, I think it's basically in good shape. Since it should be a bit
> faster than a couple versions ago, would you be up for retesting with
> the original test having 8 to 512 writers?
Sure thing.  The results are similar.  As before, the improvements are most
visible when the arrays are large.
    writers  head  patch
    8        672   680
    16       639   664
    32       701   689
    64       705   703
    128      628   653
    256      576   627
    512      530   584
    768      450   536
    1024     350   494
> And also add the const
> markers we discussed upthread?
Oops, sorry about that.  This is done in v9.
> Aside from that, I plan to commit this
> week unless there is further bikeshedding.
Great, thanks.
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
			
		Вложения
On Mon, Aug 08, 2022 at 12:56:28PM +0530, Bharath Rupireddy wrote: > 1) pg_lfind32 - why just uint32? If it's not possible to define > functions for char, unsigned char, int16, uint16, int32, int64, uint64 > and so on, can we add a few comments around that? Also, the comments > can talk about if the base type or element data type of array or data > type of key matters to use pg_lfind32. I figured that we'd add functions for other types when needed. I considered making the new function generic by adding an argument for the element size. Then, we could branch to optimized routines based on the element size (e.g., pg_lfind() would call pg_lfind32() if the element size was 4 bytes). However, that seemed like more complexity than is required, and it's probably nice to avoid the extra branching. > 2) I think this is not just for the remaining elements but also for > non-USE_SSE2 cases. Also, please specify in which cases we reach here > for USE_SSE2 cases. > + /* Process the remaining elements the slow way. */ Well, in the non-SSE2 case, all of the elements are remaining at this point. :) > 3) Can pg_lfind32 return the index of the key found, for instance to > use it for setting/resetting the found element in the array? As discussed upthread, only returning whether the element is present in the array is slightly faster. If we ever needed a version that returned the address of the matching element, we could reevaluate whether this small boost was worth creating a separate function or if we should just modify pg_lfind32() to be a tad slower. I don't think we need to address that now, though. > 4) Can we, right away, use this API to replace linear search, say, in > SimpleLruReadPage_ReadOnly(), ATExecAttachPartitionIdx(), > AfterTriggerSetState()? I'm sure I might be missing other places, but > can we replace the possible found areas with the new function? I had found a few eligible linear searches earlier [0], but I haven't done any performance analysis that proved such changes were worthwhile. While substituting linear searches with pg_lfind32() is probably an improvement in most cases, I think we ought to demonstrate the benefits for each one. [0] https://postgr.es/m/20220802221301.GA742739%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Aug 9, 2022 at 7:33 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Aug 08, 2022 at 01:46:48PM +0700, John Naylor wrote: > > Okay, I think it's basically in good shape. Since it should be a bit > > faster than a couple versions ago, would you be up for retesting with > > the original test having 8 to 512 writers? > > Sure thing. The results are similar. As before, the improvements are most > visible when the arrays are large. > > writers head patch > 8 672 680 > 16 639 664 > 32 701 689 > 64 705 703 > 128 628 653 > 256 576 627 > 512 530 584 > 768 450 536 > 1024 350 494 > > > And also add the const > > markers we discussed upthread? > > Oops, sorry about that. This is done in v9. > > > Aside from that, I plan to commit this > > week unless there is further bikeshedding. > > Great, thanks. The patch looks good to me. One minor point is: + * IDENTIFICATION + * src/port/pg_lfind.h The path doesn't match to the actual file path, src/include/port/pg_lfind.h. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Aug 09, 2022 at 10:57:44AM +0900, Masahiko Sawada wrote: > The patch looks good to me. One minor point is: Thanks for taking a look. > + * IDENTIFICATION > + * src/port/pg_lfind.h > > The path doesn't match to the actual file path, src/include/port/pg_lfind.h. Fixed in v10. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
On Tue, Aug 9, 2022 at 4:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Aug 08, 2022 at 12:56:28PM +0530, Bharath Rupireddy wrote: > > 1) pg_lfind32 - why just uint32? If it's not possible to define > > functions for char, unsigned char, int16, uint16, int32, int64, uint64 > > and so on, can we add a few comments around that? Also, the comments > > can talk about if the base type or element data type of array or data > > type of key matters to use pg_lfind32. > > I figured that we'd add functions for other types when needed. I > considered making the new function generic by adding an argument for the > element size. Then, we could branch to optimized routines based on the > element size (e.g., pg_lfind() would call pg_lfind32() if the element size > was 4 bytes). However, that seemed like more complexity than is required, > and it's probably nice to avoid the extra branching. > > > 3) Can pg_lfind32 return the index of the key found, for instance to > > use it for setting/resetting the found element in the array? > > As discussed upthread, only returning whether the element is present in the > array is slightly faster. If we ever needed a version that returned the > address of the matching element, we could reevaluate whether this small > boost was worth creating a separate function or if we should just modify > pg_lfind32() to be a tad slower. I don't think we need to address that > now, though. Isn't it a good idea to capture the above two points as comments in port/pg_lfind.h just to not lose track of it? I know these are present in the hackers thread, but having them in the form of comments helps developers who attempt to change or use the new function. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Tue, Aug 09, 2022 at 09:40:15AM +0530, Bharath Rupireddy wrote: > Isn't it a good idea to capture the above two points as comments in > port/pg_lfind.h just to not lose track of it? I know these are present > in the hackers thread, but having them in the form of comments helps > developers who attempt to change or use the new function. Hm. My first impression is that this is exactly the sort of information that is better captured on the lists. I'm not sure that the lack of such commentary really poses any threat for future changes, which would need to be judged on their own merit, anyway. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Tue, Aug 09, 2022 at 09:40:15AM +0530, Bharath Rupireddy wrote:
>> Isn't it a good idea to capture the above two points as comments in
>> port/pg_lfind.h just to not lose track of it? I know these are present
>> in the hackers thread, but having them in the form of comments helps
>> developers who attempt to change or use the new function.
> Hm.  My first impression is that this is exactly the sort of information
> that is better captured on the lists.  I'm not sure that the lack of such
> commentary really poses any threat for future changes, which would need to
> be judged on their own merit, anyway.
It's clearly unproductive (not to say impossible) to enumerate every
possible alternative design and say why you didn't choose it.  If
there's some particular "obvious" choice that you feel a need to
refute, then sure write a comment about that.  Notably, if we used
to do X and now do Y because X was found to be broken, then it's good
to have a comment trail discouraging future hackers from reinventing
X.  But that doesn't lead to needing comments about an unrelated
option Z.
            regards, tom lane
			
		On Tue, Aug 9, 2022 at 10:51 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Fixed in v10. I decided I wasn't quite comfortable changing snapshot handling without further guarantees. To this end, 0002 in the attached v11 is an addendum that adds assert checking (also pgindent and some comment-smithing). As I suspected, make check-world passes even with purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and I verified that sticking in the wrong answer will lead to a crash in assert-enabled builds in short order. I'd kind of like to throw this (or something else suitable) at the build farm first for that reason. It's simpler than the qsort/qunique/binary search that was there before, so that's nice, but I've not tried to test performance. -- John Naylor EDB: http://www.enterprisedb.com
Вложения
On Tue, Aug 09, 2022 at 01:21:41PM +0700, John Naylor wrote: > I decided I wasn't quite comfortable changing snapshot handling > without further guarantees. To this end, 0002 in the attached v11 is > an addendum that adds assert checking (also pgindent and some > comment-smithing). As I suspected, make check-world passes even with > purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and > I verified that sticking in the wrong answer will lead to a crash in > assert-enabled builds in short order. I'd kind of like to throw this > (or something else suitable) at the build farm first for that reason. > It's simpler than the qsort/qunique/binary search that was there > before, so that's nice, but I've not tried to test performance. Your adjustments in 0002 seem reasonable to me. I think it makes sense to ensure there is test coverage for pg_lfind32(), but I don't know if that syscache code is the right choice. For non-USE_SSE2 builds, it might make these lookups more expensive. I'll look around to see if there are any other suitable candidates. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Aug 09, 2022 at 01:00:37PM -0700, Nathan Bossart wrote: > Your adjustments in 0002 seem reasonable to me. I think it makes sense to > ensure there is test coverage for pg_lfind32(), but I don't know if that > syscache code is the right choice. For non-USE_SSE2 builds, it might make > these lookups more expensive. I'll look around to see if there are any > other suitable candidates. One option might be to create a small test module for pg_lfind32(). Here is an example. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
On Wed, Aug 10, 2022 at 5:00 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Aug 09, 2022 at 01:21:41PM +0700, John Naylor wrote: > > I decided I wasn't quite comfortable changing snapshot handling > > without further guarantees. To this end, 0002 in the attached v11 is > > an addendum that adds assert checking (also pgindent and some > > comment-smithing). As I suspected, make check-world passes even with > > purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and > > I verified that sticking in the wrong answer will lead to a crash in > > assert-enabled builds in short order. I'd kind of like to throw this > > (or something else suitable) at the build farm first for that reason. > > It's simpler than the qsort/qunique/binary search that was there > > before, so that's nice, but I've not tried to test performance. > > Your adjustments in 0002 seem reasonable to me. I think it makes sense to > ensure there is test coverage for pg_lfind32(), but I don't know if that > syscache code is the right choice. For non-USE_SSE2 builds, it might make > these lookups more expensive. I think that for non-USE_SSE2 builds, there is no additional overhead as all assertion-related code in pg_lfind32 depends on USE_SSE2. > I'll look around to see if there are any > other suitable candidates. As you proposed, having a test module for that seems to be a good idea. We can add test codes for future optimizations that utilize SIMD operations. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Aug 10, 2022 at 7:13 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Aug 09, 2022 at 01:00:37PM -0700, Nathan Bossart wrote: > > Your adjustments in 0002 seem reasonable to me. I think it makes sense to > > ensure there is test coverage for pg_lfind32(), but I don't know if that > > syscache code is the right choice. For non-USE_SSE2 builds, it might make > > these lookups more expensive. Yeah. On Wed, Aug 10, 2022 at 9:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I think that for non-USE_SSE2 builds, there is no additional overhead > as all assertion-related code in pg_lfind32 depends on USE_SSE2. Nathan is referring to RelationSupportsSysCache() and RelationHasSysCache(). They currently use binary search and using linear search on non-x86-64 platforms is probably slower. [Nathan again] > One option might be to create a small test module for pg_lfind32(). Here > is an example. LGTM, let's see what the buildfarm thinks of 0001. -- John Naylor EDB: http://www.enterprisedb.com
On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote: > LGTM, let's see what the buildfarm thinks of 0001. Thanks! I haven't noticed any related buildfarm failures yet. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Aug 11, 2022 at 4:46 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote: > > LGTM, let's see what the buildfarm thinks of 0001. > > Thanks! I haven't noticed any related buildfarm failures yet. I was waiting for all the Windows animals to report in, and it looks like they have, so I've pushed 0002. Thanks for picking this topic up again! -- John Naylor EDB: http://www.enterprisedb.com
On Thu, Aug 11, 2022 at 09:50:54AM +0700, John Naylor wrote: > I was waiting for all the Windows animals to report in, and it looks > like they have, so I've pushed 0002. Thanks for picking this topic up > again! Thanks for reviewing and committing. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com