Обсуждение: Optimizing ResouceOwner to speed up COPY
Hi,
While reviewing and testing a nearby patch (using COPY for batching in
postgres_fdw), I noticed some of the COPY queries are spending a
substantial amount of time in ResourceOwnerAddToHash(). The exact figure
depends on amount of data in the COPY, but it was often close to 50%
(according to perf-record/report).
The reason is pretty simple - ResourceOwner tracks the resources in a
very simple hash table, with O(n^2) behavior with duplicates. This
happens with COPY, because COPY creates an array of a 1000 tuple slots,
and each slot references the same tuple descriptor. And the descriptor
is added to ResourceOwner for each slot.
The trouble is the ResourceOwner uses a simple open addressing hash
table, and the duplicate values end up in a long dense "train" of
entries, slowing down both additions and deletions. Starting with an
empty hash table, adding the first entry is fine - the first slot is
available. But the following additions need to skip over the current
entries - the n-th addition needs to skip over n-1 entries.
With N entries this means (N * (N-1) / 2). And then the same thing
happens for deletions, in reverse.
This is not the first time I noticed ResourceOwner in profiles, but I
never investigated it. I don't know if all those cases were caused by
this same behavior, but it's likely at least some were ...
There's an easy way to improve this by allowing a single hash entry to
represent multiple references to the same resource. The attached patch
adds a "count" to the ResourceElem, tracking how many times that
resource was added. So if you add 1000 tuples slots, the descriptor will
have just one ResourceElem entry with count=1000.
This reduces the insert/delete complexity - not quite to O(1), but much
lower than O(n^2). It also reduces the memory usage with duplicates, and
postpones when the hash table needs to be enlarged. Of course, if there
are no duplicates, it'll use a bit more memory.
The deduplication is "opportunistic" in the sense that you may still end
up with multiple entries for the same value/kind. When adding a resource
finds an empty slot before hitting the other entry, it'll use the slot.
Also, only the hash table is deduplicated, not the initial array (which
however quite small).
This is a reasonable trade off because it means it does not get more
expensive for cases with no duplicates, while still improving cases with
many duplicate values.
To demonstrate the benefits, here's a throughput (tps) from a COPY of a
certain number of rows from a file into an UNLOGGED table, with 1, 4 and
8 clients.
master patched
rows | 1 4 8 | 1 4 8
---------|--------------------------|-------------------------
10 | 112090 418421 594905 | 112871 419043 589646
100 | 35265 121903 168238 | 42536 138578 183740
1000 | 1450 5700 10725 | 5847 21594 30713
10000 | 531 1943 2988 | 743 2619 3557
100000 | 72 244 324 | 76 256 332
Or, as relative throughput:
patched / master
rows | 1 4 8
---------------------------------------
10 | 101% 100% 99%
100 | 121% 114% 109%
1000 | 403% 379% 286%
10000 | 140% 135% 119%
100000 | 106% 105% 102%
Those are some nice improvements, especially with 1000 rows. Which makes
sense, because 1000 is the internal batch size in COPY. So the overhead
gradually increases up to 1000 rows, and then it amortizes over many
batches. It has very little impact for large COPY.
I've done the same test with a logged table. The results significantly
depend on the storage (how fast it can handle commits) - not surprising.
But with good storage the overall behavior is almost exactly the same.
I didn't do any benchmarking focused on cases with no/few duplicates
(although the COPY with 10 rows could qualify) yet. But as I explained
earlier, the behavior should not change at all. There's an extra uint32,
but that's it.
regards
--
Tomas Vondra
Вложения
Tomas Vondra <tomas@vondra.me> writes:
> The reason is pretty simple - ResourceOwner tracks the resources in a
> very simple hash table, with O(n^2) behavior with duplicates. This
> happens with COPY, because COPY creates an array of a 1000 tuple slots,
> and each slot references the same tuple descriptor. And the descriptor
> is added to ResourceOwner for each slot.
> ...
> There's an easy way to improve this by allowing a single hash entry to
> represent multiple references to the same resource. The attached patch
> adds a "count" to the ResourceElem, tracking how many times that
> resource was added. So if you add 1000 tuples slots, the descriptor will
> have just one ResourceElem entry with count=1000.
Hmm. I don't love the 50% increase in sizeof(ResourceElem) ... maybe
that's negligible, or maybe it isn't. Can you find evidence of this
change being helpful for anything except this specific scenario in
COPY? Because we could probably find some way to avoid registering
all the doppelganger slots, if that's the only culprit.
regards, tom lane
On 10/16/25 20:12, Tom Lane wrote: > Tomas Vondra <tomas@vondra.me> writes: >> The reason is pretty simple - ResourceOwner tracks the resources in a >> very simple hash table, with O(n^2) behavior with duplicates. This >> happens with COPY, because COPY creates an array of a 1000 tuple slots, >> and each slot references the same tuple descriptor. And the descriptor >> is added to ResourceOwner for each slot. >> ... >> There's an easy way to improve this by allowing a single hash entry to >> represent multiple references to the same resource. The attached patch >> adds a "count" to the ResourceElem, tracking how many times that >> resource was added. So if you add 1000 tuples slots, the descriptor will >> have just one ResourceElem entry with count=1000. > > Hmm. I don't love the 50% increase in sizeof(ResourceElem) ... maybe > that's negligible, or maybe it isn't. I agree the +50% is not great. My plan was to maybe use a smaller data type, say uint8 or uint16, and reduce the overhead that way (while still massively reducing the number of entries). But I didn't realize there will always be extra padding, making this pointless. Maybe we could make the struct __packed__, and put the counter last (and maybe make it uint16)? But I don't see any other packed structs, so I guess there's a reason not to have them. I'm not sure if this is (or is not) negligible. In cases with duplicates this is actually saves a lot of memory, so it's a win. I'm not sure about cases without duplicates - it'll use more memory, but I'm not sure how many resource owners there usually are. Or how many entries they contain. That'll probably determine if it's negligible. My intuition was there wouldn't be that many, and that the typical number of elements is fairly low (if not, why have the initial capacity set to 64/128)? But maybe my intuition is wrong. > Can you find evidence of this change being helpful for anything > except this specific scenario in COPY? I went through the ResourceOwnerRemember() calls, looking for other cases that might create a lot of duplicates, similar to the tuple descriptors, but I haven't found anything obvious. Other resources seem to be either naturally unique or limited to very few duplicates. > Because we could probably find some way to avoid registering all the > doppelganger slots, if that's the only culprit. I'm not against other approaches. I didn't want to rework the foundation of the resource owner management, and this seemed simple, and with minimal impact on cases without duplicates (except for the struct getting larger). thanks -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes:
> On 10/16/25 20:12, Tom Lane wrote:
>> Can you find evidence of this change being helpful for anything
>> except this specific scenario in COPY?
> I went through the ResourceOwnerRemember() calls, looking for other
> cases that might create a lot of duplicates, similar to the tuple
> descriptors, but I haven't found anything obvious. Other resources seem
> to be either naturally unique or limited to very few duplicates.
I was thinking of adding some temporary instrumentation, like
just elog'ing whenever the count goes above 1, and seeing where
you get hits during the regression tests. I'm prepared to believe
this is worth doing, but it'd be nice to have more examples
in mind.
regards, tom lane
On 10/16/25 21:28, Tom Lane wrote: > Tomas Vondra <tomas@vondra.me> writes: >> On 10/16/25 20:12, Tom Lane wrote: >>> Can you find evidence of this change being helpful for anything >>> except this specific scenario in COPY? > >> I went through the ResourceOwnerRemember() calls, looking for other >> cases that might create a lot of duplicates, similar to the tuple >> descriptors, but I haven't found anything obvious. Other resources seem >> to be either naturally unique or limited to very few duplicates. > > I was thinking of adding some temporary instrumentation, like > just elog'ing whenever the count goes above 1, and seeing where > you get hits during the regression tests. I'm prepared to believe > this is worth doing, but it'd be nice to have more examples > in mind. > I tried that, and that gives me ~30k log messages with (count > 1). But that's a bit misleading, because a lot of that are the same "thing" going from 1 to N, which produces N messages. If I subtract all the COPY statements, loading data for regressison tests, that leaves ~7500 cases. There's a lot of cases with count 2 or 3, mostly simple queries. Even a simple "\d t" produces a bunch of such messages. test=# \d t WARNING: RESOURCEOWNER: snapshot reference 0x2e3787b0 resource owner Portal count 2 WARNING: RESOURCEOWNER: relcache reference 0x79ae1302fba8 resource owner Portal count 2 WARNING: RESOURCEOWNER: tupdesc reference 0x79ae1302fec8 resource owner Portal count 2 WARNING: RESOURCEOWNER: relcache reference 0x79ae13034928 resource owner Portal count 2 WARNING: RESOURCEOWNER: relcache reference 0x79ae13034928 resource owner Portal count 3 WARNING: RESOURCEOWNER: relcache reference 0x79ae13034d88 resource owner Portal count 2 WARNING: RESOURCEOWNER: buffer pin 0x4a resource owner Portal count 2 WARNING: RESOURCEOWNER: relcache reference 0x79ae13034928 resource owner Portal count 4 WARNING: RESOURCEOWNER: buffer pin 0xa resource owner Portal count 2 WARNING: RESOURCEOWNER: relcache reference 0x79ae12dca6d0 resource owner Portal count 2 WARNING: RESOURCEOWNER: relcache reference 0x79ae1303aff8 resource owner Portal count 2 There are some more extreme ones too. For example select infinite_recurse(); produces WARNING: RESOURCEOWNER: plancache reference 0x34555828 resource owner Portal count 1340 Another example is CREATE TABLE, which creates a batch of slots when inserting attributes in InsertPgAttributeTuples, so that'll end up with the count = number of attributes. Of course, those are not particularly frequent operations. Most applications are not doing CREATE TABLE nearly as often as DML. But I had another idea - see how large the ResourceOwners get, which would tell us how much "overhead" it really is. So I added logging into ResourceOwnerDelete (without the patch), and with that regression tests produce 113916 messages. And 113289 have the initial capacity 32, so array only. From the remaining ~600, only 72 have capacity over 64. So I guess the overhead should not be that bad. Actually, it would be possible to completely eliminate the overhead for the array, because that does not actually need the count at all. regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes:
> On 10/16/25 21:28, Tom Lane wrote:
>> I was thinking of adding some temporary instrumentation, like
>> just elog'ing whenever the count goes above 1, and seeing where
>> you get hits during the regression tests. I'm prepared to believe
>> this is worth doing, but it'd be nice to have more examples
>> in mind.
> I tried that, and that gives me ~30k log messages with (count > 1). But
> that's a bit misleading, because a lot of that are the same "thing"
> going from 1 to N, which produces N messages.
> If I subtract all the COPY statements, loading data for regressison
> tests, that leaves ~7500 cases. There's a lot of cases with count 2 or
> 3, mostly simple queries. Even a simple "\d t" produces a bunch of such
> messages.
Okay, so we definitely do have other cases where the count will be
more than 1.
> There are some more extreme ones too. For example
> select infinite_recurse();
> produces
> WARNING: RESOURCEOWNER: plancache reference 0x34555828 resource owner
> Portal count 1340
Makes me wonder if we shouldn't make the count int64, just to remove
all worries about overflow. That'd be free on 64-bit machines ...
> But I had another idea - see how large the ResourceOwners get, which
> would tell us how much "overhead" it really is. So I added logging into
> ResourceOwnerDelete (without the patch), and with that regression tests
> produce 113916 messages. And 113289 have the initial capacity 32, so
> array only. From the remaining ~600, only 72 have capacity over 64.
Ah, that's fairly convincing. Seems like we can move ahead with this.
regards, tom lane
On 10/17/25 00:17, Tom Lane wrote: > Tomas Vondra <tomas@vondra.me> writes: >> On 10/16/25 21:28, Tom Lane wrote: >>> I was thinking of adding some temporary instrumentation, like >>> just elog'ing whenever the count goes above 1, and seeing where >>> you get hits during the regression tests. I'm prepared to believe >>> this is worth doing, but it'd be nice to have more examples >>> in mind. > >> I tried that, and that gives me ~30k log messages with (count > 1). But >> that's a bit misleading, because a lot of that are the same "thing" >> going from 1 to N, which produces N messages. > >> If I subtract all the COPY statements, loading data for regressison >> tests, that leaves ~7500 cases. There's a lot of cases with count 2 or >> 3, mostly simple queries. Even a simple "\d t" produces a bunch of such >> messages. > > Okay, so we definitely do have other cases where the count will be > more than 1. > >> There are some more extreme ones too. For example >> select infinite_recurse(); >> produces >> WARNING: RESOURCEOWNER: plancache reference 0x34555828 resource owner >> Portal count 1340 > > Makes me wonder if we shouldn't make the count int64, just to remove > all worries about overflow. That'd be free on 64-bit machines ... > We may do that, if it's free. But I doubt there's any risk of overflow in practice. If we really wanted to support that many entries, we wouldn't be allocating the hash table using MemoryContextAllocZero(). That puts the maximum number of unique entries at ~50M. >> But I had another idea - see how large the ResourceOwners get, which >> would tell us how much "overhead" it really is. So I added logging into >> ResourceOwnerDelete (without the patch), and with that regression tests >> produce 113916 messages. And 113289 have the initial capacity 32, so >> array only. From the remaining ~600, only 72 have capacity over 64. > > Ah, that's fairly convincing. Seems like we can move ahead with this. > Thanks for the feedback. I'll let it sit for a while, there's no rush to get this committed. regards -- Tomas Vondra
> On Oct 17, 2025, at 01:46, Tomas Vondra <tomas@vondra.me> wrote:
>
> --
> Tomas Vondra<v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch>
Nice patch! I eyeball reviewed the patch, only got a few small comments:
1
```
@@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
idx = hash_resource_elem(value, kind) & mask;
for (;;)
{
+ /* found an exact match - just increment the counter */
+ if ((owner->hash[idx].kind == kind) &&
+ (owner->hash[idx].item == value))
+ {
+ owner->hash[idx].count += count;
+ return;
+ }
+
if (owner->hash[idx].kind == NULL)
break; /* found a free slot */
idx = (idx + 1) & mask;
}
```
I think this is the “trade-off” you mention in your email: if a free slot found earlier, then still gets duplicated
entries.I have no concern to this “trade-off”, but please add a comment here, otherwise future readers may be confused,
andmight potentially consider that were a bug, without reading your original design email.
2
```
typedef struct ResourceElem
{
Datum item;
+ uint32 count; /* number of occurrences */
```
Nit suggestion. People usually name this type of count as “refcnt”, which looks more meaningful. But I don’t have a
strongopinion here. I am absolutely fine if you don’t want to change.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On 17/10/2025 06:13, Chao Li wrote:
> ```
> @@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
> idx = hash_resource_elem(value, kind) & mask;
> for (;;)
> {
> + /* found an exact match - just increment the counter */
> + if ((owner->hash[idx].kind == kind) &&
> + (owner->hash[idx].item == value))
> + {
> + owner->hash[idx].count += count;
> + return;
> + }
> +
> if (owner->hash[idx].kind == NULL)
> break; /* found a free slot */
> idx = (idx + 1) & mask;
> }
> ```
>
> I think this is the “trade-off” you mention in your email: if a free slot found earlier, then still gets duplicated
entries.I have no concern to this “trade-off”, but please add a comment here, otherwise future readers may be confused,
andmight potentially consider that were a bug, without reading your original design email.
+1 on such a comment. Here or maybe on the ResourceElem struct itself.
> typedef struct ResourceElem
> {
> Datum item;
> + uint32 count; /* number of occurrences */
> const ResourceOwnerDesc *kind; /* NULL indicates a free hash table slot */
> } ResourceElem;
Hmm, the 'count' is not used when the entry is stored in the array.
Perhaps we should have a separate struct for array and hash elements
now. Keeping the array small helps it to fit in CPU caches.
> /*
> * Forget that an object is owned by a ResourceOwner
> *
> * Note: If same resource ID is associated with the ResourceOwner more than
> * once, one instance is removed.
> *
> * Note: Forgetting a resource does not guarantee that there is room to
> * remember a new resource. One exception is when you forget the most
> * recently remembered resource; that does make room for a new remember call.
> * Some code callers rely on that exception.
> */
> void
> ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)
Does this break the exception noted in the above comment? I guess it
still holds because we don't deduplicate entries in the array. That's
very subtle, needs a comment at least.
typo: ocurrence -> occurrence
- Heikki
On 10/17/25 10:31, Heikki Linnakangas wrote:
> On 17/10/2025 06:13, Chao Li wrote:
>> ```
>> @@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner,
>> Datum value, const ResourceOwnerDesc
>> idx = hash_resource_elem(value, kind) & mask;
>> for (;;)
>> {
>> + /* found an exact match - just increment the counter */
>> + if ((owner->hash[idx].kind == kind) &&
>> + (owner->hash[idx].item == value))
>> + {
>> + owner->hash[idx].count += count;
>> + return;
>> + }
>> +
>> if (owner->hash[idx].kind == NULL)
>> break; /* found a free slot */
>> idx = (idx + 1) & mask;
>> }
>> ```
>>
>> I think this is the “trade-off” you mention in your email: if a free
>> slot found earlier, then still gets duplicated entries. I have no
>> concern to this “trade-off”, but please add a comment here, otherwise
>> future readers may be confused, and might potentially consider that
>> were a bug, without reading your original design email.
>
> +1 on such a comment. Here or maybe on the ResourceElem struct itself.
>
Will do. It's probably appropriate to mention this deduplication in the
comment at the beginning of the file.
>> typedef struct ResourceElem
>> {
>> Datum item;
>> + uint32 count; /* number of occurrences */
>> const ResourceOwnerDesc *kind; /* NULL indicates a free hash
>> table slot */
>> } ResourceElem;
>
> Hmm, the 'count' is not used when the entry is stored in the array.
> Perhaps we should have a separate struct for array and hash elements
> now. Keeping the array small helps it to fit in CPU caches.
>
Agreed. I had the same idea yesterday, but I haven't done it yet.
>> /*
>> * Forget that an object is owned by a ResourceOwner
>> *
>> * Note: If same resource ID is associated with the ResourceOwner more
>> than
>> * once, one instance is removed.
>> *
>> * Note: Forgetting a resource does not guarantee that there is room to
>> * remember a new resource. One exception is when you forget the most
>> * recently remembered resource; that does make room for a new
>> remember call.
>> * Some code callers rely on that exception.
>> */
>> void
>> ResourceOwnerForget(ResourceOwner owner, Datum value, const
>> ResourceOwnerDesc *kind)
>
> Does this break the exception noted in the above comment? I guess it
> still holds because we don't deduplicate entries in the array. That's
> very subtle, needs a comment at least.
>
Right, it doesn't break the exception - as you say, it only applies to
the hash, not to the array. I'll add a comment.
Thanks
--
Tomas Vondra
On 10/17/25 12:32, Tomas Vondra wrote:
>
>
> On 10/17/25 10:31, Heikki Linnakangas wrote:
>> On 17/10/2025 06:13, Chao Li wrote:
>>> ```
>>> @@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner,
>>> Datum value, const ResourceOwnerDesc
>>> idx = hash_resource_elem(value, kind) & mask;
>>> for (;;)
>>> {
>>> + /* found an exact match - just increment the counter */
>>> + if ((owner->hash[idx].kind == kind) &&
>>> + (owner->hash[idx].item == value))
>>> + {
>>> + owner->hash[idx].count += count;
>>> + return;
>>> + }
>>> +
>>> if (owner->hash[idx].kind == NULL)
>>> break; /* found a free slot */
>>> idx = (idx + 1) & mask;
>>> }
>>> ```
>>>
>>> I think this is the “trade-off” you mention in your email: if a free
>>> slot found earlier, then still gets duplicated entries. I have no
>>> concern to this “trade-off”, but please add a comment here, otherwise
>>> future readers may be confused, and might potentially consider that
>>> were a bug, without reading your original design email.
>>
>> +1 on such a comment. Here or maybe on the ResourceElem struct itself.
>>
>
> Will do. It's probably appropriate to mention this deduplication in the
> comment at the beginning of the file.
>
I added a comment to the file comment, and clarified the comment when
adding the hash table entry. But I tried not to be too verbose there.
>>> typedef struct ResourceElem
>>> {
>>> Datum item;
>>> + uint32 count; /* number of occurrences */
>>> const ResourceOwnerDesc *kind; /* NULL indicates a free hash
>>> table slot */
>>> } ResourceElem;
>>
>> Hmm, the 'count' is not used when the entry is stored in the array.
>> Perhaps we should have a separate struct for array and hash elements
>> now. Keeping the array small helps it to fit in CPU caches.
>>
>
> Agreed. I had the same idea yesterday, but I haven't done it yet.
>
The attached v2 does that - it adds a separate ResourceHashElem for the
has table, and it works. But I'm not sure I like it very much, because
there are two places that relied on both the array and hash table using
the same struct to "walk" it the same way.
For ResourceOwnerSort() it's not too bad, but ResourceOwnerReleaseAll()
now duplicates most of the code. It's not terrible, but also not pretty.
I can't think of a better way, though.
>>> /*
>>> * Forget that an object is owned by a ResourceOwner
>>> *
>>> * Note: If same resource ID is associated with the ResourceOwner more
>>> than
>>> * once, one instance is removed.
>>> *
>>> * Note: Forgetting a resource does not guarantee that there is room to
>>> * remember a new resource. One exception is when you forget the most
>>> * recently remembered resource; that does make room for a new
>>> remember call.
>>> * Some code callers rely on that exception.
>>> */
>>> void
>>> ResourceOwnerForget(ResourceOwner owner, Datum value, const
>>> ResourceOwnerDesc *kind)
>>
>> Does this break the exception noted in the above comment? I guess it
>> still holds because we don't deduplicate entries in the array. That's
>> very subtle, needs a comment at least.
>>
>
> Right, it doesn't break the exception - as you say, it only applies to
> the hash, not to the array. I'll add a comment.
>
Comment added.
regards
--
Tomas Vondra
Вложения
On 18/10/2025 01:49, Tomas Vondra wrote:
> On 10/17/25 12:32, Tomas Vondra wrote:
>>
>>
>> On 10/17/25 10:31, Heikki Linnakangas wrote:
>>>> typedef struct ResourceElem
>>>> {
>>>> Datum item;
>>>> + uint32 count; /* number of occurrences */
>>>> const ResourceOwnerDesc *kind; /* NULL indicates a free hash
>>>> table slot */
>>>> } ResourceElem;
>>>
>>> Hmm, the 'count' is not used when the entry is stored in the array.
>>> Perhaps we should have a separate struct for array and hash elements
>>> now. Keeping the array small helps it to fit in CPU caches.
>>
>> Agreed. I had the same idea yesterday, but I haven't done it yet.
>
> The attached v2 does that - it adds a separate ResourceHashElem for the
> has table, and it works. But I'm not sure I like it very much, because
> there are two places that relied on both the array and hash table using
> the same struct to "walk" it the same way.
>
> For ResourceOwnerSort() it's not too bad, but ResourceOwnerReleaseAll()
> now duplicates most of the code. It's not terrible, but also not pretty.
> I can't think of a better way, though.
Looks fine to me. The code duplication is not too bad IMO.
Here's a rebased version of the micro-benchmark I used when I was
working on the ResourceOwner refactoring
(https://www.postgresql.org/message-id/d746cead-a1ef-7efe-fb47-933311e876a3%40iki.fi).
I ran it again on my laptop. Different from the one I used back then, so
the results are not comparable with the results from that old thread.
Unpatched (commit 18d26140934):
postgres=# \i contrib/resownerbench/snaptest.sql
numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
0 | 1 | 11.6 | 11.1
0 | 5 | 12.1 | 13.1
0 | 10 | 12.3 | 13.5
0 | 60 | 14.6 | 19.4
0 | 70 | 16.0 | 18.1
0 | 100 | 16.7 | 18.0
0 | 1000 | 18.1 | 20.7
0 | 10000 | 21.9 | 29.5
9 | 10 | 11.0 | 11.1
9 | 100 | 14.9 | 20.0
9 | 1000 | 16.1 | 24.4
9 | 10000 | 21.9 | 25.7
65 | 70 | 11.7 | 12.5
65 | 100 | 13.9 | 14.8
65 | 1000 | 16.7 | 17.8
65 | 10000 | 22.5 | 27.8
(16 rows)
v2-0001-Deduplicate-entries-in-ResourceOwner.patch:
postgres=# \i contrib/resownerbench/snaptest.sql
numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
0 | 1 | 10.8 | 10.6
0 | 5 | 11.5 | 12.3
0 | 10 | 12.1 | 13.0
0 | 60 | 13.9 | 19.4
0 | 70 | 15.9 | 18.7
0 | 100 | 16.0 | 18.5
0 | 1000 | 19.2 | 21.6
0 | 10000 | 22.4 | 29.0
9 | 10 | 11.2 | 11.3
9 | 100 | 14.4 | 19.9
9 | 1000 | 16.4 | 23.8
9 | 10000 | 22.4 | 25.7
65 | 70 | 11.4 | 12.1
65 | 100 | 14.8 | 17.0
65 | 1000 | 16.9 | 18.1
65 | 10000 | 22.5 | 28.5
(16 rows)
v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch:
postgres=# \i contrib/resownerbench/snaptest.sql
numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
0 | 1 | 11.3 | 11.1
0 | 5 | 12.3 | 13.0
0 | 10 | 13.0 | 14.1
0 | 60 | 14.7 | 20.5
0 | 70 | 16.3 | 19.0
0 | 100 | 16.5 | 18.4
0 | 1000 | 19.0 | 22.4
0 | 10000 | 23.2 | 29.6
9 | 10 | 11.2 | 11.1
9 | 100 | 14.8 | 20.5
9 | 1000 | 16.8 | 24.3
9 | 10000 | 23.3 | 26.5
65 | 70 | 12.4 | 13.0
65 | 100 | 15.2 | 16.6
65 | 1000 | 16.9 | 18.4
65 | 10000 | 23.4 | 29.3
(16 rows)
These are just a single run on my laptop, the error bars on individual
numbers are significant. But it seems to me that V2 is maybe a little
faster when the entries fit in the array.
- Heikki
Вложения
On 10/21/25 09:10, Heikki Linnakangas wrote:
> On 18/10/2025 01:49, Tomas Vondra wrote:
>> On 10/17/25 12:32, Tomas Vondra wrote:
>>>
>>>
>>> On 10/17/25 10:31, Heikki Linnakangas wrote:
>>>>> typedef struct ResourceElem
>>>>> {
>>>>> Datum item;
>>>>> + uint32 count; /* number of occurrences */
>>>>> const ResourceOwnerDesc *kind; /* NULL indicates a free hash
>>>>> table slot */
>>>>> } ResourceElem;
>>>>
>>>> Hmm, the 'count' is not used when the entry is stored in the array.
>>>> Perhaps we should have a separate struct for array and hash elements
>>>> now. Keeping the array small helps it to fit in CPU caches.
>>>
>>> Agreed. I had the same idea yesterday, but I haven't done it yet.
>>
>> The attached v2 does that - it adds a separate ResourceHashElem for the
>> has table, and it works. But I'm not sure I like it very much, because
>> there are two places that relied on both the array and hash table using
>> the same struct to "walk" it the same way.
>>
>> For ResourceOwnerSort() it's not too bad, but ResourceOwnerReleaseAll()
>> now duplicates most of the code. It's not terrible, but also not pretty.
>> I can't think of a better way, though.
>
> Looks fine to me. The code duplication is not too bad IMO.
>
> Here's a rebased version of the micro-benchmark I used when I was
> working on the ResourceOwner refactoring (https://www.postgresql.org/
> message-id/d746cead-a1ef-7efe-fb47-933311e876a3%40iki.fi).
>
> I ran it again on my laptop. Different from the one I used back then, so
> the results are not comparable with the results from that old thread.
>
> Unpatched (commit 18d26140934):
>
> postgres=# \i contrib/resownerbench/snaptest.sql
> numkeep | numsnaps | lifo_time_ns | fifo_time_ns
> ---------+----------+--------------+--------------
> 0 | 1 | 11.6 | 11.1
> 0 | 5 | 12.1 | 13.1
> 0 | 10 | 12.3 | 13.5
> 0 | 60 | 14.6 | 19.4
> 0 | 70 | 16.0 | 18.1
> 0 | 100 | 16.7 | 18.0
> 0 | 1000 | 18.1 | 20.7
> 0 | 10000 | 21.9 | 29.5
> 9 | 10 | 11.0 | 11.1
> 9 | 100 | 14.9 | 20.0
> 9 | 1000 | 16.1 | 24.4
> 9 | 10000 | 21.9 | 25.7
> 65 | 70 | 11.7 | 12.5
> 65 | 100 | 13.9 | 14.8
> 65 | 1000 | 16.7 | 17.8
> 65 | 10000 | 22.5 | 27.8
> (16 rows)
>
> v2-0001-Deduplicate-entries-in-ResourceOwner.patch:
>
> postgres=# \i contrib/resownerbench/snaptest.sql
> numkeep | numsnaps | lifo_time_ns | fifo_time_ns
> ---------+----------+--------------+--------------
> 0 | 1 | 10.8 | 10.6
> 0 | 5 | 11.5 | 12.3
> 0 | 10 | 12.1 | 13.0
> 0 | 60 | 13.9 | 19.4
> 0 | 70 | 15.9 | 18.7
> 0 | 100 | 16.0 | 18.5
> 0 | 1000 | 19.2 | 21.6
> 0 | 10000 | 22.4 | 29.0
> 9 | 10 | 11.2 | 11.3
> 9 | 100 | 14.4 | 19.9
> 9 | 1000 | 16.4 | 23.8
> 9 | 10000 | 22.4 | 25.7
> 65 | 70 | 11.4 | 12.1
> 65 | 100 | 14.8 | 17.0
> 65 | 1000 | 16.9 | 18.1
> 65 | 10000 | 22.5 | 28.5
> (16 rows)
>
> v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch:
>
> postgres=# \i contrib/resownerbench/snaptest.sql
> numkeep | numsnaps | lifo_time_ns | fifo_time_ns
> ---------+----------+--------------+--------------
> 0 | 1 | 11.3 | 11.1
> 0 | 5 | 12.3 | 13.0
> 0 | 10 | 13.0 | 14.1
> 0 | 60 | 14.7 | 20.5
> 0 | 70 | 16.3 | 19.0
> 0 | 100 | 16.5 | 18.4
> 0 | 1000 | 19.0 | 22.4
> 0 | 10000 | 23.2 | 29.6
> 9 | 10 | 11.2 | 11.1
> 9 | 100 | 14.8 | 20.5
> 9 | 1000 | 16.8 | 24.3
> 9 | 10000 | 23.3 | 26.5
> 65 | 70 | 12.4 | 13.0
> 65 | 100 | 15.2 | 16.6
> 65 | 1000 | 16.9 | 18.4
> 65 | 10000 | 23.4 | 29.3
> (16 rows)
>
> These are just a single run on my laptop, the error bars on individual
> numbers are significant. But it seems to me that V2 is maybe a little
> faster when the entries fit in the array.
>
Thanks. Attached is a version that adds a missing .sql file defining the
benchmark functions, and then two patches. The v2 is the patch already
shared on Saturday, v3 is a minor tweak (details in a minute).
I ran the benchmark on my test machine with both v1 and v2, with 10
runs. And there seems to be a small regression of ~5-10% (compared to
master). Which is not terrible, but also not negligible. v0 is master.
Here's the "fifo" results:
keep snaps | v0 v1 v2 v3 | v1 v2 v3
==================================================================
0 1 | 10.57 10.70 10.73 10.5 | 101% 102% 99%
5 | 10.52 10.70 10.90 10.5 | 102% 104% 100%
10 | 11.41 11.90 11.94 11.5 | 104% 105% 101%
60 | 15.04 15.74 15.82 15.5 | 105% 105% 103%
70 | 13.73 14.42 14.80 14.3 | 105% 108% 104%
100 | 13.65 14.27 14.59 14.1 | 105% 107% 104%
1000 | 15.07 15.78 16.27 15.6 | 105% 108% 104%
10000 | 23.15 24.96 25.57 24.1 | 108% 110% 104%
------------------------------------------------------------------
9 10 | 10.8 10.94 10.86 10.6 | 101% 101% 98%
100 | 15.83 16.35 16.72 16.1 | 103% 106% 102%
1000 | 19.04 19.70 20.34 19.6 | 103% 107% 103%
10000 | 21.5 23.37 24.18 22.5 | 109% 112% 105%
------------------------------------------------------------------
65 70 | 10.58 10.95 11.18 10.6 | 103% 106% 100%
100 | 13.29 14.28 14.73 14.1 | 107% 111% 106%
1000 | 14.62 15.49 15.99 15.2 | 106% 109% 104%
10000 | 22.98 24.78 25.55 24.3 | 108% 111% 106%
and here's "lifo"
keep snaps | v0 v1 v2 v3 | v1 v2 v3
==================================================================
0 1 | 10.73 10.74 11.06 10.4 | 100% 103% 97%
5 | 10.44 10.45 10.62 10.2 | 100% 102% 98%
10 | 11.82 11.84 12.17 11.6 | 100% 103% 98%
60 | 12.15 12.81 12.94 12.4 | 105% 107% 102%
70 | 12.84 13.61 13.95 13.4 | 106% 109% 104%
100 | 12.98 13.73 14.06 13.5 | 106% 108% 104%
1000 | 13.71 14.56 14.80 14.2 | 106% 108% 104%
10000 | 17.68 19.51 20.38 19.0 | 110% 115% 108%
------------------------------------------------------------------
9 10 | 10.96 10.92 11.02 10.6 | 100% 101% 97%
100 | 12.58 13.11 13.26 12.8 | 104% 105% 102%
1000 | 13.67 14.38 14.73 14.1 | 105% 108% 103%
10000 | 17.67 19.79 20.20 19.2 | 112% 114% 109%
------------------------------------------------------------------
65 70 | 10.58 10.85 10.78 11.1 | 103% 102% 105%
100 | 13.03 14.12 14.27 13.5 | 108% 110% 103%
1000 | 13.65 14.56 14.88 14.1 | 107% 109% 104%
10000 | 17.62 19.61 20.30 19.0 | 111% 115% 108%
I was wondering where the regression comes from, and it occurred to me
ResourceOwnerAddToHash() may be doing the checks in the wrong order. It
should be checking for empty slot first, so I did that - that's v3.
There's still a regression, but it's about halved compared to v2, and
about equal to v1. So I tried doing the same tweak for v1, but that
didn't help much (if at all).
The results seem fairly stable, and the overall trend is clear. It'd be
great if there were no regressions, but considering how narrow is this
microbenchmark (and considering the benefits for practical COPY runs),
I'd say it's probably OK.
regards
--
Tomas Vondra