Обсуждение: Optimizing ResouceOwner to speed up COPY

Поиск

Список

Период

Сортировка

Optimizing ResouceOwner to speed up COPY

От

Tomas Vondra

Дата:

16 октября, 20:46:49

Hi,

While reviewing and testing a nearby patch (using COPY for batching in
postgres_fdw), I noticed some of the COPY queries are spending a
substantial amount of time in ResourceOwnerAddToHash(). The exact figure
depends on amount of data in the COPY, but it was often close to 50%
(according to perf-record/report).

The reason is pretty simple - ResourceOwner tracks the resources in a
very simple hash table, with O(n^2) behavior with duplicates. This
happens with COPY, because COPY creates an array of a 1000 tuple slots,
and each slot references the same tuple descriptor. And the descriptor
is added to ResourceOwner for each slot.

The trouble is the ResourceOwner uses a simple open addressing hash
table, and the duplicate values end up in a long dense "train" of
entries, slowing down both additions and deletions. Starting with an
empty hash table, adding the first entry is fine - the first slot is
available. But the following additions need to skip over the current
entries - the n-th addition needs to skip over n-1 entries.

With N entries this means (N * (N-1) / 2). And then the same thing
happens for deletions, in reverse.

This is not the first time I noticed ResourceOwner in profiles, but I
never investigated it. I don't know if all those cases were caused by
this same behavior, but it's likely at least some were ...

There's an easy way to improve this by allowing a single hash entry to
represent multiple references to the same resource. The attached patch
adds a "count" to the ResourceElem, tracking how many times that
resource was added. So if you add 1000 tuples slots, the descriptor will
have just one ResourceElem entry with count=1000.

This reduces the insert/delete complexity - not quite to O(1), but much
lower than O(n^2). It also reduces the memory usage with duplicates, and
postpones when the hash table needs to be enlarged. Of course, if there
are no duplicates, it'll use a bit more memory.

The deduplication is "opportunistic" in the sense that you may still end
up with multiple entries for the same value/kind. When adding a resource
finds an empty slot before hitting the other entry, it'll use the slot.
Also, only the hash table is deduplicated, not the initial array (which
however quite small).

This is a reasonable trade off because it means it does not get more
expensive for cases with no duplicates, while still improving cases with
many duplicate values.

To demonstrate the benefits, here's a throughput (tps) from a COPY of a
certain number of rows from a file into an UNLOGGED table, with 1, 4 and
8 clients.

                           master                patched
    rows    |       1       4       8  |       1       4       8
   ---------|--------------------------|-------------------------
    10      |  112090  418421  594905  |  112871  419043  589646
    100     |   35265  121903  168238  |   42536  138578  183740
    1000    |    1450    5700   10725  |    5847   21594   30713
    10000   |     531    1943    2988  |     743    2619    3557
    100000  |      72     244     324  |      76     256     332

Or, as relative throughput:

                    patched / master
    rows    |         1        4        8
   ---------------------------------------
    10      |      101%     100%      99%
    100     |      121%     114%     109%
    1000    |      403%     379%     286%
    10000   |      140%     135%     119%
    100000  |      106%     105%     102%

Those are some nice improvements, especially with 1000 rows. Which makes
sense, because 1000 is the internal batch size in COPY. So the overhead
gradually increases up to 1000 rows, and then it amortizes over many
batches. It has very little impact for large COPY.

I've done the same test with a logged table. The results significantly
depend on the storage (how fast it can handle commits) - not surprising.
But with good storage the overall behavior is almost exactly the same.

I didn't do any benchmarking focused on cases with no/few duplicates
(although the COPY with 10 rows could qualify) yet. But as I explained
earlier, the behavior should not change at all. There's an extra uint32,
but that's it.


regards

-- 
Tomas Vondra

Вложения

v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch

Re: Optimizing ResouceOwner to speed up COPY

От

Tom Lane

Дата:

16 октября, 21:12:47

Tomas Vondra <tomas@vondra.me> writes:
> The reason is pretty simple - ResourceOwner tracks the resources in a
> very simple hash table, with O(n^2) behavior with duplicates. This
> happens with COPY, because COPY creates an array of a 1000 tuple slots,
> and each slot references the same tuple descriptor. And the descriptor
> is added to ResourceOwner for each slot.
> ...
> There's an easy way to improve this by allowing a single hash entry to
> represent multiple references to the same resource. The attached patch
> adds a "count" to the ResourceElem, tracking how many times that
> resource was added. So if you add 1000 tuples slots, the descriptor will
> have just one ResourceElem entry with count=1000.

Hmm.  I don't love the 50% increase in sizeof(ResourceElem) ... maybe
that's negligible, or maybe it isn't.  Can you find evidence of this
change being helpful for anything except this specific scenario in
COPY?  Because we could probably find some way to avoid registering
all the doppelganger slots, if that's the only culprit.

            regards, tom lane

Re: Optimizing ResouceOwner to speed up COPY

От

Tomas Vondra

Дата:

16 октября, 22:22:29

On 10/16/25 20:12, Tom Lane wrote:
> Tomas Vondra <tomas@vondra.me> writes:
>> The reason is pretty simple - ResourceOwner tracks the resources in a
>> very simple hash table, with O(n^2) behavior with duplicates. This
>> happens with COPY, because COPY creates an array of a 1000 tuple slots,
>> and each slot references the same tuple descriptor. And the descriptor
>> is added to ResourceOwner for each slot.
>> ...
>> There's an easy way to improve this by allowing a single hash entry to
>> represent multiple references to the same resource. The attached patch
>> adds a "count" to the ResourceElem, tracking how many times that
>> resource was added. So if you add 1000 tuples slots, the descriptor will
>> have just one ResourceElem entry with count=1000.
> 
> Hmm.  I don't love the 50% increase in sizeof(ResourceElem) ... maybe
> that's negligible, or maybe it isn't.

I agree the +50% is not great. My plan was to maybe use a smaller data
type, say uint8 or uint16, and reduce the overhead that way (while still
massively reducing the number of entries). But I didn't realize there
will always be extra padding, making this pointless.

Maybe we could make the struct __packed__, and put the counter last (and
maybe make it uint16)? But I don't see any other packed structs, so I
guess there's a reason not to have them.

I'm not sure if this is (or is not) negligible. In cases with duplicates
this is actually saves a lot of memory, so it's a win. I'm not sure
about cases without duplicates - it'll use more memory, but I'm not sure
how many resource owners there usually are. Or how many entries they
contain. That'll probably determine if it's negligible.

My intuition was there wouldn't be that many, and that the typical
number of elements is fairly low (if not, why have the initial capacity
set to 64/128)? But maybe my intuition is wrong.

> Can you find evidence of this change being helpful for anything
> except this specific scenario in COPY?
I went through the ResourceOwnerRemember() calls, looking for other
cases that might create a lot of duplicates, similar to the tuple
descriptors, but I haven't found anything obvious. Other resources seem
to be either naturally unique or limited to very few duplicates.

> Because we could probably find some way to avoid registering all the
> doppelganger slots, if that's the only culprit.
I'm not against other approaches.

I didn't want to rework the foundation of the resource owner management,
and this seemed simple, and with minimal impact on cases without
duplicates (except for the struct getting larger).

thanks

-- 
Tomas Vondra

Re: Optimizing ResouceOwner to speed up COPY

От

Tom Lane

Дата:

16 октября, 22:28:01

Tomas Vondra <tomas@vondra.me> writes:
> On 10/16/25 20:12, Tom Lane wrote:
>> Can you find evidence of this change being helpful for anything
>> except this specific scenario in COPY?

> I went through the ResourceOwnerRemember() calls, looking for other
> cases that might create a lot of duplicates, similar to the tuple
> descriptors, but I haven't found anything obvious. Other resources seem
> to be either naturally unique or limited to very few duplicates.

I was thinking of adding some temporary instrumentation, like
just elog'ing whenever the count goes above 1, and seeing where
you get hits during the regression tests.  I'm prepared to believe
this is worth doing, but it'd be nice to have more examples
in mind.

            regards, tom lane

Re: Optimizing ResouceOwner to speed up COPY

От

Tomas Vondra

Дата:

17 октября, 00:58:45

On 10/16/25 21:28, Tom Lane wrote:
> Tomas Vondra <tomas@vondra.me> writes:
>> On 10/16/25 20:12, Tom Lane wrote:
>>> Can you find evidence of this change being helpful for anything
>>> except this specific scenario in COPY?
> 
>> I went through the ResourceOwnerRemember() calls, looking for other
>> cases that might create a lot of duplicates, similar to the tuple
>> descriptors, but I haven't found anything obvious. Other resources seem
>> to be either naturally unique or limited to very few duplicates.
> 
> I was thinking of adding some temporary instrumentation, like
> just elog'ing whenever the count goes above 1, and seeing where
> you get hits during the regression tests.  I'm prepared to believe
> this is worth doing, but it'd be nice to have more examples
> in mind.
> 

I tried that, and that gives me ~30k log messages with (count > 1). But
that's a bit misleading, because a lot of that are the same "thing"
going from 1 to N, which produces N messages.

If I subtract all the COPY statements, loading data for regressison
tests, that leaves ~7500 cases. There's a lot of cases with count 2 or
3, mostly simple queries. Even a simple "\d t" produces a bunch of such
messages.

test=# \d t
WARNING:  RESOURCEOWNER: snapshot reference 0x2e3787b0 resource owner
Portal count 2
WARNING:  RESOURCEOWNER: relcache reference 0x79ae1302fba8 resource
owner Portal count 2
WARNING:  RESOURCEOWNER: tupdesc reference 0x79ae1302fec8 resource owner
Portal count 2
WARNING:  RESOURCEOWNER: relcache reference 0x79ae13034928 resource
owner Portal count 2
WARNING:  RESOURCEOWNER: relcache reference 0x79ae13034928 resource
owner Portal count 3
WARNING:  RESOURCEOWNER: relcache reference 0x79ae13034d88 resource
owner Portal count 2
WARNING:  RESOURCEOWNER: buffer pin 0x4a resource owner Portal count 2
WARNING:  RESOURCEOWNER: relcache reference 0x79ae13034928 resource
owner Portal count 4
WARNING:  RESOURCEOWNER: buffer pin 0xa resource owner Portal count 2
WARNING:  RESOURCEOWNER: relcache reference 0x79ae12dca6d0 resource
owner Portal count 2
WARNING:  RESOURCEOWNER: relcache reference 0x79ae1303aff8 resource
owner Portal count 2

There are some more extreme ones too. For example

   select infinite_recurse();

produces

WARNING:  RESOURCEOWNER: plancache reference 0x34555828 resource owner
Portal count 1340

Another example is CREATE TABLE, which creates a batch of slots when
inserting attributes in InsertPgAttributeTuples, so that'll end up with
the count = number of attributes.

Of course, those are not particularly frequent operations. Most
applications are not doing CREATE TABLE nearly as often as DML.

But I had another idea - see how large the ResourceOwners get, which
would tell us how much "overhead" it really is. So I added logging into
ResourceOwnerDelete (without the patch), and with that regression tests
produce 113916 messages. And 113289 have the initial capacity 32, so
array only. From the remaining ~600, only 72 have capacity over 64.

So I guess the overhead should not be that bad. Actually, it would be
possible to completely eliminate the overhead for the array, because
that does not actually need the count at all.

regards

-- 
Tomas Vondra

Re: Optimizing ResouceOwner to speed up COPY

От

Tom Lane

Дата:

17 октября, 01:17:20

Tomas Vondra <tomas@vondra.me> writes:
> On 10/16/25 21:28, Tom Lane wrote:
>> I was thinking of adding some temporary instrumentation, like
>> just elog'ing whenever the count goes above 1, and seeing where
>> you get hits during the regression tests.  I'm prepared to believe
>> this is worth doing, but it'd be nice to have more examples
>> in mind.

> I tried that, and that gives me ~30k log messages with (count > 1). But
> that's a bit misleading, because a lot of that are the same "thing"
> going from 1 to N, which produces N messages.

> If I subtract all the COPY statements, loading data for regressison
> tests, that leaves ~7500 cases. There's a lot of cases with count 2 or
> 3, mostly simple queries. Even a simple "\d t" produces a bunch of such
> messages.

Okay, so we definitely do have other cases where the count will be
more than 1.

> There are some more extreme ones too. For example
>    select infinite_recurse();
> produces
> WARNING:  RESOURCEOWNER: plancache reference 0x34555828 resource owner
> Portal count 1340

Makes me wonder if we shouldn't make the count int64, just to remove
all worries about overflow.  That'd be free on 64-bit machines ...

> But I had another idea - see how large the ResourceOwners get, which
> would tell us how much "overhead" it really is. So I added logging into
> ResourceOwnerDelete (without the patch), and with that regression tests
> produce 113916 messages. And 113289 have the initial capacity 32, so
> array only. From the remaining ~600, only 72 have capacity over 64.

Ah, that's fairly convincing.  Seems like we can move ahead with this.

            regards, tom lane

Re: Optimizing ResouceOwner to speed up COPY

От

Tomas Vondra

Дата:

17 октября, 01:28:22

On 10/17/25 00:17, Tom Lane wrote:
> Tomas Vondra <tomas@vondra.me> writes:
>> On 10/16/25 21:28, Tom Lane wrote:
>>> I was thinking of adding some temporary instrumentation, like
>>> just elog'ing whenever the count goes above 1, and seeing where
>>> you get hits during the regression tests.  I'm prepared to believe
>>> this is worth doing, but it'd be nice to have more examples
>>> in mind.
> 
>> I tried that, and that gives me ~30k log messages with (count > 1). But
>> that's a bit misleading, because a lot of that are the same "thing"
>> going from 1 to N, which produces N messages.
> 
>> If I subtract all the COPY statements, loading data for regressison
>> tests, that leaves ~7500 cases. There's a lot of cases with count 2 or
>> 3, mostly simple queries. Even a simple "\d t" produces a bunch of such
>> messages.
> 
> Okay, so we definitely do have other cases where the count will be
> more than 1.
> 
>> There are some more extreme ones too. For example
>>    select infinite_recurse();
>> produces
>> WARNING:  RESOURCEOWNER: plancache reference 0x34555828 resource owner
>> Portal count 1340
> 
> Makes me wonder if we shouldn't make the count int64, just to remove
> all worries about overflow.  That'd be free on 64-bit machines ...
> 

We may do that, if it's free. But I doubt there's any risk of overflow
in practice. If we really wanted to support that many entries, we
wouldn't be allocating the hash table using MemoryContextAllocZero().
That puts the maximum number of unique entries at ~50M.

>> But I had another idea - see how large the ResourceOwners get, which
>> would tell us how much "overhead" it really is. So I added logging into
>> ResourceOwnerDelete (without the patch), and with that regression tests
>> produce 113916 messages. And 113289 have the initial capacity 32, so
>> array only. From the remaining ~600, only 72 have capacity over 64.
> 
> Ah, that's fairly convincing.  Seems like we can move ahead with this.
> 

Thanks for the feedback. I'll let it sit for a while, there's no rush to
get this committed.


regards

-- 
Tomas Vondra

Re: Optimizing ResouceOwner to speed up COPY

От

Chao Li

Дата:

17 октября, 06:13:44

> On Oct 17, 2025, at 01:46, Tomas Vondra <tomas@vondra.me> wrote:
>
> --
> Tomas Vondra<v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch>

Nice patch! I eyeball reviewed the patch, only got a few small comments:

1
```
@@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
     idx = hash_resource_elem(value, kind) & mask;
     for (;;)
     {
+        /* found an exact match - just increment the counter */
+        if ((owner->hash[idx].kind == kind) &&
+            (owner->hash[idx].item == value))
+        {
+            owner->hash[idx].count += count;
+            return;
+        }
+
         if (owner->hash[idx].kind == NULL)
             break;                /* found a free slot */
         idx = (idx + 1) & mask;
     }
```

I think this is the “trade-off” you mention in your email: if a free slot found earlier, then still gets duplicated
entries.I have no concern to this “trade-off”, but please add a comment here, otherwise future readers may be confused,
andmight potentially consider that were a bug, without reading your original design email. 

2
```
 typedef struct ResourceElem
 {
     Datum        item;
+    uint32        count;                /* number of occurrences */
```

Nit suggestion. People usually name this type of count as “refcnt”, which looks more meaningful. But I don’t have a
strongopinion here. I am absolutely fine if you don’t want to change. 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimizing ResouceOwner to speed up COPY

От

Heikki Linnakangas

Дата:

17 октября, 11:31:00

On 17/10/2025 06:13, Chao Li wrote:
> ```
> @@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
>       idx = hash_resource_elem(value, kind) & mask;
>       for (;;)
>       {
> +        /* found an exact match - just increment the counter */
> +        if ((owner->hash[idx].kind == kind) &&
> +            (owner->hash[idx].item == value))
> +        {
> +            owner->hash[idx].count += count;
> +            return;
> +        }
> +
>           if (owner->hash[idx].kind == NULL)
>               break;                /* found a free slot */
>           idx = (idx + 1) & mask;
>       }
> ```
> 
> I think this is the “trade-off” you mention in your email: if a free slot found earlier, then still gets duplicated
entries.I have no concern to this “trade-off”, but please add a comment here, otherwise future readers may be confused,
andmight potentially consider that were a bug, without reading your original design email.
 

+1 on such a comment. Here or maybe on the ResourceElem struct itself.

>  typedef struct ResourceElem
>  {
>      Datum        item;
> +    uint32        count;                /* number of occurrences */
>      const ResourceOwnerDesc *kind;    /* NULL indicates a free hash table slot */
>  } ResourceElem;

Hmm, the 'count' is not used when the entry is stored in the array. 
Perhaps we should have a separate struct for array and hash elements 
now. Keeping the array small helps it to fit in CPU caches.

> /*
>  * Forget that an object is owned by a ResourceOwner
>  *
>  * Note: If same resource ID is associated with the ResourceOwner more than
>  * once, one instance is removed.
>  *
>  * Note: Forgetting a resource does not guarantee that there is room to
>  * remember a new resource.  One exception is when you forget the most
>  * recently remembered resource; that does make room for a new remember call.
>  * Some code callers rely on that exception.
>  */
> void
> ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)

Does this break the exception noted in the above comment? I guess it 
still holds because we don't deduplicate entries in the array. That's 
very subtle, needs a comment at least.

typo: ocurrence -> occurrence

- Heikki

Re: Optimizing ResouceOwner to speed up COPY

От

Tomas Vondra

Дата:

17 октября, 13:32:43


On 10/17/25 10:31, Heikki Linnakangas wrote:
> On 17/10/2025 06:13, Chao Li wrote:
>> ```
>> @@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner,
>> Datum value, const ResourceOwnerDesc
>>       idx = hash_resource_elem(value, kind) & mask;
>>       for (;;)
>>       {
>> +        /* found an exact match - just increment the counter */
>> +        if ((owner->hash[idx].kind == kind) &&
>> +            (owner->hash[idx].item == value))
>> +        {
>> +            owner->hash[idx].count += count;
>> +            return;
>> +        }
>> +
>>           if (owner->hash[idx].kind == NULL)
>>               break;                /* found a free slot */
>>           idx = (idx + 1) & mask;
>>       }
>> ```
>>
>> I think this is the “trade-off” you mention in your email: if a free
>> slot found earlier, then still gets duplicated entries. I have no
>> concern to this “trade-off”, but please add a comment here, otherwise
>> future readers may be confused, and might potentially consider that
>> were a bug, without reading your original design email.
> 
> +1 on such a comment. Here or maybe on the ResourceElem struct itself.
> 

Will do. It's probably appropriate to mention this deduplication in the
comment at the beginning of the file.

>>  typedef struct ResourceElem
>>  {
>>      Datum        item;
>> +    uint32        count;                /* number of occurrences */
>>      const ResourceOwnerDesc *kind;    /* NULL indicates a free hash
>> table slot */
>>  } ResourceElem;
> 
> Hmm, the 'count' is not used when the entry is stored in the array.
> Perhaps we should have a separate struct for array and hash elements
> now. Keeping the array small helps it to fit in CPU caches.
> 

Agreed. I had the same idea yesterday, but I haven't done it yet.

>> /*
>>  * Forget that an object is owned by a ResourceOwner
>>  *
>>  * Note: If same resource ID is associated with the ResourceOwner more
>> than
>>  * once, one instance is removed.
>>  *
>>  * Note: Forgetting a resource does not guarantee that there is room to
>>  * remember a new resource.  One exception is when you forget the most
>>  * recently remembered resource; that does make room for a new
>> remember call.
>>  * Some code callers rely on that exception.
>>  */
>> void
>> ResourceOwnerForget(ResourceOwner owner, Datum value, const
>> ResourceOwnerDesc *kind)
> 
> Does this break the exception noted in the above comment? I guess it
> still holds because we don't deduplicate entries in the array. That's
> very subtle, needs a comment at least.
> 

Right, it doesn't break the exception - as you say, it only applies to
the hash, not to the array. I'll add a comment.

Thanks

-- 
Tomas Vondra

Re: Optimizing ResouceOwner to speed up COPY

От

Tomas Vondra

Дата:

18 октября, 01:49:56

On 10/17/25 12:32, Tomas Vondra wrote:
> 
> 
> On 10/17/25 10:31, Heikki Linnakangas wrote:
>> On 17/10/2025 06:13, Chao Li wrote:
>>> ```
>>> @@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner,
>>> Datum value, const ResourceOwnerDesc
>>>       idx = hash_resource_elem(value, kind) & mask;
>>>       for (;;)
>>>       {
>>> +        /* found an exact match - just increment the counter */
>>> +        if ((owner->hash[idx].kind == kind) &&
>>> +            (owner->hash[idx].item == value))
>>> +        {
>>> +            owner->hash[idx].count += count;
>>> +            return;
>>> +        }
>>> +
>>>           if (owner->hash[idx].kind == NULL)
>>>               break;                /* found a free slot */
>>>           idx = (idx + 1) & mask;
>>>       }
>>> ```
>>>
>>> I think this is the “trade-off” you mention in your email: if a free
>>> slot found earlier, then still gets duplicated entries. I have no
>>> concern to this “trade-off”, but please add a comment here, otherwise
>>> future readers may be confused, and might potentially consider that
>>> were a bug, without reading your original design email.
>>
>> +1 on such a comment. Here or maybe on the ResourceElem struct itself.
>>
> 
> Will do. It's probably appropriate to mention this deduplication in the
> comment at the beginning of the file.
> 

I added a comment to the file comment, and clarified the comment when
adding the hash table entry. But I tried not to be too verbose there.

>>>  typedef struct ResourceElem
>>>  {
>>>      Datum        item;
>>> +    uint32        count;                /* number of occurrences */
>>>      const ResourceOwnerDesc *kind;    /* NULL indicates a free hash
>>> table slot */
>>>  } ResourceElem;
>>
>> Hmm, the 'count' is not used when the entry is stored in the array.
>> Perhaps we should have a separate struct for array and hash elements
>> now. Keeping the array small helps it to fit in CPU caches.
>>
> 
> Agreed. I had the same idea yesterday, but I haven't done it yet.
> 

The attached v2 does that - it adds a separate ResourceHashElem for the
has table, and it works. But I'm not sure I like it very much, because
there are two places that relied on both the array and hash table using
the same struct to "walk" it the same way.

For ResourceOwnerSort() it's not too bad, but ResourceOwnerReleaseAll()
now duplicates most of the code. It's not terrible, but also not pretty.
I can't think of a better way, though.

>>> /*
>>>  * Forget that an object is owned by a ResourceOwner
>>>  *
>>>  * Note: If same resource ID is associated with the ResourceOwner more
>>> than
>>>  * once, one instance is removed.
>>>  *
>>>  * Note: Forgetting a resource does not guarantee that there is room to
>>>  * remember a new resource.  One exception is when you forget the most
>>>  * recently remembered resource; that does make room for a new
>>> remember call.
>>>  * Some code callers rely on that exception.
>>>  */
>>> void
>>> ResourceOwnerForget(ResourceOwner owner, Datum value, const
>>> ResourceOwnerDesc *kind)
>>
>> Does this break the exception noted in the above comment? I guess it
>> still holds because we don't deduplicate entries in the array. That's
>> very subtle, needs a comment at least.
>>
> 
> Right, it doesn't break the exception - as you say, it only applies to
> the hash, not to the array. I'll add a comment.
> 

Comment added.


regards

-- 
Tomas Vondra

Вложения

v2-0001-Deduplicate-entries-in-ResourceOwner.patch

Re: Optimizing ResouceOwner to speed up COPY

От

Heikki Linnakangas

Дата:

21 октября, 10:10:38

On 18/10/2025 01:49, Tomas Vondra wrote:
> On 10/17/25 12:32, Tomas Vondra wrote:
>>
>>
>> On 10/17/25 10:31, Heikki Linnakangas wrote:
>>>>   typedef struct ResourceElem
>>>>   {
>>>>       Datum        item;
>>>> +    uint32        count;                /* number of occurrences */
>>>>       const ResourceOwnerDesc *kind;    /* NULL indicates a free hash
>>>> table slot */
>>>>   } ResourceElem;
>>>
>>> Hmm, the 'count' is not used when the entry is stored in the array.
>>> Perhaps we should have a separate struct for array and hash elements
>>> now. Keeping the array small helps it to fit in CPU caches.
>>
>> Agreed. I had the same idea yesterday, but I haven't done it yet.
>
> The attached v2 does that - it adds a separate ResourceHashElem for the
> has table, and it works. But I'm not sure I like it very much, because
> there are two places that relied on both the array and hash table using
> the same struct to "walk" it the same way.
>
> For ResourceOwnerSort() it's not too bad, but ResourceOwnerReleaseAll()
> now duplicates most of the code. It's not terrible, but also not pretty.
> I can't think of a better way, though.

Looks fine to me. The code duplication is not too bad IMO.

Here's a rebased version of the micro-benchmark I used when I was
working on the ResourceOwner refactoring
(https://www.postgresql.org/message-id/d746cead-a1ef-7efe-fb47-933311e876a3%40iki.fi).


I ran it again on my laptop. Different from the one I used back then, so
the results are not comparable with the results from that old thread.

Unpatched (commit 18d26140934):

postgres=# \i contrib/resownerbench/snaptest.sql
  numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
        0 |        1 |         11.6 |         11.1
        0 |        5 |         12.1 |         13.1
        0 |       10 |         12.3 |         13.5
        0 |       60 |         14.6 |         19.4
        0 |       70 |         16.0 |         18.1
        0 |      100 |         16.7 |         18.0
        0 |     1000 |         18.1 |         20.7
        0 |    10000 |         21.9 |         29.5
        9 |       10 |         11.0 |         11.1
        9 |      100 |         14.9 |         20.0
        9 |     1000 |         16.1 |         24.4
        9 |    10000 |         21.9 |         25.7
       65 |       70 |         11.7 |         12.5
       65 |      100 |         13.9 |         14.8
       65 |     1000 |         16.7 |         17.8
       65 |    10000 |         22.5 |         27.8
(16 rows)

v2-0001-Deduplicate-entries-in-ResourceOwner.patch:

postgres=# \i contrib/resownerbench/snaptest.sql
  numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
        0 |        1 |         10.8 |         10.6
        0 |        5 |         11.5 |         12.3
        0 |       10 |         12.1 |         13.0
        0 |       60 |         13.9 |         19.4
        0 |       70 |         15.9 |         18.7
        0 |      100 |         16.0 |         18.5
        0 |     1000 |         19.2 |         21.6
        0 |    10000 |         22.4 |         29.0
        9 |       10 |         11.2 |         11.3
        9 |      100 |         14.4 |         19.9
        9 |     1000 |         16.4 |         23.8
        9 |    10000 |         22.4 |         25.7
       65 |       70 |         11.4 |         12.1
       65 |      100 |         14.8 |         17.0
       65 |     1000 |         16.9 |         18.1
       65 |    10000 |         22.5 |         28.5
(16 rows)

v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch:

postgres=# \i contrib/resownerbench/snaptest.sql
  numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
        0 |        1 |         11.3 |         11.1
        0 |        5 |         12.3 |         13.0
        0 |       10 |         13.0 |         14.1
        0 |       60 |         14.7 |         20.5
        0 |       70 |         16.3 |         19.0
        0 |      100 |         16.5 |         18.4
        0 |     1000 |         19.0 |         22.4
        0 |    10000 |         23.2 |         29.6
        9 |       10 |         11.2 |         11.1
        9 |      100 |         14.8 |         20.5
        9 |     1000 |         16.8 |         24.3
        9 |    10000 |         23.3 |         26.5
       65 |       70 |         12.4 |         13.0
       65 |      100 |         15.2 |         16.6
       65 |     1000 |         16.9 |         18.4
       65 |    10000 |         23.4 |         29.3
(16 rows)

These are just a single run on my laptop, the error bars on individual
numbers are significant. But it seems to me that V2 is maybe a little
faster when the entries fit in the array.

- Heikki

Вложения

v3-0001-resownerbench.patch

Re: Optimizing ResouceOwner to speed up COPY

От

Tomas Vondra

Дата:

21 октября, 17:43:08


On 10/21/25 09:10, Heikki Linnakangas wrote:
> On 18/10/2025 01:49, Tomas Vondra wrote:
>> On 10/17/25 12:32, Tomas Vondra wrote:
>>>
>>>
>>> On 10/17/25 10:31, Heikki Linnakangas wrote:
>>>>>   typedef struct ResourceElem
>>>>>   {
>>>>>       Datum        item;
>>>>> +    uint32        count;                /* number of occurrences */
>>>>>       const ResourceOwnerDesc *kind;    /* NULL indicates a free hash
>>>>> table slot */
>>>>>   } ResourceElem;
>>>>
>>>> Hmm, the 'count' is not used when the entry is stored in the array.
>>>> Perhaps we should have a separate struct for array and hash elements
>>>> now. Keeping the array small helps it to fit in CPU caches.
>>>
>>> Agreed. I had the same idea yesterday, but I haven't done it yet.
>>
>> The attached v2 does that - it adds a separate ResourceHashElem for the
>> has table, and it works. But I'm not sure I like it very much, because
>> there are two places that relied on both the array and hash table using
>> the same struct to "walk" it the same way.
>>
>> For ResourceOwnerSort() it's not too bad, but ResourceOwnerReleaseAll()
>> now duplicates most of the code. It's not terrible, but also not pretty.
>> I can't think of a better way, though.
> 
> Looks fine to me. The code duplication is not too bad IMO.
> 
> Here's a rebased version of the micro-benchmark I used when I was
> working on the ResourceOwner refactoring (https://www.postgresql.org/
> message-id/d746cead-a1ef-7efe-fb47-933311e876a3%40iki.fi).
> 
> I ran it again on my laptop. Different from the one I used back then, so
> the results are not comparable with the results from that old thread.
> 
> Unpatched (commit 18d26140934):
> 
> postgres=# \i contrib/resownerbench/snaptest.sql
>  numkeep | numsnaps | lifo_time_ns | fifo_time_ns
> ---------+----------+--------------+--------------
>        0 |        1 |         11.6 |         11.1
>        0 |        5 |         12.1 |         13.1
>        0 |       10 |         12.3 |         13.5
>        0 |       60 |         14.6 |         19.4
>        0 |       70 |         16.0 |         18.1
>        0 |      100 |         16.7 |         18.0
>        0 |     1000 |         18.1 |         20.7
>        0 |    10000 |         21.9 |         29.5
>        9 |       10 |         11.0 |         11.1
>        9 |      100 |         14.9 |         20.0
>        9 |     1000 |         16.1 |         24.4
>        9 |    10000 |         21.9 |         25.7
>       65 |       70 |         11.7 |         12.5
>       65 |      100 |         13.9 |         14.8
>       65 |     1000 |         16.7 |         17.8
>       65 |    10000 |         22.5 |         27.8
> (16 rows)
> 
> v2-0001-Deduplicate-entries-in-ResourceOwner.patch:
> 
> postgres=# \i contrib/resownerbench/snaptest.sql
>  numkeep | numsnaps | lifo_time_ns | fifo_time_ns
> ---------+----------+--------------+--------------
>        0 |        1 |         10.8 |         10.6
>        0 |        5 |         11.5 |         12.3
>        0 |       10 |         12.1 |         13.0
>        0 |       60 |         13.9 |         19.4
>        0 |       70 |         15.9 |         18.7
>        0 |      100 |         16.0 |         18.5
>        0 |     1000 |         19.2 |         21.6
>        0 |    10000 |         22.4 |         29.0
>        9 |       10 |         11.2 |         11.3
>        9 |      100 |         14.4 |         19.9
>        9 |     1000 |         16.4 |         23.8
>        9 |    10000 |         22.4 |         25.7
>       65 |       70 |         11.4 |         12.1
>       65 |      100 |         14.8 |         17.0
>       65 |     1000 |         16.9 |         18.1
>       65 |    10000 |         22.5 |         28.5
> (16 rows)
> 
> v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch:
> 
> postgres=# \i contrib/resownerbench/snaptest.sql
>  numkeep | numsnaps | lifo_time_ns | fifo_time_ns
> ---------+----------+--------------+--------------
>        0 |        1 |         11.3 |         11.1
>        0 |        5 |         12.3 |         13.0
>        0 |       10 |         13.0 |         14.1
>        0 |       60 |         14.7 |         20.5
>        0 |       70 |         16.3 |         19.0
>        0 |      100 |         16.5 |         18.4
>        0 |     1000 |         19.0 |         22.4
>        0 |    10000 |         23.2 |         29.6
>        9 |       10 |         11.2 |         11.1
>        9 |      100 |         14.8 |         20.5
>        9 |     1000 |         16.8 |         24.3
>        9 |    10000 |         23.3 |         26.5
>       65 |       70 |         12.4 |         13.0
>       65 |      100 |         15.2 |         16.6
>       65 |     1000 |         16.9 |         18.4
>       65 |    10000 |         23.4 |         29.3
> (16 rows)
> 
> These are just a single run on my laptop, the error bars on individual
> numbers are significant. But it seems to me that V2 is maybe a little
> faster when the entries fit in the array.
> 

Thanks. Attached is a version that adds a missing .sql file defining the
benchmark functions, and then two patches. The v2 is the patch already
shared on Saturday, v3 is a minor tweak (details in a minute).

I ran the benchmark on my test machine with both v1 and v2, with 10
runs. And there seems to be a small regression of ~5-10% (compared to
master). Which is not terrible, but also not negligible. v0 is master.

Here's the "fifo" results:

  keep snaps |     v0      v1      v2      v3  |    v1     v2     v3
  ==================================================================
  0        1 |  10.57   10.70   10.73    10.5  |  101%   102%    99%
           5 |  10.52   10.70   10.90    10.5  |  102%   104%   100%
          10 |  11.41   11.90   11.94    11.5  |  104%   105%   101%
          60 |  15.04   15.74   15.82    15.5  |  105%   105%   103%
          70 |  13.73   14.42   14.80    14.3  |  105%   108%   104%
         100 |  13.65   14.27   14.59    14.1  |  105%   107%   104%
        1000 |  15.07   15.78   16.27    15.6  |  105%   108%   104%
       10000 |  23.15   24.96   25.57    24.1  |  108%   110%   104%
  ------------------------------------------------------------------
  9       10 |  10.8    10.94   10.86    10.6  |  101%   101%    98%
         100 |  15.83   16.35   16.72    16.1  |  103%   106%   102%
        1000 |  19.04   19.70   20.34    19.6  |  103%   107%   103%
       10000 |  21.5    23.37   24.18    22.5  |  109%   112%   105%
  ------------------------------------------------------------------
  65      70 |  10.58   10.95   11.18    10.6  |  103%   106%   100%
         100 |  13.29   14.28   14.73    14.1  |  107%   111%   106%
        1000 |  14.62   15.49   15.99    15.2  |  106%   109%   104%
       10000 |  22.98   24.78   25.55    24.3  |  108%   111%   106%

and here's "lifo"

  keep snaps |     v0      v1      v2      v3  |    v1     v2     v3
  ==================================================================
  0        1 |  10.73   10.74   11.06    10.4  |  100%   103%    97%
           5 |  10.44   10.45   10.62    10.2  |  100%   102%    98%
          10 |  11.82   11.84   12.17    11.6  |  100%   103%    98%
          60 |  12.15   12.81   12.94    12.4  |  105%   107%   102%
          70 |  12.84   13.61   13.95    13.4  |  106%   109%   104%
         100 |  12.98   13.73   14.06    13.5  |  106%   108%   104%
        1000 |  13.71   14.56   14.80    14.2  |  106%   108%   104%
       10000 |  17.68   19.51   20.38    19.0  |  110%   115%   108%
  ------------------------------------------------------------------
  9       10 |  10.96   10.92   11.02    10.6  |  100%   101%    97%
         100 |  12.58   13.11   13.26    12.8  |  104%   105%   102%
        1000 |  13.67   14.38   14.73    14.1  |  105%   108%   103%
       10000 |  17.67   19.79   20.20    19.2  |  112%   114%   109%
  ------------------------------------------------------------------
  65      70 |  10.58   10.85   10.78    11.1  |  103%   102%   105%
         100 |  13.03   14.12   14.27    13.5  |  108%   110%   103%
        1000 |  13.65   14.56   14.88    14.1  |  107%   109%   104%
       10000 |  17.62   19.61   20.30    19.0  |  111%   115%   108%

I was wondering where the regression comes from, and it occurred to me
ResourceOwnerAddToHash() may be doing the checks in the wrong order. It
should be checking for empty slot first, so I did that - that's v3.

There's still a regression, but it's about halved compared to v2, and
about equal to v1. So I tried doing the same tweak for v1, but that
didn't help much (if at all).

The results seem fairly stable, and the overall trend is clear. It'd be
great if there were no regressions, but considering how narrow is this
microbenchmark (and considering the benefits for practical COPY runs),
I'd say it's probably OK.


regards

-- 
Tomas Vondra

Вложения

Re: Optimizing ResouceOwner to speed up COPY

От

Tomas Vondra

Дата:

27 октября, 18:14:38

On 10/21/25 16:43, Tomas Vondra wrote:
> ...
> 
> The results seem fairly stable, and the overall trend is clear. It'd be
> great if there were no regressions, but considering how narrow is this
> microbenchmark (and considering the benefits for practical COPY runs),
> I'd say it's probably OK.
> 

The regression for non-pathological cases (like COPY) is bothering me,
even if it's a very narrow microbenchmark and I'd bet it would not be
measurable in practice. Still, it's annoying.

I wonder if maybe a better solution would be to invent a concept of
tuple slots for the same descriptor, and allow it to be tracked only
once in the resource owner. That'd mean no duplicates, completely
eliminating the pathological case.

regards

-- 
Tomas Vondra

Re: Optimizing ResouceOwner to speed up COPY

От

Tomas Vondra

Дата:

27 октября, 23:42:36

On 10/27/25 16:14, Tomas Vondra wrote:
> On 10/21/25 16:43, Tomas Vondra wrote:
>> ...
>>
>> The results seem fairly stable, and the overall trend is clear. It'd be
>> great if there were no regressions, but considering how narrow is this
>> microbenchmark (and considering the benefits for practical COPY runs),
>> I'd say it's probably OK.
>>
> 
> The regression for non-pathological cases (like COPY) is bothering me,
> even if it's a very narrow microbenchmark and I'd bet it would not be
> measurable in practice. Still, it's annoying.
> 
> I wonder if maybe a better solution would be to invent a concept of
> tuple slots for the same descriptor, and allow it to be tracked only
> once in the resource owner. That'd mean no duplicates, completely
> eliminating the pathological case.
> 

Here's a PoC of this approach. It introduces a couple functions to
create/destroy a batch of slots, and uses that for COPY.

After allocating a batch with table_slot_batch_create(), the code can
get the next slot using table_slot_batch_next(). The descriptor is
pinned only once, in table_slot_batch_create().

It needs to add these to multiple "layers", but it's mostly mechanical.
It's just a PoC and needs more thought / cleanup. E.g. some of the
duplication could be eliminated (if we allowed MakeTupleTableSlot to
skip pinning the descriptor). Also, perhaps there should be a way to
protect against freeing the batched slots individually.

Anyway, the results are unsurprising - it has the same benefits as the
earlier patches, but doesn't affect the resownerbench at all.

regards

-- 
Tomas Vondra

Вложения

0001-introduce-batch-of-slots.patch

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Optimizing ResouceOwner to speed up COPY

Вложения

Вложения

Вложения

Вложения

Вложения