Обсуждение: Shared hash table allocations

Поиск

Список

Период

Сортировка

Shared hash table allocations

От

Heikki Linnakangas

Дата:

28 марта, 01:37:33

I'm starting a new thread for this topic that we discussed on the 
"Better shared data structure management and resizable shared data 
structures" thread:

On 25/03/2026 20:37, Robert Haas wrote:
> On Sat, Mar 21, 2026 at 8:14 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I wonder if we should set aside a small amount of memory, like 10-20 kB,
>> in the fixed shmem segment for small structs like that. Currently, such
>> allocations will usually succeed because we leave some wiggle room, but
>> the  can also be consumed by other things like locks. But we could
>> reserve a small amount of memory specifically for small allocations in
>> extensions like this.
> 
> Yeah, I don't really understand why we let the lock table use up that
> space. I mean, I think it would be useful to have a way to let the
> lock table expand without a server restart, and I also suspect that we
> could come up with a less-silly data structure than the PROCLOCK hash,
> but also if the only thing keeping you from running out of lock space
> is the wiggle room, maybe you just need to bump up
> max_locks_per_transaction. Like, you could easily burn through the
> wiggle room, get an error anyway, and then later find that you also
> now can't load certain extensions without a server restart.

Attached patch set tightens up all shared memory hash tables so that 
they no longer use the "wiggle room". They are now truly fixed size. You 
specify the number of elements in ShmemInitHash(), and that's it.

This also addresses the accounting issue we currently have with hash 
tables, that the size reported in pg_shmem_allocations only shows the 
size of fixed header and the directory, not the actual hash buckets. 
They were previously all lumped together in the "<anonymous>" section. 
These patches fix that.

There was an earlier attempt at that at last year [1], but it got 
reverted. I hope my approach is less invasive: instead of changing 
dynahash.c to use a single allocation directly, I'm providing it an 
"alloc" callback that carves out the different parts it needs from the 
single contiguous shmem area, which in turn is allocated with 
ShmemInitStruct().

[1] 
https://www.postgresql.org/message-id/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7%2BhVAqaS5GQ%40mail.gmail.com

- Heikki

Вложения

Re: Shared hash table allocations

От

Tomas Vondra

Дата:

28 марта, 03:14:41


On 3/27/26 23:37, Heikki Linnakangas wrote:
> I'm starting a new thread for this topic that we discussed on the
> "Better shared data structure management and resizable shared data
> structures" thread:
> 
> On 25/03/2026 20:37, Robert Haas wrote:
>> On Sat, Mar 21, 2026 at 8:14 PM Heikki Linnakangas <hlinnaka@iki.fi>
>> wrote:
>>> I wonder if we should set aside a small amount of memory, like 10-20 kB,
>>> in the fixed shmem segment for small structs like that. Currently, such
>>> allocations will usually succeed because we leave some wiggle room, but
>>> the  can also be consumed by other things like locks. But we could
>>> reserve a small amount of memory specifically for small allocations in
>>> extensions like this.
>>
>> Yeah, I don't really understand why we let the lock table use up that
>> space. I mean, I think it would be useful to have a way to let the
>> lock table expand without a server restart, and I also suspect that we
>> could come up with a less-silly data structure than the PROCLOCK hash,
>> but also if the only thing keeping you from running out of lock space
>> is the wiggle room, maybe you just need to bump up
>> max_locks_per_transaction. Like, you could easily burn through the
>> wiggle room, get an error anyway, and then later find that you also
>> now can't load certain extensions without a server restart.
> 
> Attached patch set tightens up all shared memory hash tables so that
> they no longer use the "wiggle room". They are now truly fixed size. You
> specify the number of elements in ShmemInitHash(), and that's it.
> 
> This also addresses the accounting issue we currently have with hash
> tables, that the size reported in pg_shmem_allocations only shows the
> size of fixed header and the directory, not the actual hash buckets.
> They were previously all lumped together in the "<anonymous>" section.
> These patches fix that.
> 
> There was an earlier attempt at that at last year [1], but it got
> reverted. I hope my approach is less invasive: instead of changing
> dynahash.c to use a single allocation directly, I'm providing it an
> "alloc" callback that carves out the different parts it needs from the
> single contiguous shmem area, which in turn is allocated with
> ShmemInitStruct().
> 

Thanks!

That earlier attempt was committed+reverted by me, and I certainly find
this new approach much easier to understand (but I thought I understand
the old patch too ...).

A couple comments, based on a quick review:

* 0001 - seems fine

* 0002 - +1 to getting rid of HASH_SEGMENT, but I don't see the point of
renaming DEF_SEGSIZE to HASH_SEGSIZE. Isn't that a bit unnecessary?

* 0003 - I'd probably rename CurrentDynaHashCxt to something that
doesn't seem like a "global" variable, e.g. "dynahashCxt"

* 0004 - seems fine, +1 to get rid of unused pieces

* 0005 - seems fine

* 0006 - Doesn't this completely change the alignment? ShmemHashAlloc
used to call ShmemAllocRaw, which is very careful to use CACHELINEALIGN.
But now ShmemHashAlloc just does MAXALIGN, which ShmemAllocRaw claims is
not enough on modern systems.

* 0007 - this left one comment referencing HASH_DIRSIZE in dynahash.c

* 0008 - makes sense


regards

-- 
Tomas Vondra

Re: Shared hash table allocations

От

Heikki Linnakangas

Дата:

30 марта, 18:28:58

On 28/03/2026 02:14, Tomas Vondra wrote:
> * 0002 - +1 to getting rid of HASH_SEGMENT, but I don't see the point of
> renaming DEF_SEGSIZE to HASH_SEGSIZE. Isn't that a bit unnecessary?

DEF_SEGSIZE stands for "default segsize", but after this commit it's not 
merely the default, it's the same hard-coded constant for every hash 
table. That's why it seems prudent to rename it.

> * 0003 - I'd probably rename CurrentDynaHashCxt to something that
> doesn't seem like a "global" variable, e.g. "dynahashCxt"

Renamed it to "hcxt", as that's what the corresponding field in HTAB is 
called.

> * 0004 - seems fine, +1 to get rid of unused pieces

To be clear, the init_size/max_size are not completely unused at the 
moment: the lock manager sets max_size to 2 * init_size, and 
wait_event.c used constants 16 and 128.

The point is that it doesn't give you a very wide range of scalability, 
and I think it's better to not be flexible in that fashion. I would call 
it sloppiness rather than flexibility.

> * 0005 - seems fine
> 
> * 0006 - Doesn't this completely change the alignment? ShmemHashAlloc
> used to call ShmemAllocRaw, which is very careful to use CACHELINEALIGN.
> But now ShmemHashAlloc just does MAXALIGN, which ShmemAllocRaw claims is
> not enough on modern systems.

dynahash.c allocates multiple elements in each alloc() call, so even 
though ShmemAllocRaw() returns a cacheline-aligned block, the individual 
elements were not cacheline-aligned before this patch either. See 
element_alloc() and choose_nelem_alloc().

> * 0007 - this left one comment referencing HASH_DIRSIZE in dynahash.c

Fixed

- Heikki

Вложения

Re: Shared hash table allocations

От

Heikki Linnakangas

Дата:

31 марта, 01:02:52

On 30/03/2026 18:28, Heikki Linnakangas wrote:
> To be clear, the init_size/max_size are not completely unused at the 
> moment: the lock manager sets max_size to 2 * init_size, and ... 

Huh, I only now realized the implications of the above. The formula in 
the lock manager is actually:

#define NLOCKENTS() \
    mul_size(max_locks_per_xact, add_size(MaxBackends, max_prepared_xacts))

...
    max_table_size = NLOCKENTS();
    init_table_size = max_table_size / 2;

So the initial size is only half of the maximum you get from 
max_locks_per_xacts * max_connections. That means that if something 
consumes the "wiggle room" shared memory, you might get *fewer* locks 
than you might expect based on the GUCs.

I've somehow always thought that it's the other way round, that you 
might get *more* locks than max_locks_per_xacts * max_connections if 
you're lucky, but that  max_connections * max_locks_per_xacts was 
guaranteed.

This will change with these patches.

I wonder if we should change the defaults somehow. In usual 
configurations, people are currently getting much more lock space than 
you'd expect based on max_connections and max_locks_per_transaction, and 
after these patches, they'll get much fewer locks. It might be prudent 
bump up the default max_locks_per_transaction setting so that you'd get 
roughly the same amount of locks in the default configuration.

- Heikki

Re: Shared hash table allocations

От

Heikki Linnakangas

Дата:

01 апреля, 00:25:44

On 31/03/2026 01:02, Heikki Linnakangas wrote:
> I wonder if we should change the defaults somehow. In usual 
> configurations, people are currently getting much more lock space than 
> you'd expect based on max_connections and max_locks_per_transaction, and 
> after these patches, they'll get much fewer locks. It might be prudent 
> bump up the default max_locks_per_transaction setting so that you'd get 
> roughly the same amount of locks in the default configuration.

I did some testing of the memory usage and how removing the wiggle room 
affects the number of locks you can acquire. Attached are the test 
procedures I used, and proposed patches. The patches are new, designed 
to just change the parameters of the hash tables and shmem calculations 
with no other changes. They don't include the refactorings we've 
discussed so far in this thread. My plan is to commit these new patches 
first, and those other refactorings after that. Once these new patches 
are committed, the refactorings won't materially change the overall 
memory usage or how it's divided between different hash tables, all 
those effects are in these new patches.

master: With the default configuration on master, the attached test 
procedure can take 14927 locks before hitting "out of shared memory" 
error. At that point, all the "wiggle room" is assigned for the LOCK 
hash table. A different scenario could make the PROCLOCK hash table 
consume all the wiggle room instead, but I believe running out of LOCK 
space is more common, and I don't think it changes the big picture 
anyway if you hit the ceiling with PROCLOCK instead.

0001: While looking at this, I noticed that we add 10% "safety margin" 
to the shmem calculations in predicate.c, but we had already marked the 
predicate.c hash tables as HASH_FIXED_SIZE so they were never able to 
make use of the safety margin. Oops. The extra memory was available for 
the lock.c hash tables, though. After removing that bogus 10% safety 
margin from predicate.c, memory usage was reduced by 200 kB, but the 
number of locks you could take went down from 14927 to 14159.

0002: As the next step, I also removed the 10% safety margin from 
lock.c. That reduced memory usage by another 320 kB, and the number of 
locks went down from 14159 to 12815.

0003: After those changes, there's only little extra memory sloshing 
around that's not accounted for any data structure. ipci.c reserves a 
constant 100 kB, but that's pretty much it. However, there's still 
flexibility between the LOCK and the PROCLOCK hash tables. The PROCLOCK 
hash table is estimated to be 2x the size of the LOCK table, but when 
it's not, the space can get assigned to the LOCK table instead. In patch 
0003 I removed that flexibility by marking them both with 
HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops 
the hash tables from using any of the other remaining wiggle room, 
making them truly fixed-size. This doesn't change the overall shared 
memory allocated, but the number of locks that the test procedure could 
acquire went down from 12815 to 8767, mostly because it cannot "steal" 
space from PROCLOCK anymore.

0004: To buy back that lock manager space in common out-of-the box 
situations, I propose to bump up the default for 
max_locks_per_transactions from 64 to 128. That increases memory usage 
again by 3216 kB, making it 2696 kB higher than on master (remember that 
the previous changes reduced memory usage). The number of locks you can 
take after that is 17535, which more than on master (14927).

Increasing the default won't affect users who have already set 
max_locks_per_transaction to a non-default value. They will see that the 
number of locks they can acquire with their existing configuration will 
be reduced, again because of the lost wiggle room and flexibility 
between LOCK and PROCLOCK. Not sure if we could or should do something 
about that. Probably best to just document in the release notes that if 
you had raised increase max_locks_per_transaction, you might need to 
raise it further to be able to accommodate the same amount of locks as 
before.

Here's all that in table form:

| Patch                                 | Shmem (kB) | Locks |
| --------------------------------------+------------|-------|
| master                                |     153560 | 14927 |
| 0001: remove 10% from predicate.c     |     153360 | 14159 |
| 0002: remove 10% from lock.c          |     153040 | 12815 |
| 0003: Make lock.c tables fixed size   |     153040 |  8767 |
| 0004: max_locks_per_transactions=128  |     156256 | 17535 |

This increase in memory usage is not great, but it's not that big in the 
grand scheme of things. I think it's well worth, and better than the 
sloppy scheme we have today.

Any thoughts, objections?

- Heikki

Вложения

Re: Shared hash table allocations

От

Matthias van de Meent

Дата:

02 апреля, 13:24:29

On Tue, 31 Mar 2026 at 23:25, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 31/03/2026 01:02, Heikki Linnakangas wrote:
> > I wonder if we should change the defaults somehow. In usual
> > configurations, people are currently getting much more lock space than
> > you'd expect based on max_connections and max_locks_per_transaction, and
> > after these patches, they'll get much fewer locks. It might be prudent
> > bump up the default max_locks_per_transaction setting so that you'd get
> > roughly the same amount of locks in the default configuration.
>
> master: With the default configuration on master, the attached test
> procedure can take 14927 locks before hitting "out of shared memory"
> error. At that point, all the "wiggle room" is assigned for the LOCK
> hash table. A different scenario could make the PROCLOCK hash table
> consume all the wiggle room instead, but I believe running out of LOCK
> space is more common, and I don't think it changes the big picture
> anyway if you hit the ceiling with PROCLOCK instead.
>
> 0001: [...]

LGTM

> 0002: As the next step, I also removed the 10% safety margin from
> lock.c. That reduced memory usage by another 320 kB, and the number of
> locks went down from 14159 to 12815.

LGTM

> 0003: In patch 0003 I removed that flexibility by marking them both with
> HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops
> the hash tables from using any of the other remaining wiggle room,
> making them truly fixed-size.

I think this patch finally gave me a good reason why PROCLOCK would've
needed to be allocated with double the sizes of LOCK:

LOCK is (was) initialized with only 50% of its max capacity. If
PROCLOCK was initialized with the same parameters and all spare shmem
is then allocated to other processes, then backends wouldn't be able
to safely use max_locks_per_transaction. To guarantee no OOMs when all
backends use max_locks_per_transaction, PROCLOCK's size must be
doubled to make sure PROCLOCK has sufficient space. (The same isn't
usually an issue for LOCK, because it's very likely backends will
operate on the same tables, and thus will be able to share most of the
LOCK structs.)

Now that LOCK is fully allocated, I think the size doubling can be
removed, or possibly parameterized for those that need it.

> 0004: To buy back that lock manager space in common out-of-the box
> situations, I propose to bump up the default for
> max_locks_per_transactions from 64 to 128. [...]
> The number of locks you can
> take after that is 17535, which more than on master (14927).

Note that this is for one backend; with current sizing you could lock
the same 17535 locks in at least one more backend.

Patch LGTM.

> Any thoughts, objections?

Overall, I'm +1 on this change. I do have some general comments
though, at least in part based on discussions in the hackers discord
last year[0]:

1.) We'll need to clearly advertise the changed, more strict behaviour
of the heavy-weight locking system in the release notes.
2.) (Related) We probably should make it easier for DBAs to monitor
lock counts now that we enforce the limit more strictly. This could
take the form of (optional) logging that alerts when a session exceeds
some threshold number of locks in a transaction (e.g. 100% and 200% of
max_locks_per_transaction), or as a metric in
pg_stat_{activity,databases} as total locks taken/max number of locks
taken in a transaction.
3.) (Related) We should probably parameterize the LOCK-to-PROCLOCK
ratio. LOCK is large, and especially on systems with high values of
max_connections (where the additional LOCKs will go unused) the
overhead of carrying all those additional LOCKs would go up to 50% of
the added memory usage (LOCK at 152+24=176B, PROCLOCK at
2*(64+24B)=176B).  It'd be nice if we could avoid allocating that
memory.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

[0] starting at
https://discord.com/channels/1258108670710124574/1266090488415654032/1442879718285119518

Re: Shared hash table allocations

От

Heikki Linnakangas

Дата:

02 апреля, 14:52:07

On 02/04/2026 13:24, Matthias van de Meent wrote:
> On Tue, 31 Mar 2026 at 23:25, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> 0003: In patch 0003 I removed that flexibility by marking them both with
>> HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops
>> the hash tables from using any of the other remaining wiggle room,
>> making them truly fixed-size.
> 
> I think this patch finally gave me a good reason why PROCLOCK would've
> needed to be allocated with double the sizes of LOCK:
> 
> LOCK is (was) initialized with only 50% of its max capacity. If
> PROCLOCK was initialized with the same parameters and all spare shmem
> is then allocated to other processes, then backends wouldn't be able
> to safely use max_locks_per_transaction. To guarantee no OOMs when all
> backends use max_locks_per_transaction, PROCLOCK's size must be
> doubled to make sure PROCLOCK has sufficient space. (The same isn't
> usually an issue for LOCK, because it's very likely backends will
> operate on the same tables, and thus will be able to share most of the
> LOCK structs.)

Hmm, I don't know if that makes sense. It can happen that you have a lot 
of backends acquiring the same, smaller set of locks, growing PROCLOCK 
so that it uses up all the available wiggle room, and LOCK can never 
grow from its initial size, 1/2 * max_locks_per_transactions * 
MaxBackends. If the workload then changes so that every backend tries to 
acquire exactly max_locks_per_transactions locks, but this time each 
lock is on a different object, you will run out of shared memory at 1/2 
the size of what you expected.

The opposite can't happen, because PROCLOCK is always at least as large 
as LOCK. It doesn't matter what you set PROCLOCK's initial size to, it 
will grow together with LOCK, and you will not run out of shared memory 
before PROCLOCK has grown up to max_locks_per_transactions * MaxBackends 
anyway.

> Now that LOCK is fully allocated, I think the size doubling can be
> removed, or possibly parameterized for those that need it.

I don't think that follows. The 2x factor is pretty arbitrary, but it's 
still a fair assumption that many backends will be acquiring locks on 
the same objects so you need more space in PROCLOCK than in LOCK.

I don't know how true that assumption is. It feels right for OLTP 
applications. But the situation where I've hit max_locks_per_transaction 
is when I've tried to create one table with thousands or partitions. Or 
rather, when I try to *drop* that table. In that situation, there's just 
one transaction acquiring all the locks, so the PROCLOCK / LOCK ratio is 1.

We could parameterize it, but I feel that's probably overkill and 
exposing too much detail to users. At the end of the day, if you hit the 
limit, you just bump up max_locks_per_transactions. If there are two 
settings, it's more complicated; which one do you change? You probably 
don't mind wasting the few MB of memory that you could gain by carefully 
tuning the LOCK / PROCLOCK factor.
- Heikki

Re: Shared hash table allocations

От

Ashutosh Bapat

Дата:

02 апреля, 15:55:23

On Wed, Apr 1, 2026 at 2:55 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 31/03/2026 01:02, Heikki Linnakangas wrote:
> > I wonder if we should change the defaults somehow. In usual
> > configurations, people are currently getting much more lock space than
> > you'd expect based on max_connections and max_locks_per_transaction, and
> > after these patches, they'll get much fewer locks. It might be prudent
> > bump up the default max_locks_per_transaction setting so that you'd get
> > roughly the same amount of locks in the default configuration.
>
> I did some testing of the memory usage and how removing the wiggle room
> affects the number of locks you can acquire. Attached are the test
> procedures I used, and proposed patches. The patches are new, designed
> to just change the parameters of the hash tables and shmem calculations
> with no other changes. They don't include the refactorings we've
> discussed so far in this thread. My plan is to commit these new patches
> first, and those other refactorings after that. Once these new patches
> are committed, the refactorings won't materially change the overall
> memory usage or how it's divided between different hash tables, all
> those effects are in these new patches.
>
>
> master: With the default configuration on master, the attached test
> procedure can take 14927 locks before hitting "out of shared memory"
> error. At that point, all the "wiggle room" is assigned for the LOCK
> hash table. A different scenario could make the PROCLOCK hash table
> consume all the wiggle room instead, but I believe running out of LOCK
> space is more common, and I don't think it changes the big picture
> anyway if you hit the ceiling with PROCLOCK instead.
>
> 0001: While looking at this, I noticed that we add 10% "safety margin"
> to the shmem calculations in predicate.c, but we had already marked the
> predicate.c hash tables as HASH_FIXED_SIZE so they were never able to
> make use of the safety margin. Oops. The extra memory was available for
> the lock.c hash tables, though. After removing that bogus 10% safety
> margin from predicate.c, memory usage was reduced by 200 kB, but the
> number of locks you could take went down from 14927 to 14159.
>
> 0002: As the next step, I also removed the 10% safety margin from
> lock.c. That reduced memory usage by another 320 kB, and the number of
> locks went down from 14159 to 12815.
>
> 0003: After those changes, there's only little extra memory sloshing
> around that's not accounted for any data structure. ipci.c reserves a
> constant 100 kB, but that's pretty much it. However, there's still
> flexibility between the LOCK and the PROCLOCK hash tables. The PROCLOCK
> hash table is estimated to be 2x the size of the LOCK table, but when
> it's not, the space can get assigned to the LOCK table instead. In patch
> 0003 I removed that flexibility by marking them both with
> HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops
> the hash tables from using any of the other remaining wiggle room,
> making them truly fixed-size. This doesn't change the overall shared
> memory allocated, but the number of locks that the test procedure could
> acquire went down from 12815 to 8767, mostly because it cannot "steal"
> space from PROCLOCK anymore.
>
> 0004: To buy back that lock manager space in common out-of-the box
> situations, I propose to bump up the default for
> max_locks_per_transactions from 64 to 128. That increases memory usage
> again by 3216 kB, making it 2696 kB higher than on master (remember that
> the previous changes reduced memory usage). The number of locks you can
> take after that is 17535, which more than on master (14927).
>
> Increasing the default won't affect users who have already set
> max_locks_per_transaction to a non-default value. They will see that the
> number of locks they can acquire with their existing configuration will
> be reduced, again because of the lost wiggle room and flexibility
> between LOCK and PROCLOCK. Not sure if we could or should do something
> about that. Probably best to just document in the release notes that if
> you had raised increase max_locks_per_transaction, you might need to
> raise it further to be able to accommodate the same amount of locks as
> before.
>
> Here's all that in table form:
>
> | Patch                                 | Shmem (kB) | Locks |
> | --------------------------------------+------------|-------|
> | master                                |     153560 | 14927 |
> | 0001: remove 10% from predicate.c     |     153360 | 14159 |
> | 0002: remove 10% from lock.c          |     153040 | 12815 |
> | 0003: Make lock.c tables fixed size   |     153040 |  8767 |
> | 0004: max_locks_per_transactions=128  |     156256 | 17535 |
>
> This increase in memory usage is not great, but it's not that big in the
> grand scheme of things. I think it's well worth, and better than the
> sloppy scheme we have today.
>
> Any thoughts, objections?

When we "allocate" shared memory, we are just allocating space on
systems which use mmap. The memory gets allocated only when it is
touched. The wiggle room as a whole is never touched during
initialization. Those pages get allocated when wiggle room is used -
i.e. when the entries beyond initial number are allocated. By
allocating maximal hash tables, I was worried that we will allocate
more memory than required. But that's not true since a 4K memory page
fits only 50-60 entries - far less than the default configuration
permits. Most of the memory for the hash table will be allocated as
the entries as used.

The second hazard of increasing hash table size is the hash table
access becomes slower as it becomes sparse [1]. I don't think it shows
up in performance but maybe worth trying a trivial pgbench run, just
to make sure that default performance doesn't regress.

The increase in memory usage is 3MB, which is fine usually. I mean, we
didn't hear any complaints when we increased the default size of the
shared buffer pool - this is much less than that. But why do you want
to double the max_locks_per_transaction? I first thought it's because
the hash table size is anyway a power of 2. But then the size of the
hash table is actually max_locks_per_transaction * (number of backends
+ number of prepared transactions). What we want is the default
max_locks_per_transaction such that 14927 locks are allowed. Playing
with max_locks_per_transaction using your script 109 seems to be the
number which will give us 14951 locks. It looks (and is) an odd
number. If we are worried about memory increase, that's the number we
should use as default and then write a long paragraph about why we
chose such an odd-looking number :D.

I think we should highlight the change in default in the release notes
though. The users which use default configuration will notice an
increase in the memory. If they are using a custom value, they will
think of bumping it up. Can we give them some ballpark % by which they
should increase their max_locks_per_transaction? E.g. double the
number or something?

I looked at the places where max_locks_per_transaction is used. I
don't see any place that needs code updates other than the ones in the
patch.

LGTM, overall.

[1] https://ashutoshpg.blogspot.com/2025/07/efficiency-of-sparse-hash-table.html

--
Best Wishes,
Ashutosh Bapat

Re: Shared hash table allocations

От

Matthias van de Meent

Дата:

02 апреля, 16:47:54

On Thu, 2 Apr 2026 at 13:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 02/04/2026 13:24, Matthias van de Meent wrote:
> > On Tue, 31 Mar 2026 at 23:25, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >> 0003: In patch 0003 I removed that flexibility by marking them both with
> >> HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops
> >> the hash tables from using any of the other remaining wiggle room,
> >> making them truly fixed-size.
> >
> > I think this patch finally gave me a good reason why PROCLOCK would've
> > needed to be allocated with double the sizes of LOCK:
> >
> > LOCK is (was) initialized with only 50% of its max capacity. If
> > PROCLOCK was initialized with the same parameters and all spare shmem
> > is then allocated to other processes, then backends wouldn't be able
> > to safely use max_locks_per_transaction. To guarantee no OOMs when all
> > backends use max_locks_per_transaction, PROCLOCK's size must be
> > doubled to make sure PROCLOCK has sufficient space. (The same isn't
> > usually an issue for LOCK, because it's very likely backends will
> > operate on the same tables, and thus will be able to share most of the
> > LOCK structs.)
>
> Hmm, I don't know if that makes sense.

Code and mailing history indicate it's not the reason, but there is no
other sane reason why PROCLOCK would *not* be sized to
max_locks_per_transaction * MaxBackends. At least with this reasoning
the minimum size is exactly that.

> It can happen that you have a lot
> of backends acquiring the same, smaller set of locks, growing PROCLOCK
> so that it uses up all the available wiggle room, and LOCK can never
> grow from its initial size, 1/2 * max_locks_per_transactions *
> MaxBackends. If the workload then changes so that every backend tries to
> acquire exactly max_locks_per_transactions locks, but this time each
> lock is on a different object, you will run out of shared memory at 1/2
> the size of what you expected.
>
> The opposite can't happen, because PROCLOCK is always at least as large
> as LOCK. It doesn't matter what you set PROCLOCK's initial size to, it
> will grow together with LOCK, and you will not run out of shared memory
> before PROCLOCK has grown up to max_locks_per_transactions * MaxBackends
> anyway.
>
> > Now that LOCK is fully allocated, I think the size doubling can be
> > removed, or possibly parameterized for those that need it.
>
> I don't think that follows. The 2x factor is pretty arbitrary, but it's
> still a fair assumption that many backends will be acquiring locks on
> the same objects so you need more space in PROCLOCK than in LOCK.

I agree that we'll *probably* have more PROCLOCKs in use than LOCKs.
But max_locks_per_transaction (MLPT) to me indicates that it is an
indicator of the maximum number of locks taken by a transaction, and
transaction locks have a 1:1 correspondence with PROCLOCKs (as long as
we ignore fast-path locking).

Adjusting that value by an arbitrary factor does not many any sense.
The user configured a value X, so we should use that value X.
Possibly there could be adjustments we need to make to give ourself
some breathing room (it's not uncommon to overallocate by a constant
factor to allow evict-after-insert patterns in caches), but I can't
explain a blanket doubling of usage "because we have a hunch LOCK
usage will be lower than PROCLOCK usage" when the user specified a
value that would/should map 1:1 against PROCLOCKs scaling as anything
other than plainly wasting memory.

> I don't know how true that assumption is. It feels right for OLTP
> applications. But the situation where I've hit max_locks_per_transaction
> is when I've tried to create one table with thousands or partitions. Or
> rather, when I try to *drop* that table. In that situation, there's just
> one transaction acquiring all the locks, so the PROCLOCK / LOCK ratio is 1.

> We could parameterize it, but I feel that's probably overkill and
> exposing too much detail to users. At the end of the day, if you hit the
> limit, you just bump up max_locks_per_transactions.

Or, if it's for DROP, you could use a phased dropping scheme, where
you spread the operation across many transactions by dropping a subset
of the partitions in each transaction. It takes more careful execution
and more time, but it allows you to avoid hitting the limits and
starving other backends of lock slots, and avoids requiring postmaster
restarts.

> If there are two
> settings, it's more complicated; which one do you change? You probably
> don't mind wasting the few MB of memory that you could gain by carefully
> tuning the LOCK / PROCLOCK factor.

Yes, that would be more complicated, but we have similar factors
elsewhere (hash_mem_multiplier, various costs, weights). We wouldn't
even have to use a factor, we could just as well use a new, more
direct `max_unique_locks_per_transaction`, which we'd use to scale the
LOCK hash.

Note that with our current default settings we're spending 11kiB (= 64
* (64+24)) per backend on what I would consider oversized PROCLOCK
allocations. With MLPT=128, that doubles to 22kiB per backend. Every
50 max_backends, that'd be ~1.1MB of shared memory allocated in excess
of user's requested configuration.

Kind regards,

Matthias van de Meent

Re: Shared hash table allocations

От

Heikki Linnakangas

Дата:

02 апреля, 17:14:46

On 02/04/2026 15:55, Ashutosh Bapat wrote:
> When we "allocate" shared memory, we are just allocating space on
> systems which use mmap. The memory gets allocated only when it is
> touched. The wiggle room as a whole is never touched during
> initialization. Those pages get allocated when wiggle room is used -
> i.e. when the entries beyond initial number are allocated. By
> allocating maximal hash tables, I was worried that we will allocate
> more memory than required. But that's not true since a 4K memory page
> fits only 50-60 entries - far less than the default configuration
> permits. Most of the memory for the hash table will be allocated as
> the entries as used.

Hmm, that's a good point about untouched memory not being allocated. I 
think it's fine, though.

With small changes on top of the the earlier refactorings from this 
thread, we could stop pre-allocating all the elements when a shared 
memory hash table is created, and have ShmemHashAlloc() allocate them on 
the fly, but instead of doing them as anonymous allocations like we do 
with ShmemAlloc() today, the allocations could come from the 
pre-allocated region dedicated to the hash table. You'd still get the 
same determinism and visibility in pg_shmem_allocations, but you could 
avoid actually touching the pages until they're needed. Not sure it's 
worth the trouble.

> The second hazard of increasing hash table size is the hash table
> access becomes slower as it becomes sparse [1]. I don't think it shows
> up in performance but maybe worth trying a trivial pgbench run, just
> to make sure that default performance doesn't regress.

Interesting, but yeah I don't think that's going to be measurable. I did 
some quick testing with a test function that just locks and unlocks 
relations:

PG_FUNCTION_INFO_V1(test_lock_bench);
Datum
test_lock_bench(PG_FUNCTION_ARGS)
{
    int32        num_distinct_locks = PG_GETARG_INT32(0);
    int32        num_acquires = PG_GETARG_INT32(1);

    LOCKMODE    lockmode = AccessExclusiveLock;

#define FIRST_RELID 1000000000

    for (int32 i = 0; i < num_acquires; i++)
    {
        Oid            relid = FIRST_RELID + i % num_distinct_locks;

        if (i >= num_distinct_locks)
            UnlockRelationOid(relid, lockmode);

        if (!ConditionalLockRelationOid(relid, lockmode))
        {
            elog(LOG, "could not acquire lock, iteration %d", i);
            break;
        }
    }

    PG_RETURN_VOID();
}

With test_lock_bench(1, 5000000), I don't see any meaningful difference, 
i.e. it's within 1-2 %, with anything from max_locks_per_transactions=10 
to max_locks_per_transactions=128.

With more distinct locks involved, the caching effects might be bigger, 
and maybe you'd see a difference because of more or less collisions. 
Spot testing some values on my laptop, I don't see anything that would 
worry me though.

> The increase in memory usage is 3MB, which is fine usually. I mean, we
> didn't hear any complaints when we increased the default size of the
> shared buffer pool - this is much less than that. But why do you want
> to double the max_locks_per_transaction? I first thought it's because
> the hash table size is anyway a power of 2. But then the size of the
> hash table is actually max_locks_per_transaction * (number of backends
> + number of prepared transactions). What we want is the default
> max_locks_per_transaction such that 14927 locks are allowed. Playing
> with max_locks_per_transaction using your script 109 seems to be the
> number which will give us 14951 locks. It looks (and is) an odd
> number. If we are worried about memory increase, that's the number we
> should use as default and then write a long paragraph about why we
> chose such an odd-looking number :D.

My first thought was actually to set max_locks_per_transaction=100, 
making it a nice round number :-). But then the neighboring default of 
max_pred_locks_per_transaction=64 looks weird. We could reduce it 
max_pred_locks_per_transaction=50 to make it fit in. But it feels a 
little arbitrary to change just for aesthetic reasons.

> I think we should highlight the change in default in the release notes
> though. The users which use default configuration will notice an
> increase in the memory. If they are using a custom value, they will
> think of bumping it up. Can we give them some ballpark % by which they
> should increase their max_locks_per_transaction? E.g. double the
> number or something?

I don't think people who are using the defaults will notice. I'm worried 
about the people who have set max_locks_per_transactions manually, and 
now effectively get less lock space for the same setting. Yeah, doubling 
the previous value is a good rule of thumb.

- Heikki

Re: Shared hash table allocations

От

Ashutosh Bapat

Дата:

02 апреля, 17:52:32

On Thu, Apr 2, 2026 at 7:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 02/04/2026 15:55, Ashutosh Bapat wrote:
> > When we "allocate" shared memory, we are just allocating space on
> > systems which use mmap. The memory gets allocated only when it is
> > touched. The wiggle room as a whole is never touched during
> > initialization. Those pages get allocated when wiggle room is used -
> > i.e. when the entries beyond initial number are allocated. By
> > allocating maximal hash tables, I was worried that we will allocate
> > more memory than required. But that's not true since a 4K memory page
> > fits only 50-60 entries - far less than the default configuration
> > permits. Most of the memory for the hash table will be allocated as
> > the entries as used.
>
> Hmm, that's a good point about untouched memory not being allocated. I
> think it's fine, though.
>
> With small changes on top of the the earlier refactorings from this
> thread, we could stop pre-allocating all the elements when a shared
> memory hash table is created, and have ShmemHashAlloc() allocate them on
> the fly, but instead of doing them as anonymous allocations like we do
> with ShmemAlloc() today, the allocations could come from the
> pre-allocated region dedicated to the hash table. You'd still get the
> same determinism and visibility in pg_shmem_allocations, but you could
> avoid actually touching the pages until they're needed. Not sure it's
> worth the trouble.

share hash table refactoring + shared memory structure refactoring +
resizable structures, we should be able to get resizable shared hash
tables as well. But that's not required immediately. I feel large hash
tables like buffer hash table, lock hash tables can benefit from this
kind of thing.

>
> > The second hazard of increasing hash table size is the hash table
> > access becomes slower as it becomes sparse [1]. I don't think it shows
> > up in performance but maybe worth trying a trivial pgbench run, just
> > to make sure that default performance doesn't regress.
>
> Interesting, but yeah I don't think that's going to be measurable. I did
> some quick testing with a test function that just locks and unlocks
> relations:
>
> PG_FUNCTION_INFO_V1(test_lock_bench);
> Datum
> test_lock_bench(PG_FUNCTION_ARGS)
> {
>         int32           num_distinct_locks = PG_GETARG_INT32(0);
>         int32           num_acquires = PG_GETARG_INT32(1);
>
>         LOCKMODE        lockmode = AccessExclusiveLock;
>
> #define FIRST_RELID 1000000000
>
>         for (int32 i = 0; i < num_acquires; i++)
>         {
>                 Oid                     relid = FIRST_RELID + i % num_distinct_locks;
>
>                 if (i >= num_distinct_locks)
>                         UnlockRelationOid(relid, lockmode);
>
>                 if (!ConditionalLockRelationOid(relid, lockmode))
>                 {
>                         elog(LOG, "could not acquire lock, iteration %d", i);
>                         break;
>                 }
>         }
>
>         PG_RETURN_VOID();
> }
>
> With test_lock_bench(1, 5000000), I don't see any meaningful difference,
> i.e. it's within 1-2 %, with anything from max_locks_per_transactions=10
> to max_locks_per_transactions=128.
>
> With more distinct locks involved, the caching effects might be bigger,
> and maybe you'd see a difference because of more or less collisions.
> Spot testing some values on my laptop, I don't see anything that would
> worry me though.

Great. This agrees with my experiments with sparse buffer lookup table.

>
> > The increase in memory usage is 3MB, which is fine usually. I mean, we
> > didn't hear any complaints when we increased the default size of the
> > shared buffer pool - this is much less than that. But why do you want
> > to double the max_locks_per_transaction? I first thought it's because
> > the hash table size is anyway a power of 2. But then the size of the
> > hash table is actually max_locks_per_transaction * (number of backends
> > + number of prepared transactions). What we want is the default
> > max_locks_per_transaction such that 14927 locks are allowed. Playing
> > with max_locks_per_transaction using your script 109 seems to be the
> > number which will give us 14951 locks. It looks (and is) an odd
> > number. If we are worried about memory increase, that's the number we
> > should use as default and then write a long paragraph about why we
> > chose such an odd-looking number :D.
>
> My first thought was actually to set max_locks_per_transaction=100,
> making it a nice round number :-). But then the neighboring default of
> max_pred_locks_per_transaction=64 looks weird. We could reduce it
> max_pred_locks_per_transaction=50 to make it fit in. But it feels a
> little arbitrary to change just for aesthetic reasons.

+1. Let's keep it 128 and see if there are complaints. We can set it
to 100 or 109 if the complaints look serious.

--
Best Wishes,
Ashutosh Bapat

Re: Shared hash table allocations

От

Ashutosh Bapat

Дата:

02 апреля, 19:13:49

On Thu, Apr 2, 2026 at 7:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> > I think we should highlight the change in default in the release notes
> > though. The users which use default configuration will notice an
> > increase in the memory. If they are using a custom value, they will
> > think of bumping it up. Can we give them some ballpark % by which they
> > should increase their max_locks_per_transaction? E.g. double the
> > number or something?
>
> I don't think people who are using the defaults will notice. I'm worried
> about the people who have set max_locks_per_transactions manually, and
> now effectively get less lock space for the same setting. Yeah, doubling
> the previous value is a good rule of thumb.

Users who have set max_locks_per_transaction to a non-default value
but have tuned their application to respect those limits are safe even
after this change, since their lock hash table never used wiggle room.
Those users who weren't careful to respect those limits will need to
bump their setting. I think the release notes should "nudge" all the
users who use non-default max_locks_per_transaction to increase it if
they see "out of memory" errors. I don't think it should provide a
blanket advise to double their locks

--
Best Wishes,
Ashutosh Bapat

Re: Shared hash table allocations

От

Heikki Linnakangas

Дата:

02 апреля, 19:44:56

On 02/04/2026 19:13, Ashutosh Bapat wrote:
> On Thu, Apr 2, 2026 at 7:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>>> I think we should highlight the change in default in the release notes
>>> though. The users which use default configuration will notice an
>>> increase in the memory. If they are using a custom value, they will
>>> think of bumping it up. Can we give them some ballpark % by which they
>>> should increase their max_locks_per_transaction? E.g. double the
>>> number or something?
>>
>> I don't think people who are using the defaults will notice. I'm worried
>> about the people who have set max_locks_per_transactions manually, and
>> now effectively get less lock space for the same setting. Yeah, doubling
>> the previous value is a good rule of thumb.
> 
> Users who have set max_locks_per_transaction to a non-default value
> but have tuned their application to respect those limits are safe even
> after this change, since their lock hash table never used wiggle room.
> Those users who weren't careful to respect those limits will need to
> bump their setting.

That's technically true, but in practice it's very hard for someone to 
carefully tune the setting. It's difficult to know how many locks a 
particular set of queries take. In practice what people do is they bump 
up the setting if the get the "out of shared memory" error, until the 
error goes away. If you do the tuning that way, it's quite possible that 
you are relying the "wiggle room" without realizing it.

> I think the release notes should "nudge" all the
> users who use non-default max_locks_per_transaction to increase it if
> they see "out of memory" errors. I don't think it should provide a
> blanket advise to double their locks

How about:

"If you had previously set max_locks_per_transaction, you might need to 
set it to a higher value in v19 to avoid "out of shared memory" errors. 
If you are unsure what to set it to and don't mind the increased memory 
usage, you can double the value to ensure that you can acquire at least 
as many locks as before"

TODO: do some more calculations and testing of how exactly the doubling 
rule works with different values. Is it guaranteed to be enough in all 
cases?

- Heikki

Re: Shared hash table allocations

От

Ashutosh Bapat

Дата:

03 апреля, 16:03:46

On Thu, Apr 2, 2026 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 02/04/2026 19:13, Ashutosh Bapat wrote:
> > On Thu, Apr 2, 2026 at 7:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >>> I think we should highlight the change in default in the release notes
> >>> though. The users which use default configuration will notice an
> >>> increase in the memory. If they are using a custom value, they will
> >>> think of bumping it up. Can we give them some ballpark % by which they
> >>> should increase their max_locks_per_transaction? E.g. double the
> >>> number or something?
> >>
> >> I don't think people who are using the defaults will notice. I'm worried
> >> about the people who have set max_locks_per_transactions manually, and
> >> now effectively get less lock space for the same setting. Yeah, doubling
> >> the previous value is a good rule of thumb.
> >
> > Users who have set max_locks_per_transaction to a non-default value
> > but have tuned their application to respect those limits are safe even
> > after this change, since their lock hash table never used wiggle room.
> > Those users who weren't careful to respect those limits will need to
> > bump their setting.
>
> That's technically true, but in practice it's very hard for someone to
> carefully tune the setting. It's difficult to know how many locks a
> particular set of queries take. In practice what people do is they bump
> up the setting if the get the "out of shared memory" error, until the
> error goes away. If you do the tuning that way, it's quite possible that
> you are relying the "wiggle room" without realizing it.
>

That's true.

> > I think the release notes should "nudge" all the
> > users who use non-default max_locks_per_transaction to increase it if
> > they see "out of memory" errors. I don't think it should provide a
> > blanket advise to double their locks
>
> How about:
>
> "If you had previously set max_locks_per_transaction, you might need to
> set it to a higher value in v19 to avoid "out of shared memory" errors.
> If you are unsure what to set it to and don't mind the increased memory
> usage, you can double the value to ensure that you can acquire at least
> as many locks as before"

The wiggle room is 100KB fixed + 10% of other two structures, so value
by which it should be increased is partly fixed and partly a multiple
of current value. "double the value" is simplest advice we can give.
+1.

--
Best Wishes,
Ashutosh Bapat

Re: Shared hash table allocations

От

Heikki Linnakangas

Дата:

03 апреля, 20:32:07

On 03/04/2026 16:03, Ashutosh Bapat wrote:
> On Thu, Apr 2, 2026 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I think the release notes should "nudge" all the
>>> users who use non-default max_locks_per_transaction to increase it if
>>> they see "out of memory" errors. I don't think it should provide a
>>> blanket advise to double their locks
>>
>> How about:
>>
>> "If you had previously set max_locks_per_transaction, you might need to
>> set it to a higher value in v19 to avoid "out of shared memory" errors.
>> If you are unsure what to set it to and don't mind the increased memory
>> usage, you can double the value to ensure that you can acquire at least
>> as many locks as before"
> 
> The wiggle room is 100KB fixed + 10% of other two structures, so value
> by which it should be increased is partly fixed and partly a multiple
> of current value. "double the value" is simplest advice we can give.
> +1.

Ok, committed these patches to remove the safety margins, make LOCK and 
PROCLOCK fixed-size, and change the default to 
max_locks_per_transaction=128. I will do one final self-review of the 
remaining earlier patches from this thread next; I believe they're ready 
to be committed too.

Thanks for the review!

- Heikki

Re: Shared hash table allocations

От

Heikki Linnakangas

Дата:

04 апреля, 02:58:42

On 03/04/2026 20:32, Heikki Linnakangas wrote:
> On 03/04/2026 16:03, Ashutosh Bapat wrote:
>> On Thu, Apr 2, 2026 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> 
>> wrote:
>>>> I think the release notes should "nudge" all the
>>>> users who use non-default max_locks_per_transaction to increase it if
>>>> they see "out of memory" errors. I don't think it should provide a
>>>> blanket advise to double their locks
>>>
>>> How about:
>>>
>>> "If you had previously set max_locks_per_transaction, you might need to
>>> set it to a higher value in v19 to avoid "out of shared memory" errors.
>>> If you are unsure what to set it to and don't mind the increased memory
>>> usage, you can double the value to ensure that you can acquire at least
>>> as many locks as before"
>>
>> The wiggle room is 100KB fixed + 10% of other two structures, so value
>> by which it should be increased is partly fixed and partly a multiple
>> of current value. "double the value" is simplest advice we can give.
>> +1.
> 
> Ok, committed these patches to remove the safety margins, make LOCK and 
> PROCLOCK fixed-size, and change the default to 
> max_locks_per_transaction=128. I will do one final self-review of the 
> remaining earlier patches from this thread next; I believe they're ready 
> to be committed too.
> 
> Thanks for the review!

And committed the rest of the patches from this thread now too, after 
some small fixes and cleanups. Thanks again!

- Heikki

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Shared hash table allocations

Вложения

Вложения

Вложения