Обсуждение: Shared hash table allocations
I'm starting a new thread for this topic that we discussed on the "Better shared data structure management and resizable shared data structures" thread: On 25/03/2026 20:37, Robert Haas wrote: > On Sat, Mar 21, 2026 at 8:14 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I wonder if we should set aside a small amount of memory, like 10-20 kB, >> in the fixed shmem segment for small structs like that. Currently, such >> allocations will usually succeed because we leave some wiggle room, but >> the can also be consumed by other things like locks. But we could >> reserve a small amount of memory specifically for small allocations in >> extensions like this. > > Yeah, I don't really understand why we let the lock table use up that > space. I mean, I think it would be useful to have a way to let the > lock table expand without a server restart, and I also suspect that we > could come up with a less-silly data structure than the PROCLOCK hash, > but also if the only thing keeping you from running out of lock space > is the wiggle room, maybe you just need to bump up > max_locks_per_transaction. Like, you could easily burn through the > wiggle room, get an error anyway, and then later find that you also > now can't load certain extensions without a server restart. Attached patch set tightens up all shared memory hash tables so that they no longer use the "wiggle room". They are now truly fixed size. You specify the number of elements in ShmemInitHash(), and that's it. This also addresses the accounting issue we currently have with hash tables, that the size reported in pg_shmem_allocations only shows the size of fixed header and the directory, not the actual hash buckets. They were previously all lumped together in the "<anonymous>" section. These patches fix that. There was an earlier attempt at that at last year [1], but it got reverted. I hope my approach is less invasive: instead of changing dynahash.c to use a single allocation directly, I'm providing it an "alloc" callback that carves out the different parts it needs from the single contiguous shmem area, which in turn is allocated with ShmemInitStruct(). [1] https://www.postgresql.org/message-id/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7%2BhVAqaS5GQ%40mail.gmail.com - Heikki
Вложения
- v1-0001-Use-ShmemInitStruct-to-allocate-shmem-for-semapho.patch
- v1-0002-Remove-HASH_SEGMENT-option.patch
- v1-0003-Change-the-signature-of-dynahash-s-alloc-function.patch
- v1-0004-Merge-init-and-max-size-options-on-shmem-hash-tab.patch
- v1-0005-Prevent-shared-memory-hash-tables-from-growing-be.patch
- v1-0006-Allocate-all-parts-of-shmem-hash-table-from-a-sin.patch
- v1-0007-Remove-HASH_DIRSIZE-always-use-the-default-algori.patch
- v1-0008-Make-ShmemIndex-visible-in-the-pg_shmem_allocatio.patch
On 3/27/26 23:37, Heikki Linnakangas wrote: > I'm starting a new thread for this topic that we discussed on the > "Better shared data structure management and resizable shared data > structures" thread: > > On 25/03/2026 20:37, Robert Haas wrote: >> On Sat, Mar 21, 2026 at 8:14 PM Heikki Linnakangas <hlinnaka@iki.fi> >> wrote: >>> I wonder if we should set aside a small amount of memory, like 10-20 kB, >>> in the fixed shmem segment for small structs like that. Currently, such >>> allocations will usually succeed because we leave some wiggle room, but >>> the can also be consumed by other things like locks. But we could >>> reserve a small amount of memory specifically for small allocations in >>> extensions like this. >> >> Yeah, I don't really understand why we let the lock table use up that >> space. I mean, I think it would be useful to have a way to let the >> lock table expand without a server restart, and I also suspect that we >> could come up with a less-silly data structure than the PROCLOCK hash, >> but also if the only thing keeping you from running out of lock space >> is the wiggle room, maybe you just need to bump up >> max_locks_per_transaction. Like, you could easily burn through the >> wiggle room, get an error anyway, and then later find that you also >> now can't load certain extensions without a server restart. > > Attached patch set tightens up all shared memory hash tables so that > they no longer use the "wiggle room". They are now truly fixed size. You > specify the number of elements in ShmemInitHash(), and that's it. > > This also addresses the accounting issue we currently have with hash > tables, that the size reported in pg_shmem_allocations only shows the > size of fixed header and the directory, not the actual hash buckets. > They were previously all lumped together in the "<anonymous>" section. > These patches fix that. > > There was an earlier attempt at that at last year [1], but it got > reverted. I hope my approach is less invasive: instead of changing > dynahash.c to use a single allocation directly, I'm providing it an > "alloc" callback that carves out the different parts it needs from the > single contiguous shmem area, which in turn is allocated with > ShmemInitStruct(). > Thanks! That earlier attempt was committed+reverted by me, and I certainly find this new approach much easier to understand (but I thought I understand the old patch too ...). A couple comments, based on a quick review: * 0001 - seems fine * 0002 - +1 to getting rid of HASH_SEGMENT, but I don't see the point of renaming DEF_SEGSIZE to HASH_SEGSIZE. Isn't that a bit unnecessary? * 0003 - I'd probably rename CurrentDynaHashCxt to something that doesn't seem like a "global" variable, e.g. "dynahashCxt" * 0004 - seems fine, +1 to get rid of unused pieces * 0005 - seems fine * 0006 - Doesn't this completely change the alignment? ShmemHashAlloc used to call ShmemAllocRaw, which is very careful to use CACHELINEALIGN. But now ShmemHashAlloc just does MAXALIGN, which ShmemAllocRaw claims is not enough on modern systems. * 0007 - this left one comment referencing HASH_DIRSIZE in dynahash.c * 0008 - makes sense regards -- Tomas Vondra
On 28/03/2026 02:14, Tomas Vondra wrote: > * 0002 - +1 to getting rid of HASH_SEGMENT, but I don't see the point of > renaming DEF_SEGSIZE to HASH_SEGSIZE. Isn't that a bit unnecessary? DEF_SEGSIZE stands for "default segsize", but after this commit it's not merely the default, it's the same hard-coded constant for every hash table. That's why it seems prudent to rename it. > * 0003 - I'd probably rename CurrentDynaHashCxt to something that > doesn't seem like a "global" variable, e.g. "dynahashCxt" Renamed it to "hcxt", as that's what the corresponding field in HTAB is called. > * 0004 - seems fine, +1 to get rid of unused pieces To be clear, the init_size/max_size are not completely unused at the moment: the lock manager sets max_size to 2 * init_size, and wait_event.c used constants 16 and 128. The point is that it doesn't give you a very wide range of scalability, and I think it's better to not be flexible in that fashion. I would call it sloppiness rather than flexibility. > * 0005 - seems fine > > * 0006 - Doesn't this completely change the alignment? ShmemHashAlloc > used to call ShmemAllocRaw, which is very careful to use CACHELINEALIGN. > But now ShmemHashAlloc just does MAXALIGN, which ShmemAllocRaw claims is > not enough on modern systems. dynahash.c allocates multiple elements in each alloc() call, so even though ShmemAllocRaw() returns a cacheline-aligned block, the individual elements were not cacheline-aligned before this patch either. See element_alloc() and choose_nelem_alloc(). > * 0007 - this left one comment referencing HASH_DIRSIZE in dynahash.c Fixed - Heikki
Вложения
- v2-0001-Use-ShmemInitStruct-to-allocate-shmem-for-semapho.patch
- v2-0002-Remove-HASH_SEGMENT-option.patch
- v2-0003-Change-the-signature-of-dynahash-s-alloc-function.patch
- v2-0004-Merge-init-and-max-size-options-on-shmem-hash-tab.patch
- v2-0005-Prevent-shared-memory-hash-tables-from-growing-be.patch
- v2-0006-Allocate-all-parts-of-shmem-hash-table-from-a-sin.patch
- v2-0007-Remove-HASH_DIRSIZE-always-use-the-default-algori.patch
- v2-0008-Make-ShmemIndex-visible-in-the-pg_shmem_allocatio.patch
On 30/03/2026 18:28, Heikki Linnakangas wrote:
> To be clear, the init_size/max_size are not completely unused at the
> moment: the lock manager sets max_size to 2 * init_size, and ...
Huh, I only now realized the implications of the above. The formula in
the lock manager is actually:
#define NLOCKENTS() \
mul_size(max_locks_per_xact, add_size(MaxBackends, max_prepared_xacts))
...
max_table_size = NLOCKENTS();
init_table_size = max_table_size / 2;
So the initial size is only half of the maximum you get from
max_locks_per_xacts * max_connections. That means that if something
consumes the "wiggle room" shared memory, you might get *fewer* locks
than you might expect based on the GUCs.
I've somehow always thought that it's the other way round, that you
might get *more* locks than max_locks_per_xacts * max_connections if
you're lucky, but that max_connections * max_locks_per_xacts was
guaranteed.
This will change with these patches.
I wonder if we should change the defaults somehow. In usual
configurations, people are currently getting much more lock space than
you'd expect based on max_connections and max_locks_per_transaction, and
after these patches, they'll get much fewer locks. It might be prudent
bump up the default max_locks_per_transaction setting so that you'd get
roughly the same amount of locks in the default configuration.
- Heikki
On 31/03/2026 01:02, Heikki Linnakangas wrote: > I wonder if we should change the defaults somehow. In usual > configurations, people are currently getting much more lock space than > you'd expect based on max_connections and max_locks_per_transaction, and > after these patches, they'll get much fewer locks. It might be prudent > bump up the default max_locks_per_transaction setting so that you'd get > roughly the same amount of locks in the default configuration. I did some testing of the memory usage and how removing the wiggle room affects the number of locks you can acquire. Attached are the test procedures I used, and proposed patches. The patches are new, designed to just change the parameters of the hash tables and shmem calculations with no other changes. They don't include the refactorings we've discussed so far in this thread. My plan is to commit these new patches first, and those other refactorings after that. Once these new patches are committed, the refactorings won't materially change the overall memory usage or how it's divided between different hash tables, all those effects are in these new patches. master: With the default configuration on master, the attached test procedure can take 14927 locks before hitting "out of shared memory" error. At that point, all the "wiggle room" is assigned for the LOCK hash table. A different scenario could make the PROCLOCK hash table consume all the wiggle room instead, but I believe running out of LOCK space is more common, and I don't think it changes the big picture anyway if you hit the ceiling with PROCLOCK instead. 0001: While looking at this, I noticed that we add 10% "safety margin" to the shmem calculations in predicate.c, but we had already marked the predicate.c hash tables as HASH_FIXED_SIZE so they were never able to make use of the safety margin. Oops. The extra memory was available for the lock.c hash tables, though. After removing that bogus 10% safety margin from predicate.c, memory usage was reduced by 200 kB, but the number of locks you could take went down from 14927 to 14159. 0002: As the next step, I also removed the 10% safety margin from lock.c. That reduced memory usage by another 320 kB, and the number of locks went down from 14159 to 12815. 0003: After those changes, there's only little extra memory sloshing around that's not accounted for any data structure. ipci.c reserves a constant 100 kB, but that's pretty much it. However, there's still flexibility between the LOCK and the PROCLOCK hash tables. The PROCLOCK hash table is estimated to be 2x the size of the LOCK table, but when it's not, the space can get assigned to the LOCK table instead. In patch 0003 I removed that flexibility by marking them both with HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops the hash tables from using any of the other remaining wiggle room, making them truly fixed-size. This doesn't change the overall shared memory allocated, but the number of locks that the test procedure could acquire went down from 12815 to 8767, mostly because it cannot "steal" space from PROCLOCK anymore. 0004: To buy back that lock manager space in common out-of-the box situations, I propose to bump up the default for max_locks_per_transactions from 64 to 128. That increases memory usage again by 3216 kB, making it 2696 kB higher than on master (remember that the previous changes reduced memory usage). The number of locks you can take after that is 17535, which more than on master (14927). Increasing the default won't affect users who have already set max_locks_per_transaction to a non-default value. They will see that the number of locks they can acquire with their existing configuration will be reduced, again because of the lost wiggle room and flexibility between LOCK and PROCLOCK. Not sure if we could or should do something about that. Probably best to just document in the release notes that if you had raised increase max_locks_per_transaction, you might need to raise it further to be able to accommodate the same amount of locks as before. Here's all that in table form: | Patch | Shmem (kB) | Locks | | --------------------------------------+------------|-------| | master | 153560 | 14927 | | 0001: remove 10% from predicate.c | 153360 | 14159 | | 0002: remove 10% from lock.c | 153040 | 12815 | | 0003: Make lock.c tables fixed size | 153040 | 8767 | | 0004: max_locks_per_transactions=128 | 156256 | 17535 | This increase in memory usage is not great, but it's not that big in the grand scheme of things. I think it's well worth, and better than the sloppy scheme we have today. Any thoughts, objections? - Heikki
Вложения
On Tue, 31 Mar 2026 at 23:25, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 31/03/2026 01:02, Heikki Linnakangas wrote:
> > I wonder if we should change the defaults somehow. In usual
> > configurations, people are currently getting much more lock space than
> > you'd expect based on max_connections and max_locks_per_transaction, and
> > after these patches, they'll get much fewer locks. It might be prudent
> > bump up the default max_locks_per_transaction setting so that you'd get
> > roughly the same amount of locks in the default configuration.
>
> master: With the default configuration on master, the attached test
> procedure can take 14927 locks before hitting "out of shared memory"
> error. At that point, all the "wiggle room" is assigned for the LOCK
> hash table. A different scenario could make the PROCLOCK hash table
> consume all the wiggle room instead, but I believe running out of LOCK
> space is more common, and I don't think it changes the big picture
> anyway if you hit the ceiling with PROCLOCK instead.
>
> 0001: [...]
LGTM
> 0002: As the next step, I also removed the 10% safety margin from
> lock.c. That reduced memory usage by another 320 kB, and the number of
> locks went down from 14159 to 12815.
LGTM
> 0003: In patch 0003 I removed that flexibility by marking them both with
> HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops
> the hash tables from using any of the other remaining wiggle room,
> making them truly fixed-size.
I think this patch finally gave me a good reason why PROCLOCK would've
needed to be allocated with double the sizes of LOCK:
LOCK is (was) initialized with only 50% of its max capacity. If
PROCLOCK was initialized with the same parameters and all spare shmem
is then allocated to other processes, then backends wouldn't be able
to safely use max_locks_per_transaction. To guarantee no OOMs when all
backends use max_locks_per_transaction, PROCLOCK's size must be
doubled to make sure PROCLOCK has sufficient space. (The same isn't
usually an issue for LOCK, because it's very likely backends will
operate on the same tables, and thus will be able to share most of the
LOCK structs.)
Now that LOCK is fully allocated, I think the size doubling can be
removed, or possibly parameterized for those that need it.
> 0004: To buy back that lock manager space in common out-of-the box
> situations, I propose to bump up the default for
> max_locks_per_transactions from 64 to 128. [...]
> The number of locks you can
> take after that is 17535, which more than on master (14927).
Note that this is for one backend; with current sizing you could lock
the same 17535 locks in at least one more backend.
Patch LGTM.
> Any thoughts, objections?
Overall, I'm +1 on this change. I do have some general comments
though, at least in part based on discussions in the hackers discord
last year[0]:
1.) We'll need to clearly advertise the changed, more strict behaviour
of the heavy-weight locking system in the release notes.
2.) (Related) We probably should make it easier for DBAs to monitor
lock counts now that we enforce the limit more strictly. This could
take the form of (optional) logging that alerts when a session exceeds
some threshold number of locks in a transaction (e.g. 100% and 200% of
max_locks_per_transaction), or as a metric in
pg_stat_{activity,databases} as total locks taken/max number of locks
taken in a transaction.
3.) (Related) We should probably parameterize the LOCK-to-PROCLOCK
ratio. LOCK is large, and especially on systems with high values of
max_connections (where the additional LOCKs will go unused) the
overhead of carrying all those additional LOCKs would go up to 50% of
the added memory usage (LOCK at 152+24=176B, PROCLOCK at
2*(64+24B)=176B). It'd be nice if we could avoid allocating that
memory.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
[0] starting at
https://discord.com/channels/1258108670710124574/1266090488415654032/1442879718285119518
On 02/04/2026 13:24, Matthias van de Meent wrote: > On Tue, 31 Mar 2026 at 23:25, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> 0003: In patch 0003 I removed that flexibility by marking them both with >> HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops >> the hash tables from using any of the other remaining wiggle room, >> making them truly fixed-size. > > I think this patch finally gave me a good reason why PROCLOCK would've > needed to be allocated with double the sizes of LOCK: > > LOCK is (was) initialized with only 50% of its max capacity. If > PROCLOCK was initialized with the same parameters and all spare shmem > is then allocated to other processes, then backends wouldn't be able > to safely use max_locks_per_transaction. To guarantee no OOMs when all > backends use max_locks_per_transaction, PROCLOCK's size must be > doubled to make sure PROCLOCK has sufficient space. (The same isn't > usually an issue for LOCK, because it's very likely backends will > operate on the same tables, and thus will be able to share most of the > LOCK structs.) Hmm, I don't know if that makes sense. It can happen that you have a lot of backends acquiring the same, smaller set of locks, growing PROCLOCK so that it uses up all the available wiggle room, and LOCK can never grow from its initial size, 1/2 * max_locks_per_transactions * MaxBackends. If the workload then changes so that every backend tries to acquire exactly max_locks_per_transactions locks, but this time each lock is on a different object, you will run out of shared memory at 1/2 the size of what you expected. The opposite can't happen, because PROCLOCK is always at least as large as LOCK. It doesn't matter what you set PROCLOCK's initial size to, it will grow together with LOCK, and you will not run out of shared memory before PROCLOCK has grown up to max_locks_per_transactions * MaxBackends anyway. > Now that LOCK is fully allocated, I think the size doubling can be > removed, or possibly parameterized for those that need it. I don't think that follows. The 2x factor is pretty arbitrary, but it's still a fair assumption that many backends will be acquiring locks on the same objects so you need more space in PROCLOCK than in LOCK. I don't know how true that assumption is. It feels right for OLTP applications. But the situation where I've hit max_locks_per_transaction is when I've tried to create one table with thousands or partitions. Or rather, when I try to *drop* that table. In that situation, there's just one transaction acquiring all the locks, so the PROCLOCK / LOCK ratio is 1. We could parameterize it, but I feel that's probably overkill and exposing too much detail to users. At the end of the day, if you hit the limit, you just bump up max_locks_per_transactions. If there are two settings, it's more complicated; which one do you change? You probably don't mind wasting the few MB of memory that you could gain by carefully tuning the LOCK / PROCLOCK factor. - Heikki
On Wed, Apr 1, 2026 at 2:55 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 31/03/2026 01:02, Heikki Linnakangas wrote: > > I wonder if we should change the defaults somehow. In usual > > configurations, people are currently getting much more lock space than > > you'd expect based on max_connections and max_locks_per_transaction, and > > after these patches, they'll get much fewer locks. It might be prudent > > bump up the default max_locks_per_transaction setting so that you'd get > > roughly the same amount of locks in the default configuration. > > I did some testing of the memory usage and how removing the wiggle room > affects the number of locks you can acquire. Attached are the test > procedures I used, and proposed patches. The patches are new, designed > to just change the parameters of the hash tables and shmem calculations > with no other changes. They don't include the refactorings we've > discussed so far in this thread. My plan is to commit these new patches > first, and those other refactorings after that. Once these new patches > are committed, the refactorings won't materially change the overall > memory usage or how it's divided between different hash tables, all > those effects are in these new patches. > > > master: With the default configuration on master, the attached test > procedure can take 14927 locks before hitting "out of shared memory" > error. At that point, all the "wiggle room" is assigned for the LOCK > hash table. A different scenario could make the PROCLOCK hash table > consume all the wiggle room instead, but I believe running out of LOCK > space is more common, and I don't think it changes the big picture > anyway if you hit the ceiling with PROCLOCK instead. > > 0001: While looking at this, I noticed that we add 10% "safety margin" > to the shmem calculations in predicate.c, but we had already marked the > predicate.c hash tables as HASH_FIXED_SIZE so they were never able to > make use of the safety margin. Oops. The extra memory was available for > the lock.c hash tables, though. After removing that bogus 10% safety > margin from predicate.c, memory usage was reduced by 200 kB, but the > number of locks you could take went down from 14927 to 14159. > > 0002: As the next step, I also removed the 10% safety margin from > lock.c. That reduced memory usage by another 320 kB, and the number of > locks went down from 14159 to 12815. > > 0003: After those changes, there's only little extra memory sloshing > around that's not accounted for any data structure. ipci.c reserves a > constant 100 kB, but that's pretty much it. However, there's still > flexibility between the LOCK and the PROCLOCK hash tables. The PROCLOCK > hash table is estimated to be 2x the size of the LOCK table, but when > it's not, the space can get assigned to the LOCK table instead. In patch > 0003 I removed that flexibility by marking them both with > HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops > the hash tables from using any of the other remaining wiggle room, > making them truly fixed-size. This doesn't change the overall shared > memory allocated, but the number of locks that the test procedure could > acquire went down from 12815 to 8767, mostly because it cannot "steal" > space from PROCLOCK anymore. > > 0004: To buy back that lock manager space in common out-of-the box > situations, I propose to bump up the default for > max_locks_per_transactions from 64 to 128. That increases memory usage > again by 3216 kB, making it 2696 kB higher than on master (remember that > the previous changes reduced memory usage). The number of locks you can > take after that is 17535, which more than on master (14927). > > Increasing the default won't affect users who have already set > max_locks_per_transaction to a non-default value. They will see that the > number of locks they can acquire with their existing configuration will > be reduced, again because of the lost wiggle room and flexibility > between LOCK and PROCLOCK. Not sure if we could or should do something > about that. Probably best to just document in the release notes that if > you had raised increase max_locks_per_transaction, you might need to > raise it further to be able to accommodate the same amount of locks as > before. > > Here's all that in table form: > > | Patch | Shmem (kB) | Locks | > | --------------------------------------+------------|-------| > | master | 153560 | 14927 | > | 0001: remove 10% from predicate.c | 153360 | 14159 | > | 0002: remove 10% from lock.c | 153040 | 12815 | > | 0003: Make lock.c tables fixed size | 153040 | 8767 | > | 0004: max_locks_per_transactions=128 | 156256 | 17535 | > > This increase in memory usage is not great, but it's not that big in the > grand scheme of things. I think it's well worth, and better than the > sloppy scheme we have today. > > Any thoughts, objections? When we "allocate" shared memory, we are just allocating space on systems which use mmap. The memory gets allocated only when it is touched. The wiggle room as a whole is never touched during initialization. Those pages get allocated when wiggle room is used - i.e. when the entries beyond initial number are allocated. By allocating maximal hash tables, I was worried that we will allocate more memory than required. But that's not true since a 4K memory page fits only 50-60 entries - far less than the default configuration permits. Most of the memory for the hash table will be allocated as the entries as used. The second hazard of increasing hash table size is the hash table access becomes slower as it becomes sparse [1]. I don't think it shows up in performance but maybe worth trying a trivial pgbench run, just to make sure that default performance doesn't regress. The increase in memory usage is 3MB, which is fine usually. I mean, we didn't hear any complaints when we increased the default size of the shared buffer pool - this is much less than that. But why do you want to double the max_locks_per_transaction? I first thought it's because the hash table size is anyway a power of 2. But then the size of the hash table is actually max_locks_per_transaction * (number of backends + number of prepared transactions). What we want is the default max_locks_per_transaction such that 14927 locks are allowed. Playing with max_locks_per_transaction using your script 109 seems to be the number which will give us 14951 locks. It looks (and is) an odd number. If we are worried about memory increase, that's the number we should use as default and then write a long paragraph about why we chose such an odd-looking number :D. I think we should highlight the change in default in the release notes though. The users which use default configuration will notice an increase in the memory. If they are using a custom value, they will think of bumping it up. Can we give them some ballpark % by which they should increase their max_locks_per_transaction? E.g. double the number or something? I looked at the places where max_locks_per_transaction is used. I don't see any place that needs code updates other than the ones in the patch. LGTM, overall. [1] https://ashutoshpg.blogspot.com/2025/07/efficiency-of-sparse-hash-table.html -- Best Wishes, Ashutosh Bapat
On Thu, 2 Apr 2026 at 13:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 02/04/2026 13:24, Matthias van de Meent wrote: > > On Tue, 31 Mar 2026 at 23:25, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> > >> 0003: In patch 0003 I removed that flexibility by marking them both with > >> HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops > >> the hash tables from using any of the other remaining wiggle room, > >> making them truly fixed-size. > > > > I think this patch finally gave me a good reason why PROCLOCK would've > > needed to be allocated with double the sizes of LOCK: > > > > LOCK is (was) initialized with only 50% of its max capacity. If > > PROCLOCK was initialized with the same parameters and all spare shmem > > is then allocated to other processes, then backends wouldn't be able > > to safely use max_locks_per_transaction. To guarantee no OOMs when all > > backends use max_locks_per_transaction, PROCLOCK's size must be > > doubled to make sure PROCLOCK has sufficient space. (The same isn't > > usually an issue for LOCK, because it's very likely backends will > > operate on the same tables, and thus will be able to share most of the > > LOCK structs.) > > Hmm, I don't know if that makes sense. Code and mailing history indicate it's not the reason, but there is no other sane reason why PROCLOCK would *not* be sized to max_locks_per_transaction * MaxBackends. At least with this reasoning the minimum size is exactly that. > It can happen that you have a lot > of backends acquiring the same, smaller set of locks, growing PROCLOCK > so that it uses up all the available wiggle room, and LOCK can never > grow from its initial size, 1/2 * max_locks_per_transactions * > MaxBackends. If the workload then changes so that every backend tries to > acquire exactly max_locks_per_transactions locks, but this time each > lock is on a different object, you will run out of shared memory at 1/2 > the size of what you expected. > > The opposite can't happen, because PROCLOCK is always at least as large > as LOCK. It doesn't matter what you set PROCLOCK's initial size to, it > will grow together with LOCK, and you will not run out of shared memory > before PROCLOCK has grown up to max_locks_per_transactions * MaxBackends > anyway. > > > Now that LOCK is fully allocated, I think the size doubling can be > > removed, or possibly parameterized for those that need it. > > I don't think that follows. The 2x factor is pretty arbitrary, but it's > still a fair assumption that many backends will be acquiring locks on > the same objects so you need more space in PROCLOCK than in LOCK. I agree that we'll *probably* have more PROCLOCKs in use than LOCKs. But max_locks_per_transaction (MLPT) to me indicates that it is an indicator of the maximum number of locks taken by a transaction, and transaction locks have a 1:1 correspondence with PROCLOCKs (as long as we ignore fast-path locking). Adjusting that value by an arbitrary factor does not many any sense. The user configured a value X, so we should use that value X. Possibly there could be adjustments we need to make to give ourself some breathing room (it's not uncommon to overallocate by a constant factor to allow evict-after-insert patterns in caches), but I can't explain a blanket doubling of usage "because we have a hunch LOCK usage will be lower than PROCLOCK usage" when the user specified a value that would/should map 1:1 against PROCLOCKs scaling as anything other than plainly wasting memory. > I don't know how true that assumption is. It feels right for OLTP > applications. But the situation where I've hit max_locks_per_transaction > is when I've tried to create one table with thousands or partitions. Or > rather, when I try to *drop* that table. In that situation, there's just > one transaction acquiring all the locks, so the PROCLOCK / LOCK ratio is 1. > We could parameterize it, but I feel that's probably overkill and > exposing too much detail to users. At the end of the day, if you hit the > limit, you just bump up max_locks_per_transactions. Or, if it's for DROP, you could use a phased dropping scheme, where you spread the operation across many transactions by dropping a subset of the partitions in each transaction. It takes more careful execution and more time, but it allows you to avoid hitting the limits and starving other backends of lock slots, and avoids requiring postmaster restarts. > If there are two > settings, it's more complicated; which one do you change? You probably > don't mind wasting the few MB of memory that you could gain by carefully > tuning the LOCK / PROCLOCK factor. Yes, that would be more complicated, but we have similar factors elsewhere (hash_mem_multiplier, various costs, weights). We wouldn't even have to use a factor, we could just as well use a new, more direct `max_unique_locks_per_transaction`, which we'd use to scale the LOCK hash. Note that with our current default settings we're spending 11kiB (= 64 * (64+24)) per backend on what I would consider oversized PROCLOCK allocations. With MLPT=128, that doubles to 22kiB per backend. Every 50 max_backends, that'd be ~1.1MB of shared memory allocated in excess of user's requested configuration. Kind regards, Matthias van de Meent
On 02/04/2026 15:55, Ashutosh Bapat wrote:
> When we "allocate" shared memory, we are just allocating space on
> systems which use mmap. The memory gets allocated only when it is
> touched. The wiggle room as a whole is never touched during
> initialization. Those pages get allocated when wiggle room is used -
> i.e. when the entries beyond initial number are allocated. By
> allocating maximal hash tables, I was worried that we will allocate
> more memory than required. But that's not true since a 4K memory page
> fits only 50-60 entries - far less than the default configuration
> permits. Most of the memory for the hash table will be allocated as
> the entries as used.
Hmm, that's a good point about untouched memory not being allocated. I
think it's fine, though.
With small changes on top of the the earlier refactorings from this
thread, we could stop pre-allocating all the elements when a shared
memory hash table is created, and have ShmemHashAlloc() allocate them on
the fly, but instead of doing them as anonymous allocations like we do
with ShmemAlloc() today, the allocations could come from the
pre-allocated region dedicated to the hash table. You'd still get the
same determinism and visibility in pg_shmem_allocations, but you could
avoid actually touching the pages until they're needed. Not sure it's
worth the trouble.
> The second hazard of increasing hash table size is the hash table
> access becomes slower as it becomes sparse [1]. I don't think it shows
> up in performance but maybe worth trying a trivial pgbench run, just
> to make sure that default performance doesn't regress.
Interesting, but yeah I don't think that's going to be measurable. I did
some quick testing with a test function that just locks and unlocks
relations:
PG_FUNCTION_INFO_V1(test_lock_bench);
Datum
test_lock_bench(PG_FUNCTION_ARGS)
{
int32 num_distinct_locks = PG_GETARG_INT32(0);
int32 num_acquires = PG_GETARG_INT32(1);
LOCKMODE lockmode = AccessExclusiveLock;
#define FIRST_RELID 1000000000
for (int32 i = 0; i < num_acquires; i++)
{
Oid relid = FIRST_RELID + i % num_distinct_locks;
if (i >= num_distinct_locks)
UnlockRelationOid(relid, lockmode);
if (!ConditionalLockRelationOid(relid, lockmode))
{
elog(LOG, "could not acquire lock, iteration %d", i);
break;
}
}
PG_RETURN_VOID();
}
With test_lock_bench(1, 5000000), I don't see any meaningful difference,
i.e. it's within 1-2 %, with anything from max_locks_per_transactions=10
to max_locks_per_transactions=128.
With more distinct locks involved, the caching effects might be bigger,
and maybe you'd see a difference because of more or less collisions.
Spot testing some values on my laptop, I don't see anything that would
worry me though.
> The increase in memory usage is 3MB, which is fine usually. I mean, we
> didn't hear any complaints when we increased the default size of the
> shared buffer pool - this is much less than that. But why do you want
> to double the max_locks_per_transaction? I first thought it's because
> the hash table size is anyway a power of 2. But then the size of the
> hash table is actually max_locks_per_transaction * (number of backends
> + number of prepared transactions). What we want is the default
> max_locks_per_transaction such that 14927 locks are allowed. Playing
> with max_locks_per_transaction using your script 109 seems to be the
> number which will give us 14951 locks. It looks (and is) an odd
> number. If we are worried about memory increase, that's the number we
> should use as default and then write a long paragraph about why we
> chose such an odd-looking number :D.
My first thought was actually to set max_locks_per_transaction=100,
making it a nice round number :-). But then the neighboring default of
max_pred_locks_per_transaction=64 looks weird. We could reduce it
max_pred_locks_per_transaction=50 to make it fit in. But it feels a
little arbitrary to change just for aesthetic reasons.
> I think we should highlight the change in default in the release notes
> though. The users which use default configuration will notice an
> increase in the memory. If they are using a custom value, they will
> think of bumping it up. Can we give them some ballpark % by which they
> should increase their max_locks_per_transaction? E.g. double the
> number or something?
I don't think people who are using the defaults will notice. I'm worried
about the people who have set max_locks_per_transactions manually, and
now effectively get less lock space for the same setting. Yeah, doubling
the previous value is a good rule of thumb.
- Heikki
On Thu, Apr 2, 2026 at 7:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 02/04/2026 15:55, Ashutosh Bapat wrote:
> > When we "allocate" shared memory, we are just allocating space on
> > systems which use mmap. The memory gets allocated only when it is
> > touched. The wiggle room as a whole is never touched during
> > initialization. Those pages get allocated when wiggle room is used -
> > i.e. when the entries beyond initial number are allocated. By
> > allocating maximal hash tables, I was worried that we will allocate
> > more memory than required. But that's not true since a 4K memory page
> > fits only 50-60 entries - far less than the default configuration
> > permits. Most of the memory for the hash table will be allocated as
> > the entries as used.
>
> Hmm, that's a good point about untouched memory not being allocated. I
> think it's fine, though.
>
> With small changes on top of the the earlier refactorings from this
> thread, we could stop pre-allocating all the elements when a shared
> memory hash table is created, and have ShmemHashAlloc() allocate them on
> the fly, but instead of doing them as anonymous allocations like we do
> with ShmemAlloc() today, the allocations could come from the
> pre-allocated region dedicated to the hash table. You'd still get the
> same determinism and visibility in pg_shmem_allocations, but you could
> avoid actually touching the pages until they're needed. Not sure it's
> worth the trouble.
share hash table refactoring + shared memory structure refactoring +
resizable structures, we should be able to get resizable shared hash
tables as well. But that's not required immediately. I feel large hash
tables like buffer hash table, lock hash tables can benefit from this
kind of thing.
>
> > The second hazard of increasing hash table size is the hash table
> > access becomes slower as it becomes sparse [1]. I don't think it shows
> > up in performance but maybe worth trying a trivial pgbench run, just
> > to make sure that default performance doesn't regress.
>
> Interesting, but yeah I don't think that's going to be measurable. I did
> some quick testing with a test function that just locks and unlocks
> relations:
>
> PG_FUNCTION_INFO_V1(test_lock_bench);
> Datum
> test_lock_bench(PG_FUNCTION_ARGS)
> {
> int32 num_distinct_locks = PG_GETARG_INT32(0);
> int32 num_acquires = PG_GETARG_INT32(1);
>
> LOCKMODE lockmode = AccessExclusiveLock;
>
> #define FIRST_RELID 1000000000
>
> for (int32 i = 0; i < num_acquires; i++)
> {
> Oid relid = FIRST_RELID + i % num_distinct_locks;
>
> if (i >= num_distinct_locks)
> UnlockRelationOid(relid, lockmode);
>
> if (!ConditionalLockRelationOid(relid, lockmode))
> {
> elog(LOG, "could not acquire lock, iteration %d", i);
> break;
> }
> }
>
> PG_RETURN_VOID();
> }
>
> With test_lock_bench(1, 5000000), I don't see any meaningful difference,
> i.e. it's within 1-2 %, with anything from max_locks_per_transactions=10
> to max_locks_per_transactions=128.
>
> With more distinct locks involved, the caching effects might be bigger,
> and maybe you'd see a difference because of more or less collisions.
> Spot testing some values on my laptop, I don't see anything that would
> worry me though.
Great. This agrees with my experiments with sparse buffer lookup table.
>
> > The increase in memory usage is 3MB, which is fine usually. I mean, we
> > didn't hear any complaints when we increased the default size of the
> > shared buffer pool - this is much less than that. But why do you want
> > to double the max_locks_per_transaction? I first thought it's because
> > the hash table size is anyway a power of 2. But then the size of the
> > hash table is actually max_locks_per_transaction * (number of backends
> > + number of prepared transactions). What we want is the default
> > max_locks_per_transaction such that 14927 locks are allowed. Playing
> > with max_locks_per_transaction using your script 109 seems to be the
> > number which will give us 14951 locks. It looks (and is) an odd
> > number. If we are worried about memory increase, that's the number we
> > should use as default and then write a long paragraph about why we
> > chose such an odd-looking number :D.
>
> My first thought was actually to set max_locks_per_transaction=100,
> making it a nice round number :-). But then the neighboring default of
> max_pred_locks_per_transaction=64 looks weird. We could reduce it
> max_pred_locks_per_transaction=50 to make it fit in. But it feels a
> little arbitrary to change just for aesthetic reasons.
+1. Let's keep it 128 and see if there are complaints. We can set it
to 100 or 109 if the complaints look serious.
--
Best Wishes,
Ashutosh Bapat
On Thu, Apr 2, 2026 at 7:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > I think we should highlight the change in default in the release notes > > though. The users which use default configuration will notice an > > increase in the memory. If they are using a custom value, they will > > think of bumping it up. Can we give them some ballpark % by which they > > should increase their max_locks_per_transaction? E.g. double the > > number or something? > > I don't think people who are using the defaults will notice. I'm worried > about the people who have set max_locks_per_transactions manually, and > now effectively get less lock space for the same setting. Yeah, doubling > the previous value is a good rule of thumb. Users who have set max_locks_per_transaction to a non-default value but have tuned their application to respect those limits are safe even after this change, since their lock hash table never used wiggle room. Those users who weren't careful to respect those limits will need to bump their setting. I think the release notes should "nudge" all the users who use non-default max_locks_per_transaction to increase it if they see "out of memory" errors. I don't think it should provide a blanket advise to double their locks -- Best Wishes, Ashutosh Bapat
On 02/04/2026 19:13, Ashutosh Bapat wrote: > On Thu, Apr 2, 2026 at 7:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >>> I think we should highlight the change in default in the release notes >>> though. The users which use default configuration will notice an >>> increase in the memory. If they are using a custom value, they will >>> think of bumping it up. Can we give them some ballpark % by which they >>> should increase their max_locks_per_transaction? E.g. double the >>> number or something? >> >> I don't think people who are using the defaults will notice. I'm worried >> about the people who have set max_locks_per_transactions manually, and >> now effectively get less lock space for the same setting. Yeah, doubling >> the previous value is a good rule of thumb. > > Users who have set max_locks_per_transaction to a non-default value > but have tuned their application to respect those limits are safe even > after this change, since their lock hash table never used wiggle room. > Those users who weren't careful to respect those limits will need to > bump their setting. That's technically true, but in practice it's very hard for someone to carefully tune the setting. It's difficult to know how many locks a particular set of queries take. In practice what people do is they bump up the setting if the get the "out of shared memory" error, until the error goes away. If you do the tuning that way, it's quite possible that you are relying the "wiggle room" without realizing it. > I think the release notes should "nudge" all the > users who use non-default max_locks_per_transaction to increase it if > they see "out of memory" errors. I don't think it should provide a > blanket advise to double their locks How about: "If you had previously set max_locks_per_transaction, you might need to set it to a higher value in v19 to avoid "out of shared memory" errors. If you are unsure what to set it to and don't mind the increased memory usage, you can double the value to ensure that you can acquire at least as many locks as before" TODO: do some more calculations and testing of how exactly the doubling rule works with different values. Is it guaranteed to be enough in all cases? - Heikki
On Thu, Apr 2, 2026 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 02/04/2026 19:13, Ashutosh Bapat wrote: > > On Thu, Apr 2, 2026 at 7:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> > >>> I think we should highlight the change in default in the release notes > >>> though. The users which use default configuration will notice an > >>> increase in the memory. If they are using a custom value, they will > >>> think of bumping it up. Can we give them some ballpark % by which they > >>> should increase their max_locks_per_transaction? E.g. double the > >>> number or something? > >> > >> I don't think people who are using the defaults will notice. I'm worried > >> about the people who have set max_locks_per_transactions manually, and > >> now effectively get less lock space for the same setting. Yeah, doubling > >> the previous value is a good rule of thumb. > > > > Users who have set max_locks_per_transaction to a non-default value > > but have tuned their application to respect those limits are safe even > > after this change, since their lock hash table never used wiggle room. > > Those users who weren't careful to respect those limits will need to > > bump their setting. > > That's technically true, but in practice it's very hard for someone to > carefully tune the setting. It's difficult to know how many locks a > particular set of queries take. In practice what people do is they bump > up the setting if the get the "out of shared memory" error, until the > error goes away. If you do the tuning that way, it's quite possible that > you are relying the "wiggle room" without realizing it. > That's true. > > I think the release notes should "nudge" all the > > users who use non-default max_locks_per_transaction to increase it if > > they see "out of memory" errors. I don't think it should provide a > > blanket advise to double their locks > > How about: > > "If you had previously set max_locks_per_transaction, you might need to > set it to a higher value in v19 to avoid "out of shared memory" errors. > If you are unsure what to set it to and don't mind the increased memory > usage, you can double the value to ensure that you can acquire at least > as many locks as before" The wiggle room is 100KB fixed + 10% of other two structures, so value by which it should be increased is partly fixed and partly a multiple of current value. "double the value" is simplest advice we can give. +1. -- Best Wishes, Ashutosh Bapat
On 03/04/2026 16:03, Ashutosh Bapat wrote: > On Thu, Apr 2, 2026 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> I think the release notes should "nudge" all the >>> users who use non-default max_locks_per_transaction to increase it if >>> they see "out of memory" errors. I don't think it should provide a >>> blanket advise to double their locks >> >> How about: >> >> "If you had previously set max_locks_per_transaction, you might need to >> set it to a higher value in v19 to avoid "out of shared memory" errors. >> If you are unsure what to set it to and don't mind the increased memory >> usage, you can double the value to ensure that you can acquire at least >> as many locks as before" > > The wiggle room is 100KB fixed + 10% of other two structures, so value > by which it should be increased is partly fixed and partly a multiple > of current value. "double the value" is simplest advice we can give. > +1. Ok, committed these patches to remove the safety margins, make LOCK and PROCLOCK fixed-size, and change the default to max_locks_per_transaction=128. I will do one final self-review of the remaining earlier patches from this thread next; I believe they're ready to be committed too. Thanks for the review! - Heikki
On 03/04/2026 20:32, Heikki Linnakangas wrote: > On 03/04/2026 16:03, Ashutosh Bapat wrote: >> On Thu, Apr 2, 2026 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> >> wrote: >>>> I think the release notes should "nudge" all the >>>> users who use non-default max_locks_per_transaction to increase it if >>>> they see "out of memory" errors. I don't think it should provide a >>>> blanket advise to double their locks >>> >>> How about: >>> >>> "If you had previously set max_locks_per_transaction, you might need to >>> set it to a higher value in v19 to avoid "out of shared memory" errors. >>> If you are unsure what to set it to and don't mind the increased memory >>> usage, you can double the value to ensure that you can acquire at least >>> as many locks as before" >> >> The wiggle room is 100KB fixed + 10% of other two structures, so value >> by which it should be increased is partly fixed and partly a multiple >> of current value. "double the value" is simplest advice we can give. >> +1. > > Ok, committed these patches to remove the safety margins, make LOCK and > PROCLOCK fixed-size, and change the default to > max_locks_per_transaction=128. I will do one final self-review of the > remaining earlier patches from this thread next; I believe they're ready > to be committed too. > > Thanks for the review! And committed the rest of the patches from this thread now too, after some small fixes and cleanups. Thanks again! - Heikki