Обсуждение: POC: make mxidoff 64 bits
Hi!
I've been trying to introduce 64-bit transaction identifications to Postgres for quite a while [0]. All this implies,
of course, an enormous amount of change that will have to take place in various modules. Consider this, the
patch set become too big to be committed “at once”.
The obvious solutions was to try to split the patch set into smaller ones. But here it comes a new challenge,
not every one of these parts, make Postgres better at the moment. Actually, even switching to a
FullTransactionId in PGPROC will have disadvantage in increasing of WAL size [1].
In fact, I believe, we're dealing with the chicken and the egg problem here. Not able to commit full patch set
since it is too big to handle and not able to commit parts of it, since they make sense all together and do not
help improve Postgres at the moment.
But it's not that bad. Since the commit 4ed8f0913bfdb5f, added in [3], we are capable to use 64 bits to
indexing SLRUs.
PROPOSAL
Make multixact offsets 64 bit.
RATIONALE
It is not very difficult to overflow 32-bit mxidoff. Since, it is created for every unique combination of the
It is not very difficult to overflow 32-bit mxidoff. Since, it is created for every unique combination of the
transaction for each tuple, including XIDs and respective flags. And when a transaction is added to a
specific multixact, it is rewritten with a new offset. In other words, it is possible to overflow the offsets of
multixacts without overflowing the multixacts themselves and/or ordinary transactions. I believe, there
was something about in the hackers mail lists, but I could not find links now.
PFA, patch. Here is a WIP version. Upgrade machinery should be added later.
As always, any opinions on a subject a very welcome!
--
Best regards,
Maxim Orlov.
Вложения
On 23/04/2024 11:23, Maxim Orlov wrote: > PROPOSAL > Make multixact offsets 64 bit. +1, this is a good next step and useful regardless of 64-bit XIDs. > @@ -156,7 +148,7 @@ > ((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1)) > > /* page in which a member is to be found */ > -#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE) > +#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE) > #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT) > > /* Location (byte offset within page) of flag word for a given member */ This is really a bug fix. It didn't matter when TransactionId and MultiXactOffset were both typedefs of uint32, but it was always wrong. The argument name 'xid' is also misleading. I think there are some more like that, MXOffsetToFlagsBitShift for example. -- Heikki Linnakangas Neon (https://neon.tech)
> On 23 Apr 2024, at 11:23, Maxim Orlov <orlovmg@gmail.com> wrote:
>
> Make multixact offsets 64 bit.
- ereport(ERROR,
- (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("multixact \"members\" limit exceeded"),
Personally, I'd be happy with this! We had some incidents where the only mitigation was vacuum settings tweaking.
BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset?
Best regards, Andrey Borodin.
Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL size by a few bytes should have very little impact with today's disk performance(Logical replication of this feature wal log is also increased a lot, logical replication is a milestone new feature, and the community has been improving the logical replication of functions),I believe removing troubled postgresql Transaction ID Wraparound was also a milestone new feature adding a few bytes is worth it!
Best regards
Thank you so much for your tireless work on this. Increasing the WAL size by a few bytes should have very little impact with today's disk performance(Logical replication of this feature wal log is also increased a lot, logical replication is a milestone new feature, and the community has been improving the logical replication of functions),I believe removing troubled postgresql Transaction ID Wraparound was also a milestone new feature adding a few bytes is worth it!
Best regards
On Tue, 23 Apr 2024 at 17:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 23/04/2024 11:23, Maxim Orlov wrote:
> PROPOSAL
> Make multixact offsets 64 bit.
+1, this is a good next step and useful regardless of 64-bit XIDs.
> @@ -156,7 +148,7 @@
> ((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
>
> /* page in which a member is to be found */
> -#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
> +#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE)
> #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
>
> /* Location (byte offset within page) of flag word for a given member */
This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.
I think there are some more like that, MXOffsetToFlagsBitShift for example.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Tue, 23 Apr 2024 at 12:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.
I think there are some more like that, MXOffsetToFlagsBitShift for example.
Yeah, I always thought so too. I believe, this is just a copy-paste. You mean, it is worth creating a separate CF
entry for these fixes?
On Tue, 23 Apr 2024 at 16:03, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset?Actually, first versions of the 64xid patch set have such a cast to types TransactionID, MultiXact and so on. But,
after some discussions, we are switched to unsigned long long cast. Unfortunately, I could not find an exact link
for that discussion. On the other hand, such a casting is already used throughout the code. So, just for the
sake of the consistency, I would like to stay with these casts.
On Tue, 23 Apr 2024 at 16:03, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL size by a few bytes should have very little impact with today's disk performance(Logical replication of this feature wal log is also increased a lot, logical replication is a milestone new feature, and the community has been improving the logical replication of functions),I believe removing troubled postgresql Transaction ID Wraparound was also a milestone new feature adding a few bytes is worth it!
I'm 100% agree. Maybe, I should return to this approach and find some benefits for having FXIDs in WAL.
On Tue, Sep 3, 2024 at 4:30 PM Maxim Orlov <orlovmg@gmail.com> wrote: > Here is rebase. Apparently I'll have to do it often, since the CATALOG_VERSION_NO changed in the patch. I don't think you need to maintain CATALOG_VERSION_NO change in your patch for the exact reason you have mentioned: patch will get conflict each time CATALOG_VERSION_NO is advanced. It's responsibility of committer to advance CATALOG_VERSION_NO when needed. ------ Regards, Alexander Korotkov Supabase
On Tue, 3 Sept 2024 at 16:32, Alexander Korotkov <aekorotkov@gmail.com> wrote:
I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced. It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.
OK, I got it. My intention here was to help to test the patch. If someone wants to have a
look at the patch, he won't need to make changes in the code. In the next iteration, I'll
remove CATALOG_VERSION_NO version change.
Best regards,
Maxim Orlov.
Hi, Maxim!
Previously we accessed offsets in shared MultiXactState without locks as 32-bit read is always atomic. But I'm not sure it's so when offset become 64-bit.
E.g. GetNewMultiXactId():
nextOffset = MultiXactState->nextOffset;
is outside lock.
There might be other places we do the same as well.
Regards,
Pavel Borisov
Supabase
On Thu, 12 Sept 2024 at 16:09, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Maxim!Previously we accessed offsets in shared MultiXactState without locks as 32-bit read is always atomic. But I'm not sure it's so when offset become 64-bit.E.g. GetNewMultiXactId():nextOffset = MultiXactState->nextOffset;is outside lock.There might be other places we do the same as well.
I think the replacement of plain assignments by pg_atomic_read_u64/pg_atomic_write_u64 would be sufficient.
(The same I think is needed for the patchset [1])
Regards,
Pavel Borisov
On 2024-Sep-12, Pavel Borisov wrote: > Hi, Maxim! > > Previously we accessed offsets in shared MultiXactState without locks as > 32-bit read is always atomic. But I'm not sure it's so when offset become > 64-bit. > E.g. GetNewMultiXactId(): > > nextOffset = MultiXactState->nextOffset; > is outside lock. Good though. But fortunately I think it's not a problem. The one you say is with MultiXactGetLock held in shared mode -- and that works OK, as the assignment (in line 1263 at the bottom of the same routine) is done with exclusive lock held. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On 07/09/2024 07:36, Maxim Orlov wrote: > Here is v3. MultiXactMemberFreezeThreshold looks quite bogus now. Now that MaxMultiXactOffset==2^64-1, you cannot get anywhere near the MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I guess it would still be useful to trigger autovacuum if multixacts members grows large though, to release the disk space, even if you can't run out of members as such anymore. What should the logic for that look like? I'd love to see some tests for the pg_upgrade code. Something like a little perl script to generate test clusters with different wraparound scenarios etc. using the old version, and a TAP test to run pg_upgrade on them and verify that queries on the upgraded cluster works correctly. We don't have tests like that in the repository today, and I don't know if we'd want to commit these permanently either, but it would be highly useful now as a one-off thing, to show that the code works. On upgrade, are there really no changes required to pg_multixact/members? I imagined that the segment files would need to be renamed around wraparound, so that if you previously had files like this: pg_multixact/members/FFFE pg_multixact/members/FFFF pg_multixact/members/0000 pg_multixact/members/0001 after upgrade you would need to have: pg_multixact/members/00000000FFFE pg_multixact/members/00000000FFFF pg_multixact/members/000000010000 pg_multixact/members/000000010001 Thanks for working on this! -- Heikki Linnakangas Neon (https://neon.tech)
HI Maxim Orlov
> After a bit of thought, I've realized that to be conservative here is the way to go.
>We can reuse a maximum of existing logic. I mean, we can remove offset wraparound "error logic" and reuse "warning logic". But set the threshold for "warning >logic" to a much higher value. For now, I choose 2^32-1. In other world, legit logic, in my view, here would be to trigger autovacuum if the number of offsets (i.e. >difference nextOffset - oldestOffset) exceeds 2^32-1. PFA patch set.good point ,Couldn't agree with you more. xid64 is the solution to the wraparound problem,The previous error log is no longer meaningful ,But we might want to refine the output waring log a little(For example, checking the underlying reasons why age has been increasing),Though we don't have to worry about xid wraparound
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Can we refine this annotation a bit? for example
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space ,reduce table bloat if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space ,reduce table bloat if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Thanks
Maxim Orlov <orlovmg@gmail.com> 于2024年10月23日周三 23:55写道:
After a bit of thought, I've realized that to be conservative here is the way to go.
We can reuse a maximum of existing logic. I mean, we can remove offset wraparound "error logic" and reuse "warning logic". But set the threshold for "warning logic" to a much higher value. For now, I choose 2^32-1. In other world, legit logic, in my view, here would be to trigger autovacuum if the number of offsets (i.e. difference nextOffset - oldestOffset) exceeds 2^32-1. PFA patch set.
--Best regards,Maxim Orlov.
On 13/11/2024 17:44, Maxim Orlov wrote: > On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi > <mailto:hlinnaka@iki.fi>> wrote: > On a different note, I'm surprised you're rewriting member segments > from > scratch, parsing all the individual member groups and writing them out > again. There's no change to the members file format, except for the > numbering of the files, so you could just copy the files under the new > names without paying attention to the contents. It's not wrong to parse > them in detail, but I'd assume that it would be simpler not to. > > Yes, at the beginning I also thought that it would be possible to get by > with simple copying. But in case of wraparound, we must "bypass" invalid > zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 > values appears in multixact.c So, they must be handled. Bypass, in fact. > When we are switched to the 64-bit offsets, we have two options: > 1). Bypass every ((uint32) offset == 0) value in multixact.c; > 2). Convert members and bypass invalid value once. > > The first options seem too weird for me. So, we have to repack members > and bypass invalid value. Hmm, so if I understand correctly, this is related to how we determine the length of the members array, by looking at the next multixid's offset. This is explained in GetMultiXactIdMembers: > /* > * Find out the offset at which we need to start reading MultiXactMembers > * and the number of members in the multixact. We determine the latter as > * the difference between this multixact's starting offset and the next > * one's. However, there are some corner cases to worry about: > * > * 1. This multixact may be the latest one created, in which case there is > * no next one to look at. In this case the nextOffset value we just > * saved is the correct endpoint. > * > * 2. The next multixact may still be in process of being filled in: that > * is, another process may have done GetNewMultiXactId but not yet written > * the offset entry for that ID. In that scenario, it is guaranteed that > * the offset entry for that multixact exists (because GetNewMultiXactId > * won't release MultiXactGenLock until it does) but contains zero > * (because we are careful to pre-zero offset pages). Because > * GetNewMultiXactId will never return zero as the starting offset for a > * multixact, when we read zero as the next multixact's offset, we know we > * have this case. We handle this by sleeping on the condition variable > * we have just for this; the process in charge will signal the CV as soon > * as it has finished writing the multixact offset. > * > * 3. Because GetNewMultiXactId increments offset zero to offset one to > * handle case #2, there is an ambiguity near the point of offset > * wraparound. If we see next multixact's offset is one, is that our > * multixact's actual endpoint, or did it end at zero with a subsequent > * increment? We handle this using the knowledge that if the zero'th > * member slot wasn't filled, it'll contain zero, and zero isn't a valid > * transaction ID so it can't be a multixact member. Therefore, if we > * read a zero from the members array, just ignore it. > * > * This is all pretty messy, but the mess occurs only in infrequent corner > * cases, so it seems better than holding the MultiXactGenLock for a long > * time on every multixact creation. > */ With 64-bit offsets, can we assume that it never wraps around? We often treat 2^64 as "large enough that we'll never run out", e.g. LSNs are also assumed to never wrap around. I think that would be a safe assumption here too. If we accept that, we don't need to worry about case 3 anymore. But if we upgrade wrapped-around members files by just renaming them, there could still be a members array where we had skipped offset 0, and reading that after the upgrade might get confused. We could continue to ignore a 0 XID in the members array like the comment says; I think that would be enough. But yeah, maybe it's better to bite the bullet in pg_upgrade and squeeze those out. Does your upgrade test suite include case 3, where the next multixact's offset is 1? Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other comments and checks that talk about binary-upgraded values too that we can hopefully clean up now. If we are to parse the member segments in detail in upgrade anyway, I'd be tempted to make some further changes / optimizations: - You could leave out all locking XID members in upgrade, because they're not relevant after upgrade any more (all the XIDs will be committed or aborted and have released the locks; we require prepared transactions to be completed before upgrading too). It'd be enough to include actual UPDATE/DELETE XIDs. - The way we determine the length of the members array by looking at the next multixid's offset is a bit complicated. We could have one extra flag per XID in the members to indicate "this is the last member of this multixid". That could either to replace the current mechanism of looking at the next offset, or be just an additional cross-check. - Do we still like the "group" representation, with 4 bytes of flags followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per XID unaligned. - A more radical idea: There can be only one updating XID in one multixid. We could store that directly in the offsets SLRU, and keep only the locking XIDs in members. That way, the members SLRU would become less critical; it could be safely reset on crash for example (except for prepared transactions, which could still be holding locks, but it'd still be less serious). Separating correctness-critical data from more ephemeral state is generally a good idea. I'm not insisting on any of these changes, just some things that might be worth considering if we're rewriting the SLRUs on upgrade anyway. -- Heikki Linnakangas Neon (https://neon.tech)
On Fri, 15 Nov 2024 at 14:06, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Hmm, so if I understand correctly, this is related to how we determine
the length of the members array, by looking at the next multixid's
offset. This is explained in GetMultiXactIdMembers:
Correct.
If we accept that, we don't need to worry about case 3 anymore. But if
we upgrade wrapped-around members files by just renaming them, there
could still be a members array where we had skipped offset 0, and
reading that after the upgrade might get confused. We could continue to
ignore a 0 XID in the members array like the comment says; I think that
would be enough. But yeah, maybe it's better to bite the bullet in
pg_upgrade and squeeze those out.
Correct. I couldn't explain this better. I'm more for the squeeze those out. Overwise, we're ending up in adding another hack in multixact, but one of the benefits from switching to 64-bits, it should make XID's logic more straight forward. After all, mxact juggling in pg_upgrade is one time inconvenience.
Does your upgrade test suite include case 3, where the next multixact's
offset is 1?
Not exactly.
simple
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5927049
offset-wrap
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5591183
multi-wrap
Latest checkpoint's NextMultiXactId: 82006
Latest checkpoint's NextMultiOffset: 7408811
offset-multi-wrap
Latest checkpoint's NextMultiXactId: 52146
Latest checkpoint's NextMultiOffset: 5591183
You want test case where NextMultiOffset will be 1?
simple
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5927049
offset-wrap
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5591183
multi-wrap
Latest checkpoint's NextMultiXactId: 82006
Latest checkpoint's NextMultiOffset: 7408811
offset-multi-wrap
Latest checkpoint's NextMultiXactId: 52146
Latest checkpoint's NextMultiOffset: 5591183
You want test case where NextMultiOffset will be 1?
Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other
comments and checks that talk about binary-upgraded values too that we
can hopefully clean up now.
Yes, technically we can. But this is kinda unrelated to the offsets and will make the patch set significantly complicated, thus more complicated to review and less likely to be committed. Again, I'm not opposing the idea, I'm not sure if it is worth to do it right now.
If we are to parse the member segments in detail in upgrade anyway, I'd
be tempted to make some further changes / optimizations:
- You could leave out all locking XID members in upgrade, because
they're not relevant after upgrade any more (all the XIDs will be
committed or aborted and have released the locks; we require prepared
transactions to be completed before upgrading too). It'd be enough to
include actual UPDATE/DELETE XIDs.
- The way we determine the length of the members array by looking at the
next multixid's offset is a bit complicated. We could have one extra
flag per XID in the members to indicate "this is the last member of this
multixid". That could either to replace the current mechanism of looking
at the next offset, or be just an additional cross-check.
- Do we still like the "group" representation, with 4 bytes of flags
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per
XID unaligned.
Not really. But I would leave it for next iteration - switching multi to 64 bit. I already have some drafts for this. In any case, we'll must do adjustments in pg_upgrade again. My goal is to move towards 64 XIDs, but with the small steps, and I plan changes in "group" representation in combination with switching multi to 64 bit. This seems a bit more appropriate in my view.
As for your optimization suggestions, I like them. I don’t against them, but I’m afraid to disrupt the clarity of thought, especially since the algorithm is not the simplest.
--
Best regards,
Maxim Orlov.
On 27/12/2024 19:09, Maxim Orlov wrote:
>
> On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas <hlinnaka@iki.fi
> <mailto:hlinnaka@iki.fi>> wrote:
> Does the pg_upgrade code work though, if you have that buggy situation
> where oldestOffsetKnown == false ?
...
> >
> > if (!TransactionIdIsValid(*xactptr))
> > {
> > /* Corner case 3: we must be looking at
> unused slot zero */
> > Assert(offset == 0);
> > continue;
> > }
>
> After upgrade, this corner case 3 would *not* happen on offset == 0. So
> looks like we're still missing test coverage for this upgrade corner
> case.
>
> Am I understanding correctly that you want to have a test corresponding
> to the buggy 9.3 and 9.4 era versions?
No, those were two different things. I think there might be two things
wrong here:
1. I suspect pg_upgrade might not correctly handle the situation where
oldestOffsetKnown==false, and
2. The above assertion in "corner case 3" would not hold. It seems that
we don't have a test case for it, or it would've hit the assertion.
Now that I think about it, yes, a test case for 1. would be good too.
But I was talking about 2.
> Do you think we could imitate this scenario on a current master branch
> like that:
> 1) generate a couple of offsets segments for the first table;
> 2) generate more segments for a second table;
> 3) drop first table;
> 4) stop pg cluster;
> 5) remove pg_multixact/offsets/0000
> 6) upgrade?
I don't remember off the top of my head.
It might be best to just refuse the upgrade if oldestOffsetKnown==false.
It's a very ancient corner case. It seems reasonable to require you to
upgrade to a newer minor version and run VACUUM before upgrading. IIRC
that sets oldestOffsetKnown.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Thu, 2 Jan 2025 at 01:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
It might be best to just refuse the upgrade if oldestOffsetKnown==false.
It's a very ancient corner case. It seems reasonable to require you to
upgrade to a newer minor version and run VACUUM before upgrading. IIRC
that sets oldestOffsetKnown.
I agree. After all, we do already have a ready-made solution in the form of a vacuum, do we?
If I understand all this multixact_old.c machinery correctly, in case of oldestOffsetKnown==false
we should fail with "could not open file" or offset will be 0 in GetOldMultiXactIdSingleMember.
So, I suppose we can put an analogue of SimpleLruDoesPhysicalPageExist call in the beginning
of GetOldMultiXactIdSingleMember. And if either SimpleLruDoesPhysicalPageExist return false
or a corresponding offset will be 0 we have to bail out with "oldest offset does not exist, consider
running vacuum before pg_upgrdade" or smth. Please, correct me if I'm wrong.
Best regards,
Maxim Orlov.
HI Maxim
> Looks like there is a bit of a pause in the discussion. Here is a small update. Consider v12.
> No major changes, rebase to the actual master and a squash of multiple commits to make a
> patch set easy to reviewer.
> AFAICs, we are reached a consensus on a core patch for switching to 64 bits offsets. The
> AFAICs, we are reached a consensus on a core patch for switching to 64 bits offsets. The
> only concern is about more comprehensive test coverage for pg_upgrade, is it?
Agree ,When upgrading meets extremes (oldestOffsetKnown==false.) Just follow the solution mentioned by Heikki Linnakangas.
Thanks
On Thu, Jan 16, 2025 at 9:32 PM Maxim Orlov <orlovmg@gmail.com> wrote:
Looks like there is a bit of a pause in the discussion. Here is a small update. Consider v12.No major changes, rebase to the actual master and a squash of multiple commits to make apatch set easy to reviewer.
AFAICs, we are reached a consensus on a core patch for switching to 64 bits offsets. Theonly concern is about more comprehensive test coverage for pg_upgrade, is it?
--Best regards,Maxim Orlov.
Here is a rebase, v14.
--
Best regards,
Maxim Orlov.
Вложения
- v14-0005-TEST-initdb-option-to-initialize-cluster-with-no.patch.txt
- v14-0001-Use-64-bit-format-output-for-multixact-offsets.patch
- v14-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patch
- v14-0002-Use-64-bit-multixact-offsets.patch
- v14-0003-Make-pg_upgrade-convert-multixact-offsets.patch
- v14-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patch.txt
- v14-0007-TEST-bump-catver.patch.txt
HI Maxim Orlov Heikki Linnakangas
Thank you for working on it,A few more days is a code freeze.It seems like things have been quiet for a while, but I believe implementing xid64 is absolutely necessary. For instance, there’s often concern about performance jitter caused by frequent freezes. If xid64 is implemented, the freeze process can be smoother.
Thanks
On Fri, Mar 7, 2025 at 7:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:
Here is a rebase, v14.
--Best regards,Maxim Orlov.
On 07/03/2025 13:30, Maxim Orlov wrote: > Here is a rebase, v14. Thanks! I did some manual testing of this. I created a little helper function to consume multixids, to test the autovacuum behavior, and found one issue: If you consume a lot of multixid members space, by creating lots of multixids with huge number of members in each, you can end up with a very bloated members SLRU, and autovacuum is in no hurry to clean it up. Here's what I did: 1. Installed attached test module 2. Ran "select consume_multixids(10000, 100000);" many times 3. ran: $ du -h data/pg_multixact/members/ 26G data/pg_multixact/members/ When I run "vacuum freeze; select * from pg_database;", I can see that 'datminmxid' for the current database is advanced. However, autovacuum is in no hurry to vacuum 'template0' and 'template1', so pg_multixact/members/ does not get truncated. Eventually, when autovacuum_multixact_freeze_max_age is reached, it presumably will, but you will run out of disk space before that. There is this check for members size at the end of SetOffsetVacuumLimit(): > > /* > * Do we need autovacuum? If we're not sure, assume yes. > */ > return !oldestOffsetKnown || > (nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD); And the caller (SetMultiXactIdLimit()) will in fact signal the autovacuum launcher after "vacuum freeze" because of that. But autovacuum launcher will look at the datminmxid / relminmxid values, see that they are well within autovacuum_multixact_freeze_max_age, and do nothing. This is a very extreme case, but clearly the code to signal autovacuum launcher, and the freeze age cutoff that autovacuum then uses, are not in sync. This patch removed MultiXactMemberFreezeThreshold(), per my suggestion, but we threw this baby with the bathwater. We discussed that in this thread, but didn't come up with any solution. But ISTM we still need something like MultiXactMemberFreezeThreshold() to trigger autovacuum freezing if the members have grown too large. -- Heikki Linnakangas Neon (https://neon.tech)
On 01/04/2025 21:25, Heikki Linnakangas wrote: > On 07/03/2025 13:30, Maxim Orlov wrote: >> Here is a rebase, v14. > > Thanks! I did some manual testing of this. I created a little helper > function to consume multixids, to test the autovacuum behavior, and > found one issue: Forgot to attach the test function I used, here it is. -- Heikki Linnakangas Neon (https://neon.tech)
Вложения
HI Heikki
Pls Kindly help to create task in https://commitfest.postgresql.org/53/ ,I can not found this path in
Commitfest 2025-07
Thanks
On Wed, Apr 2, 2025 at 2:26 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 01/04/2025 21:25, Heikki Linnakangas wrote:
> On 07/03/2025 13:30, Maxim Orlov wrote:
>> Here is a rebase, v14.
>
> Thanks! I did some manual testing of this. I created a little helper
> function to consume multixids, to test the autovacuum behavior, and
> found one issue:
Forgot to attach the test function I used, here it is.
--
Heikki Linnakangas
Neon (https://neon.tech)
I moved the topic to the next commitfest.
--
--
Best regards,
Maxim Orlov.
Hi Maxim,
On Wed, Apr 16, 2025 at 1:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:
I moved the topic to the next commitfest.
I am reviewing these patches.
I notice that transam/README does not mention multixact except one place in section "Transaction Emulation during Recovery". I expected it to document how pg_multixact/members and pg_multixact/offset are used and what is their layout. It's not the responsibility of this patchset to document it, but it will be good if we add a section about multixacts in transam/README. It will make reviews easier.
Best Wishes,
Ashutosh Bapat
Yet another rebase @ f5a987c0e5
--
Best regards,
Maxim Orlov.
Вложения
- v16-0001-Use-64-bit-format-output-for-multixact-offsets.patch
- v16-0002-Use-64-bit-multixact-offsets.patch
- v16-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patch
- v16-0005-TEST-initdb-option-to-initialize-cluster-with-no.patch
- v16-0003-Make-pg_upgrade-convert-multixact-offsets.patch
- v16-0007-TEST-bump-catver.patch
- v16-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patch
Once again, @ 8191e0c16a
--
Best regards,
Maxim Orlov.
Вложения
- v17-0005-TEST-initdb-option-to-initialize-cluster-with-no.patch
- v17-0003-Make-pg_upgrade-convert-multixact-offsets.patch
- v17-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patch
- v17-0002-Use-64-bit-multixact-offsets.patch
- v17-0001-Use-64-bit-format-output-for-multixact-offsets.patch
- v17-0006-TEST-add-src-bin-pg_upgrade-t-006_offset.pl.patch
- v17-0007-TEST-bump-catver.patch
Hello Maxim! On Thu, Sep 11, 2025 at 11:58 AM Maxim Orlov <orlovmg@gmail.com> wrote: > > Once again, @ 8191e0c16a Thank you for your work on this subject. Multixact members can really grow much faster than multixact offsets, and avoiding wraparound just here might make sense. At the same time, making multixact offsets 64-bit is local and doesn't require changing the tuple xmin/xmax interpretation. I went through the patchset. The shape does not look bad, but I have a concern about the size of the multixact offsets. As I understand, this patchset grows multixact offsets twice; each multixact offset grows from 32 bits to 64 bits. This seems quite a price for avoiding the Multixact members' wraparound. We can try to squeeze multixact offsets given it's ascending sequence each time increased by a multixact size. But how many members can a multixact contain at maximum? Looking at MultiXactIdExpand(), I get that we're keeping locks from in-progress transactions, and committed non-lock transactions (I guess the latter could be only one). The number of transactions running by backends should fit MAX_BACKENDS (2^18 - 1), and the number of prepared transactions should also fit MAX_BACKENDS. So, I guess we can cap the total number of one multixact members to 2^24. Therefore, we can change from each 8 of 32-bit multixact offsets (takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments (takes 29-bytes). The actual multixact offsets can be calculated at the fly, overhead shouldn't be significant. What do you think? ------ Regards, Alexander Korotkov Supabase
On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?
Thank you for your review; I'm pleased to hear from you again.
Yes, because the maximum number of mxoff is limited by the number of
running transactions, we may do it that way.
However, it is a bit wired to have offsets with the 7-byte "base".
I believe we may take advantage of the 64XID patch's notion of putting a
Yes, because the maximum number of mxoff is limited by the number of
running transactions, we may do it that way.
However, it is a bit wired to have offsets with the 7-byte "base".
I believe we may take advantage of the 64XID patch's notion of putting a
8 byte base followed by 4 byte offsets for particular page.
32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
Therefore, offset from base will never overflow 2^31 and will always
fit uint32.
It appears logical to me.
--
Best regards,
Maxim Orlov.
Hi Maxim
Thanks
Thanks for your continued efforts to get XID64 implemented.
> 32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
> 32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
> Therefore, offset from base will never overflow 2^31 and will always
> fit uint32.
> It appears logical to me.
Agree +1 , but I have a question: I remember the XID64 patch got split into a few threads. How are these threads related? The original one was seen as too big a change, so it was broken up after people raised concerns.
Agree +1 , but I have a question: I remember the XID64 patch got split into a few threads. How are these threads related? The original one was seen as too big a change, so it was broken up after people raised concerns.
On Mon, Sep 15, 2025 at 11:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:
On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?Thank you for your review; I'm pleased to hear from you again.
Yes, because the maximum number of mxoff is limited by the number of
running transactions, we may do it that way.
However, it is a bit wired to have offsets with the 7-byte "base".
I believe we may take advantage of the 64XID patch's notion of putting a8 byte base followed by 4 byte offsets for particular page.32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.Therefore, offset from base will never overflow 2^31 and will alwaysfit uint32.It appears logical to me.
--Best regards,Maxim Orlov.
On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Yeah, you're absolutely correct. This thread is a part of the overall
Agree +1 , but I have a question: I remember the XID64 patch got split into a few threads. How are these threads related? The original one was seen as too big a change, so it was broken up after people raised concerns.
work on the transition to XID64 in Postgres. As far as I remember, no
one explicitly raised concerns, but it's clear to me that it won't be
committed as one big patch set.
Here is a WIP patch @ 8191e0c16a for the discussed above issue.
I also had to merge several patches from the previous set, since the
consensus is to use the PRI* format for outputting 64-bit values, and a
separate patch that only changed the format with cast and %llu lost
it's meaning.
If this option suits everyone, the next step is to add a part related
to the pg_upgrade.
Best regards,
Maxim Orlov.
Вложения
Hi Maxim
Thank you for your feedback.I remember this path the primary challenges are in the upgrade section, where we're still debating how to address a few edge cases. Right now, this thread is missing input from core code contributors.
Thanks
On Fri, Sep 19, 2025 at 12:37 AM Maxim Orlov <orlovmg@gmail.com> wrote:
--On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:Yeah, you're absolutely correct. This thread is a part of the overall
Agree +1 , but I have a question: I remember the XID64 patch got split into a few threads. How are these threads related? The original one was seen as too big a change, so it was broken up after people raised concerns.
work on the transition to XID64 in Postgres. As far as I remember, no
one explicitly raised concerns, but it's clear to me that it won't be
committed as one big patch set.
Here is a WIP patch @ 8191e0c16a for the discussed above issue.
I also had to merge several patches from the previous set, since the
consensus is to use the PRI* format for outputting 64-bit values, and a
separate patch that only changed the format with cast and %llu lost
it's meaning.
If this option suits everyone, the next step is to add a part related
to the pg_upgrade.Best regards,Maxim Orlov.
Here is a new patch set @ 10b5bb3bffaee8
--
As previously stated, the patch set implements the concept of saving the
"difference" between page offsets in order to save disc space.
The second significant change was that I decided to modify the
pg_upgrade portion as suggested by Heikki above.
At the very least, the logic becomes much simpler and completely
replicates what is done in the multxact.c module and provide some
optimization.
Perhaps this will be easier to comprehend and analyse than attempting
to correctly convert bytes from one format to another.
In the near future, I intend to focus on testing and implement a test
suite.
"difference" between page offsets in order to save disc space.
The second significant change was that I decided to modify the
pg_upgrade portion as suggested by Heikki above.
At the very least, the logic becomes much simpler and completely
replicates what is done in the multxact.c module and provide some
optimization.
Perhaps this will be easier to comprehend and analyse than attempting
to correctly convert bytes from one format to another.
In the near future, I intend to focus on testing and implement a test
suite.
--
Best regards,
Maxim Orlov.
Вложения
On 27/10/2025 17:54, Maxim Orlov wrote: > Here is a new patch set @ 10b5bb3bffaee8 > > As previously stated, the patch set implements the concept of saving the > "difference" between page offsets in order to save disc space. Hmm, is that safe? We do the assignment of multixact and offset, in the GetNewMultiXactId() function, separately from updating the SLRU pages in the RecordNewMultiXact() function. I believe this happen: To keep the arithmetic simple, let's assume that multixid 100 is the first multixid on an offsets SLRU page. So the 'base' on the page is initialized when multixid 100 is written. 1. Backend A calls GetNewMultiXactId(), is assigned multixid 100, offset 1000 2. Backend B calls GetNewMultiXactId(), is assigned multixid 101, offset 1010 3. Backend B calls RecordNewMultiXact() and sets 'page->offsets[1] = 10' 4. Backend A calls RecordNewMultiXact() and sets 'page->base = 1000' and 'page->offsets[0] = 0' If backend C looks up multixid 101 in between steps 3 and 4, it would read the offset incorrectly, because 'base' isn't set yet. - Heikki
Unfortunately, I need to admit that I have messed a bit with v18.
I forgot to sync the pg_upgrade portion with the rest of the patch,
among other things. Here's a proper version with additional testing.
among other things. Here's a proper version with additional testing.
pg_upgrade/t/007_mxoff.pl is not meant to be committed, it's just
for test purposes. In order to run it, env var oldinstall must be set.
On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 27/10/2025 17:54, Maxim Orlov wrote:
If backend C looks up multixid 101 in between steps 3 and 4, it would
read the offset incorrectly, because 'base' isn't set yet.
offset on the page, not only the first one. In other words, there
should never be a case when we read an offset without a previously
defined page base. Correct me if I'm wrong:
1. Backend A assigned mxact=100, offset=1000.
2. Backend B assigned mxact=101, offset=1010.
3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and
set page base=1010, offset plus 0^0x80000000 bit while
holding lock on the page.
4. Backend C looks up for the mxact=101 by calling MXOffsetRead()
and should get exactly what he's looking for:
base (1010) + offset (0) minus 0x80000000 bit.
5. Backend A calls RecordNewMultiXact() and sets his offset using
existing base from step 3.
--
Best regards,
Maxim Orlov.
Вложения
On 30/10/2025 08:13, Maxim Orlov wrote: > On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi > <mailto:hlinnaka@iki.fi>> wrote: > > On 27/10/2025 17:54, Maxim Orlov wrote: > > > If backend C looks up multixid 101 in between steps 3 and 4, it would > read the offset incorrectly, because 'base' isn't set yet. > > Hmm, maybe I miss something? We set page base on first write of any > offset on the page, not only the first one. In other words, there > should never be a case when we read an offset without a previously > defined page base. Correct me if I'm wrong: > 1. Backend A assigned mxact=100, offset=1000. > 2. Backend B assigned mxact=101, offset=1010. > 3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and > set page base=1010, offset plus 0^0x80000000 bit while > holding lock on the page. > 4. Backend C looks up for the mxact=101 by calling MXOffsetRead() > and should get exactly what he's looking for: > base (1010) + offset (0) minus 0x80000000 bit. > 5. Backend A calls RecordNewMultiXact() and sets his offset using > existing base from step 3. Oh I see, the 'base' is not necessarily the base offset of the first multixact on the page, it's the base offset of the first multixid that is written to the page. And the (short) offsets can be negative. That's a frighteningly clever encoding scheme. One upshot of that is that WAL redo might get construct the page with a different 'base'. I guess that works, but it scares me. Could we come up with a more deterministic scheme? - Heikki
On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Oh I see, the 'base' is not necessarily the base offset of the first
multixact on the page, it's the base offset of the first multixid that
is written to the page. And the (short) offsets can be negative. That's
a frighteningly clever encoding scheme. One upshot of that is that WAL
redo might get construct the page with a different 'base'. I guess that
works, but it scares me. Could we come up with a more deterministic scheme?
Definitely! The most stable approach is the one we had before, which
used actual 64-bit offsets in the SLRU. To be honest, I'm completely
happy with it. After all, what's most important for me is to have 64-bit
xids in Postgres, and this patch is a step towards that goal.
PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
segments.
Fortunately, now that I've separated reading and writing offsets into
different functions, switching from one implementation to another is
easy to do.
Here's a quick overview of the current state of the patch:
1) Access to the offset is placed to separate calls:
MXOffsetWrite/MXOffsetRead.
2) I abandoned byte juggling in pg_upgrade and moved to using logic that
replicates the work with offsets im multixact.c
3) As a result, the update issue came down to the correct implementation
of functions MXOffsetWrite/MXOffsetRead.
4) The only question that remains is the question of disk representation
of 64-bit offsets in SLRU segments.
My thoughts on point (4).
Using 32-bit offsets + some kind of packing:
Pros:
+ Reduce the total disc space used by the segments; ideally it is
almost the same as before.
Cons:
- Reduces reliability (losing a part will most likely result in losing
the entire page).
- Complicates code, especially considering that segments may be written
to the page in random order.
Using 64-bit offsets in SLRU:
Pros:
+ Easy to implement/transparent logic.
Cons:
- Increases the amount of disk space used.
In terms of speed, I'm not sure which will be faster. On the one hand,
64-bit eliminates the necessity for calculations and branching. On the
other hand, the amount of data used will increase.
I am not opposed to any of these options, as our primary goal is getting
64-bit offsets. However, I like the approach using full 64-bit offsets
in SLRU, because it is more clear and, should we say, robust. Yes, it
will increase the number of segment, however this is not heap data in
for a table. Under typical circumstances, there should not be too many
such segments.
used actual 64-bit offsets in the SLRU. To be honest, I'm completely
happy with it. After all, what's most important for me is to have 64-bit
xids in Postgres, and this patch is a step towards that goal.
PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
segments.
Fortunately, now that I've separated reading and writing offsets into
different functions, switching from one implementation to another is
easy to do.
Here's a quick overview of the current state of the patch:
1) Access to the offset is placed to separate calls:
MXOffsetWrite/MXOffsetRead.
2) I abandoned byte juggling in pg_upgrade and moved to using logic that
replicates the work with offsets im multixact.c
3) As a result, the update issue came down to the correct implementation
of functions MXOffsetWrite/MXOffsetRead.
4) The only question that remains is the question of disk representation
of 64-bit offsets in SLRU segments.
My thoughts on point (4).
Using 32-bit offsets + some kind of packing:
Pros:
+ Reduce the total disc space used by the segments; ideally it is
almost the same as before.
Cons:
- Reduces reliability (losing a part will most likely result in losing
the entire page).
- Complicates code, especially considering that segments may be written
to the page in random order.
Using 64-bit offsets in SLRU:
Pros:
+ Easy to implement/transparent logic.
Cons:
- Increases the amount of disk space used.
In terms of speed, I'm not sure which will be faster. On the one hand,
64-bit eliminates the necessity for calculations and branching. On the
other hand, the amount of data used will increase.
I am not opposed to any of these options, as our primary goal is getting
64-bit offsets. However, I like the approach using full 64-bit offsets
in SLRU, because it is more clear and, should we say, robust. Yes, it
will increase the number of segment, however this is not heap data in
for a table. Under typical circumstances, there should not be too many
such segments.
Best regards,
Maxim Orlov.
Вложения
On Thu, Oct 30, 2025 at 6:17 PM Maxim Orlov <orlovmg@gmail.com> wrote: > On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> >> Oh I see, the 'base' is not necessarily the base offset of the first >> multixact on the page, it's the base offset of the first multixid that >> is written to the page. And the (short) offsets can be negative. That's >> a frighteningly clever encoding scheme. One upshot of that is that WAL >> redo might get construct the page with a different 'base'. I guess that >> works, but it scares me. Could we come up with a more deterministic scheme? >> > Definitely! The most stable approach is the one we had before, which > used actual 64-bit offsets in the SLRU. To be honest, I'm completely > happy with it. After all, what's most important for me is to have 64-bit > xids in Postgres, and this patch is a step towards that goal. Yes, but why can't we have an encoding scheme which would both be deterministic and provide compression? The attached is what I meant in [1]. It's based on v19 and provide deterministic conversion of each 8 of 64-bit offsets into a chunks containing 64-bit base and 7 of 24-bit increments. I didn't touch pg_upgrade code though. Links. 1. https://www.postgresql.org/message-id/CAPpHfdtPybyMYBj-x3-Z5%3D4bj_vhYk2R0nezfy%3DVjcz4QBMDgw%40mail.gmail.com ------ Regards, Alexander Korotkov Supabase
Вложения
On 30/10/2025 18:17, Maxim Orlov wrote: > PFA v20 returns to using actual 64-bit offsets for on-disk SLRU > segments. > > Fortunately, now that I've separated reading and writing offsets into > different functions, switching from one implementation to another is > easy to do. > > Here's a quick overview of the current state of the patch: > 1) Access to the offset is placed to separate calls: > MXOffsetWrite/MXOffsetRead. > 2) I abandoned byte juggling in pg_upgrade and moved to using logic that > replicates the work with offsets im multixact.c > 3) As a result, the update issue came down to the correct implementation > of functions MXOffsetWrite/MXOffsetRead. > 4) The only question that remains is the question of disk representation > of 64-bit offsets in SLRU segments. Here's another round of review and cleanup of this. I made a bunch of small changes, but haven't found any major problems. Looking pretty good. Notable changes since v20: - Changed MULTIXACT_MEMBER_AUTOVAC_THRESHOLD to 4000000000 instead of 0xFFFFFFFF. The 2^32 mark doesn't have any particular meaning significance and using a round number makes that more clear. We should possibly expose that as a separate GUC, but that can be done in a followup patch. - Removed the MXOffsetRead/Write functions again. They certainly make sense if we make them more complicated with some kind of a compression scheme, but I preferred to keep the code closer to 'master' for now. - Removed more remnants of offset wraparound handling. There were still a few places that checked for wraparound and tried to deal with it, while other places just assumed that it doesn't happen. I added a check in GetNewMultiXactId() to error out if it would wrap around. It really should not happen in the real world, one reason being that we would run out of WAL before running out of 64-bit mxoffsets, but a sanity check won't hurt just in case someone e.g. abuses pg_resetwal to get into that situation. - Removed MaybeExtendOffsetSlru(). It was only used to deal with binary upgrade from version 9.2 and below. Now that pg_upgrade rewrites the files, it's not needed anymore. - Modified PerformMembersTruncation() to use SimpleLruTruncate() Changes in pg_upgrade: - Removed the nextMXact/nextOffset fields from MultiXactWriter. They were redundant with the next_multi and next_offset local variables in the caller. Remaining issues: - There's one more refactoring I'd like to do before merging this: Move the definitions that are now duplicated between src/bin/pg_upgrade/multixact_new.c and src/backend/access/transam/multixact.c into a new header file, multixact_internal.h. One complication with that is that it needs SLRU_PAGES_PER_SEGMENT from access/slru.h, but slru.h cannot currently be included in FRONTEND code. Perhaps we should move SLRU_PAGES_PER_SEGMENT to pg_config_manual.h, or if that feels too global, to a separate slru_config.h file. - I saw Alexander's proposal for a new compression scheme but didn't incorporate that here. It might be a good idea, but I think we can do that as a followup patch before the release, if it seems worth it. I don't feel too bad about just making pg_multixact/offsets 2x larger either. - Have you done any performance testing of the pg_upgrade code? How long does the conversion take if you have e.g. 1 billion multixids? - Is the !oldestOffsetKnown case in the code still reachable? I left one FIXME comment about that. Needs a comment update at least. - The new pg_upgrade test fails on my system with this error in the log: > # Running: pg_dump --no-sync --restrict-key test -d port=22462 host=/tmp/5KdMvth1jk dbname='postgres' -f /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql > pg_dump: error: aborting because of server version mismatch > pg_dump: detail: server version: 19devel; pg_dump version: 17.6 > could not read "/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql":No such fileor directory at /home/heikki/git-sandbox/postgresql/src/bin/pg_upgrade/t/007_mxoff.pl line 242. This turns out to be an issue with IPC::Run. Setting the IPCRUNDEBUG=basic env variable reveals that it has a built-in command cache: > IPC::Run 0004 [#19(109223)]: ****** harnessing ***** > IPC::Run 0004 [#19(109223)]: parsing [ pg_dump --no-sync --restrict-key test -d 'port=20999 host=/tmp/NsJKldN1Ie dbname='postgres''-f '/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql'] > IPC::Run 0004 [#19(109223)]: ** starting > IPC::Run 0004 [#19(109223)]: 'pg_dump' found in cache: '/home/heikki/pgsql.17stable/bin/pg_dump' > IPC::Run 0004 [#19(111432) pg_dump]: execing /home/heikki/pgsql.17stable/bin/pg_dump --no-sync --restrict-key test -d 'port=20999host=/tmp/NsJKldN1Ie dbname='postgres'' -f /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql > IPC::Run 0004 [#19(109223)]: ** finishing > pg_dump: error: aborting because of server version mismatch > pg_dump: detail: server version: 19devel; pg_dump version: 17.6 The test calls pg_dump twice: first with the old version, then with the new version. But thanks to IPC::Run's command cache, the invocation of the new pg_dump version actually also calls the old version. I'm not sure how to fix that, but I was able to work around it by reversing the pg_dump calls so that thew new version is called first. That way we use the new pg_dump against both server versions which works. - The new pg_ugprade test is very slow. I would love to include that test permanently in the test suite, but it's too slow for that currently. - Heikki
Вложения
On Wed, 5 Nov 2025 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Remaining issues:
- There's one more refactoring I'd like to do before merging this: Move
the definitions that are now duplicated between
src/bin/pg_upgrade/multixact_new.c and
src/backend/access/transam/multixact.c into a new header file,
multixact_internal.h. One complication with that is that it needs
SLRU_PAGES_PER_SEGMENT ...
Done. Also put SLRU_PAGES_PER_SEGMENT in pg_config_manual.h
In my opinion, this constant perfectly aligns the description in the
file header. In any case, feel free to move it anywhere you like.
- Have you done any performance testing of the pg_upgrade code? How long
does the conversion take if you have e.g. 1 billion multixids?
Unfortunately, not yet. I'd like to do this soon. Currently, the
bulk of the testing time is spent generating multi-transactions.
bulk of the testing time is spent generating multi-transactions.
- Is the !oldestOffsetKnown case in the code still reachable? I left one
FIXME comment about that. Needs a comment update at least.
Yep, no longer needed. A separate commit has been added.
- The new pg_upgrade test fails on my system with this error in the log:
Unfortunately, I don't face this issue. I think this can be fixed by
providing an explicit path to the utility.
providing an explicit path to the utility.
- The new pg_ugprade test is very slow. I would love to include that
test permanently in the test suite, but it's too slow for that currently.
Yes, unfortunately. The majority of the time is spent on tests that
produce multiple segments. These are cases numbered 4-th and higher.
If we remove these, the testing should be relatively fast.
produce multiple segments. These are cases numbered 4-th and higher.
If we remove these, the testing should be relatively fast.
I also add commit "Handle wraparound of next new multi in pg_upgrade".
Per BUG #18863 and BUG #18865
The issue is that pg_upgrade neglects to handle the wraparound of
mxact/mxoff.
We'll obviously resolve the issue with mxoff wraparound by moving to
64-bits. And the mxact bug can be easily solved with two lines of code.
Or five if you count indents and comments. Test also provided.
This commit is totally optional. If you think it deserves to be treated
as a different issue, feel free to discard it.
mxact/mxoff.
We'll obviously resolve the issue with mxoff wraparound by moving to
64-bits. And the mxact bug can be easily solved with two lines of code.
Or five if you count indents and comments. Test also provided.
This commit is totally optional. If you think it deserves to be treated
as a different issue, feel free to discard it.
--
Best regards,
Maxim Orlov.
Вложения
- v23-0006-TEST-bump-catversion.patch.txt
- v23-0001-Use-64-bit-multixact-offsets.patch
- v23-0005-Handle-wraparound-of-next-new-multi-in-pg_upgrad.patch
- v23-0002-Add-pg_upgarde-for-64-bit-multixact-offsets.patch
- v23-0003-Remove-oldestOffset-oldestOffsetKnown-from-multi.patch
- v23-0004-Add-test-for-64-bit-mxoff-in-pg_resetwal.patch
- v23-0007-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patch.txt
I noticed one minor issue after I had already sent the
previous letter.
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1034,7 +1034,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
if (nextOffset + nmembers < nextOffset)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- "MultiXact members would wrap around"));
+ errmsg("MultiXact members would wrap around")));
*offset = nextOffset;
+++ b/src/backend/access/transam/multixact.c
@@ -1034,7 +1034,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
if (nextOffset + nmembers < nextOffset)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- "MultiXact members would wrap around"));
+ errmsg("MultiXact members would wrap around")));
*offset = nextOffset;
$ $PGBINOLD/pg_controldata -D pgdata
pg_control version number: 1800
Catalog version number: 202510221
...
Latest checkpoint's NextMultiXactId: 10000000
Latest checkpoint's NextMultiOffset: 999995050
Latest checkpoint's oldestXID: 748
...
pg_control version number: 1800
Catalog version number: 202510221
...
Latest checkpoint's NextMultiXactId: 10000000
Latest checkpoint's NextMultiOffset: 999995050
Latest checkpoint's oldestXID: 748
...
I tried finding out how long it would take to convert a big number of
segments. Unfortunately, I only have access to a very old machine right
now. It took me 7 hours to generate this much data on my old
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.
Here are my rough measurements:
segments. Unfortunately, I only have access to a very old machine right
now. It took me 7 hours to generate this much data on my old
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.
Here are my rough measurements:
HDD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m59.459s
user 0m19.974s
sys 0m13.640s
SSD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m52.958s
user 0m19.826s
sys 0m13.624s
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m59.459s
user 0m19.974s
sys 0m13.640s
SSD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m52.958s
user 0m19.826s
sys 0m13.624s
I aim to get access to more modern stuff and check it all out there.
Best regards,
Maxim Orlov.
On 07/11/2025 18:03, Maxim Orlov wrote: > I tried finding out how long it would take to convert a big number of > segments. Unfortunately, I only have access to a very old machine right > now. It took me 7 hours to generate this much data on my old > Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM. > > Here are my rough measurements: > > HDD > $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches > $ time pg_upgrade > ... > real 4m59.459s > user 0m19.974s > sys 0m13.640s > > SSD > $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches > $ time pg_upgrade > ... > real 4m52.958s > user 0m19.826s > sys 0m13.624s > > I aim to get access to more modern stuff and check it all out there. Thanks, I also did some perf testing on my laptop. I wrote a little helper function to consume multixids, and used it to create a v17 cluster with 100 million multixids. See attached consume-mxids.patch.txt. I then ran pg_upgrade on that, and measured how long the pg_multixact conversion part of pg_upgrade took. It took about 1.2 s on my laptop. Extrapolating from that, converting 1 billion multixids would take 12 s. These were very simple multixacts with just one member each, though; realistic multixacts with more members would presumably take a little longer. In any case, I think we're in an acceptable ballpark here. There's some very low-hanging fruit though: Profiling with 'linux-perf' suggested that a lot of CPU time was spent simply on the function call overhead of GetOldMultiXactIdSingleMember, SlruReadSwitchPage, RecordNewMultiXact, SlruWriteSwitchPage for each multixact. I added an inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to eliminate the function call overhead of those in the common case that no page switch is needed. With that, the 100 million mxid test case I used went from 1.2 s to 0.9 s. We could optimize this further but I think this is good enough. Some other changes since patch set v23: - Rebased. I committed the wraparound bug fixes. - I added an SlruFileName() helper function to slru_io.c, and support for reading SLRUs with long_segment_names==true. It's not needed currently, but it seemed like a weird omission. AllocSlruRead() actually left 'long_segment_names' uninitialized which is error-prone. We could've just documented it, but it seems just as easy to support it. - I split the multixact_internal.h header in a separate commit, to make it more clear what changes are related to 64-bit offsets I kept all the new test cases for now. We need to decide which ones are worth keeping, and polish and speed up the ones we decide to keep. I'm getting one failure from the pg_upgrade/008_mxoff test: > [14:43:38.422](0.530s) not ok 26 - dump outputs from original and restored regression databases match > [14:43:38.422](0.000s) # Failed test 'dump outputs from original and restored regression databases match' > # at /home/heikki/git-sandbox/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 801. > [14:43:38.422](0.000s) # got: '1' > # expected: '0' > === diff of /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted and /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted > === stdout === > --- /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted 2025-11-12 14:43:38.030399957 +0200 > +++ /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted 2025-11-12 14:43:38.314399819 +0200 > @@ -2,8 +2,8 @@ > -- PostgreSQL database dump > -- > \restrict test > --- Dumped from database version 17.6 > --- Dumped by pg_dump version 17.6 > +-- Dumped from database version 19devel > +-- Dumped by pg_dump version 19devel > SET statement_timeout = 0; > SET lock_timeout = 0; > SET idle_in_transaction_session_timeout = 0;=== stderr === > === EOF === > [14:43:38.425](0.004s) # >>> case #6 I ran the test with: (rm -rf build/testrun/ build/tmp_install/; olddump=/tmp/olddump-regress.sql oldinstall=/home/heikki/pgsql.17stable/ meson test -C build --suite setup --suite pg_upgrade) - Heikki
Вложения
- consume-mxids.patch.txt
- v24-0001-Move-pg_multixact-SLRU-page-format-definitions-t.patch
- v24-0002-Use-64-bit-multixact-offsets.patch
- v24-0003-Add-pg_upgrade-for-64-bit-multixact-offsets.patch
- v24-0004-Remove-oldestOffset-oldestOffsetKnown-from-multi.patch
- v24-0005-TEST-bump-catversion.patch
- v24-0006-TEST-Add-test-for-64-bit-mxoff-in-pg_resetwal.patch
- v24-0007-TEST-Add-test-for-wraparound-of-next-new-multi-i.patch
- v24-0008-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patch
I realized that this issue was still outstanding: On 01/04/2025 21:25, Heikki Linnakangas wrote: > Thanks! I did some manual testing of this. I created a little helper > function to consume multixids, to test the autovacuum behavior, and > found one issue: > > If you consume a lot of multixid members space, by creating lots of > multixids with huge number of members in each, you can end up with a > very bloated members SLRU, and autovacuum is in no hurry to clean it up. > Here's what I did: > > 1. Installed attached test module > 2. Ran "select consume_multixids(10000, 100000);" many times > 3. ran: > > $ du -h data/pg_multixact/members/ > 26G data/pg_multixact/members/ > > When I run "vacuum freeze; select * from pg_database;", I can see that > 'datminmxid' for the current database is advanced. However, autovacuum > is in no hurry to vacuum 'template0' and 'template1', so pg_multixact/ > members/ does not get truncated. Eventually, when > autovacuum_multixact_freeze_max_age is reached, it presumably will, but > you will run out of disk space before that. > > There is this check for members size at the end of SetOffsetVacuumLimit(): > >> >> /* >> * Do we need autovacuum? If we're not sure, assume yes. >> */ >> return !oldestOffsetKnown || >> (nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD); > > And the caller (SetMultiXactIdLimit()) will in fact signal the > autovacuum launcher after "vacuum freeze" because of that. But > autovacuum launcher will look at the datminmxid / relminmxid values, see > that they are well within autovacuum_multixact_freeze_max_age, and do > nothing. > > This is a very extreme case, but clearly the code to signal autovacuum > launcher, and the freeze age cutoff that autovacuum then uses, are not > in sync. > > This patch removed MultiXactMemberFreezeThreshold(), per my suggestion, > but we threw this baby with the bathwater. We discussed that in this > thread, but didn't come up with any solution. But ISTM we still need > something like MultiXactMemberFreezeThreshold() to trigger autovacuum > freezing if the members have grown too large. Here's a new patch version that addresses the above issue. I resurrected MultiXactMemberFreezeThreshold(), using the same logic as before, just using pretty arbitrary thresholds of 1 and 2 billion offsets instead of the safe/danger thresholds derived from MaxMultiOffset. That gives roughly the same behavior wrt. calculating effective freeze age as before. Another change is that I removed the offset-based emergency vacuum triggering. With 64-bit offsets, we never need to shut down the system to prevent offset wraparound, so even if the offsets SLRU grows large, it's not an "emergency" the same way that wraparound is. Consuming lots of disk space could be a problem, of course, but we can let autovacuum deal with that at the normal pace, like it deals with bloated tables. The heuristics could surely be made better and/or more configurable, but I think this good enough for now. I included these changes as a separate patch for review purposes, but it ought to be squashed with the main patch before committing. - Heikki
Вложения
- v25-0001-Move-pg_multixact-SLRU-page-format-definitions-t.patch
- v25-0002-Use-64-bit-multixact-offsets.patch
- v25-0003-Add-pg_upgrade-for-64-bit-multixact-offsets.patch
- v25-0004-Remove-oldestOffset-oldestOffsetKnown-from-multi.patch
- v25-0005-Reintroduce-MultiXactMemberFreezeThreshold.patch
- v25-0006-TEST-bump-catversion.patch
- v25-0007-TEST-Add-test-for-64-bit-mxoff-in-pg_resetwal.patch
- v25-0008-TEST-Add-test-for-wraparound-of-next-new-multi-i.patch
- v25-0009-TEST-Add-test-for-64-bit-mxoff-in-pg_upgrade.patch
- v25-0010-TEST-add-consume_multixids-function.patch
On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I added an
inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to
eliminate the function call overhead of those in the common case that no
page switch is needed. With that, the 100 million mxid test case I used
went from 1.2 s to 0.9 s. We could optimize this further but I think
this is good enough.
I agree with you.
- I added an SlruFileName() helper function to slru_io.c, and support
for reading SLRUs with long_segment_names==true. It's not needed
currently, but it seemed like a weird omission. AllocSlruRead() actually
left 'long_segment_names' uninitialized which is error-prone. We
could've just documented it, but it seems just as easy to support it.
Yeah, I didn't particularly like that place either. But then I decided it was
overkill to do it for the sake of symmetry and would raise questions.
It turned out much better this way.
I kept all the new test cases for now. We need to decide which ones are
worth keeping, and polish and speed up the ones we decide to keep.
I think of two cases here.
A) Upgrade from "new cluster":
* created cluster with pre 32-bit overflow mxoff
* consume around of 2k of mxacts (1k before 32-bit overflow
and 1k after)
* run pg_upgrade
* check upgraded cluster is working
* check data invariant
B) Same as A), but for an "old cluster" with oldinstall env.
On Thu, 13 Nov 2025 at 19:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Here's a new patch version that addresses the above issue. I resurrected
MultiXactMemberFreezeThreshold(), using the same logic as before, just
using pretty arbitrary thresholds of 1 and 2 billion offsets instead of
the safe/danger thresholds derived from MaxMultiOffset. That gives
roughly the same behavior wrt. calculating effective freeze age as before.
Yes, I think it's okay for now. This reflects the existing logic well.
I wonder what the alternative solution might be? Can we make a
"vacuum freeze" also do pg_multixact segments truncation?
In any case, this can be discussed later.
--
Best regards,
Maxim Orlov.
On 14/11/2025 17:40, Maxim Orlov wrote: > On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi> >> I kept all the new test cases for now. We need to decide which >> ones are worth keeping, and polish and speed up the ones we decide >> to keep. Attached is a new patch version, with more work on the tests. The pg_upgrade patch (v26-0004-Add-pg_upgrade-for-64-bit-multixact-offsets.patch) now includes a test case. I'm proposing to commit that test along with these patches. It's a heavily-modified version of the test cases you wrote. I tested that test using old installations, all the way down to version 9.4. That required a bunch of changes to the test perl modules, to make them work with such old versions. Without any extra changes, the test works down to v11. Later patches in the patch set add more tests, labelled with the TEST: prefix. Those are the tests you posted earlier, with little to no modifications. I'm just carrying those around, so that I can easily run them now during development. But I don't think they're adding much value and I plan to leave them out of the final commit. > I think of two cases here. > A) Upgrade from "new cluster": > * created cluster with pre 32-bit overflow mxoff > * consume around of 2k of mxacts (1k before 32-bit overflow > and 1k after) > * run pg_upgrade > * check upgraded cluster is working > * check data invariant > B) Same as A), but for an "old cluster" with oldinstall env. Makes sense. The 007_multixact_conversion.pl test in the attached patches includes two test scenarios: "basic" and "wraparound" test. In the basic scenario there's no overflow or wraparound involved, but it can be run without an old installation, i.e. in a "new -> new upgrade". The "wraparound" scenario is the same, but the old cluster is reset with pg_resetwal so that the mxoff wraps around. The "wraparound" requires a pre-19 old installation, because the pg_resetwal logic requires pre-v19 layout. If we enhance the reset_mxoff() perl function in the test so that it also works with v19, we could run the "wraparound" scenario in new->new upgrades too. That would essentially the case A) that you listed above. I think it's already pretty good as it is though. I don't expect the point where we cross offset 2^32 in the new version to be very interesting now that we use 64-bit offsets everywhere. - Heikki
Вложения
- v26-0001-Move-pg_multixact-SLRU-page-format-definitions-t.patch
- v26-0002-Use-64-bit-multixact-offsets.patch
- v26-0003-TEST-bump-catversion.patch
- v26-0004-Add-pg_upgrade-for-64-bit-multixact-offsets.patch
- v26-0005-Remove-oldestOffset-oldestOffsetKnown-from-multi.patch
- v26-0006-Reintroduce-MultiXactMemberFreezeThreshold.patch
- v26-0007-TEST-Add-test-for-64-bit-mxoff-in-pg_resetwal.patch
- v26-0008-TEST-Add-test-for-wraparound-of-next-new-multi-i.patch
- v26-0009-TEST-add-consume_multixids-function.patch
- v26-0010-TEST-Original-pg_upgrade-test-case.patch
Here's yet another patch version. I spent the day reviewing this in detail and doing little cleanups here and there. I squashed the commits and wrote a proper commit message. One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to me at least) how upgrade from version 9.2 and below now works. It was actually broken when I tested it. Not sure if I had broken it earlier or if it never worked, but in any case it works now. I also tested upgrading a cluster from an old minor version, < 9.3.5, where the control file has a bogus oldestMultiXid==1 value (see commit b6a3444fa6). As expected, you get a "could not open file" error: > Performing Upgrade > ------------------ > Setting locale and encoding for new cluster ok > ... > Deleting files from new pg_multixact/members ok > Deleting files from new pg_multixact/offsets ok > Converting pg_multixact files > could not open file "/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file or directory > Failure, exiting I don't think we need to support that case. I hope there are no clusters in that state still in the wild, and you can work around it by upgrading to 9.3.5 or above and letting autovacuum run. But I wonder if a pre-upgrade check with a better error message would still be worthwhile. Ashutosh, you were interested in reviewing this earlier. Would you have a chance to review this now, before I commit it? Alexander, Alvaro, would you have a chance to take a final look too, please? - Heikki
Вложения
Hi Heikki
> I don't think we need to support that case. I hope there are no clusters
> in that state still in the wild, and you can work around it by upgrading
> to 9.3.5 or above and letting autovacuum run. But I wonder if a
> pre-upgrade check with a better error message would still be worthwhile.
> in that state still in the wild, and you can work around it by upgrading
> to 9.3.5 or above and letting autovacuum run. But I wonder if a
> pre-upgrade check with a better error message would still be worthwhile.
I think we believe it is now highly unlikely to find instances of version 9.3; all users are advised to upgrade to the latest version first.
Thanks
On Tue, Nov 18, 2025 at 12:35 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Here's yet another patch version. I spent the day reviewing this in
detail and doing little cleanups here and there. I squashed the commits
and wrote a proper commit message.
One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to
me at least) how upgrade from version 9.2 and below now works. It was
actually broken when I tested it. Not sure if I had broken it earlier or
if it never worked, but in any case it works now.
I also tested upgrading a cluster from an old minor version, < 9.3.5,
where the control file has a bogus oldestMultiXid==1 value (see commit
b6a3444fa6). As expected, you get a "could not open file" error:
> Performing Upgrade
> ------------------
> Setting locale and encoding for new cluster ok
> ...
> Deleting files from new pg_multixact/members ok
> Deleting files from new pg_multixact/offsets ok
> Converting pg_multixact files
> could not open file "/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file or directory
> Failure, exiting
I don't think we need to support that case. I hope there are no clusters
in that state still in the wild, and you can work around it by upgrading
to 9.3.5 or above and letting autovacuum run. But I wonder if a
pre-upgrade check with a better error message would still be worthwhile.
Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it? Alexander, Alvaro,
would you have a chance to take a final look too, please?
- Heikki
One more small issue: The docs for pg_resetwal contain recipes for how to determine safe values to use: > -m mxid,mxid > --multixact-ids=mxid,mxid > Manually set the next and oldest multitransaction ID. > > A safe value for the next multitransaction ID (first part) can be > determined by looking for the numerically largest file name in the > directory pg_multixact/offsets under the data directory, adding one, > and then multiplying by 65536 (0x10000). Conversely, a safe value > for the oldest multitransaction ID (second part of -m) can be > determined by looking for the numerically smallest file name in the > same directory and multiplying by 65536. The file names are in > hexadecimal, so the easiest way to do this is to specify the option > value in hexadecimal and append four zeroes. > > -O mxoff > --multixact-offset=mxoff > > Manually set the next multitransaction offset. > > A safe value can be determined by looking for the numerically > largest file name in the directory pg_multixact/members under the > data directory, adding one, and then multiplying by 52352 (0xCC80). > The file names are in hexadecimal. There is no simple recipe such as > the ones for other options of appending zeroes. I think those recipes need to be adjusted for 64-bit offsets. - Heikki
On Wed, 19 Nov 2025 at 19:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I think those recipes need to be adjusted for 64-bit offsets.
Sorry if this is too obvious, but with 32-bit offsets, we get:
SLRU_PAGES_PER_SEGMENT * BLKSZ / sizeof(MXOff) =
32 * 8192 / 4 = 65,536 mxoff per segment.
Now, with 64-bits offsets, we should have half as much.
Best regards,
Maxim Orlov.
On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Ashutosh, you were interested in reviewing this earlier. Would you have
> a chance to review this now, before I commit it?
0002 seems to be fine. It's just moving code from one file to another.
However, the name multixact_internals.h seems to be misusing term
"internal". I would expect an "internal.h" to expose things for use
within the module and not "externally" in other modules. But this
isn't the first time, buf_internal.h, toast_internal.h
bgworker_internal.h and so on have their own meaning of what
"internal" means to them. But it might be better to use something more
meaningful than "internal" in this case. The file seems to contain
declarations related to how pg_multixact SLRU is structured. To that
effect multixact_slru.h or multixact_layout.h may be more appropriate.
There are two offsets that that file deals with offset within
pg_multixact/offset, MultiXactOffset and member offset (flag offset
and xid offset) within pg_multixact/members. It's quite easy to get
confused between those when reading that code. For example, it's not
clear which offset MultiXactIdToOffset* functions are about. These
functions are calculating the page, entry (within the page) and
segment (of page) in pg_multixact/offset where to find the
MultiXactOffset of the first member of a given mxid. Thus returning
offset within offset. I feel they should have been named differently
when the code was written. But now that we are moving this code in a
separate file, we also have an opportunity to rename it better. I
think MXOffsetToMember* functions have better names. Using a similar
convention we could use MultiXactIdToOffsetOffset*, but that might
increase confusion. How about MultiXactIdToOffsetPos*? A separate .h
file also looks like a good place to document how offsets are laid out
and its contents and how members is laid out. The latter is somehow
documented in terms of macros and the static functions. The first is
not documented well, I feel. This refactoring seems to be a good
opportunity to do that. If we do so, I think, the .h there will be
some value in committing .h file as a separate commit.
The reason why this eliminates the need for wraparound is mentioned
somewhere in GetNewMultiXactId(), but probably it should be mentioned
at a more prominent place and also in the commit message. I expected
it to be in the prologue of GetNewMultiXactId(), and then a reference
to prologue from where the comment is right now.
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("MultiXact members would wrap around")));
If a server ever reaches this point, there is no way but to create
another cluster, if the applications requires multi-xact ids? We
should also provide this as an errhint().
if (nextOffset + nmembers < nextOffset)
:). I spent a few seconds trying to understand this. But I don't know
how to make it easy to understand.
In ExtendMultiXactMember() the last comment mentions a wraparound
/*
* Advance to next page, taking care to properly handle the wraparound
* case. OK if nmembers goes negative.
*/
I thought this wraparound is about offset wraparound, which can not
happen now. Since you have left the comment intact, it's something
else. Is the wraparound of offset within the page? Maybe requires a
bit more clarification?
PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset
newOldestOffset)
{
... snip ...
- segment += 1;
- }
+ SimpleLruTruncate(MultiXactMemberCtl,
+ MXOffsetToMemberPage(newOldestOffset));
}
Most of the callers of SimpleLruTruncate() call it directly. Why do we
want to keep this static wrapper? PerformOffsetsTruncation() has a
comment which seems to need the wrapper. But
PerformMembersTruncation() doesn't even have that.
MultiXactState->oldestMultiXactId = newOldestMulti;
MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+ MultiXactState->oldestOffset = newOldestOffset;
LWLockRelease(MultiXactGenLock);
Is this something we are missing in the current code? I can not
understand the connection between this change and the fact that offset
wraparound is not possible with wider multi xact offsets. Maybe I
missed some previous discussion.
I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.
--
Best Wishes,
Ashutosh Bapat
On 21/11/2025 14:15, Ashutosh Bapat wrote:
> On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> Ashutosh, you were interested in reviewing this earlier. Would you have
>> a chance to review this now, before I commit it?
>
> 0002 seems to be fine. It's just moving code from one file to another.
> However, the name multixact_internals.h seems to be misusing term
> "internal". I would expect an "internal.h" to expose things for use
> within the module and not "externally" in other modules. But this
> isn't the first time, buf_internal.h, toast_internal.h
> bgworker_internal.h and so on have their own meaning of what
> "internal" means to them. But it might be better to use something more
> meaningful than "internal" in this case. The file seems to contain
> declarations related to how pg_multixact SLRU is structured. To that
> effect multixact_slru.h or multixact_layout.h may be more appropriate.
Yeah, I went with multixact_internal.h because of the precedence. It's
not great, but IMHO it's better to be consistent than invent a new
naming scheme.
> There are two offsets that that file deals with offset within
> pg_multixact/offset, MultiXactOffset and member offset (flag offset
> and xid offset) within pg_multixact/members. It's quite easy to get
> confused between those when reading that code.
Agreed, those are confusing. I'll think about that a little more.
> The reason why this eliminates the need for wraparound is mentioned
> somewhere in GetNewMultiXactId(), but probably it should be mentioned
> at a more prominent place and also in the commit message. I expected
> it to be in the prologue of GetNewMultiXactId(), and then a reference
> to prologue from where the comment is right now.
>
> ereport(ERROR,
> (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> errmsg("MultiXact members would wrap around")));
>
> If a server ever reaches this point, there is no way but to create
> another cluster, if the applications requires multi-xact ids?
Pretty much. You can also vacuum freeze everything, so that no multixids
are in use, and then use pg_resetwal to reset nextOffset to a lower value.
That obviously sounds bad, but this is a "can't happen" situation. For
comparison, we don't provide any better way to recover from running out
of LSNs either.
> We should also provide this as an errhint().
I think no. You cannot run into this "organically" by just running the
system. If you do manage to hit it, it's a sign of some other trouble,
and I don't want to guess what that might be, or what might be the
appropriate way to recover.
> In ExtendMultiXactMember() the last comment mentions a wraparound
> /*
> * Advance to next page, taking care to properly handle the wraparound
> * case. OK if nmembers goes negative.
> */
> I thought this wraparound is about offset wraparound, which can not
> happen now. Since you have left the comment intact, it's something
> else. Is the wraparound of offset within the page? Maybe requires a
> bit more clarification?
It was indeed about offset wraparound. I'll remove it.
> PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset
> newOldestOffset)
> {
> ... snip ...
> - segment += 1;
> - }
> + SimpleLruTruncate(MultiXactMemberCtl,
> + MXOffsetToMemberPage(newOldestOffset));
> }
>
> Most of the callers of SimpleLruTruncate() call it directly. Why do we
> want to keep this static wrapper? PerformOffsetsTruncation() has a
> comment which seems to need the wrapper. But
> PerformMembersTruncation() doesn't even have that.
Hmm, yeah those wrappers are a bit vestigial now. I'm inclined to keep
them, because as you said, PerformOffsetsTruncation() provides a place
for the comment. And given that, it seems best to keep
PerformMembersTruncation(), for symmetry.
> MultiXactState->oldestMultiXactId = newOldestMulti;
> MultiXactState->oldestMultiXactDB = newOldestMultiDB;
> + MultiXactState->oldestOffset = newOldestOffset;
> LWLockRelease(MultiXactGenLock);
>
> Is this something we are missing in the current code? I can not
> understand the connection between this change and the fact that offset
> wraparound is not possible with wider multi xact offsets. Maybe I
> missed some previous discussion.
Good question. At first I intended to extract that to a separate commit,
before the main patch, because it seems like a nice improvement: We have
just calculated 'oldestOffset', so why not update the value in shared
memory while we have it? But looking closer, I'm not sure if it'd be
sane without the 64-bit offsets. Currently, oldestOffset is only updated
by SetOffsetVacuumLimit(), which also updates offsetStopLimit. We could
get into a state where oldestOffset is set, but offsetStopLimit is not.
With 64-bit offsets that's no longer a concern because it removes
offsetStopLimit altogether.
> I have reviewed patch 0002 and multxact.c changes in 0003. So far I
> have only these comments. I will review the pg_upgrade.c changes next.
Thanks for the review so far!
- Heikki
Looking at the upgrade code, in light of the "IPC/MultixactCreation on the Standby server" thread [1], I think we need to make it more tolerant. It's possible that there are 0 offsets in pg_multixact/offsets. That might or might not be a problem: it's OK as long as those multixids don't appear in any heap table, or you might actually have lost those multixids, which is bad but the damage has already been done and upgrade should not get stuck on it. GetOldMultiXactIdSingleMember() currently asserts that the offset is never zero, but it should try to do something sensible in that case instead of just failing. [1] https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru - Heikki