Обсуждение: POC: make mxidoff 64 bits

Поиск

Список

Период

Сортировка

POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

23 апреля 2024 г., 08:23:41

Hi!

I've been trying to introduce 64-bit transaction identifications to Postgres for quite a while [0]. All this implies, 

of course, an enormous amount of change that will have to take place in various modules. Consider this, the 

patch set become too big to be committed “at once”.

The obvious solutions was to try to split the patch set into smaller ones. But here it comes a new challenge, 

not every one of these parts, make Postgres better at the moment. Actually, even switching to a 

FullTransactionId in PGPROC will have disadvantage in increasing of WAL size [1].

In fact, I believe, we're dealing with the chicken and the egg problem here. Not able to commit full patch set 

since it is too big to handle and not able to commit parts of it, since they make sense all together and do not 

help improve Postgres at the moment.

But it's not that bad. Since the commit 4ed8f0913bfdb5f, added in [3], we are capable to use 64 bits to 

indexing SLRUs.

PROPOSAL

Make multixact offsets 64 bit.

RATIONALE
It is not very difficult to overflow 32-bit mxidoff. Since, it is created for every unique combination of the 

transaction for each tuple, including XIDs and respective flags. And when a transaction is added to a 

specific multixact, it is rewritten with a new offset. In other words, it is possible to overflow the offsets of 

multixacts without overflowing the multixacts themselves and/or ordinary transactions. I believe, there 

was something about in the hackers mail lists, but I could not find links now.

PFA, patch. Here is a WIP version. Upgrade machinery should be added later.

As always, any opinions on a subject a very welcome!

[0] https://www.postgresql.org/message-id/flat/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com

[1] https://www.postgresql.org/message-id/flat/CACG%3DezY7msw%2Bjip%3Drtfvnfz051dRqz4s-diuO46v3rAoAE0T0g%40mail.gmail.com

[3] https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq%2BvfkmTF5Q%40mail.gmail.com

Best regards,

Maxim Orlov.

Вложения

0001-WIP-mxidoff-to-64bit.patch

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

23 апреля 2024 г., 09:37:40

On 23/04/2024 11:23, Maxim Orlov wrote:
> PROPOSAL
> Make multixact offsets 64 bit.

+1, this is a good next step and useful regardless of 64-bit XIDs.

> @@ -156,7 +148,7 @@
>          ((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
>  
>  /* page in which a member is to be found */
> -#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
> +#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE)
>  #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
>  
>  /* Location (byte offset within page) of flag word for a given member */

This is really a bug fix. It didn't matter when TransactionId and 
MultiXactOffset were both typedefs of uint32, but it was always wrong. 
The argument name 'xid' is also misleading.

I think there are some more like that, MXOffsetToFlagsBitShift for example.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: POC: make mxidoff 64 bits

От

"Andrey M. Borodin"

Дата:

23 апреля 2024 г., 13:02:56


> On 23 Apr 2024, at 11:23, Maxim Orlov <orlovmg@gmail.com> wrote:
>
> Make multixact offsets 64 bit.

-        ereport(ERROR,
-                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-                 errmsg("multixact \"members\" limit exceeded"),
Personally, I'd be happy with this! We had some incidents where the only mitigation was vacuum settings tweaking.

BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset?


Best regards, Andrey Borodin.

Re: POC: make mxidoff 64 bits

От

wenhui qiu

Дата:

23 апреля 2024 г., 13:03:06

Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL size by a few bytes should have very little impact with today's disk performance(Logical replication of this feature wal log is also increased a lot, logical replication is a milestone new feature, and the community has been improving the logical replication of functions),I believe removing troubled postgresql Transaction ID Wraparound was also a milestone new feature adding a few bytes is worth it!

Best regards

On Tue, 23 Apr 2024 at 17:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 23/04/2024 11:23, Maxim Orlov wrote:
> PROPOSAL
> Make multixact offsets 64 bit.

+1, this is a good next step and useful regardless of 64-bit XIDs.

> @@ -156,7 +148,7 @@
> ((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
>
> /* page in which a member is to be found */
> -#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
> +#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE)
> #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
>
> /* Location (byte offset within page) of flag word for a given member */

This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.

I think there are some more like that, MXOffsetToFlagsBitShift for example.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

25 апреля 2024 г., 14:20:53

On Tue, 23 Apr 2024 at 12:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.

I think there are some more like that, MXOffsetToFlagsBitShift for example.

Yeah, I always thought so too. I believe, this is just a copy-paste. You mean, it is worth creating a separate CF

entry for these fixes?

On Tue, 23 Apr 2024 at 16:03, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset?

Actually, first versions of the 64xid patch set have such a cast to types TransactionID, MultiXact and so on. But,

after some discussions, we are switched to unsigned long long cast. Unfortunately, I could not find an exact link

for that discussion. On the other hand, such a casting is already used throughout the code. So, just for the

sake of the consistency, I would like to stay with these casts.

On Tue, 23 Apr 2024 at 16:03, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL size by a few bytes should have very little impact with today's disk performance(Logical replication of this feature wal log is also increased a lot, logical replication is a milestone new feature, and the community has been improving the logical replication of functions),I believe removing troubled postgresql Transaction ID Wraparound was also a milestone new feature adding a few bytes is worth it!

I'm 100% agree. Maybe, I should return to this approach and find some benefits for having FXIDs in WAL.

Re: POC: make mxidoff 64 bits

От

Alexander Korotkov

Дата:

03 сентября 2024 г., 16:32:46

On Tue, Sep 3, 2024 at 4:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:
> Here is rebase.  Apparently I'll have to do it often, since the CATALOG_VERSION_NO changed in the patch.

I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced.  It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.

------
Regards,
Alexander Korotkov
Supabase

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

04 сентября 2024 г., 11:49:32

On Tue, 3 Sept 2024 at 16:32, Alexander Korotkov <aekorotkov@gmail.com> wrote:

I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced. It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.

OK, I got it. My intention here was to help to test the patch. If someone wants to have a

look at the patch, he won't need to make changes in the code. In the next iteration, I'll

remove CATALOG_VERSION_NO version change.

Best regards,

Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Pavel Borisov

Дата:

12 сентября 2024 г., 15:09:08

Hi, Maxim!

Previously we accessed offsets in shared MultiXactState without locks as 32-bit read is always atomic. But I'm not sure it's so when offset become 64-bit.

E.g. GetNewMultiXactId():

nextOffset = MultiXactState->nextOffset;

is outside lock.

There might be other places we do the same as well.

Regards,

Pavel Borisov

Supabase

Re: POC: make mxidoff 64 bits

От

Pavel Borisov

Дата:

12 сентября 2024 г., 15:25:53

On Thu, 12 Sept 2024 at 16:09, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Hi, Maxim!

Previously we accessed offsets in shared MultiXactState without locks as 32-bit read is always atomic. But I'm not sure it's so when offset become 64-bit.
E.g. GetNewMultiXactId():

nextOffset = MultiXactState->nextOffset;
is outside lock.

There might be other places we do the same as well.

I think the replacement of plain assignments by pg_atomic_read_u64/pg_atomic_write_u64 would be sufficient.

(The same I think is needed for the patchset [1])

[1] https://www.postgresql.org/message-id/flat/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com

Regards,

Pavel Borisov

Re: POC: make mxidoff 64 bits

От

Alvaro Herrera

Дата:

12 сентября 2024 г., 16:14:08

On 2024-Sep-12, Pavel Borisov wrote:

> Hi, Maxim!
> 
> Previously we accessed offsets in shared MultiXactState without locks as
> 32-bit read is always atomic. But I'm not sure it's so when offset become
> 64-bit.
> E.g. GetNewMultiXactId():
> 
> nextOffset = MultiXactState->nextOffset;
> is outside lock.

Good though.  But fortunately I think it's not a problem.  The one you
say is with MultiXactGetLock held in shared mode -- and that works OK,
as the assignment (in line 1263 at the bottom of the same routine) is
done with exclusive lock held.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

22 октября 2024 г., 12:43:34

On 07/09/2024 07:36, Maxim Orlov wrote:
> Here is v3.

MultiXactMemberFreezeThreshold looks quite bogus now. Now that 
MaxMultiXactOffset==2^64-1, you cannot get anywhere near the 
MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD 
values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I 
guess it would still be useful to trigger autovacuum if multixacts 
members grows large though, to release the disk space, even if you can't 
run out of members as such anymore. What should the logic for that look 
like?

I'd love to see some tests for the pg_upgrade code. Something like a 
little perl script to generate test clusters with different wraparound 
scenarios etc. using the old version, and a TAP test to run pg_upgrade 
on them and verify that queries on the upgraded cluster works correctly. 
We don't have tests like that in the repository today, and I don't know 
if we'd want to commit these permanently either, but it would be highly 
useful now as a one-off thing, to show that the code works.

On upgrade, are there really no changes required to 
pg_multixact/members? I imagined that the segment files would need to be 
renamed around wraparound, so that if you previously had files like this:

pg_multixact/members/FFFE
pg_multixact/members/FFFF
pg_multixact/members/0000
pg_multixact/members/0001

after upgrade you would need to have:

pg_multixact/members/00000000FFFE
pg_multixact/members/00000000FFFF
pg_multixact/members/000000010000
pg_multixact/members/000000010001


Thanks for working on this!

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: POC: make mxidoff 64 bits

От

wenhui qiu

Дата:

25 октября 2024 г., 06:38:57

HI Maxim Orlov

> After a bit of thought, I've realized that to be conservative here is the way to go.

>We can reuse a maximum of existing logic. I mean, we can remove offset wraparound "error logic" and reuse "warning logic". But set the threshold for "warning >logic" to a much higher value. For now, I choose 2^32-1. In other world, legit logic, in my view, here would be to trigger autovacuum if the number of offsets (i.e. >difference nextOffset - oldestOffset) exceeds 2^32-1. PFA patch set.

good point ,Couldn't agree with you more. xid64 is the solution to the wraparound problem,The previous error log is no longer meaningful ,But we might want to refine the output waring log a little(For example, checking the underlying reasons why age has been increasing),Though we don't have to worry about xid wraparound

+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)

Can we refine this annotation a bit? for example

+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space ,reduce table bloat if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)

Thanks

Maxim Orlov <orlovmg@gmail.com> 于2024年10月23日周三 23:55写道：

After a bit of thought, I've realized that to be conservative here is the way to go.

We can reuse a maximum of existing logic. I mean, we can remove offset wraparound "error logic" and reuse "warning logic". But set the threshold for "warning logic" to a much higher value. For now, I choose 2^32-1. In other world, legit logic, in my view, here would be to trigger autovacuum if the number of offsets (i.e. difference nextOffset - oldestOffset) exceeds 2^32-1. PFA patch set.

--
Best regards,
Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

15 ноября 2024 г., 14:06:00

On 13/11/2024 17:44, Maxim Orlov wrote:
> On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>     On a different note, I'm surprised you're rewriting member segments
>     from
>     scratch, parsing all the individual member groups and writing them out
>     again. There's no change to the members file format, except for the
>     numbering of the files, so you could just copy the files under the new
>     names without paying attention to the contents. It's not wrong to parse
>     them in detail, but I'd assume that it would be simpler not to.
> 
> Yes, at the beginning I also thought that it would be possible to get by 
> with simple copying. But in case of wraparound, we must "bypass" invalid 
> zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 
> values appears in multixact.c So, they must be handled. Bypass, in fact. 
> When we are switched to the 64-bit offsets, we have two options:
> 1). Bypass every ((uint32) offset == 0) value in multixact.c;
> 2). Convert members and bypass invalid value once.
> 
> The first options seem too weird for me. So, we have to repack members 
> and bypass invalid value.

Hmm, so if I understand correctly, this is related to how we determine 
the length of the members array, by looking at the next multixid's 
offset. This is explained in GetMultiXactIdMembers:

>     /*
>      * Find out the offset at which we need to start reading MultiXactMembers
>      * and the number of members in the multixact.  We determine the latter as
>      * the difference between this multixact's starting offset and the next
>      * one's.  However, there are some corner cases to worry about:
>      *
>      * 1. This multixact may be the latest one created, in which case there is
>      * no next one to look at.  In this case the nextOffset value we just
>      * saved is the correct endpoint.
>      *
>      * 2. The next multixact may still be in process of being filled in: that
>      * is, another process may have done GetNewMultiXactId but not yet written
>      * the offset entry for that ID.  In that scenario, it is guaranteed that
>      * the offset entry for that multixact exists (because GetNewMultiXactId
>      * won't release MultiXactGenLock until it does) but contains zero
>      * (because we are careful to pre-zero offset pages). Because
>      * GetNewMultiXactId will never return zero as the starting offset for a
>      * multixact, when we read zero as the next multixact's offset, we know we
>      * have this case.  We handle this by sleeping on the condition variable
>      * we have just for this; the process in charge will signal the CV as soon
>      * as it has finished writing the multixact offset.
>      *
>      * 3. Because GetNewMultiXactId increments offset zero to offset one to
>      * handle case #2, there is an ambiguity near the point of offset
>      * wraparound.  If we see next multixact's offset is one, is that our
>      * multixact's actual endpoint, or did it end at zero with a subsequent
>      * increment?  We handle this using the knowledge that if the zero'th
>      * member slot wasn't filled, it'll contain zero, and zero isn't a valid
>      * transaction ID so it can't be a multixact member.  Therefore, if we
>      * read a zero from the members array, just ignore it.
>      *
>      * This is all pretty messy, but the mess occurs only in infrequent corner
>      * cases, so it seems better than holding the MultiXactGenLock for a long
>      * time on every multixact creation.
>      */

With 64-bit offsets, can we assume that it never wraps around? We often 
treat 2^64 as "large enough that we'll never run out", e.g. LSNs are 
also assumed to never wrap around. I think that would be a safe 
assumption here too.

If we accept that, we don't need to worry about case 3 anymore. But if 
we upgrade wrapped-around members files by just renaming them, there 
could still be a members array where we had skipped offset 0, and 
reading that after the upgrade might get confused. We could continue to 
ignore a 0 XID in the members array like the comment says; I think that 
would be enough. But yeah, maybe it's better to bite the bullet in 
pg_upgrade and squeeze those out.

Does your upgrade test suite include case 3, where the next multixact's 
offset is 1?


Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other 
comments and checks that talk about binary-upgraded values too that we 
can hopefully clean up now.


If we are to parse the member segments in detail in upgrade anyway, I'd 
be tempted to make some further changes / optimizations:

-  You could leave out all locking XID members in upgrade, because 
they're not relevant after upgrade any more (all the XIDs will be 
committed or aborted and have released the locks; we require prepared 
transactions to be completed before upgrading too). It'd be enough to 
include actual UPDATE/DELETE XIDs.

- The way we determine the length of the members array by looking at the 
next multixid's offset is a bit complicated. We could have one extra 
flag per XID in the members to indicate "this is the last member of this 
multixid". That could either to replace the current mechanism of looking 
at the next offset, or be just an additional cross-check.

- Do we still like the "group" representation, with 4 bytes of flags 
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per 
XID unaligned.

- A more radical idea: There can be only one updating XID in one 
multixid. We could store that directly in the offsets SLRU, and keep 
only the locking XIDs in members. That way, the members SLRU would 
become less critical; it could be safely reset on crash for example 
(except for prepared transactions, which could still be holding locks, 
but it'd still be less serious). Separating correctness-critical data 
from more ephemeral state is generally a good idea.

I'm not insisting on any of these changes, just some things that might 
be worth considering if we're rewriting the SLRUs on upgrade anyway.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

15 ноября 2024 г., 19:19:07

On Fri, 15 Nov 2024 at 14:06, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm, so if I understand correctly, this is related to how we determine
the length of the members array, by looking at the next multixid's
offset. This is explained in GetMultiXactIdMembers:

Correct.

If we accept that, we don't need to worry about case 3 anymore. But if
we upgrade wrapped-around members files by just renaming them, there
could still be a members array where we had skipped offset 0, and
reading that after the upgrade might get confused. We could continue to
ignore a 0 XID in the members array like the comment says; I think that
would be enough. But yeah, maybe it's better to bite the bullet in
pg_upgrade and squeeze those out.

Correct. I couldn't explain this better. I'm more for the squeeze those out. Overwise, we're ending up in adding another hack in multixact, but one of the benefits from switching to 64-bits, it should make XID's logic more straight forward. After all, mxact juggling in pg_upgrade is one time inconvenience.

Does your upgrade test suite include case 3, where the next multixact's
offset is 1?

Not exactly.

simple
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5927049

offset-wrap
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5591183

multi-wrap
Latest checkpoint's NextMultiXactId: 82006
Latest checkpoint's NextMultiOffset: 7408811

offset-multi-wrap
Latest checkpoint's NextMultiXactId: 52146
Latest checkpoint's NextMultiOffset: 5591183

You want test case where NextMultiOffset will be 1?

Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other
comments and checks that talk about binary-upgraded values too that we
can hopefully clean up now.

Yes, technically we can. But this is kinda unrelated to the offsets and will make the patch set significantly complicated, thus more complicated to review and less likely to be committed. Again, I'm not opposing the idea, I'm not sure if it is worth to do it right now.

If we are to parse the member segments in detail in upgrade anyway, I'd
be tempted to make some further changes / optimizations:

- You could leave out all locking XID members in upgrade, because
they're not relevant after upgrade any more (all the XIDs will be
committed or aborted and have released the locks; we require prepared
transactions to be completed before upgrading too). It'd be enough to
include actual UPDATE/DELETE XIDs.

- The way we determine the length of the members array by looking at the
next multixid's offset is a bit complicated. We could have one extra
flag per XID in the members to indicate "this is the last member of this
multixid". That could either to replace the current mechanism of looking
at the next offset, or be just an additional cross-check.

- Do we still like the "group" representation, with 4 bytes of flags
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per
XID unaligned.

Not really. But I would leave it for next iteration - switching multi to 64 bit. I already have some drafts for this. In any case, we'll must do adjustments in pg_upgrade again. My goal is to move towards 64 XIDs, but with the small steps, and I plan changes in "group" representation in combination with switching multi to 64 bit. This seems a bit more appropriate in my view.

As for your optimization suggestions, I like them. I don’t against them, but I’m afraid to disrupt the clarity of thought, especially since the algorithm is not the simplest.

Best regards,

Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

02 января, 01:12:36

On 27/12/2024 19:09, Maxim Orlov wrote:
> 
> On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>     Does the pg_upgrade code work though, if you have that buggy situation
>     where oldestOffsetKnown == false ?

...

>      >
>      >               if (!TransactionIdIsValid(*xactptr))
>      >               {
>      >                       /* Corner case 3: we must be looking at
>     unused slot zero */
>      >                       Assert(offset == 0);
>      >                       continue;
>      >               }
> 
>     After upgrade, this corner case 3 would *not* happen on offset == 0. So
>     looks like we're still missing test coverage for this upgrade corner
>     case.
> 
> Am I understanding correctly that you want to have a test corresponding 
> to the buggy 9.3 and 9.4 era versions?

No, those were two different things. I think there might be two things 
wrong here:

1. I suspect pg_upgrade might not correctly handle the situation where 
oldestOffsetKnown==false, and

2. The above assertion in "corner case 3" would not hold. It seems that 
we don't have a test case for it, or it would've hit the assertion.

Now that I think about it, yes, a test case for 1. would be good too. 
But I was talking about 2.

> Do you think we could imitate this scenario on a current master branch 
> like that:
> 1) generate a couple of offsets segments for the first table;
> 2) generate more segments for a second table;
> 3) drop first table;
> 4) stop pg cluster;
> 5) remove pg_multixact/offsets/0000
> 6) upgrade?

I don't remember off the top of my head.

It might be best to just refuse the upgrade if oldestOffsetKnown==false. 
It's a very ancient corner case. It seems reasonable to require you to 
upgrade to a newer minor version and run VACUUM before upgrading. IIRC 
that sets oldestOffsetKnown.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

07 января, 13:23:59

On Thu, 2 Jan 2025 at 01:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It might be best to just refuse the upgrade if oldestOffsetKnown==false.
It's a very ancient corner case. It seems reasonable to require you to
upgrade to a newer minor version and run VACUUM before upgrading. IIRC
that sets oldestOffsetKnown.

I agree. After all, we do already have a ready-made solution in the form of a vacuum, do we?

If I understand all this multixact_old.c machinery correctly, in case of oldestOffsetKnown==false

we should fail with "could not open file" or offset will be 0 in GetOldMultiXactIdSingleMember.

So, I suppose we can put an analogue of SimpleLruDoesPhysicalPageExist call in the beginning

of GetOldMultiXactIdSingleMember. And if either SimpleLruDoesPhysicalPageExist return false

or a corresponding offset will be 0 we have to bail out with "oldest offset does not exist, consider

running vacuum before pg_upgrdade" or smth. Please, correct me if I'm wrong.

Best regards,

Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

wenhui qiu

Дата:

21 января, 06:35:33

HI Maxim

> Looks like there is a bit of a pause in the discussion. Here is a small update. Consider v12.

> No major changes, rebase to the actual master and a squash of multiple commits to make a

> patch set easy to reviewer.

> AFAICs, we are reached a consensus on a core patch for switching to 64 bits offsets. The

> only concern is about more comprehensive test coverage for pg_upgrade, is it?

Agree ,When upgrading meets extremes (oldestOffsetKnown==false.) Just follow the solution mentioned by Heikki Linnakangas.

Thanks

On Thu, Jan 16, 2025 at 9:32 PM Maxim Orlov <orlovmg@gmail.com> wrote:

Looks like there is a bit of a pause in the discussion. Here is a small update. Consider v12.
No major changes, rebase to the actual master and a squash of multiple commits to make a
patch set easy to reviewer.

AFAICs, we are reached a consensus on a core patch for switching to 64 bits offsets. The
only concern is about more comprehensive test coverage for pg_upgrade, is it?

--
Best regards,
Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

07 марта, 14:30:17

Here is a rebase, v14.

Best regards,

Maxim Orlov.

Вложения

Re: POC: make mxidoff 64 bits

От

wenhui qiu

Дата:

27 марта, 05:03:22

HI Maxim Orlov Heikki Linnakangas

Thank you for working on it,A few more days is a code freeze.It seems like things have been quiet for a while, but I believe implementing xid64 is absolutely necessary. For instance, there’s often concern about performance jitter caused by frequent freezes. If xid64 is implemented, the freeze process can be smoother.

Thanks

On Fri, Mar 7, 2025 at 7:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:

Here is a rebase, v14.

--
Best regards,
Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

01 апреля, 21:25:16

On 07/03/2025 13:30, Maxim Orlov wrote:
> Here is a rebase, v14.

Thanks! I did some manual testing of this. I created a little helper 
function to consume multixids, to test the autovacuum behavior, and 
found one issue:

If you consume a lot of multixid members space, by creating lots of 
multixids with huge number of members in each, you can end up with a 
very bloated members SLRU, and autovacuum is in no hurry to clean it up. 
Here's what I did:

1. Installed attached test module
2. Ran "select consume_multixids(10000, 100000);" many times
3. ran:

$ du -h data/pg_multixact/members/
26G    data/pg_multixact/members/

When I run "vacuum freeze; select * from pg_database;", I can see that 
'datminmxid' for the current database is advanced. However, autovacuum 
is in no hurry to vacuum 'template0' and 'template1', so 
pg_multixact/members/ does not get truncated. Eventually, when 
autovacuum_multixact_freeze_max_age is reached, it presumably will, but 
you will run out of disk space before that.

There is this check for members size at the end of SetOffsetVacuumLimit():

> 
>     /*
>      * Do we need autovacuum?    If we're not sure, assume yes.
>      */
>     return !oldestOffsetKnown ||
>         (nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);

And the caller (SetMultiXactIdLimit()) will in fact signal the 
autovacuum launcher after "vacuum freeze" because of that. But 
autovacuum launcher will look at the datminmxid / relminmxid values, see 
that they are well within autovacuum_multixact_freeze_max_age, and do 
nothing.

This is a very extreme case, but clearly the code to signal autovacuum 
launcher, and the freeze age cutoff that autovacuum then uses, are not 
in sync.

This patch removed MultiXactMemberFreezeThreshold(), per my suggestion, 
but we threw this baby with the bathwater. We discussed that in this 
thread, but didn't come up with any solution. But ISTM we still need 
something like MultiXactMemberFreezeThreshold() to trigger autovacuum 
freezing if the members have grown too large.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

01 апреля, 21:26:27

On 01/04/2025 21:25, Heikki Linnakangas wrote:
> On 07/03/2025 13:30, Maxim Orlov wrote:
>> Here is a rebase, v14.
> 
> Thanks! I did some manual testing of this. I created a little helper 
> function to consume multixids, to test the autovacuum behavior, and 
> found one issue:

Forgot to attach the test function I used, here it is.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Вложения

0001-TEST-add-consume_multixids-function.patch

Re: POC: make mxidoff 64 bits

От

wenhui qiu

Дата:

14 апреля, 12:04:40

HI Heikki

Pls Kindly help to create task in https://commitfest.postgresql.org/53/ ,I can not found this path in

Commitfest 2025-07

Thanks

On Wed, Apr 2, 2025 at 2:26 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 01/04/2025 21:25, Heikki Linnakangas wrote:
> On 07/03/2025 13:30, Maxim Orlov wrote:
>> Here is a rebase, v14.
>
> Thanks! I did some manual testing of this. I created a little helper
> function to consume multixids, to test the autovacuum behavior, and
> found one issue:

Forgot to attach the test function I used, here it is.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

16 апреля, 11:11:46

I moved the topic to the next commitfest.

--

Best regards,

Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Ashutosh Bapat

Дата:

29 апреля, 15:01:09

Hi Maxim,

On Wed, Apr 16, 2025 at 1:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:

I moved the topic to the next commitfest.

I am reviewing these patches.

I notice that transam/README does not mention multixact except one place in section "Transaction Emulation during Recovery". I expected it to document how pg_multixact/members and pg_multixact/offset are used and what is their layout. It's not the responsibility of this patchset to document it, but it will be good if we add a section about multixacts in transam/README. It will make reviews easier.

Best Wishes,

Ashutosh Bapat

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

14 июля, 16:54:33

Yet another rebase @ f5a987c0e5

Best regards,

Maxim Orlov.

Вложения

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

11 сентября, 11:58:13

Once again, @ 8191e0c16a

Best regards,

Maxim Orlov.

Вложения

Re: POC: make mxidoff 64 bits

От

Alexander Korotkov

Дата:

13 сентября, 16:34:01

Hello Maxim!

On Thu, Sep 11, 2025 at 11:58 AM Maxim Orlov <orlovmg@gmail.com> wrote:
>
> Once again, @ 8191e0c16a

Thank you for your work on this subject.  Multixact members can really
grow much faster than multixact offsets, and avoiding wraparound just
here might make sense.  At the same time, making multixact offsets
64-bit is local and doesn't require changing the tuple xmin/xmax
interpretation.

I went through the patchset.  The shape does not look bad, but I have
a concern about the size of the multixact offsets.  As I understand,
this patchset grows multixact offsets twice; each multixact offset
grows from 32 bits to 64 bits.  This seems quite a price for avoiding
the Multixact members' wraparound.

We can try to squeeze multixact offsets given it's ascending sequence
each time increased by a multixact size.  But how many members can a
multixact contain at maximum?  Looking at MultiXactIdExpand(), I get
that we're keeping locks from in-progress transactions, and committed
non-lock transactions (I guess the latter could be only one).  The
number of transactions running by backends should fit MAX_BACKENDS
(2^18 - 1), and the number of prepared transactions should also fit
MAX_BACKENDS. So, I guess we can cap the total number of one multixact
members to 2^24.

Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes).  The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant.  What do you think?

------
Regards,
Alexander Korotkov
Supabase

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

15 сентября, 18:42:00

On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?

Thank you for your review; I'm pleased to hear from you again.

Yes, because the maximum number of mxoff is limited by the number of
running transactions, we may do it that way.
However, it is a bit wired to have offsets with the 7-byte "base".

I believe we may take advantage of the 64XID patch's notion of putting a

8 byte base followed by 4 byte offsets for particular page.

32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.

Therefore, offset from base will never overflow 2^31 and will always

fit uint32.

It appears logical to me.

Best regards,

Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

wenhui qiu

Дата:

16 сентября, 15:12:17

Hi Maxim

Thanks for your continued efforts to get XID64 implemented.
> 32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.

> Therefore, offset from base will never overflow 2^31 and will always

> fit uint32.

> It appears logical to me.
Agree +1 , but I have a question: I remember the XID64 patch got split into a few threads. How are these threads related? The original one was seen as too big a change, so it was broken up after people raised concerns.

Thanks

On Mon, Sep 15, 2025 at 11:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:

On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?

Thank you for your review; I'm pleased to hear from you again.

Yes, because the maximum number of mxoff is limited by the number of
running transactions, we may do it that way.
However, it is a bit wired to have offsets with the 7-byte "base".

I believe we may take advantage of the 64XID patch's notion of putting a
8 byte base followed by 4 byte offsets for particular page.

32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
Therefore, offset from base will never overflow 2^31 and will always
fit uint32.

It appears logical to me.

--
Best regards,
Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

18 сентября, 19:37:11

On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

Agree +1 , but I have a question: I remember the XID64 patch got split into a few threads. How are these threads related? The original one was seen as too big a change, so it was broken up after people raised concerns.

Yeah, you're absolutely correct. This thread is a part of the overall
work on the transition to XID64 in Postgres. As far as I remember, no
one explicitly raised concerns, but it's clear to me that it won't be
committed as one big patch set.

Here is a WIP patch @ 8191e0c16a for the discussed above issue.
I also had to merge several patches from the previous set, since the
consensus is to use the PRI* format for outputting 64-bit values, and a
separate patch that only changed the format with cast and %llu lost
it's meaning.

If this option suits everyone, the next step is to add a part related
to the pg_upgrade.

Best regards,

Maxim Orlov.

Вложения

v18-wip-0001-Use-64-bit-multixact-offsets.patch

Re: POC: make mxidoff 64 bits

От

wenhui qiu

Дата:

19 сентября, 12:07:15

Hi Maxim

Thank you for your feedback.I remember this path the primary challenges are in the upgrade section, where we're still debating how to address a few edge cases. Right now, this thread is missing input from core code contributors.

Thanks

On Fri, Sep 19, 2025 at 12:37 AM Maxim Orlov <orlovmg@gmail.com> wrote:

On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

Agree +1 , but I have a question: I remember the XID64 patch got split into a few threads. How are these threads related? The original one was seen as too big a change, so it was broken up after people raised concerns.
Yeah, you're absolutely correct. This thread is a part of the overall
work on the transition to XID64 in Postgres. As far as I remember, no
one explicitly raised concerns, but it's clear to me that it won't be
committed as one big patch set.

Here is a WIP patch @ 8191e0c16a for the discussed above issue.
I also had to merge several patches from the previous set, since the
consensus is to use the PRI* format for outputting 64-bit values, and a
separate patch that only changed the format with cast and %llu lost
it's meaning.

If this option suits everyone, the next step is to add a part related
to the pg_upgrade.

--
Best regards,
Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

27 октября, 18:54:35

Here is a new patch set @ 10b5bb3bffaee8

As previously stated, the patch set implements the concept of saving the
"difference" between page offsets in order to save disc space.

The second significant change was that I decided to modify the
pg_upgrade portion as suggested by Heikki above.

At the very least, the logic becomes much simpler and completely
replicates what is done in the multxact.c module and provide some
optimization.

Perhaps this will be easier to comprehend and analyse than attempting
to correctly convert bytes from one format to another.

In the near future, I intend to focus on testing and implement a test
suite.

Best regards,

Maxim Orlov.

Вложения

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

28 октября, 17:17:11

On 27/10/2025 17:54, Maxim Orlov wrote:
> Here is a new patch set @ 10b5bb3bffaee8
> 
> As previously stated, the patch set implements the concept of saving the
> "difference" between page offsets in order to save disc space.

Hmm, is that safe? We do the assignment of multixact and offset, in the 
GetNewMultiXactId() function, separately from updating the SLRU pages in 
the RecordNewMultiXact() function. I believe this happen:

To keep the arithmetic simple, let's assume that multixid 100 is the 
first multixid on an offsets SLRU page. So the 'base' on the page is 
initialized when multixid 100 is written.

1. Backend A calls GetNewMultiXactId(), is assigned multixid 100, offset 
1000
2. Backend B calls GetNewMultiXactId(), is assigned multixid 101, offset 
1010
3. Backend B calls RecordNewMultiXact() and sets 'page->offsets[1] = 10'
4. Backend A calls RecordNewMultiXact() and sets 'page->base = 1000' and 
'page->offsets[0] = 0'

If backend C looks up multixid 101 in between steps 3 and 4, it would 
read the offset incorrectly, because 'base' isn't set yet.

- Heikki

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

30 октября, 09:13:54

Unfortunately, I need to admit that I have messed a bit with v18.

I forgot to sync the pg_upgrade portion with the rest of the patch,
among other things. Here's a proper version with additional testing.

pg_upgrade/t/007_mxoff.pl is not meant to be committed, it's just

for test purposes. In order to run it, env var oldinstall must be set.

On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 27/10/2025 17:54, Maxim Orlov wrote:

If backend C looks up multixid 101 in between steps 3 and 4, it would
read the offset incorrectly, because 'base' isn't set yet.

Hmm, maybe I miss something? We set page base on first write of any
offset on the page, not only the first one. In other words, there
should never be a case when we read an offset without a previously
defined page base. Correct me if I'm wrong:

1. Backend A assigned mxact=100, offset=1000.

2. Backend B assigned mxact=101, offset=1010.

3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and

set page base=1010, offset plus 0^0x80000000 bit while

holding lock on the page.

4. Backend C looks up for the mxact=101 by calling MXOffsetRead()

and should get exactly what he's looking for:

base (1010) + offset (0) minus 0x80000000 bit.

5. Backend A calls RecordNewMultiXact() and sets his offset using

existing base from step 3.

Best regards,

Maxim Orlov.

Вложения

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

30 октября, 12:10:43

On 30/10/2025 08:13, Maxim Orlov wrote:
> On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
> 
>     On 27/10/2025 17:54, Maxim Orlov wrote:
> 
> 
>     If backend C looks up multixid 101 in between steps 3 and 4, it would
>     read the offset incorrectly, because 'base' isn't set yet.
> 
> Hmm, maybe I miss something? We set page base on first write of any
> offset on the page, not only the first one. In other words, there
> should never be a case when we read an offset without a previously
> defined page base. Correct me if I'm wrong:
> 1. Backend A assigned mxact=100, offset=1000.
> 2. Backend B assigned mxact=101, offset=1010.
> 3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and
>      set page base=1010, offset plus 0^0x80000000 bit while
>      holding lock on the page.
> 4. Backend C looks up for the mxact=101 by calling MXOffsetRead()
>      and should get exactly what he's looking for:
>      base (1010) + offset (0) minus 0x80000000 bit.
> 5. Backend A calls RecordNewMultiXact() and sets his offset using
>      existing base from step 3.

Oh I see, the 'base' is not necessarily the base offset of the first 
multixact on the page, it's the base offset of the first multixid that 
is written to the page. And the (short) offsets can be negative. That's 
a frighteningly clever encoding scheme. One upshot of that is that WAL 
redo might get construct the page with a different 'base'. I guess that 
works, but it scares me. Could we come up with a more deterministic scheme?

- Heikki

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

30 октября, 19:17:00

On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Oh I see, the 'base' is not necessarily the base offset of the first
multixact on the page, it's the base offset of the first multixid that
is written to the page. And the (short) offsets can be negative. That's
a frighteningly clever encoding scheme. One upshot of that is that WAL
redo might get construct the page with a different 'base'. I guess that
works, but it scares me. Could we come up with a more deterministic scheme?

Definitely! The most stable approach is the one we had before, which
used actual 64-bit offsets in the SLRU. To be honest, I'm completely
happy with it. After all, what's most important for me is to have 64-bit
xids in Postgres, and this patch is a step towards that goal.

PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
segments.

Fortunately, now that I've separated reading and writing offsets into
different functions, switching from one implementation to another is
easy to do.

Here's a quick overview of the current state of the patch:
1) Access to the offset is placed to separate calls:
MXOffsetWrite/MXOffsetRead.
2) I abandoned byte juggling in pg_upgrade and moved to using logic that
replicates the work with offsets im multixact.c
3) As a result, the update issue came down to the correct implementation
of functions MXOffsetWrite/MXOffsetRead.
4) The only question that remains is the question of disk representation
of 64-bit offsets in SLRU segments.

My thoughts on point (4).

Using 32-bit offsets + some kind of packing:
Pros:
+ Reduce the total disc space used by the segments; ideally it is
almost the same as before.
Cons:
- Reduces reliability (losing a part will most likely result in losing
the entire page).
- Complicates code, especially considering that segments may be written
to the page in random order.

Using 64-bit offsets in SLRU:
Pros:
+ Easy to implement/transparent logic.
Cons:
- Increases the amount of disk space used.

In terms of speed, I'm not sure which will be faster. On the one hand,
64-bit eliminates the necessity for calculations and branching. On the
other hand, the amount of data used will increase.

I am not opposed to any of these options, as our primary goal is getting
64-bit offsets. However, I like the approach using full 64-bit offsets
in SLRU, because it is more clear and, should we say, robust. Yes, it
will increase the number of segment, however this is not heap data in
for a table. Under typical circumstances, there should not be too many
such segments.

Best regards,

Maxim Orlov.

Вложения

Re: POC: make mxidoff 64 bits

От

Alexander Korotkov

Дата:

04 ноября, 18:10:31

On Thu, Oct 30, 2025 at 6:17 PM Maxim Orlov <orlovmg@gmail.com> wrote:
> On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>>
>> Oh I see, the 'base' is not necessarily the base offset of the first
>> multixact on the page, it's the base offset of the first multixid that
>> is written to the page. And the (short) offsets can be negative. That's
>> a frighteningly clever encoding scheme. One upshot of that is that WAL
>> redo might get construct the page with a different 'base'. I guess that
>> works, but it scares me. Could we come up with a more deterministic scheme?
>>
> Definitely! The most stable approach is the one we had before, which
> used actual 64-bit offsets in the SLRU. To be honest, I'm completely
> happy with it. After all, what's most important for me is to have 64-bit
> xids in Postgres, and this patch is a step towards that goal.

Yes, but why can't we have an encoding scheme which would both be
deterministic and provide compression?  The attached is what I meant
in [1].  It's based on v19 and provide deterministic conversion of
each 8 of 64-bit offsets into a chunks containing 64-bit base and 7 of
24-bit increments.  I didn't touch pg_upgrade code though.

Links.
1. https://www.postgresql.org/message-id/CAPpHfdtPybyMYBj-x3-Z5%3D4bj_vhYk2R0nezfy%3DVjcz4QBMDgw%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Вложения

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

05 ноября, 18:37:54

On 30/10/2025 18:17, Maxim Orlov wrote:
> PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
> segments.
>
> Fortunately, now that I've separated reading and writing offsets into
> different functions, switching from one implementation to another is
> easy to do.
>
> Here's a quick overview of the current state of the patch:
> 1) Access to the offset is placed to separate calls:
>     MXOffsetWrite/MXOffsetRead.
> 2) I abandoned byte juggling in pg_upgrade and moved to using logic that
>     replicates the work with offsets im multixact.c
> 3) As a result, the update issue came down to the correct implementation
>     of functions MXOffsetWrite/MXOffsetRead.
> 4) The only question that remains is the question of disk representation
>     of 64-bit offsets in SLRU segments.

Here's another round of review and cleanup of this. I made a bunch of
small changes, but haven't found any major problems. Looking pretty good.

Notable changes since v20:

- Changed MULTIXACT_MEMBER_AUTOVAC_THRESHOLD to 4000000000 instead of
0xFFFFFFFF. The 2^32 mark doesn't have any particular meaning
significance and using a round number makes that more clear. We should
possibly expose that as a separate GUC, but that can be done in a
followup patch.

- Removed the MXOffsetRead/Write functions again. They certainly make
sense if we make them more complicated with some kind of a compression
scheme, but I preferred to keep the code closer to 'master' for now.

- Removed more remnants of offset wraparound handling. There were still
a few places that checked for wraparound and tried to deal with it,
while other places just assumed that it doesn't happen. I added a check
in GetNewMultiXactId() to error out if it would wrap around. It really
should not happen in the real world, one reason being that we would run
out of WAL before running out of 64-bit mxoffsets, but a sanity check
won't hurt just in case someone e.g. abuses pg_resetwal to get into that
situation.

- Removed MaybeExtendOffsetSlru(). It was only used to deal with binary
upgrade from version 9.2 and below. Now that pg_upgrade rewrites the
files, it's not needed anymore.

- Modified PerformMembersTruncation() to use SimpleLruTruncate()

Changes in pg_upgrade:

- Removed the nextMXact/nextOffset fields from MultiXactWriter. They
were redundant with the next_multi and next_offset local variables in
the caller.

Remaining issues:

- There's one more refactoring I'd like to do before merging this: Move
the definitions that are now duplicated between
src/bin/pg_upgrade/multixact_new.c and
src/backend/access/transam/multixact.c into a new header file,
multixact_internal.h. One complication with that is that it needs
SLRU_PAGES_PER_SEGMENT from access/slru.h, but slru.h cannot currently
be included in FRONTEND code. Perhaps we should move
SLRU_PAGES_PER_SEGMENT to pg_config_manual.h, or if that feels too
global, to a separate slru_config.h file.

- I saw Alexander's proposal for a new compression scheme but didn't
incorporate that here. It might be a good idea, but I think we can do
that as a followup patch before the release, if it seems worth it. I
don't feel too bad about just making pg_multixact/offsets 2x larger either.

- Have you done any performance testing of the pg_upgrade code? How long
does the conversion take if you have e.g. 1 billion multixids?

- Is the !oldestOffsetKnown case in the code still reachable? I left one
FIXME comment about that. Needs a comment update at least.

- The new pg_upgrade test fails on my system with this error in the log:

> # Running: pg_dump --no-sync --restrict-key test -d port=22462 host=/tmp/5KdMvth1jk dbname='postgres' -f
/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql
> pg_dump: error: aborting because of server version mismatch
> pg_dump: detail: server version: 19devel; pg_dump version: 17.6
> could not read
"/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql":No such
fileor directory at /home/heikki/git-sandbox/postgresql/src/bin/pg_upgrade/t/007_mxoff.pl line 242. 

This turns out to be an issue with IPC::Run. Setting the
IPCRUNDEBUG=basic env variable reveals that it has a built-in command cache:

> IPC::Run 0004 [#19(109223)]: ****** harnessing *****
> IPC::Run 0004 [#19(109223)]: parsing [ pg_dump --no-sync --restrict-key test -d 'port=20999 host=/tmp/NsJKldN1Ie
dbname='postgres''-f
'/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql'] 
> IPC::Run 0004 [#19(109223)]: ** starting
> IPC::Run 0004 [#19(109223)]: 'pg_dump' found in cache: '/home/heikki/pgsql.17stable/bin/pg_dump'
> IPC::Run 0004 [#19(111432) pg_dump]: execing /home/heikki/pgsql.17stable/bin/pg_dump --no-sync --restrict-key test -d
'port=20999host=/tmp/NsJKldN1Ie dbname='postgres'' -f
/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql
> IPC::Run 0004 [#19(109223)]: ** finishing
> pg_dump: error: aborting because of server version mismatch
> pg_dump: detail: server version: 19devel; pg_dump version: 17.6

The test calls pg_dump twice: first with the old version, then with the
new version. But thanks to IPC::Run's command cache, the invocation of
the new pg_dump version actually also calls the old version. I'm not
sure how to fix that, but I was able to work around it by reversing the
pg_dump calls so that thew new version is called first. That way we use
the new pg_dump against both server versions which works.

- The new pg_ugprade test is very slow. I would love to include that
test permanently in the test suite, but it's too slow for that currently.

- Heikki

Вложения

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

06 ноября, 18:50:23

On Wed, 5 Nov 2025 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Remaining issues:

- There's one more refactoring I'd like to do before merging this: Move
the definitions that are now duplicated between
src/bin/pg_upgrade/multixact_new.c and
src/backend/access/transam/multixact.c into a new header file,
multixact_internal.h. One complication with that is that it needs
SLRU_PAGES_PER_SEGMENT ...

Done. Also put SLRU_PAGES_PER_SEGMENT in pg_config_manual.h

In my opinion, this constant perfectly aligns the description in the

file header. In any case, feel free to move it anywhere you like.

- Have you done any performance testing of the pg_upgrade code? How long
does the conversion take if you have e.g. 1 billion multixids?

Unfortunately, not yet. I'd like to do this soon. Currently, the
bulk of the testing time is spent generating multi-transactions.

- Is the !oldestOffsetKnown case in the code still reachable? I left one
FIXME comment about that. Needs a comment update at least.

Yep, no longer needed. A separate commit has been added.

- The new pg_upgrade test fails on my system with this error in the log:

Unfortunately, I don't face this issue. I think this can be fixed by
providing an explicit path to the utility.

- The new pg_ugprade test is very slow. I would love to include that
test permanently in the test suite, but it's too slow for that currently.

Yes, unfortunately. The majority of the time is spent on tests that
produce multiple segments. These are cases numbered 4-th and higher.
If we remove these, the testing should be relatively fast.

I also add commit "Handle wraparound of next new multi in pg_upgrade".

Per BUG #18863 and BUG #18865

The issue is that pg_upgrade neglects to handle the wraparound of
mxact/mxoff.

We'll obviously resolve the issue with mxoff wraparound by moving to
64-bits. And the mxact bug can be easily solved with two lines of code.
Or five if you count indents and comments. Test also provided.

This commit is totally optional. If you think it deserves to be treated
as a different issue, feel free to discard it.

Best regards,

Maxim Orlov.

Вложения

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

07 ноября, 19:03:11

I noticed one minor issue after I had already sent the

previous letter.

--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1034,7 +1034,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
if (nextOffset + nmembers < nextOffset)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- "MultiXact members would wrap around"));
+ errmsg("MultiXact members would wrap around")));
*offset = nextOffset;

$ $PGBINOLD/pg_controldata -D pgdata
pg_control version number: 1800
Catalog version number: 202510221
...
Latest checkpoint's NextMultiXactId: 10000000
Latest checkpoint's NextMultiOffset: 999995050
Latest checkpoint's oldestXID: 748
...

I tried finding out how long it would take to convert a big number of
segments. Unfortunately, I only have access to a very old machine right
now. It took me 7 hours to generate this much data on my old
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.

Here are my rough measurements:

HDD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m59.459s
user 0m19.974s
sys 0m13.640s

SSD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m52.958s
user 0m19.826s
sys 0m13.624s

I aim to get access to more modern stuff and check it all out there.

Best regards,

Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

12 ноября, 16:00:02

On 07/11/2025 18:03, Maxim Orlov wrote:
> I tried finding out how long it would take to convert a big number of
> segments. Unfortunately, I only have access to a very old machine right
> now. It took me 7 hours to generate this much data on my old
> Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.
> 
> Here are my rough measurements:
> 
> HDD
> $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
> $ time pg_upgrade
> ...
> real    4m59.459s
> user    0m19.974s
> sys     0m13.640s
> 
> SSD
> $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
> $ time pg_upgrade
> ...
> real    4m52.958s
> user    0m19.826s
> sys     0m13.624s
> 
> I aim to get access to more modern stuff and check it all out there.

Thanks, I also did some perf testing on my laptop. I wrote a little 
helper function to consume multixids, and used it to create a v17 
cluster with 100 million multixids. See attached 
consume-mxids.patch.txt. I then ran pg_upgrade on that, and measured how 
long the pg_multixact conversion part of pg_upgrade took. It took about 
1.2 s on my laptop. Extrapolating from that, converting 1 billion 
multixids would take 12 s. These were very simple multixacts with just 
one member each, though; realistic multixacts with more members would 
presumably take a little longer.

In any case, I think we're in an acceptable ballpark here.

There's some very low-hanging fruit though: Profiling with 'linux-perf' 
suggested that a lot of CPU time was spent simply on the function call 
overhead of GetOldMultiXactIdSingleMember, SlruReadSwitchPage, 
RecordNewMultiXact, SlruWriteSwitchPage for each multixact. I added an 
inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to 
eliminate the function call overhead of those in the common case that no 
page switch is needed. With that, the 100 million mxid test case I used 
went from 1.2 s to 0.9 s. We could optimize this further but I think 
this is good enough.

Some other changes since patch set v23:

- Rebased. I committed the wraparound bug fixes.

- I added an SlruFileName() helper function to slru_io.c, and support 
for reading SLRUs with long_segment_names==true. It's not needed 
currently, but it seemed like a weird omission. AllocSlruRead() actually 
left 'long_segment_names' uninitialized which is error-prone. We 
could've just documented it, but it seems just as easy to support it.

- I split the multixact_internal.h header in a separate commit, to make 
it more clear what changes are related to 64-bit offsets

I kept all the new test cases for now. We need to decide which ones are 
worth keeping, and polish and speed up the ones we decide to keep.

I'm getting one failure from the pg_upgrade/008_mxoff test:

> [14:43:38.422](0.530s) not ok 26 - dump outputs from original and restored regression databases match
> [14:43:38.422](0.000s) #   Failed test 'dump outputs from original and restored regression databases match'
> #   at /home/heikki/git-sandbox/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 801.
> [14:43:38.422](0.000s) #          got: '1'
> #     expected: '0'
> === diff of
/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted
and
/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted
> === stdout ===
> ---
/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted
  2025-11-12 14:43:38.030399957 +0200

> +++
/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted
  2025-11-12 14:43:38.314399819 +0200

> @@ -2,8 +2,8 @@
>  -- PostgreSQL database dump
>  --
>  \restrict test
> --- Dumped from database version 17.6
> --- Dumped by pg_dump version 17.6
> +-- Dumped from database version 19devel
> +-- Dumped by pg_dump version 19devel
>  SET statement_timeout = 0;
>  SET lock_timeout = 0;
>  SET idle_in_transaction_session_timeout = 0;=== stderr ===
> === EOF ===
> [14:43:38.425](0.004s) # >>> case #6

I ran the test with:

(rm -rf build/testrun/ build/tmp_install/; 
olddump=/tmp/olddump-regress.sql oldinstall=/home/heikki/pgsql.17stable/ 
meson test -C build --suite setup --suite pg_upgrade)

- Heikki

Вложения

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

13 ноября, 19:04:48

I realized that this issue was still outstanding:

On 01/04/2025 21:25, Heikki Linnakangas wrote:
> Thanks! I did some manual testing of this. I created a little helper
> function to consume multixids, to test the autovacuum behavior, and
> found one issue:
>
> If you consume a lot of multixid members space, by creating lots of
> multixids with huge number of members in each, you can end up with a
> very bloated members SLRU, and autovacuum is in no hurry to clean it up.
> Here's what I did:
>
> 1. Installed attached test module
> 2. Ran "select consume_multixids(10000, 100000);" many times
> 3. ran:
>
> $ du -h data/pg_multixact/members/
> 26G    data/pg_multixact/members/
>
> When I run "vacuum freeze; select * from pg_database;", I can see that
> 'datminmxid' for the current database is advanced. However, autovacuum
> is in no hurry to vacuum 'template0' and 'template1', so pg_multixact/
> members/ does not get truncated. Eventually, when
> autovacuum_multixact_freeze_max_age is reached, it presumably will, but
> you will run out of disk space before that.
>
> There is this check for members size at the end of SetOffsetVacuumLimit():
>
>>
>>     /*
>>      * Do we need autovacuum?    If we're not sure, assume yes.
>>      */
>>     return !oldestOffsetKnown ||
>>         (nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
>
> And the caller (SetMultiXactIdLimit()) will in fact signal the
> autovacuum launcher after "vacuum freeze" because of that. But
> autovacuum launcher will look at the datminmxid / relminmxid values, see
> that they are well within autovacuum_multixact_freeze_max_age, and do
> nothing.
>
> This is a very extreme case, but clearly the code to signal autovacuum
> launcher, and the freeze age cutoff that autovacuum then uses, are not
> in sync.
>
> This patch removed MultiXactMemberFreezeThreshold(), per my suggestion,
> but we threw this baby with the bathwater. We discussed that in this
> thread, but didn't come up with any solution. But ISTM we still need
> something like MultiXactMemberFreezeThreshold() to trigger autovacuum
> freezing if the members have grown too large.

Here's a new patch version that addresses the above issue. I resurrected
MultiXactMemberFreezeThreshold(), using the same logic as before, just
using pretty arbitrary thresholds of 1 and 2 billion offsets instead of
the safe/danger thresholds derived from MaxMultiOffset. That gives
roughly the same behavior wrt. calculating effective freeze age as before.

Another change is that I removed the offset-based emergency vacuum
triggering. With 64-bit offsets, we never need to shut down the system
to prevent offset wraparound, so even if the offsets SLRU grows large,
it's not an "emergency" the same way that wraparound is. Consuming lots
of disk space could be a problem, of course, but we can let autovacuum
deal with that at the normal pace, like it deals with bloated tables.

The heuristics could surely be made better and/or more configurable, but
I think this good enough for now.

I included these changes as a separate patch for review purposes, but it
ought to be squashed with the main patch before committing.

- Heikki

Вложения

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

14 ноября, 18:40:50

On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I added an
inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to
eliminate the function call overhead of those in the common case that no
page switch is needed. With that, the 100 million mxid test case I used
went from 1.2 s to 0.9 s. We could optimize this further but I think
this is good enough.

I agree with you.

- I added an SlruFileName() helper function to slru_io.c, and support
for reading SLRUs with long_segment_names==true. It's not needed
currently, but it seemed like a weird omission. AllocSlruRead() actually
left 'long_segment_names' uninitialized which is error-prone. We
could've just documented it, but it seems just as easy to support it.

Yeah, I didn't particularly like that place either. But then I decided it was

overkill to do it for the sake of symmetry and would raise questions.

It turned out much better this way.

I kept all the new test cases for now. We need to decide which ones are
worth keeping, and polish and speed up the ones we decide to keep.

I think of two cases here.

A) Upgrade from "new cluster":

* created cluster with pre 32-bit overflow mxoff

* consume around of 2k of mxacts (1k before 32-bit overflow

and 1k after)

* run pg_upgrade

* check upgraded cluster is working

* check data invariant

B) Same as A), but for an "old cluster" with oldinstall env.

On Thu, 13 Nov 2025 at 19:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a new patch version that addresses the above issue. I resurrected
MultiXactMemberFreezeThreshold(), using the same logic as before, just
using pretty arbitrary thresholds of 1 and 2 billion offsets instead of
the safe/danger thresholds derived from MaxMultiOffset. That gives
roughly the same behavior wrt. calculating effective freeze age as before.

Yes, I think it's okay for now. This reflects the existing logic well.

I wonder what the alternative solution might be? Can we make a

"vacuum freeze" also do pg_multixact segments truncation?

In any case, this can be discussed later.

Best regards,

Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

14 ноября, 20:48:39

On 14/11/2025 17:40, Maxim Orlov wrote:
> On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi>
>> I kept all the new test cases for now. We need to decide which
>> ones are worth keeping, and polish and speed up the ones we decide
>> to keep.

Attached is a new patch version, with more work on the tests. The
pg_upgrade patch
(v26-0004-Add-pg_upgrade-for-64-bit-multixact-offsets.patch) now
includes a test case. I'm proposing to commit that test along with these
patches. It's a heavily-modified version of the test cases you wrote.

I tested that test using old installations, all the way down to version
9.4. That required a bunch of changes to the test perl modules, to make
them work with such old versions. Without any extra changes, the test
works down to v11.

Later patches in the patch set add more tests, labelled with the TEST:
prefix. Those are the tests you posted earlier, with little to no
modifications. I'm just carrying those around, so that I can easily run
them now during development. But I don't think they're adding much value
and I plan to leave them out of the final commit.

> I think of two cases here.
> A) Upgrade from "new cluster":
>      * created cluster with pre 32-bit overflow mxoff
>      * consume around of 2k of mxacts (1k before 32-bit overflow
>        and 1k after)
>      * run pg_upgrade
>      * check upgraded cluster is working
>      * check data invariant
> B)  Same as A), but for an "old cluster" with oldinstall env.

Makes sense.

The 007_multixact_conversion.pl test in the attached patches includes
two test scenarios:  "basic" and "wraparound" test. In the basic
scenario there's no overflow or wraparound involved, but it can be run
without an old installation, i.e. in a "new -> new upgrade". The
"wraparound" scenario is the same, but the old cluster is reset with
pg_resetwal so that the mxoff wraps around. The "wraparound" requires a
pre-19 old installation, because the pg_resetwal logic requires pre-v19
layout.

If we enhance the reset_mxoff() perl function in the test so that it
also works with v19, we could run the "wraparound" scenario in new->new
upgrades too. That would essentially the case A) that you listed above.

I think it's already pretty good as it is though. I don't expect the
point where we cross offset 2^32 in the new version to be very
interesting now that we use 64-bit offsets everywhere.

- Heikki

Вложения

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

17 ноября, 19:35:46

Here's yet another patch version. I spent the day reviewing this in 
detail and doing little cleanups here and there. I squashed the commits 
and wrote a proper commit message.

One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to 
me at least) how upgrade from version 9.2 and below now works. It was 
actually broken when I tested it. Not sure if I had broken it earlier or 
if it never worked, but in any case it works now.

I also tested upgrading a cluster from an old minor version, < 9.3.5, 
where the control file has a bogus oldestMultiXid==1 value (see commit 
b6a3444fa6). As expected, you get a "could not open file" error:

> Performing Upgrade
> ------------------
> Setting locale and encoding for new cluster                   ok
> ...
> Deleting files from new pg_multixact/members                  ok
> Deleting files from new pg_multixact/offsets                  ok
> Converting pg_multixact files                                 
> could not open file "/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file or directory
> Failure, exiting

I don't think we need to support that case. I hope there are no clusters 
in that state still in the wild, and you can work around it by upgrading 
to 9.3.5 or above and letting autovacuum run. But I wonder if a 
pre-upgrade check with a better error message would still be worthwhile.


Ashutosh, you were interested in reviewing this earlier. Would you have 
a chance to review this now, before I commit it? Alexander, Alvaro, 
would you have a chance to take a final look too, please?

- Heikki

Вложения

Re: POC: make mxidoff 64 bits

От

wenhui qiu

Дата:

18 ноября, 04:47:45

Hi Heikki

> I don't think we need to support that case. I hope there are no clusters
> in that state still in the wild, and you can work around it by upgrading
> to 9.3.5 or above and letting autovacuum run. But I wonder if a
> pre-upgrade check with a better error message would still be worthwhile.

I think we believe it is now highly unlikely to find instances of version 9.3; all users are advised to upgrade to the latest version first.

Thanks

On Tue, Nov 18, 2025 at 12:35 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's yet another patch version. I spent the day reviewing this in
detail and doing little cleanups here and there. I squashed the commits
and wrote a proper commit message.

One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to
me at least) how upgrade from version 9.2 and below now works. It was
actually broken when I tested it. Not sure if I had broken it earlier or
if it never worked, but in any case it works now.

I also tested upgrading a cluster from an old minor version, < 9.3.5,
where the control file has a bogus oldestMultiXid==1 value (see commit
b6a3444fa6). As expected, you get a "could not open file" error:

> Performing Upgrade
> ------------------
> Setting locale and encoding for new cluster ok
> ...
> Deleting files from new pg_multixact/members ok
> Deleting files from new pg_multixact/offsets ok
> Converting pg_multixact files
> could not open file "/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file or directory
> Failure, exiting

I don't think we need to support that case. I hope there are no clusters
in that state still in the wild, and you can work around it by upgrading
to 9.3.5 or above and letting autovacuum run. But I wonder if a
pre-upgrade check with a better error message would still be worthwhile.

Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it? Alexander, Alvaro,
would you have a chance to take a final look too, please?

- Heikki

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

19 ноября, 19:20:27

One more small issue: The docs for pg_resetwal contain recipes for how 
to determine safe values to use:

> -m mxid,mxid
> --multixact-ids=mxid,mxid
> Manually set the next and oldest multitransaction ID.
> 
> A safe value for the next multitransaction ID (first part) can be
> determined by looking for the numerically largest file name in the
> directory pg_multixact/offsets under the data directory, adding one,
> and then multiplying by 65536 (0x10000). Conversely, a safe value
> for the oldest multitransaction ID (second part of -m) can be
> determined by looking for the numerically smallest file name in the
> same directory and multiplying by 65536. The file names are in
> hexadecimal, so the easiest way to do this is to specify the option
> value in hexadecimal and append four zeroes.
> 
> -O mxoff
> --multixact-offset=mxoff
> 
> Manually set the next multitransaction offset.
> 
> A safe value can be determined by looking for the numerically
> largest file name in the directory pg_multixact/members under the
> data directory, adding one, and then multiplying by 52352 (0xCC80).
> The file names are in hexadecimal. There is no simple recipe such as
> the ones for other options of appending zeroes.

I think those recipes need to be adjusted for 64-bit offsets.

- Heikki

Re: POC: make mxidoff 64 bits

От

Maxim Orlov

Дата:

20 ноября, 19:30:51

On Wed, 19 Nov 2025 at 19:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I think those recipes need to be adjusted for 64-bit offsets.

Yes, we need to do it.

Sorry if this is too obvious, but with 32-bit offsets, we get:

SLRU_PAGES_PER_SEGMENT * BLKSZ / sizeof(MXOff) =

32 * 8192 / 4 = 65,536 mxoff per segment.

Now, with 64-bits offsets, we should have half as much.

Best regards,

Maxim Orlov.

Re: POC: make mxidoff 64 bits

От

Ashutosh Bapat

Дата:

21 ноября, 15:15:54

On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Ashutosh, you were interested in reviewing this earlier. Would you have
> a chance to review this now, before I commit it?

0002 seems to be fine. It's just moving code from one file to another.
However, the name multixact_internals.h seems to be misusing term
"internal".  I would expect an "internal.h" to expose things for use
within the module and not "externally" in other modules. But this
isn't the first time, buf_internal.h, toast_internal.h
bgworker_internal.h and so on have their own meaning of what
"internal" means to them. But it might be better to use something more
meaningful than "internal" in this case. The file seems to contain
declarations related to how pg_multixact SLRU is structured. To that
effect multixact_slru.h or multixact_layout.h may be more appropriate.

There are two offsets that that file deals with offset within
pg_multixact/offset, MultiXactOffset and member offset (flag offset
and xid offset) within pg_multixact/members. It's quite easy to get
confused between those when reading that code. For example, it's not
clear which offset MultiXactIdToOffset* functions are about. These
functions are calculating the page, entry (within the page) and
segment (of page) in pg_multixact/offset where to find the
MultiXactOffset of the first member of a given mxid. Thus returning
offset within offset. I feel they should have been named differently
when the code was written. But now that we are moving this code in a
separate file, we also have an opportunity to rename it better. I
think MXOffsetToMember* functions have better names. Using a similar
convention we could use MultiXactIdToOffsetOffset*, but that might
increase confusion. How about MultiXactIdToOffsetPos*? A separate .h
file also looks like a good place to document how offsets are laid out
and its contents and how members is laid out. The latter is somehow
documented in terms of macros and the static functions. The first is
not documented well, I feel. This refactoring seems to be a good
opportunity to do that. If we do so, I think, the .h there will be
some value in committing .h file as a separate commit.

The reason why this eliminates the need for wraparound is mentioned
somewhere in GetNewMultiXactId(), but probably it should be mentioned
at a more prominent place and also in the commit message. I expected
it to be in the prologue of GetNewMultiXactId(), and then a reference
to prologue from where the comment is right now.

ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("MultiXact members would wrap around")));

 If a server ever reaches this point, there is no way but to create
another cluster, if the applications requires multi-xact ids? We
should also provide this as an errhint().

if (nextOffset + nmembers < nextOffset)

:). I spent a few seconds trying to understand this. But I don't know
how to make it easy to understand.

In ExtendMultiXactMember() the last comment mentions a wraparound
/*
* Advance to next page, taking care to properly handle the wraparound
* case. OK if nmembers goes negative.
*/
I thought this wraparound is about offset wraparound, which can not
happen now. Since you have left the comment intact, it's something
else. Is the wraparound of offset within the page? Maybe requires a
bit more clarification?

PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset
newOldestOffset)
{
... snip ...
- segment += 1;
- }
+ SimpleLruTruncate(MultiXactMemberCtl,
+ MXOffsetToMemberPage(newOldestOffset));
}

Most of the callers of SimpleLruTruncate() call it directly. Why do we
want to keep this static wrapper? PerformOffsetsTruncation() has a
comment which seems to need the wrapper. But
PerformMembersTruncation() doesn't even have that.

MultiXactState->oldestMultiXactId = newOldestMulti;
MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+ MultiXactState->oldestOffset = newOldestOffset;
LWLockRelease(MultiXactGenLock);

Is this something we are missing in the current code? I can not
understand the connection between this change and the fact that offset
wraparound is not possible with wider multi xact offsets. Maybe I
missed some previous discussion.

I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.

--
Best Wishes,
Ashutosh Bapat

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

21 ноября, 16:56:33

On 21/11/2025 14:15, Ashutosh Bapat wrote:
> On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> Ashutosh, you were interested in reviewing this earlier. Would you have
>> a chance to review this now, before I commit it?
>
> 0002 seems to be fine. It's just moving code from one file to another.
> However, the name multixact_internals.h seems to be misusing term
> "internal".  I would expect an "internal.h" to expose things for use
> within the module and not "externally" in other modules. But this
> isn't the first time, buf_internal.h, toast_internal.h
> bgworker_internal.h and so on have their own meaning of what
> "internal" means to them. But it might be better to use something more
> meaningful than "internal" in this case. The file seems to contain
> declarations related to how pg_multixact SLRU is structured. To that
> effect multixact_slru.h or multixact_layout.h may be more appropriate.

Yeah, I went with multixact_internal.h because of the precedence. It's
not great, but IMHO it's better to be consistent than invent a new
naming scheme.

> There are two offsets that that file deals with offset within
> pg_multixact/offset, MultiXactOffset and member offset (flag offset
> and xid offset) within pg_multixact/members. It's quite easy to get
> confused between those when reading that code.

Agreed, those are confusing. I'll think about that a little more.

> The reason why this eliminates the need for wraparound is mentioned
> somewhere in GetNewMultiXactId(), but probably it should be mentioned
> at a more prominent place and also in the commit message. I expected
> it to be in the prologue of GetNewMultiXactId(), and then a reference
> to prologue from where the comment is right now.
>
> ereport(ERROR,
> (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> errmsg("MultiXact members would wrap around")));
>
>   If a server ever reaches this point, there is no way but to create
> another cluster, if the applications requires multi-xact ids?

Pretty much. You can also vacuum freeze everything, so that no multixids
are in use, and then use pg_resetwal to reset nextOffset to a lower value.

That obviously sounds bad, but this is a "can't happen" situation. For
comparison, we don't provide any better way to recover from running out
of LSNs either.

> We should also provide this as an errhint().
I think no. You cannot run into this "organically" by just running the
system. If you do manage to hit it, it's a sign of some other trouble,
and I don't want to guess what that might be, or what might be the
appropriate way to recover.

> In ExtendMultiXactMember() the last comment mentions a wraparound
> /*
> * Advance to next page, taking care to properly handle the wraparound
> * case. OK if nmembers goes negative.
> */
> I thought this wraparound is about offset wraparound, which can not
> happen now. Since you have left the comment intact, it's something
> else. Is the wraparound of offset within the page? Maybe requires a
> bit more clarification?

It was indeed about offset wraparound. I'll remove it.

> PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset
> newOldestOffset)
> {
> ... snip ...
> - segment += 1;
> - }
> + SimpleLruTruncate(MultiXactMemberCtl,
> + MXOffsetToMemberPage(newOldestOffset));
> }
>
> Most of the callers of SimpleLruTruncate() call it directly. Why do we
> want to keep this static wrapper? PerformOffsetsTruncation() has a
> comment which seems to need the wrapper. But
> PerformMembersTruncation() doesn't even have that.

Hmm, yeah those wrappers are a bit vestigial now. I'm inclined to keep
them, because as you said, PerformOffsetsTruncation() provides a place
for the comment. And given that, it seems best to keep
PerformMembersTruncation(), for symmetry.

> MultiXactState->oldestMultiXactId = newOldestMulti;
> MultiXactState->oldestMultiXactDB = newOldestMultiDB;
> + MultiXactState->oldestOffset = newOldestOffset;
> LWLockRelease(MultiXactGenLock);
>
> Is this something we are missing in the current code? I can not
> understand the connection between this change and the fact that offset
> wraparound is not possible with wider multi xact offsets. Maybe I
> missed some previous discussion.

Good question. At first I intended to extract that to a separate commit,
before the main patch, because it seems like a nice improvement: We have
just calculated 'oldestOffset', so why not update the value in shared
memory while we have it? But looking closer, I'm not sure if it'd be
sane without the 64-bit offsets. Currently, oldestOffset is only updated
by SetOffsetVacuumLimit(), which also updates offsetStopLimit. We could
get into a state where oldestOffset is set, but offsetStopLimit is not.
With 64-bit offsets that's no longer a concern because it removes
offsetStopLimit altogether.

> I have reviewed patch 0002 and multxact.c changes in 0003. So far I
> have only these comments. I will review the pg_upgrade.c changes next.

Thanks for the review so far!

- Heikki

Re: POC: make mxidoff 64 bits

От

Heikki Linnakangas

Дата:

25 ноября, 13:07:27

Looking at the upgrade code, in light of the "IPC/MultixactCreation on 
the Standby server" thread [1], I think we need to make it more 
tolerant. It's possible that there are 0 offsets in 
pg_multixact/offsets. That might or might not be a problem: it's OK as 
long as those multixids don't appear in any heap table, or you might 
actually have lost those multixids, which is bad but the damage has 
already been done and upgrade should not get stuck on it. 
GetOldMultiXactIdSingleMember() currently asserts that the offset is 
never zero, but it should try to do something sensible in that case 
instead of just failing.

[1] 
https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru

- Heikki

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: POC: make mxidoff 64 bits

Вложения

Вложения

Вложения

Commitfest 2025-07

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения