Обсуждение: Multixid SLRU truncation bugs at wraparound

Поиск
Список
Период
Сортировка

Multixid SLRU truncation bugs at wraparound

От
Heikki Linnakangas
Дата:
While working on the reported pg_upgrade failure at multixid wraparound 
[1], I bumped into another bug related to multixid wraparound. If you 
run vacuum freeze, and it advances oldestMultiXactId, and nextMulti has 
just wrapped around to 0, you get this in the log:

> LOG:  MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact 1 does not exist on
disk

Culprit: TruncateMultiXact does this:

     LWLockAcquire(MultiXactGenLock, LW_SHARED);
     nextMulti = MultiXactState->nextMXact;
     nextOffset = MultiXactState->nextOffset;
     oldestMulti = MultiXactState->oldestMultiXactId;
     LWLockRelease(MultiXactGenLock);
     Assert(MultiXactIdIsValid(oldestMulti));

     ...


     /*
      * First, compute the safe truncation point for MultiXactMember. 
This is
      * the starting offset of the oldest multixact.
      *
      * Hopefully, find_multixact_start will always work here, because we've
      * already checked that it doesn't precede the earliest MultiXact 
on disk.
      * But if it fails, don't truncate anything, and log a message.
      */
     if (oldestMulti == nextMulti)
     {
         /* there are NO MultiXacts */
         oldestOffset = nextOffset;
     }
     else if (!find_multixact_start(oldestMulti, &oldestOffset))
     {
         ereport(LOG,
                 (errmsg("oldest MultiXact %u not found, earliest 
MultiXact %u, skipping truncation",
                         oldestMulti, earliest)));
         LWLockRelease(MultiXactTruncationLock);
         return;
     }

Scenario 1: In the buggy scenario, oldestMulti is 1 and nextMulti is 0. 
We should take the "there are NO MultiXacts" codepath in that case, 
because we skip over 0 when assigning multixids. Instead, we call 
find_multixact_start with oldestMulti==1, which returns false because 
multixid 1 hasn't been assigned and the SLRU segment doesn't exist yet. 
There's a similar bug in SetOffsetVacuumLimit().

Scenario 2: In scenario 1 we just fail to truncate the SLRUs and you get 
the log message. But I think there might be more serious variants of 
this. If the SLRU segment exists but the offset for multixid 1 hasn't 
been set yet, find_multixact_start() will return 0 instead, and we will 
proceed with the truncation based on incorrect oldestOffset==0 value, 
possibly removing SLRU segments that are still needed.

Attached is a fix for scenarios 1 and 2, and a test case for scenario 1.

Scenario 3: I also noticed that the above code isn't prepared for the 
race condition that the offset corresponding to 'oldestMulti' hasn't 
been stored in the SLRU yet, even without wraparound. That could 
theoretically happen if the backend executing 
MultiXactIdCreateFromMembers() gets stuck for a long time between the 
calls to GetNewMultiXactId() and RecordNewMultiXact(), but I think we're 
saved by the fact that we only create new multixids while holding a lock 
on a heap page, and a system-wide VACUUM FREEZE that would advance 
oldestMulti would need to lock the heap page too. It's scary though, 
because it could also lead to truncating away members SLRU segments that 
are still needed. The attached patch does *not* address this scenario.

[1] 
https://www.postgresql.org/message-id/CACG%3DezaApSMTjd%3DM2Sfn5Ucuggd3FG8Z8Qte8Xq9k5-%2BRQis-g@mail.gmail.com

- Heikki

Вложения

Re: Multixid SLRU truncation bugs at wraparound

От
Heikki Linnakangas
Дата:
On 07/11/2025 17:33, Heikki Linnakangas wrote:
> Scenario 1: In the buggy scenario, oldestMulti is 1 and nextMulti is 0. 
> We should take the "there are NO MultiXacts" codepath in that case, 
> because we skip over 0 when assigning multixids. Instead, we call 
> find_multixact_start with oldestMulti==1, which returns false because 
> multixid 1 hasn't been assigned and the SLRU segment doesn't exist yet. 
> There's a similar bug in SetOffsetVacuumLimit().
> 
> Scenario 2: In scenario 1 we just fail to truncate the SLRUs and you get 
> the log message. But I think there might be more serious variants of 
> this. If the SLRU segment exists but the offset for multixid 1 hasn't 
> been set yet, find_multixact_start() will return 0 instead, and we will 
> proceed with the truncation based on incorrect oldestOffset==0 value, 
> possibly removing SLRU segments that are still needed.
> 
> Attached is a fix for scenarios 1 and 2, and a test case for scenario 1.

These scenarios were incidentally fixed by commit 789d65364c "Set next 
multixid's offset when creating a new multixid". (On master, commit 
87a350e1f2 "Never store 0 as the nextMXact" would also have fixed it, 
but 789d65364c was committed first.)

> Scenario 3: I also noticed that the above code isn't prepared for the 
> race condition that the offset corresponding to 'oldestMulti' hasn't 
> been stored in the SLRU yet, even without wraparound. That could 
> theoretically happen if the backend executing 
> MultiXactIdCreateFromMembers() gets stuck for a long time between the 
> calls to GetNewMultiXactId() and RecordNewMultiXact(), but I think we're 
> saved by the fact that we only create new multixids while holding a lock 
> on a heap page, and a system-wide VACUUM FREEZE that would advance 
> oldestMulti would need to lock the heap page too. It's scary though, 
> because it could also lead to truncating away members SLRU segments that 
> are still needed. The attached patch does *not* address this scenario.

I believe this is still a potential issue, although I'm still not sure 
if it's possible to hit in practice. I see two avenues:

1. A long delay between GetNewMultiXactId() and RecordNewMultiXact(), as 
I suggested above.

2. A crash between GetNewMultiXactId() and RecordNewMultiXact(), which 
leaves behind an multixid with 0 offset.

I'm not sure if TruncateMultiXact() can be called with such an invalid 
multixid. It cannot appear anywhere in the heap, so perhaps VACUUM will 
never pick it as the oldest multixid. But I can't rule it out either. In 
any case, we should put a safeguard in TruncateMultiXact() for this. The 
consequences are bad, and it's easy to check for.


While working on this, I noticed that the truncation code does a couple 
of unnecessary things:

>     /*
>      * Note we can't just plow ahead with the truncation; it's possible that
>      * there are no segments to truncate, which is a problem because we are
>      * going to attempt to read the offsets page to determine where to
>      * truncate the members SLRU.  So we first scan the directory to determine
>      * the earliest offsets page number that we can read without error.
>      *
>      * When nextMXact is less than one segment away from multiWrapLimit,
>      * SlruScanDirCbFindEarliest can find some early segment other than the
>      * actual earliest.  (MultiXactOffsetPagePrecedes(EARLIEST, LATEST)
>      * returns false, because not all pairs of entries have the same answer.)
>      * That can also arise when an earlier truncation attempt failed unlink()
>      * or returned early from this function.  The only consequence is
>      * returning early, which wastes space that we could have liberated.
>      *
>      * NB: It's also possible that the page that oldestMulti is on has already
>      * been truncated away, and we crashed before updating oldestMulti.
>      */
>     trunc.earliestExistingPage = -1;
>     SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
>     earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
>     if (earliest < FirstMultiXactId)
>         earliest = FirstMultiXactId;
> 
>     /* If there's nothing to remove, we can bail out early. */
>     if (MultiXactIdPrecedes(oldestMulti, earliest))
>     {
>         LWLockRelease(MultiXactTruncationLock);
>         return;
>     }
> 
>     /*
>      * First, compute the safe truncation point for MultiXactMember. This is
>      * the starting offset of the oldest multixact.
>      *
>      * Hopefully, find_multixact_start will always work here, because we've
>      * already checked that it doesn't precede the earliest MultiXact on disk.
>      * But if it fails, don't truncate anything, and log a message.
>      */
>     if (oldestMulti == nextMulti)
>     {
>         /* there are NO MultiXacts */
>         oldestOffset = nextOffset;
>     }
>     else if (!find_multixact_start(oldestMulti, &oldestOffset))
>     {
>         ereport(LOG,
>                 (errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
>                         oldestMulti, earliest)));
>         LWLockRelease(MultiXactTruncationLock);
>         return;
>     }

1. In the above, we scan the 'offsets' SLRU to find the earliest 
existing page. Per the comments, that's to avoid reading an offset that 
doesn't exist. But find_multixact_start() handles that gracefully; we 
check for that above and print the "oldest Multixact %u not found" log 
message if it happens. What's the point of scanning the SLRU then, why 
not just let find_multixact_start() detect the missing segment?

2. The above code looks up oldestOffset, which is then included in the 
WAL record and passed to PerformMembersTruncation(). But 
PerformMembersTruncation() doesn't use it for anything. It's not used 
during WAL replay either. This is new with the 64-bit offsets; on stable 
branches PerformMembersTruncation() does use it.

I'm inclined to leave this alone in stable branches, as it's not broken, 
just somewhat pointless. But for 'master', I propose the attached 
v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patch.

For all branches, I propose 
v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patch to add a 
check for oldestOffset == 0. That fixes the potential for catastrophic 
truncation with invalid offset 0.

- Heikki

Вложения

Re: Multixid SLRU truncation bugs at wraparound

От
Andrey Borodin
Дата:

> On 6 Jan 2026, at 16:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> For all branches, I propose v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patch to add a check for
oldestOffset== 0. That fixes the potential for catastrophic truncation with invalid offset 0. 

Multixid that is used in heap is WAL-logged. WAL-logged multixact has non-zero offset.
So in non-corrupted database such as condition is impossible.
However, I observed several incidents when AI recommended pg_resetwal to users.
Proposed safeguard might be useful to prevent sprawling corruption in database.



> On 6 Jan 2026, at 16:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> But for 'master', I propose the attached v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patch.

The patch simplifies the code while maintaining correctness.
The only issue I can think of is that clog, commit_ts and async are still using approach based on SlruScanDirectory().


Best regards, Andrey Borodin.