Обсуждение: Multixid SLRU truncation bugs at wraparound
While working on the reported pg_upgrade failure at multixid wraparound
[1], I bumped into another bug related to multixid wraparound. If you
run vacuum freeze, and it advances oldestMultiXactId, and nextMulti has
just wrapped around to 0, you get this in the log:
> LOG: MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact 1 does not exist on
disk
Culprit: TruncateMultiXact does this:
LWLockAcquire(MultiXactGenLock, LW_SHARED);
nextMulti = MultiXactState->nextMXact;
nextOffset = MultiXactState->nextOffset;
oldestMulti = MultiXactState->oldestMultiXactId;
LWLockRelease(MultiXactGenLock);
Assert(MultiXactIdIsValid(oldestMulti));
...
/*
* First, compute the safe truncation point for MultiXactMember.
This is
* the starting offset of the oldest multixact.
*
* Hopefully, find_multixact_start will always work here, because we've
* already checked that it doesn't precede the earliest MultiXact
on disk.
* But if it fails, don't truncate anything, and log a message.
*/
if (oldestMulti == nextMulti)
{
/* there are NO MultiXacts */
oldestOffset = nextOffset;
}
else if (!find_multixact_start(oldestMulti, &oldestOffset))
{
ereport(LOG,
(errmsg("oldest MultiXact %u not found, earliest
MultiXact %u, skipping truncation",
oldestMulti, earliest)));
LWLockRelease(MultiXactTruncationLock);
return;
}
Scenario 1: In the buggy scenario, oldestMulti is 1 and nextMulti is 0.
We should take the "there are NO MultiXacts" codepath in that case,
because we skip over 0 when assigning multixids. Instead, we call
find_multixact_start with oldestMulti==1, which returns false because
multixid 1 hasn't been assigned and the SLRU segment doesn't exist yet.
There's a similar bug in SetOffsetVacuumLimit().
Scenario 2: In scenario 1 we just fail to truncate the SLRUs and you get
the log message. But I think there might be more serious variants of
this. If the SLRU segment exists but the offset for multixid 1 hasn't
been set yet, find_multixact_start() will return 0 instead, and we will
proceed with the truncation based on incorrect oldestOffset==0 value,
possibly removing SLRU segments that are still needed.
Attached is a fix for scenarios 1 and 2, and a test case for scenario 1.
Scenario 3: I also noticed that the above code isn't prepared for the
race condition that the offset corresponding to 'oldestMulti' hasn't
been stored in the SLRU yet, even without wraparound. That could
theoretically happen if the backend executing
MultiXactIdCreateFromMembers() gets stuck for a long time between the
calls to GetNewMultiXactId() and RecordNewMultiXact(), but I think we're
saved by the fact that we only create new multixids while holding a lock
on a heap page, and a system-wide VACUUM FREEZE that would advance
oldestMulti would need to lock the heap page too. It's scary though,
because it could also lead to truncating away members SLRU segments that
are still needed. The attached patch does *not* address this scenario.
[1]
https://www.postgresql.org/message-id/CACG%3DezaApSMTjd%3DM2Sfn5Ucuggd3FG8Z8Qte8Xq9k5-%2BRQis-g@mail.gmail.com
- Heikki
Вложения
On 07/11/2025 17:33, Heikki Linnakangas wrote:
> Scenario 1: In the buggy scenario, oldestMulti is 1 and nextMulti is 0.
> We should take the "there are NO MultiXacts" codepath in that case,
> because we skip over 0 when assigning multixids. Instead, we call
> find_multixact_start with oldestMulti==1, which returns false because
> multixid 1 hasn't been assigned and the SLRU segment doesn't exist yet.
> There's a similar bug in SetOffsetVacuumLimit().
>
> Scenario 2: In scenario 1 we just fail to truncate the SLRUs and you get
> the log message. But I think there might be more serious variants of
> this. If the SLRU segment exists but the offset for multixid 1 hasn't
> been set yet, find_multixact_start() will return 0 instead, and we will
> proceed with the truncation based on incorrect oldestOffset==0 value,
> possibly removing SLRU segments that are still needed.
>
> Attached is a fix for scenarios 1 and 2, and a test case for scenario 1.
These scenarios were incidentally fixed by commit 789d65364c "Set next
multixid's offset when creating a new multixid". (On master, commit
87a350e1f2 "Never store 0 as the nextMXact" would also have fixed it,
but 789d65364c was committed first.)
> Scenario 3: I also noticed that the above code isn't prepared for the
> race condition that the offset corresponding to 'oldestMulti' hasn't
> been stored in the SLRU yet, even without wraparound. That could
> theoretically happen if the backend executing
> MultiXactIdCreateFromMembers() gets stuck for a long time between the
> calls to GetNewMultiXactId() and RecordNewMultiXact(), but I think we're
> saved by the fact that we only create new multixids while holding a lock
> on a heap page, and a system-wide VACUUM FREEZE that would advance
> oldestMulti would need to lock the heap page too. It's scary though,
> because it could also lead to truncating away members SLRU segments that
> are still needed. The attached patch does *not* address this scenario.
I believe this is still a potential issue, although I'm still not sure
if it's possible to hit in practice. I see two avenues:
1. A long delay between GetNewMultiXactId() and RecordNewMultiXact(), as
I suggested above.
2. A crash between GetNewMultiXactId() and RecordNewMultiXact(), which
leaves behind an multixid with 0 offset.
I'm not sure if TruncateMultiXact() can be called with such an invalid
multixid. It cannot appear anywhere in the heap, so perhaps VACUUM will
never pick it as the oldest multixid. But I can't rule it out either. In
any case, we should put a safeguard in TruncateMultiXact() for this. The
consequences are bad, and it's easy to check for.
While working on this, I noticed that the truncation code does a couple
of unnecessary things:
> /*
> * Note we can't just plow ahead with the truncation; it's possible that
> * there are no segments to truncate, which is a problem because we are
> * going to attempt to read the offsets page to determine where to
> * truncate the members SLRU. So we first scan the directory to determine
> * the earliest offsets page number that we can read without error.
> *
> * When nextMXact is less than one segment away from multiWrapLimit,
> * SlruScanDirCbFindEarliest can find some early segment other than the
> * actual earliest. (MultiXactOffsetPagePrecedes(EARLIEST, LATEST)
> * returns false, because not all pairs of entries have the same answer.)
> * That can also arise when an earlier truncation attempt failed unlink()
> * or returned early from this function. The only consequence is
> * returning early, which wastes space that we could have liberated.
> *
> * NB: It's also possible that the page that oldestMulti is on has already
> * been truncated away, and we crashed before updating oldestMulti.
> */
> trunc.earliestExistingPage = -1;
> SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
> earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
> if (earliest < FirstMultiXactId)
> earliest = FirstMultiXactId;
>
> /* If there's nothing to remove, we can bail out early. */
> if (MultiXactIdPrecedes(oldestMulti, earliest))
> {
> LWLockRelease(MultiXactTruncationLock);
> return;
> }
>
> /*
> * First, compute the safe truncation point for MultiXactMember. This is
> * the starting offset of the oldest multixact.
> *
> * Hopefully, find_multixact_start will always work here, because we've
> * already checked that it doesn't precede the earliest MultiXact on disk.
> * But if it fails, don't truncate anything, and log a message.
> */
> if (oldestMulti == nextMulti)
> {
> /* there are NO MultiXacts */
> oldestOffset = nextOffset;
> }
> else if (!find_multixact_start(oldestMulti, &oldestOffset))
> {
> ereport(LOG,
> (errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
> oldestMulti, earliest)));
> LWLockRelease(MultiXactTruncationLock);
> return;
> }
1. In the above, we scan the 'offsets' SLRU to find the earliest
existing page. Per the comments, that's to avoid reading an offset that
doesn't exist. But find_multixact_start() handles that gracefully; we
check for that above and print the "oldest Multixact %u not found" log
message if it happens. What's the point of scanning the SLRU then, why
not just let find_multixact_start() detect the missing segment?
2. The above code looks up oldestOffset, which is then included in the
WAL record and passed to PerformMembersTruncation(). But
PerformMembersTruncation() doesn't use it for anything. It's not used
during WAL replay either. This is new with the 64-bit offsets; on stable
branches PerformMembersTruncation() does use it.
I'm inclined to leave this alone in stable branches, as it's not broken,
just somewhat pointless. But for 'master', I propose the attached
v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patch.
For all branches, I propose
v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patch to add a
check for oldestOffset == 0. That fixes the potential for catastrophic
truncation with invalid offset 0.
- Heikki
Вложения
> On 6 Jan 2026, at 16:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > For all branches, I propose v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patch to add a check for oldestOffset== 0. That fixes the potential for catastrophic truncation with invalid offset 0. Multixid that is used in heap is WAL-logged. WAL-logged multixact has non-zero offset. So in non-corrupted database such as condition is impossible. However, I observed several incidents when AI recommended pg_resetwal to users. Proposed safeguard might be useful to prevent sprawling corruption in database. > On 6 Jan 2026, at 16:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > But for 'master', I propose the attached v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patch. The patch simplifies the code while maintaining correctness. The only issue I can think of is that clog, commit_ts and async are still using approach based on SlruScanDirectory(). Best regards, Andrey Borodin.
On 09/01/2026 19:58, Andrey Borodin wrote: >> On 6 Jan 2026, at 16:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> For all branches, I propose v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patch to add a check for oldestOffset== 0. That fixes the potential for catastrophic truncation with invalid offset 0. > > Multixid that is used in heap is WAL-logged. WAL-logged multixact has non-zero offset. > So in non-corrupted database such as condition is impossible. > However, I observed several incidents when AI recommended pg_resetwal to users. > Proposed safeguard might be useful to prevent sprawling corruption in database. +1 >> On 6 Jan 2026, at 16:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> But for 'master', I propose the attached v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patch. > > The patch simplifies the code while maintaining correctness. > The only issue I can think of is that clog, commit_ts and async are still using approach based on SlruScanDirectory(). Clog, commit_ts and async actually all use SimpleLruTruncate for the truncation. Which in turn uses SlruScanDirectory(). There is one subtle difference between clog and commit_ts, and multixact. Before truncation, clog and commit_ts use SlruScanDirCbReportPresence to check if there are any files to remove, and only perform the truncation if there are. Multixact doesn't do that check, so it will write a truncation WAL record, even if there are no files to remove, while clog/commit_ts will not. That's OK, and isn't new with this patch anyway. Pushed, thanks for the review! - Heikki