On 07/11/2025 17:33, Heikki Linnakangas wrote:
> Scenario 1: In the buggy scenario, oldestMulti is 1 and nextMulti is 0.
> We should take the "there are NO MultiXacts" codepath in that case,
> because we skip over 0 when assigning multixids. Instead, we call
> find_multixact_start with oldestMulti==1, which returns false because
> multixid 1 hasn't been assigned and the SLRU segment doesn't exist yet.
> There's a similar bug in SetOffsetVacuumLimit().
>
> Scenario 2: In scenario 1 we just fail to truncate the SLRUs and you get
> the log message. But I think there might be more serious variants of
> this. If the SLRU segment exists but the offset for multixid 1 hasn't
> been set yet, find_multixact_start() will return 0 instead, and we will
> proceed with the truncation based on incorrect oldestOffset==0 value,
> possibly removing SLRU segments that are still needed.
>
> Attached is a fix for scenarios 1 and 2, and a test case for scenario 1.
These scenarios were incidentally fixed by commit 789d65364c "Set next
multixid's offset when creating a new multixid". (On master, commit
87a350e1f2 "Never store 0 as the nextMXact" would also have fixed it,
but 789d65364c was committed first.)
> Scenario 3: I also noticed that the above code isn't prepared for the
> race condition that the offset corresponding to 'oldestMulti' hasn't
> been stored in the SLRU yet, even without wraparound. That could
> theoretically happen if the backend executing
> MultiXactIdCreateFromMembers() gets stuck for a long time between the
> calls to GetNewMultiXactId() and RecordNewMultiXact(), but I think we're
> saved by the fact that we only create new multixids while holding a lock
> on a heap page, and a system-wide VACUUM FREEZE that would advance
> oldestMulti would need to lock the heap page too. It's scary though,
> because it could also lead to truncating away members SLRU segments that
> are still needed. The attached patch does *not* address this scenario.
I believe this is still a potential issue, although I'm still not sure
if it's possible to hit in practice. I see two avenues:
1. A long delay between GetNewMultiXactId() and RecordNewMultiXact(), as
I suggested above.
2. A crash between GetNewMultiXactId() and RecordNewMultiXact(), which
leaves behind an multixid with 0 offset.
I'm not sure if TruncateMultiXact() can be called with such an invalid
multixid. It cannot appear anywhere in the heap, so perhaps VACUUM will
never pick it as the oldest multixid. But I can't rule it out either. In
any case, we should put a safeguard in TruncateMultiXact() for this. The
consequences are bad, and it's easy to check for.
While working on this, I noticed that the truncation code does a couple
of unnecessary things:
> /*
> * Note we can't just plow ahead with the truncation; it's possible that
> * there are no segments to truncate, which is a problem because we are
> * going to attempt to read the offsets page to determine where to
> * truncate the members SLRU. So we first scan the directory to determine
> * the earliest offsets page number that we can read without error.
> *
> * When nextMXact is less than one segment away from multiWrapLimit,
> * SlruScanDirCbFindEarliest can find some early segment other than the
> * actual earliest. (MultiXactOffsetPagePrecedes(EARLIEST, LATEST)
> * returns false, because not all pairs of entries have the same answer.)
> * That can also arise when an earlier truncation attempt failed unlink()
> * or returned early from this function. The only consequence is
> * returning early, which wastes space that we could have liberated.
> *
> * NB: It's also possible that the page that oldestMulti is on has already
> * been truncated away, and we crashed before updating oldestMulti.
> */
> trunc.earliestExistingPage = -1;
> SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
> earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
> if (earliest < FirstMultiXactId)
> earliest = FirstMultiXactId;
>
> /* If there's nothing to remove, we can bail out early. */
> if (MultiXactIdPrecedes(oldestMulti, earliest))
> {
> LWLockRelease(MultiXactTruncationLock);
> return;
> }
>
> /*
> * First, compute the safe truncation point for MultiXactMember. This is
> * the starting offset of the oldest multixact.
> *
> * Hopefully, find_multixact_start will always work here, because we've
> * already checked that it doesn't precede the earliest MultiXact on disk.
> * But if it fails, don't truncate anything, and log a message.
> */
> if (oldestMulti == nextMulti)
> {
> /* there are NO MultiXacts */
> oldestOffset = nextOffset;
> }
> else if (!find_multixact_start(oldestMulti, &oldestOffset))
> {
> ereport(LOG,
> (errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
> oldestMulti, earliest)));
> LWLockRelease(MultiXactTruncationLock);
> return;
> }
1. In the above, we scan the 'offsets' SLRU to find the earliest
existing page. Per the comments, that's to avoid reading an offset that
doesn't exist. But find_multixact_start() handles that gracefully; we
check for that above and print the "oldest Multixact %u not found" log
message if it happens. What's the point of scanning the SLRU then, why
not just let find_multixact_start() detect the missing segment?
2. The above code looks up oldestOffset, which is then included in the
WAL record and passed to PerformMembersTruncation(). But
PerformMembersTruncation() doesn't use it for anything. It's not used
during WAL replay either. This is new with the 64-bit offsets; on stable
branches PerformMembersTruncation() does use it.
I'm inclined to leave this alone in stable branches, as it's not broken,
just somewhat pointless. But for 'master', I propose the attached
v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patch.
For all branches, I propose
v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patch to add a
check for oldestOffset == 0. That fixes the potential for catastrophic
truncation with invalid offset 0.
- Heikki