Обсуждение: Bug in MultiXact replay compat logic for older minor version after crash-recovery
Hi hackers, This is related to two recent threads: [0] https://www.postgresql.org/message-id/flat/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru [1] https://www.postgresql.org/message-id/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com I think there may be a bug in the compat logic introduced in RecordNewMultiXact() to fix [0]. The compat logic handles WAL generated by older minor versions where ZERO_OFF_PAGE N+1 may appear after CREATE_ID:N in the WAL stream. However, the condition used to detect whether the next offset page needs initialization seems to fail after a crash-restart, causing the standby to enter a FATAL crash loop. We hit this in production: a PG 16.12 standby replaying WAL from a PG 16.8 primary entered a crash-restart loop with: > FATAL: could not access status of transaction 1277952 > DETAIL: Could not read from file "pg_multixact/offsets/0013" at offset > 131072: read too few bytes. > CONTEXT: WAL redo at 5194/5F5F7118 for MultiXact/CREATE_ID: 1277951 >offset 2605275 nmembers 2: 814605500 (keysh) 814605501 (keysh) == Analysis == The compat logic condition is: > if (InRecovery && > next_pageno != pageno && > MultiXactOffsetCtl->shared->latest_page_number == pageno) The third condition (latest_page_number == pageno) assumes that if latest_page_number equals the current multixact's page, then the next page hasn't been initialized yet. This works during normal (non-crash) replay because latest_page_number is incrementally maintained by SimpleLruZeroPage() calls. But after a crash-restart, StartupMultiXact() resets latest_page_number to page(checkPoint.nextMulti): > void StartupMultiXact(void) > { > MultiXactId multi = MultiXactState->nextMXact; > pageno = MultiXactIdToOffsetPage(multi); > MultiXactOffsetCtl->shared->latest_page_number = pageno; > } When the checkpoint captured an already-advanced nextMXact (which happens when GetNewMultiXactId() incremented nextMXact before the backend wrote CREATE_ID to WAL), checkPoint.nextMulti >= N+1, so latest_page_number = page(N+1) = P+1. The compat check becomes: > P+1 == P -> FALSE -> compat logic skipped The subsequent SimpleLruReadPage(next_pageno=P+1) then fails because page P+1 doesn't exist on disk (ZERO_OFF_PAGE:P+1 hasn't been replayed yet -- it's after CREATE_ID:N in the WAL due to the reordering that the compat logic is supposed to handle). The fix for [1] addressed a related issue where TRUNCATE replay reset latest_page_number to a very old value (latest_page_number << pageno). That fix addresses the "too small" direction but not the "too large" direction described here. == Timeline: the lock-free window == The root cause is a lock-free window in MultiXactIdCreateFromMembers().GetNewMultiXactId() advances nextMXact under MultiXactGenLock,but the lock is released BEFORE XLogInsert(CREATE_ID). Any checkpoint that runs in this window captures the advanced nextMXact without the corresponding CREATE_ID in WAL. Assume multi N is the last entry on offset page P (entry 2047), so multi N+1 falls on page P+1. 1. [Backend] GetNewMultiXactId(): - LWLockAcquire(MultiXactGenLock) - result = nextMXact (= N) - nextMXact++ (now N+1) - LWLockRelease(MultiXactGenLock) - return N <<< lock-free window opens here >>> nextMXact=N+1 is globally visible, but CREATE_ID:N not in WAL yet. 2. [Checkpointer] CHECKPOINT runs during the window: - reads nextMXact = N+1 - writes checkpoint record with nextMulti=N+1 3. [Standby] replays the checkpoint, does restartpoint (persists nextMulti=N+1 to disk) 4. [Standby] *** CRASH *** <<< lock-free window closes >>> 5. [Backend] XLogInsert(CREATE_ID: N) -- this WAL record lands AFTER the checkpoint record 6. [Backend] RecordNewMultiXact(N): - next_pageno = page(N+1) = P+1 - SimpleLruReadPage(P+1) -- works fine on primary The resulting WAL stream on disk: > ... CHECKPOINT(nextMulti=N+1) ... CREATE_ID:N ... ZERO_OFF_PAGE:P+1 ... > ^ ^ ^ > redo starts here needs page P+1 page P+1 init > after crash-restart After standby crash-restart: 7. [Standby] StartupMultiXact(): - multi = nextMXact = N+1 (from checkpoint) - latest_page_number = page(N+1) = P+1 8. [Standby] Redo CREATE_ID:N -> RecordNewMultiXact(N): - pageno = page(N) = P, next_pageno = page(N+1) = P+1 - Compat logic check: InRecovery = true next_pageno(P+1) != pageno(P) latest_page_number(P+1) == pageno(P) -- FAIL --> compat logic SKIPPED - SimpleLruReadPage(P+1) page P+1 not on disk yet (ZERO_OFF_PAGE:P+1 not replayed) --> FATAL: read too few bytes == Reproduction == I wrote a deterministic reproducer using sleep injection. A sleep injection patch (sleep-injection-for-multixact-compat-logic-bug-repro.patch) adds a 60-second sleep in MultiXactIdCreateFromMembers(), after GetNewMultiXactId() returns but before XLogInsert(CREATE_ID). The sleep only triggers when the allocated multi is the last entry on an offset page (entry 2047) AND the file /tmp/mxact_sleep_enabled exists (so the batch phase that generates multixacts is not affected). The setup: - Primary: PG 16.11 with the sleep injection patch applied. - Standby: PG 16.13 (contains the compat logic from [0] and[1]). The test script (test_multixact_compat.sh) does the following: 1. Initialize primary, create test table 2. Generate multixacts until nextMulti's entry is exactly 2047 (last entry on page P) 3. pg_basebackup to create standby, start with pg_new (16.13), checkpoint_timeout=30s 4. Session A: BEGIN; SELECT * FROM t WHERE id=1 FOR SHARE; (dirties heap buffer) 5. CHECKPOINT to flush the dirty buffer (*) 6. Enable sleep injection (touch /tmp/mxact_sleep_enabled) 7. Session B: SELECT * FROM t WHERE id=1 FOR SHARE; -> creates multixact N, GetNewMultiXactId advances nextMXact to N+1, then sleeps 60s 8. During sleep: CHECKPOINT on primary -> completes instantly (buffer is clean) -> captures nextMulti=N+1 in checkpoint record 9. Wait for standby to replay checkpoint and do restartpoint 10. Crash standby (pg_ctl -m immediate stop) 11. Wait for sleep to end -> CREATE_ID:N written to WAL 12. Restart standby -> FATAL The CHECKPOINT in step 5 is essential. Without it, the checkpoint in step 8 blocks because heap_lock_tuple holds the buffer content lock in EXCLUSIVE mode across the entire MultiXactIdCreateFromMembers call (including the sleep). Session A's FOR SHARE dirtied the heap buffer, so BufferSync needs LW_SHARED on the content lock and blocks. The intermediate CHECKPOINT flushes the dirty buffer before Session B acquires the content lock. Result: > DEBUG: next MultiXactId: 114688; next MultiXactOffset: 688233 > FATAL: could not access status of transaction 114688 > DETAIL: Could not read from file "pg_multixact/offsets/0001" > at offset 196608: read too few bytes. > CONTEXT: WAL redo at 0/40426E0 for MultiXact/CREATE_ID: 114687 > offset 688231 nmembers 2: 116735 (sh) 116736 (sh) == Fix ideas == I have two ideas for fixing this. Two independent patches are attached. Idea 1 (0001-Fix-multixact-compat-logic-via-unconditional-zero.patch): Remove the "latest_page_number == pageno" condition entirely. During recovery, whenever we cross a page boundary in RecordNewMultiXact(), unconditionally ensure the next page is initialized via SimpleLruZeroPage(). Pro: Minimal change, easy to review, impossible to miss any edge case since we always initialize. Safe because SimpleLruZeroPage is idempotent -- at this point the next page only contains zeros (CREATE_ID records for that page haven't been replayed yet), so re-zeroing loses nothing. Con: When replaying WAL from a new-version primary (where ZERO_OFF_PAGE is emitted BEFORE CREATE_ID), the page has already been initialized by the ZERO_OFF_PAGE record. This patch will redundantly zero+write it again. The cost is one extra SimpleLruZeroPage + SimpleLruWritePage per 2048 multixacts during recovery, which is negligible in practice, but not zero. Idea 2 (0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch): Also remove the "latest_page_number == pageno" condition, but add a two-tier check: first, if latest_page_number == pageno (fast path), we know the next page needs initialization. Otherwise, scan the SLRU buffer slots to check if the next page already exists (e.g. from a prior ZERO_OFF_PAGE replay). Only initialize if the page is not found in the buffer. Pro: Avoids the redundant zero+write when replaying new-version WAL where ZERO_OFF_PAGE has already been replayed before CREATE_ID. The SLRU buffer scan is O(n_buffers) which is cheap (default 8 buffers). Con: More complex than Idea 1. The SLRU buffer check is a heuristic -- if the page was written out and evicted from the buffer pool before CREATE_ID replay, we'd zero it again (still safe, just redundant). Also introduces a dependency on SLRU buffer pool internals. I'd appreciate it if someone could review whether my analysis of the bug is correct. If it is, I'd be happy to hear any thoughts on the patches or better approaches to fix this. Regards, Duan
Вложения
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Heikki Linnakangas
Дата:
On 19/03/2026 20:02, 段坤仁(刻韧) wrote: > Hi hackers, > > This is related to two recent threads: > > [0] https://www.postgresql.org/message-id/flat/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru > [1] https://www.postgresql.org/message-id/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com > > I think there may be a bug in the compat logic introduced in > RecordNewMultiXact() to fix [0]. The compat logic handles WAL > generated by older minor versions where ZERO_OFF_PAGE N+1 > may appear after CREATE_ID:N in the WAL stream. However, the condition > used to detect whether the next offset page needs initialization > seems to fail after a crash-restart, causing the standby to enter a > FATAL crash loop. > > We hit this in production: a PG 16.12 standby replaying WAL from a > PG 16.8 primary entered a crash-restart loop with: > >> FATAL: could not access status of transaction 1277952 >> DETAIL: Could not read from file "pg_multixact/offsets/0013" at offset >> 131072: read too few bytes. >> CONTEXT: WAL redo at 5194/5F5F7118 for MultiXact/CREATE_ID: 1277951 >> offset 2605275 nmembers 2: 814605500 (keysh) 814605501 (keysh) :-(. This is a gift that keeps giving. > == Analysis == > > ... > > == Timeline: the lock-free window == > > ... > > == Reproduction == Thanks for the thorough analysis and the repro! > == Fix ideas == > > I have two ideas for fixing this. Two independent patches are > attached. > > Idea 1 (0001-Fix-multixact-compat-logic-via-unconditional-zero.patch): > > Remove the "latest_page_number == pageno" condition entirely. During > recovery, whenever we cross a page boundary in RecordNewMultiXact(), > unconditionally ensure the next page is initialized via > SimpleLruZeroPage(). > > Pro: Minimal change, easy to review, impossible to miss any edge > case since we always initialize. Safe because SimpleLruZeroPage is > idempotent -- at this point the next page only contains zeros > (CREATE_ID records for that page haven't been replayed yet), so > re-zeroing loses nothing. > > Con: When replaying WAL from a new-version primary (where > ZERO_OFF_PAGE is emitted BEFORE CREATE_ID), the page has already > been initialized by the ZERO_OFF_PAGE record. This patch will > redundantly zero+write it again. The cost is one extra > SimpleLruZeroPage + SimpleLruWritePage per 2048 multixacts during > recovery, which is negligible in practice, but not zero. This fix is not correct. The CREATE_ID records can appear in the WAL out of order, e.g: CREATE_ID:N+2 -> CREATE_ID:N+3 -> CREATE_ID:N+1 With the patch, replaying the CREATE_ID:N+1 record would overwrite the changes made by the CREATE_ID:N+2 and CREATE_ID:N+3 records. > Idea 2 (0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch): > > Also remove the "latest_page_number == pageno" condition, but add a > two-tier check: first, if latest_page_number == pageno (fast path), > we know the next page needs initialization. Otherwise, scan the SLRU > buffer slots to check if the next page already exists (e.g. from a > prior ZERO_OFF_PAGE replay). Only initialize if the page is not > found in the buffer. > > Pro: Avoids the redundant zero+write when replaying new-version WAL > where ZERO_OFF_PAGE has already been replayed before CREATE_ID. The > SLRU buffer scan is O(n_buffers) which is cheap (default 8 buffers). > > Con: More complex than Idea 1. The SLRU buffer check is a heuristic > -- if the page was written out and evicted from the buffer pool > before CREATE_ID replay, we'd zero it again (still safe, just > redundant). Also introduces a dependency on SLRU buffer pool > internals. > > I'd appreciate it if someone could review whether my analysis of the > bug is correct. If it is, I'd be happy to hear any thoughts on the > patches or better approaches to fix this. I think this has the same problem, although the extra conditions make it much harder to hit. Maybe even impossible, because the SLRU never evicts the latest page. But it seems fragile to rely on that, and the consequences are pretty bad if you get that wrong. Idea 3: I think a better fix is to accept that our tracking is a little imprecise and use SimpleLruDoesPhysicalPageExist() to check if the page exists. I suspect that's too expensive to do on every RecordNewMultiXact() call that crosses a page, but perhaps we could do it once at StartupMultiXact(). Or perhaps track last-zeroed page separately from latest_page_number, and if we haven't seen any XLOG_MULTIXACT_ZERO_OFF_PAGE records yet after startup, call SimpleLruDoesPhysicalPageExist() to determine if initialization is needed. Attached patch does that. - Heikki
Вложения
On 19/03/2026 20:11, Heikki Linnakangas wrote:> This fix is not correct. The CREATE_ID records can appear in the WAL > out of order, e.g: > > CREATE_ID:N+2 -> CREATE_ID:N+3 -> CREATE_ID:N+1 > > With the patch, replaying the CREATE_ID:N+1 record would overwrite > the changes made by the CREATE_ID:N+2 and CREATE_ID:N+3 records. > Or perhaps track last-zeroed page separately from > latest_page_number, and if we haven't seen any > XLOG_MULTIXACT_ZERO_OFF_PAGE records yet after startup, call > SimpleLruDoesPhysicalPageExist() to determine if initialization > is needed. Attached patch does that. Thanks for the quick reply and the fix! My two patches were both stuck in the trap of trying to fix latest_page_number itself, which as you pointed out is incorrect. Your approach of stepping back and tracking last_initialized_offsets_page separately, with SimpleLruDoesPhysicalPageExist() as the fallback, is much cleaner. I had considered using SimpleLruDoesPhysicalPageExist() myself but dismissed it as too expensive -- looking at it now, it's probably the necessary path for handling unknown state after restart. I've applied your patch on both REL_17_STABLE and REL_18_STABLE and verified it fixes the crash-restart scenario. The same code pattern exists on REL_14_STABLE through REL_16_STABLE, so it should apply there as well. It would be good to have another pair of eyes review the fix across all affected branches. One minor observation on the patch: the comment for last_initialized_offsets_page currently says: > last_initialized_offsets_page is the XLOG_MULTIXACT_ZERO_OFF_PAGE > record that we saw during WAL replay But the variable is also updated when the compat logic in RecordNewMultiXact() initializes a page itself. So it tracks the last page zeroed during replay from any source, not just from ZERO_OFF_PAGE records. Perhaps something like: /* * last_initialized_offsets_page tracks the last offsets page that * was zeroed during WAL replay, whether by replaying an * XLOG_MULTIXACT_ZERO_OFF_PAGE record or by the compat logic in * RecordNewMultiXact(). -1 if no page has been zeroed yet since * the last restart. */ == Maybe add a TAP test? == The compat logic in RecordNewMultiXact() has now been the source of three bugs in a short period: it was originally introduced as an intermediate step in [0], then broken by TRUNCATE replay [1], and now by the crash-restart scenario described in this thread. The code sits at a tricky intersection of WAL ordering, SLRU page management, and crash recovery -- future changes to SLRU initialization, checkpoint logic, or multixact lifecycle could potentially re-introduce a regression. A targeted regression test would help catch such issues early, especially since the failure mode (standby crash loop) is severe and hard to diagnose in production. The compat logic now has two branches: > if (last_initialized_offsets_page == -1) > init_needed = !SimpleLruDoesPhysicalPageExist(..., next_pageno); > else > init_needed = (last_initialized_offsets_page == pageno); The first branch (fallback path) fires after crash-restart, when last_initialized_offsets_page is -1 because no ZERO_OFF_PAGE has been replayed yet. I wrote a TAP test that exercises this path end-to-end: crash-restart the standby, truncate the offset file to remove the next page, then verify the compat logic detects the missing page via SimpleLruDoesPhysicalPageExist() and initializes it so the standby can continue replay. This ensures the fallback path works as a safety net when the tracking state is unknown. The second branch (fast path) relies on last_initialized_offsets_page advancing monotonically. To guard against future code that might modify this variable elsewhere and break the assumption, we could add Asserts at the two write sites: /* at each write site: */ Assert(last_initialized_offsets_page == -1 || MultiXactOffsetPagePrecedes( last_initialized_offsets_page, new_page)); This ensures new_page is always ahead of last_initialized_offsets_page, even across wraparound. The test reproduces the fallback scenario by creating the crash-restart condition where last_initialized_offsets_page is -1 and the next offsets page is physically missing: 1. Using pg_resetwal -m 2047,1 to place nextMulti at the last entry on offset page 0, so the next allocation crosses a page boundary. 2. Using an injection point to pause after GetNewMultiXactId() but before XLogInsert(CREATE_ID), creating the window where CHECKPOINT captures the advanced nextMulti. 3. Triggering CHECKPOINT during the pause, waiting for the standby to replay it and do a restartpoint. 4. Crashing the standby. 5. Truncating the standby's offset file to remove the next page, simulating the old-version condition where ZERO_OFF_PAGE was never written. This is necessary because on a new-version primary, ZERO_OFF_PAGE is always written before CREATE_ID, so the page would normally exist on disk. 6. Waking up the injection point so CREATE_ID gets written. 7. Restarting the standby and verifying the compat logic initializes the missing page. Without your fix applied, the test correctly detects the bug. I've implemented this in attached patches, one for REL_17_STABLE and one for REL_18_STABLE. I'm still looking for a way to add test coverage on REL_14_STABLE through REL_16_STABLE, where the injection point infrastructure is not available. Would appreciate any comments on the test approach or the patches. Regards Duan
Вложения
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Andrey Borodin
Дата:
> On 19 Mar 2026, at 23:11, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > :-(. This is a gift that keeps giving. Well, maybe, we could leaving that deadlock in place for some time... > Idea 3: > > I think a better fix is to accept that our tracking is a little imprecise and use SimpleLruDoesPhysicalPageExist() to checkif the page exists. I suspect that's too expensive to do on every RecordNewMultiXact() call that crosses a page, butperhaps we could do it once at StartupMultiXact(). > > Or perhaps track last-zeroed page separately from latest_page_number, and if we haven't seen any XLOG_MULTIXACT_ZERO_OFF_PAGErecords yet after startup, call SimpleLruDoesPhysicalPageExist() to determine if initializationis needed. Attached patch does that. SimpleLruDoesPhysicalPageExist() does not detect recently zeroed pages via buffers, because it goes directly to FS. I tried this approach when implementing deadlock fix, it did not work for me. Best regards, Andrey Borodin.
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Heikki Linnakangas
Дата:
On 20/03/2026 13:55, Andrey Borodin wrote: >> On 19 Mar 2026, at 23:11, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I think a better fix is to accept that our tracking is a little >> imprecise and use SimpleLruDoesPhysicalPageExist() to check if the >> page exists. I suspect that's too expensive to do on every >> RecordNewMultiXact() call that crosses a page, but perhaps we >> could do it once at StartupMultiXact(). >> >> Or perhaps track last-zeroed page separately from >> latest_page_number, and if we haven't seen any >> XLOG_MULTIXACT_ZERO_OFF_PAGE records yet after startup, call >> SimpleLruDoesPhysicalPageExist() to determine if initialization is >> needed. Attached patch does that. > > SimpleLruDoesPhysicalPageExist() does not detect recently zeroed > pages via buffers, because it goes directly to FS. I tried this > approach when implementing deadlock fix, it did not work for me. Hmm, after startup, before we have zeroed any pages, it still works though. So I think my patch works, but it means that tracking the latest page we have zeroed is not merely an optimization to avoid excessive SimpleLruDoesPhysicalPageExist() calls, it's needed for correctness. Need to adjust the comments for that. - Heikki
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Andrey Borodin
Дата:
> On 20 Mar 2026, at 16:19, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Hmm, after startup, before we have zeroed any pages, it still works though. So I think my patch works, but it means thattracking the latest page we have zeroed is not merely an optimization to avoid excessive SimpleLruDoesPhysicalPageExist()calls, it's needed for correctness. Need to adjust the comments for that. If we are sure buffers have no this page we can detect it via FS. Otherwise... nothing bad can happen, actually. We might get false positive and zero the page once more. If we got init_needed==false, maybe cache it for this page and set last_initialized_offsets_page = pageno? Or, perhaps, XLOG_MULTIXACT_ZERO_OFF_PAGE will do it for us anyway, but a bit later. Best regards, Andrey Borodin.
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Heikki Linnakangas
Дата:
On 20/03/2026 15:39, Andrey Borodin wrote: >> On 20 Mar 2026, at 16:19, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> Hmm, after startup, before we have zeroed any pages, it still works though. So I think my patch works, but it means thattracking the latest page we have zeroed is not merely an optimization to avoid excessive SimpleLruDoesPhysicalPageExist()calls, it's needed for correctness. Need to adjust the comments for that. > > If we are sure buffers have no this page we can detect it via FS. > Otherwise... nothing bad can happen, actually. We might get false positive and zero the page once more. Zeroing the page again is dangerous because the CREATE_ID records can be out of order. The page might already contain some later multixids, and zeroing will overwrite them. > If we got init_needed==false, maybe cache it for this page and set last_initialized_offsets_page = pageno? > Or, perhaps, XLOG_MULTIXACT_ZERO_OFF_PAGE will do it for us anyway, but a bit later. My patch does set last_initialized_offsets_page = pageno, if it initializes the page, so yeah I think we're good there. - Heikki
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Andrey Borodin
Дата:
> On 20 Mar 2026, at 18:14, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Zeroing the page again is dangerous because the CREATE_ID records can be out of order. The page might already contain somelater multixids, and zeroing will overwrite them. I see only cases when it's not a problem: we zeroed page, did not flush it, thus did not extend the file, crashed, testedFS, zeroed page once more, overwrote again by replaying WAL, no big deal. We should never zero a page with offsets, that will not be replayed by WAL. If the page was persisted, even partially, we will read it from disk without zeroing out. Best regards, Andrey Borodin.
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Heikki Linnakangas
Дата:
On 20/03/2026 19:05, Andrey Borodin wrote: >> On 20 Mar 2026, at 18:14, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> Zeroing the page again is dangerous because the CREATE_ID records can be out of order. The page might already containsome later multixids, and zeroing will overwrite them. > > I see only cases when it's not a problem: we zeroed page, did not flush it, thus did not extend the file, crashed, testedFS, zeroed page once more, overwrote again by replaying WAL, no big deal. > We should never zero a page with offsets, that will not be replayed by WAL. I think we're in agreement, but I want to verify because this is important to get right. I was replying to this: > If we are sure buffers have no this page we can detect it via FS. > Otherwise... nothing bad can happen, actually. We might get false positive and zero the page once more. My point is that if we rely on SimpleLruDoesPhysicalPageExist(), and it ever returns false even though we had already initialized the page, you can lose data. It's *not* ok to zero a page again that was zeroed earlier already, because we might have already written some real data on it. Let's consider this wal stream, generated with old minor version: ZERO_PAGE:2048 -> CREATE_ID:2048 -> CREATE_ID:2049 -> CREATE_ID:2047 2048 is the first multixid on the page. When WAL replay gets to the CREATE_ID:2047 record, it will enter the backwards-compatibility codepath and needs to determine if the page containing the next mxid (2048) already exists. In this WAL sequence, the page already exist because the ZERO_PAGE record was replayed earlier. But if we just call SimpleLruDoesPhysicalPageExist(), it will return 'false' because the page was not flushed to disk yet. If we believe that and zero the page again, we will lose data (the offset for mxid 2049). The opposite cannot happen: if SimpleLruDoesPhysicalPageExist() returns true, then it does really exist. So indeed we can only trust SimpleLruDoesPhysicalPageExist() if we are sure that the page is not sitting in the buffers. Attached is a new version. I updated the comment to explain that. I also added another safety measure: before calling SimpleLruDoesPhysicalPageExist(), flush all the SLRU buffers. That way, SimpleLruDoesPhysicalPageExist() should definitely return the correct answer. That shouldn't be necessary because the check with last_initialized_offsets_page should cover all the cases where a page that extended the file is sitting in the buffers, but better safe than sorry. - Heikki
Вложения
Thanks for the v2 patch.
On 20/03/2026 16:19, Heikki Linnakangas wrote:
> it means that tracking the latest page we have zeroed is not merely
> an optimization to avoid excessive SimpleLruDoesPhysicalPageExist()
> calls, it's needed for correctness.
Agreed.
On 20/03/2026 18:14, Heikki Linnakangas wrote:
> I also added another safety measure: before calling
> SimpleLruDoesPhysicalPageExist(), flush all the SLRU buffers.
This is more robust than scanning the SLRU buffers first and only
calling SimpleLruDoesPhysicalPageExist() on a miss, which would
rely on the SLRU eviction invariant.
I walked through the scenarios I could think of. Let N be the last
multixid on offset page P, so N+1 falls on page P+1.
(a) Old-version WAL (CREATE_ID:N before ZERO_OFF_PAGE:P+1):
last_initialized_offsets_page = P from earlier ZERO_OFF_PAGE.
init_needed = (P == P) = true -> init P+1. Correct.
Later ZERO_OFF_PAGE:P+1 is skipped via pre_initialized_offsets_page.
(b) Crash-restart, page P+1 not on disk (the original bug):
last_initialized_offsets_page = -1, fallback path fires.
SimpleLruDoesPhysicalPageExist(P+1) = false -> init. Correct.
(c) Crash-restart, page P+1 already on disk:
Same fallback, SimpleLruDoesPhysicalPageExist(P+1) = true -> skip.
last_initialized_offsets_page stays -1 until the next
ZERO_OFF_PAGE switches back to the fast path.
(d) Out-of-order CREATE_IDs (ZERO_PAGE:P+1 -> CREATE_ID:N+1 ->
CREATE_ID:N+2 -> CREATE_ID:N):
N+1 and N+2 don't cross a page boundary, compat logic not entered.
CREATE_ID:N: init_needed = (P+1 == P) = false -> skip.
Page P+1 is not re-zeroed, data from N+1/N+2 preserved.
(e) Consecutive page crossings (N on page P, later M on page P+1):
After init of P+1: last_initialized_offsets_page = P+1.
CREATE_ID:M: init_needed = (P+1 == P+1) = true -> init P+2.
Tracking advances monotonically across page boundaries.
The logic looks correct to me in all the cases above.
Regards,
Duan
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Kirill Reshke
Дата:
Hi! I can see that the back-branches commit was included into master [0]. I think this is good. On Sun, 22 Mar 2026 at 16:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 20/03/2026 19:05, Andrey Borodin wrote: > >> On 20 Mar 2026, at 18:14, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> > >> Zeroing the page again is dangerous because the CREATE_ID records can be out of order. The page might already containsome later multixids, and zeroing will overwrite them. > > > > I see only cases when it's not a problem: we zeroed page, did not flush it, thus did not extend the file, crashed, testedFS, zeroed page once more, overwrote again by replaying WAL, no big deal. > > We should never zero a page with offsets, that will not be replayed by WAL. > > I think we're in agreement, but I want to verify because this is > important to get right. I was replying to this: > > > If we are sure buffers have no this page we can detect it via FS. > > Otherwise... nothing bad can happen, actually. We might get false positive and zero the page once more. > > My point is that if we rely on SimpleLruDoesPhysicalPageExist(), and it > ever returns false even though we had already initialized the page, you > can lose data. It's *not* ok to zero a page again that was zeroed > earlier already, because we might have already written some real data on it. +1. Even if we manage to compose a "fix" that zeroes a page more than once, this "fix" will be non-future-profing and we will corrupt the database if anything goes even slightly wrong. > Let's consider this wal stream, generated with old minor version: > > ZERO_PAGE:2048 -> CREATE_ID:2048 -> CREATE_ID:2049 -> CREATE_ID:2047 > > 2048 is the first multixid on the page. When WAL replay gets to the > CREATE_ID:2047 record, it will enter the backwards-compatibility > codepath and needs to determine if the page containing the next mxid > (2048) already exists. > > In this WAL sequence, the page already exist because the ZERO_PAGE > record was replayed earlier. But if we just call > SimpleLruDoesPhysicalPageExist(), it will return 'false' because the > page was not flushed to disk yet. If we believe that and zero the page > again, we will lose data (the offset for mxid 2049). > > The opposite cannot happen: if SimpleLruDoesPhysicalPageExist() returns > true, then it does really exist. > > So indeed we can only trust SimpleLruDoesPhysicalPageExist() if we are > sure that the page is not sitting in the buffers. +1 > Attached is a new version. I updated the comment to explain that. > > I also added another safety measure: before calling > SimpleLruDoesPhysicalPageExist(), flush all the SLRU buffers. That way, > SimpleLruDoesPhysicalPageExist() should definitely return the correct > answer. That shouldn't be necessary because the check with > last_initialized_offsets_page should cover all the cases where a page > that extended the file is sitting in the buffers, but better safe than > sorry. > > - Heikki I played with v2 and was unable to fool it into corrupting db. So v2 looks good to me. [0] https://git.postgresql.org/cgit/postgresql.git/commit/?id=516310ed4dba89bd300242df0d56b4782f33ed4d -- Best regards, Kirill Reshke
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Kirill Reshke
Дата:
On Sun, 22 Mar 2026 at 19:15, Kirill Reshke <reshkekirill@gmail.com> wrote: > > Hi! > I can see that the back-branches commit was included into master [0]. > I think this is good. > > On Sun, 22 Mar 2026 at 16:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 20/03/2026 19:05, Andrey Borodin wrote: > > >> On 20 Mar 2026, at 18:14, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > >> > > >> Zeroing the page again is dangerous because the CREATE_ID records can be out of order. The page might already containsome later multixids, and zeroing will overwrite them. > > > > > > I see only cases when it's not a problem: we zeroed page, did not flush it, thus did not extend the file, crashed,tested FS, zeroed page once more, overwrote again by replaying WAL, no big deal. > > > We should never zero a page with offsets, that will not be replayed by WAL. > > > > I think we're in agreement, but I want to verify because this is > > important to get right. I was replying to this: > > > > > If we are sure buffers have no this page we can detect it via FS. > > > Otherwise... nothing bad can happen, actually. We might get false positive and zero the page once more. > > > > My point is that if we rely on SimpleLruDoesPhysicalPageExist(), and it > > ever returns false even though we had already initialized the page, you > > can lose data. It's *not* ok to zero a page again that was zeroed > > earlier already, because we might have already written some real data on it. > > +1. Even if we manage to compose a "fix" that zeroes a page more than > once, this "fix" will be non-future-profing and we will corrupt the > database if anything goes even slightly wrong. > > > Let's consider this wal stream, generated with old minor version: > > > > ZERO_PAGE:2048 -> CREATE_ID:2048 -> CREATE_ID:2049 -> CREATE_ID:2047 > > > > 2048 is the first multixid on the page. When WAL replay gets to the > > CREATE_ID:2047 record, it will enter the backwards-compatibility > > codepath and needs to determine if the page containing the next mxid > > (2048) already exists. > > > > In this WAL sequence, the page already exist because the ZERO_PAGE > > record was replayed earlier. But if we just call > > SimpleLruDoesPhysicalPageExist(), it will return 'false' because the > > page was not flushed to disk yet. If we believe that and zero the page > > again, we will lose data (the offset for mxid 2049). > > > > The opposite cannot happen: if SimpleLruDoesPhysicalPageExist() returns > > true, then it does really exist. > > > > So indeed we can only trust SimpleLruDoesPhysicalPageExist() if we are > > sure that the page is not sitting in the buffers. > > +1 > > > Attached is a new version. I updated the comment to explain that. > > > > I also added another safety measure: before calling > > SimpleLruDoesPhysicalPageExist(), flush all the SLRU buffers. That way, > > SimpleLruDoesPhysicalPageExist() should definitely return the correct > > answer. That shouldn't be necessary because the check with > > last_initialized_offsets_page should cover all the cases where a page > > that extended the file is sitting in the buffers, but better safe than > > sorry. > > > > - Heikki > > I played with v2 and was unable to fool it into corrupting db. So v2 > looks good to me. > > > [0] https://git.postgresql.org/cgit/postgresql.git/commit/?id=516310ed4dba89bd300242df0d56b4782f33ed4d > > -- > Best regards, > Kirill Reshke Also, in commit message: > the backwards compatibility logic to tolerate WAL generated by older minor versions Let's define older as pre-789d65364c to be exact? -- Best regards, Kirill Reshke
Re: 回复:Bug in MultiXact replay compat logic for older minor version after crash-recovery
От
Heikki Linnakangas
Дата:
On 22/03/2026 15:09, 段坤仁(刻韧) wrote: > On 20/03/2026 16:19, Heikki Linnakangas wrote: >> it means that tracking the latest page we have zeroed is not merely >> an optimization to avoid excessive SimpleLruDoesPhysicalPageExist() >> calls, it's needed for correctness. > > Agreed. > > On 20/03/2026 18:14, Heikki Linnakangas wrote: >> I also added another safety measure: before calling >> SimpleLruDoesPhysicalPageExist(), flush all the SLRU buffers. > > This is more robust than scanning the SLRU buffers first and only > calling SimpleLruDoesPhysicalPageExist() on a miss, which would > rely on the SLRU eviction invariant. > > I walked through the scenarios I could think of. Let N be the last > multixid on offset page P, so N+1 falls on page P+1. > > (a) Old-version WAL (CREATE_ID:N before ZERO_OFF_PAGE:P+1): > last_initialized_offsets_page = P from earlier ZERO_OFF_PAGE. > init_needed = (P == P) = true -> init P+1. Correct. > Later ZERO_OFF_PAGE:P+1 is skipped via pre_initialized_offsets_page. > > (b) Crash-restart, page P+1 not on disk (the original bug): > last_initialized_offsets_page = -1, fallback path fires. > SimpleLruDoesPhysicalPageExist(P+1) = false -> init. Correct. > > (c) Crash-restart, page P+1 already on disk: > Same fallback, SimpleLruDoesPhysicalPageExist(P+1) = true -> skip. > last_initialized_offsets_page stays -1 until the next > ZERO_OFF_PAGE switches back to the fast path. > > (d) Out-of-order CREATE_IDs (ZERO_PAGE:P+1 -> CREATE_ID:N+1 -> > CREATE_ID:N+2 -> CREATE_ID:N): > N+1 and N+2 don't cross a page boundary, compat logic not entered. > CREATE_ID:N: init_needed = (P+1 == P) = false -> skip. > Page P+1 is not re-zeroed, data from N+1/N+2 preserved. > > (e) Consecutive page crossings (N on page P, later M on page P+1): > After init of P+1: last_initialized_offsets_page = P+1. > CREATE_ID:M: init_needed = (P+1 == P+1) = true -> init P+2. > Tracking advances monotonically across page boundaries. > > The logic looks correct to me in all the cases above. Ok, committed. Thank you! - Heikki