Обсуждение: Why clearing the VM doesn't require registering vm buffer in wal record
Today Andres and I noticed that heap_{update,insert,delete}() don't
register the VM buffer when they are clearing the VM. I was under the
impression that any buffers modified needed to be registered in the
WAL record. Without which, you'll never do an FPI. It seems like this
could cause checksum failures. What are we missing?
Hi,
On 2026-03-05 14:56:13 -0500, Melanie Plageman wrote:
> Today Andres and I noticed that heap_{update,insert,delete}() don't
> register the VM buffer when they are clearing the VM. I was under the
> impression that any buffers modified needed to be registered in the
> WAL record. Without which, you'll never do an FPI. It seems like this
> could cause checksum failures. What are we missing?
Besides checksum errors, I was also worried that this would break
pg_rewind. But it seems we're saved by pg_rewind only parsing the WAL for
changes to the main fork.
But it does seem like it could be a problem for incremental backup /
walsummarizer?
Greetings,
Andres Freund
Re: Why clearing the VM doesn't require registering vm buffer in wal record
От
Matthias van de Meent
Дата:
On Thu, 5 Mar 2026 at 20:56, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> Today Andres and I noticed that heap_{update,insert,delete}() don't
> register the VM buffer when they are clearing the VM. I was under the
> impression that any buffers modified needed to be registered in the
> WAL record. Without which, you'll never do an FPI. It seems like this
> could cause checksum failures. What are we missing?
Might it be, because zeroing a VM page (which you would do when you
encounter checksum failures) is an MVCC-safe operation?
I agree with you that we probably _should_ register the VM (and
possibly FSM) buffer, but that's a bit of a different story. Right
now, the VM does not use the standard page format (nor does the FSM),
and therefore every FPI would be the full 8KB, even when just a few
bits of the VM page are in use; which would be a rather large waste of
space.
On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote:
>
> But it does seem like it could be a problem for incremental backup /
> walsummarizer?
I don't think it is, because that doesn't do calculations for non-main
forks, it considers those forks always changed and includes them in
full. Or at least, that was the response I got when I raised concerns
about the FSM back when the incremental backup feature was being
developed [0].
Kind regards,
Matthias van de Meent
Databricks (https://databricks.com)
[0] https://postgr.es/m/CA%2BTgmoaR8o%2BPBeWc_2Ge0XVgoM7xWKNyDmqXoTov%3DS6_J1gecQ%40mail.gmail.com
Hi,
On 2026-03-05 21:23:12 +0100, Matthias van de Meent wrote:
> On Thu, 5 Mar 2026 at 20:56, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> > Today Andres and I noticed that heap_{update,insert,delete}() don't
> > register the VM buffer when they are clearing the VM. I was under the
> > impression that any buffers modified needed to be registered in the
> > WAL record. Without which, you'll never do an FPI. It seems like this
> > could cause checksum failures. What are we missing?
>
> Might it be, because zeroing a VM page (which you would do when you
> encounter checksum failures) is an MVCC-safe operation?
We are afaict not automatically zeroing corrupted VM pages (we are doing that
for FSM though). IMO it's a bug if a user ever has to turn on
zero_damaged_pages outside of the storage actually corrupting data.
> I agree with you that we probably _should_ register the VM (and possibly
> FSM) buffer, but that's a bit of a different story. Right now, the VM does
> not use the standard page format (nor does the FSM), and therefore every FPI
> would be the full 8KB, even when just a few bits of the VM page are in use;
> which would be a rather large waste of space.
If checksums are enabled, we are already emitting FPIs for the VM when
*setting* bits in the VM (c.f. log_heap_visible()). I don't see why the story
for clearing it should be different. And because there are so few VM pages
compared to heap pages, I wouldn't expect there to be a meaningful amount of
VM FPIs outside of very contrived workloads.
> On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote:
> >
> > But it does seem like it could be a problem for incremental backup /
> > walsummarizer?
>
> I don't think it is, because that doesn't do calculations for non-main
> forks, it considers those forks always changed and includes them in
> full. Or at least, that was the response I got when I raised concerns
> about the FSM back when the incremental backup feature was being
> developed [0].
There's explicit code for ignoring the FSM, but I don't see the same for the
VM. And that makes sense: VM changes are mostly WAL logged, just not
completely / generically (i.e. this complaint), whereas FSM changes are not
WAL logged at all.
Greetings,
Andres Freund
Hi,
On 2026-03-05 15:38:24 -0500, Andres Freund wrote:
> > On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > But it does seem like it could be a problem for incremental backup /
> > > walsummarizer?
> >
> > I don't think it is, because that doesn't do calculations for non-main
> > forks, it considers those forks always changed and includes them in
> > full. Or at least, that was the response I got when I raised concerns
> > about the FSM back when the incremental backup feature was being
> > developed [0].
>
> There's explicit code for ignoring the FSM, but I don't see the same for the
> VM. And that makes sense: VM changes are mostly WAL logged, just not
> completely / generically (i.e. this complaint), whereas FSM changes are not
> WAL logged at all.
Unfortunately I can confirm that incremental backups end up with an outdated
VM.
An unfortunate kicker: It looks like verify_heapam() doesn't even notice :(.
But the next vacuum does complain, if we end up processing the page for some
reason:
2026-03-05 15:56:23.018 EST [2073843][client backend][0/5:0][psql] WARNING: page is not marked all-visible but
visibilitymap bit is set in relation "s" page 0
2026-03-05 15:56:23.018 EST [2073843][client backend][0/5:0][psql] CONTEXT: while scanning block 0 of relation
"public.s"
Repro:
1) CREATE TABLE s(id int); INSERT INTO s DEFAULT VALUES; SELECT pg_current_wal_insert_lsn(); VACUUM s;
2) create full base backup
rm -rf /tmp/backup-a && mkdir /tmp/backup-a && ./src/bin/pg_basebackup/pg_basebackup -c fast -D /tmp/backup-a
3) UPDATE s SET id = -id;
4) create incremental backup
rm -rf /tmp/backup-a-i && mkdir /tmp/backup-a-i && ./src/bin/pg_basebackup/pg_basebackup
--incremental=/tmp/backup-a/backup_manifest-c fast -D /tmp/backup-a-i
5) combine full and incremental backup
rm -rf /tmp/backup-a-i-c && mkdir /tmp/backup-a-i-c && ./src/bin/pg_combinebackup/pg_combinebackup -d -o
/tmp/backup-a-i-c--copy-file-range /tmp/backup-a /tmp/backup-a-i/
6) start PG on combined backup and use
SELECT * FROM pg_visibility_map_summary('s');
to see things are wrong
7) vacuum and see it complain (page is processed because we always look at the
last page, and there's just one)
Greetings,
Andres Freund
On Thu, Mar 5, 2026 at 3:38 PM Andres Freund <andres@anarazel.de> wrote: > If checksums are enabled, we are already emitting FPIs for the VM when > *setting* bits in the VM (c.f. log_heap_visible()). I don't see why the story > for clearing it should be different. And because there are so few VM pages > compared to heap pages, I wouldn't expect there to be a meaningful amount of > VM FPIs outside of very contrived workloads. Yeah, that's how it seems to me, too, at least as of this moment. Is there any indication in the code or comments that this was an intentional omission because somebody thought we could get away without doing it? Or is just a straight-up goof? -- Robert Haas EDB: http://www.enterprisedb.com
Hi,
On 2026-03-05 18:11:43 -0500, Robert Haas wrote:
> On Thu, Mar 5, 2026 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> > If checksums are enabled, we are already emitting FPIs for the VM when
> > *setting* bits in the VM (c.f. log_heap_visible()). I don't see why the story
> > for clearing it should be different. And because there are so few VM pages
> > compared to heap pages, I wouldn't expect there to be a meaningful amount of
> > VM FPIs outside of very contrived workloads.
>
> Yeah, that's how it seems to me, too, at least as of this moment.
>
> Is there any indication in the code or comments that this was an
> intentional omission because somebody thought we could get away
> without doing it?
I couldn't find any evidence of that, which of course doesn't mean it doesn't
exist.
> Or is just a straight-up goof?
It kinda looks like a victim of multiple subsequent changes that each arguably
should have done something different:
commit 2c03216d831
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2014-11-20 17:56:26 +0200
Revamp the WAL record format.
Which should have included a reference to the VM page in the WAL record for
insert/update/delete.
and
commit 96ef3b8ff1c
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: 2013-03-22 13:54:07 +0000
Allow I/O reliability checks using 16-bit checksums
Which should have triggered an FPI when clearing the VM, as afaict there
otherwise is no guarantee the checksum will be correct after a crash.
and, although I am not sure it's true:
commit 503c7305a1e
Author: Robert Haas <rhaas@postgresql.org>
Date: 2011-06-21 23:04:40 -0400
Make the visibility map crash-safe.
Which perhaps also should have emitted an FPI when clearing a bit? But I'm
unsure that that was required at the time. OTOH, it did seem to generate an
FPI for setting a VM bit, so ...
Greetings,
Andres Freund
On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <andres@anarazel.de> wrote: > Which perhaps also should have emitted an FPI when clearing a bit? But I'm > unsure that that was required at the time. OTOH, it did seem to generate an > FPI for setting a VM bit, so ... After booting up the part of my brain that worked on this back in 2011, I think I reasoned that clearing a bit was idempotent and didn't depend on the prior page state, so that we would be adequate protected without an FPI. I had to add a WAL record to *set* the VM bit, because that wasn't logged at all, but the actual of clearing the VM bit piggybacked on the heap change that necessitated doing so, and having that record also emit an FPI for the VM page didn't seem to me to add anything. So I think technically it's the addition of checksums that turns this into a real bug, because now torn pages are a checksum-failure hazard. However, that would have been extremely hard to notice given that I seem never to have actually documented that reasoning in a comment anywhere. I don't think we should mess around with trying to make the behavior here conditional on wal_level, checksums, etc. We should just do it all the time to keep the logic simple. -- Robert Haas EDB: http://www.enterprisedb.com
06.03.2026 00:01, Andres Freund пишет: > Hi, > > On 2026-03-05 15:38:24 -0500, Andres Freund wrote: >>> On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote: >>>> >>>> But it does seem like it could be a problem for incremental backup / >>>> walsummarizer? >>> >>> I don't think it is, because that doesn't do calculations for non-main >>> forks, it considers those forks always changed and includes them in >>> full. Or at least, that was the response I got when I raised concerns >>> about the FSM back when the incremental backup feature was being >>> developed [0]. >> >> There's explicit code for ignoring the FSM, but I don't see the same for the >> VM. And that makes sense: VM changes are mostly WAL logged, just not >> completely / generically (i.e. this complaint), whereas FSM changes are not >> WAL logged at all. > > Unfortunately I can confirm that incremental backups end up with an outdated > VM. That is why pg_probackup still archive VM at whole in incremental (WAL parsing) backup. That is why WAL-G's incremental backup in WAL-parsing mode is (was?) considered unstable. I know the problem for couple of years. Excuse me I didn't write about. I didn't recognize fix could be as simple as registering VM buffers. My bad. I fill so stupid :-( It would be great if it will be fixed in all supported versions. -- regards Yura Sokolov aka funny-falcon
12.03.2026 14:25, Yura Sokolov пишет: > 06.03.2026 00:01, Andres Freund пишет: >> Hi, >> >> On 2026-03-05 15:38:24 -0500, Andres Freund wrote: >>>> On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote: >>>>> >>>>> But it does seem like it could be a problem for incremental backup / >>>>> walsummarizer? >>>> >>>> I don't think it is, because that doesn't do calculations for non-main >>>> forks, it considers those forks always changed and includes them in >>>> full. Or at least, that was the response I got when I raised concerns >>>> about the FSM back when the incremental backup feature was being >>>> developed [0]. >>> >>> There's explicit code for ignoring the FSM, but I don't see the same for the >>> VM. And that makes sense: VM changes are mostly WAL logged, just not >>> completely / generically (i.e. this complaint), whereas FSM changes are not >>> WAL logged at all. >> >> Unfortunately I can confirm that incremental backups end up with an outdated >> VM. > > That is why pg_probackup still archive VM at whole in incremental (WAL > parsing) backup. > That is why WAL-G's incremental backup in WAL-parsing mode is (was?) > considered unstable. > > I know the problem for couple of years. Excuse me I didn't write about. > > I didn't recognize fix could be as simple as registering VM buffers. > My bad. I fill so stupid :-( Should LSN properly set as well on vm page? To maximum of concurrent updates, of course. -- regards Yura Sokolov aka funny-falcon
Hi, On 2026-03-06 11:49:20 -0500, Robert Haas wrote: > On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <andres@anarazel.de> wrote: > > Which perhaps also should have emitted an FPI when clearing a bit? But I'm > > unsure that that was required at the time. OTOH, it did seem to generate an > > FPI for setting a VM bit, so ... > > After booting up the part of my brain that worked on this back in > 2011, I think I reasoned that clearing a bit was idempotent and didn't > depend on the prior page state, so that we would be adequate protected > without an FPI. I had to add a WAL record to *set* the VM bit, because > that wasn't logged at all, but the actual of clearing the VM bit > piggybacked on the heap change that necessitated doing so, and having > that record also emit an FPI for the VM page didn't seem to me to add > anything. > > So I think technically it's the addition of checksums that turns this > into a real bug, because now torn pages are a checksum-failure hazard. > However, that would have been extremely hard to notice given that I > seem never to have actually documented that reasoning in a comment > anywhere. And then it got a lot worse with incremental backup support... > I don't think we should mess around with trying to make the behavior > here conditional on wal_level, checksums, etc. We should just do it > all the time to keep the logic simple. Agreed. I've been working on a fix for this (still pretty raw though). What a mess. As part of validating that I added error checks that disallow reads in the startup process for anything other than the FSM fork. Which triggered, due to visibilitymap_prepare_truncate() reading a VM page. Which in turn made me wonder: Is it actually OK for visibilitymap_prepare_truncate() to do separate WAL logging from XLOG_SMGR_TRUNCATE? I suspect not, consider this scenario: 1) primary does XLOG_FPI for the VM page 2) checkpoint 3) primary does XLOG_SMGR_TRUNCATE 4) standby replays XLOG_FPI 5) primary performs a restartpoint 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does visibilitymap_prepare_truncate, which dirties the VM page 7) standby crashes while writing out the VM page 8) standby does recovery and now finds a torn VM page, triggering a checksum failure Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just wipe out that region of the VM? Separately, is it actually sufficient that visibilitymap_prepare_truncate() only WAL logs the changes if XLogHintBitIsNeeded()? I'm going back and forth in my head about danger scenarios involving PITR or crashes inbetween prepare_truncate() and the actual truncation. ISTM that we should not do visibilitymap_prepare_truncate() during recovery and always WAL log the prepare_truncate() on the primary. Does that sound sane? Greetings, Andres
On 3/12/26 18:05, Andres Freund wrote: > Hi, > > On 2026-03-06 11:49:20 -0500, Robert Haas wrote: >> On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <andres@anarazel.de> wrote: >>> Which perhaps also should have emitted an FPI when clearing a bit? But I'm >>> unsure that that was required at the time. OTOH, it did seem to generate an >>> FPI for setting a VM bit, so ... >> >> After booting up the part of my brain that worked on this back in >> 2011, I think I reasoned that clearing a bit was idempotent and didn't >> depend on the prior page state, so that we would be adequate protected >> without an FPI. I had to add a WAL record to *set* the VM bit, because >> that wasn't logged at all, but the actual of clearing the VM bit >> piggybacked on the heap change that necessitated doing so, and having >> that record also emit an FPI for the VM page didn't seem to me to add >> anything. >> >> So I think technically it's the addition of checksums that turns this >> into a real bug, because now torn pages are a checksum-failure hazard. >> However, that would have been extremely hard to notice given that I >> seem never to have actually documented that reasoning in a comment >> anywhere. > > And then it got a lot worse with incremental backup support... > > >> I don't think we should mess around with trying to make the behavior >> here conditional on wal_level, checksums, etc. We should just do it >> all the time to keep the logic simple. > > Agreed. > > > I've been working on a fix for this (still pretty raw though). What a mess. > > > As part of validating that I added error checks that disallow reads in the > startup process for anything other than the FSM fork. Which triggered, due to > visibilitymap_prepare_truncate() reading a VM page. > > Which in turn made me wonder: Is it actually OK for > visibilitymap_prepare_truncate() to do separate WAL logging from > XLOG_SMGR_TRUNCATE? > > I suspect not, consider this scenario: > > 1) primary does XLOG_FPI for the VM page > 2) checkpoint > 3) primary does XLOG_SMGR_TRUNCATE > 4) standby replays XLOG_FPI > 5) primary performs a restartpoint > 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does > visibilitymap_prepare_truncate, which dirties the VM page > 7) standby crashes while writing out the VM page > 8) standby does recovery and now finds a torn VM page, triggering a checksum > failure > > Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just > wipe out that region of the VM? > Shouldn't it be possible to test this using injection points? > > Separately, is it actually sufficient that visibilitymap_prepare_truncate() > only WAL logs the changes if XLogHintBitIsNeeded()? I'm going back and forth > in my head about danger scenarios involving PITR or crashes inbetween > prepare_truncate() and the actual truncation. > > > ISTM that we should not do visibilitymap_prepare_truncate() during recovery > and always WAL log the prepare_truncate() on the primary. Does that sound > sane? > I'm considering writing a stress test to try to shake out issues in this. That strategy was pretty successful for the online checksums patch, where we had a hunch it might not be quite correct- but it was hard to come up with a reproducer. The stress test failures were very useful in that they gave us proof of issues and some examples. But the crucial part is an ability to verify correctness. With the checksums it's easy enough - just verify checksums / look for checksum failures in the server log. But what would that be here? regards -- Tomas Vondra
Hi,
On 2026-03-12 20:27:32 +0100, Tomas Vondra wrote:
> > I've been working on a fix for this (still pretty raw though). What a mess.
> >
> >
> > As part of validating that I added error checks that disallow reads in the
> > startup process for anything other than the FSM fork. Which triggered, due to
> > visibilitymap_prepare_truncate() reading a VM page.
> >
> > Which in turn made me wonder: Is it actually OK for
> > visibilitymap_prepare_truncate() to do separate WAL logging from
> > XLOG_SMGR_TRUNCATE?
> >
> > I suspect not, consider this scenario:
> >
> > 1) primary does XLOG_FPI for the VM page
> > 2) checkpoint
> > 3) primary does XLOG_SMGR_TRUNCATE
> > 4) standby replays XLOG_FPI
> > 5) primary performs a restartpoint
> > 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
> > visibilitymap_prepare_truncate, which dirties the VM page
> > 7) standby crashes while writing out the VM page
> > 8) standby does recovery and now finds a torn VM page, triggering a checksum
> > failure
> >
> > Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just
> > wipe out that region of the VM?
> >
>
> Shouldn't it be possible to test this using injection points?
Not trivially, I think? For one, you're not normally going to get a torn
page, even if you kill the server in the right moment, so you're rarely going
to be able to reproduce checksum errors without further infrastructure. But
also, since the failure behaviour is going to be to clear the VM, you'd not
really see bad consequences, I think?
I think we really need some logic during WAL replay that does more
verification of pages that are read in (i.e. do not have an FPI applied) by
the startup process.
- Direct uses XLogReadBufferExtended() should only be allowed with
ZERO_ON_ERROR, as it has no provisions for applying FPIs.
- Unless FPWs are disabled or ZERO_ON_ERROR is used:
- the read in page must have an LSN newer than the last REDO LSN, otherwise
there should have been an FPI
- the page LSN may not have an LSN from the future (i.e BLK_DONE
should never happen with FPWs)
- all FPIs in the record have been applied
A slightly different, but related, thing that we are missing is infrastructure
to detect pages that have been dirtied without associated WAL logging. I
wonder if we could use a bit in the buffer descriptor to track if a page has
been dirtied by anything other than a hint bit write. Or maybe we could
detect it when unlocking the page.
> > Separately, is it actually sufficient that visibilitymap_prepare_truncate()
> > only WAL logs the changes if XLogHintBitIsNeeded()? I'm going back and forth
> > in my head about danger scenarios involving PITR or crashes inbetween
> > prepare_truncate() and the actual truncation.
> >
> >
> > ISTM that we should not do visibilitymap_prepare_truncate() during recovery
> > and always WAL log the prepare_truncate() on the primary. Does that sound
> > sane?
> >
>
> I'm considering writing a stress test to try to shake out issues in
> this.
We could certainly use more of that.
> That strategy was pretty successful for the online checksums patch, where we
> had a hunch it might not be quite correct- but it was hard to come up with a
> reproducer. The stress test failures were very useful in that they gave us
> proof of issues and some examples.
> But the crucial part is an ability to verify correctness.
Unfortunately I don't think we have a particularly good infrastructure for
detecting problems right now :(
The verification ideas I posted above would, I think, help to detect some
issues, but I don't think they'd catch more complicated things.
We have wal_consistency_checking, but as it runs just after the redo routine,
it doesn't catch problems like not including things in the WAL record or
covering related actions in two WAL records that could be separated by a
checkpoint. We really should have a tool that compares a primary and a
replica after doing recovery using the masking infrastructure from
wal_consistency_checking.
> With the checksums it's easy enough - just verify checksums / look for
> checksum failures in the server log. But what would that be here?
Unfortunately that's not even something you can really can rely on, it's
entirely possible to see checksum errors for the FSM without it being a bug,
as it's not WAL logged.
Greetings,
Andres Freund
On 3/12/26 22:26, Andres Freund wrote: > Hi, > > On 2026-03-12 20:27:32 +0100, Tomas Vondra wrote: >>> I've been working on a fix for this (still pretty raw though). What a mess. >>> >>> >>> As part of validating that I added error checks that disallow reads in the >>> startup process for anything other than the FSM fork. Which triggered, due to >>> visibilitymap_prepare_truncate() reading a VM page. >>> >>> Which in turn made me wonder: Is it actually OK for >>> visibilitymap_prepare_truncate() to do separate WAL logging from >>> XLOG_SMGR_TRUNCATE? >>> >>> I suspect not, consider this scenario: >>> >>> 1) primary does XLOG_FPI for the VM page >>> 2) checkpoint >>> 3) primary does XLOG_SMGR_TRUNCATE >>> 4) standby replays XLOG_FPI >>> 5) primary performs a restartpoint >>> 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does >>> visibilitymap_prepare_truncate, which dirties the VM page >>> 7) standby crashes while writing out the VM page >>> 8) standby does recovery and now finds a torn VM page, triggering a checksum >>> failure >>> >>> Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just >>> wipe out that region of the VM? >>> >> >> Shouldn't it be possible to test this using injection points? > > Not trivially, I think? For one, you're not normally going to get a torn > page, even if you kill the server in the right moment, so you're rarely going > to be able to reproduce checksum errors without further infrastructure. But > also, since the failure behaviour is going to be to clear the VM, you'd not > really see bad consequences, I think? > > > I think we really need some logic during WAL replay that does more > verification of pages that are read in (i.e. do not have an FPI applied) by > the startup process. > > - Direct uses XLogReadBufferExtended() should only be allowed with > ZERO_ON_ERROR, as it has no provisions for applying FPIs. > > - Unless FPWs are disabled or ZERO_ON_ERROR is used: > > - the read in page must have an LSN newer than the last REDO LSN, otherwise > there should have been an FPI > > - the page LSN may not have an LSN from the future (i.e BLK_DONE > should never happen with FPWs) > > - all FPIs in the record have been applied > > > > A slightly different, but related, thing that we are missing is infrastructure > to detect pages that have been dirtied without associated WAL logging. I > wonder if we could use a bit in the buffer descriptor to track if a page has > been dirtied by anything other than a hint bit write. Or maybe we could > detect it when unlocking the page. > > >>> Separately, is it actually sufficient that visibilitymap_prepare_truncate() >>> only WAL logs the changes if XLogHintBitIsNeeded()? I'm going back and forth >>> in my head about danger scenarios involving PITR or crashes inbetween >>> prepare_truncate() and the actual truncation. >>> >>> >>> ISTM that we should not do visibilitymap_prepare_truncate() during recovery >>> and always WAL log the prepare_truncate() on the primary. Does that sound >>> sane? >>> >> >> I'm considering writing a stress test to try to shake out issues in >> this. > > We could certainly use more of that. > > >> That strategy was pretty successful for the online checksums patch, where we >> had a hunch it might not be quite correct- but it was hard to come up with a >> reproducer. The stress test failures were very useful in that they gave us >> proof of issues and some examples. > > >> But the crucial part is an ability to verify correctness. > > Unfortunately I don't think we have a particularly good infrastructure for > detecting problems right now :( > > > The verification ideas I posted above would, I think, help to detect some > issues, but I don't think they'd catch more complicated things. > > > We have wal_consistency_checking, but as it runs just after the redo routine, > it doesn't catch problems like not including things in the WAL record or > covering related actions in two WAL records that could be separated by a > checkpoint. We really should have a tool that compares a primary and a > replica after doing recovery using the masking infrastructure from > wal_consistency_checking. > Silly idea - wouldn't it be possible to detect this particular issue solely based on WAL? We could even have an off-line tool that reads WAL and checks that we have all FPIs for all the changes. > >> With the checksums it's easy enough - just verify checksums / look for >> checksum failures in the server log. But what would that be here? > > Unfortunately that's not even something you can really can rely on, it's > entirely possible to see checksum errors for the FSM without it being a bug, > as it's not WAL logged. > I did not mean to imply that any checksum failure is necessarily an outright bug. But I think this kind of issues being "normal" is a hint maybe not WAL-logging FSM is not such a good design choice anymore. regards -- Tomas Vondra
Hi, On 2026-03-12 22:54:25 +0100, Tomas Vondra wrote: > >> But the crucial part is an ability to verify correctness. > > > > Unfortunately I don't think we have a particularly good infrastructure for > > detecting problems right now :( > > > > > > The verification ideas I posted above would, I think, help to detect some > > issues, but I don't think they'd catch more complicated things. > > > > > > We have wal_consistency_checking, but as it runs just after the redo routine, > > it doesn't catch problems like not including things in the WAL record or > > covering related actions in two WAL records that could be separated by a > > checkpoint. We really should have a tool that compares a primary and a > > replica after doing recovery using the masking infrastructure from > > wal_consistency_checking. > > > > Silly idea - wouldn't it be possible to detect this particular issue > solely based on WAL? We could even have an off-line tool that reads WAL > and checks that we have all FPIs for all the changes. Afaict the problem is that you can't easily see that from the WAL in case of bugs. E.g. for the case at hand, we didn't include block references for the cleared VM pages in the WAL record. So such a tool would need to know about the XLH_INSERT_ALL_VISIBLE_CLEARED flag etc. > >> With the checksums it's easy enough - just verify checksums / look for > >> checksum failures in the server log. But what would that be here? > > > > Unfortunately that's not even something you can really can rely on, it's > > entirely possible to see checksum errors for the FSM without it being a bug, > > as it's not WAL logged. > > > > I did not mean to imply that any checksum failure is necessarily an > outright bug. But I think this kind of issues being "normal" is a hint > maybe not WAL-logging FSM is not such a good design choice anymore. I agree we'll eventually have to fix this, I just don't entirely know how. Perhaps it'd be good enough to essentially treat the FSM as hint bits, i.e. WAL log the first modification of an FSM page in a checkpoint, if checksums/log-hints is on. Greetings, Andres Freund
On Thu, Mar 12, 2026 at 1:05 PM Andres Freund <andres@anarazel.de> wrote: > Which in turn made me wonder: Is it actually OK for > visibilitymap_prepare_truncate() to do separate WAL logging from > XLOG_SMGR_TRUNCATE? I mean, I think this is just a blatant violation of the usual practice around write-ahead logging. We're hoping that the change we're doing here will be atomic with a later change in a separate critical section. I don't see how that's ever going to work out. Functions with "prepare" in the name should be limited to doing things that can be done ahead of the operation for which they are preparing. The actual operation needs to all happen in one critical section after all preparation is complete. > Separately, is it actually sufficient that visibilitymap_prepare_truncate() > only WAL logs the changes if XLogHintBitIsNeeded()? I'm going back and forth > in my head about danger scenarios involving PITR or crashes inbetween > prepare_truncate() and the actual truncation. Since we don't set the page LSN, the changes made in that critical section can make it to disk even if we don't manage to replay the truncate record and fail to make it to disk even if we do. That can only be OK if those changes are never critical. The only thing we're doing here is clearing VM bits, so if that happens when it shouldn't, nothing should break, except maybe synchronization between primaries and standbys. But if it does happen when it shouldn't, that's potentially data-corrupting: the relation can be re-extended, and now you've got pages where the VM bit in the page is clear and the VM bit in the map is set, which is never OK. > ISTM that we should not do visibilitymap_prepare_truncate() during recovery > and always WAL log the prepare_truncate() on the primary. Does that sound > sane? I think the real need here is, um, to just following the regular WAL-logging rules. All the things that need to happen as part of a single operation need to be done in one critical section while holding one set of buffer locks all at the same time, and the WAL record needs to be written in that same critical section. Or alternatively several operations that each follow these rules and together add up to the same thing. For example, it would be fine to treat clearing the tailing bits of the last partial page as a separate operation that happens before the actual truncation and that therefore has its own WAL record, buffer locks, and critical section, provided that you can keep any of those bits from being set again before the actual truncation happens, e.g. by hanging onto the buffer lock. During replay, there can't be any concurrent writers, so replay can be fully consecutive. I feel like what's basically happened here is that an attempt was made to hide too many of the details under the visibilitymap_* layer. That's nice for abstraction, but it's not so great for correctness. -- Robert Haas EDB: http://www.enterprisedb.com