Обсуждение: Why clearing the VM doesn't require registering vm buffer in wal record

Поиск
Список
Период
Сортировка

Why clearing the VM doesn't require registering vm buffer in wal record

От
Melanie Plageman
Дата:
Today Andres and I noticed that heap_{update,insert,delete}() don't
register the VM buffer when they are clearing the VM. I was under the
impression that any buffers modified needed to be registered in the
WAL record. Without which, you'll never do an FPI. It seems like this
could cause checksum failures. What are we missing?



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Andres Freund
Дата:
Hi,

On 2026-03-05 14:56:13 -0500, Melanie Plageman wrote:
> Today Andres and I noticed that heap_{update,insert,delete}() don't
> register the VM buffer when they are clearing the VM. I was under the
> impression that any buffers modified needed to be registered in the
> WAL record. Without which, you'll never do an FPI. It seems like this
> could cause checksum failures. What are we missing?

Besides checksum errors, I was also worried that this would break
pg_rewind. But it seems we're saved by pg_rewind only parsing the WAL for
changes to the main fork.

But it does seem like it could be a problem for incremental backup /
walsummarizer?

Greetings,

Andres Freund



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Matthias van de Meent
Дата:
On Thu, 5 Mar 2026 at 20:56, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> Today Andres and I noticed that heap_{update,insert,delete}() don't
> register the VM buffer when they are clearing the VM. I was under the
> impression that any buffers modified needed to be registered in the
> WAL record. Without which, you'll never do an FPI. It seems like this
> could cause checksum failures. What are we missing?

Might it be, because zeroing a VM page (which you would do when you
encounter checksum failures) is an MVCC-safe operation?

I agree with you that we probably _should_ register the VM (and
possibly FSM) buffer, but that's a bit of a different story. Right
now, the VM does not use the standard page format (nor does the FSM),
and therefore every FPI would be the full 8KB, even when just a few
bits of the VM page are in use; which would be a rather large waste of
space.


On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote:
>
> But it does seem like it could be a problem for incremental backup /
> walsummarizer?

I don't think it is, because that doesn't do calculations for non-main
forks, it considers those forks always changed and includes them in
full. Or at least, that was the response I got when I raised concerns
about the FSM back when the incremental backup feature was being
developed [0].


Kind regards,

Matthias van de Meent
Databricks (https://databricks.com)

[0] https://postgr.es/m/CA%2BTgmoaR8o%2BPBeWc_2Ge0XVgoM7xWKNyDmqXoTov%3DS6_J1gecQ%40mail.gmail.com



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Andres Freund
Дата:
Hi,

On 2026-03-05 21:23:12 +0100, Matthias van de Meent wrote:
> On Thu, 5 Mar 2026 at 20:56, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> > Today Andres and I noticed that heap_{update,insert,delete}() don't
> > register the VM buffer when they are clearing the VM. I was under the
> > impression that any buffers modified needed to be registered in the
> > WAL record. Without which, you'll never do an FPI. It seems like this
> > could cause checksum failures. What are we missing?
> 
> Might it be, because zeroing a VM page (which you would do when you
> encounter checksum failures) is an MVCC-safe operation?

We are afaict not automatically zeroing corrupted VM pages (we are doing that
for FSM though). IMO it's a bug if a user ever has to turn on
zero_damaged_pages outside of the storage actually corrupting data.


> I agree with you that we probably _should_ register the VM (and possibly
> FSM) buffer, but that's a bit of a different story. Right now, the VM does
> not use the standard page format (nor does the FSM), and therefore every FPI
> would be the full 8KB, even when just a few bits of the VM page are in use;
> which would be a rather large waste of space.

If checksums are enabled, we are already emitting FPIs for the VM when
*setting* bits in the VM (c.f. log_heap_visible()). I don't see why the story
for clearing it should be different.  And because there are so few VM pages
compared to heap pages, I wouldn't expect there to be a meaningful amount of
VM FPIs outside of very contrived workloads.


> On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote:
> >
> > But it does seem like it could be a problem for incremental backup /
> > walsummarizer?
> 
> I don't think it is, because that doesn't do calculations for non-main
> forks, it considers those forks always changed and includes them in
> full. Or at least, that was the response I got when I raised concerns
> about the FSM back when the incremental backup feature was being
> developed [0].

There's explicit code for ignoring the FSM, but I don't see the same for the
VM. And that makes sense: VM changes are mostly WAL logged, just not
completely / generically (i.e. this complaint), whereas FSM changes are not
WAL logged at all.

Greetings,

Andres Freund



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Andres Freund
Дата:
Hi,

On 2026-03-05 15:38:24 -0500, Andres Freund wrote:
> > On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > But it does seem like it could be a problem for incremental backup /
> > > walsummarizer?
> >
> > I don't think it is, because that doesn't do calculations for non-main
> > forks, it considers those forks always changed and includes them in
> > full. Or at least, that was the response I got when I raised concerns
> > about the FSM back when the incremental backup feature was being
> > developed [0].
>
> There's explicit code for ignoring the FSM, but I don't see the same for the
> VM. And that makes sense: VM changes are mostly WAL logged, just not
> completely / generically (i.e. this complaint), whereas FSM changes are not
> WAL logged at all.

Unfortunately I can confirm that incremental backups end up with an outdated
VM.

An unfortunate kicker: It looks like verify_heapam() doesn't even notice :(.


But the next vacuum does complain, if we end up processing the page for some
reason:

2026-03-05 15:56:23.018 EST [2073843][client backend][0/5:0][psql] WARNING:  page is not marked all-visible but
visibilitymap bit is set in relation "s" page 0
 
2026-03-05 15:56:23.018 EST [2073843][client backend][0/5:0][psql] CONTEXT:  while scanning block 0 of relation
"public.s"


Repro:

1) CREATE TABLE s(id int); INSERT INTO s DEFAULT VALUES; SELECT pg_current_wal_insert_lsn(); VACUUM s;
2) create full base backup
   rm -rf /tmp/backup-a && mkdir /tmp/backup-a && ./src/bin/pg_basebackup/pg_basebackup -c fast -D /tmp/backup-a

3) UPDATE s SET id = -id;
4) create incremental backup
   rm -rf /tmp/backup-a-i && mkdir /tmp/backup-a-i && ./src/bin/pg_basebackup/pg_basebackup
--incremental=/tmp/backup-a/backup_manifest-c fast -D /tmp/backup-a-i
 

5) combine full and incremental backup
   rm -rf /tmp/backup-a-i-c && mkdir /tmp/backup-a-i-c && ./src/bin/pg_combinebackup/pg_combinebackup -d -o
/tmp/backup-a-i-c--copy-file-range /tmp/backup-a /tmp/backup-a-i/
 

6) start PG on combined backup and use
   SELECT * FROM pg_visibility_map_summary('s');
   to see things are wrong

7) vacuum and see it complain (page is processed because we always look at the
   last page, and there's just one)


Greetings,

Andres Freund



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Robert Haas
Дата:
On Thu, Mar 5, 2026 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> If checksums are enabled, we are already emitting FPIs for the VM when
> *setting* bits in the VM (c.f. log_heap_visible()). I don't see why the story
> for clearing it should be different.  And because there are so few VM pages
> compared to heap pages, I wouldn't expect there to be a meaningful amount of
> VM FPIs outside of very contrived workloads.

Yeah, that's how it seems to me, too, at least as of this moment.

Is there any indication in the code or comments that this was an
intentional omission because somebody thought we could get away
without doing it? Or is just a straight-up goof?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Andres Freund
Дата:
Hi,

On 2026-03-05 18:11:43 -0500, Robert Haas wrote:
> On Thu, Mar 5, 2026 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> > If checksums are enabled, we are already emitting FPIs for the VM when
> > *setting* bits in the VM (c.f. log_heap_visible()). I don't see why the story
> > for clearing it should be different.  And because there are so few VM pages
> > compared to heap pages, I wouldn't expect there to be a meaningful amount of
> > VM FPIs outside of very contrived workloads.
>
> Yeah, that's how it seems to me, too, at least as of this moment.
>
> Is there any indication in the code or comments that this was an
> intentional omission because somebody thought we could get away
> without doing it?

I couldn't find any evidence of that, which of course doesn't mean it doesn't
exist.


> Or is just a straight-up goof?

It kinda looks like a victim of multiple subsequent changes that each arguably
should have done something different:


commit 2c03216d831
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   2014-11-20 17:56:26 +0200

    Revamp the WAL record format.


Which should have included a reference to the VM page in the WAL record for
insert/update/delete.


and

commit 96ef3b8ff1c
Author: Simon Riggs <simon@2ndQuadrant.com>
Date:   2013-03-22 13:54:07 +0000

    Allow I/O reliability checks using 16-bit checksums

Which should have triggered an FPI when clearing the VM, as afaict there
otherwise is no guarantee the checksum will be correct after a crash.


and, although I am not sure it's true:

commit 503c7305a1e
Author: Robert Haas <rhaas@postgresql.org>
Date:   2011-06-21 23:04:40 -0400

    Make the visibility map crash-safe.

Which perhaps also should have emitted an FPI when clearing a bit? But I'm
unsure that that was required at the time. OTOH, it did seem to generate an
FPI for setting a VM bit, so ...


Greetings,

Andres Freund



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Robert Haas
Дата:
On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <andres@anarazel.de> wrote:
> Which perhaps also should have emitted an FPI when clearing a bit? But I'm
> unsure that that was required at the time. OTOH, it did seem to generate an
> FPI for setting a VM bit, so ...

After booting up the part of my brain that worked on this back in
2011, I think I reasoned that clearing a bit was idempotent and didn't
depend on the prior page state, so that we would be adequate protected
without an FPI. I had to add a WAL record to *set* the VM bit, because
that wasn't logged at all, but the actual of clearing the VM bit
piggybacked on the heap change that necessitated doing so, and having
that record also emit an FPI for the VM page didn't seem to me to add
anything.

So I think technically it's the addition of checksums that turns this
into a real bug, because now torn pages are a checksum-failure hazard.
However, that would have been extremely hard to notice given that I
seem never to have actually documented that reasoning in a comment
anywhere.

I don't think we should mess around with trying to make the behavior
here conditional on wal_level, checksums, etc. We should just do it
all the time to keep the logic simple.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Yura Sokolov
Дата:
06.03.2026 00:01, Andres Freund пишет:
> Hi,
> 
> On 2026-03-05 15:38:24 -0500, Andres Freund wrote:
>>> On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote:
>>>>
>>>> But it does seem like it could be a problem for incremental backup /
>>>> walsummarizer?
>>>
>>> I don't think it is, because that doesn't do calculations for non-main
>>> forks, it considers those forks always changed and includes them in
>>> full. Or at least, that was the response I got when I raised concerns
>>> about the FSM back when the incremental backup feature was being
>>> developed [0].
>>
>> There's explicit code for ignoring the FSM, but I don't see the same for the
>> VM. And that makes sense: VM changes are mostly WAL logged, just not
>> completely / generically (i.e. this complaint), whereas FSM changes are not
>> WAL logged at all.
> 
> Unfortunately I can confirm that incremental backups end up with an outdated
> VM.

That is why pg_probackup still archive VM at whole in incremental (WAL
parsing) backup.
That is why WAL-G's incremental backup in WAL-parsing mode is (was?)
considered unstable.

I know the problem for couple of years. Excuse me I didn't write about.

I didn't recognize fix could be as simple as registering VM buffers.
My bad. I fill so stupid :-(

It would be great if it will be fixed in all supported versions.

-- 
regards
Yura Sokolov aka funny-falcon



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Yura Sokolov
Дата:
12.03.2026 14:25, Yura Sokolov пишет:
> 06.03.2026 00:01, Andres Freund пишет:
>> Hi,
>>
>> On 2026-03-05 15:38:24 -0500, Andres Freund wrote:
>>>> On Thu, 5 Mar 2026 at 21:16, Andres Freund <andres@anarazel.de> wrote:
>>>>>
>>>>> But it does seem like it could be a problem for incremental backup /
>>>>> walsummarizer?
>>>>
>>>> I don't think it is, because that doesn't do calculations for non-main
>>>> forks, it considers those forks always changed and includes them in
>>>> full. Or at least, that was the response I got when I raised concerns
>>>> about the FSM back when the incremental backup feature was being
>>>> developed [0].
>>>
>>> There's explicit code for ignoring the FSM, but I don't see the same for the
>>> VM. And that makes sense: VM changes are mostly WAL logged, just not
>>> completely / generically (i.e. this complaint), whereas FSM changes are not
>>> WAL logged at all.
>>
>> Unfortunately I can confirm that incremental backups end up with an outdated
>> VM.
> 
> That is why pg_probackup still archive VM at whole in incremental (WAL
> parsing) backup.
> That is why WAL-G's incremental backup in WAL-parsing mode is (was?)
> considered unstable.
> 
> I know the problem for couple of years. Excuse me I didn't write about.
> 
> I didn't recognize fix could be as simple as registering VM buffers.
> My bad. I fill so stupid :-(
Should LSN properly set as well on vm page? To maximum of concurrent
updates, of course.

-- 
regards
Yura Sokolov aka funny-falcon



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Andres Freund
Дата:
Hi,

On 2026-03-06 11:49:20 -0500, Robert Haas wrote:
> On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <andres@anarazel.de> wrote:
> > Which perhaps also should have emitted an FPI when clearing a bit? But I'm
> > unsure that that was required at the time. OTOH, it did seem to generate an
> > FPI for setting a VM bit, so ...
> 
> After booting up the part of my brain that worked on this back in
> 2011, I think I reasoned that clearing a bit was idempotent and didn't
> depend on the prior page state, so that we would be adequate protected
> without an FPI. I had to add a WAL record to *set* the VM bit, because
> that wasn't logged at all, but the actual of clearing the VM bit
> piggybacked on the heap change that necessitated doing so, and having
> that record also emit an FPI for the VM page didn't seem to me to add
> anything.
> 
> So I think technically it's the addition of checksums that turns this
> into a real bug, because now torn pages are a checksum-failure hazard.
> However, that would have been extremely hard to notice given that I
> seem never to have actually documented that reasoning in a comment
> anywhere.

And then it got a lot worse with incremental backup support...


> I don't think we should mess around with trying to make the behavior
> here conditional on wal_level, checksums, etc. We should just do it
> all the time to keep the logic simple.

Agreed.


I've been working on a fix for this (still pretty raw though).  What a mess.


As part of validating that I added error checks that disallow reads in the
startup process for anything other than the FSM fork.  Which triggered, due to
visibilitymap_prepare_truncate() reading a VM page.

Which in turn made me wonder: Is it actually OK for
visibilitymap_prepare_truncate() to do separate WAL logging from
XLOG_SMGR_TRUNCATE?

I suspect not, consider this scenario:

1) primary does XLOG_FPI for the VM page
2) checkpoint
3) primary does XLOG_SMGR_TRUNCATE
4) standby replays XLOG_FPI
5) primary performs a restartpoint
6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
   visibilitymap_prepare_truncate, which dirties the VM page
7) standby crashes while writing out the VM page
8) standby does recovery and now finds a torn VM page, triggering a checksum
   failure

Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just
wipe out that region of the VM?


Separately, is it actually sufficient that visibilitymap_prepare_truncate()
only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and forth
in my head about danger scenarios involving PITR or crashes inbetween
prepare_truncate() and the actual truncation.


ISTM that we should not do visibilitymap_prepare_truncate() during recovery
and always WAL log the prepare_truncate() on the primary.  Does that sound
sane?


Greetings,

Andres



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Tomas Vondra
Дата:
On 3/12/26 18:05, Andres Freund wrote:
> Hi,
> 
> On 2026-03-06 11:49:20 -0500, Robert Haas wrote:
>> On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <andres@anarazel.de> wrote:
>>> Which perhaps also should have emitted an FPI when clearing a bit? But I'm
>>> unsure that that was required at the time. OTOH, it did seem to generate an
>>> FPI for setting a VM bit, so ...
>>
>> After booting up the part of my brain that worked on this back in
>> 2011, I think I reasoned that clearing a bit was idempotent and didn't
>> depend on the prior page state, so that we would be adequate protected
>> without an FPI. I had to add a WAL record to *set* the VM bit, because
>> that wasn't logged at all, but the actual of clearing the VM bit
>> piggybacked on the heap change that necessitated doing so, and having
>> that record also emit an FPI for the VM page didn't seem to me to add
>> anything.
>>
>> So I think technically it's the addition of checksums that turns this
>> into a real bug, because now torn pages are a checksum-failure hazard.
>> However, that would have been extremely hard to notice given that I
>> seem never to have actually documented that reasoning in a comment
>> anywhere.
> 
> And then it got a lot worse with incremental backup support...
> 
> 
>> I don't think we should mess around with trying to make the behavior
>> here conditional on wal_level, checksums, etc. We should just do it
>> all the time to keep the logic simple.
> 
> Agreed.
> 
> 
> I've been working on a fix for this (still pretty raw though).  What a mess.
> 
> 
> As part of validating that I added error checks that disallow reads in the
> startup process for anything other than the FSM fork.  Which triggered, due to
> visibilitymap_prepare_truncate() reading a VM page.
> 
> Which in turn made me wonder: Is it actually OK for
> visibilitymap_prepare_truncate() to do separate WAL logging from
> XLOG_SMGR_TRUNCATE?
> 
> I suspect not, consider this scenario:
> 
> 1) primary does XLOG_FPI for the VM page
> 2) checkpoint
> 3) primary does XLOG_SMGR_TRUNCATE
> 4) standby replays XLOG_FPI
> 5) primary performs a restartpoint
> 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
>    visibilitymap_prepare_truncate, which dirties the VM page
> 7) standby crashes while writing out the VM page
> 8) standby does recovery and now finds a torn VM page, triggering a checksum
>    failure
> 
> Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just
> wipe out that region of the VM?
> 

Shouldn't it be possible to test this using injection points?

> 
> Separately, is it actually sufficient that visibilitymap_prepare_truncate()
> only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and forth
> in my head about danger scenarios involving PITR or crashes inbetween
> prepare_truncate() and the actual truncation.
> 
> 
> ISTM that we should not do visibilitymap_prepare_truncate() during recovery
> and always WAL log the prepare_truncate() on the primary.  Does that sound
> sane?
> 

I'm considering writing a stress test to try to shake out issues in
this. That strategy was pretty successful for the online checksums
patch, where we had a hunch it might not be quite correct- but it was
hard to come up with a reproducer. The stress test failures were very
useful in that they gave us proof of issues and some examples.

But the crucial part is an ability to verify correctness. With the
checksums it's easy enough - just verify checksums / look for checksum
failures in the server log. But what would that be here?

regards

-- 
Tomas Vondra




Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Andres Freund
Дата:
Hi,

On 2026-03-12 20:27:32 +0100, Tomas Vondra wrote:
> > I've been working on a fix for this (still pretty raw though).  What a mess.
> >
> >
> > As part of validating that I added error checks that disallow reads in the
> > startup process for anything other than the FSM fork.  Which triggered, due to
> > visibilitymap_prepare_truncate() reading a VM page.
> >
> > Which in turn made me wonder: Is it actually OK for
> > visibilitymap_prepare_truncate() to do separate WAL logging from
> > XLOG_SMGR_TRUNCATE?
> >
> > I suspect not, consider this scenario:
> >
> > 1) primary does XLOG_FPI for the VM page
> > 2) checkpoint
> > 3) primary does XLOG_SMGR_TRUNCATE
> > 4) standby replays XLOG_FPI
> > 5) primary performs a restartpoint
> > 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
> >    visibilitymap_prepare_truncate, which dirties the VM page
> > 7) standby crashes while writing out the VM page
> > 8) standby does recovery and now finds a torn VM page, triggering a checksum
> >    failure
> >
> > Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just
> > wipe out that region of the VM?
> >
>
> Shouldn't it be possible to test this using injection points?

Not trivially, I think?  For one, you're not normally going to get a torn
page, even if you kill the server in the right moment, so you're rarely going
to be able to reproduce checksum errors without further infrastructure.  But
also, since the failure behaviour is going to be to clear the VM, you'd not
really see bad consequences, I think?


I think we really need some logic during WAL replay that does more
verification of pages that are read in (i.e. do not have an FPI applied) by
the startup process.

- Direct uses XLogReadBufferExtended() should only be allowed with
  ZERO_ON_ERROR, as it has no provisions for applying FPIs.

- Unless FPWs are disabled or ZERO_ON_ERROR is used:

   - the read in page must have an LSN newer than the last REDO LSN, otherwise
     there should have been an FPI

   - the page LSN may not have an LSN from the future (i.e BLK_DONE
     should never happen with FPWs)

- all FPIs in the record have been applied



A slightly different, but related, thing that we are missing is infrastructure
to detect pages that have been dirtied without associated WAL logging.  I
wonder if we could use a bit in the buffer descriptor to track if a page has
been dirtied by anything other than a hint bit write.  Or maybe we could
detect it when unlocking the page.


> > Separately, is it actually sufficient that visibilitymap_prepare_truncate()
> > only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and forth
> > in my head about danger scenarios involving PITR or crashes inbetween
> > prepare_truncate() and the actual truncation.
> >
> >
> > ISTM that we should not do visibilitymap_prepare_truncate() during recovery
> > and always WAL log the prepare_truncate() on the primary.  Does that sound
> > sane?
> >
>
> I'm considering writing a stress test to try to shake out issues in
> this.

We could certainly use more of that.


> That strategy was pretty successful for the online checksums patch, where we
> had a hunch it might not be quite correct- but it was hard to come up with a
> reproducer. The stress test failures were very useful in that they gave us
> proof of issues and some examples.


> But the crucial part is an ability to verify correctness.

Unfortunately I don't think we have a particularly good infrastructure for
detecting problems right now :(


The verification ideas I posted above would, I think, help to detect some
issues, but I don't think they'd catch more complicated things.


We have wal_consistency_checking, but as it runs just after the redo routine,
it doesn't catch problems like not including things in the WAL record or
covering related actions in two WAL records that could be separated by a
checkpoint.  We really should have a tool that compares a primary and a
replica after doing recovery using the masking infrastructure from
wal_consistency_checking.


> With the checksums it's easy enough - just verify checksums / look for
> checksum failures in the server log. But what would that be here?

Unfortunately that's not even something you can really can rely on, it's
entirely possible to see checksum errors for the FSM without it being a bug,
as it's not WAL logged.

Greetings,

Andres Freund



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Tomas Vondra
Дата:

On 3/12/26 22:26, Andres Freund wrote:
> Hi,
> 
> On 2026-03-12 20:27:32 +0100, Tomas Vondra wrote:
>>> I've been working on a fix for this (still pretty raw though).  What a mess.
>>>
>>>
>>> As part of validating that I added error checks that disallow reads in the
>>> startup process for anything other than the FSM fork.  Which triggered, due to
>>> visibilitymap_prepare_truncate() reading a VM page.
>>>
>>> Which in turn made me wonder: Is it actually OK for
>>> visibilitymap_prepare_truncate() to do separate WAL logging from
>>> XLOG_SMGR_TRUNCATE?
>>>
>>> I suspect not, consider this scenario:
>>>
>>> 1) primary does XLOG_FPI for the VM page
>>> 2) checkpoint
>>> 3) primary does XLOG_SMGR_TRUNCATE
>>> 4) standby replays XLOG_FPI
>>> 5) primary performs a restartpoint
>>> 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
>>>    visibilitymap_prepare_truncate, which dirties the VM page
>>> 7) standby crashes while writing out the VM page
>>> 8) standby does recovery and now finds a torn VM page, triggering a checksum
>>>    failure
>>>
>>> Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just
>>> wipe out that region of the VM?
>>>
>>
>> Shouldn't it be possible to test this using injection points?
> 
> Not trivially, I think?  For one, you're not normally going to get a torn
> page, even if you kill the server in the right moment, so you're rarely going
> to be able to reproduce checksum errors without further infrastructure.  But
> also, since the failure behaviour is going to be to clear the VM, you'd not
> really see bad consequences, I think?
> 
> 
> I think we really need some logic during WAL replay that does more
> verification of pages that are read in (i.e. do not have an FPI applied) by
> the startup process.
> 
> - Direct uses XLogReadBufferExtended() should only be allowed with
>   ZERO_ON_ERROR, as it has no provisions for applying FPIs.
> 
> - Unless FPWs are disabled or ZERO_ON_ERROR is used:
> 
>    - the read in page must have an LSN newer than the last REDO LSN, otherwise
>      there should have been an FPI
> 
>    - the page LSN may not have an LSN from the future (i.e BLK_DONE
>      should never happen with FPWs)
> 
> - all FPIs in the record have been applied
> 
> 
> 
> A slightly different, but related, thing that we are missing is infrastructure
> to detect pages that have been dirtied without associated WAL logging.  I
> wonder if we could use a bit in the buffer descriptor to track if a page has
> been dirtied by anything other than a hint bit write.  Or maybe we could
> detect it when unlocking the page.
> 
> 
>>> Separately, is it actually sufficient that visibilitymap_prepare_truncate()
>>> only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and forth
>>> in my head about danger scenarios involving PITR or crashes inbetween
>>> prepare_truncate() and the actual truncation.
>>>
>>>
>>> ISTM that we should not do visibilitymap_prepare_truncate() during recovery
>>> and always WAL log the prepare_truncate() on the primary.  Does that sound
>>> sane?
>>>
>>
>> I'm considering writing a stress test to try to shake out issues in
>> this.
> 
> We could certainly use more of that.
> 
> 
>> That strategy was pretty successful for the online checksums patch, where we
>> had a hunch it might not be quite correct- but it was hard to come up with a
>> reproducer. The stress test failures were very useful in that they gave us
>> proof of issues and some examples.
> 
> 
>> But the crucial part is an ability to verify correctness.
> 
> Unfortunately I don't think we have a particularly good infrastructure for
> detecting problems right now :(
> 
> 
> The verification ideas I posted above would, I think, help to detect some
> issues, but I don't think they'd catch more complicated things.
> 
> 
> We have wal_consistency_checking, but as it runs just after the redo routine,
> it doesn't catch problems like not including things in the WAL record or
> covering related actions in two WAL records that could be separated by a
> checkpoint.  We really should have a tool that compares a primary and a
> replica after doing recovery using the masking infrastructure from
> wal_consistency_checking.
> 

Silly idea - wouldn't it be possible to detect this particular issue
solely based on WAL? We could even have an off-line tool that reads WAL
and checks that we have all FPIs for all the changes.

> 
>> With the checksums it's easy enough - just verify checksums / look for
>> checksum failures in the server log. But what would that be here?
> 
> Unfortunately that's not even something you can really can rely on, it's
> entirely possible to see checksum errors for the FSM without it being a bug,
> as it's not WAL logged.
> 

I did not mean to imply that any checksum failure is necessarily an
outright bug. But I think this kind of issues being "normal" is a hint
maybe not WAL-logging FSM is not such a good design choice anymore.

regards

-- 
Tomas Vondra




Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Andres Freund
Дата:
Hi,

On 2026-03-12 22:54:25 +0100, Tomas Vondra wrote:
> >> But the crucial part is an ability to verify correctness.
> > 
> > Unfortunately I don't think we have a particularly good infrastructure for
> > detecting problems right now :(
> > 
> > 
> > The verification ideas I posted above would, I think, help to detect some
> > issues, but I don't think they'd catch more complicated things.
> > 
> > 
> > We have wal_consistency_checking, but as it runs just after the redo routine,
> > it doesn't catch problems like not including things in the WAL record or
> > covering related actions in two WAL records that could be separated by a
> > checkpoint.  We really should have a tool that compares a primary and a
> > replica after doing recovery using the masking infrastructure from
> > wal_consistency_checking.
> > 
> 
> Silly idea - wouldn't it be possible to detect this particular issue
> solely based on WAL? We could even have an off-line tool that reads WAL
> and checks that we have all FPIs for all the changes.

Afaict the problem is that you can't easily see that from the WAL in case of
bugs. E.g. for the case at hand, we didn't include block references for the
cleared VM pages in the WAL record. So such a tool would need to know about
the XLH_INSERT_ALL_VISIBLE_CLEARED flag etc.


> >> With the checksums it's easy enough - just verify checksums / look for
> >> checksum failures in the server log. But what would that be here?
> > 
> > Unfortunately that's not even something you can really can rely on, it's
> > entirely possible to see checksum errors for the FSM without it being a bug,
> > as it's not WAL logged.
> > 
> 
> I did not mean to imply that any checksum failure is necessarily an
> outright bug. But I think this kind of issues being "normal" is a hint
> maybe not WAL-logging FSM is not such a good design choice anymore.

I agree we'll eventually have to fix this, I just don't entirely know how.
Perhaps it'd be good enough to essentially treat the FSM as hint bits,
i.e. WAL log the first modification of an FSM page in a checkpoint, if
checksums/log-hints is on.

Greetings,

Andres Freund



Re: Why clearing the VM doesn't require registering vm buffer in wal record

От
Robert Haas
Дата:
On Thu, Mar 12, 2026 at 1:05 PM Andres Freund <andres@anarazel.de> wrote:
> Which in turn made me wonder: Is it actually OK for
> visibilitymap_prepare_truncate() to do separate WAL logging from
> XLOG_SMGR_TRUNCATE?

I mean, I think this is just a blatant violation of the usual practice
around write-ahead logging. We're hoping that the change we're doing
here will be atomic with a later change in a separate critical
section. I don't see how that's ever going to work out. Functions with
"prepare" in the name should be limited to doing things that can be
done ahead of the operation for which they are preparing. The actual
operation needs to all happen in one critical section after all
preparation is complete.

> Separately, is it actually sufficient that visibilitymap_prepare_truncate()
> only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and forth
> in my head about danger scenarios involving PITR or crashes inbetween
> prepare_truncate() and the actual truncation.

Since we don't set the page LSN, the changes made in that critical
section can make it to disk even if we don't manage to replay the
truncate record and fail to make it to disk even if we do. That can
only be OK if those changes are never critical. The only thing we're
doing here is clearing VM bits, so if that happens when it shouldn't,
nothing should break, except maybe synchronization between primaries
and standbys. But if it does happen when it shouldn't, that's
potentially data-corrupting: the relation can be re-extended, and now
you've got pages where the VM bit in the page is clear and the VM bit
in the map is set, which is never OK.

> ISTM that we should not do visibilitymap_prepare_truncate() during recovery
> and always WAL log the prepare_truncate() on the primary.  Does that sound
> sane?

I think the real need here is, um, to just following the regular
WAL-logging rules. All the things that need to happen as part of a
single operation need to be done in one critical section while holding
one set of buffer locks all at the same time, and the WAL record needs
to be written in that same critical section. Or alternatively several
operations that each follow these rules and together add up to the
same thing. For example, it would be fine to treat clearing the
tailing bits of the last partial page as a separate operation that
happens before the actual truncation and that therefore has its own
WAL record, buffer locks, and critical section, provided that you can
keep any of those bits from being set again before the actual
truncation happens, e.g. by hanging onto the buffer lock. During
replay, there can't be any concurrent writers, so replay can be fully
consecutive.

I feel like what's basically happened here is that an attempt was made
to hide too many of the details under the visibilitymap_* layer.
That's nice for abstraction, but it's not so great for correctness.

--
Robert Haas
EDB: http://www.enterprisedb.com