Обсуждение: Two issues leading to discrepancies in FSM data on the standby server

Поиск

Список

Период

Сортировка

Two issues leading to discrepancies in FSM data on the standby server

От

Alexey Makhmutov

Дата:

20 марта, 04:32:20

We’ve recently observed a situation with significant increase in 
response time for insert operations after switching to a replica server. 
The collected information pointed to the discrepancy in the FSM data on 
the replica side, which became visible to the insert sessions once 
autovacuum process pulled incorrect data from from leaf blocks into FSM 
root. The entire situation was looking like the case discussed in 
https://postgr.es/m/20180802172857.5skoexsilnjvgruk@alvherre.pgsql and 
which was supposed to be fixed by ‘ab7dbd681’ (which introduced FSM 
update during 'heap_xlog_visible' invocation). However in our case and 
synthetic tests we were able to see data blocks marked as ‘all visible’, 
but still having incorrect FSM records.

After analyzing the code I’ve noticed that during recovery FSM data is 
updated in XLogRecordPageWithFreeSpace, which uses MarkBufferDirtyHint 
to mark FSM block as modified. However, if data checksums are enabled, 
then this call does nothing during recovery and is actually a no-op – 
basically it just exits immediately without marking block as dirty. The 
logic here is that as no new WAL data could not be generated during the 
recovery, so changes to hints in block should not mark block as dirty to 
avoid risk of torn pages being written. This seems logical, but it seems 
not aligned well with the FSM case, as its blocks could be just zeroed 
if checksum mismatch is detected. Currently changes to a FSM block could 
be lost if each change to the particular FSM block occur rarely enough 
to allow its eviction from the cache. To persist the change the 
modification need to be performed while FSM block is still kept in 
buffers and marked as dirty after receiving its FPI. If block was 
already cleaned, then the change won’t be persisted and stored FSM 
blocks may remain in an obsolete state. In our case the table had its 
'fillfactor' parameter set  below 80, so during insert bursts each FSM 
block on replica side was modified only during first access of FSM block 
since checkpoint (with FPI) and then by processing XLOG_HEAP2_VISIBLE 
record for data once it was marked as ‘all visible’. This gives plenty 
of time to cleanup buffer between these moments, so the second change 
was just never written to the disk. So, large number of blocks were left 
with incorrect data in FSM leaf blocks, which caused problem after 
switchover.

Given that FSM is ready to handle torn page writes and 
XLogRecordPageWithFreeSpace is called only during the recovery there 
seems to be no reason to use MarkBufferDirtyHint here instead of a 
regular MarkBufferDirty call. The code is already trying to limit 
updates to the FSM (i.e. by updating it only after reaching 80% of used 
space for regular DML), so we probably want to ensure that these updates 
are actually persisted.

The second noticed issue (not related to our observed problem) is 
related to the ‘heap_xlog_visible’ – this function uses 
‘PageGetFreeSpace’ call instead of ‘PageGetHeapFreeSpace’ to get size of 
free space for regular heap blocks. This seems like a bug, as method 
'PageGetHeapFreeSpace' is used for any other case where we need to get 
free space for a heap page. Usage of incorrect function could also cause 
incorrect data being written to the FSM on replica: if block still have 
free space, but already reached MaxHeapTuplesPerPage limit, then it 
should be marked as unavailable for new rows in FSM, otherwise inserter 
will need to check and update its FSM data as well.

Attached are separate patches, which tries to fixes both these problems 
– calling ‘MarkBufferDirty’ instead of ‘MarkBufferDirtyHint’ in the 
first case and replacing ‘PageGetFreeSpace’ with ‘PageGetHeapFreeSpace’ 
in the second case.

Two synthetic test cases are also attached which simulates both these 
situations – ‘test_case1.zip’ to simulate the problem with lost FSM 
update on replica side and ‘test_case2.zip’ to simulate incorrect FSM 
data on standby server for blocks with large number of redirect slots. 
In both cases the ‘test_prepare.sh’ script could be edited to specify 
path to PG installation and port numbers. Then invoke ‘test_preapre.sh’ 
script to prepare two databases. For first case the second script 
‘test_run.sh’ need to be invoked after that to show large number of 
blocks being visited for simple insert and for second test case state of 
the FSM (for single block) is just displayed at the end of 
‘test_prepare.sh’.

Thanks,
Alexey

Вложения

Re: Two issues leading to discrepancies in FSM data on the standby server

От

Andrey Borodin

Дата:

25 марта, 15:29:52

Hi!

Very interesting cases!

> On 20 Mar 2026, at 06:32, Alexey Makhmutov <a.makhmutov@postgrespro.ru> wrote:
>
> Attached are separate patches, which tries to fixes both these problems – calling ‘MarkBufferDirty’ instead of
‘MarkBufferDirtyHint’in the first case and replacing ‘PageGetFreeSpace’ with ‘PageGetHeapFreeSpace’ in the second case. 

Patch 0001 - MarkBufferDirty() instead of MarkBufferDirtyHint() in XLogRecordPageWithFreeSpace().
Yes, MarkBufferDirtyHint() is no-op in recovery and it's the only case I found of using MarkBufferDirtyHint()
in redo.
Originally in e981653 was used MarkBufferDirty() but 96ef3b8 flipped to MarkBufferDirtyHint().
Neither of these commits provided a comment on why this version was chosen. I think if we fix it we must comment
things.

Patch 0002 - PageGetHeapFreeSpace instead of PageGetFreeSpace in heap_xlog_visible.
This seems to be just an oversight in ab7dbd6. Every other call to XLogRecordPageWithFreeSpace() uses
PageGetHeapFreeSpace().
And this seems correct to me, but a bit odd. Why indexes do not update FSM via this routine?

It seems indexes do not log free pages at all, relying on index vacuum to rebuild fsm on Standby.

Nice catch!

Best regards, Andrey Borodin.

Re: Two issues leading to discrepancies in FSM data on the standby server

От

Alexey Makhmutov

Дата:

06 апреля, 17:26:01

Hi Andrey!

Thank you for the attention to this patch!

> Originally in e981653 was used MarkBufferDirty() but 96ef3b8 flipped to MarkBufferDirtyHint().
> Neither of these commits provided a comment on why this version was chosen. I think if we fix it we must comment
things.

I think that reason of change in 96ef3b8 (changing of 'MarkBufferDirty' 
to 'MarkBufferDirtyHint') may be described in the next commit (9df56f6), 
during the README update:
 > New WAL records cannot be written during recovery, so hint bits set 
during recovery must not dirty the page if the buffer is not already 
dirty, when checksums are enabled.  Systems in Hot-Standby mode may 
benefit from hint bits being set, but with checksums enabled, a page 
cannot be dirtied after setting a hint bit (due to the torn page risk). 
So, it must wait for full-page images containing the hint bit updates to 
arrive from the master.

So, it seems logical, that any changes to the data not protected by the 
WAL (which includes VM and FSM as well) should use MarkBufferDirtyHint, 
which does not set dirty flag during recovery. However, as FSM blocks 
could be just zeroed in case of checksums mismatch, so I think it's 
perfectly fine to use regular MarkBufferDirty here.

I've updated the first patch by adding the comment with explanation of 
the reason for using MarkBufferDirty instead of MarkBufferDirtyHint here.

As for the second issue and the patch - it seems to be resolved in the 
current master by a881cc9, which removed the entire 'heap_xlog_visible' 
method, as all-visibility information is now sent with the 
XLOG_HEAP2_PRUNE_VACUUM_CLEANUP message and its handler already uses 
PageGetHeapFreeSpace. The problem is still relevant for the pre-19 
versions, so I will probably move it to the separate thread in bugs.

Thanks,
Alexey

Вложения

0001-Mark-modified-FSM-buffer-as-dirty-during-recovery.patch

Re: Two issues leading to discrepancies in FSM data on the standby server

От

Melanie Plageman

Дата:

14 апреля, 19:21:57

On Mon, Apr 6, 2026 at 10:26 AM Alexey Makhmutov
<a.makhmutov@postgrespro.ru> wrote:
>
> > Originally in e981653 was used MarkBufferDirty() but 96ef3b8 flipped to MarkBufferDirtyHint().
> > Neither of these commits provided a comment on why this version was chosen. I think if we fix it we must comment
things.
>
> I think that reason of change in 96ef3b8 (changing of 'MarkBufferDirty'
> to 'MarkBufferDirtyHint') may be described in the next commit (9df56f6),
> during the README update:
>  > New WAL records cannot be written during recovery, so hint bits set
> during recovery must not dirty the page if the buffer is not already
> dirty, when checksums are enabled.  Systems in Hot-Standby mode may
> benefit from hint bits being set, but with checksums enabled, a page
> cannot be dirtied after setting a hint bit (due to the torn page risk).
> So, it must wait for full-page images containing the hint bit updates to
> arrive from the master.
>
> So, it seems logical, that any changes to the data not protected by the
> WAL (which includes VM and FSM as well) should use MarkBufferDirtyHint,
> which does not set dirty flag during recovery. However, as FSM blocks
> could be just zeroed in case of checksums mismatch, so I think it's
> perfectly fine to use regular MarkBufferDirty here.

Yea, I agree that this seems like simply an oversight in 96ef3b8. And
it seems safe to use MarkBufferDirty() here instead.

- Melanie

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Two issues leading to discrepancies in FSM data on the standby server

Two issues leading to discrepancies in FSM data on the standby server

Вложения

Re: Two issues leading to discrepancies in FSM data on the standby server

Re: Two issues leading to discrepancies in FSM data on the standby server

Вложения

Re: Two issues leading to discrepancies in FSM data on the standby server