Обсуждение: Can we change pg_rewind used without wal_log_hints and data_checksums

Поиск
Список
Период
Сортировка

Can we change pg_rewind used without wal_log_hints and data_checksums

От
lchch1990@sina.cn
Дата:
hi hackers,

I am thinking about why pg_rewind need wal_log_hints or data_checksums which significantly
limits its usability. I research somewhere can can only find it's for against data
corruption in code comment.

And i come up a case which may need depend on the page consistence after do pg_rewind:
1. We have primary A and standby B.
2. We have a transaction xact1 currently and xact1 modify some pages.
3. Do a checkpoint on A
4. standby B promote
5. xact1 committed on and do a query on all data modified by xact1
6. do pg_rewind on A

If on no page consistence mode, and hack pg_rewind code to force a rewind, then we
may see xact1 on A and can not see xact1 on B. And it's cause unconsistence.

Now I tell myself ,pg_rewind may can not handle this case so we must set wal_log_hints
on to avoid the case. If so we can modify pg_rewind to met this case. If not so, I want
to known the reason or some mail thread discuss that? Thanks.

Here i want to introduce a way to solve the case above:
We need record all transaction ID commited after diverge record and research more wal
before diverge record, we need to find a start lsn(lsn_s) which older than all the
transactions. And we should read from lsn_s to diverge lsn to collect influenced pages
by the transactions. So we can copy them at the rewind phase.

Best Regards,
Movead Li

Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Michael Paquier
Дата:
On Thu, Jan 15, 2026 at 09:47:39AM +0800, lchch1990@sina.cn wrote:
> I am thinking about why pg_rewind need wal_log_hints or data_checksums which significantly
> limits its usability. I research somewhere can can only find it's for against data
> corruption in code comment.

Hint bints can be set on a page, and we *have to* WAL log these pages
so as pg_rewind can track the modified blocks, or we would corrupt a
data folder after a rewind.  See around this thread about the original
description of the issue:
https://www.postgresql.org/message-id/519E5493.5060800@vmware.com

> Here i want to introduce a way to solve the case above:
> We need record all transaction ID commited after diverge record and research more wal
> before diverge record, we need to find a start lsn(lsn_s) which older than all the
> transactions. And we should read from lsn_s to diverge lsn to collect influenced pages
> by the transactions. So we can copy them at the rewind phase.

How is a method based on the tracking of transaction IDs and the
modified blocks not going to be more costly than the current method
where we are able to track the modified blocks directly in the WAL
records?
--
Michael

Вложения

Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Laurenz Albe
Дата:
On Thu, 2026-01-15 at 09:47 +0800, lchch1990@sina.cn wrote:
> I am thinking about why pg_rewind need wal_log_hints or data_checksums which significantly
> limits its usability. I research somewhere can can only find it's for against data
> corruption in code comment.

See https://postgr.es/m/CA%2BTgmoY4j%2Bp7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG%2BSPByg%40mail.gmail.com

In more detail:

1. there is a transaction open on the primary server (server A)

2. the transaction inserts a row

3. a checkpoint happens

4. the transaction commits

5. the session reads the row it just inserted, which sets hint bits on the row
   that mark it as generally visible

Now the standby (server B) promoted between steps 3 and 4, which means that on server B
(the new primary), the transaction didn't commit and the row is invisible.

Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks
that were modified after the last common checkpoint (which happened in step 3 above).
If neither wal_log_hints = on nor checksums are enabled (which effectively forces
WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind
fails to copy that block from server B.  The consequence is that after pg_rewind, the
row is *still* visible on server A because of the hint bits.  That is data corruption.

Therefore, the requirement cannot be relaxed.

Yours,
Laurenz Albe



Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
lchch1990@sina.cn
Дата:
>See https://postgr.es/m/CA%2BTgmoY4j%2Bp7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG%2BSPByg%40mail.gmail.com
>In more detail:
>1. there is a transaction open on the primary server (server A)
>2. the transaction inserts a row
>3. a checkpoint happens
>4. the transaction commits
>5. the session reads the row it just inserted, which sets hint bits on the row
>that mark it as generally visible
>Now the standby (server B) promoted between steps 3 and 4, which means that on server B
>(the new primary), the transaction didn't commit and the row is invisible.
>Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks
>that were modified after the last common checkpoint (which happened in step 3 above).
>If neither wal_log_hints = on nor checksums are enabled (which effectively forces
>WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind
>fails to copy that block from server B. The consequence is that after pg_rewind, the
>row is *still* visible on server A because of the hint bits. That is data corruption.
>Therefore, the requirement cannot be relaxed.


Yes I known the step and I have check the mail link. 
As described in the top mail we can find some way to solve the problem so that 
pg_rewind can run without wal_log_hints and data_checksums. 


Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more 
wal to cover whole releated transactions so any releated pages with copyed, and we never 
warried about hint bits issue. 


Anyway, I wish my mail in right format. Because my last mail reply to Michael out of order 
and  miss from this thread. 

---
Movead Li


Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
"lchch1990@sina.cn"
Дата:

>Here i want to introduce a way to solve the case above:
>We need record all transaction ID commited after diverge record and research more wal
>before diverge record, we need to find a start lsn(lsn_s) which older than all the
>transactions. And we should read from lsn_s to diverge lsn to collect influenced pages
>by the transactions. So we can copy them at the rewind phase.


Hi hackers,

I am sorry and I am tring how to send a in order mail.



----
Best Regards,
Movead Li



Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
"lchch1990@sina.cn"
Дата:
On Thu, 2026-01-15 at 13:47 +0800, laurenz.albe@cybertec.at wrote:
> Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks
> that were modified after the last common checkpoint (which happened in step 3 above).
> If neither wal_log_hints = on nor checksums are enabled (which effectively forces
> WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind
> fails to copy that block from server B.  The consequence is that after pg_rewind, the
> row is *still* visible on server A because of the hint bits.  That is data corruption.
> Therefore, the requirement cannot be relaxed.


Yes I known the step and I have check the mail link. As described in the top mail we can
find some way to solve the problem so that pg_rewind can run without wal_log_hints and
data_checksums.

Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more
wal to cover whole releated transactions so any releated pages with copyed, and we never
warried about hint bits issue.

Anyway, I wish this mail in order.



----
Best Regards,

Movead Li



 



Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Movead
Дата:
>> On Thu, 2026-01-15 at 14:14 +0800, laurenz.albe@cybertec.at wrote:
>> Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks
>> that were modified after the last common checkpoint (which happened in step 3 above).
>> If neither wal_log_hints = on nor checksums are enabled (which effectively forces
>> WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind
>> fails to copy that block from server B.  The consequence is that after pg_rewind, the
>> row is *still* visible on server A because of the hint bits.  That is data corruption.
>> Therefore, the requirement cannot be relaxed.



>Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more
>wal to cover whole releated transactions so any releated pages with copyed, and we never
>warried about hint bits issue.





Base on the discussion I write a patch and introduce it:

Currently pg_rewind search checkpoint start at divergerec and walk backward. Then it
collect change pages from checkpoint to divergerec forward.

We modify the second step and collect the minimal commited transaction id and named
min_commited_xid. And collect the 'first appeared' transaction id by XLOG_RUNNING_XACTS
wal record and named base_xid. If base_xid <= min_commited_xid we can work a safy
rewind.

How ever if we can not met 'base_xid <= min_commited_xid' then we read wal from
checkpoint and walk backward until we met the goal, ofcause we collect change pages during
the third step. If we can not met the goal at last, we report an error for can not finish.


The third step maybe slowly so I add a option(-d or --deep-dig), by default it stop if can not
met the goal at the second step. And user should add -d to run the third step.

Patch attached.


----
Best Regards,
Movead Li



 



 



 



 



 



 



Вложения

Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Laurenz Albe
Дата:
On Thu, 2026-01-15 at 14:14 +0800, lchch1990@sina.cn wrote:
> > See https://postgr.es/m/CA%2BTgmoY4j%2Bp7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG%2BSPByg%40mail.gmail.com
>
> Yes I known the step and I have check the mail link. 
> As described in the top mail we can find some way to solve the problem so that 
> pg_rewind can run without wal_log_hints and data_checksums. 
>
> Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more 
> wal to cover whole releated transactions so any releated pages with copyed, and we never 
> warried about hint bits issue. 

I apologize for my misunderstanding.

I had a brief look at the patch, and the gratuitous use of static variables didn't
appeal to me.  Can you briefly describe the algorithm?  You look at all commit
records *after* the fork, right?  Then how can you identify how far back you have
rewind?  How can you identify when a transaction started?

Yours,
Laurenz Albe



Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Movead
Дата:
On Thu, 2026-01-16 at 03:08 +0800, laurenz.albe@cybertec.at wrote:
>I had a brief look at the patch, and the gratuitous use of static variables didn't
>appeal to me. 
Sorry about that, and I can find a nice way if the design is good.

>Can you briefly describe the algorithm?
I think the algorithm is introduced in the patch mail, let me do a summary

During a forward WAL walk, the system collects the minimal commit transaction
ID. If it find when this transaction ID assigned, a safety rewind can be performed.


>You look at all commit records *after* the fork, right? 
It's all record after for fork, not only commited. I do not change the code logic.
Maybe commit record is enougn.

On the other hand, my patch collect all record *before* the fork, maybe
it will cause many meaningless copy and I can fix that.

>Then how can you identify how far back you have rewind? 
It's the point that we can known the stop point only by walk wal. So it maybe
take long time, and it's the reason I add a '-d, --deep-dig' option.

>How can you identify when a transaction started?
We find it by XLOG_RUNNING_XACTS wal record, this wal record  have a *nextXid*
which mean a unassigned transaction id when produce the wal record.

 



----
Best Regards,
Movead Li




 



Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Laurenz Albe
Дата:
On Fri, 2026-01-16 at 08:04 +0800, Movead wrote:
> > How can you identify when a transaction started?
>
> We find it by XLOG_RUNNING_XACTS wal record, this wal record  have a *nextXid*
> which mean a unassigned transaction id when produce the wal record.

I see; thanks for the explanation.

Yours,
Laurenz Albe



Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Movead
Дата:
As discussed on the other mail thread:
https://www.postgresql.org/message-id/696885c3db9be6.68269280.fb0eaf68%40m0.mail.sina.com.cn



I need to do two changes on the patch.
1. Find some way to handle static variables.
2. Do not collect useless pages on the third step.

For the first one because the static variables needed on several place and it seem worst if
deliver by function parameters, so I keep them.

For the second one I have tryied to keep a commited transaction ID list or bitmapset, however
the data structs seem hard to use in src/bin code. So I use min_commited_xid and
max_commited_xid instead to filter the useless pages.

In fact the min_commited_xid and max_commited_xid is the edge transaction commited after
diverge record, so it's enough.

Changed patch attached.



----
Best Regards,
Movead Li



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



Вложения

Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Movead
Дата:
I modify the patch for ninja build error which found by CI.





----
Best Regards,

Movead Li



Вложения

Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Neil Chen
Дата:
Hi Movead,

It's an interesting idea. 

While it’s impossible to predict exactly how much WAL we’ll need to backtrack through --
I assume it mainly depends on the duration of long-running transactions -- this approach 
seems to offer an opportunity using pg_rewind without enabling wal_log_hints.

On Fri, Jan 16, 2026 at 9:28 PM Movead <lchch1990@sina.cn> wrote:

In fact the min_commited_xid and max_commited_xid is the edge transaction commited after
diverge record, so it's enough.

 
Given the potential large gap between transaction IDs (especially when long-running transactions are involved), 
maintaining  a list/bitmap struct would be worthwhile.

A minor suggestion, for an operation that may fail, I suggest retrieving the first XLOG_RUNNING_XACTS record to obtain its base_xid 
before doing the deep-dig process. If the task cannot be completed (i.e., the base_xid <= min_commited_xid condition isn’t met), 
we can throw an error immediately instead of waiting for all WAL records to be parsed.

Re: Can we change pg_rewind used without wal_log_hints and data_checksums

От
Movead
Дата:
>It's an interesting idea.
>While it’s impossible to predict exactly how much WAL we’ll need to backtrack through --
>I assume it mainly depends on the duration of long-running transactions -- this approach
>seems to offer an opportunity using pg_rewind without enabling wal_log_hints.

Hi Neil, thanks and I think it's meaningful.


>Given the potential large gap between transaction IDs (especially when long-running transactions are involved),
>maintaining  a list/bitmap struct would be worthwhile.

Yes I intend todo that but bitmapset can not use in src/bin and it seem no necessary to implement
one. And attention that min and max is what commited after diverge record, so it nomally small gap.
So it's the reason I give up.


>A minor suggestion, for an operation that may fail, I suggest retrieving the first XLOG_RUNNING_XACTS record to obtain
itsbase_xid  
>before doing the deep-dig process. If the task cannot be completed (i.e., the base_xid <= min_commited_xid condition
isn’tmet),  
>we can throw an error immediately instead of waiting for all WAL records to be parsed.

Mainly we can not get the first wal segment, because if no enough wal it will fetch wal segment
by restore_command. Your suggestion is meanful only if no restore_command. Anyway let's see
hacker's opnions.

----
Best Regards,
Movead Li