Обсуждение: Can we change pg_rewind used without wal_log_hints and data_checksums
hi hackers,
I am thinking about why pg_rewind need wal_log_hints or data_checksums which significantly
limits its usability. I research somewhere can can only find it's for against data
corruption in code comment.
And i come up a case which may need depend on the page consistence after do pg_rewind:
1. We have primary A and standby B.
2. We have a transaction xact1 currently and xact1 modify some pages.
3. Do a checkpoint on A
4. standby B promote
5. xact1 committed on and do a query on all data modified by xact1
6. do pg_rewind on A
If on no page consistence mode, and hack pg_rewind code to force a rewind, then we
may see xact1 on A and can not see xact1 on B. And it's cause unconsistence.
Now I tell myself ,pg_rewind may can not handle this case so we must set wal_log_hints
on to avoid the case. If so we can modify pg_rewind to met this case. If not so, I want
to known the reason or some mail thread discuss that? Thanks.
Here i want to introduce a way to solve the case above:
We need record all transaction ID commited after diverge record and research more wal
before diverge record, we need to find a start lsn(lsn_s) which older than all the
transactions. And we should read from lsn_s to diverge lsn to collect influenced pages
by the transactions. So we can copy them at the rewind phase.
Best Regards,
Movead Li
Movead Li
On Thu, Jan 15, 2026 at 09:47:39AM +0800, lchch1990@sina.cn wrote: > I am thinking about why pg_rewind need wal_log_hints or data_checksums which significantly > limits its usability. I research somewhere can can only find it's for against data > corruption in code comment. Hint bints can be set on a page, and we *have to* WAL log these pages so as pg_rewind can track the modified blocks, or we would corrupt a data folder after a rewind. See around this thread about the original description of the issue: https://www.postgresql.org/message-id/519E5493.5060800@vmware.com > Here i want to introduce a way to solve the case above: > We need record all transaction ID commited after diverge record and research more wal > before diverge record, we need to find a start lsn(lsn_s) which older than all the > transactions. And we should read from lsn_s to diverge lsn to collect influenced pages > by the transactions. So we can copy them at the rewind phase. How is a method based on the tracking of transaction IDs and the modified blocks not going to be more costly than the current method where we are able to track the modified blocks directly in the WAL records? -- Michael
Вложения
On Thu, 2026-01-15 at 09:47 +0800, lchch1990@sina.cn wrote: > I am thinking about why pg_rewind need wal_log_hints or data_checksums which significantly > limits its usability. I research somewhere can can only find it's for against data > corruption in code comment. See https://postgr.es/m/CA%2BTgmoY4j%2Bp7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG%2BSPByg%40mail.gmail.com In more detail: 1. there is a transaction open on the primary server (server A) 2. the transaction inserts a row 3. a checkpoint happens 4. the transaction commits 5. the session reads the row it just inserted, which sets hint bits on the row that mark it as generally visible Now the standby (server B) promoted between steps 3 and 4, which means that on server B (the new primary), the transaction didn't commit and the row is invisible. Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks that were modified after the last common checkpoint (which happened in step 3 above). If neither wal_log_hints = on nor checksums are enabled (which effectively forces WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind fails to copy that block from server B. The consequence is that after pg_rewind, the row is *still* visible on server A because of the hint bits. That is data corruption. Therefore, the requirement cannot be relaxed. Yours, Laurenz Albe
Re: Can we change pg_rewind used without wal_log_hints and data_checksums
От
lchch1990@sina.cn
Дата:
>See https://postgr.es/m/CA%2BTgmoY4j%2Bp7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG%2BSPByg%40mail.gmail.com
>In more detail:
>1. there is a transaction open on the primary server (server A)
>2. the transaction inserts a row
>3. a checkpoint happens
>4. the transaction commits
>5. the session reads the row it just inserted, which sets hint bits on the row
>that mark it as generally visible
>Now the standby (server B) promoted between steps 3 and 4, which means that on server B
>(the new primary), the transaction didn't commit and the row is invisible.
>Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks
>that were modified after the last common checkpoint (which happened in step 3 above).
>If neither wal_log_hints = on nor checksums are enabled (which effectively forces
>WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind
>fails to copy that block from server B. The consequence is that after pg_rewind, the
>row is *still* visible on server A because of the hint bits. That is data corruption.
>In more detail:
>1. there is a transaction open on the primary server (server A)
>2. the transaction inserts a row
>3. a checkpoint happens
>4. the transaction commits
>5. the session reads the row it just inserted, which sets hint bits on the row
>that mark it as generally visible
>Now the standby (server B) promoted between steps 3 and 4, which means that on server B
>(the new primary), the transaction didn't commit and the row is invisible.
>Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks
>that were modified after the last common checkpoint (which happened in step 3 above).
>If neither wal_log_hints = on nor checksums are enabled (which effectively forces
>WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind
>fails to copy that block from server B. The consequence is that after pg_rewind, the
>row is *still* visible on server A because of the hint bits. That is data corruption.
>Therefore, the requirement cannot be relaxed.
Yes I known the step and I have check the mail link.
As described in the top mail we can find some way to solve the problem so that
pg_rewind can run without wal_log_hints and data_checksums.
Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more
wal to cover whole releated transactions so any releated pages with copyed, and we never
warried about hint bits issue.
Anyway, I wish my mail in right format. Because my last mail reply to Michael out of order
and miss from this thread.
---
Movead Li
Re: Can we change pg_rewind used without wal_log_hints and data_checksums
От
"lchch1990@sina.cn"
Дата:
>Here i want to introduce a way to solve the case above: >We need record all transaction ID commited after diverge record and research more wal >before diverge record, we need to find a start lsn(lsn_s) which older than all the >transactions. And we should read from lsn_s to diverge lsn to collect influenced pages >by the transactions. So we can copy them at the rewind phase. Hi hackers, I am sorry and I am tring how to send a in order mail. ---- Best Regards, Movead Li
Re: Can we change pg_rewind used without wal_log_hints and data_checksums
От
"lchch1990@sina.cn"
Дата:
On Thu, 2026-01-15 at 13:47 +0800, laurenz.albe@cybertec.at wrote: > Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks > that were modified after the last common checkpoint (which happened in step 3 above). > If neither wal_log_hints = on nor checksums are enabled (which effectively forces > WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind > fails to copy that block from server B. The consequence is that after pg_rewind, the > row is *still* visible on server A because of the hint bits. That is data corruption. > Therefore, the requirement cannot be relaxed. Yes I known the step and I have check the mail link. As described in the top mail we can find some way to solve the problem so that pg_rewind can run without wal_log_hints and data_checksums. Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more wal to cover whole releated transactions so any releated pages with copyed, and we never warried about hint bits issue. Anyway, I wish this mail in order. ---- Best Regards, Movead Li
>> On Thu, 2026-01-15 at 14:14 +0800, laurenz.albe@cybertec.at wrote: >> Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks >> that were modified after the last common checkpoint (which happened in step 3 above). >> If neither wal_log_hints = on nor checksums are enabled (which effectively forces >> WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind >> fails to copy that block from server B. The consequence is that after pg_rewind, the >> row is *still* visible on server A because of the hint bits. That is data corruption. >> Therefore, the requirement cannot be relaxed. >Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more >wal to cover whole releated transactions so any releated pages with copyed, and we never >warried about hint bits issue. Base on the discussion I write a patch and introduce it: Currently pg_rewind search checkpoint start at divergerec and walk backward. Then it collect change pages from checkpoint to divergerec forward. We modify the second step and collect the minimal commited transaction id and named min_commited_xid. And collect the 'first appeared' transaction id by XLOG_RUNNING_XACTS wal record and named base_xid. If base_xid <= min_commited_xid we can work a safy rewind. How ever if we can not met 'base_xid <= min_commited_xid' then we read wal from checkpoint and walk backward until we met the goal, ofcause we collect change pages during the third step. If we can not met the goal at last, we report an error for can not finish. The third step maybe slowly so I add a option(-d or --deep-dig), by default it stop if can not met the goal at the second step. And user should add -d to run the third step. Patch attached. ---- Best Regards, Movead Li
Вложения
On Thu, 2026-01-15 at 14:14 +0800, lchch1990@sina.cn wrote: > > See https://postgr.es/m/CA%2BTgmoY4j%2Bp7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG%2BSPByg%40mail.gmail.com > > Yes I known the step and I have check the mail link. > As described in the top mail we can find some way to solve the problem so that > pg_rewind can run without wal_log_hints and data_checksums. > > Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more > wal to cover whole releated transactions so any releated pages with copyed, and we never > warried about hint bits issue. I apologize for my misunderstanding. I had a brief look at the patch, and the gratuitous use of static variables didn't appeal to me. Can you briefly describe the algorithm? You look at all commit records *after* the fork, right? Then how can you identify how far back you have rewind? How can you identify when a transaction started? Yours, Laurenz Albe
On Thu, 2026-01-16 at 03:08 +0800, laurenz.albe@cybertec.at wrote: >I had a brief look at the patch, and the gratuitous use of static variables didn't >appeal to me. Sorry about that, and I can find a nice way if the design is good. >Can you briefly describe the algorithm? I think the algorithm is introduced in the patch mail, let me do a summary During a forward WAL walk, the system collects the minimal commit transaction ID. If it find when this transaction ID assigned, a safety rewind can be performed. >You look at all commit records *after* the fork, right? It's all record after for fork, not only commited. I do not change the code logic. Maybe commit record is enougn. On the other hand, my patch collect all record *before* the fork, maybe it will cause many meaningless copy and I can fix that. >Then how can you identify how far back you have rewind? It's the point that we can known the stop point only by walk wal. So it maybe take long time, and it's the reason I add a '-d, --deep-dig' option. >How can you identify when a transaction started? We find it by XLOG_RUNNING_XACTS wal record, this wal record have a *nextXid* which mean a unassigned transaction id when produce the wal record. ---- Best Regards, Movead Li
On Fri, 2026-01-16 at 08:04 +0800, Movead wrote: > > How can you identify when a transaction started? > > We find it by XLOG_RUNNING_XACTS wal record, this wal record have a *nextXid* > which mean a unassigned transaction id when produce the wal record. I see; thanks for the explanation. Yours, Laurenz Albe
As discussed on the other mail thread: https://www.postgresql.org/message-id/696885c3db9be6.68269280.fb0eaf68%40m0.mail.sina.com.cn I need to do two changes on the patch. 1. Find some way to handle static variables. 2. Do not collect useless pages on the third step. For the first one because the static variables needed on several place and it seem worst if deliver by function parameters, so I keep them. For the second one I have tryied to keep a commited transaction ID list or bitmapset, however the data structs seem hard to use in src/bin code. So I use min_commited_xid and max_commited_xid instead to filter the useless pages. In fact the min_commited_xid and max_commited_xid is the edge transaction commited after diverge record, so it's enough. Changed patch attached. ---- Best Regards, Movead Li
Вложения
I modify the patch for ninja build error which found by CI. ---- Best Regards, Movead Li
Вложения
Hi Movead,
It's an interesting idea.
While it’s impossible to predict exactly how much WAL we’ll need to backtrack through --
I assume it mainly depends on the duration of long-running transactions -- this approach
seems to offer an opportunity using pg_rewind without enabling wal_log_hints.
On Fri, Jan 16, 2026 at 9:28 PM Movead <lchch1990@sina.cn> wrote:
In fact the min_commited_xid and max_commited_xid is the edge transaction commited after
diverge record, so it's enough.
Given the potential large gap between transaction IDs (especially when long-running transactions are involved),
maintaining a list/bitmap struct would be worthwhile.
A minor suggestion, for an operation that may fail, I suggest retrieving the first XLOG_RUNNING_XACTS record to obtain its base_xid
before doing the deep-dig process. If the task cannot be completed (i.e., the base_xid <= min_commited_xid condition isn’t met),
we can throw an error immediately instead of waiting for all WAL records to be parsed.
>It's an interesting idea. >While it’s impossible to predict exactly how much WAL we’ll need to backtrack through -- >I assume it mainly depends on the duration of long-running transactions -- this approach >seems to offer an opportunity using pg_rewind without enabling wal_log_hints. Hi Neil, thanks and I think it's meaningful. >Given the potential large gap between transaction IDs (especially when long-running transactions are involved), >maintaining a list/bitmap struct would be worthwhile. Yes I intend todo that but bitmapset can not use in src/bin and it seem no necessary to implement one. And attention that min and max is what commited after diverge record, so it nomally small gap. So it's the reason I give up. >A minor suggestion, for an operation that may fail, I suggest retrieving the first XLOG_RUNNING_XACTS record to obtain itsbase_xid >before doing the deep-dig process. If the task cannot be completed (i.e., the base_xid <= min_commited_xid condition isn’tmet), >we can throw an error immediately instead of waiting for all WAL records to be parsed. Mainly we can not get the first wal segment, because if no enough wal it will fetch wal segment by restore_command. Your suggestion is meanful only if no restore_command. Anyway let's see hacker's opnions. ---- Best Regards, Movead Li