Обсуждение: Making pg_rewind faster
Hi Hackers,
I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.
However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.
1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?
Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.
Thanks,
Vignesh
Hi everyone!
Here's the attached patch submission to optimize pg_rewind performance when many WAL files are retained on server. This patch avoids replaying (copying over) older WAL segment files that fall before the point of divergence between the source and target servers.
Thanks,
Justin
From: Justin Kwan <jkwan@cloudflare.com>
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran <admin@viggy28.dev>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
Subject: Re: Making pg_rewind faster
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran <admin@viggy28.dev>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
Subject: Re: Making pg_rewind faster
Looping in my other email.
On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran <admin@viggy28.dev> wrote:
Hi Hackers,I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.Thanks,Vignesh
Вложения
Hi everyone!
I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
Thanks,
Justin
From: Justin Kwan <jkwan@cloudflare.com>
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran <admin@viggy28.dev>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
Subject: Re: Making pg_rewind faster
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran <admin@viggy28.dev>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
Subject: Re: Making pg_rewind faster
Looping in my other email.
On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran <admin@viggy28.dev> wrote:
Hi Hackers,I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.Thanks,Vignesh
Вложения
Justin Kwan <justinpkwan@outlook.com> writes:
> I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets
versionPostgres version 15 Beta 1/2.
It's very unlikely that we would consider committing such changes into
released branches. In fact, it's too late even for v15. You should
be submitting non-bug-fix patches against master (v16-to-be).
regards, tom lane
Hi Tom,
Thank you for taking a look at this and that sounds good. I will send over a patch compatible with Postgres v16.
Justin
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: July 17, 2022 2:40 PM
To: Justin Kwan <justinpkwan@outlook.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster
Sent: July 17, 2022 2:40 PM
To: Justin Kwan <justinpkwan@outlook.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster
Justin Kwan <justinpkwan@outlook.com> writes:
> I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
It's very unlikely that we would consider committing such changes into
released branches. In fact, it's too late even for v15. You should
be submitting non-bug-fix patches against master (v16-to-be).
regards, tom lane
> I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
It's very unlikely that we would consider committing such changes into
released branches. In fact, it's too late even for v15. You should
be submitting non-bug-fix patches against master (v16-to-be).
regards, tom lane
On Mon, Jul 18, 2022 at 05:14:00PM +0000, Justin Kwan wrote:
> Thank you for taking a look at this and that sounds good. I will
> send over a patch compatible with Postgres v16.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_last_modified_at);
Please note that you should not rely on the FS-level stats for
anything that touches the WAL segments. A rough guess about what you
could here to make sure that only the set of WAL segments you are
looking for is being copied over would be to either:
- Scan the logs produced by pg_rewind and see if the segments are
copied or not, depending on the divergence point (aka the last
checkpoint before WAL forked).
- Clean up pg_wal/ in the target node before running pg_rewind,
checking that only the segments you want are available once the
operation completes.
--
Michael
Вложения
Hi Michael,
Thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
Thanks,
Justin
From: Michael Paquier
Sent: Tuesday, July 19, 2022 1:36 AM
To: Justin Kwan
Cc: Tom Lane; pgsql-hackers; vignesh; jkwan@cloudflare.com; vignesh ravichandran; hlinnaka@iki.fi
Subject: Re: Making pg_rewind faster
On Mon, Jul 18, 2022 at 05:14:00PM +0000, Justin Kwan wrote:
> Thank you for taking a look at this and that sounds good. I will
> send over a patch compatible with Postgres v16.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_last_modified_at);
Please note that you should not rely on the FS-level stats for
anything that touches the WAL segments. A rough guess about what you
could here to make sure that only the set of WAL segments you are
looking for is being copied over would be to either:
- Scan the logs produced by pg_rewind and see if the segments are
copied or not, depending on the divergence point (aka the last
checkpoint before WAL forked).
- Clean up pg_wal/ in the target node before running pg_rewind,
checking that only the segments you want are available once the
operation completes.
--
Michael
> Thank you for taking a look at this and that sounds good. I will
> send over a patch compatible with Postgres v16.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_last_modified_at);
Please note that you should not rely on the FS-level stats for
anything that touches the WAL segments. A rough guess about what you
could here to make sure that only the set of WAL segments you are
looking for is being copied over would be to either:
- Scan the logs produced by pg_rewind and see if the segments are
copied or not, depending on the divergence point (aka the last
checkpoint before WAL forked).
- Clean up pg_wal/ in the target node before running pg_rewind,
checking that only the segments you want are available once the
operation completes.
--
Michael
Вложения
Hi Michael,
Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
Thanks,
From: Justin Kwan <justinpkwan@outlook.com>
Sent: July 18, 2022 1:14 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster
Sent: July 18, 2022 1:14 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster
Hi Tom,
Thank you for taking a look at this and that sounds good. I will send over a patch compatible with Postgres v16.
Justin
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: July 17, 2022 2:40 PM
To: Justin Kwan <justinpkwan@outlook.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster
Sent: July 17, 2022 2:40 PM
To: Justin Kwan <justinpkwan@outlook.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster
Justin Kwan <justinpkwan@outlook.com> writes:
> I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
It's very unlikely that we would consider committing such changes into
released branches. In fact, it's too late even for v15. You should
be submitting non-bug-fix patches against master (v16-to-be).
regards, tom lane
> I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
It's very unlikely that we would consider committing such changes into
released branches. In fact, it's too late even for v15. You should
be submitting non-bug-fix patches against master (v16-to-be).
regards, tom lane
Вложения
Hi, Justin! On Fri, Jul 29, 2022 at 1:05 PM Justin Kwan <justinpkwan@outlook.com> wrote: > Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanningthe logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied overto the target server. > > I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached. Thank you for the revision. I've taken a look at this patch. Overall it looks good to me. I also don't see any design objections in the thread. A couple of points from me: 1) I would prefer to evade hard-coded names for WAL segments in the tap tests. Could we calculate the names in the tap tests based on the diverge point, etc.? 2) Patch contains some indentation with spaces, which should be done in tabs. Please consider either manually fixing this or running pgindent over modified files. ------ Regards, Alexander Korotkov
Hi,
On 2022-09-13 20:50:20 +0300, Alexander Korotkov wrote:
> On Fri, Jul 29, 2022 at 1:05 PM Justin Kwan <justinpkwan@outlook.com> wrote:
> > Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions
byscanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being
copiedover to the target server.
> >
> > I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see
attached.
>
> Thank you for the revision.
>
> I've taken a look at this patch. Overall it looks good to me. I also
> don't see any design objections in the thread.
>
> A couple of points from me:
> 1) I would prefer to evade hard-coded names for WAL segments in the
> tap tests. Could we calculate the names in the tap tests based on the
> diverge point, etc.?
> 2) Patch contains some indentation with spaces, which should be done
> in tabs. Please consider either manually fixing this or running
> pgindent over modified files.
This patch currently fails because it hasn't been adjusted for
commit c47885bd8b6
Author: Andres Freund <andres@anarazel.de>
Date: 2022-09-19 18:03:17 -0700
Split TESTDIR into TESTLOGDIR and TESTDATADIR
The adjustment is trivial. Attached, together with also producing an error
message rather than just dying wordlessly.
It doesn't seem quite right to read pg_rewind's logs by reading
regress_log_001_basic. Too easy to confuse different runs of pg_rewind
etc. I'd suggest trying to redirect the log to a different file.
With regard to Alexander's point about whitespace:
.git/rebase-apply/patch:25: indent with spaces.
/* Handle WAL segment file. */
.git/rebase-apply/patch:26: indent with spaces.
const char *fname;
.git/rebase-apply/patch:27: indent with spaces.
char *slash;
.git/rebase-apply/patch:29: indent with spaces.
/* Split filepath into directory & filename. */
.git/rebase-apply/patch:30: indent with spaces.
slash = strrchr(path, '/');
warning: squelched 29 whitespace errors
warning: 34 lines add whitespace errors.
Greetings,
Andres Freund
Вложения
On Sun, Oct 02, 2022 at 10:44:25AM -0700, Andres Freund wrote:
> It doesn't seem quite right to read pg_rewind's logs by reading
> regress_log_001_basic. Too easy to confuse different runs of pg_rewind
> etc. I'd suggest trying to redirect the log to a different file.
Hardcoding log file names in the test increases the overall
maintenance, even if renaming these would be easy to track and fix if
the naming convention is changed. Anyway, I think that what this
patch should do is to use command_checks_all() in RewindTest.pm as it
is the test API able to check after a status and multiple expected
outputs, which is what the changes in 001 and 008 are doing.
RewindTest::run_pg_rewind() needs to be a bit tweaked to accept these
regex patterns in input.
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
There may be something I am missing here, but there is no need to care
about segments with a TLI older than lastcommontliIndex, no?
decide_wal_file_action() assumes that the WAL segment exists on the
target and the source. This looks bug-prone to me without at least an
assertion.
file_entry_t has an entry to track if a file is a relation file. I
think that it would be much cleaner to track if we are handling a WAL
segment when inserting an entry in insert_filehash_entry(), so
isrelfile could be replaced by an enum with three values: relation
file, WAL segment and the rest.
--
Michael
Вложения
On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote: > file_entry_t has an entry to track if a file is a relation file. I > think that it would be much cleaner to track if we are handling a WAL > segment when inserting an entry in insert_filehash_entry(), so > isrelfile could be replaced by an enum with three values: relation > file, WAL segment and the rest. This review has been done a few weeks ago, and there has been no update since, so I am marking this entry as returned with feedback in the CF app. -- Michael
Вложения
On Tue, 01 Jul 2025 at 11:13, John H <johnhyvr@gmail.com> wrote: > Hi, > > I've attached an updated version of the patch against master with the changes > suggested. > > On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz> wrote: >> >> On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote: >>> >>> There may be something I am missing here, but there is no need to care >>> about segments with a TLI older than lastcommontliIndex, no? > > Hard to say. pg_rewind is intended to make the same "copy" of the cluster which > implies pg_wal/ should look the same. There might be use cases around logical > replication where you would want these WAL files to still exist even > across promotions? > >>> decide_wal_file_action() assumes that the WAL segment exists on the >>> target and the source. This looks bug-prone to me without at least an >>> assertion. > > From previous refactors there is now an Assertion in filemap.c > decide_file_action that handles this. > >> Assert(entry->target_exists && entry->source_exists); > > decide_wal_file_action is called after the assertion. > Hi, John Thanks for updating the patch. 1. +/* Determine the type of file content (relation, WAL, or other) */ +static file_content_type_t +getFileType(const char *path) Considering the existence of file_type_t, would getFileContentType() be a suitable function for handling file content types? 2. Perhaps decide_wal_file_action() could be defined in filemap.c. While this is unrelated to WAL logging, it could also contribute to faster pg_rewind operations. Should we consider ignoring log files under PGDATA (e.g., those in the default log/ directory)? -- Regards, Japin Li
HI
> 2.
> Perhaps decide_wal_file_action() could be defined in filemap.c.
> While this is unrelated to WAL logging, it could also contribute to faster
> pg_rewind operations. Should we consider ignoring log files under PGDATA
> (e.g., those in the default log/ directory)?
> Perhaps decide_wal_file_action() could be defined in filemap.c.
> While this is unrelated to WAL logging, it could also contribute to faster
> pg_rewind operations. Should we consider ignoring log files under PGDATA
> (e.g., those in the default log/ directory)?
Agree ,Usually the log file directory also takes up a lot of space, and the number of log files is quite large
On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
On Tue, 01 Jul 2025 at 11:13, John H <johnhyvr@gmail.com> wrote:
> Hi,
>
> I've attached an updated version of the patch against master with the changes
> suggested.
>
> On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz> wrote:
>>
>> On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
>>>
>>> There may be something I am missing here, but there is no need to care
>>> about segments with a TLI older than lastcommontliIndex, no?
>
> Hard to say. pg_rewind is intended to make the same "copy" of the cluster which
> implies pg_wal/ should look the same. There might be use cases around logical
> replication where you would want these WAL files to still exist even
> across promotions?
>
>>> decide_wal_file_action() assumes that the WAL segment exists on the
>>> target and the source. This looks bug-prone to me without at least an
>>> assertion.
>
> From previous refactors there is now an Assertion in filemap.c
> decide_file_action that handles this.
>
>> Assert(entry->target_exists && entry->source_exists);
>
> decide_wal_file_action is called after the assertion.
>
Hi, John
Thanks for updating the patch.
1.
+/* Determine the type of file content (relation, WAL, or other) */
+static file_content_type_t
+getFileType(const char *path)
Considering the existence of file_type_t, would getFileContentType() be a
suitable function for handling file content types?
2.
Perhaps decide_wal_file_action() could be defined in filemap.c.
While this is unrelated to WAL logging, it could also contribute to faster
pg_rewind operations. Should we consider ignoring log files under PGDATA
(e.g., those in the default log/ directory)?
--
Regards,
Japin Li
On Wed, 02 Jul 2025 at 11:21, John H <johnhyvr@gmail.com> wrote: > Hi, > > Thanks for the quick review. > > On Tue, Jul 1, 2025 at 8:16 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote: >> > Perhaps decide_wal_file_action() could be defined in filemap.c. >> > > That's a good point. I updated the patch to reflect that. > Thanks for updating the patch. >> > While this is unrelated to WAL logging, it could also contribute to faster >> > pg_rewind operations. Should we consider ignoring log files under PGDATA >> > (e.g., those in the default log/ directory)? >> Agree ,Usually the log file directory also takes up a lot of space, and the number of log files is quite large >> > > Should we handle this use case? I do agree that for the more common > use-cases of pg_rewind which is synchronizing an old writer to the new > leader after failover, avoiding syncing the logging directory is > useful. > At the same time, pg_rewind is intended to make the same copy of the > source cluster as efficiently as possible which would include "all" > directories if they exist in $PGDATA. By default logging_collector is > off as well. I'd also think you would want to avoid putting the logs > in $PGDATA to have smaller backups if you are using tools like > pg_basebackup. > Splitting the logs from $PGDATA is definitely better. The question is whether it's worth implementing this directly in core or if a prominent note in the documentation would suffice. >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote: >>> >>> Hi, John >>> >>> Thanks for updating the patch. >>> >>> 1. >>> +/* Determine the type of file content (relation, WAL, or other) */ >>> +static file_content_type_t >>> +getFileType(const char *path) >>> >>> Considering the existence of file_type_t, would getFileContentType() be a >>> suitable function for handling file content types? > > Do you mean naming getFileType to getFileContentType? > Exactly! It's confusing that getFileType() returns file_content_type_t instead of file_type_t. For v5 patch: 1. We could simply use the global WalSegSz variable within decide_file_action(), eliminating the need to pass wal_segsz_bytes as an argument. 2. For last_common_segno, we could implement it similarly to WalSegSz, avoiding a signature change for decide_file_actions() and decide_file_action(). I'm not insisting on this approach, however. -- Regards, Japin Li
Hi, On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote: > > > > > Splitting the logs from $PGDATA is definitely better. The question is whether > it's worth implementing this directly in core or if a prominent note in the > documentation would suffice. > I can work on the documentation update as a separate patch if folks think this is worthwhile. > >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote: > > Exactly! It's confusing that getFileType() returns file_content_type_t > instead of file_type_t. > Ah yes that is confusing, updated in patch. > For v5 patch: > > 1. > We could simply use the global WalSegSz variable within decide_file_action(), > eliminating the need to pass wal_segsz_bytes as an argument. > Good point. > 2. > For last_common_segno, we could implement it similarly to WalSegSz, avoiding a > signature change for decide_file_actions() and decide_file_action(). I'm not > insisting on this approach, however. > I made it a global as well, and had to include access/xlog_internal.h in pg_rewind.h but I don't feel strongly about it either way. Thanks, -- John Hsu - Amazon Web Services
Вложения
On Thu, 03 Jul 2025 at 12:59, John H <johnhyvr@gmail.com> wrote: > Hi, > > On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote: >> >> > >> >> Splitting the logs from $PGDATA is definitely better. The question is whether >> it's worth implementing this directly in core or if a prominent note in the >> documentation would suffice. >> > > I can work on the documentation update as a separate patch if folks > think this is worthwhile. > >> >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote: >> >> Exactly! It's confusing that getFileType() returns file_content_type_t >> instead of file_type_t. >> > > Ah yes that is confusing, updated in patch. > >> For v5 patch: >> >> 1. >> We could simply use the global WalSegSz variable within decide_file_action(), >> eliminating the need to pass wal_segsz_bytes as an argument. >> > > Good point. > >> 2. >> For last_common_segno, we could implement it similarly to WalSegSz, avoiding a >> signature change for decide_file_actions() and decide_file_action(). I'm not >> insisting on this approach, however. >> > > I made it a global as well, and had to include access/xlog_internal.h > in pg_rewind.h but I don't feel strongly about it either way. > Thanks, LGTM. -- Regards, Japin Li
Hi
> Thanks, LGTM.
I think it's possible to register to https://commitfest.postgresql.org/54, https://commitfest.postgresql.org/53 will closed soon
Thanks
On Fri, Jul 4, 2025 at 10:50 AM Japin Li <japinli@hotmail.com> wrote:
On Thu, 03 Jul 2025 at 12:59, John H <johnhyvr@gmail.com> wrote:
> Hi,
>
> On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
>>
>> >
>>
>> Splitting the logs from $PGDATA is definitely better. The question is whether
>> it's worth implementing this directly in core or if a prominent note in the
>> documentation would suffice.
>>
>
> I can work on the documentation update as a separate patch if folks
> think this is worthwhile.
>
>> >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
>>
>> Exactly! It's confusing that getFileType() returns file_content_type_t
>> instead of file_type_t.
>>
>
> Ah yes that is confusing, updated in patch.
>
>> For v5 patch:
>>
>> 1.
>> We could simply use the global WalSegSz variable within decide_file_action(),
>> eliminating the need to pass wal_segsz_bytes as an argument.
>>
>
> Good point.
>
>> 2.
>> For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
>> signature change for decide_file_actions() and decide_file_action(). I'm not
>> insisting on this approach, however.
>>
>
> I made it a global as well, and had to include access/xlog_internal.h
> in pg_rewind.h but I don't feel strongly about it either way.
>
Thanks, LGTM.
--
Regards,
Japin Li
Hi, Thanks for the review. On Sun, Jul 6, 2025 at 7:12 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote: > > Hi > > Thanks, LGTM. > I think it's possible to register to https://commitfest.postgresql.org/54, https://commitfest.postgresql.org/53 will closedsoon > I've created the commitfest entry here: https://commitfest.postgresql.org/patch/5902/ I realized I wasn't testing properly in 0006. I've updated the latest one to move it into a separate test file to make the results cleaner. Thanks, -- John Hsu - Amazon Web Services
Вложения
On Mon, Jul 7, 2025 at 2:46 PM John H <johnhyvr@gmail.com> wrote: > I've created the commitfest entry here: > https://commitfest.postgresql.org/patch/5902/ > > I realized I wasn't testing properly in 0006. I've updated the latest > one to move it into a separate test file to make the results cleaner. Thanks to everyone who has worked on this patch. I think this is a real problem about which we ought to try to do something. I think you could even defend calling this a back-patchable bug fix, since copying a lot of files that we know have to be identical is a serious efficiency issue that can affect the usability of pg_upgrade. There are more aggressive things that we could try to do here that might be even better, e.g. we could copy just the WAL segment or segments that contain the last checkpoint record and let the server fetch the rest as needed after it starts up. However, that would be a definitional change, which we certainly wouldn't want to back-patch, and which might have downsides for some people, so perhaps it would need to be optional behavior. But the patch as proposed is just an optimization that shouldn't change the outcome for anyone. Perhaps it's still best to treat it as a master-only improvement, but calling it a fix for a performance bug isn't completely crazy either, IMHO. Moving on to the patch contents, the approach taken makes sense to me. Any WAL files that exist in both source and target from prior to the point of divergence should not need to be copied, because they should necessarily be identical. If anyone sees a flaw in this reasoning, please point it out! While I agree with the basic approach, I think that the patch could be hardened in two ways. First, isRelDataFile tests the complete pathname to decide whether it has a relation file, but getFileContentType only tests the final component of the pathname. I think it should test the whole pathname also, in case there's a stray file with a file name that looks like a WAL file in some other directory. In fact, I think that these two functions should be integrated. Let's rename the isRelDataFile() to getFileContentType() and put all the logic in that function, including appropriate tests of the path name. Second, in decide_file_action(), I think we should check that entry->target_size == entry->source_size and return FILE_ACTION_COPY if not. This case really shouldn't happen, but it's very cheap to check and might offer a bit of protection in some weird scenario. I am not a huge fan of the way that the patch stuffs last_common_segno into a global variable that is then accessed by decide_wal_file_action. I think it would be better to find a way to pass this information down through the call stack, either as a separate argument or as part of some struct we're already passing, if there's a reasonable and not-too-invasive way to do that. If it looks too ugly, maybe this is the best option, but I think we should try to find something better. It does not appear to me that the test case tests what it purports to test, and I don't think that what it purports to test is the right thing anyway. It claims that it is testing that a certain WAL file (which it erroneously calls a "WAL file entry") is not copied to the target, but I see no logic to actually check this. I also don't think it's the correct behavior. I think what the patch is doing and should do is ensure that we don't copy the file when it's already present. If the file isn't present in the target cluster, it should still be copied. Granted, I haven't verified that this is the actual behavior, but the "Handle cases where the file is missing from one of the systems" code in decide_file_actiomn() seems to be higher up than the code the patch changes, so I assume that logic should be applied first. If that's correct, I don't quite know what a test case here can usefully verify, since the only difference would be performance, but maybe I'm misunderstanding something? As an administrative note, the naming convention for the patches posted to the thread is not consistent and not correct. Use "git format-patch -v$VERSION" to generate patches. In a name like v7-0003-do-something.patch, "v7" is the version of the patch set as a whole, and "0003" identifies a specific patch within that version of the patch set. In other words, that would be the third patch in the series from the seventh time that the patch set was posted. This patch set so far contains only one patch, so the file name should be vX-0001-whatever.patch, where whatever is the subject of the patch and X is a number that increases by one every time it's reposted. Let git format-patch do the work for you, though. On a related note, something a committer would need to do if they wanted to commit this patch is figure out who should be credited with writing it. I see that a number of different people have posted versions of the patch; it would be nice if "Author:" or "Co-authored-by:" tags were added to the commit message to reflect whose code is present in a given version. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Robert, Thanks for taking a look. On Thu, Sep 25, 2025 at 11:35 AM Robert Haas <robertmhaas@gmail.com> wrote: > ... > integrated. Let's rename the isRelDataFile() to getFileContentType() > and put all the logic in that function, including appropriate tests of > the path name. Second, in decide_file_action(), I think we should > check that entry->target_size == entry->source_size and return > FILE_ACTION_COPY if not. This case really shouldn't happen, but it's > very cheap to check and might offer a bit of protection in some weird > scenario. Updated the patch to reflect that. > I am not a huge fan of the way that the patch stuffs last_common_segno > into a global variable that is then accessed by > ... I forgot why I bothered with a global instead since it was straightforward to actually pass in the argument. Updated. > It does not appear to me that the test case tests what it purports to > test, and I don't think that what it purports to test is the right > thing anyway. It claims that it is testing that a certain WAL file > (which it erroneously calls a "WAL file entry") is not copied to the > target, but I see no logic to actually check this. > ... That's a fair point. The test does: 1. Write some data -> WAL 1 2. SELECT pg_switch_wal() -> pg_current_wal_lsn() is at WAL 2 3. CHECKPOINT; -> Makes WAL 2 the last common checkpoint Then it checks that "WAL 1 not copied over" was logged which isn't the best since if it was logged elsewhere then it would still pass. > ... > first. If that's correct, I don't quite know what a test case here can > usefully verify, since the only difference would be performance, but > maybe I'm misunderstanding something? I updated the test to run stat and get the modification time of the common WAL segment before and after pg_rewind and verify it is the same. In filemap.c in decide_wal_file_action, if you comment out > return FILE_ACTION_NONE; the test fails. > > As an administrative note, the naming convention for the patches > posted to the thread is not consistent and not correct. Use "git > format-patch -v$VERSION" to generate patches. In a name like > ... Fixed patch formatting. > ... > writing it. I see that a number of different people have posted > versions of the patch; it would be nice if "Author:" or > "Co-authored-by:" tags were added to the commit message to reflect I added Justin as the Co-Author in the commit message but I realized in their original patch they had "Author: Justin Kwan, Vignesh Ravichandran". Andres also had the latest patch series before mine so I'll leave it up to the committer how they want to handle this. Thanks, -- John Hsu - Amazon Web Services
Вложения
Hi,
just a second late :( i was about to post a patch addressing the refactors which Robert mentioned ,anyway will have a look at your latest patch John thanks :), curious about the tap test.
while i was writing the patch something suddenly struck me , that is why we are even depending on last_common_segno ,because once we reached decide_wal_file_action it means that the file exists in both target and source ,AFAIK this can only happen with wal segments older than or equal to last_common_segno because once the promotion competes the filename of the WAL files gets changed with the new timelineID(2), for ex: if the last_common_segno is 000000010000000000000003 then based on the rules in XLogInitNewTimeline
1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004
so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do
just a second late :( i was about to post a patch addressing the refactors which Robert mentioned ,anyway will have a look at your latest patch John thanks :), curious about the tap test.
while i was writing the patch something suddenly struck me , that is why we are even depending on last_common_segno ,because once we reached decide_wal_file_action it means that the file exists in both target and source ,AFAIK this can only happen with wal segments older than or equal to last_common_segno because once the promotion competes the filename of the WAL files gets changed with the new timelineID(2), for ex: if the last_common_segno is 000000010000000000000003 then based on the rules in XLogInitNewTimeline
1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004
so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do
in decide_wal_file_action
if (entry->target_size != entry->source_size)
{
pg_log_debug("WAL segment file \"%s\" is copied to target", fname);
return FILE_ACTION_COPY;
}
return FILE_ACTION_NONE;
thoughts?
Hey Srinath On Thu, Oct 9, 2025 at 12:09 PM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote: > ... > 1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with samesegno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003 > 2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file forthe new timeline will be 000000020000000000000004 > > so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to targetin total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergencepoint by calculating and checking against last_common_segno which we are doing in our current approach , i thinkwe can just do > > ... That's a great point. I need to think about it some more but the reasoning makes sense to me. I think 'last_common_segno ' is only useful as another sanity check but we already have the size ones. Thanks, -- John Hsu - Amazon Web Services
Hi John , I have reviewed your latest patch(v8-0001).
On Thu, Oct 9, 2025 at 11:27 PM John H <johnhyvr@gmail.com> wrote:
> ...
> integrated. Let's rename the isRelDataFile() to getFileContentType()
> and put all the logic in that function, including appropriate tests of
> the path name. Second, in decide_file_action(), I think we should
> check that entry->target_size == entry->source_size and return
> FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
> very cheap to check and might offer a bit of protection in some weird
> scenario.
Updated the patch to reflect that.
in getFileContentType as you moved the logic of validating
whether they are xlog files or not using IsXLogFileName ,
i think it's better to validate it the kind of similar way as its
done for relfiles by parsing the path using sscanf this will
help to get the tli,logno,segno of the xlog file and using
XLogFromFileName which gives us logSegNo ,so that we
can rebuild the xlog file path again using these values with
XLogFilePath , then validate this new path with the given path
,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10.
maybe like this
+ if (strncmp("pg_wal/", path, 7) == 0)
+ {
+ const char *filename = path + 7;
+ XLogFromFileName(filename,&tli,logSegNo,WalSegSz);
+ XLogFilePath(check_path,tli,*logSegNo,WalSegSz);
+ if (strcmp(check_path, path) == 0)
+ return FILE_CONTENT_TYPE_WAL;
+ else
+ return FILE_CONTENT_TYPE_OTHER;
+ }
+ if (strncmp("pg_wal/", path, 7) == 0)
+ {
+ const char *filename = path + 7;
+ XLogFromFileName(filename,&tli,logSegNo,WalSegSz);
+ XLogFilePath(check_path,tli,*logSegNo,WalSegSz);
+ if (strcmp(check_path, path) == 0)
+ return FILE_CONTENT_TYPE_WAL;
+ else
+ return FILE_CONTENT_TYPE_OTHER;
+ }
> I am not a huge fan of the way that the patch stuffs last_common_segno
> into a global variable that is then accessed by
> ...
I forgot why I bothered with a global instead since it was straightforward
to actually pass in the argument. Updated.
instead of passing last_common_segno down the call stack directly,
we can have a struct maybe called "rewindState" which will have the common
information related to both clusters involved in rewind ,so that in future
if we want to pass more such information down we won't be increasing
the number of arguments that need to be passed down to all functions in stack.
the number of arguments that need to be passed down to all functions in stack.
> It does not appear to me that the test case tests what it purports to
> test, and I don't think that what it purports to test is the right
> thing anyway. It claims that it is testing that a certain WAL file
> (which it erroneously calls a "WAL file entry") is not copied to the
> target, but I see no logic to actually check this.
> ...
That's a fair point. The test does:
1. Write some data -> WAL 1
2. SELECT pg_switch_wal() -> pg_current_wal_lsn() is at WAL 2
3. CHECKPOINT; -> Makes WAL 2 the last common checkpoint
Then it checks that "WAL 1 not copied over" was logged which isn't
the best since if it was logged elsewhere then it would still pass.
we can improve the TAP test file header comment as
# Test the situation where we skip copying the same WAL files from source to target
# except if WAL file size differs.
let me put it this way
1) WALs are written to 000000010000000000000002 in primary.
2) then SELECT pg_switch_wal() now the current WAL is 000000010000000000000003
3) CHECKPOINT; this is the last common checkpoint and is written in
000000010000000000000003.
4) until this point we have the same WAL files in primary and standby.
5) promote standby ,this leads to timeline switch and lets say the divergence point
is at segment boundary , so the first WAL segment in the new timeline is 000000020000000000000004.
6) when we stop standby then primary , in primary we get checkpoint shutdown which
is greater than the divergence point ,means there is divergence so pg_rewind is needed.
7) so the last common WAL segment was 000000010000000000000003.
I have mentioned below about how this updated TAP test is testing whether
common WAL file is being copied or not and about one more test we need to cover
in this TAP test.
> ...
> first. If that's correct, I don't quite know what a test case here can
> usefully verify, since the only difference would be performance, but
> maybe I'm misunderstanding something?
I updated the test to run stat and get the modification time of the common
WAL segment before and after pg_rewind and verify it is the same.
In filemap.c in decide_wal_file_action, if you comment out
> return FILE_ACTION_NONE;
the test fails.
This test makes sense to me, by checking whether
last common segment(000000010000000000000003 in our case)
been copied or not using the stat->mtime after the pg_rewind, if its not
copied then stat->mtime will be the same ,else it will differ with the
one we noted before rewind, but we also need to test the case
where entry->target_size != entry->source_size and we do copy,
like if stat->mtime is changed (which means file has been copied)
and target_stat->size != soucrce_stat->size here no error expected.
where entry->target_size != entry->source_size and we do copy,
like if stat->mtime is changed (which means file has been copied)
and target_stat->size != soucrce_stat->size here no error expected.
Hi, On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote: > > ... > XLogFilePath , then validate this new path with the given path > ,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10. > ... Are you concerned that somehow these files, which are named like XLog files but actually aren't, are somehow created therefore we should sync them in case? I'm trying to understand how these files would be generated in the first place. > > instead of passing last_common_segno down the call stack directly, > we can have a struct maybe called "rewindState" which will have the common > information related to both clusters involved in rewind ,so that in future > if we want to pass more such information down we won't be increasing > the number of arguments that need to be passed down to all functions in stack. > I don't feel too strongly about this. Not sure how we view future-proofing things, as in do we proactively wrap around structs or wait until there's an actual need. This is the first time it's been touched since it was introduced in 14 for what it's worth. > > we can improve the TAP test file header comment as > # Test the situation where we skip copying the same WAL files from source to target > # except if WAL file size differs. > > let me put it this way > 1) WALs are written to 000000010000000000000002 in primary. > ... > 7) so the last common WAL segment was 000000010000000000000003. Updated comments. > > ... > where entry->target_size != entry->source_size and we do copy, > like if stat->mtime is changed (which means file has been copied) > and target_stat->size != soucrce_stat->size here no error expected. > Updated patch with one additional segment has a byte appended to it on target but not source. Thanks, John H
Вложения
Hi,
On Wed, Oct 15, 2025 at 4:49 AM John H <johnhyvr@gmail.com> wrote:
Hi,
On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
>
> ...
> XLogFilePath , then validate this new path with the given path
> ,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10.
> ...
Are you concerned that somehow these files, which are named like XLog
files but actually
aren't, are somehow created therefore we should sync them in case?
I'm trying to understand how these files would be generated in the first place.
the problem is not when server generates them because
filename is properly calculated using the XLogRecPtr in
XLogWrite->XLogFileInit->XLogFileInitInternal->XLogFilePath
,the main problem is when if someone manually places an invalid WAL file
in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
consider it as valid ,so with the approach as i mentioned earlier we can
catch such cases.
>
> instead of passing last_common_segno down the call stack directly,
> we can have a struct maybe called "rewindState" which will have the common
> information related to both clusters involved in rewind ,so that in future
> if we want to pass more such information down we won't be increasing
> the number of arguments that need to be passed down to all functions in stack.
>
I don't feel too strongly about this. Not sure how we view
future-proofing things,
as in do we proactively wrap around structs or wait until there's an
actual need.
This is the first time it's been touched since it was introduced in 14 for what
it's worth.
TBH first it seemed to me a good coding practice for future proofing,
then i checked if there's any place in postgres we are doing
something similar ,then found 1 in src/include/access/gist.h
typedef struct GISTDeletedPageContents
{
/* last xid which could see the page in a scan */
FullTransactionId deleteXid;
} GISTDeletedPageContents;
>
> we can improve the TAP test file header comment as
> # Test the situation where we skip copying the same WAL files from source to target
> # except if WAL file size differs.
>
> let me put it this way
> 1) WALs are written to 000000010000000000000002 in primary.
> ...
> 7) so the last common WAL segment was 000000010000000000000003.
Updated comments.
>
> ...
> where entry->target_size != entry->source_size and we do copy,
> like if stat->mtime is changed (which means file has been copied)
> and target_stat->size != soucrce_stat->size here no error expected.
>
Updated patch with one additional segment has a byte appended to it
on target but not source.
thanks for updating, i have attached a diff patch on
top of v9-0001 patch , where i tried to add more tests
to improve the validation that we copy the WAL file even
when it exists on both source and target but the size differs.
I also tried to rename some variables for readability,while
testing this out i found that usleep(1000); is not enough
to show that there has been a copy which changed the
stat->mtime of the file because i think the copy was very
fast (less than 1 millisecond) so the mtime of the
file after rewind's copy didn't change or filesystem precision
didn't caught the change, the copying of the file is confirmed
because the same file has different size before rewind and
same after rewind which these tests proved
ok($corrupt_wal_seg_stat_before_rewind->size != $corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has different size in source vs target before rewind");
ok 11 - WAL segment 000000010000000000000004 has different size in source vs target before rewind
ok($corrupt_wal_seg_stat_after_rewind->size == $corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was copied: file sizes are same between target and source after rewind");
ok 12 - WAL segment 000000010000000000000004 was copied: file sizes are same between target and source after rewind
and more over the pg_rewind confirmed this with a new debug
message which I added in this diff patch
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 69d7728ce44..743484f4833 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
*/
if (file_segno < last_common_segno && source_size == target_size)
{
pg_log_debug("WAL segment \"%s\" not copied to target", fname);
return FILE_ACTION_NONE;
}
+ pg_log_debug("WAL segment \"%s\" is copied to target", fname);
return FILE_ACTION_COPY;
}
ok 6 - run pg_rewind stderr /(?^:WAL segment \"000000010000000000000004\" is copied to target)/
also did this instead of using catfile
-my $wal_skipped_path = File::Spec->catfile($node_primary->data_dir, 'pg_wal', $wal_seg_skipped);
+my $wal_skipped_path = $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
seems consistent with other TAP tests.
Вложения
On Wed, Oct 15, 2025 at 7:56 PM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote:
thanks for updating, i have attached a diff patch on
top of v9-0001 patch , where i tried to add more tests
to improve the validation that we copy the WAL file even
when it exists on both source and target but the size differs.
I also tried to rename some variables for readability,while
testing this out i found that usleep(1000); is not enough
to show that there has been a copy which changed the
stat->mtime of the file because i think the copy was very
fast (less than 1 millisecond) so the mtime of the
file after rewind's copy didn't change or filesystem precision
didn't caught the change, the copying of the file is confirmed
because the same file has different size before rewind and
same after rewind which these tests proved
ok($corrupt_wal_seg_stat_before_rewind->size != $corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has different size in source vs target before rewind");ok 11 - WAL segment 000000010000000000000004 has different size in source vs target before rewind
ok($corrupt_wal_seg_stat_after_rewind->size == $corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was copied: file sizes are same between target and source after rewind");
ok 12 - WAL segment 000000010000000000000004 was copied: file sizes are same between target and source after rewind
and more over the pg_rewind confirmed this with a new debug
message which I added in this diff patch
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 69d7728ce44..743484f4833 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
*/
if (file_segno < last_common_segno && source_size == target_size)
{
pg_log_debug("WAL segment \"%s\" not copied to target", fname);
return FILE_ACTION_NONE;
}
+ pg_log_debug("WAL segment \"%s\" is copied to target", fname);
return FILE_ACTION_COPY;
}
ok 6 - run pg_rewind stderr /(?^:WAL segment \"000000010000000000000004\" is copied to target)/
btw i forgot to let you know that i have increased the
usleep(1000) to usleep(1000000) for the above reason
which helped to show the difference in mtime after rewind
adding the below extra validation that the file has been copied from
source to target ,this is already done in the v9-0001.diff patch.
ok($corrupt_wal_seg_stat_after_rewind->mtime > $corrupt_wal_seg_stat_before_rewind->mtime,"WAL segment $corrupt_wal_seg was copied: mtime after rewind > mtime before rewind");
ok 12 - WAL segment 000000010000000000000004 was copied: mtime after rewind > mtime before rewind
it would be awesome to know ,if you have any update on this.
-- usleep(1000) to usleep(1000000) for the above reason
which helped to show the difference in mtime after rewind
adding the below extra validation that the file has been copied from
source to target ,this is already done in the v9-0001.diff patch.
ok($corrupt_wal_seg_stat_after_rewind->mtime > $corrupt_wal_seg_stat_before_rewind->mtime,"WAL segment $corrupt_wal_seg was copied: mtime after rewind > mtime before rewind");
ok 12 - WAL segment 000000010000000000000004 was copied: mtime after rewind > mtime before rewind
On Fri, Oct 10, 2025 at 3:18 AM John H <johnhyvr@gmail.com> wrote:
Hey Srinath
On Thu, Oct 9, 2025 at 12:09 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
> ...
> 1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
> 2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004
>
> so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do
>
> ...
That's a great point. I need to think about it some more but the
reasoning makes sense to me.
I think 'last_common_segno ' is only useful as another sanity check
but we already have the size
ones.
it would be awesome to know ,if you have any update on this.
On Thu, Oct 9, 2025 at 3:09 PM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote: > just a second late :( i was about to post a patch addressing the refactors which Robert mentioned ,anyway will have alook at your latest patch John thanks :), curious about the tap test. > > while i was writing the patch something suddenly struck me , that is why we are even depending on last_common_segno ,becauseonce we reached decide_wal_file_action it means that the file exists in both target and source ,AFAIK this can onlyhappen with wal segments older than or equal to last_common_segno because once the promotion competes the filename ofthe WAL files gets changed with the new timelineID(2), for ex: if the last_common_segno is 000000010000000000000003 thenbased on the rules in XLogInitNewTimeline > 1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with samesegno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003 > 2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file forthe new timeline will be 000000020000000000000004 > > so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to targetin total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergencepoint by calculating and checking against last_common_segno which we are doing in our current approach , i thinkwe can just do What makes me nervous about this is that it isn't necessarily the case that the servers were perfectly in sync at the time of the failure. Suppose that the primary was in the middle of writing 000000010000000000000003. The standby might also have this file, but it might contain less valid data than the one on the primary; therefore, if we don't copy the file, the two servers might not have an identical file. Maybe that wouldn't really matter, in the sense that the extra valid data that exists on the original primary shouldn't prevent it from replaying WAL on the new primary's timeline, which is probably all we really care about. But it feels dangerous to me. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote: >> On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla >> <srinath2133@gmail.com> wrote: >> > XLogFilePath , then validate this new path with the given path >> > ,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10. >> > ... >> >> Are you concerned that somehow these files, which are named like XLog >> files but actually >> aren't, are somehow created therefore we should sync them in case? >> I'm trying to understand how these files would be generated in the first place. > > the problem is not when server generates them because > filename is properly calculated using the XLogRecPtr in > XLogWrite->XLogFileInit->XLogFileInitInternal->XLogFilePath > ,the main problem is when if someone manually places an invalid WAL file > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will > consider it as valid ,so with the approach as i mentioned earlier we can > catch such cases. I think that parsing the file name may be a good idea so that we can do appropriate sanity checks on the values (e.g. checking that we're only skipping copying prior to last_common_segno), but I do not think we should worry too much about the user manually injecting invalid WAL files. I mean, I would prefer that if that does happen, it either works anyway or fails with a sensible error message, rather than emitting an incomprehensible error message or dumping core. But, it is in general true that if manual modifications are made to the data directory, things may go terribly wrong, and this code is not obliged to provide any more protection against such scenarios than we do in other cases. Ultimately, such modifications are user error. -- Robert Haas EDB: http://www.enterprisedb.com
Hi,
On Wed, Oct 15, 2025 at 7:27 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
>
> TBH first it seemed to me a good coding practice for future proofing,
> then i checked if there's any place in postgres we are doing
> something similar ,then found 1 in src/include/access/gist.h
>
> typedef struct GISTDeletedPageContents
> {
> /* last xid which could see the page in a scan */
> FullTransactionId deleteXid;
> } GISTDeletedPageContents;
>
It makes more sense to me if we expect the struct or same set of arguments
to be reused in different places / want a stable API reference. I don't think
we want everything to be wrapped around a struct for functions just in case.
> ..
> thanks for updating, i have attached a diff patch on
> top of v9-0001 patch , where i tried to add more tests
> ...
Thank you for the diff! Your changes are a lot cleaner and I've included it in
the latest patch. One difference I've made is instead of additionally logging in
decide_wal_file_action
> + pg_log_debug("WAL segment \"%s\" is copied to target", fname);
I realized we are already logging the WAL segments that are copied over in
print_filemap so I've updated the test to check for that instead
> qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
I also updated the error message for the last three checks.
Thanks,
--
John Hsu - Amazon Web Services
Вложения
On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla > > ,the main problem is when if someone manually places an invalid WAL file > > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will > > consider it as valid ,so with the approach as i mentioned earlier we can > > catch such cases. > > I think that parsing the file name may be a good idea so that we can > do appropriate sanity checks on the values (e.g. checking that we're > only skipping copying prior to last_common_segno), but I do not think > we should worry too much about the user manually injecting invalid WAL > files. I mean, I would prefer that if that does happen, it either > works anyway or fails with a sensible error message, rather than > emitting an incomprehensible error message or dumping core. But, it is > in general true that if manual modifications are made to the data > directory, things may go terribly wrong, and this code is not obliged > to provide any more protection against such scenarios than we do in > other cases. Ultimately, such modifications are user error. > It feels like there's a lot of things we could attempt to ensure "correctness" if we are concerned about scenarios when the user manually puts or modifies content unexpectedly in the pg_wal directory. For instance, one could make the argument that when considering to skip copying the common WAL segments, even though they are of the same size, it's possible the user has manipulated them directly. I don't think we need to run checksums on every WAL segment that is a valid candidate to ensure they match. -- John Hsu - Amazon Web Services
Thanks John, great to see this patch getting attention again.
This builds on my original version from 2022, glad to see it moving forward and happy with the direction it’s taking.
This builds on my original version from 2022, glad to see it moving forward and happy with the direction it’s taking.
—Justin
From: John H <johnhyvr@gmail.com>
Sent: October 16, 2025 6:55 PM
To: Robert Haas <robertmhaas@gmail.com>
Cc: Srinath Reddy Sadipiralla <srinath2133@gmail.com>; wenhui qiu <qiuwenhuifx@gmail.com>; Japin Li <japinli@hotmail.com>; Michael Paquier <michael@paquier.xyz>; Andres Freund <andres@anarazel.de>; Alexander Korotkov <aekorotkov@gmail.com>; Justin Kwan <justinpkwan@outlook.com>; Tom Lane <tgl@sss.pgh.pa.us>; pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>; jkwan@cloudflare.com <jkwan@cloudflare.com>
Subject: Re: Making pg_rewind faster
Sent: October 16, 2025 6:55 PM
To: Robert Haas <robertmhaas@gmail.com>
Cc: Srinath Reddy Sadipiralla <srinath2133@gmail.com>; wenhui qiu <qiuwenhuifx@gmail.com>; Japin Li <japinli@hotmail.com>; Michael Paquier <michael@paquier.xyz>; Andres Freund <andres@anarazel.de>; Alexander Korotkov <aekorotkov@gmail.com>; Justin Kwan <justinpkwan@outlook.com>; Tom Lane <tgl@sss.pgh.pa.us>; pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>; jkwan@cloudflare.com <jkwan@cloudflare.com>
Subject: Re: Making pg_rewind faster
On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
> > ,the main problem is when if someone manually places an invalid WAL file
> > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
> > consider it as valid ,so with the approach as i mentioned earlier we can
> > catch such cases.
>
> I think that parsing the file name may be a good idea so that we can
> do appropriate sanity checks on the values (e.g. checking that we're
> only skipping copying prior to last_common_segno), but I do not think
> we should worry too much about the user manually injecting invalid WAL
> files. I mean, I would prefer that if that does happen, it either
> works anyway or fails with a sensible error message, rather than
> emitting an incomprehensible error message or dumping core. But, it is
> in general true that if manual modifications are made to the data
> directory, things may go terribly wrong, and this code is not obliged
> to provide any more protection against such scenarios than we do in
> other cases. Ultimately, such modifications are user error.
>
It feels like there's a lot of things we could attempt to ensure
"correctness" if we are concerned about scenarios when the user manually puts
or modifies content unexpectedly in the pg_wal directory.
For instance, one could make the argument that when considering to skip
copying the common WAL segments, even though they are of the same
size, it's possible the user has manipulated them directly. I don't
think we need to
run checksums on every WAL segment that is a valid candidate to ensure they
match.
--
John Hsu - Amazon Web Services
>
> On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
> > ,the main problem is when if someone manually places an invalid WAL file
> > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
> > consider it as valid ,so with the approach as i mentioned earlier we can
> > catch such cases.
>
> I think that parsing the file name may be a good idea so that we can
> do appropriate sanity checks on the values (e.g. checking that we're
> only skipping copying prior to last_common_segno), but I do not think
> we should worry too much about the user manually injecting invalid WAL
> files. I mean, I would prefer that if that does happen, it either
> works anyway or fails with a sensible error message, rather than
> emitting an incomprehensible error message or dumping core. But, it is
> in general true that if manual modifications are made to the data
> directory, things may go terribly wrong, and this code is not obliged
> to provide any more protection against such scenarios than we do in
> other cases. Ultimately, such modifications are user error.
>
It feels like there's a lot of things we could attempt to ensure
"correctness" if we are concerned about scenarios when the user manually puts
or modifies content unexpectedly in the pg_wal directory.
For instance, one could make the argument that when considering to skip
copying the common WAL segments, even though they are of the same
size, it's possible the user has manipulated them directly. I don't
think we need to
run checksums on every WAL segment that is a valid candidate to ensure they
match.
--
John Hsu - Amazon Web Services
On Thu, Oct 16, 2025 at 11:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 9, 2025 at 3:09 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
> just a second late :( i was about to post a patch addressing the refactors which Robert mentioned ,anyway will have a look at your latest patch John thanks :), curious about the tap test.
>
> while i was writing the patch something suddenly struck me , that is why we are even depending on last_common_segno ,because once we reached decide_wal_file_action it means that the file exists in both target and source ,AFAIK this can only happen with wal segments older than or equal to last_common_segno because once the promotion competes the filename of the WAL files gets changed with the new timelineID(2), for ex: if the last_common_segno is 000000010000000000000003 then based on the rules in XLogInitNewTimeline
> 1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
> 2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004
>
> so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do
What makes me nervous about this is that it isn't necessarily the case
that the servers were perfectly in sync at the time of the failure.
Suppose that the primary was in the middle of writing
000000010000000000000003. The standby might also have this file, but
it might contain less valid data than the one on the primary;
therefore, if we don't copy the file, the two servers might not have
an identical file. Maybe that wouldn't really matter, in the sense
that the extra valid data that exists on the original primary
shouldn't prevent it from replaying WAL on the new primary's timeline,
which is probably all we really care about. But it feels dangerous to
me.
Thanks Robert ,I want to understand this point more , and will get back .
Hi,
On Fri, Oct 17, 2025 at 4:22 AM John H <johnhyvr@gmail.com> wrote:
Hi,
On Wed, Oct 15, 2025 at 7:27 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
>
> TBH first it seemed to me a good coding practice for future proofing,
> then i checked if there's any place in postgres we are doing
> something similar ,then found 1 in src/include/access/gist.h
>
> typedef struct GISTDeletedPageContents
> {
> /* last xid which could see the page in a scan */
> FullTransactionId deleteXid;
> } GISTDeletedPageContents;
>
It makes more sense to me if we expect the struct or same set of arguments
to be reused in different places / want a stable API reference. I don't think
we want everything to be wrapped around a struct for functions just in case.
> ..
> thanks for updating, i have attached a diff patch on
> top of v9-0001 patch , where i tried to add more tests
> ...
Thank you for the diff! Your changes are a lot cleaner and I've included it in
the latest patch. One difference I've made is instead of additionally logging in
decide_wal_file_action
> + pg_log_debug("WAL segment \"%s\" is copied to target", fname);
I realized we are already logging the WAL segments that are copied over in
print_filemap so I've updated the test to check for that instead
> qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
I also updated the error message for the last three checks.
Thanks for updating the patch, I have reviewed it,
except these below concerns, patch LGTM.
1) regarding the parsing the WAL filename
On Fri, Oct 17, 2025 at 4:25 AM John H <johnhyvr@gmail.com> wrote:
On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
> > ,the main problem is when if someone manually places an invalid WAL file
> > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
> > consider it as valid ,so with the approach as i mentioned earlier we can
> > catch such cases.
>
> I think that parsing the file name may be a good idea so that we can
> do appropriate sanity checks on the values (e.g. checking that we're
> only skipping copying prior to last_common_segno), but I do not think
> we should worry too much about the user manually injecting invalid WAL
> files. I mean, I would prefer that if that does happen, it either
> works anyway or fails with a sensible error message, rather than
> emitting an incomprehensible error message or dumping core. But, it is
> in general true that if manual modifications are made to the data
> directory, things may go terribly wrong, and this code is not obliged
> to provide any more protection against such scenarios than we do in
> other cases. Ultimately, such modifications are user error.
>
It feels like there's a lot of things we could attempt to ensure
"correctness" if we are concerned about scenarios when the user manually puts
or modifies content unexpectedly in the pg_wal directory.
For instance, one could make the argument that when considering to skip
copying the common WAL segments, even though they are of the same
size, it's possible the user has manipulated them directly. I don't
think we need to
run checksums on every WAL segment that is a valid candidate to ensure they
match.
I agree with Robert here, and i think if we found
that the WAL filename is not valid after parsing
we can return something like FILE_CONTENT_INVALID_TYPE
in getFileContentType so that it won't end up going
through decide_wal_file_action,then XLogFromFileName
may assign file_segno invalid values greater than
last_common_segno which means we lead to copying it,even
though it's an invalid file, thoughts?
also add file_content_type_t typedef to typedefs.list
file.
3) maybe we can improve the error messages for the
last 2 checks in tap test, that because of this
reason hence proven that the copy from source to
target has been done.
4) We need to update the pg_rewind docs regarding
this new behaviour.
Hi Srinath, Thank you for the quick review! On Sat, Oct 18, 2025 at 9:05 AM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote: > ... > I agree with Robert here, and i think if we found > that the WAL filename is not valid after parsing > we can return something like FILE_CONTENT_INVALID_TYPE > in getFileContentType so that it won't end up going > through decide_wal_file_action,then XLogFromFileName > may assign file_segno invalid values greater than > last_common_segno which means we lead to copying it,even > though it's an invalid file, thoughts? > My understanding of pg_rewind is that it aims to make the two clusters "identical" after rewind, similar to a basebackup. It does not try to make decisions on whether or not a file is necessary for database to start. I'd prefer the patch aims to keep parity with the existing behavior of pg_rewind/pg_basebackup as much as possible. If there is an unexpected/invalid segment and we can't make assumptions about the content, we should just sync it to keep with existing behavior. There's lots of potential optimizations pg_rewind can do if we only talk about what is "strictly" necessary for the database to start. We can exclude any directories that should not be there. For example, if there exists a directory $PGDATA/random_files on source and target, the whole directory gets synced right now. We could not include that directory and the database would most likely still start but that goes against the current way pg_rewind works. > 2) Please fix the indentation by running pgindent, > also add file_content_type_t typedef to typedefs.list > file. Will update. > 3) maybe we can improve the error messages for the > last 2 checks in tap test, that because of this > reason hence proven that the copy from source to > target has been done. What are you thinking of? Something like the below? + "Expected WAL segment $corrupt_wal_seg to have been modified due to rewind"); + "Expected WAL segment $corrupt_wal_seg to have been synced from source to target after rewind"); > 4) We need to update the pg_rewind docs regarding > this new behaviour. > Will update. -- John Hsu - Amazon Web Services
Hi John,
On Tue, Oct 21, 2025 at 3:34 AM John H <johnhyvr@gmail.com> wrote:
> 3) maybe we can improve the error messages for the
> last 2 checks in tap test, that because of this
> reason hence proven that the copy from source to
> target has been done.
What are you thinking of? Something like the below?
+ "Expected WAL segment $corrupt_wal_seg to have been modified
due to rewind");
+ "Expected WAL segment $corrupt_wal_seg to have been synced
from source to target after rewind");
maybe something like this
+ "Expected WAL segment $corrupt_wal_seg to have later mtime
on target than source after rewind as it was copied");
+ "Expected WAL segment $corrupt_wal_seg file sizes to be same
between target and source after rewind as it was copied");
between target and source after rewind as it was copied");
Hi Srinath, On Mon, Oct 20, 2025 at 9:36 PM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote: > ... > maybe something like this > + "Expected WAL segment $corrupt_wal_seg to have later mtime > on target than source after rewind as it was copied"); > + "Expected WAL segment $corrupt_wal_seg file sizes to be same > between target and source after rewind as it was copied"); > Made the changes and included the documentation update. Thanks, -- John Hsu - Amazon Web Services
Вложения
On Thu, Sep 25, 2025 at 02:35:16PM -0400, Robert Haas wrote: > On a related note, something a committer would need to do if they > wanted to commit this patch is figure out who should be credited with > writing it. I see that a number of different people have posted > versions of the patch; it would be nice if "Author:" or > "Co-authored-by:" tags were added to the commit message to reflect > whose code is present in a given version. Looking at this patch was on my TODO list (more to that in a bit), and the simplest answer to this question is that everybody involved would be cited in the release notes. Justin Kwan has been the original author. John Hsu and Srinath have followed up in an equivalent fashion, including answers with your review comments posted on this previous message. So listing all three as equivalent authors feels normal to me (no need for co-authoring notes, IMO). -- Michael
Вложения
On Tue, Oct 21, 2025 at 04:14:55PM -0700, John H wrote: > Made the changes and included the documentation update. I have been looking at this patch, and the first part that stood out is that the refactoring related to isRelDataFile() can be done as an independent piece, cutting the total size of the main patch by 30% or so. So I have extracted this part, simplified it to make it more consistent and les noisy on HEAD, and applied it as 6ae08d9583e9. I am still looking at the rest, and attached is a rebased version of what I have for the moment. Note first that the new test was missing in meson.build, so the coverage provided by the CI was limited. + * However we check last_common_segno and file_size again for sanity. + */ + if (file_segno < last_common_segno && source_size == target_size) What if a segment has a size different than the one a cluster has been initialized with, still both have the same size on the target and the source? Unlikely, still incorrect if not cross-checked with what a control file has in store. :) +# Sleep to allow mtime to be different +usleep(1000000); [...] + [qr/WAL segment \"$wal_seg_skipped\" not copied to target/, + qr/pg_wal\/$corrupt_wal_seg \(COPY\)/ + ], FWIW, I am definitely not a fan of the test relying on the timestamp of the files based on their retrieved meta-data stats, with the mtime. I suspect complications on Windows. Worse thing, there is a forced sleep of 1s added to the test, to make sure that the timestamp of the files we compare have a different number. But do we really need that anyway if we have the debug information with the file map printed in it? Hence, I think that we should simplify and rework the test a bit, tweaking a bit how the filemap is printed: now that we can detect if an operation is happening on a WAL file thanks to file_content_type_t, let's always print the type of operation done for each WAL file and remove this "not copied to target" debug log which provides the same information, then let's compare the debug output with what we want. It seems to me that it would be good for the test to run some sanity checks on the rewound standby, as well. That would provide more validation for the case of the "corrupted" segment that gets copied anyway. -- Michael
Вложения
On Thu, Oct 23, 2025 at 4:22 AM Michael Paquier <michael@paquier.xyz> wrote: > + * However we check last_common_segno and file_size again for sanity. > + */ > + if (file_segno < last_common_segno && source_size == target_size) > > What if a segment has a size different than the one a cluster has been > initialized with, still both have the same size on the target and the > source? Unlikely, still incorrect if not cross-checked with what a > control file has in store. :) While I'm not against cross-checking against the control file, this sounds like an imaginary scenario to me. That is, it would only happen if somebody maliciously modified the contents of the data directory by hand with the express goal of breaking the tool. But we fundamentally cannot defend against a malicious user whose express goal is to break the tool, and I do not see any compelling reason to expend energy on it even in cases like this where we could theoretically detect it without much effort. If we go down that path, we'll end up not only complicating the code, but also obscuring our own goals: it will look like we've either done too much sanity checking (because we will have added checks that are unnecessary with a non-malicious user) or too little (because we will not have caught all the things a malicious user might do). -- Robert Haas EDB: http://www.enterprisedb.com
Hi Michael,
On Thu, Oct 23, 2025 at 1:52 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Oct 21, 2025 at 04:14:55PM -0700, John H wrote:
> Made the changes and included the documentation update.
I have been looking at this patch, and the first part that stood out
is that the refactoring related to isRelDataFile() can be done as an
independent piece, cutting the total size of the main patch by 30% or
so. So I have extracted this part, simplified it to make it more
consistent and les noisy on HEAD, and applied it as 6ae08d9583e9.
yeah ... that's cleaner, thanks for committing.
+# Sleep to allow mtime to be different
+usleep(1000000);
[...]
+ [qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+ qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
+ ],
FWIW, I am definitely not a fan of the test relying on the timestamp
of the files based on their retrieved meta-data stats, with the mtime.
I suspect complications on Windows. Worse thing, there is a forced
sleep of 1s added to the test, to make sure that the timestamp of the
files we compare have a different number. But do we really need that
anyway if we have the debug information with the file map printed in
it?
thought it would act like an extra sanity check ,to prove the point that the
pg_rewind is saying through the debug info that it has been really copied
or skipped.
Hence, I think that we should simplify and rework the test a bit,
tweaking a bit how the filemap is printed: now that we can detect if
an operation is happening on a WAL file thanks to file_content_type_t,
let's always print the type of operation done for each WAL file and
remove this "not copied to target" debug log which provides the same
information, then let's compare the debug output with what we want.
i guess this is what you want ,please let me know if it's otherwise,
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 2d82edec060..2180495d337 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -546,7 +546,7 @@ print_filemap(filemap_t *filemap)
for (i = 0; i < filemap->nentries; i++)
{
entry = filemap->entries[i];
- if (entry->action != FILE_ACTION_NONE ||
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL || entry->action != FILE_ACTION_NONE ||
entry->target_pages_to_overwrite.bitmapsize > 0)
{
pg_log_debug("%s (%s)", entry->path,
@@ -733,7 +733,6 @@ decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
*/
if (file_segno < last_common_segno && source_size == target_size)
{
- pg_log_debug("WAL segment \"%s\" not copied to target", fname);
return FILE_ACTION_NONE;
}
diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/011_wal_copy.pl
index b9e24844654..3e12744dfa6 100644
--- a/src/bin/pg_rewind/t/011_wal_copy.pl
+++ b/src/bin/pg_rewind/t/011_wal_copy.pl
@@ -93,7 +93,7 @@ command_checks_all(
0,
[qr//],
[
- qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+ qr/pg_wal\/$wal_seg_skipped \(NONE\)/,
qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
],
'run pg_rewind');
It seems to me that it would be good for the test to run some sanity
checks on the rewound standby, as well. That would provide more
validation for the case of the "corrupted" segment that gets copied
anyway.
i think these checks help do the same
1) qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
this shows the corrupted segment is copied.
2) ok( $corrupt_wal_seg_stat_before_rewind->size !=
$corrupt_wal_seg_source_stat->size,
"Expected WAL segment $corrupt_wal_seg to have different size in
source vs target before rewind"
);
my $corrupt_wal_seg_stat_after_rewind = stat($corrupt_wal_seg_in_target_path);
ok( $corrupt_wal_seg_stat_after_rewind->size ==
$corrupt_wal_seg_source_stat->size,
"Expected WAL segment $corrupt_wal_seg file sizes to be same between
target and source after rewind as it was copied"
);
These checks show that before the corrupt segment's size on
rewound standby(target) was not the same as source but after
1) qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
this shows the corrupted segment is copied.
2) ok( $corrupt_wal_seg_stat_before_rewind->size !=
$corrupt_wal_seg_source_stat->size,
"Expected WAL segment $corrupt_wal_seg to have different size in
source vs target before rewind"
);
my $corrupt_wal_seg_stat_after_rewind = stat($corrupt_wal_seg_in_target_path);
ok( $corrupt_wal_seg_stat_after_rewind->size ==
$corrupt_wal_seg_source_stat->size,
"Expected WAL segment $corrupt_wal_seg file sizes to be same between
target and source after rewind as it was copied"
);
These checks show that before the corrupt segment's size on
rewound standby(target) was not the same as source but after
rewind it was the same as source,please let me know if i am
missing your point here.
Hi, On Thu, Oct 23, 2025 at 9:08 AM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote: > On Thu, Oct 23, 2025 at 1:52 PM Michael Paquier <michael@paquier.xyz> wrote: >> >> FWIW, I am definitely not a fan of the test relying on the timestamp >> of the files based on their retrieved meta-data stats, with the mtime. >> I suspect complications on Windows. Worse thing, there is a forced >> sleep of 1s added to the test, to make sure that the timestamp of the >> files we compare have a different number. But do we really need that >> anyway if we have the debug information with the file map printed in >> it? > > > thought it would act like an extra sanity check ,to prove the point that the > pg_rewind is saying through the debug info that it has been really copied > or skipped. > I don't feel too strongly about it. If we want to rely on checking debug log that works for me. We don't do any in-depth checks that every file was properly copied anyway for existing files. > ... > i guess this is what you want ,please let me know if it's otherwise, > ... That looks right to me. >> >> It seems to me that it would be good for the test to run some sanity >> checks on the rewound standby, as well. That would provide more >> validation for the case of the "corrupted" segment that gets copied >> anyway. > > ... > These checks show that before the corrupt segment's size on > rewound standby(target) was not the same as source but after > rewind it was the same as source,please let me know if i am > missing your point here. > What did you have in mind for additional sanity checks? The existing test checks that when sizes are different the correct branch is taken. For something more in-depth that requires comparing data before and after the rewind with a "corrupted" segment that seems complicated since the segment would have to somehow be applied to writer but not replica prior to divergence. -- John Hsu - Amazon Web Services
On Thu, Oct 23, 2025 at 03:01:32PM -0700, John H wrote: > On Thu, Oct 23, 2025 at 9:08 AM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote: >> ... >> i guess this is what you want ,please let me know if it's otherwise, >> ... > > That looks right to me. Yes, that would be something among these lines, WAL segments being treated as an exception when printing the file map to show that they are not copied. >> These checks show that before the corrupt segment's size on >> rewound standby(target) was not the same as source but after >> rewind it was the same as source,please let me know if i am >> missing your point here. > > What did you have in mind for additional sanity checks? > The existing test checks that when sizes are different the correct > branch is taken. For something more in-depth that requires comparing > data before and after the rewind with a "corrupted" segment that seems > complicated since the segment would have to somehow be applied to > writer but not replica prior to divergence. I was just wondering about some queries on the new standby once we are done with the rewind. But after sleeping on it, I'd be happy with just the set of tests we have: the debug output checks, the size checks pre and post-rewind, and no mtime. -- Michael
Вложения
On Fri, Oct 24, 2025 at 04:56:33PM +0900, Michael Paquier wrote: > I was just wondering about some queries on the new standby once we are > done with the rewind. But after sleeping on it, I'd be happy with > just the set of tests we have: the debug output checks, the size > checks pre and post-rewind, and no mtime. So, I am coming down to the attached (minus the commit message) after more edits and adjustments. It's Friday evening here, and I am not planning to do more today and to take bets on the buildfarm, even if the CI is OK. -- Michael
Вложения
On Fri, Oct 24, 2025 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Oct 24, 2025 at 04:56:33PM +0900, Michael Paquier wrote:
> I was just wondering about some queries on the new standby once we are
> done with the rewind. But after sleeping on it, I'd be happy with
> just the set of tests we have: the debug output checks, the size
> checks pre and post-rewind, and no mtime.
So, I am coming down to the attached (minus the commit message) after
more edits and adjustments. It's Friday evening here, and I am not
planning to do more today and to take bets on the buildfarm, even if
the CI is OK.
Thanks for the updated patch, I have reviewed and tested the
patch, except these minor refactors ,it LGTM.
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 59672e66932..467fd97ebcf 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -728,7 +728,7 @@ decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
/*
* Avoid copying files before the last common segment.
*
- * These files exist on the source and the target services, so they should
+ * These files exist on the source and the target servers, so they should
* be identical and located strictly before the segment that contains the
* LSN where target and source servers have diverged.
*
I think as we are not using mtime to show that the file has not been copied
and been skipped ,instead we are doing the same with the debug message
(qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
segment can be removed.
segment can be removed.
diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/011_wal_copy.pl
index fb5b7104378..c4d7200d710 100644
--- a/src/bin/pg_rewind/t/011_wal_copy.pl
+++ b/src/bin/pg_rewind/t/011_wal_copy.pl
@@ -43,12 +43,6 @@ RewindTest::promote_standby;
my $new_timeline_wal_seg = $node_standby->safe_psql('postgres',
'SELECT pg_walfile_name(pg_current_wal_lsn())');
-# Get some stats info for the WAL file whose copy is skipped.
-my $wal_skipped_path =
- $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
-my $wal_skipped_stat = stat($wal_skipped_path);
-defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
-
# Corrupt a WAL segment on target that has been generated before the
# divergence point. We will check that it is copied from the source.
Test:
I tried to demonstrate that with this patch, we are not copying WAL
segments before the divergence point by using wal_keep_size.
Environment:
A primary and standby physical streaming replication setup was used.
The TPC-H dbgen tool with a Scale Factor of 10 was used to generate
data, which in turn generated sufficient WAL to test with wal_keep_size = 20000 MB.
A failover and rewind were performed.
Result:
Without Patch:
SELECT
pg_size_pretty(sum(size)) AS wal_size
FROM
pg_ls_waldir();
wal_size
----------
19 GB
(1 row)
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 4/DEACBAB0 on timeline 1
pg_rewind: rewinding from last common checkpoint at 4/C8250D28 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 20449 MB (total source directory size is 32778 MB)
20940602/20940602 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!
To show that only a small amount of WAL was generated between the last common
checkpoint and the latest checkpoint on the source:
postgres=# select pg_size_pretty('4/DEACBB78'::pg_lsn - '4/C8250D28'::pg_lsn);
pg_size_pretty
----------------
360 MB
(1 row)
so the 20.449 GB copied was primarily due to wal_keep_size
(19 GB of accumulated WAL) being copied, as previously non-data files were copied in full.
With Patch:
pg_size_pretty
----------------
360 MB
(1 row)
so the 20.449 GB copied was primarily due to wal_keep_size
(19 GB of accumulated WAL) being copied, as previously non-data files were copied in full.
With Patch:
SELECT
pg_size_pretty(sum(size)) AS wal_size
FROM
pg_ls_waldir();
wal_size
----------
19 GB
(1 row)
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 4/B0F2FE78 on timeline 1
pg_rewind: rewinding from last common checkpoint at 4/B087E468 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 212 MB (total source directory size is 32074 MB)
pg_rewind: skipped WAL 19168 MB -> tweaked calculate_totals to get this,
just to give more info (not included in patch).
217467/217467 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!
postgres=# select pg_size_pretty('4/B0F30AB0'::pg_lsn - '4/B087E468'::pg_lsn);
pg_size_pretty
----------------
6855 kB
(1 row)
This test shows that with the patch applied, only the necessary WAL segments
after the divergence point are copied, while previously accumulated WAL
segments (due to wal_keep_size) are correctly skipped.
-- pg_size_pretty(sum(size)) AS wal_size
FROM
pg_ls_waldir();
wal_size
----------
19 GB
(1 row)
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 4/B0F2FE78 on timeline 1
pg_rewind: rewinding from last common checkpoint at 4/B087E468 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 212 MB (total source directory size is 32074 MB)
pg_rewind: skipped WAL 19168 MB -> tweaked calculate_totals to get this,
just to give more info (not included in patch).
217467/217467 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!
postgres=# select pg_size_pretty('4/B0F30AB0'::pg_lsn - '4/B087E468'::pg_lsn);
pg_size_pretty
----------------
6855 kB
(1 row)
This test shows that with the patch applied, only the necessary WAL segments
after the divergence point are copied, while previously accumulated WAL
segments (due to wal_keep_size) are correctly skipped.
On Fri, Oct 24, 2025 at 05:51:54PM +0530, Srinath Reddy Sadipiralla wrote: > - * These files exist on the source and the target services, so they > should > + * These files exist on the source and the target servers, so they > should Not sure what my fingers were doing here. > I think as we are not using mtime to show that the file has not been copied > and been skipped ,instead we are doing the same with the debug message > (qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL > segment can be removed. Yes, removing this one makes sense. And, to keep it short: applied. -- Michael
Вложения
On Sat, Oct 25, 2025 at 6:46 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Oct 24, 2025 at 05:51:54PM +0530, Srinath Reddy Sadipiralla wrote:
> - * These files exist on the source and the target services, so they
> should
> + * These files exist on the source and the target servers, so they
> should
Not sure what my fingers were doing here.
> I think as we are not using mtime to show that the file has not been copied
> and been skipped ,instead we are doing the same with the debug message
> (qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
> segment can be removed.
Yes, removing this one makes sense.
And, to keep it short: applied.
Thanks Michael.
On Thu, Oct 23, 2025 at 08:40:14AM -0400, Robert Haas wrote: > While I'm not against cross-checking against the control file, this > sounds like an imaginary scenario to me. That is, it would only happen > if somebody maliciously modified the contents of the data directory by > hand with the express goal of breaking the tool. But we fundamentally > cannot defend against a malicious user whose express goal is to break > the tool, and I do not see any compelling reason to expend energy on > it even in cases like this where we could theoretically detect it > without much effort. If we go down that path, we'll end up not only > complicating the code, but also obscuring our own goals: it will look > like we've either done too much sanity checking (because we will have > added checks that are unnecessary with a non-malicious user) or too > little (because we will not have caught all the things a malicious > user might do). I was thinking about this argument over the weekend, and I am wondering if we could not do better here to detect if a file should be copied or not. What if we included a checksum of each file if both exist on the target and source, and just not copy them if the checksums match? You cannot do that for relation files when the source is online, of course, but for files like the oldest segments before the divergence point, that's better than checking the size, still more expensive due to the cost of the checksum computation. And there is a sha256() available at SQL level. Just throwing one idea in the bucket of ideas. That may not be worth the extra cost here, of course, but attaching a checksum to file_entry_t is not what I would qualify as an invasive change. -- Michael
Вложения
On Tue, Oct 28, 2025 at 12:02 AM Michael Paquier <michael@paquier.xyz> wrote: > I was thinking about this argument over the weekend, and I am > wondering if we could not do better here to detect if a file should be > copied or not. What if we included a checksum of each file if both > exist on the target and source, and just not copy them if the > checksums match? Well, that would require reading the entire file on both sides to compute the checksum, which sounds pretty expensive. I mean, a copy would only be a read on one side and a write on the other. Even granting that writes are more expensive than reads, a read of both sides would still be a substantial percentage of the total cost, I think. Also, I don't think we really want to reinvent a worse version of rsync. If you want to use checksums or file timestamps to decide what to copy, there are already good tools for that which probably handle that task better than our code ever will. What we can bring to the table is PG-specific logic, where we're able to reason about the behavior of PG in a way that a general-purpose tool can't. That's why for example we use the WAL to decide what data blocks need to be copied, rather than checksums -- it's an optimization that rsync can't do, and we can. The rule implemented here is similar: rsync can't know that WAL from before the divergence point should be the same on both sides, but we can. Now, of course, if in a specific situation the assumptions on which pg_rewind relies are not valid, e.g. because manual data directory modification has occurred, then pg_rewind should not be used. And if on the other hand we find some flaw that will keep pg_rewind from delivering correct results even when nothing strange has happened, then that's a bug or a design problem that we need to fix. But if we just start second-guessing ourselves by adding overhead to protect against can't-happen scenarios, we'll end up making pg_rewind useless. -- Robert Haas EDB: http://www.enterprisedb.com