Обсуждение: Trap errors from streaming child in pg_basebackup to exit early

Поиск
Список
Период
Сортировка

Trap errors from streaming child in pg_basebackup to exit early

От
Daniel Gustafsson
Дата:
When using pg_basebackup with WAL streaming (-X stream), we have observed on a
number of times in production that the streaming child exited prematurely (to
no fault of the code it seems, most likely due to network middleboxes), which
cause the backup to fail but only after it has run to completion.  On long
running backups this can consume a lot of time before it’s noticed.

By trapping the failure of the streaming process we can instead exit early to
allow the user to fix and/or restart the process.

The attached adds a SIGCHLD handler for Unix, and catch the returnvalue from
the Windows thread, in order to break out early from the main loop.  It still
needs a test, and proper testing on Windows, but early feedback on the approach
would be appreciated.

--
Daniel Gustafsson        https://vmware.com/


Вложения

Re: Trap errors from streaming child in pg_basebackup to exit early

От
Bharath Rupireddy
Дата:
On Thu, Aug 26, 2021 at 2:55 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> When using pg_basebackup with WAL streaming (-X stream), we have observed on a
> number of times in production that the streaming child exited prematurely (to
> no fault of the code it seems, most likely due to network middleboxes), which
> cause the backup to fail but only after it has run to completion.  On long
> running backups this can consume a lot of time before it’s noticed.

Hm.

> By trapping the failure of the streaming process we can instead exit early to
> allow the user to fix and/or restart the process.
>
> The attached adds a SIGCHLD handler for Unix, and catch the returnvalue from
> the Windows thread, in order to break out early from the main loop.  It still
> needs a test, and proper testing on Windows, but early feedback on the approach
> would be appreciated.

Here are some comments on the patch:
1) Do we need volatile keyword here to read the value of the variables
always from the memory?
+static volatile sig_atomic_t bgchild_exited = false;

2) Do we need #ifndef WIN32 ... #endif around sigchld_handler function
definition?

3) I'm not sure if the new value of bgchild_exited being set in the
child thread will reflect in the main process on Windows? But
theoretically, I can understand that the memory will be shared between
the main process thread and child thread.
#ifdef WIN32
/*
* In order to signal the main thread of an ungraceful exit we
* set the flag used on Unix to signal SIGCHLD.
*/
bgchild_exited = true;
#endif

4) How about "set the same flag that we use on Unix to signal
SIGCHLD."  instead of "* set the flag used on Unix to signal
SIGCHLD."?

5) How about "background WAL receiver terminated unexpectedly" instead
of "log streamer child terminated unexpectedly"? This will be in sync
with the existing message "starting background WAL receiver". "log
streamer" is the word used internally in the code, user doesn't know
it with that name.

6) How about giving the exit code (like postmaster's reaper function
does) instead of just a message saying unexpected termination? It will
be useful to know for what reason the process exited. For Windows, we
can use GetExitCodeThread (I'm referring to the code around waitpid in
pg_basebackup) and for Unix we can use waitpid.

Regards,
Bharath Rupireddy.



Re: Trap errors from streaming child in pg_basebackup to exit early

От
Daniel Gustafsson
Дата:

> On 30 Aug 2021, at 12:31, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

> Here are some comments on the patch:
> 1) Do we need volatile keyword here to read the value of the variables
> always from the memory?
> +static volatile sig_atomic_t bgchild_exited = false;

Yes, fixed.

> 2) Do we need #ifndef WIN32 ... #endif around sigchld_handler function
> definition?

Ah yes, good point. Fixed.

> 3) I'm not sure if the new value of bgchild_exited being set in the
> child thread will reflect in the main process on Windows? But
> theoretically, I can understand that the memory will be shared between
> the main process thread and child thread.

The child does not have it’s own copy of bgchild_exited.

> 4) How about "set the same flag that we use on Unix to signal
> SIGCHLD."  instead of "* set the flag used on Unix to signal
> SIGCHLD."?

Fixed.

> 5) How about "background WAL receiver terminated unexpectedly" instead
> of "log streamer child terminated unexpectedly"? This will be in sync
> with the existing message "starting background WAL receiver". "log
> streamer" is the word used internally in the code, user doesn't know
> it with that name.

Good point, that’s better.

> 6) How about giving the exit code (like postmaster's reaper function
> does) instead of just a message saying unexpected termination? It will
> be useful to know for what reason the process exited. For Windows, we
> can use GetExitCodeThread (I'm referring to the code around waitpid in
> pg_basebackup) and for Unix we can use waitpid.

The rest of the program is doing exit(1) regardless of the failure of the
child/thread, so it seems more consistent to keep doing that for this class of
error as well.

A v2 with the above fixes is attached.

--
Daniel Gustafsson        https://vmware.com/


Вложения

Re: Trap errors from streaming child in pg_basebackup to exit early

От
Bharath Rupireddy
Дата:
On Wed, Sep 1, 2021 at 1:56 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> A v2 with the above fixes is attached.

Thanks for the updated patch. Here are some comments:

1) Do we need to set bgchild = -1 before the exit(1); in the code
below so that we don't kill(bgchild, SIGTERM); unnecessarily in
kill_bgchild_atexit?
+ if (bgchild_exited)
+ {
+ pg_log_error("background WAL receiver terminated unexpectedly");
+ exit(1);
+ }
+

2) Missing "," after "On Windows, we use a ....."
+ * that time. On Windows we use a background thread which can communicate

3) How about "/* Flag to indicate whether or not child process exited
*/" instead of +/* State of child process */?

4) Instead of just exiting from the main pg_basebackup process when
the child WAL receiver dies, can't we think of restarting the child
process, probably with the WAL streaming position where it left off or
stream from the beginning? This way, the work that the main
pg_basebackup has done so far doesn't get wasted. I'm not sure if this
affects the pg_basebackup functionality. We can restart the child
process for 1 or 2 times, if it still dies, we can kill the main
pg_baasebackup process too. Thoughts?

Regards,
Bharath Rupireddy.



Re: Trap errors from streaming child in pg_basebackup to exit early

От
Daniel Gustafsson
Дата:
> On 1 Sep 2021, at 12:28, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Sep 1, 2021 at 1:56 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>> A v2 with the above fixes is attached.
>
> Thanks for the updated patch. Here are some comments:
>
> 1) Do we need to set bgchild = -1 before the exit(1); in the code
> below so that we don't kill(bgchild, SIGTERM); unnecessarily in
> kill_bgchild_atexit?

Good point. We can also inspect bgchild_exited in kill_bgchild_atexit.

> 2) Missing "," after "On Windows, we use a ....."
> + * that time. On Windows we use a background thread which can communicate
>
> 3) How about "/* Flag to indicate whether or not child process exited
> */" instead of +/* State of child process */?

Fixed.

> 4) Instead of just exiting from the main pg_basebackup process when
> the child WAL receiver dies, can't we think of restarting the child
> process, probably with the WAL streaming position where it left off or
> stream from the beginning? This way, the work that the main
> pg_basebackup has done so far doesn't get wasted. I'm not sure if this
> affects the pg_basebackup functionality. We can restart the child
> process for 1 or 2 times, if it still dies, we can kill the main
> pg_baasebackup process too. Thoughts?

I was toying with the idea, but I ended up not pursuing it.  This error is well
into the “really shouldn’t happen, but can” territory and it’s quite likely
that some level of manual intervention is required to make it successfully
restart.  I’m not convinced that adding complicated logic to restart (and even
more complicated tests to simulate and test it) will be worthwhile.

--
Daniel Gustafsson        https://vmware.com/


Вложения

Re: Trap errors from streaming child in pg_basebackup to exit early

От
Bharath Rupireddy
Дата:
On Fri, Sep 3, 2021 at 3:23 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > 4) Instead of just exiting from the main pg_basebackup process when
> > the child WAL receiver dies, can't we think of restarting the child
> > process, probably with the WAL streaming position where it left off or
> > stream from the beginning? This way, the work that the main
> > pg_basebackup has done so far doesn't get wasted. I'm not sure if this
> > affects the pg_basebackup functionality. We can restart the child
> > process for 1 or 2 times, if it still dies, we can kill the main
> > pg_baasebackup process too. Thoughts?
>
> I was toying with the idea, but I ended up not pursuing it.  This error is well
> into the “really shouldn’t happen, but can” territory and it’s quite likely
> that some level of manual intervention is required to make it successfully
> restart.  I’m not convinced that adding complicated logic to restart (and even
> more complicated tests to simulate and test it) will be worthwhile.

 I withdraw my suggestion because I now feel that it's better not to
make it complex and let the user decide if at all the child process
exits abnormally.

I think we might still miss abnormal child thread exits on Windows
because we set bgchild_exited = true only if ReceiveXlogStream or
walmethod->finish() returns false. I'm not sure the parent thread on
Windows can detect a child's abnormal exit. Since there is no signal
mechanism on Windows, what the patch does is better to detect child
exit on two important functions failures.

Overall, the v3 patch looks good to me.

Regards,
Bharath Rupireddy.



Re: Trap errors from streaming child in pg_basebackup to exit early

От
Magnus Hagander
Дата:
On Fri, Sep 3, 2021 at 11:53 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 1 Sep 2021, at 12:28, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Wed, Sep 1, 2021 at 1:56 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> >> A v2 with the above fixes is attached.
> >
> > Thanks for the updated patch. Here are some comments:
> >
> > 1) Do we need to set bgchild = -1 before the exit(1); in the code
> > below so that we don't kill(bgchild, SIGTERM); unnecessarily in
> > kill_bgchild_atexit?
>
> Good point. We can also inspect bgchild_exited in kill_bgchild_atexit.
>
> > 2) Missing "," after "On Windows, we use a ....."
> > + * that time. On Windows we use a background thread which can communicate
> >
> > 3) How about "/* Flag to indicate whether or not child process exited
> > */" instead of +/* State of child process */?
>
> Fixed.
>
> > 4) Instead of just exiting from the main pg_basebackup process when
> > the child WAL receiver dies, can't we think of restarting the child
> > process, probably with the WAL streaming position where it left off or
> > stream from the beginning? This way, the work that the main
> > pg_basebackup has done so far doesn't get wasted. I'm not sure if this
> > affects the pg_basebackup functionality. We can restart the child
> > process for 1 or 2 times, if it still dies, we can kill the main
> > pg_baasebackup process too. Thoughts?
>
> I was toying with the idea, but I ended up not pursuing it.  This error is well
> into the “really shouldn’t happen, but can” territory and it’s quite likely
> that some level of manual intervention is required to make it successfully
> restart.  I’m not convinced that adding complicated logic to restart (and even
> more complicated tests to simulate and test it) will be worthwhile.
>

I think the restart scenario while nice, definitely means moving the
goalposts quite far. Let's get this detection in first at least, and
then we can always consider that a separate patch in the future.

Might be worth noting in one of the comments the difference in
behaviour if the backend process/thread *crashes* -- that is, on Unix
it will be detected via the signal handler and on Windows the whole
process including the main thread will die, so there is nothing to
detect.

Other places in the code just refers to the background process as "the
background process".  The term "WAL receiver" is new from this patch.
While I agree it's in many ways a better term, I think (1) we should
try to be consistent, and (2) maybe use a different term than "WAL
receiver" specifically because we have a backend component with that
name.

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Trap errors from streaming child in pg_basebackup to exit early

От
Daniel Gustafsson
Дата:
> On 28 Sep 2021, at 15:48, Magnus Hagander <magnus@hagander.net> wrote:

> Might be worth noting in one of the comments the difference in
> behaviour if the backend process/thread *crashes* -- that is, on Unix
> it will be detected via the signal handler and on Windows the whole
> process including the main thread will die, so there is nothing to
> detect.

Good point, done.

> Other places in the code just refers to the background process as "the
> background process".  The term "WAL receiver" is new from this patch.
> While I agree it's in many ways a better term, I think (1) we should
> try to be consistent, and (2) maybe use a different term than "WAL
> receiver" specifically because we have a backend component with that
> name.

Looking at the user-facing messaging we have before this patch, there is a bit
of variability:

On UNIX:

  pg_log_error("could not create pipe for background process: %m");
  pg_log_error("could not create background process: %m");
  pg_log_info("could not send command to background pipe: %m");
  pg_log_error("could not wait for child process: %m");

On Windows:

  pg_log_error("could not create background thread: %m");
  pg_log_error("could not get child thread exit status: %m");
  pg_log_error("could not wait for child thread: %m");
  pg_log_error("child thread exited with error %u", ..);

On Both:

  pg_log_info("starting background WAL receiver");
  pg_log_info("waiting for background process to finish streaming ...");

So there is one mention of a background WAL receiver already in there, but it's
pretty inconsistent as to what we call it.  For now I've changed the messaging
in this patch to say "background process", leaving making this all consistent
for a follow-up patch.

The attached fixes the above, as well as the typo mentioned off-list and is
rebased on top of todays HEAD.

--
Daniel Gustafsson        https://vmware.com/


Вложения

Re: Trap errors from streaming child in pg_basebackup to exit early

От
Masahiko Sawada
Дата:
On Wed, Sep 29, 2021 at 8:18 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 28 Sep 2021, at 15:48, Magnus Hagander <magnus@hagander.net> wrote:
>
> > Might be worth noting in one of the comments the difference in
> > behaviour if the backend process/thread *crashes* -- that is, on Unix
> > it will be detected via the signal handler and on Windows the whole
> > process including the main thread will die, so there is nothing to
> > detect.
>
> Good point, done.
>
> > Other places in the code just refers to the background process as "the
> > background process".  The term "WAL receiver" is new from this patch.
> > While I agree it's in many ways a better term, I think (1) we should
> > try to be consistent, and (2) maybe use a different term than "WAL
> > receiver" specifically because we have a backend component with that
> > name.
>
> Looking at the user-facing messaging we have before this patch, there is a bit
> of variability:
>
> On UNIX:
>
>   pg_log_error("could not create pipe for background process: %m");
>   pg_log_error("could not create background process: %m");
>   pg_log_info("could not send command to background pipe: %m");
>   pg_log_error("could not wait for child process: %m");
>
> On Windows:
>
>   pg_log_error("could not create background thread: %m");
>   pg_log_error("could not get child thread exit status: %m");
>   pg_log_error("could not wait for child thread: %m");
>   pg_log_error("child thread exited with error %u", ..);
>
> On Both:
>
>   pg_log_info("starting background WAL receiver");
>   pg_log_info("waiting for background process to finish streaming ...");
>
> So there is one mention of a background WAL receiver already in there, but it's
> pretty inconsistent as to what we call it.  For now I've changed the messaging
> in this patch to say "background process", leaving making this all consistent
> for a follow-up patch.
>
> The attached fixes the above, as well as the typo mentioned off-list and is
> rebased on top of todays HEAD.

Thank you for working on this issue.

The patch looks good to me but there is one minor comment:

--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -174,6 +174,8 @@ static int bgpipe[2] = {-1, -1};
 /* Handle to child process */
 static pid_t bgchild = -1;
 static bool in_log_streamer = false;
+/* Flag to indicate if child process exited unexpectedly */
+static volatile sig_atomic_t bgchild_exited = false;

It's better to have a new line before the new comment.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Trap errors from streaming child in pg_basebackup to exit early

От
Bharath Rupireddy
Дата:
On Wed, Sep 29, 2021 at 4:48 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > Other places in the code just refers to the background process as "the
> > background process".  The term "WAL receiver" is new from this patch.
> > While I agree it's in many ways a better term, I think (1) we should
> > try to be consistent, and (2) maybe use a different term than "WAL
> > receiver" specifically because we have a backend component with that
> > name.
>
> Looking at the user-facing messaging we have before this patch, there is a bit
> of variability:
>
> On UNIX:
>
>   pg_log_error("could not create pipe for background process: %m");
>   pg_log_error("could not create background process: %m");
>   pg_log_info("could not send command to background pipe: %m");
>   pg_log_error("could not wait for child process: %m");
>
> On Windows:
>
>   pg_log_error("could not create background thread: %m");
>   pg_log_error("could not get child thread exit status: %m");
>   pg_log_error("could not wait for child thread: %m");
>   pg_log_error("child thread exited with error %u", ..);
>
> On Both:
>
>   pg_log_info("starting background WAL receiver");
>   pg_log_info("waiting for background process to finish streaming ...");
>
> So there is one mention of a background WAL receiver already in there, but it's
> pretty inconsistent as to what we call it.  For now I've changed the messaging
> in this patch to say "background process", leaving making this all consistent
> for a follow-up patch.
>
> The attached fixes the above, as well as the typo mentioned off-list and is
> rebased on top of todays HEAD.

The documentation [1] of pg_basebackup specifies it as a "second
replication connection". Also, I see that the pg_receivewal.c using
the following message:
    if (db_name)
    {
        pg_log_error("replication connection using slot \"%s\" is
unexpectedly database specific",
                     replication_slot);
        exit(1);

We can use something like "stream replication connection" or
"background replication connection" or "background process/thread for
replication". Otherwise just "background process" on Unix and
"background thread" on Windows look fine to me. If others are okay, we
can remove the "WAL receiver" and use it consistently.

[1]
s
stream

Stream write-ahead log data while the backup is being taken. This
method will open a second connection to the server and start streaming
the write-ahead log in parallel while running the backup. Therefore,
it will require two replication connections not just one. As long as
the client can keep up with the write-ahead log data, using this
method requires no extra write-ahead logs to be saved on the source
server.

When tar format is used, the write-ahead log files will be written to
a separate file named pg_wal.tar (if the server is a version earlier
than 10, the file will be named pg_xlog.tar).

This value is the default.

Regards,
Bharath Rupireddy.



Re: Trap errors from streaming child in pg_basebackup to exit early

От
Michael Paquier
Дата:
On Wed, Sep 29, 2021 at 01:18:40PM +0200, Daniel Gustafsson wrote:
> So there is one mention of a background WAL receiver already in there, but it's
> pretty inconsistent as to what we call it.  For now I've changed the messaging
> in this patch to say "background process", leaving making this all consistent
> for a follow-up patch.
>
> The attached fixes the above, as well as the typo mentioned off-list and is
> rebased on top of todays HEAD.

I have been looking a bit at this patch, and did some tests on Windows
to find out that this is able to catch the failure of the thread
streaming the WAL segments in pg_basebackup, avoiding a completion of
the base backup, while HEAD waits until the backup finishes.  Testing
this scenario is actually simple by issuing pg_terminate_backend() on
the WAL sender that streams the WAL with START_REPLICATION, while
throttling the base backup.

Could you add a test to automate this scenario?  As far as I can see,
something like the following should be stable even for Windows:
1) Run a pg_basebackup in the background with IPC::Run, using
--max-rate with a minimal value to slow down the base backup, for slow
machines.  013_crash_restart.pl does that as one example with $killme.
2) Find out the WAL sender doing START_REPLICATION in the backend, and
issue pg_terminate_backend() on it.
3) Use a variant of pump_until() on the pg_basebackup process and
check after one or more failure patterns.  We should refactor this
part, actually.  If this new test uses the same logic, that would make
three tests doing that with 022_crash_temp_files.pl and
013_crash_restart.pl.  The CI should be fine to provide any feedback
with the test in place, though I am fine to test things also in my
box.
--
Michael

Вложения

Re: Trap errors from streaming child in pg_basebackup to exit early

От
Daniel Gustafsson
Дата:
> On 16 Feb 2022, at 08:27, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Sep 29, 2021 at 01:18:40PM +0200, Daniel Gustafsson wrote:
>> So there is one mention of a background WAL receiver already in there, but it's
>> pretty inconsistent as to what we call it.  For now I've changed the messaging
>> in this patch to say "background process", leaving making this all consistent
>> for a follow-up patch.
>>
>> The attached fixes the above, as well as the typo mentioned off-list and is
>> rebased on top of todays HEAD.
>
> I have been looking a bit at this patch, and did some tests on Windows
> to find out that this is able to catch the failure of the thread
> streaming the WAL segments in pg_basebackup, avoiding a completion of
> the base backup, while HEAD waits until the backup finishes.  Testing
> this scenario is actually simple by issuing pg_terminate_backend() on
> the WAL sender that streams the WAL with START_REPLICATION, while
> throttling the base backup.

Great, thanks!

> Could you add a test to automate this scenario?  As far as I can see,
> something like the following should be stable even for Windows:
> 1) Run a pg_basebackup in the background with IPC::Run, using
> --max-rate with a minimal value to slow down the base backup, for slow
> machines.  013_crash_restart.pl does that as one example with $killme.
> 2) Find out the WAL sender doing START_REPLICATION in the backend, and
> issue pg_terminate_backend() on it.
> 3) Use a variant of pump_until() on the pg_basebackup process and
> check after one or more failure patterns.  We should refactor this
> part, actually.  If this new test uses the same logic, that would make
> three tests doing that with 022_crash_temp_files.pl and
> 013_crash_restart.pl.  The CI should be fine to provide any feedback
> with the test in place, though I am fine to test things also in my
> box.

This is good idea, I was going in a different direction earlier with a test but
this is cleaner.  The attached 0001 refactors pump_until; 0002 fixes a trivial
spelling error found while hacking; and 0003 is the previous patch complete
with a test that passes on Cirrus CI.

--
Daniel Gustafsson        https://vmware.com/


Вложения

Re: Trap errors from streaming child in pg_basebackup to exit early

От
Michael Paquier
Дата:
On Fri, Feb 18, 2022 at 10:00:43PM +0100, Daniel Gustafsson wrote:
> This is good idea, I was going in a different direction earlier with a test but
> this is cleaner.  The attached 0001 refactors pump_until; 0002 fixes a trivial
> spelling error found while hacking; and 0003 is the previous patch complete
> with a test that passes on Cirrus CI.

This looks rather sane to me, and I can confirm that this passes
the CI and a manual run of MSVC tests with my own box.

+is($node->poll_query_until('postgres',
+   "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE " .
+   "application_name = '010_pg_basebackup.pl' AND wait_event =
'WalSenderMain' " .
+   "AND backend_type = 'walsender'"), "1", "Walsender killed");
If you do that, don't you have a risk to kill the WAL sender doing the
BASE_BACKUP?  That could falsify the test.  It seems to me that it
would be safer to add a check on query ~ 'START_REPLICATION' or
something like that.

-           diag("aborting wait: program timed out");
-           diag("stream contents: >>", $$stream, "<<");
-           diag("pattern searched for: ", $untl);
Keeping some of this information around would be useful for
debugging in the refactored routine.

+my $sigchld_bb = IPC::Run::start(
+   [
+       @pg_basebackup_defs, '-X', 'stream', '-D', "$tempdir/sigchld",
+       '-r', '32', '-d', $node->connstr('postgres')
+   ],
I would recommend the use of long options here as a matter to
self-document what this does, and add a comment explaining why
--max-rate is preferable, mainly for fast machines.
--
Michael

Вложения

Re: Trap errors from streaming child in pg_basebackup to exit early

От
Daniel Gustafsson
Дата:
> On 21 Feb 2022, at 03:03, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Feb 18, 2022 at 10:00:43PM +0100, Daniel Gustafsson wrote:
>> This is good idea, I was going in a different direction earlier with a test but
>> this is cleaner.  The attached 0001 refactors pump_until; 0002 fixes a trivial
>> spelling error found while hacking; and 0003 is the previous patch complete
>> with a test that passes on Cirrus CI.
>
> This looks rather sane to me, and I can confirm that this passes
> the CI and a manual run of MSVC tests with my own box.

Great, thanks!

> +is($node->poll_query_until('postgres',
> +   "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE " .
> +   "application_name = '010_pg_basebackup.pl' AND wait_event =
> 'WalSenderMain' " .
> +   "AND backend_type = 'walsender'"), "1", "Walsender killed");
> If you do that, don't you have a risk to kill the WAL sender doing the
> BASE_BACKUP?  That could falsify the test.  It seems to me that it
> would be safer to add a check on query ~ 'START_REPLICATION' or
> something like that.

I don't think there's a risk, but I've added the check on query as well since
it also makes it more readable.

> -           diag("aborting wait: program timed out");
> -           diag("stream contents: >>", $$stream, "<<");
> -           diag("pattern searched for: ", $untl);
> Keeping some of this information around would be useful for
> debugging in the refactored routine.

Maybe, but we don't really have diag output anywhere in the modules or the
tests so I didn't see much of a precedent for keeping it.  Inspectig the repo I
think we can remove two more in pg_rewind, which I just started a thread for.

> +my $sigchld_bb = IPC::Run::start(
> +   [
> +       @pg_basebackup_defs, '-X', 'stream', '-D', "$tempdir/sigchld",
> +       '-r', '32', '-d', $node->connstr('postgres')
> +   ],
>     I would recommend the use of long options here as a matter to
> self-document what this does, and add a comment explaining why
> --max-rate is preferable, mainly for fast machines.

Fair enough, done.

--
Daniel Gustafsson        https://vmware.com/


Вложения

Re: Trap errors from streaming child in pg_basebackup to exit early

От
Michael Paquier
Дата:
On Mon, Feb 21, 2022 at 03:11:30PM +0100, Daniel Gustafsson wrote:
>On 21 Feb 2022, at 03:03, Michael Paquier <michael@paquier.xyz> wrote:
>> +is($node->poll_query_until('postgres',
>> +   "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE " .
>> +   "application_name = '010_pg_basebackup.pl' AND wait_event =
>> 'WalSenderMain' " .
>> +   "AND backend_type = 'walsender'"), "1", "Walsender killed");
>> If you do that, don't you have a risk to kill the WAL sender doing the
>> BASE_BACKUP?  That could falsify the test.  It seems to me that it
>> would be safer to add a check on query ~ 'START_REPLICATION' or
>> something like that.
>
> I don't think there's a risk, but I've added the check on query as well since
> it also makes it more readable.

Okay, thanks.

>> -           diag("aborting wait: program timed out");
>> -           diag("stream contents: >>", $$stream, "<<");
>> -           diag("pattern searched for: ", $untl);
>> Keeping some of this information around would be useful for
>> debugging in the refactored routine.
>
> Maybe, but we don't really have diag output anywhere in the modules or the
> tests so I didn't see much of a precedent for keeping it.  Inspectig the repo I
> think we can remove two more in pg_rewind, which I just started a thread for.

Hmm.  If you think this is better this way, I won't fight hard on this
point, either.

The patch set looks fine overall.
--
Michael

Вложения

Re: Trap errors from streaming child in pg_basebackup to exit early

От
Daniel Gustafsson
Дата:
> On 22 Feb 2022, at 02:13, Michael Paquier <michael@paquier.xyz> wrote:

> The patch set looks fine overall.

Thanks for reviewing and testing, I pushed this today after taking another
round at it.

--
Daniel Gustafsson        https://vmware.com/