Обсуждение: Return codes for archive and restore commands

Поиск
Список
Период
Сортировка

Return codes for archive and restore commands

От
PG Doc comments form
Дата:
The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/11/archive-recovery-settings.html
Description:

For instance for the restore command in the documentation said: 

It is important for the command to return a zero exit status only if it
succeeds. The command will be asked for file names that are not present in
the archive; it must return nonzero when so asked. Examples:
...
An exception is that if the command was terminated by a signal (other than
SIGTERM, which is used as part of a database server shutdown) or an error by
the shell (such as command not found), then recovery will abort and the
server will not start up.
end cite

This is not correct. I think that how the behavior of PostgreSQL depends on
return codes of restore and archive commands must be more exactly explained,
this is important for those how write scripts and applications for this
commands. For instance, if the aws command line interface (awscli) used as
restore command, aws on some commands return 255 code (for instance in case
of network fault) and this leads to unexpected result with PostgreSQL.

For the archive command:
<=128 There are not errors in the PostgreSQL log (messages with severity
equal or higher than ERROR). Firstly 3 messages of type LOG about fault,
then WARNING about this and pause for 1 minute, then repeated.
>=129 FATAL error in the PostgeSQL log. The message about stoping an archive
process, but not the database. Repeated after roughly 16 seconds.

For restore command:
<=125 There are not errors in the PostgreSQL log, repeated after several
seconds. Good to return network failure or in case of absent file.
>=126 FATAL error in the PostgreSQL log, stop a startup process, shutdown
the database. Good for a fatal error, for instance misconfiguration.

In this case PostgreSQL tries confirm rules for return codes of a unix
shell. A unix shell return 126 in the case of "command not executable", 127
in the case "command not found", 128+# of signal in the case if application
interrupted by uncatched signal.

Re: Return codes for archive and restore commands

От
Michael Paquier
Дата:
On Wed, Nov 28, 2018 at 11:00:31AM +0000, PG Doc comments form wrote:
> For the archive command:
> <=128 There are not errors in the PostgreSQL log (messages with severity
> equal or higher than ERROR). Firstly 3 messages of type LOG about fault,
> then WARNING about this and pause for 1 minute, then repeated.
> >=129 FATAL error in the PostgeSQL log. The message about stoping an archive
> process, but not the database. Repeated after roughly 16 seconds.

This code is around for some time, and comes from this commit:
commit: 3ad0728c817bf8abd2c76bd11d856967509b307c
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Tue, 21 Nov 2006 20:59:53 +0000
committer: Tom Lane <tgl@sss.pgh.pa.us>
date: Tue, 21 Nov 2006 20:59:53 +0000
On systems that have setsid(2) (which should be just about everything except
Windows), arrange for each postmaster child process to be its own process
group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
process group not only the direct child process.  This provides saner behavior
for archive and recovery scripts; in particular, it's possible to shut down a
warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
of SIGQUIT to the startup subprocess will result in killing the waiting
recovery_command.  Also, this makes Query Cancel and statement_timeout apply
to scripts being run from backends via system().  (There is no support in the
core backend for that, but it's widely done using untrusted PLs.)  Per gripe
from Stephen Harris and subsequent discussion.

The relevant part if pgarch_archiveXlog() in pgarch.c, and this part
is most relevant:
* Per the Single Unix Spec, shells report exit status > 128 when a
* called command died on a signal.

> In this case PostgreSQL tries confirm rules for return codes of a unix
> shell. A unix shell return 126 in the case of "command not executable", 127
> in the case "command not found", 128+# of signal in the case if application
> interrupted by uncatched signal.

If you were to rewrite those paragraphs or make them more precise, how
would you actually shape your suggestions?  I personally quite like the
current formulations, but I am rather used to it to be honest.
--
Michael

Вложения

Re: Return codes for archive and restore commands

От
Stephen Frost
Дата:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Wed, Nov 28, 2018 at 11:00:31AM +0000, PG Doc comments form wrote:
> > For the archive command:
> > <=128 There are not errors in the PostgreSQL log (messages with severity
> > equal or higher than ERROR). Firstly 3 messages of type LOG about fault,
> > then WARNING about this and pause for 1 minute, then repeated.
> > >=129 FATAL error in the PostgeSQL log. The message about stoping an archive
> > process, but not the database. Repeated after roughly 16 seconds.
>
> This code is around for some time, and comes from this commit:
> commit: 3ad0728c817bf8abd2c76bd11d856967509b307c
> author: Tom Lane <tgl@sss.pgh.pa.us>
> date: Tue, 21 Nov 2006 20:59:53 +0000
> committer: Tom Lane <tgl@sss.pgh.pa.us>
> date: Tue, 21 Nov 2006 20:59:53 +0000
> On systems that have setsid(2) (which should be just about everything except
> Windows), arrange for each postmaster child process to be its own process
> group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
> process group not only the direct child process.  This provides saner behavior
> for archive and recovery scripts; in particular, it's possible to shut down a
> warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
> of SIGQUIT to the startup subprocess will result in killing the waiting
> recovery_command.  Also, this makes Query Cancel and statement_timeout apply
> to scripts being run from backends via system().  (There is no support in the
> core backend for that, but it's widely done using untrusted PLs.)  Per gripe
> from Stephen Harris and subsequent discussion.
>
> The relevant part if pgarch_archiveXlog() in pgarch.c, and this part
> is most relevant:
> * Per the Single Unix Spec, shells report exit status > 128 when a
> * called command died on a signal.
>
> > In this case PostgreSQL tries confirm rules for return codes of a unix
> > shell. A unix shell return 126 in the case of "command not executable", 127
> > in the case "command not found", 128+# of signal in the case if application
> > interrupted by uncatched signal.
>
> If you were to rewrite those paragraphs or make them more precise, how
> would you actually shape your suggestions?  I personally quite like the
> current formulations, but I am rather used to it to be honest.

This is another example, at least imv, of why we really need to move
away from archive_command as an interface for doing WAL archiving.

Having discussed this quite a bit lately with David Steele and Magnus,
it's pretty clear that we need to completely rip out how this works
today and rewrite it based around an extension model where a background
worker can start up and essentially take the place of the archiver
process, with flexibility to jump forward through the WAL stream,
communicate clearly with other processes, handle failure to do so
gracefully based on the specific cases, etc.

We could then possibly write an extension to be included that mimics
what archive_command does today, but imv we should immediately consider
it deprecated and encourage people to move off of it.

Thanks!

Stephen

Вложения

Re: Return codes for archive and restore commands

От
Michael Paquier
Дата:
On Wed, Nov 28, 2018 at 09:39:58PM -0500, Stephen Frost wrote:
> Having discussed this quite a bit lately with David Steele and Magnus,
> it's pretty clear that we need to completely rip out how this works
> today and rewrite it based around an extension model where a background
> worker can start up and essentially take the place of the archiver
> process, with flexibility to jump forward through the WAL stream,
> communicate clearly with other processes, handle failure to do so
> gracefully based on the specific cases, etc.

Hm.  When an instance state is in PM_SHUTDOWN_2, the postmaster
explicitely waits for the WAL senders and the archiver to shut down.  So
I think that you would need more control regarding the timing a bgworker
should be shut down first to be completely correct.
--
Michael

Вложения

Re: Return codes for archive and restore commands

От
Stephen Frost
Дата:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Wed, Nov 28, 2018 at 09:39:58PM -0500, Stephen Frost wrote:
> > Having discussed this quite a bit lately with David Steele and Magnus,
> > it's pretty clear that we need to completely rip out how this works
> > today and rewrite it based around an extension model where a background
> > worker can start up and essentially take the place of the archiver
> > process, with flexibility to jump forward through the WAL stream,
> > communicate clearly with other processes, handle failure to do so
> > gracefully based on the specific cases, etc.
>
> Hm.  When an instance state is in PM_SHUTDOWN_2, the postmaster
> explicitely waits for the WAL senders and the archiver to shut down.  So
> I think that you would need more control regarding the timing a bgworker
> should be shut down first to be completely correct.

Yes, it couldn't be exactly the same as a generic background worker,
that's a good point.  We definitely need to make sure that the
postmaster waits for the archiver to shut down, as it does for the WAL
senders.

Thanks!

Stephen

Вложения

Re: Return codes for archive and restore commands

От
Michael Paquier
Дата:
On Wed, Nov 28, 2018 at 10:27:31PM -0500, Stephen Frost wrote:
> Yes, it couldn't be exactly the same as a generic background worker,
> that's a good point.  We definitely need to make sure that the
> postmaster waits for the archiver to shut down, as it does for the WAL
> senders.

Just to be clear, please note I don't think that what removing the
archiver code from the core code is a bad idea, quite the contrary
actually.  But I doubt that it would be acceptable to rip off this code
without something which has the same properties and guarantees for any
users depending on it.  And archive_command is used a lot.
--
Michael

Вложения

Re: Return codes for archive and restore commands

От
Stephen Frost
Дата:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Wed, Nov 28, 2018 at 10:27:31PM -0500, Stephen Frost wrote:
> > Yes, it couldn't be exactly the same as a generic background worker,
> > that's a good point.  We definitely need to make sure that the
> > postmaster waits for the archiver to shut down, as it does for the WAL
> > senders.
>
> Just to be clear, please note I don't think that what removing the
> archiver code from the core code is a bad idea, quite the contrary
> actually.  But I doubt that it would be acceptable to rip off this code
> without something which has the same properties and guarantees for any
> users depending on it.  And archive_command is used a lot.

Yes, it's used a lot and I tend to agree that we'll need to have
something which replaces it- at least for a while, but we shouldn't let
the fact that it's used a lot (because it's the only option in many
ways...) lead us to think it's actually a good interface which should be
kept forever.  We're growing up here and realizing that the initial
implementations of things around the edges of the core system, while
used extensively, need to be updated to be reliable and resilient and
the previous unreliable interfaces need to be deprecated and eventually
removed, for the benefit of all of our users who might otherwise think
they are as reliable as we all wish that had been when they were
initially implemented.

Thanks!

Stephen

Вложения

Re: Return codes for archive and restore commands

От
Oleg Bartunov
Дата:
On Thu, Nov 29, 2018 at 5:40 AM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Michael Paquier (michael@paquier.xyz) wrote:
> > On Wed, Nov 28, 2018 at 11:00:31AM +0000, PG Doc comments form wrote:
> > > For the archive command:
> > > <=128 There are not errors in the PostgreSQL log (messages with severity
> > > equal or higher than ERROR). Firstly 3 messages of type LOG about fault,
> > > then WARNING about this and pause for 1 minute, then repeated.
> > > >=129 FATAL error in the PostgeSQL log. The message about stoping an archive
> > > process, but not the database. Repeated after roughly 16 seconds.
> >
> > This code is around for some time, and comes from this commit:
> > commit: 3ad0728c817bf8abd2c76bd11d856967509b307c
> > author: Tom Lane <tgl@sss.pgh.pa.us>
> > date: Tue, 21 Nov 2006 20:59:53 +0000
> > committer: Tom Lane <tgl@sss.pgh.pa.us>
> > date: Tue, 21 Nov 2006 20:59:53 +0000
> > On systems that have setsid(2) (which should be just about everything except
> > Windows), arrange for each postmaster child process to be its own process
> > group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
> > process group not only the direct child process.  This provides saner behavior
> > for archive and recovery scripts; in particular, it's possible to shut down a
> > warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
> > of SIGQUIT to the startup subprocess will result in killing the waiting
> > recovery_command.  Also, this makes Query Cancel and statement_timeout apply
> > to scripts being run from backends via system().  (There is no support in the
> > core backend for that, but it's widely done using untrusted PLs.)  Per gripe
> > from Stephen Harris and subsequent discussion.
> >
> > The relevant part if pgarch_archiveXlog() in pgarch.c, and this part
> > is most relevant:
> > * Per the Single Unix Spec, shells report exit status > 128 when a
> > * called command died on a signal.
> >
> > > In this case PostgreSQL tries confirm rules for return codes of a unix
> > > shell. A unix shell return 126 in the case of "command not executable", 127
> > > in the case "command not found", 128+# of signal in the case if application
> > > interrupted by uncatched signal.
> >
> > If you were to rewrite those paragraphs or make them more precise, how
> > would you actually shape your suggestions?  I personally quite like the
> > current formulations, but I am rather used to it to be honest.
>
> This is another example, at least imv, of why we really need to move
> away from archive_command as an interface for doing WAL archiving.

+1

>
> Having discussed this quite a bit lately with David Steele and Magnus,
> it's pretty clear that we need to completely rip out how this works
> today and rewrite it based around an extension model where a background
> worker can start up and essentially take the place of the archiver
> process, with flexibility to jump forward through the WAL stream,
> communicate clearly with other processes, handle failure to do so
> gracefully based on the specific cases, etc.
>
> We could then possibly write an extension to be included that mimics
> what archive_command does today, but imv we should immediately consider
> it deprecated and encourage people to move off of it.
>
> Thanks!
>
> Stephen



-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Return codes for archive and restore commands

От
Олег Самойлов
Дата:
> If you were to rewrite those paragraphs or make them more precise, how
> would you actually shape your suggestions?  I personally quite like the
> current formulations, but I am rather used to it to be honest.
> --
> Michael

Yep, I am for make them more precise. Now this paragraphs describe PostgreSQL and bash behavior  for users of
PostgreSQLand may be they are good in this. But for a script or application programmer must be described not only
behaviorof PostgreSQL, but also precisely described the program interface. For instance, aws cli utility, that I used
formy archive and restore commands sometimes return 255 code, for instance, in a case of network fault to connect to S3
(objectget command). And I was surprised that the PostgreSQL suddenly stoped in such case, there was nothing in
documentationabout this.  So explicitly describing behavior of PostgreSQL in terms of script returning codes will be
usefulfor script programmes.