Обсуждение: Return codes for archive and restore commands
The following documentation comment has been logged on the website: Page: https://www.postgresql.org/docs/11/archive-recovery-settings.html Description: For instance for the restore command in the documentation said: It is important for the command to return a zero exit status only if it succeeds. The command will be asked for file names that are not present in the archive; it must return nonzero when so asked. Examples: ... An exception is that if the command was terminated by a signal (other than SIGTERM, which is used as part of a database server shutdown) or an error by the shell (such as command not found), then recovery will abort and the server will not start up. end cite This is not correct. I think that how the behavior of PostgreSQL depends on return codes of restore and archive commands must be more exactly explained, this is important for those how write scripts and applications for this commands. For instance, if the aws command line interface (awscli) used as restore command, aws on some commands return 255 code (for instance in case of network fault) and this leads to unexpected result with PostgreSQL. For the archive command: <=128 There are not errors in the PostgreSQL log (messages with severity equal or higher than ERROR). Firstly 3 messages of type LOG about fault, then WARNING about this and pause for 1 minute, then repeated. >=129 FATAL error in the PostgeSQL log. The message about stoping an archive process, but not the database. Repeated after roughly 16 seconds. For restore command: <=125 There are not errors in the PostgreSQL log, repeated after several seconds. Good to return network failure or in case of absent file. >=126 FATAL error in the PostgreSQL log, stop a startup process, shutdown the database. Good for a fatal error, for instance misconfiguration. In this case PostgreSQL tries confirm rules for return codes of a unix shell. A unix shell return 126 in the case of "command not executable", 127 in the case "command not found", 128+# of signal in the case if application interrupted by uncatched signal.
On Wed, Nov 28, 2018 at 11:00:31AM +0000, PG Doc comments form wrote: > For the archive command: > <=128 There are not errors in the PostgreSQL log (messages with severity > equal or higher than ERROR). Firstly 3 messages of type LOG about fault, > then WARNING about this and pause for 1 minute, then repeated. > >=129 FATAL error in the PostgeSQL log. The message about stoping an archive > process, but not the database. Repeated after roughly 16 seconds. This code is around for some time, and comes from this commit: commit: 3ad0728c817bf8abd2c76bd11d856967509b307c author: Tom Lane <tgl@sss.pgh.pa.us> date: Tue, 21 Nov 2006 20:59:53 +0000 committer: Tom Lane <tgl@sss.pgh.pa.us> date: Tue, 21 Nov 2006 20:59:53 +0000 On systems that have setsid(2) (which should be just about everything except Windows), arrange for each postmaster child process to be its own process group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole process group not only the direct child process. This provides saner behavior for archive and recovery scripts; in particular, it's possible to shut down a warm-standby recovery server using "pg_ctl stop -m immediate", since delivery of SIGQUIT to the startup subprocess will result in killing the waiting recovery_command. Also, this makes Query Cancel and statement_timeout apply to scripts being run from backends via system(). (There is no support in the core backend for that, but it's widely done using untrusted PLs.) Per gripe from Stephen Harris and subsequent discussion. The relevant part if pgarch_archiveXlog() in pgarch.c, and this part is most relevant: * Per the Single Unix Spec, shells report exit status > 128 when a * called command died on a signal. > In this case PostgreSQL tries confirm rules for return codes of a unix > shell. A unix shell return 126 in the case of "command not executable", 127 > in the case "command not found", 128+# of signal in the case if application > interrupted by uncatched signal. If you were to rewrite those paragraphs or make them more precise, how would you actually shape your suggestions? I personally quite like the current formulations, but I am rather used to it to be honest. -- Michael
Вложения
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Wed, Nov 28, 2018 at 11:00:31AM +0000, PG Doc comments form wrote: > > For the archive command: > > <=128 There are not errors in the PostgreSQL log (messages with severity > > equal or higher than ERROR). Firstly 3 messages of type LOG about fault, > > then WARNING about this and pause for 1 minute, then repeated. > > >=129 FATAL error in the PostgeSQL log. The message about stoping an archive > > process, but not the database. Repeated after roughly 16 seconds. > > This code is around for some time, and comes from this commit: > commit: 3ad0728c817bf8abd2c76bd11d856967509b307c > author: Tom Lane <tgl@sss.pgh.pa.us> > date: Tue, 21 Nov 2006 20:59:53 +0000 > committer: Tom Lane <tgl@sss.pgh.pa.us> > date: Tue, 21 Nov 2006 20:59:53 +0000 > On systems that have setsid(2) (which should be just about everything except > Windows), arrange for each postmaster child process to be its own process > group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole > process group not only the direct child process. This provides saner behavior > for archive and recovery scripts; in particular, it's possible to shut down a > warm-standby recovery server using "pg_ctl stop -m immediate", since delivery > of SIGQUIT to the startup subprocess will result in killing the waiting > recovery_command. Also, this makes Query Cancel and statement_timeout apply > to scripts being run from backends via system(). (There is no support in the > core backend for that, but it's widely done using untrusted PLs.) Per gripe > from Stephen Harris and subsequent discussion. > > The relevant part if pgarch_archiveXlog() in pgarch.c, and this part > is most relevant: > * Per the Single Unix Spec, shells report exit status > 128 when a > * called command died on a signal. > > > In this case PostgreSQL tries confirm rules for return codes of a unix > > shell. A unix shell return 126 in the case of "command not executable", 127 > > in the case "command not found", 128+# of signal in the case if application > > interrupted by uncatched signal. > > If you were to rewrite those paragraphs or make them more precise, how > would you actually shape your suggestions? I personally quite like the > current formulations, but I am rather used to it to be honest. This is another example, at least imv, of why we really need to move away from archive_command as an interface for doing WAL archiving. Having discussed this quite a bit lately with David Steele and Magnus, it's pretty clear that we need to completely rip out how this works today and rewrite it based around an extension model where a background worker can start up and essentially take the place of the archiver process, with flexibility to jump forward through the WAL stream, communicate clearly with other processes, handle failure to do so gracefully based on the specific cases, etc. We could then possibly write an extension to be included that mimics what archive_command does today, but imv we should immediately consider it deprecated and encourage people to move off of it. Thanks! Stephen
Вложения
On Wed, Nov 28, 2018 at 09:39:58PM -0500, Stephen Frost wrote: > Having discussed this quite a bit lately with David Steele and Magnus, > it's pretty clear that we need to completely rip out how this works > today and rewrite it based around an extension model where a background > worker can start up and essentially take the place of the archiver > process, with flexibility to jump forward through the WAL stream, > communicate clearly with other processes, handle failure to do so > gracefully based on the specific cases, etc. Hm. When an instance state is in PM_SHUTDOWN_2, the postmaster explicitely waits for the WAL senders and the archiver to shut down. So I think that you would need more control regarding the timing a bgworker should be shut down first to be completely correct. -- Michael
Вложения
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Wed, Nov 28, 2018 at 09:39:58PM -0500, Stephen Frost wrote: > > Having discussed this quite a bit lately with David Steele and Magnus, > > it's pretty clear that we need to completely rip out how this works > > today and rewrite it based around an extension model where a background > > worker can start up and essentially take the place of the archiver > > process, with flexibility to jump forward through the WAL stream, > > communicate clearly with other processes, handle failure to do so > > gracefully based on the specific cases, etc. > > Hm. When an instance state is in PM_SHUTDOWN_2, the postmaster > explicitely waits for the WAL senders and the archiver to shut down. So > I think that you would need more control regarding the timing a bgworker > should be shut down first to be completely correct. Yes, it couldn't be exactly the same as a generic background worker, that's a good point. We definitely need to make sure that the postmaster waits for the archiver to shut down, as it does for the WAL senders. Thanks! Stephen
Вложения
On Wed, Nov 28, 2018 at 10:27:31PM -0500, Stephen Frost wrote: > Yes, it couldn't be exactly the same as a generic background worker, > that's a good point. We definitely need to make sure that the > postmaster waits for the archiver to shut down, as it does for the WAL > senders. Just to be clear, please note I don't think that what removing the archiver code from the core code is a bad idea, quite the contrary actually. But I doubt that it would be acceptable to rip off this code without something which has the same properties and guarantees for any users depending on it. And archive_command is used a lot. -- Michael
Вложения
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Wed, Nov 28, 2018 at 10:27:31PM -0500, Stephen Frost wrote: > > Yes, it couldn't be exactly the same as a generic background worker, > > that's a good point. We definitely need to make sure that the > > postmaster waits for the archiver to shut down, as it does for the WAL > > senders. > > Just to be clear, please note I don't think that what removing the > archiver code from the core code is a bad idea, quite the contrary > actually. But I doubt that it would be acceptable to rip off this code > without something which has the same properties and guarantees for any > users depending on it. And archive_command is used a lot. Yes, it's used a lot and I tend to agree that we'll need to have something which replaces it- at least for a while, but we shouldn't let the fact that it's used a lot (because it's the only option in many ways...) lead us to think it's actually a good interface which should be kept forever. We're growing up here and realizing that the initial implementations of things around the edges of the core system, while used extensively, need to be updated to be reliable and resilient and the previous unreliable interfaces need to be deprecated and eventually removed, for the benefit of all of our users who might otherwise think they are as reliable as we all wish that had been when they were initially implemented. Thanks! Stephen
Вложения
On Thu, Nov 29, 2018 at 5:40 AM Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Michael Paquier (michael@paquier.xyz) wrote: > > On Wed, Nov 28, 2018 at 11:00:31AM +0000, PG Doc comments form wrote: > > > For the archive command: > > > <=128 There are not errors in the PostgreSQL log (messages with severity > > > equal or higher than ERROR). Firstly 3 messages of type LOG about fault, > > > then WARNING about this and pause for 1 minute, then repeated. > > > >=129 FATAL error in the PostgeSQL log. The message about stoping an archive > > > process, but not the database. Repeated after roughly 16 seconds. > > > > This code is around for some time, and comes from this commit: > > commit: 3ad0728c817bf8abd2c76bd11d856967509b307c > > author: Tom Lane <tgl@sss.pgh.pa.us> > > date: Tue, 21 Nov 2006 20:59:53 +0000 > > committer: Tom Lane <tgl@sss.pgh.pa.us> > > date: Tue, 21 Nov 2006 20:59:53 +0000 > > On systems that have setsid(2) (which should be just about everything except > > Windows), arrange for each postmaster child process to be its own process > > group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole > > process group not only the direct child process. This provides saner behavior > > for archive and recovery scripts; in particular, it's possible to shut down a > > warm-standby recovery server using "pg_ctl stop -m immediate", since delivery > > of SIGQUIT to the startup subprocess will result in killing the waiting > > recovery_command. Also, this makes Query Cancel and statement_timeout apply > > to scripts being run from backends via system(). (There is no support in the > > core backend for that, but it's widely done using untrusted PLs.) Per gripe > > from Stephen Harris and subsequent discussion. > > > > The relevant part if pgarch_archiveXlog() in pgarch.c, and this part > > is most relevant: > > * Per the Single Unix Spec, shells report exit status > 128 when a > > * called command died on a signal. > > > > > In this case PostgreSQL tries confirm rules for return codes of a unix > > > shell. A unix shell return 126 in the case of "command not executable", 127 > > > in the case "command not found", 128+# of signal in the case if application > > > interrupted by uncatched signal. > > > > If you were to rewrite those paragraphs or make them more precise, how > > would you actually shape your suggestions? I personally quite like the > > current formulations, but I am rather used to it to be honest. > > This is another example, at least imv, of why we really need to move > away from archive_command as an interface for doing WAL archiving. +1 > > Having discussed this quite a bit lately with David Steele and Magnus, > it's pretty clear that we need to completely rip out how this works > today and rewrite it based around an extension model where a background > worker can start up and essentially take the place of the archiver > process, with flexibility to jump forward through the WAL stream, > communicate clearly with other processes, handle failure to do so > gracefully based on the specific cases, etc. > > We could then possibly write an extension to be included that mimics > what archive_command does today, but imv we should immediately consider > it deprecated and encourage people to move off of it. > > Thanks! > > Stephen -- Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> If you were to rewrite those paragraphs or make them more precise, how > would you actually shape your suggestions? I personally quite like the > current formulations, but I am rather used to it to be honest. > -- > Michael Yep, I am for make them more precise. Now this paragraphs describe PostgreSQL and bash behavior for users of PostgreSQLand may be they are good in this. But for a script or application programmer must be described not only behaviorof PostgreSQL, but also precisely described the program interface. For instance, aws cli utility, that I used formy archive and restore commands sometimes return 255 code, for instance, in a case of network fault to connect to S3 (objectget command). And I was surprised that the PostgreSQL suddenly stoped in such case, there was nothing in documentationabout this. So explicitly describing behavior of PostgreSQL in terms of script returning codes will be usefulfor script programmes.