Обсуждение: pg_ctl restart - behaviour based on wrong instance
I am not sure the following pg_ctl behaviour is really a bug, but I find it unexpected enough to report. I was testing synchronous replication in a test setup on a single machine. (After all, one could have different instances on different arrays, right? If you think this is an unlikely use-case, perhaps the following is not important.) There are two installations of 9.1devel (git as of today): primary: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1 standby: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2 The standby's data_directory is generated by pg_basebackup from vanilla_1. The problem is the very first run of pg_ctl restart: pg_ctl first correctly decides that the standby instance (=vanilla_2) isn't yet running: pg_ctl: PID file "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid" does not exist This is OK and expected. But then it continues (in the logfile) with: FATAL: lock file "postmaster.pid" already exists HINT: Is another postmaster (PID 20519) running in data directory "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"? So, complaints about the *other* instance. It doesn't happen once a successful start (with pg_ctl start) has happened. It starts fine when started right away with 'start' instead of 'restart'. Also, if it has been started once, it will react to 'pg_ctl restart' without the errors. I'll attach a shell-script, that provokes the error, see the 'restart' on the line with the comment: 'HERE' It would seem (see below) that pg_ctl's final decision about the standby, (that is has started up) is wrong; the standby does *not* eventually start. Below the output of the attached shell script. (careful - it deletes stuff) (It still contains some debug lines, but I didn't want to change it too much.) $ clear; ./split_vanilla.sh PGPASSFILE=/home/rijkers/.pg_rijkers waiting for server to shut down.... done server stopped waiting for server to shut down.... done server stopped waiting for server to start.... done server started removed `/var/data1/pg_stuff/archive_dir/000000010000000000000018' removed `/var/data1/pg_stuff/archive_dir/000000010000000000000019' removed `/var/data1/pg_stuff/archive_dir/000000010000000000000019.00000020.backup' removed `/var/data1/pg_stuff/archive_dir/00000001000000000000001A' /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_basebackup NOTICE: pg_stop_backup complete, all required WAL segments have been archived BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin PGPORT=6564 PGPASSFILE=/home/rijkers/.pg_rijkers PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_ctl waiting for server to shut down.... done server stopped waiting for server to start.... done server started UID PID PPID C STIME TTY STAT TIME CMD rijkers 20519 1 20 17:19 pts/25 S+ 0:00 /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/postgres -D /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data rijkers 20521 20519 0 17:19 ? Ss 0:00 \_ postgres: writer process rijkers 20522 20519 0 17:19 ? Ss 0:00 \_ postgres: wal writer process rijkers 20523 20519 0 17:19 ? Ss 0:00 \_ postgres: autovacuum launcher process rijkers 20524 20519 0 17:19 ? Ss 0:00 \_ postgres: archiver process rijkers 20525 20519 0 17:19 ? Ss 0:00 \_ postgres: stats collector process BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin PGPORT=6664 PGPASSFILE=/home/rijkers/.pg_rijkers PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin/pg_ctl pg_ctl: PID file "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid" does not exist Is server running? starting server anyway waiting for server to start............................................................... done server started -- logfile 1: LOG: database system is shut down LOG: database system was shut down at 2011-03-18 17:19:54 CET LOG: autovacuum launcher started LOG: database system is ready to accept connections -- logfile 2: LOG: shutting down LOG: database system is shut down FATAL: lock file "postmaster.pid" already exists HINT: Is another postmaster (PID 20519) running in data directory "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"? thanks, Erik Rijkers
Вложения
On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote: > This is OK and expected. But then it continues (in the logfile) with: > > FATAL: lock file "postmaster.pid" already exists > HINT: Is another postmaster (PID 20519) running in data directory > "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"? > > So, complaints about the *other* instance. It doesn't happen once a successful start (with pg_ctl > start) has happened. I'm guessing that leftover postmaster.pid contents might be responsible for this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote: >> This is OK and expected. But then it continues (in the logfile) with: >> >> FATAL: lock file "postmaster.pid" already exists >> HINT: Is another postmaster (PID 20519) running in data directory >> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"? >> >> So, complaints about the *other* instance. It doesn't happen once a successful start (with pg_ctl >> start) has happened. > > I'm guessing that leftover postmaster.pid contents might be > responsible for this? The cause is that "pg_ctl restart" uses the postmaster.opts which was created in the primary. Since its content was something like "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked wrongly. The simple workaround is to exclude postmaster.opts from the backup as well as postmaster.pid. But when postmaster.opts doesn't exist, "pg_ctl restart" cannot start up the server. We might also need to change the code of "pg_ctl restart" so that it does just "pg_ctl start" when postmaster.opts doesn't exist. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote: >>> This is OK and expected. But then it continues (in the logfile) with: >>> >>> FATAL: lock file "postmaster.pid" already exists >>> HINT: Is another postmaster (PID 20519) running in data directory >>> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"? >>> >>> So, complaints about the *other* instance. It doesn't happen once a successful start (with pg_ctl >>> start) has happened. >> >> I'm guessing that leftover postmaster.pid contents might be >> responsible for this? > > The cause is that "pg_ctl restart" uses the postmaster.opts which was > created in the primary. Since its content was something like > "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked > wrongly. > > The simple workaround is to exclude postmaster.opts from the backup > as well as postmaster.pid. But when postmaster.opts doesn't exist, > "pg_ctl restart" cannot start up the server. We might also need to change > the code of "pg_ctl restart" so that it does just "pg_ctl start" when > postmaster.opts doesn't exist. Sounds reasonable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Fujii Masao wrote: > On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote: > >> This is OK and expected. ?But then it continues (in the logfile) with: > >> > >> FATAL: ?lock file "postmaster.pid" already exists > >> HINT: ?Is another postmaster (PID 20519) running in data directory > >> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"? > >> > >> So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl > >> start) has happened. > > > > I'm guessing that leftover postmaster.pid contents might be > > responsible for this? > > The cause is that "pg_ctl restart" uses the postmaster.opts which was > created in the primary. Since its content was something like > "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked > wrongly. FYI, my The Magic of Hot Streaming Replication talk shows this exact issue on slide 16: http://momjian.us/main/presentations/features.html#hot_streaming Remove /data2/postmaster.pid so the standby server does not see theprimary servers pid as its own: rm /u/pg/data2/postmaster.pid This is because my demo creates the standby on the same machine as the master so the pid is still valid and owned by 'postgres', which is what the user is reporting. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas wrote: > On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote: > >>> This is OK and expected. ?But then it continues (in the logfile) with: > >>> > >>> FATAL: ?lock file "postmaster.pid" already exists > >>> HINT: ?Is another postmaster (PID 20519) running in data directory > >>> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"? > >>> > >>> So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl > >>> start) has happened. > >> > >> I'm guessing that leftover postmaster.pid contents might be > >> responsible for this? > > > > The cause is that "pg_ctl restart" uses the postmaster.opts which was > > created in the primary. Since its content was something like > > "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked > > wrongly. > > > > The simple workaround is to exclude postmaster.opts from the backup > > as well as postmaster.pid. But when postmaster.opts doesn't exist, > > "pg_ctl restart" cannot start up the server. We might also need to change > > the code of "pg_ctl restart" so that it does just "pg_ctl start" when > > postmaster.opts doesn't exist. > > Sounds reasonable. Has this been handled? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas wrote: > On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote: > >>> This is OK and expected. ?But then it continues (in the logfile) with: > >>> > >>> FATAL: ?lock file "postmaster.pid" already exists > >>> HINT: ?Is another postmaster (PID 20519) running in data directory > >>> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"? > >>> > >>> So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl > >>> start) has happened. > >> > >> I'm guessing that leftover postmaster.pid contents might be > >> responsible for this? > > > > The cause is that "pg_ctl restart" uses the postmaster.opts which was > > created in the primary. Since its content was something like > > "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked > > wrongly. > > > > The simple workaround is to exclude postmaster.opts from the backup > > as well as postmaster.pid. But when postmaster.opts doesn't exist, > > "pg_ctl restart" cannot start up the server. We might also need to change > > the code of "pg_ctl restart" so that it does just "pg_ctl start" when > > postmaster.opts doesn't exist. > > Sounds reasonable. I looked over this issue and I don't thinking having pg_ctl restart fall back to 'start' is a good solution. I am concerned about cases where we start a different server without shutting down the old server, for some reason. When they say 'restart', I think we have to assume they want a restart. What I did do was to document that not backing up postmaster.pid and postmaster.opts might help prevent pg_ctl from getting confused. Patch applied and backpatched to 9.1.X. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml new file mode 100644 index b8daedc..737355a *** a/doc/src/sgml/backup.sgml --- b/doc/src/sgml/backup.sgml *************** SELECT pg_stop_backup(); *** 869,875 **** of mistakes when restoring. This is easy to arrange if <filename>pg_xlog/</> is a symbolic link pointing to someplace outside the cluster directory, which is a common setup anyway for performance ! reasons. </para> <para> --- 869,879 ---- of mistakes when restoring. This is easy to arrange if <filename>pg_xlog/</> is a symbolic link pointing to someplace outside the cluster directory, which is a common setup anyway for performance ! reasons. You might also want to exclude <filename>postmaster.pid</> ! and <filename>postmaster.opts</>, which record information ! about the running <application>postmaster</>, not about the ! <application>postmaster</> which will eventually use this backup. ! (These files can confuse <application>pg_ctl</>.) </para> <para>
On Tue, Oct 11, 2011 at 23:35, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> > On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote: >> >>> This is OK and expected. ?But then it continues (in the logfile) with: >> >>> >> >>> FATAL: ?lock file "postmaster.pid" already exists >> >>> HINT: ?Is another postmaster (PID 20519) running in data directory >> >>> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"? >> >>> >> >>> So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl >> >>> start) has happened. >> >> >> >> I'm guessing that leftover postmaster.pid contents might be >> >> responsible for this? >> > >> > The cause is that "pg_ctl restart" uses the postmaster.opts which was >> > created in the primary. Since its content was something like >> > "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked >> > wrongly. >> > >> > The simple workaround is to exclude postmaster.opts from the backup >> > as well as postmaster.pid. But when postmaster.opts doesn't exist, >> > "pg_ctl restart" cannot start up the server. We might also need to change >> > the code of "pg_ctl restart" so that it does just "pg_ctl start" when >> > postmaster.opts doesn't exist. >> >> Sounds reasonable. > > I looked over this issue and I don't thinking having pg_ctl restart fall > back to 'start' is a good solution. I am concerned about cases where we > start a different server without shutting down the old server, for some > reason. When they say 'restart', I think we have to assume they want a > restart. > > What I did do was to document that not backing up postmaster.pid and > postmaster.opts might help prevent pg_ctl from getting confused. Should we exclude postmaster.opts from streaming base backups? We already exclude postmaster.pid... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander wrote: > > I looked over this issue and I don't thinking having pg_ctl restart fall > > back to 'start' is a good solution. ?I am concerned about cases where we > > start a different server without shutting down the old server, for some > > reason. ?When they say 'restart', I think we have to assume they want a > > restart. > > > > What I did do was to document that not backing up postmaster.pid and > > postmaster.opts might help prevent pg_ctl from getting confused. > > Should we exclude postmaster.opts from streaming base backups? We > already exclude postmaster.pid... Uh, I think so, unless my analysis was wrong. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wednesday, October 12, 2011, Bruce Momjian wrote:
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Magnus Hagander wrote:
> > I looked over this issue and I don't thinking having pg_ctl restart fall
> > back to 'start' is a good solution. ?I am concerned about cases where we
> > start a different server without shutting down the old server, for some
> > reason. ?When they say 'restart', I think we have to assume they want a
> > restart.
> >
> > What I did do was to document that not backing up postmaster.pid and
> > postmaster.opts might help prevent pg_ctl from getting confused.
>
> Should we exclude postmaster.opts from streaming base backups? We
> already exclude postmaster.pid...
Uh, I think so, unless my analysis was wrong.
Ok, fixed and applied.
//Magnus
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote: > Ok, fixed and applied. You seem to have forgot to change protocol.sgml. Patch attached. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Вложения
On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote: > Ok, fixed and applied. You seem to have forgot to change protocol.sgml. Patch attached. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Вложения
On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote: > Ok, fixed and applied. You seem to have forgot to change protocol.sgml. Patch attached. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Вложения
On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote: > Ok, fixed and applied. You seem to have forgot to change protocol.sgml. Patch attached. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Вложения
Oh, sorry for repeating the same posts. Gmail seems to have not worked fine... :( On Wed, Oct 19, 2011 at 1:24 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote: >> Ok, fixed and applied. > > You seem to have forgot to change protocol.sgml. > Patch attached. > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center > -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Oct 18, 2011 at 12:18 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote: >> Ok, fixed and applied. > > You seem to have forgot to change protocol.sgml. > Patch attached. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company