Обсуждение: pg_ctl restart - behaviour based on wrong instance

Поиск
Список
Период
Сортировка

pg_ctl restart - behaviour based on wrong instance

От
"Erik Rijkers"
Дата:
I am not sure the following pg_ctl behaviour is really a bug, but I find it unexpected enough to
report.

I was testing synchronous replication in a test setup on a single machine.  (After all, one could
have different instances on different arrays, right?  If you think this is an unlikely use-case,
perhaps the following is not important.)

There are two installations of 9.1devel (git as of today):
  primary: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1
  standby: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2

The standby's data_directory is generated by pg_basebackup from vanilla_1.

The problem is the very first run of  pg_ctl restart:

pg_ctl first correctly decides that the standby instance (=vanilla_2) isn't yet running:

pg_ctl: PID file "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid" does
not exist

This is OK and expected.  But then it continues (in the logfile) with:

FATAL:  lock file "postmaster.pid" already exists
HINT:  Is another postmaster (PID 20519) running in data directory
"/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?

So, complaints about the *other* instance.  It doesn't happen once a successful start (with pg_ctl
start) has happened.

It starts fine when started right away with 'start' instead of  'restart'.

Also, if it has been started once, it will react to 'pg_ctl restart' without the errors.

I'll attach a shell-script, that provokes the error, see the 'restart' on the line with the
comment: 'HERE'

It would seem (see below) that pg_ctl's final decision about the standby, (that is has started up)
is wrong; the standby does *not* eventually start.


Below the output of the attached shell script.  (careful - it deletes stuff)
(It still contains some debug lines, but I didn't want to change it too much.)


$ clear; ./split_vanilla.sh

PGPASSFILE=/home/rijkers/.pg_rijkers
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
removed `/var/data1/pg_stuff/archive_dir/000000010000000000000018'
removed `/var/data1/pg_stuff/archive_dir/000000010000000000000019'
removed `/var/data1/pg_stuff/archive_dir/000000010000000000000019.00000020.backup'
removed `/var/data1/pg_stuff/archive_dir/00000001000000000000001A'
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_basebackup
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived

BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin
PGPORT=6564
PGPASSFILE=/home/rijkers/.pg_rijkers
PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_ctl
waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
rijkers  20519     1 20 17:19 pts/25   S+     0:00
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/postgres -D
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data
rijkers  20521 20519  0 17:19 ?        Ss     0:00  \_ postgres: writer process
rijkers  20522 20519  0 17:19 ?        Ss     0:00  \_ postgres: wal writer process
rijkers  20523 20519  0 17:19 ?        Ss     0:00  \_ postgres: autovacuum launcher process
rijkers  20524 20519  0 17:19 ?        Ss     0:00  \_ postgres: archiver process
rijkers  20525 20519  0 17:19 ?        Ss     0:00  \_ postgres: stats collector process

BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin
PGPORT=6664
PGPASSFILE=/home/rijkers/.pg_rijkers
PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin/pg_ctl
pg_ctl: PID file "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid" does
not exist
Is server running?
starting server anyway
waiting for server to start............................................................... done
server started

-- logfile 1:
LOG:  database system is shut down
LOG:  database system was shut down at 2011-03-18 17:19:54 CET
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

-- logfile 2:
LOG:  shutting down
LOG:  database system is shut down
FATAL:  lock file "postmaster.pid" already exists
HINT:  Is another postmaster (PID 20519) running in data directory
"/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?




thanks,

Erik Rijkers

Вложения

Re: pg_ctl restart - behaviour based on wrong instance

От
Robert Haas
Дата:
On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote:
> This is OK and expected.  But then it continues (in the logfile) with:
>
> FATAL:  lock file "postmaster.pid" already exists
> HINT:  Is another postmaster (PID 20519) running in data directory
> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?
>
> So, complaints about the *other* instance.  It doesn't happen once a successful start (with pg_ctl
> start) has happened.

I'm guessing that leftover postmaster.pid contents might be
responsible for this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pg_ctl restart - behaviour based on wrong instance

От
Fujii Masao
Дата:
On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote:
>> This is OK and expected.  But then it continues (in the logfile) with:
>>
>> FATAL:  lock file "postmaster.pid" already exists
>> HINT:  Is another postmaster (PID 20519) running in data directory
>> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?
>>
>> So, complaints about the *other* instance.  It doesn't happen once a successful start (with pg_ctl
>> start) has happened.
>
> I'm guessing that leftover postmaster.pid contents might be
> responsible for this?

The cause is that "pg_ctl restart" uses the postmaster.opts which was
created in the primary. Since its content was something like
"pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
wrongly.

The simple workaround is to exclude postmaster.opts from the backup
as well as postmaster.pid. But when postmaster.opts doesn't exist,
"pg_ctl restart" cannot start up the server. We might also need to change
the code of "pg_ctl restart" so that it does just "pg_ctl start" when
postmaster.opts doesn't exist.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_ctl restart - behaviour based on wrong instance

От
Robert Haas
Дата:
On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote:
>>> This is OK and expected.  But then it continues (in the logfile) with:
>>>
>>> FATAL:  lock file "postmaster.pid" already exists
>>> HINT:  Is another postmaster (PID 20519) running in data directory
>>> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?
>>>
>>> So, complaints about the *other* instance.  It doesn't happen once a successful start (with pg_ctl
>>> start) has happened.
>>
>> I'm guessing that leftover postmaster.pid contents might be
>> responsible for this?
>
> The cause is that "pg_ctl restart" uses the postmaster.opts which was
> created in the primary. Since its content was something like
> "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
> wrongly.
>
> The simple workaround is to exclude postmaster.opts from the backup
> as well as postmaster.pid. But when postmaster.opts doesn't exist,
> "pg_ctl restart" cannot start up the server. We might also need to change
> the code of "pg_ctl restart" so that it does just "pg_ctl start" when
> postmaster.opts doesn't exist.

Sounds reasonable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pg_ctl restart - behaviour based on wrong instance

От
Bruce Momjian
Дата:
Fujii Masao wrote:
> On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote:
> >> This is OK and expected. ?But then it continues (in the logfile) with:
> >>
> >> FATAL: ?lock file "postmaster.pid" already exists
> >> HINT: ?Is another postmaster (PID 20519) running in data directory
> >> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?
> >>
> >> So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl
> >> start) has happened.
> >
> > I'm guessing that leftover postmaster.pid contents might be
> > responsible for this?
> 
> The cause is that "pg_ctl restart" uses the postmaster.opts which was
> created in the primary. Since its content was something like
> "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
> wrongly.

FYI, my The Magic of Hot Streaming Replication talk shows this exact
issue on slide 16:
http://momjian.us/main/presentations/features.html#hot_streaming
Remove /data2/postmaster.pid so the standby server does not see theprimary servers pid as its own:
rm /u/pg/data2/postmaster.pid

This is because my demo creates the standby on the same machine as the
master so the pid is still valid and owned by 'postgres', which is what
the user is reporting.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: pg_ctl restart - behaviour based on wrong instance

От
Bruce Momjian
Дата:
Robert Haas wrote:
> On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote:
> >>> This is OK and expected. ?But then it continues (in the logfile) with:
> >>>
> >>> FATAL: ?lock file "postmaster.pid" already exists
> >>> HINT: ?Is another postmaster (PID 20519) running in data directory
> >>> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?
> >>>
> >>> So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl
> >>> start) has happened.
> >>
> >> I'm guessing that leftover postmaster.pid contents might be
> >> responsible for this?
> >
> > The cause is that "pg_ctl restart" uses the postmaster.opts which was
> > created in the primary. Since its content was something like
> > "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
> > wrongly.
> >
> > The simple workaround is to exclude postmaster.opts from the backup
> > as well as postmaster.pid. But when postmaster.opts doesn't exist,
> > "pg_ctl restart" cannot start up the server. We might also need to change
> > the code of "pg_ctl restart" so that it does just "pg_ctl start" when
> > postmaster.opts doesn't exist.
> 
> Sounds reasonable.

Has this been handled?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: pg_ctl restart - behaviour based on wrong instance

От
Bruce Momjian
Дата:
Robert Haas wrote:
> On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote:
> >>> This is OK and expected. ?But then it continues (in the logfile) with:
> >>>
> >>> FATAL: ?lock file "postmaster.pid" already exists
> >>> HINT: ?Is another postmaster (PID 20519) running in data directory
> >>> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?
> >>>
> >>> So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl
> >>> start) has happened.
> >>
> >> I'm guessing that leftover postmaster.pid contents might be
> >> responsible for this?
> >
> > The cause is that "pg_ctl restart" uses the postmaster.opts which was
> > created in the primary. Since its content was something like
> > "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
> > wrongly.
> >
> > The simple workaround is to exclude postmaster.opts from the backup
> > as well as postmaster.pid. But when postmaster.opts doesn't exist,
> > "pg_ctl restart" cannot start up the server. We might also need to change
> > the code of "pg_ctl restart" so that it does just "pg_ctl start" when
> > postmaster.opts doesn't exist.
>
> Sounds reasonable.

I looked over this issue and I don't thinking having pg_ctl restart fall
back to 'start' is a good solution.  I am concerned about cases where we
start a different server without shutting down the old server, for some
reason.  When they say 'restart', I think we have to assume they want a
restart.

What I did do was to document that not backing up postmaster.pid and
postmaster.opts might help prevent pg_ctl from getting confused.

Patch applied and backpatched to 9.1.X.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
new file mode 100644
index b8daedc..737355a
*** a/doc/src/sgml/backup.sgml
--- b/doc/src/sgml/backup.sgml
*************** SELECT pg_stop_backup();
*** 869,875 ****
      of mistakes when restoring.  This is easy to arrange if
      <filename>pg_xlog/</> is a symbolic link pointing to someplace outside
      the cluster directory, which is a common setup anyway for performance
!     reasons.
     </para>

     <para>
--- 869,879 ----
      of mistakes when restoring.  This is easy to arrange if
      <filename>pg_xlog/</> is a symbolic link pointing to someplace outside
      the cluster directory, which is a common setup anyway for performance
!     reasons.  You might also want to exclude <filename>postmaster.pid</>
!     and <filename>postmaster.opts</>, which record information
!     about the running <application>postmaster</>, not about the
!     <application>postmaster</> which will eventually use this backup.
!     (These files can confuse <application>pg_ctl</>.)
     </para>

     <para>

Re: pg_ctl restart - behaviour based on wrong instance

От
Magnus Hagander
Дата:
On Tue, Oct 11, 2011 at 23:35, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> > On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> >> On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers <er@xs4all.nl> wrote:
>> >>> This is OK and expected. ?But then it continues (in the logfile) with:
>> >>>
>> >>> FATAL: ?lock file "postmaster.pid" already exists
>> >>> HINT: ?Is another postmaster (PID 20519) running in data directory
>> >>> "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?
>> >>>
>> >>> So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl
>> >>> start) has happened.
>> >>
>> >> I'm guessing that leftover postmaster.pid contents might be
>> >> responsible for this?
>> >
>> > The cause is that "pg_ctl restart" uses the postmaster.opts which was
>> > created in the primary. Since its content was something like
>> > "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
>> > wrongly.
>> >
>> > The simple workaround is to exclude postmaster.opts from the backup
>> > as well as postmaster.pid. But when postmaster.opts doesn't exist,
>> > "pg_ctl restart" cannot start up the server. We might also need to change
>> > the code of "pg_ctl restart" so that it does just "pg_ctl start" when
>> > postmaster.opts doesn't exist.
>>
>> Sounds reasonable.
>
> I looked over this issue and I don't thinking having pg_ctl restart fall
> back to 'start' is a good solution.  I am concerned about cases where we
> start a different server without shutting down the old server, for some
> reason.  When they say 'restart', I think we have to assume they want a
> restart.
>
> What I did do was to document that not backing up postmaster.pid and
> postmaster.opts might help prevent pg_ctl from getting confused.

Should we exclude postmaster.opts from streaming base backups? We
already exclude postmaster.pid...


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: pg_ctl restart - behaviour based on wrong instance

От
Bruce Momjian
Дата:
Magnus Hagander wrote:
> > I looked over this issue and I don't thinking having pg_ctl restart fall
> > back to 'start' is a good solution. ?I am concerned about cases where we
> > start a different server without shutting down the old server, for some
> > reason. ?When they say 'restart', I think we have to assume they want a
> > restart.
> >
> > What I did do was to document that not backing up postmaster.pid and
> > postmaster.opts might help prevent pg_ctl from getting confused.
> 
> Should we exclude postmaster.opts from streaming base backups? We
> already exclude postmaster.pid...

Uh, I think so, unless my analysis was wrong.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: pg_ctl restart - behaviour based on wrong instance

От
Magnus Hagander
Дата:
On Wednesday, October 12, 2011, Bruce Momjian wrote:
Magnus Hagander wrote:
> > I looked over this issue and I don't thinking having pg_ctl restart fall
> > back to 'start' is a good solution. ?I am concerned about cases where we
> > start a different server without shutting down the old server, for some
> > reason. ?When they say 'restart', I think we have to assume they want a
> > restart.
> >
> > What I did do was to document that not backing up postmaster.pid and
> > postmaster.opts might help prevent pg_ctl from getting confused.
>
> Should we exclude postmaster.opts from streaming base backups? We
> already exclude postmaster.pid...

Uh, I think so, unless my analysis was wrong.


Ok, fixed and applied.

//Magnus



--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: pg_ctl restart - behaviour based on wrong instance

От
Fujii Masao
Дата:
On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Ok, fixed and applied.

You seem to have forgot to change protocol.sgml.
Patch attached.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Вложения

Re: pg_ctl restart - behaviour based on wrong instance

От
Fujii Masao
Дата:
On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Ok, fixed and applied.

You seem to have forgot to change protocol.sgml.
Patch attached.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Вложения

Re: pg_ctl restart - behaviour based on wrong instance

От
Fujii Masao
Дата:
On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Ok, fixed and applied.

You seem to have forgot to change protocol.sgml.
Patch attached.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Вложения

Re: pg_ctl restart - behaviour based on wrong instance

От
Fujii Masao
Дата:
On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Ok, fixed and applied.

You seem to have forgot to change protocol.sgml.
Patch attached.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Вложения

Re: pg_ctl restart - behaviour based on wrong instance

От
Fujii Masao
Дата:
Oh, sorry for repeating the same posts. Gmail seems to have not worked
fine... :(

On Wed, Oct 19, 2011 at 1:24 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote:
>> Ok, fixed and applied.
>
> You seem to have forgot to change protocol.sgml.
> Patch attached.
>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_ctl restart - behaviour based on wrong instance

От
Robert Haas
Дата:
On Tue, Oct 18, 2011 at 12:18 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander <magnus@hagander.net> wrote:
>> Ok, fixed and applied.
>
> You seem to have forgot to change protocol.sgml.
> Patch attached.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company