Re: unable to fail over to warm standby server

Поиск
Список
Период
Сортировка
От Mason Hale
Тема Re: unable to fail over to warm standby server
Дата
Msg-id 1e85dd391001280703l4c13e231m77e50e2630f34975@mail.gmail.com
обсуждение исходный текст
Ответ на Re: unable to fail over to warm standby server  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Ответы Re: unable to fail over to warm standby server  (Fujii Masao <masao.fujii@gmail.com>)
Список pgsql-bugs
Hello Heikki --

Thank you for investigating this issue and clearing up this mystery.
I do not believe it is obvious that the postgres process needs to be able to
remove the trigger file.

My naive assumption was that the trigger file was merely a flag to signal
that recovery mode needed to be stopped. If I were to guess what those steps
would be, I would assume the following:

   - detect the presence of the trigger file
   - stop the postgres process safely (e.g pg_ctl ... stop)
   - rename recovery.conf to recovery.done
   - restart the postgres process (e.g. pg_ctl ... start)

It is not obvious that the trigger file needs to be removed.
And if permissions prevent it from being removed the last thing that should
happen is to cause to database to become corrupted.

At minimum the pg_standby documentation should make this requirement clear.
I suggest language to the effect of the following:

Note it is critical the trigger file be created with permissions that allow
> the postgres process to remove the file. Generally this is best done by
> creating the file from the postgres user account. Data corruption may result
> if the trigger file permissions prevent deletion of the trigger file.


Of course the best solution is to avoid this issue entirely. Something as
easy to miss as file permissions should not cause data corruption,
especially in the process meant to fail over from a crashing primary
database.

thanks,

Mason Hale
http://www.onespot.com
direct +1 800.618.0768 ext 701



On Thu, Jan 28, 2010 at 3:49 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

> Mason Hale wrote:
> >  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> > permittedtrigger file found
> >
> >  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> permitted
> >
> > This file was not looked until after the attempt to recover was
> > aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a
> > problem,
> > but we don't see how that would explain the error messages, which seem
> > to indicate that data on the standby server was corrupted.
>
> Yes, that permission problem seems to be the root cause of the troubles.
> If pg_standby fails to remove the trigger file, it exit()s with whatever
> return code the unlink() call returned:
>
> >               /*
> >                * If trigger file found, we *must* delete it. Here's why:
> When
> >                * recovery completes, we will be asked again for the same
> file from
> >                * the archive using pg_standby so must remove trigger file
> so we can
> >                * reload file again and come up correctly.
> >                */
> >               rc = unlink(triggerPath);
> >               if (rc != 0)
> >               {
> >                       fprintf(stderr, "\n ERROR: could not remove \"%s\":
> %s", triggerPath, strerror(errno));
> >                       fflush(stderr);
> >                       exit(rc);
> >               }
>
> unlink() returns -1 on error, so pg_standby calls exit(-1). -1 is out of
> the range of normal return codes, and apparently gets mangled into the
> mysterious 65280 code you saw in the logs. The server treats that as a
> fatal error, and dies.
>
> That seems like a bug in pg_standby, but I'm not sure what it should do
> if the unlink() fails. It could exit with some other exit code, so that
> the server wouldn't die, but the lingering trigger file could cause
> problems, as the comment explains. If it should indeed cause FATAL, it
> should do so in a more robust way than the exit(rc) call above.
>
> BTW, this changed in PostgreSQL 8.4; pg_standby no longer tries to
> delete the trigger file (so that problematic block of code is gone), but
> there's a new restore_end_command option in recovery.conf instead, where
> you're supposed to put 'rm <triggerfile>'. I think in that
> configuration, the standby would've started up, even though removal of
> the trigger file would've still failed.
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Sun Duozhong(孙多忠)
Дата:
Сообщение: emedded SQL in C to get the record type from plpgsql
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Status of submitted bugs