Re: Failback to old master

Поиск
Список
Период
Сортировка
От Maeldron T.
Тема Re: Failback to old master
Дата
Msg-id CAKatfSmMyjBQTtOQKkWOb0TL2LBYUW5p0wJDq_COnSn7NkyTrA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Failback to old master  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Failback to old master
Список pgsql-hackers
Thank you, Robert.

I thought that removing the recovery.conf file makes the slave master only after the slave was restarted. (Unlike creating the a trigger_file). Isn't this true?

I also thought that if there was a crash on the original master and it applied WAL entries on itself that are not presented on the slave then it will throw an error when I try to connect it to the new master (to the old slave).

It would be nice to know as creating a base_backup takes much time.

As for the other case, when there was no crash, safe swapping the master and the slave two times without creating base_backups makes the upgrading of the OS much easier (with only a couple of seconds down-time).

I am afraid to try on until production someone confirms that it's safe. I seems to work though (but I don't like to bet).

M.

2014-10-29 15:41 GMT+01:00 Robert Haas <robertmhaas@gmail.com>:
On Wed, Oct 29, 2014 at 6:21 AM, Maeldron T. <maeldron@gmail.com> wrote:
> I swear I have read a couple of old threads. Yet I am not sure if it safe to
> failback to the old master in case of async replication without base backup.
>
> Considering:
> I have the latest 9.3 server
> A: master
> B: slave
> B is actively connected to A
>
> I shut down A manually with -m fast (it's the default FreeBSD init script
> setting)
> I remove the recovery.conf from B
> I restart B
> I create a recovery.conf on A
> I start A
> I see nothing wrong in the logs
> I go for a lunch
> I shut down B
> I remove the recovery.conf on AI restart A
> I restore the recovery.conf on B
> I start B
> I see nothing wrong in the logs and I see that replication is working
>
> Can I say that my data is safe in this case?
>
> If the answer is yes, is it safe to do this if there was a power outage on A
> instead of manual shutdown? Considering that the log says nothing wrong. (Of
> course if it complains I'd do base backup from B).

The threshold question here is whether the original master might have
written (and thus, perhaps, applied) write-ahead log records that were
not replayed on the slave.  If A crashed, that is definitely possible,
so this is definitely not safe.  If A was shut down cleanly, then
streaming replication *should* take everything up through the shutdown
checkpoint and replicate those to the standby, which *should* replay
them.  If all goes according to plan, I think this will work.

I'm not sure we really have enough safeties to make this robust,
though: for example, at the point when the shutdown checkpoint is
written, I believe that the master is no longer accepting new
connections - so if the connection to the slave is broken before the
shutdown checkpoint record is replicated, then it's not safe any more,
but how will we detect that?  And, if you remove recovery.conf on the
slave, it will abort replay and enter normal running as soon as it
reaches what it thinks is end-of-WAL, with no cross-check to make sure
that's really the same was point that the master was actually at.  So
it strikes me that it might be quite difficult to really have
confidence that nothing will go wrong.

I'm definitely not the expert in this area on this mailing list, so
I'm curious what others think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Deferring some AtStart* allocations?
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Directory/File Access Permissions for COPY and Generic File Access Functions