Re: Failback to old master

Поиск
Список
Период
Сортировка
От Ants Aasma
Тема Re: Failback to old master
Дата
Msg-id CA+CSw_u0kw9ZdzMTxm5o6B1G0qKFP7=mFyG91h2wWdq2vLb_-w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Failback to old master  ("Maeldron T." <maeldron@gmail.com>)
Ответы Re: Failback to old master  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Failback to old master  ("Maeldron T." <maeldron@gmail.com>)
Список pgsql-hackers
On Tue, Nov 11, 2014 at 11:52 PM, Maeldron T. <maeldron@gmail.com> wrote:
> As far as I remember (I can’t test it right now but I am 99% sure) promoting the slave makes it impossible to connect
theold master to the new one without making a base_backup. The reason is the timeline change. It complains. 

A safely shut down master (-m fast is safe) can be safely restarted as
a slave to the newly promoted master. Fast shutdown shuts down all
normal connections, does a shutdown checkpoint and then waits for this
checkpoint to be replicated to all active streaming clients. Promoting
slave to master creates a timeline switch, that prior to version 9.3
was only possible to replicate using the archive mechanism. As of
version 9.3 you don't need to configure archiving to follow timeline
switches, just add a recovery.conf to the old master to start it up as
a slave and it will fetch everything it needs from the new master.

In case of a unsafe shut down (crash) it is possible that you have WAL
lying around that was not streamed out to the slave. In this case the
old master will request recovery from a point after the timeline
switch and the new master will reply with an error. So it is safe to
try re-adding a crashed master as a slave, but this might fail.
Success is more likely when the whole operating system went down, as
then it's somewhat likely that any WAL got streamed out before it made
it to disk.

In general my suggestion is to avoid slave promotion by removal of
recovery.conf, it's too easy to get confused and end up with hard to
diagnose data corruption.

In your example, if for example B happens to disconnect at WAL
position x1 and remains disconnected while shutdown on A occurred at
WAL position x2 it will be missing the WAL interval A(x1..x2). Now B
is restarted as master from position x1, generates some new WAL past
x2, then A is restarted as slave and starts streaming at x2 as to the
best of it's knowledge that was where things left off. At this point
the slave A is corrupted, you have x1..x2 changes from A that are not
on the master and are also missing some changes that are on the
master. Wrong data and/or crashes ensue.

Always use the promotion mechanism because then you are likely to get
errors when something is screwy. Unfortunately it's still possible to
end up in a corrupted state with no errors, as timeline identifiers
are sequential integers, not GUID's, but at least it's significantly
harder.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Petr Jelinek
Дата:
Сообщение: Re: tracking commit timestamps
Следующее
От: Peter Eisentraut
Дата:
Сообщение: what does this mean: "running xacts with xcnt == 0"