pg9.6 when is a promoted cluster ready to accept "rewind" request?

Поиск

Список

Период

Сортировка

От	magodo
Тема	pg9.6 when is a promoted cluster ready to accept "rewind" request?
Дата	12 ноября 2018 г. 08:11:23
Msg-id	3663c1bfe329a2f934301604c56851f029c4c881.camel@sina.com обсуждение исходный текст
Ответы	Re: pg9.6 when is a promoted cluster ready to accept "rewind" request? (talk to ben <blo.talkto@gmail.com>)
Список	pgsql-general

Дерево обсуждения

Dear supporters,

I'm writing some scripts to implement manual failover. I have two
clusters(let's say p1 and p2), where one is primary(e.g. p1) and the
other is standby(e.g. p2). The way to do manual failover is straight
forward, like following:

1. promote on p2
2. wait `pg_is_ready()` on p2
3. rewind on p1
4. prepare a recovery.conf on p1
5. start p1

This should ends up with the same HA but role switched.

It works find if I manually do each step. 

But if I call each step sequentially in a script, it will fail after I
switched role for the 1st time and want to switch back.

For example, with a fresh setup(timeline starts from 1), I firstly
tried to switch role, and it works. I get p1 as standby following p2,
which is the priamry. Then I switch role again and error occurs, the
error message is like:

   < 2018-11-12 04:59:24.547 UTC > LOG:  entering standby mode
   < 2018-11-12 04:59:24.555 UTC > LOG:  redo starts at 0/4000028
   < 2018-11-12 04:59:24.566 UTC > LOG:  started streaming WAL from
   primary at 0/5000000 on timeline 1
   < 2018-11-12 04:59:24.566 UTC > FATAL:  could not receive data from
   WAL stream: ERROR:  requested WAL segment 000000020000000000000005
   has already been
   removed                                                             
                                                      

   < 2018-11-12 04:59:24.577 UTC > LOG:  started streaming WAL from
   primary at 0/5000000 on timeline 1
   < 2018-11-12 04:59:24.577 UTC > FATAL:  could not receive data from
   WAL stream: ERROR:  requested WAL segment 000000020000000000000005
   has already been
   removed                                                             
                                                      

   < 2018-11-12 04:59:25.413 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:26.416 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:27.419 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:28.422 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:29.425 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:29.576 UTC > LOG:  started streaming WAL from
   primary at 0/5000000 on timeline 1
   < 2018-11-12 04:59:29.576 UTC > FATAL:  could not receive data from
   WAL stream: ERROR:  requested WAL segment 000000020000000000000005
   has already been removed              


the pg_rewind output is as follow:

   servers diverged at WAL position 0/5000060 on timeline 1         
   rewinding from last common checkpoint at 0/4000060 on timeline 1 

From the log, it seems the wrong timeline of divergence is evaluated,
it should be timeline 2 rather than 1. 

Furthermore, if I add a `sleep` between step 2(promote) and step
3(rewind), it just works. 

Hence, I suspect the promoted cluster is not ready to be used for
rewinding right after promote. Is there anything I need to wait before
I rewind the old primary against this promoted cluster?

Thank you in advance!

---
magodo

В списке pgsql-general по дате отправления:

Предыдущее

От: Ron
Дата: 12 ноября 2018 г., 04:16:19
Сообщение: Re: Move cluster to new host, upgraded version

Следующее

От: Karl Martin Skoldebrand
Дата: 12 ноября 2018 г., 12:17:35
Сообщение: Recommendation for upgrading from PostgreSQL 9.3

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

pg9.6 when is a promoted cluster ready to accept "rewind" request?

Предыдущее

Следующее