BUG? Slave don't reconnect to the master

Поиск
Список
Период
Сортировка
От Олег Самойлов
Тема BUG? Slave don't reconnect to the master
Дата
Msg-id 60590EC6-4062-4F25-A49C-3948ED2A7D47@ya.ru
обсуждение исходный текст
Ответы Re: BUG? Slave don't reconnect to the master  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Список pgsql-general
Hi all.

I found some strange behaviour of postgres, which I recognise as a bug. First of all, let me explain situation.

I created a "test bed" (not sure how to call it right), to test high availability clusters based on Pacemaker and
PostgreSQL.The test bed consist of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4 HA clusters
withdifferent structure. And all 4 HA cluster constantly tested in loop: simulated failures with different nature,
waitedfor rising fall-over, fixing, and so on. For simplicity I'll explain only one HA cluster. This is 3 virtual
machines,with master on one, and sync and async slaves on other. The PostgreSQL service is provided by float IPs
pointedto working master and slaves. Slaves are connected to the master float IP too. When the pacemaker detects a
failure,for instance, on the master, it promote a master on other node with lowest latency WAL and switches float IPs,
sothe third node keeping be a sync slave. My company decided to open this project as an open source, now I am finishing
formality.

Almost works fine, but sometimes, rather rare, I detected that a slave don't reconnect to the new master after a
failure.First case is PostgreSQL-STOP, when I `kill` by STOP signal postgres on the master to simulate freeze. The
slavedon't reconnect to the new master with errors in log: 

18:02:56.236 [3154] FATAL:  terminating walreceiver due to timeout
18:02:56.237 [1421] LOG:  record with incorrect prev-link 0/1600DDE8 at 0/1A00DE10

What is strange that error about incorrect WAL is risen  after the termination of connection. Well, this can be
workarounedby turning off wal receiver timeout. Now PostgreSQL-STOP works fine, but the problem is still exists with
othertest. ForkBomb simulates an out of memory situation. In this case a slave sometimes don't reconnect to the new
mastertoo, with errors in log: 

10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24, got 0

The last error message (last row in log) was observed different, btw.

What I expect as right behaviour. The PostgreSQL slave must reconnect to the master IP (float IP) after the
wal_retrieve_retry_interval.


В списке pgsql-general по дате отправления:

Предыдущее
От: Daulat Ram
Дата:
Сообщение: Point in time recovery
Следующее
От: Ron
Дата:
Сообщение: Re: Point in time recovery