Re: BUG? Slave don't reconnect to the master

Поиск
Список
Период
Сортировка
От Олег Самойлов
Тема Re: BUG? Slave don't reconnect to the master
Дата
Msg-id 32B9ED9C-0603-47D8-9C5D-0A2461E42971@ya.ru
обсуждение исходный текст
Ответ на Re: BUG? Slave don't reconnect to the master  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Ответы Re: BUG? Slave don't reconnect to the master  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Список pgsql-general

> On 29 Sep 2020, at 12:31, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
>
> On Thu, 24 Sep 2020 15:22:46 +0300
> Олег Самойлов <splarv@ya.ru> wrote:
>
>> Hi, Jehan.
>>
>>> On 9 Sep 2020, at 18:19, Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
>>> wrote:
>>>
>>> On Mon, 7 Sep 2020 23:46:17 +0300
>>> Олег Самойлов <splarv@ya.ru> wrote:
>>>
>>>>> [...]
>>>>>>>> 10:30:55.965 FATAL:  terminating walreceiver process dpue to
>>>>>>>> administrator cmd 10:30:55.966 LOG:  redo done at 0/1600C4B0
>>>>>>>> 10:30:55.966 LOG:  last completed transaction was at log time
>>>>>>>> 10:25:38.76429 10:30:55.968 LOG:  selected new timeline ID: 4
>>>>>>>> 10:30:56.001 LOG:  archive recovery complete
>>>>>>>> 10:30:56.005 LOG:  database system is ready to accept connections
>>>>>>>
>>>>>>>> The slave with didn't reconnected replication, tuchanka3c. Also I
>>>>>>>> separated logs copied from the old master by a blank line:
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> 10:20:25.168 LOG:  database system was interrupted; last known up at
>>>>>>>> 10:20:19 10:20:25.180 LOG:  entering standby mode
>>>>>>>> 10:20:25.181 LOG:  redo starts at 0/11000098
>>>>>>>> 10:20:25.183 LOG:  consistent recovery state reached at 0/11000A68
>>>>>>>> 10:20:25.183 LOG:  database system is ready to accept read only
>>>>>>>> connections 10:20:25.193 LOG:  started streaming WAL from primary at
>>>>>>>> 0/12000000 on tl 3 10:25:05.370 LOG:  could not send data to client:
>>>>>>>> Connection reset by peer 10:26:38.655 FATAL:  terminating walreceiver
>>>>>>>> due to timeout 10:26:38.655 LOG:  record with incorrect prev-link
>>>>>>>> 0/1200C4B0 at 0/1600C4D8
>>>>>>>
>>>>>>> This message appear before the effective promotion of tuchanka3b. Do you
>>>>>>> have logs about what happen *after* the promotion?
>>>>>>
>>>>>> This is end of the slave log. Nothing. Just absent replication.
>>>>>
>>>>> This is unusual. Could you log some more details about replication
>>>>> tryouts to your PostgreSQL logs? Set log_replication_commands and lower
>>>>> log_min_messages to debug ?
>>>>
>>>> Sure, this is PostgreSQL logs for the cluster tuchanka3.
>>>> Tuchanka3a is an old (failed) master.
>>>
>>> According to your logs:
>>>
>>> 20:29:41 tuchanka3a: freeze
>>> 20:30:39 tuchanka3c: wal receiver timeout (default 60s timeout)
>>> 20:30:39 tuchanka3c: switched to archives, and error'ed (expected)
>>> 20:30:39 tuchanka3c: switched to stream again (expected)
>>>                    no more news from this new wal receiver
>>> 20:34:21 tuchanka3b: promoted
>>>
>>> I'm not sure where your floating IP is located at 20:30:39, but I suppose it
>>> is still on tuchanka3a as the wal receiver don't hit any connection error
>>> and tuchanka3b is not promoted yet.
>>
>> I think so.
>>
>>>
>>> So at this point, I suppose the wal receiver is stuck in libpqrcv_connect
>>> waiting for frozen tuchanka3a to answer, with no connection timeout. You
>>> might track tcp sockets on tuchanka3a to confirm this.
>>
>> I don't know how to do this.
>
> Use ss, see its manual page. Hare is an example, using standard 5432 pgsql port:
>
>  ss -tapn 'dport = 5432 or sport = 5432'
>
> Look for Local and Peer addresses and their status.
>
>>> To avoid such a wait, try to add eg. connect_timeout=2 to your
>>> primary_conninfo parameter. See:
>>> https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS
>>
>> Nope, this was not enough. But I went further and I added tcp keepalive
>> options too. So now paf file, for instance in tuchanka3c, is:
>>
>> # recovery.conf for krogan3, pgsqlms pacemaker module
>> primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3c
>> connect_timeout=5 keepalives=1 keepalives_idle=1 keepalives_interval=3
>> keepalives_count=3' recovery_target_timeline = 'latest' standby_mode = 'on'
>>
>> And now the problem with PostgreSQL-STOP is solved. But I surprised, why this
>> was needed? I though that wal_receiver_timeout must be enough for this case.
>
> Because wal_receiver_timeout apply on already established and streaming
> connections, when the server end streaming becomes silent.
>
> The timeout you have happen during the connection establishment, where
> connect_timeout takes effect.
>
> In regards with keepalive parameters, I am a bit surprised. According to the
> source code, parameters defaults are:
>
>  keepalives=1
>  keepalives_idle=1
>  keepalives_interval=1
>  keepalives_count=1
>
> But I just had a quick look there, so I probably miss something.

According to the official documentation, if keepalive parameters are not specified, then used default value from the
OS.
https://www.postgresql.org/docs/12/runtime-config-connection.html#RUNTIME-CONFIG-CONNECTION-SETTINGS

Cite:  A value of 0 (the default) selects the operating system's default.

I don't know what is the default values for the CentOS 7. I can only assert that adding keepalive is solved issue with
Postgres-STOPtest and looked like problems with ForkBomb too. And keep in mind, I still use PostgreSQL 11. May be 12 or
13something changed. 


В списке pgsql-general по дате отправления:

Предыдущее
От: Jehan-Guillaume de Rorthais
Дата:
Сообщение: Re: BUG? Slave don't reconnect to the master
Следующее
От: Jehan-Guillaume de Rorthais
Дата:
Сообщение: Re: BUG? Slave don't reconnect to the master