Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Поиск

Список

Период

Сортировка

От	Rui DeSousa
Тема	Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Дата	12 ноября 2018 г. 16:30:53
Msg-id	9D8241F9-9E87-46B0-AF33-1893A0013C5B@crazybean.net обсуждение исходный текст
Ответ на	PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" (Achilleas Mantzios <achill@matrix.gatewaynet.com>)
Ответы	Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Список	pgsql-admin

Дерево обсуждения

> On Nov 12, 2018, at 5:41 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>
> This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5
mins).This left the replication slot retaining wals till our production primary server run out of space. (this is not
connectedwith the wal fill up of the previous Sunday nor does it explain why it happened, still in the dark about this
one).

This sounds like there was a network related issue.  Did the WAL receiver timeout too or did it remain “connected”?  If
thedownstream server did not detected the network issue thus failed to drop the abandoned connection and reconnect on
itsown then this is normal behavior as the replication slot would not have be active.   

I’m a bit confused and as I thought you stated before that you checked the replication slots and they where active and
movingforward; right? 

> - We give you a mechanism to detect failures, we set the default timeout at 60 seconds, and you are responsible to
monitorthis and act accordingly or write an automated tool to handle such events (to do what???), otherwise set it to 0
butbe prepared, in case of permanent problems, to loose availability when you run out of disk space. 
>
> So is there any way to restart the WAL sender ? Is there any way to tell postgresql to retry after this specified
amountof time? Otherwise what is the purpose of the LOG message? (which is not even an ERROR ?) Should a restart of the
subscriberor the publisher node remedy this? 

wal_sender_timeout and wal_receiver_timeout are timeout and Postgres will terminate the connect and the downstream
serverwill reconnect on its own (as long as it terminates its own connection — wal_receiver_timeout). 

This is very useful when you have an overzealous firewall that drops connections due to idle sessions without resets or
anyother situation that causes a network connection issue. 

Disabling the timeout seems like a really bad idea, the end result then would depend on your TCP/IP stack witch can
takea day or so to detect an abandoned connect unless TCP/IP keep alive is enabled.  And, I would recommenced setting
upTCP/IP keep alive to detect abandoned sessions. i.e. again a firewall dropping a user sessions without a reset (can
happenedon a long running query as the connection would look idle to the firewall); or a user closing the lid on their
laptopand heading home for the day while being logged in — without the timeouts their session will continue to run
activequeries and/or hold on to opened transactions until the OS terminates the session or some other timeout is
reached.

Did you check to see if you have any long running queries or opened transactions that are holding on to a xmin?

В списке pgsql-admin по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"