PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Поиск

Список

Период

Сортировка

От	Achilleas Mantzios
Тема	PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Дата	12 ноября 2018 г. 13:41:30
Msg-id	ae8812c3-d138-73b7-537a-a273e15ef6e1@matrix.gatewaynet.com обсуждение исходный текст
Ответы	Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Список	pgsql-admin

Дерево обсуждения

Hello,
while setting up logical replication since August we had seen early on the need to increase max_receiver_timeout and
max_sender_timeoutfrom 60sec to 5mins, otherwise the synchronization would never

take place.
This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5
mins).This left the replication slot retaining wals till our production primary

server run out of space. (this is not connected with the wal fill up of the previous Sunday nor does it explain why it
happened,still in the dark about this one).

I got the following messages on the publisher (primary) host :

10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr@dynacom line:9 LOG:
terminatingwalsender process due to replication timeout

10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr@dynacom line:10
CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the change

callback, associated LSN 13DF/393BF7F0
10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.066 EET data_for_testsmadb_pub repmgr@dynacom line:11 LOG:
disconnection:session time: 49:47:17.937 user=repmgr database=dynacom

host=10.9.0.77 port=48650

By querying pg_stat_subscription all 3 timestamps were about 5 mins prior to the above walsender termination.

I didn't get *any* ERROR/FATAL message on the subscriber side. We have built a monitoring system that notifies us about
problems,conflicts such as unique key violations, etc but we had no alerts

coming from the subscriber logs. So this went unnoticed. Also the scripts about disk size growth fired but this was
alreadytoo late.

So after I got the alerts about our primary being dead, I had to free the space in pg_wal, keeping all the wals
elsewhere,consult pg_controldata about last checkpoint, bring the primary up, and then

making sure that all standbys were up and running. Our physical standby easily caught up, but the logical one (which
waspart of the cause) of course didn't. So I used the moved wals, moving them back

to pg_wal (of the primary) in bulks (as the subscriber asked) till I managed to bring the subscriber node in sync with
thepublisher, sans the last wal (the one created at the time of the PANIC),

which I had to bring from our WAL archiving server. This pretty much brought everything up and running again.

Lessons learned are :
* monitoring the subscriber is not enough
* a walsender (sending WAL across for logical decoding) having a timeout may be fatal

I have seen howtos recommending disabling those timeouts :
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html

The postgresql docs are quite "laconic" about the use of the two timeout parameters, and both default to 1min. :

wal_sender_timeout (integer)
Terminate replication connections that are inactive longer than the specified number of milliseconds. This is
usefulfor the sending server to detect a standby crash or network outage. A value of

zero disables the timeout mechanism. This parameter can only be set in the postgresql.conf file or on the server
commandline. The default value is 60 seconds.

wal_receiver_timeout (integer)
Terminate replication connections that are inactive longer than the specified number of milliseconds. This is
usefulfor the receiving standby server to detect a primary node crash or network

outage. A value of zero disables the timeout mechanism. This parameter can only be set in the postgresql.conf file or
onthe server command line. The default value is 60 seconds.

So what postgresql manual basically says to the user is :

- We give you a mechanism to detect failures, we set the default timeout at 60 seconds, and you are responsible to
monitorthis and act accordingly or write an automated tool to handle such events (to

do what???), otherwise set it to 0 but be prepared, in case of permanent problems, to loose availability when you run
outof disk space.

So is there any way to restart the WAL sender ? Is there any way to tell postgresql to retry after this specified
amountof time? Otherwise what is the purpose of the LOG message? (which is not even

an ERROR ?) Should a restart of the subscriber or the publisher node remedy this?

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

В списке pgsql-admin по дате отправления:

Предыдущее

От: Peter Eisentraut
Дата: 11 ноября 2018 г., 23:12:45
Сообщение: Re: How to downgrade a cluster from 9.5 to 9.3

Следующее

От: Achilleas Mantzios
Дата: 12 ноября 2018 г., 17:00:39
Сообщение: Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Предыдущее

Следующее