Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Поиск
Список
Период
Сортировка
От Achilleas Mantzios
Тема Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Дата
Msg-id be3f23dc-f7ea-2c7e-fa11-22e68a22ef6f@matrix.gatewaynet.com
обсуждение исходный текст
Ответ на Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"  (Rui DeSousa <rui@crazybean.net>)
Список pgsql-admin
On 12/11/18 6:30 μ.μ., Rui DeSousa wrote:
>
>> On Nov 12, 2018, at 5:41 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>>
>> This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5
mins).This left the replication slot retaining wals till our production primary server run out of space. (this is not
connectedwith the wal fill up of the previous Sunday nor does it explain why it happened, still in the dark about this
one).
> This sounds like there was a network related issue.

yes, apparently, or a transaction was large enough so that it didn't get through within the timeout limit. However I
countedabout 24 WAL files in this period  so it would be max about 400MB. Which 
 
means the speed fell down to 1.3MB/sec which is pretty slow for connections between the same cloud provider.

> Did the WAL receiver timeout too or did it remain “connected”?

I mention it in the original message :
"By querying pg_stat_subscription all 3 timestamps were about 5 mins prior to the above walsender termination.
I didn't get *any* ERROR/FATAL message on the subscriber side".

Or to be more precise, I didn't get any related message on the subscriber side. Here what the subscriber logs contain
forthe said time window :
 

  [106486] 2018-11-09 14:55:16.996 EET  @ line:468 LOG:  checkpoint complete: wrote 17564 buffers (1.7%); 0 WAL file(s)
added,0 removed, 43 recycled; write=610.394 s, sync=0.044 s, total=610.521 s; 
 
sync files=507, longest=0.011 s, average=0.000 s; distance=564195 kB, estimate=712814 kB
  [106486] 2018-11-09 14:55:19.093 EET  @ line:469 LOG:  checkpoint starting: xlog
10.9.202.211(49220) [18191] 2018-11-09 14:59:35.772 EET [unknown] [unknown]@[unknown] line:1 LOG:  connection received:
host=10.9.202.211port=49220
 
.......
  [106486] 2018-11-09 15:06:56.846 EET  @ line:470 LOG:  checkpoint complete: wrote 9034 buffers (0.9%); 0 WAL file(s)
added,0 removed, 24 recycled; write=697.701 s, sync=0.003 s, total=697.752 s; 
 
sync files=414, longest=0.002 s, average=0.000 s; distance=722719 kB, estimate=722719 kB
.......
  [106486] 2018-11-09 15:10:19.880 EET  @ line:471 LOG:  checkpoint starting: time
  [106486] 2018-11-09 15:23:49.049 EET  @ line:472 LOG:  checkpoint complete: wrote 9554 buffers (0.9%); 0 WAL file(s)
added,0 removed, 0 recycled; write=809.158 s, sync=0.003 s, total=809.169 s; 
 
sync files=453, longest=0.001 s, average=0.000 s; distance=327761 kB, estimate=683223 kB


> If the downstream server did not detected the network issue thus failed to drop the abandoned connection and
reconnecton its own then this is normal behavior as the replication slot would not have be active.
 

The subscriber did not detect the walsender termination. It was only after the subscriber node restart (this Sunday)
thatit started again asking for WAL segments.
 
The replication slot lives on the publisher node, it was apparently inactive since the walsender terminated on Friday
(tillthe PANIC)
 

>    
>
> I’m a bit confused and as I thought you stated before that you checked the replication slots and they where active
andmoving forward; right?
 

I have sent two separate emails with two separate subjects, reporting two distinct incidents. In this incident, I could
notcheck any replication slot because the primary had PANIC'ed.
 

>
>> - We give you a mechanism to detect failures, we set the default timeout at 60 seconds, and you are responsible to
monitorthis and act accordingly or write an automated tool to handle such events (to do what???), otherwise set it to 0
butbe prepared, in case of permanent problems, to loose availability when you run out of disk space.
 
>>
>> So is there any way to restart the WAL sender ? Is there any way to tell postgresql to retry after this specified
amountof time? Otherwise what is the purpose of the LOG message? (which is not even an ERROR ?) Should a restart of the
subscriberor the publisher node remedy this?
 
> wal_sender_timeout and wal_receiver_timeout are timeout and Postgres will terminate the connect and the downstream
serverwill reconnect on its own (as long as it terminates its own connection — wal_receiver_timeout).
 

This didn't happen. Nothing replication related coming from the subscriber logs since the walsender termination.

> Did you check to see if you have any long running queries or opened transactions that are holding on to a xmin?

https://www.postgresql.org/docs/10/protocol-logical-replication.html
An uncommitted transaction won't make it to the subscriber.

>
>
>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



В списке pgsql-admin по дате отправления:

Предыдущее
От: Andrey Zhidenkov
Дата:
Сообщение: Re: hot_standby_feedback parameter doesn't work
Следующее
От: Andrey Zhidenkov
Дата:
Сообщение: Re: hot_standby_feedback parameter doesn't work