Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Поиск
Список
Период
Сортировка
От Achilleas Mantzios
Тема Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Дата
Msg-id c39ea7f4-c7a2-2111-6d09-58f25df5b6c5@matrix.gatewaynet.com
обсуждение исходный текст
Ответ на Re: PostgreSQL 10.5 : Logical replication timeout results in PANICin pg_wal "No space left on device"  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-admin
Hi Alvaro!

On 23/11/18 1:10 μ.μ., Alvaro Herrera wrote:
> On 2018-Nov-14, Rui DeSousa wrote:
>
>>> On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>>>
>>> Our sysadms (seasoned linux/network guys : we have been working here
>>> for more than 10 yrs) were absolute in that we run no firewall or
>>> other traffic shaping system between the two hosts. (if we did the
>>> problem would manifest itself earlier).  Can you recommend what to
>>> look for exactly regarding both TCP stacks ? The subscriber node is
>>> a clone of the primary. We have :
>>>
>>> # sysctl -a | grep -i keepaliv
>>> net.ipv4.tcp_keepalive_intvl = 75
>>> net.ipv4.tcp_keepalive_probes = 9
>>> net.ipv4.tcp_keepalive_time = 7200
>> Those keep alive settings are linux’s defaults and work out to be 18
>> hours before the abandon connection is dropped.  So, the WAL receiver
>> should have corrected itself after that time.  For reference, I run
>> terminating abandon session within 15 mins as they take-up valuable
>> database resources and could potentially hold on to locks, snapshots,
>> etc.
> Where does your 18h figure come from?  As I understand it, these numbers
> mean "wait 7200 seconds, then send 9 probes 75 seconds apart", kill the
> connection if not reply is obtained.  So that works out to about 131
> minutes (modulo fencepost bug).  Certainly not 18 hours ...

Thanks, yes it sums up to 2Hrs 11 Mins. The moments after the primary crushed I didn't have the nerves/patience/guts to
waitthat long and actually prove that the subscriber listened happily to a 
 
ghost/stuck connection.

>
> Now ... I have seen Linux kernel code that seemed to me to cause network
> transmission get stuck *in the kernel* without any way out.  Now I'm not
> a kernel expert and I don't know if this applies to your case (maybe it
> got fixed already), but it was definitely some process that was stuck
> with "wchan" set to a network kernel call and way beyond TCP keepalives.

It seems we'll have to upgrade our systems/kernels ASAP.  Thanks a lot!

>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



В списке pgsql-admin по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: PostgreSQL 10.5 : Logical replication timeout results in PANICin pg_wal "No space left on device"
Следующее
От: AYahorau@ibagroup.eu
Дата:
Сообщение: Re: Logical replication monitoring