Re: PostgreSQL 10.5 : Logical replication timeout results in PANICin pg_wal "No space left on device"

Поиск
Список
Период
Сортировка
От Alvaro Herrera
Тема Re: PostgreSQL 10.5 : Logical replication timeout results in PANICin pg_wal "No space left on device"
Дата
Msg-id 20181123111024.73ck7f7zbzexfudz@alvherre.pgsql
обсуждение исходный текст
Ответ на Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"  (Rui DeSousa <rui@crazybean.net>)
Ответы Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"  (Achilleas Mantzios <achill@matrix.gatewaynet.com>)
Список pgsql-admin
On 2018-Nov-14, Rui DeSousa wrote:

> > On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
> > 
> > Our sysadms (seasoned linux/network guys : we have been working here
> > for more than 10 yrs) were absolute in that we run no firewall or
> > other traffic shaping system between the two hosts. (if we did the
> > problem would manifest itself earlier).  Can you recommend what to
> > look for exactly regarding both TCP stacks ? The subscriber node is
> > a clone of the primary. We have :
> > 
> > # sysctl -a | grep -i keepaliv
> > net.ipv4.tcp_keepalive_intvl = 75
> > net.ipv4.tcp_keepalive_probes = 9
> > net.ipv4.tcp_keepalive_time = 7200
> 
> Those keep alive settings are linux’s defaults and work out to be 18
> hours before the abandon connection is dropped.  So, the WAL receiver
> should have corrected itself after that time.  For reference, I run
> terminating abandon session within 15 mins as they take-up valuable
> database resources and could potentially hold on to locks, snapshots,
> etc.

Where does your 18h figure come from?  As I understand it, these numbers
mean "wait 7200 seconds, then send 9 probes 75 seconds apart", kill the
connection if not reply is obtained.  So that works out to about 131
minutes (modulo fencepost bug).  Certainly not 18 hours ...

Now ... I have seen Linux kernel code that seemed to me to cause network
transmission get stuck *in the kernel* without any way out.  Now I'm not
a kernel expert and I don't know if this applies to your case (maybe it
got fixed already), but it was definitely some process that was stuck
with "wchan" set to a network kernel call and way beyond TCP keepalives.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-admin по дате отправления:

Предыдущее
От: "Zwettler Markus (OIZ)"
Дата:
Сообщение: pgpool2 HA + split brain
Следующее
От: Achilleas Mantzios
Дата:
Сообщение: Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"