Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

Поиск
Список
Период
Сортировка
От Jelte Fennema
Тема Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings
Дата
Msg-id AM5PR83MB01780E7649EC5802643666A5F7589@AM5PR83MB0178.EURPRD83.prod.outlook.com
обсуждение исходный текст
Ответ на Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
It seems the man page of TCP_USER_TIMEOUT does not align with
reality then. When I use it on my local machine it is effectively used
as a connection timeout too. The second command times out after
two seconds:

sudo iptables -A INPUT -p tcp --destination-port 5432 -j DROP
psql 'host=localhost tcp_user_timeout=2000'

The keepalive settings only apply once you get to the recv however. And yes,
it is pretty unlikely for the connection to break right when it is waiting for data.
But it has happened for us. And when it happens it is really bad, because
the process will be blocked forever. Since it is a blocking call.

After investigation when this happened it seemed to be a combination of a few
things making this happen:
1. The way citus uses cancelation requests: A Citus query on the coordinator creates
   multiple connections to a worker and with 2PC for distributed transactions. If one
   connection receives an error it sends a cancel request for all others.
2. When a machine is under heavy CPU or memory pressure things don't work
   well:
   i. errors can occur pretty frequently, causing lots of cancels to be sent by Citus.
   ii. postmaster can be slow in handling new cancelation requests.
   iii. Our failover system can think the node is down, because health checks are
      failing.
3. Our failover system effectively cuts the power and the network of the primary
   when it triggers a fail over to the secondary

This all together can result in a cancel request being interrupted right at that
wrong moment. And when it happens a distributed query on the Citus
coordinator, becomes blocked forever. We've had queries stuck in this state
for multiple days. The only way to get out of it at that point is either by restarting
postgres or manually closing the blocked socket (either with ss or gdb).

Jelte


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: \d with triggers: more than one row returned by a subquery used as an expression
Следующее
От: Tom Lane
Дата:
Сообщение: Re: \d with triggers: more than one row returned by a subquery used as an expression