Re: libpq connect keepalive* parms: no effect!?

Поиск
Список
Период
Сортировка
От Bill Clay
Тема Re: libpq connect keepalive* parms: no effect!?
Дата
Msg-id 54F60FB9.2090302@acm.org
обсуждение исходный текст
Ответ на libpq connect keepalive* parms: no effect!?  (Bill Clay <william.b.clay@acm.org>)
Список pgsql-interfaces
The problem I described yesterday ("libpq connect keepalive* parms: no 
effect!?") is RESOLVED and, as I suspected, has nothing to do with libpq 
or PostgreSQL.

Searches reveal that my perplexity in satisfying this elusive objective 
is not rare, so I describe below the solution, at least for recent 
Linux-provided TCP protocol implementations.

My friend tcpdump finally illuminated the issue.  It showed me that TCP 
keepalives are necessary BUT NOT SUFFICIENT to promptly detect loss of 
end-to-end connectivity on an otherwise idle session.

The ca. 17 minute timeout observed on broken sessions is due to a 
system-wide default value for the TCP packet retransmission limit. 
Presently for Linux, the default value for this parameter (sysctl 
parameter net.ipv4.tcp_retries2) is 15.  Linux TCP spaces these retries 
nearly exponentially: the first retry is subsecond, and subsequent 
retries occur at increasing intervals until they reach and remain at a 
maximum spacing of about 2 minutes.  In my tests, 15 retries take about 
17 minutes (doubtless influenced somewhat by the TCP round-trip-time 
["RTT"] just prior to the session break; for my transatlantic test case, 
around 200 ms.)

Why do keepalive parameters NOT shorten this delay?  Because the first 
unacknowledged keepalive triggers TCP's retry method as just described; 
a keepalive gets the same retry treatment as ordinary user data.  Thus, 
keepalives WILL guarantee detection of the problem AT THE TCP LAYER 
within the the configured keepalive interval on an otherwise idle 
session, but DOES NOT REDUCE THE TIME ELAPSED before the TCP layer gives 
up and relays the bad news to the application layer, where you're 
impatiently waiting.

Unfortunately, at least on Linux, the TCP parameter necessary to reduce 
application-visible timeouts on a broken session is system-wide and is 
not accessible as a socket option.  So, to summarize, and based on a TCP 
RTT of around 200 ms.:

To guarantee that a PostgreSQL application is informed of a break in 
connectivity to its DBMS server within 2 minutes of the loss, whether or 
not client traffic is active at the time:

1. PQconnectdb() and variants may specify keepalives, keepalives_idle, 
keepalives_interval, and keepalives_count as 1, 60, 60, and 1, 
respectively (no need setting the count > 1 for this purpose; the TCP 
layer has already tried multiple times before it tells libpq of a 
persistent loss of connectivity).

2. The CLIENT host's OS can be configured for a lower non-default TCP 
retransmission count of 7.  (For the Linux distributions I know, this 
can be made non-volatile by setting net.ipv4.tcp_retries2 in config file 
/etc/sysctl.conf or in a file in /etc/sysctl.d; note that "retries1" 
impacts other TCP recovery logic and does NOT aid session break detection).

I presume (but do not know) other *nix flavors present similar behavior 
with similar remedies, YMMV.  Unfortunately my last Sun Solaris box went 
to the recycling center a couple years ago ...



В списке pgsql-interfaces по дате отправления:

Предыдущее
От: Bill Clay
Дата:
Сообщение: libpq connect keepalive* parms: no effect!?
Следующее
От: Rodrigo Barboza
Дата:
Сообщение: Problem when forking