Обсуждение: libpq connect keepalive* parms: no effect!?

Поиск
Список
Период
Сортировка

libpq connect keepalive* parms: no effect!?

От
Bill Clay
Дата:
I have searched fairly thoroughly and been unable to find a way to force 
prompt client application session breaks when PostgreSQL 
client-to-server transport fails.

I run a 7x24 PostgreSQL 9.1 "write-only" libpq client application 
(solely INSERTs/COPYs running on Debian 7 "wheezy" OS) that communicates 
with its PostgreSQL 9.0 DB server (Debian 6 "squeeze") via 
less-than-perfect intercontinental TCP Internet/VPN transport. The 
application has been running very reliably for over 4 years except for 
communication breaks.

Unfortunately, in this environment, connectivity lapses of a minute or 
two to an hour or two are common.  To minimize the risk of data loss 
when session recovery is attempted only AFTER the client queues data, I 
want to promptly detect and attempt recovery of lost sessions even when 
no transactions are pending.  To this end, I have tried:

#define PGSQL_KEEPALIVE_QSECS "60"
char pstring[6];
snprintf(pstring, sizeof(pstring), "%i", cnt0->conf.pgsql_port);
PQconnectdbParams(  (const char *[]) {"dbname", "host",    "user", "password", "port", "sslmode",
"application_name","connect_timeout", "keepalives",    "keepalives_idle", "keepalives_interval", "keepalives_count",
NULL}, (const char *[]) {cnt0->conf.pgsql_db, cnt0->conf.pgsql_host,    cnt0->conf.pgsql_user,
cnt0->conf.pgsql_password,pstring, "disable",    "motion", PGSQL_KEEPALIVE_QSECS, "1",    PGSQL_KEEPALIVE_QSECS,
PGSQL_KEEPALIVE_QSECS,"3", NULL}, 0))
 

As a baseline comparison, I establish a psql session with an all-default 
environment, break the VPN link, and then attempt a simple query (select 
count(*) from ...).  The query and psql session fail after about 17 
minutes' wait.  When testing the application -- even specifiying the 
above connection parameters -- I get approximately the same 17 minute 
timeout before a broken session is signalled at the application 
(PQconsumeInput(conn); if (PQstatus(conn)!=PGSQL_CONNECTION_OK) ...) 
when testing over the intentionally broken link.  This is a far cry from 
the maximum of 5 minutes I expected.

Based on postings elsewhere, I have also tried changing the relevant 
Linux kernel defaults of:

/proc/sys/net/ipv4/tcp_keepalive_time=7200
/proc/sys/net/ipv4/tcp_keepalive_probes=9
/proc/sys/net/ipv4/tcp_keepalive_intvl=75

to:

/proc/sys/net/ipv4/tcp_keepalive_time=60
/proc/sys/net/ipv4/tcp_keepalive_probes=3
/proc/sys/net/ipv4/tcp_keepalive_intvl=15

... with no detectable effect; still a ca. 17 minute timeout. (Failure 
of initial connection establishment IS  indicated rapidly; ca. 20 sec., 
with or without any of the above measures, even connection_timeout=60.)

Any ideas how to achieve the keepalives as specified in 
PQconnectdbParams when running on these platforms?

Thanks,
Bill Clay



Re: libpq connect keepalive* parms: no effect!?

От
Bill Clay
Дата:
The problem I described yesterday ("libpq connect keepalive* parms: no 
effect!?") is RESOLVED and, as I suspected, has nothing to do with libpq 
or PostgreSQL.

Searches reveal that my perplexity in satisfying this elusive objective 
is not rare, so I describe below the solution, at least for recent 
Linux-provided TCP protocol implementations.

My friend tcpdump finally illuminated the issue.  It showed me that TCP 
keepalives are necessary BUT NOT SUFFICIENT to promptly detect loss of 
end-to-end connectivity on an otherwise idle session.

The ca. 17 minute timeout observed on broken sessions is due to a 
system-wide default value for the TCP packet retransmission limit. 
Presently for Linux, the default value for this parameter (sysctl 
parameter net.ipv4.tcp_retries2) is 15.  Linux TCP spaces these retries 
nearly exponentially: the first retry is subsecond, and subsequent 
retries occur at increasing intervals until they reach and remain at a 
maximum spacing of about 2 minutes.  In my tests, 15 retries take about 
17 minutes (doubtless influenced somewhat by the TCP round-trip-time 
["RTT"] just prior to the session break; for my transatlantic test case, 
around 200 ms.)

Why do keepalive parameters NOT shorten this delay?  Because the first 
unacknowledged keepalive triggers TCP's retry method as just described; 
a keepalive gets the same retry treatment as ordinary user data.  Thus, 
keepalives WILL guarantee detection of the problem AT THE TCP LAYER 
within the the configured keepalive interval on an otherwise idle 
session, but DOES NOT REDUCE THE TIME ELAPSED before the TCP layer gives 
up and relays the bad news to the application layer, where you're 
impatiently waiting.

Unfortunately, at least on Linux, the TCP parameter necessary to reduce 
application-visible timeouts on a broken session is system-wide and is 
not accessible as a socket option.  So, to summarize, and based on a TCP 
RTT of around 200 ms.:

To guarantee that a PostgreSQL application is informed of a break in 
connectivity to its DBMS server within 2 minutes of the loss, whether or 
not client traffic is active at the time:

1. PQconnectdb() and variants may specify keepalives, keepalives_idle, 
keepalives_interval, and keepalives_count as 1, 60, 60, and 1, 
respectively (no need setting the count > 1 for this purpose; the TCP 
layer has already tried multiple times before it tells libpq of a 
persistent loss of connectivity).

2. The CLIENT host's OS can be configured for a lower non-default TCP 
retransmission count of 7.  (For the Linux distributions I know, this 
can be made non-volatile by setting net.ipv4.tcp_retries2 in config file 
/etc/sysctl.conf or in a file in /etc/sysctl.d; note that "retries1" 
impacts other TCP recovery logic and does NOT aid session break detection).

I presume (but do not know) other *nix flavors present similar behavior 
with similar remedies, YMMV.  Unfortunately my last Sun Solaris box went 
to the recycling center a couple years ago ...