Обсуждение: Escaping a blocked sendto() syscall without causing a restart

Поиск
Список
Период
Сортировка

Escaping a blocked sendto() syscall without causing a restart

От
Jerry Sievers
Дата:
Does anyone know if one of the signals below can be sent to break out
,of this state *without* the postmaster sensing a crashed backend?

I've seen several times in the past at other companies, backends that
will not respond to cancel nor SIGTERM due to syscall that's blocked
on IO.

Quite often though apparently the backend would notice the broken
socket eventually and receive the signals and exit cleanly.

I've got one that's been wedged like that for  a couple days now.

I recall trying several  in a similar situation a while ago and of
course one of them  interrupted the syscall all right but it was an
abort and we got the customary spontaneous postmaster restart.


 PostgreSQL 8.4.13 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.2-1.1) 4.3.2, 64-bit

$ uname -a
Linux somebox.foo.zizzy 2.6.36 #5 SMP Thu Jul 28 17:52:31 UTC 2011 x86_64 GNU/Linux
$
$ strace -p 31603
Process 31603 attached - interrupt to quit
sendto(9, "default_rate_3m_v4: 0.1224\nmonth_"..., 3440, 0, NULL, 0
 <unfinished ...>
Process 31603 detached
$ $
$ kill -l
 1) SIGHUP     2) SIGINT     3) SIGQUIT     4) SIGILL
 5) SIGTRAP     6) SIGABRT     7) SIGBUS     8) SIGFPE
 9) SIGKILL    10) SIGUSR1    11) SIGSEGV    12) SIGUSR2
13) SIGPIPE    14) SIGALRM    15) SIGTERM    16) SIGSTKFLT
17) SIGCHLD    18) SIGCONT    19) SIGSTOP    20) SIGTSTP
21) SIGTTIN    22) SIGTTOU    23) SIGURG    24) SIGXCPU
25) SIGXFSZ    26) SIGVTALRM    27) SIGPROF    28) SIGWINCH
29) SIGIO    30) SIGPWR    31) SIGSYS    34) SIGRTMIN
35) SIGRTMIN+1    36) SIGRTMIN+2    37) SIGRTMIN+3    38) SIGRTMIN+4
39) SIGRTMIN+5    40) SIGRTMIN+6    41) SIGRTMIN+7    42) SIGRTMIN+8
43) SIGRTMIN+9    44) SIGRTMIN+10    45) SIGRTMIN+11    46) SIGRTMIN+12
47) SIGRTMIN+13    48) SIGRTMIN+14    49) SIGRTMIN+15    50) SIGRTMAX-14
51) SIGRTMAX-13    52) SIGRTMAX-12    53) SIGRTMAX-11    54) SIGRTMAX-10
55) SIGRTMAX-9    56) SIGRTMAX-8    57) SIGRTMAX-7    58) SIGRTMAX-6
59) SIGRTMAX-5    60) SIGRTMAX-4    61) SIGRTMAX-3    62) SIGRTMAX-2
63) SIGRTMAX-1    64) SIGRTMAX
$

Thanks

--
Jerry Sievers
e: jerry.sievers@comcast.net
p: 312.241.7800


Re: Escaping a blocked sendto() syscall without causing a restart

От
Tom Lane
Дата:
Jerry Sievers <gsievers19@comcast.net> writes:
> Does anyone know if one of the signals below can be sent to break out
> ,of this state *without* the postmaster sensing a crashed backend?

> I've seen several times in the past at other companies, backends that
> will not respond to cancel nor SIGTERM due to syscall that's blocked
> on IO.

> Quite often though apparently the backend would notice the broken
> socket eventually and receive the signals and exit cleanly.

> I've got one that's been wedged like that for  a couple days now.

> I recall trying several  in a similar situation a while ago and of
> course one of them  interrupted the syscall all right but it was an
> abort and we got the customary spontaneous postmaster restart.

Offhand it looks to me like most signals would kick the backend off the
send() call ... but it would loop right back and try again.  See
internal_flush() in pqcomm.c.  (If you're using SSL, this diagnosis
may or may not apply.)

We can't do anything except repeat the send attempt if the client
connection is to be kept in a sane state.  It's possible that if the
interrupt was a SIGTERM (forced exit) we could mark the connection dead
and return early, but it would probably take some thought and
experimentation to get useful behavior that way.  And I'm not at all
sure if we could get it to work in SSL mode ...

So the short answer is no, you probably can't kill the session without
causing a restart.  Possibly we should add a TODO to make this better.

What you might consider instead, if this is a recurring problem, is
adjusting the postmaster-side TCP keepalive parameters so that dead
connections are noticed more quickly.  The default connection timeout
according to the TCP standards is on the order of hours, but you can
reduce that quite a lot if your network environment is at all reliable.

(But it's not clear to me why your stuck-for-a-couple-days case wouldn't
have timed out long since.  Are you sure this isn't a client-side
problem, ie client is wedged?  If so, why not kill the client instead?)

            regards, tom lane