Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.
Дата
Msg-id 20170920010629.3icid5ukemkrr42k@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: [COMMITTERS] pgsql: Make new crash restart test a bit more robust.  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.
Список pgsql-committers
On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > So this is geniuinely interesting. When the machine is really loaded (as
> > in 6 animals running on a vm at the same time, incuding valgrind), psql
> > sometimes doesn't get the WARNING message from a shutdown. Instead it
> > gets
> > # psql:<stdin>:3: server closed the connection unexpectedly
> > #       This probably means the server terminated abnormally
> > #       before or while processing the request.
> > # psql:<stdin>:3: connection to server was lost
> 
> That seems pretty weird.  Maybe it's not the same case, but in
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2017-09-19%2020%3A10%3A02
> 
> you can see from the postmaster log that the backend *is* issuing
> the message, or at least it's getting to the server log:
> 
> 2017-09-19 20:20:34.476 UTC [6363] [unknown] LOG:  connection received: host=[local]
> 2017-09-19 20:20:34.477 UTC [6363] [unknown] LOG:  connection authorized: user=andres database=postgres
> 2017-09-19 20:20:34.478 UTC [6363] t/013_crash_restart.pl LOG:  statement: SELECT $$psql-connected$$;
> ...
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl WARNING:  terminating connection because of crash of
anotherserver process
 
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process
toroll back the current transaction and exit, because another server process exited abnormally and possibly corrupted
sharedmemory.
 
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your command.
 
> 
> Have we forgotten an fflush() or something?

After hacking a fix for my previous theory, I started adding strace into
the mix, to verify this. Takes longer to reproduce, but after filtering
to -e trace=network, I got this:

socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM, 0)         = 3
connect(3, {sa_family=AF_UNIX, sun_path="/tmp/EDkYotgk3u/.s.PGSQL.57230"}, 110) = 0
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockname(3, {sa_family=AF_UNIX}, [128->2]) = 0
sendto(3, "\0\0\0O\0\3\0\0user\0andres\0database\0pos"..., 79, MSG_NOSIGNAL, NULL, 0) = 79
recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0,application_name\0t"..., 16384, 0, NULL, NULL) = 340
sendto(3, "Q\0\0\0\37SELECT $$psql-connected$$;\0", 32, MSG_NOSIGNAL, NULL, 0) = 32
recvfrom(3, "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\31\377\377\377\377\377\377"..., 16384, 0, NULL, NULL) = 79
sendto(3, "Q\0\0\0\33SELECT pg_sleep(3600);\0", 28, MSG_NOSIGNAL, NULL, 0) = 28
recvfrom(3, 0x555817dae2a0, 16384, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
+++ exited with 2 +++

So indeed, we got a connreset before receiving the proper error
message.

The corresponding server log (debug3):
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 730
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 716
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 715
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 717
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 718
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 719
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl WARNING:  terminating connection because of crash of another
serverprocess
 
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process to
rollback the current t
 
ransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your c
 
ommand.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [720] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  proc_exit(-1): 0 callbacks to make
...
2017-09-20 00:57:00.577 UTC [713] DEBUG:  server process (PID 730) exited with exit code 2
2017-09-20 00:57:00.577 UTC [713] DETAIL:  Failed process was running: SELECT pg_sleep(3600);
2017-09-20 00:57:00.577 UTC [713] LOG:  all server processes terminated; reinitializing

So the server indeed was killed by SIGQUIT, not an escalation to
SIGKILL. And it output stuff to the server log, and didn't complain
about communication to the client... Odd.

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

В списке pgsql-committers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: [COMMITTERS] pgsql: Set partitioned_rels appropriately when UNIONALL is used.
Следующее
От: Andres Freund
Дата:
Сообщение: Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.