Обсуждение: [COMMITTERS] pgsql: Make new crash restart test a bit more robust.

Поиск

Список

Период

Сортировка

[COMMITTERS] pgsql: Make new crash restart test a bit more robust.

От

Andres Freund

Дата:

19 сентября 2017 г., 23:40:20

Make new crash restart test a bit more robust.

Add timeouts in case psql doesn't deliver the expected output, and try
to cause the monitoring psql to be fully connected to a backend.  This
isn't necessarily everything needed, but at least the timeouts should
reduce the pain for buildfarm owners.

Author: Andres Freund
Reported-By: Tom Lane, BF animals prairiedog and calliphoridae
Discussion: https://postgr.es/m/E1du6ZT-00043I-91@gemulon.postgresql.org

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/1910353675bd149e1020b29c0fae02538fc358cd

Modified Files
--------------
src/test/recovery/t/013_crash_restart.pl | 34 ++++++++++++++++++++------------
1 file changed, 21 insertions(+), 13 deletions(-)


--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.

От

Andres Freund

Дата:

20 сентября 2017 г., 01:20:32

On 2017-09-19 17:40:20 +0000, Andres Freund wrote:
> Make new crash restart test a bit more robust.
> 
> Add timeouts in case psql doesn't deliver the expected output, and try
> to cause the monitoring psql to be fully connected to a backend.  This
> isn't necessarily everything needed, but at least the timeouts should
> reduce the pain for buildfarm owners.
> 
> Author: Andres Freund
> Reported-By: Tom Lane, BF animals prairiedog and calliphoridae
> Discussion: https://postgr.es/m/E1du6ZT-00043I-91@gemulon.postgresql.org

Checkining on calliphoridae why that's not sufficient - the machine's
busy, so the build & test will take a bit.

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit more robust.

От

Tom Lane

Дата:

20 сентября 2017 г., 01:24:49

Andres Freund <andres@anarazel.de> writes:
> Checkining on calliphoridae why that's not sufficient - the machine's
> busy, so the build & test will take a bit.

FWIW, prairiedog got through the recovery tests this time --- run's
still going though.
        regards, tom lane


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.

От

Andres Freund

Дата:

20 сентября 2017 г., 02:25:18

On 2017-09-19 15:24:49 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Checkining on calliphoridae why that's not sufficient - the machine's
> > busy, so the build & test will take a bit.
> 
> FWIW, prairiedog got through the recovery tests this time --- run's
> still going though.

So this is geniuinely interesting. When the machine is really loaded (as
in 6 animals running on a vm at the same time, incuding valgrind), psql
sometimes doesn't get the WARNING message from a shutdown. Instead it
gets
# psql:<stdin>:3: server closed the connection unexpectedly
#       This probably means the server terminated abnormally
#       before or while processing the request.
# psql:<stdin>:3: connection to server was lost

We can obviously easily make the test accept both - but are we ok with
the client sometimes not getting the message?

Greetings,

Andres Freund

-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit more robust.

От

Tom Lane

Дата:

20 сентября 2017 г., 02:46:58

Andres Freund <andres@anarazel.de> writes:
> So this is geniuinely interesting. When the machine is really loaded (as
> in 6 animals running on a vm at the same time, incuding valgrind), psql
> sometimes doesn't get the WARNING message from a shutdown. Instead it
> gets
> # psql:<stdin>:3: server closed the connection unexpectedly
> #       This probably means the server terminated abnormally
> #       before or while processing the request.
> # psql:<stdin>:3: connection to server was lost

That seems pretty weird.  Maybe it's not the same case, but in

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2017-09-19%2020%3A10%3A02

you can see from the postmaster log that the backend *is* issuing
the message, or at least it's getting to the server log:

2017-09-19 20:20:34.476 UTC [6363] [unknown] LOG:  connection received: host=[local]
2017-09-19 20:20:34.477 UTC [6363] [unknown] LOG:  connection authorized: user=andres database=postgres
2017-09-19 20:20:34.478 UTC [6363] t/013_crash_restart.pl LOG:  statement: SELECT $$psql-connected$$;
...
2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl WARNING:  terminating connection because of crash of another
serverprocess 
2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process to
rollback the current transaction and exit, because another server process exited abnormally and possibly corrupted
sharedmemory. 
2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your command. 

Have we forgotten an fflush() or something?

Also, maybe problem is on client side.  I vaguely recall a libpq bug
wherein it would complain about socket EOF even though data remained
to be processed.  Maybe we reintroduced something like that?

> We can obviously easily make the test accept both - but are we ok with
> the client sometimes not getting the message?

I'm not ...
        regards, tom lane

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.

От

Andres Freund

Дата:

20 сентября 2017 г., 02:53:18

On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > So this is geniuinely interesting. When the machine is really loaded (as
> > in 6 animals running on a vm at the same time, incuding valgrind), psql
> > sometimes doesn't get the WARNING message from a shutdown. Instead it
> > gets
> > # psql:<stdin>:3: server closed the connection unexpectedly
> > #       This probably means the server terminated abnormally
> > #       before or while processing the request.
> > # psql:<stdin>:3: connection to server was lost
> 
> That seems pretty weird.  Maybe it's not the same case, but in
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2017-09-19%2020%3A10%3A02
> 
> you can see from the postmaster log that the backend *is* issuing
> the message, or at least it's getting to the server log:
> 
> 2017-09-19 20:20:34.476 UTC [6363] [unknown] LOG:  connection received: host=[local]
> 2017-09-19 20:20:34.477 UTC [6363] [unknown] LOG:  connection authorized: user=andres database=postgres
> 2017-09-19 20:20:34.478 UTC [6363] t/013_crash_restart.pl LOG:  statement: SELECT $$psql-connected$$;
> ...
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl WARNING:  terminating connection because of crash of
anotherserver process
 
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process
toroll back the current transaction and exit, because another server process exited abnormally and possibly corrupted
sharedmemory.
 
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your command.
 

I think it's likely the same - I've observed the same with the added
instrumentation.


> Have we forgotten an fflush() or something?
> 
> Also, maybe problem is on client side.  I vaguely recall a libpq bug
> wherein it would complain about socket EOF even though data remained
> to be processed.  Maybe we reintroduced something like that?

That seems quite possible.


> > We can obviously easily make the test accept both - but are we ok with
> > the client sometimes not getting the message?
> 
> I'm not ...

Same here.

I'll see if I can spot the bug in an hour or two. If not I'll make the
test temporarily accept both outputs while investigating?


Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit more robust.

От

Tom Lane

Дата:

20 сентября 2017 г., 03:03:07

Andres Freund <andres@anarazel.de> writes:
> I'll see if I can spot the bug in an hour or two. If not I'll make the
> test temporarily accept both outputs while investigating?

Seems reasonable.
        regards, tom lane


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.

От

Andres Freund

Дата:

20 сентября 2017 г., 05:16:50

On 2017-09-19 13:53:18 -0700, Andres Freund wrote:
> On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> > Have we forgotten an fflush() or something?
> > 
> > Also, maybe problem is on client side.  I vaguely recall a libpq bug
> > wherein it would complain about socket EOF even though data remained
> > to be processed.  Maybe we reintroduced something like that?
> 
> That seems quite possible.

Preface: This is largely a as of yet unverified theory. But seems
worthwhile to investigate whether it's the cause of this issue or not.

By changing the error message I know that the "server closed the
connection unexpectedly" ERROR is coming from pqsecure_raw_read(). The
caller here has to be pqReadData(). Where have the following block:
if (nread > 0){    conn->inEnd += nread;
    /*     * Hack to deal with the fact that some kernels will only give us back     * 1 packet per recv() call, even
ifwe asked for more and there is     * more available.  If it looks like we are reading a long message,     * loop back
torecv() again immediately, until we run out of data or     * buffer space.  Without this, the block-and-restart
behaviorof     * libpq's higher levels leads to O(N^2) performance on long messages.     *     * Since we
left-justifiedthe data above, conn->inEnd gives the     * amount of data already read in the current message.  We
consider    * the message "long" once we have acquired 32k ...B     */    if (conn->inEnd > 32768 &&
(conn->inBufSize- conn->inEnd) >= 8192)    {        someread = 1;        goto retry3;    }    return 1;}

So imagine we've just read one block containing the error message from
the server. That's going to be less than 32kb. So we go into the retry3
path.
/* OK, try to read some data */
retry3:
nread = pqsecure_read(conn, conn->inBuffer + conn->inEnd,                      conn->inBufSize - conn->inEnd);if (nread
<0){    if (SOCK_ERRNO == EINTR)        goto retry3;    /* Some systems return EAGAIN/EWOULDBLOCK for no data */

#ifdef EAGAIN    if (SOCK_ERRNO == EAGAIN)        return someread;
#endif
#if defined(EWOULDBLOCK) && (!defined(EAGAIN) || (EWOULDBLOCK != EAGAIN))    if (SOCK_ERRNO == EWOULDBLOCK)
returnsomeread;

#endif    /* We might get ECONNRESET here if using TCP and backend died */
#ifdef ECONNRESET    if (SOCK_ERRNO == ECONNRESET)        goto definitelyFailed;
#endif    /* pqsecure_read set the error message for us */    return -1;}

So if the connection actually was closed we:
definitelyFailed:/* Do *not* drop any already-read data; caller still wants it */pqDropConnection(conn,
false);conn->status= CONNECTION_BAD;    /* No more connection to backend */return -1;

and PQgetResult() will happily signal that upwards with an error:    /* Wait for some more data, and load it. */    if
(flushResult||        pqWait(TRUE, FALSE, conn) ||        pqReadData(conn) < 0)    {        /*         *
conn->errorMessagehas been set by pqWait or pqReadData. We         * want to append it to any already-received error
message.        */        pqSaveErrorResult(conn);        conn->asyncStatus = PGASYNC_IDLE;        return
pqPrepareAsyncResult(conn);

ISTM, we need to react differently if we've already partially read data
successfully? Am I missing something?

Greetings,

Andres Freund

-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.

От

Andres Freund

Дата:

20 сентября 2017 г., 07:06:29

On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > So this is geniuinely interesting. When the machine is really loaded (as
> > in 6 animals running on a vm at the same time, incuding valgrind), psql
> > sometimes doesn't get the WARNING message from a shutdown. Instead it
> > gets
> > # psql:<stdin>:3: server closed the connection unexpectedly
> > #       This probably means the server terminated abnormally
> > #       before or while processing the request.
> > # psql:<stdin>:3: connection to server was lost
> 
> That seems pretty weird.  Maybe it's not the same case, but in
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2017-09-19%2020%3A10%3A02
> 
> you can see from the postmaster log that the backend *is* issuing
> the message, or at least it's getting to the server log:
> 
> 2017-09-19 20:20:34.476 UTC [6363] [unknown] LOG:  connection received: host=[local]
> 2017-09-19 20:20:34.477 UTC [6363] [unknown] LOG:  connection authorized: user=andres database=postgres
> 2017-09-19 20:20:34.478 UTC [6363] t/013_crash_restart.pl LOG:  statement: SELECT $$psql-connected$$;
> ...
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl WARNING:  terminating connection because of crash of
anotherserver process
 
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process
toroll back the current transaction and exit, because another server process exited abnormally and possibly corrupted
sharedmemory.
 
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your command.
 
> 
> Have we forgotten an fflush() or something?

After hacking a fix for my previous theory, I started adding strace into
the mix, to verify this. Takes longer to reproduce, but after filtering
to -e trace=network, I got this:

socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM, 0)         = 3
connect(3, {sa_family=AF_UNIX, sun_path="/tmp/EDkYotgk3u/.s.PGSQL.57230"}, 110) = 0
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockname(3, {sa_family=AF_UNIX}, [128->2]) = 0
sendto(3, "\0\0\0O\0\3\0\0user\0andres\0database\0pos"..., 79, MSG_NOSIGNAL, NULL, 0) = 79
recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0,application_name\0t"..., 16384, 0, NULL, NULL) = 340
sendto(3, "Q\0\0\0\37SELECT $$psql-connected$$;\0", 32, MSG_NOSIGNAL, NULL, 0) = 32
recvfrom(3, "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\31\377\377\377\377\377\377"..., 16384, 0, NULL, NULL) = 79
sendto(3, "Q\0\0\0\33SELECT pg_sleep(3600);\0", 28, MSG_NOSIGNAL, NULL, 0) = 28
recvfrom(3, 0x555817dae2a0, 16384, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
+++ exited with 2 +++

So indeed, we got a connreset before receiving the proper error
message.

The corresponding server log (debug3):
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 730
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 716
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 715
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 717
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 718
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 719
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl WARNING:  terminating connection because of crash of another
serverprocess
 
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process to
rollback the current t
 
ransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your c
 
ommand.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [720] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  proc_exit(-1): 0 callbacks to make
...
2017-09-20 00:57:00.577 UTC [713] DEBUG:  server process (PID 730) exited with exit code 2
2017-09-20 00:57:00.577 UTC [713] DETAIL:  Failed process was running: SELECT pg_sleep(3600);
2017-09-20 00:57:00.577 UTC [713] LOG:  all server processes terminated; reinitializing

So the server indeed was killed by SIGQUIT, not an escalation to
SIGKILL. And it output stuff to the server log, and didn't complain
about communication to the client... Odd.

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.

От

Andres Freund

Дата:

20 сентября 2017 г., 08:00:38

On 2017-09-19 18:06:29 -0700, Andres Freund wrote:
> On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> > Have we forgotten an fflush() or something?
> 
> After hacking a fix for my previous theory, I started adding strace into
> the mix, to verify this. Takes longer to reproduce, but after filtering
> to -e trace=network, I got this:
> 
> socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
> connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
> socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
> connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
> socket(AF_UNIX, SOCK_STREAM, 0)         = 3
> connect(3, {sa_family=AF_UNIX, sun_path="/tmp/EDkYotgk3u/.s.PGSQL.57230"}, 110) = 0
> getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> getsockname(3, {sa_family=AF_UNIX}, [128->2]) = 0
> sendto(3, "\0\0\0O\0\3\0\0user\0andres\0database\0pos"..., 79, MSG_NOSIGNAL, NULL, 0) = 79
> recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0,application_name\0t"..., 16384, 0, NULL, NULL) = 340
> sendto(3, "Q\0\0\0\37SELECT $$psql-connected$$;\0", 32, MSG_NOSIGNAL, NULL, 0) = 32
> recvfrom(3, "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\31\377\377\377\377\377\377"..., 16384, 0, NULL, NULL) = 79
> sendto(3, "Q\0\0\0\33SELECT pg_sleep(3600);\0", 28, MSG_NOSIGNAL, NULL, 0) = 28
> recvfrom(3, 0x555817dae2a0, 16384, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
> +++ exited with 2 +++
> 
> So indeed, we got a connreset before receiving the proper error
> message.
> 
> The corresponding server log (debug3):
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 730
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 716
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 715
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 717
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 718
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 719
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl WARNING:  terminating connection because of crash of another
serverprocess
 
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process to
rollback the current t
 
> ransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your c
 
> ommand.
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to
make
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
> 2017-09-20 00:57:00.573 UTC [720] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  proc_exit(-1): 0 callbacks to make
> ...
> 2017-09-20 00:57:00.577 UTC [713] DEBUG:  server process (PID 730) exited with exit code 2
> 2017-09-20 00:57:00.577 UTC [713] DETAIL:  Failed process was running: SELECT pg_sleep(3600);
> 2017-09-20 00:57:00.577 UTC [713] LOG:  all server processes terminated; reinitializing
> 
> So the server indeed was killed by SIGQUIT, not an escalation to
> SIGKILL. And it output stuff to the server log, and didn't complain
> about communication to the client... Odd.

Hah! The reason for that is that socket_flush tries to avoid doing stuff
recursively:

static int
socket_flush(void)
{int            res;
/* No-op if reentrant call */if (PqCommBusy)    return 0;
...

(detected by putting an elog(COMMERROR) there)

adding an abort shows the following backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f089de4e42a in __GI_abort () at abort.c:89
#2  0x000056473218a3f6 in socket_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1408
#3  0x000056473246e4ec in send_message_to_frontend (edata=0x5647329e34e0 <errordata>)   at
/home/andres/src/postgresql/src/backend/utils/error/elog.c:3319
#4  0x000056473246ad02 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1483
#5  0x00005647324680af in errfinish (dummy=0) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:483
#6  0x00005647322fb340 in quickdie (postgres_signal_arg=3) at
/home/andres/src/postgresql/src/backend/tcop/postgres.c:2608
#7  <signal handler called>
#8  0x00007f089fbb65bd in __libc_send (fd=7, buf=0x564734a22c80, n=79, flags=0) at
../sysdeps/unix/sysv/linux/x86_64/send.c:26
#9  0x000056473217c874 in secure_raw_write (port=0x564734a1c730, ptr=0x564734a22c80, len=79)   at
/home/andres/src/postgresql/src/backend/libpq/be-secure.c:310
#10 0x000056473217c6fc in secure_write (port=0x564734a1c730, ptr=0x564734a22c80, len=79)   at
/home/andres/src/postgresql/src/backend/libpq/be-secure.c:256
#11 0x000056473218a47c in internal_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1437
#12 0x000056473218a40c in socket_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1413
#13 0x00005647322f693d in ReadyForQuery (dest=DestRemote) at /home/andres/src/postgresql/src/backend/tcop/dest.c:259
#14 0x00005647322fd078 in PostgresMain (argc=1, argv=0x564734a24db8, dbname=0x564734a24c98 "postgres",
username=0x5647349fbda8"andres")   at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4047
 
#15 0x000056473225b54e in BackendRun (port=0x564734a1c730) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4364
#16 0x000056473225ac5e in BackendStartup (port=0x564734a1c730) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4036
#17 0x0000564732257007 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1756
#18 0x0000564732256596 in PostmasterMain (argc=3, argv=0x5647349f9b70)   at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1364
#19 0x0000564732190be8 in main (argc=3, argv=0x5647349f9b70) at
/home/andres/src/postgresql/src/backend/main/main.c:228

I think this is actually not the $monitor backend but $killme, but I
think it shows how this is possible.  I don't think there's an easy way
out of this - making all low-level client communcation properly
reentrant sounds hard.  What I would like to do is to add an log message
somewhere around send_message_to_frontend() noticing that the message to
the client could not be sent, to avoid future confusion like this.

Does anybody have a better idea?

Given this fact pattern, I'll allow the case without a received error
message in the recovery test. Objections?

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Make new crash restart test abit more robust.

От

Andres Freund

Дата:

20 сентября 2017 г., 10:42:15

On 2017-09-19 19:00:38 -0700, Andres Freund wrote:
> Given this fact pattern, I'll allow the case without a received error
> message in the recovery test. Objections?

Hearing none. Pushed.

While debugging this, I've also introduced a pump wrapper so that we now
get:
ok 4 - exactly one process killed with SIGQUIT
# aborting wait: program died
# stream contents: >>psql:<stdin>:9: WARNING:  terminating connection because of crash of another server process
# DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.

# HINT:  In a moment you should be able to reconnect to the database and repeat your command.
# psql:<stdin>:9: server closed the connection unexpectedly
#       This probably means the server terminated abnormally
#       before or while processing the request.
# psql:<stdin>:9: connection to server was lost
# <<
# pattern searched for: (?^m:MAYDAY:  terminating connection because of crash of another server process)
not ok 5 - psql query died successfully after SIGQUIT

Greetings,

Andres Freund

-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Make new crash restart test abit more robust.

От

Thomas Munro

Дата:

22 сентября 2017 г., 08:37:38

On Wed, Sep 20, 2017 at 4:42 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-09-19 19:00:38 -0700, Andres Freund wrote:
>> Given this fact pattern, I'll allow the case without a received error
>> message in the recovery test. Objections?
>
> Hearing none. Pushed.
>
> While debugging this, I've also introduced a pump wrapper so that we now
> get:
> ok 4 - exactly one process killed with SIGQUIT
> # aborting wait: program died
> # stream contents: >>psql:<stdin>:9: WARNING:  terminating connection because of crash of another server process
> # DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
> # HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> # psql:<stdin>:9: server closed the connection unexpectedly
> #       This probably means the server terminated abnormally
> #       before or while processing the request.
> # psql:<stdin>:9: connection to server was lost
> # <<
> # pattern searched for: (?^m:MAYDAY:  terminating connection because of crash of another server process)
> not ok 5 - psql query died successfully after SIGQUIT

Seeing these failures in 013_crash_restart.pl from time to time on
Travis CI.  Examples:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/278419122
https://travis-ci.org/postgresql-cfbot/postgresql/builds/278247756

There are a couple of other weird problems in the TAP test that
probably belong on another thread (see build IDs 278302509 and
278247756 which are for different CF patches but exhibit the same
symptom: some test never returns control but we can't see its output,
maybe due to -Otarget, before the whole job is nuked by Travis for not
making progress).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [COMMITTERS] pgsql: Make new crash restart test a bit more robust.