Re: BUG #16811: Severe reproducible server backend crash

Поиск
Список
Период
Сортировка
От James Inform
Тема Re: BUG #16811: Severe reproducible server backend crash
Дата
Msg-id 78c1f0db-8417-7bc3-83e4-1673872c0793@pharmapp.de
обсуждение исходный текст
Ответ на Re: BUG #16811: Severe reproducible server backend crash  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
Is there any further progress on this topic?

7. Januar 2021 um 18:22

I can reproduce it without anything extra. What's needed is to run
the problematic statement in extended query mode, which you can
do like this:

$ cat foo.sql
do $$ begin rollback; end $$;

$ pgbench -n -f foo.sql -M prepared
pgbench: error: client 0 aborted in command 0 (SQL) of script 0; perhaps the backend died while processing

That lnext() should certainly not find pstmt->stmts to be NIL,
seeing that we are inside a loop over that list. Ergo, something
is clobbering this active portal. A bit of gdb'ing says the clobber
happens here:

#0 AtAbort_Portals () at portalmem.c:833
(this appears to be inlined code from PortalReleaseCachedPlan)
#1 0x00000000005a4ce2 in AbortTransaction () at xact.c:2711
#2 0x00000000005a55d5 in AbortCurrentTransaction () at xact.c:3322
#3 0x00000000006d1557 in _SPI_rollback (chain=<optimized out>) at spi.c:326
#4 0x00007feef9e851c5 in exec_stmt_rollback (stmt=0x2babca8,
estate=0x7fff35e55ee0) at pl_exec.c:4961
#5 exec_stmts (estate=0x7fff35e55ee0, stmts=0x2babd80) at pl_exec.c:2081
#6 0x00007feef9e863cb in exec_stmt_block (estate=0x7fff35e55ee0,
block=0x2babdd8) at pl_exec.c:1904
#7 0x00007feef9e864bb in exec_toplevel_block (
estate=estate@entry=0x7fff35e55ee0, block=0x2babdd8) at pl_exec.c:1602
#8 0x00007feef9e86ced in plpgsql_exec_function (func=func@entry=0x2ba7c60,
fcinfo=fcinfo@entry=0x7fff35e56060,
simple_eval_estate=simple_eval_estate@entry=0x2bad6b0,
simple_eval_resowner=simple_eval_resowner@entry=0x2b12e40,
atomic=<optimized out>) at pl_exec.c:605
#9 0x00007feef9e8fd58 in plpgsql_inline_handler (fcinfo=<optimized out>)
at pl_handler.c:344
#10 0x000000000091a540 in FunctionCall1Coll (flinfo=0x7fff35e561f0,
collation=<optimized out>, arg1=<optimized out>) at fmgr.c:1141
#11 0x000000000091aaa9 in OidFunctionCall1Coll (functionId=<optimized out>,
collation=collation@entry=0, arg1=45120272) at fmgr.c:1419
#12 0x000000000064df7e in ExecuteDoStmt (stmt=stmt@entry=0x2b07ed8,
atomic=atomic@entry=false) at functioncmds.c:2027
#13 0x000000000080fa14 in standard_ProcessUtility (pstmt=0x2b07e40,
queryString=0x2b079a0 "do $$ begin rollback; end $$;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0xa90540 <donothingDR>, qc=0x7fff35e56630) at utility.c:696
#14 0x000000000080d044 in PortalRunUtility (portal=0x2b47240, pstmt=0x2b07e40,
isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>,
dest=0xa90540 <donothingDR>, qc=0x7fff35e56630) at pquery.c:1159
#15 0x000000000080db24 in PortalRunMulti (portal=portal@entry=0x2b47240,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=0xa90540 <donothingDR>,
dest@entry=0x2adfa88, altdest=0xa90540 <donothingDR>,
altdest@entry=0x2adfa88, qc=qc@entry=0x7fff35e56630) at pquery.c:1311
#16 0x000000000080e937 in PortalRun (portal=portal@entry=0x2b47240,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x2adfa88,
altdest=altdest@entry=0x2adfa88, qc=0x7fff35e56630) at pquery.c:779
#17 0x000000000080c77b in exec_execute_message (max_rows=9223372036854775807,
portal_name=0x2adf670 "") at postgres.c:2196
#18 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7fff35e569c0,
dbname=<optimized out>, username=<optimized out>) at postgres.c:4452

So I would say that the conditions under which AtAbort_Portals
decides that it can destroy a portal rather than just mark it failed
need to be reconsidered. It's not clear to me exactly how that
should change though. Maybe Peter has more insight.

regards, tom lane
7. Januar 2021 um 11:14
On Thu, Jan 7, 2021 at 10:25 PM PG Bug reporting form

Thanks for the report. I happened to have DBeaver here and could
reproduce this, and got the following core:

#0 0x00005651d5c07927 in lnext (l=0x0, c=0x5651d794aca0) at
../../../src/include/nodes/pg_list.h:312
312 Assert(c >= &l->elements[0] && c < &l->elements[l->length]);
(gdb) bt
#0 0x00005651d5c07927 in lnext (l=0x0, c=0x5651d794aca0) at
../../../src/include/nodes/pg_list.h:312
#1 0x00005651d5c096c9 in PortalRunMulti (portal=0x5651d79900a0,
isTopLevel=true, setHoldSnapshot=false,
dest=0x5651d60a1580 <donothingDR>, altdest=0x5651d60a1580
<donothingDR>, qc=0x7ffc44cf7530) at pquery.c:1321
#2 0x00005651d5c08b31 in PortalRun (portal=0x5651d79900a0, count=200,
isTopLevel=true, run_once=false,
dest=0x5651d7928378, altdest=0x5651d7928378, qc=0x7ffc44cf7530) at
pquery.c:779
#3 0x00005651d5c03fea in exec_execute_message
(portal_name=0x5651d7927f60 "", max_rows=200) at postgres.c:2196
#4 0x00005651d5c06d28 in PostgresMain (argc=1, argv=0x7ffc44cf7760,
dbname=0x5651d79235f8 "postgres",
username=0x5651d7953f58 "tmunro") at postgres.c:4405
#5 0x00005651d5b481d8 in BackendRun (port=0x5651d794d100) at postmaster.c:4484
#6 0x00005651d5b47b07 in BackendStartup (port=0x5651d794d100) at
postmaster.c:4206
#7 0x00005651d5b43f28 in ServerLoop () at postmaster.c:1730
#8 0x00005651d5b43777 in PostmasterMain (argc=3, argv=0x5651d79215b0)
at postmaster.c:1402
#9 0x00005651d5a43279 in main (argc=3, argv=0x5651d79215b0) at main.c:209
(gdb) f 1
#1 0x00005651d5c096c9 in PortalRunMulti (portal=0x5651d79900a0,
isTopLevel=true, setHoldSnapshot=false,
dest=0x5651d60a1580 <donothingDR>, altdest=0x5651d60a1580
<donothingDR>, qc=0x7ffc44cf7530) at pquery.c:1321
1321 if (lnext(portal->stmts, stmtlist_item) != NULL)
(gdb) print portal->stmts
$1 = (List *) 0x0

I didn't have time to investigate whether this is the right fix, but
this cargo cult change fixes the problem:

--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1318,7 +1318,7 @@ PortalRunMulti(Portal portal,
* Increment command counter between queries, but not
after the last
* one.
*/
- if (lnext(portal->stmts, stmtlist_item) != NULL)
+ if (portal->stmts && lnext(portal->stmts,
stmtlist_item) != NULL)
CommandCounterIncrement();

Maybe something to do with commit
1cff1b95ab6ddae32faa3efe0d95a820dbfdc164. I can dig some more
tomorrow if someone doesn't beat me to it.
7. Januar 2021 um 00:25
The following bug has been logged on the website:

Bug reference: 16811
Logged by: James Inform
Email address: james.inform@pharmapp.de
PostgreSQL version: 13.1
Operating system: Ubuntu 18.04 LTS AND macOS Catalina 10.15.7
Description:

First things first: Happy new year to all of you and stay healthy during
these days.

I have run into a severe backend crash that makes the whole PostgreSQL
shutting down and restart with recovering message.

This is reproducible on Mac using the latest version of the Postgresapp
(www.postgresapp.com) which comes with PG13.1 as well as on Ubuntu with a
self build PG 13.1 and also with the latest version from REL_13_STABLE
branch.

The issue doesn't exist on PG 12.5. Wether on Mac nor on Ubuntu.

It's very severe because with the steps for reproduction it seems that you
can tear down every PostgreSQL 13 server you have access to.

It's enough to run the following simple anonymous block with an appropriate
db tool:

do
$$
begin
rollback;
end
$$
;

This works fine in psql or from python using psycopg2.

But when using the DBeaver tool (www.dbeaver.io / Verson 7.3.2) to run the
command, the whole PostgreSQL is teared down. It is not only with DBeaver,
it is also reproducible with Lazarus/Delphi [yes, I am still using good old
pascal :)] on Linux and Windows. With Lazarus I am using components that
have a pg client library compiled into while DBeaver uses java libraries. I
can reproduce the behaviour with these two tools and I would bet that there
are many more tools out there that deliver the same result.

Steps to reproduce:
1. Install or build PostgreSQL 13.1 on Ubuntu or use the Postgresapp on
Mac.
2. Init a new PG13 cluster (initdb)
3. Download and install DBeaver on Linux, Mac or Windows
4. Connect to PostgreSQL using the postgres superuser
5. Execute the anonymous block from above
6. Voila ... you have forced PostgreSQL to shutdown and recover

I have started postgres with core dump option: pg_ctl start -D /opt/data13
-c
So I have a core dump at hand. But using the bug form on www.postgresql.org
I am not able to add an attachment. How can I provide you with such an
attachment.

Cheers, James

--################################################################################
-- Now the relevant log entries
-- You can see that the server starts
-- And just after executing the command from (in this case) DBeaver the
corresponding error is logged
--################################################################################

pglinux@aggaserv:/opt/data13$ pg_ctl start -D /opt/data13/ -c
waiting for server to start....2021-01-06 22:36:18.375 UTC [38787] LOG:
starting PostgreSQL 13.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
2021-01-06 22:36:18.375 UTC [38787] LOG: listening on IPv6 address "::1",
port 5432
2021-01-06 22:36:18.375 UTC [38787] LOG: listening on IPv4 address
"127.0.0.1", port 5432
2021-01-06 22:36:18.386 UTC [38787] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5432"
2021-01-06 22:36:18.405 UTC [38788] LOG: database system was shut down at
2021-01-06 22:26:20 UTC
2021-01-06 22:36:18.417 UTC [38787] LOG: database system is ready to accept
connections
done
server started
pglinux@aggaserv:/opt/data13$ 2021-01-06 22:36:30.022 UTC [38787] LOG:
server process (PID 38881) was terminated by signal 11: Segmentation fault
2021-01-06 22:36:30.022 UTC [38787] DETAIL: Failed process was running:
do
$$
begin
rollback;
end
$$

2021-01-06 22:36:30.022 UTC [38787] LOG: terminating any other active
server processes
2021-01-06 22:36:30.022 UTC [38792] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:30.022 UTC [38792] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:30.022 UTC [38792] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:30.023 UTC [38880] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:30.023 UTC [38880] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:30.023 UTC [38880] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:30.024 UTC [38787] LOG: all server processes terminated;
reinitializing
2021-01-06 22:36:30.043 UTC [38884] LOG: database system was interrupted;
last known up at 2021-01-06 22:36:18 UTC
2021-01-06 22:36:30.088 UTC [38884] LOG: database system was not properly
shut down; automatic recovery in progress
2021-01-06 22:36:30.095 UTC [38884] LOG: invalid record length at
0/1693F78: wanted 24, got 0
2021-01-06 22:36:30.095 UTC [38884] LOG: redo is not required
2021-01-06 22:36:30.144 UTC [38787] LOG: database system is ready to accept
connections
2021-01-06 22:36:30.686 UTC [38787] LOG: server process (PID 38977) was
terminated by signal 11: Segmentation fault
2021-01-06 22:36:30.686 UTC [38787] DETAIL: Failed process was running:
do
$$
begin
rollback;
end
$$

2021-01-06 22:36:30.686 UTC [38787] LOG: terminating any other active
server processes
2021-01-06 22:36:30.686 UTC [38906] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:30.686 UTC [38906] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:30.686 UTC [38906] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:30.686 UTC [38976] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:30.686 UTC [38976] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:30.686 UTC [38976] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:30.688 UTC [38787] LOG: all server processes terminated;
reinitializing
2021-01-06 22:36:30.695 UTC [38979] LOG: database system was interrupted;
last known up at 2021-01-06 22:36:30 UTC
2021-01-06 22:36:30.702 UTC [38979] LOG: database system was not properly
shut down; automatic recovery in progress
2021-01-06 22:36:30.709 UTC [38979] LOG: invalid record length at
0/1693FF0: wanted 24, got 0
2021-01-06 22:36:30.709 UTC [38979] LOG: redo is not required
2021-01-06 22:36:30.748 UTC [38787] LOG: database system is ready to accept
connections
2021-01-06 22:36:31.894 UTC [38787] LOG: server process (PID 39159) was
terminated by signal 11: Segmentation fault
2021-01-06 22:36:31.894 UTC [38787] DETAIL: Failed process was running:
do
$$
begin
rollback;
end
$$

2021-01-06 22:36:31.894 UTC [38787] LOG: terminating any other active
server processes
2021-01-06 22:36:31.894 UTC [39158] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:31.894 UTC [39158] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:31.894 UTC [39158] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:31.894 UTC [38986] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:31.894 UTC [38986] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:31.894 UTC [38986] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:31.895 UTC [38787] LOG: all server processes terminated;
reinitializing
2021-01-06 22:36:31.905 UTC [39161] LOG: database system was interrupted;
last known up at 2021-01-06 22:36:30 UTC
2021-01-06 22:36:31.912 UTC [39161] LOG: database system was not properly
shut down; automatic recovery in progress
2021-01-06 22:36:31.919 UTC [39161] LOG: invalid record length at
0/1694080: wanted 24, got 0
2021-01-06 22:36:31.919 UTC [39161] LOG: redo is not required
2021-01-06 22:36:31.958 UTC [38787] LOG: database system is ready to accept
connections
2021-01-06 22:36:32.561 UTC [38787] LOG: server process (PID 39253) was
terminated by signal 11: Segmentation fault
2021-01-06 22:36:32.561 UTC [38787] DETAIL: Failed process was running:
do
$$
begin
rollback;
end
$$

2021-01-06 22:36:32.561 UTC [38787] LOG: terminating any other active
server processes
2021-01-06 22:36:32.562 UTC [39252] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:32.562 UTC [39252] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:32.562 UTC [39252] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:32.562 UTC [39167] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:32.562 UTC [39167] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:32.562 UTC [39167] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:32.562 UTC [38787] LOG: all server processes terminated;
reinitializing
2021-01-06 22:36:32.569 UTC [39255] LOG: database system was interrupted;
last known up at 2021-01-06 22:36:31 UTC
2021-01-06 22:36:32.576 UTC [39255] LOG: database system was not properly
shut down; automatic recovery in progress
2021-01-06 22:36:32.583 UTC [39255] LOG: invalid record length at
0/16940F8: wanted 24, got 0
2021-01-06 22:36:32.583 UTC [39255] LOG: redo is not required
2021-01-06 22:36:32.621 UTC [38787] LOG: database system is ready to accept
connections
s was running: do
976] DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
2021-01-06 22:36:30.686 UTC [38976] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:30.688 UTC [38787] LOG: all server processes terminated;
reinitializing
2021-01-06 22:36:30.695 UTC [38979] LOG: database system was interrupted;
last known up at 2021-01-06 22:36:30 UTC
2021-01-06 22:36:30.702 UTC [38979] LOG: database system was not properly
shut down; automatic recovery in progress
2021-01-06 22:36:30.709 UTC [38979] LOG: invalid record length at
0/1693FF0: wanted 24, got 0
2021-01-06 22:36:30.709 UTC [38979] LOG: redo is not required
2021-01-06 22:36:30.748 UTC [38787] LOG: database system is ready to accept
connections
2021-01-06 22:36:31.894 UTC [38787] LOG: server process (PID 39159) was
terminated by signal 11: Segmentation fault
2021-01-06 22:36:31.894 UTC [38787] DETAIL: Failed process was running:
do
$$
begin
rollback;
end
$$

2021-01-06 22:36:31.894 UTC [38787] LOG: terminating any other active
server processes
2021-01-06 22:36:31.894 UTC [39158] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:31.894 UTC [39158] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:31.894 UTC [39158] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:31.894 UTC [38986] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:31.894 UTC [38986] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2021-01-06 22:36:31.894 UTC [38986] HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2021-01-06 22:36:31.895 UTC [38787] LOG: all server processes terminated;
reinitializing
2021-01-06 22:36:31.905 UTC [39161] LOG: database system was interrupted;
last known up at 2021-01-06 22:36:30 UTC
2021-01-06 22:36:31.912 UTC [39161] LOG: database system was not properly
shut down; automatic recovery in progress
2021-01-06 22:36:31.919 UTC [39161] LOG: invalid record length at
0/1694080: wanted 24, got 0
2021-01-06 22:36:31.919 UTC [39161] LOG: redo is not required
2021-01-06 22:36:31.958 UTC [38787] LOG: database system is ready to accept
connections
2021-01-06 22:36:32.561 UTC [38787] LOG: server process (PID 39253) was
terminated by signal 11: Segmentation fault
2021-01-06 22:36:32.561 UTC [38787] DETAIL: Failed process was running:
do
$$
begin
rollback;
end
$$

2021-01-06 22:36:32.561 UTC [38787] LOG: terminating any other active
server processes
2021-01-06 22:36:32.562 UTC [39252] WARNING: terminating connection because
of crash of another server process
2021-01-06 22:36:32.562 UTC [39252] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server procs: command not found


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Mihir Pandya
Дата:
Сообщение: Re: BUG #16822: Unable to launch setup
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #16829: Postgresql-12 service is not starting with SSL enabled in centos 7