Обсуждение: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed
BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 15727 Logged by: Roman Zharkov Email address: r.zharkov@postgrespro.ru PostgreSQL version: Unsupported/Unknown Operating system: Ubuntu 5.4.0-6ubuntu1~16.04.10 Description: Hello, Our test server crashed while benchmarking connection pooler called Yandex Odyssey. There is no debugging info in the code dump now, because it takes few hours to reproduce the error. We will try to reproduce the error with debugging info. Server version - PostgreSQL 12devel-master, commit - bb76134 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit The error does not reproduce on PostgreSQL 12devel-master, commit - 2fadf24 2019-04-01 15:27:38.829 +07 [7524] ERROR: unexpected table_lock_tuple status: 1 2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1; 2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction 400048276, it was already committed 2019-04-01 15:27:39.001 +07 [7499] ERROR: unexpected table_lock_tuple status: 1 2019-04-01 15:27:39.001 +07 [7499] STATEMENT: UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2; 2019-04-01 15:27:39.001 +07 [7499] PANIC: cannot abort transaction 400051561, it was already committed 2019-04-01 15:27:41.193 +07 [7509] ERROR: unexpected table_lock_tuple status: 1 2019-04-01 15:27:41.193 +07 [7509] STATEMENT: UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1; 2019-04-01 15:27:41.193 +07 [7509] PANIC: cannot abort transaction 400088425, it was already committed 2019-04-01 15:41:01.176 +07 [3559] LOG: server process (PID 7499) was terminated by signal 6: Aborted 2019-04-01 15:41:01.176 +07 [3559] DETAIL: Failed process was running: UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2; 2019-04-01 15:41:01.176 +07 [3559] LOG: terminating any other active server processes [New LWP 7509] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: zharkov test 127.0.0.1(42666) UPDATE '. Program terminated with signal SIGABRT, Aborted. #0 0x00007f1f4c6c7428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 0x00007f1f4c6c7428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007f1f4c6c902a in __GI_abort () at abort.c:89 #2 0x000000000085faac in errfinish () #3 0x00000000008633e4 in elog_finish () #4 0x000000000050266d in RecordTransactionAbort () #5 0x0000000000502788 in AbortTransaction () #6 0x0000000000502fb5 in AbortCurrentTransaction () #7 0x000000000075029f in PostgresMain () #8 0x0000000000480593 in ServerLoop () #9 0x00000000006dfc04 in PostmasterMain () #10 0x00000000004813b9 in main ()
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-02 10:30:15 +0000, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 15727 > Logged by: Roman Zharkov > Email address: r.zharkov@postgrespro.ru > PostgreSQL version: Unsupported/Unknown > Operating system: Ubuntu 5.4.0-6ubuntu1~16.04.10 > Description: > > Hello, > Our test server crashed while benchmarking connection pooler called Yandex > Odyssey. What's the exact workload? Just pgbench? > There is no debugging info in the code dump now, because it takes few hours > to reproduce the error. We will try to reproduce the error with debugging > info. Unfortunately a backtrace is probably not going to help up much, I'm afraid. The issue likely is triggered at an earlier state than the error thrown here. > Server version - PostgreSQL 12devel-master, commit - bb76134 on > x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 > 20160609, 64-bit > > The error does not reproduce on PostgreSQL 12devel-master, commit - > 2fadf24 > > 2019-04-01 15:27:38.829 +07 [7524] ERROR: unexpected table_lock_tuple > status: 1 Hm, so that appears to be related to my changes. Could you run this with log_error_verbosity=verbose, so we know which of the codepaths this hits? Apparently we're encountering an invisible tuple somewhere, which I thought was not possible (hence the error) - previously we just ignored that. > 2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET > abalance = 1 WHERE aid = 1; > 2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction > 400048276, it was already committed But that's probably a separate issue. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET >> abalance = 1 WHERE aid = 1; >> 2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction >> 400048276, it was already committed > But that's probably a separate issue. What that seems to indicate is that the "unexpected table_lock_tuple status" error was thrown during commit, which seems pretty odd. regards, tom lane
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-02 11:44:26 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET > >> abalance = 1 WHERE aid = 1; > >> 2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction > >> 400048276, it was already committed > > > But that's probably a separate issue. > > What that seems to indicate is that the "unexpected table_lock_tuple > status" error was thrown during commit, which seems pretty odd. I suspect that's a deferred trigger. But that code obviously could throw errors, so we gotta handle that correctly. Greetings, Andres Freund
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Alvaro Herrera
Дата:
On 2019-Apr-02, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET > >> abalance = 1 WHERE aid = 1; > >> 2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction > >> 400048276, it was already committed > > > But that's probably a separate issue. > > What that seems to indicate is that the "unexpected table_lock_tuple > status" error was thrown during commit, which seems pretty odd. AFAICS this error can only come from ExecDelete(), because the value 1 is TM_Invisible and the other callsites where the "unexpected table_lock_tuple" error appears use different wording for that one. Maybe it's the result of a deferred constraint being checked at that time ... maybe it's trying to honor an "on cascade delete" setting for an FK, and the affected tuple has already been updated or deleted? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andres Freund <andres@anarazel.de> writes: > On 2019-04-02 11:44:26 -0400, Tom Lane wrote: >> What that seems to indicate is that the "unexpected table_lock_tuple >> status" error was thrown during commit, which seems pretty odd. > I suspect that's a deferred trigger. But that code obviously could throw > errors, so we gotta handle that correctly. Deferred triggers execute before the real transaction commit, not during the critical section surrounding where we set the bit in pg_clog. But this error is seemingly getting thrown from within that critical section, otherwise we wouldn't have inconsistent xact status afterwards. regards, tom lane
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-02 11:51:38 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2019-04-02 11:44:26 -0400, Tom Lane wrote: > >> What that seems to indicate is that the "unexpected table_lock_tuple > >> status" error was thrown during commit, which seems pretty odd. > > > I suspect that's a deferred trigger. But that code obviously could throw > > errors, so we gotta handle that correctly. > > Deferred triggers execute before the real transaction commit, not during > the critical section surrounding where we set the bit in pg_clog. > But this error is seemingly getting thrown from within that critical > section, otherwise we wouldn't have inconsistent xact status afterwards. I don't think it was thrown during the critical section - we'd otherwise see the ERROR about unexpected lock status converted to a PANIC (and the "cannot abort transaction, already committed" is a PANIC itself). I suspect it's more likely that we corrupt the xact.c status somewhere - but it's hard to say without knowing more about the workload. Greetings, Andres Freund
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote: > On 2019-Apr-02, Tom Lane wrote: > > > Andres Freund <andres@anarazel.de> writes: > > >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET > > >> abalance = 1 WHERE aid = 1; > > >> 2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction > > >> 400048276, it was already committed > > > > > But that's probably a separate issue. > > > > What that seems to indicate is that the "unexpected table_lock_tuple > > status" error was thrown during commit, which seems pretty odd. > > AFAICS this error can only come from ExecDelete(), because the value 1 > is TM_Invisible and the other callsites where the "unexpected > table_lock_tuple" error appears use different wording for that one. Hm? Why couldn't it be the ExecUpdate() case? > Maybe it's the result of a deferred constraint being checked at that > time ... maybe it's trying to honor an "on cascade delete" setting for > an FK, and the affected tuple has already been updated or deleted? Then it ought to get TM_Deleted, no? We ought to wait till that transaction commits, and then roll back. So there's something odd happening here. I suspect there has to be some additional log output or such to explain this. Greetings, Andres Freund
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Alvaro Herrera
Дата:
On 2019-Apr-02, Andres Freund wrote: > On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote: > > AFAICS this error can only come from ExecDelete(), because the value 1 > > is TM_Invisible and the other callsites where the "unexpected > > table_lock_tuple" error appears use different wording for that one. > > Hm? Why couldn't it be the ExecUpdate() case? You're right, that one too. > > Maybe it's the result of a deferred constraint being checked at that > > time ... maybe it's trying to honor an "on cascade delete" setting for > > an FK, and the affected tuple has already been updated or deleted? > > Then it ought to get TM_Deleted, no? We ought to wait till that > transaction commits, and then roll back. I was thinking that this would have happened in the same transaction; but yeah I didn't spend too much time analyzing the exact code flow. Anyway I agree that there's something odd going on, and perhaps you just unmasked an earlier bug. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
r.zharkov@postgrespro.ru
Дата:
On 2019-04-02 23:13, Andres Freund wrote: > Hi, > > On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote: >> On 2019-Apr-02, Tom Lane wrote: >> >> > Andres Freund <andres@anarazel.de> writes: >> > >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET >> > >> abalance = 1 WHERE aid = 1; >> > >> 2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction >> > >> 400048276, it was already committed >> > >> > > But that's probably a separate issue. >> > >> > What that seems to indicate is that the "unexpected table_lock_tuple >> > status" error was thrown during commit, which seems pretty odd. >> >> AFAICS this error can only come from ExecDelete(), because the value 1 >> is TM_Invisible and the other callsites where the "unexpected >> table_lock_tuple" error appears use different wording for that one. > > Hm? Why couldn't it be the ExecUpdate() case? > > >> Maybe it's the result of a deferred constraint being checked at that >> time ... maybe it's trying to honor an "on cascade delete" setting for >> an FK, and the affected tuple has already been updated or deleted? > > Then it ought to get TM_Deleted, no? We ought to wait till that > transaction commits, and then roll back. So there's something odd > happening here. I suspect there has to be some additional log output > or > such to explain this. > > Greetings, > > Andres Freund I made the new core file: [New LWP 5967] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: zharkov test 127.0.0.1(35888) UPDATE '. Program terminated with signal SIGABRT, Aborted. #0 0x00007fe194769428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 0x00007fe194769428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007fe19476b02a in __GI_abort () at abort.c:89 #2 0x0000000000a9c0ac in errfinish (dummy=0) at elog.c:552 #3 0x0000000000a9e653 in elog_finish (elevel=22, fmt=0xb695a0 "cannot abort transaction %u, it was already committed") at elog.c:1365 #4 0x0000000000562c87 in RecordTransactionAbort (isSubXact=false) at xact.c:1644 #5 0x0000000000563c93 in AbortTransaction () at xact.c:2670 #6 0x0000000000564516 in AbortCurrentTransaction () at xact.c:3191 #7 0x00000000009145b9 in PostgresMain (argc=1, argv=0x255d970, dbname=0x255d950 "test", username=0x252dd88 "zharkov") at postgres.c:4033 #8 0x000000000086406c in BackendRun (port=0x2556020) at postmaster.c:4399 #9 0x00000000008637cd in BackendStartup (port=0x2556020) at postmaster.c:4090 #10 0x000000000085facf in ServerLoop () at postmaster.c:1703 #11 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x252bcf0) at postmaster.c:1376 #12 0x000000000077aeba in main (argc=3, argv=0x252bcf0) at main.c:228 Log: 2019-04-02 21:51:19.639 +07 [5967] STATEMENT: UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1; 2019-04-02 21:51:19.639 +07 [5967] PANIC: cannot abort transaction 200011850, it was already committed 2019-04-02 21:53:57.924 +07 [2473] LOG: server process (PID 5967) was terminated by signal 6: Aborted 2019-04-02 21:53:57.924 +07 [2473] DETAIL: Failed process was running: UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1; 2019-04-02 21:53:57.924 +07 [2473] LOG: terminating any other active server processes We run pgbench with two custom scripts. First: \set i random_zipfian(1, 100000 * :scale, 1.01) SELECT abalance FROM pgbench_accounts WHERE aid = :i; Second: \set i random_zipfian(1, 100000 * :scale, 1.01) UPDATE pgbench_accounts SET abalance = 1 WHERE aid = :i; pgbench connects to the database through the connection pooller called Odyssey. We use pool of 16 or 32 backends. We run four series of pgbench tests with different connections count ( 100, 150, 200, ... , 1000 ) Three series with pool size 16 runs normally. But fourth series fails with 200-300 connections. It took about five hours to reproduce the error this time. Here is the little piece of the pgbench log: Bench start Tue Apr 2 14:43:12 UTC 2019 Timestamp 1554216192 transaction type: multiple scripts scaling factor: 1 query mode: simple number of clients: 200 number of threads: 32 duration: 300 s number of transactions actually processed: 8062981 latency average = 7.442 ms tps = 26875.375100 (including connections establishing) tps = 26875.694564 (excluding connections establishing) SQL script 1: ./sql/ycsb_read_zipf.sql - weight: 1 (targets 50.0% of total) - 4019635 transactions (49.9% of total, tps = 13398.171023) - latency average = 6.955 ms - latency stddev = 8.366 ms SQL script 2: ./sql/ycsb_update_zipf.sql - weight: 1 (targets 50.0% of total) - 4021508 transactions (49.9% of total, tps = 13404.414071) - latency average = 7.909 ms - latency stddev = 8.478 ms Bench end Tue Apr 2 14:48:12 UTC 2019 Timestamp 1554216492 Bench start Tue Apr 2 14:48:12 UTC 2019 Timestamp 1554216492 client 220 script 1 aborted in command 1 query 0: PANIC: cannot abort transaction 200011850, it was already committed WARNING: terminating connection because of crash of another server process -- regards, Roman Zharkov
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-03 00:04:15 +0700, r.zharkov@postgrespro.ru wrote: > pgbench connects to the database through the connection pooller called > Odyssey. > We use pool of 16 or 32 backends. > We run four series of pgbench tests with different connections count ( 100, > 150, 200, ... , 1000 ) > Three series with pool size 16 runs normally. > But fourth series fails with 200-300 connections. It took about five hours > to reproduce the error this time. > > Here is the little piece of the pgbench log: > > Bench start > Tue Apr 2 14:43:12 UTC 2019 > Timestamp 1554216192 > transaction type: multiple scripts > scaling factor: 1 Is this the actual scaling factor? Did you create the pgbench database with foreign keys? Greetings, Andres Freund
r.zharkov@postgrespro.ru writes: > I made the new core file: Thanks, but this isn't much help --- it just shows what we already know, ie the "cannot abort transaction" error occurs after some other error was thrown. What would be more helpful would be to adjust the places that can throw the "unexpected table_lock_tuple status" error text to be PANIC rather than ERROR, so that the core dump would show a stack trace showing how we got to whichever of those places this is. Or, run the test with a higher log verbosity, so that you can find out which of those places is throwing the error to begin with; then just PANIC-ize that one. FWIW, I see six potential candidates, not two: pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test); pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u", pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR, "unexpected table_lock_tuple status:%u", pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR, "unexpected table_lock_tuplestatus: %u", pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u",res); pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u",res); regards, tom lane
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-02 13:13:58 -0400, Tom Lane wrote: > FWIW, I see six potential candidates, not two: > > pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test); > pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u", > pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u",res); > pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u",res); These have an explicit case / separate error for TM_Invisible (= 1 - which is the status we're seeing according to the error message) though. So afaict they can't be relevant here. Greetings, Andres Freund
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Alvaro Herrera
Дата:
On 2019-Apr-02, Tom Lane wrote: > FWIW, I see six potential candidates, not two: > > pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test); > pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u", > pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR, "unexpected table_lock_tuplestatus: %u", > pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR, "unexpected table_lock_tuplestatus: %u", > pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u",res); > pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u",res); The other four handle the TM_Invisible case with a different error message. Anyway, the six of them are unexpected so changing them all to PANIC ought to be good enough. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andres Freund <andres@anarazel.de> writes: > On 2019-04-02 13:13:58 -0400, Tom Lane wrote: >> FWIW, I see six potential candidates, not two: > These have an explicit case / separate error for TM_Invisible (= 1 - > which is the status we're seeing according to the error message) > though. So afaict they can't be relevant here. Oh, right, so it has to be from one of the calls in nodeModifyTable.c. But it's hardly plausible that that runs post-commit. So now I'm thinking that what we really need to know is why a TM_Invisible result was returned. That's going to be a bit harder ... regards, tom lane
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed
От
Жарков Роман
Дата:
> 3 апр. 2019 г., в 0:13, Tom Lane <tgl@sss.pgh.pa.us> написал(а): > > r.zharkov@postgrespro.ru writes: >> I made the new core file: > > Thanks, but this isn't much help --- it just shows what we already > know, ie the "cannot abort transaction" error occurs after some other > error was thrown. > > What would be more helpful would be to adjust the places that > can throw the "unexpected table_lock_tuple status" error text to be > PANIC rather than ERROR, so that the core dump would show a stack > trace showing how we got to whichever of those places this is. > Or, run the test with a higher log verbosity, so that you can find > out which of those places is throwing the error to begin with; > then just PANIC-ize that one. > > FWIW, I see six potential candidates, not two: > > pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test); > pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u", > pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR, "unexpected table_lock_tuplestatus: %u", > pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR, "unexpected table_lock_tuplestatus: %u", > pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u",res); > pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u",res); > > > regards, tom lane Ok. I will made it tomorrow morning. regards, Roman Zharkov
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-02 13:30:56 -0400, Tom Lane wrote: > But it's hardly plausible that that runs post-commit. Yea, one surely would hope that that's not happening. I suspect it's possible it's running pre-commit and somehow the failure handling is screwed up leading to the 'already committed' error. > So now I'm thinking that what we really need to know is why a > TM_Invisible result was returned. That's going to be a bit harder ... Yea. I re-skimmed through the code and can't immediately see anything wrong - not that such subtle issues would necessarily be noticed without more information. I wonder, is there a chance that you're using an extension with xact hooks, or a modified version of postgres? Could you show us the exact schema used, and the config? Greetings, Andres Freund
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
r.zharkov@postgrespro.ru
Дата:
On 2019-04-03 00:13, Tom Lane wrote: > r.zharkov@postgrespro.ru writes: >> I made the new core file: > > Thanks, but this isn't much help --- it just shows what we already > know, ie the "cannot abort transaction" error occurs after some other > error was thrown. > > What would be more helpful would be to adjust the places that > can throw the "unexpected table_lock_tuple status" error text to be > PANIC rather than ERROR, so that the core dump would show a stack > trace showing how we got to whichever of those places this is. > Or, run the test with a higher log verbosity, so that you can find > out which of those places is throwing the error to begin with; > then just PANIC-ize that one. > > FWIW, I see six potential candidates, not two: > > pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, > "unexpected table_lock_tuple status: %u", test); > pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, > "unexpected table_lock_tuple status: %u", > pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR, > "unexpected table_lock_tuple status: %u", > pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR, > "unexpected table_lock_tuple status: %u", > pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, > "unexpected table_lock_tuple status: %u", res); > pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, > "unexpected table_lock_tuple status: %u", res); > > > regards, tom lane Hello, I have made the new core dump. Here are the results: PostgreSQL 12devel-master/bb76134 Log: TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) || (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | 0x0010))) == 0x0040)))", File: "heapam.c", Line: 2019-04-03 15:13:36.668 +07 [23163] LOG: server process (PID 27904) was terminated by signal 6: Aborted 2019-04-03 15:13:36.668 +07 [23163] DETAIL: Failed process was running: UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2; 2019-04-03 15:13:36.668 +07 [23163] LOG: terminating any other active server processes 2019-04-03 15:13:36.669 +07 [27907] WARNING: terminating connection because of crash of another server process Backtrace: [New LWP 27904] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: zharkov test 127.0.0.1(44582) UPDATE '. Program terminated with signal SIGABRT, Aborted. #0 0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007f5e0dc7d02a in __GI_abort () at abort.c:89 #2 0x0000000000a9b7ce in ExceptionalCondition ( conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) || (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | 0x0010))) == 0x0040)))", errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 "heapam.c", lineNumber=4286) at assert.c:54 #3 0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80, tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, follow_updates=false, buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286 #4 0x00000000004f5903 in heapam_tuple_lock (relation=0x7f5e0f252d80, tid=0x7ffed968cdf0, snapshot=0x303cef0, slot=0x30f4d80, cid=0, mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffed968cd80) at heapam_handler.c:352 #5 0x000000000074055a in table_lock_tuple (rel=0x7f5e0f252d80, tid=0x7ffed968cdf0, snapshot=0x303cef0, slot=0x30f4d80, cid=0, mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffed968cd80) at ../../../src/include/access/tableam.h:993 #6 0x00000000007421cc in ExecUpdate (mtstate=0x30eb780, tupleid=0x7ffed968cdf0, oldtuple=0x0, slot=0x30ed1c8, planSlot=0x30ed050, epqstate=0x30eb878, estate=0x30eb3d8, canSetTag=true) at nodeModifyTable.c:1242 #7 0x0000000000743756 in ExecModifyTable (pstate=0x30eb780) at nodeModifyTable.c:2070 #8 0x0000000000712671 in ExecProcNodeFirst (node=0x30eb780) at execProcnode.c:445 #9 0x000000000070748c in ExecProcNode (node=0x30eb780) at ../../../src/include/executor/executor.h:239 #10 0x0000000000709d11 in ExecutePlan (estate=0x30eb3d8, planstate=0x30eb780, use_parallel_mode=false, operation=CMD_UPDATE, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x30ee508, execute_once=true) at execMain.c:1643 #11 0x0000000000707ac9 in standard_ExecutorRun (queryDesc=0x303faf8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:362 #12 0x0000000000707905 in ExecutorRun (queryDesc=0x303faf8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:306 #13 0x0000000000915950 in ProcessQuery (plan=0x30df120, sourceText=0x30161d8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;", params=0x0, queryEnv=0x0, dest=0x30ee508, completionTag=0x7ffed968d270 "") at pquery.c:161 #14 0x000000000091739c in PortalRunMulti (portal=0x3098078, isTopLevel=true, setHoldSnapshot=false, dest=0x30ee508, altdest=0x30ee508, completionTag=0x7ffed968d270 "") at pquery.c:1283 #15 0x00000000009168e8 in PortalRun (portal=0x3098078, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x30ee508, altdest=0x30ee508, completionTag=0x7ffed968d270 "") at pquery.c:796 #16 0x000000000091027a in exec_simple_query (query_string=0x30161d8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;") at postgres.c:1215 #17 0x00000000009148ff in PostgresMain (argc=1, argv=0x3042980, dbname=0x3042960 "test", username=0x3012d98 "zharkov") at postgres.c:4247 #18 0x000000000086406c in BackendRun (port=0x303b030) at postmaster.c:4399 #19 0x00000000008637cd in BackendStartup (port=0x303b030) at postmaster.c:4090 #20 0x000000000085facf in ServerLoop () at postmaster.c:1703 #21 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x3010d00) at postmaster.c:1376 #22 0x000000000077aeba in main (argc=3, argv=0x3010d00) at main.c:228 The changes I made: diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c index e03ffdd..ab3bace 100644 --- a/src/backend/commands/trigger.c +++ b/src/backend/commands/trigger.c @@ -3358,7 +3358,7 @@ GetTupleForTrigger(EState *estate, ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), errmsg("could not serialize access due to concurrent update"))); - elog(ERROR, "unexpected table_lock_tuple status: %u", test); + elog(PANIC, "unexpected table_lock_tuple status: %u", test); break; case TM_Deleted: diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c index d8b48c6..d06cbe0 100644 --- a/src/backend/executor/execReplication.c +++ b/src/backend/executor/execReplication.c @@ -207,7 +207,7 @@ retry: elog(ERROR, "attempted to lock invisible tuple"); break; default: - elog(ERROR, "unexpected table_lock_tuple status: %u", res); + elog(PANIC, "unexpected table_lock_tuple status: %u", res); break; } } @@ -371,7 +371,7 @@ retry: elog(ERROR, "attempted to lock invisible tuple"); break; default: - elog(ERROR, "unexpected table_lock_tuple status: %u", res); + elog(PANIC, "unexpected table_lock_tuple status: %u", res); break; } } diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c index 7674ac8..c7a261c 100644 --- a/src/backend/executor/nodeLockRows.c +++ b/src/backend/executor/nodeLockRows.c @@ -229,7 +229,7 @@ lnext: ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), errmsg("could not serialize access due to concurrent update"))); - elog(ERROR, "unexpected table_lock_tuple status: %u", + elog(PANIC, "unexpected table_lock_tuple status: %u", test); break; diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 7be0e77..b15b7a0 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -780,7 +780,7 @@ ldelete:; * locking the latest version via * TUPLE_LOCK_FLAG_FIND_LAST_VERSION. */ - elog(ERROR, "unexpected table_lock_tuple status: %u", + elog(PANIC, "unexpected table_lock_tuple status: %u", result); return NULL; } @@ -1269,7 +1269,7 @@ lreplace:; default: /* see table_lock_tuple call in ExecDelete() */ - elog(ERROR, "unexpected table_lock_tuple status: %u", + elog(PANIC, "unexpected table_lock_tuple status: %u", result); return NULL; } regards, Roman Zharkov
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
r.zharkov@postgrespro.ru
Дата:
On 2019-04-03 00:11, Andres Freund wrote: > Hi, > > On 2019-04-03 00:04:15 +0700, r.zharkov@postgrespro.ru wrote: >> pgbench connects to the database through the connection pooller called >> Odyssey. >> We use pool of 16 or 32 backends. >> We run four series of pgbench tests with different connections count ( >> 100, >> 150, 200, ... , 1000 ) >> Three series with pool size 16 runs normally. >> But fourth series fails with 200-300 connections. It took about five >> hours >> to reproduce the error this time. >> >> Here is the little piece of the pgbench log: >> >> Bench start >> Tue Apr 2 14:43:12 UTC 2019 >> Timestamp 1554216192 >> transaction type: multiple scripts >> scaling factor: 1 > > Is this the actual scaling factor? Did you create the pgbench database > with foreign keys? > > Greetings, > > Andres Freund Hi, Yes, 1 is the actual scaling factor. My script to create database: /db/zharkov/.vanila/bin/pg_ctl -D /db/zharkov/vanila -l /db/zharkov/vanila/log.log stop rm -rf /db/zharkov/vanila/* /db/zharkov/.vanila/bin/initdb -D /db/zharkov/vanila --data-checksums #/db/zharkov/.vanila/bin/initdb -D /db/zharkov/vanila cp postgresql.conf /db/zharkov/vanila/ ulimit -c unlimited /db/zharkov/.vanila/bin/pg_ctl -D /db/zharkov/vanila -c -l /db/zharkov/vanila/log.log start /db/zharkov/.vanila/bin/psql postgres -c 'create database test' /db/zharkov/.vanila/bin/psql test -c 'create extension pg_prewarm;' /db/zharkov/.vanila/bin/pgbench -i test -s 1 #/db/zharkov/.vanila/bin/pgbench -i test -s 1000 /db/zharkov/.vanila/bin/psql test -c "select pg_prewarm( 'pgbench_accounts' );" /db/zharkov/.vanila/bin/psql test -c "select pg_prewarm( 'pgbench_branches' );" /db/zharkov/.vanila/bin/psql test -c "select pg_prewarm( 'pgbench_history' );" /db/zharkov/.vanila/bin/psql test -c "select pg_prewarm( 'pgbench_tellers' );" The dump of the server settings is in the attachment. -- regards, Roman Zharkov
Вложения
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Alvaro Herrera
Дата:
On 2019-Apr-03, r.zharkov@postgrespro.ru wrote: > #0 0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at > ../sysdeps/unix/sysv/linux/raise.c:54 > #1 0x00007f5e0dc7d02a in __GI_abort () at abort.c:89 > #2 0x0000000000a9b7ce in ExceptionalCondition ( > conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) || > (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | > 0x0010))) == 0x0040)))", > errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 "heapam.c", > lineNumber=4286) at assert.c:54 > #3 0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80, > tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive, > wait_policy=LockWaitBlock, follow_updates=false, > buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286 Uhm ... this assertion failure is doesn't seem related to the other problem you were reporting. Here, it's failing because it thinks the tuple should be marked as only locked when the current transaction is the only Xmax for the tuple. /* * As a check independent from those above, we can also avoid sleeping * if the current transaction is the sole locker of the tuple. Note * that the strength of the lock already held is irrelevant; this is * not about recording the lock in Xmax (which will be done regardless * of this optimization, below). Also, note that the cases where we * hold a lock stronger than we are requesting are already handled * above by not doing anything. * * Note we only deal with the non-multixact case here; MultiXactIdWait * is well equipped to deal with this situation on its own. */ if (require_sleep && !(infomask & HEAP_XMAX_IS_MULTI) && TransactionIdIsCurrentTransactionId(xwait)) { /* ... but if the xmax changed in the meantime, start over */ LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data), xwait)) goto l3; Assert(HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_data->t_infomask)); // <--- failed assertion require_sleep = false; } I'm not quite sure what's up with that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
r.zharkov@postgrespro.ru
Дата:
On 2019-04-03 19:20, Alvaro Herrera wrote: > On 2019-Apr-03, r.zharkov@postgrespro.ru wrote: > >> #0 0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at >> ../sysdeps/unix/sysv/linux/raise.c:54 >> #1 0x00007f5e0dc7d02a in __GI_abort () at abort.c:89 >> #2 0x0000000000a9b7ce in ExceptionalCondition ( >> conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) >> || >> (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 >> | >> 0x0010))) == 0x0040)))", >> errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 >> "heapam.c", >> lineNumber=4286) at assert.c:54 >> #3 0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80, >> tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive, >> wait_policy=LockWaitBlock, follow_updates=false, >> buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286 > > Uhm ... this assertion failure is doesn't seem related to the other > problem you were reporting. Here, it's failing because it thinks the > tuple should be marked as only locked when the current transaction is > the only Xmax for the tuple. > > TransactionIdIsCurrentTransactionId(xwait)) > { > /* ... but if the xmax changed in the meantime, start over */ > LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); > if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) || > !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data), > xwait)) > goto l3; > Assert(HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_data->t_infomask)); // > <--- failed assertion > require_sleep = false; > } > > I'm not quite sure what's up with that. Today we reproduced the error without the pooller. All server settings are the same. pgbench test -j 32 -T 36000 -f ycsb_read_zipf.sql -f ycsb_update_zipf.sql -c 32 -P 60 Backtrace: [New LWP 29710] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: zharkov test [local] UPDATE '. Program terminated with signal SIGABRT, Aborted. #0 0x00007f0db56c3428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 0x00007f0db56c3428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007f0db56c502a in __GI_abort () at abort.c:89 #2 0x0000000000a9c0ac in errfinish (dummy=0) at elog.c:552 #3 0x0000000000a9e653 in elog_finish (elevel=22, fmt=0xc7c7c0 "unexpected table_lock_tuple status: %u") at elog.c:1365 #4 0x00000000007422c1 in ExecUpdate (mtstate=0x2e02370, tupleid=0x7fff6a5c1610, oldtuple=0x0, slot=0x2e03db8, planSlot=0x2e03c40, epqstate=0x2e02468, estate=0x2e01fc8, canSetTag=true) at nodeModifyTable.c:1272 #5 0x0000000000743756 in ExecModifyTable (pstate=0x2e02370) at nodeModifyTable.c:2070 #6 0x0000000000712671 in ExecProcNodeFirst (node=0x2e02370) at execProcnode.c:445 #7 0x000000000070748c in ExecProcNode (node=0x2e02370) at ../../../src/include/executor/executor.h:239 #8 0x0000000000709d11 in ExecutePlan (estate=0x2e01fc8, planstate=0x2e02370, use_parallel_mode=false, operation=CMD_UPDATE, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x2e050f8, execute_once=true) at execMain.c:1643 #9 0x0000000000707ac9 in standard_ExecutorRun (queryDesc=0x2d56a18, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:362 #10 0x0000000000707905 in ExecutorRun (queryDesc=0x2d56a18, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:306 #11 0x0000000000915950 in ProcessQuery (plan=0x2df5d10, sourceText=0x2d2d1d8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1;", params=0x0, queryEnv=0x0, dest=0x2e050f8, completionTag=0x7fff6a5c1a90 "") at pquery.c:161 #12 0x000000000091739c in PortalRunMulti (portal=0x2daf078, isTopLevel=true, setHoldSnapshot=false, dest=0x2e050f8, altdest=0x2e050f8, completionTag=0x7fff6a5c1a90 "") at pquery.c:1283 #13 0x00000000009168e8 in PortalRun (portal=0x2daf078, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x2e050f8, altdest=0x2e050f8, completionTag=0x7fff6a5c1a90 "") at pquery.c:796 #14 0x000000000091027a in exec_simple_query (query_string=0x2d2d1d8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1;") at postgres.c:1215 #15 0x00000000009148ff in PostgresMain (argc=1, argv=0x2d59a80, dbname=0x2d59960 "test", username=0x2d29d98 "zharkov") at postgres.c:4247 #16 0x000000000086406c in BackendRun (port=0x2d52030) at postmaster.c:4399 #17 0x00000000008637cd in BackendStartup (port=0x2d52030) at postmaster.c:4090 #18 0x000000000085facf in ServerLoop () at postmaster.c:1703 #19 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x2d27d00) at postmaster.c:1376 #20 0x000000000077aeba in main (argc=3, argv=0x2d27d00) at main.c:228 ------------------------------------------- Some information from first core dump ( thanks to Konstantin Knizhnik ): (gdb) p *tuple->t_data $6 = {t_choice = {t_heap = {t_xmin = 200004728, t_xmax = 200004747, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 200004728, datum_typmod = 200004747, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 0, bi_lo = 2337}, ip_posid = 61}, t_infomask2 = 49156, t_infomask = 9474, t_hoff = 24 '\030', t_bits = 0x7f59f9ea1c17 ""} (gdb) p xwait $4 = 200004747 (gdb) p /x infomask $7 = 0x2502 (gdb) p result $8 = TM_Updated (gdb) p mode $9 = LockTupleNoKeyExclusive (gdb) p *CurrentTransactionState $11 = {fullTransactionId = {value = 200004747}, subTransactionId = 1, name = 0x0, savepointLevel = 0, state = TRANS_INPROGRESS, blockState = TBLOCK_STARTED, nestingLevel = 1, gucNestLevel = 1, curTransactionContext = 0x303cda0, curTransactionOwner = 0x3054660, childXids = 0x0, nChildXids = 0, maxChildXids = 0, prevUser = 10, prevSecContext = 0, prevXactReadOnly = false, startedInRecovery = false, didLogXid = false, parallelModeLevel = 0, chain = false, parent = 0x0} -- regards Roman Zharkov
It seems that there may be some connection between this problem and EPQ. I was working on committing Amit's fix for bug #15677, which demonstrated that EPQ doesn't work for partitioned-table target rels. It seemed like there really needed to be regression test coverage for that, so I tried to convert his crasher example into an isolation test. It does indeed crash without Amit's fix ... but with it, lookee what I get: +error in steps c1 complexpartupdate: ERROR: unexpected table_lock_tuple status: 1 That seems fully reproducible in this test. I haven't looked into exactly what's causing that, but now that we have a reproducible example, somebody should. I'm not quite sure if I should commit this as-is or wait till the other problem is fixed. A crash is probably worse than a bogus error, but I don't like committing obviously-wrong "expected" output. Thoughts? regards, tom lane diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 602a08e..5c4b3f4 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -2799,15 +2799,24 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree) estate->es_output_cid = parentestate->es_output_cid; if (parentestate->es_num_result_relations > 0) { - int numResultRelations = parentestate->es_num_result_relations; + int numResultRels = parentestate->es_num_result_relations; + int numRootResultRels = parentestate->es_num_root_result_relations; ResultRelInfo *resultRelInfos; resultRelInfos = (ResultRelInfo *) - palloc(numResultRelations * sizeof(ResultRelInfo)); + palloc(numResultRels * sizeof(ResultRelInfo)); memcpy(resultRelInfos, parentestate->es_result_relations, - numResultRelations * sizeof(ResultRelInfo)); + numResultRels * sizeof(ResultRelInfo)); estate->es_result_relations = resultRelInfos; - estate->es_num_result_relations = numResultRelations; + estate->es_num_result_relations = numResultRels; + + /* Also transfer partitioned root result relations. */ + resultRelInfos = (ResultRelInfo *) + palloc(numRootResultRels * sizeof(ResultRelInfo)); + memcpy(resultRelInfos, parentestate->es_root_result_relations, + numRootResultRels * sizeof(ResultRelInfo)); + estate->es_root_result_relations = resultRelInfos; + estate->es_num_root_result_relations = numRootResultRels; } /* es_result_relation_info must NOT be copied */ /* es_trig_target_relations must NOT be copied */ diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out index c09e97f..47843d7 100644 --- a/src/test/isolation/expected/eval-plan-qual.out +++ b/src/test/isolation/expected/eval-plan-qual.out @@ -562,3 +562,16 @@ step multireadwcte: <... completed> subid id 1 1 + +starting permutation: simplepartupdate complexpartupdate c1 c2 +step simplepartupdate: + update parttbl set a = a; + +step complexpartupdate: + with u as (update parttbl set a = a returning parttbl.*) + update parttbl set a = u.a from u; + <waiting ...> +step c1: COMMIT; +step complexpartupdate: <... completed> +error in steps c1 complexpartupdate: ERROR: unexpected table_lock_tuple status: 1 +step c2: COMMIT; diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec index 41c8d56..7d3f1f6 100644 --- a/src/test/isolation/specs/eval-plan-qual.spec +++ b/src/test/isolation/specs/eval-plan-qual.spec @@ -29,6 +29,10 @@ setup CREATE TABLE jointest AS SELECT generate_series(1,10) AS id, 0 AS data; CREATE INDEX ON jointest(id); + + CREATE TABLE parttbl (a int) PARTITION BY LIST (a); + CREATE TABLE parttbl1 PARTITION OF parttbl FOR VALUES IN (1); + INSERT INTO parttbl VALUES (1); } teardown @@ -37,6 +41,7 @@ teardown DROP TABLE accounts_ext; DROP TABLE p CASCADE; DROP TABLE table_a, table_b, jointest; + DROP TABLE parttbl; } session "s1" @@ -133,6 +138,12 @@ step "selectresultforupdate" { where jt.id = y for update of jt, ss1, ss2; } +# test for EPQ on a partitioned result table + +step "simplepartupdate" { + update parttbl set a = a; +} + session "s2" setup { BEGIN ISOLATION LEVEL READ COMMITTED; } @@ -170,6 +181,10 @@ step "updateforcip3" { } step "wrtwcte" { UPDATE table_a SET value = 'tableAValue2' WHERE id = 1; } step "wrjt" { UPDATE jointest SET data = 42 WHERE id = 7; } +step "complexpartupdate" { + with u as (update parttbl set a = a returning parttbl.*) + update parttbl set a = u.a from u; +} step "c2" { COMMIT; } step "r2" { ROLLBACK; } @@ -234,3 +249,5 @@ permutation "wrtwcte" "readwcte" "c1" "c2" permutation "wrjt" "selectjoinforupdate" "c2" "c1" permutation "wrjt" "selectresultforupdate" "c2" "c1" permutation "wrtwcte" "multireadwcte" "c1" "c2" + +permutation "simplepartupdate" "complexpartupdate" "c1" "c2"
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-06 12:23:06 -0400, Tom Lane wrote: > It seems that there may be some connection between this problem and > EPQ. I was working on committing Amit's fix for bug #15677, which > demonstrated that EPQ doesn't work for partitioned-table target rels. > It seemed like there really needed to be regression test coverage for > that, so I tried to convert his crasher example into an isolation test. > It does indeed crash without Amit's fix ... but with it, lookee what > I get: > > +error in steps c1 complexpartupdate: ERROR: unexpected table_lock_tuple status: 1 > > That seems fully reproducible in this test. I haven't looked into > exactly what's causing that, but now that we have a reproducible > example, somebody should. > > I'm not quite sure if I should commit this as-is or wait till the > other problem is fixed. A crash is probably worse than a bogus > error, but I don't like committing obviously-wrong "expected" output. > Thoughts? Let me have a look at the testcase - I'd been running Roman's testcase for quite a few hours without being able to reproduce. But your testcase seems to trigger this reliably, so I hope I can make some quick progress. - Andres
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
r.zharkov@postgrespro.ru
Дата:
On 2019-04-06 23:28, Andres Freund wrote: > Hi, > > Let me have a look at the testcase - I'd been running Roman's testcase > for quite a few hours without being able to reproduce. But your > testcase > seems to trigger this reliably, so I hope I can make some quick > progress. > > - Andres Hello, I try to find the bad commit using bisect. But it takes very long time. The error reproduces with the default config using 24 clients ( server has 24 CPUs ) pgbench test -j 12 -T 36000 -f ycsb_read_zipf.sql -f ycsb_update_zipf.sql -c 24 -P 60 It does not reproduce when updating the only one record. CPU configuration: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 44 Model name: Intel(R) Xeon(R) CPU X5675 @ 3.07GHz Stepping: 2 CPU MHz: 2464.940 CPU max MHz: 3066.0000 CPU min MHz: 1600.0000 BogoMIPS: 6133.57 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23 -- regards, Roman Zharkov
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-06 09:28:46 -0700, Andres Freund wrote: > On 2019-04-06 12:23:06 -0400, Tom Lane wrote: > > It seems that there may be some connection between this problem and > > EPQ. I was working on committing Amit's fix for bug #15677, which > > demonstrated that EPQ doesn't work for partitioned-table target rels. > > It seemed like there really needed to be regression test coverage for > > that, so I tried to convert his crasher example into an isolation test. > > It does indeed crash without Amit's fix ... but with it, lookee what > > I get: > > > > +error in steps c1 complexpartupdate: ERROR: unexpected table_lock_tuple status: 1 > > > > That seems fully reproducible in this test. I haven't looked into > > exactly what's causing that, but now that we have a reproducible > > example, somebody should. > > > > I'm not quite sure if I should commit this as-is or wait till the > > other problem is fixed. A crash is probably worse than a bogus > > error, but I don't like committing obviously-wrong "expected" output. > > Thoughts? > > Let me have a look at the testcase - I'd been running Roman's testcase > for quite a few hours without being able to reproduce. But your testcase > seems to trigger this reliably, so I hope I can make some quick > progress. Hm. I see what's wrong here - the new code assumed that we couldn't get a SelfModified because the first version of the to-be-(deleted|updated) tuple was visible. To properly discern that from the TM_Deleted case, I'd to change/fix heapam_lock_tuple's follow-the-update chain to return SelfModified, rather than Invisible in this case (I don't think we want to allow invisible - we'd have to have waited for the earlier tuple version) - which is a more accurate return code anyway. I'm still not understanding how that'd be possible in Roman's case. Given the workload there never should be any self updating going on? Heavily-WIP patch attached. I noticed that we say + ereport(ERROR, + (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION), + errmsg("tuple to be updated was already modified by an operation triggered by thecurrent command"), in the ExecDelete() case (that's not new). Which seems odd. I think my fix would need a non-partition reproducer. I'll work on that and polishing it after having a coffee. Greetings, Andres Freund
Вложения
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-07 00:09:15 +0700, r.zharkov@postgrespro.ru wrote: > On 2019-04-06 23:28, Andres Freund wrote: > > Hi, > > > > Let me have a look at the testcase - I'd been running Roman's testcase > > for quite a few hours without being able to reproduce. But your testcase > > seems to trigger this reliably, so I hope I can make some quick > > progress. > > > > - Andres > > Hello, > I try to find the bad commit using bisect. But it takes very long > time. I'd be very surprised if it weren't commit 5db6df0c0117ff2a4e0cd87594d2db408cd5022f Author: Andres Freund <andres@anarazel.de> Date: 2019-03-23 19:55:57 -0700 tableam: Add tuple_{insert, delete, update, lock} and use. I just sent a fix for the issue Tom just reported, but I don't quite see how it applies to your case, given that there is - as far as I understand - only a single statement per transaction, no triggers including foreign keys, no CTEs etc. But it'd sure be interesting if my fix changes his error into trigering on TM_SelfModified rather than TM_Invisible. I'm kinda wondering if your / Roman's case is exposing a race condition somewhere (like wrong order of clog / procarray checks or such) that previously wasn't user visible. I think we probably should expand the error messages for the unexpected cases to include the tid of the failed tuple (both original and followed) - then we could at least look through the heap and WAL to get more understanding. > The error reproduces with the default config using 24 clients ( server has > 24 CPUs ) > pgbench test -j 12 -T 36000 -f ycsb_read_zipf.sql -f ycsb_update_zipf.sql -c > 24 -P 60 > It does not reproduce when updating the only one record. I ran it for like 9 hours over night, without triggering the error. On a computer with fewer CPUs though. Greetings, Andres Freund
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-06 10:10:25 -0700, Andres Freund wrote: > diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c > index add0d65f816..4bca240f367 100644 > --- a/src/backend/access/heap/heapam_handler.c > +++ b/src/backend/access/heap/heapam_handler.c > @@ -464,7 +464,7 @@ tuple_lock_retry: > HeapTupleHeaderGetCmin(tuple->t_data) >= cid) > { > ReleaseBuffer(buffer); > - return TM_Invisible; > + return TM_SelfModified; > } This needs to set tmfd->cmax = HeapTupleHeaderGetCmax(tuple->t_data); so we don't throw if (tmfd.cmax != estate->es_output_cid) ereport(ERROR, (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION), errmsg("tuple to be updated was already modified by an operation triggered by the current command"), errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes toother rows."))); in the testcase at hand. Greetings, Andres Freund
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-06 10:10:25 -0700, Andres Freund wrote: > I noticed that we say > + ereport(ERROR, > + (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION), > + errmsg("tuple to be updated was already modified by an operation triggered bythe current command"), > > in the ExecDelete() case (that's not new). Which seems odd. My inclination is to fix this in master, but not backpatch. I did in the current version of the patch, because it made it easier to verify my tests actually work. One bigger question, around precisely that error, I have is that fixing this bug led me to add tests for the codepath. What I noticed is that the current version of the patch detects the above error even if we first have to follow the update chain, whereas previously that case was just blindly ignored. Which seems hard to defend to me? So with the new code we're doing: starting permutation: wx1 updwctefail c1 c2 read step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; balance 400 step updwctefail: WITH doup AS (UPDATE accounts SET balance = balance + 1100 WHERE accountid = 'checking' RETURNING *, update_checking(999))UPDATE accounts a step c1: COMMIT; step updwctefail: <... completed> accountid balance accountid balance update_checking savings 1600 checking 1500 t step c2: COMMIT; step read: SELECT * FROM accounts ORDER BY accountid; accountid balance checking 1501 savings 1600 but previously we got: starting permutation: wx1 updwctefail c1 c2 read step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; balance 400 step updwctefail: WITH doup AS (UPDATE accounts SET balance = balance + 1100 WHERE accountid = 'checking' RETURNING *, update_checking(999))UPDATE accounts a step c1: COMMIT; step updwctefail: <... completed> accountid balance accountid balance update_checking savings 1600 checking 1500 t step c2: COMMIT; step read: SELECT * FROM accounts ORDER BY accountid; accountid balance checking 1501 savings 1600 In other words, the previous error check put there by https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6868ed7491b7ea7f0af6133bb66566a2f5fe5a75 commit 6868ed7491b7ea7f0af6133bb66566a2f5fe5a75 Author: Kevin Grittner <kgrittn@postgresql.org> Date: 2012-10-26 14:55:36 -0500 Throw error if expiring tuple is again updated or deleted. weren't that thorough. As long as the "original" row to be updated is updated by another transaction (i.e. we start read committed ctid chasing), the old EPQ fetch logic just didn't perform the check described in the commit + comment: /* * The target tuple was already updated or deleted by the * current command, or by a later command in the current * transaction. The former case is possible in a join DELETE * where multiple tuples join to the same target tuple. This * is somewhat questionable, but Postgres has always allowed * it: we just ignore additional deletion attempts. * * The latter case arises if the tuple is modified by a * command in a BEFORE trigger, or perhaps by a command in a * volatile function used in the query. In such situations we * should not ignore the deletion, but it is equally unsafe to * proceed. We don't want to discard the original DELETE * while keeping the triggered actions based on its deletion; * and it would be no better to allow the original DELETE * while discarding updates that it triggered. The row update * carries some information that might be important according * to business rules; so throwing an error is the only safe * course. * * If a trigger actually intends this type of interaction, it * can re-execute the DELETE and then return NULL to cancel * the outer delete. */ It seems mighty finnicky to fix this in < v12 (as the error would need to happen in the guts of EvalPlanQualFetch() rather than in ExecUpdate/Delete) - so I'm inclined to just fix it in master. Thoughts? Greetings, Andres Freund
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-06 16:10:37 -0700, Andres Freund wrote: > On 2019-04-06 10:10:25 -0700, Andres Freund wrote: > > I noticed that we say > > + ereport(ERROR, > > + (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION), > > + errmsg("tuple to be updated was already modified by an operation triggeredby the current command"), > > > > in the ExecDelete() case (that's not new). Which seems odd. > > My inclination is to fix this in master, but not backpatch. I did in the > current version of the patch, because it made it easier to verify my > tests actually work. Attached patch does so. > One bigger question, around precisely that error, I have is that fixing > this bug led me to add tests for the codepath. What I noticed is that > the current version of the patch detects the above error even if we > first have to follow the update chain, whereas previously that case was > just blindly ignored. Which seems hard to defend to me? > > [ long explanation ] > It seems mighty finnicky to fix this in < v12 (as the error would need > to happen in the guts of EvalPlanQualFetch() rather than in > ExecUpdate/Delete) - so I'm inclined to just fix it in master. Attached patch now errors out. Patch with tests attached. Tom, any chance you could comment on whether you think it's sensible to make these behaviour changes in master, but not backpatch? Greetings, Andres Freund
Вложения
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Amit Langote
Дата:
On 2019/04/07 1:23, Tom Lane wrote: > It seems that there may be some connection between this problem and > EPQ. I was working on committing Amit's fix for bug #15677, which > demonstrated that EPQ doesn't work for partitioned-table target rels. > It seemed like there really needed to be regression test coverage for > that, so I tried to convert his crasher example into an isolation test. > It does indeed crash without Amit's fix ... but with it, lookee what > I get: > > +error in steps c1 complexpartupdate: ERROR: unexpected table_lock_tuple status: 1 > > That seems fully reproducible in this test. I haven't looked into > exactly what's causing that, but now that we have a reproducible > example, somebody should. > > I'm not quite sure if I should commit this as-is or wait till the > other problem is fixed. A crash is probably worse than a bogus > error, but I don't like committing obviously-wrong "expected" output. > Thoughts? Thank you Tom for picking it up and adding the test which I should've done to begin with. Now that Andres has taken care of the other issues [1], maybe this one's good to go? The isolation test part needed to be rebased over Andres' commit, which I've done in the attached updated patch. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=41f5e04ae
Вложения
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Amit Langote
Дата:
On 2019/04/08 16:21, Amit Langote wrote: > Now that Andres has taken care of the other issues [1], maybe this one's > good to go? The isolation test part needed to be rebased over Andres' > commit, which I've done in the attached updated patch. The patch I attached in the previous email doesn't apply as-is to back-branches due to rebasing. I've attached another patch here, which applies to both PG 11 and 10 branches. Thanks, Amit
Вложения
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/04/08 16:21, Amit Langote wrote: >> Now that Andres has taken care of the other issues [1], maybe this one's >> good to go? The isolation test part needed to be rebased over Andres' >> commit, which I've done in the attached updated patch. > The patch I attached in the previous email doesn't apply as-is to > back-branches due to rebasing. I've attached another patch here, which > applies to both PG 11 and 10 branches. Agreed we can push this now, and done. It struck me just as I was pushing it that this test doesn't exercise EPQ with any of the interesting cases for partition routing (ie where the update causes a move to a different partition). It would likely be a good idea to have test coverage for all of these scenarios: * EPQ where the initial update would involve a partition change, and that's still true after reapplying the update to the concurrently-updated tuple version; * EPQ where the initial update would *not* require a partition change, but we need one after reapplying the update to the concurrently-updated tuple version; * EPQ where the initial update would involve a partition change, but that's no longer true after reapplying the update to the concurrently-updated tuple version. You could probably build cases exercising the latter two scenarios by doing updates in which the partition key column is set from some other column that's modified by the concurrent update. BTW, what happens if the concurrent update caused a partition change? I imagine we would think the original tuple is now dead, since there's no way to chase up to the replacement tuple in the other partition, and so we'd abandon our update. Is this documented? None of this is related to bug #15727, though, so I suggest starting a new thread with a proposed test patch. regards, tom lane
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Andres Freund
Дата:
Hi, On 2019-04-08 12:31:37 -0400, Tom Lane wrote: > BTW, what happens if the concurrent update caused a partition change? > I imagine we would think the original tuple is now dead, since there's > no way to chase up to the replacement tuple in the other partition, > and so we'd abandon our update. Is this documented? FWIW, you should get an error: if (ItemPointerIndicatesMovedPartitions(tid)) ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), errmsg("tuple to be locked was already moved to another partition due to concurrent update"))); I think there's tests for the simpler cases. Probably wouldn't hurt to test that case as another axis for the suite of tests you suggest. Greetings, Andres Freund
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Amit Langote
Дата:
On 2019/04/09 1:31, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2019/04/08 16:21, Amit Langote wrote: >>> Now that Andres has taken care of the other issues [1], maybe this one's >>> good to go? The isolation test part needed to be rebased over Andres' >>> commit, which I've done in the attached updated patch. > >> The patch I attached in the previous email doesn't apply as-is to >> back-branches due to rebasing. I've attached another patch here, which >> applies to both PG 11 and 10 branches. > > Agreed we can push this now, and done. Thank you. > It struck me just as I was pushing it that this test doesn't exercise > EPQ with any of the interesting cases for partition routing (ie where > the update causes a move to a different partition). It would likely > be a good idea to have test coverage for all of these scenarios: > > * EPQ where the initial update would involve a partition change, > and that's still true after reapplying the update to the > concurrently-updated tuple version; > > * EPQ where the initial update would *not* require a partition change, > but we need one after reapplying the update to the > concurrently-updated tuple version; > > * EPQ where the initial update would involve a partition change, > but that's no longer true after reapplying the update to the > concurrently-updated tuple version. > > You could probably build cases exercising the latter two scenarios > by doing updates in which the partition key column is set from > some other column that's modified by the concurrent update. > > BTW, what happens if the concurrent update caused a partition change? > I imagine we would think the original tuple is now dead, since there's > no way to chase up to the replacement tuple in the other partition, > and so we'd abandon our update. Is this documented? This seems to be documented in the Notes section of UPDATE's page [1]. > None of this is related to bug #15727, though, so I suggest > starting a new thread with a proposed test patch. OK, I will post something to -hackers, also considering the tip from Andres. Thanks, Amit [1] https://www.postgresql.org/docs/devel/sql-update.html
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
r.zharkov@postgrespro.ru
Дата:
On 2019-04-07 00:17, Andres Freund wrote: > I'd be very surprised if it weren't > > commit 5db6df0c0117ff2a4e0cd87594d2db408cd5022f > Author: Andres Freund <andres@anarazel.de> > Date: 2019-03-23 19:55:57 -0700 > > tableam: Add tuple_{insert, delete, update, lock} and use. > Greetings, > > Andres Freund Hello, I finished bisect and recheck the result. ad308058cc8666c50b43179e64d6bb7aeb3ba169 is the first bad commit commit ad308058cc8666c50b43179e64d6bb7aeb3ba169 Author: Thomas Munro <tmunro@postgresql.org> Date: Thu Mar 28 10:59:19 2019 +1300 Use FullTransactionId for the transaction stack. Provide GetTopFullTransactionId() and GetCurrentFullTransactionId(). The intended users of these interfaces are access methods that use xids for visibility checks but don't want to have to go back and "freeze" existing references some time later before the 32 bit xid counter wraps around. Use a new struct to serialize the transaction state for parallel query, because FullTransactionId doesn't fit into the previous serialization scheme very well. Author: Thomas Munro Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com :040000 040000 3080a54b6840daf3e08ada1edfb8d5d61c1099c6 4663f00a917eb526d27da29d1c059e20c58d4b34 M src I reproduced the error on two servers. First server, commit 16954e2: PostgreSQL 12devel-master/16954e2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit [New LWP 8346] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: zharkov test [local] UPDATE '. Program terminated with signal SIGABRT, Aborted. #0 0x00007fd37aca9428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 0x00007fd37aca9428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007fd37acab02a in __GI_abort () at abort.c:89 #2 0x0000000000aa52a7 in ExceptionalCondition (conditionName=0xb47688 "!((((tuple->t_data->t_infomask) & 0x0080) || (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | 0x0010))) == 0x0040)))", errorType=0xb432e9 "FailedAssertion", fileName=0xb43368 "heapam.c", lineNumber=4289) at assert.c:54 #3 0x00000000004ea31d in heap_lock_tuple (relation=0x7fd37c3260b0, tuple=0x2ec61a0, cid=0, mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, follow_updates=false, buffer=0x7ffd787bf774, tmfd=0x7ffd787bf940) at heapam.c:4289 #4 0x00000000004f5c3e in heapam_tuple_lock (relation=0x7fd37c3260b0, tid=0x7ffd787bf9b0, snapshot=0x2e731b0, slot=0x2ec6150, cid=0, mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffd787bf940) at heapam_handler.c:361 #5 0x0000000000745726 in table_lock_tuple (rel=0x7fd37c3260b0, tid=0x7ffd787bf9b0, snapshot=0x2e731b0, slot=0x2ec6150, cid=0, mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffd787bf940) at ../../../src/include/access/tableam.h:1228 #6 0x000000000074794d in ExecUpdate (mtstate=0x2eb6a50, tupleid=0x7ffd787bf9b0, oldtuple=0x0, slot=0x2eb8498, planSlot=0x2eb8320, epqstate=0x2eb6b48, estate=0x2eb66a8, canSetTag=true) at nodeModifyTable.c:1369 #7 0x0000000000748f39 in ExecModifyTable (pstate=0x2eb6a50) at nodeModifyTable.c:2216 #8 0x0000000000717c7d in ExecProcNodeFirst (node=0x2eb6a50) at execProcnode.c:445 #9 0x000000000070c9e6 in ExecProcNode (node=0x2eb6a50) at ../../../src/include/executor/executor.h:239 #10 0x000000000070f265 in ExecutePlan (estate=0x2eb66a8, planstate=0x2eb6a50, use_parallel_mode=false, operation=CMD_UPDATE, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x2eb97d8, execute_once=true) at execMain.c:1648 #11 0x000000000070d023 in standard_ExecutorRun (queryDesc=0x2e16688, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:365 #12 0x000000000070ce5f in ExecutorRun (queryDesc=0x2e16688, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:309 #13 0x000000000091d5e0 in ProcessQuery (plan=0x2eaa3f0, sourceText=0x2dee1c8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1;", params=0x0, queryEnv=0x0, dest=0x2eb97d8, completionTag=0x7ffd787bfe30 "") at pquery.c:161 #14 0x000000000091f02c in PortalRunMulti (portal=0x2e557c8, isTopLevel=true, setHoldSnapshot=false, dest=0x2eb97d8, altdest=0x2eb97d8, completionTag=0x7ffd787bfe30 "") at pquery.c:1283 #15 0x000000000091e578 in PortalRun (portal=0x2e557c8, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x2eb97d8, altdest=0x2eb97d8, completionTag=0x7ffd787bfe30 "") at pquery.c:796 #16 0x0000000000917ee9 in exec_simple_query (query_string=0x2dee1c8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1;") at postgres.c:1215 #17 0x000000000091c58f in PostgresMain (argc=1, argv=0x2e198c0, dbname=0x2e197a0 "test", username=0x2dead88 "zharkov") at postgres.c:4249 #18 0x000000000086afb2 in BackendRun (port=0x2e0f170) at postmaster.c:4430 #19 0x000000000086a713 in BackendStartup (port=0x2e0f170) at postmaster.c:4121 #20 0x0000000000866952 in ServerLoop () at postmaster.c:1704 #21 0x0000000000866160 in PostmasterMain (argc=3, argv=0x2de8cf0) at postmaster.c:1377 #22 0x00000000007803ba in main (argc=3, argv=0x2de8cf0) at main.c:228 Second server, commit 16954e2: PostgreSQL 12devel-master/16954e22e2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit [New LWP 6145] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: zharkov test [local] UPDATE '. Program terminated with signal SIGABRT, Aborted. #0 0x00007fc5cfa2dfcf in raise () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007fc5cfa2dfcf in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fc5cfa2f3fa in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00005580a00b5590 in ExceptionalCondition ( conditionName=0x5580a01550e8 "!((((tuple->t_data->t_infomask) & 0x0080) || (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | 0x0010))) == 0x0040)))", errorType=0x5580a0150d49 "FailedAssertion", fileName=0x5580a0150dc8 "heapam.c", lineNumber=4289) at assert.c:54 #3 0x000055809fb1e26c in heap_lock_tuple (relation=0x7fc5d10b6c00, tuple=0x5580a0bd3e20, cid=0, mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, follow_updates=false, buffer=0x7ffe391f876c, tmfd=0x7ffe391f8880) at heapam.c:4289 #4 0x000055809fb2980e in heapam_tuple_lock (relation=0x7fc5d10b6c00, tid=0x7ffe391f895a, snapshot=0x5580a0b280e0, slot=0x5580a0bd3dd0, cid=0, mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffe391f8880) at heapam_handler.c:361 #5 0x000055809fd68ba0 in table_lock_tuple (rel=0x7fc5d10b6c00, tid=0x7ffe391f895a, snapshot=0x5580a0b280e0, slot=0x5580a0bd3dd0, cid=0, mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffe391f8880) at ../../../src/include/access/tableam.h:1228 #6 0x000055809fd6ace3 in ExecUpdate (mtstate=0x5580a0bca7d0, tupleid=0x7ffe391f895a, oldtuple=0x0, slot=0x5580a0bcc218, planSlot=0x5580a0bcc0a0, epqstate=0x5580a0bca8c8, estate=0x5580a0bca428, canSetTag=true) at nodeModifyTable.c:1369 #7 0x000055809fd6c2da in ExecModifyTable (pstate=0x5580a0bca7d0) at nodeModifyTable.c:2216 #8 0x000055809fd3b65d in ExecProcNodeFirst (node=0x5580a0bca7d0) at execProcnode.c:445 #9 0x000055809fd304b7 in ExecProcNode (node=0x5580a0bca7d0) at ../../../src/include/executor/executor.h:239 #10 0x000055809fd32e7a in ExecutePlan (estate=0x5580a0bca428, planstate=0x5580a0bca7d0, use_parallel_mode=false, operation=CMD_UPDATE, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x5580a0bcd558, execute_once=true) at execMain.c:1648 #11 0x000055809fd30b0d in standard_ExecutorRun (queryDesc=0x5580a0b2a8d8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:365 #12 0x000055809fd30931 in ExecutorRun (queryDesc=0x5580a0b2a8d8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:309 #13 0x000055809ff3a439 in ProcessQuery (plan=0x5580a0bbe170, sourceText=0x5580a0b031a8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 3;", params=0x0, queryEnv=0x0, dest=0x5580a0bcd558, completionTag=0x7ffe391f8d60 "") at pquery.c:161 #14 0x000055809ff3bdaf in PortalRunMulti (portal=0x5580a0b69958, isTopLevel=true, setHoldSnapshot=false, dest=0x5580a0bcd558, altdest=0x5580a0bcd558, completionTag=0x7ffe391f8d60 "") at pquery.c:1283 #15 0x000055809ff3b366 in PortalRun (portal=0x5580a0b69958, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x5580a0bcd558, altdest=0x5580a0bcd558, completionTag=0x7ffe391f8d60 "") at pquery.c:796 #16 0x000055809ff34e8d in exec_simple_query (query_string=0x5580a0b031a8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 3;") at postgres.c:1215 #17 0x000055809ff394ca in PostgresMain (argc=1, argv=0x5580a0b2da50, dbname=0x5580a0b2d930 "test", username=0x5580a0affd68 "zharkov") at postgres.c:4249 #18 0x000055809fe8aa98 in BackendRun (port=0x5580a0b256f0) at postmaster.c:4430 #19 0x000055809fe8a201 in BackendStartup (port=0x5580a0b256f0) at postmaster.c:4121 #20 0x000055809fe863b6 in ServerLoop () at postmaster.c:1704 #21 0x000055809fe85be4 in PostmasterMain (argc=3, argv=0x5580a0afdcd0) at postmaster.c:1377 #22 0x000055809fda294e in main (argc=3, argv=0x5580a0afdcd0) at main.c:228 On the first bad commit: First server: PostgreSQL 12devel-HEAD/ad30805 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit [New LWP 25344] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: zharkov test [local] UPDATE '. Program terminated with signal SIGABRT, Aborted. #0 0x00007fc51e46e428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 0x00007fc51e46e428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007fc51e47002a in __GI_abort () at abort.c:89 #2 0x0000000000a98e22 in errfinish (dummy=0) at elog.c:552 #3 0x0000000000a9b3c9 in elog_finish (elevel=22, fmt=0xb66160 "cannot abort transaction %u, it was already committed") at elog.c:1365 #4 0x0000000000561df4 in RecordTransactionAbort (isSubXact=false) at xact.c:1644 #5 0x0000000000562e00 in AbortTransaction () at xact.c:2670 #6 0x0000000000563683 in AbortCurrentTransaction () at xact.c:3191 #7 0x00000000009113f6 in PostgresMain (argc=1, argv=0x240e8b0, dbname=0x240e790 "test", username=0x23dfd98 "zharkov") at postgres.c:4033 #8 0x0000000000861064 in BackendRun (port=0x2404160) at postmaster.c:4399 #9 0x00000000008607c5 in BackendStartup (port=0x2404160) at postmaster.c:4090 #10 0x000000000085cac7 in ServerLoop () at postmaster.c:1703 #11 0x000000000085c2d5 in PostmasterMain (argc=3, argv=0x23ddd00) at postmaster.c:1376 #12 0x0000000000777edf in main (argc=3, argv=0x23ddd00) at main.c:228 Second server: PostgreSQL 12devel-HEAD/ad308058cc on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit [New LWP 11211] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: zharkov test [local] UPDATE '. Program terminated with signal SIGABRT, Aborted. #0 0x00007feda4a2cfcf in raise () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007feda4a2cfcf in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007feda4a2e3fa in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x000055802b411af6 in errfinish (dummy=0) at elog.c:552 #3 0x000055802b414035 in elog_finish (elevel=22, fmt=0x55802b4dbb60 "cannot abort transaction %u, it was already committed") at elog.c:1365 #4 0x000055802aefbd39 in RecordTransactionAbort (isSubXact=false) at xact.c:1644 #5 0x000055802aefcdbd in AbortTransaction () at xact.c:2670 #6 0x000055802aefd6c7 in AbortCurrentTransaction () at xact.c:3191 #7 0x000055802b296b90 in PostgresMain (argc=1, argv=0x55802d723a30, dbname=0x55802d723910 "test", username=0x55802d6f5d78 "zharkov") at postgres.c:4033 #8 0x000055802b1e9059 in BackendRun (port=0x55802d71a1f0) at postmaster.c:4399 #9 0x000055802b1e87c2 in BackendStartup (port=0x55802d71a1f0) at postmaster.c:4090 #10 0x000055802b1e4a44 in ServerLoop () at postmaster.c:1703 #11 0x000055802b1e4272 in PostmasterMain (argc=3, argv=0x55802d6f3ce0) at postmaster.c:1376 #12 0x000055802b10299f in main (argc=3, argv=0x55802d6f3ce0) at main.c:228 The error does not reproduce on commit 2fc7af5 on both servers. -- regards, Roman Zharkov
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Michael Paquier
Дата:
On Thu, Apr 11, 2019 at 10:29:03AM +0700, r.zharkov@postgrespro.ru wrote: > The error does not reproduce on commit 2fc7af5 on both servers. Okay. As nothing seems to remain here, I have marked this open item as fixed. -- Michael
Вложения
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed
От
Жарков Роман
Дата:
> 11 апр. 2019 г., в 10:47, Michael Paquier <michael@paquier.xyz> написал(а): > >> On Thu, Apr 11, 2019 at 10:29:03AM +0700, r.zharkov@postgrespro.ru wrote: >> The error does not reproduce on commit 2fc7af5 on both servers. > > Okay. As nothing seems to remain here, I have marked this open item > as fixed. > -- > Michael Hi, There is no fix. 2fc7af5 is the last good commit. regards, Roman
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Michael Paquier
Дата:
On Thu, Apr 11, 2019 at 11:17:54AM +0700, Жарков Роман wrote: > There is no fix. 2fc7af5 is the last good commit. Cool, thanks for confirming. -- Michael
Вложения
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Thomas Munro
Дата:
On Thu, Apr 11, 2019 at 4:23 PM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Apr 11, 2019 at 11:17:54AM +0700, Жарков Роман wrote: > > There is no fix. 2fc7af5 is the last good commit. > > Cool, thanks for confirming. Thanks. Looking into this. -- Thomas Munro https://enterprisedb.com
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
r.zharkov@postgrespro.ru
Дата:
On 2019-04-11 11:44, Thomas Munro wrote: >> Cool, thanks for confirming. > > Thanks. Looking into this. Hi, I think that the problem is in the autovacuum. ... 2019-04-11 08:04:49.593 MSK [24671] LOG: automatic analyze of table "test.public.pgbench_accounts" system usage: CPU: user: 0.15 s, system: 0.00 s, elapsed: 0.17 s 2019-04-11 08:05:29.137 MSK [13263] LOG: checkpoint starting: wal 2019-04-11 08:05:49.352 MSK [25040] LOG: automatic analyze of table "test.public.pgbench_accounts" system usage: CPU: user: 0.15 s, system: 0.00 s, elapsed: 0.18 s 2019-04-11 08:06:49.449 MSK [25058] LOG: automatic vacuum of table "test.public.pgbench_accounts": index scans: 1 pages: 0 removed, 2145 remain, 1 skipped due to pins, 0 skipped frozen tuples: 8349 removed, 100975 remain, 976 are dead but not yet removable, oldest xmin: 195581926 buffer usage: 6303 hits, 0 misses, 43 dirtied avg read rate: 0.000 MB/s, avg write rate: 1.353 MB/s system usage: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.24 s 2019-04-11 08:06:49.623 MSK [25058] LOG: automatic analyze of table "test.public.pgbench_accounts" system usage: CPU: user: 0.13 s, system: 0.01 s, elapsed: 0.17 s 2019-04-11 08:06:57.478 MSK [13263] LOG: checkpoint complete: wrote 2378 buffers (14.5%); 0 WAL file(s) added, 0 removed, 42 recycled; write=88.322 s, sync=0.000 s, total=88.340 s; sync fi les=6, longest=0.000 s, average=0.000 s; distance=688127 kB, estimate=688128 kB 2019-04-11 08:07:49.451 MSK [25084] LOG: automatic analyze of table "test.public.pgbench_accounts" system usage: CPU: user: 0.15 s, system: 0.01 s, elapsed: 0.22 s 2019-04-11 08:08:28.963 MSK [13263] LOG: checkpoint starting: wal 2019-04-11 08:08:49.500 MSK [25100] LOG: automatic vacuum of table "test.public.pgbench_accounts": index scans: 1 pages: 0 removed, 2145 remain, 1 skipped due to pins, 0 skipped frozen tuples: 8666 removed, 101047 remain, 1047 are dead but not yet removable, oldest xmin: 198748530 buffer usage: 6243 hits, 0 misses, 15 dirtied avg read rate: 0.000 MB/s, avg write rate: 0.480 MB/s system usage: CPU: user: 0.17 s, system: 0.00 s, elapsed: 0.24 s 2019-04-11 08:08:49.678 MSK [25100] LOG: automatic analyze of table "test.public.pgbench_accounts" system usage: CPU: user: 0.13 s, system: 0.01 s, elapsed: 0.17 s TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) || (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | 0x0010))) == 0x0040)))", File: "heapam.c", Line: 4289) 2019-04-11 08:09:37.452 MSK [13261] LOG: server process (PID 13294) was terminated by signal 6: Aborted 2019-04-11 08:09:37.452 MSK [13261] DETAIL: Failed process was running: UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2; -- regards, Roman Zharkov
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Thomas Munro
Дата:
On Thu, Apr 11, 2019 at 5:35 PM <r.zharkov@postgrespro.ru> wrote: > TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) || > (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | > 0x0010))) == 0x0040)))", File: "heapam.c", Line: > 4289) Hi Roman, Ah, I see it. Can you reproduce the problem with this patch applied? While going over that commit spotted an incorrect sizeof in EstimateTransactionStateSpace(), but I'll fix that separately. -- Thomas Munro https://enterprisedb.com
Вложения
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
r.zharkov@postgrespro.ru
Дата:
On 2019-04-11 15:52, Thomas Munro wrote: > Ah, I see it. Can you reproduce the problem with this patch applied? > > While going over that commit spotted an incorrect sizeof in > EstimateTransactionStateSpace(), but I'll fix that separately. Of course, I will try it now. -- regards Roman Zharkov
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
r.zharkov@postgrespro.ru
Дата:
On 2019-04-11 15:52, Thomas Munro wrote: > On Thu, Apr 11, 2019 at 5:35 PM <r.zharkov@postgrespro.ru> wrote: >> TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) || >> (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 >> | >> 0x0010))) == 0x0040)))", File: "heapam.c", Line: >> 4289) > > Hi Roman, > > Ah, I see it. Can you reproduce the problem with this patch applied? > > While going over that commit spotted an incorrect sizeof in > EstimateTransactionStateSpace(), but I'll fix that separately. Hello, My test finished successfully. Everything is OK. Thanks! -- regards, Roman Zharkov
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Thomas Munro
Дата:
On Fri, Apr 12, 2019 at 2:44 PM <r.zharkov@postgrespro.ru> wrote: > My test finished successfully. Everything is OK. Pushed. Thanks for the report and testing! -- Thomas Munro https://enterprisedb.com
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Michael Paquier
Дата:
On Fri, Apr 12, 2019 at 05:04:53PM +1200, Thomas Munro wrote: > On Fri, Apr 12, 2019 at 2:44 PM <r.zharkov@postgrespro.ru> wrote: >> My test finished successfully. Everything is OK. > > Pushed. Thanks for the report and testing! Could you mark this item in [1] as fixed please? [1]: https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Open_Issues -- Michael
Вложения
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed
От
Thomas Munro
Дата:
On Fri, Apr 12, 2019 at 5:19 PM Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Apr 12, 2019 at 05:04:53PM +1200, Thomas Munro wrote: > > On Fri, Apr 12, 2019 at 2:44 PM <r.zharkov@postgrespro.ru> wrote: > >> My test finished successfully. Everything is OK. > > > > Pushed. Thanks for the report and testing! > > Could you mark this item in [1] as fixed please? Oops, done (moved to the fixed items list). -- Thomas Munro https://enterprisedb.com