Обсуждение: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Поиск

Список

Период

Сортировка

BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

От

PG Bug reporting form

Дата:

02 апреля 2019 г., 13:30:15

The following bug has been logged on the website:

Bug reference:      15727
Logged by:          Roman Zharkov
Email address:      r.zharkov@postgrespro.ru
PostgreSQL version: Unsupported/Unknown
Operating system:   Ubuntu 5.4.0-6ubuntu1~16.04.10
Description:

Hello,
Our test server crashed while benchmarking connection pooler called Yandex
Odyssey.
There is no debugging info in the code dump now, because it takes few hours
to reproduce the error. We will try to reproduce the error with debugging
info.

Server version - PostgreSQL 12devel-master, commit - bb76134 on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0
20160609, 64-bit

The error does not reproduce on PostgreSQL 12devel-master, commit -
2fadf24

2019-04-01 15:27:38.829 +07 [7524] ERROR:  unexpected table_lock_tuple
status: 1
2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
400048276, it was already committed
2019-04-01 15:27:39.001 +07 [7499] ERROR:  unexpected table_lock_tuple
status: 1
2019-04-01 15:27:39.001 +07 [7499] STATEMENT:  UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 2;
2019-04-01 15:27:39.001 +07 [7499] PANIC:  cannot abort transaction
400051561, it was already committed
2019-04-01 15:27:41.193 +07 [7509] ERROR:  unexpected table_lock_tuple
status: 1
2019-04-01 15:27:41.193 +07 [7509] STATEMENT:  UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:41.193 +07 [7509] PANIC:  cannot abort transaction
400088425, it was already committed
2019-04-01 15:41:01.176 +07 [3559] LOG:  server process (PID 7499) was
terminated by signal 6: Aborted
2019-04-01 15:41:01.176 +07 [3559] DETAIL:  Failed process was running:
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;
2019-04-01 15:41:01.176 +07 [3559] LOG:  terminating any other active server
processes

[New LWP 7509]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test 127.0.0.1(42666) UPDATE
'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f1f4c6c7428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007f1f4c6c7428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f1f4c6c902a in __GI_abort () at abort.c:89
#2  0x000000000085faac in errfinish ()
#3  0x00000000008633e4 in elog_finish ()
#4  0x000000000050266d in RecordTransactionAbort ()
#5  0x0000000000502788 in AbortTransaction ()
#6  0x0000000000502fb5 in AbortCurrentTransaction ()
#7  0x000000000075029f in PostgresMain ()
#8  0x0000000000480593 in ServerLoop ()
#9  0x00000000006dfc04 in PostmasterMain ()
#10 0x00000000004813b9 in main ()

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

02 апреля 2019 г., 18:29:25

Hi,

On 2019-04-02 10:30:15 +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      15727
> Logged by:          Roman Zharkov
> Email address:      r.zharkov@postgrespro.ru
> PostgreSQL version: Unsupported/Unknown
> Operating system:   Ubuntu 5.4.0-6ubuntu1~16.04.10
> Description:        
> 
> Hello,
> Our test server crashed while benchmarking connection pooler called Yandex
> Odyssey.

What's the exact workload? Just pgbench?

> There is no debugging info in the code dump now, because it takes few hours
> to reproduce the error. We will try to reproduce the error with debugging
> info.

Unfortunately a backtrace is probably not going to help up much, I'm
afraid. The issue likely is triggered at an earlier state than the error
thrown here.

> Server version - PostgreSQL 12devel-master, commit - bb76134 on
> x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0
> 20160609, 64-bit
> 
> The error does not reproduce on PostgreSQL 12devel-master, commit -
> 2fadf24
> 
> 2019-04-01 15:27:38.829 +07 [7524] ERROR:  unexpected table_lock_tuple
> status: 1

Hm, so that appears to be related to my changes. Could you run this with
log_error_verbosity=verbose, so we know which of the codepaths this
hits?  Apparently we're encountering an invisible tuple somewhere, which
I thought was not possible (hence the error) - previously we just
ignored that.

> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
> abalance = 1 WHERE aid = 1;
> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
> 400048276, it was already committed

But that's probably a separate issue.

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

От

Tom Lane

Дата:

02 апреля 2019 г., 18:44:26

Andres Freund <andres@anarazel.de> writes:
>> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
>> abalance = 1 WHERE aid = 1;
>> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
>> 400048276, it was already committed

> But that's probably a separate issue.

What that seems to indicate is that the "unexpected table_lock_tuple
status" error was thrown during commit, which seems pretty odd.

            regards, tom lane

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

02 апреля 2019 г., 18:47:20

Hi,

On 2019-04-02 11:44:26 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
> >> abalance = 1 WHERE aid = 1;
> >> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
> >> 400048276, it was already committed
> 
> > But that's probably a separate issue.
> 
> What that seems to indicate is that the "unexpected table_lock_tuple
> status" error was thrown during commit, which seems pretty odd.

I suspect that's a deferred trigger. But that code obviously could throw
errors, so we gotta handle that correctly.

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Alvaro Herrera

Дата:

02 апреля 2019 г., 18:51:08

On 2019-Apr-02, Tom Lane wrote:

> Andres Freund <andres@anarazel.de> writes:
> >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
> >> abalance = 1 WHERE aid = 1;
> >> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
> >> 400048276, it was already committed
> 
> > But that's probably a separate issue.
> 
> What that seems to indicate is that the "unexpected table_lock_tuple
> status" error was thrown during commit, which seems pretty odd.

AFAICS this error can only come from ExecDelete(), because the value 1
is TM_Invisible and the other callsites where the "unexpected
table_lock_tuple" error appears use different wording for that one.

Maybe it's the result of a deferred constraint being checked at that
time ... maybe it's trying to honor an "on cascade delete" setting for
an FK, and the affected tuple has already been updated or deleted?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

От

Tom Lane

Дата:

02 апреля 2019 г., 18:51:38

Andres Freund <andres@anarazel.de> writes:
> On 2019-04-02 11:44:26 -0400, Tom Lane wrote:
>> What that seems to indicate is that the "unexpected table_lock_tuple
>> status" error was thrown during commit, which seems pretty odd.

> I suspect that's a deferred trigger. But that code obviously could throw
> errors, so we gotta handle that correctly.

Deferred triggers execute before the real transaction commit, not during
the critical section surrounding where we set the bit in pg_clog.
But this error is seemingly getting thrown from within that critical
section, otherwise we wouldn't have inconsistent xact status afterwards.

            regards, tom lane

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

02 апреля 2019 г., 19:00:48

Hi,

On 2019-04-02 11:51:38 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-04-02 11:44:26 -0400, Tom Lane wrote:
> >> What that seems to indicate is that the "unexpected table_lock_tuple
> >> status" error was thrown during commit, which seems pretty odd.
> 
> > I suspect that's a deferred trigger. But that code obviously could throw
> > errors, so we gotta handle that correctly.
> 
> Deferred triggers execute before the real transaction commit, not during
> the critical section surrounding where we set the bit in pg_clog.
> But this error is seemingly getting thrown from within that critical
> section, otherwise we wouldn't have inconsistent xact status afterwards.

I don't think it was thrown during the critical section - we'd otherwise
see the ERROR about unexpected lock status converted to a PANIC (and the
"cannot abort transaction, already committed" is a PANIC itself).  I
suspect it's more likely that we corrupt the xact.c status somewhere -
but it's hard to say without knowing more about the workload.

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

02 апреля 2019 г., 19:13:20

Hi,

On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote:
> On 2019-Apr-02, Tom Lane wrote:
> 
> > Andres Freund <andres@anarazel.de> writes:
> > >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
> > >> abalance = 1 WHERE aid = 1;
> > >> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
> > >> 400048276, it was already committed
> > 
> > > But that's probably a separate issue.
> > 
> > What that seems to indicate is that the "unexpected table_lock_tuple
> > status" error was thrown during commit, which seems pretty odd.
> 
> AFAICS this error can only come from ExecDelete(), because the value 1
> is TM_Invisible and the other callsites where the "unexpected
> table_lock_tuple" error appears use different wording for that one.

Hm? Why couldn't it be the ExecUpdate() case?


> Maybe it's the result of a deferred constraint being checked at that
> time ... maybe it's trying to honor an "on cascade delete" setting for
> an FK, and the affected tuple has already been updated or deleted?

Then it ought to get TM_Deleted, no? We ought to wait till that
transaction commits, and then roll back. So there's something odd
happening here.  I suspect there has to be some additional log output or
such to explain this.

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Alvaro Herrera

Дата:

02 апреля 2019 г., 20:00:51

On 2019-Apr-02, Andres Freund wrote:

> On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote:

> > AFAICS this error can only come from ExecDelete(), because the value 1
> > is TM_Invisible and the other callsites where the "unexpected
> > table_lock_tuple" error appears use different wording for that one.
> 
> Hm? Why couldn't it be the ExecUpdate() case?

You're right, that one too.

> > Maybe it's the result of a deferred constraint being checked at that
> > time ... maybe it's trying to honor an "on cascade delete" setting for
> > an FK, and the affected tuple has already been updated or deleted?
> 
> Then it ought to get TM_Deleted, no? We ought to wait till that
> transaction commits, and then roll back.

I was thinking that this would have happened in the same transaction;
but yeah I didn't spend too much time analyzing the exact code flow.
Anyway I agree that there's something odd going on, and perhaps you just
unmasked an earlier bug.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

r.zharkov@postgrespro.ru

Дата:

02 апреля 2019 г., 20:04:15

On 2019-04-02 23:13, Andres Freund wrote:
> Hi,
> 
> On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote:
>> On 2019-Apr-02, Tom Lane wrote:
>> 
>> > Andres Freund <andres@anarazel.de> writes:
>> > >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
>> > >> abalance = 1 WHERE aid = 1;
>> > >> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
>> > >> 400048276, it was already committed
>> >
>> > > But that's probably a separate issue.
>> >
>> > What that seems to indicate is that the "unexpected table_lock_tuple
>> > status" error was thrown during commit, which seems pretty odd.
>> 
>> AFAICS this error can only come from ExecDelete(), because the value 1
>> is TM_Invisible and the other callsites where the "unexpected
>> table_lock_tuple" error appears use different wording for that one.
> 
> Hm? Why couldn't it be the ExecUpdate() case?
> 
> 
>> Maybe it's the result of a deferred constraint being checked at that
>> time ... maybe it's trying to honor an "on cascade delete" setting for
>> an FK, and the affected tuple has already been updated or deleted?
> 
> Then it ought to get TM_Deleted, no? We ought to wait till that
> transaction commits, and then roll back. So there's something odd
> happening here.  I suspect there has to be some additional log output 
> or
> such to explain this.
> 
> Greetings,
> 
> Andres Freund

I made the new core file:

[New LWP 5967]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test 127.0.0.1(35888) UPDATE    
     '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fe194769428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007fe194769428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007fe19476b02a in __GI_abort () at abort.c:89
#2  0x0000000000a9c0ac in errfinish (dummy=0) at elog.c:552
#3  0x0000000000a9e653 in elog_finish (elevel=22, fmt=0xb695a0 "cannot 
abort transaction %u, it was already committed") at elog.c:1365
#4  0x0000000000562c87 in RecordTransactionAbort (isSubXact=false) at 
xact.c:1644
#5  0x0000000000563c93 in AbortTransaction () at xact.c:2670
#6  0x0000000000564516 in AbortCurrentTransaction () at xact.c:3191
#7  0x00000000009145b9 in PostgresMain (argc=1, argv=0x255d970, 
dbname=0x255d950 "test", username=0x252dd88 "zharkov") at 
postgres.c:4033
#8  0x000000000086406c in BackendRun (port=0x2556020) at 
postmaster.c:4399
#9  0x00000000008637cd in BackendStartup (port=0x2556020) at 
postmaster.c:4090
#10 0x000000000085facf in ServerLoop () at postmaster.c:1703
#11 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x252bcf0) at 
postmaster.c:1376
#12 0x000000000077aeba in main (argc=3, argv=0x252bcf0) at main.c:228

Log:
2019-04-02 21:51:19.639 +07 [5967] STATEMENT:  UPDATE pgbench_accounts 
SET abalance = 1 WHERE aid = 1;
2019-04-02 21:51:19.639 +07 [5967] PANIC:  cannot abort transaction 
200011850, it was already committed
2019-04-02 21:53:57.924 +07 [2473] LOG:  server process (PID 5967) was 
terminated by signal 6: Aborted
2019-04-02 21:53:57.924 +07 [2473] DETAIL:  Failed process was running: 
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1;
2019-04-02 21:53:57.924 +07 [2473] LOG:  terminating any other active 
server processes

We run pgbench with two custom scripts.

First:
\set i random_zipfian(1, 100000 * :scale, 1.01)
SELECT abalance FROM pgbench_accounts WHERE aid = :i;

Second:
\set i random_zipfian(1, 100000 * :scale, 1.01)
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = :i;

pgbench connects to the database through the connection pooller called 
Odyssey.
We use pool of 16 or 32 backends.
We run four series of pgbench tests with different connections count ( 
100, 150, 200, ... , 1000 )
Three series with pool size 16 runs normally.
But fourth series fails with 200-300 connections. It took about five 
hours to reproduce the error this time.

Here is the little piece of the pgbench log:

Bench start
Tue Apr  2 14:43:12 UTC 2019
Timestamp 1554216192
transaction type: multiple scripts
scaling factor: 1
query mode: simple
number of clients: 200
number of threads: 32
duration: 300 s
number of transactions actually processed: 8062981
latency average = 7.442 ms
tps = 26875.375100 (including connections establishing)
tps = 26875.694564 (excluding connections establishing)
SQL script 1: ./sql/ycsb_read_zipf.sql
  - weight: 1 (targets 50.0% of total)
  - 4019635 transactions (49.9% of total, tps = 13398.171023)
  - latency average = 6.955 ms
  - latency stddev = 8.366 ms
SQL script 2: ./sql/ycsb_update_zipf.sql
  - weight: 1 (targets 50.0% of total)
  - 4021508 transactions (49.9% of total, tps = 13404.414071)
  - latency average = 7.909 ms
  - latency stddev = 8.478 ms
Bench end
Tue Apr  2 14:48:12 UTC 2019
Timestamp 1554216492
Bench start
Tue Apr  2 14:48:12 UTC 2019
Timestamp 1554216492
client 220 script 1 aborted in command 1 query 0: PANIC:  cannot abort 
transaction 200011850, it was already committed
WARNING:  terminating connection because of crash of another server 
process

-- 
regards,

Roman Zharkov

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

02 апреля 2019 г., 20:11:52

Hi,

On 2019-04-03 00:04:15 +0700, r.zharkov@postgrespro.ru wrote:
> pgbench connects to the database through the connection pooller called
> Odyssey.
> We use pool of 16 or 32 backends.
> We run four series of pgbench tests with different connections count ( 100,
> 150, 200, ... , 1000 )
> Three series with pool size 16 runs normally.
> But fourth series fails with 200-300 connections. It took about five hours
> to reproduce the error this time.
> 
> Here is the little piece of the pgbench log:
> 
> Bench start
> Tue Apr  2 14:43:12 UTC 2019
> Timestamp 1554216192
> transaction type: multiple scripts
> scaling factor: 1

Is this the actual scaling factor? Did you create the pgbench database
with foreign keys?

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

От

Tom Lane

Дата:

02 апреля 2019 г., 20:13:58

r.zharkov@postgrespro.ru writes:
> I made the new core file:

Thanks, but this isn't much help --- it just shows what we already
know, ie the "cannot abort transaction" error occurs after some other
error was thrown.

What would be more helpful would be to adjust the places that
can throw the "unexpected table_lock_tuple status" error text to be
PANIC rather than ERROR, so that the core dump would show a stack
trace showing how we got to whichever of those places this is.
Or, run the test with a higher log verbosity, so that you can find
out which of those places is throwing the error to begin with;
then just PANIC-ize that one.

FWIW, I see six potential candidates, not two:

pgsql/src/backend/commands/trigger.c: 3380:                 elog(ERROR, "unexpected table_lock_tuple status: %u",
test);
pgsql/src/backend/executor/nodeLockRows.c: 232:                 elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 881:                             elog(ERROR, "unexpected table_lock_tuple
status:%u", 
pgsql/src/backend/executor/nodeModifyTable.c: 1384:                             elog(ERROR, "unexpected
table_lock_tuplestatus: %u", 
pgsql/src/backend/executor/execReplication.c: 211:                 elog(ERROR, "unexpected table_lock_tuple status:
%u",res); 
pgsql/src/backend/executor/execReplication.c: 375:                 elog(ERROR, "unexpected table_lock_tuple status:
%u",res); 


            regards, tom lane

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

02 апреля 2019 г., 20:17:14

Hi,

On 2019-04-02 13:13:58 -0400, Tom Lane wrote:
> FWIW, I see six potential candidates, not two:
> 
> pgsql/src/backend/commands/trigger.c: 3380:                 elog(ERROR, "unexpected table_lock_tuple status: %u",
test);
> pgsql/src/backend/executor/nodeLockRows.c: 232:                 elog(ERROR, "unexpected table_lock_tuple status:
%u",
> pgsql/src/backend/executor/execReplication.c: 211:                 elog(ERROR, "unexpected table_lock_tuple status:
%u",res);
 
> pgsql/src/backend/executor/execReplication.c: 375:                 elog(ERROR, "unexpected table_lock_tuple status:
%u",res);
 

These have an explicit case / separate error for TM_Invisible (= 1 -
which is the status we're seeing according to the error message)
though. So afaict they can't be relevant here.

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Alvaro Herrera

Дата:

02 апреля 2019 г., 20:17:21

On 2019-Apr-02, Tom Lane wrote:

> FWIW, I see six potential candidates, not two:
> 
> pgsql/src/backend/commands/trigger.c: 3380:                 elog(ERROR, "unexpected table_lock_tuple status: %u",
test);
> pgsql/src/backend/executor/nodeLockRows.c: 232:                 elog(ERROR, "unexpected table_lock_tuple status:
%u",
> pgsql/src/backend/executor/nodeModifyTable.c: 881:                             elog(ERROR, "unexpected
table_lock_tuplestatus: %u",
 
> pgsql/src/backend/executor/nodeModifyTable.c: 1384:                             elog(ERROR, "unexpected
table_lock_tuplestatus: %u",
 
> pgsql/src/backend/executor/execReplication.c: 211:                 elog(ERROR, "unexpected table_lock_tuple status:
%u",res);
 
> pgsql/src/backend/executor/execReplication.c: 375:                 elog(ERROR, "unexpected table_lock_tuple status:
%u",res);
 

The other four handle the TM_Invisible case with a different error
message.  Anyway, the six of them are unexpected so changing them all to
PANIC ought to be good enough.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

От

Tom Lane

Дата:

02 апреля 2019 г., 20:30:56

Andres Freund <andres@anarazel.de> writes:
> On 2019-04-02 13:13:58 -0400, Tom Lane wrote:
>> FWIW, I see six potential candidates, not two:

> These have an explicit case / separate error for TM_Invisible (= 1 -
> which is the status we're seeing according to the error message)
> though. So afaict they can't be relevant here.

Oh, right, so it has to be from one of the calls in nodeModifyTable.c.
But it's hardly plausible that that runs post-commit.  So now I'm
thinking that what we really need to know is why a TM_Invisible
result was returned.  That's going to be a bit harder ...

            regards, tom lane

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

От

Жарков Роман

Дата:

02 апреля 2019 г., 20:35:47


> 3 апр. 2019 г., в 0:13, Tom Lane <tgl@sss.pgh.pa.us> написал(а):
>
> r.zharkov@postgrespro.ru writes:
>> I made the new core file:
>
> Thanks, but this isn't much help --- it just shows what we already
> know, ie the "cannot abort transaction" error occurs after some other
> error was thrown.
>
> What would be more helpful would be to adjust the places that
> can throw the "unexpected table_lock_tuple status" error text to be
> PANIC rather than ERROR, so that the core dump would show a stack
> trace showing how we got to whichever of those places this is.
> Or, run the test with a higher log verbosity, so that you can find
> out which of those places is throwing the error to begin with;
> then just PANIC-ize that one.
>
> FWIW, I see six potential candidates, not two:
>
> pgsql/src/backend/commands/trigger.c: 3380:                elog(ERROR, "unexpected table_lock_tuple status: %u",
test);
> pgsql/src/backend/executor/nodeLockRows.c: 232:                elog(ERROR, "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/nodeModifyTable.c: 881:                            elog(ERROR, "unexpected
table_lock_tuplestatus: %u", 
> pgsql/src/backend/executor/nodeModifyTable.c: 1384:                            elog(ERROR, "unexpected
table_lock_tuplestatus: %u", 
> pgsql/src/backend/executor/execReplication.c: 211:                elog(ERROR, "unexpected table_lock_tuple status:
%u",res); 
> pgsql/src/backend/executor/execReplication.c: 375:                elog(ERROR, "unexpected table_lock_tuple status:
%u",res); 
>
>
>            regards, tom lane

Ok. I will made it tomorrow morning.

regards,
Roman Zharkov

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

02 апреля 2019 г., 20:35:54

Hi,

On 2019-04-02 13:30:56 -0400, Tom Lane wrote:
> But it's hardly plausible that that runs post-commit.

Yea, one surely would hope that that's not happening. I suspect it's
possible it's running pre-commit and somehow the failure handling is
screwed up leading to the 'already committed' error.

> So now I'm thinking that what we really need to know is why a
> TM_Invisible result was returned.  That's going to be a bit harder ...

Yea. I re-skimmed through the code and can't immediately see anything
wrong - not that such subtle issues would necessarily be noticed without
more information.

I wonder, is there a chance that you're using an extension with xact
hooks, or a modified version of postgres?  Could you show us the exact
schema used, and the config?

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

r.zharkov@postgrespro.ru

Дата:

03 апреля 2019 г., 11:29:21

On 2019-04-03 00:13, Tom Lane wrote:
> r.zharkov@postgrespro.ru writes:
>> I made the new core file:
> 
> Thanks, but this isn't much help --- it just shows what we already
> know, ie the "cannot abort transaction" error occurs after some other
> error was thrown.
> 
> What would be more helpful would be to adjust the places that
> can throw the "unexpected table_lock_tuple status" error text to be
> PANIC rather than ERROR, so that the core dump would show a stack
> trace showing how we got to whichever of those places this is.
> Or, run the test with a higher log verbosity, so that you can find
> out which of those places is throwing the error to begin with;
> then just PANIC-ize that one.
> 
> FWIW, I see six potential candidates, not two:
> 
> pgsql/src/backend/commands/trigger.c: 3380:                 elog(ERROR,
> "unexpected table_lock_tuple status: %u", test);
> pgsql/src/backend/executor/nodeLockRows.c: 232:                 elog(ERROR,
> "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/nodeModifyTable.c: 881:                             elog(ERROR,
> "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/nodeModifyTable.c: 1384:                             elog(ERROR,
> "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/execReplication.c: 211:                 elog(ERROR,
> "unexpected table_lock_tuple status: %u", res);
> pgsql/src/backend/executor/execReplication.c: 375:                 elog(ERROR,
> "unexpected table_lock_tuple status: %u", res);
> 
> 
>             regards, tom lane

Hello,
I have made the new core dump. Here are the results:
PostgreSQL 12devel-master/bb76134

Log:
TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) || 
(((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | 
0x0010))) == 0x0040)))", File: "heapam.c", Line:
2019-04-03 15:13:36.668 +07 [23163] LOG:  server process (PID 27904) was 
terminated by signal 6: Aborted
2019-04-03 15:13:36.668 +07 [23163] DETAIL:  Failed process was running: 
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;
2019-04-03 15:13:36.668 +07 [23163] LOG:  terminating any other active 
server processes
2019-04-03 15:13:36.669 +07 [27907] WARNING:  terminating connection 
because of crash of another server process

Backtrace:
[New LWP 27904]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test 127.0.0.1(44582) UPDATE    
     '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f5e0dc7d02a in __GI_abort () at abort.c:89
#2  0x0000000000a9b7ce in ExceptionalCondition (
     conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) || 
(((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | 
0x0010))) == 0x0040)))",
     errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 "heapam.c", 
lineNumber=4286) at assert.c:54
#3  0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80, 
tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive, 
wait_policy=LockWaitBlock, follow_updates=false,
     buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286
#4  0x00000000004f5903 in heapam_tuple_lock (relation=0x7f5e0f252d80, 
tid=0x7ffed968cdf0, snapshot=0x303cef0, slot=0x30f4d80, cid=0, 
mode=LockTupleNoKeyExclusive,
     wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffed968cd80) at 
heapam_handler.c:352
#5  0x000000000074055a in table_lock_tuple (rel=0x7f5e0f252d80, 
tid=0x7ffed968cdf0, snapshot=0x303cef0, slot=0x30f4d80, cid=0, 
mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock,
     flags=2 '\002', tmfd=0x7ffed968cd80) at 
../../../src/include/access/tableam.h:993
#6  0x00000000007421cc in ExecUpdate (mtstate=0x30eb780, 
tupleid=0x7ffed968cdf0, oldtuple=0x0, slot=0x30ed1c8, 
planSlot=0x30ed050, epqstate=0x30eb878, estate=0x30eb3d8, 
canSetTag=true)
     at nodeModifyTable.c:1242
#7  0x0000000000743756 in ExecModifyTable (pstate=0x30eb780) at 
nodeModifyTable.c:2070
#8  0x0000000000712671 in ExecProcNodeFirst (node=0x30eb780) at 
execProcnode.c:445
#9  0x000000000070748c in ExecProcNode (node=0x30eb780) at 
../../../src/include/executor/executor.h:239
#10 0x0000000000709d11 in ExecutePlan (estate=0x30eb3d8, 
planstate=0x30eb780, use_parallel_mode=false, operation=CMD_UPDATE, 
sendTuples=false, numberTuples=0,
     direction=ForwardScanDirection, dest=0x30ee508, execute_once=true) 
at execMain.c:1643
#11 0x0000000000707ac9 in standard_ExecutorRun (queryDesc=0x303faf8, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:362
#12 0x0000000000707905 in ExecutorRun (queryDesc=0x303faf8, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:306
#13 0x0000000000915950 in ProcessQuery (plan=0x30df120, 
sourceText=0x30161d8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid 
= 2;", params=0x0, queryEnv=0x0, dest=0x30ee508,
     completionTag=0x7ffed968d270 "") at pquery.c:161
#14 0x000000000091739c in PortalRunMulti (portal=0x3098078, 
isTopLevel=true, setHoldSnapshot=false, dest=0x30ee508, 
altdest=0x30ee508, completionTag=0x7ffed968d270 "") at pquery.c:1283
#15 0x00000000009168e8 in PortalRun (portal=0x3098078, 
count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0x30ee508, altdest=0x30ee508, completionTag=0x7ffed968d270 "")
     at pquery.c:796
#16 0x000000000091027a in exec_simple_query (query_string=0x30161d8 
"UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;") at 
postgres.c:1215
#17 0x00000000009148ff in PostgresMain (argc=1, argv=0x3042980, 
dbname=0x3042960 "test", username=0x3012d98 "zharkov") at 
postgres.c:4247
#18 0x000000000086406c in BackendRun (port=0x303b030) at 
postmaster.c:4399
#19 0x00000000008637cd in BackendStartup (port=0x303b030) at 
postmaster.c:4090
#20 0x000000000085facf in ServerLoop () at postmaster.c:1703
#21 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x3010d00) at 
postmaster.c:1376
#22 0x000000000077aeba in main (argc=3, argv=0x3010d00) at main.c:228


The changes I made:
diff --git a/src/backend/commands/trigger.c 
b/src/backend/commands/trigger.c
index e03ffdd..ab3bace 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3358,7 +3358,7 @@ GetTupleForTrigger(EState *estate,
                                         ereport(ERROR,
                                                         
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                                                          errmsg("could 
not serialize access due to concurrent update")));
-                               elog(ERROR, "unexpected table_lock_tuple 
status: %u", test);
+                               elog(PANIC, "unexpected table_lock_tuple 
status: %u", test);
                                 break;

                         case TM_Deleted:
diff --git a/src/backend/executor/execReplication.c 
b/src/backend/executor/execReplication.c
index d8b48c6..d06cbe0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -207,7 +207,7 @@ retry:
                                 elog(ERROR, "attempted to lock invisible 
tuple");
                                 break;
                         default:
-                               elog(ERROR, "unexpected table_lock_tuple 
status: %u", res);
+                               elog(PANIC, "unexpected table_lock_tuple 
status: %u", res);
                                 break;
                 }
         }
@@ -371,7 +371,7 @@ retry:
                                 elog(ERROR, "attempted to lock invisible 
tuple");
                                 break;
                         default:
-                               elog(ERROR, "unexpected table_lock_tuple 
status: %u", res);
+                               elog(PANIC, "unexpected table_lock_tuple
status: %u", res);
                                 break;
                 }
         }
diff --git a/src/backend/executor/nodeLockRows.c 
b/src/backend/executor/nodeLockRows.c
index 7674ac8..c7a261c 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -229,7 +229,7 @@ lnext:
                                         ereport(ERROR,
                                                         
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                                                          errmsg("could 
not serialize access due to concurrent update")));
-                               elog(ERROR, "unexpected table_lock_tuple 
status: %u",
+                               elog(PANIC, "unexpected table_lock_tuple 
status: %u",
                                          test);
                                 break;

diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index 7be0e77..b15b7a0 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -780,7 +780,7 @@ ldelete:;
                                                          * locking the 
latest version via
                                                          * 
TUPLE_LOCK_FLAG_FIND_LAST_VERSION.
                                                          */
-                                                       elog(ERROR, 
"unexpected table_lock_tuple status: %u",
+                                                       elog(PANIC, 
"unexpected table_lock_tuple status: %u",
                                                                  
result);
                                                         return NULL;
                                         }
@@ -1269,7 +1269,7 @@ lreplace:;

                                                 default:
                                                         /* see 
table_lock_tuple call in ExecDelete() */
-                                                       elog(ERROR, 
"unexpected table_lock_tuple status: %u",
+                                                       elog(PANIC, 
"unexpected table_lock_tuple status: %u",
                                                                  
result);
                                                         return NULL;
                                         }


regards,

Roman Zharkov

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

r.zharkov@postgrespro.ru

Дата:

03 апреля 2019 г., 11:38:16

On 2019-04-03 00:11, Andres Freund wrote:
> Hi,
> 
> On 2019-04-03 00:04:15 +0700, r.zharkov@postgrespro.ru wrote:
>> pgbench connects to the database through the connection pooller called
>> Odyssey.
>> We use pool of 16 or 32 backends.
>> We run four series of pgbench tests with different connections count ( 
>> 100,
>> 150, 200, ... , 1000 )
>> Three series with pool size 16 runs normally.
>> But fourth series fails with 200-300 connections. It took about five 
>> hours
>> to reproduce the error this time.
>> 
>> Here is the little piece of the pgbench log:
>> 
>> Bench start
>> Tue Apr  2 14:43:12 UTC 2019
>> Timestamp 1554216192
>> transaction type: multiple scripts
>> scaling factor: 1
> 
> Is this the actual scaling factor? Did you create the pgbench database
> with foreign keys?
> 
> Greetings,
> 
> Andres Freund

Hi,
Yes, 1 is the actual scaling factor.

My script to create database:
/db/zharkov/.vanila/bin/pg_ctl -D /db/zharkov/vanila -l 
/db/zharkov/vanila/log.log stop
rm -rf /db/zharkov/vanila/*
/db/zharkov/.vanila/bin/initdb -D /db/zharkov/vanila --data-checksums
#/db/zharkov/.vanila/bin/initdb -D /db/zharkov/vanila
cp postgresql.conf /db/zharkov/vanila/
ulimit -c unlimited
/db/zharkov/.vanila/bin/pg_ctl -D /db/zharkov/vanila -c -l 
/db/zharkov/vanila/log.log start
/db/zharkov/.vanila/bin/psql postgres -c 'create database test'
/db/zharkov/.vanila/bin/psql test -c 'create extension pg_prewarm;'
/db/zharkov/.vanila/bin/pgbench -i test -s 1
#/db/zharkov/.vanila/bin/pgbench -i test -s 1000
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm( 
'pgbench_accounts' );"
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm( 
'pgbench_branches' );"
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm( 
'pgbench_history' );"
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm( 
'pgbench_tellers' );"

The dump of the server settings is in the attachment.

-- 
regards,

Roman Zharkov

Вложения

settings.txt

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Alvaro Herrera

Дата:

03 апреля 2019 г., 15:20:27

On 2019-Apr-03, r.zharkov@postgrespro.ru wrote:

> #0  0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007f5e0dc7d02a in __GI_abort () at abort.c:89
> #2  0x0000000000a9b7ce in ExceptionalCondition (
>     conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) ||
> (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 |
> 0x0010))) == 0x0040)))",
>     errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 "heapam.c",
> lineNumber=4286) at assert.c:54
> #3  0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80,
> tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive,
> wait_policy=LockWaitBlock, follow_updates=false,
>     buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286

Uhm ... this assertion failure is doesn't seem related to the other
problem you were reporting.  Here, it's failing because it thinks the
tuple should be marked as only locked when the current transaction is
the only Xmax for the tuple.

        /*
         * As a check independent from those above, we can also avoid sleeping
         * if the current transaction is the sole locker of the tuple.  Note
         * that the strength of the lock already held is irrelevant; this is
         * not about recording the lock in Xmax (which will be done regardless
         * of this optimization, below).  Also, note that the cases where we
         * hold a lock stronger than we are requesting are already handled
         * above by not doing anything.
         *
         * Note we only deal with the non-multixact case here; MultiXactIdWait
         * is well equipped to deal with this situation on its own.
         */
        if (require_sleep && !(infomask & HEAP_XMAX_IS_MULTI) &&
            TransactionIdIsCurrentTransactionId(xwait))
        {
            /* ... but if the xmax changed in the meantime, start over */
            LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
            if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
                !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
                                     xwait))
                goto l3;
            Assert(HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_data->t_infomask));    //  <--- failed assertion
            require_sleep = false;
        }

I'm not quite sure what's up with that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

r.zharkov@postgrespro.ru

Дата:

03 апреля 2019 г., 15:59:23

On 2019-04-03 19:20, Alvaro Herrera wrote:
> On 2019-Apr-03, r.zharkov@postgrespro.ru wrote:
> 
>> #0  0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at
>> ../sysdeps/unix/sysv/linux/raise.c:54
>> #1  0x00007f5e0dc7d02a in __GI_abort () at abort.c:89
>> #2  0x0000000000a9b7ce in ExceptionalCondition (
>>     conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) 
>> ||
>> (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 
>> |
>> 0x0010))) == 0x0040)))",
>>     errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 
>> "heapam.c",
>> lineNumber=4286) at assert.c:54
>> #3  0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80,
>> tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive,
>> wait_policy=LockWaitBlock, follow_updates=false,
>>     buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286
> 
> Uhm ... this assertion failure is doesn't seem related to the other
> problem you were reporting.  Here, it's failing because it thinks the
> tuple should be marked as only locked when the current transaction is
> the only Xmax for the tuple.
> 
>             TransactionIdIsCurrentTransactionId(xwait))
>         {
>             /* ... but if the xmax changed in the meantime, start over */
>             LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
>             if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
>                 !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
>                                      xwait))
>                 goto l3;
>             Assert(HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_data->t_infomask));    //
> <--- failed assertion
>             require_sleep = false;
>         }
> 
> I'm not quite sure what's up with that.

Today we reproduced the error without the pooller. All server settings 
are the same.

pgbench test -j 32 -T 36000 -f ycsb_read_zipf.sql -f 
ycsb_update_zipf.sql -c 32 -P 60

Backtrace:

[New LWP 29710]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test [local] UPDATE             
     '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f0db56c3428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007f0db56c3428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f0db56c502a in __GI_abort () at abort.c:89
#2  0x0000000000a9c0ac in errfinish (dummy=0) at elog.c:552
#3  0x0000000000a9e653 in elog_finish (elevel=22, fmt=0xc7c7c0 
"unexpected table_lock_tuple status: %u") at elog.c:1365
#4  0x00000000007422c1 in ExecUpdate (mtstate=0x2e02370, 
tupleid=0x7fff6a5c1610, oldtuple=0x0, slot=0x2e03db8, 
planSlot=0x2e03c40, epqstate=0x2e02468, estate=0x2e01fc8, 
canSetTag=true) at nodeModifyTable.c:1272
#5  0x0000000000743756 in ExecModifyTable (pstate=0x2e02370) at 
nodeModifyTable.c:2070
#6  0x0000000000712671 in ExecProcNodeFirst (node=0x2e02370) at 
execProcnode.c:445
#7  0x000000000070748c in ExecProcNode (node=0x2e02370) at 
../../../src/include/executor/executor.h:239
#8  0x0000000000709d11 in ExecutePlan (estate=0x2e01fc8, 
planstate=0x2e02370, use_parallel_mode=false, operation=CMD_UPDATE, 
sendTuples=false, numberTuples=0, direction=ForwardScanDirection, 
dest=0x2e050f8,
     execute_once=true) at execMain.c:1643
#9  0x0000000000707ac9 in standard_ExecutorRun (queryDesc=0x2d56a18, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:362
#10 0x0000000000707905 in ExecutorRun (queryDesc=0x2d56a18, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:306
#11 0x0000000000915950 in ProcessQuery (plan=0x2df5d10, 
sourceText=0x2d2d1d8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid 
= 1;", params=0x0, queryEnv=0x0, dest=0x2e050f8, 
completionTag=0x7fff6a5c1a90 "")
     at pquery.c:161
#12 0x000000000091739c in PortalRunMulti (portal=0x2daf078, 
isTopLevel=true, setHoldSnapshot=false, dest=0x2e050f8, 
altdest=0x2e050f8, completionTag=0x7fff6a5c1a90 "") at pquery.c:1283
#13 0x00000000009168e8 in PortalRun (portal=0x2daf078, 
count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0x2e050f8, altdest=0x2e050f8, completionTag=0x7fff6a5c1a90 "") at 
pquery.c:796
#14 0x000000000091027a in exec_simple_query (query_string=0x2d2d1d8 
"UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1;") at 
postgres.c:1215
#15 0x00000000009148ff in PostgresMain (argc=1, argv=0x2d59a80, 
dbname=0x2d59960 "test", username=0x2d29d98 "zharkov") at 
postgres.c:4247
#16 0x000000000086406c in BackendRun (port=0x2d52030) at 
postmaster.c:4399
#17 0x00000000008637cd in BackendStartup (port=0x2d52030) at 
postmaster.c:4090
#18 0x000000000085facf in ServerLoop () at postmaster.c:1703
#19 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x2d27d00) at 
postmaster.c:1376
#20 0x000000000077aeba in main (argc=3, argv=0x2d27d00) at main.c:228

-------------------------------------------

Some information from first core dump ( thanks to Konstantin Knizhnik ):

(gdb) p *tuple->t_data
$6 = {t_choice = {t_heap = {t_xmin = 200004728, t_xmax = 200004747, 
t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 200004728,
       datum_typmod = 200004747, datum_typeid = 0}}, t_ctid = {ip_blkid = 
{bi_hi = 0, bi_lo = 2337}, ip_posid = 61}, t_infomask2 = 49156, 
t_infomask = 9474,
   t_hoff = 24 '\030', t_bits = 0x7f59f9ea1c17 ""}
(gdb) p xwait
$4 = 200004747
(gdb) p  /x infomask
$7 = 0x2502
(gdb) p result
$8 = TM_Updated
(gdb) p mode
$9 = LockTupleNoKeyExclusive
(gdb) p *CurrentTransactionState
$11 = {fullTransactionId = {value = 200004747}, subTransactionId = 1, 
name = 0x0, savepointLevel = 0, state = TRANS_INPROGRESS, blockState = 
TBLOCK_STARTED,
   nestingLevel = 1, gucNestLevel = 1, curTransactionContext = 0x303cda0, 
curTransactionOwner = 0x3054660, childXids = 0x0, nChildXids = 0, 
maxChildXids = 0,
   prevUser = 10, prevSecContext = 0, prevXactReadOnly = false, 
startedInRecovery = false, didLogXid = false, parallelModeLevel = 0, 
chain = false, parent = 0x0}



-- 
regards

Roman Zharkov

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

От

Tom Lane

Дата:

06 апреля 2019 г., 19:23:06

It seems that there may be some connection between this problem and
EPQ.  I was working on committing Amit's fix for bug #15677, which
demonstrated that EPQ doesn't work for partitioned-table target rels.
It seemed like there really needed to be regression test coverage for
that, so I tried to convert his crasher example into an isolation test.
It does indeed crash without Amit's fix ... but with it, lookee what
I get:

+error in steps c1 complexpartupdate: ERROR:  unexpected table_lock_tuple status: 1

That seems fully reproducible in this test.  I haven't looked into
exactly what's causing that, but now that we have a reproducible
example, somebody should.

I'm not quite sure if I should commit this as-is or wait till the
other problem is fixed.  A crash is probably worse than a bogus
error, but I don't like committing obviously-wrong "expected" output.
Thoughts?

            regards, tom lane

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 602a08e..5c4b3f4 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2799,15 +2799,24 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
     estate->es_output_cid = parentestate->es_output_cid;
     if (parentestate->es_num_result_relations > 0)
     {
-        int            numResultRelations = parentestate->es_num_result_relations;
+        int            numResultRels = parentestate->es_num_result_relations;
+        int            numRootResultRels = parentestate->es_num_root_result_relations;
         ResultRelInfo *resultRelInfos;

         resultRelInfos = (ResultRelInfo *)
-            palloc(numResultRelations * sizeof(ResultRelInfo));
+            palloc(numResultRels * sizeof(ResultRelInfo));
         memcpy(resultRelInfos, parentestate->es_result_relations,
-               numResultRelations * sizeof(ResultRelInfo));
+               numResultRels * sizeof(ResultRelInfo));
         estate->es_result_relations = resultRelInfos;
-        estate->es_num_result_relations = numResultRelations;
+        estate->es_num_result_relations = numResultRels;
+
+        /* Also transfer partitioned root result relations. */
+        resultRelInfos = (ResultRelInfo *)
+            palloc(numRootResultRels * sizeof(ResultRelInfo));
+        memcpy(resultRelInfos, parentestate->es_root_result_relations,
+               numRootResultRels * sizeof(ResultRelInfo));
+        estate->es_root_result_relations = resultRelInfos;
+        estate->es_num_root_result_relations = numRootResultRels;
     }
     /* es_result_relation_info must NOT be copied */
     /* es_trig_target_relations must NOT be copied */
diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out
index c09e97f..47843d7 100644
--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@@ -562,3 +562,16 @@ step multireadwcte: <... completed>
 subid          id

 1              1
+
+starting permutation: simplepartupdate complexpartupdate c1 c2
+step simplepartupdate:
+    update parttbl set a = a;
+
+step complexpartupdate:
+    with u as (update parttbl set a = a returning parttbl.*)
+    update parttbl set a = u.a from u;
+ <waiting ...>
+step c1: COMMIT;
+step complexpartupdate: <... completed>
+error in steps c1 complexpartupdate: ERROR:  unexpected table_lock_tuple status: 1
+step c2: COMMIT;
diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec
index 41c8d56..7d3f1f6 100644
--- a/src/test/isolation/specs/eval-plan-qual.spec
+++ b/src/test/isolation/specs/eval-plan-qual.spec
@@ -29,6 +29,10 @@ setup

  CREATE TABLE jointest AS SELECT generate_series(1,10) AS id, 0 AS data;
  CREATE INDEX ON jointest(id);
+
+ CREATE TABLE parttbl (a int) PARTITION BY LIST (a);
+ CREATE TABLE parttbl1 PARTITION OF parttbl FOR VALUES IN (1);
+ INSERT INTO parttbl VALUES (1);
 }

 teardown
@@ -37,6 +41,7 @@ teardown
  DROP TABLE accounts_ext;
  DROP TABLE p CASCADE;
  DROP TABLE table_a, table_b, jointest;
+ DROP TABLE parttbl;
 }

 session "s1"
@@ -133,6 +138,12 @@ step "selectresultforupdate"    {
       where jt.id = y for update of jt, ss1, ss2;
 }

+# test for EPQ on a partitioned result table
+
+step "simplepartupdate"    {
+    update parttbl set a = a;
+}
+

 session "s2"
 setup        { BEGIN ISOLATION LEVEL READ COMMITTED; }
@@ -170,6 +181,10 @@ step "updateforcip3"    {
 }
 step "wrtwcte"    { UPDATE table_a SET value = 'tableAValue2' WHERE id = 1; }
 step "wrjt"    { UPDATE jointest SET data = 42 WHERE id = 7; }
+step "complexpartupdate"    {
+    with u as (update parttbl set a = a returning parttbl.*)
+    update parttbl set a = u.a from u;
+}
 step "c2"    { COMMIT; }
 step "r2"    { ROLLBACK; }

@@ -234,3 +249,5 @@ permutation "wrtwcte" "readwcte" "c1" "c2"
 permutation "wrjt" "selectjoinforupdate" "c2" "c1"
 permutation "wrjt" "selectresultforupdate" "c2" "c1"
 permutation "wrtwcte" "multireadwcte" "c1" "c2"
+
+permutation "simplepartupdate" "complexpartupdate" "c1" "c2"

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

06 апреля 2019 г., 19:28:46

Hi,

On 2019-04-06 12:23:06 -0400, Tom Lane wrote:
> It seems that there may be some connection between this problem and
> EPQ.  I was working on committing Amit's fix for bug #15677, which
> demonstrated that EPQ doesn't work for partitioned-table target rels.
> It seemed like there really needed to be regression test coverage for
> that, so I tried to convert his crasher example into an isolation test.
> It does indeed crash without Amit's fix ... but with it, lookee what
> I get:
> 
> +error in steps c1 complexpartupdate: ERROR:  unexpected table_lock_tuple status: 1
> 
> That seems fully reproducible in this test.  I haven't looked into
> exactly what's causing that, but now that we have a reproducible
> example, somebody should.
> 
> I'm not quite sure if I should commit this as-is or wait till the
> other problem is fixed.  A crash is probably worse than a bogus
> error, but I don't like committing obviously-wrong "expected" output.
> Thoughts?

Let me have a look at the testcase - I'd been running Roman's testcase
for quite a few hours without being able to reproduce. But your testcase
seems to trigger this reliably, so I hope I can make some quick
progress.

- Andres

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

r.zharkov@postgrespro.ru

Дата:

06 апреля 2019 г., 20:09:15

On 2019-04-06 23:28, Andres Freund wrote:
> Hi,
> 
> Let me have a look at the testcase - I'd been running Roman's testcase
> for quite a few hours without being able to reproduce. But your 
> testcase
> seems to trigger this reliably, so I hope I can make some quick
> progress.
> 
> - Andres

Hello,
I try to find the bad commit using bisect. But it takes very long time.
The error reproduces with the default config using 24 clients ( server 
has 24 CPUs )
pgbench test -j 12 -T 36000 -f ycsb_read_zipf.sql -f 
ycsb_update_zipf.sql -c 24 -P 60
It does not reproduce when updating the only one record.

CPU configuration:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Model name:            Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz
Stepping:              2
CPU MHz:               2464.940
CPU max MHz:           3066.0000
CPU min MHz:           1600.0000
BogoMIPS:              6133.57
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23


-- 
regards,
Roman Zharkov

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

06 апреля 2019 г., 20:10:25

Hi,

On 2019-04-06 09:28:46 -0700, Andres Freund wrote:
> On 2019-04-06 12:23:06 -0400, Tom Lane wrote:
> > It seems that there may be some connection between this problem and
> > EPQ.  I was working on committing Amit's fix for bug #15677, which
> > demonstrated that EPQ doesn't work for partitioned-table target rels.
> > It seemed like there really needed to be regression test coverage for
> > that, so I tried to convert his crasher example into an isolation test.
> > It does indeed crash without Amit's fix ... but with it, lookee what
> > I get:
> > 
> > +error in steps c1 complexpartupdate: ERROR:  unexpected table_lock_tuple status: 1
> > 
> > That seems fully reproducible in this test.  I haven't looked into
> > exactly what's causing that, but now that we have a reproducible
> > example, somebody should.
> > 
> > I'm not quite sure if I should commit this as-is or wait till the
> > other problem is fixed.  A crash is probably worse than a bogus
> > error, but I don't like committing obviously-wrong "expected" output.
> > Thoughts?
> 
> Let me have a look at the testcase - I'd been running Roman's testcase
> for quite a few hours without being able to reproduce. But your testcase
> seems to trigger this reliably, so I hope I can make some quick
> progress.

Hm. I see what's wrong here - the new code assumed that we couldn't get
a SelfModified because the first version of the to-be-(deleted|updated)
tuple was visible. To properly discern that from the TM_Deleted case,
I'd to change/fix heapam_lock_tuple's follow-the-update chain to return
SelfModified, rather than Invisible in this case (I don't think we want
to allow invisible - we'd have to have waited for the earlier tuple
version) - which is a more accurate return code anyway.

I'm still not understanding how that'd be possible in Roman's
case. Given the workload there never should be any self updating going
on?

Heavily-WIP patch attached.

I noticed that we say
+                                ereport(ERROR,
+                                        (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
+                                         errmsg("tuple to be updated was already modified by an operation triggered by
thecurrent command"),

in the ExecDelete() case (that's not new). Which seems odd.

I think my fix would need a non-partition reproducer. I'll work on that
and polishing it after having a coffee.

Greetings,

Andres Freund

Вложения

fix-repeated-self-mod-chain.diff

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

06 апреля 2019 г., 20:17:05

Hi,

On 2019-04-07 00:09:15 +0700, r.zharkov@postgrespro.ru wrote:
> On 2019-04-06 23:28, Andres Freund wrote:
> > Hi,
> > 
> > Let me have a look at the testcase - I'd been running Roman's testcase
> > for quite a few hours without being able to reproduce. But your testcase
> > seems to trigger this reliably, so I hope I can make some quick
> > progress.
> > 
> > - Andres
> 
> Hello,
> I try to find the bad commit using bisect. But it takes very long
> time.

I'd be very surprised if it weren't

commit 5db6df0c0117ff2a4e0cd87594d2db408cd5022f
Author: Andres Freund <andres@anarazel.de>
Date:   2019-03-23 19:55:57 -0700

    tableam: Add tuple_{insert, delete, update, lock} and use.

I just sent a fix for the issue Tom just reported, but I don't quite see
how it applies to your case, given that there is - as far as I
understand - only a single statement per transaction, no triggers
including foreign keys, no CTEs etc.  But it'd sure be interesting if my
fix changes his error into trigering on TM_SelfModified rather than
TM_Invisible.

I'm kinda wondering if your / Roman's case is exposing a race condition
somewhere (like wrong order of clog / procarray checks or such) that
previously wasn't user visible.

I think we probably should expand the error messages for the unexpected
cases to include the tid of the failed tuple (both original and
followed) - then we could at least look through the heap and WAL to get
more understanding.

> The error reproduces with the default config using 24 clients ( server has
> 24 CPUs )
> pgbench test -j 12 -T 36000 -f ycsb_read_zipf.sql -f ycsb_update_zipf.sql -c
> 24 -P 60
> It does not reproduce when updating the only one record.

I ran it for like 9 hours over night, without triggering the error. On a
computer with fewer CPUs though.

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

06 апреля 2019 г., 20:27:08

Hi,

On 2019-04-06 10:10:25 -0700, Andres Freund wrote:
> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index add0d65f816..4bca240f367 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -464,7 +464,7 @@ tuple_lock_retry:
>                          HeapTupleHeaderGetCmin(tuple->t_data) >= cid)
>                      {
>                          ReleaseBuffer(buffer);
> -                        return TM_Invisible;
> +                        return TM_SelfModified;
>                      }

This needs to set
tmfd->cmax = HeapTupleHeaderGetCmax(tuple->t_data);
so we don't throw
                if (tmfd.cmax != estate->es_output_cid)
                    ereport(ERROR,
                            (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
                             errmsg("tuple to be updated was already modified by an operation triggered by the current
command"),
                             errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes
toother rows.")));
 

in the testcase at hand.

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

07 апреля 2019 г., 02:10:37

Hi,

On 2019-04-06 10:10:25 -0700, Andres Freund wrote:
> I noticed that we say
> +                                ereport(ERROR,
> +                                        (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
> +                                         errmsg("tuple to be updated was already modified by an operation triggered
bythe current command"),
 
> 
> in the ExecDelete() case (that's not new). Which seems odd.

My inclination is to fix this in master, but not backpatch. I did in the
current version of the patch, because it made it easier to verify my
tests actually work.

One bigger question, around precisely that error, I have is that fixing
this bug led me to add tests for the codepath. What I noticed is that
the current version of the patch detects the above error even if we
first have to follow the update chain, whereas previously that case was
just blindly ignored. Which seems hard to defend to me?

So with the new code we're doing:

starting permutation: wx1 updwctefail c1 c2 read
step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
balance        

400            
step updwctefail: WITH doup AS (UPDATE accounts SET balance = balance + 1100 WHERE accountid = 'checking' RETURNING *,
update_checking(999))UPDATE accounts a 
 
step c1: COMMIT;
step updwctefail: <... completed>
accountid      balance        accountid      balance        update_checking

savings        1600           checking       1500           t              
step c2: COMMIT;
step read: SELECT * FROM accounts ORDER BY accountid;
accountid      balance        

checking       1501           
savings        1600           

but previously we got:

starting permutation: wx1 updwctefail c1 c2 read
step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
balance        

400            
step updwctefail: WITH doup AS (UPDATE accounts SET balance = balance + 1100 WHERE accountid = 'checking' RETURNING *,
update_checking(999))UPDATE accounts a 
 
step c1: COMMIT;
step updwctefail: <... completed>
accountid      balance        accountid      balance        update_checking

savings        1600           checking       1500           t              
step c2: COMMIT;
step read: SELECT * FROM accounts ORDER BY accountid;
accountid      balance        

checking       1501           
savings        1600           

In other words, the previous error check put there by
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6868ed7491b7ea7f0af6133bb66566a2f5fe5a75
commit 6868ed7491b7ea7f0af6133bb66566a2f5fe5a75
Author: Kevin Grittner <kgrittn@postgresql.org>
Date:   2012-10-26 14:55:36 -0500

    Throw error if expiring tuple is again updated or deleted.

weren't that thorough. As long as the "original" row to be updated is
updated by another transaction (i.e. we start read committed ctid
chasing), the old EPQ fetch logic just didn't perform the check
described in the commit + comment:

                /*
                 * The target tuple was already updated or deleted by the
                 * current command, or by a later command in the current
                 * transaction.  The former case is possible in a join DELETE
                 * where multiple tuples join to the same target tuple. This
                 * is somewhat questionable, but Postgres has always allowed
                 * it: we just ignore additional deletion attempts.
                 *
                 * The latter case arises if the tuple is modified by a
                 * command in a BEFORE trigger, or perhaps by a command in a
                 * volatile function used in the query.  In such situations we
                 * should not ignore the deletion, but it is equally unsafe to
                 * proceed.  We don't want to discard the original DELETE
                 * while keeping the triggered actions based on its deletion;
                 * and it would be no better to allow the original DELETE
                 * while discarding updates that it triggered.  The row update
                 * carries some information that might be important according
                 * to business rules; so throwing an error is the only safe
                 * course.
                 *
                 * If a trigger actually intends this type of interaction, it
                 * can re-execute the DELETE and then return NULL to cancel
                 * the outer delete.
                 */

It seems mighty finnicky to fix this in < v12 (as the error would need
to happen in the guts of EvalPlanQualFetch() rather than in
ExecUpdate/Delete) - so I'm inclined to just fix it in master.

Thoughts?

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

07 апреля 2019 г., 23:33:15

Hi,

On 2019-04-06 16:10:37 -0700, Andres Freund wrote:
> On 2019-04-06 10:10:25 -0700, Andres Freund wrote:
> > I noticed that we say
> > +                                ereport(ERROR,
> > +                                        (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
> > +                                         errmsg("tuple to be updated was already modified by an operation
triggeredby the current command"),
 
> > 
> > in the ExecDelete() case (that's not new). Which seems odd.
> 
> My inclination is to fix this in master, but not backpatch. I did in the
> current version of the patch, because it made it easier to verify my
> tests actually work.

Attached patch does so.


> One bigger question, around precisely that error, I have is that fixing
> this bug led me to add tests for the codepath. What I noticed is that
> the current version of the patch detects the above error even if we
> first have to follow the update chain, whereas previously that case was
> just blindly ignored. Which seems hard to defend to me?
>
> [ long explanation ]
> It seems mighty finnicky to fix this in < v12 (as the error would need
> to happen in the guts of EvalPlanQualFetch() rather than in
> ExecUpdate/Delete) - so I'm inclined to just fix it in master.

Attached patch now errors out.


Patch with tests attached.  Tom, any chance you could comment on whether
you think it's sensible to make these behaviour changes in master, but
not backpatch?

Greetings,

Andres Freund

Вложения

0001-Fix-a-number-of-issues-around-modifying-a-previously.patch

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Amit Langote

Дата:

08 апреля 2019 г., 10:21:47

On 2019/04/07 1:23, Tom Lane wrote:
> It seems that there may be some connection between this problem and
> EPQ.  I was working on committing Amit's fix for bug #15677, which
> demonstrated that EPQ doesn't work for partitioned-table target rels.
> It seemed like there really needed to be regression test coverage for
> that, so I tried to convert his crasher example into an isolation test.
> It does indeed crash without Amit's fix ... but with it, lookee what
> I get:
> 
> +error in steps c1 complexpartupdate: ERROR:  unexpected table_lock_tuple status: 1
> 
> That seems fully reproducible in this test.  I haven't looked into
> exactly what's causing that, but now that we have a reproducible
> example, somebody should.
> 
> I'm not quite sure if I should commit this as-is or wait till the
> other problem is fixed.  A crash is probably worse than a bogus
> error, but I don't like committing obviously-wrong "expected" output.
> Thoughts?

Thank you Tom for picking it up and adding the test which I should've done
to begin with.

Now that Andres has taken care of the other issues [1], maybe this one's
good to go?  The isolation test part needed to be rebased over Andres'
commit, which I've done in the attached updated patch.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=41f5e04ae

Вложения

bug-15677-fix-with-test_v2.patch

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Amit Langote

Дата:

08 апреля 2019 г., 11:18:48

On 2019/04/08 16:21, Amit Langote wrote:
> Now that Andres has taken care of the other issues [1], maybe this one's
> good to go?  The isolation test part needed to be rebased over Andres'
> commit, which I've done in the attached updated patch.

The patch I attached in the previous email doesn't apply as-is to
back-branches due to rebasing.  I've attached another patch here, which
applies to both PG 11 and 10 branches.

Thanks,
Amit

Вложения

pg11-10-bug-15677-fix-with-test_v2.patch

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

От

Tom Lane

Дата:

08 апреля 2019 г., 19:31:37

Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> On 2019/04/08 16:21, Amit Langote wrote:
>> Now that Andres has taken care of the other issues [1], maybe this one's
>> good to go?  The isolation test part needed to be rebased over Andres'
>> commit, which I've done in the attached updated patch.

> The patch I attached in the previous email doesn't apply as-is to
> back-branches due to rebasing.  I've attached another patch here, which
> applies to both PG 11 and 10 branches.

Agreed we can push this now, and done.

It struck me just as I was pushing it that this test doesn't exercise
EPQ with any of the interesting cases for partition routing (ie where
the update causes a move to a different partition).  It would likely
be a good idea to have test coverage for all of these scenarios:

* EPQ where the initial update would involve a partition change,
and that's still true after reapplying the update to the
concurrently-updated tuple version;

* EPQ where the initial update would *not* require a partition change,
but we need one after reapplying the update to the
concurrently-updated tuple version;

* EPQ where the initial update would involve a partition change,
but that's no longer true after reapplying the update to the
concurrently-updated tuple version.

You could probably build cases exercising the latter two scenarios
by doing updates in which the partition key column is set from
some other column that's modified by the concurrent update.

BTW, what happens if the concurrent update caused a partition change?
I imagine we would think the original tuple is now dead, since there's
no way to chase up to the replacement tuple in the other partition,
and so we'd abandon our update.  Is this documented?

None of this is related to bug #15727, though, so I suggest
starting a new thread with a proposed test patch.

            regards, tom lane

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Andres Freund

Дата:

08 апреля 2019 г., 19:41:38

Hi,

On 2019-04-08 12:31:37 -0400, Tom Lane wrote:
> BTW, what happens if the concurrent update caused a partition change?
> I imagine we would think the original tuple is now dead, since there's
> no way to chase up to the replacement tuple in the other partition,
> and so we'd abandon our update.  Is this documented?

FWIW, you should get an error:
                if (ItemPointerIndicatesMovedPartitions(tid))
                    ereport(ERROR,
                            (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                             errmsg("tuple to be locked was already moved to another partition due to concurrent
update")));

I think there's tests for the simpler cases. Probably wouldn't hurt to
test that case as another axis for the suite of tests you suggest.

Greetings,

Andres Freund

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Amit Langote

Дата:

09 апреля 2019 г., 07:55:22

On 2019/04/09 1:31, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> On 2019/04/08 16:21, Amit Langote wrote:
>>> Now that Andres has taken care of the other issues [1], maybe this one's
>>> good to go?  The isolation test part needed to be rebased over Andres'
>>> commit, which I've done in the attached updated patch.
> 
>> The patch I attached in the previous email doesn't apply as-is to
>> back-branches due to rebasing.  I've attached another patch here, which
>> applies to both PG 11 and 10 branches.
> 
> Agreed we can push this now, and done.

Thank you.

> It struck me just as I was pushing it that this test doesn't exercise
> EPQ with any of the interesting cases for partition routing (ie where
> the update causes a move to a different partition).  It would likely
> be a good idea to have test coverage for all of these scenarios:
> 
> * EPQ where the initial update would involve a partition change,
> and that's still true after reapplying the update to the
> concurrently-updated tuple version;
> 
> * EPQ where the initial update would *not* require a partition change,
> but we need one after reapplying the update to the
> concurrently-updated tuple version;
> 
> * EPQ where the initial update would involve a partition change,
> but that's no longer true after reapplying the update to the
> concurrently-updated tuple version.
> 
> You could probably build cases exercising the latter two scenarios
> by doing updates in which the partition key column is set from
> some other column that's modified by the concurrent update.
> 
> BTW, what happens if the concurrent update caused a partition change?
> I imagine we would think the original tuple is now dead, since there's
> no way to chase up to the replacement tuple in the other partition,
> and so we'd abandon our update.  Is this documented?

This seems to be documented in the Notes section of UPDATE's page [1].

> None of this is related to bug #15727, though, so I suggest
> starting a new thread with a proposed test patch.

OK, I will post something to -hackers, also considering the tip from Andres.

Thanks,
Amit

[1] https://www.postgresql.org/docs/devel/sql-update.html

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

r.zharkov@postgrespro.ru

Дата:

11 апреля 2019 г., 06:29:03

On 2019-04-07 00:17, Andres Freund wrote:

> I'd be very surprised if it weren't
> 
> commit 5db6df0c0117ff2a4e0cd87594d2db408cd5022f
> Author: Andres Freund <andres@anarazel.de>
> Date:   2019-03-23 19:55:57 -0700
> 
>     tableam: Add tuple_{insert, delete, update, lock} and use.

> Greetings,
> 
> Andres Freund

Hello,
I finished bisect and recheck the result.

ad308058cc8666c50b43179e64d6bb7aeb3ba169 is the first bad commit
commit ad308058cc8666c50b43179e64d6bb7aeb3ba169
Author: Thomas Munro <tmunro@postgresql.org>
Date:   Thu Mar 28 10:59:19 2019 +1300

     Use FullTransactionId for the transaction stack.

     Provide GetTopFullTransactionId() and GetCurrentFullTransactionId().
     The intended users of these interfaces are access methods that use
     xids for visibility checks but don't want to have to go back and
     "freeze" existing references some time later before the 32 bit xid
     counter wraps around.

     Use a new struct to serialize the transaction state for parallel
     query, because FullTransactionId doesn't fit into the previous
     serialization scheme very well.

     Author: Thomas Munro
     Reviewed-by: Heikki Linnakangas
     Discussion: 
https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com

:040000 040000 3080a54b6840daf3e08ada1edfb8d5d61c1099c6 
4663f00a917eb526d27da29d1c059e20c58d4b34 M      src


I reproduced the error on two servers.
First server, commit 16954e2:
PostgreSQL 12devel-master/16954e2 on x86_64-pc-linux-gnu, compiled by 
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit

[New LWP 8346]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test [local] UPDATE             
     '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fd37aca9428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007fd37aca9428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007fd37acab02a in __GI_abort () at abort.c:89
#2  0x0000000000aa52a7 in ExceptionalCondition (conditionName=0xb47688 
"!((((tuple->t_data->t_infomask) & 0x0080) || 
(((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | 
0x0010))) == 0x0040)))",
     errorType=0xb432e9 "FailedAssertion", fileName=0xb43368 "heapam.c", 
lineNumber=4289) at assert.c:54
#3  0x00000000004ea31d in heap_lock_tuple (relation=0x7fd37c3260b0, 
tuple=0x2ec61a0, cid=0, mode=LockTupleNoKeyExclusive, 
wait_policy=LockWaitBlock, follow_updates=false, buffer=0x7ffd787bf774, 
tmfd=0x7ffd787bf940)
     at heapam.c:4289
#4  0x00000000004f5c3e in heapam_tuple_lock (relation=0x7fd37c3260b0, 
tid=0x7ffd787bf9b0, snapshot=0x2e731b0, slot=0x2ec6150, cid=0, 
mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, flags=2 '\002',
     tmfd=0x7ffd787bf940) at heapam_handler.c:361
#5  0x0000000000745726 in table_lock_tuple (rel=0x7fd37c3260b0, 
tid=0x7ffd787bf9b0, snapshot=0x2e731b0, slot=0x2ec6150, cid=0, 
mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock, flags=2 '\002',
     tmfd=0x7ffd787bf940) at ../../../src/include/access/tableam.h:1228
#6  0x000000000074794d in ExecUpdate (mtstate=0x2eb6a50, 
tupleid=0x7ffd787bf9b0, oldtuple=0x0, slot=0x2eb8498, 
planSlot=0x2eb8320, epqstate=0x2eb6b48, estate=0x2eb66a8, 
canSetTag=true) at nodeModifyTable.c:1369
#7  0x0000000000748f39 in ExecModifyTable (pstate=0x2eb6a50) at 
nodeModifyTable.c:2216
#8  0x0000000000717c7d in ExecProcNodeFirst (node=0x2eb6a50) at 
execProcnode.c:445
#9  0x000000000070c9e6 in ExecProcNode (node=0x2eb6a50) at 
../../../src/include/executor/executor.h:239
#10 0x000000000070f265 in ExecutePlan (estate=0x2eb66a8, 
planstate=0x2eb6a50, use_parallel_mode=false, operation=CMD_UPDATE, 
sendTuples=false, numberTuples=0, direction=ForwardScanDirection, 
dest=0x2eb97d8,
     execute_once=true) at execMain.c:1648
#11 0x000000000070d023 in standard_ExecutorRun (queryDesc=0x2e16688, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:365
#12 0x000000000070ce5f in ExecutorRun (queryDesc=0x2e16688, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:309
#13 0x000000000091d5e0 in ProcessQuery (plan=0x2eaa3f0, 
sourceText=0x2dee1c8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid 
= 1;", params=0x0, queryEnv=0x0, dest=0x2eb97d8, 
completionTag=0x7ffd787bfe30 "")
     at pquery.c:161
#14 0x000000000091f02c in PortalRunMulti (portal=0x2e557c8, 
isTopLevel=true, setHoldSnapshot=false, dest=0x2eb97d8, 
altdest=0x2eb97d8, completionTag=0x7ffd787bfe30 "") at pquery.c:1283
#15 0x000000000091e578 in PortalRun (portal=0x2e557c8, 
count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0x2eb97d8, altdest=0x2eb97d8, completionTag=0x7ffd787bfe30 "") at 
pquery.c:796
#16 0x0000000000917ee9 in exec_simple_query (query_string=0x2dee1c8 
"UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1;") at 
postgres.c:1215
#17 0x000000000091c58f in PostgresMain (argc=1, argv=0x2e198c0, 
dbname=0x2e197a0 "test", username=0x2dead88 "zharkov") at 
postgres.c:4249
#18 0x000000000086afb2 in BackendRun (port=0x2e0f170) at 
postmaster.c:4430
#19 0x000000000086a713 in BackendStartup (port=0x2e0f170) at 
postmaster.c:4121
#20 0x0000000000866952 in ServerLoop () at postmaster.c:1704
#21 0x0000000000866160 in PostmasterMain (argc=3, argv=0x2de8cf0) at 
postmaster.c:1377
#22 0x00000000007803ba in main (argc=3, argv=0x2de8cf0) at main.c:228

Second server, commit 16954e2:
PostgreSQL 12devel-master/16954e22e2 on x86_64-pc-linux-gnu, compiled by 
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

[New LWP 6145]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test [local] UPDATE             
         '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fc5cfa2dfcf in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fc5cfa2dfcf in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fc5cfa2f3fa in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00005580a00b5590 in ExceptionalCondition (
     conditionName=0x5580a01550e8 "!((((tuple->t_data->t_infomask) & 
0x0080) || (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) 
| 0x0040 | 0x0010))) == 0x0040)))",
     errorType=0x5580a0150d49 "FailedAssertion", fileName=0x5580a0150dc8 
"heapam.c", lineNumber=4289) at assert.c:54
#3  0x000055809fb1e26c in heap_lock_tuple (relation=0x7fc5d10b6c00, 
tuple=0x5580a0bd3e20, cid=0, mode=LockTupleNoKeyExclusive, 
wait_policy=LockWaitBlock, follow_updates=false,
     buffer=0x7ffe391f876c, tmfd=0x7ffe391f8880) at heapam.c:4289
#4  0x000055809fb2980e in heapam_tuple_lock (relation=0x7fc5d10b6c00, 
tid=0x7ffe391f895a, snapshot=0x5580a0b280e0, slot=0x5580a0bd3dd0, cid=0, 
mode=LockTupleNoKeyExclusive,
     wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffe391f8880) at 
heapam_handler.c:361
#5  0x000055809fd68ba0 in table_lock_tuple (rel=0x7fc5d10b6c00, 
tid=0x7ffe391f895a, snapshot=0x5580a0b280e0, slot=0x5580a0bd3dd0, cid=0, 
mode=LockTupleNoKeyExclusive,
     wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffe391f8880) at 
../../../src/include/access/tableam.h:1228
#6  0x000055809fd6ace3 in ExecUpdate (mtstate=0x5580a0bca7d0, 
tupleid=0x7ffe391f895a, oldtuple=0x0, slot=0x5580a0bcc218, 
planSlot=0x5580a0bcc0a0, epqstate=0x5580a0bca8c8,
     estate=0x5580a0bca428, canSetTag=true) at nodeModifyTable.c:1369
#7  0x000055809fd6c2da in ExecModifyTable (pstate=0x5580a0bca7d0) at 
nodeModifyTable.c:2216
#8  0x000055809fd3b65d in ExecProcNodeFirst (node=0x5580a0bca7d0) at 
execProcnode.c:445
#9  0x000055809fd304b7 in ExecProcNode (node=0x5580a0bca7d0) at 
../../../src/include/executor/executor.h:239
#10 0x000055809fd32e7a in ExecutePlan (estate=0x5580a0bca428, 
planstate=0x5580a0bca7d0, use_parallel_mode=false, operation=CMD_UPDATE,
sendTuples=false, numberTuples=0,
     direction=ForwardScanDirection, dest=0x5580a0bcd558, 
execute_once=true) at execMain.c:1648
#11 0x000055809fd30b0d in standard_ExecutorRun 
(queryDesc=0x5580a0b2a8d8, direction=ForwardScanDirection, count=0, 
execute_once=true) at execMain.c:365
#12 0x000055809fd30931 in ExecutorRun (queryDesc=0x5580a0b2a8d8, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:309
#13 0x000055809ff3a439 in ProcessQuery (plan=0x5580a0bbe170, 
sourceText=0x5580a0b031a8 "UPDATE pgbench_accounts SET abalance = 1 
WHERE aid = 3;", params=0x0, queryEnv=0x0,
     dest=0x5580a0bcd558, completionTag=0x7ffe391f8d60 "") at 
pquery.c:161
#14 0x000055809ff3bdaf in PortalRunMulti (portal=0x5580a0b69958, 
isTopLevel=true, setHoldSnapshot=false, dest=0x5580a0bcd558, 
altdest=0x5580a0bcd558, completionTag=0x7ffe391f8d60 "")
     at pquery.c:1283
#15 0x000055809ff3b366 in PortalRun (portal=0x5580a0b69958, 
count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0x5580a0bcd558, altdest=0x5580a0bcd558,
     completionTag=0x7ffe391f8d60 "") at pquery.c:796
#16 0x000055809ff34e8d in exec_simple_query (query_string=0x5580a0b031a8 
"UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 3;") at 
postgres.c:1215
#17 0x000055809ff394ca in PostgresMain (argc=1, argv=0x5580a0b2da50, 
dbname=0x5580a0b2d930 "test", username=0x5580a0affd68 "zharkov") at 
postgres.c:4249
#18 0x000055809fe8aa98 in BackendRun (port=0x5580a0b256f0) at 
postmaster.c:4430
#19 0x000055809fe8a201 in BackendStartup (port=0x5580a0b256f0) at 
postmaster.c:4121
#20 0x000055809fe863b6 in ServerLoop () at postmaster.c:1704
#21 0x000055809fe85be4 in PostmasterMain (argc=3, argv=0x5580a0afdcd0) 
at postmaster.c:1377
#22 0x000055809fda294e in main (argc=3, argv=0x5580a0afdcd0) at 
main.c:228


On the first bad commit:
First server:
PostgreSQL 12devel-HEAD/ad30805 on x86_64-pc-linux-gnu, compiled by gcc 
(Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit

[New LWP 25344]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test [local] UPDATE             
     '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fc51e46e428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007fc51e46e428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007fc51e47002a in __GI_abort () at abort.c:89
#2  0x0000000000a98e22 in errfinish (dummy=0) at elog.c:552
#3  0x0000000000a9b3c9 in elog_finish (elevel=22, fmt=0xb66160 "cannot 
abort transaction %u, it was already committed") at elog.c:1365
#4  0x0000000000561df4 in RecordTransactionAbort (isSubXact=false) at 
xact.c:1644
#5  0x0000000000562e00 in AbortTransaction () at xact.c:2670
#6  0x0000000000563683 in AbortCurrentTransaction () at xact.c:3191
#7  0x00000000009113f6 in PostgresMain (argc=1, argv=0x240e8b0, 
dbname=0x240e790 "test", username=0x23dfd98 "zharkov") at 
postgres.c:4033
#8  0x0000000000861064 in BackendRun (port=0x2404160) at 
postmaster.c:4399
#9  0x00000000008607c5 in BackendStartup (port=0x2404160) at 
postmaster.c:4090
#10 0x000000000085cac7 in ServerLoop () at postmaster.c:1703
#11 0x000000000085c2d5 in PostmasterMain (argc=3, argv=0x23ddd00) at 
postmaster.c:1376
#12 0x0000000000777edf in main (argc=3, argv=0x23ddd00) at main.c:228

Second server:
PostgreSQL 12devel-HEAD/ad308058cc on x86_64-pc-linux-gnu, compiled by 
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

[New LWP 11211]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test [local] UPDATE             
         '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007feda4a2cfcf in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007feda4a2cfcf in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007feda4a2e3fa in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000055802b411af6 in errfinish (dummy=0) at elog.c:552
#3  0x000055802b414035 in elog_finish (elevel=22, fmt=0x55802b4dbb60 
"cannot abort transaction %u, it was already committed") at elog.c:1365
#4  0x000055802aefbd39 in RecordTransactionAbort (isSubXact=false) at 
xact.c:1644
#5  0x000055802aefcdbd in AbortTransaction () at xact.c:2670
#6  0x000055802aefd6c7 in AbortCurrentTransaction () at xact.c:3191
#7  0x000055802b296b90 in PostgresMain (argc=1, argv=0x55802d723a30, 
dbname=0x55802d723910 "test", username=0x55802d6f5d78 "zharkov") at 
postgres.c:4033
#8  0x000055802b1e9059 in BackendRun (port=0x55802d71a1f0) at 
postmaster.c:4399
#9  0x000055802b1e87c2 in BackendStartup (port=0x55802d71a1f0) at 
postmaster.c:4090
#10 0x000055802b1e4a44 in ServerLoop () at postmaster.c:1703
#11 0x000055802b1e4272 in PostmasterMain (argc=3, argv=0x55802d6f3ce0) 
at postmaster.c:1376
#12 0x000055802b10299f in main (argc=3, argv=0x55802d6f3ce0) at 
main.c:228


The error does not reproduce on commit 2fc7af5 on both servers.


-- 
regards,
Roman Zharkov

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Michael Paquier

Дата:

11 апреля 2019 г., 06:47:30

On Thu, Apr 11, 2019 at 10:29:03AM +0700, r.zharkov@postgrespro.ru wrote:
> The error does not reproduce on commit 2fc7af5 on both servers.

Okay.  As nothing seems to remain here, I have marked this open item
as fixed.
--
Michael

Вложения

signature.asc

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

От

Жарков Роман

Дата:

11 апреля 2019 г., 07:17:54


> 11 апр. 2019 г., в 10:47, Michael Paquier <michael@paquier.xyz> написал(а):
>
>> On Thu, Apr 11, 2019 at 10:29:03AM +0700, r.zharkov@postgrespro.ru wrote:
>> The error does not reproduce on commit 2fc7af5 on both servers.
>
> Okay.  As nothing seems to remain here, I have marked this open item
> as fixed.
> --
> Michael

Hi,
There is no fix. 2fc7af5 is the last good commit.

regards,
Roman

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Michael Paquier

Дата:

11 апреля 2019 г., 07:23:42

On Thu, Apr 11, 2019 at 11:17:54AM +0700, Жарков Роман wrote:
> There is no fix. 2fc7af5 is the last good commit.

Cool, thanks for confirming.
--
Michael

Вложения

signature.asc

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Thomas Munro

Дата:

11 апреля 2019 г., 07:44:37

On Thu, Apr 11, 2019 at 4:23 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Apr 11, 2019 at 11:17:54AM +0700, Жарков Роман wrote:
> > There is no fix. 2fc7af5 is the last good commit.
>
> Cool, thanks for confirming.

Thanks.  Looking into this.

--
Thomas Munro
https://enterprisedb.com

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

r.zharkov@postgrespro.ru

Дата:

11 апреля 2019 г., 08:35:34

On 2019-04-11 11:44, Thomas Munro wrote:
>> Cool, thanks for confirming.
> 
> Thanks.  Looking into this.

Hi,
I think that the problem is in the autovacuum.

...
2019-04-11 08:04:49.593 MSK [24671] LOG:  automatic analyze of table 
"test.public.pgbench_accounts" system usage: CPU: user: 0.15 s, system: 
0.00 s, elapsed: 0.17 s
2019-04-11 08:05:29.137 MSK [13263] LOG:  checkpoint starting: wal
2019-04-11 08:05:49.352 MSK [25040] LOG:  automatic analyze of table 
"test.public.pgbench_accounts" system usage: CPU: user: 0.15 s, system: 
0.00 s, elapsed: 0.18 s
2019-04-11 08:06:49.449 MSK [25058] LOG:  automatic vacuum of table 
"test.public.pgbench_accounts": index scans: 1
         pages: 0 removed, 2145 remain, 1 skipped due to pins, 0 skipped 
frozen
         tuples: 8349 removed, 100975 remain, 976 are dead but not yet 
removable, oldest xmin: 195581926
         buffer usage: 6303 hits, 0 misses, 43 dirtied
         avg read rate: 0.000 MB/s, avg write rate: 1.353 MB/s
         system usage: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.24 s
2019-04-11 08:06:49.623 MSK [25058] LOG:  automatic analyze of table 
"test.public.pgbench_accounts" system usage: CPU: user: 0.13 s, system: 
0.01 s, elapsed: 0.17 s
2019-04-11 08:06:57.478 MSK [13263] LOG:  checkpoint complete: wrote 
2378 buffers (14.5%); 0 WAL file(s) added, 0 removed, 42 recycled; 
write=88.322 s, sync=0.000 s, total=88.340 s; sync fi
les=6, longest=0.000 s, average=0.000 s; distance=688127 kB, 
estimate=688128 kB
2019-04-11 08:07:49.451 MSK [25084] LOG:  automatic analyze of table 
"test.public.pgbench_accounts" system usage: CPU: user: 0.15 s, system: 
0.01 s, elapsed: 0.22 s
2019-04-11 08:08:28.963 MSK [13263] LOG:  checkpoint starting: wal
2019-04-11 08:08:49.500 MSK [25100] LOG:  automatic vacuum of table 
"test.public.pgbench_accounts": index scans: 1
         pages: 0 removed, 2145 remain, 1 skipped due to pins, 0 skipped 
frozen
         tuples: 8666 removed, 101047 remain, 1047 are dead but not yet 
removable, oldest xmin: 198748530
         buffer usage: 6243 hits, 0 misses, 15 dirtied
         avg read rate: 0.000 MB/s, avg write rate: 0.480 MB/s
         system usage: CPU: user: 0.17 s, system: 0.00 s, elapsed: 0.24 s
2019-04-11 08:08:49.678 MSK [25100] LOG:  automatic analyze of table 
"test.public.pgbench_accounts" system usage: CPU: user: 0.13 s, system: 
0.01 s, elapsed: 0.17 s
TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) || 
(((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 | 
0x0010))) == 0x0040)))", File: "heapam.c", Line:
4289)
2019-04-11 08:09:37.452 MSK [13261] LOG:  server process (PID 13294) was 
terminated by signal 6: Aborted
2019-04-11 08:09:37.452 MSK [13261] DETAIL:  Failed process was running: 
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;

-- 
regards,

Roman Zharkov

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Thomas Munro

Дата:

11 апреля 2019 г., 11:52:17

On Thu, Apr 11, 2019 at 5:35 PM <r.zharkov@postgrespro.ru> wrote:
> TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) ||
> (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 |
> 0x0010))) == 0x0040)))", File: "heapam.c", Line:
> 4289)

Hi Roman,

Ah, I see it.  Can you reproduce the problem with this patch applied?

While going over that commit spotted an incorrect sizeof in
EstimateTransactionStateSpace(), but I'll fix that separately.

-- 
Thomas Munro
https://enterprisedb.com

Вложения

0001-Fix-GetNewTransactionId-on-collision-with-xidVacLimi.patch

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

r.zharkov@postgrespro.ru

Дата:

11 апреля 2019 г., 11:55:58

On 2019-04-11 15:52, Thomas Munro wrote:
> Ah, I see it.  Can you reproduce the problem with this patch applied?
> 
> While going over that commit spotted an incorrect sizeof in
> EstimateTransactionStateSpace(), but I'll fix that separately.

Of course, I will try it now.

-- 
regards

Roman Zharkov

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

r.zharkov@postgrespro.ru

Дата:

12 апреля 2019 г., 05:44:04

On 2019-04-11 15:52, Thomas Munro wrote:
> On Thu, Apr 11, 2019 at 5:35 PM <r.zharkov@postgrespro.ru> wrote:
>> TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) ||
>> (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 
>> |
>> 0x0010))) == 0x0040)))", File: "heapam.c", Line:
>> 4289)
> 
> Hi Roman,
> 
> Ah, I see it.  Can you reproduce the problem with this patch applied?
> 
> While going over that commit spotted an incorrect sizeof in
> EstimateTransactionStateSpace(), but I'll fix that separately.

Hello,
My test finished successfully. Everything is OK.
Thanks!

-- 
regards,
Roman Zharkov

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Thomas Munro

Дата:

12 апреля 2019 г., 08:04:53

On Fri, Apr 12, 2019 at 2:44 PM <r.zharkov@postgrespro.ru> wrote:
> My test finished successfully. Everything is OK.

Pushed.  Thanks for the report and testing!

-- 
Thomas Munro
https://enterprisedb.com

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Michael Paquier

Дата:

12 апреля 2019 г., 08:19:05

On Fri, Apr 12, 2019 at 05:04:53PM +1200, Thomas Munro wrote:
> On Fri, Apr 12, 2019 at 2:44 PM <r.zharkov@postgrespro.ru> wrote:
>> My test finished successfully. Everything is OK.
>
> Pushed.  Thanks for the report and testing!

Could you mark this item in [1] as fixed please?

[1]: https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Open_Issues
--
Michael

Вложения

signature.asc

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it wasalready committed

От

Thomas Munro

Дата:

12 апреля 2019 г., 08:39:21

On Fri, Apr 12, 2019 at 5:19 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Apr 12, 2019 at 05:04:53PM +1200, Thomas Munro wrote:
> > On Fri, Apr 12, 2019 at 2:44 PM <r.zharkov@postgrespro.ru> wrote:
> >> My test finished successfully. Everything is OK.
> >
> > Pushed.  Thanks for the report and testing!
>
> Could you mark this item in [1] as fixed please?

Oops, done (moved to the fixed items list).

-- 
Thomas Munro
https://enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения