Обсуждение: bug or simply not enough stack space?
L.S. We're working on some conversion script which in itself ran fine. We then added a couple of additional checks, one of which was wrong and thus bailed out, but ran into the following (from the log): LOG: statement: update stock_item_composition set ...... ERROR: new row for relation "site_stock" violates check constraint CONTEXT: SQL statement "............" PL/pgSQL function "tr_site_stock_amounts_after_iud" line 74 at SQL statement SQL statement "............" PL/pgSQL function "tr_stock_item_composition_after_iud" line 202 at SQL statement STATEMENT: update stock_item_composition set ......... ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner SubTransaction WARNING: AbortSubTransaction while in ABORT state WARNING: did not find subXID 75610 in MyProc ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner SubTransaction WARNING: AbortSubTransaction while in ABORT state WARNING: did not find subXID 75610 in MyProc ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner SubTransaction WARNING: AbortSubTransaction while in ABORT state WARNING: did not find subXID 75610 in MyProc ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner SubTransaction PANIC: ERRORDATA_STACK_SIZE exceeded LOG: server process (PID 14757) was terminated by signal 6: Aborted The backtrace itself is not very informative, if needed we can easily create a new one based on breakpoint information provided: (gdb) bt #0 0x00007ffe7a7a7f05 in raise () from /lib64/libc.so.6 #1 0x00007ffe7a7a9960 in abort () from /lib64/libc.so.6 #2 0x0000000000675e05 in errfinish () #3 0x000000000067693f in elog_start () #4 0x000000000047518d in AbortSubTransaction () #5 0x0000000000475659 in AbortCurrentTransaction () #6 0x00000000005cf685 in PostgresMain () #7 0x00000000005a5155 in ServerLoop () #8 0x00000000005a5c9f in PostmasterMain () #9 0x0000000000554a70 in main () This is a 64-bit machine with 10GB mem, configured with max_stack_depth = 6MB, the queries/checks/etc used are not recursive, so I found the above a bit suspicious. -- Best, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > WARNING: AbortSubTransaction while in ABORT state > WARNING: did not find subXID 75610 in MyProc > ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner > SubTransaction Those are bugs (although there is probably only one bug and the rest is collateral damage). May we have a test case? regards, tom lane
Hi, > > WARNING: AbortSubTransaction while in ABORT state > > WARNING: did not find subXID 75610 in MyProc > > ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner > > SubTransaction > > Those are bugs (although there is probably only one bug and the rest is > collateral damage). May we have a test case? Scripts, triggers and stuff are a bit complex, before assigning the resources for that, could we help with creating a backtrace? That we could do immediately ;) -- Best, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: >> Those are bugs (although there is probably only one bug and the rest is >> collateral damage). May we have a test case? > Scripts, triggers and stuff are a bit complex, before assigning the resources > for that, could we help with creating a backtrace? You did show a backtrace --- it proves only what was already obvious, namely that the problem is in transaction cleanup. regards, tom lane
Op donderdag 16 juli 2009, schreef Tom Lane: > >> Those are bugs (although there is probably only one bug and the rest is > >> collateral damage). May we have a test case? > > > > Scripts, triggers and stuff are a bit complex, before assigning the > > resources for that, could we help with creating a backtrace? > > You did show a backtrace --- it proves only what was already obvious, > namely that the problem is in transaction cleanup. Ok, working on it. -- Best, Frank.
Hi Tom, > >> Those are bugs (although there is probably only one bug and the rest is > >> collateral damage). May we have a test case? > > > > Scripts, triggers and stuff are a bit complex, before assigning the > > resources for that, could we help with creating a backtrace? > > You did show a backtrace --- it proves only what was already obvious, > namely that the problem is in transaction cleanup. Bummer, this is not going to work.... ;( I've been trying to mimic the structure of tables, after en deferred triggers and corresponding inserts/updates, but I'm still unsuccesfull. On our development database I confirmed that it's safepoint related, but that was kind of obvious from the error as well. Can you think of another way for me to for example get more information out of the problemcase that I _can_ reproduce? It would for example help if I could get a clue on what table is involved with the problematic tuples. I could easily add extra debug/log statements to the locations these errors come from, but would appreciate a hint as to what to add exactly ;) -- Best, Frank.
Ok, So this took a while, but here's your test case..... Turns out to be quite small actually ;) create table t1 (id int); CREATE FUNCTION tr_t1_after_iud() RETURNS trigger LANGUAGE 'plpgsql' VOLATILE STRICT SECURITY INVOKER AS 'DECLARE BEGIN RAISE NOTICE ''%'', ROW(NEW.*); SELECT 1/0; RETURN NEW; END;'; CREATE TRIGGER t1_after AFTER INSERT OR UPDATE OR DELETE ON t1 FOR EACH ROW EXECUTE PROCEDURE tr_t1_after_iud(); begin; savepoint s1; INSERT INTO t1 values (1); => this will result in the following: db=# INSERT INTO t1 values (1); NOTICE: (1) WARNING: AbortSubTransaction while in ABORT state WARNING: did not find subXID 77063 in MyProc ERROR: division by zero CONTEXT: SQL statement "SELECT 1/0" PL/pgSQL function "tr_t1_after_iud" line 4 at SQL statement ERROR: tupdesc reference 0x7ffe74f24ad0 is not owned by resource owner SubTransaction => mind the fact that the savepoint is 'needed', without it there will be no problem => in my reallife example, this resulted in a "PANIC: ERRORDATA_STACK_SIZE exceeded", I cannot reproduce that, but as you stated earlier, this might just be collateral damage, which I'll be able to easily confirm once the problem above has been fixed. => cleanup: rollback; drop table t1; drop function tr_t1_after_iud(); => to avoid possible confusion db=# select version(); version --------------------------------------------------------------------------------------- PostgreSQL 8.4.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.2.4, 64-bit Looking forward to your reply. -- Best, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > So this took a while, but here's your test case..... > Turns out to be quite small actually ;) Hmm ... the relevant code change seems to have been http://archives.postgresql.org/pgsql-committers/2009-04/msg00127.php I think I might have been overenthusiastic in trying to free resources during a subtransaction abort. Try this patch and see if you notice any bad side-effects. regards, tom lane Index: src/pl/plpgsql/src/pl_exec.c =================================================================== RCS file: /cvsroot/pgsql/src/pl/plpgsql/src/pl_exec.c,v retrieving revision 1.244 diff -c -r1.244 pl_exec.c *** src/pl/plpgsql/src/pl_exec.c 17 Jun 2009 13:46:12 -0000 1.244 --- src/pl/plpgsql/src/pl_exec.c 17 Jul 2009 16:12:22 -0000 *************** *** 5292,5298 **** { SimpleEcontextStackEntry *next; ! FreeExprContext(simple_econtext_stack->stack_econtext); next = simple_econtext_stack->next; pfree(simple_econtext_stack); simple_econtext_stack = next; --- 5292,5299 ---- { SimpleEcontextStackEntry *next; ! if (event == SUBXACT_EVENT_COMMIT_SUB) ! FreeExprContext(simple_econtext_stack->stack_econtext); next = simple_econtext_stack->next; pfree(simple_econtext_stack); simple_econtext_stack = next;
Hi Tom, > Hmm ... the relevant code change seems to have been > http://archives.postgresql.org/pgsql-committers/2009-04/msg00127.php Well, though not during beta, the field testing did pay off ;) > I think I might have been overenthusiastic in trying to free resources > during a subtransaction abort. Try this patch and see if you notice > any bad side-effects. All examples I had that crashed and burned, now work correctly and/or bail out correctly where needed. No side-effects noticed. -- Best, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: >> I think I might have been overenthusiastic in trying to free resources >> during a subtransaction abort. Try this patch and see if you notice >> any bad side-effects. > All examples I had that crashed and burned, now work correctly and/or bail out > correctly where needed. > No side-effects noticed. Well, it had some :-(. I've committed a better version of that patch, which will be in 8.4.1. However, if you're not seeing any serious memory leakage then you could keep using the patch you have for the time being. regards, tom lane
On Sat, Jul 18, 2009 at 2:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > >> I think I might have been overenthusiastic in trying to free resources > >> during a subtransaction abort. Try this patch and see if you notice > >> any bad side-effects. > > > All examples I had that crashed and burned, now work correctly and/or bail out > > correctly where needed. > > > No side-effects noticed. > > Well, it had some :-(. I've committed a better version of that patch, > which will be in 8.4.1. However, if you're not seeing any serious > memory leakage then you could keep using the patch you have for the > time being. Reviving this ancient thread. I saw "did not find subXID" errors, in 9.6.12. Here is what happened. 2019-10-03 19:54:01 CDT [: @] []: LOG: checkpoints are occurring too frequently (14 seconds apart) 2019-10-03 19:54:01 CDT [: @] []: HINT: Consider increasing the configuration parameter "max_wal_size". 2019-10-03 19:54:15 CDT [: @] []: LOG: checkpoints are occurring too frequently (14 seconds apart) 2019-10-03 19:54:15 CDT [: @] []: HINT: Consider increasing the configuration parameter "max_wal_size". 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: did not find subXID 384134 in MyProc 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: CONTEXT: PL/pgSQL function loadhistorydatafromysm_testing2() line 99 during exception cleanup 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: LOG: could not send data to client: Broken pipe 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: CONTEXT: PL/pgSQL function loadhistorydatafromysm_testing2() line 99 during exception cleanup 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: STATEMENT: select LoadHistoryDataFromYSM_testing2(); 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: ERROR: failed to re-find shared lock object 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: CONTEXT: PL/pgSQL function loadhistorydatafromysm_testing2() line 99 during exception cleanup 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: STATEMENT: select LoadHistoryDataFromYSM_testing2(); 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: AbortSubTransaction while in ABORT state 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: did not find subXID 384134 in MyProc 2019-10-03 19:53:31 CDT [: @] []: LOG: checkpoints are occurring too frequently (16 seconds apart) 2019-10-03 19:53:31 CDT [: @] []: HINT: Consider increasing the configuration parameter "max_wal_size". 2019-10-03 19:53:31 CDT [: @] []: LOG: checkpoints are occurring too frequently (16 seconds apart) 2019-10-03 19:53:31 CDT [: @] []: HINT: Consider increasing the configuration parameter "max_wal_size". 2019-10-03 19:53:47 CDT [: @] []: LOG: checkpoints are occurring too frequently (16 seconds apart) 2019-10-03 19:53:47 CDT [: @] []: HINT: Consider increasing the configuration parameter "max_wal_size". 2019-10-03 19:54:01 CDT [: @] []: LOG: checkpoints are occurring too frequently (14 seconds apart) 2019-10-03 19:54:01 CDT [: @] []: HINT: Consider increasing the configuration parameter "max_wal_size". 2019-10-03 19:54:15 CDT [: @] []: LOG: checkpoints are occurring too frequently (14 seconds apart) 2019-10-03 19:54:15 CDT [: @] []: HINT: Consider increasing the configuration parameter "max_wal_size". 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: did not find subXID 384134 in MyProc 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: CONTEXT: PL/pgSQL function loadhistorydatafromysm_testing2() line 99 during exception cleanup 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: LOG: could not send data to client: Broken pipe 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: CONTEXT: PL/pgSQL function loadhistorydatafromysm_testing2() line 99 during exception cleanup 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: STATEMENT: select LoadHistoryDataFromYSM_testing2(); 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: ERROR: failed to re-find shared lock object 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: CONTEXT: PL/pgSQL function loadhistorydatafromysm_testing2() line 99 during exception cleanup 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: STATEMENT: select LoadHistoryDataFromYSM_testing2(); 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: AbortSubTransaction while in ABORT state 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: did not find subXID 384134 in MyProc 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: ERROR: failed to re-find shared lock object 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: AbortSubTransaction while in ABORT state 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: did not find subXID 384134 in MyProc 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: ERROR: failed to re-find shared lock object 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: AbortSubTransaction while in ABORT state 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: did not find subXID 384134 in MyProc 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: ERROR: failed to re-find shared lock object 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: AbortSubTransaction while in ABORT state 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: did not find subXID 384134 in MyProc 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: you don't own a lock of type ExclusiveLock 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: WARNING: ReleaseLockIfHeld: failed?? 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: ERROR: failed to re-find shared lock object 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] [10.22.236.83(54943)]: PANIC: ERRORDATA_STACK_SIZE exceeded 2019-10-03 19:58:37 CDT [: @] []: LOG: server process (PID 10715) was terminated by signal 6: Aborted 2019-10-03 19:58:37 CDT [: @] []: LOG: terminating any other active server processes Couple interesting things here: *) Logs are strongly out of order *) "loadhistorydatafromysm_testing2()" is using pl/sh, which is a known source of weird (but rare) instability issues (I'm assuming this is underlying cause of issue) *) everything restarted and is working fine *) chance of providing reproducible case is zero Bringing this up due to this thread: https://github.com/laurenz/oracle_fdw/issues/328 I can't help but wonder if we have some kind of obscure issue that is related to C extension problems; just throwing a data point on the table. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > Reviving this ancient thread. I saw "did not find subXID" errors, in > 9.6.12. Here is what happened. > 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] > [10.22.236.83(54943)]: WARNING: did not find subXID 384134 in MyProc > 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] > [10.22.236.83(54943)]: CONTEXT: PL/pgSQL function > loadhistorydatafromysm_testing2() line 99 during exception cleanup > 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] > [10.22.236.83(54943)]: LOG: could not send data to client: Broken > pipe > 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] > [10.22.236.83(54943)]: CONTEXT: PL/pgSQL function > loadhistorydatafromysm_testing2() line 99 during exception cleanup > 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] > [10.22.236.83(54943)]: STATEMENT: select > LoadHistoryDataFromYSM_testing2(); > 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] > [10.22.236.83(54943)]: ERROR: failed to re-find shared lock object > 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] > [10.22.236.83(54943)]: CONTEXT: PL/pgSQL function > loadhistorydatafromysm_testing2() line 99 during exception cleanup > 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2] > [10.22.236.83(54943)]: STATEMENT: select > LoadHistoryDataFromYSM_testing2(); [ and then we get into recursive error-during-error-cleanup failures ] Yeah, something has left stuff in a bad state here. > *) "loadhistorydatafromysm_testing2()" is using pl/sh, which is a > known source of weird (but rare) instability issues (I'm assuming this > is underlying cause of issue) Hm. Yeah, I'd be way more interested if this could be reproduced without pl/sh. > I can't help but wonder if we have some kind of obscure issue that is > related to C extension problems; just throwing a data point on the > table. Well, there's nothing too obscure about the rule that error cleanup needs to avoid doing anything that might cause another error, for fear of causing infinite recursion. I suspect that the underlying issue is that pl/sh is violating that rule somewhere. The other thread you point to suggests that maybe oracle_fdw also used to do that, and fixed it. regards, tom lane