Обсуждение: bug or simply not enough stack space?

Поиск

Список

Период

Сортировка

bug or simply not enough stack space?

От

Frank van Vugt

Дата:

16 июля 2009 г., 10:50:37

L.S.

We're working on some conversion script which in itself ran fine. We then
added a couple of additional checks, one of which was wrong and thus bailed
out, but ran into the following (from the log):


LOG: statement: update stock_item_composition set ......

ERROR: new row for relation "site_stock" violates check constraint

CONTEXT: SQL statement "............"

 PL/pgSQL function "tr_site_stock_amounts_after_iud" line 74 at SQL statement
 SQL statement "............"

 PL/pgSQL function "tr_stock_item_composition_after_iud" line 202 at SQL
statement

STATEMENT: update stock_item_composition set .........

ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner
SubTransaction

WARNING: AbortSubTransaction while in ABORT state

WARNING: did not find subXID 75610 in MyProc

ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner
SubTransaction

WARNING: AbortSubTransaction while in ABORT state

WARNING: did not find subXID 75610 in MyProc

ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner
SubTransaction

WARNING: AbortSubTransaction while in ABORT state

WARNING: did not find subXID 75610 in MyProc

ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner
SubTransaction

PANIC: ERRORDATA_STACK_SIZE exceeded

LOG: server process (PID 14757) was terminated by signal 6: Aborted



The backtrace itself is not very informative, if needed we can easily create a
new one based on breakpoint information provided:

(gdb) bt
#0 0x00007ffe7a7a7f05 in raise () from /lib64/libc.so.6
#1 0x00007ffe7a7a9960 in abort () from /lib64/libc.so.6
#2 0x0000000000675e05 in errfinish ()
#3 0x000000000067693f in elog_start ()
#4 0x000000000047518d in AbortSubTransaction ()
#5 0x0000000000475659 in AbortCurrentTransaction ()
#6 0x00000000005cf685 in PostgresMain ()
#7 0x00000000005a5155 in ServerLoop ()
#8 0x00000000005a5c9f in PostmasterMain ()
#9 0x0000000000554a70 in main ()



This is a 64-bit machine with 10GB mem, configured with max_stack_depth = 6MB,
the queries/checks/etc used are not recursive, so I found the above a bit
suspicious.





--
Best,




Frank.

Re: bug or simply not enough stack space?

От

Tom Lane

Дата:

16 июля 2009 г., 11:07:39

Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> WARNING: AbortSubTransaction while in ABORT state
> WARNING: did not find subXID 75610 in MyProc
> ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner
> SubTransaction

Those are bugs (although there is probably only one bug and the rest is
collateral damage).  May we have a test case?

            regards, tom lane

Re: bug or simply not enough stack space?

От

Frank van Vugt

Дата:

16 июля 2009 г., 11:49:29

Hi,

> > WARNING: AbortSubTransaction while in ABORT state
> > WARNING: did not find subXID 75610 in MyProc
> > ERROR: tupdesc reference 0x7ffe74eaf028 is not owned by resource owner
> > SubTransaction
>
> Those are bugs (although there is probably only one bug and the rest is
> collateral damage).  May we have a test case?

Scripts, triggers and stuff are a bit complex, before assigning the resources
for that, could we help with creating a backtrace? That we could do
immediately ;)




--
Best,




Frank.

Re: bug or simply not enough stack space?

От

Tom Lane

Дата:

16 июля 2009 г., 11:54:08

Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
>> Those are bugs (although there is probably only one bug and the rest is
>> collateral damage).  May we have a test case?

> Scripts, triggers and stuff are a bit complex, before assigning the resources
> for that, could we help with creating a backtrace?

You did show a backtrace --- it proves only what was already obvious,
namely that the problem is in transaction cleanup.

            regards, tom lane

Re: bug or simply not enough stack space?

От

Frank van Vugt

Дата:

16 июля 2009 г., 12:29:41

Op donderdag 16 juli 2009, schreef Tom Lane:
> >> Those are bugs (although there is probably only one bug and the rest is
> >> collateral damage).  May we have a test case?
> >
> > Scripts, triggers and stuff are a bit complex, before assigning the
> > resources for that, could we help with creating a backtrace?
>
> You did show a backtrace --- it proves only what was already obvious,
> namely that the problem is in transaction cleanup.

Ok, working on it.



--
Best,




Frank.

Re: bug or simply not enough stack space?

От

Frank van Vugt

Дата:

16 июля 2009 г., 16:42:28

Hi Tom,

> >> Those are bugs (although there is probably only one bug and the rest is
> >> collateral damage).  May we have a test case?
> >
> > Scripts, triggers and stuff are a bit complex, before assigning the
> > resources for that, could we help with creating a backtrace?
>
> You did show a backtrace --- it proves only what was already obvious,
> namely that the problem is in transaction cleanup.

Bummer, this is not going to work.... ;(

I've been trying to mimic the structure of tables, after en deferred triggers
and corresponding inserts/updates, but I'm still unsuccesfull. On our
development database I confirmed that it's safepoint related, but that was
kind of obvious from the error as well.

Can you think of another way for me to for example get more information out of
the problemcase that I _can_ reproduce? It would for example help if I could
get a clue on what table is involved with the problematic tuples. I could
easily add extra debug/log statements to the locations these errors come from,
but would appreciate a hint as to what to add exactly ;)




--
Best,




Frank.

Re: bug or simply not enough stack space?

От

Frank van Vugt

Дата:

17 июля 2009 г., 06:11:12

Ok,

So this took a while, but here's your test case.....
Turns out to be quite small actually ;)



create table t1 (id int);

CREATE FUNCTION tr_t1_after_iud()
RETURNS trigger
LANGUAGE 'plpgsql'
VOLATILE
STRICT
SECURITY INVOKER
AS 'DECLARE
 BEGIN
 RAISE NOTICE ''%'', ROW(NEW.*);
 SELECT 1/0;
 RETURN NEW;
 END;';

CREATE TRIGGER t1_after AFTER INSERT OR UPDATE OR DELETE ON t1 FOR EACH ROW
EXECUTE PROCEDURE tr_t1_after_iud();

begin;

savepoint s1;

INSERT INTO t1 values (1);




=> this will result in the following:

db=# INSERT INTO t1 values (1);
NOTICE:  (1)
WARNING:  AbortSubTransaction while in ABORT state
WARNING:  did not find subXID 77063 in MyProc
ERROR:  division by zero
CONTEXT:  SQL statement "SELECT 1/0"
PL/pgSQL function "tr_t1_after_iud" line 4 at SQL statement
ERROR:  tupdesc reference 0x7ffe74f24ad0 is not owned by resource owner
SubTransaction



=> mind the fact that the savepoint is 'needed', without it there will be no
problem

=> in my reallife example, this resulted in a "PANIC: ERRORDATA_STACK_SIZE
exceeded", I cannot reproduce that, but as you stated earlier, this might just
be collateral damage, which I'll be able to easily confirm once the problem
above has been fixed.



=> cleanup:

rollback;
drop table t1;
drop function tr_t1_after_iud();


=> to avoid possible confusion

db=# select version();
                                        version
---------------------------------------------------------------------------------------
 PostgreSQL 8.4.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.2.4, 64-bit



Looking forward to your reply.



--
Best,




Frank.

Re: bug or simply not enough stack space?

От

Tom Lane

Дата:

17 июля 2009 г., 13:18:15

Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> So this took a while, but here's your test case.....
> Turns out to be quite small actually ;)

Hmm ... the relevant code change seems to have been
http://archives.postgresql.org/pgsql-committers/2009-04/msg00127.php

I think I might have been overenthusiastic in trying to free resources
during a subtransaction abort.  Try this patch and see if you notice
any bad side-effects.

            regards, tom lane

Index: src/pl/plpgsql/src/pl_exec.c
===================================================================
RCS file: /cvsroot/pgsql/src/pl/plpgsql/src/pl_exec.c,v
retrieving revision 1.244
diff -c -r1.244 pl_exec.c
*** src/pl/plpgsql/src/pl_exec.c    17 Jun 2009 13:46:12 -0000    1.244
--- src/pl/plpgsql/src/pl_exec.c    17 Jul 2009 16:12:22 -0000
***************
*** 5292,5298 ****
      {
          SimpleEcontextStackEntry *next;

!         FreeExprContext(simple_econtext_stack->stack_econtext);
          next = simple_econtext_stack->next;
          pfree(simple_econtext_stack);
          simple_econtext_stack = next;
--- 5292,5299 ----
      {
          SimpleEcontextStackEntry *next;

!         if (event == SUBXACT_EVENT_COMMIT_SUB)
!             FreeExprContext(simple_econtext_stack->stack_econtext);
          next = simple_econtext_stack->next;
          pfree(simple_econtext_stack);
          simple_econtext_stack = next;

Re: bug or simply not enough stack space?

От

Frank van Vugt

Дата:

17 июля 2009 г., 15:09:36

Hi Tom,

> Hmm ... the relevant code change seems to have been
> http://archives.postgresql.org/pgsql-committers/2009-04/msg00127.php

Well, though not during beta, the field testing did pay off ;)

> I think I might have been overenthusiastic in trying to free resources
> during a subtransaction abort.  Try this patch and see if you notice
> any bad side-effects.

All examples I had that crashed and burned, now work correctly and/or bail out
correctly where needed.

No side-effects noticed.




--
Best,




Frank.

Re: bug or simply not enough stack space?

От

Tom Lane

Дата:

18 июля 2009 г., 16:18:03

Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
>> I think I might have been overenthusiastic in trying to free resources
>> during a subtransaction abort.  Try this patch and see if you notice
>> any bad side-effects.

> All examples I had that crashed and burned, now work correctly and/or bail out
> correctly where needed.

> No side-effects noticed.

Well, it had some :-(.  I've committed a better version of that patch,
which will be in 8.4.1.  However, if you're not seeing any serious
memory leakage then you could keep using the patch you have for the
time being.

            regards, tom lane

Re: [BUGS] bug or simply not enough stack space?

От

Merlin Moncure

Дата:

04 октября 2019 г., 14:12:38

On Sat, Jul 18, 2009 at 2:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> >> I think I might have been overenthusiastic in trying to free resources
> >> during a subtransaction abort.  Try this patch and see if you notice
> >> any bad side-effects.
>
> > All examples I had that crashed and burned, now work correctly and/or bail out
> > correctly where needed.
>
> > No side-effects noticed.
>
> Well, it had some :-(.  I've committed a better version of that patch,
> which will be in 8.4.1.  However, if you're not seeing any serious
> memory leakage then you could keep using the patch you have for the
> time being.

Reviving this ancient thread.  I saw "did not find subXID" errors, in
9.6.12.  Here is what happened.


2019-10-03 19:54:01 CDT [: @] []: LOG:  checkpoints are occurring too
frequently (14 seconds apart)
2019-10-03 19:54:01 CDT [: @] []: HINT:  Consider increasing the
configuration parameter "max_wal_size".
2019-10-03 19:54:15 CDT [: @] []: LOG:  checkpoints are occurring too
frequently (14 seconds apart)
2019-10-03 19:54:15 CDT [: @] []: HINT:  Consider increasing the
configuration parameter "max_wal_size".
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  did not find subXID 384134 in MyProc
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: CONTEXT:  PL/pgSQL function
loadhistorydatafromysm_testing2() line 99 during exception cleanup
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: LOG:  could not send data to client: Broken
pipe
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: CONTEXT:  PL/pgSQL function
loadhistorydatafromysm_testing2() line 99 during exception cleanup
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: STATEMENT:  select
LoadHistoryDataFromYSM_testing2();
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: ERROR:  failed to re-find shared lock object
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: CONTEXT:  PL/pgSQL function
loadhistorydatafromysm_testing2() line 99 during exception cleanup
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: STATEMENT:  select
LoadHistoryDataFromYSM_testing2();
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  AbortSubTransaction while in ABORT
state
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  did not find subXID 384134 in MyProc
2019-10-03 19:53:31 CDT [: @] []: LOG:  checkpoints are occurring too
frequently (16 seconds apart)
2019-10-03 19:53:31 CDT [: @] []: HINT:  Consider increasing the
configuration parameter "max_wal_size".
2019-10-03 19:53:31 CDT [: @] []: LOG:  checkpoints are occurring too
frequently (16 seconds apart)
2019-10-03 19:53:31 CDT [: @] []: HINT:  Consider increasing the
configuration parameter "max_wal_size".
2019-10-03 19:53:47 CDT [: @] []: LOG:  checkpoints are occurring too
frequently (16 seconds apart)
2019-10-03 19:53:47 CDT [: @] []: HINT:  Consider increasing the
configuration parameter "max_wal_size".
2019-10-03 19:54:01 CDT [: @] []: LOG:  checkpoints are occurring too
frequently (14 seconds apart)
2019-10-03 19:54:01 CDT [: @] []: HINT:  Consider increasing the
configuration parameter "max_wal_size".
2019-10-03 19:54:15 CDT [: @] []: LOG:  checkpoints are occurring too
frequently (14 seconds apart)
2019-10-03 19:54:15 CDT [: @] []: HINT:  Consider increasing the
configuration parameter "max_wal_size".
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  did not find subXID 384134 in MyProc
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: CONTEXT:  PL/pgSQL function
loadhistorydatafromysm_testing2() line 99 during exception cleanup
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: LOG:  could not send data to client: Broken
pipe
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: CONTEXT:  PL/pgSQL function
loadhistorydatafromysm_testing2() line 99 during exception cleanup
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: STATEMENT:  select
LoadHistoryDataFromYSM_testing2();
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: ERROR:  failed to re-find shared lock object
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: CONTEXT:  PL/pgSQL function
loadhistorydatafromysm_testing2() line 99 during exception cleanup
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: STATEMENT:  select
LoadHistoryDataFromYSM_testing2();
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  AbortSubTransaction while in ABORT
state
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  did not find subXID 384134 in MyProc
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: ERROR:  failed to re-find shared lock object
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  AbortSubTransaction while in ABORT
state
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  did not find subXID 384134 in MyProc
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: ERROR:  failed to re-find shared lock object
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  AbortSubTransaction while in ABORT
state
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  did not find subXID 384134 in MyProc
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: ERROR:  failed to re-find shared lock object
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  AbortSubTransaction while in ABORT
state
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  did not find subXID 384134 in MyProc
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  you don't own a lock of type
ExclusiveLock
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: WARNING:  ReleaseLockIfHeld: failed??
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: ERROR:  failed to re-find shared lock object
2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
[10.22.236.83(54943)]: PANIC:  ERRORDATA_STACK_SIZE exceeded
2019-10-03 19:58:37 CDT [: @] []: LOG:  server process (PID 10715) was
terminated by signal 6: Aborted
2019-10-03 19:58:37 CDT [: @] []: LOG:  terminating any other active
server processes

Couple interesting things here:
*) Logs are strongly out of order
*) "loadhistorydatafromysm_testing2()" is using pl/sh, which is a
known source of weird (but rare) instability issues (I'm assuming this
is underlying cause of issue)
*) everything restarted and is working fine
*) chance of providing reproducible case is zero

Bringing this up due to this thread:
https://github.com/laurenz/oracle_fdw/issues/328

I can't help but wonder if we have some kind of obscure issue that is
related to C extension problems; just throwing a data point on the
table.

merlin

Re: [BUGS] bug or simply not enough stack space?

От

Tom Lane

Дата:

04 октября 2019 г., 16:06:34

Merlin Moncure <mmoncure@gmail.com> writes:
> Reviving this ancient thread.  I saw "did not find subXID" errors, in
> 9.6.12.  Here is what happened.

> 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
> [10.22.236.83(54943)]: WARNING:  did not find subXID 384134 in MyProc
> 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
> [10.22.236.83(54943)]: CONTEXT:  PL/pgSQL function
> loadhistorydatafromysm_testing2() line 99 during exception cleanup
> 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
> [10.22.236.83(54943)]: LOG:  could not send data to client: Broken
> pipe
> 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
> [10.22.236.83(54943)]: CONTEXT:  PL/pgSQL function
> loadhistorydatafromysm_testing2() line 99 during exception cleanup
> 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
> [10.22.236.83(54943)]: STATEMENT:  select
> LoadHistoryDataFromYSM_testing2();
> 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
> [10.22.236.83(54943)]: ERROR:  failed to re-find shared lock object
> 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
> [10.22.236.83(54943)]: CONTEXT:  PL/pgSQL function
> loadhistorydatafromysm_testing2() line 99 during exception cleanup
> 2019-10-03 19:58:37 CDT [10.22.236.83: rms@cds2]
> [10.22.236.83(54943)]: STATEMENT:  select
> LoadHistoryDataFromYSM_testing2();

[ and then we get into recursive error-during-error-cleanup failures ]

Yeah, something has left stuff in a bad state here.

> *) "loadhistorydatafromysm_testing2()" is using pl/sh, which is a
> known source of weird (but rare) instability issues (I'm assuming this
> is underlying cause of issue)

Hm.  Yeah, I'd be way more interested if this could be reproduced
without pl/sh.

> I can't help but wonder if we have some kind of obscure issue that is
> related to C extension problems; just throwing a data point on the
> table.

Well, there's nothing too obscure about the rule that error cleanup
needs to avoid doing anything that might cause another error, for fear
of causing infinite recursion.  I suspect that the underlying issue is
that pl/sh is violating that rule somewhere.  The other thread you point
to suggests that maybe oracle_fdw also used to do that, and fixed it.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: bug or simply not enough stack space?