Обсуждение: "invalid memory alloc request size " in deferred trigger causes transaction to fail, but the backend keeps running

Поиск
Список
Период
Сортировка
L.S.

Situation:

- PostgreSQL v8.0.0beta5 running on x86 hardware

- a database schema that uses before/after/deferred triggers to do additional
processing on stuff like orders and invoices based on the order- and
invoice--lines; the deferred triggers are basically of the format 'update
order set ..... from (select aggregates from order_line where order_id = ...)
where id = ...' and use an updated_on field to make sure they do the
necessary work only once

The program causing the problem is a conversion tool that fills a database
thus firing the various triggers. This process is fairly lengthy (couple of
hours) and this morning I noticed that the conversion tool bailed out due to
a failed transaction commit, the log mentions:

        PL/pgSQL function "tr_purchaseorder_def" line 42 at SQL statement
ERROR:  invalid memory alloc request size 3248857576

The statement on line 42 is the earlier mentioned update statement.

Obviously, since the backend keeps running, it won't be easy to get my hands
on the backtrace from when this error occurs and I'm wondering about the best
way to proceed in order to produce some helpfull input to the developers.

In order to find out whether this was some hardware glitch, I immediately
restarted the process and just now it ended with exactly the same error on
exactly the same spot.



Any hints on how to dig deeper ?





--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> ERROR:  invalid memory alloc request size 3248857576

> Obviously, since the backend keeps running, it won't be easy to get my hands
> on the backtrace from when this error occurs and I'm wondering about the best
> way to proceed in order to produce some helpfull input to the developers.

Provide a reproducible test case maybe?

Also, building with --enable-cassert might help track it down.

            regards, tom lane
> > I'm wondering
> > about the best way to proceed in order to produce some helpfull input to
> > the developers.
>
> Provide a reproducible test case maybe?

Sorry, won't be that easy. I didn't encounter the problem with earlier
(smaller) data-sets and even if I would put in the hours needed in order to
strip the schema and change the conversion tool both with the uncertainty
that I'd be reproducing the proper sequence of (trigger) events, I wouldn't
be able to provide the dataset itself (i.e. non-disclosure)

> Also, building with --enable-cassert might help track it down.

Will do that for starters.

Hold on, I thought that this error message would be generated from numerous
places in the code but it isn't. I'll recompile with debugging enabled and
put breakpoints on the appropriate lines to produce a backtrace.

<some work is done>

Ok, using --enable-cassert might not be such a good idea for now, since it
triggers a trap very early on in the conversion:


TRAP: FailedAssertion("!(((ntp)->t_data)->t_infomask & 0x0010)", File:
"catcache.c", Line: 1729)


So for now I have disabled assertions again and will focus on creating the
backtrace for the memory alloc problem first.

When needed / wanted I can always try and go over any assertion failures
later.


Will be able to post a backtrace in a few hours, I hope.




--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> Ok, using --enable-cassert might not be such a good idea for now, since it
> triggers a trap very early on in the conversion:

> TRAP: FailedAssertion("!(((ntp)->t_data)->t_infomask & 0x0010)", File:
> "catcache.c", Line: 1729)

[ raised eyebrow... ]  That should definitely not be happening either.
In fact, I find that considerably more surprising than your original
report.  I'd suggest chasing this first.

            regards, tom lane
> [ raised eyebrow... ]  That should definitely not be happening either.
> In fact, I find that considerably more surprising than your original
> report.  I'd suggest chasing this first.

It was already well on its way, so....

Here's what's going on when the memory alloc problem occurs:


(gdb) b mcxt.c:502
Breakpoint 1 at 0x81c5667: file mcxt.c, line 502.
(gdb) b mcxt.c:523
Breakpoint 2 at 0x81c56af: file mcxt.c, line 523.
(gdb) b mcxt.c:548
Breakpoint 3 at 0x81c572f: file mcxt.c, line 548.
(gdb) b mcxt.c:612
Breakpoint 4 at 0x81c57ab: file mcxt.c, line 612.
(gdb) cont
Continuing.

Breakpoint 1, MemoryContextAlloc (context=0x86a6fa8, size=3248857576) at
mcxt.c:502
502                     elog(ERROR, "invalid memory alloc request size %lu",


(gdb) bt
#0  MemoryContextAlloc (context=0x86a6fa8, size=3248857576) at mcxt.c:502
#1  0x81cb6ad in CopySnapshot (snapshot=0x8735038) at tqual.c:1205
#2  0x8101414 in postquel_start (es=0x87b52f8, fcache=0x86aa990) at
functions.c:317
#3  0x810175f in postquel_execute (es=0x87b52f8, fcinfo=0xbfffe498,
fcache=0x86aa990) at functions.c:464
#4  0x8101998 in fmgr_sql (fcinfo=0xbfffe498) at functions.c:633
#5  0x80fd261 in ExecMakeFunctionResult (fcache=0x86aa7d8, econtext=0x86aa920,
isNull=0xbfffe5d3 "", isDone=0x0) at execQual.c:1038
#6  0x80fd9c1 in ExecEvalFunc (fcache=0x86aa7d8, econtext=0x86aa920,
isNull=0xbfffe5d3 "", isDone=0x0) at execQual.c:1455
#7  0x80feee1 in ExecEvalExprSwitchContext (expression=0x86aa7d8,
econtext=0x86aa920, isNull=0xbfffe5d3 "", isDone=0x0) at execQual.c:2708
#8  0x8134cf3 in evaluate_expr (expr=0x86f1388, result_type=23) at
clauses.c:2390
#9  0x81345c7 in evaluate_function (funcid=8615860, result_type=23,
args=0x86f1370, func_tuple=0x417d98b0, context=0xbfffe788) at clauses.c:2007
#10 0x81344cc in simplify_function (funcid=8615860, result_type=23,
args=0x86f1370, allow_inline=1, context=0xbfffe788) at clauses.c:1913
#11 0x8133c2d in eval_const_expressions_mutator (node=0x87894b0,
context=0xbfffe788) at clauses.c:1245
#12 0x81359e3 in expression_tree_mutator (node=0x87881c8, mutator=0x8133b6c
<eval_const_expressions_mutator>, context=0xbfffe788) at clauses.c:3198
#13 0x8134060 in eval_const_expressions_mutator (node=0x87881a0,
context=0xbfffe788) at clauses.c:1574
#14 0x8135959 in expression_tree_mutator (node=0x86f05c8, mutator=0x8133b6c
<eval_const_expressions_mutator>, context=0xbfffe788) at clauses.c:3163
#15 0x8134343 in eval_const_expressions_mutator (node=0x86f05c8,
context=0xbfffe788) at clauses.c:1756
#16 0x813597f in expression_tree_mutator (node=0x86f1058, mutator=0x8133b6c
<eval_const_expressions_mutator>, context=0xbfffe788) at clauses.c:3184
#17 0x8134343 in eval_const_expressions_mutator (node=0x86f1058,
context=0xbfffe788) at clauses.c:1756
#18 0x8133b49 in eval_const_expressions (node=0x86f1058) at clauses.c:1152
#19 0x812d925 in preprocess_expression (parse=0x8734248, expr=0x86f1058,
kind=1) at planner.c:415
#20 0x812d6e5 in subquery_planner (parse=0x8734248, tuple_fraction=0) at
planner.c:240
#21 0x812d5bd in planner (parse=0x8734248, isCursor=0 '\000', cursorOptions=0,
boundParams=0x0) at planner.c:129
#22 0x815bb4f in pg_plan_query (querytree=0x8734248, boundParams=0x0) at
postgres.c:647
#23 0x815bbeb in pg_plan_queries (querytrees=0x86f10c8, boundParams=0x0,
needSnapshot=0 '\000') at postgres.c:715
#24 0x810b428 in _SPI_prepare_plan (
    src=0x87c6a08 "UPDATE purchaseorder SET price_total_val = total_val,
valuta_id = total_valuta_id, status_id = CASE WHEN count_lines = 0 OR
count_pol_quotation > 0 THEN po_stat('PO_QUOTATION') WHEN
count_pol_ontransp"..., plan=0xbfffe8a4) at spi.c:1276
#25 0x810a291 in SPI_prepare (
    src=0x87c6a08 "UPDATE purchaseorder SET price_total_val = total_val,
valuta_id = total_valuta_id, status_id = CASE WHEN count_lines = 0 OR
count_pol_quotation > 0 THEN po_stat('PO_QUOTATION') WHEN
count_pol_ontransp"..., nargs=1, argtypes=0x87b3480) at spi.c:407
#26 0x41500e02 in ?? ()
#27 0x41500fce in ?? ()
#28 0x414ffc2e in ?? ()
#29 0x414ffa4b in ?? ()
#30 0x414ff98b in ?? ()
#31 0x414ff36f in ?? ()
#32 0x414fbecc in ?? ()
#33 0x80ebadc in ExecCallTriggerFunc (trigdata=0xbfffec10, finfo=0x422da3b0,
per_tuple_context=0x8769ba0) at trigger.c:1149
#34 0x80ec860 in AfterTriggerExecute (event=0x41b84b38, rel=0x418097f0,
trigdesc=0x422d8ef0, finfo=0x422da200, per_tuple_context=0x8769ba0)
    at trigger.c:1979
#35 0x80eca7f in afterTriggerInvokeEvents (events=0x87053c8, firing_id=9977,
delete_ok=1 '\001') at trigger.c:2175
#36 0x80ecc66 in AfterTriggerEndXact () at trigger.c:2376
#37 0x8090806 in CommitTransaction () at xact.c:1450
#38 0x8090c6d in CommitTransactionCommand () at xact.c:1938
#39 0x815ceba in finish_xact_command () at postgres.c:1843
#40 0x815bebf in exec_simple_query (query_string=0x82ebbd0 "COMMIT") at
postgres.c:965
#41 0x815e0db in PostgresMain (argc=4, argv=0x82a5478, username=0x82a5448
"megaconv") at postgres.c:2986
#42 0x813b85e in BackendRun (port=0x82b8538) at postmaster.c:2817
#43 0x813b121 in BackendStartup (port=0x82b8538) at postmaster.c:2453
#44 0x81399b4 in ServerLoop () at postmaster.c:1198
#45 0x81394c5 in PostmasterMain (argc=3, argv=0x82a4038) at postmaster.c:917
#46 0x8111e11 in main (argc=3, argv=0xbffff914) at main.c:268
#47 0x400d4577 in __libc_start_main () from /lib/libc.so.6




In the mean time, I have compiled postgres on another machine (different
version of slackware) with assertions on, but being a fresh install I needed
to re-run the first phase of conversion which basically is a copy tables /
create indexes loop. However, during the read of *some* of the tables I see
this :

Program received signal SIGUSR1, User defined signal 1.
0x40109ac1 in kill () from /lib/libc.so.6
(gdb) where
#0  0x40109ac1 in kill () from /lib/libc.so.6
#1  0x0818e9e1 in SendPostmasterSignal (reason=PMSIGNAL_PASSWORD_CHANGE) at
pmsignal.c:69
#2  0x081906a8 in SIInsertDataEntry (segP=0xdbd, data=0x837aa94) at
sinvaladt.c:229
#3  0x0818f7e2 in SendSharedInvalidMessage (msg=0x837aa94) at sinval.c:122
#4  0x081ffd18 in ProcessInvalidationMessages (hdr=0x837a73c, func=0x818f7c0
<SendSharedInvalidMessage>) at inval.c:344
#5  0x082001d6 in AtEOXact_Inval (isCommit=1 '\001') at inval.c:674
#6  0x080a2548 in CommitTransaction () at xact.c:1543
#7  0x080a2af5 in CommitTransactionCommand () at xact.c:1938
#8  0x0819caef in finish_xact_command () at postgres.c:1843
#9  0x0819b795 in exec_simple_query (query_string=0x834ee8c "create index
lijst03_table_idx15 on lijst03_table(fn16)") at postgres.c:965
#10 0x0819dce0 in PostgresMain (argc=4, argv=0x8308b0c, username=0x8308adc
"megaconv") at postgres.c:2986
#11 0x08171f31 in BackendRun (port=0x83177f0) at postmaster.c:2817
#12 0x08171991 in BackendStartup (port=0x83177f0) at postmaster.c:2453
#13 0x0816ff6f in ServerLoop () at postmaster.c:1198
#14 0x0816f4a3 in PostmasterMain (argc=3, argv=0x8307788) at postmaster.c:917
#15 0x0813d22d in main (argc=3, argv=0x8307788) at main.c:268
#16 0x400f5d06 in __libc_start_main () from /lib/libc.so.6

I'm able to issue 'cont' to move on, and it's not like the server load is too
high or something.


Will proceed to the first assertion TRAP mentioned earlier.





--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> Here's what's going on when the memory alloc problem occurs:

AFAICS this can only be explained as a memory stomp on the static data
structures SerializableSnapshotData or LatestSnapshotData.  That seems a
tad improbable.  I wonder if what you have is a build problem.  Have you
tried a full "make distclean", configure, rebuild from scratch?

            regards, tom lane
> > Here's what's going on when the memory alloc problem occurs:

> AFAICS this can only be explained as a memory stomp on the static data
> structures SerializableSnapshotData or LatestSnapshotData.  That seems a
> tad improbable.  I wonder if what you have is a build problem.  Have you
> tried a full "make distclean", configure, rebuild from scratch?

Yep, I always do a distclean when changing configuration options, I've been
bitten before ;)

Mind you, I am using the earlier discussed patch to avoid the glibc bug :

******************************************************
$> more oracle_compat.diff
--- src/backend/utils/adt/oracle_compat.c_orig  Mon Nov 29 12:25:20 2004
+++ src/backend/utils/adt/oracle_compat.c       Mon Nov 29 12:26:49 2004
@@ -43,7 +43,7 @@
  * We assume if we have these two functions, we have their friends too, and
  * can use the wide-character method.
  */
-#if defined(HAVE_WCSTOMBS) && defined(HAVE_TOWLOWER)
+#if defined(HAVE_WCSTOMBS) && defined(HAVE_TOWLOWER) && FALSE
 #define USE_WIDE_UPPER_LOWER
 #endif
******************************************************

I could reproduce this case tomorrow and inspect some variables or call
abort() if you think that could be of any help?




As far as the other problem is concerned, I just finished the first half of
the conversion on a different machine that was build with assertions enabled
and saw a number of reports like the ones below. What's causing these?

I'll start the second half in a minute, which should reproduce the TRAP pretty
soon. Will post a backtrace of that asap as well.


Program received signal SIGUSR1, User defined signal 1.
0x40109ac1 in kill () from /lib/libc.so.6
(gdb) where
#0  0x40109ac1 in kill () from /lib/libc.so.6
#1  0x0818e9e1 in SendPostmasterSignal (reason=PMSIGNAL_PASSWORD_CHANGE) at
pmsignal.c:69
#2  0x081906a8 in SIInsertDataEntry (segP=0xdbd, data=0x83e4af8) at
sinvaladt.c:229
#3  0x0818f7e2 in SendSharedInvalidMessage (msg=0x83e4af8) at sinval.c:122
#4  0x081ffcd8 in ProcessInvalidationMessages (hdr=0x83dd6a4, func=0x818f7c0
<SendSharedInvalidMessage>) at inval.c:343
#5  0x082001d6 in AtEOXact_Inval (isCommit=1 '\001') at inval.c:674
#6  0x080a2548 in CommitTransaction () at xact.c:1543
#7  0x080a2af5 in CommitTransactionCommand () at xact.c:1938
#8  0x0819caef in finish_xact_command () at postgres.c:1843
#9  0x0819b795 in exec_simple_query (query_string=0x834ee8c "drop table
lijst45_table") at postgres.c:965
#10 0x0819dce0 in PostgresMain (argc=4, argv=0x8308b0c, username=0x8308adc
"megaconv") at postgres.c:2986
#11 0x08171f31 in BackendRun (port=0x83177f0) at postmaster.c:2817
#12 0x08171991 in BackendStartup (port=0x83177f0) at postmaster.c:2453
#13 0x0816ff6f in ServerLoop () at postmaster.c:1198
#14 0x0816f4a3 in PostmasterMain (argc=3, argv=0x8307788) at postmaster.c:917
#15 0x0813d22d in main (argc=3, argv=0x8307788) at main.c:268
#16 0x400f5d06 in __libc_start_main () from /lib/libc.so.6

Program received signal SIGUSR1, User defined signal 1.
0x40109ac1 in kill () from /lib/libc.so.6
(gdb) where
#0  0x40109ac1 in kill () from /lib/libc.so.6
#1  0x0818e9e1 in SendPostmasterSignal (reason=PMSIGNAL_PASSWORD_CHANGE) at
pmsignal.c:69
#2  0x081906a8 in SIInsertDataEntry (segP=0xdbd, data=0x8408e20) at
sinvaladt.c:229
#3  0x0818f7e2 in SendSharedInvalidMessage (msg=0x8408e20) at sinval.c:122
#4  0x081ffcd8 in ProcessInvalidationMessages (hdr=0x8407844, func=0x818f7c0
<SendSharedInvalidMessage>) at inval.c:343
#5  0x082001d6 in AtEOXact_Inval (isCommit=1 '\001') at inval.c:674
#6  0x080a2548 in CommitTransaction () at xact.c:1543
#7  0x080a2af5 in CommitTransactionCommand () at xact.c:1938
#8  0x0819caef in finish_xact_command () at postgres.c:1843
#9  0x0819b795 in exec_simple_query (
    query_string=0x834f0a4 "create table lijst55_table(invoice_id
int,serial_id int,payment_cond text,old_debtor_id text,debtor_descr
text,korting numeric,magazijncode int,pakbon_nr text,totaal_bruto
numeric,totaal_netto numeric"...) at postgres.c:965
#10 0x0819dce0 in PostgresMain (argc=4, argv=0x8308b0c, username=0x8308adc
"megaconv") at postgres.c:2986
#11 0x08171f31 in BackendRun (port=0x83177f0) at postmaster.c:2817
#12 0x08171991 in BackendStartup (port=0x83177f0) at postmaster.c:2453
#13 0x0816ff6f in ServerLoop () at postmaster.c:1198
#14 0x0816f4a3 in PostmasterMain (argc=3, argv=0x8307788) at postmaster.c:917
#15 0x0813d22d in main (argc=3, argv=0x8307788) at main.c:268
#16 0x400f5d06 in __libc_start_main () from /lib/libc.so.6
(gdb)

Program received signal SIGUSR1, User defined signal 1.
0x40109ac1 in kill () from /lib/libc.so.6
(gdb) where
#0  0x40109ac1 in kill () from /lib/libc.so.6
#1  0x0818e9e1 in SendPostmasterSignal (reason=PMSIGNAL_PASSWORD_CHANGE) at
pmsignal.c:69
#2  0x081906a8 in SIInsertDataEntry (segP=0xdbd, data=0x842bb40) at
sinvaladt.c:229
#3  0x0818f7e2 in SendSharedInvalidMessage (msg=0x842bb40) at sinval.c:122
#4  0x081ffcd8 in ProcessInvalidationMessages (hdr=0x842514c, func=0x818f7c0
<SendSharedInvalidMessage>) at inval.c:343
#5  0x082001d6 in AtEOXact_Inval (isCommit=1 '\001') at inval.c:674
#6  0x080a2548 in CommitTransaction () at xact.c:1543
#7  0x080a2af5 in CommitTransactionCommand () at xact.c:1938
#8  0x0819caef in finish_xact_command () at postgres.c:1843
#9  0x0819b795 in exec_simple_query (query_string=0x834ee8c "drop table
lijst61_table") at postgres.c:965
#10 0x0819dce0 in PostgresMain (argc=4, argv=0x8308b0c, username=0x8308adc
"megaconv") at postgres.c:2986
#11 0x08171f31 in BackendRun (port=0x83177f0) at postmaster.c:2817
#12 0x08171991 in BackendStartup (port=0x83177f0) at postmaster.c:2453
#13 0x0816ff6f in ServerLoop () at postmaster.c:1198
#14 0x0816f4a3 in PostmasterMain (argc=3, argv=0x8307788) at postmaster.c:917
#15 0x0813d22d in main (argc=3, argv=0x8307788) at main.c:268
#16 0x400f5d06 in __libc_start_main () from /lib/libc.so.6
(gdb) cont





--
Best,




Frank.
On Fri, Dec 03, 2004 at 01:03:43AM +0100, Frank van Vugt wrote:

> As far as the other problem is concerned, I just finished the first half of
> the conversion on a different machine that was build with assertions enabled
> and saw a number of reports like the ones below. What's causing these?

This happens when a transaction finishes that creates, drops, or
modifies a user.  If it happens at a different time, something is wrong.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Oh, great altar of passive entertainment, bestow upon me thy discordant images
at such speed as to render linear thought impossible" (Calvin a la TV)
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> I could reproduce this case tomorrow and inspect some variables or call
> abort() if you think that could be of any help?

Do you have hardware watchpoints on that machine?  What I'd be inclined
to try is tracing through the transaction to see where
SerializableSnapshotData.xcnt or LatestSnapshotData.xcnt gets clobbered
from.  It'd be mighty tedious if you have to fall back on a software
watchpoint though.

Also, I gather from the xact.c traceback that the transaction looks like

    begin;
    savepoint a;
    ...
    commit; -- without releasing the savepoint

Does anything happen before the SAVEPOINT?  Now that you see which
plpgsql function is failing, do you have a better shot at making
a self-contained example?

            regards, tom lane
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> On Fri, Dec 03, 2004 at 01:03:43AM +0100, Frank van Vugt wrote:
>> As far as the other problem is concerned, I just finished the first half of
>> the conversion on a different machine that was build with assertions enabled
>> and saw a number of reports like the ones below. What's causing these?

> This happens when a transaction finishes that creates, drops, or
> modifies a user.  If it happens at a different time, something is wrong.

Given that the call site is SIInsertDataEntry, it's clear that
SendPostmasterSignal is actually being told PMSIGNAL_WAKEN_CHILDREN ...
the gdb traceback is reporting the wrong value for the parameter, which
is no big surprise for values in registers.

Which makes this a perfectly normal event; it's just trying to flush the
SI messaging queue before it overflows.  Nothing to see here, move
along.

            regards, tom lane
> > #1  0x0818e9e1 in SendPostmasterSignal (reason=PMSIGNAL_PASSWORD_CHANGE)
> >  at pmsignal.c:69
> >
> > As far as the other problem is concerned, I just finished the first half
> > of the conversion on a different machine that was build with assertions
> > enabled and saw a number of reports like the ones below. What's causing
> > these?
>
> This happens when a transaction finishes that creates, drops, or
> modifies a user.  If it happens at a different time, something is wrong.

The code for the first part of the conversion does not contain any statements
for create/alter/drop user, it doesn't even fire triggers. Basically it's
just looping over the following set of statements for a number of textfiles:

- read, validate and post-process the contents of a text file
- write the result into another text file
- (try to) drop the table
- create the table
- grant select on this table to public
- copy table from a textfile

That's it, so there is no explicit user-handling at all.

Anything particular I can take a look at?




--
Best,




Frank.
> > I could reproduce this case tomorrow and inspect some variables or call
> > abort() if you think that could be of any help?
>
> Do you have hardware watchpoints on that machine?

Mmm, don't know for sure, will have a look tomorrow.

> What I'd be inclined
> to try is tracing through the transaction to see where
> SerializableSnapshotData.xcnt or LatestSnapshotData.xcnt gets clobbered
> from.  It'd be mighty tedious if you have to fall back on a software
> watchpoint though.

A hint on the expression(s) to use would be appreciated !


> Also, I gather from the xact.c traceback that the transaction looks like
>
>     begin;
>     savepoint a;
>     ...
>     commit; -- without releasing the savepoint
>
> Does anything happen before the SAVEPOINT?

Actually, the SQL involved is v7.4.6, so there are no savepoints to begin with
(or arguably, there's nothing 'before' a savepoint ;))

> Now that you see which plpgsql function is failing, do you have a better
> shot at making a self-contained example?

Not really, but if tracing the transaction won't reveal anything else I guess
I will have no choice..... ;-(



--
Best,




Frank.
Ok, here are the results of the assertion TRAP (mind you, same error on the
same line, but a different machine than the original)


database log:

TRAP: FailedAssertion("!(((ntp)->t_data)->t_infomask & 0x0010)", File:
"catcache.c", Line: 1729)


gdb:

Program received signal SIGABRT, Aborted.
0x40109ac1 in kill () from /lib/libc.so.6
(gdb) where
#0  0x40109ac1 in kill () from /lib/libc.so.6
#1  0x401096eb in raise () from /lib/libc.so.6
#2  0x4010b127 in abort () from /lib/libc.so.6
#3  0x08208036 in ExceptionalCondition (conditionName=0x40212d24 "8,\023",
errorType=0x6 <Address 0x6 out of bounds>, fileName=0x0, lineNumber=1729)
    at assert.c:51
#4  0x081ff88f in build_dummy_tuple (cache=0x41469048, nkeys=1,
skeys=0xbfffe410) at catcache.c:1735
#5  0x081fe917 in SearchCatCache (cache=0x41469048, v1=0, v2=0, v3=0, v4=0) at
catcache.c:1240
#6  0x08173908 in pgstat_vacuum_tabstat () at pgstat.c:817
#7  0x081156d9 in vacuum (vacstmt=0x8342b5c) at vacuum.c:272
#8  0x0819f7cc in PortalRunUtility (portal=0x835bf2c, query=0x8342bb8,
dest=0x8342a3c, completionTag=0xbfffebd0 "") at pquery.c:934
#9  0x0819fb01 in PortalRunMulti (portal=0x835bf2c, dest=0x8342a3c,
altdest=0x8342a3c, completionTag=0xbfffebd0 "") at pquery.c:1001
#10 0x0819f174 in PortalRun (portal=0x835bf2c, count=2147483647,
dest=0x8342a3c, altdest=0x8342a3c, completionTag=0xbfffebd0 "") at
pquery.c:617
#11 0x0819b760 in exec_simple_query (query_string=0x8342824 "vacuum full
analyse ean_available") at postgres.c:933
#12 0x0819dce0 in PostgresMain (argc=4, argv=0x8308b04, username=0x8308adc
"vugtf") at postgres.c:2986
#13 0x08171f31 in BackendRun (port=0x83177f0) at postmaster.c:2817
#14 0x08171991 in BackendStartup (port=0x83177f0) at postmaster.c:2453
#15 0x0816ff6f in ServerLoop () at postmaster.c:1198
#16 0x0816f4a3 in PostmasterMain (argc=3, argv=0x8307788) at postmaster.c:917
#17 0x0813d22d in main (argc=3, argv=0x8307788) at main.c:268
#18 0x400f5d06 in __libc_start_main () from /lib/libc.so.6


Obviously, continuing was not an option ;)



--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
>> What I'd be inclined
>> to try is tracing through the transaction to see where
>> SerializableSnapshotData.xcnt or LatestSnapshotData.xcnt gets clobbered
>> from.  It'd be mighty tedious if you have to fall back on a software
>> watchpoint though.

> A hint on the expression(s) to use would be appreciated !

(gdb) watch SerializableSnapshotData.xcnt
Hardware watchpoint 1: SerializableSnapshotData.xcnt
(gdb) watch LatestSnapshotData.xcnt
Hardware watchpoint 2: LatestSnapshotData.xcnt
(gdb) cont
Continuing.

The only "authorized" place to set these is the assignment near the
bottom of GetSnapshotData(), which would normally get hit about
once per command.

            regards, tom lane
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
>> Does anything happen before the SAVEPOINT?

> Actually, the SQL involved is v7.4.6,

Not with that backtrace, it isn't.

            regards, tom lane
> >> Does anything happen before the SAVEPOINT?
> >
> > Actually, the SQL involved is v7.4.6,
>
> Not with that backtrace, it isn't.

Excuse me?

I'm running v8.0.0beta5, but the sql-statements that are in my source-tree are
all still 'based on v7.4.6', i.e. there is no occurence of the word
'savepoint' in the whole tree.




--
Best,




Frank.
> > A hint on the expression(s) to use would be appreciated !
> (gdb) watch SerializableSnapshotData.xcnt

I could have thought of that ;)

(I misinterpreted what you meant by SerializableSnapshotData.xcnt, wouldn't
have asked otherwise)

Anyway, I just confirmed that hardware watchpoints are available, so I'll
follow this up first thing in the morning.






--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
>>>> Does anything happen before the SAVEPOINT?
>
>>>> Actually, the SQL involved is v7.4.6,
>>
>> Not with that backtrace, it isn't.

> Excuse me?

> I'm running v8.0.0beta5,

Oh, I thought you meant the backend was 7.4.6.

> but the sql-statements that are in my source-tree are
> all still 'based on v7.4.6', i.e. there is no occurence of the word
> 'savepoint' in the whole tree.

Hmm.  The line number referenced for CommitTransactionCommand() is
definitely in code that is only entered after a savepoint, but I see
it is an identical code path to some other arms of the switch()
command.  Probably the compiler decided to merge those arms.  You might
want to back off the compiler optimization level a step so you can get
more readable tracebacks ...

            regards, tom lane
> Hmm.  The line number referenced for CommitTransactionCommand() is
> definitely in code that is only entered after a savepoint, but I see
> it is an identical code path to some other arms of the switch()
> command.  Probably the compiler decided to merge those arms.  You might
> want to back off the compiler optimization level a step so you can get
> more readable tracebacks ...

Yup, will do that as well.


Will read any comments you may have on the TRAP backtrace in a couple of
hours, need to take myself offline for a while ;)





--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> Ok, here are the results of the assertion TRAP (mind you, same error on the
> same line, but a different machine than the original)

That's another can't-happen problem.  As best I can tell, this suggests
that the tuple descriptor associated with the RELOID catalog cache has
gotten clobbered (it should have the HASOID bit set, but an Assert that
says it does is failing).

Either you've found some code path that clobbers random bits of memory,
or you've got flaky hardware.  But the same error on two different
machines makes the second theory a bit hard to swallow.

What's the hardware platform exactly?  Is it 64-bit, or anything at all
off the beaten path?

            regards, tom lane
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> Meanwhile, one of the application developers here bumped into a way to
> reproduce what looks like the same memory alloc problem (exactly the same
> point in exactly the same trigger) using our application software
> only,

Oh good.  Can you construct a self-contained test case then?

> Here are both the query-set and the corresponding backtrace.

The query set's not very interesting without a database to try it
against :-(

> I then got a firm set of results, all of which were looking like this:

> Hardware access (read/write) watchpoint 1: SerializableSnapshotData.xcnt
> Value = 1
> Hardware access (read/write) watchpoint 2: LatestSnapshotData.xcnt
> Value = 1

> All were located at sinval.c:888

This is the expected case.  The failure in CopySnapshot has got to
indicate that somebody set one or the other field to some bizarrely
large value, though.  I take it you didn't run the watchpointed backend
far enough to get the memory-alloc error?

            regards, tom lane
> > Meanwhile, one of the application developers here bumped into a way to
> > reproduce what looks like the same memory alloc problem (exactly the same
> > point in exactly the same trigger) using our application software
> > only,
>
> Oh good.  Can you construct a self-contained test case then?

Will try to do just that when the rest won't work.

> > All were located at sinval.c:888
>
> This is the expected case.  The failure in CopySnapshot has got to
> indicate that somebody set one or the other field to some bizarrely
> large value, though.  I take it you didn't run the watchpointed backend
> far enough to get the memory-alloc error?

Oh, but I did.....

All the breaks are at sinval.c:888 and at some point the memory-alloc simply
occurs. Do you mean you want a backtrace of the last break at line 888 just
before the error ?



--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
>> This is the expected case.  The failure in CopySnapshot has got to
>> indicate that somebody set one or the other field to some bizarrely
>> large value, though.  I take it you didn't run the watchpointed backend
>> far enough to get the memory-alloc error?

> Oh, but I did.....

[ scratches head ... ]  AFAICS, the CopySnapshot that gets the error
must be trying to copy one or the other of those snapshots.  Maybe you
should look at the address passed to CopySnapshot and see what it's
really pointing at, and where it came from.

            regards, tom lane
> [ scratches head ... ]  AFAICS, the CopySnapshot that gets the error
> must be trying to copy one or the other of those snapshots.  Maybe you
> should look at the address passed to CopySnapshot and see what it's
> really pointing at, and where it came from.

Ok, stepping through the transaction didn't reveal any change in the
CopySnapshot address, except for the last one when the error occurs, which is
when the deferred trigger is executing an update statement.


I guess the only thing left for me is to try and create a script that will
reproduce the problem.... ;(


Please allow me to describe the sequence of events of the one thing that comes
to mind when I take another good look at the functions/triggers and try to
think about what could be causing this:

- db
    - a deferred trigger is defined on insert/update/delete of
        purchaseorder_line (pol)
    - this trigger is meant to update the corresponding purchaseorder(s)


- begin transaction

- select purchaseorder_line (pol) X for update
    (here it is determined whether or not to spawn the line)

- call function spawn_pol() on pol X
    (inside this function marked as volatile)
    - select pol X for update
        (in this case for the second time, it's called from other places too)
    - insert new pol Y
    - update old pol X
        (with regards to the spawned amount)

- update pol Y
    (with regards to a number of other fields)

- commit transaction
    - this will fire the earlier deferred trigger on pol
        - update purchaseorder (po) to which pols X and Y belong, using aggregates
        (it is this update statement where the error occurs)


The only thing I can think of is that the handling of pol X and creation of
pol Y from within spawn_pol() is somehow messing things up, but......

If this doesn't ring any alarm bells, then I'll start working on a script
first thing next Monday.





--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> The only thing I can think of is that the handling of pol X and creation of
> pol Y from within spawn_pol() is somehow messing things up, but......

Certainly the mere firing of a deferred trigger isn't the problem; we do
that all the time.  What struck me about the traceback you showed is
that it involved an attempt by the planner to constant-fold an
SQL-language function that was used in the trigger function's query
(and was presumably marked IMMUTABLE and had constant arguments in the
query).

Because the trigger function is plpgsql, this could happen only the
first time the trigger is fired in a particular session (unless you are
using EXECUTE to invoke the update command?).  If the problem is related
to the planner constant-folding environment, then the first-time-only
limitation would help make it hard to reproduce.

            regards, tom lane
> > The only thing I can think of is that the handling of pol X and creation
> > of pol Y from within spawn_pol() is somehow messing things up, but......

> Certainly the mere firing of a deferred trigger isn't the problem; we do
> that all the time.

Me too ;)

I was more trying to emphase the multiple select for update from both in- as
well as outside the functions, but you'd probably have caught that already if
it could have been the culprit.

> What struck me about the traceback <cut>
> Because the trigger function is plpgsql, this could happen only the
> first time the trigger is fired in a particular session

I've tried to run all immutable functions used at least once before running
the query-set, this made no difference, same error on the same location.

> (unless you are using EXECUTE to invoke the update command?)

No, no, it's a plain call.




--
Best,




Frank.
Hi Tom,

> Because the trigger function is plpgsql, this could happen only the
> first time the trigger is fired in a particular session (unless you are
> using EXECUTE to invoke the update command?).  If the problem is related
> to the planner constant-folding environment, then the first-time-only
> limitation would help make it hard to reproduce.

It looks like your suspicion was right.... I've isolated one particulare case,
simply run the script attached. If you remove the comment before the extra
call of spawn_pol() at the bottom of the script, the error is avoided.

Hopefully, solving this will also solve the other cases of mem-alloc error
I've been seeing ;) I'm not using spawn_pol/sol() during the earlier
mentioned conversion, but my hope is that the error generated during
conversion will turn out to be related and caused by the constant-folding of
some other function. But I'll know soon enough if it isn't ;)

Looking forward to your assesment.




--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> Looking forward to your assesment.

Looks like the problem is associated with DEFERRED AFTER triggers: we
don't normally set a snapshot for TransactionStmt commands, including
COMMIT, but there needs to be a snapshot set when running trigger
functions.  The attached patch fixes the test case you sent; can you
try it against your other problem?

            regards, tom lane

*** src/backend/commands/trigger.c.orig    Sun Nov 14 14:37:05 2004
--- src/backend/commands/trigger.c    Mon Dec  6 13:20:43 2004
***************
*** 2365,2370 ****
--- 2365,2378 ----
      Assert(afterTriggers->query_depth == -1);

      /*
+      * If there are any triggers to fire, make sure we have set a snapshot
+      * for them to use.  (Since PortalRunUtility doesn't set a snap for
+      * COMMIT, we can't assume ActiveSnapshot is valid on entry.)
+      */
+     if (afterTriggers->events.head != NULL)
+         ActiveSnapshot = CopySnapshot(GetTransactionSnapshot());
+
+     /*
       * Run all the remaining triggers.  Loop until they are all gone,
       * just in case some trigger queues more for us to do.
       */
Hi Tom,

> Looks like the problem is associated with DEFERRED AFTER triggers

> The attached patch fixes the test case you sent; can you
> try it against your other problem?

I patched RC1 with this and tried the conversion again. Though still running,
it's passed the point where the mem-alloc error occured earlier, so it would
seem we're in the clear on this one.

Thanks for the quick fix !

I guess this patch is already in the RC2 tree?




--
Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> I guess this patch is already in the RC2 tree?

No, I was waiting for confirmation ...

            regards, tom lane