Обсуждение: assertion failure 9.3.4

Поиск

Список

Период

Сортировка

assertion failure 9.3.4

От

Andrew Dunstan

Дата:

15 апреля 2014 г., 01:28:25

With a client's code I have just managed to produce the following 
assertion failure on 9.3.4:
   2014-04-15 01:02:46 GMT [19854] 76299: LOG:  execute <unnamed>:   select * from "asp_ins_event_task_log"( job_id:=1,
event_id:=3164,  task_name:='EventUtcComputeTask', task_status_code:='VALID'   , task_start_utc:='04/15/2014
01:02:44.563',  task_end_utc:='04/15/2014 01:02:44.563')   TRAP: FailedAssertion("!(update_xact == ((TransactionId)
0))",File:   "/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c", Line: 5873)   2014-04-15 01:02:46 GMT [11959]
11:LOG:  server process (PID 19854)   was terminated by signal 6: Aborted   2014-04-15 01:02:46 GMT [11959] 12: DETAIL:
Failed process was   running: select * from "asp_ins_event_task_log"( job_id:=1,   event_id:=3164,
task_name:='EventUtcComputeTask',task_status_code   :='VALID', task_start_utc:='04/15/2014 01:02:44.563',
task_end_utc:='04/15/201401:02:44.563')   2014-04-15 01:02:46 GMT [11959] 13: LOG:  terminating any other   active
serverprocesses
 


When running without assertions, the client reports experiencing tables 
with duplicate primary keys among other things. It's apparently quite 
reproducible.

I'm digging into this, but it's a nasty bug and any useful thoughts 
would be appreciated.

cheers

andrew

Re: assertion failure 9.3.4

От

Andrew Dunstan

Дата:

15 апреля 2014 г., 01:49:53

On 04/14/2014 09:28 PM, Andrew Dunstan wrote:
>
> With a client's code I have just managed to produce the following 
> assertion failure on 9.3.4:
>
>    2014-04-15 01:02:46 GMT [19854] 76299: LOG:  execute <unnamed>:
>    select * from "asp_ins_event_task_log"( job_id:=1, event_id:=3164,
>    task_name:='EventUtcComputeTask', task_status_code:='VALID'
>    , task_start_utc:='04/15/2014 01:02:44.563',
>    task_end_utc:='04/15/2014 01:02:44.563')
>    TRAP: FailedAssertion("!(update_xact == ((TransactionId) 0))", File:
>    "/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c", Line: 
> 5873)
>    2014-04-15 01:02:46 GMT [11959] 11: LOG:  server process (PID 19854)
>    was terminated by signal 6: Aborted
>    2014-04-15 01:02:46 GMT [11959] 12: DETAIL:  Failed process was
>    running: select * from "asp_ins_event_task_log"( job_id:=1,
>    event_id:=3164, task_name:='EventUtcComputeTask', task_status_code
>    :='VALID', task_start_utc:='04/15/2014 01:02:44.563',
>    task_end_utc:='04/15/2014 01:02:44.563')
>    2014-04-15 01:02:46 GMT [11959] 13: LOG:  terminating any other
>    active server processes
>
>
> When running without assertions, the client reports experiencing 
> tables with duplicate primary keys among other things. It's apparently 
> quite reproducible.
>
> I'm digging into this, but it's a nasty bug and any useful thoughts 
> would be appreciated.


and here the stack trace:
   #0  0x000000361ba36285 in __GI_raise (sig=6) at   ../nptl/sysdeps/unix/sysv/linux/raise.c:64   #1
0x000000361ba37b9bin __GI_abort () at abort.c:91   #2  0x000000000075c157 in ExceptionalCondition
(conditionName=<optimizedout>, errorType=<optimized out>,   fileName=<optimized out>, lineNumber=<optimized out>)
at /home/andrew/pgl/pg_9_3/src/backend/utils/error/assert.c:54   #3  0x000000000048c2af in MultiXactIdGetUpdateXid
(xmax=<optimized  out>, t_infomask=<optimized out>) at   /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:5873
#4  0x000000000078ad50 in HeapTupleSatisfiesMVCC   (tuple=0x7feb3be8a790, snapshot=0x1025d70, buffer=2172) at
/home/andrew/pgl/pg_9_3/src/backend/utils/time/tqual.c:1221  #5  0x000000000048aad2 in heapgetpage (scan=0x11522f0,
page=6)at   /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:397   #6  0x000000000048aee9 in
heapgettup_pagemode(scan=0x11522f0,   dir=<optimized out>, nkeys=0, key=0x0) at
/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:939  #7  0x000000000048d646 in heap_getnext (scan=0x11522f0,
direction=<optimizedout>) at   /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:1459   #8  0x00000000005becab
inSeqNext (node=<optimized out>) at   /home/andrew/pgl/pg_9_3/src/backend/executor/nodeSeqscan.c:66   #9
0x00000000005ac66ein ExecScanFetch (recheckMtd=0x5bec70   <SeqRecheck>, accessMtd=0x5bec80 <SeqNext>, node=0x1151488)
at  /home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:82   #10 ExecScan (node=0x1151488, accessMtd=0x5bec80
<SeqNext>,  recheckMtd=0x5bec70 <SeqRecheck>) at   /home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:167   #11
0x00000000005a5338in ExecProcNode (node=0x1151488) at   /home/andrew/pgl/pg_9_3/src/backend/executor/execProcnode.c:400
 #12 0x00000000005b9bcf in ExecLockRows (node=0x1151278) at
/home/andrew/pgl/pg_9_3/src/backend/executor/nodeLockRows.c:56  #13 0x00000000005a51d0 in ExecProcNode (node=0x1151278)
at  /home/andrew/pgl/pg_9_3/src/backend/executor/execProcnode.c:496   #14 0x00000000005a250a in ExecutePlan
(dest=0xb952e0,  direction=<optimized out>, numberTuples=1, sendTuples=1 '\001',   operation=CMD_SELECT,
planstate=0x1151278,estate=0x1151038)        at /home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:1473   #15
standard_ExecutorRun(queryDesc=0x114f230, direction=<optimized   out>, count=1) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:307  #16 0x00007feb5cf2f27d in explain_ExecutorRun
(queryDesc=0x114f230,  direction=ForwardScanDirection, count=1) at
/home/andrew/pgl/pg_9_3/contrib/auto_explain/auto_explain.c:233  #17 0x00007feb5cd2a235 in pgss_ExecutorRun
(queryDesc=0x114f230,  direction=ForwardScanDirection, count=1) at
/home/andrew/pgl/pg_9_3/contrib/pg_stat_statements/pg_stat_statements.c:717  #18 0x00000000005c8b97 in _SPI_pquery
(tcount=1,fire_triggers=0   '\000', queryDesc=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2369  #19 _SPI_execute_plan (plan=0x10afac8, paramLI=0x114f028,
snapshot=0x0,crosscheck_snapshot=0x0, read_only=0 '\000',   fire_triggers=0 '\000', tcount=1)        at
/home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2157  #20 0x00000000005c8fd1 in SPI_execute_snapshot
(plan=0x10afac8,  Values=0x7fff147d8500, Nulls=0x7fff147d8700 " ", snapshot=0x0,   crosscheck_snapshot=0x0, read_only=0
'\000',       fire_triggers=0 '\000', tcount=1) at   /home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:488   #21
0x0000000000722635in ri_PerformCheck (riinfo=0xf98168,   qkey=0x7fff147d8a20, qplan=0x10afac8, fk_rel=0x7feb39048048,
pk_rel=0x7feb38ff9b40,old_tuple=0x0, new_tuple=0x7fff147d9010,        detectNewRows=0 '\000', expect_OK=5) at
/home/andrew/pgl/pg_9_3/src/backend/utils/adt/ri_triggers.c:3094  #22 0x0000000000722cfc in RI_FKey_check
(trigdata=<optimizedout>)   at /home/andrew/pgl/pg_9_3/src/backend/utils/adt/ri_triggers.c:440   #23 0x0000000000588add
inExecCallTriggerFunc   (trigdata=0x7fff147d8fe0, tgindx=4, finfo=0x11409a8, instr=0x0,   per_tuple_context=0x1198d28)
     at /home/andrew/pgl/pg_9_3/src/backend/commands/trigger.c:1871   #24 0x0000000000588f6b in AfterTriggerExecute
(per_tuple_context=0x1198d28,instr=0x0, finfo=0x11408e8,   trigdesc=0x11401f8, rel=0x7feb39048048, event=0x114d038)
  at /home/andrew/pgl/pg_9_3/src/backend/commands/trigger.c:3454   #25 afterTriggerInvokeEvents (events=0x1025620,
firing_id=1,  estate=0x113ffb8, delete_ok=1 '\001') at   /home/andrew/pgl/pg_9_3/src/backend/commands/trigger.c:3633
#260x000000000058e873 in AfterTriggerEndQuery (estate=0x113ffb8) at
/home/andrew/pgl/pg_9_3/src/backend/commands/trigger.c:3823  #27 0x00000000005a26f4 in standard_ExecutorFinish
(queryDesc=0x1076d08)at   /home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:379   #28 0x00007feb5cf2f1a5 in
explain_ExecutorFinish  (queryDesc=0x1076d08) at   /home/andrew/pgl/pg_9_3/contrib/auto_explain/auto_explain.c:256
#290x00007feb5cd2a163 in pgss_ExecutorFinish (queryDesc=0x1076d08)   at
/home/andrew/pgl/pg_9_3/contrib/pg_stat_statements/pg_stat_statements.c:740  #30 0x00000000005c8be4 in _SPI_pquery
(tcount=2,fire_triggers=1   '\001', queryDesc=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2381  #31 _SPI_execute_plan (plan=0x10905a8, paramLI=0x112ab10,
snapshot=0x0,crosscheck_snapshot=0x0, read_only=0 '\000',   fire_triggers=1 '\001', tcount=2)        at
/home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2157  #32 0x00000000005c909f in SPI_execute_plan_with_paramlist
(plan=0x10905a8,params=0x112ab10, read_only=0 '\000', tcount=2) at
/home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:449  #33 0x00007feb3908d877 in exec_stmt_execsql
(estate=0x7fff147d9700,  stmt=0x10630f0) at   /home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:3293   #34
0x00007feb3908ee9ein exec_stmt (stmt=0x10630f0,   estate=0x7fff147d9700) at
/home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:1446  #35 exec_stmts (estate=0x7fff147d9700, stmts=<optimized
out>)at   /home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:1345   #36 0x00007feb3909157b in exec_stmt_block
(estate=0x7fff147d9700,  block=0x1063c68) at   /home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:1283   #37
0x00007feb390921ddin plpgsql_exec_function (func=0x106bbe8,   fcinfo=0x7fff147d9940) at
/home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:321  #38 0x00007feb39086ff4 in plpgsql_call_handler
(fcinfo=0x7fff147d9940)at   /home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_handler.c:129   #39 0x00000000005aa632 in
ExecMakeTableFunctionResult  (funcexpr=0x1122070, econtext=0x1121b98, expectedDesc=0x1121f18,   randomAccess=0 '\000')
     at /home/andrew/pgl/pg_9_3/src/backend/executor/execQual.c:2164   #40 0x00000000005be0c1 in FunctionNext
(node=0x1121a80)at   /home/andrew/pgl/pg_9_3/src/backend/executor/nodeFunctionscan.c:64   #41 0x00000000005ac74e in
ExecScanFetch(recheckMtd=0x5be040   <FunctionRecheck>, accessMtd=0x5be050 <FunctionNext>, node=0x1121a80)        at
/home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:82  #42 ExecScan (node=0x1121a80, accessMtd=0x5be050
<FunctionNext>,  recheckMtd=0x5be040 <FunctionRecheck>) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:132  #43 0x00000000005a52d8 in ExecProcNode (node=0x1121a80) at
 /home/andrew/pgl/pg_9_3/src/backend/executor/execProcnode.c:426   #44 0x00000000005a250a in ExecutePlan
(dest=0xeeaa40,  direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',   operation=CMD_SELECT,
planstate=0x1121a80,estate=0x1121968)        at /home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:1473   #45
standard_ExecutorRun(queryDesc=0x1029e60, direction=<optimized   out>, count=0) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:307  #46 0x00007feb5cf2f27d in explain_ExecutorRun
(queryDesc=0x1029e60,  direction=ForwardScanDirection, count=0) at
/home/andrew/pgl/pg_9_3/contrib/auto_explain/auto_explain.c:233  #47 0x00007feb5cd2a235 in pgss_ExecutorRun
(queryDesc=0x1029e60,  direction=ForwardScanDirection, count=0) at
/home/andrew/pgl/pg_9_3/contrib/pg_stat_statements/pg_stat_statements.c:717  #48 0x000000000068fe77 in PortalRunSelect
(portal=0xf2d598,  forward=<optimized out>, count=0, dest=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/tcop/pquery.c:946  #49 0x00000000006913c1 in PortalRun (portal=0xf2d598,
count=9223372036854775807,isTopLevel=1 '\001', dest=0xeeaa40,   altdest=0xeeaa40, completionTag=0x7fff147da470 "")
 at /home/andrew/pgl/pg_9_3/src/backend/tcop/pquery.c:790   #50 0x000000000068ec50 in exec_execute_message
(max_rows=<optimized  out>, portal_name=0xeea628 "") at   /home/andrew/pgl/pg_9_3/src/backend/tcop/postgres.c:1931
#51PostgresMain (argc=<optimized out>, argv=<optimized out>,   dbname=0xe3e0c8 "eventjobs", username=<optimized out>)
at  /home/andrew/pgl/pg_9_3/src/backend/tcop/postgres.c:4060   #52 0x000000000064465c in BackendRun (port=0xe63490) at
/home/andrew/pgl/pg_9_3/src/backend/postmaster/postmaster.c:4011   #53 BackendStartup (port=0xe63490) at
/home/andrew/pgl/pg_9_3/src/backend/postmaster/postmaster.c:3685  #54 ServerLoop () at
/home/andrew/pgl/pg_9_3/src/backend/postmaster/postmaster.c:1586  #55 PostmasterMain (argc=<optimized out>,
argv=<optimizedout>) at   /home/andrew/pgl/pg_9_3/src/backend/postmaster/postmaster.c:1253   #56 0x000000000045c897 in
main(argc=3, argv=0xe3d270) at   /home/andrew/pgl/pg_9_3/src/backend/main/main.c:226
 


cheers

andrew

Re: assertion failure 9.3.4

От

Alvaro Herrera

Дата:

15 апреля 2014 г., 02:02:27

Andrew Dunstan wrote:

> and here the stack trace:
> 
>    #0  0x000000361ba36285 in __GI_raise (sig=6) at
>    ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>    #1  0x000000361ba37b9b in __GI_abort () at abort.c:91
>    #2  0x000000000075c157 in ExceptionalCondition
>    (conditionName=<optimized out>, errorType=<optimized out>,
>    fileName=<optimized out>, lineNumber=<optimized out>)
>         at /home/andrew/pgl/pg_9_3/src/backend/utils/error/assert.c:54
>    #3  0x000000000048c2af in MultiXactIdGetUpdateXid (xmax=<optimized
>    out>, t_infomask=<optimized out>) at
>    /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:5873
>    #4  0x000000000078ad50 in HeapTupleSatisfiesMVCC
>    (tuple=0x7feb3be8a790, snapshot=0x1025d70, buffer=2172) at
>    /home/andrew/pgl/pg_9_3/src/backend/utils/time/tqual.c:1221

Clearly this is a bug related to multixacts and the related tqual.c
changse.  Will look.


-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: assertion failure 9.3.4

От

Andrew Dunstan

Дата:

15 апреля 2014 г., 02:21:56

On 04/14/2014 10:02 PM, Alvaro Herrera wrote:
> Andrew Dunstan wrote:
>
>> and here the stack trace:
>>
>>     #0  0x000000361ba36285 in __GI_raise (sig=6) at
>>     ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>>     #1  0x000000361ba37b9b in __GI_abort () at abort.c:91
>>     #2  0x000000000075c157 in ExceptionalCondition
>>     (conditionName=<optimized out>, errorType=<optimized out>,
>>     fileName=<optimized out>, lineNumber=<optimized out>)
>>          at /home/andrew/pgl/pg_9_3/src/backend/utils/error/assert.c:54
>>     #3  0x000000000048c2af in MultiXactIdGetUpdateXid (xmax=<optimized
>>     out>, t_infomask=<optimized out>) at
>>     /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:5873
>>     #4  0x000000000078ad50 in HeapTupleSatisfiesMVCC
>>     (tuple=0x7feb3be8a790, snapshot=0x1025d70, buffer=2172) at
>>     /home/andrew/pgl/pg_9_3/src/backend/utils/time/tqual.c:1221
> Clearly this is a bug related to multixacts and the related tqual.c
> changse.  Will look.
>
>


OK. If you need me to test patches I can do so.

cheers

andrew

Re: assertion failure 9.3.4

От

Alvaro Herrera

Дата:

16 апреля 2014 г., 21:45:05

So, from top to bottom I see the following elements:

* backend is executing a query
* this query is getting captured by pg_stat_statements
* the query is also getting captured by autoexplain, in chain from pg_stat_statements
* autoexplain runs the query, which invokes a plpgsql function
* this plpgsql function runs another query, and this one is captured by pg_stat_statements
* and also by autoexplain
* The autoexplain run of this inner query invokes a trigger
* .. which is a FK trigger, and therefore runs ri_PerformCheck which runs another query
* This FK check query gets captured by pg_stat_statements
* and also by autoexplain, which runs it
* this autoexplain run of the third query invokes SeqNext
* this does a heap access, which uses HeapTupleSatisfiesMVCC
* this one tries to read the update XID and gets an assertion failure.

Oh my.

Would it be possible for you to provide a self-contained setup that
behaves like this?  I think a "bt full" would be useful since it'd
provide the queries at each of the three stages.

I'm not quite clear on why the third query, the one in ri_PerformCheck,
is invoking a sequence.  AFAICS it's this:
    /* ----------     * The query string built is     *    SELECT 1 FROM ONLY <pktable> x WHERE pkatt1 = $1 [AND ...]
 *           FOR KEY SHARE OF x     * The type id's for the $ parameters are those of the     * corresponding FK
attributes.    * ----------     */
 
so either I'm misreading the whole thing, or there is something very odd
going on here.

Are you aware of something unusual in the FK setups?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: assertion failure 9.3.4

От

Tom Lane

Дата:

16 апреля 2014 г., 23:19:52

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I'm not quite clear on why the third query, the one in ri_PerformCheck,
> is invoking a sequence.

It's not --- SeqNext is the next-tuple function for a sequential scan.
Nothing to do with sequences.

Now, it *is* worth wondering why the heck a query on the table's primary
key is using a seqscan and not an indexscan.  Maybe the planner thinks
there are just a few rows in the table?  But the stack trace seems
unexceptional other than that.

I'm wondering if the combination of autoexplain and pg_stat_statements
is causing trouble.

Yeah, it would be real nice to see a self-contained test case for this.
        regards, tom lane

Re: assertion failure 9.3.4

От

Andrew Dunstan

Дата:

17 апреля 2014 г., 02:23:41

On 04/16/2014 07:19 PM, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> I'm not quite clear on why the third query, the one in ri_PerformCheck,
>> is invoking a sequence.
> It's not --- SeqNext is the next-tuple function for a sequential scan.
> Nothing to do with sequences.
>
> Now, it *is* worth wondering why the heck a query on the table's primary
> key is using a seqscan and not an indexscan.  Maybe the planner thinks
> there are just a few rows in the table?  But the stack trace seems
> unexceptional other than that.
>
> I'm wondering if the combination of autoexplain and pg_stat_statements
> is causing trouble.
>
> Yeah, it would be real nice to see a self-contained test case for this.
>
>             


Well, that might be hard to put together, but I did try running without 
pg_stat_statements and auto_explain loaded and the error did not occur. 
Not sure where that gets us in terms of deciding on a culprit.

cheers

andrew

Re: assertion failure 9.3.4

От

Tom Lane

Дата:

17 апреля 2014 г., 02:28:40

Andrew Dunstan <andrew@dunslane.net> writes:
> On 04/16/2014 07:19 PM, Tom Lane wrote:
>> Yeah, it would be real nice to see a self-contained test case for this.

> Well, that might be hard to put together, but I did try running without 
> pg_stat_statements and auto_explain loaded and the error did not occur. 
> Not sure where that gets us in terms of deciding on a culprit.

Could we at least see the exact parameter settings for pg_stat_statements
and auto_explain?  (And any other GUCs with nondefault values?)
        regards, tom lane

Re: assertion failure 9.3.4

От

Alvaro Herrera

Дата:

17 апреля 2014 г., 02:36:10

Andrew Dunstan wrote:
> 
> On 04/16/2014 07:19 PM, Tom Lane wrote:
> >Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> >>I'm not quite clear on why the third query, the one in ri_PerformCheck,
> >>is invoking a sequence.
> >It's not --- SeqNext is the next-tuple function for a sequential scan.
> >Nothing to do with sequences.
> >
> >Now, it *is* worth wondering why the heck a query on the table's primary
> >key is using a seqscan and not an indexscan.  Maybe the planner thinks
> >there are just a few rows in the table?  But the stack trace seems
> >unexceptional other than that.
> >
> >I'm wondering if the combination of autoexplain and pg_stat_statements
> >is causing trouble.
> >
> >Yeah, it would be real nice to see a self-contained test case for this.
> 
> Well, that might be hard to put together, but I did try running
> without pg_stat_statements and auto_explain loaded and the error did
> not occur.

Well, can you get us the queries that are being run, and the schemas
involved in those queries?

> Not sure where that gets us in terms of deciding on a
> culprit.

I suspect, in the blind, that autoexplain is executing a query that
causes a second update XID to be added to a multixact (which already has
another one), perhaps with the same snapshot as the original update.
Why would this be happening in a FK-check query?  No idea.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: assertion failure 9.3.4

От

Andrew Dunstan

Дата:

17 апреля 2014 г., 14:15:34

On 04/16/2014 10:28 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 04/16/2014 07:19 PM, Tom Lane wrote:
>>> Yeah, it would be real nice to see a self-contained test case for this.
>> Well, that might be hard to put together, but I did try running without
>> pg_stat_statements and auto_explain loaded and the error did not occur.
>> Not sure where that gets us in terms of deciding on a culprit.
> Could we at least see the exact parameter settings for pg_stat_statements
> and auto_explain?  (And any other GUCs with nondefault values?)
>
>             


Here are all the settings from the run that failed:
   listen_addresses = '*'   port = 5432   fsync = on   synchronous_commit = off   checkpoint_segments = 128
checkpoint_completion_target= 0.9   shared_buffers = 512MB   max_connections = 300   work_mem = 128MB
maintenance_work_mem= 32MB   effective_cache_size = 16GB   effective_io_concurrency = 2   logging_collector = on
log_destination= 'stderr'   log_filename = 'postgresql-%a.log'   log_rotation_size = 0   log_truncate_on_rotation = on
log_line_prefix = '%t [%p] %l: '   log_connections = on   log_disconnections = on   log_statement = 'all'
track_activity_query_size= 10240   shared_preload_libraries = 'auto_explain,pg_stat_statements'
 

As you can see, auto_explain's log_min_duration hasn't been set, so it 
shouldn't be doing anything very much, I should think.


cheers

andrew

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

17 апреля 2014 г., 23:47:29

All,

So have encountered a 2nd report of this issue, or of an issue which
sounds very similar:

- corruption in two "queue" tables

- the tables are written in a high-concurrency, lock-contested environment

- user uses SELECT FOR UPDATE with these tables.

- pg_stat_statements .so is loaded, but the extension is not installed

- four rows were added to the queue tables, while not being added to the
PK index.  This allowed duplicate PKs to be added.

Currently the user is testing not loading the pg_stat_statements.so to
see if the problem goes away.  They have a destruction test environment,
so we should be able to confirm/deny in a couple days.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Peter Geoghegan

Дата:

18 апреля 2014 г., 01:04:31

On Thu, Apr 17, 2014 at 7:15 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>  track_activity_query_size = 10240
>    shared_preload_libraries = 'auto_explain,pg_stat_statements'
>
> As you can see, auto_explain's log_min_duration hasn't been set, so it
> shouldn't be doing anything very much, I should think.

track_activity_query_size is 10240? Not 1024?


-- 
Peter Geoghegan

Re: assertion failure 9.3.4

От

Andrew Dunstan

Дата:

18 апреля 2014 г., 03:21:22

On 04/17/2014 09:04 PM, Peter Geoghegan wrote:
> On Thu, Apr 17, 2014 at 7:15 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>   track_activity_query_size = 10240
>>     shared_preload_libraries = 'auto_explain,pg_stat_statements'
>>
>> As you can see, auto_explain's log_min_duration hasn't been set, so it
>> shouldn't be doing anything very much, I should think.
> track_activity_query_size is 10240? Not 1024?
>
>

yes.

I normally find 1024 is far too small.

cheers

andrew

Re: assertion failure 9.3.4

От

Andrew Dunstan

Дата:

18 апреля 2014 г., 16:43:22

On 04/17/2014 10:15 AM, Andrew Dunstan wrote:
>
> On 04/16/2014 10:28 PM, Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> On 04/16/2014 07:19 PM, Tom Lane wrote:
>>>> Yeah, it would be real nice to see a self-contained test case for 
>>>> this.
>>> Well, that might be hard to put together, but I did try running without
>>> pg_stat_statements and auto_explain loaded and the error did not occur.
>>> Not sure where that gets us in terms of deciding on a culprit.
>> Could we at least see the exact parameter settings for 
>> pg_stat_statements
>> and auto_explain?  (And any other GUCs with nondefault values?)
>>
>>
>
>
> Here are all the settings from the run that failed:
>
>    listen_addresses = '*'
>    port = 5432
>    fsync = on
>    synchronous_commit = off
>    checkpoint_segments = 128
>    checkpoint_completion_target = 0.9
>    shared_buffers = 512MB
>    max_connections = 300
>    work_mem = 128MB
>    maintenance_work_mem = 32MB
>    effective_cache_size = 16GB
>    effective_io_concurrency = 2
>    logging_collector = on
>    log_destination = 'stderr'
>    log_filename = 'postgresql-%a.log'
>    log_rotation_size = 0
>    log_truncate_on_rotation = on
>    log_line_prefix = '%t [%p] %l: '
>    log_connections = on
>    log_disconnections = on
>    log_statement = 'all'
>    track_activity_query_size = 10240
>    shared_preload_libraries = 'auto_explain,pg_stat_statements'
>
> As you can see, auto_explain's log_min_duration hasn't been set, so it 
> shouldn't be doing anything very much, I should think.



There definitely seems to be something going on involving these two 
pre-loaded modules. With both auto_explain and pg_stat_statements 
preloaded I can reproduce the error fairly reliably. I have also 
reproduced it, but less reliably, with auto_explain alone loaded. I have 
not reproduced it with pg_stat_statements alone loaded, but Josh Berkus 
has reported that another client has experienced something very similar 
(duplicate PK errors on a non-assertion-enabled build) with 
pg_stat_statements alone loaded.

cheers

andrew

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

18 апреля 2014 г., 17:13:30

On 04/18/2014 09:42 AM, Andrew Dunstan wrote:

> There definitely seems to be something going on involving these two
> pre-loaded modules. With both auto_explain and pg_stat_statements
> preloaded I can reproduce the error fairly reliably. I have also
> reproduced it, but less reliably, with auto_explain alone loaded. I have
> not reproduced it with pg_stat_statements alone loaded, but Josh Berkus
> has reported that another client has experienced something very similar
> (duplicate PK errors on a non-assertion-enabled build) with
> pg_stat_statements alone loaded.

Correct.   However, this seems to produce the issue less "reliably";
that is, it takes several hours of load testing for a duplicate PK to
show up, so I suspect that with pg_stat_statements alone it's also
timing issue.  I'm still waiting on the results with
pg_stat_statements.so NOT loaded to confirm that we're not actually
getting two separate bugs with similar symptoms.

The second site does not have any increase in query size.  Their only
settings for pg_s_s are:

pg_stat_statements.max = 10000
pg_stat_statements.track = all

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

21 апреля 2014 г., 16:14:44

All,

More on this:

1) I've confirmed at the 2nd site that the issue doesn't happen if
pg_stat_statements.so is not loaded.  So this seems to be confirmation
that either auto_explain, pg_stat_statements, or both need to be loaded
(but not necessarily created as extensions) in order to have the issue.

2) I've been unable to reproduce the issue using a naive test case, but
I'll keep trying.

3) The XID information is interesting.

a) I've confirmed that this is a case of having multiple row versions
rather than allowing a duplicate PK to be inserted.

b) the end result of this corruption is XIDs which go backwards:
 xmin   |  xmax   |   ctid    | step_id | workitem_id | status
---------+---------+-----------+---------+-------------+---------3362707 | 3362707 | (6177,31) |       1 |      446469
|pending3362710 | 3362707 | (6177,32) |       1 |      446469 | working
 
5520953 | 5520953 | (5064,105) |       1 |      727946 | pending5520954 | 5520953 | (5064,108) |       1 |      727946
|working
 

What's additionally problematic is that the current snapshot minxid is
in the 9000000 range, so it's not clear why any of the above rows are
visible at all.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Stephen Frost

Дата:

21 апреля 2014 г., 16:20:34

* Josh Berkus (josh@agliodbs.com) wrote:
> 1) I've confirmed at the 2nd site that the issue doesn't happen if
> pg_stat_statements.so is not loaded.  So this seems to be confirmation
> that either auto_explain, pg_stat_statements, or both need to be loaded
> (but not necessarily created as extensions) in order to have the issue.

Interesting...

> 2) I've been unable to reproduce the issue using a naive test case, but
> I'll keep trying.

Thanks.

> 3) The XID information is interesting.
>
> a) I've confirmed that this is a case of having multiple row versions
> rather than allowing a duplicate PK to be inserted.
>
> b) the end result of this corruption is XIDs which go backwards:
>
>   xmin   |  xmax   |   ctid    | step_id | workitem_id | status
> ---------+---------+-----------+---------+-------------+---------
>  3362707 | 3362707 | (6177,31) |       1 |      446469 | pending
>  3362710 | 3362707 | (6177,32) |       1 |      446469 | working
>
>  5520953 | 5520953 | (5064,105) |       1 |      727946 | pending
>  5520954 | 5520953 | (5064,108) |       1 |      727946 | working
>
> What's additionally problematic is that the current snapshot minxid is
> in the 9000000 range, so it's not clear why any of the above rows are
> visible at all.

Can you get the infomask bits..?  What's does pg_controldata report wrt
the MultiXid's?
THanks,
    Stephen

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

21 апреля 2014 г., 16:40:13

> Can you get the infomask bits..?  What's does pg_controldata report wrt
> the MultiXid's?

Can't get the infomask bits.

pg_controldata attached, with some redactions.  Unfortunately, it
appears that they've continued to do tests on this system, so the XID
counter has advanced somewhat.

pg_control version number:            937
Catalog version number:               {redacted}
Database system identifier:           {redacted}
Database cluster state:               in production
pg_control last modified:             Mon 21 Apr 2014 04:28:30 PM UTC
Latest checkpoint location:           8/C8B05F98
Prior checkpoint location:            8/C663A4B0
Latest checkpoint's REDO location:    8/C6DC9A10
Latest checkpoint's REDO WAL file:    0000000300000008000000C6
Latest checkpoint's TimeLineID:       3
Latest checkpoint's PrevTimeLineID:   3
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/11214375
Latest checkpoint's NextOID:          32769
Latest checkpoint's NextMultiXactId:  28939
Latest checkpoint's NextMultiOffset:  58004
Latest checkpoint's oldestXID:        674
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  11214375
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint:            Mon 21 Apr 2014 04:21:02 PM UTC
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
Current wal_level setting:            hot_standby
Current max_connections setting:      100
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Tom Lane

Дата:

21 апреля 2014 г., 17:04:09

Josh Berkus <josh@agliodbs.com> writes:
> 1) I've confirmed at the 2nd site that the issue doesn't happen if
> pg_stat_statements.so is not loaded.  So this seems to be confirmation
> that either auto_explain, pg_stat_statements, or both need to be loaded
> (but not necessarily created as extensions) in order to have the issue.

I looked a little bit for possible interactions between those two modules,
and almost immediately noticed that pg_stat_statements is not being polite
enough to call any previous occupant of the post_parse_analyze_hook.
That's not the bug we're looking for, since auto_explain does not use
post_parse_analyze_hook, but it's a bug nonetheless.
        regards, tom lane

Re: assertion failure 9.3.4

От

Andres Freund

Дата:

21 апреля 2014 г., 18:54:27

Hi,

I spent the last two hours poking arounds in the environment Andrew
provided and I was able to reproduce the issue, find a assert to
reproduce it much faster and find a possible root cause.

Since the symptom of the problem seem to be multixacts with more than
one updating xid, I added a check to MultiXactIdCreateFromMembers()
preventing that. That requires to move ISUPDATE_from_mxstatus() to a
header, but I think we should definitely add such a assert.

As it turns out the problem is in the
else if (result == HeapTupleBeingUpdated && wait)
branch in (at least) heap_update(). When the problem is hit the
to-be-updated tuple originally has HEAP_XMIN_COMMITTED |
HEAP_XMAX_LOCK_ONLY | HEAP_XMAX_KEYSHR_LOCK set. So we release the
buffer lock, acquire the tuple lock, and reacquire the buffer lock. But
inbetween the locking backend has actually updated the tuple.
The code tries to protect against that with:               /*                * recheck the locker; if someone else
changedthe tuple while                * we weren't looking, start over.                */               if
((oldtup.t_data->t_infomask& HEAP_XMAX_IS_MULTI) ||                   !TransactionIdEquals(
     HeapTupleHeaderGetRawXmax(oldtup.t_data),                                        xwait))                   goto
l2;
               can_continue = true;               locker_remains = true;

and similar. The problem is that in Andrew's case the infomask changes
from 0x2192 to 0x2102 (i.e. it's a normal update afterwards), while xmax
stays the same. Ooops.
A bit later there's:       result = can_continue ? HeapTupleMayBeUpdated : HeapTupleUpdated;
So, from thereon we happily continue to update the tuple, thinking
there's no previous updater. Which obviously causes problems.

I've hacked^Wfixed this by changing the infomask test above into
infomask != oldtup.t_data->t_infomask in a couple of places. That seems
to be sufficient to survive the testcase a couple of times.

I am too hungry right now to think about a proper fix for this and
whether there's further problematic areas.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: assertion failure 9.3.4

От

Tom Lane

Дата:

21 апреля 2014 г., 19:26:10

Andres Freund <andres@2ndquadrant.com> writes:
> I spent the last two hours poking arounds in the environment Andrew
> provided and I was able to reproduce the issue, find a assert to
> reproduce it much faster and find a possible root cause.

Hmm ... is this the same thing Josh is reporting?  If so, why the
apparent connection with use of pg_stat_statements?  Maybe that
just changes the timing to make the race condition more likely?
        regards, tom lane

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

21 апреля 2014 г., 19:31:16

On 04/21/2014 12:26 PM, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> I spent the last two hours poking arounds in the environment Andrew
>> provided and I was able to reproduce the issue, find a assert to
>> reproduce it much faster and find a possible root cause.
> 
> Hmm ... is this the same thing Josh is reporting?  If so, why the
> apparent connection with use of pg_stat_statements?  Maybe that
> just changes the timing to make the race condition more likely?

Well, in Andrew's case it only happens if auto_explain is loaded.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Andrew Dunstan

Дата:

21 апреля 2014 г., 19:35:24

On 04/21/2014 03:26 PM, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> I spent the last two hours poking arounds in the environment Andrew
>> provided and I was able to reproduce the issue, find a assert to
>> reproduce it much faster and find a possible root cause.
> Hmm ... is this the same thing Josh is reporting?  If so, why the
> apparent connection with use of pg_stat_statements?  Maybe that
> just changes the timing to make the race condition more likely?
>
>             


That's my theory. Josh's case is very similar indeed to mine.

cheers

andrew

Re: assertion failure 9.3.4

От

Andres Freund

Дата:

21 апреля 2014 г., 19:38:09

On 2014-04-21 15:26:03 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I spent the last two hours poking arounds in the environment Andrew
> > provided and I was able to reproduce the issue, find a assert to
> > reproduce it much faster and find a possible root cause.
> 
> Hmm ... is this the same thing Josh is reporting?  If so, why the
> apparent connection with use of pg_stat_statements?  Maybe that
> just changes the timing to make the race condition more likely?

I am not sure. I could only reproduce the problem with both
stat_statements and auto_explain loaded, but I didn't try very
long. It's pretty timing sensitive, so I could very well imagine that
some additional overhead makes it easier to reproduce.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: assertion failure 9.3.4

От

Andres Freund

Дата:

21 апреля 2014 г., 20:37:10

On 2014-04-21 12:31:09 -0700, Josh Berkus wrote:
> On 04/21/2014 12:26 PM, Tom Lane wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> >> I spent the last two hours poking arounds in the environment Andrew
> >> provided and I was able to reproduce the issue, find a assert to
> >> reproduce it much faster and find a possible root cause.
> > 
> > Hmm ... is this the same thing Josh is reporting?  If so, why the
> > apparent connection with use of pg_stat_statements?  Maybe that
> > just changes the timing to make the race condition more likely?
> 
> Well, in Andrew's case it only happens if auto_explain is loaded.

Josh, how long does it take you to reproduce the issue? And can you
reproduce it outside of a production environment?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

21 апреля 2014 г., 21:13:22

> Josh, how long does it take you to reproduce the issue? 

A couple hours.

> And can you
> reproduce it outside of a production environment?

Not yet, still working on that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Andrew Dunstan

Дата:

21 апреля 2014 г., 23:43:34

On 04/21/2014 02:54 PM, Andres Freund wrote:
> Hi,
>
> I spent the last two hours poking arounds in the environment Andrew
> provided and I was able to reproduce the issue, find a assert to
> reproduce it much faster and find a possible root cause.


What's the assert that makes it happen faster? That might help a lot in 
constructing a self-contained test.

cheers

andrew

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

22 апреля 2014 г., 01:04:58

All,

I've taken a stab at creating a reproduceable test case based on the
characterisitics of the production issues I'm seeing.  But clearly
there's an element I'm missing, because I'm not able to produce the bug
with a pgbench-based test case.

My current test has FKs, updating both FK'd tables, updating both
indexed and non-indexed columns, and doing multiple updates in the same
transaction, and lock-blocking.

Files are attached in case someone has better ideas.  queue_bench.sql is
the setup file, and then you do:

pgbench -n -T 600 -c 15 -j 5 -f queue_adder.bench -f queue_worker.bench
-f queue_worker.bench -f queue_worker.bench -f queue_worker.bench {dbname}

... or whatever levels of c and j make sense on your hardware.

FWIW, this particular test case might make a nice destruction test case
for replication testing, too.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Вложения

Re: assertion failure 9.3.4

От

Andres Freund

Дата:

22 апреля 2014 г., 06:44:18

On 2014-04-21 19:43:15 -0400, Andrew Dunstan wrote:
>
> On 04/21/2014 02:54 PM, Andres Freund wrote:
> >Hi,
> >
> >I spent the last two hours poking arounds in the environment Andrew
> >provided and I was able to reproduce the issue, find a assert to
> >reproduce it much faster and find a possible root cause.
>
>
> What's the assert that makes it happen faster? That might help a lot in
> constructing a self-contained test.

Assertion and *preliminary*, *hacky* fix attached.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: assertion failure 9.3.4

От

Alvaro Herrera

Дата:

22 апреля 2014 г., 21:01:52

Andres Freund wrote:
> On 2014-04-21 19:43:15 -0400, Andrew Dunstan wrote:
> >
> > On 04/21/2014 02:54 PM, Andres Freund wrote:
> > >Hi,
> > >
> > >I spent the last two hours poking arounds in the environment Andrew
> > >provided and I was able to reproduce the issue, find a assert to
> > >reproduce it much faster and find a possible root cause.
> >
> >
> > What's the assert that makes it happen faster? That might help a lot in
> > constructing a self-contained test.
>
> Assertion and *preliminary*, *hacky* fix attached.

Thanks for the analysis and patches.  I've been playing with this on my
own a bit, and one thing that I just noticed is that at least for
heap_update I cannot reproduce a problem when the xmax is originally a
multixact, so AFAICT the number of places that need patched aren't as
many.

Some testing later, I think the issue only occurs if we determine that
we don't need to wait for the xid/multi to complete, because otherwise
the wait itself saves us.  (It's easy to cause the problem by adding a
breakpoint in heapam.c:3325, i.e. just before re-acquiring the buffer
lock, and then having transaction A lock for key share, then transaction
B update the tuple which stops at the breakpoint, then transaction A
also update the tuple, and finally release transaction B).

For now I offer a cleaned up version of your patch to add the assertion
that multis don't contain multiple updates.  I considered the idea of
making this #ifdef USE_ASSERT_CHECKING, because it has to walk the
complete array of members; and then have full elogs in MultiXactIdExpand
and MultiXactIdCreate, which are lighter because they can check more
easily.  But on second thoughts I refrained from doing that, because
surely the arrays are not as large anyway, are they.

I think I should push this patch first, so that Andrew and Josh can try
their respective test cases which should start throwing errors, then
push the actual fixes.  Does that sound okay?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Вложения

complain-multi-updates.patch

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

22 апреля 2014 г., 21:20:52

On 04/22/2014 02:01 PM, Alvaro Herrera wrote:
> I think I should push this patch first, so that Andrew and Josh can try
> their respective test cases which should start throwing errors, then
> push the actual fixes.  Does that sound okay?

Note that I have a limited ability to actually test my failing test case
-- I have to fire up the user's full application test suite to produce
it, and that needs to be scheduled.  However, Andrew is able to rerun
his test case repeatedly.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Andrew Dunstan

Дата:

22 апреля 2014 г., 21:36:56

On 04/22/2014 05:20 PM, Josh Berkus wrote:
> On 04/22/2014 02:01 PM, Alvaro Herrera wrote:
>> I think I should push this patch first, so that Andrew and Josh can try
>> their respective test cases which should start throwing errors, then
>> push the actual fixes.  Does that sound okay?
> Note that I have a limited ability to actually test my failing test case
> -- I have to fire up the user's full application test suite to produce
> it, and that needs to be scheduled.  However, Andrew is able to rerun
> his test case repeatedly.
>

Alvaro has access (as does Andres) to my setup, and instructions on how 
to run the test. I am going to be offline most of the time from tonight 
until next Tuesday.

cheers

andrew

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

22 апреля 2014 г., 21:40:52

On 04/22/2014 02:01 PM, Alvaro Herrera wrote:
> Some testing later, I think the issue only occurs if we determine that
> we don't need to wait for the xid/multi to complete, because otherwise
> the wait itself saves us.  (It's easy to cause the problem by adding a
> breakpoint in heapam.c:3325, i.e. just before re-acquiring the buffer
> lock, and then having transaction A lock for key share, then transaction
> B update the tuple which stops at the breakpoint, then transaction A
> also update the tuple, and finally release transaction B).

So, trying to make my synthetic test work:

In order to encounter this issue, I'd need to have two concurrent
processes update the child records of the same parent record?  That is:

A ---> B1 \---> B2

... and the issue should only happen if I update both B1 and B2
concurrently in separate sessions?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Andres Freund

Дата:

22 апреля 2014 г., 21:42:31

On 2014-04-22 17:36:42 -0400, Andrew Dunstan wrote:
> 
> On 04/22/2014 05:20 PM, Josh Berkus wrote:
> >On 04/22/2014 02:01 PM, Alvaro Herrera wrote:
> >>I think I should push this patch first, so that Andrew and Josh can try
> >>their respective test cases which should start throwing errors, then
> >>push the actual fixes.  Does that sound okay?
> >Note that I have a limited ability to actually test my failing test case
> >-- I have to fire up the user's full application test suite to produce
> >it, and that needs to be scheduled.  However, Andrew is able to rerun
> >his test case repeatedly.
> >
> 
> Alvaro has access (as does Andres) to my setup, and instructions on how to
> run the test. I am going to be offline most of the time from tonight until
> next Tuesday.

My preliminary patch seems to have fixed the problem in your setup...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: assertion failure 9.3.4

От

Andres Freund

Дата:

22 апреля 2014 г., 21:44:34

On 2014-04-22 14:40:46 -0700, Josh Berkus wrote:
> On 04/22/2014 02:01 PM, Alvaro Herrera wrote:
> > Some testing later, I think the issue only occurs if we determine that
> > we don't need to wait for the xid/multi to complete, because otherwise
> > the wait itself saves us.  (It's easy to cause the problem by adding a
> > breakpoint in heapam.c:3325, i.e. just before re-acquiring the buffer
> > lock, and then having transaction A lock for key share, then transaction
> > B update the tuple which stops at the breakpoint, then transaction A
> > also update the tuple, and finally release transaction B).
> 
> So, trying to make my synthetic test work:
> 
> In order to encounter this issue, I'd need to have two concurrent
> processes update the child records of the same parent record?  That is:
> 
> A ---> B1
>   \---> B2
> 
> ... and the issue should only happen if I update both B1 and B2
> concurrently in separate sessions?

I don't think that'll trigger it. You need rows that are first key share
locked and then updated by the locking transaction. Under
concurrency. And the timewindow really is rather small..

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: assertion failure 9.3.4

От

Andres Freund

Дата:

22 апреля 2014 г., 21:47:03

On 2014-04-22 18:01:40 -0300, Alvaro Herrera wrote:
> Thanks for the analysis and patches.  I've been playing with this on my
> own a bit, and one thing that I just noticed is that at least for
> heap_update I cannot reproduce a problem when the xmax is originally a
> multixact, so AFAICT the number of places that need patched aren't as
> many.

I am quite uncomfortable with that assumption. I don't immediately see a
problem for some of the cases, but leaving them in a weaker state than
9.2 makes me uncomfortable.

> For now I offer a cleaned up version of your patch to add the assertion
> that multis don't contain multiple updates.  I considered the idea of
> making this #ifdef USE_ASSERT_CHECKING, because it has to walk the
> complete array of members; and then have full elogs in MultiXactIdExpand
> and MultiXactIdCreate, which are lighter because they can check more
> easily.  But on second thoughts I refrained from doing that, because
> surely the arrays are not as large anyway, are they.

Yea, I think it's fine to do it where it's in the proposed patch.

> I think I should push this patch first, so that Andrew and Josh can try
> their respective test cases which should start throwing errors, then
> push the actual fixes.  Does that sound okay?

+1

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

22 апреля 2014 г., 21:49:12

>> In order to encounter this issue, I'd need to have two concurrent
>> processes update the child records of the same parent record?  That is:
>>
>> A ---> B1
>>   \---> B2
>>
>> ... and the issue should only happen if I update both B1 and B2
>> concurrently in separate sessions?
> 
> I don't think that'll trigger it. You need rows that are first key share
> locked and then updated by the locking transaction. Under
> concurrency. And the timewindow really is rather small..

Well, currently I have a test which locks A and B1, then updates B1
(twice, actually), and then updates A.  However, since there's a lock on
A, there's no concurrent updating of B1 and B2. This is based on the
behavior of the queue where I originally saw the problem, but it doesn't
reproduce the bug.

I'm thinking I need to just lock B1, update B1, then A, while allowing a
concurrent session to update B2 and and A.  No?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Andres Freund

Дата:

22 апреля 2014 г., 21:52:11

On 2014-04-22 14:49:00 -0700, Josh Berkus wrote:
> 
> >> In order to encounter this issue, I'd need to have two concurrent
> >> processes update the child records of the same parent record?  That is:
> >>
> >> A ---> B1
> >>   \---> B2
> >>
> >> ... and the issue should only happen if I update both B1 and B2
> >> concurrently in separate sessions?
> > 
> > I don't think that'll trigger it. You need rows that are first key share
> > locked and then updated by the locking transaction. Under
> > concurrency. And the timewindow really is rather small..
> 
> Well, currently I have a test which locks A and B1, then updates B1
> (twice, actually), and then updates A.  However, since there's a lock on
> A, there's no concurrent updating of B1 and B2. This is based on the
> behavior of the queue where I originally saw the problem, but it doesn't
> reproduce the bug.
> 
> I'm thinking I need to just lock B1, update B1, then A, while allowing a
> concurrent session to update B2 and and A.  No?

I don't think this gets any easier to reproduce by introducing more than
one relation. Have one session acquire key share locks and then update,
and another one just doing updates.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: assertion failure 9.3.4

От

Alvaro Herrera

Дата:

23 апреля 2014 г., 00:07:39

Josh Berkus wrote:
> 
> >> In order to encounter this issue, I'd need to have two concurrent
> >> processes update the child records of the same parent record?  That is:
> >>
> >> A ---> B1
> >>   \---> B2
> >>
> >> ... and the issue should only happen if I update both B1 and B2
> >> concurrently in separate sessions?
> > 
> > I don't think that'll trigger it. You need rows that are first key share
> > locked and then updated by the locking transaction. Under
> > concurrency. And the timewindow really is rather small..
> 
> Well, currently I have a test which locks A and B1, then updates B1
> (twice, actually), and then updates A.  However, since there's a lock on
> A, there's no concurrent updating of B1 and B2. This is based on the
> behavior of the queue where I originally saw the problem, but it doesn't
> reproduce the bug.

If you want to make it easier to reproduce, you need to insert some
pg_usleep() calls in carefully selected spots.  As Andres says, the
window is small normally.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: assertion failure 9.3.4

От

Alvaro Herrera

Дата:

23 апреля 2014 г., 00:17:18

Andres Freund wrote:
> On 2014-04-22 18:01:40 -0300, Alvaro Herrera wrote:
> > Thanks for the analysis and patches.  I've been playing with this on my
> > own a bit, and one thing that I just noticed is that at least for
> > heap_update I cannot reproduce a problem when the xmax is originally a
> > multixact, so AFAICT the number of places that need patched aren't as
> > many.
> 
> I am quite uncomfortable with that assumption. I don't immediately see a
> problem for some of the cases, but leaving them in a weaker state than
> 9.2 makes me uncomfortable.

That's true too.

I'm thinking about the comparison of full infomask as you propose
instead of just the bits that we actually care about.   I think the only
thing that could cause a spurious failure (causing an extra execution of
the HeapTupleSatisfiesUpdate call and the stuff below) is somebody
setting HEAP_XMIN_COMMITTED concurrently; but that seems infrequent
enough that it should pretty harmless.  However, should we worry about
possible future infomask bit changes that could negatively affect this
behavior?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: assertion failure 9.3.4

От

Josh Berkus

Дата:

23 апреля 2014 г., 00:19:16

On 04/22/2014 05:07 PM, Alvaro Herrera wrote:
> If you want to make it easier to reproduce, you need to insert some
> pg_usleep() calls in carefully selected spots.  As Andres says, the
> window is small normally.

Yeah, but the whole point of this is that having
pg_stat-statements/auto_explain loaded should make the window much
bigger.  In the cases where we're hitting it, we're hitting it with a
fair degree of predicability.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: assertion failure 9.3.4

От

Alvaro Herrera

Дата:

23 апреля 2014 г., 17:03:35

Alvaro Herrera wrote:

> I'm thinking about the comparison of full infomask as you propose
> instead of just the bits that we actually care about.   I think the only
> thing that could cause a spurious failure (causing an extra execution of
> the HeapTupleSatisfiesUpdate call and the stuff below) is somebody
> setting HEAP_XMIN_COMMITTED concurrently; but that seems infrequent
> enough that it should pretty harmless.  However, should we worry about
> possible future infomask bit changes that could negatively affect this
> behavior?

Here's a complete patch illustrating what I mean.  This is slightly more
expensive than straight infomask comparison in terms of machine
instructions, but that seems okay to me.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Вложения

dont-multi.patch

Re: assertion failure 9.3.4

От

Alvaro Herrera

Дата:

24 апреля 2014 г., 20:21:34

Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> 
> > I'm thinking about the comparison of full infomask as you propose
> > instead of just the bits that we actually care about.   I think the only
> > thing that could cause a spurious failure (causing an extra execution of
> > the HeapTupleSatisfiesUpdate call and the stuff below) is somebody
> > setting HEAP_XMIN_COMMITTED concurrently; but that seems infrequent
> > enough that it should pretty harmless.  However, should we worry about
> > possible future infomask bit changes that could negatively affect this
> > behavior?
> 
> Here's a complete patch illustrating what I mean.  This is slightly more
> expensive than straight infomask comparison in terms of machine
> instructions, but that seems okay to me.

I have pushed a slightly tweaked version of this, after verifying that
it solves the problem in Andrew's test environment.

Josh, if you could verify that it solves the problem for you too, it'd
be great.

Thanks for the report and test case.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: assertion failure 9.3.4

Вложения

Вложения

Вложения

Вложения