Обсуждение: BUG #15041: dsa alloc_object null pointer

Поиск
Список
Период
Сортировка

BUG #15041: dsa alloc_object null pointer

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15041
Logged by:          Daniel Farina
Email address:      daniel@fdr.io
PostgreSQL version: 10.1
Operating system:   Linux
Description:

A database that was operating normally for quite a while suddenly generated
three similar looking core-dumps near one another. The stack traces look
like this.

It is possible there was unusual memory pressure, at the time this occurred.
This is the first occurrence.

#0  alloc_object (size_class=<optimized out>, area=0x0) at dsa.c:1433
#1  dsa_allocate_extended (area=0x0, size=size@entry=72,
flags=flags@entry=4) at dsa.c:785
#2  0x000000000062d277 in tbm_prepare_shared_iterate
(tbm=tbm@entry=0x1e54160) at tidbitmap.c:807
#3  0x00000000005f69a0 in BitmapHeapNext (node=node@entry=0x1d22a48) at
nodeBitmapHeapscan.c:155
#4  0x00000000005ebe63 in ExecScanFetch (recheckMtd=0x5f62c0
<BitmapHeapRecheck>, accessMtd=0x5f6320 <BitmapHeapNext>, node=0x1d22a48) at
execScan.c:97
#5  ExecScan (node=0x1d22a48, accessMtd=0x5f6320 <BitmapHeapNext>,
recheckMtd=0x5f62c0 <BitmapHeapRecheck>) at execScan.c:164
#6  0x000000000060402f in ExecProcNode (node=0x1d22a48) at
../../../src/include/executor/executor.h:250
#7  ExecNestLoop (pstate=0x1d22888) at nodeNestloop.c:109
#8  0x0000000000606506 in ExecProcNode (node=0x1d22888) at
../../../src/include/executor/executor.h:250
#9  ExecSort (pstate=0x1d22618) at nodeSort.c:106
#10 0x00000000005fed83 in ExecProcNode (node=0x1d22618) at
../../../src/include/executor/executor.h:250
#11 gather_merge_readnext (gm_state=0x1d223a8, reader=<optimized out>,
nowait=<optimized out>) at nodeGatherMerge.c:631
#12 0x00000000005ff06d in gather_merge_init (gm_state=0x1d223a8) at
nodeGatherMerge.c:468
#13 gather_merge_getnext (gm_state=0x1d223a8) at nodeGatherMerge.c:536
#14 ExecGatherMerge (pstate=0x1d223a8) at nodeGatherMerge.c:250
#15 0x00000000005fdfc9 in ExecProcNode (node=0x1d223a8) at
../../../src/include/executor/executor.h:250
#16 ExecLimit (pstate=0x1d21c38) at nodeLimit.c:95
#17 0x00000000005e6672 in ExecProcNode (node=0x1d21c38) at
../../../src/include/executor/executor.h:250
#18 ExecutePlan (execute_once=<optimized out>, dest=0x1ddc088,
direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,
operation=CMD_SELECT, 
    use_parallel_mode=<optimized out>, planstate=0x1d21c38,
estate=0x1d21988) at execMain.c:1722
#19 standard_ExecutorRun (queryDesc=0x161b1f8, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:363
#20 0x00007f7f42c12fd5 in pgss_ExecutorRun (queryDesc=0x161b1f8,
direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at
pg_stat_statements.c:889
#21 0x000000000070850c in PortalRunSelect (portal=portal@entry=0x135a8e8,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x1ddc088)
    at pquery.c:932
#22 0x0000000000709880 in PortalRun (portal=portal@entry=0x135a8e8,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001',

    run_once=run_once@entry=1 '\001', dest=dest@entry=0x1ddc088,
altdest=altdest@entry=0x1ddc088, completionTag=0x7fff5626f4d0 "") at
pquery.c:773
#23 0x00000000007059f9 in exec_simple_query (


Re: BUG #15041: dsa alloc_object null pointer

От
Thomas Munro
Дата:
On Thu, Feb 1, 2018 at 8:48 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      15041
> Logged by:          Daniel Farina
> Email address:      daniel@fdr.io
> PostgreSQL version: 10.1
> Operating system:   Linux
> Description:
>
> A database that was operating normally for quite a while suddenly generated
> three similar looking core-dumps near one another. The stack traces look
> like this.
>
> It is possible there was unusual memory pressure, at the time this occurred.
> This is the first occurrence.
>
> #0  alloc_object (size_class=<optimized out>, area=0x0) at dsa.c:1433
> #1  dsa_allocate_extended (area=0x0, size=size@entry=72,
> flags=flags@entry=4) at dsa.c:785
> #2  0x000000000062d277 in tbm_prepare_shared_iterate
> (tbm=tbm@entry=0x1e54160) at tidbitmap.c:807
> #3  0x00000000005f69a0 in BitmapHeapNext (node=node@entry=0x1d22a48) at
> nodeBitmapHeapscan.c:155

Hi Daniel,

Thanks for the report.  This looks like the bug described here, where
"area" is a NULL pointer because we failed to launch a parallel query
(ie we're running a parallel query plan, but there are no workers and
no shared memory):

https://www.postgresql.org/message-id/CAEepm=0kADK5inNf_KuemjX=HQ=PuTP0DykM--fO5jS5ePVFEA@mail.gmail.com

It was fixed in commit c6755e233be1cccadd0884d952a2bb455fa0db1f and
back patched to REL_10_STABLE, so the fix will be in 10.2 (target 8th
Feb).  The cause is running out of DSM slots, but not handing that
case correctly.  I think this implies that you're running queries with
a lot of Gather [Merge] nodes in them? The number of DSM slots is 64 +
2 * max_connections, so one workaround is to crank up max_connections,
and another is just to disable parallelism for that query.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15041: dsa alloc_object null pointer

От
Thomas Munro
Дата:
On Thu, Feb 1, 2018 at 9:04 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> The cause is running out of DSM slots, but not handing that
> case correctly.  I think this implies that you're running queries with
> a lot of Gather [Merge] nodes in them? The number of DSM slots is 64 +
> 2 * max_connections, so one workaround is to crank up max_connections,
> and another is just to disable parallelism for that query.

Or alternatively it could be running out of workers due to the
max_parallel_workers, max_worker_processes limits, which on reflection
may be more likely (the other cases I ran into like this ran out of
DSM slots, because they had very high numbers of Gather nodes).
Either way, it's running out of some resource needed for parallel
query and falling back to non-parallel execution, but Bitmap Heap Scan
had failed to anticipate that possibility.

-- 
Thomas Munro
http://www.enterprisedb.com