Обсуждение: ERROR: too many dynamic shared memory segments

Поиск

Список

Период

Сортировка

ERROR: too many dynamic shared memory segments

От

Jakub Glapa

Дата:

27 ноября 2017 г., 15:54:03

Hi, I started seeing those errors on Postgresql 10.1 running on Ubuntu 16.04.3 (64GB/8CPU) server. The DB enters recovery mode after that. 2017-11-23 07:20:39 CET::@:[24823]: ERROR: could not attach to dynamic shared area 2017-11-23 07:20:39 CET::@:[24822]: ERROR: could not map dynamic shared memory segment 2017-11-23 07:20:40 CET:192.168.xx,xx(33974):u(at)db:[24209]: ERROR: too many dynamic shared memory segments The errors happen when the parallel execution is enabled and multiple queries are executed simultaneously. If I set the max_parallel_workers_per_gather = 0 the error doesn't occur. The queries are rather big, each executes over 10sec. I see the error when 4 or more queries are started at the same time. my postgresql.conf: max_connections = 100 shared_buffers = 12GB effective_cache_size = 51GB work_mem = 83886kB maintenance_work_mem = 2GB max_wal_size = 8GB min_wal_size = 4GB checkpoint_completion_target = 0.9 wal_buffers = 32MB default_statistics_target = 1000 max_locks_per_transaction = 128 #max_parallel_workers_per_gather = 0 PS. I've posted this question on pgsql-admin distro but didn't get any feedback. -- regards, Jakub Glapa

Re: ERROR: too many dynamic shared memory segments

От

Thomas Munro

Дата:

27 ноября 2017 г., 16:47:05

On Mon, Nov 27, 2017 at 10:54 PM, Jakub Glapa <jakub.glapa@gmail.com> wrote:
> The DB enters recovery mode after that.

That's not good.  So it actually crashes?  Can you please show the
full error messages?

> 2017-11-23 07:20:39 CET::@:[24823]: ERROR:  could not attach to dynamic
> shared area

From src/backend/utils/mmgr/dsa.c.  The same message is used for
dsa_attach() and for attach_internal(), but in this case it must be
the latter because we use in-place DSA areas for parallel query.  This
means that when the worker tried to attach it found that
control->refcnt == 0, meaning 'too late, every other backend including
the leader has already detached'.

> 2017-11-23 07:20:39 CET::@:[24822]: ERROR:  could not map dynamic shared
> memory segment

From src/backend/access/transam/parallel.c when dsm_attach returns
null.  I think this also means 'too late'.

So those are both workers that have started up and found that the
leader has abandoned the parallel query already, but they discovered
it at different stages.  PID 24823 didn't manage to attach to the DSM
segment, while PID 24822 did but found that the other(s) had already
detached from the per-query DSA area inside it.

> 2017-11-23 07:20:40 CET:192.168.xx,xx(33974):u(at)db:[24209]: ERROR:  too
> many
> dynamic shared memory segments
>
> The errors happen when the parallel execution is enabled and multiple
> queries are executed simultaneously.
> If I set the max_parallel_workers_per_gather = 0 the error doesn't occur.
> The queries are rather big, each executes over 10sec.
> I see the error when 4 or more queries are started at the same time.
>
> my postgresql.conf:
>
> max_connections = 100

So the two questions we need to figure out are: (1) how do we manage
to use up all 64 + 2 * 100 DSM slots (or somehow corrupt things so it
looks that way) by running only 4 queries, and (2) why would be be
crashing rather than failing and reporting an error?  I'm not able to
reproduce the problem from your description running lots of parallel
queries running at the same time.  Can you provide reproduction steps?Does your query plan happen to include a Parallel
BitmapHeap Scan?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: ERROR: too many dynamic shared memory segments

От

Jakub Glapa

Дата:

27 ноября 2017 г., 18:13:32

Hi Thomas, log excerpt: ... 2017-11-27 12:21:14 CET:192.168.10.83(33424):user@db:[27291]: ERROR: too many dynamic shared memory segments 2017-11-27 12:21:14 CET:192.168.10.83(33424):user@db:[27291]: STATEMENT:SELECT << REMOVED>> 2017-11-27 12:21:14 CET:192.168.10.83(35182):user@db:[28281]: ERROR: too many dynamic shared memory segments 2017-11-27 12:21:14 CET:192.168.10.83(35182):user@db:[28281]: STATEMENT:SELECT < 2017-11-27 12:21:14 CET::@:[28618]: ERROR: could not map dynamic shared memory segment 2017-11-27 12:21:14 CET::@:[28619]: ERROR: could not map dynamic shared memory segment 2017-11-27 12:21:14 CET::@:[25645]: LOG: worker process: parallel worker for PID 27291 (PID 28618) exited with exit code 1 2017-11-27 12:21:14 CET::@:[25645]: LOG: worker process: parallel worker for PID 27291 (PID 28619) exited with exit code 1 ... this time the db didn't crash but the queries failed to execute. The queries are somehow special. We are still using the old style partitioning (list type) but we abuse it a bit when querying. When querying a set of partitions instead of doing it via parent table we stitch together the required tables with UNION ALL (this requires less locking) and was more performant in our benchmark (the new native partitioning might improve this but we didn't research that yet). The queries are in form of SELECT col1,col2,col3 FROM (SELECT * FROM par1 WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <= '2017-11-26 23:59:59.999' UNION ALL SELECT * FROM par2 WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <= '2017-11-26 23:59:59.999' UNION ALL SELECT * FROM par2 WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <= '2017-11-26 23:59:59.999' WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <= '2017-11-26 23:59:59.999' UNION ALL ... ) unionalias WHERE .... and indeed the query planner shows the usage of Bitmap Heap Scan .... -> Subquery Scan on "*SELECT* 2"(cost=3068.58..19793.94 rows=1 width=1284) -> Gather (cost=3068.58..19793.93 rows=1 width=5014) Workers Planned: 2 -> Parallel Bitmap Heap Scan on par_6 (cost=2068.58..18793.83 rows=1 width=5014) Recheck Cond: <> Filter: <> -> BitmapAnd(cost=2068.58..2068.58 rows=30955 width=0) -> BitmapOr(cost=999.30..999.30 rows=42989 width=0) -> Bitmap Index Scan on par_6_datasource (cost=0.00..990.21 rows=42922 width=0) Index Cond: ((datasource)::text = 'one'::text) -> Bitmap Index Scan on par_6_datasource (cost=0.00..4.30 rows=1 width=0) Index Cond: ((datasource)::text = 'two'::text) -> Bitmap Index Scan on par_6_datasource (cost=0.00..4.79 rows=67 width=0) Index Cond: ((datasource)::text = 'three'::text) -> Bitmap Index Scan on par_6_rangestart (cost=0.00..1069.02 rows=47564 width=0) Index Cond: (rangestart >= '2017-01-01 00:00:00+01'::timestamp with time zone) -> Subquery Scan on "*SELECT* 3"(cost=761.33..7944.99 rows=1 width=1086) -> Bitmap Heap Scan on par_7(cost=761.33..7944.98 rows=1 width=4816) Recheck Cond: <> Filter: <> -> BitmapAnd(cost=761.33..761.33 rows=7045 width=0) -> Bitmap Index Scan on par_7_rangestart (cost=0.00..380.35 rows=14942 width=0) Index Cond: (rangestart >= '2017-01-01 00:00:00+01'::timestamp with time zone) -> BitmapOr(cost=380.72..380.72 rows=12248 width=0) -> Bitmap Index Scan on par_7_datasource (cost=0.00..372.00 rows=12228 width=0) Index Cond: ((datasource)::text = 'one'::text) -> Bitmap Index Scan on par_7_datasource (cost=0.00..4.36 rows=10 width=0) Index Cond: ((datasource)::text = 'two'::text) -> Bitmap Index Scan on par_7_datasource (cost=0.00..4.36 rows=10 width=0) Index Cond: ((datasource)::text = 'three'::text) .... In this particular query there were over _100_ partitions connected with the UNION ALL operator. -- regards, pozdrawiam, Jakub Glapa On Mon, Nov 27, 2017 at 11:47 AM, Thomas Munro < thomas.munro@enterprisedb.com> wrote: > On Mon, Nov 27, 2017 at 10:54 PM, Jakub Glapa > wrote: > > The DB enters recovery mode after that. > > That's not good. So it actually crashes? Can you please show the > full error messages? > > > 2017-11-23 07:20:39 CET::@:[24823]: ERROR: could not attach to dynamic > > shared area > > From src/backend/utils/mmgr/dsa.c. The same message is used for > dsa_attach() and for attach_internal(), but in this case it must be > the latter because we use in-place DSA areas for parallel query. This > means that when the worker tried to attach it found that > control->refcnt == 0, meaning 'too late, every other backend including > the leader has already detached'. > > > 2017-11-23 07:20:39 CET::@:[24822]: ERROR: could not map dynamic shared > > memory segment > > From src/backend/access/transam/parallel.c when dsm_attach returns > null. I think this also means 'too late'. > > So those are both workers that have started up and found that the > leader has abandoned the parallel query already, but they discovered > it at different stages. PID 24823 didn't manage to attach to the DSM > segment, while PID 24822 did but found that the other(s) had already > detached from the per-query DSA area inside it. > > > 2017-11-23 07:20:40 CET:192.168.xx,xx(33974):u(at)db:[24209]: ERROR: > too > > many > > dynamic shared memory segments > > > > The errors happen when the parallel execution is enabled and multiple > > queries are executed simultaneously. > > If I set the max_parallel_workers_per_gather = 0 the error doesn't > occur. > > The queries are rather big, each executes over 10sec. > > I see the error when 4 or more queries are started at the same time. > > > > my postgresql.conf: > > > > max_connections = 100 > > So the two questions we need to figure out are: (1) how do we manage > to use up all 64 + 2 * 100 DSM slots (or somehow corrupt things so it > looks that way) by running only 4 queries, and (2) why would be be > crashing rather than failing and reporting an error? I'm not able to > reproduce the problem from your description running lots of parallel > queries running at the same time. Can you provide reproduction steps? > Does your query plan happen to include a Parallel Bitmap Heap Scan? > > -- > Thomas Munro > http://www.enterprisedb.com >

Re: ERROR: too many dynamic shared memory segments

От

Thomas Munro

Дата:

28 ноября 2017 г., 00:53:46

On Tue, Nov 28, 2017 at 1:13 AM, Jakub Glapa <jakub.glapa@gmail.com> wrote:
> The queries are somehow special.
> We are still using the old style partitioning (list type) but we abuse it a
> bit when querying.
> When querying a set of partitions instead of doing it via parent table we
> stitch together the required tables with UNION ALL (this requires less
> locking) and was more performant in our benchmark (the new native
> partitioning might improve this but we didn't research that yet).
>
> The queries are in form of
> SELECT col1,col2,col3 FROM
> (SELECT *
>       FROM par1
>       WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <=
> '2017-11-26 23:59:59.999'
>       UNION ALL SELECT *
>                 FROM par2
>                 WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <=
> '2017-11-26 23:59:59.999'
>       UNION ALL SELECT *
>                 FROM  par2
>                 WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <=
> '2017-11-26 23:59:59.999'
>
>       WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <=
> '2017-11-26 23:59:59.999'
>
> UNION ALL
> ...
> ) unionalias
> WHERE ....
>
>
>
> and indeed the query planner shows the usage of Bitmap Heap Scan
> ....
>                                  ->  Subquery Scan on "*SELECT* 2"
> (cost=3068.58..19793.94 rows=1 width=1284)
>                                        ->  Gather  (cost=3068.58..19793.93
> rows=1 width=5014)
>                                              Workers Planned: 2
>                                              ->  Parallel Bitmap Heap Scan
> on par_6  (cost=2068.58..18793.83 rows=1 width=5014)
>                                                    Recheck Cond:
> <<CONDITION>>
>                                                    Filter: <<CONDITION>>
>                                                    ->  BitmapAnd
> (cost=2068.58..2068.58 rows=30955 width=0)
>                                                          ->  BitmapOr
> (cost=999.30..999.30 rows=42989 width=0)
>                                                                ->  Bitmap
> Index Scan on par_6_datasource  (cost=0.00..990.21 rows=42922 width=0)
>                                                                      Index
> Cond: ((datasource)::text = 'one'::text)
>                                                                ->  Bitmap
> Index Scan on par_6_datasource  (cost=0.00..4.30 rows=1 width=0)
>                                                                      Index
> Cond: ((datasource)::text = 'two'::text)
>                                                                ->  Bitmap
> Index Scan on par_6_datasource  (cost=0.00..4.79 rows=67 width=0)
>                                                                      Index
> Cond: ((datasource)::text = 'three'::text)
>                                                          ->  Bitmap Index
> Scan on par_6_rangestart  (cost=0.00..1069.02 rows=47564 width=0)
>                                                                Index Cond:
> (rangestart >= '2017-01-01 00:00:00+01'::timestamp with time zone)
>                                  ->  Subquery Scan on "*SELECT* 3"
> (cost=761.33..7944.99 rows=1 width=1086)
>                                        ->  Bitmap Heap Scan on par_7
> (cost=761.33..7944.98 rows=1 width=4816)
>                                                    Recheck Cond:
> <<CONDITION>>
>                                                    Filter: <<CONDITION>>
>                                              ->  BitmapAnd
> (cost=761.33..761.33 rows=7045 width=0)
>                                                    ->  Bitmap Index Scan on
> par_7_rangestart  (cost=0.00..380.35 rows=14942 width=0)
>                                                          Index Cond:
> (rangestart >= '2017-01-01 00:00:00+01'::timestamp with time zone)
>                                                    ->  BitmapOr
> (cost=380.72..380.72 rows=12248 width=0)
>                                                          ->  Bitmap Index
> Scan on par_7_datasource  (cost=0.00..372.00 rows=12228 width=0)
>                                                                Index Cond:
> ((datasource)::text = 'one'::text)
>                                                          ->  Bitmap Index
> Scan on par_7_datasource  (cost=0.00..4.36 rows=10 width=0)
>                                                                Index Cond:
> ((datasource)::text = 'two'::text)
>                                                          ->  Bitmap Index
> Scan on par_7_datasource  (cost=0.00..4.36 rows=10 width=0)
>                                                                Index Cond:
> ((datasource)::text = 'three'::text)
>
> ....
>
>
> In this particular query there were over _100_ partitions connected with the
> UNION ALL operator.

Ah, so you have many Gather nodes under Append?  That's one way to eat
arbitrarily many DSM slots.  We allow for 64 + 2 * max_backends.  Does
it help if you increase max_connections?  I am concerned about the
crash failure mode you mentioned in the first email though: we should
always be able to handle that condition gracefully.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: ERROR: too many dynamic shared memory segments

От

Tom Lane

Дата:

28 ноября 2017 г., 01:16:25

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Ah, so you have many Gather nodes under Append?  That's one way to eat
> arbitrarily many DSM slots.  We allow for 64 + 2 * max_backends.  Does
> it help if you increase max_connections?  I am concerned about the
> crash failure mode you mentioned in the first email though: we should
> always be able to handle that condition gracefully.

I suspect this is an instance of the issue I complained about before [1]
that parallel query is unable to cope with worker start failure.
        regards, tom lane

[1] https://www.postgresql.org/message-id/4905.1492813727@sss.pgh.pa.us

Re: ERROR: too many dynamic shared memory segments

От

Jakub Glapa

Дата:

28 ноября 2017 г., 03:05:39

Hi Thomas, doubling the max_connection has the problem gone away for now! Yay! As for the crash. I dug up the initial log and it looks like a segmentation fault... 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR: too many dynamic shared memory segments 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: STATEMENT:SELECT .... 2017-11-23 07:26:55 CET::@:[29398]: LOG: server process (PID 26992) was terminated by signal 11: Segmentation fault 2017-11-23 07:26:55 CET::@:[29398]: DETAIL: Failed process was running: SELECT ..... 2017-11-23 07:26:55 CET::@:[29398]: LOG: terminating any other active server processes 2017-11-23 07:26:55 CET:192.168.10.83(35242):user@db:[30009]: WARNING:terminating connection because of crash of another server process 2017-11-23 07:26:55 CET:192.168.10.83(35242):user@db:[30009]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-11-23 07:26:55 CET:192.168.10.83(35242):user@db:[30009]: HINT: In a moment you should be able to reconnect to the database and repeat your command. 2017-11-23 07:26:55 CET:192.168.10.83(35268):user@db:[30097]: WARNING:terminating connection because of crash of another server process 2017-11-23 07:26:55 CET:192.168.10.83(35268):user@db:[30097]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-11-23 07:26:55 CET:192.168.10.83(35268):user@db:[30097]: HINT: In a moment you should be able to reconnect to the database and repeat your command. 2017-11-23 07:26:55 CET:192.168.10.83(46614):user@db:[31080]: WARNING:terminating connection because of crash of another server process 2017-11-23 07:26:55 CET:192.168.10.83(46614):user@db:[31080]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-11-23 07:26:55 CET:192.168.10.83(46614):user@db: HINT: In a moment you should be able to reconnect to the database and repeat your command. 2017-11-23 07:26:55 CET:192.168.10.83(35238):user@db:[30003]: WARNING:terminating connection because of crash of another server process 2017-11-23 07:26:55 CET:192.168.10.83(35238):user@db:[30003]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory 2017-11-23 07:26:55 CET:192.168.10.83(35300):user@db:[30152]: FATAL: the database system is in recovery mode -- regards, pozdrawiam, Jakub Glapa On Mon, Nov 27, 2017 at 7:53 PM, Thomas Munro wrote: > On Tue, Nov 28, 2017 at 1:13 AM, Jakub Glapa > wrote: > > The queries are somehow special. > > We are still using the old style partitioning (list type) but we abuse > it a > > bit when querying. > > When querying a set of partitions instead of doing it via parent table we > > stitch together the required tables with UNION ALL (this requires less > > locking) and was more performant in our benchmark (the new native > > partitioning might improve this but we didn't research that yet). > > > > The queries are in form of > > SELECT col1,col2,col3 FROM > > (SELECT * > > FROM par1 > > WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <= > > '2017-11-26 23:59:59.999' > > UNION ALL SELECT * > > FROM par2 > > WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend > <= > > '2017-11-26 23:59:59.999' > > UNION ALL SELECT * > > FROM par2 > > WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend > <= > > '2017-11-26 23:59:59.999' > > > > WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <= > > '2017-11-26 23:59:59.999' > > > > UNION ALL > > ... > > ) unionalias > > WHERE .... > > > > > > > > and indeed the query planner shows the usage of Bitmap Heap Scan > > .... > > -> Subquery Scan on "*SELECT* 2" > > (cost=3068.58..19793.94 rows=1 width=1284) > > -> Gather > (cost=3068.58..19793.93 > > rows=1 width=5014) > > Workers Planned: 2 > > -> Parallel Bitmap Heap > Scan > > on par_6 (cost=2068.58..18793.83 rows=1 width=5014) > > Recheck Cond: > > <> > > Filter: <> > > -> BitmapAnd > > (cost=2068.58..2068.58 rows=30955 width=0) > > -> BitmapOr > > (cost=999.30..999.30 rows=42989 width=0) > > -> Bitmap > > Index Scan on par_6_datasource (cost=0.00..990.21 rows=42922 width=0) > > > Index > > Cond: ((datasource)::text = 'one'::text) > > -> Bitmap > > Index Scan on par_6_datasource (cost=0.00..4.30 rows=1 width=0) > > > Index > > Cond: ((datasource)::text = 'two'::text) > > -> Bitmap > > Index Scan on par_6_datasource (cost=0.00..4.79 rows=67 width=0) > > > Index > > Cond: ((datasource)::text = 'three'::text) > > -> Bitmap Index > > Scan on par_6_rangestart (cost=0.00..1069.02 rows=47564 width=0) > > Index > Cond: > > (rangestart >= '2017-01-01 00:00:00+01'::timestamp with time zone) > > -> Subquery Scan on "*SELECT* 3" > > (cost=761.33..7944.99 rows=1 width=1086) > > -> Bitmap Heap Scan on par_7 > > (cost=761.33..7944.98 rows=1 width=4816) > > Recheck Cond: > > <> > > Filter: <> > > -> BitmapAnd > > (cost=761.33..761.33 rows=7045 width=0) > > -> Bitmap Index Scan > on > > par_7_rangestart (cost=0.00..380.35 rows=14942 width=0) > > Index Cond: > > (rangestart >= '2017-01-01 00:00:00+01'::timestamp with time zone) > > -> BitmapOr > > (cost=380.72..380.72 rows=12248 width=0) > > -> Bitmap Index > > Scan on par_7_datasource (cost=0.00..372.00 rows=12228 width=0) > > Index > Cond: > > ((datasource)::text = 'one'::text) > > -> Bitmap Index > > Scan on par_7_datasource (cost=0.00..4.36 rows=10 width=0) > > Index > Cond: > > ((datasource)::text = 'two'::text) > > -> Bitmap Index > > Scan on par_7_datasource (cost=0.00..4.36 rows=10 width=0) > > Index > Cond: > > ((datasource)::text = 'three'::text) > > > > .... > > > > > > In this particular query there were over _100_ partitions connected with > the > > UNION ALL operator. > > Ah, so you have many Gather nodes under Append? That's one way to eat > arbitrarily many DSM slots. We allow for 64 + 2 * max_backends. Does > it help if you increase max_connections? I am concerned about the > crash failure mode you mentioned in the first email though: we should > always be able to handle that condition gracefully. > > -- > Thomas Munro > http://www.enterprisedb.com >

Re: ERROR: too many dynamic shared memory segments

От

Thomas Munro

Дата:

28 ноября 2017 г., 04:48:54

On Tue, Nov 28, 2017 at 10:05 AM, Jakub Glapa <jakub.glapa@gmail.com> wrote:
> As for the crash. I dug up the initial log and it looks like a segmentation
> fault...
>
> 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR:  too
> many dynamic shared memory segments

Hmm.  Well this error can only occur in dsm_create() called without
DSM_CREATE_NULL_IF_MAXSEGMENTS.  parallel.c calls it with that flag
and dsa.c doesn't (perhaps it should, not sure, but that'd just change
the error message), so that means this the error arose from dsa.c
trying to get more segments.  That would be when Parallel Bitmap Heap
Scan tried to allocate memory.

I hacked my copy of PostgreSQL so that it allows only 5 DSM slots and
managed to reproduce a segv crash by trying to run concurrent Parallel
Bitmap Heap Scans.  The stack looks like this:
 * frame #0: 0x00000001083ace29
postgres`alloc_object(area=0x0000000000000000, size_class=10) + 25 at
dsa.c:1433   frame #1: 0x00000001083acd14
postgres`dsa_allocate_extended(area=0x0000000000000000, size=72,
flags=4) + 1076 at dsa.c:785   frame #2: 0x0000000108059c33
postgres`tbm_prepare_shared_iterate(tbm=0x00007f9743027660) + 67 at
tidbitmap.c:780   frame #3: 0x0000000108000d57
postgres`BitmapHeapNext(node=0x00007f9743019c88) + 503 at
nodeBitmapHeapscan.c:156   frame #4: 0x0000000107fefc5b
postgres`ExecScanFetch(node=0x00007f9743019c88,
accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77),
recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) +
459 at execScan.c:95   frame #5: 0x0000000107fef983
postgres`ExecScan(node=0x00007f9743019c88,
accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77),
recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) +
147 at execScan.c:162   frame #6: 0x00000001080008d1
postgres`ExecBitmapHeapScan(pstate=0x00007f9743019c88) + 49 at
nodeBitmapHeapscan.c:735

(lldb) f 3
frame #3: 0x0000000108000d57
postgres`BitmapHeapNext(node=0x00007f9743019c88) + 503 at
nodeBitmapHeapscan.c:156  153 * dsa_pointer of the iterator state which will be used by  154 * multiple processes to
iteratejointly.  155 */
 
-> 156 pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);  157 #ifdef USE_PREFETCH  158 if (node->prefetch_maximum
>0)  159
 
(lldb) print tbm->dsa
(dsa_area *) $3 = 0x0000000000000000
(lldb) print node->ss.ps.state->es_query_dsa
(dsa_area *) $5 = 0x0000000000000000
(lldb) f 17
frame #17: 0x000000010800363b
postgres`ExecGather(pstate=0x00007f9743019320) + 635 at
nodeGather.c:220  217 * Get next tuple, either from one of our workers, or by running the plan  218 * ourselves.  219
*/
-> 220 slot = gather_getnext(node);  221 if (TupIsNull(slot))  222 return NULL;  223
(lldb) print *node->pei
(ParallelExecutorInfo) $8 = { planstate = 0x00007f9743019640 pcxt = 0x00007f97450001b8 buffer_usage =
0x0000000108b7e218instrumentation = 0x0000000108b7da38 area = 0x0000000000000000 param_exec = 0 finished = '\0' tqueue
=0x0000000000000000 reader = 0x0000000000000000
 
}
(lldb) print *node->pei->pcxt
warning: could not load any Objective-C class information. This will
significantly reduce the quality of type information available.
(ParallelContext) $9 = { node = {   prev = 0x000000010855fb60   next = 0x000000010855fb60 } subid = 1 nworkers = 0
nworkers_launched= 0 library_name = 0x00007f9745000248 "postgres" function_name = 0x00007f9745000268
"ParallelQueryMain"error_context_stack = 0x0000000000000000 estimator = (space_for_chunks = 180352, number_of_keys =
19)seg = 0x0000000000000000 private_memory = 0x0000000108b53038 toc = 0x0000000108b53038 worker = 0x0000000000000000
 
}

I think there are two failure modes: one of your sessions showed the
"too many ..." error (that's good, ran out of slots and said so and
our error machinery worked as it should), and another crashed with a
segfault, because it tried to use a NULL "area" pointer (bad).  I
think this is a degenerate case where we completely failed to launch
parallel query, but we ran the parallel query plan anyway and this
code thinks that the DSA is available.  Oops.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: ERROR: too many dynamic shared memory segments

От

Dilip Kumar

Дата:

28 ноября 2017 г., 13:32:56

On Tue, Nov 28, 2017 at 4:18 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Nov 28, 2017 at 10:05 AM, Jakub Glapa <jakub.glapa@gmail.com> wrote:
> As for the crash. I dug up the initial log and it looks like a segmentation
> fault...
>
> 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR: too
> many dynamic shared memory segments

I think there are two failure modes: one of your sessions showed the
"too many ..." error (that's good, ran out of slots and said so and
our error machinery worked as it should), and another crashed with a
segfault, because it tried to use a NULL "area" pointer (bad). I
think this is a degenerate case where we completely failed to launch
parallel query, but we ran the parallel query plan anyway and this
code thinks that the DSA is available. Oops.

I think BitmapHeapScan check whether dsa is valid or not if DSA is not valid then it should assume it's non-parallel plan.

Attached patch should fix the issue.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

bug_fix_in_pbhs_when_dsa_not_initialized.patch

Re: ERROR: too many dynamic shared memory segments

От

Dilip Kumar

Дата:

28 ноября 2017 г., 13:32:56

On Tue, Nov 28, 2017 at 4:18 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Nov 28, 2017 at 10:05 AM, Jakub Glapa <jakub.glapa@gmail.com> wrote:
> As for the crash. I dug up the initial log and it looks like a segmentation
> fault...
>
> 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR: too
> many dynamic shared memory segments

I think there are two failure modes: one of your sessions showed the
"too many ..." error (that's good, ran out of slots and said so and
our error machinery worked as it should), and another crashed with a
segfault, because it tried to use a NULL "area" pointer (bad). I
think this is a degenerate case where we completely failed to launch
parallel query, but we ran the parallel query plan anyway and this
code thinks that the DSA is available. Oops.

I think BitmapHeapScan check whether dsa is valid or not if DSA is not valid then it should assume it's non-parallel plan.

Attached patch should fix the issue.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

bug_fix_in_pbhs_when_dsa_not_initialized.patch

Re: ERROR: too many dynamic shared memory segments

От

Robert Haas

Дата:

28 ноября 2017 г., 19:43:45

On Tue, Nov 28, 2017 at 2:32 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>  I think BitmapHeapScan check whether dsa is valid or not if DSA is not
> valid then it should assume it's non-parallel plan.
>
> Attached patch should fix the issue.

So, create the pstate and then pretend we didn't?  Why not just avoid
creating it in the first place, like this?

I haven't checked whether this fixes the bug, but if it does, we can
avoid introducing an extra branch in BitmapHeapNext.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

no-pstate.patch

Re: ERROR: too many dynamic shared memory segments

От

Robert Haas

Дата:

28 ноября 2017 г., 19:43:45

On Tue, Nov 28, 2017 at 2:32 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>  I think BitmapHeapScan check whether dsa is valid or not if DSA is not
> valid then it should assume it's non-parallel plan.
>
> Attached patch should fix the issue.

So, create the pstate and then pretend we didn't?  Why not just avoid
creating it in the first place, like this?

I haven't checked whether this fixes the bug, but if it does, we can
avoid introducing an extra branch in BitmapHeapNext.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

no-pstate.patch

Re: ERROR: too many dynamic shared memory segments

От

Dilip Kumar

Дата:

28 ноября 2017 г., 20:45:09

On Tue, Nov 28, 2017 at 7:13 PM, Robert Haas wrote: > On Tue, Nov 28, 2017 at 2:32 AM, Dilip Kumar > wrote: > > I think BitmapHeapScan check whether dsa is valid or not if DSA is not > > valid then it should assume it's non-parallel plan. > > > > Attached patch should fix the issue. > > So, create the pstate and then pretend we didn't? Why not just avoid > creating it in the first place, like this? > This is better way to fix it. > > I haven't checked whether this fixes the bug, but if it does, we can > avoid introducing an extra branch in BitmapHeapNext. With my test it's fixing the problem. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com

Re: ERROR: too many dynamic shared memory segments

От

Dilip Kumar

Дата:

28 ноября 2017 г., 20:45:09

Re: ERROR: too many dynamic shared memory segments

От

Robert Haas

Дата:

28 ноября 2017 г., 22:58:53

On Tue, Nov 28, 2017 at 9:45 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I haven't checked whether this fixes the bug, but if it does, we can
>> avoid introducing an extra branch in BitmapHeapNext.
>
> With my test it's fixing the problem.

I tested it some more and found that, for me, it PARTIALLY fixes the
problem.  I tested like this:

--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -279,7 +279,7 @@ InitializeParallelDSM(ParallelContext *pcxt)     * parallelism than to fail outright.     */
segsize= shm_toc_estimate(&pcxt->estimator);
 
-    if (pcxt->nworkers > 0)
+    if (pcxt->nworkers > 0 && false)        pcxt->seg = dsm_create(segsize, DSM_CREATE_NULL_IF_MAXSEGMENTS);    if
(pcxt->seg!= NULL)        pcxt->toc = shm_toc_create(PARALLEL_MAGIC,
 

That turned out to produce more than one problem.  I find that the
select_parallel test then fails like this:

ERROR:  could not find key 18446744073709486082 in shm TOC at 0x10be98040

The fix for that problem seems to be:

--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -430,7 +430,8 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
    /* Recreate error queues. */    error_queue_space =
-        shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, false);
+        shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, true);
+    Assert(pcxt->nworkers == 0 || error_queue_space != NULL);    for (i = 0; i < pcxt->nworkers; ++i)    {        char
     *start;
 

With that fix in place, I then hit a crash in parallel bitmap heap
scan.  After applying no-pstate.patch, which I just committed and
back-patched to v10, then things look OK.  I'm going to apply the fix
for the error_queue_space problem also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ERROR: too many dynamic shared memory segments

От

Robert Haas

Дата:

28 ноября 2017 г., 22:58:53

On Tue, Nov 28, 2017 at 9:45 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I haven't checked whether this fixes the bug, but if it does, we can
>> avoid introducing an extra branch in BitmapHeapNext.
>
> With my test it's fixing the problem.

I tested it some more and found that, for me, it PARTIALLY fixes the
problem.  I tested like this:

--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -279,7 +279,7 @@ InitializeParallelDSM(ParallelContext *pcxt)     * parallelism than to fail outright.     */
segsize= shm_toc_estimate(&pcxt->estimator);
 
-    if (pcxt->nworkers > 0)
+    if (pcxt->nworkers > 0 && false)        pcxt->seg = dsm_create(segsize, DSM_CREATE_NULL_IF_MAXSEGMENTS);    if
(pcxt->seg!= NULL)        pcxt->toc = shm_toc_create(PARALLEL_MAGIC,
 

That turned out to produce more than one problem.  I find that the
select_parallel test then fails like this:

ERROR:  could not find key 18446744073709486082 in shm TOC at 0x10be98040

The fix for that problem seems to be:

--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -430,7 +430,8 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
    /* Recreate error queues. */    error_queue_space =
-        shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, false);
+        shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, true);
+    Assert(pcxt->nworkers == 0 || error_queue_space != NULL);    for (i = 0; i < pcxt->nworkers; ++i)    {        char
     *start;
 

With that fix in place, I then hit a crash in parallel bitmap heap
scan.  After applying no-pstate.patch, which I just committed and
back-patched to v10, then things look OK.  I'm going to apply the fix
for the error_queue_space problem also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ERROR: too many dynamic shared memory segments

От

Jakub Glapa

Дата:

04 декабря 2017 г., 18:18:38

I see that the segfault is under active discussion but just wanted to ask if increasing the max_connections to mitigate the DSM slots shortage is the way to go? -- regards, Jakub Glapa On Mon, Nov 27, 2017 at 11:48 PM, Thomas Munro < thomas.munro@enterprisedb.com> wrote: > On Tue, Nov 28, 2017 at 10:05 AM, Jakub Glapa > wrote: > > As for the crash. I dug up the initial log and it looks like a > segmentation > > fault... > > > > 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR: > too > > many dynamic shared memory segments > > Hmm. Well this error can only occur in dsm_create() called without > DSM_CREATE_NULL_IF_MAXSEGMENTS. parallel.c calls it with that flag > and dsa.c doesn't (perhaps it should, not sure, but that'd just change > the error message), so that means this the error arose from dsa.c > trying to get more segments. That would be when Parallel Bitmap Heap > Scan tried to allocate memory. > > I hacked my copy of PostgreSQL so that it allows only 5 DSM slots and > managed to reproduce a segv crash by trying to run concurrent Parallel > Bitmap Heap Scans. The stack looks like this: > > * frame #0: 0x00000001083ace29 > postgres`alloc_object(area=0x0000000000000000, size_class=10) + 25 at > dsa.c:1433 > frame #1: 0x00000001083acd14 > postgres`dsa_allocate_extended(area=0x0000000000000000, size=72, > flags=4) + 1076 at dsa.c:785 > frame #2: 0x0000000108059c33 > postgres`tbm_prepare_shared_iterate(tbm=0x00007f9743027660) + 67 at > tidbitmap.c:780 > frame #3: 0x0000000108000d57 > postgres`BitmapHeapNext(node=0x00007f9743019c88) + 503 at > nodeBitmapHeapscan.c:156 > frame #4: 0x0000000107fefc5b > postgres`ExecScanFetch(node=0x00007f9743019c88, > accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77), > recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) + > 459 at execScan.c:95 > frame #5: 0x0000000107fef983 > postgres`ExecScan(node=0x00007f9743019c88, > accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77), > recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) + > 147 at execScan.c:162 > frame #6: 0x00000001080008d1 > postgres`ExecBitmapHeapScan(pstate=0x00007f9743019c88) + 49 at > nodeBitmapHeapscan.c:735 > > (lldb) f 3 > frame #3: 0x0000000108000d57 > postgres`BitmapHeapNext(node=0x00007f9743019c88) + 503 at > nodeBitmapHeapscan.c:156 > 153 * dsa_pointer of the iterator state which will be used by > 154 * multiple processes to iterate jointly. > 155 */ > -> 156 pstate->tbmiterator = tbm_prepare_shared_iterate(tbm); > 157 #ifdef USE_PREFETCH > 158 if (node->prefetch_maximum > 0) > 159 > (lldb) print tbm->dsa > (dsa_area *) $3 = 0x0000000000000000 > (lldb) print node->ss.ps.state->es_query_dsa > (dsa_area *) $5 = 0x0000000000000000 > (lldb) f 17 > frame #17: 0x000000010800363b > postgres`ExecGather(pstate=0x00007f9743019320) + 635 at > nodeGather.c:220 > 217 * Get next tuple, either from one of our workers, or by running the > plan > 218 * ourselves. > 219 */ > -> 220 slot = gather_getnext(node); > 221 if (TupIsNull(slot)) > 222 return NULL; > 223 > (lldb) print *node->pei > (ParallelExecutorInfo) $8 = { > planstate = 0x00007f9743019640 > pcxt = 0x00007f97450001b8 > buffer_usage = 0x0000000108b7e218 > instrumentation = 0x0000000108b7da38 > area = 0x0000000000000000 > param_exec = 0 > finished = '\0' > tqueue = 0x0000000000000000 > reader = 0x0000000000000000 > } > (lldb) print *node->pei->pcxt > warning: could not load any Objective-C class information. This will > significantly reduce the quality of type information available. > (ParallelContext) $9 = { > node = { > prev = 0x000000010855fb60 > next = 0x000000010855fb60 > } > subid = 1 > nworkers = 0 > nworkers_launched = 0 > library_name = 0x00007f9745000248 "postgres" > function_name = 0x00007f9745000268 "ParallelQueryMain" > error_context_stack = 0x0000000000000000 > estimator = (space_for_chunks = 180352, number_of_keys = 19) > seg = 0x0000000000000000 > private_memory = 0x0000000108b53038 > toc = 0x0000000108b53038 > worker = 0x0000000000000000 > } > > I think there are two failure modes: one of your sessions showed the > "too many ..." error (that's good, ran out of slots and said so and > our error machinery worked as it should), and another crashed with a > segfault, because it tried to use a NULL "area" pointer (bad). I > think this is a degenerate case where we completely failed to launch > parallel query, but we ran the parallel query plan anyway and this > code thinks that the DSA is available. Oops. > > -- > Thomas Munro > http://www.enterprisedb.com >

Re: ERROR: too many dynamic shared memory segments

От

Thomas Munro

Дата:

08 декабря 2017 г., 03:30:47

On Tue, Dec 5, 2017 at 1:18 AM, Jakub Glapa <jakub.glapa@gmail.com> wrote:
> I see that the segfault is under active discussion but just wanted to ask if
> increasing the max_connections to mitigate the DSM slots shortage is the way
> to go?

Hi Jakub,

Yes.  In future releases this situation will improve (maybe we'll
figure out how to use one DSM segment for all the gather nodes in your
query plan, and maybe it'll be moot anyway because maybe we'll be able
to use a Parallel Append for queries like yours so that it uses the
same set of workers over all the child plans instead of the
fork()-fest you're presumably seeing).  For now your only choice, if
you want that plan to run, is to crank up max_connections so that the
total number of concurrently executing Gather nodes is less than about
64 + 2 * max_connections.  There is also a crash bug right now in the
out-of-slots case as discussed, fixed in the next point release, but
even with that fix in place you'll still need a high enough
max_connections setting to be sure to be able to complete the query
without an error.

Thanks for the report!

-- 
Thomas Munro
http://www.enterprisedb.com

Re: ERROR: too many dynamic shared memory segments

От

Jakub Glapa

Дата:

08 декабря 2017 г., 13:59:07

Thank You Thomas!

--
regards,
Jakub Glapa

On Thu, Dec 7, 2017 at 10:30 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Dec 5, 2017 at 1:18 AM, Jakub Glapa <jakub.glapa@gmail.com> wrote:
> I see that the segfault is under active discussion but just wanted to ask if
> increasing the max_connections to mitigate the DSM slots shortage is the way
> to go?

Hi Jakub,

Yes. In future releases this situation will improve (maybe we'll
figure out how to use one DSM segment for all the gather nodes in your
query plan, and maybe it'll be moot anyway because maybe we'll be able
to use a Parallel Append for queries like yours so that it uses the
same set of workers over all the child plans instead of the
fork()-fest you're presumably seeing). For now your only choice, if
you want that plan to run, is to crank up max_connections so that the
total number of concurrently executing Gather nodes is less than about
64 + 2 * max_connections. There is also a crash bug right now in the
out-of-slots case as discussed, fixed in the next point release, but
even with that fix in place you'll still need a high enough
max_connections setting to be sure to be able to complete the query
without an error.

Thanks for the report!

--
Thomas Munro
http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: ERROR: too many dynamic shared memory segments

Вложения

Вложения

Вложения

Вложения