Обсуждение: ERROR: too many dynamic shared memory segments

Поиск
Список
Период
Сортировка

ERROR: too many dynamic shared memory segments

От
Nicola Contu
Дата:
Hello,
We are running postgres 11.5 and in the last two weeks we did :

- upgrade of postgres to 11.5 from 11.4
- increased shared_buffer to 1/3 of the memory
- increased effective_cache_size = 160GB from 120
- increased checkpoint_completion_target = 0.9 from 0.7
- increased checkpoint_timeout = 1h
- increased work_mem = 2GB (this can be set up to 4GB) from 600MB

Since that, in the last two weeks we saw an increment of this error : 

ERROR: too many dynamic shared memory segments 

Is there any relation between these parameters or the pgsql 11.5 version?

Any help can be appreciated.

Thank you,
Nicola

Re: ERROR: too many dynamic shared memory segments

От
Pavel Stehule
Дата:
Hi

st 11. 9. 2019 v 9:48 odesílatel Nicola Contu <nicola.contu@gmail.com> napsal:
Hello,
We are running postgres 11.5 and in the last two weeks we did :

- upgrade of postgres to 11.5 from 11.4
- increased shared_buffer to 1/3 of the memory
- increased effective_cache_size = 160GB from 120
- increased checkpoint_completion_target = 0.9 from 0.7
- increased checkpoint_timeout = 1h
- increased work_mem = 2GB (this can be set up to 4GB) from 600MB

Since that, in the last two weeks we saw an increment of this error : 

ERROR: too many dynamic shared memory segments 

Is there any relation between these parameters or the pgsql 11.5 version?

I expect it can be related to increasing work_mem - maybe parallel hash join was used

Regards

Pavel

Any help can be appreciated.

Thank you,
Nicola

Re: ERROR: too many dynamic shared memory segments

От
Nicola Contu
Дата:
Hello,
We did not see any error in the logs, just that one.
Unfortunately we had problems installing updates in this machine and we are not installing updates since a few months.

Do you think that can be the issue? We are running Centos 7.

I will look into those parameters as well.

Thanks for your feedback


Il giorno mer 11 set 2019 alle ore 09:56 Mickael Franc (mickaelf) <mickaelf@cisco.com> ha scritto:
Hello,

Maybe a change needed in a kernel parameter such "kernel.shmmax" or "kernel.shmall" to allow by kernel to provide a huge shared-mem.
Did you see any other error in logs ? Have you upgrade your kernel since last two weeks ?

Best,

De : Nicola Contu <nicola.contu@gmail.com>
Envoyé : mercredi 11 septembre 2019 09:47
À : pgsql-generallists.postgresql.org <pgsql-general@lists.postgresql.org>
Cc : Alessandro Aste <Alessandro.aste@gtt.net>
Objet : ERROR: too many dynamic shared memory segments
 
Hello,
We are running postgres 11.5 and in the last two weeks we did :

- upgrade of postgres to 11.5 from 11.4
- increased shared_buffer to 1/3 of the memory
- increased effective_cache_size = 160GB from 120
- increased checkpoint_completion_target = 0.9 from 0.7
- increased checkpoint_timeout = 1h
- increased work_mem = 2GB (this can be set up to 4GB) from 600MB

Since that, in the last two weeks we saw an increment of this error : 

ERROR: too many dynamic shared memory segments 

Is there any relation between these parameters or the pgsql 11.5 version?

Any help can be appreciated.

Thank you,
Nicola

Re: ERROR: too many dynamic shared memory segments

От
Nicola Contu
Дата:
If the error persist I will try to revert the work_mem.
Thanks a lot

Il giorno mer 11 set 2019 alle ore 10:10 Pavel Stehule <pavel.stehule@gmail.com> ha scritto:
Hi

st 11. 9. 2019 v 9:48 odesílatel Nicola Contu <nicola.contu@gmail.com> napsal:
Hello,
We are running postgres 11.5 and in the last two weeks we did :

- upgrade of postgres to 11.5 from 11.4
- increased shared_buffer to 1/3 of the memory
- increased effective_cache_size = 160GB from 120
- increased checkpoint_completion_target = 0.9 from 0.7
- increased checkpoint_timeout = 1h
- increased work_mem = 2GB (this can be set up to 4GB) from 600MB

Since that, in the last two weeks we saw an increment of this error : 

ERROR: too many dynamic shared memory segments 

Is there any relation between these parameters or the pgsql 11.5 version?

I expect it can be related to increasing work_mem - maybe parallel hash join was used

Regards

Pavel

Any help can be appreciated.

Thank you,
Nicola

Re: ERROR: too many dynamic shared memory segments

От
Thomas Munro
Дата:
On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu <nicola.contu@gmail.com> wrote:
> If the error persist I will try to revert the work_mem.
> Thanks a lot

Hi Nicola,

It's hard to say exactly what the cause of the problem is in your case
and how to avoid it, without knowing what your query plans look like.
PostgreSQL allows 64 + 2 * max_connections segments to exist a time,
and it needs a number of them that depends on work_mem (in the case of
Parallel Hash Join and Parallel Bitmap Index Scan), and also depends
on the number of Gather nodes that appear in the plan, which in some
unusual cases can result from partitioning.

I've seen people reaching this error by running a lot of parallel
queries concurrently.  If that's the cause, then you can definitely
get some relief by turning work_mem down, or by turning
max_connections up (even though you don't want to allow more
connections -- because it influences the formula for deciding on the
DSM segment limit).  We should probably adjust some of the internal
constants to give us more slots, to avoid that problem, as discussed
here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com

I've also seen people reaching this error by somehow coming up with
plans that had a very large number of Gather nodes in them,
corresponding to partitions; that's probably a bad plan (it'd
presumably be better to terminate parallelism higher up in the plan,
but these plans do seem to exist in the wild; I don't recall exactly
why).  I think we need a bit of a redesign so that if there are
multiple Gather nodes, they share the same main DSM segment, instead
of blowing through this limit.

-- 
Thomas Munro
https://enterprisedb.com



Re: ERROR: too many dynamic shared memory segments

От
Nicola Contu
Дата:
Hey Thomas,
after a few months, we started having this issue again.
So we revert the work_mem parameter to 600MB instead of 2GB.
But the issue is still there. A query went to segmentation fault, the DB went to recovery mode and our app went to read only for a few minutes.

I understand we can increase max_connections so we can have many more segments.

My question is : is there a way to understand the number of segments we reached?
Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have about 500 shared segments.
We would like to increase that number to 300 or 400 but would be great to understand if there is a way to make sure we will solve the issue as it requires a restart of the service.

I know you were also talking about a redesign this part in PostgreSQL. Do you know if anything has changed in any of the newer versions after 11.5?

Thanks a lot,
Nicola





Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro <thomas.munro@gmail.com> ha scritto:
On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu <nicola.contu@gmail.com> wrote:
> If the error persist I will try to revert the work_mem.
> Thanks a lot

Hi Nicola,

It's hard to say exactly what the cause of the problem is in your case
and how to avoid it, without knowing what your query plans look like.
PostgreSQL allows 64 + 2 * max_connections segments to exist a time,
and it needs a number of them that depends on work_mem (in the case of
Parallel Hash Join and Parallel Bitmap Index Scan), and also depends
on the number of Gather nodes that appear in the plan, which in some
unusual cases can result from partitioning.

I've seen people reaching this error by running a lot of parallel
queries concurrently.  If that's the cause, then you can definitely
get some relief by turning work_mem down, or by turning
max_connections up (even though you don't want to allow more
connections -- because it influences the formula for deciding on the
DSM segment limit).  We should probably adjust some of the internal
constants to give us more slots, to avoid that problem, as discussed
here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com

I've also seen people reaching this error by somehow coming up with
plans that had a very large number of Gather nodes in them,
corresponding to partitions; that's probably a bad plan (it'd
presumably be better to terminate parallelism higher up in the plan,
but these plans do seem to exist in the wild; I don't recall exactly
why).  I think we need a bit of a redesign so that if there are
multiple Gather nodes, they share the same main DSM segment, instead
of blowing through this limit.

--
Thomas Munro
https://enterprisedb.com

Re: ERROR: too many dynamic shared memory segments

От
Nicola Contu
Дата:
We also reverted this param :

cmdv3=# show max_parallel_workers_per_gather;
 max_parallel_workers_per_gather
---------------------------------
 2
(1 row)

It was set to 8.


Il giorno mar 21 gen 2020 alle ore 16:06 Nicola Contu <nicola.contu@gmail.com> ha scritto:
Hey Thomas,
after a few months, we started having this issue again.
So we revert the work_mem parameter to 600MB instead of 2GB.
But the issue is still there. A query went to segmentation fault, the DB went to recovery mode and our app went to read only for a few minutes.

I understand we can increase max_connections so we can have many more segments.

My question is : is there a way to understand the number of segments we reached?
Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have about 500 shared segments.
We would like to increase that number to 300 or 400 but would be great to understand if there is a way to make sure we will solve the issue as it requires a restart of the service.

I know you were also talking about a redesign this part in PostgreSQL. Do you know if anything has changed in any of the newer versions after 11.5?

Thanks a lot,
Nicola





Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro <thomas.munro@gmail.com> ha scritto:
On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu <nicola.contu@gmail.com> wrote:
> If the error persist I will try to revert the work_mem.
> Thanks a lot

Hi Nicola,

It's hard to say exactly what the cause of the problem is in your case
and how to avoid it, without knowing what your query plans look like.
PostgreSQL allows 64 + 2 * max_connections segments to exist a time,
and it needs a number of them that depends on work_mem (in the case of
Parallel Hash Join and Parallel Bitmap Index Scan), and also depends
on the number of Gather nodes that appear in the plan, which in some
unusual cases can result from partitioning.

I've seen people reaching this error by running a lot of parallel
queries concurrently.  If that's the cause, then you can definitely
get some relief by turning work_mem down, or by turning
max_connections up (even though you don't want to allow more
connections -- because it influences the formula for deciding on the
DSM segment limit).  We should probably adjust some of the internal
constants to give us more slots, to avoid that problem, as discussed
here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com

I've also seen people reaching this error by somehow coming up with
plans that had a very large number of Gather nodes in them,
corresponding to partitions; that's probably a bad plan (it'd
presumably be better to terminate parallelism higher up in the plan,
but these plans do seem to exist in the wild; I don't recall exactly
why).  I think we need a bit of a redesign so that if there are
multiple Gather nodes, they share the same main DSM segment, instead
of blowing through this limit.

--
Thomas Munro
https://enterprisedb.com

Re: ERROR: too many dynamic shared memory segments

От
Nicola Contu
Дата:
Hello, may I ask you for a feedback?
Thanks a lot

Il giorno mar 21 gen 2020 alle ore 17:14 Nicola Contu <nicola.contu@gmail.com> ha scritto:
We also reverted this param :

cmdv3=# show max_parallel_workers_per_gather;
 max_parallel_workers_per_gather
---------------------------------
 2
(1 row)

It was set to 8.


Il giorno mar 21 gen 2020 alle ore 16:06 Nicola Contu <nicola.contu@gmail.com> ha scritto:
Hey Thomas,
after a few months, we started having this issue again.
So we revert the work_mem parameter to 600MB instead of 2GB.
But the issue is still there. A query went to segmentation fault, the DB went to recovery mode and our app went to read only for a few minutes.

I understand we can increase max_connections so we can have many more segments.

My question is : is there a way to understand the number of segments we reached?
Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have about 500 shared segments.
We would like to increase that number to 300 or 400 but would be great to understand if there is a way to make sure we will solve the issue as it requires a restart of the service.

I know you were also talking about a redesign this part in PostgreSQL. Do you know if anything has changed in any of the newer versions after 11.5?

Thanks a lot,
Nicola





Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro <thomas.munro@gmail.com> ha scritto:
On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu <nicola.contu@gmail.com> wrote:
> If the error persist I will try to revert the work_mem.
> Thanks a lot

Hi Nicola,

It's hard to say exactly what the cause of the problem is in your case
and how to avoid it, without knowing what your query plans look like.
PostgreSQL allows 64 + 2 * max_connections segments to exist a time,
and it needs a number of them that depends on work_mem (in the case of
Parallel Hash Join and Parallel Bitmap Index Scan), and also depends
on the number of Gather nodes that appear in the plan, which in some
unusual cases can result from partitioning.

I've seen people reaching this error by running a lot of parallel
queries concurrently.  If that's the cause, then you can definitely
get some relief by turning work_mem down, or by turning
max_connections up (even though you don't want to allow more
connections -- because it influences the formula for deciding on the
DSM segment limit).  We should probably adjust some of the internal
constants to give us more slots, to avoid that problem, as discussed
here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com

I've also seen people reaching this error by somehow coming up with
plans that had a very large number of Gather nodes in them,
corresponding to partitions; that's probably a bad plan (it'd
presumably be better to terminate parallelism higher up in the plan,
but these plans do seem to exist in the wild; I don't recall exactly
why).  I think we need a bit of a redesign so that if there are
multiple Gather nodes, they share the same main DSM segment, instead
of blowing through this limit.

--
Thomas Munro
https://enterprisedb.com

Re: ERROR: too many dynamic shared memory segments

От
Thomas Munro
Дата:
On Wed, Jan 22, 2020 at 4:06 AM Nicola Contu <nicola.contu@gmail.com> wrote:
> after a few months, we started having this issue again.
> So we revert the work_mem parameter to 600MB instead of 2GB.
> But the issue is still there. A query went to segmentation fault, the DB went to recovery mode and our app went to
readonly for a few minutes.
 

Hi Nicola,
Hmm, a segmentation fault sounds like a different problem.  Can you
please share the exact error messages from PostgreSQL and OS logs?

> I understand we can increase max_connections so we can have many more segments.
>
> My question is : is there a way to understand the number of segments we reached?

If you're on Linux, you can probably see them with "ls /dev/shm".

> Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have about 500 shared segments.
> We would like to increase that number to 300 or 400 but would be great to understand if there is a way to make sure
wewill solve the issue as it requires a restart of the service.
 
>
> I know you were also talking about a redesign this part in PostgreSQL. Do you know if anything has changed in any of
thenewer versions after 11.5?
 

It's possible that we should increase a couple of constants used the
formula -- I'll look into that again.  But first I'd like to see if
we're even investigating the right problem here.



Re: ERROR: too many dynamic shared memory segments

От
Nicola Contu
Дата:
This is the error on postgres log of the segmentation fault :

2020-01-21 14:20:29 GMT [] [42222]: [108-1] db=,user= LOG:  server process (PID 2042) was terminated by signal 11: Segmentation fault
2020-01-21 14:20:29 GMT [] [42222]: [109-1] db=,user= DETAIL:  Failed process was running: select pid from pg_stat_activity where query ilike 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
2020-01-21 14:20:29 GMT [] [42222]: [110-1] db=,user= LOG:  terminating any other active server processes
2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [4-1] db=cmdv3,user=admin WARNING:  terminating connection because of crash of another server process
2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [5-1] db=cmdv3,user=admin DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [6-1] db=cmdv3,user=admin HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-01-21 14:20:29 GMT [127.0.0.1(34026)] [2055]: [5-1] db=cmdv3,user=admin WARNING:  terminating connection because of crash of another server process

At CentOS level :
Jan 21 14:20:29 usnyh-cmd1.gt-t.net kernel: postmaster[2042]: segfault at 0 ip 000000000048bef4 sp 00007ffdf4955bb0 error 4 in postgres[400000+6c5000]
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT [33727] WARNING C-0x21526e8: cmdv3/admin@10.151.2.154:39688 pooler error: server conn crashed?
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT [33727] WARNING S-0x1f5ff58: cmdv3/admin@127.0.0.1:5432 got packet 'N' from server when not linked
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT [33727] WARNING S-0x25097f0: cmdv3/admin@127.0.0.1:5432 got packet 'N' from server when not linked
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.906 GMT [33727] WARNING S-0x2508b60: cmdv3/admin@127.0.0.1:5432 got packet 'N' from server when not linked
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.981 GMT [33727] ERROR S: login failed: FATAL: the database system is in recovery mode


> If you're on Linux, you can probably see them with "ls /dev/shm".

I see a lot of files there, and doing a cat they are empty. What can I do with them?

Those are two different problems I guess, but they are related because right before the Segmentation Fault I see a lot of shared segment errors in the postgres log.

Il giorno mer 29 gen 2020 alle ore 10:09 Thomas Munro <thomas.munro@gmail.com> ha scritto:
On Wed, Jan 22, 2020 at 4:06 AM Nicola Contu <nicola.contu@gmail.com> wrote:
> after a few months, we started having this issue again.
> So we revert the work_mem parameter to 600MB instead of 2GB.
> But the issue is still there. A query went to segmentation fault, the DB went to recovery mode and our app went to read only for a few minutes.

Hi Nicola,
Hmm, a segmentation fault sounds like a different problem.  Can you
please share the exact error messages from PostgreSQL and OS logs?

> I understand we can increase max_connections so we can have many more segments.
>
> My question is : is there a way to understand the number of segments we reached?

If you're on Linux, you can probably see them with "ls /dev/shm".

> Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have about 500 shared segments.
> We would like to increase that number to 300 or 400 but would be great to understand if there is a way to make sure we will solve the issue as it requires a restart of the service.
>
> I know you were also talking about a redesign this part in PostgreSQL. Do you know if anything has changed in any of the newer versions after 11.5?

It's possible that we should increase a couple of constants used the
formula -- I'll look into that again.  But first I'd like to see if
we're even investigating the right problem here.

Re: ERROR: too many dynamic shared memory segments

От
Julian Backes
Дата:
Hi,

we only had the "too many shared too many dynamic shared memory segments" error but no segmentation faults. The error started occurring after upgrading from postgres 10 to postgres 12 (server has 24 cores / 48 threads, i.e. many parallel workers). The error itself was not that much of a problem but /dev/shm started filling up with orphaned files which probably (?) had not been cleaned up by postgres after the parallel workers died. In consequence, after some time, /dev/shm was full and everything crashed.

Unfortunately, the only "solution" we found so far was to increase max connections from 100 to 1000. After that (about 2 months ago I think), the error had gone.

Maybe this helps...

Julian

Am Mi., 29. Jan. 2020 um 10:37 Uhr schrieb Nicola Contu <nicola.contu@gmail.com>:
This is the error on postgres log of the segmentation fault :

2020-01-21 14:20:29 GMT [] [42222]: [108-1] db=,user= LOG:  server process (PID 2042) was terminated by signal 11: Segmentation fault
2020-01-21 14:20:29 GMT [] [42222]: [109-1] db=,user= DETAIL:  Failed process was running: select pid from pg_stat_activity where query ilike 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
2020-01-21 14:20:29 GMT [] [42222]: [110-1] db=,user= LOG:  terminating any other active server processes
2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [4-1] db=cmdv3,user=admin WARNING:  terminating connection because of crash of another server process
2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [5-1] db=cmdv3,user=admin DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [6-1] db=cmdv3,user=admin HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-01-21 14:20:29 GMT [127.0.0.1(34026)] [2055]: [5-1] db=cmdv3,user=admin WARNING:  terminating connection because of crash of another server process

At CentOS level :
Jan 21 14:20:29 usnyh-cmd1.gt-t.net kernel: postmaster[2042]: segfault at 0 ip 000000000048bef4 sp 00007ffdf4955bb0 error 4 in postgres[400000+6c5000]
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT [33727] WARNING C-0x21526e8: cmdv3/admin@10.151.2.154:39688 pooler error: server conn crashed?
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT [33727] WARNING S-0x1f5ff58: cmdv3/admin@127.0.0.1:5432 got packet 'N' from server when not linked
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT [33727] WARNING S-0x25097f0: cmdv3/admin@127.0.0.1:5432 got packet 'N' from server when not linked
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.906 GMT [33727] WARNING S-0x2508b60: cmdv3/admin@127.0.0.1:5432 got packet 'N' from server when not linked
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.981 GMT [33727] ERROR S: login failed: FATAL: the database system is in recovery mode


> If you're on Linux, you can probably see them with "ls /dev/shm".

I see a lot of files there, and doing a cat they are empty. What can I do with them?

Those are two different problems I guess, but they are related because right before the Segmentation Fault I see a lot of shared segment errors in the postgres log.

Il giorno mer 29 gen 2020 alle ore 10:09 Thomas Munro <thomas.munro@gmail.com> ha scritto:
On Wed, Jan 22, 2020 at 4:06 AM Nicola Contu <nicola.contu@gmail.com> wrote:
> after a few months, we started having this issue again.
> So we revert the work_mem parameter to 600MB instead of 2GB.
> But the issue is still there. A query went to segmentation fault, the DB went to recovery mode and our app went to read only for a few minutes.

Hi Nicola,
Hmm, a segmentation fault sounds like a different problem.  Can you
please share the exact error messages from PostgreSQL and OS logs?

> I understand we can increase max_connections so we can have many more segments.
>
> My question is : is there a way to understand the number of segments we reached?

If you're on Linux, you can probably see them with "ls /dev/shm".

> Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have about 500 shared segments.
> We would like to increase that number to 300 or 400 but would be great to understand if there is a way to make sure we will solve the issue as it requires a restart of the service.
>
> I know you were also talking about a redesign this part in PostgreSQL. Do you know if anything has changed in any of the newer versions after 11.5?

It's possible that we should increase a couple of constants used the
formula -- I'll look into that again.  But first I'd like to see if
we're even investigating the right problem here.

Re: ERROR: too many dynamic shared memory segments

От
Thomas Munro
Дата:
On Wed, Jan 29, 2020 at 10:37 PM Nicola Contu <nicola.contu@gmail.com> wrote:
> This is the error on postgres log of the segmentation fault :
>
> 2020-01-21 14:20:29 GMT [] [42222]: [108-1] db=,user= LOG:  server process (PID 2042) was terminated by signal 11:
Segmentationfault
 
> 2020-01-21 14:20:29 GMT [] [42222]: [109-1] db=,user= DETAIL:  Failed process was running: select pid from
pg_stat_activitywhere query ilike 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
 
> 2020-01-21 14:20:29 GMT [] [42222]: [110-1] db=,user= LOG:  terminating any other active server processes

Ok, this is a bug.  Do you happen to have a core file?  I don't recall
where CentOS puts them.

> > If you're on Linux, you can probably see them with "ls /dev/shm".
>
> I see a lot of files there, and doing a cat they are empty. What can I do with them?

Not much, but it tells you approximately how many 'slots' are in use
at a given time (ie because of currently running parallel queries), if
they were created since PostgreSQL started up (if they're older ones
they could have leaked from a crashed server, but we try to avoid that
by trying to clean them up when you restart).

> Those are two different problems I guess, but they are related because right before the Segmentation Fault I see a
lotof shared segment errors in the postgres log.
 

That gave me an idea...  I hacked my copy of PostgreSQL to flip a coin
to decide whether to pretend there are no slots free (see below), and
I managed to make it crash in the regression tests when doing a
parallel index build.  It's late here now, but I'll look into that
tomorrow.  It's possible that the parallel index code needs to learn
to cope with that.

#2  0x0000000000a096f6 in SharedFileSetInit (fileset=0x80b2fe14c,
seg=0x0) at sharedfileset.c:71
#3  0x0000000000c72440 in tuplesort_initialize_shared
(shared=0x80b2fe140, nWorkers=2, seg=0x0) at tuplesort.c:4341
#4  0x00000000005ab405 in _bt_begin_parallel
(buildstate=0x7fffffffc070, isconcurrent=false, request=1) at
nbtsort.c:1402
#5  0x00000000005aa7c7 in _bt_spools_heapscan (heap=0x801ddd7e8,
index=0x801dddc18, buildstate=0x7fffffffc070, indexInfo=0x80b2b62d0)
at nbtsort.c:396
#6  0x00000000005aa695 in btbuild (heap=0x801ddd7e8,
index=0x801dddc18, indexInfo=0x80b2b62d0) at nbtsort.c:328
#7  0x0000000000645b5c in index_build (heapRelation=0x801ddd7e8,
indexRelation=0x801dddc18, indexInfo=0x80b2b62d0, isreindex=false,
parallel=true) at index.c:2879
#8  0x0000000000643e5c in index_create (heapRelation=0x801ddd7e8,
indexRelationName=0x7fffffffc510 "pg_toast_24587_index",
indexRelationId=24603, parentIndexRelid=0,

I don't know if that's the bug that you're hitting, but it definitely
could be: REFRESH MATERIALIZED VIEW could be rebuilding an index.

===

diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 90e0d739f8..f0b49d94ee 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -468,6 +468,13 @@ dsm_create(Size size, int flags)
        nitems = dsm_control->nitems;
        for (i = 0; i < nitems; ++i)
        {
+               /* BEGIN HACK */
+               if (random() % 10 > 5)
+               {
+                       nitems = dsm_control->maxitems;
+                       break;
+               }
+               /* END HACK */
                if (dsm_control->item[i].refcnt == 0)
                {
                        dsm_control->item[i].handle = seg->handle;



Re: ERROR: too many dynamic shared memory segments

От
Thomas Munro
Дата:
On Wed, Jan 29, 2020 at 11:24 PM Julian Backes <julianbackes@gmail.com> wrote:
> we only had the "too many shared too many dynamic shared memory segments" error but no segmentation faults. The error
startedoccurring after upgrading from postgres 10 to postgres 12 (server has 24 cores / 48 threads, i.e. many parallel
workers).The error itself was not that much of a problem but /dev/shm started filling up with orphaned files which
probably(?) had not been cleaned up by postgres after the parallel workers died. In consequence, after some time,
/dev/shmwas full and everything crashed. 

Oh, thanks for the report.  I think see what was happening there, and
it's a third independent problem.  The code in dsm_create() does
DSM_OP_DESTROY (ie cleans up) in the DSM_CREATE_NULL_IF_MAXSEGMENTS
case, but in the case where you see "ERROR: too many dynamic shared
memory segments" it completely fails to clean up after itself.  I can
reproduce that here.  That's a terrible bug, and has been sitting in
the tree for 5 years.

> Unfortunately, the only "solution" we found so far was to increase max connections from 100 to 1000. After that
(about2 months ago I think), the error had gone. 

I'll take that as a vote for increasing the number of slots.



Re: ERROR: too many dynamic shared memory segments

От
Thomas Munro
Дата:
On Wed, Jan 29, 2020 at 11:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Jan 29, 2020 at 10:37 PM Nicola Contu <nicola.contu@gmail.com> wrote:
> > This is the error on postgres log of the segmentation fault :
> >
> > 2020-01-21 14:20:29 GMT [] [42222]: [108-1] db=,user= LOG:  server process (PID 2042) was terminated by signal 11:
Segmentationfault
 
> > 2020-01-21 14:20:29 GMT [] [42222]: [109-1] db=,user= DETAIL:  Failed process was running: select pid from
pg_stat_activitywhere query ilike 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
 
> > 2020-01-21 14:20:29 GMT [] [42222]: [110-1] db=,user= LOG:  terminating any other active server processes

> That gave me an idea...  I hacked my copy of PostgreSQL to flip a coin
> to decide whether to pretend there are no slots free (see below), and
> I managed to make it crash in the regression tests when doing a
> parallel index build.  It's late here now, but I'll look into that
> tomorrow.  It's possible that the parallel index code needs to learn
> to cope with that.

Hi Nicola,

Without more information I can't know if I found the same bug you
experienced, but I think it's likely.  I have committed a fix for
that, which will be available in the next release (mid-February).

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=74618e77b43cfce670b4725d5b9a300a2afd12d1



Re: ERROR: too many dynamic shared memory segments

От
Nicola Contu
Дата:
Hi Thomas,
unfortunately I can't find any core dump to help you more.
Thanks for the fix, we are in the process of installing 12.1 in production, so we can still wait on this release and go live with 12.2

I will let you know at this point if I still get this after installing 12.2 trying to build a core dump file.

Do you still recommend to increase max_conn?


Il giorno gio 30 gen 2020 alle ore 23:41 Thomas Munro <thomas.munro@gmail.com> ha scritto:
On Wed, Jan 29, 2020 at 11:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Jan 29, 2020 at 10:37 PM Nicola Contu <nicola.contu@gmail.com> wrote:
> > This is the error on postgres log of the segmentation fault :
> >
> > 2020-01-21 14:20:29 GMT [] [42222]: [108-1] db=,user= LOG:  server process (PID 2042) was terminated by signal 11: Segmentation fault
> > 2020-01-21 14:20:29 GMT [] [42222]: [109-1] db=,user= DETAIL:  Failed process was running: select pid from pg_stat_activity where query ilike 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
> > 2020-01-21 14:20:29 GMT [] [42222]: [110-1] db=,user= LOG:  terminating any other active server processes

> That gave me an idea...  I hacked my copy of PostgreSQL to flip a coin
> to decide whether to pretend there are no slots free (see below), and
> I managed to make it crash in the regression tests when doing a
> parallel index build.  It's late here now, but I'll look into that
> tomorrow.  It's possible that the parallel index code needs to learn
> to cope with that.

Hi Nicola,

Without more information I can't know if I found the same bug you
experienced, but I think it's likely.  I have committed a fix for
that, which will be available in the next release (mid-February).

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=74618e77b43cfce670b4725d5b9a300a2afd12d1

Re: ERROR: too many dynamic shared memory segments

От
Thomas Munro
Дата:
On Thu, Jan 30, 2020 at 12:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Jan 29, 2020 at 11:24 PM Julian Backes <julianbackes@gmail.com> wrote:
> > we only had the "too many shared too many dynamic shared memory segments" error but no segmentation faults. The
errorstarted occurring after upgrading from postgres 10 to postgres 12 (server has 24 cores / 48 threads, i.e. many
parallelworkers). The error itself was not that much of a problem but /dev/shm started filling up with orphaned files
whichprobably (?) had not been cleaned up by postgres after the parallel workers died. In consequence, after some time,
/dev/shmwas full and everything crashed. 
>
> Oh, thanks for the report.  I think see what was happening there, and
> it's a third independent problem.  The code in dsm_create() does
> DSM_OP_DESTROY (ie cleans up) in the DSM_CREATE_NULL_IF_MAXSEGMENTS
> case, but in the case where you see "ERROR: too many dynamic shared
> memory segments" it completely fails to clean up after itself.  I can
> reproduce that here.  That's a terrible bug, and has been sitting in
> the tree for 5 years.

I committed a fix for that.  It'll be in the new releases that due out
in a couple of weeks.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=93745f1e019543fe7b742d0c5e971aad8d08fd56

> > Unfortunately, the only "solution" we found so far was to increase max connections from 100 to 1000. After that
(about2 months ago I think), the error had gone. 
>
> I'll take that as a vote for increasing the number of slots.

I committed something to do this for 13 (due out end of year), but I'm
not game to back-patch it to the release branches.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d061ea21fc1cc1c657bb5c742f5c4a1564e82ee2



Re: ERROR: too many dynamic shared memory segments

От
Thomas Munro
Дата:
On Fri, Jan 31, 2020 at 11:05 PM Nicola Contu <nicola.contu@gmail.com> wrote:
> Do you still recommend to increase max_conn?

Yes, as a workaround of last resort.  The best thing would be to
figure out why you are hitting the segment limit, and see if there is
something we could tune to fix that. If you EXPLAIN your queries, do
you see plans that have a lot of "Gather" nodes in them, perhaps
involving many partitions?  Or are you running a lot of parallel
queries at the same time?  Or are you running queries that do very,
very large parallel hash joins?  Or something else?



Re: ERROR: too many dynamic shared memory segments

От
"dainius.b"
Дата:
Hello,
I also get high amount of "too many dynamic shared memory segments" errors.
Upgraded Postgres version to 12.2, but that did not help.
Server has 64GB Ram/16 CPU. Postgres params:
        "max_connections":500,
        "shared_buffers":"16GB",
        "effective_cache_size":"48GB",
        "maintenance_work_mem":"2GB",
        "checkpoint_completion_target":0.9,
        "wal_buffers":"32MB",
        "default_statistics_target":500,
        "random_page_cost":1.1,
        "effective_io_concurrency":200,
        "work_mem":"20971kB",
        "min_wal_size":"2GB",
        "max_wal_size":"8GB",
        "max_worker_processes":8,
        "max_parallel_workers_per_gather":4,
        "max_parallel_workers":8,
        "log_statement":"none"

This error happens when executing many parallel queries that have quite
complex plan:

Limit  (cost=163243.91..163250.02 rows=51 width=464) (actual
time=1224.817..1224.834 rows=31 loops=1)
  ->  Gather Merge  (cost=163243.91..165477.68 rows=18656 width=464) (actual
time=1224.815..1254.031 rows=31 loops=1)
        Workers Planned: 4
        Workers Launched: 4
        ->  Sort  (cost=162243.85..162255.51 rows=4664 width=464) (actual
time=1214.032..1214.032 rows=6 loops=5)
              Sort Key: (...)
              Sort Method: quicksort  Memory: 30kB
              Worker 0:  Sort Method: quicksort  Memory: 28kB
              Worker 1:  Sort Method: quicksort  Memory: 27kB
              Worker 2:  Sort Method: quicksort  Memory: 27kB
              Worker 3:  Sort Method: quicksort  Memory: 27kB
              ->  Parallel Hash Semi Join  (cost=41604.51..162088.25
rows=4664 width=464) (actual time=409.437..1213.922 rows=6 loops=5)
                    Hash Cond:  (...)
                    ->  Parallel Hash Join  (cost=28073.57..148289.42
rows=17880 width=464) (actual time=234.973..1165.754 rows=36930 loops=5)
                          Hash Cond:  (...)
                          ->  Parallel Hash Left Join 
(cost=20732.39..140901.30 rows=17880 width=445) (actual
time=187.482..1083.629 rows=36930 loops=5)
                                Hash Cond:  (...)
                                ->  Parallel Hash Left Join 
(cost=14850.80..134972.78 rows=17880 width=435) (actual
time=148.107..1010.915 rows=36930 loops=5)
                                      Hash Cond:  (...)
                                      ->  Parallel Hash Left Join 
(cost=8969.21..129044.25 rows=17880 width=425) (actual time=110.696..938.602
rows=36930 loops=5)
                                            Hash Cond:  (...)
                                            ->  Nested Loop 
(cost=3087.61..123115.72 rows=17880 width=411) (actual time=70.827..861.142
rows=36930 loops=5)
                                                  ->  Nested Loop 
(cost=3087.19..104038.13 rows=38340 width=263) (actual time=70.742..621.262
rows=37073 loops=5)
                                                        ->  Parallel Bitmap
Heap Scan on (...)  (cost=3086.76..73462.00 rows=39271 width=167) (actual
time=70.653..358.576 rows=37103 loops=5)
                                                              Recheck Cond: 
(...)
                                                              Filter: (...)
                                                              Rows Removed
by Filter: 42915
                                                              Heap Blocks:
exact=17872
                                                              ->  Bitmap
Index Scan on  (...) (cost=0.00..3047.49 rows=378465 width=0) (actual
time=52.331..52.331 rows=400144 loops=1)
                                                                    Index
Cond:  (...)
                                                        ->  Index Scan using 
(...)  (cost=0.43..0.78 rows=1 width=96) (actual time=0.006..0.006 rows=1
loops=185514)
                                                              Index Cond: 
(...)
                                                              Filter:  (...)
                                                              Rows Removed
by Filter: 0
                                                  ->  Index Scan using 
(...)(cost=0.43..0.50 rows=1 width=152) (actual time=0.006..0.006 rows=1
loops=185367)
                                                        Index Cond:  (...)
                                                        Filter:  (...)
                                                        Rows Removed by
Filter: 0
                                            ->  Parallel Hash 
(cost=3675.71..3675.71 rows=176471 width=18) (actual time=38.590..38.590
rows=60000 loops=5)
                                                  Buckets: 524288  Batches:
1  Memory Usage: 20640kB
                                                  ->  Parallel Seq Scan on 
(...) (cost=0.00..3675.71 rows=176471 width=18) (actual time=0.020..11.378
rows=60000 loops=5)
                                      ->  Parallel Hash 
(cost=3675.71..3675.71 rows=176471 width=18) (actual time=36.769..36.770
rows=60000 loops=5)
                                            Buckets: 524288  Batches: 1 
Memory Usage: 20608kB
                                            ->  Parallel Seq Scan on  (...) 
(cost=0.00..3675.71 rows=176471 width=18) (actual time=0.018..11.665
rows=60000 loops=5)
                                ->  Parallel Hash  (cost=3675.71..3675.71
rows=176471 width=18) (actual time=38.415..38.415 rows=60000 loops=5)
                                      Buckets: 524288  Batches: 1  Memory
Usage: 20640kB
                                      ->  Parallel Seq Scan on  (...) 
(cost=0.00..3675.71 rows=176471 width=18) (actual time=0.021..11.781
rows=60000 loops=5)
                          ->  Parallel Hash  (cost=5619.97..5619.97
rows=137697 width=27) (actual time=46.665..46.665 rows=66096 loops=5)
                                Buckets: 524288  Batches: 1  Memory Usage:
24544kB
                                ->  Parallel Seq Scan on (...) 
(cost=0.00..5619.97 rows=137697 width=27) (actual time=0.024..16.629
rows=66096 loops=5)
                    ->  Parallel Hash  (cost=12679.65..12679.65 rows=68103
width=4) (actual time=28.674..28.674 rows=41176 loops=5)
                          Buckets: 262144  Batches: 1  Memory Usage: 10176kB
                          ->  Parallel Index Only Scan using (...) 
(cost=0.57..12679.65 rows=68103 width=4) (actual time=0.048..15.443
rows=41176 loops=5)
                                Index Cond: ()
                                Heap Fetches: 205881
Planning Time: 6.941 ms
Execution Time: 1254.251 ms

So to avoid getting these errors the only solution is to decrease work_mem
or turn off paralelism? Because I hoped that in such case it would take
longer time to complete the queries but instead I get high amount of queries
just crashing.


Thomas Munro-5 wrote
> On Fri, Jan 31, 2020 at 11:05 PM Nicola Contu <

> nicola.contu@

> > wrote:
>> Do you still recommend to increase max_conn?
> 
> Yes, as a workaround of last resort.  The best thing would be to
> figure out why you are hitting the segment limit, and see if there is
> something we could tune to fix that. If you EXPLAIN your queries, do
> you see plans that have a lot of "Gather" nodes in them, perhaps
> involving many partitions?  Or are you running a lot of parallel
> queries at the same time?  Or are you running queries that do very,
> very large parallel hash joins?  Or something else?





--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html