Обсуждение: 9.2.2 - semop hanging

Поиск

Список

Период

Сортировка

9.2.2 - semop hanging

От

Rafael Domiciano

Дата:

15 июля 2013 г., 13:16:28

Hello all you guys,

I've sent the same problem in performance list. Some answered me, but didn't resolved the situation.

Since 2 weeks I'm get stucked in a very strange situation: from time to time (sometimes with intervals less than 10 minutes), the server get "stucked"/"hang" (I dont know how to call it) and every connections on postgres (dont matter if it's SELECT, UPDATE, DELETE, INSERT, startup, authentication...) seems like get "paused"; after some seconds (say ~10 or ~15 sec, sometimes less) everything "goes OK".

So, my first trial was to check disks. Running "iostat" apparently showed that disks was OK. It's a Raid10, 4 600GB SAS, IBM Storage DS3512, over FC. IBM DS Storage Manager says that disks is OK.

Then, memory. Apparently no swap being used:

[###@### data]# free -m

total used free shared buffers cached

Mem: 145182 130977 14204 0 43 121407

-/+ buffers/cache: 9526 135655

Swap: 6143 65 6078

No error on /var/log/messages.

Following is what I've tried:

1) Emre Hasegeli has suggested to reduce my shared buffers, but it's already low:

total server memory: 141 GB

shared_buffers: 16 GB

Maybe it's too low? I've been thinking to increase to 32 GB.

max_connections = 500 and ~400 connections average

2) Being "hanging" on "semop" I tried the following, as suggested on some "tuning page" over web. Is it right?

echo "250 32000 200 128" > /proc/sys/kernel/sem

3) I think my problem could be something related to "LwLocks", as I did some googling and found some related problems and slides. There is some way I can confirm this?

4) Rebooting the server didn't make any difference.

Following, is some strace of one process, and some others, maybe, useful infos. Every processes I've straced bring the same scenario: seems it get stucked on semop.

Any help appreciate,

[###@### ~]# strace -ttp 5209

Process 5209 attached - interrupt to quit

09:01:54.122445 semop(2293765, {{15, -1, 0}}, 1) = 0

09:01:55.368785 semop(2293765, {{15, -1, 0}}, 1) = 0

09:01:55.368902 semop(2523148, {{11, 1, 0}}, 1) = 0

09:01:55.368978 semop(2293765, {{15, -1, 0}}, 1) = 0

09:01:55.369861 semop(2293765, {{15, -1, 0}}, 1) = 0

09:01:55.370648 semop(3047452, {{6, 1, 0}}, 1) = 0

09:01:55.370694 semop(2293765, {{15, -1, 0}}, 1) = 0

09:01:55.370762 semop(2785300, {{12, 1, 0}}, 1) = 0

09:01:55.370805 access("base/2048098929", F_OK) = 0

09:01:55.370953 open("base/2048098929/PG_VERSION", O_RDONLY) = 5

[###@### ~]# strace -p 16877 -tt

Process 16877 attached - interrupt to quit

09:57:56.305123 semop(163844, {{13, -1, 0}}, 1) = 0

09:57:59.453714 semop(163844, {{13, -1, 0}}, 1) = 0

09:58:04.004023 semop(163844, {{13, -1, 0}}, 1) = 0

09:58:04.004209 brk(0x1f44000) = 0x1f44000

09:58:04.004305 brk(0x1f42000) = 0x1f42000

[###@### data]# ipcs -l

- Shared Memory Limits -

max number of segments = 4096

max seg size (kbytes) = 83886080

max total shared memory (kbytes) = 17179869184

min seg size (bytes) = 1

------ Semaphore Limits --------

max number of arrays = 128

max semaphores per array = 250

max semaphores system wide = 32000

max ops per semop call = 200

semaphore max value = 32767

------ Messages: Limits --------

max queues system wide = 32768

max size of message (bytes) = 65536

default max size of queue (bytes) = 65536

[###@### data]# ipcs -u

----- Semaphore Status -------

used arrays: 34

allocated semaphores: 546

[###@### data]# uname -a

Linux ### 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

postgres=# select version();

version

--------------------------------------------------------------------------------------------------------------

PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

(1 registro)

[###@### data]# cat /etc/redhat-release

CentOS release 6.3 (Final)

Re: 9.2.2 - semop hanging

От

Eduardo Morras

Дата:

15 июля 2013 г., 13:44:27

On Mon, 15 Jul 2013 10:16:19 -0300
Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

I'm not a Linux expert, I'm a BSD man, but

a) do you have an interrupt storm?
b) what does postgres do before the hang?
c) do you have anyother software running? Including contrib modules. It may be a dns lookup timeout (watch port 53
tcp/udp)
d) what filesystem? Needs filesystem some kind of "maintenance window"? I mean flush dirty caches, metadata, wake up
barriers...


---   ---
Eduardo Morras <emorrasg@yahoo.es>

Re: 9.2.2 - semop hanging

От

Prashanth Ranjalkar

Дата:

15 июля 2013 г., 14:31:10

On Mon, Jul 15, 2013 at 7:14 PM, Eduardo Morras <emorrasg@yahoo.es> wrote:

On Mon, 15 Jul 2013 10:16:19 -0300
Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

I'm not a Linux expert, I'm a BSD man, but

a) do you have an interrupt storm?
b) what does postgres do before the hang?
c) do you have anyother software running? Including contrib modules. It may be a dns lookup timeout (watch port 53 tcp/udp)
d) what filesystem? Needs filesystem some kind of "maintenance window"? I mean flush dirty caches, metadata, wake up barriers...

Eduardo Morras <emorrasg@yahoo.es>

Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

--- ---

max seg size (kbytes) = 83886080

Request to verify the server logs if there are any share memory errors and assume that work_mem value is not set to extreme higher value as the max connections has set to 500.

Thanks & Regards,

Prashanth Ranjalkar

Database Consultant & Architect

Email:prashant.ranjalkar@gmail.com

Skype:prashanth.ranjalkar

Cell: +91 932 568 2271

www.postgresdba.net

Re: 9.2.2 - semop hanging

От

Prashanth Ranjalkar

Дата:

15 июля 2013 г., 16:27:08

On 15-07-2013 19:14, Eduardo Morras wrote:

On Mon, 15 Jul 2013 10:16:19 -0300
Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

I'm not a Linux expert, I'm a BSD man, but 

a) do you have an interrupt storm?
b) what does postgres do before the hang?
c) do you have anyother software running? Including contrib modules. It may be a dns lookup timeout (watch port 53 tcp/udp)
d) what filesystem? Needs filesystem some kind of "maintenance window"? I mean flush dirty caches, metadata, wake up barriers...


---   ---
Eduardo Morras <emorrasg@yahoo.es>

Max seg size, max total shared memory  represents SHMMAX and SHMALL respectively.

max seg size (kbytes) = 83886080

max total shared memory (kbytes) = 17179869184

At initial glance, I suspect adjusting these values may resolve the issue as SHMALL should be set either total max memory size or in number of pages. I think the value set is too large therefore recommend comment this SHMALL parameter and set only SHMMAX value with 24GB.

Request to verify the server logs if there are any share memory errors and assume that work_mem value is not set to extreme higher value as the max connections has set to 500.

Thanks & Regards,

Prashanth Ranjalkar

Database Consultant & Architect

Email:prashant.ranjalkar@gmail.com

Skype:prashanth.ranjalkar

Cell: +91 932 568 2271

www.postgresdba.net

Re: 9.2.2 - semop hanging

От

Kevin Grittner

Дата:

15 июля 2013 г., 19:12:08

Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

> PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc
> (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

> CentOS release 6.3 (Final)

> Since 2 weeks I'm get stucked in a very strange situation: from
> time to time (sometimes with intervals less than 10 minutes), the
> server get "stucked"/"hang" (I dont know how to call it) and
> every connections on postgres (dont matter if it's SELECT,
> UPDATE, DELETE, INSERT, startup, authentication...) seems like
> get "paused"; after some seconds (say ~10 or ~15 sec, sometimes
> less) everything "goes OK".

During these episodes, do you see high system CPU time?  If so, try
disabling transparent huge page support, and see whether it affects
the frequency or severity of the episodes.

> So, my first trial was to check disks. Running "iostat"
> apparently showed that disks was OK.

Did you run iostat during an episode of slowness?  What did it
show?  Giving an interpretation that it as "apparently OK" doesn't
provide much useful information.

> It's a Raid10, 4 600GB SAS, IBM Storage DS3512, over FC. IBM DS
> Storage Manager says that disks is OK.

Are there any reports to show you when writing was saturated?

>              total       used       free     shared    buffers    cached
> Mem:        145182     130977      14204          0         43    121407
> -/+ buffers/cache:       9526     135655
> Swap:         6143         65       6078

> Following is what I've tried:
> 1) Emre Hasegeli has suggested to reduce my shared buffers, but
> it's already low:
>   total server memory: 141 GB
>   shared_buffers: 16 GB

On a machine with nearly twice that RAM, I've had to decrease
shared_buffers to 2GB to avoid the symptoms you describe.  That is
in conjunction with making the background writer more aggressive
and making sure the checkpoint completion target is set to 0.9.

> Maybe it's too low? I've been thinking to increase to 32 GB.

Well, you could try that; if the symptoms get worse, then you might
be willing to go the other direction....

> max_connections = 500 and ~400 connections average

How many cores (not "hardware threads") does the machine have?  You
will probably have better throughput and latency if you use
connection pooling to limit the number of active database
transactions to somewhere arount two times the number of cores, or
slightly above that.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9.2.2 - semop hanging

От

Rafael Domiciano

Дата:

16 июля 2013 г., 13:08:55

First of all, Thanks for response, answers below.

On Mon, Jul 15, 2013 at 4:12 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

> PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc
> (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

> CentOS release 6.3 (Final)

> Since 2 weeks I'm get stucked in a very strange situation: from
> time to time (sometimes with intervals less than 10 minutes), the
> server get "stucked"/"hang" (I dont know how to call it) and
> every connections on postgres (dont matter if it's SELECT,
> UPDATE, DELETE, INSERT, startup, authentication...) seems like
> get "paused"; after some seconds (say ~10 or ~15 sec, sometimes
> less) everything "goes OK".

During these episodes, do you see high system CPU time? If so, try
disabling transparent huge page support, and see whether it affects
the frequency or severity of the episodes.

Well, running mpstat 1 give me the following:

08:27:48 all 5,44 0,00 3,97 0,59 0,03 0,03 0,00 0,00 89,93

08:27:49 all 7,61 0,00 3,22 3,13 0,00 0,06 0,00 0,00 85,97

08:27:50 all 2,54 0,00 24,23 0,06 0,00 0,00 0,00 0,00 73,17

08:27:51 all 1,76 0,00 33,33 0,19 0,00 0,00 0,00 0,00 64,72

08:27:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle

08:27:52 all 5,07 0,00 23,63 4,10 0,00 0,06 0,00 0,00 67,14

08:27:53 all 0,34 0,00 17,89 3,07 0,00 0,00 0,00 0,00 78,70

08:27:54 all 0,06 0,00 14,94 0,03 0,00 0,03 0,00 0,00 84,93

08:27:55 all 4,64 0,00 4,64 3,41 0,03 0,09 0,00 0,00 87,19

08:27:56 all 9,27 0,00 2,29 3,76 0,03 0,03 0,00 0,00 84,62

08:27:57 all 3,32 0,00 15,49 1,82 0,00 0,03 0,00 0,00 79,34

08:27:58 all 0,09 0,00 16,67 0,31 0,00 0,00 0,00 0,00 82,92

Another sample:

[###@###~]# mpstat 1

Linux 2.6.32-279.14.1.el6.x86_64 (###.###) 16-07-2013 _x86_64_ (32 CPU)

08:37:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle

08:37:51 all 4,85 0,00 15,37 1,60 0,00 0,03 0,00 0,00 78,15

08:37:52 all 4,70 0,00 22,89 0,06 0,00 0,03 0,00 0,00 72,32

08:37:53 all 0,97 0,00 21,55 0,03 0,00 0,00 0,00 0,00 77,45

08:37:54 all 0,53 0,00 19,54 0,03 0,00 0,00 0,00 0,00 79,89

08:37:55 all 0,19 0,00 13,24 0,91 0,03 0,06 0,00 0,00 85,57

08:37:56 all 6,56 0,00 1,91 7,00 0,00 0,16 0,00 0,00 84,37

08:37:57 all 3,72 0,00 0,47 6,29 0,00 0,00 0,00 0,00 89,51

08:37:58 all 5,35 0,00 0,66 3,79 0,00 0,03 0,00 0,00 90,17

Yeah, disabling THP seens to lower the severity of the situation. Thanks. Right now is about 1 hour without any episode.
Same problem here and same resolution: http://dba.stackexchange.com/questions/32890/postgresql-pg-stat-activity-shows-commit.

Googling I've found that others had the same problem, and resolved disabling THP. Is it the right way?

About the disks activity, my parameter is the test that was did when the storage was installed/configured. At that test iostat was around ~600 tps. In my episodes tps was around ~300 tps.

The Processors is 2x Intel Xeon E5-2690, giving a total of 32 threads.

About shared_buffers, I going to try different values and test.

Thanks,

Rafael Domiciano

> So, my first trial was to check disks. Running "iostat"
> apparently showed that disks was OK.

Did you run iostat during an episode of slowness? What did it
show? Giving an interpretation that it as "apparently OK" doesn't
provide much useful information.

> It's a Raid10, 4 600GB SAS, IBM Storage DS3512, over FC. IBM DS
> Storage Manager says that disks is OK.

Are there any reports to show you when writing was saturated?

>              total       used       free     shared    buffers    cached
> Mem:        145182     130977      14204          0         43    121407
> -/+ buffers/cache:       9526     135655
> Swap:         6143         65       6078

> Following is what I've tried:
> 1) Emre Hasegeli has suggested to reduce my shared buffers, but
> it's already low:
>   total server memory: 141 GB
>   shared_buffers: 16 GB

On a machine with nearly twice that RAM, I've had to decrease
shared_buffers to 2GB to avoid the symptoms you describe. That is
in conjunction with making the background writer more aggressive
and making sure the checkpoint completion target is set to 0.9.

> Maybe it's too low? I've been thinking to increase to 32 GB.

Well, you could try that; if the symptoms get worse, then you might
be willing to go the other direction....

> max_connections = 500 and ~400 connections average

How many cores (not "hardware threads") does the machine have? You
will probably have better throughput and latency if you use
connection pooling to limit the number of active database
transactions to somewhere arount two times the number of cores, or
slightly above that.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9.2.2 - semop hanging

От

Kevin Grittner

Дата:

16 июля 2013 г., 15:32:33

Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

> Yeah, disabling THP seens to lower the severity of the situation.
> Thanks. Right now is about 1 hour without any episode.

> Googling I've found that others had the same problem, and
> resolved disabling THP. Is it the right way?

That is the only way to correct this problem when it appears that I
am aware of.  I have seen recommendations to disable THP defrag
instead, but where I have seen people do that, they wound up
entirely disabling THP support later.

Huge pages should benefit performance in general, but some
implementations seem to have problems.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9.2.2 - semop hanging

От

Eduardo

Дата:

16 июля 2013 г., 18:00:15

On Tue, 16 Jul 2013 10:08:49 -0300
Rafael Domiciano <rafael.domiciano@gmail.com> wrote:

> First of all, Thanks for response, answers below.
>
>
> On Mon, Jul 15, 2013 at 4:12 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>
> > Rafael Domiciano <rafael.domiciano@gmail.com> wrote:
> >
> > > PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc
> > > (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit
> >
> > > CentOS release 6.3 (Final)
> >
> > > Since 2 weeks I'm get stucked in a very strange situation: from
> > > time to time (sometimes with intervals less than 10 minutes), the
> > > server get "stucked"/"hang" (I dont know how to call it) and
> > > every connections on postgres (dont matter if it's SELECT,
> > > UPDATE, DELETE, INSERT, startup, authentication...) seems like
> > > get "paused"; after some seconds (say ~10 or ~15 sec, sometimes
> > > less) everything "goes OK".
> >
> > During these episodes, do you see high system CPU time?  If so, try
> > disabling transparent huge page support, and see whether it affects
> > the frequency or severity of the episodes.
> >
> Yeah, disabling THP seens to lower the severity of the situation. Thanks.
> Right now is about 1 hour without any episode.
> Same problem here and same resolution:
> http://dba.stackexchange.com/questions/32890/postgresql-pg-stat-activity-shows-commit
> .
>
> Googling I've found that others had the same problem, and resolved
> disabling THP. Is it the right way?

Why don't try to configure THP for your needs? It looks like the transparent manager tries to defrag memory when it's
toolate to do it fast because there's not enough free memory, and that it may have a default configuration bad for you.
Afast search in postgresql source code gets no MADV_HUGEPAGE, so THP can't be set to process those pages only and must
beconfigured for all system. 

Set THP on, #echo always >/sys/kernel/mm/transparent_hugepage/enabled

get the values from and try to find and set better values:

/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan (how many pages should scan at each pass)
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs (how many milisecs spend each pass)
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs (how many milisecs between pass)

khugepaged daemon is similar to our autovaccum, it should be "easy" find a local optimun values

alternatively you can disable the defrag, but this way you will end losing a lot of memory in due to fragmentation:

#echo never >/sys/kernel/mm/transparent_hugepage/defrag

Once the values are setted, Postgres must be restarted.

> About the disks activity, my parameter is the test that was did when the
> storage was installed/configured. At that test iostat was around ~600 tps.
> In my episodes tps was around ~300 tps.
>
> The Processors is 2x Intel Xeon E5-2690, giving a total of 32 threads.
>
> About shared_buffers, I going to try different values and test.
>
> Thanks,
> Rafael Domiciano

--
Eduardo <emorrasg@yahoo.es>

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: 9.2.2 - semop hanging