Обсуждение: Sudden slow down and spike in system CPU causes max_connections to get exhausted

Поиск
Список
Период
Сортировка

Sudden slow down and spike in system CPU causes max_connections to get exhausted

От
"Anand Kumar, Karthik"
Дата:
Hi,

We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem

Everything will run along okay, and every few hours, for about a couple of minutes, postgres will slow way down. A "select 1" query takes between 10 and 15 seconds to run, and the box in general gets lethargic.

This causes a pile up of connections at the DB, and we run out of max_connections.

This is accompanied with a steep spike in system CPU and load avg. No spike in user CPU or in I/O.

So far:
- We've ruled out check points as a cause. 
- We have statement logging turned on and no single statement seems to be causing this. All statements slow down, including "select 1"
- There is no spike in incoming traffic that we can see.

We do typically have a lot of idle connections (1500 connections total, over a 1000 idle at any given time). We're in the midst of installing pgbouncer to try and mitigate the problem, but that still doesn't address the root cause.

Anyone have any tips for why this might be occurring?

Thanks,
Karthik

Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted

От
John R Pierce
Дата:
On 1/6/2014 5:06 PM, Anand Kumar, Karthik wrote:
>
> We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem

please tell me thats a typo, and you're using ext4, or at least ext3.

> We do typically have a lot of idle connections (1500 connections
> total, over a 1000 idle at any given time). We're in the midst of
> installing pgbouncer to try and mitigate the problem, but that still
> doesn't address the root cause.
>

having 500 not-idle connections is disturbing.   depending on the
complexities of those 500 active queries, you could be using 500 or 1000
or more times your work_mem setting in temporary buffers on top of your
other memory allocations.   thats also going to use a LOT of file
handles in your kernel.  and having 500 active queries competing for
your CPU cores, ouch.

I bet you're running out of memory during these busy peaks and going
very page-swap bound.


--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted

От
Tom Lane
Дата:
"Anand Kumar, Karthik" <Karthik.AnandKumar@classmates.com> writes:
> We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem
> Everything will run along okay, and every few hours, for about a couple of minutes, postgres will slow way down. A
"select1" query takes between 10 and 15 seconds to run, and the box in general gets lethargic. 
> This causes a pile up of connections at the DB, and we run out of max_connections.
> This is accompanied with a steep spike in system CPU and load avg. No spike in user CPU or in I/O.

System CPU only huh?  There have been some reports of such behavior
apparently caused by inefficiencies in the kernel's support of
"transparent huge pages".  See for instance this thread

http://www.postgresql.org/message-id/flat/CABMVzL2y8mRM5C9xxejAyDqe0i1S78RAE3cEATGYNf5Ktz_Zdg@mail.gmail.com

although it looks like in that case the real fix was to reduce the number
of backends.

> We do typically have a lot of idle connections (1500 connections total, over a 1000 idle at any given time). We're in
themidst of installing pgbouncer to try and mitigate the problem, but that still doesn't address the root cause. 

1500 connections?  What makes you think that itself isn't the root cause?

            regards, tom lane


Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted

От
John R Pierce
Дата:
On 1/6/2014 5:06 PM, Anand Kumar, Karthik wrote:
> We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem

also, centos 6.3 is a couple year old release, you really should `yum
update` and get the latest centos 6.everything.   been lots and lots of
fixes between 6.3 and now (6.5 was the last released build, plus a
couple months of incremental patches to that).

lets see...  6.3 was released with kernel-2.6.32-279.el6.x86_64.rpm in
June 2012
6.5 came with kernel-2.6.32-431.el6.x86_64.rpm in November 2013
and the latest update is kernel-2.6.32-431.1.2.0.1.el6.x86_64.rpm





--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted

От
"Anand Kumar, Karthik"
Дата:
The reason we're on ext2 is to get around
http://www.postgresql.org/message-id/CED87E13.C57E7%karthik.anandkumar@memo
rylane.com

We had pretty severe index and table corruption that would occur randomly
- this was on ext3, and centos 5.6, 2.6.18 kernel. The problems got fixed
after we upgraded the kernel to 2.6.32, Centos 6.3. We also dropped down
to ext2 because we would see the filesystem go readonly, and wanted to get
the journal out of the way (yes, maybe overkill, but we desperately needed
to stop crashing)

We are getting a standby box up on Centos 6.5 with xfs, will move to that.
Thats longer term though and I'm hoping to be able to resolve this issue
before we get there.

We'll try reducing the number of backends, and disable transparent huge
pages. I'll update the thread with our results.

Thanks,
Karthik



Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted

От
Sergey Konoplev
Дата:
On Mon, Jan 6, 2014 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Anand Kumar, Karthik" <Karthik.AnandKumar@classmates.com> writes:
>> We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem
>> Everything will run along okay, and every few hours, for about a couple of minutes, postgres will slow way down. A
"select1" query takes between 10 and 15 seconds to run, and the box in general gets lethargic. 
>> This causes a pile up of connections at the DB, and we run out of max_connections.
>> This is accompanied with a steep spike in system CPU and load avg. No spike in user CPU or in I/O.
>
> System CPU only huh?  There have been some reports of such behavior
> apparently caused by inefficiencies in the kernel's support of
> "transparent huge pages".  See for instance this thread
>
> http://www.postgresql.org/message-id/flat/CABMVzL2y8mRM5C9xxejAyDqe0i1S78RAE3cEATGYNf5Ktz_Zdg@mail.gmail.com
>
> although it looks like in that case the real fix was to reduce the number
> of backends.

I experienced the THP defragmentation problem even with <10
connections. What always saves me is to set

echo always > /sys/kernel/mm/transparent_hugepage/enabled
echo madvise > /sys/kernel/mm/transparent_hugepage/defrag

, the names might be slightly different on CentOS, like
redhat_transparent_hugepage or something like this, I don't remember
exactly.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com


Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted

От
Scott Marlowe
Дата:
On Mon, Jan 6, 2014 at 6:06 PM, Anand Kumar, Karthik
<Karthik.AnandKumar@classmates.com> wrote:
> Hi,
>
> We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem
>
> Everything will run along okay, and every few hours, for about a couple of
> minutes, postgres will slow way down. A "select 1" query takes between 10
> and 15 seconds to run, and the box in general gets lethargic.
>
> This causes a pile up of connections at the DB, and we run out of
> max_connections.
>
> This is accompanied with a steep spike in system CPU and load avg. No spike
> in user CPU or in I/O.

As well as the previously mentioned huge pages thing, also look at vm
dirty ratio:

http://www.westnet.com/~gsmith/content/linux-pdflush.htm


Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted

От
Sameer Kumar
Дата:

On Tue, Jan 7, 2014 at 9:06 AM, Anand Kumar, Karthik <Karthik.AnandKumar@classmates.com> wrote:
We do typically have a lot of idle connections (1500 connections total, over a 1000 idle at any given time). We're in the midst of installing pgbouncer to try and mitigate the problem, but that still doesn't address the root cause.

1500 connections in total (is that the number you are using for your max_connections too)? How many CPUs/CPU threads you have on your server? How many concurrent transactions are happening in your database? 1000 idle at any give time? Whoa! Is that by design or a defect which you have decided to live with? 1000 idle connection are idle or idle in transaction?


What pooling mode are you using in pgbouncer? Going by your description, I will suggest that you use transaction mode. Since you either have bugs or deliberately (why? why?) keep connections open/idle.

So in sessions mode you may not get much benefits from pgbouncer.

While you use pgbouncer you can set max_clients to a higher number and try to set timeouts for your clients (note that it will timeout even those clients who are in middle of transaction but are idle for long).


Best Regards,
Sameer Kumar | Database Consultant
ASHNIK PTE. LTD.
101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533
M : +65 8110 0350 T: +65 6438 3504 | www.ashnik.com
www.facebook.com/ashnikbiz | www.twitter.com/ashnikbiz

email patch

This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s).

Вложения

Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted

От
"Anand Kumar, Karthik"
Дата:
Thanks all for your suggestions. Looks like disabling transparent huge
pages fixed this issue for us. We haven't had it occur in two days now
after the change.

Thanks,
Karthik