Обсуждение: Sudden slow down and spike in system CPU causes max_connections to get exhausted
Sudden slow down and spike in system CPU causes max_connections to get exhausted
От
"Anand Kumar, Karthik"
Дата:
Hi,
We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem
Everything will run along okay, and every few hours, for about a couple of minutes, postgres will slow way down. A "select 1" query takes between 10 and 15 seconds to run, and the box in general gets lethargic.
This causes a pile up of connections at the DB, and we run out of max_connections.
This is accompanied with a steep spike in system CPU and load avg. No spike in user CPU or in I/O.
So far:
- We've ruled out check points as a cause.
- We have statement logging turned on and no single statement seems to be causing this. All statements slow down, including "select 1"
- There is no spike in incoming traffic that we can see.
We do typically have a lot of idle connections (1500 connections total, over a 1000 idle at any given time). We're in the midst of installing pgbouncer to try and mitigate the problem, but that still doesn't address the root cause.
Anyone have any tips for why this might be occurring?
Thanks,
Karthik
Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted
От
John R Pierce
Дата:
On 1/6/2014 5:06 PM, Anand Kumar, Karthik wrote: > > We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem please tell me thats a typo, and you're using ext4, or at least ext3. > We do typically have a lot of idle connections (1500 connections > total, over a 1000 idle at any given time). We're in the midst of > installing pgbouncer to try and mitigate the problem, but that still > doesn't address the root cause. > having 500 not-idle connections is disturbing. depending on the complexities of those 500 active queries, you could be using 500 or 1000 or more times your work_mem setting in temporary buffers on top of your other memory allocations. thats also going to use a LOT of file handles in your kernel. and having 500 active queries competing for your CPU cores, ouch. I bet you're running out of memory during these busy peaks and going very page-swap bound. -- john r pierce 37N 122W somewhere on the middle of the left coast
"Anand Kumar, Karthik" <Karthik.AnandKumar@classmates.com> writes: > We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem > Everything will run along okay, and every few hours, for about a couple of minutes, postgres will slow way down. A "select1" query takes between 10 and 15 seconds to run, and the box in general gets lethargic. > This causes a pile up of connections at the DB, and we run out of max_connections. > This is accompanied with a steep spike in system CPU and load avg. No spike in user CPU or in I/O. System CPU only huh? There have been some reports of such behavior apparently caused by inefficiencies in the kernel's support of "transparent huge pages". See for instance this thread http://www.postgresql.org/message-id/flat/CABMVzL2y8mRM5C9xxejAyDqe0i1S78RAE3cEATGYNf5Ktz_Zdg@mail.gmail.com although it looks like in that case the real fix was to reduce the number of backends. > We do typically have a lot of idle connections (1500 connections total, over a 1000 idle at any given time). We're in themidst of installing pgbouncer to try and mitigate the problem, but that still doesn't address the root cause. 1500 connections? What makes you think that itself isn't the root cause? regards, tom lane
Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted
От
John R Pierce
Дата:
On 1/6/2014 5:06 PM, Anand Kumar, Karthik wrote: > We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem also, centos 6.3 is a couple year old release, you really should `yum update` and get the latest centos 6.everything. been lots and lots of fixes between 6.3 and now (6.5 was the last released build, plus a couple months of incremental patches to that). lets see... 6.3 was released with kernel-2.6.32-279.el6.x86_64.rpm in June 2012 6.5 came with kernel-2.6.32-431.el6.x86_64.rpm in November 2013 and the latest update is kernel-2.6.32-431.1.2.0.1.el6.x86_64.rpm -- john r pierce 37N 122W somewhere on the middle of the left coast
Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted
От
"Anand Kumar, Karthik"
Дата:
The reason we're on ext2 is to get around http://www.postgresql.org/message-id/CED87E13.C57E7%karthik.anandkumar@memo rylane.com We had pretty severe index and table corruption that would occur randomly - this was on ext3, and centos 5.6, 2.6.18 kernel. The problems got fixed after we upgraded the kernel to 2.6.32, Centos 6.3. We also dropped down to ext2 because we would see the filesystem go readonly, and wanted to get the journal out of the way (yes, maybe overkill, but we desperately needed to stop crashing) We are getting a standby box up on Centos 6.5 with xfs, will move to that. Thats longer term though and I'm hoping to be able to resolve this issue before we get there. We'll try reducing the number of backends, and disable transparent huge pages. I'll update the thread with our results. Thanks, Karthik
Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted
От
Sergey Konoplev
Дата:
On Mon, Jan 6, 2014 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Anand Kumar, Karthik" <Karthik.AnandKumar@classmates.com> writes: >> We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem >> Everything will run along okay, and every few hours, for about a couple of minutes, postgres will slow way down. A "select1" query takes between 10 and 15 seconds to run, and the box in general gets lethargic. >> This causes a pile up of connections at the DB, and we run out of max_connections. >> This is accompanied with a steep spike in system CPU and load avg. No spike in user CPU or in I/O. > > System CPU only huh? There have been some reports of such behavior > apparently caused by inefficiencies in the kernel's support of > "transparent huge pages". See for instance this thread > > http://www.postgresql.org/message-id/flat/CABMVzL2y8mRM5C9xxejAyDqe0i1S78RAE3cEATGYNf5Ktz_Zdg@mail.gmail.com > > although it looks like in that case the real fix was to reduce the number > of backends. I experienced the THP defragmentation problem even with <10 connections. What always saves me is to set echo always > /sys/kernel/mm/transparent_hugepage/enabled echo madvise > /sys/kernel/mm/transparent_hugepage/defrag , the names might be slightly different on CentOS, like redhat_transparent_hugepage or something like this, I don't remember exactly. -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted
От
Scott Marlowe
Дата:
On Mon, Jan 6, 2014 at 6:06 PM, Anand Kumar, Karthik <Karthik.AnandKumar@classmates.com> wrote: > Hi, > > We run postgres 9.1.11, on Centos 6.3, and an ext2 filesystem > > Everything will run along okay, and every few hours, for about a couple of > minutes, postgres will slow way down. A "select 1" query takes between 10 > and 15 seconds to run, and the box in general gets lethargic. > > This causes a pile up of connections at the DB, and we run out of > max_connections. > > This is accompanied with a steep spike in system CPU and load avg. No spike > in user CPU or in I/O. As well as the previously mentioned huge pages thing, also look at vm dirty ratio: http://www.westnet.com/~gsmith/content/linux-pdflush.htm
Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted
От
Sameer Kumar
Дата:
On Tue, Jan 7, 2014 at 9:06 AM, Anand Kumar, Karthik <Karthik.AnandKumar@classmates.com> wrote:
We do typically have a lot of idle connections (1500 connections total, over a 1000 idle at any given time). We're in the midst of installing pgbouncer to try and mitigate the problem, but that still doesn't address the root cause.
1500 connections in total (is that the number you are using for your max_connections too)? How many CPUs/CPU threads you have on your server? How many concurrent transactions are happening in your database? 1000 idle at any give time? Whoa! Is that by design or a defect which you have decided to live with? 1000 idle connection are idle or idle in transaction?
What pooling mode are you using in pgbouncer? Going by your description, I will suggest that you use transaction mode. Since you either have bugs or deliberately (why? why?) keep connections open/idle.
So in sessions mode you may not get much benefits from pgbouncer.
While you use pgbouncer you can set max_clients to a higher number and try to set timeouts for your clients (note that it will timeout even those clients who are in middle of transaction but are idle for long).
Best Regards,
Sameer Kumar | Database Consultant
ASHNIK PTE. LTD.
101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533
M : +65 8110 0350 T: +65 6438 3504 | www.ashnik.com
www.facebook.com/ashnikbiz | www.twitter.com/ashnikbiz
This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s).
Вложения
Re: Sudden slow down and spike in system CPU causes max_connections to get exhausted
От
"Anand Kumar, Karthik"
Дата:
Thanks all for your suggestions. Looks like disabling transparent huge pages fixed this issue for us. We haven't had it occur in two days now after the change. Thanks, Karthik