Обсуждение: Random performance hit, unknown cause.

Поиск
Список
Период
Сортировка

Random performance hit, unknown cause.

От
Brian Fehrle
Дата:
Hi all,

OS: Linux 64 bit 2.6.32
PostgreSQL 9.0.5 installed from Ubuntu packages.
8 CPU cores
64 GB system memory
Database cluster is on raid 10 direct attached drive, using a HP p800 controller card.


I have a system that has been having occasional performance hits, where the load on the system skyrockets, all queries take longer to execute and a hot standby slave I have set up via streaming replication starts to get behind. I'm having trouble pinpointing where the exact issue is.

This morning, during our nightly backup process (where we grab a copy of the data directory), we started having this same issue. The main thing that I see in all of these is a high disk wait on the system. When we are performing 'well', the %wa from top is usually around 30%, and our load is around 12 - 15. This morning we saw a load  21 - 23, and an %wa jumping between 60% and 75%.

The top process pretty much at all times is the WAL Sender Process, is this normal?

From what I can tell, my access patterns on the database has not changed, same average number of inserts, updates, deletes, and had nothing on the system changed in any way. No abnormal autovacuum processes that aren't normally already running.

So what things can I do to track down what an issue is? Currently the system has returned to a 'good' state, and performance looks great. But I would like to know how to prevent this, as well as be able to grab good stats if it does happen again in the future.

Has anyone had any issues with the HP p800 controller card in a postgres environment? Is there anything that can help us maximise the performance to disk in this case, as it seems to be one of our major bottlenecks? I do plan on moving the pg_xlog to a separate drive down the road, the cluster is extremely active so that will help out a ton.

some IO stats:

$ iostat -d -x 5 3
Device:        rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
dev1            1.99    75.24  651.06  438.04 41668.57  8848.18    46.38     0.60    3.68   0.70  76.36
dev2            0.00     0.00  653.05  513.43 41668.57  8848.18    43.31     2.18    4.78   0.65  76.35

Device:        rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

dev1            0.00    35.20  676.20  292.00 35105.60  5688.00    42.13    67.76   70.73   1.03 100.00
dev2            0.00     0.00  671.80  295.40 35273.60  4843.20    41.48    73.41   76.62   1.03 100.00

Device:        rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

dev1            1.20    40.80  865.40  424.80 51355.20  8231.00    46.18    37.87   29.22   0.77  99.80
dev2            0.00     0.00  867.40  465.60 51041.60  8231.00    44.47    38.28   28.58   0.75  99.80


Thanks in advance,
Brian F

Re: Random performance hit, unknown cause.

От
Claudio Freire
Дата:
On Thu, Apr 12, 2012 at 3:41 PM, Brian Fehrle
<brianf@consistentstate.com> wrote:
> This morning, during our nightly backup process (where we grab a copy of the
> data directory), we started having this same issue. The main thing that I
> see in all of these is a high disk wait on the system. When we are
> performing 'well', the %wa from top is usually around 30%, and our load is
> around 12 - 15. This morning we saw a load  21 - 23, and an %wa jumping
> between 60% and 75%.
>
> The top process pretty much at all times is the WAL Sender Process, is this
> normal?

Sounds like vacuum to me.

Re: Random performance hit, unknown cause.

От
"Kevin Grittner"
Дата:
Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Apr 12, 2012 at 3:41 PM, Brian Fehrle
> <brianf@consistentstate.com> wrote:
>> This morning, during our nightly backup process (where we grab a
>> copy of the data directory), we started having this same issue.
>> The main thing that I see in all of these is a high disk wait on
>> the system. When we are performing 'well', the %wa from top is
>> usually around 30%, and our load is around 12 - 15. This morning
>> we saw a load  21 - 23, and an %wa jumping between 60% and 75%.
>>
>> The top process pretty much at all times is the WAL Sender
>> Process, is this normal?
>
> Sounds like vacuum to me.

More particularly, it seems consistent with autovacuum finding a
large number of tuples which had reached their freeze threshold.
Rewriting the tuple in place with a frozen xmin is a WAL-logged
operation.

-Kevin

Re: Random performance hit, unknown cause.

От
Claudio Freire
Дата:
On Thu, Apr 12, 2012 at 3:41 PM, Brian Fehrle
<brianf@consistentstate.com> wrote:
> Is there anything that can help us maximise the performance to disk in this
> case, as it seems to be one of our major bottlenecks?

If it's indeed autovacuum, like I think it is, you can try limiting it
with pg's autovacuum_cost_delay params.

Re: Random performance hit, unknown cause.

От
Brian Fehrle
Дата:
Interesting, that is very likely.

In this system I have a table that is extremely active. On a 'normal'
day, the autovacuum process takes about 7 hours to complete on this
table, and once it's complete, the system performs an autoanalyze on the
table, finding that we have millions of new dead rows. Once this
happens, it kicks off the autovacuum again, so we basically always have
a vacuum running on this table at any given time.

If I were to tweak the autovacuum_vacuum_cost_delay parameter, what
would that be doing? Would it be limiting what the current autovacuum is
allowed to do? Or does it simply space out the time between autovacuum
runs? In my case, with 7 hour long autovacuums (sometimes 14 hours), a
few milliseconds between each vacuum wouldn't mean anything to me.

If that parameter does limit the amount of work autovacuum can do, It
may cause the system to perform better at that time, but would prolong
the length of the autovacuum right? That's an issue I'm already having
issue with, and wouldn't want to make the autovacuum any longer if I
don't need to.

- Brian F


On 04/12/2012 01:52 PM, Kevin Grittner wrote:
> Claudio Freire<klaussfreire@gmail.com>  wrote:
>> On Thu, Apr 12, 2012 at 3:41 PM, Brian Fehrle
>> <brianf@consistentstate.com>  wrote:
>>> This morning, during our nightly backup process (where we grab a
>>> copy of the data directory), we started having this same issue.
>>> The main thing that I see in all of these is a high disk wait on
>>> the system. When we are performing 'well', the %wa from top is
>>> usually around 30%, and our load is around 12 - 15. This morning
>>> we saw a load  21 - 23, and an %wa jumping between 60% and 75%.
>>>
>>> The top process pretty much at all times is the WAL Sender
>>> Process, is this normal?
>> Sounds like vacuum to me.
>
> More particularly, it seems consistent with autovacuum finding a
> large number of tuples which had reached their freeze threshold.
> Rewriting the tuple in place with a frozen xmin is a WAL-logged
> operation.
>
> -Kevin


Re: Random performance hit, unknown cause.

От
"Kevin Grittner"
Дата:
Brian Fehrle <brianf@consistentstate.com> wrote:

> In this system I have a table that is extremely active. On a
> 'normal' day, the autovacuum process takes about 7 hours to
> complete on this table, and once it's complete, the system
> performs an autoanalyze on the table, finding that we have
> millions of new dead rows. Once this happens, it kicks off the
> autovacuum again, so we basically always have a vacuum running on
> this table at any given time.
>
> If I were to tweak the autovacuum_vacuum_cost_delay parameter,
> what would that be doing?

That controls how long an autovacuum worker naps after it has done
enough work to hit the autovacuum_cost_limit.  As tuning knobs go,
this one is pretty coarse.

> Would it be limiting what the current autovacuum is allowed to do?

No, just how fast it does it.

> Or does it simply space out the time between autovacuum runs?

Not that either; it's part of pacing the work of a run.

> In my case, with 7 hour long autovacuums (sometimes 14 hours), a
> few milliseconds between each vacuum wouldn't mean anything to me.

Generally, I find that the best way to tune it is to pick 10ms to
20ms for autovacuum_cost_delay, and adjust adjust
autovacuum_cost_limit to tune from there.  A small change in the
former can cause a huge change in pacing; the latter is better for
fine-tuning.

> It may cause the system to perform better at that time, but would
> prolong the length of the autovacuum right?

Right.

-Kevin

Re: Random performance hit, unknown cause.

От
"Strange, John W"
Дата:
Check your pagecache settings, when doing heavy io writes of a large file you can basically force a linux box to
completelystall.  At some point once the pagecache has reached it's limit it'll force all IO to go sync basically from
myunderstanding.   We are still fighting with this but lots of changes in RH6 seem to address of lot of these issues.
 

grep -i dirty /proc/meminfo
cat /proc/sys/vm/
cat /proc/sys/vm/nr_pdflush_threads

Once the dirty pages reaches a really large size and the limit of pagecache your system should experience a pretty
abruptdrop in performance.  You should be able to avoid this by using sync writes, but we haven't had a chance to
completelyisolate and address this issue.
 

-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Claudio
Freire
Sent: Thursday, April 12, 2012 1:50 PM
To: Brian Fehrle
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Random performance hit, unknown cause.

On Thu, Apr 12, 2012 at 3:41 PM, Brian Fehrle <brianf@consistentstate.com> wrote:
> This morning, during our nightly backup process (where we grab a copy 
> of the data directory), we started having this same issue. The main 
> thing that I see in all of these is a high disk wait on the system. 
> When we are performing 'well', the %wa from top is usually around 30%, 
> and our load is around 12 - 15. This morning we saw a load  21 - 23, 
> and an %wa jumping between 60% and 75%.
>
> The top process pretty much at all times is the WAL Sender Process, is 
> this normal?

Sounds like vacuum to me.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.