Обсуждение: Write workload is causing severe slowdown in Production

Поиск

Список

Период

Сортировка

Write workload is causing severe slowdown in Production

От

"Gnanakumar"

Дата:

22 марта 2012 г., 07:29:49

Hi,

We're running a web-based application powered by PostgreSQL.  Recently,
we've developed a "new" separate Java-based standalone (daemon process)
threaded program that performs both read and write operations heavily on 2
"huge" tables.  One table has got 5.4 million records and other has 1.3
million records.  Moreover, more than one read and/or write operations may
be executing concurrently.

The issue that we're facing currently in our Production server is, whenever
this "newly" developed Java program is started/run, then immediately the
entire web application becomes very slow in response.  At this time, I could
also see from the output of " iostat -tx" that "%util" is even crossing more
than 80%.  So, what I could infer here based on my knowledge is, this is
creating heavy IO traffic because of write operation.  Since it was entirely
slowing down web application, we've temporarily stopped running this
standalone application.

Meantime, I also read about "checkpoint spikes" could be a reason for slow
down in "write workload" database.  I'm also reading that starting in
PostgreSQL 8.3, we can get verbose logging of the checkpoint process by
turning on "log_checkpoints".

My question is, how do I determine whether "checkpoint" occurrences are the
root cause of this slowdown in my case?  We're running PostgreSQL v8.2.22 on
CentOS5.2 having 35 GB RAM.  "log_checkpoints" is not available in
PostgreSQL v8.2.22.

We want to optimize our Production database to handle both reads and writes,
any suggestions/advice/guidelines on this are highly appreciated.

Some important "postgresql.conf" parameters are:
# - Memory -
shared_buffers=1536MB

# - Planner Cost Constants -
effective_cache_size = 4GB

# - Checkpoints -
checkpoint_segments=32
checkpoint_timeout=5min
checkpoint_warning=270s

# - Background writer -
bgwriter_delay = 200ms
bgwriter_lru_percent = 1.0
bgwriter_lru_maxpages = 5
bgwriter_all_percent = 0.333
bgwriter_all_maxpages = 5

Regards,
Gnanam

Re: Write workload is causing severe slowdown in Production

От

Vitalii Tymchyshyn

Дата:

22 марта 2012 г., 09:42:36

Check for next messages in your log:
LOG: checkpoints are occurring too frequently (ZZZ seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".

Best regards, Vitalii Tymchyshyn

22.03.12 09:27, Gnanakumar написав(ла):
> Hi,
>
> We're running a web-based application powered by PostgreSQL.  Recently,
> we've developed a "new" separate Java-based standalone (daemon process)
> threaded program that performs both read and write operations heavily on 2
> "huge" tables.  One table has got 5.4 million records and other has 1.3
> million records.  Moreover, more than one read and/or write operations may
> be executing concurrently.
>
> The issue that we're facing currently in our Production server is, whenever
> this "newly" developed Java program is started/run, then immediately the
> entire web application becomes very slow in response.  At this time, I could
> also see from the output of " iostat -tx" that "%util" is even crossing more
> than 80%.  So, what I could infer here based on my knowledge is, this is
> creating heavy IO traffic because of write operation.  Since it was entirely
> slowing down web application, we've temporarily stopped running this
> standalone application.
>
> Meantime, I also read about "checkpoint spikes" could be a reason for slow
> down in "write workload" database.  I'm also reading that starting in
> PostgreSQL 8.3, we can get verbose logging of the checkpoint process by
> turning on "log_checkpoints".
>
> My question is, how do I determine whether "checkpoint" occurrences are the
> root cause of this slowdown in my case?  We're running PostgreSQL v8.2.22 on
> CentOS5.2 having 35 GB RAM.  "log_checkpoints" is not available in
> PostgreSQL v8.2.22.

Re: Write workload is causing severe slowdown in Production

От

"Tomas Vondra"

Дата:

22 марта 2012 г., 10:04:55

On 22 Březen 2012, 10:42, Vitalii Tymchyshyn wrote:
> Check for next messages in your log:
> LOG: checkpoints are occurring too frequently (ZZZ seconds apart)
> HINT: Consider increasing the configuration parameter
> "checkpoint_segments".
>
> Best regards, Vitalii Tymchyshyn
>
> 22.03.12 09:27, Gnanakumar написав(ла):
>> Hi,
>>
>> We're running a web-based application powered by PostgreSQL.  Recently,
>> we've developed a "new" separate Java-based standalone (daemon process)
>> threaded program that performs both read and write operations heavily on
>> 2
>> "huge" tables.  One table has got 5.4 million records and other has 1.3
>> million records.  Moreover, more than one read and/or write operations
>> may
>> be executing concurrently.
>>
>> The issue that we're facing currently in our Production server is,
>> whenever
>> this "newly" developed Java program is started/run, then immediately the
>> entire web application becomes very slow in response.  At this time, I
>> could
>> also see from the output of " iostat -tx" that "%util" is even crossing
>> more
>> than 80%.  So, what I could infer here based on my knowledge is, this is
>> creating heavy IO traffic because of write operation.  Since it was
>> entirely
>> slowing down web application, we've temporarily stopped running this
>> standalone application.

I'd say you should investigate what the application actually does. The
chances are it's poorly written, issuing a lot of queries and causing a
log of IO.

And 80% utilization does not mean the operations need to be writes - it's
about IO operations, i.e. both reads and writes.

>> Meantime, I also read about "checkpoint spikes" could be a reason for
>> slow
>> down in "write workload" database.  I'm also reading that starting in
>> PostgreSQL 8.3, we can get verbose logging of the checkpoint process by
>> turning on "log_checkpoints".
>>
>> My question is, how do I determine whether "checkpoint" occurrences are
>> the
>> root cause of this slowdown in my case?  We're running PostgreSQL
>> v8.2.22 on
>> CentOS5.2 having 35 GB RAM.  "log_checkpoints" is not available in
>> PostgreSQL v8.2.22.

There's a checkpoint_warning option. Set it to 3600 and you should get
messages in the log. Correlate those to the issues (do they happen at the
same time?). Sadly, 8.2 doesn't have any of the nice statistic views :-(

Check this: http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

It talks about improvements in 8.3 but it mentions older version too.

If you can, install iotop and watch the processes that cause the I/O.
IIRC, the title of the process should say 'CHECKPOINT' etc. But if the
issues disappear once the application is stopped, it's unlikely the
checkpoints are the issue.

What we need is more details about your setup, especially

   - checkpoint_segments
   - checkpoint_timeout
   - shared_buffers

also it'd be nice to have samples from the vmstat/iostat and messages from
the log.

kind regards
Tomas

Re: Write workload is causing severe slowdown in Production

От

"Gnanakumar"

Дата:

22 марта 2012 г., 10:33:14

> There's a checkpoint_warning option. Set it to 3600 and you should get
> messages in the log. Correlate those to the issues (do they happen at the
> same time?).
After setting "checkpoint_warning" to 3600, can you explain on how do I correlate with the messages?

> If you can, install iotop and watch the processes that cause the I/O.
I tried installing "iotop", but it failed to run because it requires Linux >= 2.6.20.  Our CentOS5.2 is 2.6.18-8.

> What we need is more details about your setup, especially
>   - checkpoint_segments
>   - checkpoint_timeout
>   - shared_buffers
# - Memory -
shared_buffers=1536MB

# - Planner Cost Constants -
effective_cache_size = 4GB

# - Checkpoints -
checkpoint_segments=32
checkpoint_timeout=5min
checkpoint_warning=270s

# - Background writer -
bgwriter_delay = 200ms
bgwriter_lru_percent = 1.0
bgwriter_lru_maxpages = 5
bgwriter_all_percent = 0.333
bgwriter_all_maxpages = 5

> also it'd be nice to have samples from the vmstat/iostat and messages from
> the log.
Unfortunately, I don't have "exact" logs when the problem actually happened

Re: Write workload is causing severe slowdown in Production

От

"Tomas Vondra"

Дата:

22 марта 2012 г., 10:43:29

On 22 Březen 2012, 11:32, Gnanakumar wrote:
>> There's a checkpoint_warning option. Set it to 3600 and you should get
>> messages in the log. Correlate those to the issues (do they happen at
>> the
>> same time?).
> After setting "checkpoint_warning" to 3600, can you explain on how do I
> correlate with the messages?

Well, you do know when the issues happened, so that you can check the logs
and see if there are messages at that time. Try to install the application
and watch the logs / IO performance.

>> If you can, install iotop and watch the processes that cause the I/O.
> I tried installing "iotop", but it failed to run because it requires Linux
> >= 2.6.20.  Our CentOS5.2 is 2.6.18-8.
>
>> What we need is more details about your setup, especially
>>   - checkpoint_segments
>>   - checkpoint_timeout
>>   - shared_buffers
> # - Memory -
> shared_buffers=1536MB
>
> # - Planner Cost Constants -
> effective_cache_size = 4GB

So, what else is running on the system? Because if there's 35GB RAM and
the shared buffers are 1.5GB, then there's about 33GB for page cache.
Something like 16GB would be a conservative setting.

I'm not saying this will fix the issues, but maybe it shows that something
else is running on the box and maybe that's the culprit, not PostgreSQL?

>> also it'd be nice to have samples from the vmstat/iostat and messages
>> from
>> the log.
> Unfortunately, I don't have "exact" logs when the problem actually
> happened

Then install the app again and collect as much info as possible. Otherwise
it's all just wild guesses.

Tomas

Re: Write workload is causing severe slowdown in Production

От

"Gnanakumar"

Дата:

22 марта 2012 г., 12:11:19

> So, what else is running on the system? Because if there's 35GB RAM and
> the shared buffers are 1.5GB, then there's about 33GB for page cache.
> Something like 16GB would be a conservative setting.

Yes, you guessed it right.  Both Web server and DB server are running in the same machine.

> I'm not saying this will fix the issues, but maybe it shows that something
> else is running on the box and maybe that's the culprit, not PostgreSQL?

Based on my observations and analysis, I'm sure that database write operation "is" causing the slowdown, but not
becauseof other applications running in the same server. 

> Then install the app again and collect as much info as possible. Otherwise
> it's all just wild guesses.

OK.

Re: Write workload is causing severe slowdown in Production

От

"Gnanakumar"

Дата:

22 марта 2012 г., 12:39:28

> There's a checkpoint_warning option. Set it to 3600 and you should get messages in the log.

I've a basic question about setting "checkpoint_warning" configuration.  8.2 doc
(http://www.postgresql.org/docs/8.2/interactive/runtime-config-wal.html)says: 

"Write a message to the server log if checkpoints caused by the filling of checkpoint segment files happen closer
togetherthan this many seconds (which suggests that checkpoint_segments ought to be raised). The default is 30 seconds
(30s)."

How does increasing the default 30s to 3600s (which is 1 hour or 60 minutes) print messages to the log?  Even after
readingthe description from above doc, am not able to get this point clearly.  Can you help me in understanding this?

Re: Write workload is causing severe slowdown in Production

От

"Tomas Vondra"

Дата:

22 марта 2012 г., 12:52:59

On 22 Březen 2012, 13:38, Gnanakumar wrote:
>> There's a checkpoint_warning option. Set it to 3600 and you should get
>> messages in the log.
>
> I've a basic question about setting "checkpoint_warning" configuration.
> 8.2 doc
> (http://www.postgresql.org/docs/8.2/interactive/runtime-config-wal.html)
> says:
>
> "Write a message to the server log if checkpoints caused by the filling of
> checkpoint segment files happen closer together than this many seconds
> (which suggests that checkpoint_segments ought to be raised). The default
> is 30 seconds (30s)."
>
> How does increasing the default 30s to 3600s (which is 1 hour or 60
> minutes) print messages to the log?  Even after reading the description
> from above doc, am not able to get this point clearly.  Can you help me in
> understanding this?

A checkpoint may be triggered for two main reasons:

  (1) all available WAL segments are filled (you do have 32 of them, i.e.
      512MB of WAL data)

  (2) the checkpoint_timeout runs out (by default 5 mins IIRC)

The checkpoint_warning should emmit a 'warning' message whenever the
checkpoint happens less than the specified number of seconds since the
last one, so setting it high enough should log all checkpoints.

The fact that this does not happen (no warning messages) suggests this is
not caused by checkpoints. It may be caused by the database, but it seems
unlikely it's checkpoints.

Tomas

Re: Write workload is causing severe slowdown in Production

От

"Tomas Vondra"

Дата:

22 марта 2012 г., 12:55:44

On 22 Březen 2012, 13:10, Gnanakumar wrote:
>> So, what else is running on the system? Because if there's 35GB RAM and
>> the shared buffers are 1.5GB, then there's about 33GB for page cache.
>> Something like 16GB would be a conservative setting.
>
> Yes, you guessed it right.  Both Web server and DB server are running in
> the same machine.
>
>> I'm not saying this will fix the issues, but maybe it shows that
>> something
>> else is running on the box and maybe that's the culprit, not PostgreSQL?
>
> Based on my observations and analysis, I'm sure that database write
> operation "is" causing the slowdown, but not because of other applications
> running in the same server.

You haven't provided any clear evidence for such statement so far. Let's
wait for the vmstat/iostat logs etc.

Tomas

Re: Write workload is causing severe slowdown in Production

От

"Kevin Grittner"

Дата:

22 марта 2012 г., 15:14:10

"Gnanakumar" <gnanam@zoniac.com> wrote:

> We're running a web-based application powered by PostgreSQL.
> Recently, we've developed a "new" separate Java-based standalone
> (daemon process) threaded program that performs both read and
> write operations heavily on 2 "huge" tables.  One table has got
> 5.4 million records and other has 1.3 million records.  Moreover,
> more than one read and/or write operations may be executing
> concurrently.

We're running a web application using PostgreSQL and Java which has
80 tables with over 1 million records each, the largest of which has
212 million rows.  It is updated by replication from 3000 directly
attached users at 72 sites, using a multi-threaded Java application.
We have one connection pool for the read-only web application,
which allows about 30 concurrent requests, and a connection pool for
the replication which allows 6.

If you want a peek at our performance, you can access the site here:
http://wcca.wicourts.gov/  -- if you view a case and click on the
"Court Record Events" button, you'll be viewing records in the table
with 212 million rows.

My point is that you're not asking PostgreSQL to do anything it
*can't* handle well.

> The issue that we're facing currently in our Production server is,
> whenever this "newly" developed Java program is started/run, then
> immediately the entire web application becomes very slow in
> response.  At this time, I could also see from the output of "
> iostat -tx" that "%util" is even crossing more than 80%.  So, what
> I could infer here based on my knowledge is, this is creating
> heavy IO traffic because of write operation.  Since it was
> entirely slowing down web application, we've temporarily stopped
> running this standalone application.

How are you handling concurrency?  (Are you using FOR SHARE on your
SELECT statements?  Are you explicitly acquiring table locks before
modifying data?  Etc.)  You might be introducing blocking somehow.
When things are slow, try running some of the queries show on this
page to get more clues:

http://wiki.postgresql.org/wiki/Lock_Monitoring

In particular, I recommend that you *never* leave transactions open
or hold locks while waiting for user response or input.  They *will*
answer phone calls or go to lunch with things pending, potentially
blocking other users for extended periods.

> Meantime, I also read about "checkpoint spikes" could be a reason
> for slow down in "write workload" database.

When you hit that issue, there is not a continual slowdown --
queries which normally run very fast (a small fraction of a second)
may periodically all take tens of seconds.  Is that the pattern
you're seeing?

> We're running PostgreSQL v8.2.22 on CentOS5.2 having 35 GB RAM.
> "log_checkpoints" is not available in PostgreSQL v8.2.22.
>
> We want to optimize our Production database to handle both reads
> and writes, any suggestions/advice/guidelines on this are highly
> appreciated.

Support for 8.2 was dropped last year, five years after it was
released.  PostgreSQL has had a new major release every year since
8.2 was released, many of which have provided dramatic performance
improvements.  If you want good performance my first suggestion
would be to upgrade your version of PostgreSQL to at least 9.0, and
preferably 9.1.  Because of stricter typing in 8.3 and later,
upgrading from 8.2 takes a bit more work than most PostgreSQL major
releases.  Be sure to test well.

> # - Background writer -
> bgwriter_delay = 200ms
> bgwriter_lru_percent = 1.0
> bgwriter_lru_maxpages = 5
> bgwriter_all_percent = 0.333
> bgwriter_all_maxpages = 5

These settings result in a very slow dribble of dirty buffers out to
the OS cache.  *If* you're hitting the "checkpoint spikes" issue
(see above), you might want to boost the aggressiveness of the
background writer.  I couldn't recommend settings without knowing a
lot more about your storage system and its capabilities.  In
supported releases of PostgreSQL, the checkpoint system and
background writer are much improved, so again -- upgrading would be
the most effective way to solve the problem.

Besides the outdated PostgreSQL release and possible blocking, I
would be concerned if you are using any sort of ORM for the update
application.  You want to watch that very closely because the
default behavior of many of them does not scale well.  There's
usually a way to get better performance through configuration and/or
bypassing automatic query generation for complex data requests.

-Kevin

Re: Write workload is causing severe slowdown in Production

От

Merlin Moncure

Дата:

22 марта 2012 г., 16:18:58

On Thu, Mar 22, 2012 at 10:13 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> In particular, I recommend that you *never* leave transactions open
> or hold locks while waiting for user response or input.  They *will*
> answer phone calls or go to lunch with things pending, potentially
> blocking other users for extended periods.

This -- transactions aren't meant to intentionally block users but to
allow you to ensure data remains in a valid state in the face of
concurrent activity.  In-transaction locks should be as short as
possible while still giving those guarantees and should hopefully
never be user facing.  If you want user controlled locks, they should
be cooperative and non transactional -- advisory locks.

http://www.postgresql.org/docs/current/static/explicit-locking.html#ADVISORY-LOCKS

merlin

Re: Write workload is causing severe slowdown in Production

От

Mark Kirkwood

Дата:

23 марта 2012 г., 00:09:18

On 22/03/12 20:27, Gnanakumar wrote:
>
> The issue that we're facing currently in our Production server is, whenever
> this "newly" developed Java program is started/run, then immediately the
> entire web application becomes very slow in response.  At this time, I could
> also see from the output of " iostat -tx" that "%util" is even crossing more
> than 80%.  So, what I could infer here based on my knowledge is, this is
> creating heavy IO traffic because of write operation.  Since it was entirely
> slowing down web application, we've temporarily stopped running this
> standalone application.

I'd recommend taking a hard took at what the Java app is doing. You
might be able to get useful data by adding:

log_min_duration_statement = 10000 # queries taking longer than 10 sec

into your postgresql.conf. I'd *guess* that locking is not the issue -
as that is unlikely to cause high IO load (altho checking for lots of
locks with granted= f in pg_locks might be worthwhile just to rule it out).

You could also try running the Java app in one of your development
environments to see if you can provoke high load behaviour in a more
easily studied environment.

regards

Mark

Re: Write workload is causing severe slowdown in Production

От

"Gnanakumar"

Дата:

23 марта 2012 г., 10:11:17

First off, thank you *so much* for that detailed explanation comparing with
a real-time application performance benchmark, which was really enlightening
for me.

> How are you handling concurrency?  (Are you using FOR SHARE on your
> SELECT statements?  Are you explicitly acquiring table locks before
> modifying data?  Etc.)  You might be introducing blocking somehow.

No, actually am not explicitly locking any tables -- all are *simple*
select, update, insert statements only.

> In particular, I recommend that you *never* leave transactions open
> or hold locks while waiting for user response or input.

Again, we're not leaving any transaction opens until for any user responses,
etc.

> When you hit that issue, there is not a continual slowdown --
> queries which normally run very fast (a small fraction of a second)
> may periodically all take tens of seconds.  Is that the pattern
> you're seeing?

Yes, you're correct.  Queries those normally run fast are becoming slow at
the time of this slowdown.

> Besides the outdated PostgreSQL release and possible blocking, I
> would be concerned if you are using any sort of ORM for the update
> application.  You want to watch that very closely because the
> default behavior of many of them does not scale well.  There's
> usually a way to get better performance through configuration and/or
> bypassing automatic query generation for complex data requests.

Am not able to understand above statements (...any sort of ORM for the
update application ...) clearly.  Can you help me in understanding this?

Re: Write workload is causing severe slowdown in Production

От

"Tomas Vondra"

Дата:

23 марта 2012 г., 10:24:51

On 23 Březen 2012, 11:10, Gnanakumar wrote:
> First off, thank you *so much* for that detailed explanation comparing
> with
> a real-time application performance benchmark, which was really
> enlightening
> for me.
>
>> How are you handling concurrency?  (Are you using FOR SHARE on your
>> SELECT statements?  Are you explicitly acquiring table locks before
>> modifying data?  Etc.)  You might be introducing blocking somehow.
>
> No, actually am not explicitly locking any tables -- all are *simple*
> select, update, insert statements only.

Are those wrapped in a transaction or not? Each transaction forces a fsync
when committing, and if each of those INSERT/UPDATE statements stands on
it's own it may cause of lot of I/O.

>> Besides the outdated PostgreSQL release and possible blocking, I
>> would be concerned if you are using any sort of ORM for the update
>> application.  You want to watch that very closely because the
>> default behavior of many of them does not scale well.  There's
>> usually a way to get better performance through configuration and/or
>> bypassing automatic query generation for complex data requests.
>
> Am not able to understand above statements (...any sort of ORM for the
> update application ...) clearly.  Can you help me in understanding this?

There are tools that claim to remove the object vs. relational discrepancy
when accessing the database. They often generate queries on the fly, and
some of the queries are pretty awful (depends on how well the ORM model is
defined). There are various reasons why this may suck - loading too much
data, using lazy fetch everywhere etc.

Are you using something like Hibernate, JPA, ... to handle persistence?

Tomas

Re: Write workload is causing severe slowdown in Production

От

"Gnanakumar"

Дата:

23 марта 2012 г., 12:26:53

> Are those wrapped in a transaction or not? Each transaction forces a fsync
> when committing, and if each of those INSERT/UPDATE statements stands on
> it's own it may cause of lot of I/O.

Yes, it's wrapped inside a transaction.  May be this could be a reason for slowdown, as you've highlighted here.
Atleast,we've got some guidance here to troubleshoot in this aspect also. 

> There are tools that claim to remove the object vs. relational discrepancy
> when accessing the database. They often generate queries on the fly, and
> some of the queries are pretty awful (depends on how well the ORM model is
> defined). There are various reasons why this may suck - loading too much
> data, using lazy fetch everywhere etc.

Thanks for the clarification.

> Are you using something like Hibernate, JPA, ... to handle persistence?

No, we're not using any persistence frameworks/libraries as such.

Re: Write workload is causing severe slowdown in Production

От

"Kevin Grittner"

Дата:

23 марта 2012 г., 14:09:28

"Gnanakumar" <gnanam@zoniac.com> wrote:

>>  When you hit that issue, there is not a continual slowdown --
>> queries which normally run very fast (a small fraction of a
>> second) may periodically all take tens of seconds.  Is that the
>> pattern you're seeing?
>
> Yes, you're correct.  Queries those normally run fast are becoming
> slow at the time of this slowdown.

But the question is -- while the update application is running is
performance *usually* good with *brief periods* of high latency, or
does it just get bad and stay bad?  The *pattern* is the clue as to
whether it is likely to be write saturation.

Here's something I would recommend as a diagnostic step: run `vmstat
1` (or equivalent, based on your OS) to capture about a minute's
worth of activity while things are running well, and also while
things are slow.  Pick a few lines that are "typical" of each and
paste them into a post here.  (If there is a lot of variation over
the sample, it may be best to attach them to your post in their
entirety.  Don't just paste in more than a few lines of vmstat
output, as the wrapping would make it hard to read.)

Also, you should try running queries from this page when things are
slow:

http://wiki.postgresql.org/wiki/Lock_Monitoring

If there is any blocking, that might be interesting.

-Kevin

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Write workload is causing severe slowdown in Production