Обсуждение: strange pgbench results (as if blocked at the end)

Поиск
Список
Период
Сортировка

strange pgbench results (as if blocked at the end)

От
"Tomas Vondra"
Дата:
Hi,

I've run a lot of pgbench tests recently (trying to compare various fs,
block sizes etc.), and I've noticed several really strange results.

Eeach benchmark consists of three simple steps:

1) set-up the database
2) read-only run (10 clients, 5 minutes)
3) read-write run (10 clients, 5 minutes)

with a short read-only warm-up (1 client, 1 minute) before each run.

I've run nearly 200 of these, and in about 10 cases I got something that
looks like this:

http://www.fuzzy.cz/tmp/pgbench/tps.png
http://www.fuzzy.cz/tmp/pgbench/latency.png

i.e. it runs just fine for about 3:40 and then something goes wrong. The
bench should take 5:00 minutes, but it somehow locks, does nothing for
about 2 minutes and then  all the clients end at the same time. So instead
of 5 minutes the run actually takes about 6:40.

The question is what went wrong - AFAIK there's nothing else running on
the machine that could cause this. I'm looking for possible culprits -
I'll try to repeat this run and see if it happens again.

The pgbench log is available here (notice the 10 lines at the end, those
are the 10 blocked clients) along with the postgres.log

http://www.fuzzy.cz/tmp/pgbench/pgbench.log.gz
http://www.fuzzy.cz/tmp/pgbench/pg.log

Ignore the "immediate shutdown request" warning (once the benchmark is
over, I don't need it anymore. Besides that there's just a bunch of
"pgstat wait timeout" warnings (which makes sense, because the pgbench run
does a lot of I/O).

I'd understand a slowdown, but why does it block?

I'm using PostgreSQL 9.0.4, the machine has 2GB of RAM and 1GB of shared
buffers. I admit the machine might be configured a bit differently (e.g.
smaller shared buffers) but I've seen just about 10 such strange results
out of 200 runs, so I doubt this is the cause.

I was thinking about something like autovacuum, but I'd expect that to
happen much more frequently (same config, same workload, etc.). And it
happens with just some file systems.

For example for ext3/writeback, the STDDEV(latency) looks like this
(x-axis represents PostgreSQL block size, y-axis fs block size):

  http://www.fuzzy.cz/tmp/pgbench/ext3-writeback.png

while for ext4/journal:

  http://www.fuzzy.cz/tmp/pgbench/ext4-journal.png

thanks
Tomas


Re: strange pgbench results (as if blocked at the end)

От
Craig Ringer
Дата:
On 13/08/2011 7:37 AM, Tomas Vondra wrote:

> I'd understand a slowdown, but why does it block?

My first guess is that you're having checkpoint issues. Try enabling
logging of checkpoints and checkpoint timings and see if anything there
lines up with the pause you encounter.

--
Craig Ringer


Re: strange pgbench results (as if blocked at the end)

От
Greg Smith
Дата:
On 08/12/2011 07:37 PM, Tomas Vondra wrote:
I've run nearly 200 of these, and in about 10 cases I got something that
looks like this:

http://www.fuzzy.cz/tmp/pgbench/tps.png
http://www.fuzzy.cz/tmp/pgbench/latency.png

i.e. it runs just fine for about 3:40 and then something goes wrong. The
bench should take 5:00 minutes, but it somehow locks, does nothing for
about 2 minutes and then  all the clients end at the same time. So instead
of 5 minutes the run actually takes about 6:40. 

You need to run tests like these for 10 minutes to see the full cycle of things; then you'll likely see them on most runs, instead of only 5%.  It's probably the case that some of your tests are finishing before the first checkpoint does, which is why you don't see the bad stuff every time.

The long pauses are most likely every client blocking once the checkpoint sync runs.  When those fsync calls go out, Linux will freeze for quite a while there on ext3.  In this example, the drop in TPS/rise in latency at around 50:30 is either the beginning of a checkpoint or the dirty_background_ratio threshold in Linux being exceeded; they tend to happen around the same time.  It executes the write phase for a bit, then gets into the sync phase around 51:40.  You can find a couple of examples just like this on my giant test set around what was committed as the fsync compaction feature in 9.1, all at http://www.2ndquadrant.us/pgbench-results/index.htm

The one most similar to your case is http://www.2ndquadrant.us/pgbench-results/481/index.html  Had that test only run for 5 minutes, it would have looked just like yours, ending after the long pause that's in the middle on my run.  The freeze was over 3 minutes long in that example.  (My server has a fairly fast disk subsystem, probably faster than what you're testing, but it also has 8GB of RAM that it can dirty to more than make up for it).

In my tests, I switched from ext3 to XFS to get better behavior.  You got the same sort of benefit from ext4.  ext3 just doesn't handle its write cache filling and then having fsync calls execute very well.  I've given up on that as an unsolvable problem; improving behavior on XFS and ext4 are the only problems worth worrying about now to me.

And I keep seeing too many data corruption issues on ext4 to recommend anyone use it yet for PostgreSQL, that's why I focused on XFS.  ext4 still needs at least a few more months before all the bug fixes it's gotten in later kernels are backported to the 2.6.32 versions deployed in RHEL6 and Debian Squeeze, the newest Linux distributions my customers care about right now.  On RHEL6 for example, go read p { margin-bottom: 0.08in; }p.western { font-family: "Albany AMT","Arial",sans-serif; }p.rststyle-textbody-western { font-family: "Albany AMT","Arial",sans-serif; }a:link { }http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.1_Technical_Notes/kernel.html , specifically BZ#635199, and you tell me if that sounds like it's considered stable code yet or not.  "The block layer will be updated in future kernels to provide this more efficient mechanism of ensuring ordering...these future block layer improvements will change some kernel interfaces..."  Yikes, that does not inspire confidence to me.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: strange pgbench results (as if blocked at the end)

От
Дата:
On 13 Srpen 2011, 5:09, Greg Smith wrote:
> The long pauses are most likely every client blocking once the
> checkpoint sync runs.  When those fsync calls go out, Linux will freeze
> for quite a while there on ext3.  In this example, the drop in TPS/rise
> in latency at around 50:30 is either the beginning of a checkpoint or
> the dirty_background_ratio threshold in Linux being exceeded; they tend
> to happen around the same time.  It executes the write phase for a bit,
> then gets into the sync phase around 51:40.  You can find a couple of
> examples just like this on my giant test set around what was committed
> as the fsync compaction feature in 9.1, all at
> http://www.2ndquadrant.us/pgbench-results/index.htm
>
> The one most similar to your case is
> http://www.2ndquadrant.us/pgbench-results/481/index.html  Had that test
> only run for 5 minutes, it would have looked just like yours, ending
> after the long pause that's in the middle on my run.  The freeze was
> over 3 minutes long in that example.  (My server has a fairly fast disk
> subsystem, probably faster than what you're testing, but it also has 8GB
> of RAM that it can dirty to more than make up for it).

I guess you're right - I was thinking about checkpoints too, but what
really puzzled me was that only some of the runs (with about the same
workload) were affected by that.

It's probably a timing issue - the tests were running for 5 minutes and
checkpoint timeout is 5 minutes too. So the runs where the checkpoint
timed out early had to write very little data, but when the checkpoint
timed out just before the end it had to write much more data.

I've increased the test duration to 10 minutes, decreased the
checkpoint timeout to 4 minutes and a checkpoint is issued just before
the pgbench. That way the starting position should be more or less the
same for all runs.

> In my tests, I switched from ext3 to XFS to get better behavior.  You
> got the same sort of benefit from ext4.  ext3 just doesn't handle its
> write cache filling and then having fsync calls execute very well.  I've
> given up on that as an unsolvable problem; improving behavior on XFS and
> ext4 are the only problems worth worrying about now to me.

For production systems, XFS seems like a good choice. The purpose of
the tests I've run was merely to see what is the effect of various block
size  and mount options for available file systems (including
experimental ones).

If interested, you can see the results here http://www.fuzzy.cz/bench/
although it's the first run with runs not long enough (just 5 minutes)
and some other slightly misconfigured options (shared buffers and
checkpoints).

> And I keep seeing too many data corruption issues on ext4 to recommend
> anyone use it yet for PostgreSQL, that's why I focused on XFS.  ext4
> still needs at least a few more months before all the bug fixes it's
> gotten in later kernels are backported to the 2.6.32 versions deployed
> in RHEL6 and Debian Squeeze, the newest Linux distributions my customers
> care about right now.  On RHEL6 for example, go read
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.1_Technical_Notes/kernel.html
> , specifically BZ#635199, and you tell me if that sounds like it's
> considered stable code yet or not.  "The block layer will be updated in
> future kernels to provide this more efficient mechanism of ensuring
> ordering...these future block layer improvements will change some kernel
> interfaces..."  Yikes, that does not inspire confidence to me.

XFS is naturally much more mature / stable than EXT4, but I'm not quite
sure I want to judge the stability of code based on a comment in release
notes. As I understand it, the comment says something like "things are
not working as efficiently as it should, we'll improve that in the
future" and it relates to the block layer as a whole, not just specific
file systems. But I don't have access to the bug #635199, so maybe I
missed something.

regards
Tomas

Re: strange pgbench results (as if blocked at the end)

От
Scott Marlowe
Дата:
On Sun, Aug 14, 2011 at 6:51 AM,  <tv@fuzzy.cz> wrote:
>
> I've increased the test duration to 10 minutes, decreased the
> checkpoint timeout to 4 minutes and a checkpoint is issued just before
> the pgbench. That way the starting position should be more or less the
> same for all runs.

Also look at increasing checkpoint completion target to something
closer to 1. 0.8 is a nice starting place.

Re: strange pgbench results (as if blocked at the end)

От
Дата:
On Sun, 14 Aug 2011 07:15:00 -0600, Scott Marlowe
<scott.marlowe@gmail.com> wrote:
> On Sun, Aug 14, 2011 at 6:51 AM,  <tv@fuzzy.cz> wrote:
>>
>> I've increased the test duration to 10 minutes, decreased the
>> checkpoint timeout to 4 minutes and a checkpoint is issued just before
>> the pgbench. That way the starting position should be more or less the
>> same for all runs.
>
> Also look at increasing checkpoint completion target to something
> closer to 1. 0.8 is a nice starting place.

Yes, I've increased that already:

checkpoints_segments=64
checkpoints_completion_target=0.9

Tomas

Re: strange pgbench results (as if blocked at the end)

От
Tom Lane
Дата:
<tv@fuzzy.cz> writes:
> On 13 Srpen 2011, 5:09, Greg Smith wrote:
>> And I keep seeing too many data corruption issues on ext4 to recommend
>> anyone use it yet for PostgreSQL, that's why I focused on XFS.  ext4
>> still needs at least a few more months before all the bug fixes it's
>> gotten in later kernels are backported to the 2.6.32 versions deployed
>> in RHEL6 and Debian Squeeze, the newest Linux distributions my customers
>> care about right now.  On RHEL6 for example, go read
>> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.1_Technical_Notes/kernel.html
>> , specifically BZ#635199, and you tell me if that sounds like it's
>> considered stable code yet or not.  "The block layer will be updated in
>> future kernels to provide this more efficient mechanism of ensuring
>> ordering...these future block layer improvements will change some kernel
>> interfaces..."  Yikes, that does not inspire confidence to me.

> XFS is naturally much more mature / stable than EXT4, but I'm not quite
> sure I want to judge the stability of code based on a comment in release
> notes. As I understand it, the comment says something like "things are
> not working as efficiently as it should, we'll improve that in the
> future" and it relates to the block layer as a whole, not just specific
> file systems. But I don't have access to the bug #635199, so maybe I
> missed something.

I do ;-).  The reason for the tech note was to point out that RHEL6.1
would incorporate backports of upstream kernel changes that broke the
ABI for loadable kernel modules, compared to what it had been in
RHEL6.0.  That's of great interest to third-party software developers
who add device or filesystem drivers to RHEL, but I don't think it
speaks at all to whether the code is unstable from a user's standpoint.
(The changes in question were purely for performance, and involved a
conversion from write barriers in the block layer to flush+fua, whatever
that is.)  Furthermore, this affected every filesystem not only ext4,
so it really entirely fails to support Greg's argument.

            regards, tom lane

Re: strange pgbench results (as if blocked at the end)

От
Greg Smith
Дата:
On 08/14/2011 08:51 AM, tv@fuzzy.cz wrote:
> I've increased the test duration to 10 minutes, decreased the
> checkpoint timeout to 4 minutes and a checkpoint is issued just before
> the pgbench. That way the starting position should be more or less the
> same for all runs.
>

That's basically what I settled on for pgbench-tools.  Force a
checkpoint just before the test, so the beginning of each run is aligned
more consistently, then run for long enough that you're guaranteed at
least one checkpoint finishes[1] (and you might see more than one if you
fill checkpoint_segments fast enough).  I never bothered trying to
compress that test cycle down by decreasing checkpoint_timeout.  There's
already too many things you need to do in order to get this test working
well, and I didn't want to include a change I'd never recommend people
make on a production server in the mix.

[1] If your checkpoint behavior goes pathological, for example the
extended checkpoints possible when the background writer fsync queue
fills, it's not actually guaranteed that the checkpoint will finish
within 5 minutes after it starts.  So a 10 minute run doesn't assure
you'll a checkpoint begin and end in all circumstances, but it is the
expected case.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


Re: strange pgbench results (as if blocked at the end)

От
Greg Smith
Дата:
On 08/14/2011 01:33 PM, Tom Lane wrote:
> (The changes in question were purely for performance, and involved a
> conversion from write barriers in the block layer to flush+fua, whatever
> that is.)  Furthermore, this affected every filesystem not only ext4,
> so it really entirely fails to support Greg's argument.
>

FUA is "Force Unit Access", shorthand for forcing write-through.  If
you're staring at this from the perspective I am, where I assume that
every line of new filesystem code always takes a while from release
until it's stable, it still concerns me.  I suspect that both of these
code paths--the write barrier one and the flush+FUA one--are fine on
XFS.  One of the reasons XFS was so terrible in its early years was
because it took a long time to get those parts in particular correct,
which was complicated by how badly drives lied about everything back then.

If these changes shift around which section of the ext4 write path code
are exercised, it could easily be the case that it moves fsync related
work (which has been one of the buggiest parts of that filesystem) from
code that has been tested heavily to parts that haven't been hit as hard
yet by users.  My paranoid raving about this particular bug isn't
strongly supported; I'll concede that with the additional data you've
provided.  But I don't think it's completely unfounded either.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us