Обсуждение: strange pgbench results (as if blocked at the end)
Hi, I've run a lot of pgbench tests recently (trying to compare various fs, block sizes etc.), and I've noticed several really strange results. Eeach benchmark consists of three simple steps: 1) set-up the database 2) read-only run (10 clients, 5 minutes) 3) read-write run (10 clients, 5 minutes) with a short read-only warm-up (1 client, 1 minute) before each run. I've run nearly 200 of these, and in about 10 cases I got something that looks like this: http://www.fuzzy.cz/tmp/pgbench/tps.png http://www.fuzzy.cz/tmp/pgbench/latency.png i.e. it runs just fine for about 3:40 and then something goes wrong. The bench should take 5:00 minutes, but it somehow locks, does nothing for about 2 minutes and then all the clients end at the same time. So instead of 5 minutes the run actually takes about 6:40. The question is what went wrong - AFAIK there's nothing else running on the machine that could cause this. I'm looking for possible culprits - I'll try to repeat this run and see if it happens again. The pgbench log is available here (notice the 10 lines at the end, those are the 10 blocked clients) along with the postgres.log http://www.fuzzy.cz/tmp/pgbench/pgbench.log.gz http://www.fuzzy.cz/tmp/pgbench/pg.log Ignore the "immediate shutdown request" warning (once the benchmark is over, I don't need it anymore. Besides that there's just a bunch of "pgstat wait timeout" warnings (which makes sense, because the pgbench run does a lot of I/O). I'd understand a slowdown, but why does it block? I'm using PostgreSQL 9.0.4, the machine has 2GB of RAM and 1GB of shared buffers. I admit the machine might be configured a bit differently (e.g. smaller shared buffers) but I've seen just about 10 such strange results out of 200 runs, so I doubt this is the cause. I was thinking about something like autovacuum, but I'd expect that to happen much more frequently (same config, same workload, etc.). And it happens with just some file systems. For example for ext3/writeback, the STDDEV(latency) looks like this (x-axis represents PostgreSQL block size, y-axis fs block size): http://www.fuzzy.cz/tmp/pgbench/ext3-writeback.png while for ext4/journal: http://www.fuzzy.cz/tmp/pgbench/ext4-journal.png thanks Tomas
On 13/08/2011 7:37 AM, Tomas Vondra wrote: > I'd understand a slowdown, but why does it block? My first guess is that you're having checkpoint issues. Try enabling logging of checkpoints and checkpoint timings and see if anything there lines up with the pause you encounter. -- Craig Ringer
On 08/12/2011 07:37 PM, Tomas Vondra wrote:
You need to run tests like these for 10 minutes to see the full cycle of things; then you'll likely see them on most runs, instead of only 5%. It's probably the case that some of your tests are finishing before the first checkpoint does, which is why you don't see the bad stuff every time.
The long pauses are most likely every client blocking once the checkpoint sync runs. When those fsync calls go out, Linux will freeze for quite a while there on ext3. In this example, the drop in TPS/rise in latency at around 50:30 is either the beginning of a checkpoint or the dirty_background_ratio threshold in Linux being exceeded; they tend to happen around the same time. It executes the write phase for a bit, then gets into the sync phase around 51:40. You can find a couple of examples just like this on my giant test set around what was committed as the fsync compaction feature in 9.1, all at http://www.2ndquadrant.us/pgbench-results/index.htm
The one most similar to your case is http://www.2ndquadrant.us/pgbench-results/481/index.html Had that test only run for 5 minutes, it would have looked just like yours, ending after the long pause that's in the middle on my run. The freeze was over 3 minutes long in that example. (My server has a fairly fast disk subsystem, probably faster than what you're testing, but it also has 8GB of RAM that it can dirty to more than make up for it).
In my tests, I switched from ext3 to XFS to get better behavior. You got the same sort of benefit from ext4. ext3 just doesn't handle its write cache filling and then having fsync calls execute very well. I've given up on that as an unsolvable problem; improving behavior on XFS and ext4 are the only problems worth worrying about now to me.
And I keep seeing too many data corruption issues on ext4 to recommend anyone use it yet for PostgreSQL, that's why I focused on XFS. ext4 still needs at least a few more months before all the bug fixes it's gotten in later kernels are backported to the 2.6.32 versions deployed in RHEL6 and Debian Squeeze, the newest Linux distributions my customers care about right now. On RHEL6 for example, go readp { margin-bottom: 0.08in; }p.western { font-family: "Albany AMT","Arial",sans-serif; }p.rststyle-textbody-western { font-family: "Albany AMT","Arial",sans-serif; }a:link { }http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.1_Technical_Notes/kernel.html , specifically BZ#635199, and you tell me if that sounds like it's considered stable code yet or not. "The block layer will be updated in future kernels to provide this more efficient mechanism of ensuring ordering...these future block layer improvements will change some kernel interfaces..." Yikes, that does not inspire confidence to me.
I've run nearly 200 of these, and in about 10 cases I got something that looks like this: http://www.fuzzy.cz/tmp/pgbench/tps.png http://www.fuzzy.cz/tmp/pgbench/latency.png i.e. it runs just fine for about 3:40 and then something goes wrong. The bench should take 5:00 minutes, but it somehow locks, does nothing for about 2 minutes and then all the clients end at the same time. So instead of 5 minutes the run actually takes about 6:40.
You need to run tests like these for 10 minutes to see the full cycle of things; then you'll likely see them on most runs, instead of only 5%. It's probably the case that some of your tests are finishing before the first checkpoint does, which is why you don't see the bad stuff every time.
The long pauses are most likely every client blocking once the checkpoint sync runs. When those fsync calls go out, Linux will freeze for quite a while there on ext3. In this example, the drop in TPS/rise in latency at around 50:30 is either the beginning of a checkpoint or the dirty_background_ratio threshold in Linux being exceeded; they tend to happen around the same time. It executes the write phase for a bit, then gets into the sync phase around 51:40. You can find a couple of examples just like this on my giant test set around what was committed as the fsync compaction feature in 9.1, all at http://www.2ndquadrant.us/pgbench-results/index.htm
The one most similar to your case is http://www.2ndquadrant.us/pgbench-results/481/index.html Had that test only run for 5 minutes, it would have looked just like yours, ending after the long pause that's in the middle on my run. The freeze was over 3 minutes long in that example. (My server has a fairly fast disk subsystem, probably faster than what you're testing, but it also has 8GB of RAM that it can dirty to more than make up for it).
In my tests, I switched from ext3 to XFS to get better behavior. You got the same sort of benefit from ext4. ext3 just doesn't handle its write cache filling and then having fsync calls execute very well. I've given up on that as an unsolvable problem; improving behavior on XFS and ext4 are the only problems worth worrying about now to me.
And I keep seeing too many data corruption issues on ext4 to recommend anyone use it yet for PostgreSQL, that's why I focused on XFS. ext4 still needs at least a few more months before all the bug fixes it's gotten in later kernels are backported to the 2.6.32 versions deployed in RHEL6 and Debian Squeeze, the newest Linux distributions my customers care about right now. On RHEL6 for example, go read
-- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On 13 Srpen 2011, 5:09, Greg Smith wrote: > The long pauses are most likely every client blocking once the > checkpoint sync runs. When those fsync calls go out, Linux will freeze > for quite a while there on ext3. In this example, the drop in TPS/rise > in latency at around 50:30 is either the beginning of a checkpoint or > the dirty_background_ratio threshold in Linux being exceeded; they tend > to happen around the same time. It executes the write phase for a bit, > then gets into the sync phase around 51:40. You can find a couple of > examples just like this on my giant test set around what was committed > as the fsync compaction feature in 9.1, all at > http://www.2ndquadrant.us/pgbench-results/index.htm > > The one most similar to your case is > http://www.2ndquadrant.us/pgbench-results/481/index.html Had that test > only run for 5 minutes, it would have looked just like yours, ending > after the long pause that's in the middle on my run. The freeze was > over 3 minutes long in that example. (My server has a fairly fast disk > subsystem, probably faster than what you're testing, but it also has 8GB > of RAM that it can dirty to more than make up for it). I guess you're right - I was thinking about checkpoints too, but what really puzzled me was that only some of the runs (with about the same workload) were affected by that. It's probably a timing issue - the tests were running for 5 minutes and checkpoint timeout is 5 minutes too. So the runs where the checkpoint timed out early had to write very little data, but when the checkpoint timed out just before the end it had to write much more data. I've increased the test duration to 10 minutes, decreased the checkpoint timeout to 4 minutes and a checkpoint is issued just before the pgbench. That way the starting position should be more or less the same for all runs. > In my tests, I switched from ext3 to XFS to get better behavior. You > got the same sort of benefit from ext4. ext3 just doesn't handle its > write cache filling and then having fsync calls execute very well. I've > given up on that as an unsolvable problem; improving behavior on XFS and > ext4 are the only problems worth worrying about now to me. For production systems, XFS seems like a good choice. The purpose of the tests I've run was merely to see what is the effect of various block size and mount options for available file systems (including experimental ones). If interested, you can see the results here http://www.fuzzy.cz/bench/ although it's the first run with runs not long enough (just 5 minutes) and some other slightly misconfigured options (shared buffers and checkpoints). > And I keep seeing too many data corruption issues on ext4 to recommend > anyone use it yet for PostgreSQL, that's why I focused on XFS. ext4 > still needs at least a few more months before all the bug fixes it's > gotten in later kernels are backported to the 2.6.32 versions deployed > in RHEL6 and Debian Squeeze, the newest Linux distributions my customers > care about right now. On RHEL6 for example, go read > http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.1_Technical_Notes/kernel.html > , specifically BZ#635199, and you tell me if that sounds like it's > considered stable code yet or not. "The block layer will be updated in > future kernels to provide this more efficient mechanism of ensuring > ordering...these future block layer improvements will change some kernel > interfaces..." Yikes, that does not inspire confidence to me. XFS is naturally much more mature / stable than EXT4, but I'm not quite sure I want to judge the stability of code based on a comment in release notes. As I understand it, the comment says something like "things are not working as efficiently as it should, we'll improve that in the future" and it relates to the block layer as a whole, not just specific file systems. But I don't have access to the bug #635199, so maybe I missed something. regards Tomas
On Sun, Aug 14, 2011 at 6:51 AM, <tv@fuzzy.cz> wrote: > > I've increased the test duration to 10 minutes, decreased the > checkpoint timeout to 4 minutes and a checkpoint is issued just before > the pgbench. That way the starting position should be more or less the > same for all runs. Also look at increasing checkpoint completion target to something closer to 1. 0.8 is a nice starting place.
On Sun, 14 Aug 2011 07:15:00 -0600, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Sun, Aug 14, 2011 at 6:51 AM, <tv@fuzzy.cz> wrote: >> >> I've increased the test duration to 10 minutes, decreased the >> checkpoint timeout to 4 minutes and a checkpoint is issued just before >> the pgbench. That way the starting position should be more or less the >> same for all runs. > > Also look at increasing checkpoint completion target to something > closer to 1. 0.8 is a nice starting place. Yes, I've increased that already: checkpoints_segments=64 checkpoints_completion_target=0.9 Tomas
<tv@fuzzy.cz> writes: > On 13 Srpen 2011, 5:09, Greg Smith wrote: >> And I keep seeing too many data corruption issues on ext4 to recommend >> anyone use it yet for PostgreSQL, that's why I focused on XFS. ext4 >> still needs at least a few more months before all the bug fixes it's >> gotten in later kernels are backported to the 2.6.32 versions deployed >> in RHEL6 and Debian Squeeze, the newest Linux distributions my customers >> care about right now. On RHEL6 for example, go read >> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.1_Technical_Notes/kernel.html >> , specifically BZ#635199, and you tell me if that sounds like it's >> considered stable code yet or not. "The block layer will be updated in >> future kernels to provide this more efficient mechanism of ensuring >> ordering...these future block layer improvements will change some kernel >> interfaces..." Yikes, that does not inspire confidence to me. > XFS is naturally much more mature / stable than EXT4, but I'm not quite > sure I want to judge the stability of code based on a comment in release > notes. As I understand it, the comment says something like "things are > not working as efficiently as it should, we'll improve that in the > future" and it relates to the block layer as a whole, not just specific > file systems. But I don't have access to the bug #635199, so maybe I > missed something. I do ;-). The reason for the tech note was to point out that RHEL6.1 would incorporate backports of upstream kernel changes that broke the ABI for loadable kernel modules, compared to what it had been in RHEL6.0. That's of great interest to third-party software developers who add device or filesystem drivers to RHEL, but I don't think it speaks at all to whether the code is unstable from a user's standpoint. (The changes in question were purely for performance, and involved a conversion from write barriers in the block layer to flush+fua, whatever that is.) Furthermore, this affected every filesystem not only ext4, so it really entirely fails to support Greg's argument. regards, tom lane
On 08/14/2011 08:51 AM, tv@fuzzy.cz wrote: > I've increased the test duration to 10 minutes, decreased the > checkpoint timeout to 4 minutes and a checkpoint is issued just before > the pgbench. That way the starting position should be more or less the > same for all runs. > That's basically what I settled on for pgbench-tools. Force a checkpoint just before the test, so the beginning of each run is aligned more consistently, then run for long enough that you're guaranteed at least one checkpoint finishes[1] (and you might see more than one if you fill checkpoint_segments fast enough). I never bothered trying to compress that test cycle down by decreasing checkpoint_timeout. There's already too many things you need to do in order to get this test working well, and I didn't want to include a change I'd never recommend people make on a production server in the mix. [1] If your checkpoint behavior goes pathological, for example the extended checkpoints possible when the background writer fsync queue fills, it's not actually guaranteed that the checkpoint will finish within 5 minutes after it starts. So a 10 minute run doesn't assure you'll a checkpoint begin and end in all circumstances, but it is the expected case. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On 08/14/2011 01:33 PM, Tom Lane wrote: > (The changes in question were purely for performance, and involved a > conversion from write barriers in the block layer to flush+fua, whatever > that is.) Furthermore, this affected every filesystem not only ext4, > so it really entirely fails to support Greg's argument. > FUA is "Force Unit Access", shorthand for forcing write-through. If you're staring at this from the perspective I am, where I assume that every line of new filesystem code always takes a while from release until it's stable, it still concerns me. I suspect that both of these code paths--the write barrier one and the flush+FUA one--are fine on XFS. One of the reasons XFS was so terrible in its early years was because it took a long time to get those parts in particular correct, which was complicated by how badly drives lied about everything back then. If these changes shift around which section of the ext4 write path code are exercised, it could easily be the case that it moves fsync related work (which has been one of the buggiest parts of that filesystem) from code that has been tested heavily to parts that haven't been hit as hard yet by users. My paranoid raving about this particular bug isn't strongly supported; I'll concede that with the additional data you've provided. But I don't think it's completely unfounded either. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us