Обсуждение: Massive I/O spikes during checkpoint

От:
David Kerr
Дата:

Howdy!

I'm trying to figure out why checkpointing it completely pegging my I/O under moderate to high write load, 
I'm on PG9.1.1, RHEL 6.2 x64

checkpoint_completion_target = 0.7
checkpoint_timeout = 10m

Jul 10 00:32:30 perf01 postgres[52619]: [1895-1] user=,db= LOG:  checkpoint starting: time
[...]
Jul 10 00:36:47 perf01 postgres[52619]: [1896-1] user=,db= LOG:  checkpoint complete: wrote 119454 buffers (11.4%); 0 transaction log file(s) added, 0 removed

Watching my I/O with: iostat -t -d -x dm-2 5  
Which is my $PGDATA mount point (ext4). 
I get the following:
Date r/s w/s rsec/s wsec/s await svctm %util
[...]
07/10/12 00:35:36 0 69.8 0 2233.6 0.63 0.07 0.46
07/10/12 00:35:41 1.2 810 99.2 22200 4.13 0.05 4.02
07/10/12 00:35:46 0 111.6 0 5422.4 1.82 0.08 0.9
07/10/12 00:35:51 0 299.2 0 5670.4 1.27 0.04 1.24
07/10/12 00:35:56 0.8 176.6 41.6 3654.4 2.16 0.07 1.32
07/10/12 00:36:01 0 364.8 0 6670.4 1.1 0.04 1.62
07/10/12 00:36:06 0.8 334.6 12.8 5953.6 1.18 0.05 1.64
07/10/12 00:36:11 0 118.6 0 6948.8 1.82 0.07 0.82
07/10/12 00:36:16 0 8274.6 0 148764.8 10.55 0.07 61.18
07/10/12 00:36:21 0.2 8577.4 3.2 161806.4 16.68 0.12 99.62
07/10/12 00:36:26 0.8 9244.6 12.8 167841.6 15.01 0.11 99.82
07/10/12 00:36:31 0.8 9434.2 44.8 208156.8 16.22 0.11 99.7
07/10/12 00:36:36 0 9582.8 0 202508.8 14.84 0.1 99.72
07/10/12 00:36:41 0 9830.2 0 175326.4 14.42 0.1 99.5
07/10/12 00:36:46 0 8208.6 0 149372.8 17.82 0.12 99.64
07/10/12 00:36:51 3 1438.4 102.4 26748.8 8.49 0.12 18
07/10/12 00:36:56 0.6 2004.6 9.6 27400 1.25 0.03 5.74
07/10/12 00:37:01 0.6 1723 9.6 23758.4 1.85 0.03 5.08
07/10/12 00:37:06 0.4 181.2 35.2 2928 1.49 0.06 1.06

The ramp up is barely using any I/O, but then just before the checkpoint ends I get a 
flood of I/O all at once. 

I thought that the idea of checkpoint_completion_target was that we try to finish writing
out the data throughout the entire checkpoint (leaving some room to spare, in my case 30%
of the total estimated checkpoint time)

But what appears to be happening is that all of the data is being written out at the end of the checkpoint.

This happens at every checkpoint while the system is under load.

I get the feeling that this isn't the correct behavior and i've done something wrong. 

Also, I didn't see this sort of behavior in PG 8.3, however unfortunately, I don't have data to back that 
statement up.

Any suggestions. I'm willing and able to profile, or whatever.

Thanks

От:
Maxim Boguk
Дата:



But what appears to be happening is that all of the data is being written out at the end of the checkpoint.

This happens at every checkpoint while the system is under load.

I get the feeling that this isn't the correct behavior and i've done something wrong. 



It's not an actual checkpoints.
It's is a fsync after checkpoint which create write spikes hurting server.

You should set sysctl vm.dirty_background_bytes and vm.dirty_bytes to reasonable low values
(for 512MB raid controller with cache I would suggest to sometning like
vm.dirty_background_bytes = 33554432
vm.dirty_bytes = 268435456
32MB and 256MB respectively)

If youre server doesn't have raid with BBU cache - then you should tune these values to much lower values.

Please read http://blog.2ndquadrant.com/tuning_linux_for_low_postgresq/
and related posts.

--
Maxim Boguk
Senior Postgresql DBA.
http://www.postgresql-consulting.com/

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

Skype: maxim.boguk
Jabber:
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."


От:
Jeff Janes
Дата:

On Mon, Jul 9, 2012 at 10:39 PM, David Kerr <> wrote:
>
> I thought that the idea of checkpoint_completion_target was that we try to
> finish writing
> out the data throughout the entire checkpoint (leaving some room to spare,
> in my case 30%
> of the total estimated checkpoint time)
>
> But what appears to be happening is that all of the data is being written
> out at the end of the checkpoint.

Postgres is writing data out to the kernel throughout the checkpoint.
But the kernel is just buffering it up dirty, until the end of the
checkpoint when the fsyncs start landing like bombs.

>
> This happens at every checkpoint while the system is under load.
>
> I get the feeling that this isn't the correct behavior and i've done
> something wrong.
>
> Also, I didn't see this sort of behavior in PG 8.3, however unfortunately, I
> don't have data to back that
> statement up.

Did you have less RAM back when you were running PG 8.3?

> Any suggestions. I'm willing and able to profile, or whatever.

Who much RAM do you have?  What are your settings for /proc/sys/vm/dirty_* ?

Cheers,

Jeff

От:
David Kerr
Дата:

On Jul 9, 2012, at 10:52 PM, Jeff Janes wrote:

> On Mon, Jul 9, 2012 at 10:39 PM, David Kerr <> wrote:
>>
>> I thought that the idea of checkpoint_completion_target was that we try to
>> finish writing
>> out the data throughout the entire checkpoint (leaving some room to spare,
>> in my case 30%
>> of the total estimated checkpoint time)
>>
>> But what appears to be happening is that all of the data is being written
>> out at the end of the checkpoint.
>
> Postgres is writing data out to the kernel throughout the checkpoint.
> But the kernel is just buffering it up dirty, until the end of the
> checkpoint when the fsyncs start landing like bombs.

Ahh. duh!

I guess i assumed that the point of spreading the checkpoint I/O was
spreading the syncs out.

>
>>
>> This happens at every checkpoint while the system is under load.
>>
>> I get the feeling that this isn't the correct behavior and i've done
>> something wrong.
>>
>> Also, I didn't see this sort of behavior in PG 8.3, however unfortunately, I
>> don't have data to back that
>> statement up.
>
> Did you have less RAM back when you were running PG 8.3?
nope. I was on RHEL 5.5 back then though.

>
>> Any suggestions. I'm willing and able to profile, or whatever.
>
> Who much RAM do you have?  What are your settings for /proc/sys/vm/dirty_* ?

256G
and I've been running with this for a while now, but I think that's the default in RHEL 6+
echo 10 > /proc/sys/vm/dirty_ratio
echo 5 >/proc/sys/vm/dirty_background_ratio



От:
David Kerr
Дата:


On Jul 9, 2012, at 10:51 PM, Maxim Boguk wrote:



But what appears to be happening is that all of the data is being written out at the end of the checkpoint.

This happens at every checkpoint while the system is under load.

I get the feeling that this isn't the correct behavior and i've done something wrong. 



It's not an actual checkpoints.
It's is a fsync after checkpoint which create write spikes hurting server.
You should set sysctl vm.dirty_background_bytes and vm.dirty_bytes to reasonable low values

So use bla_bytes instead of bla_ratio?

(for 512MB raid controller with cache I would suggest to sometning like
vm.dirty_background_bytes = 33554432
vm.dirty_bytes = 268435456
32MB and 256MB respectively)

I'll take a look.


If youre server doesn't have raid with BBU cache - then you should tune these values to much lower values.

Please read http://blog.2ndquadrant.com/tuning_linux_for_low_postgresq/
and related posts.

yeah, I saw that I guess I didn't put 2+2 together. thanks.



От:
Maxim Boguk
Дата:



On Tue, Jul 10, 2012 at 4:03 PM, David Kerr <> wrote:

On Jul 9, 2012, at 10:51 PM, Maxim Boguk wrote:



But what appears to be happening is that all of the data is being written out at the end of the checkpoint.

This happens at every checkpoint while the system is under load.

I get the feeling that this isn't the correct behavior and i've done something wrong. 



It's not an actual checkpoints.
It's is a fsync after checkpoint which create write spikes hurting server.
You should set sysctl vm.dirty_background_bytes and vm.dirty_bytes to reasonable low values

So use bla_bytes instead of bla_ratio?


Yes because on 256GB server
echo 10 > /proc/sys/vm/dirty_ratio
is equivalent to 26Gb dirty_bytes

and
echo 5 >/proc/sys/vm/dirty_background_ratio
is equivalent to 13Gb dirty_background_bytes

It is really huge values.

So kernel doesn't start write any pages out in background before it has at least 13Gb dirty pages in kernel memory.
And at end of the checkpoint kernel trying flush all dirty pages to disk.

Even echo 1 >/proc/sys/vm/dirty_background_ratio  is too high value for contemporary server.
That is why  *_bytes controls added to kernel.

--
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.com/

От:
David Kerr
Дата:

On 7/9/2012 11:14 PM, Maxim Boguk wrote:
>
>
> On Tue, Jul 10, 2012 at 4:03 PM, David Kerr <
> <mailto:>> wrote:
>
>
>     On Jul 9, 2012, at 10:51 PM, Maxim Boguk wrote:
>
>>
>>
>>         But what appears to be happening is that all of the data is
>>         being written out at the end of the checkpoint.
>>
>>         This happens at every checkpoint while the system is under load.
>>
>>         I get the feeling that this isn't the correct behavior and
>>         i've done something wrong.
>>
>>
>>
>>     It's not an actual checkpoints.
>>     It's is a fsync after checkpoint which create write spikes hurting
>>     server.
>>     You should set sysctl vm.dirty_background_bytes and vm.dirty_bytes
>>     to reasonable low values
>
>     So use bla_bytes instead of bla_ratio?
>
>
> Yes because on 256GB server
> echo 10 > /proc/sys/vm/dirty_ratio
> is equivalent to 26Gb dirty_bytes
>
> and
> echo 5 >/proc/sys/vm/dirty_background_ratio
> is equivalent to 13Gb dirty_background_bytes
>
> It is really huge values.
<sigh> yeah, I never bothered to think that through.

> So kernel doesn't start write any pages out in background before it has
> at least 13Gb dirty pages in kernel memory.
> And at end of the checkpoint kernel trying flush all dirty pages to disk.
>
> Even echo 1 >/proc/sys/vm/dirty_background_ratio  is too high value for
> contemporary server.
> That is why  *_bytes controls added to kernel.

Awesome, Thanks.

От:
Andres Freund
Дата:

On Tuesday, July 10, 2012 08:14:00 AM Maxim Boguk wrote:
> On Tue, Jul 10, 2012 at 4:03 PM, David Kerr <> wrote:
> > On Jul 9, 2012, at 10:51 PM, Maxim Boguk wrote:
> >> But what appears to be happening is that all of the data is being
> >> written out at the end of the checkpoint.
> >>
> >> This happens at every checkpoint while the system is under load.
> >>
> >> I get the feeling that this isn't the correct behavior and i've done
> >> something wrong.
> >
> > It's not an actual checkpoints.
> > It's is a fsync after checkpoint which create write spikes hurting
> > server.
> >
> > You should set sysctl vm.dirty_background_bytes and vm.dirty_bytes to
> > reasonable low values
> >
> >
> > So use bla_bytes instead of bla_ratio?
>
> Yes because on 256GB server
> echo 10 > /proc/sys/vm/dirty_ratio
> is equivalent to 26Gb dirty_bytes
>
> and
> echo 5 >/proc/sys/vm/dirty_background_ratio
> is equivalent to 13Gb dirty_background_bytes
>
> It is really huge values.
>
> So kernel doesn't start write any pages out in background before it has at
> least 13Gb dirty pages in kernel memory.
> And at end of the checkpoint kernel trying flush all dirty pages to disk.
Thast not entirely true. The kernel will also writeout pages which haven't
been written to for dirty_expire_centisecs.

But yes, adjusting dirty_* is definitely a good idea.

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

От:
Jeff Janes
Дата:

On Tue, Jul 10, 2012 at 5:44 AM, Andres Freund <> wrote:
> On Tuesday, July 10, 2012 08:14:00 AM Maxim Boguk wrote:
>>
>> So kernel doesn't start write any pages out in background before it has at
>> least 13Gb dirty pages in kernel memory.
>> And at end of the checkpoint kernel trying flush all dirty pages to disk.

> Thast not entirely true. The kernel will also writeout pages which haven't
> been written to for dirty_expire_centisecs.

There seems to be many situations in which it totally fails to do that.

Although I've never been able to categorize just what those situations are.

Cheers,

Jeff

От:
Andres Freund
Дата:

On Tuesday, July 10, 2012 03:36:35 PM Jeff Janes wrote:
> On Tue, Jul 10, 2012 at 5:44 AM, Andres Freund <>
wrote:
> > On Tuesday, July 10, 2012 08:14:00 AM Maxim Boguk wrote:
> >> So kernel doesn't start write any pages out in background before it has
> >> at least 13Gb dirty pages in kernel memory.
> >> And at end of the checkpoint kernel trying flush all dirty pages to
> >> disk.
> >
> > Thast not entirely true. The kernel will also writeout pages which
> > haven't been written to for dirty_expire_centisecs.
>
> There seems to be many situations in which it totally fails to do that.
Totally as in diry pages sitting around without any io activity? Or just not
agressive enough?

Currently its a bit hard to speculate about all without specifying the kernel
because there have been massive rewrites of all that stuff in several kernels
in the last two years...

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services