Re: Improvement of checkpoint IO scheduler for stable transaction responses

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Improvement of checkpoint IO scheduler for stable transaction responses
Дата
Msg-id 51E2F1C4.6010503@2ndQuadrant.com
обсуждение исходный текст
Ответ на Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Ants Aasma <ants@cybertec.at>)
Список pgsql-hackers
On 6/16/13 10:27 AM, Heikki Linnakangas wrote:
> Yeah, the checkpoint scheduling logic doesn't take into account the
> heavy WAL activity caused by full page images...
> Rationalizing a bit, I could even argue to myself that it's a *good*
> thing. At the beginning of a checkpoint, the OS write cache should be
> relatively empty, as the checkpointer hasn't done any writes yet. So it
> might make sense to write a burst of pages at the beginning, to
> partially fill the write cache first, before starting to throttle. But
> this is just handwaving - I have no idea what the effect is in real life.

That's exactly right.  When a checkpoint finishes the OS write cache is 
clean.  That means all of the full-page writes aren't even hitting disk 
in many cases.  They just pile up in the OS dirty memory, often sitting 
there all the way until when the next checkpoint fsyncs start.  That's 
why I never wandered down the road of changing FPW behavior.  I have 
never seen a benchmark workload hit a write bottleneck until long after 
the big burst of FPW pages is over.

I could easily believe that there are low-memory systems where the FPW 
write pressure becomes a problem earlier.  And slim VMs make sense as 
the place this behavior is being seen at.

I'm a big fan of instrumenting the code around a performance change 
before touching anything, as a companion patch that might make sense to 
commit on its own.  In the case of a change to FPW spacing, I'd want to 
see some diagnostic output in something like pg_stat_bgwriter that 
tracks how many FPW pages are being modified.  A 
pgstat_bgwriter.full_page_writes counter would be perfect here, and then 
graph that data over time as the benchmark runs.

> Another thought is that rather than trying to compensate for that effect
> in the checkpoint scheduler, could we avoid the sudden rush of full-page
> images in the first place? The current rule for when to write a full
> page image is conservative: you don't actually need to write a full page
> image when you modify a buffer that's sitting in the buffer cache, if
> that buffer hasn't been flushed to disk by the checkpointer yet, because
> the checkpointer will write and fsync it later. I'm not sure how much it
> would smoothen WAL write I/O, but it would be interesting to try.

There I also think the right way to proceed is instrumenting that area 
first.

> A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
> www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp.
> He posted very promising performance numbers, but it was dropped because
> Tom couldn't reproduce the numbers, and because sorting requires
> allocating a large array, which has the risk of running out of memory,
> which would be bad when you're trying to checkpoint.

I updated and re-reviewed that in 2011: 
http://www.postgresql.org/message-id/4D31AE64.3000202@2ndquadrant.com 
and commented on why I think the improvement was difficult to reproduce 
back then.  The improvement didn't follow for me either.  It would take 
a really amazing bit of data to get me to believe write sorting code is 
worthwhile after that.  On large systems capable of dirtying enough 
blocks to cause a problem, the operating system and RAID controllers are 
already sorting block.  And *that* sorting is also considering 
concurrent read requests, which are a lot more important to an efficient 
schedule than anything the checkpoint process knows about.  The database 
doesn't have nearly enough information yet to compete against OS level 
sorting.


>> Bad point of my patch is longer checkpoint. Checkpoint time was
>> increased about 10% - 20%. But it can work correctry on schedule-time in
>> checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).
>
> For a fair comparison, you should increase the
> checkpoint_completion_target of the unpatched test, so that the
> checkpoints run for roughly the same amount of time with and without the
> patch. Otherwise the benefit you're seeing could be just because of a
> more lazy checkpoint.

Heikki has nailed the problem with the submitted dbt-2 results here.  If 
you spread checkpoints out more, you cannot fairly compare the resulting 
TPS or latency numbers anymore.

Simple example:  20 minute long test.  Server A does a checkpoint every 
5 minutes.  Server B has modified parameters or server code such that 
checkpoints happen every 6 minutes.  If you run both to completion, A 
will have hit 4 checkpoints that flush the buffer cache, B only 3.  Of 
course B will seem faster.  It didn't do as much work.

pgbench_tools measures the number of checkpoints during the test, as 
well as the buffer count statistics.  If those numbers are very 
different between two tests, I have to throw them out as unfair.  A lot 
of things that seem promising turn out to have this sort of problem.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Smith
Дата:
Сообщение: Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)