Re: Improvement of checkpoint IO scheduler for stable transaction responses
От | Greg Smith |
---|---|
Тема | Re: Improvement of checkpoint IO scheduler for stable transaction responses |
Дата | |
Msg-id | 51E2F1C4.6010503@2ndQuadrant.com обсуждение исходный текст |
Ответ на | Re: Improvement of checkpoint IO scheduler for stable transaction responses (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Ответы |
Re: Improvement of checkpoint IO scheduler for stable
transaction responses
(Ants Aasma <ants@cybertec.at>)
|
Список | pgsql-hackers |
On 6/16/13 10:27 AM, Heikki Linnakangas wrote: > Yeah, the checkpoint scheduling logic doesn't take into account the > heavy WAL activity caused by full page images... > Rationalizing a bit, I could even argue to myself that it's a *good* > thing. At the beginning of a checkpoint, the OS write cache should be > relatively empty, as the checkpointer hasn't done any writes yet. So it > might make sense to write a burst of pages at the beginning, to > partially fill the write cache first, before starting to throttle. But > this is just handwaving - I have no idea what the effect is in real life. That's exactly right. When a checkpoint finishes the OS write cache is clean. That means all of the full-page writes aren't even hitting disk in many cases. They just pile up in the OS dirty memory, often sitting there all the way until when the next checkpoint fsyncs start. That's why I never wandered down the road of changing FPW behavior. I have never seen a benchmark workload hit a write bottleneck until long after the big burst of FPW pages is over. I could easily believe that there are low-memory systems where the FPW write pressure becomes a problem earlier. And slim VMs make sense as the place this behavior is being seen at. I'm a big fan of instrumenting the code around a performance change before touching anything, as a companion patch that might make sense to commit on its own. In the case of a change to FPW spacing, I'd want to see some diagnostic output in something like pg_stat_bgwriter that tracks how many FPW pages are being modified. A pgstat_bgwriter.full_page_writes counter would be perfect here, and then graph that data over time as the benchmark runs. > Another thought is that rather than trying to compensate for that effect > in the checkpoint scheduler, could we avoid the sudden rush of full-page > images in the first place? The current rule for when to write a full > page image is conservative: you don't actually need to write a full page > image when you modify a buffer that's sitting in the buffer cache, if > that buffer hasn't been flushed to disk by the checkpointer yet, because > the checkpointer will write and fsync it later. I'm not sure how much it > would smoothen WAL write I/O, but it would be interesting to try. There I also think the right way to proceed is instrumenting that area first. > A long time ago, Itagaki wrote a patch to sort the checkpoint writes: > www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp. > He posted very promising performance numbers, but it was dropped because > Tom couldn't reproduce the numbers, and because sorting requires > allocating a large array, which has the risk of running out of memory, > which would be bad when you're trying to checkpoint. I updated and re-reviewed that in 2011: http://www.postgresql.org/message-id/4D31AE64.3000202@2ndquadrant.com and commented on why I think the improvement was difficult to reproduce back then. The improvement didn't follow for me either. It would take a really amazing bit of data to get me to believe write sorting code is worthwhile after that. On large systems capable of dirtying enough blocks to cause a problem, the operating system and RAID controllers are already sorting block. And *that* sorting is also considering concurrent read requests, which are a lot more important to an efficient schedule than anything the checkpoint process knows about. The database doesn't have nearly enough information yet to compete against OS level sorting. >> Bad point of my patch is longer checkpoint. Checkpoint time was >> increased about 10% - 20%. But it can work correctry on schedule-time in >> checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6). > > For a fair comparison, you should increase the > checkpoint_completion_target of the unpatched test, so that the > checkpoints run for roughly the same amount of time with and without the > patch. Otherwise the benefit you're seeing could be just because of a > more lazy checkpoint. Heikki has nailed the problem with the submitted dbt-2 results here. If you spread checkpoints out more, you cannot fairly compare the resulting TPS or latency numbers anymore. Simple example: 20 minute long test. Server A does a checkpoint every 5 minutes. Server B has modified parameters or server code such that checkpoints happen every 6 minutes. If you run both to completion, A will have hit 4 checkpoints that flush the buffer cache, B only 3. Of course B will seem faster. It didn't do as much work. pgbench_tools measures the number of checkpoints during the test, as well as the buffer count statistics. If those numbers are very different between two tests, I have to throw them out as unfair. A lot of things that seem promising turn out to have this sort of problem. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Greg SmithДата:
Сообщение: Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Следующее
От: Fabien COELHOДата:
Сообщение: Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)