Re: Improvement of checkpoint IO scheduler for stable transaction responses

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: Improvement of checkpoint IO scheduler for stable transaction responses
Дата
Msg-id CAMkU=1wXRCo85AxXDRDr8-kt_=kUVNMC_WAOEyNyGRfpa8rWjA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Greg Smith <greg@2ndQuadrant.com>)
Список pgsql-hackers
On Sunday, July 14, 2013, Greg Smith wrote:
On 7/14/13 5:28 PM, james wrote:
Some random seeks during sync can't be helped, but if they are done when
we aren't waiting for sync completion then they are in effect free.

That happens sometimes, but if you measure you'll find this doesn't actually occur usefully in the situation everyone dislikes.  In a write heavy environment where the database doesn't fit in RAM, backends and/or the background writer are constantly writing data out to the OS.  WAL is going out constantly as well, and in many cases that's competing for the disks too.

While I think it is probably true that many systems don't separate WAL from non-WAL to different IO controllers, is it true that many systems that are in need of heavy IO tuning don't do so?  I thought that that would be the first stop for any DBA of an highly IO-write constrained database.

 
 The most popular blocks in the database get high usage counts and they never leave shared_buffers except at checkpoint time. That's easy to prove to yourself with pg_buffercache.

And once the write cache fills, every I/O operation is now competing. There is nothing happening for free.  You're stealing I/O from something else any time you force a write out.  The optimal throughput path for checkpoints turns out to be delaying every single bit of I/O as long as possible, in favor of the [backend|bgwriter] writes and WAL.  Whenever you delay a buffer write, you have increased the possibility that someone else will write the same block again.   And the buffers being written by the checkpointer are, on average, the most popular ones in the database.  Writing any of them to disk pre-emptively has high odds of writing the same block more than once per checkpoint.


Should the checkpointer make multiple passes over the buffer pool, writing out the high usage_count buffers first, because no one else is going to do it, and then going back for the low usage_count buffers in the hope they were already written out?  On the other hand, if the checkpointer writes out a low-usage buffer, why would anyone else need to write it again soon?  If it were likely to get dirtied often, it wouldn't be low usage.  If it was dirtied rarely, it wouldn't be dirty anymore once written.

Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: pg_memory_barrier() doesn't compile, let alone work, for me
Следующее
От: Robert Haas
Дата:
Сообщение: Re: pg_memory_barrier() doesn't compile, let alone work, for me