Re: Improvement of checkpoint IO scheduler for stable transaction responses

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Improvement of checkpoint IO scheduler for stable transaction responses
Дата
Msg-id 51E32A51.7080309@2ndQuadrant.com
обсуждение исходный текст
Ответ на Re: Improvement of checkpoint IO scheduler for stable transaction responses  (james <james@mansionfamily.plus.com>)
Ответы Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Jeff Janes <jeff.janes@gmail.com>)
Список pgsql-hackers
On 7/14/13 5:28 PM, james wrote:
> Some random seeks during sync can't be helped, but if they are done when
> we aren't waiting for sync completion then they are in effect free.

That happens sometimes, but if you measure you'll find this doesn't 
actually occur usefully in the situation everyone dislikes.  In a write 
heavy environment where the database doesn't fit in RAM, backends and/or 
the background writer are constantly writing data out to the OS.  WAL is 
going out constantly as well, and in many cases that's competing for the 
disks too.  The most popular blocks in the database get high usage 
counts and they never leave shared_buffers except at checkpoint time. 
That's easy to prove to yourself with pg_buffercache.

And once the write cache fills, every I/O operation is now competing. 
There is nothing happening for free.  You're stealing I/O from something 
else any time you force a write out.  The optimal throughput path for 
checkpoints turns out to be delaying every single bit of I/O as long as 
possible, in favor of the [backend|bgwriter] writes and WAL.  Whenever 
you delay a buffer write, you have increased the possibility that 
someone else will write the same block again.  And the buffers being 
written by the checkpointer are, on average, the most popular ones in 
the database.  Writing any of them to disk pre-emptively has high odds 
of writing the same block more than once per checkpoint.  And that easy 
to measure waste--it shows as more writes/transaction in 
pg_stat_bgwriter--it hurts throughput more than every reduction in seek 
overhead you might otherwise get from early writes.  The big gain isn't 
chasing after cheap seeks.  The best path is the one that decreases the 
total volume of writes.

We played this game with the background writer work for 8.3.  The main 
reason the one committed improved on the original design is that it 
completely eliminated doing work on popular buffers in advance. 
Everything happens at the last possible time, which is the optimal 
throughput situation.  The 8.1/8.2 BGW used to try and write things out 
before they were strictly necessary, in hopes that that I/O would be 
free.  But it rarely was, while there was always a cost to forcing them 
to disk early.  And that cost is highest when you're talking about the 
higher usage blocks the checkpointer tends to write.  When in doubt, 
always delay the write in hopes it will be written to again and you'll 
save work.

> So it occurs to me that perhaps we can watch for patterns where we have
> groups of adjacent writes that might stream, and when they form we might
> schedule them...

Stop here.  I mentioned something upthread that is worth repeating.

The checkpointer doesn't know what concurrent reads are happening.  We 
can't even easily make it know, not without adding a whole new source of 
IPC and locking contention among clients.

Whatever scheduling decision the checkpointer might make with its 
limited knowledge of system I/O is going to be poor.  You might find a 
100% write benchmark that it helps, but those are not representative of 
the real world.  In any mixed read/write case, the operating system is 
likely to do better.  That's why things like sorting blocks sometimes 
seem to help someone, somewhere, with one workload, but then aren't 
repeatable.

We can decide to trade throughput for latency by nudging the OS to deal 
with its queued writes more regularly.  That will result in more total 
writes, which is the reason throughput drops.

But the idea that PostgreSQL is going to do a better global job of I/O 
scheduling, that road is a hard one to walk.  It's only going to happen 
if we pull all of the I/O into the database *and* do a better job on the 
entire process than the existing OS kernel does.  That sort of dream, of 
outperforming the filesystem, it is very difficult to realize.  There's 
a good reason that companies like Oracle stopped pushing so hard on 
recommending raw partitions.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Smith
Дата:
Сообщение: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Следующее
От: Stephen Frost
Дата:
Сообщение: ECPG timestamp '%j'