Greg Stark wrote:
> Using sync_file_range you can specify the set of blocks to sync and
> then block on them only after some time has passed. But there's no
> documentation on how this relates to the I/O scheduler so it's not
> clear it would have any effect on the problem.
I believe this is the exact spot we're stalled at in regards to getting
this improved on the Linux side, as I understand it at least. *The*
answer for this class of problem on Linux is to use sync_file_range, and
I don't think we'll ever get any sympathy from those kernel developers
until we do. But that's a Linux specific call, so doing that is going
to add a write path fork with platform-specific code into the database.
If I thought sync_file_range was a silver bullet guaranteed to make this
better, maybe I'd go for that. I think there's some relatively
low-hanging fruit on the database side that would do better before going
to that extreme though, thus the patch.
> We might still have to delay the begining of the sync to allow the dirty blocks to be synced
> naturally and then when we issue it still end up catching a lot of
> other i/o as well.
>
Whether it's "lots" or not is really workload dependent. I work from
the assumption that the blocks being written out by the checkpoint are
the most popular ones in the database, the ones that accumulate a high
usage count and stay there. If that's true, my guess is that the writes
being done while the checkpoint is executing are a bit less likely to be
touching the same files. You raise a valid concern, I just haven't seen
that actually happen in practice yet.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us