Re: Spread checkpoint sync

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Spread checkpoint sync
Дата
Msg-id 4CE994F8.8020800@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Spread checkpoint sync  (Martijn van Oosterhout <kleptog@svana.org>)
Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Robert Haas wrote:
> Doing all the writes and then all the fsyncs meets this requirement
> trivially, but I'm not so sure that's a good idea.  For example, given
> files F1 ... Fn with dirty pages needing checkpoint writes, we could
> do the following: first, do any pending fsyncs for files not among F1
> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
> and fsync, write all pages for F3 and fsync, etc.  This might seem
> dumb because we're not really giving the OS a chance to write anything
> out before we fsync, but think about the ext3 case where the whole
> filesystem cache gets flushed anyway.

I'm not horribly interested in optimizing for the ext3 case per se, as I 
consider that filesystem fundamentally broken from the perspective of 
its ability to deliver low-latency here.  I wouldn't want a patch that 
improved behavior on filesystem with granular fsync to make the ext3 
situation worst.  That's as much as I'd want design to lean toward 
considering its quirks.  Jeff Janes made a case downthread for "why not 
make it the admin/OS's job to worry about this?"  In cases where there 
is a reasonable solution available, in the form of "switch to XFS or 
ext4", I'm happy to take that approach.

Let me throw some numbers out to give a better idea of the shape and 
magnitude of the problem case I've been working on here.  In the 
situation that leads that the near hour-long sync phase I've seen, 
checkpoints will start with about a 3GB backlog of data in the kernel 
write cache to deal with.  That's about 4% of RAM, just under the 5% 
threshold set by dirty_background_ratio.  Whether or not the 256MB write 
cache on the controller is also filled is a relatively minor detail I 
can't monitor easily.  The checkpoint itself?  <250MB each time. 

This proportion is why I didn't think to follow the alternate path of 
worrying about spacing the write and fsync calls out differently.  I 
shrunk shared_buffers down to make the actual checkpoints smaller, which 
helped to some degree; that's what got them down to smaller than the 
RAID cache size.  But the amount of data cached by the operating system 
is the real driver of total sync time here.  Whether or not you include 
all of the writes from the checkpoint itself before you start calling 
fsync didn't actually matter very much; in the case I've been chasing, 
those are getting cached anyway.  The write storm from the fsync calls 
themselves forcing things out seems to be the driver on I/O spikes, 
which is why I started with spacing those out.

Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog 
takes a minimum of 10 minutes of real time.  There are about 300 1GB 
relation files involved in the case I've been chasing.  This is where 
the 3 second delay number came from; 300 files, 3 seconds each, 900 
seconds = 15 minutes of sync spread.  You can turn that math around to 
figure out how much delay per relation you can afford while still 
keeping checkpoints to a planned end time, which isn't done in the patch 
I submitted yet.

Ultimately what I want to do here is some sort of smarter write-behind 
sync operation, perhaps with a LRU on relations with pending fsync 
requests.  The idea would be to sync relations that haven't been touched 
in a while in advance of the checkpoint even.  I think that's similar to 
the general idea Robert is suggesting here, to get some sync calls 
flowing before all of the checkpoint writes have happened.  I think that 
the final sync calls will need to get spread out regardless, and since 
doing that requires a fairly small amount of code too that's why we 
started with that.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dimitri Fontaine
Дата:
Сообщение: Re: ALTER OBJECT any_name SET SCHEMA name
Следующее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: Spread checkpoint sync