Re: Improvement of checkpoint IO scheduler for stable transaction responses
От | Andres Freund |
---|---|
Тема | Re: Improvement of checkpoint IO scheduler for stable transaction responses |
Дата | |
Msg-id | 20130704130555.GA1403@awork2.anarazel.de обсуждение исходный текст |
Ответ на | Re: Improvement of checkpoint IO scheduler for stable transaction responses (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>) |
Ответы |
Re: Improvement of checkpoint IO scheduler for stable transaction responses
(Tom Lane <tgl@sss.pgh.pa.us>)
Re: Improvement of checkpoint IO scheduler for stable transaction responses ("Joshua D. Drake" <jd@commandprompt.com>) |
Список | pgsql-hackers |
On 2013-07-04 21:28:11 +0900, KONDO Mitsumasa wrote: > >That would move all the vm and fsm forks to separate directories, > >which would cut down the number of files in the main-fork directory > >significantly. That might be worth doing independently of the issue > >you're raising here. For large clusters, you'd even want one more > >level to keep the directories from getting too big: > > > >base/${DBOID}/${FORK}/${X}/${RELFILENODE} > > > >...where ${X} is two hex digits, maybe just the low 16 bits of the > >relfilenode number. But this would be not as good for small clusters > >where you'd end up with oodles of little-tiny directories, and I'm not > >sure it'd be practical to smoothly fail over from one system to the > >other. > It seems good idea! In generally, base directory was not seen by user. > So it should be more efficient arrangement for performance and adopt for > large database. > > > Presumably the smaller segsize is better because we don't > > completely stall the system by submitting up to 1GB of io at once. So, > > if we were to do it in 32MB chunks and then do a final fsync() > > afterwards we might get most of the benefits. > Yes, I try to test this setting './configure --with-segsize=0.03125' tonight. > I will send you this test result tomorrow. I don't like going in this direction at all: 1) it breaks pg_upgrade. Which means many of the bigger users won't be able to migrate to this and most packagers wouldcarry the old segsize around forever. Even if we could get pg_upgrade to split files accordingly link mode wouldstill be broken. 2) It drastically increases the amount of file handles neccessary and by extension increases the amount of open/close calls.Those aren't all that cheap. And it increases metadata traffic since mtime/atime are kept for more files. Also, filecreation is rather expensive since it requires metadata transaction on the filesystem level. 3) It breaks readahead since that usually only works within a single file. I am pretty sure that this will significantlyslow down uncached sequential reads on larger tables. > (2013/07/03 22:39), Andres Freund wrote:> On 2013-07-03 17:18:29 +0900 > > Hm. I wonder how much of this could be gained by doing a > > sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing > > the original checkpoint-pass through the buffers or when fsyncing the > > files. > Sync_file_rage system call is interesting. But it was supported only by > Linux kernel 2.6.22 or later. In postgresql, it will suits Robert's idea > which does not depend on kind of OS. Well. But it can be implemented without breaking things... Even if we don't have sync_file_range() we can cope by simply doing fsync()s more frequently. For every open file keep track of the amount of buffers dirtied and every 32MB or so issue an fdatasync()/fsync(). > I think that best way to write buffers in checkpoint is sorted by buffer's > FD and block-number with small segsize setting and each property sleep > times. It will realize genuine sorted checkpint with sequential disk > writing! That would mke regular fdatasync()ing even easier. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
В списке pgsql-hackers по дате отправления: