Re: Improvement of checkpoint IO scheduler for stable transaction responses

Поиск

Список

Период

Сортировка

От	Andres Freund
Тема	Re: Improvement of checkpoint IO scheduler for stable transaction responses
Дата	4 июля 2013 г. 16:04:38
Msg-id	20130704130555.GA1403@awork2.anarazel.de обсуждение исходный текст
Ответ на	Re: Improvement of checkpoint IO scheduler for stable transaction responses (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
Ответы	Re: Improvement of checkpoint IO scheduler for stable transaction responses (Tom Lane <tgl@sss.pgh.pa.us>) Re: Improvement of checkpoint IO scheduler for stable transaction responses ("Joshua D. Drake" <jd@commandprompt.com>)
Список	pgsql-hackers

Дерево обсуждения

On 2013-07-04 21:28:11 +0900, KONDO Mitsumasa wrote:
> >That would move all the vm and fsm forks to separate directories,
> >which would cut down the number of files in the main-fork directory
> >significantly.  That might be worth doing independently of the issue
> >you're raising here.  For large clusters, you'd even want one more
> >level to keep the directories from getting too big:
> >
> >base/${DBOID}/${FORK}/${X}/${RELFILENODE}
> >
> >...where ${X} is two hex digits, maybe just the low 16 bits of the
> >relfilenode number.  But this would be not as good for small clusters
> >where you'd end up with oodles of little-tiny directories, and I'm not
> >sure it'd be practical to smoothly fail over from one system to the
> >other.
> It seems good idea! In generally, base directory was not seen by user.
> So it should be more efficient arrangement for performance and adopt for
> large database.
>
> > Presumably the smaller segsize is better because we don't
> > completely stall the system by submitting up to 1GB of io at once. So,
> > if we were to do it in 32MB chunks and then do a final fsync()
> > afterwards we might get most of the benefits.
> Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
> I will send you this test result tomorrow.

I don't like going in this direction at all:
1) it breaks pg_upgrade. Which means many of the bigger users won't be  able to migrate to this and most packagers
wouldcarry the old  segsize around forever.  Even if we could get pg_upgrade to split files accordingly link mode
wouldstill be broken.

2) It drastically increases the amount of file handles neccessary and by  extension increases the amount of open/close
calls.Those aren't all  that cheap. And it increases metadata traffic since mtime/atime are  kept for more files. Also,
filecreation is rather expensive since it  requires metadata transaction on the filesystem level.

3) It breaks readahead since that usually only works within a single  file. I am pretty sure that this will
significantlyslow down  uncached sequential reads on larger tables.

> (2013/07/03 22:39), Andres Freund wrote:> On 2013-07-03 17:18:29 +0900
> > Hm. I wonder how much of this could be gained by doing a
> > sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
> > the original checkpoint-pass through the buffers or when fsyncing the
> > files.
> Sync_file_rage system call is interesting. But it was supported only by
> Linux kernel 2.6.22 or later. In postgresql, it will suits Robert's idea
> which does not depend on kind of OS.

Well. But it can be implemented without breaking things... Even if we
don't have sync_file_range() we can cope by simply doing fsync()s more
frequently. For every open file keep track of the amount of buffers
dirtied and every 32MB or so issue an fdatasync()/fsync().

> I think that best way to write buffers in checkpoint is sorted by buffer's
> FD and block-number with small segsize setting and each property sleep
> times. It will realize genuine sorted checkpint with sequential disk
> writing!

That would mke regular fdatasync()ing even easier.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Pavel Stehule
Дата: 04 июля 2013 г., 16:02:17
Сообщение: Re: Grouping Sets

Следующее

От: Bruce Momjian
Дата: 04 июля 2013 г., 16:09:10
Сообщение: Re: [9.4 CF 1] The Commitfest Slacker List

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Improvement of checkpoint IO scheduler for stable transaction responses

Предыдущее

Следующее