Re: Spread checkpoint sync

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Spread checkpoint sync
Дата
Msg-id AANLkTi=oiyz8V5aGaxnx-0bkpU6izkwkvzh2ruUF55jv@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Spread checkpoint sync  (Jeff Janes <jeff.janes@gmail.com>)
Ответы Re: Spread checkpoint sync  (Jeff Janes <jeff.janes@gmail.com>)
Список pgsql-hackers
On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> The thing to realize
>>> that complicates the design is that the actual sync execution may take a
>>> considerable period of time.  It's much more likely for that to happen than
>>> in the case of an individual write, as the current spread checkpoint does,
>>> because those are usually cached.  In the spread sync case, it's easy for
>>> one slow sync to make the rest turn into ones that fire in quick succession,
>>> to make up for lost time.
>>
>> I think the behavior of file systems and operating systems is highly
>> relevant here.  We seem to have a theory that allowing a delay between
>> the write and the fsync should give the OS a chance to start writing
>> the data out,
>
> I thought that the theory was that doing too many fsync in short order
> can lead to some kind of starvation of other IO.
>
> If the theory is that we want to wait between writes and fsyncs, then
> the current behavior is probably the best, Spreading out the writes
> and then doing all the syncs at the end gives the best delay time
> between an average write and the sync of that written to file.  Or,
> spread the writes out over 150 seconds, sleep for 140 seconds, then do
> the fsyncs.  But I don't think that that is the theory.

Well, I've heard Bruce and, I think, possibly also Greg talk about
wanting to wait after doing the writes in the hopes that the kernel
will start to flush the dirty pages, but I'm wondering whether it
wouldn't be better to just give up on that and do: small batch of
writes - fsync those writes - another small batch of writes - fsync
that batch - etc.

>> but do we have any evidence indicating whether and under
>> what circumstances that actually occurs?  For example, if we knew that
>> it's important to wait at least 30 s but waiting 60 s is no better,
>> that would be useful information.
>>
>> Another question I have is about how we're actually going to know when
>> any given fsync can be performed.  For any given segment, there are a
>> certain number of pages A that are already dirty at the start of the
>> checkpoint.
>
> Dirty in the shared pool, or dirty in the OS cache?

OS cache, sorry.

>> Then there are a certain number of additional pages B
>> that are going to be written out during the checkpoint.  If it so
>> happens that B = 0, we can call fsync() at the beginning of the
>> checkpoint without losing anything (in fact, we gain something: any
>> pages dirtied by cleaning scans or backend writes during the
>> checkpoint won't need to hit the disk;
>
> Aren't those pages written out by cleaning scans and backend writes
> while the checkpoint is occurring exactly what you defined to be page
> set B, and then to be zero?

No, sorry, I'm referring to cases where all the dirty pages in a
segment have been written out the OS but we have not yet issued the
necessary fsync.

>> and if the filesystem dumps
>> more of its cache than necessary on fsync, we may as well take that
>> hit before dirtying a bunch more stuff).  But if B > 0, then we should
>> attempt the fsync() until we've written them all; otherwise we'll end
>> up having to fsync() that segment twice.
>>
>> Doing all the writes and then all the fsyncs meets this requirement
>> trivially, but I'm not so sure that's a good idea.  For example, given
>> files F1 ... Fn with dirty pages needing checkpoint writes, we could
>> do the following: first, do any pending fsyncs for files not among F1
>> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
>> and fsync, write all pages for F3 and fsync, etc.  This might seem
>> dumb because we're not really giving the OS a chance to write anything
>> out before we fsync, but think about the ext3 case where the whole
>> filesystem cache gets flushed anyway.  It's much better to dump the
>> cache at the beginning of the checkpoint and then again after every
>> file than it is to spew many GB of dirty stuff into the cache and then
>> drop the hammer.
>
> But the kernel has knobs to prevent that from happening.
> dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
> kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
> is supposed to do a journal commit every 5 seconds under default mount
> conditions.

I don't know in detail.  dirty_expire_centisecs sounds useful; I think
the problem with dirty_background_ratio and dirty_ratio is that the
default ratios are large enough that on systems with a huge pile of
memory, they allow more dirty data to accumulate than can be flushed
without causing an I/O storm.  I believe Greg Smith made a comment
along the lines of - memory sizes are grow faster than I/O speeds;
therefore a ratio that is OK for a low-end system with a modest amount
of memory causes problems on a high-end system that has faster I/O but
MUCH more memory.

As a kernel developer, I suspect the tendency is to try to set the
ratio so that you keep enough free memory around to service future
allocation requests.  Optimizing for possible future fsyncs is
probably not the top priority...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Stark
Дата:
Сообщение: Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Следующее
От: Jeff Janes
Дата:
Сообщение: Re: Spread checkpoint sync