Re: limiting hint bit I/O

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: limiting hint bit I/O
Дата
Msg-id AANLkTi=O7RR7HJvUVH3sYG7DqocM2mG1Q4b359Ows_=C@mail.gmail.com
обсуждение исходный текст
Ответ на Re: limiting hint bit I/O  (Bruce Momjian <bruce@momjian.us>)
Ответы Re: limiting hint bit I/O  (Cédric Villemain <cedric.villemain.debian@gmail.com>)
Список pgsql-hackers
On Mon, Feb 7, 2011 at 10:48 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> On Sat, Feb 5, 2011 at 4:31 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> > Uh, in this C comment:
>> >
>> > + ? ? ? ?* or not we want to take the time to write it. ?We allow up to 5% of
>> > + ? ? ? ?* otherwise-not-dirty pages to be written due to hint bit changes,
>> >
>> > 5% of what? ?5% of all buffers? ?5% of all hint-bit-dirty ones? ?Can you
>> > clarify this in the patch?
>>
>> 5% of buffers that are hint-bit-dirty but not otherwise dirty.  ISTM
>> that's exactly what the comment you just quoted says on its face, but
>> I'm open to some other wording you want to propose.
>
> How about:
>
>        otherwise-not-dirty -> only-hint-bit-dirty
>
> So 95% of your hint bit modificates are discarded if the pages is not
> otherwise dirtied?  That seems pretty radical.

No, it's more subtle than that, although I admit it *is* radical.
There are three ways that pages can get written out to disk:

1. Checkpoints.
2. Background writer activity.
3. Backends writing out dirty buffers because there are no clean
buffers available to allocate.

What the latest version of the patch implements is:

1. Checkpoints no longer write only-hint-bit-dirty pages to disk.
Since a checkpoint doesn't evict pages from memory, the hint bits are
still there to be written out (or not) by (2) or (3), below.

2. When the background writer's cleaning scan hits an
only-hint-bit-dirty page, it writes it, same as before.  This
definitely doesn't result in the loss of any hint bits.

3. When a backend writes out a dirty buffer itself, because there are
no clean buffers available to allocate, it initially writes them.  But
if there are more than 100 such pages per block of 2000 allocations,
it recycles any after the first 100 without writing them.

In normal operation, I suspect that there will be very little impact
from this change.  The change described in #1 may slightly reduce the
size of some checkpoints, but it's unclear that it will be enough to
be material.  The change described in #3 will probably also not
matter, because, in a well-tuned system, the background writer should
be set aggressively enough to provide a supply of clean pages, and
therefore backends shouldn't be doing many writes themselves, and
therefore most buffer allocations will be of already-clean pages, and
the logic described in #3 will probably never kick in.  Even if they
are writing a lot of buffers themselves, the logic in #3 still won't
kick in if many of the pages being written are actually dirty - it
will only matter if the backends are writing out lots and lots of
pages *solely because they are only-hint-bit-dirty*.

Where I expect this to make a big difference is on sequential scans of
just-loaded tables.  In that case, the BufferAccessStrategy machinery
will force the backend to reuse the same buffers over and over again,
and all of those pages will be only-hint-bit-dirty.  So the backend
has to do a write for every page it allocates, and even though those
writes are being absorbed by the OS cache, it's still slow.  With this
patch, what will happen is that the backend will write about 100
pages, then perform the next 1900 allocations without writing, then
write another 100 pages, etc.  So at the end of the scan, instead of
having written an amount of data equal to the size of the table, we
will have written 5% of that amount, and 5% of the hint bits will be
on disk.  Each subsequent scan will get another 5% of the hint bits on
disk until after 20 scans they are all set.  So the work of setting
the hint bits is spread out across the first 20 table scans instead of
all being done the first time through.

Clearly, there's further jiggering that can be done here.  But the
overall goal is simply that some of our users don't seem to like it
when the first scan of a newly loaded table generates a huge storm of
*write* traffic.  Given that the hint bits appear to be quite
important from a performance perspective (see benchmark numbers
upthread), we don't really have the option of just not writing them -
but we can try to not to do it all at once, if we think that's an
improvement, which I think is likely.

Overall, I'm inclined to move this patch to the next CommitFest and
forget about it for now.  I don't think we're going to get enough
testing of this in the next week to be really confident that it's
right.  I might be willing to commit with some more moderate amount of
testing if we were right at the beginning of a development cycle,
figuring that we'd shake out any warts as the cycle went along, but
this isn't seeming like the right time for this kind of a change.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeff Davis
Дата:
Сообщение: Re: Range Types
Следующее
От: Chris Browne
Дата:
Сообщение: Re: OpenVMS - an effort which needs guidance and support.