Re: Page replacement algorithm in buffer cache

Поиск
Список
Период
Сортировка
От Merlin Moncure
Тема Re: Page replacement algorithm in buffer cache
Дата
Msg-id CAHyXU0zFmRFBJeGtWuwxey-8brcxS60Ve0vqT=wW+1v8P3E7UA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Page replacement algorithm in buffer cache  (Jim Nasby <jim@nasby.net>)
Ответы Re: Page replacement algorithm in buffer cache  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Mon, Apr 1, 2013 at 6:09 PM, Jim Nasby <jim@nasby.net> wrote:
> On 4/1/13 4:55 PM, Merlin Moncure wrote:
>>
>> On Mon, Apr 1, 2013 at 4:09 PM, Andres Freund<andres@2ndquadrant.com>
>> wrote:
>>>
>>> >On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:
>>>>
>>>> >>On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes<jeff.janes@gmail.com>
>>>> >> wrote:
>>>>>
>>>>> >> >On Friday, March 22, 2013, Ants Aasma wrote:
>>>>>>
>>>>>> >> >>
>>>>>> >> >>On Fri, Mar 22, 2013 at 10:22 PM, Merlin
>>>>>> >> >> Moncure<mmoncure@gmail.com>
>>>>>> >> >>wrote:
>>>>>>>
>>>>>>> >> >> >well if you do a non-locking test first you could at least
>>>>>>> >> >> > avoid some
>>>>>>> >> >> >cases (and, if you get the answer wrong, so what?) by jumping
>>>>>>> >> >> > to the
>>>>>>> >> >> >next buffer immediately.  if the non locking test comes good,
>>>>>>> >> >> > only
>>>>>>> >> >> >then do you do a hardware TAS.
>>>>>>> >> >> >
>>>>>>> >> >> >you could in fact go further and dispense with all locking in
>>>>>>> >> >> > front of
>>>>>>> >> >> >usage_count, on the premise that it's only advisory and not a
>>>>>>> >> >> > real
>>>>>>> >> >> >refcount.  so you only then lock if/when it's time to select a
>>>>>>> >> >> >candidate buffer, and only then when you did a non locking
>>>>>>> >> >> > test first.
>>>>>>> >> >> >  this would of course require some amusing adjustments to
>>>>>>> >> >> > various
>>>>>>> >> >> >logical checks (usage_count <= 0, heh).
>>>>>>
>>>>>> >> >>
>>>>>> >> >>Moreover, if the buffer happens to miss a decrement due to a data
>>>>>> >> >>race, there's a good chance that the buffer is heavily used and
>>>>>> >> >>wouldn't need to be evicted soon anyway. (if you arrange it to be
>>>>>> >> >> a
>>>>>> >> >>read-test-inc/dec-store operation then you will never go out of
>>>>>> >> >>bounds) However, clocksweep and usage_count maintenance is not
>>>>>> >> >> what is
>>>>>> >> >>causing contention because that workload is distributed. The
>>>>>> >> >> issue is
>>>>>> >> >>pinning and unpinning.
>>>>>
>>>>> >> >
>>>>> >> >
>>>>> >> >That is one of multiple issues.  Contention on the BufFreelistLock
>>>>> >> > is
>>>>> >> >another one.  I agree that usage_count maintenance is unlikely to
>>>>> >> > become a
>>>>> >> >bottleneck unless one or both of those is fixed first (and maybe
>>>>> >> > not even
>>>>> >> >then)
>>>>
>>>> >>
>>>> >>usage_count manipulation is not a bottleneck but that is irrelevant.
>>>> >>It can be affected by other page contention which can lead to priority
>>>> >>inversion.  I don't be believe there is any reasonable argument that
>>>> >>sitting and spinning while holding the BufFreelistLock is a good idea.
>>>
>>> >
>>> >In my experience the mere fact of (unlockedly, but still) accessing all
>>> > the
>>> >buffer headers can cause noticeable slowdowns in write only/mostly
>>> > workloads with
>>> >big amounts of shmem.
>>> >Due to the write only nature large amounts of the buffers have a similar
>>> >usagecounts (since they are infrequently touched after the initial
>>> > insertion)
>>> >and there are no free ones around so the search for a buffer frequently
>>> > runs
>>> >through*all*  buffer headers multiple times till it decremented all
>>> > usagecounts
>>>
>>> >to 0. Then comes a period where free buffers are found easily (since all
>>> >usagecounts from the current sweep point onwards are zero). After that
>>> > it
>>> >starts all over.
>>> >I now have seen that scenario multiple times:(
>>
>> Interesting -- I was thinking about that too, but it's a separate
>> problem with a different trigger.  Maybe a bailout should be in there
>> so that after X usage_count adjustments the sweeper summarily does an
>> eviction, or maybe the "max" declines from 5 once per hundred buffers
>> inspected or some such.
>
> What's the potential downside on that though? IE: what happens if this
> scheme suddenly evicts the root page on a heavily used index? You'll
> suddenly have a ton of stuff blocked waiting for that page to come back in.

That seems pretty unlikely because of A sheer luck of hitting that
page for the dropout (if your buffer count is N the chances of losing
it would seem to be 1/N at most) and B highly used pages are much more
likely to be pinned and thus immune from eviction.  But my issue with
this whole line of analysis is that I've never been able to to turn it
up in simulated testing.   Probably to do it you'd need very very fast
storage.

> This is a use case that I think would benefit greatly from a background
> process that keeps pages in the free list.

> That said, I now suspect that your "frightened turtle" approach would be of
> higher value than "bgfreelist"... but I suspect we'll ultimately want both
> of them for different reasons.

Well maybe.  Performance analysis of patches like this has to be
systematic and extremely thorough, so who knows what's the best way
forward? (hint: not me).  I'm hoping that adjusting clock sweep
behavior will shave a few cycles in uncontended workloads -- this will
make it a lot easier to sell performance wise.  But I'm optimistic and
maybe we can greatly reduce allocation contention without a lot of
work.

merlin



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Claudio Freire
Дата:
Сообщение: Re: Spin Lock sleep resolution
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Page replacement algorithm in buffer cache