Re: ReadRecentBuffer() doesn't scale well

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: ReadRecentBuffer() doesn't scale well
Дата
Msg-id 20230627040931.km2ek55e2ri4a3rv@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: ReadRecentBuffer() doesn't scale well  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: ReadRecentBuffer() doesn't scale well
Re: ReadRecentBuffer() doesn't scale well
Список pgsql-hackers
Hi,

On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:
> On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:
> > Unfortunately it scaled way worse at first. This is not an inherent issue, but
> > due to an implementation choice in ReadRecentBuffer().  Whereas the normal
> > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> > LockBufHdr(), checks if the buffer ID is the same and then uses
> > PinBuffer_Locked().
> >
> > The problem with that is that PinBuffer() takes care to not hold the buffer
> > header spinlock, it uses compare_exchange to atomically acquire the pin, while
> > guaranteing nobody holds the lock.  When holding the buffer header spinlock,
> > there obviously is the risk of being scheduled out (or even just not have
> > exclusive access to the cacheline).
> 
> Yeah.  Aside from inherent nastiness of user-space spinlocks

I've been wondering about making our backoff path use futexes, after some
adaptive spinning.


> > The fairly obvious solution to this is to just use PinBuffer() and just unpin
> > the buffer if its identity was changed concurrently. There could be an
> > unlocked pre-check as well.  However, there's the following comment in
> > ReadRecentBuffer():
> >                          * It's now safe to pin the buffer.  We can't pin first and ask
> >                          * questions later, because it might confuse code paths like
> >                          * InvalidateBuffer() if we pinned a random non-matching buffer.
> >                          */
> >
> > But I'm not sure I buy that - there's plenty other things that can briefly
> > acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
> > contents, etc).
> 
> I may well have been too cautious with that.  The worst thing I can
> think of right now is that InvalidateBuffer() would busy loop (as it
> already does in other rare cases) when it sees a pin.

Right. Particularly if we were to add a pre-check for the tag to match, that
should be extremely rare.


> > Another difference between using PinBuffer() and PinBuffer_locked() is that
> > the latter does not adjust a buffer's usagecount.
> >
> > Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
> > codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
> > increasing usagecount anymore?
> 
> Yeah, that is not great.  The simplification you suggest would fix
> that too, though I guess it would also bump the usage count of buffers
> that don't have the tag we expected; that's obviously rare and erring
> on a better side though.

Yea, I'm not worried about that. If somebody is, we could just add code to
decrement the usagecount again.


> > FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
> > the root page's buffer id in RelationData, seems a noticeable win. About 7% in
> > a concurrent, read-only pgbench that utilizes batches of 10. And it should be
> > easy to get much bigger wins, e.g. with a index nested loop with a relatively
> > small index on the inner side.
> 
> Wooo, that's better than I was hoping.  Thanks for trying it out!  I
> think, for the complexity involved (ie very little)

I don't really have a concrete thought for where to store the id of the recent
buffer. I just added a new field into some padding in RelationData, but we
might go for something fancier.


> smgr_targblock could be another easy-to-cache candidate, ie a place where
> there is a single interesting hot page that we're already keeping track of
> with no requirement for new backend-local mapping machinery.

I wonder if we should simple add a generic field for such a Buffer to
RelationData, that the AM can use as it desires. For btree that would be the
root page, for heap the target block ...


> it's a nice result, and worth considering even though it's also a solid clue
> that we could do much better than this with a (yet to be designed)
> longer-lived pin scheme.

Indeed. PinBuffer() is pretty hot after the change. As is the buffer content
lock.

Particularly for the root page, it'd be really interesting to come up with a
scheme that keeps an offline copy of the root page while also pinning the real
root page. I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jaime Casanova
Дата:
Сообщение: Assert !bms_overlap(joinrel->relids, required_outer)
Следующее
От: Dilip Kumar
Дата:
Сообщение: Re: Improving btree performance through specializing by key shape, take 2