Re: Move unused buffers to freelist

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Move unused buffers to freelist
Дата
Msg-id 20130627142417.GK1254@alap2.anarazel.de
обсуждение исходный текст
Ответ на Re: Move unused buffers to freelist  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On 2013-06-27 09:50:32 -0400, Robert Haas wrote:
> On Thu, Jun 27, 2013 at 9:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Contention wise I aggree. What I have seen is that we have a huge
> > amount of cacheline bouncing around the buffer header spinlocks.
> 
> How did you measure that?

perf record -e cache-misses. If you want it more detailed looking at
{L1,LLC}-{load,store}{s,misses} can sometimes be helpful too.
Also, running perf stat -vvv postgres -D ... for a whole benchmark can
be useful to compare how much a change influences cache misses and such.

For very detailed analysis running something under valgrind/cachegrind
can be helpful too, but I usually find perf to be sufficient.

> > I have previously added some adhoc instrumentation that printed the
> > amount of buffers that were required (by other backends) during a
> > bgwriter cycle and the amount of buffers that the buffer manager could
> > actually write out.
> 
> I think you can see how many are needed from buffers_alloc.  No?

Not easily correlated with bgwriter activity. If we cannot keep up
because it's 100% busy writing out buffers I don't have many problems
with that. But I don't think we often are.

> > Problems with the current code:
> >
> > * doesn't manipulate the usage_count and never does anything to used
> >   pages. Which means it will just about never find a victim buffer in a
> >   busy database.
> 
> Right.  I was thinking that was part of this patch, but it isn't.  I
> think we should definitely add that.  In other words, the background
> writer's job should be to run the clock sweep and add buffers to the
> free list.

We might need to split it into two for that. One process to writeout
dirty pages, one to populate the freelist.
Otherwise we will probably regularly hit the current scalability issues
because we're currently io contended. Say during a busy or even
immediate checkpoint.

>  I think we should also split the lock: a spinlock for the
> freelist, and an lwlock for the clock sweep.

Yea, thought about that when writing the thing about the exclusive lock
during the clocksweep.

> > * by far not aggressive enough, touches only a few buffers ahead of the
> >   clock sweep.
> 
> Check.  Fixing this might be a separate patch, but then again maybe
> not.  The changes we're talking about here provide a natural feedback
> mechanism: if we observe that the freelist is empty (or less than some
> length, like 32 buffers?) set the background writer's latch, because
> we know it's not keeping up.

Yes, that makes sense. Also provides adaptability to bursty workloads
which means we don't have too complex logic in the bgwriter for that.

> > There's another thing we could do to noticeably improve scalability of
> > buffer acquiration. Currently we do a huge amount of work under the
> > freelist lock.
> > ...
> > So, we perform the entire clock sweep until we found a single buffer we
> > can use inside a *global* lock. At times we need to iterate over the
> > whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
> > the usage counts enough (if the database is busy it can take even
> > longer...).
> > In a busy database where usually all the usagecounts are high the next
> > backend will touch a lot of those buffers again which causes massive
> > cache eviction & bouncing.
> >
> > It seems far more sensible to only protect the clock sweep's
> > nextVictimBuffer with a spinlock. With some care all the rest can happen
> > without any global interlock.
> 
> That's a lot more spinlock acquire/release cycles, but it might work
> out to a win anyway.  Or it might lead to the system suffering a
> horrible spinlock-induced death spiral on eviction-heavy workloads.

I can't imagine it to be worse that what we have today. Also, nobody
requires us to only advance the clocksweep by one page, we can easily do
it say 29 pages at a time or so if we detect the lock is contended.

Alternatively it shouldn't be too hard to make it into an atomic
increment, although that requires some trickery to handle the wraparound
sanely.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fujii Masao
Дата:
Сообщение: Re: [PATCH] add long options to pgbench (submission 1)
Следующее
От: Kevin Grittner
Дата:
Сообщение: Re: refresh materialized view concurrently