Re: WIP: dynahash replacement for buffer table

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: WIP: dynahash replacement for buffer table
Дата
Msg-id CA+TgmoYZ=0vO5W97v=EH18WrpsgLELOih3yYiSPyHQtBOa=XUw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WIP: dynahash replacement for buffer table  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: WIP: dynahash replacement for buffer table
Re: WIP: dynahash replacement for buffer table
Список pgsql-hackers
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> The code in CHashSearch shows the problem there: you need to STORE the
>> hazard pointer before you begin to do the LOAD operations to scan the
>> bucket, and you must finish all of those LOADs before you STORE the
>> NULL hazard pointer.  A read or write barrier won't provide store/load
>> or load/store ordering.
>
> I'm not sure I understand the problem here - but a read + write barrier
> should suffice? To avoid falling back to two full barriers, we'd need a
> separate define pg_read_write_barrier(), but that's not a problem. IIUC
> that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.  It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads.  We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.

I tested PPC, because that's what I have.  I think we're emitting two
sync instructions there, but maybe lwsync or isync would be enough in
one or both cases.  The first of these links indicates that lwsync
ought to be enough for both cases; the second says that we need a sync
after setting the hazard pointer but that an lwsync is enough before
clearing it.   Or that's my reading anyway.

http://www.ibm.com/developerworks/systems/articles/powerpc.html
http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/

>> > My guess is that the additional indirection via the arena explains the
>> > difference in cache misses? But I might be completely off here.
>>
>> The arena makes the calculation of the pointer address less
>> predictable, which might make things more difficult for the CPU
>> pipeline.  But aside from that, I think it's the same basic idea: you
>> bounce from some sort of bucket header to the first element of the
>> chain and then, hopefully, you're done.  Am I missing something?
>
> I haven't really read much of the code yet (wrote most of this email on
> my workstation before embarking on a plane), but afaics there's one
> indirection to the bucket, and then one to the first node in the linked
> list. Right?
> In dynahash it's first to the segment, but there's only few, so it's
> likely to be cached. Then to the bucket - which, afaics, already can
> contain the key.
>
> So it's not really the arena, but that you chose not to store the first
> element in a chain inline...

Hmm, you have a point.  I think the reason I did it that way is
because I don't know how to make the memory management work out
otherwise.  If you store the first element inline, then even an empty
bucket uses up as much memory space as one with a single element.
More seriously, it breaks the deletion algorithm.  There's no good way
to atomically swap out the bucket header if it is located in a fixed
position and bigger than the size of object the machine can manipulate
atomically.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: narwhal and PGDLLIMPORT
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Better support of exported snapshots with pg_dump