Re: [HACKERS] Parallel bitmap heap scan

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: [HACKERS] Parallel bitmap heap scan
Дата
Msg-id CAFiTN-tDSoB6smUHAS35o=7Co3B4v_ED-UOXqnhL7_gsSOzNSQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Parallel bitmap heap scan  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Hi Robert,

Thanks for the review.

> When Andres wrote in
> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash".  But the patch set
> you've implemented here doesn't behave that way.  Instead, you've got
> the array containing the hash table elements shared (which is good)

Here is my analysis why we selected this instead of just sharing the iterator:
----------------------------------------------------------------------------------------------------
The problem is that each entry of iteration array is just a pointer to
hash entry. So if we only shared iterator then iterator element will
be pointing to local memory of other workers. I thought of one more
option that during tbm_begin_iterate instead of keeping pointer to
hash entry, we can make a copy of that in DSA area, but that will have
2 copies of hash table element for short duration which is not good.
And, I think what we are doing currently is better than that.
 The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.
> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer.

Actually, tbm_begin_iterate is processing the each element of the hash
table and converting to chunk and page array and currently, it’s done
by each worker and it’s pretty cheap. Suppose I try to do this only by
the first worker then implementation will look something like this.

1. First, we need to create a tbm->spages array and tbm->schunks array
and both should be the array of dsa_pointers.
2. Then each worker will process these array’s and will convert them
to the array of their local pointers.
3. With the current solution where all hash elements are stored in one
large dsa chunk, then how we are going to divide them into multiple
dsa pointers.

I will work on remaining comments.
 And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table.  That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public.  You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK.  Those details need to remain
> private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we
> need some better solution.  The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace.  And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.
>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate().  Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README.  I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).
>
> In 0002, you have this:
>
> +    tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table.  But TBH, I don't really see why we
> want this to be a separate object.  Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that.  The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [HACKERS] TRAP: FailedAssertion("!(hassrf)", File:"nodeProjectSet.c", Line: 180)
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: [HACKERS] Enabling replication connections by default inpg_hba.conf