Re: WIP: Fast GiST index build

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: WIP: Fast GiST index build
Дата
Msg-id CAPpHfdu4DusbHtMLE+csQJj61rMg_JtFOGSwrQVoOF5wPBkixQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WIP: Fast GiST index build  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: WIP: Fast GiST index build  (Alexander Korotkov <aekorotkov@gmail.com>)
Список pgsql-hackers
Hi!

Thank you for your notes.

On Fri, Aug 12, 2011 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> [ new patch ]

Some random comments:

- It appears that the "noFollowFight" flag is really supposed to be
called "noFollowRight".
Fixed.

- In gist_private.h you've written "halt-filled" where you really mean
"half-filled".
Fixed.
 
- It seems like you're using reloptions to set parameters that are
only going to do anything at index creation time.  IIUC, "BUFFERING",
"LEVELSTEP" and "BUFFERSIZE" have no permanent meaning for that index;
they're just used ephemerally while constructing it.  If we're going
to expose such things as options, maybe they should be GUCs, not
reloptions.
Having these as index parameters may be helpful when you reindex. It's likely that you would like to rebuild it with same parameters as it was created. Actually, we have the same situation with FILLFACTOR: it is used only during index creation.
 
- Function names should begin with "gist" or some other, appropriate
prefix, especially if they are non-static.  decreasePathRefcount(),
getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
getNodeBufferBusySize() violate this rule, and it might be good to
change the static functions to follow it, too, just for consistency,
and to avoid renaming things if something that's currently static
later needs to be made non-static.
Fixed.
 
- validateBufferOption needs to use ereport(), not elog().
Fixed.
 
- This needs a bit of attention:

+               /* TODO: Write the WAL record */
+               if (RelationNeedsWAL(state->r))
+                       recptr = gistXLogSplit(state->r->rd_node,
blkno, is_leaf,
+                                                               dist,
oldrlink, oldnsn, InvalidBuffer, true);
+               else
+                       recptr = GetXLogRecPtrForTemp();
+

I don't think the comment matches the code, since gistXLogSplit() does
in fact call XLogInsert(). Also, you should probably move the
RelationNeedsWAL() test inside gistXLogSplit().  Otherwise, every
single caller of gistXLogSplit() is going to need to do the same
dance.

- In gistBufferingPlaceToPage, you've got a series of loops that look like this:

+               for (ptr = dist; ptr; ptr = ptr->next)

The first loop allocates a bunch of buffers.  The second loop sets up
downlink pointers.   Then there's some other code.  Then there's a
third loop, which adds items to each page in turn and sets up right
links.  Then there's a fourth loop, which marks all those buffers
dirty.  Then you write XLOG.  Then there's a fifth loop, which sets
all the LSNs and TLIs, and a sixth loop, which does
UnlockReleaseBuffer() on each valid buffer in the list.  All of this
seems like it could be simplified.  In particular, the third and
fourth loops can certainly be merged - you should set the dirty bit at
the same time you're adding items to the page.  And the fifth and
sixth loops can also be merged.  You certainly don't need to set all
the LSNs and TLIs before releasing any of the buffer locks & pins.
I'm not sure if there's any more merging that can be done than that,
but you might want to have a look.

I'm also wondering how long this linked list can be.  It's not good to
have large numbers of buffers locked for a long period of time.  At
the very least, some comments are in order here.

Another general comment about this function is that it seems like it
is backwards.  The overall flow of the function is:

if (is_split)
{
   /* complicated stuff */
}
else
{
   /* simple case */
}

It seems like it might be better to flip that around and do this:

if (!is_split)
{
   /* simple case */
   return result;
}
/* complicated stuff */

It's easier to read and avoids having the indentation get too deep.

- As I look at this more, I see that a lot of the logic in
gistBufferingBuildPlaceToPage is copied from gistplacetopage().  It
would be nice to move the common bits to common subroutines that both
functions can call, instead of duplicating the code.

- On a related note, gistBufferingBuildPlaceToPage needs to do
START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
sequence, as gistplacetopage() does.
While, I've merged gistplacetopage() and gistBufferingBuildPlaceToPage(). Now I'm trying some more refactoring.
 
- gistFindCorrectParent() seems to rely heavily on the assumption that
there's no concurrent activity going on in this index.  Otherwise,
it's got to be unsafe to release the buffer lock before using the
answer the function computes.  Some kind of comment seems like it
would be a good idea.
Corresponding comment was added.
 
- On a more algorithmic note, I don't really understand why we attach
buffers to all pages on a level or none of them.  If it isn't
necessary to have buffers on every internal page in the tree, why do
we have them on every other level or every third level rather than,
say, creating them on the fly in whatever parts of the tree end up
busy?
Idea of having buffers on levels with some step is following. We have enough of cache to have a sub-tree of some height fits to cache. When we loaded such sub-tree once we can process index tuples inside it effectively (without actual IO). During buffer emptying we're flushing index tuples to undeflying buffers or leaf pages. Having buffers on levels with step we guarantee that flushing don't require loading(and writing) more then such sub-tree (which fits to cache). Thus, if we've processed many enough of index tuples during emptying, it's IO effective. It's possible that some more effective distribution of buffers exists, but it's currently unclear for me.

Other changes:
1) Levelstep and buffersize user options were removed.
2) Buffer size is now run time tuned.
3) Buffer emptying now stops when some child can't take index tuple anymore.

------
With best regards,
Alexander Korotkov.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: our buffer replacement strategy is kind of lame
Следующее
От: Robert Haas
Дата:
Сообщение: Re: VACUUM FULL versus unsafe order-of-operations in DDL commands