Re: Hash Indexes

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: Hash Indexes
Дата
Msg-id CAMkU=1y==EbuEhxbL_A6CoP07unaL8sQNjSL9zZw_Zb-tH5FfQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Hash Indexes  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Hash Indexes  (Jeff Janes <jeff.janes@gmail.com>)
Re: Hash Indexes  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Wed, Sep 7, 2016 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>
>> I have fixed all other issues you have raised.  Updated patch is
>> attached with this mail.
>
>
> I am finding the comments (particularly README) quite hard to follow.  There
> are many references to an "overflow bucket", or similar phrases.  I think
> these should be "overflow pages".  A bucket is a conceptual thing consisting
> of a primary page for that bucket and zero or more overflow pages for the
> same bucket.  There are no overflow buckets, unless you are referring to the
> new bucket to which things are being moved.
>

Hmm.  I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology.  I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1],
search for overflow on that wiki page.

That page seems to use "slot" to refer to the primary bucket/page and all the overflow buckets/pages which cover the same post-masked values.  I don't think that would be an improvement for us, because "slot" is already pretty well-used for other things.   Their use of "bucket" does seem to be mostly the same as "page" (or maybe "buffer" or "block"?) but I don't think we gain anything from creating yet another synonym for page/buffer/block.  I think the easiest thing would be to keep using the meanings which the existed committed code uses, so that we at least have internal consistency.

 
Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.

I think just "overflow page" or "buffer containing the overflow page".


Here are some more notes I've taken, mostly about the README and comments.  

It took me a while to understand that once a tuple is marked as moved by split, it stays that way forever.  It doesn't mean "recently moved by split", but "ever moved by split".  Which works, but is rather subtle.  Perhaps this deserves a parenthetical comment in the README the first time the flag is mentioned.

========

#define INDEX_SIZE_MASK 0x1FFF
/* bit 0x2000 is not used at present */

This is no longer true, maybe:
/* bit 0x2000 is reserved for index-AM specific usage */

========

   Note that this is designed to allow concurrent splits and scans.  If a
   split occurs, tuples relocated into the new bucket will be visited twice
   by the scan, but that does no harm.  As we are releasing the locks during
   scan of a bucket, it will allow concurrent scan to start on a bucket and
   ensures that scan will always be behind cleanup.

Above, the abrupt transition from splits (first sentence) to cleanup is confusing.  If the cleanup referred to is vacuuming, it should be a new paragraph or at least have a transition sentence.  Or is it referring to clean-up locks used for control purposes, rather than for actual vacuum clean-up?  I think it is the first one, the vacuum.  (I find the committed version of this comment confusing as well--how in the committed code would a tuple be visited twice, and why does that not do harm in the committed coding? So maybe the issue here is me, not the comment.)


=======

+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split.  The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results.  Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.

Perhaps it would be better to teach scans to restart anywhere on the page, than to force more cleanup locks to be taken?

=======
This comment no longer seems accurate (as long as it is just an ERROR and not a PANIC):

                 * XXX we have a problem here if we fail to get space for a
                 * new overflow page: we'll error out leaving the bucket split
                 * only partially complete, meaning the index is corrupt,
                 * since searches may fail to find entries they should find.

The split will still be marked as being in progress, so any scanner will have to scan the old page and see the tuple there.

========
in _hash_splitbucket comments, this needs updating:

 * The caller must hold exclusive locks on both buckets to ensure that
 * no one else is trying to access them (see README).

The true prereq here is a buffer clean up lock (pin plus exclusive buffer content lock), correct?

And then:

 * Split needs to hold pin on primary bucket pages of both old and new
 * buckets till end of operation.

'retain' is probably better than 'hold', to emphasize that we are dropping the buffer content lock part of the clean-up lock, but that the pin part of it is kept continuously (this also matches the variable name used in the code).  Also, the paragraph after that one seems to be obsolete and contradictory with the newly added comments.

===========

    /*
     * Acquiring cleanup lock to clear the split-in-progress flag ensures that
     * there is no pending scan that has seen the flag after it is cleared.
     */

But, we are not acquiring a clean up lock.  We already have a pin, and we do acquire a write buffer-content lock, but don't observe that our pin is the only one.  I don't see why it is necessary to have a clean up lock (what harm is done if a under-way scan thinks it is scanning a bucket that is being split when it actually just finished  the split?), but if it is necessary then I think this code is wrong.  If not necessary, the comment is wrong.

Also, why must we hold a write lock on both old and new primary bucket pages simultaneously?  Is this in anticipation of the WAL patch?  The contract for the function does say that it returns both pages write locked, but I don't see a reason for that part of the contract at the moment.

=========

   To avoid deadlock between readers and inserters, whenever there is a need
   to lock multiple buckets, we always take in the order suggested in Locking 
   Definitions above.  This algorithm allows them a very high degree of
   concurrency.

The section referred to is actually spelled "Lock Definitions", no "ing".

The Lock Definitions sections doesn't mention the meta page at all.  I think there needs be something added to it about how the meta page gets locked and why that is deadlock free.  (But we could be optimistic and assume the patch to implement caching of the metapage will go in and will take care of that).

=========

And an operational question on this:  A lot of stuff is done conditionally here.  Under high concurrency, do splits ever actually occur?  It seems like they could easily be permanently starved.

Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: cost_sort() may need to be updated
Следующее
От: Tom Lane
Дата:
Сообщение: Re: kqueue