Re: Hash Indexes

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: Hash Indexes
Дата
Msg-id CAA4eK1+pnr8jbpf=u64tSboVbhA0cJrzzcoTsF+Eh5_R64SQcg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Hash Indexes  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Hash Indexes
Список pgsql-hackers
On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> [ new patches ]
>
> Attached is yet another incremental patch with some suggested changes.
>
> + * This expects that the caller has acquired a cleanup lock on the target
> + * bucket (primary page of a bucket) and it is reponsibility of caller to
> + * release that lock.
>
> This is confusing, because it makes it sound like we retain the lock
> through the entire execution of the function, which isn't always true.
> I would say that caller must acquire a cleanup lock on the target
> primary bucket page before calling this function, and that on return
> that page will again be write-locked.  However, the lock might be
> temporarily released in the meantime, which visiting overflow pages.
> (Attached patch has a suggested rewrite.)
>
> + * During scan of overflow pages, first we need to lock the next bucket and
> + * then release the lock on current bucket.  This ensures that any concurrent
> + * scan started after we start cleaning the bucket will always be behind the
> + * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
> + * required for sanctity of scan.
>
> This comment says that it's bad if other scans can pass our cleanup
> scan, but it doesn't explain why.  I think it's because we don't have
> page-at-a-time mode yet, and cleanup might decrease the TIDs for
> existing index entries.  (Attached patch has a suggested rewrite, but
> might need further adjustment if my understanding of the reasons is
> incomplete.)
>

Okay, I have included your changes with minor typo fix and updated
README to use similar language.


> +        if (delay)
> +            vacuum_delay_point();
>
> You don't really need "delay".  If we're not in a cost-accounted
> VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
> which should be safe (and a good idea) regardless.  (Fixed in
> attached.)
>

New patch contains this fix.

> +            if (callback && callback(htup, callback_state))
> +            {
> +                /* mark the item for deletion */
> +                deletable[ndeletable++] = offno;
> +                if (tuples_removed)
> +                    *tuples_removed += 1;
> +            }
> +            else if (bucket_has_garbage)
> +            {
> +                /* delete the tuples that are moved by split. */
> +                bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
> ),
> +                                              maxbucket,
> +                                              highmask,
> +                                              lowmask);
> +                /* mark the item for deletion */
> +                if (bucket != cur_bucket)
> +                {
> +                    /*
> +                     * We expect tuples to either belong to curent bucket or
> +                     * new_bucket.  This is ensured because we don't allow
> +                     * further splits from bucket that contains garbage. See
> +                     * comments in _hash_expandtable.
> +                     */
> +                    Assert(bucket == new_bucket);
> +                    deletable[ndeletable++] = offno;
> +                }
> +                else if (num_index_tuples)
> +                    *num_index_tuples += 1;
> +            }
> +            else if (num_index_tuples)
> +                *num_index_tuples += 1;
> +        }
>
> OK, a couple things here.  First, it seems like we could also delete
> any tuples where ItemIdIsDead, and that seems worth doing. In fact, I
> think we should check it prior to invoking the callback, because it's
> probably quite substantially cheaper than the callback.  Second,
> repeating deletable[ndeletable++] = offno and *num_index_tuples += 1
> doesn't seem very clean to me; I think we should introduce a new bool
> to track whether we're keeping the tuple or killing it, and then use
> that to drive which of those things we do.  (Fixed in attached.)
>

As discussed up thread, I have included your changes apart from the
change related to ItemIsDead.

> +        if (H_HAS_GARBAGE(bucket_opaque) &&
> +            !H_INCOMPLETE_SPLIT(bucket_opaque))
>
> This is the only place in the entire patch that use
> H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even
> here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here?  (And
> couldn't we then remove H_INCOMPLETE_SPLIT() itself?) There's no
> garbage to be removed from the "new" bucket until the next split, when
> it will take on the role of the "old" bucket.
>

Fixed.

> I think it would be a good idea to change this so that
> LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until
> LH_BUCKET_OLD_PAGE_SPLIT is cleared.  The current way is confusing,
> because those tuples are NOT garbage until the split is completed!
> Moreover, both of the places that care about
> LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that
> LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about
> LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would
> actually simplify the code very slightly.
>

Yeah, I have changed as per above suggestion.  However, I think with
this change we can only check for garbage flag during vacuum.  For
now, I am checking both incomplete split and garbage flag in the
vacuum just to be extra sure, but if you also feel that we can remove
the incomplete split check, then I will do so.

> +#define H_OLD_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
> LH_BUCKET_OLD_PAGE_SPLIT)
> +#define H_NEW_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
> LH_BUCKET_NEW_PAGE_SPLIT)
>
> The code isn't consistent about the use of these macros, or at least
> not in a good way.  When you care about LH_BUCKET_OLD_PAGE_SPLIT, you
> test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT,
> you ignore the macro and test it directly.  Either get rid of both
> macros and always test directly, or keep both macros and use both of
> them. Using a macro for one but not the other is strange.
>

Used macro for both.

> I wonder if we should rename these flags and macros.  Maybe
> LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and
> LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED.  I think that might be
> clearer.  When LH_BEING_POPULATED is set, the bucket is being filled -
> that is, populated - from the old bucket.  And maybe
> LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.
>

Changed the names as per discussion up thread.

> +         * Copy bucket mapping info now;  The comment in _hash_expandtable
> +         * where we copy this information and calls _hash_splitbucket explains
> +         * why this is OK.
>
> After a semicolon, the next word should not be capitalized.  There
> shouldn't be two spaces after a semicolon, either.  Also,
> _hash_splitbucket appears to have a verb before it (calls) and a verb
> after it (explains) and I have no idea what that means.
>

Fixed.

> +    for (;;)
> +    {
> +        mask = lowmask + 1;
> +        new_bucket = old_bucket | mask;
> +        if (new_bucket > metap->hashm_maxbucket)
> +        {
> +            lowmask = lowmask >> 1;
> +            continue;
> +        }
> +        blkno = BUCKET_TO_BLKNO(metap, new_bucket);
> +        break;
> +    }
>
> I can't help feeling that it should be possible to do this without
> looping.  Can we ever loop more than once?  How?  Can we just use an
> if-then instead of a for-loop?
>
> Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket
> instead of duplicating the logic?
>

Changed as per discussion up thread.

> I still don't like the names of these functions very much.  If you
> said "get X from Y", it would be clear that you put in Y and you get
> out X.  If you say "X 2 Y", it would be clear that you put in X and
> you get out Y.  As it is, it's not very clear which is the input and
> which is the output.
>

Changed as per discussion up thread.

> +               bool primary_buc_page)
>
> I think we could just go with "primary_page" here.  (Fixed in attached.)
>

Included the change in attached version of the patch.

> +    /*
> +     * Acquiring cleanup lock to clear the split-in-progress flag ensures that
> +     * there is no pending scan that has seen the flag after it is cleared.
> +     */
> +    _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
> +    opage = BufferGetPage(bucket_obuf);
> +    oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
>
> I don't understand the comment, because the code *isn't* acquiring a
> cleanup lock.
>

Removed this comment.

>> Some more review:
>>
>> The API contract of _hash_finish_split seems a bit unfortunate.   The
>> caller is supposed to have obtained a cleanup lock on both the old and
>> new buffers, but the first thing it does is walk the entire new bucket
>> chain, completely ignoring the old one.  That means holding a cleanup
>> lock on the old buffer across an unbounded number of I/O operations --
>> which also means that you can't interrupt the query by pressing ^C,
>> because an LWLock (on the old buffer) is held.
>>
>

Fixed in attached patch as per algorithm proposed few lines down in this mail.

> I see the problem you are talking about, but it was done to ensure
> locking order, old bucket first and then new bucket, else there could
> be a deadlock risk.  However, I think we can avoid holding the cleanup
> lock on old bucket till we scan the new bucket to form a hash table of
> TIDs.
>
>>  Moreover, the
>> requirement to hold a lock on the new buffer isn't convenient for
>> either caller; they both have to go do it, so why not move it into the
>> function?
>>
>
> Yeah, we can move the locking of new bucket entirely into new function.
>

Done.

>>  Perhaps the function should be changed so that it
>> guarantees that a pin is held on the primary page of the existing
>> bucket, but no locks are held.
>>
>
> Okay, so we can change the locking order as follows:
> a. ensure a cleanup lock on old bucket and check if the bucket (old)
> has pending split.
> b. if there is a pending split, release the lock on old bucket, but not pin.
>
> below steps will be performed by _hash_finish_split():
>
> c. acquire the read content lock on new bucket and form the hash table
> of TIDs and in the process of forming hash table, we need to traverse
> whole bucket chain.  While traversing bucket chain, release the lock
> on previous bucket (both lock and pin if not a primary bucket page).
> d. After the hash table is formed, acquire cleanup lock on old and new
> buckets conditionaly; if we are not able to get cleanup lock on
> either, then just return from there.
> e. Perform split operation..
> f. release the lock and pin on new bucket
> g. release the lock on old bucket.  We don't want to release the pin
> on old bucket as the caller has ensure it before passing it to
> _hash_finish_split(), so releasing pin should be resposibility of
> caller.
>
> Now, both the callers need to ensure that they restart the operation
> from begining as after we release lock on old bucket, somebody might
> have split the bucket.
>
> Does the above change in locking strategy sounds okay?
>

I have changed the locking strategy as per above description by me and
accordingly changed the prototype of _hash_finish_split.

>> Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
>> should it instead be trying to get the lock conditionally and
>> returning immediately if it doesn't get the lock?  Seems like a good
>> idea...
>>
>
> Yeah, we can take a cleanup lock conditionally, but it would waste the
> effort of forming hash table, if we don't get cleanup lock
> immediately.  Considering incomplete splits to be a rare operation,
> may be this the wasted effort is okay, but I am not sure.  Don't you
> think we should avoid that effort?
>

Changed it to conditional lock.

>>      * We're at the end of the old bucket chain, so we're done partitioning
>>      * the tuples.  Mark the old and new buckets to indicate split is
>>      * finished.
>>      *
>>      * To avoid deadlocks due to locking order of buckets, first lock the old
>>      * bucket and then the new bucket.
>>
>> These comments have drifted too far from the code to which they refer.
>> The first part is basically making the same point as the
>> slightly-later comment /* indicate that split is finished */.
>>
>
> I think we can remove the second comment /* indicate that split is
> finished */.

Removed this comment.

>                 itupsize = new_itup->t_info & INDEX_SIZE_MASK;
>                 new_itup->t_info &= ~INDEX_SIZE_MASK;
>                 new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
>                 new_itup->t_info |= itupsize;
>
> If I'm not mistaken, you could omit the first, second, and fourth
> lines here and keep only the third one, and it would do exactly the
> same thing.  The first line saves the bits in INDEX_SIZE_MASK.  The
> second line clears the bits in INDEX_SIZE_MASK.   The fourth line
> re-sets the bits that were originally saved.
>

You are right and I have changed the code as per your suggestion.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andreas Karlsson
Дата:
Сообщение: Re: Contains and is contained by operators of inet datatypes
Следующее
От: Pavan Deolasee
Дата:
Сообщение: Re: Patch: Write Amplification Reduction Method (WARM)