Re: Hash Indexes

Поиск
Список
Период
Сортировка
От	Ashutosh Sharma
Тема	Re: Hash Indexes
Дата	24 ноября 2016 г. 10:49:16
Msg-id	CAE9k0PnXBGwPKXxG+hF8dXR-qotFp-yk00thiHxG7h_QsF=oHA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Hash Indexes (Amit Kapila <amit.kapila16@gmail.com>)
Список	pgsql-hackers
Дерево обсуждения
Hi All,

I have executed few test-cases to validate the v12 patch for
concurrent hash index shared upthread and have found no issues. Below
are some of the test-cases i ran,

1) pgbench test on a read-write workload with following configuration
(This was basically to validate the locking strategy not for
performance testing)

postgresql non-default configuration:
----------------------------------------------------
min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout=900
maintenance_work_mem=1GB
checkpoint_completion_target=0.9
max_connections=200
shared buffer=8GB

pgbench settings:
-------------------------
Scale Factor=300
run time= 30 mins
pgbench -c $thread -j $thread -T $time_for_reading -M prepared postgres


2) As v12 patch mainly has locking changes related to bucket squeezing
in hash index, I have ran a small test-case to build hash index with
good number of overflow pages and then ran deletion operation to see
if the bucket squeezing has happened. The test script
"test_squeezeb_hindex.sh" used for this testing is attached with this
mail and the results are shown below:

=====Number of bucket and overflow pages before delete=====
 274671 Tuples only is on.
 148390
 131263  bucket
  17126  overflow
      1  bitmap

=====Number of bucket and overflow pages after delete=====
 274671 Tuples only is on.
 141240
 131263  bucket
   9976  overflow
      1  bitmap

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, Nov 23, 2016 at 7:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> You are right and I have changed the code as per your suggestion.
>>
>> So...
>>
>> +        /*
>> +         * We always maintain the pin on bucket page for whole scan operation,
>> +         * so releasing the additional pin we have acquired here.
>> +         */
>> +        if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
>> +            _hash_dropbuf(rel, *bufp);
>>
>> This relies on the page contents to know whether we took a pin; that
>> seems like a bad plan.  We need to know intrinsically whether we took
>> a pin.
>>
>
> Okay, changed to not rely on page contents.
>
>> +     * If the bucket split is in progress, then we need to skip tuples that
>> +     * are moved from old bucket.  To ensure that vacuum doesn't clean any
>> +     * tuples from old or new buckets till this scan is in progress, maintain
>> +     * a pin on both of the buckets.  Here, we have to be cautious about
>>
>> It wouldn't be a problem if VACUUM removed tuples from the new bucket,
>> because they'd have to be dead anyway.   It also wouldn't be a problem
>> if it removed tuples from the old bucket that were actually dead.  The
>> real issue isn't vacuum anyway, but the process of cleaning up after a
>> split.  We need to hold the pin so that tuples being moved from the
>> old bucket to the new bucket by the split don't get removed from the
>> old bucket until our scan is done.
>>
>
> Updated comments to explain clearly.
>
>> +        old_blkno = _hash_get_oldblock_from_newbucket(rel,
>> opaque->hasho_bucket);
>>
>> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"?  I
>> feel like I'm repeating this ad nauseum, but I really think it's bad
>> to rely on the special space instead of our own local variables!
>>
>
> Okay, changed as per suggestion.
>
>> -            /* we ran off the end of the bucket without finding a match */
>> +            /*
>> +             * We ran off the end of the bucket without finding a match.
>> +             * Release the pin on bucket buffers.  Normally, such pins are
>> +             * released at end of scan, however scrolling cursors can
>> +             * reacquire the bucket lock and pin in the same scan multiple
>> +             * times.
>> +             */
>>              *bufP = so->hashso_curbuf = InvalidBuffer;
>>              ItemPointerSetInvalid(current);
>> +            _hash_dropscanbuf(rel, so);
>>
>> I think this comment is saying that we'll release the pin on the
>> primary bucket page for now, and then reacquire it later if the user
>> reverses the scan direction.  But that doesn't sound very safe,
>> because the bucket could be split in the meantime and the order in
>> which tuples are returned could change.  I think we want that to
>> remain stable within a single query execution.
>>
>
> As explained [1], this shouldn't be a problem.
>
>> +            _hash_readnext(rel, &buf, &page, &opaque,
>> +                       (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
>>
>> Same comment: don't rely on the special space to figure this out.
>> Keep track.  Also != 0 would be better than ? true : false.
>>
>
> After gluing scan of old and new buckets in _hash_read* API's, this is
> no more required.
>
>> +                            /*
>> +                             * setting hashso_skip_moved_tuples to false
>> +                             * ensures that we don't check for tuples that are
>> +                             * moved by split in old bucket and it also
>> +                             * ensures that we won't retry to scan the old
>> +                             * bucket once the scan for same is finished.
>> +                             */
>> +                            so->hashso_skip_moved_tuples = false;
>>
>> I think you've got a big problem here.  Suppose the user starts the
>> scan in the new bucket and runs it forward until they end up in the
>> old bucket.  Then they turn around and run the scan backward.  When
>> they reach the beginning of the old bucket, they're going to stop, not
>> move back to the new bucket, AFAICS.  Oops.
>>
>> _hash_first() has a related problem: a backward scan starts at the end
>> of the new bucket and moves backward, but it should start at the end
>> of the old bucket, and then when it reaches the beginning, flip to the
>> new bucket and move backward through that one.  Otherwise, a backward
>> scan and a forward scan don't return tuples in opposite order, which
>> they should.
>>
>> I think what you need to do to fix both of these problems is a more
>> thorough job gluing the two buckets together.  I'd suggest that the
>> responsibility for switching between the two buckets should probably
>> be given to _hash_readprev() and _hash_readnext(), because every place
>> that needs to advance to the next or previous page that cares about
>> this.  Right now you are trying to handle it mostly in the functions
>> that call those functions, but that is prone to errors of omission.
>>
>
> Changed as per this idea to change the API's and fix the problem.
>
>> Also, I think that so->hashso_skip_moved_tuples is badly designed.
>> There are two separate facts you need to know: (1) whether you are
>> scanning a bucket that was still being populated at the start of your
>> scan and (2) if yes, whether you are scanning the bucket being
>> populated or whether you are instead scanning the corresponding "old"
>> bucket.  You're trying to keep track of that using one Boolean, but
>> one Boolean only has two states and there are three possible states
>> here.
>>
>
> Updated patch is using two boolean variables to track the bucket state.
>
>> +    if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
>> +    {
>> +
>> +        /* release the lock on bucket buffer, before completing the split. */
>>
>> Extra blank line.
>>
>
> Removed.
>
>> +moved-by-split flag on a tuple indicates that tuple is moved from old to new
>> +bucket.  The concurrent scans can skip such tuples till the split operation is
>> +finished.  Once the tuple is marked as moved-by-split, it will remain
>> so forever
>> +but that does no harm.  We have intentionally not cleared it as that
>> can generate
>> +an additional I/O which is not necessary.
>>
>> The first sentence needs to start with "the" but the second sentence shouldn't.
>>
>
> Changed.
>
>> It would be good to adjust this part a bit to more clearly explain
>> that the split-in-progress and split-cleanup flags are bucket-level
>> flags, while moved-by-split is a per-tuple flag.  It's possible to
>> figure this out from what you've written, but I think it could be more
>> clear.  Another thing that is strange is that the code uses THREE
>> flags, bucket-being-split, bucket-being-populated, and
>> needs-split-cleanup, but the README conflates the first two and uses a
>> different name.
>>
>
> Updated patch to use bucket-being-split and bucket-being-populated to
> explain the split operation in README.  I have also changed the readme
> to clearly indicate which the bucket and tuple level flags.
>
>> +previously-acquired content lock, but not pin and repeat the process using the
>>
>> s/but not pin/but not the pin,/
>>
>
> Changed.
>
>>  A problem is that if a split fails partway through (eg due to insufficient
>> -disk space) the index is left corrupt.  The probability of that could be
>> -made quite low if we grab a free page or two before we update the meta
>> -page, but the only real solution is to treat a split as a WAL-loggable,
>> +disk space or crash) the index is left corrupt.  The probability of that
>> +could be made quite low if we grab a free page or two before we update the
>> +meta page, but the only real solution is to treat a split as a WAL-loggable,
>>  must-complete action.  I'm not planning to teach hash about WAL in this
>> -go-round.
>> +go-round.  However, we do try to finish the incomplete splits during insert
>> +and split.
>>
>> I think this paragraph needs a much heavier rewrite explaining the new
>> incomplete split handling.  It's basically wrong now.  Perhaps replace
>> it with something like this:
>>
>> --
>> If a split fails partway through (e.g. due to insufficient disk space
>> or an interrupt), the index will not be corrupted.  Instead, we'll
>> retry the split every time a tuple is inserted into the old bucket
>> prior to inserting the new tuple; eventually, we should succeed.  The
>> fact that a split is left unfinished doesn't prevent subsequent
>> buckets from being split, but we won't try to split the bucket again
>> until the prior split is finished.  In other words, a bucket can be in
>> the middle of being split for some time, but ti can't be in the middle
>> of two splits at the same time.
>>
>> Although we can survive a failure to split a bucket, a crash is likely
>> to corrupt the index, since hash indexes are not yet WAL-logged.
>> --
>>
>
> s/ti/it
> Fixed the typo and used the suggested text in README.
>
>> +        Acquire cleanup lock on target bucket
>> +        Scan and remove tuples
>> +        For overflow page, first we need to lock the next page and then
>> +        release the lock on current bucket or overflow page
>> +        Ensure to have buffer content lock in exclusive mode on bucket page
>> +        If buffer pincount is one, then compact free space as needed
>> +        Release lock
>>
>> I don't think this summary is particularly correct.  You would never
>> guess from this that we lock each bucket page in turn and then go back
>> and try to relock the primary bucket page at the end.  It's more like:
>>
>> acquire cleanup lock on primary bucket page
>> loop:
>>     scan and remove tuples
>>     if this is the last bucket page, break out of loop
>>     pin and x-lock next page
>>     release prior lock and pin (except keep pin on primary bucket page)
>> if the page we have locked is not the primary bucket page:
>>     release lock and take exclusive lock on primary bucket page
>> if there are no other pins on the primary bucket page:
>>     squeeze the bucket to remove free space
>>
>
> Yeah, it is clear, so I have used it in README.
>
>> Come to think of it, I'm a little worried about the locking in
>> _hash_squeezebucket().  It seems like we drop the lock on each "write"
>> bucket page before taking the lock on the next one.  So a concurrent
>> scan could get ahead of the cleanup process.  That would be bad,
>> wouldn't it?
>>
>
> As discussed [2], I have changed the code to use lock-chaining during
> squeeze phase.
>
>
> Apart from above, I have fixed a bug in calculation of lowmask in
> _hash_get_oldblock_from_newbucket().
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1JJDWFY0_Ezs4ZxXgnrGtTn48vFuXniOLmL7FOWX-tKNw%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/CAA4eK1J%2B0OYWKswWYNEjrBk3LfGpGJ9iSV8bYPQ3M%3D-qpkMtwQ
> %40mail.gmail.com
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
Вложения

В списке pgsql-hackers по дате отправления:
Предыдущее

От: Michael Paquier
Дата: 24 ноября 2016 г., 10:46:27
Сообщение: Re: [PATCH] Reload SSL certificates on SIGHUP
Следующее

От: Etsuro Fujita
Дата: 24 ноября 2016 г., 11:03:37
Сообщение: Re: Push down more full joins in postgres_fdw
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Hash Indexes

Вложения

Предыдущее

Следующее