Re: Hash Indexes
От | Ashutosh Sharma |
---|---|
Тема | Re: Hash Indexes |
Дата | |
Msg-id | CAE9k0PnXBGwPKXxG+hF8dXR-qotFp-yk00thiHxG7h_QsF=oHA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Hash Indexes (Amit Kapila <amit.kapila16@gmail.com>) |
Список | pgsql-hackers |
Hi All, I have executed few test-cases to validate the v12 patch for concurrent hash index shared upthread and have found no issues. Below are some of the test-cases i ran, 1) pgbench test on a read-write workload with following configuration (This was basically to validate the locking strategy not for performance testing) postgresql non-default configuration: ---------------------------------------------------- min_wal_size=15GB max_wal_size=20GB checkpoint_timeout=900 maintenance_work_mem=1GB checkpoint_completion_target=0.9 max_connections=200 shared buffer=8GB pgbench settings: ------------------------- Scale Factor=300 run time= 30 mins pgbench -c $thread -j $thread -T $time_for_reading -M prepared postgres 2) As v12 patch mainly has locking changes related to bucket squeezing in hash index, I have ran a small test-case to build hash index with good number of overflow pages and then ran deletion operation to see if the bucket squeezing has happened. The test script "test_squeezeb_hindex.sh" used for this testing is attached with this mail and the results are shown below: =====Number of bucket and overflow pages before delete===== 274671 Tuples only is on. 148390 131263 bucket 17126 overflow 1 bitmap =====Number of bucket and overflow pages after delete===== 274671 Tuples only is on. 141240 131263 bucket 9976 overflow 1 bitmap With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Wed, Nov 23, 2016 at 7:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> You are right and I have changed the code as per your suggestion. >> >> So... >> >> + /* >> + * We always maintain the pin on bucket page for whole scan operation, >> + * so releasing the additional pin we have acquired here. >> + */ >> + if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE) >> + _hash_dropbuf(rel, *bufp); >> >> This relies on the page contents to know whether we took a pin; that >> seems like a bad plan. We need to know intrinsically whether we took >> a pin. >> > > Okay, changed to not rely on page contents. > >> + * If the bucket split is in progress, then we need to skip tuples that >> + * are moved from old bucket. To ensure that vacuum doesn't clean any >> + * tuples from old or new buckets till this scan is in progress, maintain >> + * a pin on both of the buckets. Here, we have to be cautious about >> >> It wouldn't be a problem if VACUUM removed tuples from the new bucket, >> because they'd have to be dead anyway. It also wouldn't be a problem >> if it removed tuples from the old bucket that were actually dead. The >> real issue isn't vacuum anyway, but the process of cleaning up after a >> split. We need to hold the pin so that tuples being moved from the >> old bucket to the new bucket by the split don't get removed from the >> old bucket until our scan is done. >> > > Updated comments to explain clearly. > >> + old_blkno = _hash_get_oldblock_from_newbucket(rel, >> opaque->hasho_bucket); >> >> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I >> feel like I'm repeating this ad nauseum, but I really think it's bad >> to rely on the special space instead of our own local variables! >> > > Okay, changed as per suggestion. > >> - /* we ran off the end of the bucket without finding a match */ >> + /* >> + * We ran off the end of the bucket without finding a match. >> + * Release the pin on bucket buffers. Normally, such pins are >> + * released at end of scan, however scrolling cursors can >> + * reacquire the bucket lock and pin in the same scan multiple >> + * times. >> + */ >> *bufP = so->hashso_curbuf = InvalidBuffer; >> ItemPointerSetInvalid(current); >> + _hash_dropscanbuf(rel, so); >> >> I think this comment is saying that we'll release the pin on the >> primary bucket page for now, and then reacquire it later if the user >> reverses the scan direction. But that doesn't sound very safe, >> because the bucket could be split in the meantime and the order in >> which tuples are returned could change. I think we want that to >> remain stable within a single query execution. >> > > As explained [1], this shouldn't be a problem. > >> + _hash_readnext(rel, &buf, &page, &opaque, >> + (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false); >> >> Same comment: don't rely on the special space to figure this out. >> Keep track. Also != 0 would be better than ? true : false. >> > > After gluing scan of old and new buckets in _hash_read* API's, this is > no more required. > >> + /* >> + * setting hashso_skip_moved_tuples to false >> + * ensures that we don't check for tuples that are >> + * moved by split in old bucket and it also >> + * ensures that we won't retry to scan the old >> + * bucket once the scan for same is finished. >> + */ >> + so->hashso_skip_moved_tuples = false; >> >> I think you've got a big problem here. Suppose the user starts the >> scan in the new bucket and runs it forward until they end up in the >> old bucket. Then they turn around and run the scan backward. When >> they reach the beginning of the old bucket, they're going to stop, not >> move back to the new bucket, AFAICS. Oops. >> >> _hash_first() has a related problem: a backward scan starts at the end >> of the new bucket and moves backward, but it should start at the end >> of the old bucket, and then when it reaches the beginning, flip to the >> new bucket and move backward through that one. Otherwise, a backward >> scan and a forward scan don't return tuples in opposite order, which >> they should. >> >> I think what you need to do to fix both of these problems is a more >> thorough job gluing the two buckets together. I'd suggest that the >> responsibility for switching between the two buckets should probably >> be given to _hash_readprev() and _hash_readnext(), because every place >> that needs to advance to the next or previous page that cares about >> this. Right now you are trying to handle it mostly in the functions >> that call those functions, but that is prone to errors of omission. >> > > Changed as per this idea to change the API's and fix the problem. > >> Also, I think that so->hashso_skip_moved_tuples is badly designed. >> There are two separate facts you need to know: (1) whether you are >> scanning a bucket that was still being populated at the start of your >> scan and (2) if yes, whether you are scanning the bucket being >> populated or whether you are instead scanning the corresponding "old" >> bucket. You're trying to keep track of that using one Boolean, but >> one Boolean only has two states and there are three possible states >> here. >> > > Updated patch is using two boolean variables to track the bucket state. > >> + if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf)) >> + { >> + >> + /* release the lock on bucket buffer, before completing the split. */ >> >> Extra blank line. >> > > Removed. > >> +moved-by-split flag on a tuple indicates that tuple is moved from old to new >> +bucket. The concurrent scans can skip such tuples till the split operation is >> +finished. Once the tuple is marked as moved-by-split, it will remain >> so forever >> +but that does no harm. We have intentionally not cleared it as that >> can generate >> +an additional I/O which is not necessary. >> >> The first sentence needs to start with "the" but the second sentence shouldn't. >> > > Changed. > >> It would be good to adjust this part a bit to more clearly explain >> that the split-in-progress and split-cleanup flags are bucket-level >> flags, while moved-by-split is a per-tuple flag. It's possible to >> figure this out from what you've written, but I think it could be more >> clear. Another thing that is strange is that the code uses THREE >> flags, bucket-being-split, bucket-being-populated, and >> needs-split-cleanup, but the README conflates the first two and uses a >> different name. >> > > Updated patch to use bucket-being-split and bucket-being-populated to > explain the split operation in README. I have also changed the readme > to clearly indicate which the bucket and tuple level flags. > >> +previously-acquired content lock, but not pin and repeat the process using the >> >> s/but not pin/but not the pin,/ >> > > Changed. > >> A problem is that if a split fails partway through (eg due to insufficient >> -disk space) the index is left corrupt. The probability of that could be >> -made quite low if we grab a free page or two before we update the meta >> -page, but the only real solution is to treat a split as a WAL-loggable, >> +disk space or crash) the index is left corrupt. The probability of that >> +could be made quite low if we grab a free page or two before we update the >> +meta page, but the only real solution is to treat a split as a WAL-loggable, >> must-complete action. I'm not planning to teach hash about WAL in this >> -go-round. >> +go-round. However, we do try to finish the incomplete splits during insert >> +and split. >> >> I think this paragraph needs a much heavier rewrite explaining the new >> incomplete split handling. It's basically wrong now. Perhaps replace >> it with something like this: >> >> -- >> If a split fails partway through (e.g. due to insufficient disk space >> or an interrupt), the index will not be corrupted. Instead, we'll >> retry the split every time a tuple is inserted into the old bucket >> prior to inserting the new tuple; eventually, we should succeed. The >> fact that a split is left unfinished doesn't prevent subsequent >> buckets from being split, but we won't try to split the bucket again >> until the prior split is finished. In other words, a bucket can be in >> the middle of being split for some time, but ti can't be in the middle >> of two splits at the same time. >> >> Although we can survive a failure to split a bucket, a crash is likely >> to corrupt the index, since hash indexes are not yet WAL-logged. >> -- >> > > s/ti/it > Fixed the typo and used the suggested text in README. > >> + Acquire cleanup lock on target bucket >> + Scan and remove tuples >> + For overflow page, first we need to lock the next page and then >> + release the lock on current bucket or overflow page >> + Ensure to have buffer content lock in exclusive mode on bucket page >> + If buffer pincount is one, then compact free space as needed >> + Release lock >> >> I don't think this summary is particularly correct. You would never >> guess from this that we lock each bucket page in turn and then go back >> and try to relock the primary bucket page at the end. It's more like: >> >> acquire cleanup lock on primary bucket page >> loop: >> scan and remove tuples >> if this is the last bucket page, break out of loop >> pin and x-lock next page >> release prior lock and pin (except keep pin on primary bucket page) >> if the page we have locked is not the primary bucket page: >> release lock and take exclusive lock on primary bucket page >> if there are no other pins on the primary bucket page: >> squeeze the bucket to remove free space >> > > Yeah, it is clear, so I have used it in README. > >> Come to think of it, I'm a little worried about the locking in >> _hash_squeezebucket(). It seems like we drop the lock on each "write" >> bucket page before taking the lock on the next one. So a concurrent >> scan could get ahead of the cleanup process. That would be bad, >> wouldn't it? >> > > As discussed [2], I have changed the code to use lock-chaining during > squeeze phase. > > > Apart from above, I have fixed a bug in calculation of lowmask in > _hash_get_oldblock_from_newbucket(). > > [1] - https://www.postgresql.org/message-id/CAA4eK1JJDWFY0_Ezs4ZxXgnrGtTn48vFuXniOLmL7FOWX-tKNw%40mail.gmail.com > [2] - https://www.postgresql.org/message-id/CAA4eK1J%2B0OYWKswWYNEjrBk3LfGpGJ9iSV8bYPQ3M%3D-qpkMtwQ > %40mail.gmail.com > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Вложения
В списке pgsql-hackers по дате отправления: