Re: Hash Indexes

Поиск

Список

Период

Сортировка

От	Robert Haas
Тема	Re: Hash Indexes
Дата	21 июня 2016 г. 18:57:05
Msg-id	CA+TgmoaLO1Qs1W7s40vcdss8VaFikF8PMeoZ2it+JTyDPG_BbQ@mail.gmail.com обсуждение исходный текст
Ответ на	Hash Indexes (Amit Kapila <amit.kapila16@gmail.com>)
Ответы	Re: Hash Indexes
Список	pgsql-hackers

Дерево обсуждения

On Tue, May 10, 2016 at 8:09 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> For making hash indexes usable in production systems, we need to improve its concurrency and make them crash-safe by
WALlogging them.  The first problem I would like to tackle is improve the concurrency of hash indexes.  First
advantage,I see with improving concurrency of hash indexes is that it has the potential of out performing btree for
"equalto" searches (with my WIP patch attached with this mail, I could see hash index outperform btree index by 20 to
30%for very simple cases which are mentioned later in this e-mail).   Another advantage as explained by Robert [1]
earlieris that if we remove heavy weight locks under which we perform arbitrarily large number of operations, it can
helpus to sensibly WAL log it.  With this patch, I would also like to make hash indexes capable of completing the
incomplete_splitswhich can occur due to interrupts (like cancel) or errors or crash. 
>
> I have studied the concurrency problems of hash index and some of the solutions proposed for same previously and
basedon that came up with below solution which is based on idea by Robert [1], community discussion on thread [2] and
someof my own thoughts. 
>
> Maintain a flag that can be set and cleared on the primary bucket page, call it split-in-progress, and a flag that
canoptionally be set on particular index tuples, call it moved-by-split. We will allow scans of all buckets and
insertionsinto all buckets while the split is in progress, but (as now) we will not allow more than one split for a
bucketto be in progress at the same time.  We start the split by updating metapage to incrementing the number of
bucketsand set the split-in-progress flag in primary bucket pages for old and new buckets (lets number them as old
bucket- N+1/2; new bucket - N + 1 for the matter of discussion). While the split-in-progress flag is set, any scans of
N+1will first scan that bucket, ignoring any tuples flagged moved-by-split, and then ALSO scan bucket N+1/2. To ensure
thatvacuum doesn't clean any tuples from old or new buckets till this scan is in progress, maintain a pin on both of
thebuckets (first pin on old bucket needs to be acquired). The moved-by-split flag never has any effect except when
scanningthe new bucket that existed at the start of that particular scan, and then only if the split-in-progress flag
wasalso set at that time. 

You really need parentheses in (N+1)/2.  Because you are not trying to
add 1/2 to N.  https://en.wikipedia.org/wiki/Order_of_operations

> Once the split operation has set the split-in-progress flag, it will begin scanning bucket (N+1)/2.  Every time it
findsa tuple that properly belongs in bucket N+1, it will insert the tuple into bucket N+1 with the moved-by-split flag
set. Tuples inserted by anything other than a split operation will leave this flag clear, and tuples inserted while the
splitis in progress will target the same bucket that they would hit if the split were already complete.  Thus, bucket
N+1will end up with a mix of moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples coming from
parallelinsertion activity.  When the scan of bucket (N+1)/2 is complete, we know that bucket N+1 now contains all the
tuplesthat are supposed to be there, so we clear the split-in-progress flag on both buckets.  Future scans of both
bucketscan proceed normally.  Split operation needs to take a cleanup lock on primary bucket to ensure that it doesn't
startif there is any Insertion happening in the bucket.  It will leave the lock on primary bucket, but not pin as it
proceedsfor next overflow page.  Retaining pin on primary bucket will ensure that vacuum doesn't start on this bucket
tillthe split is finished. 

In the second-to-last sentence, I believe you have reversed the words
"lock" and "pin".

> Insertion will happen by scanning the appropriate bucket and needs to retain pin on primary bucket to ensure that
concurrentsplit doesn't happen, otherwise split might leave this tuple unaccounted. 

What do you mean by "unaccounted"?

> Now for deletion of tuples from (N+1/2) bucket, we need to wait for the completion of any scans that began before we
finishedpopulating bucket N+1, because otherwise we might remove tuples that they're still expecting to find in bucket
(N+1)/2.The scan will always maintain a pin on primary bucket and Vacuum can take a buffer cleanup lock (cleanup lock
includesExclusive lock on bucket and wait till all the pins on buffer becomes zero) on primary bucket for the buffer.
Ithink we can relax the requirement for vacuum to take cleanup lock (instead take Exclusive Lock on buckets where no
splithas happened) with the additional flag has_garbage which will be set on primary bucket, if any tuples have been
movedfrom that bucket, however I think for squeeze phase (in this phase, we try to move the tuples from later overflow
pagesto earlier overflow pages in the bucket and then if there are any empty overflow pages, then we move them to kind
ofa free pool) of vacuum, we need a cleanup lock, otherwise scan results might get effected. 

affected, not effected.

I think this is basically correct, although I don't find it to be as
clear as I think it could be.  It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans.  However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page.  Also, I think that when cleaning up after a
split, an exclusive lock is likewise sufficient to remove tuples from
a particular page provided that we know that every scan currently in
progress started after split-in-progress was set.  If each scan holds
a pin on the primary bucket and setting the split-in-progress flag
requires a cleanup lock on that page, then this is always true.

(Plain text email is preferred to HTML on this mailing list.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Предыдущее

От: alain radix
Дата: 21 июня 2016 г., 18:54:29
Сообщение: Requesting external_pid_file with postgres -C when not initialized lead to coredump

Следующее

От: Robert Haas
Дата: 21 июня 2016 г., 19:03:33
Сообщение: Re: Hash Indexes

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Hash Indexes

Предыдущее

Следующее