Re: GSoC on WAL-logging hash indexes

Поиск

Список

Период

Сортировка

От	Heikki Linnakangas
Тема	Re: GSoC on WAL-logging hash indexes
Дата	6 марта 2014 г. 16:11:15
Msg-id	531873E8.6050802@vmware.com обсуждение исходный текст
Ответ на	Re: GSoC on WAL-logging hash indexes (Greg Stark <stark@mit.edu>)
Ответы	Re: GSoC on WAL-logging hash indexes
Список	pgsql-hackers

Дерево обсуждения

(de-CC'ing pgsql-advocacy)

On 03/06/2014 04:03 AM, Greg Stark wrote:
> On Mon, Mar 3, 2014 at 4:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Unfortunately, I don't believe that it's possible to do this easily
>> today because of the way bucket splits are handled.  I wrote about
>> this previously here, with an idea for solving the problem:
>
> We could just tackle this in the same incomplete, buggy, way that
> btrees tackled it for years until Heikki fixed them and the way gin
> and gist still do I believe. Namely just emit xlog records for each
> page individually and during replay remember when you have an
> "incomplete split" and complain if recovery ends with any still
> incomplete. That would be unfortunate to be adding new cases of this
> just as Heikki and company are making progress eliminating the ones we
> already had but that's surely better than having no recovery at all.

Grmph. Indeed.

Looking at Robert's idea from November:

> I have thought about doing some work on this, although I don't know
> when I'll ever have the time.  As far as I can see, the basic problem
> is that we need a better way of splitting buckets.  Right now,
> splitting a bucket means locking it, scanning the entire bucket chain
> and moving ~1/2 the tuples to the new bucket, and then unlocking it.
> Since this is a long operation, we have to use heavyweight locks to
> protect the buckets, which is bad for concurrency.  Since it involves
> a modification to an arbitrarily large number of pages that has to be
> atomic, it's not possible to WAL-log it sensibly  -- and in fact, I
> think this is really the major obstacle to being able to implementing
> WAL-logging for hash indexes.

I don't think it's necessary to improve concurrency just to get 
WAL-logging. Better concurrency is a worthy goal of its own, of course, 
but it's a separate concern.

For crash safety, the key is to make sure you perform the whole 
operation in small atomic steps, and you have a way of knowing where to 
continue after crash (this is the same whether you do the cleanup 
immediately at the end of recovery, which I want to avoid, or lazily 
afterwards). But you can hold locks across those small atomic steps, to 
ensure concurrency-safety.

> I think we could work around this problem as follows.   Suppose we
> have buckets 1..N and the split will create bucket N+1.  We need a
> flag that can be set and cleared on the metapage, call it
> split-in-progress, and a flag that can optionally be set on particular
> index tuples, call it moved-by-split.  We will allow scans of all
> buckets and insertions into all buckets while the split is in
> progress, but (as now) we will not allow more than one split to be in
> progress at the same time.

Hmm, unless I'm reading the code wrong, it *does* allow more than one 
split to be in progress at the same time today.

> [description of how to use the moved-by-split to allow scans to run concurrently with the split]

I guess that would work, although you didn't actually describe how to 
continue a split after a crash. But it's a lot simpler if you don't also 
try to make it more concurrent:

---
When you start splitting a bucket, first acquire a heavy-weight lock on 
the old and new buckets. Allocate the required number of pages, before 
changing anything on-disk, so that you can easily back out if you run 
out of disk space. So far, this is how splitting works today.

Then you update the metapage to show that the bucket has been split and 
initialize the new bucket's primary page (as one atomic WAL-logged 
operation). Also mark the new bucket's primary page with a 
split-in-progress flag. That's used later by scans to detect if the 
split was interrupted.

Now you scan the old bucket, and move all the tuples belonging to the 
new bucket. That needs to be done as a series of small atomic WAL-logged 
operations, each involving a small number of old and new pages (one WAL 
record for each moved tuple is the simplest, but you can do some 
batching for efficiency). After you're done, clear the split-in-progress 
flag in the new bucket's primary page (WAL-log that), and release the locks.

In a scan, if the bucket you're about to scan has the split-in-progress 
flag set, that indicates that the split was interrupted by a crash. You 
won't see the flag as set if a concurrent split is in progress, because 
you will block on the lock it's holding on the bucket. If you see the 
flag as set, you share-lock and scan both buckets, the old and the new.

If you see the split-in-progress flag in the bucket you're about to 
insert to, you don't need to do anything special. Just insert the tuple 
to the new bucket as normal. Before splitting the new bucket again, 
however, the previous split needs to be finished, or things will get 
complicated. To do that, acquire the locks on the old and the new 
bucket, and scan the old bucket for any remaining tuples that belong to 
the new bucket and move them, and finally clear the split-in-progress flag.
---

This is similar to your description, as you scan both buckets if you see 
an interrupted split, but doesn't require the per-tuple moved-by-split 
flag, or waiting out scans. Also, I put the split-in-progress flag in 
the new bucket's primary page instead of the metapage. That allows 
multiple splits to be in progress at the same time.

- Heikki

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Fabrízio de Royes Mello
Дата: 06 марта 2014 г., 15:44:28
Сообщение: Next CommitFest Deadlines

Следующее

От: Oleg Bartunov
Дата: 06 марта 2014 г., 16:16:47
Сообщение: Re: jsonb and nested hstore

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: GSoC on WAL-logging hash indexes

Предыдущее

Следующее