Re: B-tree parent pointer and checkpoints

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: B-tree parent pointer and checkpoints
Дата
Msg-id 4CDC3205.6030001@enterprisedb.com
обсуждение исходный текст
Ответ на Re: B-tree parent pointer and checkpoints  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: B-tree parent pointer and checkpoints  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 11.11.2010 17:16, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> GiST is different. When you insert a key to a leaf page, you (sometimes)
>> need to adjust the parent pointer to reflect the new key as well. B-tree
>> tolerates incomplete splits with the 'next page' pointer, but that is
>> not applicable to gist. Teodor described the issue back in 2005 when
>> WAL-logging was added to GiST
>> (http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):
>> Reading that I wonder: what harm would an incomplete insert cause if we
>> just left it in the tree? Imagine that you insert a key to a leaf page,
>> but crash before updating the parent. If you search for the key starting
>> from the root, you'll fail to find it, because the parent pointer claims
>> that there are no entries with such a key on the child page. But that's
>> OK, the inserting transaction aborted with the crash!
>
> I think it'd be okay as far as that one entry is concerned, since as you
> say it doesn't matter whether a search finds it.  (We'd have to be sure
> that VACUUM would still find it to remove it, of course, but that
> doesn't use a normal search.)  You're right that it poses a hazard of
> subsequent inserts deciding that they don't need to do work on upper
> levels because the lower ones look OK already.  But depending on the
> details of the search algorithm, this might be a non-problem: if you
> remember that the upper level entries didn't cover your key when you
> descended, you'd still know you need to recompute them.

Hmm, we don't currently keep track of that when we descend the tree to 
choose the target page, but perhaps an extra Consistent call to check 
that would be acceptable. We already call Penalty for every tuple on 
each internal node on the way, so compared to that one more call should 
not be too expensive. If we do that, I think it would simplify the 
algorithm quite a bit to just update all the parents on the way down, 
instead of traversing up from the bottom after inserting the tuple to 
the leaf.

> Something else I just noticed is that WAL replay isn't capable of
> completely fixing the index anyway:
>
>   * To complete insert we can't use basic insertion algorithm because
>   * during insertion we can't call user-defined support functions of opclass.
>   * So, we insert 'invalid' tuples without real key and do it by separate algorithm.
>   * 'invalid' tuple should be updated by vacuum full.
>
> Given that there's no more vacuum full, and nobody has been expected to
> run it routinely for a long time anyway, this fixup approach seems
> pretty completely broken anyhow.

The 'invalid' tuples don't affect correctness, but are a drag on 
performance, so they are similar to incomplete b-tree splits. I suspect 
the overhead of an invalid gist pointer is much bigger than the overhead 
of an incomplete b-tree split, though. I agree we should get rid of 
that, it's not comforting to get a stream of messages in the logs saying 
you should run VACUUM FULL.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: wCTE behaviour
Следующее
От: Darren Duncan
Дата:
Сообщение: Re: MULTISET and additional functions for ARRAY