Re: B-tree parent pointer and checkpoints

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: B-tree parent pointer and checkpoints
Дата
Msg-id 4E65F428.6030306@enterprisedb.com
обсуждение исходный текст
Ответ на Re: B-tree parent pointer and checkpoints  (Bruce Momjian <bruce@momjian.us>)
Ответы Re: B-tree parent pointer and checkpoints
Список pgsql-hackers
On 05.09.2011 21:55, Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> On 11.03.2011 19:41, Tom Lane wrote:
>>> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>   writes:
>>>> On 11.03.2011 17:59, Tom Lane wrote:
>>>>> But that will be fixed during WAL replay.
>>>
>>>> Not under the circumstances that started the original thread:
>>>
>>>> 1. Backend splits a page
>>>> 2. Checkpoint starts
>>>> 3. Checkpoint runs to completion
>>>> 4. Crash
>>>> (5. Backend never got to insert the parent pointer)
>>>
>>>> WAL replay starts at the checkpoint redo pointer, which is after the
>>>> page split record, so WAL replay won't insert the parent pointer. That's
>>>> an incredibly tight window to hit in practice, but it's possible in theory.
>>>
>>> Hmm.  It's not so improbable that checkpoint would start inside that
>>> window, but that the parent insertion is still pending by the time the
>>> checkpoint finishes is pretty improbable.
>>>
>>> How about just reducing the deletion-time ERROR for missing downlink to a LOG?
>>
>> Well, the code that follows expects to have a valid parent page locked,
>> so you can't literally do just that. But yeah, LOG and aborting the page
>> deletion seems fine to me.
>
> Did this get fixed?

Nope.

On a closer look, this isn't only a problem for page deletion. Page 
splitting also barfs if it can't find the parent of a page. As the code 
stands, a missing downlink is not harmless, but causes all sorts of trouble.

The window for this to happen with a checkpoint is extremely tight, but 
there's another situation where you can end up with a missing downlink: 
if you run out of disk space while splitting a parent page, to insert a 
downlink to it.

I think we should do a similar fix to b-tree that I did to GiST, and put 
a flag on pages with missing downlinks. Then we can fix the missing 
downlinks in vacuum and insertion, and get rid of the code to fix 
incomplete splits after WAL replay.

The way it would work is that on page split the right page is flagged 
with MISSING_DOWNLINK flag. When the downlink is inserted into the 
parent, the flag is cleared in the same critical section as the WAL 
record for the insertion of the parent is written. Normally, a backend 
would never see the flag set, because the locks on the split pages are 
not released until the parent record is written and the flag cleared 
again. But if inserting the downlink fails for any reason, the next 
inserter or vacuum that steps on the page can finish the split by 
inserting the downlink.

Unfortunately that means holding the locks on the split pages longer 
than we do at the moment. Currently they are released as soon as the 
parent page is locked; with this change they would need to be held until 
the WAL record of the downlink insertion is done. B-tree is so heavily 
used that I'm a bit hesitant to sacrifice any concurrency there, but I 
don't think it would be noticeable in practice.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Marti Raudsepp
Дата:
Сообщение: Re: Redundant bitmap index scans on smallint column
Следующее
От: hubert depesz lubaczewski
Дата:
Сообщение: Re: [GENERAL] pg_upgrade problem