B-tree parent pointer and checkpoints

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема B-tree parent pointer and checkpoints
Дата
Msg-id 4CCFEE61.2090702@enterprisedb.com
обсуждение исходный текст
Ответы Intelligent RDBMS  (ghatpande@vsnl.net)
Re: B-tree parent pointer and checkpoints  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
We have the rm_safe_restartpoint mechanism to ensure that we don't use a 
checkpoint that splits a multi-level B-tree insertion as a restart 
point. But to my surprise, we don't have anything to protect against the 
analogous case during normal operation. This is possible:

1. Split child page. Write WAL records for the child pages.
2. Begin and finish a checkpoint
3. Crash, before writing the WAL record of inserting the child pointer 
in the parent B-tree page.
4. Recovery begins at the new checkpoint, never sees the incomplete 
split, so it stays incomplete.

In practice that's pretty hard to hit, because a checkpoint takes some 
time, while locking the parent page and writing the child pointer is 
usually very quick. But it's possible.

It surprises me that we thought of this when we introduced 
restartpoints, but this more obvious case during normal operation seems 
to have been there forever. Nothing very bad happens if you lose the 
parent update, but this would be nice to fix nevertheless.

I bumped into this while thinking about archive recovery - the above can 
happen at archive recovery too if the checkpoint is caused by 
pg_start_backup().

I think we can fix this by requiring that any multi-WAL-record actions 
that are in-progress when a checkpoint starts (at the REDO-pointer) must 
finish before the checkpoint record is written. That will close the 
issue with restartpoints, archive recovery etc. as well, so we no longer 
need to worry about this anywhere else than while performing an online 
checkpoint.

I'm thinking of using the isCommit flag for this, to delay writing the 
checkpoint record until all incomplete splits are finished. isCommit 
protects against a similar race condition between writing commit record 
and flushing the clog page, this race condition is similar. Will 
obviously need to rename it, and double-check that it's safe: b-tree 
splits take longer, and there's no critical section there like there is 
in the commit codepath.

Comments?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: SR fails to send existing WAL file after off-line copy
Следующее
От: ghatpande@vsnl.net
Дата:
Сообщение: Intelligent RDBMS