WAL dirty-buffer management bug

Поиск
Список
Период
Сортировка
От Tom Lane
Тема WAL dirty-buffer management bug
Дата
Msg-id 16678.1143744716@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: WAL dirty-buffer management bug  (Simon Riggs <simon@2ndquadrant.com>)
Re: WAL dirty-buffer management bug  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
I added some notes to src/backend/access/transam/README explaining the
protocol for executing a WAL-logged action:

---

1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
to be modified.

2. START_CRIT_SECTION()  (Any error during the next two steps must cause a
PANIC because the shared buffers will contain unlogged changes, which we
have to ensure don't get to disk.  Obviously, you should check conditions
such as whether there's enough free space on the page before you start the
critical section.)

3. Apply the required changes to the shared buffer(s).

4. Build a WAL log record and pass it to XLogInsert(); then update the page's
LSN and TLI using the returned XLOG location.  For instance,
    recptr = XLogInsert(rmgr_id, info, rdata);
    PageSetLSN(dp, recptr);    PageSetTLI(dp, ThisTimeLineID);

5. END_CRIT_SECTION()

6. Unlock and write the buffer(s):
    LockBuffer(buffer, BUFFER_LOCK_UNLOCK);    WriteBuffer(buffer);

(Note: WriteBuffer doesn't really "write" the buffer anymore, it just marks it
dirty and unpins it.  The write will not happen until a checkpoint occurs or
the shared buffer is needed for another page.)

---

This is pretty much what heapam and btree currently do, but on looking
at it I think it's got a problem: we really ought to mark the buffer
dirty before releasing the critical section.  Otherwise, if there's an
elog(ERROR) before the WriteBuffer call is reached, the backend would go
on about its business, and we'd have changes in a disk buffer that isn't
marked dirty.  The changes would be uncommitted, presumably, because of
the error --- but nonetheless this could result in inconsistency down
the road.  One example scenario is:1. We insert a tuple at, say, index 3 on a page.2. elog after making the XLOG entry,
butbefore WriteBuffer.3. page is later discarded from shared buffers; since it's not   marked dirty, it'll just be
droppedwithout writing it.4. Later we need to insert another tuple in same table, and   we again choose index 3 on this
pageas the place to put it.5. system crash leads to replay from WAL.
 
Now we'll have two different WAL records trying to insert tuple 3.
Not good.

While there's not much chance of an elog in just LockBuffer/WriteBuffer,
some of the code paths in btree do quite a bit of stuff before they
write the buffer (this was probably done to avoid multiple entries to
bufmgr, back when the BufMgrLock was a severe source of contention).
So I think there's a nontrivial risk of index corruption from this
coding practice.

I'm thinking we should change the code and the README to specify that
you must mark the buffer dirty before you can END_CRIT_SECTION().
Comments?
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Index vacuum improvements
Следующее
От: "Jonah H. Harris"
Дата:
Сообщение: Re: pg_class catalog question...