XLogSaveBufferForHint() correctness and more

Поиск
Список
Период
Сортировка
От Andres Freund
Тема XLogSaveBufferForHint() correctness and more
Дата
Msg-id 20230714154209.upywymkplodefvuo@awork3.anarazel.de
обсуждение исходный текст
Список pgsql-hackers
Hi,

While looking at [1] I started to wonder why it is safe that
CreateCheckPoint() updates XLogCtl->RedoRecPtr after releasing the WAL
insertion lock:

    /*
     * Now we can release the WAL insertion locks, allowing other xacts to
     * proceed while we are flushing disk buffers.
     */
    WALInsertLockRelease();

    /* Update the info_lck-protected copy of RedoRecPtr as well */
    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->RedoRecPtr = checkPoint.redo;
    SpinLockRelease(&XLogCtl->info_lck);

The most important user of that is GetRedoRecPtr().

Right now I'm a bit confused why it's ok that

    /* Update the info_lck-protected copy of RedoRecPtr as well */
    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->RedoRecPtr = checkPoint.redo;
    SpinLockRelease(&XLogCtl->info_lck);

happens after WALInsertLockRelease().


But then started to wonder, even if that weren't the case, how come
XLogSaveBufferForHint() and other uses of GetRedoRecPtr(), aren't racy as
hell?

The reason XLogInsertRecord() can safely check if an FPW is needed is that it
holds a WAL insertion lock, the redo pointer cannot change until the insertion
lock is released.

But there's *zero* interlock in XLogSaveBufferForHint() from what I can tell?
A checkpoint could easily start between between the GetRedoRecPtr() and the
check whether this buffer needs to be WAL logged?


While XLogSaveBufferForHint() makes no note of this, it's sole caller,
MarkBufferDirtyHint(), tries to deal with some related concerns to some
degree:

            /*
             * If the block is already dirty because we either made a change
             * or set a hint already, then we don't need to write a full page
             * image.  Note that aggressive cleaning of blocks dirtied by hint
             * bit setting would increase the call rate. Bulk setting of hint
             * bits would reduce the call rate...
             *
             * We must issue the WAL record before we mark the buffer dirty.
             * Otherwise we might write the page before we write the WAL. That
             * causes a race condition, since a checkpoint might occur between
             * writing the WAL record and marking the buffer dirty. We solve
             * that with a kluge, but one that is already in use during
             * transaction commit to prevent race conditions. Basically, we
             * simply prevent the checkpoint WAL record from being written
             * until we have marked the buffer dirty. We don't start the
             * checkpoint flush until we have marked dirty, so our checkpoint
             * must flush the change to disk successfully or the checkpoint
             * never gets written, so crash recovery will fix.
             *
             * It's possible we may enter here without an xid, so it is
             * essential that CreateCheckPoint waits for virtual transactions
             * rather than full transactionids.
             */
            Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
            MyProc->delayChkptFlags |= DELAY_CHKPT_START;
            delayChkptFlags = true;
            lsn = XLogSaveBufferForHint(buffer, buffer_std);

but I don't think that really does all that much, because the
DELAY_CHKPT_START handling in CreateCheckPoint() happens after we determine
the redo pointer.  This code isn't even reached if we wrongly skipped due to
the if (lsn <= RedoRecPtr).


I seriously doubt this can correctly be implemented outside of xlog*.c /
without the use of a WALInsertLock?

I feel like I must be missing here, this isnt' a particularly narrow race?


It looks to me like the sue of GetRedoRecPtr() in nextval_internal() is also
wrong. I think the uses in slot.c, snapbuild.c, rewriteheap.c are fine.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20230714151626.rhgae7taigk2xrq7%40awork3.anarazel.de



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ivan Panchenko
Дата:
Сообщение: Re: Bytea PL/Perl transform
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: PATCH: Using BRIN indexes for sorted output