slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",)

Поиск
Список
Период
Сортировка
От Tom Lane
Тема slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",)
Дата
Msg-id 15543.1130714273@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",  ("Jim Nasby" <jnasby@pervasive.com>)
Ответы Re: slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",)  ("Jim C. Nasby" <jnasby@pervasive.com>)
Re: slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",)  ("Jim C. Nasby" <jnasby@pervasive.com>)
Re: slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",)  ("Jim C. Nasby" <jnasby@pervasive.com>)
Список pgsql-hackers
OK, I think I see it.  The problem is that the code in slru.c is careful
about not modifying state when it doesn't hold the proper lock, but not
so careful about not *inspecting* state without the proper lock.  In
particular consider these lines in SimpleLruReadPage (line numbers are
as in CVS tip):
    /* Mark the slot read-busy (no-op if it already was) */
277        shared->page_number[slotno] = pageno;
278        shared->page_status[slotno] = SLRU_PAGE_READ_IN_PROGRESS;
    ...
    /* Release shared lock, grab per-buffer lock instead */
287        LWLockRelease(shared->ControlLock);
288        LWLockAcquire(shared->buffer_locks[slotno], LW_EXCLUSIVE);
    /*     * Check to see if someone else already did the read, or took     * the buffer away from us.  If so, restart
fromthe top.     */
 
294        if (shared->page_number[slotno] != pageno ||
295            shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS)    ...

Suppose that we arrive at line 277 when the page is currently being
faulted in by another process (ie, its state is already
read-in-progress).  The assignments at lines 277 & 278 are then no-ops,
and we'll block at line 288 waiting for the other guy to finish the I/O
and release the per-buffer lock.

Suppose that after the I/O finishes and before we get to run again, this
buffer sinks back to the bottom of the LRU list and gets chosen to be
replaced with another page.  Some other process will then start
executing this code and will change the page number (line 277) and
change the state from clean to read-in-progress (line 278).

The race condition is that this could happen between the tests at lines
294 and 295 in our original process.  In that case the original process
would think that the page still needed to be read in, and would set
about doing so.  It would discover its error at the Assert at line
308, exactly where Jim reports a problem.

The association we thought we'd noted to recursion through
SlruSelectLRUPage is in part a red herring: the race can occur without
that.  However, it's perhaps a bit more probable to occur in that path,
because when SlruSelectLRUPage recurses back to SimpleLruReadPage, we
*know* that there is another process currently doing read-in, and so
the first coincidence is already satisfied.

Still, the race condition window is extremely narrow, only a couple of
instructions.  I looked at the assembly output for the compiler
Jim is using, and it looks like this:
cmpl    %r13d, 104(%rbp,%r12,4)je    .L155
.L116:... code for if() body here...
.L155:cmpl    $1, 72(%rbp,%r12,4)jne    .L116... code for subsequent lines here

However, if the processor predicts the forward branch not to be taken,
it could waste a few cycles recovering from its mistake, so the window
maybe is a little wider than it appears.

I'd like Jim to test this theory by seeing if it helps to reverse the
order of the if-test elements at lines 294/295, ie make it look like
       if (shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS ||           shared->page_number[slotno] !=
pageno)

This won't do as a permanent patch, because it isn't guaranteed to fix
the problem on machines that don't strongly order writes, but it should
work on Opterons, at least well enough to confirm the diagnosis.

I'm still thinking about how to make a real fix without introducing
another lock/unlock cycle --- we can do that if we have to, but I think
maybe it's possible to fix it without.

SimpleLruWritePage has an identical problem BTW :-(
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: pg_dump option to dump only functions
Следующее
От: Tom Lane
Дата:
Сообщение: Re: add_missing_from breaks existing views