Re: Some bogus results from prairiedog

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: Some bogus results from prairiedog
Дата	24 июля 2014 г. 19:33:47
Msg-id	6322.1406219591@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: Some bogus results from prairiedog (Robert Haas <robertmhaas@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jul 22, 2014 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> and it sure looks to me like that
>> "ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the
>> shared-memory lock object you just released.  If someone else had managed
>> to recycle that locktable entry for some other purpose, the
>> ConflictsWithRelationFastPath call might incorrectly return true.
>> 
>> I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing
>> something.

> ...this is probably the real cause of the failures we've actually been
> seeing.  I'll go back-patch that change.

For the archives' sake --- I took another look at the two previous
instances we'd seen of FastPathStrongRelationLocks assertion failures:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-25%2003%3A15%3A03
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00

Neither of these tripped at the bug site in LockRefindAndRelease, but
that's not so surprising, because there was no Assert there at the time.
An occurrence of the bug would have silently left a negative entry in the
FastPathStrongRelationLocks->count[] array, which would have triggered an
assertion by the next process unlucky enough to use the same array entry.

In the prairiedog case, the assert happened in a test running concurrently
with the prepared_xacts test (which is presumably where the bug occurred),
so that this seems a highly plausible explanation.  In the rover_firefly
case, the assert happened quite a bit later than prepared_xacts; but it
might've been just luck that that particular array entry (out of 1024)
wasn't re-used for awhile.

So it's not certain that this one bug explains all three cases,
but it seems well within the bounds of plausibility that it does.
Also, the narrow width of the race condition window (from
LWLockRelease(partitionLock) to an inlined test in the next line)
is consistent with the very low failure rate we've seen in the buildfarm.

BTW, I wonder whether it would be interesting for testing purposes to
have a build option that causes SpinLockRelease and LWLockRelease to
do a sched_yield before returning.  One could assume that that would
greatly increase the odds of detecting this type of bug.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Magnus Hagander
Дата: 24 июля 2014 г., 19:28:01
Сообщение: Re: 9.4 docs current as of

Следующее

От: Alvaro Herrera
Дата: 24 июля 2014 г., 19:33:56
Сообщение: Re: parametric block size?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Some bogus results from prairiedog

Предыдущее

Следующее