Re: Deadlock in multiple CIC.

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Deadlock in multiple CIC.
Дата
Msg-id 6744.1523833660@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Deadlock in multiple CIC.  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Ответы Re: Deadlock in multiple CIC.  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Awhile back, Alvaro Herrera wrote:
>> Pushed to all affected branches, along with a somewhat lame
>> isolationtester test for the condition (since we've already broken this
>> twice and not noticed for long).

> Buildfarm member okapi just failed this test in 9.4:

okapi has continued to fail that test, not 100% of the time but much
more often than not ... but only in 9.4.  And no other animals have
shown it at all.  So what to make of that?

Noting that okapi uses a pretty old icc version running at a high -O
level, we could dismiss it as probably-a-compiler-bug.  But that theory
doesn't really account for the fact that it sometimes succeeds.

Another theory, noting that 9.5 and later have memory barriers in S_UNLOCK
which 9.4 lacks, is that the reason 9.4 has a problem is lack of a memory
barrier between SnapshotResetXmin and GetCurrentVirtualXIDs, thus allowing
both processes to observe the other's xmin as still nonzero given the
right timing.  This seems like a stretch, because really the latter
function's LWLockAcquire on ProcArrayLock ought to be enough to serialize
things.  But there has to be *something* different between 9.4 and all the
later branches, and the barrier stuff sure looks like it's in the right
neighborhood.

As an investigative measure, I propose that we insert

    Assert(MyPgXact->xmin == InvalidTransactionId);

into 9.4's DefineIndex, just after its InvalidateCatalogSnapshot call.
I don't want to leave that there permanently, because it's not clear to me
that there are no legitimate cases where a backend wouldn't have extra
snapshots active during CREATE INDEX CONCURRENTLY --- but we seem to get
through 9.4's regression tests with it, and it would quickly confirm or
deny whether okapi is failing because it somehow has an extra snapshot.

Assuming that that doesn't show anything, I'm inclined to think that
the next step should be to add a pg_memory_barrier() call to
SnapshotResetXmin (again only in the 9.4 branch), and see if that helps.

            regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: WIP: Covering + unique indexes.
Следующее
От: Yuriy Zhuravlev
Дата:
Сообщение: Re: Setting rpath on llvmjit.so?