[HACKERS] Buildfarm failure and dubious coding in predicate.c

Поиск
Список
Период
Сортировка
От Tom Lane
Тема [HACKERS] Buildfarm failure and dubious coding in predicate.c
Дата
Msg-id 10593.1500670709@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: [HACKERS] Buildfarm failure and dubious coding in predicate.c  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Buildfarm member culicidae just showed a transient failure in
the 9.4 branch:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-07-21%2017%3A49%3A37

It's an assert trap, for which the buildfarm helpfully captured a
stack trace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fb8d388a3fa in __GI_abort () at abort.c:89
#2  0x0000558d34d90814 in ExceptionalCondition (conditionName=conditionName@entry=0x558d34df6e2d "!(!found)",
errorType=errorType@entry=0x558d34dcef3c"FailedAssertion", fileName=fileName@entry=0x558d34f19dc0
"/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/storage/lmgr/predicate.c",
lineNumber=lineNumber@entry=2023)at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/utils/error/assert.c:54
#3  0x0000558d34c9374b in RestoreScratchTarget (lockheld=lockheld@entry=1 '\001') at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/storage/lmgr/predicate.c:2023
#4  0x0000558d34c966c4 in DropAllPredicateLocksFromTable (transfer=1 '\001', relation=relation@entry=0x7fb8d4d3aa18) at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/storage/lmgr/predicate.c:2997
#5  TransferPredicateLocksToHeapRelation (relation=relation@entry=0x7fb8d4d3aa18) at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/storage/lmgr/predicate.c:3014
#6  0x0000558d34ac7a70 in index_drop (indexId=29755, concurrent=concurrent@entry=0 '\000') at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/index.c:1516
#7  0x0000558d34ac00f8 in doDeletion (flags=-1369083928, object=0x558d35c2c03c) at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/dependency.c:1125
#8  deleteOneObject (object=0x558d35c2c03c, depRel=depRel@entry=0x7fffae656fe8, flags=flags@entry=0) at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/dependency.c:1036
#9  0x0000558d34ac0545 in deleteObjectsInList (targetObjects=targetObjects@entry=0x558d35bae140,
depRel=depRel@entry=0x7fffae656fe8,flags=flags@entry=0) at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/dependency.c:227
#10 0x0000558d34ac06c8 in performMultipleDeletions (objects=objects@entry=0x558d35badef0, behavior=DROP_CASCADE,
flags=flags@entry=0)at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/catalog/dependency.c:366
#11 0x0000558d34b3e2e9 in RemoveObjects (stmt=stmt@entry=0x558d35bf5678) at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/commands/dropcmds.c:134
#12 0x0000558d34ca61f0 in ExecDropStmt (stmt=stmt@entry=0x558d35bf5678, isTopLevel=isTopLevel@entry=1 '\001') at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/tcop/utility.c:1364
#13 0x0000558d34ca8455 in ProcessUtilitySlow (parsetree=parsetree@entry=0x558d35bf5678,
queryString=queryString@entry=0x558d35bf4b50"DROP SCHEMA selinto_schema CASCADE;",
context=context@entry=PROCESS_UTILITY_TOPLEVEL,params=params@entry=0x0, dest=dest@entry=0x558d35bf5a20,
completionTag=completionTag@entry=0x7fffae657710"") at
/home/andres/build/buildfarm-culicidae/REL9_4_STABLE/pgsql.build/../pgsql/src/backend/tcop/utility.c:1295

I've been staring at that for a little while, and I can't see any logic
error that would lead to the failure.  Clearly it'd be expected if two
sessions tried to remove/reinsert the "scratch target" concurrently,
but the locking operations should be enough to prevent that.  (Moreover,
if that had happened, you'd have expected an earlier assertion failure
in one or the other of the RemoveScratchTarget calls.)

Plausible explanations at this point seem to be:

1. Cosmic ray bit-flip.
2. There's some bug in the lock infrastructure, allowing two processes  to acquire an LWLock concurrently.
3. Logic error I'm missing.

Probably it's #3, but what?

And, while I'm looking at this ... isn't this "scratch target" logic
just an ineffective attempt at waving a dead chicken?  It's assuming
that freeing an entry in a shared hash table guarantees that it can
insert another entry.  But that hash table is partitioned, meaning it has
a separate freelist per partition.  So the extra entry only provides a
guarantee that you can insert something into the same partition it's in,
making it useless for this purpose AFAICS.

By the same token, I do not think I believe the nearby assumptions that
deleting one entry from PredicateLockHash guarantees we can insert
another one.  That hash is partitioned as well.

It looks to me like we either need to do a fairly significant rewrite
here, or to give up on making these hashtables partitioned.  Either
one is pretty annoying, considering the very low probability of running
out of shared memory right here; but what we've got is not up to project
standards IMO.

I have some ideas about fixing this by enlisting the help of dynahash.c
explicitly, rather than fooling with "scratch entries".  But I haven't
been able yet to write a design for that that doesn't have obvious bugs.
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: [HACKERS] Syncing sql extension versions with shared libraryversions
Следующее
От: Claudio Freire
Дата:
Сообщение: Re: [HACKERS] Increase Vacuum ring buffer.