I wrote:
> This is a bit disturbing:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bushpig&dt=2013-01-07%2019%3A15%3A02
> ...
> The assertion failure seems to indicate that the number of
> LockMethodProcLockHash entries found by hash_seq_search didn't match the
> number that had been counted by hash_get_num_entries immediately before
> that. I don't see any bug in GetLockStatusData itself, so this suggests
> that there's something wrong with dynahash's entry counting, or that
> somebody somewhere is modifying the shared hash table without holding
> the appropriate lock. The latter seems a bit more likely, given that
> this must be a very low-probability bug or we'd have seen it before.
> An overlooked locking requirement in a seldom-taken code path would fit
> the symptoms.
After digging around a bit, I can find only one place where it looks
like somebody might be messing with the LockMethodProcLockHash table
while not holding the appropriate lock-partition LWLock(s):
1. VirtualXactLock finds target xact holds its VXID lock fast-path.
2. VirtualXactLock calls SetupLockInTable to convert the fast-path lock to regular.
3. SetupLockInTable makes entries in LockMethodLockHash and LockMethodProcLockHash.
I see no partition lock acquisition anywhere in the above code path.
Is there one that I'm missing? Why isn't SetupLockInTable documented
as expecting the caller to hold the partition lock, as is generally
done for lock.c subroutines that require that?
If this is a bug, it's rather disturbing that it took us this long to
recognize it. That code path isn't all that seldom-taken, AFAIK.
regards, tom lane