Excerpts from Noah Misch's message of sáb jul 16 13:11:49 -0400 2011:
> In any event, I have attached a patch that fixes the problems I have described
> here. To ignore autovacuum, it only recognizes a wait when one of the
> backends under test holds a conflicting lock. (It occurs to me that perhaps
> we should expose a pg_lock_conflicts(lockmode_held text, lockmode_req text)
> function to simplify this query -- this is a fairly common monitoring need.)
Applied it. I agree that having such an utility function is worthwhile,
particularly if we're working on making pg_locks more usable as a whole.
(I wasn't able to reproduce Rémi's hangups here, so I wasn't able to
reproduce the other bits either.)
> With that change in place, my setup survived through about fifty suite runs at
> a time. The streak would end when session 2 would unexpectedly detect a
> deadlock that session 1 should have detected. The session 1 deadlock_timeout
> I chose, 20ms, is too aggressive. When session 2 is to issue the command that
> completes the deadlock, it must do so before session 1 runs the deadlock
> detector. Since we burn 10ms just noticing that the previous statement has
> blocked, that left only 10ms to issue the next statement. This patch bumps
> the figure from 20s to 100ms; hopefully that will be enough for even a
> decently-loaded virtual host.
Committed this too.
> With this patch in its final form, I have completed 180+ suite runs without a
> failure. In the absence of better theories on the cause for the buildfarm
> failures, we should give the buildfarm a whirl with this patch.
Great. If there is some other failure mechanism, we'll find out ...
> I apologize for the quantity of errata this change is entailing.
No need to apologize. I might as well apologize myself because I didn't
detect these problems on review. But we don't do that -- we just fix
the problems and move on. It's great that you were able to come up with
a fix quickly.
And this is precisely why I committed this way ahead of the patch that
it was written to help: we're now not fixing problems in both
simultaneously. By the time we get that other patch in, this test
harness will be fully robust.
Thanks for all your effort in this.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support