Re: FOR KEY LOCK foreign keys

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: FOR KEY LOCK foreign keys
Дата
Msg-id 20110716171121.GB2047@tornado.leadboat.com
обсуждение исходный текст
Ответ на Re: FOR KEY LOCK foreign keys  (Alvaro Herrera <alvherre@commandprompt.com>)
Ответы Re: FOR KEY LOCK foreign keys
Список pgsql-hackers
On Fri, Jul 15, 2011 at 07:01:26PM -0400, Alvaro Herrera wrote:
> Excerpts from Noah Misch's message of mié jul 13 01:34:10 -0400 2011:
>
> > coypu failed during the run of the test due to a different session being chosen
> > as the deadlock victim.  We can now vary deadlock_timeout to prevent this; see
> > attached fklocks-tests-deadlock_timeout.patch.  This also makes the tests much
> > faster on a default postgresql.conf.
>
> I applied your patch, thanks.  I couldn't reproduce the failures without
> it, even running only the three new tests in a loop a few dozen times.

It's probably more likely to crop up on a loaded system.  I did not actually
reproduce it myself.  However, if you swap the timeouts, the opposite session
finds the deadlock.  From there, I'm convinced that the right timing
perturbations could yield the symptom coypu exhibited.

> > crake failed when it reported waiting on the first step of an existing isolation
> > test ("two-ids.spec").  I will need to look into that further.
>
> Actually, there are four failures in tests other than the two fixed by
> your patch.  These are:
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2011-07-12%2022:32:02
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2011-07-14%2016:27:00
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pitta&dt=2011-07-15%2015:00:08
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2011-07-15%2018:32:02

Thanks for summarizing.  These all boil down to lock waits not anticipated by
the test specs.  Having pondered this, I've been able to come up with just one
explanation.  If autovacuum runs VACUUM during the test and finds that it can
truncate dead space from the end of a relation, it will acquire an
AccessExclusiveLock.  When I decrease autovacuum_naptime to 1s, I do see
plenty of pg_type and pg_attribute truncations during a test run.

When I sought to reproduce this, what I first saw instead was an indefinite
test suite hang.  That turned out to arise from an unrelated thinko -- I
assumed that backend IDs were stable for the life of the backend, but they're
only stable for the life of a pgstat snapshot.  This fell down when a backend
older than one of the test backends exited during the test:

4199 2011-07-16 03:33:28.733 EDT DEBUG:  forked new backend, pid=23984 socket=8
23984 2011-07-16 03:33:28.737 EDT LOG:  statement: SET client_min_messages = warning;
23984 2011-07-16 03:33:28.739 EDT LOG:  statement: SELECT i FROM pg_stat_get_backend_idset() t(i) WHERE
pg_stat_get_backend_pid(i)= pg_backend_pid() 
23985 2011-07-16 03:33:28.740 EDT DEBUG:  autovacuum: processing database "postgres"
4199 2011-07-16 03:33:28.754 EDT DEBUG:  forked new backend, pid=23986 socket=8
23986 2011-07-16 03:33:28.754 EDT LOG:  statement: SET client_min_messages = warning;
4199 2011-07-16 03:33:28.755 EDT DEBUG:  server process (PID 23985) exited with exit code 0
23986 2011-07-16 03:33:28.755 EDT LOG:  statement: SELECT i FROM pg_stat_get_backend_idset() t(i) WHERE
pg_stat_get_backend_pid(i)= pg_backend_pid() 
4199 2011-07-16 03:33:28.766 EDT DEBUG:  forked new backend, pid=23987 socket=8
23987 2011-07-16 03:33:28.766 EDT LOG:  statement: SET client_min_messages = warning;
23987 2011-07-16 03:33:28.767 EDT LOG:  statement: SELECT i FROM pg_stat_get_backend_idset() t(i) WHERE
pg_stat_get_backend_pid(i)= pg_backend_pid() 

This led isolationtester to initialize backend_ids = {1,2,2}, making us unable
to detect lock waits correctly.  That's also consistent with the symptoms Rémi
Zara just reported.  With that fixed, I was able to reproduce the failure due
to autovacuum-truncate-induced transient waiting using this recipe:
- autovacuum_naptime = 1s
- src/test/isolation/Makefile changed to pass --use-existing during installcheck
- Run 'make installcheck' in a loop
- A concurrent session running this in a loop:
CREATE TABLE churn (a int, b int, c int, d int, e int, f int, g int, h int);
DROP TABLE churn;

That yields a steady stream of vacuum truncations, and an associated lock wait
generally capsized the suite within 5-10 runs.  Frankly, I have some
difficulty believing that this mechanic alone produced all four failures you
cite above; I suspect I'm still missing some more-frequent cause.  Any other
theories on which system background activities can cause a transient lock
wait?  It would have to produce a "pgstat_report_waiting(true)" call, so I
believe that excludes all LWLock and lighter contention.

In any event, I have attached a patch that fixes the problems I have described
here.  To ignore autovacuum, it only recognizes a wait when one of the
backends under test holds a conflicting lock.  (It occurs to me that perhaps
we should expose a pg_lock_conflicts(lockmode_held text, lockmode_req text)
function to simplify this query -- this is a fairly common monitoring need.)

With that change in place, my setup survived through about fifty suite runs at
a time.  The streak would end when session 2 would unexpectedly detect a
deadlock that session 1 should have detected.  The session 1 deadlock_timeout
I chose, 20ms, is too aggressive.  When session 2 is to issue the command that
completes the deadlock, it must do so before session 1 runs the deadlock
detector.  Since we burn 10ms just noticing that the previous statement has
blocked, that left only 10ms to issue the next statement.  This patch bumps
the figure from 20s to 100ms; hopefully that will be enough for even a
decently-loaded virtual host.  We should keep it as low as is reasonable,
because it contributes directly to the isolation suite runtime.  Each addition
to deadlock_timeout slows the suite by 12x that amount.

With this patch in its final form, I have completed 180+ suite runs without a
failure.  In the absence of better theories on the cause for the buildfarm
failures, we should give the buildfarm a whirl with this patch.

I apologize for the quantity of errata this change is entailing.

Thanks,
nm

--
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Kevin Grittner"
Дата:
Сообщение: Re: isolation test deadlocking on buildfarm member coypu
Следующее
От: Noah Misch
Дата:
Сообщение: Re: isolation test deadlocking on buildfarm member coypu