Обсуждение: Recent failures in IsolationCheck deadlock-hard

Поиск
Список
Период
Сортировка

Recent failures in IsolationCheck deadlock-hard

От
Thomas Munro
Дата:

Re: Recent failures in IsolationCheck deadlock-hard

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> There have been five failures on three animals like this, over the
> past couple of months:

Also worth noting is that anole failed its first try at the new
deadlock-parallel isolation test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-01%2015%3A48%3A16

What that looks like is the queries got stuck and eventually
isolationtester gave up and canceled the test.  So I'm suspicious
that there's a second bug in the parallel deadlock detection code.

Possibly relevant factoids: all three of the animals in question
run HEAD with force_parallel_mode = regress, and there's reason
to think that their timing behavior could be different from other
animals (anole and gharial run on HPUX, while hyrax uses
CLOBBER_CACHE_ALWAYS).

            regards, tom lane



Re: Recent failures in IsolationCheck deadlock-hard

От
Thomas Munro
Дата:
On Sat, Aug 3, 2019 at 2:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > There have been five failures on three animals like this, over the
> > past couple of months:
>
> Also worth noting is that anole failed its first try at the new
> deadlock-parallel isolation test:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-01%2015%3A48%3A16

And friarbird (also CLOBBER_CACHE_ALWAYS) fails every time.

  animal   |      snapshot       | branch | commit  | result  |
fail_stage   |     fail_tests
-----------+---------------------+--------+---------+---------+----------------+---------------------
 lousyjack | 2019-08-05 11:33:02 | HEAD   | a76cfba | FAILURE |
IsolationCheck | {deadlock-parallel}
 gharial   | 2019-08-05 10:30:37 | HEAD   | a76cfba | FAILURE |
IsolationCheck | {deadlock-parallel}
 friarbird | 2019-08-05 05:20:01 | HEAD   | 8548ddc | FAILURE |
IsolationCheck | {deadlock-parallel}
 friarbird | 2019-08-04 05:20:02 | HEAD   | 69edf4f | FAILURE |
IsolationCheck | {deadlock-parallel}
 hyrax     | 2019-08-03 12:20:57 | HEAD   | 2abd7ae | FAILURE |
IsolationCheck | {deadlock-parallel}
 friarbird | 2019-08-03 05:20:01 | HEAD   | 2abd7ae | FAILURE |
IsolationCheck | {deadlock-parallel}
 friarbird | 2019-08-02 05:20:00 | HEAD   | a9f301d | FAILURE |
IsolationCheck | {deadlock-parallel}
 anole     | 2019-08-01 15:48:16 | HEAD   | da9456d | FAILURE |
IsolationCheck | {deadlock-parallel}

 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2019-08-05%2011:33:02
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-08-05%2010:30:37
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird&dt=2019-08-05%2005:20:01
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird&dt=2019-08-04%2005:20:02
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2019-08-03%2012:20:57
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird&dt=2019-08-03%2005:20:01
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird&dt=2019-08-02%2005:20:00
 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-01%2015:48:16

 1
 step d2a1: <... completed>
-sum
-
-10000
+error in steps d1c e1l d2a1: ERROR:  canceling statement due to user request
 step e1c: COMMIT;
-step d2c: COMMIT;
 step e2l: <... completed>
 lock_excl

 1
+step d2c: COMMIT;
 step e2c: COMMIT;

-- 
Thomas Munro
https://enterprisedb.com



Re: Recent failures in IsolationCheck deadlock-hard

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Sat, Aug 3, 2019 at 2:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also worth noting is that anole failed its first try at the new
>> deadlock-parallel isolation test:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-01%2015%3A48%3A16

> And friarbird (also CLOBBER_CACHE_ALWAYS) fails every time.

Yeah, there have been half a dozen failures since deadlock-parallel
went in, mostly on critters that are slowed by CLOBBER_CACHE_ALWAYS
or valgrind.  I've tried repeatedly to reproduce that here, without
success :-(.  It's unclear whether the failures represent a real
code bug or just a problem in the test case, so I don't really want
to speculate about fixes till I can reproduce it.

            regards, tom lane



Re: Recent failures in IsolationCheck deadlock-hard

От
Thomas Munro
Дата:
On Tue, Aug 6, 2019 at 6:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Sat, Aug 3, 2019 at 2:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Also worth noting is that anole failed its first try at the new
> >> deadlock-parallel isolation test:
> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-01%2015%3A48%3A16
>
> > And friarbird (also CLOBBER_CACHE_ALWAYS) fails every time.
>
> Yeah, there have been half a dozen failures since deadlock-parallel
> went in, mostly on critters that are slowed by CLOBBER_CACHE_ALWAYS
> or valgrind.  I've tried repeatedly to reproduce that here, without
> success :-(.  It's unclear whether the failures represent a real
> code bug or just a problem in the test case, so I don't really want
> to speculate about fixes till I can reproduce it.

I managed to reproduce a failure that looks a lot like lousyjack's
(note that there are two slightly different failure modes):

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2019-08-05%2011:33:02

I did that by changing the deadlock_timeout values for sessions d1 and
d2 to just a few milliseconds on my slowest computer, guessing that
this might be a race involving the deadlock timeout and the time it
takes for workers to fork and join a lock queue.  While normally
deadlock.c with DEBUG_DEADLOCK defined prints out something like this
during this test:

DeadLockCheck: lock 0x80a2812d0 queue  33087 33088 33089 33090 33091
rearranged to: lock 0x80a2812d0 queue  33091 33090 33089 33088 33087

... when it failed like lousyjack my run printed out:

DeadLockCheck: lock 0x80a2721f8 queue  33108 33114
rearranged to: lock 0x80a2721f8 queue  33114 33108

... and then it hung for a while, so I could inspect the lock table
and see that PID 33108 was e1l (not granted), and PID 33114 was gone
but was almost certainly the first worker for d2a1 (I can tell because
33110-33113 are the workers for d1a2 and they're still waiting and
d2a1's first worker should have had the next sequential PID, on my
OS).

Another thing I noticed is that all 4 times I managed to reproduce
this, the "rearranged to" queue had only two entries; I can understand
that d1's workers might not feature yet due to bad timing, but it's
not clear to me why there should always be only one d2a1 worker and
not more.  I don't have time to study this further today and I might
be way off, but my first guess is that in theory we need a way to make
sure that the d1-e2 edge exists before d2's deadlock timer expires,
no?  That's pretty tricky though, so maybe we just need to crank the
times up.

-- 
Thomas Munro
https://enterprisedb.com



Re: Recent failures in IsolationCheck deadlock-hard

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Aug 6, 2019 at 6:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, there have been half a dozen failures since deadlock-parallel
>> went in, mostly on critters that are slowed by CLOBBER_CACHE_ALWAYS
>> or valgrind.  I've tried repeatedly to reproduce that here, without
>> success :-(.  It's unclear whether the failures represent a real
>> code bug or just a problem in the test case, so I don't really want
>> to speculate about fixes till I can reproduce it.

> I managed to reproduce a failure that looks a lot like lousyjack's
> (note that there are two slightly different failure modes):
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2019-08-05%2011:33:02

> I did that by changing the deadlock_timeout values for sessions d1 and
> d2 to just a few milliseconds on my slowest computer, guessing that
> this might be a race involving the deadlock timeout and the time it
> takes for workers to fork and join a lock queue.

Yeah, I eventually managed to reproduce it (not too reliably) by
introducing a randomized delay into parallel worker startup.

The scenario seems to be: some d1a2 worker arrives so late that it's not
accounted for in the initial DeadLockCheck performed by some d2a1 worker.
The other d1a2 workers are released, and run and finish, but the late one
goes to sleep, with a long deadlock_timeout.  If the next DeadLockCheck is
run by e1l's worker, that prefers to release d2a1 workers, which then all
run to completion.  When the late d1a2 worker finally wakes up and runs
DeadLockCheck, *there is no deadlock to resolve*: the d2 session is idle,
not waiting for any lock.  So the worker goes back to sleep, and we sit
till isolationtester times out.

Another way to look at it is that there is a deadlock condition, but
one of the waits-for constraints is on the client side where DeadLockCheck
can't see it.  isolationtester is waiting for d1a2 to complete before it
will execute d1c which would release session d2, so that d2 is effectively
waiting for d1, but DeadLockCheck doesn't know that and thinks that it's
equally good to unblock either d1 or d2.

The attached proposed patch resolves this by introducing another lock
that is held by d1 and then d2 tries to take it, ensuring that the
deadlock detector will recognize that d1 must be released.

I've run several thousand iterations of the test this way without a
problem, where before the MTBF was maybe a hundred or two iterations
with the variable startup delay active.  So I think this fix is good,
but I could be wrong.  One notable thing is that every so often the
test takes ~10s to complete instead of a couple hundred msec.  I think
that what's happening there is that the last deadlock condition doesn't
form until after all of session d2's DeadLockChecks have run, meaning
that we don't spot the deadlock until some other session runs it.  The
test still passes, though.  This is probably fine given that it would
never happen except with platforms that are horridly slow anyway.
Possibly we could shorten the 10s values to make that case complete
quicker, but I'm afraid of maybe breaking things on slow machines.

> Another thing I noticed is that all 4 times I managed to reproduce
> this, the "rearranged to" queue had only two entries; I can understand
> that d1's workers might not feature yet due to bad timing, but it's
> not clear to me why there should always be only one d2a1 worker and
> not more.

I noticed that too, and eventually realized that it's a
max_worker_processes constraint: we have two parallel workers waiting
in e1l and e2l, so if d1a2 takes four, there are only two slots left for
d2a1; and for reasons that aren't totally clear, we don't get to use the
last slot.  (Not sure if that's a bug in itself.)

The attached patch therefore also knocks max_parallel_workers_per_gather
down to 3 in this test, so that we have room for at least 2 d2a1 workers.

            regards, tom lane

diff --git a/src/test/isolation/expected/deadlock-parallel.out b/src/test/isolation/expected/deadlock-parallel.out
index 871a80c..cf4d07e 100644
--- a/src/test/isolation/expected/deadlock-parallel.out
+++ b/src/test/isolation/expected/deadlock-parallel.out
@@ -1,10 +1,10 @@
 Parsed test spec with 4 sessions

 starting permutation: d1a1 d2a2 e1l e2l d1a2 d2a1 d1c e1c d2c e2c
-step d1a1: SELECT lock_share(1,x) FROM bigt LIMIT 1;
-lock_share
+step d1a1: SELECT lock_share(1,x), lock_excl(3,x) FROM bigt LIMIT 1;
+lock_share     lock_excl

-1
+1              1
 step d2a2: select lock_share(2,x) FROM bigt LIMIT 1;
 lock_share

@@ -16,15 +16,19 @@ step d1a2: SET force_parallel_mode = on;
               SET parallel_tuple_cost = 0;
               SET min_parallel_table_scan_size = 0;
               SET parallel_leader_participation = off;
-              SET max_parallel_workers_per_gather = 4;
+              SET max_parallel_workers_per_gather = 3;
               SELECT sum(lock_share(2,x)) FROM bigt; <waiting ...>
 step d2a1: SET force_parallel_mode = on;
               SET parallel_setup_cost = 0;
               SET parallel_tuple_cost = 0;
               SET min_parallel_table_scan_size = 0;
               SET parallel_leader_participation = off;
-              SET max_parallel_workers_per_gather = 4;
-              SELECT sum(lock_share(1,x)) FROM bigt; <waiting ...>
+              SET max_parallel_workers_per_gather = 3;
+              SELECT sum(lock_share(1,x)) FROM bigt;
+              SET force_parallel_mode = off;
+              RESET parallel_setup_cost;
+              RESET parallel_tuple_cost;
+              SELECT lock_share(3,x) FROM bigt LIMIT 1; <waiting ...>
 step d1a2: <... completed>
 sum

@@ -38,6 +42,9 @@ step d2a1: <... completed>
 sum

 10000
+lock_share
+
+1
 step e1c: COMMIT;
 step d2c: COMMIT;
 step e2l: <... completed>
diff --git a/src/test/isolation/specs/deadlock-parallel.spec b/src/test/isolation/specs/deadlock-parallel.spec
index aa4a084..7ad290c 100644
--- a/src/test/isolation/specs/deadlock-parallel.spec
+++ b/src/test/isolation/specs/deadlock-parallel.spec
@@ -15,6 +15,25 @@
 # The deadlock detector resolves the deadlock by reversing the d1-e2 edge,
 # unblocking d1.

+# However ... it's not actually that well-defined whether the deadlock
+# detector will prefer to unblock d1 or d2.  It depends on which backend
+# is first to run DeadLockCheck after the deadlock condition is created:
+# that backend will search outwards from its own wait condition, and will
+# first find a loop involving the *other* lock.  We encourage that to be
+# one of the d2a1 parallel workers, which will therefore unblock d1a2
+# workers, by setting a shorter deadlock_timeout in session d2.  But on
+# slow machines, one or more d1a2 workers may not yet have reached their
+# lock waits, so that they're not unblocked by the first DeadLockCheck.
+# The next DeadLockCheck may choose to unblock the d2a1 workers instead,
+# which would allow d2a1 to complete before d1a2, causing the test to
+# freeze up because isolationtester isn't expecting that completion order.
+# (In effect, we have an undetectable deadlock because d2 is waiting for
+# d1's completion, but on the client side.)  To fix this, introduce an
+# additional lock (advisory lock 3), which is initially taken by d1 and
+# then d2a1 will wait for it after completing the main part of the test.
+# In this way, the deadlock detector can see that d1 must be completed
+# first, regardless of timing.
+
 setup
 {
   create function lock_share(int,int) returns int language sql as
@@ -39,15 +58,15 @@ setup        { BEGIN isolation level repeatable read;
               SET force_parallel_mode = off;
               SET deadlock_timeout = '10s';
 }
-# this lock will be taken in the leader, so it will persist:
-step "d1a1"    { SELECT lock_share(1,x) FROM bigt LIMIT 1; }
+# these locks will be taken in the leader, so they will persist:
+step "d1a1"    { SELECT lock_share(1,x), lock_excl(3,x) FROM bigt LIMIT 1; }
 # this causes all the parallel workers to take locks:
 step "d1a2"    { SET force_parallel_mode = on;
               SET parallel_setup_cost = 0;
               SET parallel_tuple_cost = 0;
               SET min_parallel_table_scan_size = 0;
               SET parallel_leader_participation = off;
-              SET max_parallel_workers_per_gather = 4;
+              SET max_parallel_workers_per_gather = 3;
               SELECT sum(lock_share(2,x)) FROM bigt; }
 step "d1c"    { COMMIT; }

@@ -58,14 +77,19 @@ setup        { BEGIN isolation level repeatable read;
 }
 # this lock will be taken in the leader, so it will persist:
 step "d2a2"    { select lock_share(2,x) FROM bigt LIMIT 1; }
-# this causes all the parallel workers to take locks:
+# this causes all the parallel workers to take locks;
+# after which, make the leader take lock 3 to prevent client-driven deadlock
 step "d2a1"    { SET force_parallel_mode = on;
               SET parallel_setup_cost = 0;
               SET parallel_tuple_cost = 0;
               SET min_parallel_table_scan_size = 0;
               SET parallel_leader_participation = off;
-              SET max_parallel_workers_per_gather = 4;
-              SELECT sum(lock_share(1,x)) FROM bigt; }
+              SET max_parallel_workers_per_gather = 3;
+              SELECT sum(lock_share(1,x)) FROM bigt;
+              SET force_parallel_mode = off;
+              RESET parallel_setup_cost;
+              RESET parallel_tuple_cost;
+              SELECT lock_share(3,x) FROM bigt LIMIT 1; }
 step "d2c"    { COMMIT; }

 session "e1"