Обсуждение: Deadlock in multiple CIC.

Поиск

Список

Период

Сортировка

Deadlock in multiple CIC.

От

Jeff Janes

Дата:

26 декабря 2017 г., 11:31:02

c3d09b3bd23f5f6 fixed it so concurrent CIC would not deadlock (or at least not as reliably as before) by dropping its own snapshot before waiting for all the other ones to go away.

With commit 8aa3e47510b969354ea02a, concurrent CREATE INDEX CONCURRENTLY on different tables in the same database started deadlocking against each other again quite reliably.

I think the solution is simply to drop the catalog snapshot as well, as in the attached.

reported by Jeremy Finzel here: https://www.postgresql.org/message-id/flat/CAMa1XUhHjCv8Qkx0WOr1Mpm_R4qxN26EibwCrj0Oor2YBUFUTg%40mail.gmail.com#CAMa1XUhHjCv8Qkx0WOr1Mpm_R4qxN26EibwCrj0Oor2YBUFUTg@mail.gmail.com

Cheers,

Jeff

Вложения

CIC_deadlock.patch

Re: Deadlock in multiple CIC.

От

Alvaro Herrera

Дата:

26 декабря 2017 г., 22:31:03

Jeff Janes wrote:
> c3d09b3bd23f5f6 fixed it so concurrent CIC would not deadlock (or at least
> not as reliably as before) by dropping its own snapshot before waiting for
> all the other ones to go away.
> 
> With commit 8aa3e47510b969354ea02a, concurrent CREATE INDEX CONCURRENTLY on
> different tables in the same database started deadlocking against each
> other again quite reliably.
> 
> I think the solution is simply to drop the catalog snapshot as well, as in
> the attached.

Thanks for the analysis -- it sounds reasonable to me.  However, I'm
wondering why you used the *Conditionally() version instead of plain
InvalidateCatalogSnapshot().  I think they must have the same effect in
practice (the assumption being that you can't run CIC in a transaction
that may have other snapshots) but the theory seems simpler when calling
the other routine: just do away with the snapshot always, period.

This is back-patchable to 9.4, first branch which has MVCC catalog
scans.  It's strange that this has gone undetected for so long.  I
wonder if there's an interaction with logical decoding and its
historical snapshot stuff here.  I pinged Jeremy on the other thread
about your fix.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Deadlock in multiple CIC.

От

Jeff Janes

Дата:

26 декабря 2017 г., 23:23:32

On Tue, Dec 26, 2017 at 8:31 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Jeff Janes wrote:
> c3d09b3bd23f5f6 fixed it so concurrent CIC would not deadlock (or at least
> not as reliably as before) by dropping its own snapshot before waiting for
> all the other ones to go away.
>
> With commit 8aa3e47510b969354ea02a, concurrent CREATE INDEX CONCURRENTLY on
> different tables in the same database started deadlocking against each
> other again quite reliably.
>
> I think the solution is simply to drop the catalog snapshot as well, as in
> the attached.

Thanks for the analysis -- it sounds reasonable to me. However, I'm
wondering why you used the *Conditionally() version instead of plain
InvalidateCatalogSnapshot().

My thinking was that if there was for some reason another snapshot hanging around, that dropping the catalog snapshot unconditionally would be a correctness bug, while doing it conditionally would just fail to avoid a theoretically avoidable deadlock. So it seemed safer.

I think they must have the same effect in
practice (the assumption being that you can't run CIC in a transaction
that may have other snapshots) but the theory seems simpler when calling
the other routine: just do away with the snapshot always, period.

That is probably true. But I never even knew that catalog snapshots existed until yesterday, so didn't want to make make assumptions about what else might exist, to avoid introducing new bugs similar to the one that 8aa3e47510b969354ea02a fixed.

This is back-patchable to 9.4, first branch which has MVCC catalog
scans. It's strange that this has gone undetected for so long.

Since the purpose of CIC is to build an index with minimal impact on other users, I think wanting to use it in concurrent cases might be rather rare. In a maintenance window, I wouldn't want to use CIC because it is slower and I'd rather just hold the stronger lock and do it fast, and as a hot-fix outside a maintenance window I usually wouldn't want to hog the CPU with concurrent builds when I could do them sequentially instead. Also, since deadlocks are "expected" errors rather than "should never happen" errors, and since the docs don't promise that you can do parallel CIC without deadlocks, many people would probably shrug it off (which I initially did) rather than report it as a bug. I was looking into it as an enhancement rather than a bug until I discovered that it was already enhanced and then undone.

Cheers,

Jeff

Re: Deadlock in multiple CIC.

От

Jeremy Finzel

Дата:

27 декабря 2017 г., 22:50:19

On Tue, Dec 26, 2017 at 11:23 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Dec 26, 2017 at 8:31 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Jeff Janes wrote:
> c3d09b3bd23f5f6 fixed it so concurrent CIC would not deadlock (or at least
> not as reliably as before) by dropping its own snapshot before waiting for
> all the other ones to go away.
>
> With commit 8aa3e47510b969354ea02a, concurrent CREATE INDEX CONCURRENTLY on
> different tables in the same database started deadlocking against each
> other again quite reliably.
>
> I think the solution is simply to drop the catalog snapshot as well, as in
> the attached.

Thanks for the analysis -- it sounds reasonable to me. However, I'm
wondering why you used the *Conditionally() version instead of plain
InvalidateCatalogSnapshot().

My thinking was that if there was for some reason another snapshot hanging around, that dropping the catalog snapshot unconditionally would be a correctness bug, while doing it conditionally would just fail to avoid a theoretically avoidable deadlock. So it seemed safer.

I think they must have the same effect in
practice (the assumption being that you can't run CIC in a transaction
that may have other snapshots) but the theory seems simpler when calling
the other routine: just do away with the snapshot always, period.

That is probably true. But I never even knew that catalog snapshots existed until yesterday, so didn't want to make make assumptions about what else might exist, to avoid introducing new bugs similar to the one that 8aa3e47510b969354ea02a fixed.

This is back-patchable to 9.4, first branch which has MVCC catalog
scans. It's strange that this has gone undetected for so long.

Since the purpose of CIC is to build an index with minimal impact on other users, I think wanting to use it in concurrent cases might be rather rare. In a maintenance window, I wouldn't want to use CIC because it is slower and I'd rather just hold the stronger lock and do it fast, and as a hot-fix outside a maintenance window I usually wouldn't want to hog the CPU with concurrent builds when I could do them sequentially instead. Also, since deadlocks are "expected" errors rather than "should never happen" errors, and since the docs don't promise that you can do parallel CIC without deadlocks, many people would probably shrug it off (which I initially did) rather than report it as a bug. I was looking into it as an enhancement rather than a bug until I discovered that it was already enhanced and then undone.

Cheers,

Jeff

I was able to get this compiled, and ran the test before on stock 9.6.6, then on this patched version. I indeed reproduced it on 9.6.6, but on the patched version, it indeed fixes my issue.

Let me know if I can be of further help.

Thanks,

Jeremy

Re: Deadlock in multiple CIC.

От

Jerry Sievers

Дата:

27 декабря 2017 г., 23:42:02

Jeff Janes <jeff.janes@gmail.com> writes:

> On Tue, Dec 26, 2017 at 8:31 AM, Alvaro Herrera <
> alvherre@alvh.no-ip.org> wrote:
>
>     Jeff Janes wrote:
>     > c3d09b3bd23f5f6 fixed it so concurrent CIC would not deadlock
>     (or at least
>     > not as reliably as before) by dropping its own snapshot before
>     waiting for
>     > all the other ones to go away.
>     >
>     > With commit 8aa3e47510b969354ea02a, concurrent CREATE INDEX
>     CONCURRENTLY on
>     > different tables in the same database started deadlocking
>     against each
>     > other again quite reliably.
>     >
>     > I think the solution is simply to drop the catalog snapshot as
>     well, as in
>     > the attached.
>
>     Thanks for the analysis -- it sounds reasonable to me.  However,
>     I'm
>     wondering why you used the *Conditionally() version instead of
>     plain
>     InvalidateCatalogSnapshot().
>
>
> My thinking was that if there was for some reason another snapshot
> hanging around, that dropping the catalog snapshot unconditionally
> would be a correctness bug, while doing it conditionally would just
> fail to avoid a theoretically avoidable deadlock.  So it seemed
> safer.
>  
>
>       I think they must have the same effect in
>     practice (the assumption being that you can't run CIC in a
>     transaction
>     that may have other snapshots) but the theory seems simpler when
>     calling
>     the other routine: just do away with the snapshot always, period.
>
>
> That is probably true.  But I never even knew that catalog snapshots
> existed until yesterday, so didn't want to make make assumptions
> about what else might exist, to avoid introducing new bugs similar to
> the one that 8aa3e47510b969354ea02a fixed.
>  
>
>
>     This is back-patchable to 9.4, first branch which has MVCC
>     catalog
>     scans.  It's strange that this has gone undetected for so long.
>
>
> Since the purpose of CIC is to build an index with minimal impact on
> other users, I think wanting to use it in concurrent cases might be
> rather rare.  In a maintenance window, I wouldn't want to use CIC
> because it is slower and I'd rather just hold the stronger lock and
> do it fast, and as a hot-fix outside a maintenance window I usually
> wouldn't want to hog the CPU with concurrent builds when I could do

Hmmm, given that most/all large sites lately are probably running on hw
with dozens or perhaps hundreds of CPUs/threads, I can see DBAs not
being too concerned about "hogging".

> them sequentially instead.  Also, since deadlocks are "expected"
> errors rather than "should never happen" errors, and since the docs
> don't promise that you can do parallel CIC without deadlocks, many
> people would probably shrug it off (which I initially did) rather
> than report it as a bug.  I was looking into it as an enhancement
> rather than a bug until I discovered that it was already enhanced and

Agree such an edge case not a high priority to support for the above
reasons but good to assuming no breakage in some other regard :-)

> then undone.
>
> Cheers,
>
> Jeff
>
>
>
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Re: Deadlock in multiple CIC.

От

Jeff Janes

Дата:

28 декабря 2017 г., 00:10:48

On Wed, Dec 27, 2017 at 8:50 AM, Jeremy Finzel <finzelj@gmail.com> wrote:

I was able to get this compiled, and ran the test before on stock 9.6.6, then on this patched version. I indeed reproduced it on 9.6.6, but on the patched version, it indeed fixes my issue.

Let me know if I can be of further help.

Thanks for reporting and testing.

Cheers,

Jeff

Re: Deadlock in multiple CIC.

От

Jeremy Finzel

Дата:

28 декабря 2017 г., 00:19:41

Since the purpose of CIC is to build an index with minimal impact on other users, I think wanting to use it in concurrent cases might be rather rare. In a maintenance window, I wouldn't want to use CIC because it is slower and I'd rather just hold the stronger lock and do it fast, and as a hot-fix outside a maintenance window I usually wouldn't want to hog the CPU with concurrent builds when I could do them sequentially instead. Also, since deadlocks are "expected" errors rather than "should never happen" errors, and since the docs don't promise that you can do parallel CIC without deadlocks, many people would probably shrug it off (which I initially did) rather than report it as a bug. I was looking into it as an enhancement rather than a bug until I discovered that it was already enhanced and then undone.

FWIW, yes I agree it is a rather rare use case. For me, it's for doing concurrent index builds on logical replica tables especially after initial copy with non-indexed tables, where replicated tables may have traffic coming in constantly. That means DBA doesn't have to wait for hours for them to build 1 by 1, and also doesn't have to worry about long locks.

However IIRC, we have also run into deadlocks before when trying to build multiple indexes in an OLTP system, which may have been due to this issue as opposed to only trying to index the same table.

Re: Deadlock in multiple CIC.

От

Andres Freund

Дата:

30 декабря 2017 г., 01:24:19

On 2017-12-26 13:31:03 -0300, Alvaro Herrera wrote:
> It's strange that this has gone undetected for so long.  I wonder if
> there's an interaction with logical decoding and its historical
> snapshot stuff here.

I can't see how - did you have a vague theory you could share?

Greetings,

Andres Freund

Re: Deadlock in multiple CIC.

От

Alvaro Herrera

Дата:

03 января 2018 г., 08:22:42

Jeff Janes wrote:
> c3d09b3bd23f5f6 fixed it so concurrent CIC would not deadlock (or at least
> not as reliably as before) by dropping its own snapshot before waiting for
> all the other ones to go away.
> 
> With commit 8aa3e47510b969354ea02a, concurrent CREATE INDEX CONCURRENTLY on
> different tables in the same database started deadlocking against each
> other again quite reliably.
> 
> I think the solution is simply to drop the catalog snapshot as well, as in
> the attached.

Pushed to all affected branches, along with a somewhat lame
isolationtester test for the condition (since we've already broken this
twice and not noticed for long).

Thanks for the diagnosis and patch,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Deadlock in multiple CIC.

От

Alvaro Herrera

Дата:

03 января 2018 г., 20:28:02

Andres Freund wrote:
> On 2017-12-26 13:31:03 -0300, Alvaro Herrera wrote:
> > It's strange that this has gone undetected for so long.  I wonder if
> > there's an interaction with logical decoding and its historical
> > snapshot stuff here.
> 
> I can't see how - did you have a vague theory you could share?

Not really.  I was just looking at the snapmgr.c code and thinking
whether there was a chance that it was the historic snapshot could be
involved in waits.  But then, as presented the failure does not use
logical decoding, so even if there's a bug there (for which I have no
evidence) it would have to be a separate one.

More generally I was wondering if there was anyplace else that would
keep a catalog snapshot registered for a long time and then go to sleep. 
Looked for it desultorily, came up with nothing.

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Deadlock in multiple CIC.

От

Alvaro Herrera

Дата:

04 января 2018 г., 18:29:25

Alvaro Herrera wrote:

> Pushed to all affected branches, along with a somewhat lame
> isolationtester test for the condition (since we've already broken this
> twice and not noticed for long).

Buildfarm member okapi just failed this test in 9.4:

2018-01-04 02:44:43.481 PST [5a4e059a.3af1:6] ERROR:  deadlock detected
2018-01-04 02:44:43.481 PST [5a4e059a.3af1:7] DETAIL:  Process 15089 waits for ShareLock on virtual transaction 4/882;
blockedby process 15090.
 
    Process 15090 waits for ShareLock on virtual transaction 3/906; blocked by process 15089.
    Process 15089: 
            CREATE INDEX CONCURRENTLY mcic_one_pkey ON mcic_one (id)
            WHERE lck_shr(281457);
    
    Process 15090: 
            CREATE INDEX CONCURRENTLY mcic_two_pkey ON mcic_two (id)
            WHERE unlck();
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-01-04%2010%3A35%3A02

The interesting thing here is that the failure seems to be just like
before the patch: each session is waiting on a snapshot from the other
session.  I suppose this must be some *other* snapshot, not the catalog
snapshot ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Deadlock in multiple CIC.

От

Tom Lane

Дата:

16 апреля 2018 г., 05:07:40

Awhile back, Alvaro Herrera wrote:
>> Pushed to all affected branches, along with a somewhat lame
>> isolationtester test for the condition (since we've already broken this
>> twice and not noticed for long).

> Buildfarm member okapi just failed this test in 9.4:

okapi has continued to fail that test, not 100% of the time but much
more often than not ... but only in 9.4.  And no other animals have
shown it at all.  So what to make of that?

Noting that okapi uses a pretty old icc version running at a high -O
level, we could dismiss it as probably-a-compiler-bug.  But that theory
doesn't really account for the fact that it sometimes succeeds.

Another theory, noting that 9.5 and later have memory barriers in S_UNLOCK
which 9.4 lacks, is that the reason 9.4 has a problem is lack of a memory
barrier between SnapshotResetXmin and GetCurrentVirtualXIDs, thus allowing
both processes to observe the other's xmin as still nonzero given the
right timing.  This seems like a stretch, because really the latter
function's LWLockAcquire on ProcArrayLock ought to be enough to serialize
things.  But there has to be *something* different between 9.4 and all the
later branches, and the barrier stuff sure looks like it's in the right
neighborhood.

As an investigative measure, I propose that we insert

    Assert(MyPgXact->xmin == InvalidTransactionId);

into 9.4's DefineIndex, just after its InvalidateCatalogSnapshot call.
I don't want to leave that there permanently, because it's not clear to me
that there are no legitimate cases where a backend wouldn't have extra
snapshots active during CREATE INDEX CONCURRENTLY --- but we seem to get
through 9.4's regression tests with it, and it would quickly confirm or
deny whether okapi is failing because it somehow has an extra snapshot.

Assuming that that doesn't show anything, I'm inclined to think that
the next step should be to add a pg_memory_barrier() call to
SnapshotResetXmin (again only in the 9.4 branch), and see if that helps.

            regards, tom lane

Re: Deadlock in multiple CIC.

От

Tom Lane

Дата:

16 апреля 2018 г., 20:31:16

I wrote:
> As an investigative measure, I propose that we insert
>     Assert(MyPgXact->xmin == InvalidTransactionId);
> into 9.4's DefineIndex, just after its InvalidateCatalogSnapshot call.

Well, isn't this interesting:

TRAP: FailedAssertion("!(MyPgXact->xmin == ((TransactionId) 0))", File: "indexcmds.c", Line: 777)
2018-04-16 02:41:25.814 PDT [5ad46fbd.5412:2] LOG:  server process (PID 21777) was terminated by signal 6: Aborted
2018-04-16 02:41:25.814 PDT [5ad46fbd.5412:3] DETAIL:  Failed process was running: CREATE INDEX CONCURRENTLY
concur_index1ON concur_heap(f2,f1); 

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-04-16%2009%3A35%3A02

So we can now refine the problem statement to "SnapshotResetXmin isn't
doing what it's supposed to".  No idea why yet.  9.4 is using a simple
RegisteredSnapshots counter which 9.5 has replaced with a pairing heap,
so you'd think the newer code would be *more* likely to have bugs...

            regards, tom lane

Re: Deadlock in multiple CIC.

От

Tom Lane

Дата:

17 апреля 2018 г., 22:23:18

I wrote:
> So we can now refine the problem statement to "SnapshotResetXmin isn't
> doing what it's supposed to".  No idea why yet.  9.4 is using a simple
> RegisteredSnapshots counter which 9.5 has replaced with a pairing heap,
> so you'd think the newer code would be *more* likely to have bugs...

It's still not entirely clear what's happening on okapi, but in the
meantime I've thought of an easily-reproducible way to cause similar
failures in any branch.  That is to run CREATE INDEX CONCURRENTLY
with default_transaction_isolation = serializable.  Then, snapmgr.c
will set up a transaction snapshot (actually identical to the
"reference snapshot" used by DefineIndex), and that will not get
released, so the process's xmin doesn't get cleared, and we have
a deadlock hazard.

I experimented with running the isolation tests under "alter system set
default_transaction_isolation to serializable".  Oddly, multiple-cic
tends to not fail that way for me, though if I reduce the
isolation_schedule file to contain just that one test, it fails nine
times out of ten.  Leftover activity from the previous tests must be
messing up the timing somehow.  Anyway, the problem is definitely real.
(A couple of the other isolation tests do fail reliably under this
scenario; is it worth hardening them?)

I thought for a bit about trying to force C.I.C.'s transactions to
be run with a lower transaction isolation level, but that seems messy
and I'm not very sure it wouldn't have bad side-effects.  A much simpler
fix is to just start YA transaction before waiting, as in the attached
proposed patch.  (With the transaction restart, I feel sufficiently
confident that there should be no open snapshots that it seems okay
to put in the Assert I was previously afraid to add.)

I don't know whether this would make okapi's problem go away.
But it's seeming somewhat likely at this point that we're hitting
a weird compiler misoptimization there, and this might dodge it.
In any case this is demonstrably fixing a problem.

            regards, tom lane

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f2dcc1c..dda0dcb 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1197,14 +1197,25 @@ DefineIndex(Oid relationId,
      * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
      * they must wait for.  But first, save the snapshot's xmin to use as
      * limitXmin for GetCurrentVirtualXIDs().
-     *
-     * Our catalog snapshot could have the same effect, so drop that one too.
      */
     limitXmin = snapshot->xmin;

     PopActiveSnapshot();
     UnregisterSnapshot(snapshot);
-    InvalidateCatalogSnapshot();
+
+    /*
+     * The snapshot subsystem could still contain registered snapshots that
+     * are holding back our process's advertised xmin; in particular, if
+     * default_transaction_isolation = serializable, there is a transaction
+     * snapshot that is still active.  The CatalogSnapshot is likewise a
+     * hazard.  To ensure no deadlocks, we must commit and start yet another
+     * transaction, and do our wait before any snapshot has been taken in it.
+     */
+    CommitTransactionCommand();
+    StartTransactionCommand();
+
+    /* We should now definitely not be advertising any xmin. */
+    Assert(MyPgXact->xmin == InvalidTransactionId);

     /*
      * The index is now valid in the sense that it contains all currently

Re: Deadlock in multiple CIC.

От

Alvaro Herrera

Дата:

18 апреля 2018 г., 00:13:30

Tom Lane wrote:

> It's still not entirely clear what's happening on okapi, but in the
> meantime I've thought of an easily-reproducible way to cause similar
> failures in any branch.  That is to run CREATE INDEX CONCURRENTLY
> with default_transaction_isolation = serializable.  Then, snapmgr.c
> will set up a transaction snapshot (actually identical to the
> "reference snapshot" used by DefineIndex), and that will not get
> released, so the process's xmin doesn't get cleared, and we have
> a deadlock hazard.

Hah, ouch.

> I experimented with running the isolation tests under "alter system set
> default_transaction_isolation to serializable".  Oddly, multiple-cic
> tends to not fail that way for me, though if I reduce the
> isolation_schedule file to contain just that one test, it fails nine
> times out of ten.  Leftover activity from the previous tests must be
> messing up the timing somehow.  Anyway, the problem is definitely real.
> (A couple of the other isolation tests do fail reliably under this
> scenario; is it worth hardening them?)

Yes, I think it's worth making them pass somehow -- see commits
f18795e7b74c, a0eae1a2eeb6.

> I thought for a bit about trying to force C.I.C.'s transactions to
> be run with a lower transaction isolation level, but that seems messy
> and I'm not very sure it wouldn't have bad side-effects.  A much simpler
> fix is to just start YA transaction before waiting, as in the attached
> proposed patch.  (With the transaction restart, I feel sufficiently
> confident that there should be no open snapshots that it seems okay
> to put in the Assert I was previously afraid to add.)

Seems like an acceptable fix to me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Deadlock in multiple CIC.

От

Tom Lane

Дата:

18 апреля 2018 г., 21:27:49

I wrote:
>> It's still not entirely clear what's happening on okapi, ...

okapi has now passed two consecutive runs with elog(LOG) messages in place
between DefineIndex's snapmgr calls.  Considering that it had failed 37 of
44 test runs since 47a3a13 went in, I think two successive passes is
sufficient evidence to conclude that we have a Heisenbug in which the
presence of debug tooling affects the result.  And that in turn suggests
strongly that it's a compiler bug.  Broken interprocedural optimization,
perhaps?  Although it'd have to be cross-file optimization, which is
more than I thought icc would do.

Anyway, at this point I'm going to give up on the debug logging, revert
9.4 to its prior state, and then see if the transaction-restart patch
makes the problem go away.

>> (A couple of the other isolation tests do fail reliably under this
>> scenario; is it worth hardening them?)

> Yes, I think it's worth making them pass somehow -- see commits
> f18795e7b74c, a0eae1a2eeb6.

Will look into that too.  I'm not sure that adding extra expected
outputs is sane, though --- might be best to just force the intended
isolation level within those tests.

            regards, tom lane

Re: Deadlock in multiple CIC.

От

Alvaro Herrera

Дата:

18 апреля 2018 г., 21:36:26

Tom Lane wrote:

> Anyway, at this point I'm going to give up on the debug logging, revert
> 9.4 to its prior state, and then see if the transaction-restart patch
> makes the problem go away.

Agreed, thanks.

> >> (A couple of the other isolation tests do fail reliably under this
> >> scenario; is it worth hardening them?)
> 
> > Yes, I think it's worth making them pass somehow -- see commits
> > f18795e7b74c, a0eae1a2eeb6.
> 
> Will look into that too.  I'm not sure that adding extra expected
> outputs is sane, though --- might be best to just force the intended
> isolation level within those tests.

As I recall (not much, admittedly) that was one of the options we
considered in the old commit, but since the other isolation levels
behaved differently we ageed that it was worth adding coverage for them.
I don't know which ones are failing now; maybe forcing a specific
isolation level is sufficient.

Clearly we should have done something to make sure these tests were run
periodically with different isolation levels.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Deadlock in multiple CIC.

От

Tom Lane

Дата:

18 апреля 2018 г., 21:38:16

I wrote:
>>> (A couple of the other isolation tests do fail reliably under this
>>> scenario; is it worth hardening them?)

>> Yes, I think it's worth making them pass somehow -- see commits
>> f18795e7b74c, a0eae1a2eeb6.

> Will look into that too.  I'm not sure that adding extra expected
> outputs is sane, though --- might be best to just force the intended
> isolation level within those tests.

Hmm, so one of the ones that fails is lock-update-delete, which I see
*already has* an alternate output file for serializable mode ... but
it doesn't match what I get:

*** /home/postgres/pgsql/src/test/isolation/expected/lock-update-delete_1.out   Mon Feb 12 14:53:46 2018
--- /home/postgres/pgsql/src/test/isolation/output_iso/results/lock-update-delete.out   Wed Apr 18 11:30:23 2018
***************
*** 150,156 ****

  t
  step s1l: <... completed>
! error in steps s2_unlock s1l: ERROR:  could not serialize access due to concurrent update

  starting permutation: s2b s1l s2u s2_blocker1 s2r s2_unlock
  pg_advisory_lock
--- 150,158 ----

  t
  step s1l: <... completed>
! key            value
!
! 1              1

  starting permutation: s2b s1l s2u s2_blocker1 s2r s2_unlock
  pg_advisory_lock

It looks like maybe this one wasn't updated in 533e9c6b0 --- would
you check/confirm that?

            regards, tom lane

Re: Deadlock in multiple CIC.

От

Alvaro Herrera

Дата:

18 апреля 2018 г., 22:36:18

Tom Lane wrote:

> *** /home/postgres/pgsql/src/test/isolation/expected/lock-update-delete_1.out   Mon Feb 12 14:53:46 2018
> --- /home/postgres/pgsql/src/test/isolation/output_iso/results/lock-update-delete.out   Wed Apr 18 11:30:23 2018
> ***************
> *** 150,156 ****
>   
>   t              
>   step s1l: <... completed>
> ! error in steps s2_unlock s1l: ERROR:  could not serialize access due to concurrent update
>   
>   starting permutation: s2b s1l s2u s2_blocker1 s2r s2_unlock
>   pg_advisory_lock
> --- 150,158 ----
>   
>   t              
>   step s1l: <... completed>
> ! key            value          
> ! 
> ! 1              1              
>   
>   starting permutation: s2b s1l s2u s2_blocker1 s2r s2_unlock
>   pg_advisory_lock
> 
> It looks like maybe this one wasn't updated in 533e9c6b0 --- would
> you check/confirm that?

I think the new output is correct in REPEATABLE READ but it represents a
bug for the SERIALIZABLE mode.

The case is: a tuple is updated, with its key not modified; a concurrent
transaction is trying to read the tuple.  The original expected output
says that the reading transaction is aborted, which matches what the
test comment says:

# When run in REPEATABLE READ or SERIALIZABLE transaction isolation levels, all
# permutations that commit s2 cause a serializability error; all permutations
# that rollback s2 can get through.

In REPEATABLE READ it seems fine to read the original version of the
tuple (returns 1) and not raise an error (the reading transaction will
simply see the value that was current when it started).  But in
SERIALIZABLE mode, as far as I understand it, this case should raise a
serializability error.

I hope I'm wrong.  Copying Kevin, just in case.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Deadlock in multiple CIC.

От

Tom Lane

Дата:

19 апреля 2018 г., 00:34:03

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> Will look into that too.  I'm not sure that adding extra expected
>> outputs is sane, though --- might be best to just force the intended
>> isolation level within those tests.

> As I recall (not much, admittedly) that was one of the options we
> considered in the old commit, but since the other isolation levels
> behaved differently we ageed that it was worth adding coverage for them.
> I don't know which ones are failing now; maybe forcing a specific
> isolation level is sufficient.

So the other one that fails for me is tuplelock-update, and a bit of
bisection testing confirms that it's never worked under serializable
isolation.  The "expected" behavior in that case is unexciting --- all but
the s1 transaction just fail with "could not serialize access".  So I'm
not convinced whether it's better to provide that as an expected output,
or to change the test to force a lower isolation level where it's testing
something useful.  Anyway, that's your test case so I'll leave it to you
to do something with it.

> Clearly we should have done something to make sure these tests were run
> periodically with different isolation levels.

Yeah :-(.  Both of these test cases have failed this scenario since
they were created.

            regards, tom lane