Обсуждение: FastPathStrongRelationLocks still has an issue in HEAD

Поиск
Список
Период
Сортировка

FastPathStrongRelationLocks still has an issue in HEAD

От
Tom Lane
Дата:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00

TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 1240)
[53418a51.6a08:2] LOG:  server process (PID 27631) was terminated by signal 6
[53418a51.6a08:3] DETAIL:  Failed process was running: create table gc1() inherits (c1);
        regards, tom lane



Re: FastPathStrongRelationLocks still has an issue in HEAD

От
Robert Haas
Дата:
On Sun, Apr 6, 2014 at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00
>
> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 1240)
> [53418a51.6a08:2] LOG:  server process (PID 27631) was terminated by signal 6
> [53418a51.6a08:3] DETAIL:  Failed process was running: create table gc1() inherits (c1);

Uggh.  That's unfortunate, but not terribly surprising: I didn't think
that missing volatile was very likely to be the cause of this.  Have
we been getting random failures of this type since the fastlock stuff
went in, and we're only just now noticing?  Or did some recent change
expose this problem?  I'm a bit suspicious of the patches to
static-ify stuff, since that might cause the compiler to think it
could move things across function calls that it hadn't thought
move-able before, but FastPathStrongLocks references would seem to be
the obvious candidate for that, and volatile-izing it ought to have
fixed it.  I would think.

One thing I noticed, looking at this particular failure, is that at
the time that "create table gc1() inherits (c1)" failed an assertion,
another backend was inside "select blockme()", and specifically inside
of "select count(*)        from tenk1 a, tenk1 b, tenk1 c".  I can't
help but suspect that the bug is somehow concurrency-related, so the
presence of concurrent activity seems like a clue, but I can't figure
out the relationship.  blockme() shouldn't be taking any strong
relation locks.  Unless AV decided to truncate a table just then, the
process that failed should be the only one in the system with any
strong relation lock, so if there's a race, what is it racing against?

In the failure on prairiedog back on March 25th...

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-25%2003%3A15%3A03

...there's a lot more concurrent activity.  The process that failed
the assertion was running "CREATE TABLE t3 (name TEXT, n INTEGER)";
concurrently, the following other queries were running:

SELECT oid AS clsoid, relname, relnatts + 10 AS x  INTO selinto_schema.tmp2  FROM pg_class WHERE relname like '%b%';

CREATE TEMP TABLE foo (id integer);

create temp table t3 as select generate_series(-1000,1000) as x;

CREATE TEMPORARY TABLE bitwise_test( i2 INT2, i4 INT4, i8 INT8, i INTEGER, x INT2, y BIT(4)
);

DROP TABLE savepoints;

create temp table tt1(f1 int);

CREATE TEMP TABLE arrtest2 (i integer ARRAY[4], f float8[], n
numeric[], t text[], d timestamp[]);

COMMIT PREPARED 'regress-one';

All but the first and last of those take a strong relation lock, so
some kind of race could certainly account for that failure.  It's also
interesting that COMMIT PREPARED is shown as being involved here; that
code is presumably much more rarely executed than the code for the
regular commit or abort paths, and might therefore be thought more
likely to harbor a bug.  In particular, there's this code in
LockRefindAndRelease:
   /*    * Decrement strong lock count.  This logic is needed only for 2PC.    */   if (decrement_strong_lock_count
 && ConflictsWithRelationFastPath(&lock->tag, lockmode))   {       uint32      fasthashcode =
FastPathStrongLockHashPartition(hashcode);
       SpinLockAcquire(&FastPathStrongRelationLocks->mutex);       FastPathStrongRelationLocks->count[fasthashcode]--;
    SpinLockRelease(&FastPathStrongRelationLocks->mutex);   }
 

I notice that this code lacks an
Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0).  I think
we should add one.  If this code is somehow managing to decrement one
of the counts when it's already zero, the next process whose lock gets
mapped to this partition would increment the count from the maximum
value that can be stored back to zero.  Then, when it goes to release
the strong lock, it finds that the count is already zero and goes
boom.  This theory could even explain the new crash, since the COMMIT
PREPARED stuff has already happened by the point where rover_firefly
failed; since the lock tags are hashed to create fasthashcode,
variation in which objects got which OIDs due to concurrency in the
regression test could cause the failure to move around or even escape
detection altogether.  Now, even if the 2PC code is the problem, that
doesn't explain exactly what's wrong with the above logic, but it
would help narrow down where to look.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: FastPathStrongRelationLocks still has an issue in HEAD

От
Andres Freund
Дата:
On 2014-04-07 10:06:07 -0400, Robert Haas wrote:
> I'm a bit suspicious of the patches to
> static-ify stuff, since that might cause the compiler to think it
> could move things across function calls that it hadn't thought
> move-able before, but FastPathStrongLocks references would seem to be
> the obvious candidate for that, and volatile-izing it ought to have
> fixed it.  I would think.

Hm. It generally might be interesting to get a few !X86 buildfarms
running builds with LTO enabled. That might expose some dangerous
assumptions more easily.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: FastPathStrongRelationLocks still has an issue in HEAD

От
Robert Haas
Дата:
On Mon, Apr 7, 2014 at 10:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-04-07 10:06:07 -0400, Robert Haas wrote:
>> I'm a bit suspicious of the patches to
>> static-ify stuff, since that might cause the compiler to think it
>> could move things across function calls that it hadn't thought
>> move-able before, but FastPathStrongLocks references would seem to be
>> the obvious candidate for that, and volatile-izing it ought to have
>> fixed it.  I would think.
>
> Hm. It generally might be interesting to get a few !X86 buildfarms
> running builds with LTO enabled. That might expose some dangerous
> assumptions more easily.

I strongly suspect that will break stuff all over the place.  We can
either get compiler barriers working for real, or we can start
volatile-izing every reference in an LWLock-protected critical
section.  Hint: the second one is insane.

That might be off-topic for this issue at hand, though...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: FastPathStrongRelationLocks still has an issue in HEAD

От
Andres Freund
Дата:
On 2014-04-07 10:45:51 -0400, Robert Haas wrote:
> On Mon, Apr 7, 2014 at 10:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Hm. It generally might be interesting to get a few !X86 buildfarms
> > running builds with LTO enabled. That might expose some dangerous
> > assumptions more easily.
> 
> I strongly suspect that will break stuff all over the place.  We can
> either get compiler barriers working for real, or we can start
> volatile-izing every reference in an LWLock-protected critical
> section.  Hint: the second one is insane.

You don't have to convince me. The way there is where I am not sure
we're agreeing.

I didn't break a few months back for x86 on light loads btw. Not that
that's saying much.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: FastPathStrongRelationLocks still has an issue in HEAD

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Apr 6, 2014 at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00

> Uggh.  That's unfortunate, but not terribly surprising: I didn't think
> that missing volatile was very likely to be the cause of this.

Yeah.  That was a bug, but evidently it's not the bug we're looking for.

> Have
> we been getting random failures of this type since the fastlock stuff
> went in, and we're only just now noticing?  Or did some recent change
> expose this problem?

Not sure.  I used to rely on the pgbuildfarm-status-green daily digests
to cue me to look at transient buildfarm failures, but that list has been
AWOL for months.  However, I'm pretty sure this has not been happening
ever since 9.2, so yeah, it's at least become more probable in the last
few months.

> I'm a bit suspicious of the patches to
> static-ify stuff, since that might cause the compiler to think it
> could move things across function calls that it hadn't thought
> move-able before, but FastPathStrongLocks references would seem to be
> the obvious candidate for that, and volatile-izing it ought to have
> fixed it.  I would think.

Keep in mind also that prairiedog is running a pretty old gcc (4.0.1 if
memory serves), so I'd not expect it to be doing any crazy optimizations.
I suspect we are looking at some plain old logic bug, but as you say it's
hard to guess where exactly.

> [ LockRefindAndRelease ] lacks an
> Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0).  I think
> we should add one.

Absolutely.
        regards, tom lane



Re: FastPathStrongRelationLocks still has an issue in HEAD

От
Robert Haas
Дата:
On Mon, Apr 7, 2014 at 10:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> [ LockRefindAndRelease ] lacks an
>> Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0).  I think
>> we should add one.
>
> Absolutely.

Turns out there were two places missing such an assertion: the 2PC
path, and the abort-strong-lock-acquire path.  I added an assertion to
both.  In theory, if the problem is coming from either of those
places, this might even increase the frequency of buildfarm failures,
since it removes the necessity for another normal-path release to hit
the same partition afterwards.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company