Обсуждение: FastPathStrongRelationLocks still has an issue in HEAD
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00 TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 1240) [53418a51.6a08:2] LOG: server process (PID 27631) was terminated by signal 6 [53418a51.6a08:3] DETAIL: Failed process was running: create table gc1() inherits (c1); regards, tom lane
On Sun, Apr 6, 2014 at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00 > > TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 1240) > [53418a51.6a08:2] LOG: server process (PID 27631) was terminated by signal 6 > [53418a51.6a08:3] DETAIL: Failed process was running: create table gc1() inherits (c1); Uggh. That's unfortunate, but not terribly surprising: I didn't think that missing volatile was very likely to be the cause of this. Have we been getting random failures of this type since the fastlock stuff went in, and we're only just now noticing? Or did some recent change expose this problem? I'm a bit suspicious of the patches to static-ify stuff, since that might cause the compiler to think it could move things across function calls that it hadn't thought move-able before, but FastPathStrongLocks references would seem to be the obvious candidate for that, and volatile-izing it ought to have fixed it. I would think. One thing I noticed, looking at this particular failure, is that at the time that "create table gc1() inherits (c1)" failed an assertion, another backend was inside "select blockme()", and specifically inside of "select count(*) from tenk1 a, tenk1 b, tenk1 c". I can't help but suspect that the bug is somehow concurrency-related, so the presence of concurrent activity seems like a clue, but I can't figure out the relationship. blockme() shouldn't be taking any strong relation locks. Unless AV decided to truncate a table just then, the process that failed should be the only one in the system with any strong relation lock, so if there's a race, what is it racing against? In the failure on prairiedog back on March 25th... http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-25%2003%3A15%3A03 ...there's a lot more concurrent activity. The process that failed the assertion was running "CREATE TABLE t3 (name TEXT, n INTEGER)"; concurrently, the following other queries were running: SELECT oid AS clsoid, relname, relnatts + 10 AS x INTO selinto_schema.tmp2 FROM pg_class WHERE relname like '%b%'; CREATE TEMP TABLE foo (id integer); create temp table t3 as select generate_series(-1000,1000) as x; CREATE TEMPORARY TABLE bitwise_test( i2 INT2, i4 INT4, i8 INT8, i INTEGER, x INT2, y BIT(4) ); DROP TABLE savepoints; create temp table tt1(f1 int); CREATE TEMP TABLE arrtest2 (i integer ARRAY[4], f float8[], n numeric[], t text[], d timestamp[]); COMMIT PREPARED 'regress-one'; All but the first and last of those take a strong relation lock, so some kind of race could certainly account for that failure. It's also interesting that COMMIT PREPARED is shown as being involved here; that code is presumably much more rarely executed than the code for the regular commit or abort paths, and might therefore be thought more likely to harbor a bug. In particular, there's this code in LockRefindAndRelease: /* * Decrement strong lock count. This logic is needed only for 2PC. */ if (decrement_strong_lock_count && ConflictsWithRelationFastPath(&lock->tag, lockmode)) { uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode); SpinLockAcquire(&FastPathStrongRelationLocks->mutex); FastPathStrongRelationLocks->count[fasthashcode]--; SpinLockRelease(&FastPathStrongRelationLocks->mutex); } I notice that this code lacks an Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0). I think we should add one. If this code is somehow managing to decrement one of the counts when it's already zero, the next process whose lock gets mapped to this partition would increment the count from the maximum value that can be stored back to zero. Then, when it goes to release the strong lock, it finds that the count is already zero and goes boom. This theory could even explain the new crash, since the COMMIT PREPARED stuff has already happened by the point where rover_firefly failed; since the lock tags are hashed to create fasthashcode, variation in which objects got which OIDs due to concurrency in the regression test could cause the failure to move around or even escape detection altogether. Now, even if the 2PC code is the problem, that doesn't explain exactly what's wrong with the above logic, but it would help narrow down where to look. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-04-07 10:06:07 -0400, Robert Haas wrote: > I'm a bit suspicious of the patches to > static-ify stuff, since that might cause the compiler to think it > could move things across function calls that it hadn't thought > move-able before, but FastPathStrongLocks references would seem to be > the obvious candidate for that, and volatile-izing it ought to have > fixed it. I would think. Hm. It generally might be interesting to get a few !X86 buildfarms running builds with LTO enabled. That might expose some dangerous assumptions more easily. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Apr 7, 2014 at 10:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-04-07 10:06:07 -0400, Robert Haas wrote: >> I'm a bit suspicious of the patches to >> static-ify stuff, since that might cause the compiler to think it >> could move things across function calls that it hadn't thought >> move-able before, but FastPathStrongLocks references would seem to be >> the obvious candidate for that, and volatile-izing it ought to have >> fixed it. I would think. > > Hm. It generally might be interesting to get a few !X86 buildfarms > running builds with LTO enabled. That might expose some dangerous > assumptions more easily. I strongly suspect that will break stuff all over the place. We can either get compiler barriers working for real, or we can start volatile-izing every reference in an LWLock-protected critical section. Hint: the second one is insane. That might be off-topic for this issue at hand, though... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-04-07 10:45:51 -0400, Robert Haas wrote: > On Mon, Apr 7, 2014 at 10:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Hm. It generally might be interesting to get a few !X86 buildfarms > > running builds with LTO enabled. That might expose some dangerous > > assumptions more easily. > > I strongly suspect that will break stuff all over the place. We can > either get compiler barriers working for real, or we can start > volatile-izing every reference in an LWLock-protected critical > section. Hint: the second one is insane. You don't have to convince me. The way there is where I am not sure we're agreeing. I didn't break a few months back for x86 on light loads btw. Not that that's saying much. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Apr 6, 2014 at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00 > Uggh. That's unfortunate, but not terribly surprising: I didn't think > that missing volatile was very likely to be the cause of this. Yeah. That was a bug, but evidently it's not the bug we're looking for. > Have > we been getting random failures of this type since the fastlock stuff > went in, and we're only just now noticing? Or did some recent change > expose this problem? Not sure. I used to rely on the pgbuildfarm-status-green daily digests to cue me to look at transient buildfarm failures, but that list has been AWOL for months. However, I'm pretty sure this has not been happening ever since 9.2, so yeah, it's at least become more probable in the last few months. > I'm a bit suspicious of the patches to > static-ify stuff, since that might cause the compiler to think it > could move things across function calls that it hadn't thought > move-able before, but FastPathStrongLocks references would seem to be > the obvious candidate for that, and volatile-izing it ought to have > fixed it. I would think. Keep in mind also that prairiedog is running a pretty old gcc (4.0.1 if memory serves), so I'd not expect it to be doing any crazy optimizations. I suspect we are looking at some plain old logic bug, but as you say it's hard to guess where exactly. > [ LockRefindAndRelease ] lacks an > Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0). I think > we should add one. Absolutely. regards, tom lane
On Mon, Apr 7, 2014 at 10:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> [ LockRefindAndRelease ] lacks an >> Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0). I think >> we should add one. > > Absolutely. Turns out there were two places missing such an assertion: the 2PC path, and the abort-strong-lock-acquire path. I added an assertion to both. In theory, if the problem is coming from either of those places, this might even increase the frequency of buildfarm failures, since it removes the necessity for another normal-path release to hit the same partition afterwards. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company