Обсуждение: Some bogus results from prairiedog

Поиск
Список
Период
Сортировка

Some bogus results from prairiedog

От
Tom Lane
Дата:
According to
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55
prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight;
but there's not a lot of evidence as to why in the buildfarm report,
because the postmaster log file is truncated well before where things got
interesting.  Fortunately, I was able to capture a copy of check.log
before it got overwritten by the next run.  I find that the place where
the webserver report stops matches this section of check.log:

[53cd99bb.134a:158] LOG:  statement: create index test_range_gist_idx on test_range_gist using gist (ir);
[53cd99bb.134a:159] LOG:  statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000)
g;

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG:  statement: INSERT INTO num_exp_div VALUES
(7,8,'-1108.80577182462841041118');
[53cd99ba.1344:330] LOG:  statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420');
[53cd99ba.1344:331] LOG:  statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580');

The ^@'s represent nul bytes, which I find runs of elsewhere in the file
as well.  I think they are an artifact of OS X buffering policy caused by
multiple processes writing into the same file without any interlocks.
Perhaps we ought to consider making buildfarm runs use the logging
collector by default?  But in any case, it seems uncool that either the
buildfarm log-upload process, or the buildfarm web server, is unable to
cope with log files containing nul bytes.

Anyway, to cut to the chase, the crash seems to be from this:

TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)

which the postmaster shortly later says this about:

[53cd99b6.130e:2] LOG:  server process (PID 5230) was terminated by signal 6: Abort trap
[53cd99b6.130e:3] DETAIL:  Failed process was running: ROLLBACK PREPARED 'foo1';
[53cd99b6.130e:4] LOG:  terminating any other active server processes

So there is still something rotten in the fastpath lock logic.
        regards, tom lane



Re: Some bogus results from prairiedog

От
Andrew Dunstan
Дата:
On 07/22/2014 12:24 AM, Tom Lane wrote:
> According to
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55
> prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight;
> but there's not a lot of evidence as to why in the buildfarm report,
> because the postmaster log file is truncated well before where things got
> interesting.  Fortunately, I was able to capture a copy of check.log
> before it got overwritten by the next run.  I find that the place where
> the webserver report stops matches this section of check.log:
>
> [53cd99bb.134a:158] LOG:  statement: create index test_range_gist_idx on test_range_gist using gist (ir);
> [53cd99bb.134a:159] LOG:  statement: insert into test_range_gist select int4range(g, g+10) from
generate_series(1,2000)g;
 
>
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\
> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG:  statement: INSERT INTO num_exp_div VALUES
(7,8,'-1108.80577182462841041118');
> [53cd99ba.1344:330] LOG:  statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420');
> [53cd99ba.1344:331] LOG:  statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580');
>
> The ^@'s represent nul bytes, which I find runs of elsewhere in the file
> as well.  I think they are an artifact of OS X buffering policy caused by
> multiple processes writing into the same file without any interlocks.
> Perhaps we ought to consider making buildfarm runs use the logging
> collector by default?  But in any case, it seems uncool that either the
> buildfarm log-upload process, or the buildfarm web server, is unable to
> cope with log files containing nul bytes.


The data is there, just click on the "check" stage link at the top of 
the page to see it in raw form.

cheers

andrew





Re: Some bogus results from prairiedog

От
Robert Haas
Дата:
On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> According to
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55
> prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight;
> but there's not a lot of evidence as to why in the buildfarm report,
> because the postmaster log file is truncated well before where things got
> interesting.  Fortunately, I was able to capture a copy of check.log
> before it got overwritten by the next run.  I find that the place where
> the webserver report stops matches this section of check.log:
>
> [53cd99bb.134a:158] LOG:  statement: create index test_range_gist_idx on test_range_gist using gist (ir);
> [53cd99bb.134a:159] LOG:  statement: insert into test_range_gist select int4range(g, g+10) from
generate_series(1,2000)g;
 
>
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\
> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG:  statement: INSERT INTO num_exp_div VALUES
(7,8,'-1108.80577182462841041118');
> [53cd99ba.1344:330] LOG:  statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420');
> [53cd99ba.1344:331] LOG:  statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580');
>
> The ^@'s represent nul bytes, which I find runs of elsewhere in the file
> as well.  I think they are an artifact of OS X buffering policy caused by
> multiple processes writing into the same file without any interlocks.
> Perhaps we ought to consider making buildfarm runs use the logging
> collector by default?  But in any case, it seems uncool that either the
> buildfarm log-upload process, or the buildfarm web server, is unable to
> cope with log files containing nul bytes.
>
> Anyway, to cut to the chase, the crash seems to be from this:
>
> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)
>
> which the postmaster shortly later says this about:
>
> [53cd99b6.130e:2] LOG:  server process (PID 5230) was terminated by signal 6: Abort trap
> [53cd99b6.130e:3] DETAIL:  Failed process was running: ROLLBACK PREPARED 'foo1';
> [53cd99b6.130e:4] LOG:  terminating any other active server processes
>
> So there is still something rotten in the fastpath lock logic.

Gosh, that sucks.

The inconstancy of this problem would seem to suggest some kind of
locking bug rather than a flat-out concurrency issue, but it looks to
me like everything relevant is marked volatile.  But ... maybe the
problem could be in s_lock.h?  That machine is powerpc, and powerpc's
definition of S_UNLOCK looks like this:

#ifdef USE_PPC_LWSYNC
#define S_UNLOCK(lock)  \
do \
{ \   __asm__ __volatile__ (" lwsync \n"); \   *((volatile slock_t *) (lock)) = 0; \
} while (0)
#else
#define S_UNLOCK(lock)  \
do \
{ \   __asm__ __volatile__ (" sync \n"); \   *((volatile slock_t *) (lock)) = 0; \
} while (0)
#endif /* USE_PPC_LWSYNC */

I believe Andres recently reported that this isn't enough to prevent
compiler reordering; for that, the __asm__ __volatile___ would need to
be augmented with ::: "memory".  If that's correct, then the compiler
could decide to issue the volatile store before the lwsync or sync.
Then, the CPU could decide to perform the volatile store to the
spinlock before performing the update to FastPathStrongRelationLocks.
That would explain this problem.

The other possible explanation (at least, AFAICS) is that
lock_twophase_postcommit() is trying to release a strong lock count
that it doesn't actually hold.  That would indicate a
strong-lock-count handling bug upstream someplace - e.g. that the
count got released when the transaction was prepared, or somesuch.  I
certainly can't rule that out, although I don't know exactly where to
look.  We could add an assertion to AtPrepare_Locks(), just before
setting locallock->holdsStrongLockCount = FALSE, that
locallock->holdsStrongLockCount is true if and only if the lock tag
and more suggest that it ought to be.  But that's really just guessing
wildly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some bogus results from prairiedog

От
Andrew Dunstan
Дата:
On 07/22/2014 10:55 AM, Andrew Dunstan wrote:
>
> On 07/22/2014 12:24 AM, Tom Lane wrote:
>> According to
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55 
>>
>> prairiedog saw a crash in "make check" on the 9.4 branch earlier 
>> tonight;
>> but there's not a lot of evidence as to why in the buildfarm report,
>> because the postmaster log file is truncated well before where things 
>> got
>> interesting.  Fortunately, I was able to capture a copy of check.log
>> before it got overwritten by the next run.  I find that the place where
>> the webserver report stops matches this section of check.log:
>>
>> [53cd99bb.134a:158] LOG:  statement: create index test_range_gist_idx 
>> on test_range_gist using gist (ir);
>> [53cd99bb.134a:159] LOG:  statement: insert into test_range_gist 
>> select int4range(g, g+10) from generate_series(1,2000) g;
>>
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\

>>
>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: 
>> statement: INSERT INTO num_exp_div VALUES 
>> (7,8,'-1108.80577182462841041118');
>> [53cd99ba.1344:330] LOG:  statement: INSERT INTO num_exp_add VALUES 
>> (7,9,'-107955289.045047420');
>> [53cd99ba.1344:331] LOG:  statement: INSERT INTO num_exp_sub VALUES 
>> (7,9,'-58101680.954952580');
>>
>> The ^@'s represent nul bytes, which I find runs of elsewhere in the file
>> as well.  I think they are an artifact of OS X buffering policy 
>> caused by
>> multiple processes writing into the same file without any interlocks.
>> Perhaps we ought to consider making buildfarm runs use the logging
>> collector by default?  But in any case, it seems uncool that either the
>> buildfarm log-upload process, or the buildfarm web server, is unable to
>> cope with log files containing nul bytes.
>
>
> The data is there, just click on the "check" stage link at the top of 
> the page to see it in raw form.
>
>


I have made a change in the upload receiver app to escape nul bytes in 
the main log field too. This will operate prospectively.

Using the logging collector would be a larger change, but we can look at 
it if this isn't sufficient.

cheers

andrew



Re: Some bogus results from prairiedog

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Anyway, to cut to the chase, the crash seems to be from this:
>> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)
>> So there is still something rotten in the fastpath lock logic.

> Gosh, that sucks.

> The inconstancy of this problem would seem to suggest some kind of
> locking bug rather than a flat-out concurrency issue, but it looks to
> me like everything relevant is marked volatile.

I don't think that you need any big assumptions about machine-specific
coding issues to spot the problem.  The assert in question is here:
   /*    * Decrement strong lock count.  This logic is needed only for 2PC.    */   if (decrement_strong_lock_count
 && ConflictsWithRelationFastPath(&lock->tag, lockmode))   {       uint32    fasthashcode =
FastPathStrongLockHashPartition(hashcode);
       SpinLockAcquire(&FastPathStrongRelationLocks->mutex);
Assert(FastPathStrongRelationLocks->count[fasthashcode]> 0);       FastPathStrongRelationLocks->count[fasthashcode]--;
    SpinLockRelease(&FastPathStrongRelationLocks->mutex);   }
 

and it sure looks to me like that
"ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the
shared-memory lock object you just released.  If someone else had managed
to recycle that locktable entry for some other purpose, the
ConflictsWithRelationFastPath call might incorrectly return true.

I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing
something.
        regards, tom lane



Re: Some bogus results from prairiedog

От
Robert Haas
Дата:
On Tue, Jul 22, 2014 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Anyway, to cut to the chase, the crash seems to be from this:
>>> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)
>>> So there is still something rotten in the fastpath lock logic.
>
>> Gosh, that sucks.
>
>> The inconstancy of this problem would seem to suggest some kind of
>> locking bug rather than a flat-out concurrency issue, but it looks to
>> me like everything relevant is marked volatile.
>
> I don't think that you need any big assumptions about machine-specific
> coding issues to spot the problem.

I don't think that I'm making what could be described as big
assumptions; I think we should fix and back-patch the PPC64 spinlock
change.

But...

> The assert in question is here:
>
>     /*
>      * Decrement strong lock count.  This logic is needed only for 2PC.
>      */
>     if (decrement_strong_lock_count
>         && ConflictsWithRelationFastPath(&lock->tag, lockmode))
>     {
>         uint32    fasthashcode = FastPathStrongLockHashPartition(hashcode);
>
>         SpinLockAcquire(&FastPathStrongRelationLocks->mutex);
>         Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0);
>         FastPathStrongRelationLocks->count[fasthashcode]--;
>         SpinLockRelease(&FastPathStrongRelationLocks->mutex);
>     }
>
> and it sure looks to me like that
> "ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the
> shared-memory lock object you just released.  If someone else had managed
> to recycle that locktable entry for some other purpose, the
> ConflictsWithRelationFastPath call might incorrectly return true.
>
> I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing
> something.

...this is probably the real cause of the failures we've actually been
seeing.  I'll go back-patch that change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Some bogus results from prairiedog

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jul 22, 2014 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> and it sure looks to me like that
>> "ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the
>> shared-memory lock object you just released.  If someone else had managed
>> to recycle that locktable entry for some other purpose, the
>> ConflictsWithRelationFastPath call might incorrectly return true.
>> 
>> I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing
>> something.

> ...this is probably the real cause of the failures we've actually been
> seeing.  I'll go back-patch that change.

For the archives' sake --- I took another look at the two previous
instances we'd seen of FastPathStrongRelationLocks assertion failures:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-25%2003%3A15%3A03
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00

Neither of these tripped at the bug site in LockRefindAndRelease, but
that's not so surprising, because there was no Assert there at the time.
An occurrence of the bug would have silently left a negative entry in the
FastPathStrongRelationLocks->count[] array, which would have triggered an
assertion by the next process unlucky enough to use the same array entry.

In the prairiedog case, the assert happened in a test running concurrently
with the prepared_xacts test (which is presumably where the bug occurred),
so that this seems a highly plausible explanation.  In the rover_firefly
case, the assert happened quite a bit later than prepared_xacts; but it
might've been just luck that that particular array entry (out of 1024)
wasn't re-used for awhile.

So it's not certain that this one bug explains all three cases,
but it seems well within the bounds of plausibility that it does.
Also, the narrow width of the race condition window (from
LWLockRelease(partitionLock) to an inlined test in the next line)
is consistent with the very low failure rate we've seen in the buildfarm.

BTW, I wonder whether it would be interesting for testing purposes to
have a build option that causes SpinLockRelease and LWLockRelease to
do a sched_yield before returning.  One could assume that that would
greatly increase the odds of detecting this type of bug.
        regards, tom lane