On 08/14/2018 01:49 PM, Tomas Vondra wrote:
> On 08/13/2018 04:49 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2018-08-13 11:46:30 -0300, Alvaro Herrera wrote:
>>> On 2018-Aug-11, Tomas Vondra wrote:
>>>
>>>> Hmmm, it's difficult to compare "bt full" output, but my backtraces
>>>> look
>>>> somewhat different (and all the backtraces I'm seeing are 100% exactly
>>>> the same). Attached for comparison.
>>>
>>> Hmm, looks similar enough to me -- at the bottom you have the executor
>>> doing its thing, then an AcceptInvalidationMessages in the middle
>>> section atop which sit a few more catalog accesses, and further up from
>>> that you have another AcceptInvalidationMessages with more catalog
>>> accesses. AFAICS that's pretty much the same thing Andres was
>>> describing.
>>
>> It's somewhat different because it doesn't seem to involve a reload of a
>> nailed table, which my traces did. I wasn't able to reproduce the crash
>> more than once, so I'm not at all sure how to properly verify the issue.
>> I'd appreciate if Thomas could try to do so again with the small patch I
>> provided.
>>
>
> I'll try in the evening. I've tried reproducing it on my laptop, but I
> can't make that happen for some reason - I know I've seen some crashes
> here, but all the reproducers were from the workstation I have at home.
>
> I wonder if there's some subtle difference between the two boxes, making
> it more likely on one of them ... the whole environment (distribution,
> packages, compiler, ...) should be exactly the same, though. The only
> thing I can think of is different CPU speed, possibly making some race
> conditions more/less likely. No idea.
>
I take that back - I can reproduce the crashes, both with and without
the patch, all the way back to 9.6. Attached is a bunch of backtraces
from various versions. There's a bit of variability depending on which
pgbench script gets started first (insert.sql or vacuum.sql) - in one
case (when vacuum is started before insert) the crash happens in
InitPostgres/RelationCacheInitializePhase3, otherwise it happens in
exec_simple_query.
Another observation is that the failing COPY is not necessary, I can
reproduce the crashes without this (so even with wal_level=replica).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services