Обсуждение: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS

Поиск
Список
Период
Сортировка

Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS

От
Tom Lane
Дата:
The CLOBBER_CACHE_ALWAYS buildfarm members occasionally fail in
contrib/test_decoding with

TRAP: FailedAssertion("!(((bool)((relation)->rd_refcnt == 0)))", File: "relcache.c", Line: 1981)

for example here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-14%2005%3A50%3A09
and here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-13%2021%3A26%3A05
        regards, tom lane



Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS

От
Andres Freund
Дата:
Hi,

On 2014-12-15 10:15:30 -0500, Tom Lane wrote:
> The CLOBBER_CACHE_ALWAYS buildfarm members occasionally fail in
> contrib/test_decoding with
>
> TRAP: FailedAssertion("!(((bool)((relation)->rd_refcnt == 0)))", File: "relcache.c", Line: 1981)
>
> for example here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-14%2005%3A50%3A09
> and here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-13%2021%3A26%3A05

Yes, I've been trying to debug that. I've finally managed to reproduce
locally. Unfortunately it's not directly reproducible on my laptop...

A fair amount of tinkering later I've found out that it's not actually
CLOBBER_CACHE_ALWAYS itself that triggers the problem but catchup
interrupts. It triggers with CLOBBER_CACHE_ALWAYS because that happens
to make parts of the code slow enough to not reach a ordinary
AcceptInvalidationMessages() in time.

The problem is that in the added regression test there can be a
situation where there's a relcache entry that's *not* currently visible
but still referenced. Consider:

SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
CREATE TABLE somechange(id serial primary key);
INSERT INTO changeresult   SELECT data FROM pg_logical_slot_get_changes(...);

While get_changes stores it data into a tuplestore there can be a moment
where 'changeresult' has a refcnt > 0 due to the INSERT, but is
invisible because we're using a historic snapshot from when changeresult
doesn't exist.

Without catchup invalidations and/or an outside reference to a relation
that's normally not a problem because it won't get reloaded from the
caches at an inappropriate time while invisible. Until a few weeks ago
there was no regression test covering that case which is why these
crashes are only there now.

It triggers via:
RelationCacheInvalidate() -> RelationClearRelation(relation, true) ->       /* Build temporary entry, but don't link it
intohashtable */       newrel = RelationBuildDesc(save_relid, false);       if (newrel == NULL)       {           /*
Shouldonly get here if relation was deleted */           RelationCacheDelete(relation);
RelationDestroyRelation(relation,false);           elog(ERROR, "relation %u deleted while still in use", save_relid);
   }
 

That path is only hit while refcnt > 0

RelationDestroyRelation() has   Assert(RelationHasReferenceCountZero(relation));

So that path doesn't really work... Without assertions we'd "just" get a
transient ERROR. I think that path should generally loose the
RelationDestroyRelation() - if it's referenced that's surely not the
right thing to do.  I'm not actually convinced logical decoding is the
only way that assert can be reached.

Since logical decoding happens inside a subtransaction (when called via
SQL) and invalidate caches one relatively straightforward way to fix
this would be to add something roughly akin to:    Assert(!relation->rd_isvalid)    /*     * This should only happen
whenwe're doing logical decoding where     * it can happen when [above].     */    if (HistoricSnapshotActive())
return;

There's no chance that such a entry could survive too long in an invalid
state (minus preexisting bugs independent of decoding) since we hit the
same paths that subtransaction abort hits.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS

От
Tom Lane
Дата:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-12-15 10:15:30 -0500, Tom Lane wrote:
>> The CLOBBER_CACHE_ALWAYS buildfarm members occasionally fail in
>> contrib/test_decoding with
>> TRAP: FailedAssertion("!(((bool)((relation)->rd_refcnt == 0)))", File: "relcache.c", Line: 1981)

> Without catchup invalidations and/or an outside reference to a relation
> that's normally not a problem because it won't get reloaded from the
> caches at an inappropriate time while invisible. Until a few weeks ago
> there was no regression test covering that case which is why these
> crashes are only there now.

I've always thought that this whole idea of allowing the caches to contain
stale information was a Rube Goldberg plan that was going to bite you on
the ass eventually.  This case isn't doing anything to increase my faith
in it, and the proposed patch seems like just another kluge on top of
a rickety stack.

I think the safest fix would be to defer catchup interrupt processing
while you're in this mode.  You don't really want to be processing any
remote sinval messages at all, I'd think.
        regards, tom lane



Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS

От
Andres Freund
Дата:
On 2014-12-15 11:30:40 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-12-15 10:15:30 -0500, Tom Lane wrote:
> >> The CLOBBER_CACHE_ALWAYS buildfarm members occasionally fail in
> >> contrib/test_decoding with
> >> TRAP: FailedAssertion("!(((bool)((relation)->rd_refcnt == 0)))", File: "relcache.c", Line: 1981)
> 
> > Without catchup invalidations and/or an outside reference to a relation
> > that's normally not a problem because it won't get reloaded from the
> > caches at an inappropriate time while invisible. Until a few weeks ago
> > there was no regression test covering that case which is why these
> > crashes are only there now.
> 
> I've always thought that this whole idea of allowing the caches to contain
> stale information was a Rube Goldberg plan that was going to bite you on
> the ass eventually.

I guess we'll see. I don't think this specific case is that dramatic.
The complication here is that there's a reference to a relation from
outside logical decoding - that's not something actually likely to be
used at least by replication solutions. It's also impossible to hit from
the walsender interface.

We could just prohibit referencing relations like that... The SQL
interface primarily is used for regression tests and discovery of the
feature. At least as long there's no way to do streaming from SQL which
I can't really see how to do.

> This case isn't doing anything to increase my faith
> in it, and the proposed patch seems like just another kluge on top of
> a rickety stack.

Another possibility would be to replace
else if (!IsTransactionState())
by
else if (!IsTransactionState() || HistoricSnapshotActive())

as that pretty much what we need - invalidations during decoding happen
either outside of a valid transaction state or when entries aren't
referenced anyway.

Btw, if ever hit that code path would Assert() out without logical
decoding as well - it's not allowed to Destroy() a relation that's still
referenced as that assert shows. It's an especially bad idea if the
reference is actually from outside a subxact and the error is caught. I
think we should just remove the Destroy() call - the normal refcount
and/or resowners will take care of deleting the entry later.

> I think the safest fix would be to defer catchup interrupt processing
> while you're in this mode.  You don't really want to be processing any
> remote sinval messages at all, I'd think.

Well, we need to do relmap, smgr and similar things. So I think that'd
be more complicated than we want.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS

От
Andrew Dunstan
Дата:
On 12/15/2014 12:04 PM, Andres Freund wrote:

>> I think the safest fix would be to defer catchup interrupt processing
>> while you're in this mode.  You don't really want to be processing any
>> remote sinval messages at all, I'd think.
> Well, we need to do relmap, smgr and similar things. So I think that'd
> be more complicated than we want.
>



Where are we on this? Traffic seems to have gone quite but we still have 
a bunch of buildfarm animals red.

cheers

andrew



Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS

От
Andres Freund
Дата:
On January 4, 2015 9:51:43 PM CET, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>On 12/15/2014 12:04 PM, Andres Freund wrote:
>
>>> I think the safest fix would be to defer catchup interrupt
>processing
>>> while you're in this mode.  You don't really want to be processing
>any
>>> remote sinval messages at all, I'd think.
>> Well, we need to do relmap, smgr and similar things. So I think
>that'd
>> be more complicated than we want.
>>
>
>
>
>Where are we on this? Traffic seems to have gone quite but we still
>have 
>a bunch of buildfarm animals red.

I've a simple fix (similar too what I iriginally outkined) which I plan to post soonish. I've tried a bunch of things
roughlyin the vein of Tom's suggestions, but they all are more invasive and still incomplete.
 

Andres

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS

От
Andres Freund
Дата:
On 2015-01-04 21:54:40 +0100, Andres Freund wrote:
> On January 4, 2015 9:51:43 PM CET, Andrew Dunstan <andrew@dunslane.net> wrote:
> >
> >On 12/15/2014 12:04 PM, Andres Freund wrote:
> >
> >>> I think the safest fix would be to defer catchup interrupt
> >processing
> >>> while you're in this mode.  You don't really want to be processing
> >any
> >>> remote sinval messages at all, I'd think.
> >> Well, we need to do relmap, smgr and similar things. So I think
> >that'd
> >> be more complicated than we want.
> >>
> >
> >
> >
> >Where are we on this? Traffic seems to have gone quite but we still
> >have
> >a bunch of buildfarm animals red.
>
> I've a simple fix (similar too what I iriginally outkined) which I
> plan to post soonish. I've tried a bunch of things roughly in the vein
> of Tom's suggestions, but they all are more invasive and still
> incomplete.

Attached.

Note that part 1) referenced in the commit message is actually
problematic in all branches. I think we actually should backpatch that
part all the way, because if we ever hit that case the consequences of
the current coding will be rather hard to analyze. If others think so as
well, I'm going to split the commit into two parts, so commit messages
for < 9.4 won't reference logical decoding. Since there hasn't been a
report of that error for a long while (~8.1 era), I can also live with
not backpatching further than 9.4.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения