Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly
Дата
Msg-id 609a67b3-4bff-c1c5-6e6e-9694b0e30aa8@2ndquadrant.com
обсуждение исходный текст
Ответ на logical decoding bug when mapped relation with toast contents isrewritten repeatedly  (Andres Freund <andres@anarazel.de>)
Ответы Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Hi,

On 09/14/2018 04:10 AM, Andres Freund wrote:
> Hi,
> 
> (Tomas, CCing you because you IIRC mentioned encountered an issue like
> this)
> 

I might have mentioned an issue with this symptom recently, but that
turned out to be already fixed by da10d6a8a9 (before the minor version
with that fix got released).

> I just spent quite a while debugging an issue where running logical
> decoding yielded a:
> ERROR:  could not map filenode "base/$X/$Y" to relation OID
> error.
> 
> After discarding like 30 different theories, I have found the cause:
> 

Yeah. These issues are not exactly trivial to investigate ;-)

> During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a
> toast table with actual live toasted tuples (pg_proc in my case and
> henceforth) heap inserts with the toast data happen into the new toast
> relation, triggered by:
> 
> static void
> raw_heap_insert(RewriteState state, HeapTuple tup)
> ...
>     /*
>      * If the new tuple is too big for storage or contains already toasted
>      * out-of-line attributes from some other relation, invoke the toaster.
>      *
>      * Note: below this point, heaptup is the data we actually intend to store
>      * into the relation; tup is the caller's original untoasted data.
>      */
>     if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE)
>     {
>         /* toast table entries should never be recursively toasted */
>         Assert(!HeapTupleHasExternal(tup));
>         heaptup = tup;
>     }
>     else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
>         heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
>                                          HEAP_INSERT_SKIP_FSM |
>                                          (state->rs_use_wal ?
>                                           0 : HEAP_INSERT_SKIP_WAL));
>     else
>         heaptup = tup;
> 
> 
> At that point the new toast relation does *NOT* appear to be a system
> catalog, it's just appears as an "independent" table.  Therefore we do
> not trigger, in heap_insert():
> 

Hmm, can't we change that? Recognizing the new TOAST table as a catalog
would fix the issue, no?

> /*
>  * RelationIsLogicallyLogged
>  *        True if we need to log enough information to extract the data from the
>  *        WAL stream.
>  *
>  * We don't log information for unlogged tables (since they don't WAL log
>  * anyway) and for system tables (their content is hard to make sense of, and
>  * it would complicate decoding slightly for little gain). Note that we *do*
>  * log information for user defined catalog tables since they presumably are
>  * interesting to the user...
>  */
> #define RelationIsLogicallyLogged(relation) \
>     (XLogLogicalInfoActive() && \
>      RelationNeedsWAL(relation) && \
>      !IsCatalogRelation(relation))
> 
>         /*
>          * For logical decoding, we need the tuple even if we're doing a full
>          * page write, so make sure it's included even if we take a full-page
>          * image. (XXX We could alternatively store a pointer into the FPW).
>          */
>         if (RelationIsLogicallyLogged(relation))
>         {
>             xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
>             bufflags |= REGBUF_KEEP_DATA;
>         }
> 
> i.e. the inserted toast tuple will be marked as
> XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a
> system table. Which we currently do not allow do be logically decoded.
> 
> That normally ends up being harmless, because ReorderBufferCommit() has the
> following check:
>                     if (!RelationIsLogicallyLogged(relation))
>                         goto change_done;
> 
> but to reach that check, we first have to map the relfilenode from the
> WAL to the corresponding OID:
>                     reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
>                                                 change->data.tp.relnode.relNode);
> 
> That works correctly if there's only one rewrite - the relmapper
> contains the data for the new toast table.  But if there's been *two*
> consecutive rewrites, the relmapper *does not* contain the intermediary
> relfilenode of pg_proc.  There's no such problem for non-mapped tables,
> because historic snapshots allow us to access the relevant data, but the
> relmapper isn't mvcc.
> 
> Therefore the catalog-rewrite escape hatch of:
>                     /*
>                      * Catalog tuple without data, emitted while catalog was
>                      * in the process of being rewritten.
>                      */
>                     if (reloid == InvalidOid &&
>                         change->data.tp.newtuple == NULL &&
>                         change->data.tp.oldtuple == NULL)
>                         goto change_done;
> does not trigger and we run into:
>                     else if (reloid == InvalidOid)
>                         elog(ERROR, "could not map filenode \"%s\" to relation OID",
>                              relpathperm(change->data.tp.relnode,
>                                          MAIN_FORKNUM));
> 
> 
> commenting out this error / converting it into a warning makes this case
> harmless, but could obviously be problematic in other scenarios.
> 

Yeah, that seems like a bad idea. That error already caught a couple of
bugs (including da10d6a8a9 and this one), and I have a hunch those are
not the last ones.

> 
> I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL
> option, and specify that in raw_heap_insert() iff
> RelationIsLogicallyLogged(state->rs_old_rel) or something like that.
> 
> Attached is a *prototype* patch of that approach.  Without the code
> level changes the addition to test_decoding's rewrite.sql trigger the
> bug, after it they're fixed.
> 
> 
> The only reason the scenario I was debugging hit this was that there was
> a cluster wide VACUUM FULL a couple times a day, and replication was
> several hours behind due to slow network / receiving side.
> 
> 
> Now I'm having a beer outside.
> 

After discarding 30 theories? Have two.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Antonin Houska
Дата:
Сообщение: Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Cache lookup errors with functions manipulation object addresses