logical decoding bug when mapped relation with toast contents isrewritten repeatedly

Поиск
Список
Период
Сортировка
От Andres Freund
Тема logical decoding bug when mapped relation with toast contents isrewritten repeatedly
Дата
Msg-id 20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de
обсуждение исходный текст
Ответы Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Hi,

(Tomas, CCing you because you IIRC mentioned encountered an issue like
this)

I just spent quite a while debugging an issue where running logical
decoding yielded a:
ERROR:  could not map filenode "base/$X/$Y" to relation OID
error.

After discarding like 30 different theories, I have found the cause:

During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a
toast table with actual live toasted tuples (pg_proc in my case and
henceforth) heap inserts with the toast data happen into the new toast
relation, triggered by:

static void
raw_heap_insert(RewriteState state, HeapTuple tup)
...
    /*
     * If the new tuple is too big for storage or contains already toasted
     * out-of-line attributes from some other relation, invoke the toaster.
     *
     * Note: below this point, heaptup is the data we actually intend to store
     * into the relation; tup is the caller's original untoasted data.
     */
    if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE)
    {
        /* toast table entries should never be recursively toasted */
        Assert(!HeapTupleHasExternal(tup));
        heaptup = tup;
    }
    else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
        heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
                                         HEAP_INSERT_SKIP_FSM |
                                         (state->rs_use_wal ?
                                          0 : HEAP_INSERT_SKIP_WAL));
    else
        heaptup = tup;


At that point the new toast relation does *NOT* appear to be a system
catalog, it's just appears as an "independent" table.  Therefore we do
not trigger, in heap_insert():

/*
 * RelationIsLogicallyLogged
 *        True if we need to log enough information to extract the data from the
 *        WAL stream.
 *
 * We don't log information for unlogged tables (since they don't WAL log
 * anyway) and for system tables (their content is hard to make sense of, and
 * it would complicate decoding slightly for little gain). Note that we *do*
 * log information for user defined catalog tables since they presumably are
 * interesting to the user...
 */
#define RelationIsLogicallyLogged(relation) \
    (XLogLogicalInfoActive() && \
     RelationNeedsWAL(relation) && \
     !IsCatalogRelation(relation))

        /*
         * For logical decoding, we need the tuple even if we're doing a full
         * page write, so make sure it's included even if we take a full-page
         * image. (XXX We could alternatively store a pointer into the FPW).
         */
        if (RelationIsLogicallyLogged(relation))
        {
            xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
            bufflags |= REGBUF_KEEP_DATA;
        }

i.e. the inserted toast tuple will be marked as
XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a
system table. Which we currently do not allow do be logically decoded.

That normally ends up being harmless, because ReorderBufferCommit() has the
following check:
                    if (!RelationIsLogicallyLogged(relation))
                        goto change_done;

but to reach that check, we first have to map the relfilenode from the
WAL to the corresponding OID:
                    reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
                                                change->data.tp.relnode.relNode);

That works correctly if there's only one rewrite - the relmapper
contains the data for the new toast table.  But if there's been *two*
consecutive rewrites, the relmapper *does not* contain the intermediary
relfilenode of pg_proc.  There's no such problem for non-mapped tables,
because historic snapshots allow us to access the relevant data, but the
relmapper isn't mvcc.

Therefore the catalog-rewrite escape hatch of:
                    /*
                     * Catalog tuple without data, emitted while catalog was
                     * in the process of being rewritten.
                     */
                    if (reloid == InvalidOid &&
                        change->data.tp.newtuple == NULL &&
                        change->data.tp.oldtuple == NULL)
                        goto change_done;
does not trigger and we run into:
                    else if (reloid == InvalidOid)
                        elog(ERROR, "could not map filenode \"%s\" to relation OID",
                             relpathperm(change->data.tp.relnode,
                                         MAIN_FORKNUM));


commenting out this error / converting it into a warning makes this case
harmless, but could obviously be problematic in other scenarios.


I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL
option, and specify that in raw_heap_insert() iff
RelationIsLogicallyLogged(state->rs_old_rel) or something like that.

Attached is a *prototype* patch of that approach.  Without the code
level changes the addition to test_decoding's rewrite.sql trigger the
bug, after it they're fixed.


The only reason the scenario I was debugging hit this was that there was
a cluster wide VACUUM FULL a couple times a day, and replication was
several hours behind due to slow network / receiving side.


Now I'm having a beer outside.


Greetings,

Andres Freund

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: stat() on Windows might cause error if target file is largerthan 4GB
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Cache lookup errors with functions manipulation object addresses