Обсуждение: logical decoding bug when mapped relation with toast contents isrewritten repeatedly
logical decoding bug when mapped relation with toast contents isrewritten repeatedly
От
Andres Freund
Дата:
Hi, (Tomas, CCing you because you IIRC mentioned encountered an issue like this) I just spent quite a while debugging an issue where running logical decoding yielded a: ERROR: could not map filenode "base/$X/$Y" to relation OID error. After discarding like 30 different theories, I have found the cause: During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a toast table with actual live toasted tuples (pg_proc in my case and henceforth) heap inserts with the toast data happen into the new toast relation, triggered by: static void raw_heap_insert(RewriteState state, HeapTuple tup) ... /* * If the new tuple is too big for storage or contains already toasted * out-of-line attributes from some other relation, invoke the toaster. * * Note: below this point, heaptup is the data we actually intend to store * into the relation; tup is the caller's original untoasted data. */ if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE) { /* toast table entries should never be recursively toasted */ Assert(!HeapTupleHasExternal(tup)); heaptup = tup; } else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD) heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL, HEAP_INSERT_SKIP_FSM | (state->rs_use_wal ? 0 : HEAP_INSERT_SKIP_WAL)); else heaptup = tup; At that point the new toast relation does *NOT* appear to be a system catalog, it's just appears as an "independent" table. Therefore we do not trigger, in heap_insert(): /* * RelationIsLogicallyLogged * True if we need to log enough information to extract the data from the * WAL stream. * * We don't log information for unlogged tables (since they don't WAL log * anyway) and for system tables (their content is hard to make sense of, and * it would complicate decoding slightly for little gain). Note that we *do* * log information for user defined catalog tables since they presumably are * interesting to the user... */ #define RelationIsLogicallyLogged(relation) \ (XLogLogicalInfoActive() && \ RelationNeedsWAL(relation) && \ !IsCatalogRelation(relation)) /* * For logical decoding, we need the tuple even if we're doing a full * page write, so make sure it's included even if we take a full-page * image. (XXX We could alternatively store a pointer into the FPW). */ if (RelationIsLogicallyLogged(relation)) { xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE; bufflags |= REGBUF_KEEP_DATA; } i.e. the inserted toast tuple will be marked as XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a system table. Which we currently do not allow do be logically decoded. That normally ends up being harmless, because ReorderBufferCommit() has the following check: if (!RelationIsLogicallyLogged(relation)) goto change_done; but to reach that check, we first have to map the relfilenode from the WAL to the corresponding OID: reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode, change->data.tp.relnode.relNode); That works correctly if there's only one rewrite - the relmapper contains the data for the new toast table. But if there's been *two* consecutive rewrites, the relmapper *does not* contain the intermediary relfilenode of pg_proc. There's no such problem for non-mapped tables, because historic snapshots allow us to access the relevant data, but the relmapper isn't mvcc. Therefore the catalog-rewrite escape hatch of: /* * Catalog tuple without data, emitted while catalog was * in the process of being rewritten. */ if (reloid == InvalidOid && change->data.tp.newtuple == NULL && change->data.tp.oldtuple == NULL) goto change_done; does not trigger and we run into: else if (reloid == InvalidOid) elog(ERROR, "could not map filenode \"%s\" to relation OID", relpathperm(change->data.tp.relnode, MAIN_FORKNUM)); commenting out this error / converting it into a warning makes this case harmless, but could obviously be problematic in other scenarios. I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL option, and specify that in raw_heap_insert() iff RelationIsLogicallyLogged(state->rs_old_rel) or something like that. Attached is a *prototype* patch of that approach. Without the code level changes the addition to test_decoding's rewrite.sql trigger the bug, after it they're fixed. The only reason the scenario I was debugging hit this was that there was a cluster wide VACUUM FULL a couple times a day, and replication was several hours behind due to slow network / receiving side. Now I'm having a beer outside. Greetings, Andres Freund
Вложения
Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly
От
Tomas Vondra
Дата:
Hi, On 09/14/2018 04:10 AM, Andres Freund wrote: > Hi, > > (Tomas, CCing you because you IIRC mentioned encountered an issue like > this) > I might have mentioned an issue with this symptom recently, but that turned out to be already fixed by da10d6a8a9 (before the minor version with that fix got released). > I just spent quite a while debugging an issue where running logical > decoding yielded a: > ERROR: could not map filenode "base/$X/$Y" to relation OID > error. > > After discarding like 30 different theories, I have found the cause: > Yeah. These issues are not exactly trivial to investigate ;-) > During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a > toast table with actual live toasted tuples (pg_proc in my case and > henceforth) heap inserts with the toast data happen into the new toast > relation, triggered by: > > static void > raw_heap_insert(RewriteState state, HeapTuple tup) > ... > /* > * If the new tuple is too big for storage or contains already toasted > * out-of-line attributes from some other relation, invoke the toaster. > * > * Note: below this point, heaptup is the data we actually intend to store > * into the relation; tup is the caller's original untoasted data. > */ > if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE) > { > /* toast table entries should never be recursively toasted */ > Assert(!HeapTupleHasExternal(tup)); > heaptup = tup; > } > else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD) > heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL, > HEAP_INSERT_SKIP_FSM | > (state->rs_use_wal ? > 0 : HEAP_INSERT_SKIP_WAL)); > else > heaptup = tup; > > > At that point the new toast relation does *NOT* appear to be a system > catalog, it's just appears as an "independent" table. Therefore we do > not trigger, in heap_insert(): > Hmm, can't we change that? Recognizing the new TOAST table as a catalog would fix the issue, no? > /* > * RelationIsLogicallyLogged > * True if we need to log enough information to extract the data from the > * WAL stream. > * > * We don't log information for unlogged tables (since they don't WAL log > * anyway) and for system tables (their content is hard to make sense of, and > * it would complicate decoding slightly for little gain). Note that we *do* > * log information for user defined catalog tables since they presumably are > * interesting to the user... > */ > #define RelationIsLogicallyLogged(relation) \ > (XLogLogicalInfoActive() && \ > RelationNeedsWAL(relation) && \ > !IsCatalogRelation(relation)) > > /* > * For logical decoding, we need the tuple even if we're doing a full > * page write, so make sure it's included even if we take a full-page > * image. (XXX We could alternatively store a pointer into the FPW). > */ > if (RelationIsLogicallyLogged(relation)) > { > xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE; > bufflags |= REGBUF_KEEP_DATA; > } > > i.e. the inserted toast tuple will be marked as > XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a > system table. Which we currently do not allow do be logically decoded. > > That normally ends up being harmless, because ReorderBufferCommit() has the > following check: > if (!RelationIsLogicallyLogged(relation)) > goto change_done; > > but to reach that check, we first have to map the relfilenode from the > WAL to the corresponding OID: > reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode, > change->data.tp.relnode.relNode); > > That works correctly if there's only one rewrite - the relmapper > contains the data for the new toast table. But if there's been *two* > consecutive rewrites, the relmapper *does not* contain the intermediary > relfilenode of pg_proc. There's no such problem for non-mapped tables, > because historic snapshots allow us to access the relevant data, but the > relmapper isn't mvcc. > > Therefore the catalog-rewrite escape hatch of: > /* > * Catalog tuple without data, emitted while catalog was > * in the process of being rewritten. > */ > if (reloid == InvalidOid && > change->data.tp.newtuple == NULL && > change->data.tp.oldtuple == NULL) > goto change_done; > does not trigger and we run into: > else if (reloid == InvalidOid) > elog(ERROR, "could not map filenode \"%s\" to relation OID", > relpathperm(change->data.tp.relnode, > MAIN_FORKNUM)); > > > commenting out this error / converting it into a warning makes this case > harmless, but could obviously be problematic in other scenarios. > Yeah, that seems like a bad idea. That error already caught a couple of bugs (including da10d6a8a9 and this one), and I have a hunch those are not the last ones. > > I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL > option, and specify that in raw_heap_insert() iff > RelationIsLogicallyLogged(state->rs_old_rel) or something like that. > > Attached is a *prototype* patch of that approach. Without the code > level changes the addition to test_decoding's rewrite.sql trigger the > bug, after it they're fixed. > > > The only reason the scenario I was debugging hit this was that there was > a cluster wide VACUUM FULL a couple times a day, and replication was > several hours behind due to slow network / receiving side. > > > Now I'm having a beer outside. > After discarding 30 theories? Have two. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly
От
Andres Freund
Дата:
Hi, On 2018-09-14 16:13:46 +0200, Tomas Vondra wrote: > > > > I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL > > option, and specify that in raw_heap_insert() iff > > RelationIsLogicallyLogged(state->rs_old_rel) or something like that. > > > > Attached is a *prototype* patch of that approach. Without the code > > level changes the addition to test_decoding's rewrite.sql trigger the > > bug, after it they're fixed. > > > > > > The only reason the scenario I was debugging hit this was that there was > > a cluster wide VACUUM FULL a couple times a day, and replication was > > several hours behind due to slow network / receiving side. > > > > > > Now I'm having a beer outside. > Yeah, that seems like a bad idea. That error already caught a couple of > bugs (including da10d6a8a9 and this one), and I have a hunch those are > not the last ones. One problem with this is that that means upgrading won't fix an existing instance of the problem, but turning the ERROR into a WARNING would. I personally think that's *NOT* enough justification for relaxing the error, given that recreating the slot would fix the issue, but I see how other people can reasonably differ. I can't really see a reasonably complex approach that solves the issue in a "cake but have it too" way... > After discarding 30 theories? Have two. I will neither confirm nor deny. ;) Greetings, Andres Freund
Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly
От
Andres Freund
Дата:
Hi, On 2018-09-13 19:10:46 -0700, Andres Freund wrote: > (Tomas, CCing you because you IIRC mentioned encountered an issue like > this) > > I just spent quite a while debugging an issue where running logical > decoding yielded a: > ERROR: could not map filenode "base/$X/$Y" to relation OID > error. > > After discarding like 30 different theories, I have found the cause: > > During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a > toast table with actual live toasted tuples (pg_proc in my case and > henceforth) heap inserts with the toast data happen into the new toast > relation, triggered by: > At that point the new toast relation does *NOT* appear to be a system >OA catalog, it's just appears as an "independent" table. Therefore we do > not trigger, in heap_insert(): > > /* > * RelationIsLogicallyLogged > * True if we need to log enough information to extract the data from the > * WAL stream. > * > * We don't log information for unlogged tables (since they don't WAL log > * anyway) and for system tables (their content is hard to make sense of, and > * it would complicate decoding slightly for little gain). Note that we *do* > * log information for user defined catalog tables since they presumably are > * interesting to the user... > */ > #define RelationIsLogicallyLogged(relation) \ > (XLogLogicalInfoActive() && \ > RelationNeedsWAL(relation) && \ > !IsCatalogRelation(relation)) > > /* > * For logical decoding, we need the tuple even if we're doing a full > * page write, so make sure it's included even if we take a full-page > * image. (XXX We could alternatively store a pointer into the FPW). > */ > if (RelationIsLogicallyLogged(relation)) > { > xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE; > bufflags |= REGBUF_KEEP_DATA; > } > > i.e. the inserted toast tuple will be marked as > XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a > system table. Which we currently do not allow do be logically decoded. > > That normally ends up being harmless, because ReorderBufferCommit() has the > following check: > if (!RelationIsLogicallyLogged(relation)) > goto change_done; > > but to reach that check, we first have to map the relfilenode from the > WAL to the corresponding OID: > reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode, > change->data.tp.relnode.relNode); > > That works correctly if there's only one rewrite - the relmapper > contains the data for the new toast table. But if there's been *two* > consecutive rewrites, the relmapper *does not* contain the intermediary > relfilenode of pg_proc. There's no such problem for non-mapped tables, > because historic snapshots allow us to access the relevant data, but the > relmapper isn't mvcc. > > Therefore the catalog-rewrite escape hatch of: > /* > * Catalog tuple without data, emitted while catalog was > * in the process of being rewritten. > */ > if (reloid == InvalidOid && > change->data.tp.newtuple == NULL && > change->data.tp.oldtuple == NULL) > goto change_done; > does not trigger and we run into: > else if (reloid == InvalidOid) > elog(ERROR, "could not map filenode \"%s\" to relation OID", > relpathperm(change->data.tp.relnode, > MAIN_FORKNUM)); > > > commenting out this error / converting it into a warning makes this case > harmless, but could obviously be problematic in other scenarios. > > > I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL > option, and specify that in raw_heap_insert() iff > RelationIsLogicallyLogged(state->rs_old_rel) or something like that. > > Attached is a *prototype* patch of that approach. Without the code > level changes the addition to test_decoding's rewrite.sql trigger the > bug, after it they're fixed. > > > The only reason the scenario I was debugging hit this was that there was > a cluster wide VACUUM FULL a couple times a day, and replication was > several hours behind due to slow network / receiving side. I've pushed this now. I added more tests,which found an issue around with the change around rewrites of non-mapped catalog tables. Greetings, Andres Freund