Re: REINDEX INDEX results in a crash for an index of pg_class since9.6

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: REINDEX INDEX results in a crash for an index of pg_class since9.6
Дата
Msg-id 20190425195130.hhdp5tcpapnupszq@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi,

On 2019-04-25 14:50:09 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Perhaps that's because I don't quite understand what you mean with "It
> > assumes that it's passed an already-entirely-valid relcache entry". What
> > do you mean by that / where does it assume that?
>
> Well, I can see heapam_relation_set_new_filenode touching all of these
> fields right now:
>
> rel->rd_node
> rel->rd_rel->relpersistence (and why is it looking at that rather than
> the passed-in persistence???)

Ugh.

> rel->rd_rel->relkind
> whatever RelationOpenSmgr touches
> rel->rd_smgr


> As far as I can see, there is no API restriction on what parts of the
> relcache entry it may presume are valid.  It *certainly* thinks that
> rd_rel is valid, which is rather at odds with the fact that this has
> to be called before the pg_class entry exists all (for the creation
> case) or has been updated (for the set-new-relfilenode case).

Well, that's just what we did before. And given the way the code is
structured, I am not sure I see a decent alternative that's not a
disproportionate amount of work.  I mean, heap.c's heap_create() and
heap_create_with_catalog() basically work off the Relation after the
RelationBuildLocalRelation() call, a good bit before the underlying
storage is valid.


> But my point is that we need to restrict what the function can touch
> or assume valid, if it's going to be called before the pg_class update
> happens.  And I'd rather that we did so by restricting its argument
> list so that it hasn't even got access to stuff we don't want it
> assuming valid.

OTOH, that'd mean we'd need to separately look up the amhandler, pass in
a lot more arguments etc. ISTM it'd be easier to just declare that only
the fields RelationBuildLocalRelation() sets are to be considered valid.
See the end of the email for a proposal.


> Also, in order to fix this problem, we cannot change the actual
> relcache entry contents until after we've performed the tuple update.
> So if we want the tableam code to be getting the relfilenode or
> persistence info out of the relcache entry, rather than passing
> those as standalone parameters, the call can't happen till after
> the tuple update and CCI call.  That's why I was thinking about
> splitting it into two functions.  Get the xid values, update the
> pg_class tuple, CCI, then do relation_set_new_filenode with the
> updated relcache entry would work.

I think that'd be hard for the initial relation creation. At the moment
we intentionally create the storage for the new relation before
inserting the catalog contents.

Currently the only thing that table_relation_set_new_filenode() accesses
that already is updated is the RelFileNode. I wonder if we shouldn't
change the API so that table_relation_set_new_filenode() will get a
relcache entry *without* any updates passed in, then internally does
GetNewRelFileNode() (if so desired by the AM), and returns the new rnode
via a new out parameter.  So RelationSetNewRelfilenode() would basically
work like this:

    switch (relation->rd_rel->relkind)
    {
        case RELKIND_INDEX:
        case RELKIND_SEQUENCE:
            newrelfilenode = GetNewRelFileNode(...);
            RelationCreateStorage(newrelfilenode, persistence);
            RelationOpenSmgr(relation);
            break;
        case RELKIND_RELATION:
        case RELKIND_TOASTVALUE:
        case RELKIND_MATVIEW:
            table_relation_set_new_filenode(relation, persistence,
                                            &newrnode, &freezeXid, &minmulti);
            break;
    }

    /* Now update the pg_class row. */
    if (relation->rd_rel->relkind != RELKIND_SEQUENCE)
    {
        classform->relpages = 0;    /* it's empty until further notice */
        classform->reltuples = 0;
        classform->relallvisible = 0;
    }
    classform->relfrozenxid = freezeXid;
    classform->relminmxid = minmulti;
    classform->relpersistence = persistence;

    /*
     * If we're dealing with a mapped index, pg_class.relfilenode doesn't
     * change; instead we'll have to send the update to the relation mapper.
     * But we can do so only after doing the catalog update, otherwise the
     * contents of the old data is going to be invalid.
     *
     * XXX: Can this actually validly be reached for a mapped table?
     */
    if (!RelationIsMapped(relation))
        classform->relfilenode = newrelfilenode;

    CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);

    /* now that the catalog is updated, update relmapper if necessary */
    if (RelationIsMapped(relation))
        RelationMapUpdateMap(RelationGetRelid(relation),
                             newrelfilenode,
                             relation->rd_rel->relisshared,
                             true);

    /*
     * Make the pg_class row change visible, as well as the relation map
     * change if any.  This will cause the relcache entry to get updated, too.
     */
    CommandCounterIncrement();

    // XXX: Previously we called RelationInitPhysicalAddr() in this routine
    // but I don't think that should be needed due to the CCI?

and the table AM would do the necessary
            *newrelfilenode = GetNewRelFileNode(...);
            RelationCreateStorage(*newrelfilenode, persistence);
            RelationOpenSmgr(relation);
dance *iff* it wants to do so.

That seems like it'd go towards allowing a fair bit more flexibility for
table AMs (and possibly index AMs in the future).

We'd also have to make the equivalent set of changes to
ATExecSetTableSpace(). But that seems like it'd architecturally be
cleaner anyway. I think we'd move the FlushRelationBuffers(),
GetNewRelFileNode() logic into the callback.  It'd probably make sense
to have ATExecSetTableSpace() first call
table_relation_set_new_filenode() and then call
table_relation_copy_data() with the new relfilenode, but not yet updated
relcache entry.

We don't currently allow that, but as far as I can see the current
coding of ATExecSetTableSpace() also has bad problems with system
catalog updates. It copies the data and *then* does
CatalogTupleUpdate(), but *witout* updating the reclache - which ijust
would cause the update to be lost.


I could come up with a patch for that if you want me to.


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: jsonpath
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Segfault when restoring -Fd dump on current HEAD