Обсуждение: partition routing layering in nodeModifyTable.c

Поиск
Список
Период
Сортировка

partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

While discussing partition related code with David in [1], I again was
confused by the layering of partition related code in
nodeModifyTable.c.


1) How come partition routing is done outside of ExecInsert()?

            case CMD_INSERT:
                /* Prepare for tuple routing if needed. */
                if (proute)
                    slot = ExecPrepareTupleRouting(node, estate, proute,
                                                   resultRelInfo, slot);
                slot = ExecInsert(node, slot, planSlot,
                                  estate, node->canSetTag);
                /* Revert ExecPrepareTupleRouting's state change. */
                if (proute)
                    estate->es_result_relation_info = resultRelInfo;
                break;

That already seems like a layering violation, but it's made worse by
ExecUpdate() having its partition handling solely inside - including another
call to ExecInsert(), including the surrounding partition setup code.

And even worse, after all that, ExecInsert() still contains partitioning
code.

It seems to me that if we just moved the ExecPrepareTupleRouting() into
ExecInsert(), we could remove the duplication.


2) The contents of the
        /*
         * If a partition check failed, try to move the row into the right
         * partition.
         */
        if (partition_constraint_failed)

block ought to be moved to a separate function (maybe
ExecCrossPartitionUpdate or ExecMove). ExecUpdate() is already
complicated enough without dealing with the partition move.


3) How come we reset estate->es_result_relation_info after partition
   routing, but not the mtstate wide changes by
   ExecPrepareTupleRouting()?  Note that its comment says:

 * Caller must revert the estate changes after executing the insertion!
 * In mtstate, transition capture changes may also need to be reverted.

ExecUpdate() contains

            /*
             * Updates set the transition capture map only when a new subplan
             * is chosen.  But for inserts, it is set for each row. So after
             * INSERT, we need to revert back to the map created for UPDATE;
             * otherwise the next UPDATE will incorrectly use the one created
             * for INSERT.  So first save the one created for UPDATE.
             */
            if (mtstate->mt_transition_capture)
                saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

but as I read the code, that's not really true?  It's
ExecPrepareTupleRouting() that does so, and that's called directly in ExecUpdate().


4)
    /*
     * If this insert is the result of a partition key update that moved the
     * tuple to a new partition, put this row into the transition NEW TABLE,
     * if there is one. We need to do this separately for DELETE and INSERT
     * because they happen on different tables.
     */
    ar_insert_trig_tcs = mtstate->mt_transition_capture;
    if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
        && mtstate->mt_transition_capture->tcs_update_new_table)
    {
        ExecARUpdateTriggers(estate, resultRelInfo, NULL,
                             NULL,
                             slot,
                             NULL,
                             mtstate->mt_transition_capture);

        /*
         * We've already captured the NEW TABLE row, so make sure any AR
         * INSERT trigger fired below doesn't capture it again.
         */
        ar_insert_trig_tcs = NULL;
    }

    /* AFTER ROW INSERT Triggers */
    ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
                         ar_insert_trig_tcs);

Besides not using the just defined ar_insert_trig_tcs and instead
repeatedly referring to mtstate->mt_transition_capture, wouldn't this be
a easier to understand if the were an if/else, instead of resetting
ar_insert_trig_tcs?  If the block were

    /*
     * triggers behave differently depending on this being a delete as
     * part of a partion move, or a deletion proper.
    if (mtstate->operation == CMD_UPDATE)
    {
        /*
         * If this insert is the result of a partition key update that moved the
         * tuple to a new partition, put this row into the transition NEW TABLE,
         * if there is one. We need to do this separately for DELETE and INSERT
         * because they happen on different tables.
         */
        ExecARUpdateTriggers(estate, resultRelInfo, NULL,
                             NULL,
                             slot,
                             NULL,
                             mtstate->mt_transition_capture);

        /*
         * But we do want to fire plain per-row INSERT triggers on the
         * new table. By not passing in transition_capture we prevent
         * ....
         */
         ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
                              NULL);
    }
    else
    {
        /* AFTER ROW INSERT Triggers */
        ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
                             ar_insert_trig_tcs);
    }

it seems like it'd be quite a bit clearer (although I do think the
comments also need a fair bit of polishing independent of this proposed
change).


Greetings,

Andres Freund


[1] https://www.postgresql.org/message-id/CAKJS1f-YObQJTbncGJGRZ6gSFiS%2Bgw_Y5kvrpR%3DvEnFKH17AVA%40mail.gmail.com



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Hi Andres,

On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <andres@anarazel.de> wrote:
> 1) How come partition routing is done outside of ExecInsert()?
>
>             case CMD_INSERT:
>                 /* Prepare for tuple routing if needed. */
>                 if (proute)
>                     slot = ExecPrepareTupleRouting(node, estate, proute,
>                                                    resultRelInfo, slot);
>                 slot = ExecInsert(node, slot, planSlot,
>                                   estate, node->canSetTag);
>                 /* Revert ExecPrepareTupleRouting's state change. */
>                 if (proute)
>                     estate->es_result_relation_info = resultRelInfo;
>                 break;
>
> That already seems like a layering violation,

The decision to move partition routing out of ExecInsert() came about
when we encountered a bug [1] whereby ExecInsert() would fail to reset
estate->es_result_relation_info back to the root table if it had to
take an abnormal path out (early return), of which there are quite a
few instances.  The first solution I came up with was to add a goto
label  for the code to reset estate->es_result_relation_info and jump
to it from the various places that do an early return, which was
complained about as reducing readability.  So, the solution we
eventually settled on in 6666ee49f was to perform ResultRelInfos
switching at a higher level.

> but it's made worse by
> ExecUpdate() having its partition handling solely inside - including another
> call to ExecInsert(), including the surrounding partition setup code.
>
> And even worse, after all that, ExecInsert() still contains partitioning
> code.

AFAIK, it's only to check the partition constraint when necessary.
Partition routing complexity is totally outside, but based on what you
write in point 4 below there's bit more...

> It seems to me that if we just moved the ExecPrepareTupleRouting() into
> ExecInsert(), we could remove the duplication.

I agree that there's duplication here.  Given what I wrote above, I
can think of doing this: move all of ExecInsert()'s code into
ExecInsertInternal() and make the former instead look like this:

static TupleTableSlot *
ExecInsert(ModifyTableState *mtstate,
           TupleTableSlot *slot,
           TupleTableSlot *planSlot,
           EState *estate,
           bool canSetTag)
{
    PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
    ResultRelInfo *resultRelInfo = estate->es_result_relation_info;

    /* Prepare for tuple routing if needed. */
    if (proute)
        slot = ExecPrepareTupleRouting(mtstate, estate, proute, resultRelInfo,
                                       slot);

    slot = ExecInsertInternal(mtstate, slot, planSlot, estate,
                              mtstate->canSetTag);

    /* Revert ExecPrepareTupleRouting's state change. */
    if (proute)
        estate->es_result_relation_info = resultRelInfo;

    return slot;
}

> 2) The contents of the
>         /*
>          * If a partition check failed, try to move the row into the right
>          * partition.
>          */
>         if (partition_constraint_failed)
>
> block ought to be moved to a separate function (maybe
> ExecCrossPartitionUpdate or ExecMove). ExecUpdate() is already
> complicated enough without dealing with the partition move.

I tend to agree with this.  Adding Amit Khandekar in case he wants to
chime in about this.

> 3) How come we reset estate->es_result_relation_info after partition
>    routing, but not the mtstate wide changes by
>    ExecPrepareTupleRouting()?  Note that its comment says:
>
>  * Caller must revert the estate changes after executing the insertion!
>  * In mtstate, transition capture changes may also need to be reverted.
>
> ExecUpdate() contains
>
>             /*
>              * Updates set the transition capture map only when a new subplan
>              * is chosen.  But for inserts, it is set for each row. So after
>              * INSERT, we need to revert back to the map created for UPDATE;
>              * otherwise the next UPDATE will incorrectly use the one created
>              * for INSERT.  So first save the one created for UPDATE.
>              */
>             if (mtstate->mt_transition_capture)
>                 saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> but as I read the code, that's not really true?  It's
> ExecPrepareTupleRouting() that does so, and that's called directly in ExecUpdate().

Calling ExecPrepareTupleRouting() is considered a part of a given
INSERT operation, so anything it does is to facilitate the INSERT.  In
this case, which map to assign to tcs_map can only be determined after
a partition is chosen and determining the partition (routing) is a job
of ExecPrepareTupleRouting().  Perhaps, we need to update the comment
here a bit.

> 4)
>     /*
>      * If this insert is the result of a partition key update that moved the
>      * tuple to a new partition, put this row into the transition NEW TABLE,
>      * if there is one. We need to do this separately for DELETE and INSERT
>      * because they happen on different tables.
>      */
>     ar_insert_trig_tcs = mtstate->mt_transition_capture;
>     if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
>         && mtstate->mt_transition_capture->tcs_update_new_table)
>     {
>         ExecARUpdateTriggers(estate, resultRelInfo, NULL,
>                              NULL,
>                              slot,
>                              NULL,
>                              mtstate->mt_transition_capture);
>
>         /*
>          * We've already captured the NEW TABLE row, so make sure any AR
>          * INSERT trigger fired below doesn't capture it again.
>          */
>         ar_insert_trig_tcs = NULL;
>     }
>
>     /* AFTER ROW INSERT Triggers */
>     ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
>                          ar_insert_trig_tcs);
>
> Besides not using the just defined ar_insert_trig_tcs and instead
> repeatedly referring to mtstate->mt_transition_capture, wouldn't this be
> a easier to understand if the were an if/else, instead of resetting
> ar_insert_trig_tcs?  If the block were
>
>     /*
>      * triggers behave differently depending on this being a delete as
>      * part of a partion move, or a deletion proper.
>     if (mtstate->operation == CMD_UPDATE)
>     {
>         /*
>          * If this insert is the result of a partition key update that moved the
>          * tuple to a new partition, put this row into the transition NEW TABLE,
>          * if there is one. We need to do this separately for DELETE and INSERT
>          * because they happen on different tables.
>          */
>         ExecARUpdateTriggers(estate, resultRelInfo, NULL,
>                              NULL,
>                              slot,
>                              NULL,
>                              mtstate->mt_transition_capture);
>
>         /*
>          * But we do want to fire plain per-row INSERT triggers on the
>          * new table. By not passing in transition_capture we prevent
>          * ....
>          */
>          ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
>                               NULL);
>     }
>     else
>     {
>         /* AFTER ROW INSERT Triggers */
>         ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
>                              ar_insert_trig_tcs);
>     }

Maybe you meant to use mtstate->mt_transition_capture instead of
ar_insert_trig_tcs in the else block.  We don't need
ar_insert_trig_tcs at all.

> it seems like it'd be quite a bit clearer (although I do think the
> comments also need a fair bit of polishing independent of this proposed
> change).

Fwiw, I agree with your proposed restructuring, although I'd let Amit
Kh chime in as he'd be more familiar with this code.  I wasn't aware
of this partitioning-related bit being present in ExecInsert().

Would you like me to write a patch for some or all items?

Thanks,
Amit

[1] https://www.postgresql.org/message-id/flat/0473bf5c-57b1-f1f7-3d58-455c2230bc5f%40lab.ntt.co.jp



Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-07-18 14:24:29 +0900, Amit Langote wrote:
> On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <andres@anarazel.de> wrote:
> > 1) How come partition routing is done outside of ExecInsert()?
> >
> >             case CMD_INSERT:
> >                 /* Prepare for tuple routing if needed. */
> >                 if (proute)
> >                     slot = ExecPrepareTupleRouting(node, estate, proute,
> >                                                    resultRelInfo, slot);
> >                 slot = ExecInsert(node, slot, planSlot,
> >                                   estate, node->canSetTag);
> >                 /* Revert ExecPrepareTupleRouting's state change. */
> >                 if (proute)
> >                     estate->es_result_relation_info = resultRelInfo;
> >                 break;
> >
> > That already seems like a layering violation,
> 
> The decision to move partition routing out of ExecInsert() came about
> when we encountered a bug [1] whereby ExecInsert() would fail to reset
> estate->es_result_relation_info back to the root table if it had to
> take an abnormal path out (early return), of which there are quite a
> few instances.  The first solution I came up with was to add a goto
> label  for the code to reset estate->es_result_relation_info and jump
> to it from the various places that do an early return, which was
> complained about as reducing readability.  So, the solution we
> eventually settled on in 6666ee49f was to perform ResultRelInfos
> switching at a higher level.

I think that was the wrong path, given that the code now lives in
multiple places. Without even a comment explaining that if one has to be
changed, the other has to be changed too.


> > It seems to me that if we just moved the ExecPrepareTupleRouting() into
> > ExecInsert(), we could remove the duplication.
> 
> I agree that there's duplication here.  Given what I wrote above, I
> can think of doing this: move all of ExecInsert()'s code into
> ExecInsertInternal() and make the former instead look like this:

For me just having the gotos is cleaner than that here.

But perhaps the right fix would be to not have ExecPrepareTupleRouting()
change global state at all, and instead change it much more locally
inside ExecInsert(), around the calls that need it to be set
differently.

Or perhaps the actually correct fix is to remove es_result_relation_info
alltogether, and just pass it down the places that need it - we've a lot
more code setting it than using the value.  And it'd not be hard to
actually pass it to the places that read it.  Given all the
setting/resetting of it it's pretty obvious that a query-global resource
isn't the right place for it.



> >     /*
> >      * triggers behave differently depending on this being a delete as
> >      * part of a partion move, or a deletion proper.
> >     if (mtstate->operation == CMD_UPDATE)
> >     {
> >         /*
> >          * If this insert is the result of a partition key update that moved the
> >          * tuple to a new partition, put this row into the transition NEW TABLE,
> >          * if there is one. We need to do this separately for DELETE and INSERT
> >          * because they happen on different tables.
> >          */
> >         ExecARUpdateTriggers(estate, resultRelInfo, NULL,
> >                              NULL,
> >                              slot,
> >                              NULL,
> >                              mtstate->mt_transition_capture);
> >
> >         /*
> >          * But we do want to fire plain per-row INSERT triggers on the
> >          * new table. By not passing in transition_capture we prevent
> >          * ....
> >          */
> >          ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> >                               NULL);
> >     }
> >     else
> >     {
> >         /* AFTER ROW INSERT Triggers */
> >         ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> >                              ar_insert_trig_tcs);
> >     }
> 
> Maybe you meant to use mtstate->mt_transition_capture instead of
> ar_insert_trig_tcs in the else block.  We don't need
> ar_insert_trig_tcs at all.

Yes, it was just a untested example of how the code could be made
clearer.


> > it seems like it'd be quite a bit clearer (although I do think the
> > comments also need a fair bit of polishing independent of this proposed
> > change).
> 
> Fwiw, I agree with your proposed restructuring, although I'd let Amit
> Kh chime in as he'd be more familiar with this code.  I wasn't aware
> of this partitioning-related bit being present in ExecInsert().
> 
> Would you like me to write a patch for some or all items?

Yes, that would be awesome.

Greetings,

Andres Freund



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Thu, Jul 18, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-07-18 14:24:29 +0900, Amit Langote wrote:
> > On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <andres@anarazel.de> wrote:
> > > 1) How come partition routing is done outside of ExecInsert()?
> > >
> > >             case CMD_INSERT:
> > >                 /* Prepare for tuple routing if needed. */
> > >                 if (proute)
> > >                     slot = ExecPrepareTupleRouting(node, estate, proute,
> > >                                                    resultRelInfo, slot);
> > >                 slot = ExecInsert(node, slot, planSlot,
> > >                                   estate, node->canSetTag);
> > >                 /* Revert ExecPrepareTupleRouting's state change. */
> > >                 if (proute)
> > >                     estate->es_result_relation_info = resultRelInfo;
> > >                 break;
> > >
> > > That already seems like a layering violation,
> >
> > The decision to move partition routing out of ExecInsert() came about
> > when we encountered a bug [1] whereby ExecInsert() would fail to reset
> > estate->es_result_relation_info back to the root table if it had to
> > take an abnormal path out (early return), of which there are quite a
> > few instances.  The first solution I came up with was to add a goto
> > label  for the code to reset estate->es_result_relation_info and jump
> > to it from the various places that do an early return, which was
> > complained about as reducing readability.  So, the solution we
> > eventually settled on in 6666ee49f was to perform ResultRelInfos
> > switching at a higher level.
>
> I think that was the wrong path, given that the code now lives in
> multiple places. Without even a comment explaining that if one has to be
> changed, the other has to be changed too.
>
>
> > > It seems to me that if we just moved the ExecPrepareTupleRouting() into
> > > ExecInsert(), we could remove the duplication.
> >
> > I agree that there's duplication here.  Given what I wrote above, I
> > can think of doing this: move all of ExecInsert()'s code into
> > ExecInsertInternal() and make the former instead look like this:
>
> For me just having the gotos is cleaner than that here.
>
> But perhaps the right fix would be to not have ExecPrepareTupleRouting()
> change global state at all, and instead change it much more locally
> inside ExecInsert(), around the calls that need it to be set
> differently.
>
> Or perhaps the actually correct fix is to remove es_result_relation_info
> alltogether, and just pass it down the places that need it - we've a lot
> more code setting it than using the value.  And it'd not be hard to
> actually pass it to the places that read it.  Given all the
> setting/resetting of it it's pretty obvious that a query-global resource
> isn't the right place for it.

I tend to agree that managing state through es_result_relation_info
across various operations on a result relation has turned a bit messy
at this point.  That said, while most of the places that access the
currently active result relation from es_result_relation_info can be
easily modified to receive it directly, the FDW API BeginDirectModify
poses bit of a challenge.  BeginDirectlyModify() is called via
ExecInitForeignScan() that in turn can't be changed to add a result
relation (Index or ResultRelInfo *) argument, so the only way left for
BeginDirectlyModify() is to access it via es_result_relation_info.

Maybe we can do to ExecPrepareTupleRouting() what you say -- remove
all code in it that changes ModifyTable-global and EState-global
states.  Also, maybe call ExecPrepareTupleRouting() inside
ExecInsert() at the beginning instead of outside of it.  I agree that
setting and reverting global states around the exact piece of code
that need that to be done is better for clarity.  All of that assuming
you're not saying that we scrap ExecPrepareTupleRouting altogether.

Thoughts?  Other opinions?

> > Would you like me to write a patch for some or all items?
>
> Yes, that would be awesome.

OK, I will try to post a patch soon.

Thanks,
Amit



Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
On Thu, Jul 18, 2019 at 4:51 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Jul 18, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-07-18 14:24:29 +0900, Amit Langote wrote:
> > > On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <andres@anarazel.de> wrote:
> > > > 1) How come partition routing is done outside of ExecInsert()?
> > > >
> > > >             case CMD_INSERT:
> > > >                 /* Prepare for tuple routing if needed. */
> > > >                 if (proute)
> > > >                     slot = ExecPrepareTupleRouting(node, estate, proute,
> > > >                                                    resultRelInfo, slot);
> > > >                 slot = ExecInsert(node, slot, planSlot,
> > > >                                   estate, node->canSetTag);
> > > >                 /* Revert ExecPrepareTupleRouting's state change. */
> > > >                 if (proute)
> > > >                     estate->es_result_relation_info = resultRelInfo;
> > > >                 break;
> > > >
> > > > That already seems like a layering violation,
> > >
> > > The decision to move partition routing out of ExecInsert() came about
> > > when we encountered a bug [1] whereby ExecInsert() would fail to reset
> > > estate->es_result_relation_info back to the root table if it had to
> > > take an abnormal path out (early return), of which there are quite a
> > > few instances.  The first solution I came up with was to add a goto
> > > label  for the code to reset estate->es_result_relation_info and jump
> > > to it from the various places that do an early return, which was
> > > complained about as reducing readability.  So, the solution we
> > > eventually settled on in 6666ee49f was to perform ResultRelInfos
> > > switching at a higher level.
> >
> > I think that was the wrong path, given that the code now lives in
> > multiple places. Without even a comment explaining that if one has to be
> > changed, the other has to be changed too.

I thought this would be OK because we have the
ExecPrepareTupleRouting() call in just two places in a single source
file, at least currently.

> > Or perhaps the actually correct fix is to remove es_result_relation_info
> > alltogether, and just pass it down the places that need it - we've a lot
> > more code setting it than using the value.  And it'd not be hard to
> > actually pass it to the places that read it.  Given all the
> > setting/resetting of it it's pretty obvious that a query-global resource
> > isn't the right place for it.
>
> I tend to agree that managing state through es_result_relation_info
> across various operations on a result relation has turned a bit messy
> at this point.  That said, while most of the places that access the
> currently active result relation from es_result_relation_info can be
> easily modified to receive it directly, the FDW API BeginDirectModify
> poses bit of a challenge.  BeginDirectlyModify() is called via
> ExecInitForeignScan() that in turn can't be changed to add a result
> relation (Index or ResultRelInfo *) argument, so the only way left for
> BeginDirectlyModify() is to access it via es_result_relation_info.

That's right.  I'm not sure that's a good idea, because I think other
extensions also might look at es_result_relation_info, and if so,
removing es_result_relation_info altogether would require the
extension authors to update their extensions without any benefit,
which I think isn't a good thing.

Best regards,
Etsuro Fujita



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Thu, Jul 18, 2019 at 4:50 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Jul 18, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-07-18 14:24:29 +0900, Amit Langote wrote:
> > > On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <andres@anarazel.de> wrote:
> > Or perhaps the actually correct fix is to remove es_result_relation_info
> > alltogether, and just pass it down the places that need it - we've a lot
> > more code setting it than using the value.  And it'd not be hard to
> > actually pass it to the places that read it.  Given all the
> > setting/resetting of it it's pretty obvious that a query-global resource
> > isn't the right place for it.
>>
> > > Would you like me to write a patch for some or all items?
> >
> > Yes, that would be awesome.
>
> OK, I will try to post a patch soon.

Attached are two patches.

The first one (0001) deals with reducing the core executor's reliance
on es_result_relation_info to access the currently active result
relation, in favor of receiving it from the caller as a function
argument.  So no piece of core code relies on it being correctly set
anymore.  It still needs to be set correctly for the third-party code
such as FDWs.  Also, because the partition routing related suggestions
upthread are closely tied into this, especially those around
ExecInsert(), I've included them in the same patch.  I chose to keep
the function ExecPrepareTupleRouting, even though it's now only called
from ExecInsert(), to preserve the readability of the latter.

The second patch (0002) implements some rearrangement of the UPDATE
tuple movement code, addressing the point 2 of in the first email of
this thread.  Mainly the block of code in ExecUpdate() that implements
row movement proper has been moved in a function called ExecMove().
It also contains the cosmetic improvements suggested in point 4.

Thanks,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-07-19 17:52:20 +0900, Amit Langote wrote:
> Attached are two patches.

Awesome.


> The first one (0001) deals with reducing the core executor's reliance
> on es_result_relation_info to access the currently active result
> relation, in favor of receiving it from the caller as a function
> argument.  So no piece of core code relies on it being correctly set
> anymore.  It still needs to be set correctly for the third-party code
> such as FDWs.

I'm inclined to just remove it. There's not much code out there relying
on it, as far as I can tell. Most FDWs don't support the direct modify
API, and that's afaict the case where we one needs to use
es_result_relation_info?

In fact, I searched through alllFDWs listed on https://wiki.postgresql.org/wiki/Foreign_data_wrappers
that are on github and in first few categories (up and including to
"file wrappers"), and there was only one reference to
es_result_relation_info, and that just in comments in a test:
https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93&q=es_result_relation_info&type=
which I think was just copied from our source code.

IOW, we should just change the direct modify calls to get the relevant
ResultRelationInfo or something in that vein (perhaps just the relevant
RT index?).

pglogical also references it, but just because it creates its own
EState afaict.



> @@ -334,32 +335,50 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot *slot)
>   *        ExecInsert
>   *
>   *        For INSERT, we have to insert the tuple into the target relation
> - *        and insert appropriate tuples into the index relations.
> + *        (or partition thereof) and insert appropriate tuples into the index
> + *        relations.
>   *
>   *        Returns RETURNING result if any, otherwise NULL.
> + *
> + *        This may change the currently active tuple conversion map in
> + *        mtstate->mt_transition_capture, so the callers must take care to
> + *        save the previous value to avoid losing track of it.
>   * ----------------------------------------------------------------
>   */
>  static TupleTableSlot *
>  ExecInsert(ModifyTableState *mtstate,
> +           ResultRelInfo *resultRelInfo,
>             TupleTableSlot *slot,
>             TupleTableSlot *planSlot,
>             EState *estate,
>             bool canSetTag)
>  {
> -    ResultRelInfo *resultRelInfo;
>      Relation    resultRelationDesc;
>      List       *recheckIndexes = NIL;
>      TupleTableSlot *result = NULL;
>      TransitionCaptureState *ar_insert_trig_tcs;
>      ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
>      OnConflictAction onconflict = node->onConflictAction;
> +    PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
> +
> +    /*
> +     * If the input result relation is a partitioned table, find the leaf
> +     * partition to insert the tuple into.
> +     */
> +    if (proute)
> +    {
> +        ResultRelInfo *partRelInfo;
> +
> +        slot = ExecPrepareTupleRouting(mtstate, estate, proute,
> +                                       resultRelInfo, slot,
> +                                       &partRelInfo);
> +        resultRelInfo = partRelInfo;
> +        /* Result relation has changed, so update EState reference too. */
> +        estate->es_result_relation_info = resultRelInfo;
> +    }

I think by removing es_result_relation entirely, this would look
cleaner.


> @@ -1271,18 +1274,18 @@ lreplace:;
>                                               mtstate->mt_root_tuple_slot);
>  
>              /*
> -             * Prepare for tuple routing, making it look like we're inserting
> -             * into the root.
> +             * ExecInsert() may scribble on mtstate->mt_transition_capture,
> +             * so save the currently active map.
>               */
> +            if (mtstate->mt_transition_capture)
> +                saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

Wonder if we could remove the need for this somehow, it's still pretty
darn ugly.  Thomas, perhaps you have some insights?

To me the need to modify these ModifyTable wide state on a per-subplan
and even per-partition basis indicates that the datastructures are in
the wrong place.



> @@ -2212,23 +2207,17 @@ ExecModifyTable(PlanState *pstate)
>          switch (operation)
>          {
>              case CMD_INSERT:
> -                /* Prepare for tuple routing if needed. */
> -                if (proute)
> -                    slot = ExecPrepareTupleRouting(node, estate, proute,
> -                                                   resultRelInfo, slot);
> -                slot = ExecInsert(node, slot, planSlot,
> +                slot = ExecInsert(node, resultRelInfo, slot, planSlot,
>                                    estate, node->canSetTag);
> -                /* Revert ExecPrepareTupleRouting's state change. */
> -                if (proute)
> -                    estate->es_result_relation_info = resultRelInfo;
>                  break;
>              case CMD_UPDATE:
> -                slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
> -                                  &node->mt_epqstate, estate, node->canSetTag);
> +                slot = ExecUpdate(node, resultRelInfo, tupleid, oldtuple, slot,
> +                                  planSlot, &node->mt_epqstate, estate,
> +                                  node->canSetTag);
>                  break;
>              case CMD_DELETE:
> -                slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> -                                  &node->mt_epqstate, estate,
> +                slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> +                                  planSlot, &node->mt_epqstate, estate,
>                                    true, node->canSetTag,
>                                    false /* changingPart */ , NULL, NULL);
>                  break;

This reminds me of another complaint: ExecDelete and ExecInsert() have
gotten more boolean parameters for partition moving, but only one of
them is explained with a comment (/* changingPart */) - think we should
do that for all.



> diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
> index 43edfef089..7df3e78b22 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -173,10 +173,10 @@ ensure_transaction(void)
>   * This is based on similar code in copy.c
>   */
>  static EState *
> -create_estate_for_relation(LogicalRepRelMapEntry *rel)
> +create_estate_for_relation(LogicalRepRelMapEntry *rel,
> +                           ResultRelInfo **resultRelInfo)
>  {
>      EState       *estate;
> -    ResultRelInfo *resultRelInfo;
>      RangeTblEntry *rte;
>  
>      estate = CreateExecutorState();
> @@ -188,12 +188,11 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel)
>      rte->rellockmode = AccessShareLock;
>      ExecInitRangeTable(estate, list_make1(rte));
>  
> -    resultRelInfo = makeNode(ResultRelInfo);
> -    InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
> +    *resultRelInfo = makeNode(ResultRelInfo);
> +    InitResultRelInfo(*resultRelInfo, rel->localrel, 1, NULL, 0);
>  
> -    estate->es_result_relations = resultRelInfo;
> +    estate->es_result_relations = *resultRelInfo;
>      estate->es_num_result_relations = 1;
> -    estate->es_result_relation_info = resultRelInfo;
>  
>      estate->es_output_cid = GetCurrentCommandId(true);
>  
> @@ -567,6 +566,7 @@ GetRelationIdentityOrPK(Relation rel)
>  static void
>  apply_handle_insert(StringInfo s)
>  {
> +    ResultRelInfo *resultRelInfo;
>      LogicalRepRelMapEntry *rel;
>      LogicalRepTupleData newtup;
>      LogicalRepRelId relid;
> @@ -589,7 +589,7 @@ apply_handle_insert(StringInfo s)
>      }
>  
>      /* Initialize the executor state. */
> -    estate = create_estate_for_relation(rel);
> +    estate = create_estate_for_relation(rel, &resultRelInfo);

Hm. It kinda seems cleaner if we were to instead return the relevant
index, rather than the entire ResultRelInfo, as an output from
create_estate_for_relation().  Makes it clearer that it's still in the
EState.

Or perhaps we ought to compute it in a separate step? Then that'd be
more amenable to support replcating into partition roots.


>      /*
> -     * If this insert is the result of a partition key update that moved the
> -     * tuple to a new partition, put this row into the transition NEW TABLE,
> -     * if there is one. We need to do this separately for DELETE and INSERT
> -     * because they happen on different tables.
> +     * If this delete is a part of a partition key update, put this row into
> +     * the UPDATE trigger's NEW TABLE instead of that of an INSERT trigger.
>       */
> -    ar_insert_trig_tcs = mtstate->mt_transition_capture;
> -    if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
> -        && mtstate->mt_transition_capture->tcs_update_new_table)
> +    if (mtstate->operation == CMD_UPDATE &&
> +        mtstate->mt_transition_capture &&
> +        mtstate->mt_transition_capture->tcs_update_new_table)
>      {
> -        ExecARUpdateTriggers(estate, resultRelInfo, NULL,
> -                             NULL,
> -                             slot,
> -                             NULL,
> -                             mtstate->mt_transition_capture);
> +        ExecARUpdateTriggers(estate, resultRelInfo, NULL, NULL, slot,
> +                             NIL, mtstate->mt_transition_capture);
>  
>          /*
> -         * We've already captured the NEW TABLE row, so make sure any AR
> -         * INSERT trigger fired below doesn't capture it again.
> +         * Execute AFTER ROW INSERT Triggers, but such that the row is not
> +         * captured again in the transition table if any.
>           */
> -        ar_insert_trig_tcs = NULL;
> +        ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> +                             NULL);
> +    }
> +    else
> +    {
> +        /* AFTER ROW INSERT Triggers */
> +        ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> +                             mtstate->mt_transition_capture);
>      }
> -
> -    /* AFTER ROW INSERT Triggers */
> -    ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> -                         ar_insert_trig_tcs);

While a tiny bit more code, perhaps, this is considerably clearer
imo. Thanks.


> +/*
> + *    ExecMove
> + *        Move an updated tuple from the input result relation to the
> + *        new partition of its root parent table
> + *
> + *    This works by first deleting the tuple from the input result relation
> + *    followed by inserting it into the root parent table, that is,
> + *    mtstate->rootResultRelInfo.
> + *
> + *    Returns true if it's detected that the tuple we're trying to move has
> + *    been concurrently updated.
> + */
> +static bool
> +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> +         ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
> +         EPQState *epqstate, bool canSetTag, TupleTableSlot **slot,
> +         TupleTableSlot **inserted_tuple)
> +{

I know that it was one of the names I proposed, but now that I'm
thinking about it again, it sounds too generic. Perhaps
ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since
there's only one reference the longer name wouldn't be painful.


> +    /*
> +     * Row movement, part 1.  Delete the tuple, but skip RETURNING
> +     * processing. We want to return rows from INSERT.
> +     */
> +    ExecDelete(mtstate, resultRelInfo, tupleid, oldtuple, planSlot,
> +               epqstate, estate, false, false /* canSetTag */ ,
> +               true /* changingPart */ , &tuple_deleted, &epqslot);

Here again it'd be nice if all the booleans would be explained with a
comment.


Greetings,

Andres Freund



Re: partition routing layering in nodeModifyTable.c

От
Alvaro Herrera
Дата:
On 2019-Jul-19, Andres Freund wrote:

> On 2019-07-19 17:52:20 +0900, Amit Langote wrote:
> > The first one (0001) deals with reducing the core executor's reliance
> > on es_result_relation_info to access the currently active result
> > relation, in favor of receiving it from the caller as a function
> > argument.  So no piece of core code relies on it being correctly set
> > anymore.  It still needs to be set correctly for the third-party code
> > such as FDWs.
> 
> I'm inclined to just remove it. There's not much code out there relying
> on it, as far as I can tell. Most FDWs don't support the direct modify
> API, and that's afaict the case where we one needs to use
> es_result_relation_info?

Yeah, I too agree with removing it; since our code doesn't use it, it
seems very likely that it will become slightly out of sync with reality
and we'd not notice until some FDW misbehaves weirdly.

> > -                slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> > -                                  &node->mt_epqstate, estate,
> > +                slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> > +                                  planSlot, &node->mt_epqstate, estate,
> >                                    true, node->canSetTag,
> >                                    false /* changingPart */ , NULL, NULL);
> >                  break;
> 
> This reminds me of another complaint: ExecDelete and ExecInsert() have
> gotten more boolean parameters for partition moving, but only one of
> them is explained with a comment (/* changingPart */) - think we should
> do that for all.

Maybe change the API to use a flags bitmask?

(IMO the placement of the comment inside the function call, making the
comma appear preceded with a space, looks ugly.  If we want to add
comments, let's put each param on its own line with the comment beyond
the comma.  That's what we do in other places where this pattern is
used.)

> >      /* Initialize the executor state. */
> > -    estate = create_estate_for_relation(rel);
> > +    estate = create_estate_for_relation(rel, &resultRelInfo);
> 
> Hm. It kinda seems cleaner if we were to instead return the relevant
> index, rather than the entire ResultRelInfo, as an output from
> create_estate_for_relation().  Makes it clearer that it's still in the
> EState.

Yeah.

> Or perhaps we ought to compute it in a separate step? Then that'd be
> more amenable to support replcating into partition roots.

I'm not quite seeing the shape that you're imagining this would take.
I vote not to mess with that for this patch; I bet that we'll have to
change a few other things in this code when we add better support for
partitioning in logical replication.

> > +/*
> > + *    ExecMove
> > + *        Move an updated tuple from the input result relation to the
> > + *        new partition of its root parent table
> > + *
> > + *    This works by first deleting the tuple from the input result relation
> > + *    followed by inserting it into the root parent table, that is,
> > + *    mtstate->rootResultRelInfo.
> > + *
> > + *    Returns true if it's detected that the tuple we're trying to move has
> > + *    been concurrently updated.
> > + */
> > +static bool
> > +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> > +         ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
> > +         EPQState *epqstate, bool canSetTag, TupleTableSlot **slot,
> > +         TupleTableSlot **inserted_tuple)
> > +{
>
> I know that it was one of the names I proposed, but now that I'm
> thinking about it again, it sounds too generic. Perhaps
> ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since
> there's only one reference the longer name wouldn't be painful.

That name sounds good.  Isn't the return convention backwards?  Sounds
like "true" should mean that it succeeded.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-07-19 17:11:10 -0400, Alvaro Herrera wrote:
> On 2019-Jul-19, Andres Freund wrote:
> > > -                slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> > > -                                  &node->mt_epqstate, estate,
> > > +                slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> > > +                                  planSlot, &node->mt_epqstate, estate,
> > >                                    true, node->canSetTag,
> > >                                    false /* changingPart */ , NULL, NULL);
> > >                  break;
> > 
> > This reminds me of another complaint: ExecDelete and ExecInsert() have
> > gotten more boolean parameters for partition moving, but only one of
> > them is explained with a comment (/* changingPart */) - think we should
> > do that for all.
> 
> Maybe change the API to use a flags bitmask?
> 
> (IMO the placement of the comment inside the function call, making the
> comma appear preceded with a space, looks ugly.  If we want to add
> comments, let's put each param on its own line with the comment beyond
> the comma.  That's what we do in other places where this pattern is
> used.)

Well, that's the pre-existing style, so I'd just have gone with
that. I'm not sure I buy there's much point in going for a bitmask, as
this is file-private code, not code where changing the signature
requires modifying multiple files.


> > >      /* Initialize the executor state. */
> > > -    estate = create_estate_for_relation(rel);
> > > +    estate = create_estate_for_relation(rel, &resultRelInfo);
> > 
> > Hm. It kinda seems cleaner if we were to instead return the relevant
> > index, rather than the entire ResultRelInfo, as an output from
> > create_estate_for_relation().  Makes it clearer that it's still in the
> > EState.
> 
> Yeah.
> 
> > Or perhaps we ought to compute it in a separate step? Then that'd be
> > more amenable to support replcating into partition roots.
> 
> I'm not quite seeing the shape that you're imagining this would take.
> I vote not to mess with that for this patch; I bet that we'll have to
> change a few other things in this code when we add better support for
> partitioning in logical replication.

Yea, I think it's fine to do that separately.  If we wanted to support
replication roots as replication targets, we'd obviously need to do
something pretty similar to what ExecInsert()/ExecUpdate() already
do. And there we can't just reference an index in EState, as partition
children aren't in there.

I kind of was wondering if we were to have a separate function for
getting the ResultRelInfo targeted, we'd be able to just extend that
function to support replication.  But now that I think about it a bit
more, that's so much just scratching the surface...

We really ought to have the replication "sink" code share more code with
nodeModifyTable.c.

Greetings,

Andres Freund



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Hi Andres,

Sorry about the delay in replying as I was on vacation for the last few days.

On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <andres@anarazel.de> wrote:
> > The first one (0001) deals with reducing the core executor's reliance
> > on es_result_relation_info to access the currently active result
> > relation, in favor of receiving it from the caller as a function
> > argument.  So no piece of core code relies on it being correctly set
> > anymore.  It still needs to be set correctly for the third-party code
> > such as FDWs.
>
> I'm inclined to just remove it. There's not much code out there relying
> on it, as far as I can tell. Most FDWs don't support the direct modify
> API, and that's afaict the case where we one needs to use
> es_result_relation_info?

Right, only the directly modify API uses it.

> In fact, I searched through alllFDWs listed on https://wiki.postgresql.org/wiki/Foreign_data_wrappers
> that are on github and in first few categories (up and including to
> "file wrappers"), and there was only one reference to
> es_result_relation_info, and that just in comments in a test:
> https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93&q=es_result_relation_info&type=
> which I think was just copied from our source code.
>
> IOW, we should just change the direct modify calls to get the relevant
> ResultRelationInfo or something in that vein (perhaps just the relevant
> RT index?).

It seems easy to make one of the two functions that constitute the
direct modify API, IterateDirectModify(), access the result relation
from ForeignScanState by saving either the result relation RT index or
ResultRelInfo pointer itself into the ForeignScanState's FDW-private
area.  For example, for postgres_fdw, one would simply add a new
member to PgFdwDirectModifyState struct.

Doing that for the other function BeginDirectModify() seems a bit more
involved.  We could add a new field to ForeignScan, say
resultRelation, that's set by either PlanDirectModify() (the FDW code)
or make_modifytable() (the core code) if the ForeignScan node contains
the command for direct modification.  BeginDirectModify() can then use
that value instead of relying on es_result_relation_info being set.

Thoughts?  Fujita-san, do you have any opinion on whether that would
be a good idea?

> pglogical also references it, but just because it creates its own
> EState afaict.

That sounds easily manageable.

> > @@ -334,32 +335,50 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot *slot)
> >   *           ExecInsert
> >   *
> >   *           For INSERT, we have to insert the tuple into the target relation
> > - *           and insert appropriate tuples into the index relations.
> > + *           (or partition thereof) and insert appropriate tuples into the index
> > + *           relations.
> >   *
> >   *           Returns RETURNING result if any, otherwise NULL.
> > + *
> > + *           This may change the currently active tuple conversion map in
> > + *           mtstate->mt_transition_capture, so the callers must take care to
> > + *           save the previous value to avoid losing track of it.
> >   * ----------------------------------------------------------------
> >   */
> >  static TupleTableSlot *
> >  ExecInsert(ModifyTableState *mtstate,
> > +                ResultRelInfo *resultRelInfo,
> >                  TupleTableSlot *slot,
> >                  TupleTableSlot *planSlot,
> >                  EState *estate,
> >                  bool canSetTag)
> >  {
> > -     ResultRelInfo *resultRelInfo;
> >       Relation        resultRelationDesc;
> >       List       *recheckIndexes = NIL;
> >       TupleTableSlot *result = NULL;
> >       TransitionCaptureState *ar_insert_trig_tcs;
> >       ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
> >       OnConflictAction onconflict = node->onConflictAction;
> > +     PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
> > +
> > +     /*
> > +      * If the input result relation is a partitioned table, find the leaf
> > +      * partition to insert the tuple into.
> > +      */
> > +     if (proute)
> > +     {
> > +             ResultRelInfo *partRelInfo;
> > +
> > +             slot = ExecPrepareTupleRouting(mtstate, estate, proute,
> > +                                                                        resultRelInfo, slot,
> > +                                                                        &partRelInfo);
> > +             resultRelInfo = partRelInfo;
> > +             /* Result relation has changed, so update EState reference too. */
> > +             estate->es_result_relation_info = resultRelInfo;
> > +     }
>
> I think by removing es_result_relation entirely, this would look
> cleaner.

I agree.  Maybe, setting es_result_relation_info here isn't really
needed, because the ResultRelInfo is directly passed through
ExecForeignInsert.  Still, some FDWs may be relying on
es_result_relation_info being correctly set despite the
aforementioned.  Again, the only way to get them to stop doing so may
be to remove it.


> > @@ -1271,18 +1274,18 @@ lreplace:;
> >
mtstate->mt_root_tuple_slot);
> >
> >                       /*
> > -                      * Prepare for tuple routing, making it look like we're inserting
> > -                      * into the root.
> > +                      * ExecInsert() may scribble on mtstate->mt_transition_capture,
> > +                      * so save the currently active map.
> >                        */
> > +                     if (mtstate->mt_transition_capture)
> > +                             saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> Wonder if we could remove the need for this somehow, it's still pretty
> darn ugly.  Thomas, perhaps you have some insights?
>
> To me the need to modify these ModifyTable wide state on a per-subplan
> and even per-partition basis indicates that the datastructures are in
> the wrong place.

I agree that having to ensure tcs_map is set correctly is cumbersome,
because it has to be reset every time the currently active result
relation changes.  I think a better place for the map to be is
ResultRelInfo itself.  The trigger code can just get the correct map
from the ResultRelInfo of the result relation it's processing.

Regarding that idea, the necessary map is already present in the
tuple-routing state struct that's embedded in the partition's
ResultRelInfo.  But the UPDATE result relations that are never
processed as tuple routing targets don't have routing info initialized
(also think non-partition inheritance children), so we could add
another TupleConversionMap * field in ResultRelInfo.  Attached patch
0003 implements that.

With this change, we no longer need to track the map in a global
variable, that is, TransitionCaptureState no longer needs tcs_map.  We
still have tcs_original_insert_tuple though, which must be set during
ExecInsert and reset after it's read by AfterTriggerSaveEvent.  I have
moved the resetting of its value to right after where the originally
set value is read to make it clear that the value must be read only
once.

> > @@ -2212,23 +2207,17 @@ ExecModifyTable(PlanState *pstate)
> >               switch (operation)
> >               {
> >                       case CMD_INSERT:
> > -                             /* Prepare for tuple routing if needed. */
> > -                             if (proute)
> > -                                     slot = ExecPrepareTupleRouting(node, estate, proute,
> > -                                                                                                resultRelInfo,
slot);
> > -                             slot = ExecInsert(node, slot, planSlot,
> > +                             slot = ExecInsert(node, resultRelInfo, slot, planSlot,
> >                                                                 estate, node->canSetTag);
> > -                             /* Revert ExecPrepareTupleRouting's state change. */
> > -                             if (proute)
> > -                                     estate->es_result_relation_info = resultRelInfo;
> >                               break;
> >                       case CMD_UPDATE:
> > -                             slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
> > -                                                               &node->mt_epqstate, estate, node->canSetTag);
> > +                             slot = ExecUpdate(node, resultRelInfo, tupleid, oldtuple, slot,
> > +                                                               planSlot, &node->mt_epqstate, estate,
> > +                                                               node->canSetTag);
> >                               break;
> >                       case CMD_DELETE:
> > -                             slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> > -                                                               &node->mt_epqstate, estate,
> > +                             slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> > +                                                               planSlot, &node->mt_epqstate, estate,
> >                                                                 true, node->canSetTag,
> >                                                                 false /* changingPart */ , NULL, NULL);
> >                               break;
>
> This reminds me of another complaint: ExecDelete and ExecInsert() have
> gotten more boolean parameters for partition moving, but only one of
> them is explained with a comment (/* changingPart */) - think we should
> do that for all.

Agree about the confusing state of ExecDelete call sites.  I've
reformatted the calls to properly label the arguments (the changes are
contained in the revised 0001).  I don't see many
partitioning-specific boolean parameters in ExecInsert though.

> > diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
> > index 43edfef089..7df3e78b22 100644
> > --- a/src/backend/replication/logical/worker.c
> > +++ b/src/backend/replication/logical/worker.c
> > @@ -173,10 +173,10 @@ ensure_transaction(void)
> >   * This is based on similar code in copy.c
> >   */
> >  static EState *
> > -create_estate_for_relation(LogicalRepRelMapEntry *rel)
> > +create_estate_for_relation(LogicalRepRelMapEntry *rel,
> > +                                                ResultRelInfo **resultRelInfo)
> >  {
> >       EState     *estate;
> > -     ResultRelInfo *resultRelInfo;
> >       RangeTblEntry *rte;
> >
> >       estate = CreateExecutorState();
> > @@ -188,12 +188,11 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel)
> >       rte->rellockmode = AccessShareLock;
> >       ExecInitRangeTable(estate, list_make1(rte));
> >
> > -     resultRelInfo = makeNode(ResultRelInfo);
> > -     InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
> > +     *resultRelInfo = makeNode(ResultRelInfo);
> > +     InitResultRelInfo(*resultRelInfo, rel->localrel, 1, NULL, 0);
> >
> > -     estate->es_result_relations = resultRelInfo;
> > +     estate->es_result_relations = *resultRelInfo;
> >       estate->es_num_result_relations = 1;
> > -     estate->es_result_relation_info = resultRelInfo;
> >
> >       estate->es_output_cid = GetCurrentCommandId(true);
> >
> > @@ -567,6 +566,7 @@ GetRelationIdentityOrPK(Relation rel)
> >  static void
> >  apply_handle_insert(StringInfo s)
> >  {
> > +     ResultRelInfo *resultRelInfo;
> >       LogicalRepRelMapEntry *rel;
> >       LogicalRepTupleData newtup;
> >       LogicalRepRelId relid;
> > @@ -589,7 +589,7 @@ apply_handle_insert(StringInfo s)
> >       }
> >
> >       /* Initialize the executor state. */
> > -     estate = create_estate_for_relation(rel);
> > +     estate = create_estate_for_relation(rel, &resultRelInfo);
>
> Hm. It kinda seems cleaner if we were to instead return the relevant
> index, rather than the entire ResultRelInfo, as an output from
> create_estate_for_relation().  Makes it clearer that it's still in the
> EState.

For now, I've reverted these changes in favor of just doing this:

     /* Initialize the executor state. */
     estate = create_estate_for_relation(rel);
+    resultRelInfo = &estate->es_result_relations[0];

This seems OK as we know for sure that there is only one target relation.

> Or perhaps we ought to compute it in a separate step? Then that'd be
> more amenable to support replcating into partition roots.

If we think of create_estate_for_relation() being like InitPlan(),
then perhaps it makes sense to leave it as is.  Any setup needed for
replicating into partition roots will have to be in a separate
function anyway.

> > +/*
> > + *   ExecMove
> > + *           Move an updated tuple from the input result relation to the
> > + *           new partition of its root parent table
> > + *
> > + *   This works by first deleting the tuple from the input result relation
> > + *   followed by inserting it into the root parent table, that is,
> > + *   mtstate->rootResultRelInfo.
> > + *
> > + *   Returns true if it's detected that the tuple we're trying to move has
> > + *   been concurrently updated.
> > + */
> > +static bool
> > +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> > +              ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
> > +              EPQState *epqstate, bool canSetTag, TupleTableSlot **slot,
> > +              TupleTableSlot **inserted_tuple)
> > +{
>
> I know that it was one of the names I proposed, but now that I'm
> thinking about it again, it sounds too generic. Perhaps
> ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since
> there's only one reference the longer name wouldn't be painful.

OK, I've renamed ExecMove to ExecCrossPartitionUpdate.

> > +     /*
> > +      * Row movement, part 1.  Delete the tuple, but skip RETURNING
> > +      * processing. We want to return rows from INSERT.
> > +      */
> > +     ExecDelete(mtstate, resultRelInfo, tupleid, oldtuple, planSlot,
> > +                        epqstate, estate, false, false /* canSetTag */ ,
> > +                        true /* changingPart */ , &tuple_deleted, &epqslot);
>
> Here again it'd be nice if all the booleans would be explained with a
> comment.

Done too.

Attached updated 0001, 0002, and the new 0003 for transition tuple
conversion map related refactoring as explained above.

Thanks,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Tue, Jul 30, 2019 at 4:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <andres@anarazel.de> wrote:
> > > The first one (0001) deals with reducing the core executor's reliance
> > > on es_result_relation_info to access the currently active result
> > > relation, in favor of receiving it from the caller as a function
> > > argument.  So no piece of core code relies on it being correctly set
> > > anymore.  It still needs to be set correctly for the third-party code
> > > such as FDWs.
> >
> > I'm inclined to just remove it. There's not much code out there relying
> > on it, as far as I can tell. Most FDWs don't support the direct modify
> > API, and that's afaict the case where we one needs to use
> > es_result_relation_info?
>
> Right, only the directly modify API uses it.
>
> > In fact, I searched through alllFDWs listed on https://wiki.postgresql.org/wiki/Foreign_data_wrappers
> > that are on github and in first few categories (up and including to
> > "file wrappers"), and there was only one reference to
> > es_result_relation_info, and that just in comments in a test:
> > https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93&q=es_result_relation_info&type=
> > which I think was just copied from our source code.
> >
> > IOW, we should just change the direct modify calls to get the relevant
> > ResultRelationInfo or something in that vein (perhaps just the relevant
> > RT index?).
>
> It seems easy to make one of the two functions that constitute the
> direct modify API, IterateDirectModify(), access the result relation
> from ForeignScanState by saving either the result relation RT index or
> ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> area.  For example, for postgres_fdw, one would simply add a new
> member to PgFdwDirectModifyState struct.
>
> Doing that for the other function BeginDirectModify() seems a bit more
> involved.  We could add a new field to ForeignScan, say
> resultRelation, that's set by either PlanDirectModify() (the FDW code)
> or make_modifytable() (the core code) if the ForeignScan node contains
> the command for direct modification.  BeginDirectModify() can then use
> that value instead of relying on es_result_relation_info being set.
>
> Thoughts?  Fujita-san, do you have any opinion on whether that would
> be a good idea?

I looked into trying to do the things I mentioned above and it seems
to me that revising BeginDirectModify()'s API to receive the
ResultRelInfo directly as Andres suggested might be the best way
forward.  I've implemented that in the attached 0001.  Patches that
were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
respectively.  0002 is now a patch to "remove"
es_result_relation_info.

Thanks,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
On Wed, Jul 31, 2019 at 5:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Jul 30, 2019 at 4:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <andres@anarazel.de> wrote:
> > > IOW, we should just change the direct modify calls to get the relevant
> > > ResultRelationInfo or something in that vein (perhaps just the relevant
> > > RT index?).
> >
> > It seems easy to make one of the two functions that constitute the
> > direct modify API, IterateDirectModify(), access the result relation
> > from ForeignScanState by saving either the result relation RT index or
> > ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> > area.  For example, for postgres_fdw, one would simply add a new
> > member to PgFdwDirectModifyState struct.
> >
> > Doing that for the other function BeginDirectModify() seems a bit more
> > involved.  We could add a new field to ForeignScan, say
> > resultRelation, that's set by either PlanDirectModify() (the FDW code)
> > or make_modifytable() (the core code) if the ForeignScan node contains
> > the command for direct modification.  BeginDirectModify() can then use
> > that value instead of relying on es_result_relation_info being set.
> >
> > Thoughts?  Fujita-san, do you have any opinion on whether that would
> > be a good idea?

I'm still not sure that it's a good idea to remove
es_result_relation_info, but if I had to say then I think the latter
would probably be better.  I'm planning to rework on direct
modification to base it on upper planner pathification so we can
perform direct modification without the ModifyTable node.  (I'm not
sure we can really do this for inherited UPDATE/DELETE, though.)  For
that rewrite, I'm thinking to call BeginDirectModify() from the
ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
approach would allow that without any changes and avoid changing that
API many times.  That's the reason why I think the latter would
probably be better.

> I looked into trying to do the things I mentioned above and it seems
> to me that revising BeginDirectModify()'s API to receive the
> ResultRelInfo directly as Andres suggested might be the best way
> forward.  I've implemented that in the attached 0001.  Patches that
> were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
> respectively.  0002 is now a patch to "remove"
> es_result_relation_info.

Sorry for speaking this late.

Best regards,
Etsuro Fujita



Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-07-31 21:03:58 +0900, Etsuro Fujita wrote:
> I'm still not sure that it's a good idea to remove
> es_result_relation_info, but if I had to say then I think the latter
> would probably be better.  I'm planning to rework on direct
> modification to base it on upper planner pathification so we can
> perform direct modification without the ModifyTable node.  (I'm not
> sure we can really do this for inherited UPDATE/DELETE, though.)  For
> that rewrite, I'm thinking to call BeginDirectModify() from the
> ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
> approach would allow that without any changes and avoid changing that
> API many times.  That's the reason why I think the latter would
> probably be better.

I think if we did that, it'd become *more* urgent to remove
es_result_relation. Having more and more plan nodes change global
resources is a recipse for disaster.

Greetings,

Andres Freund



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Fujita-san,

Thanks for the reply and sorry I didn't wait a bit more before posting
the patch.

On Wed, Jul 31, 2019 at 9:04 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Jul 31, 2019 at 5:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Tue, Jul 30, 2019 at 4:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <andres@anarazel.de> wrote:
> > > > IOW, we should just change the direct modify calls to get the relevant
> > > > ResultRelationInfo or something in that vein (perhaps just the relevant
> > > > RT index?).
> > >
> > > It seems easy to make one of the two functions that constitute the
> > > direct modify API, IterateDirectModify(), access the result relation
> > > from ForeignScanState by saving either the result relation RT index or
> > > ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> > > area.  For example, for postgres_fdw, one would simply add a new
> > > member to PgFdwDirectModifyState struct.
> > >
> > > Doing that for the other function BeginDirectModify() seems a bit more
> > > involved.  We could add a new field to ForeignScan, say
> > > resultRelation, that's set by either PlanDirectModify() (the FDW code)
> > > or make_modifytable() (the core code) if the ForeignScan node contains
> > > the command for direct modification.  BeginDirectModify() can then use
> > > that value instead of relying on es_result_relation_info being set.
> > >
> > > Thoughts?  Fujita-san, do you have any opinion on whether that would
> > > be a good idea?
>
> I'm still not sure that it's a good idea to remove
> es_result_relation_info, but if I had to say then I think the latter
> would probably be better.

Could you please clarify what you meant by the "latter"?

If it's the approach of adding a resultRelation Index field to
ForeignScan node, I tried and had to give up, realizing that we don't
maintain ResultRelInfos in an array that is indexable by RT indexes.
It would've worked if es_result_relations had mirrored es_range_table,
although that probably complicates how the individual ModifyTable
nodes attach to that array.  In any case, given this discussion,
further hacking on a global variable like es_result_relations may be a
course we might not want to pursue.

>  I'm planning to rework on direct
> modification to base it on upper planner pathification so we can
> perform direct modification without the ModifyTable node.  (I'm not
> sure we can really do this for inherited UPDATE/DELETE, though.)  For
> that rewrite, I'm thinking to call BeginDirectModify() from the
> ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
> approach would allow that without any changes and avoid changing that
> API many times.  That's the reason why I think the latter would
> probably be better.

Will the new planning approach you're thinking of get rid of needing
any result relations at all (and so the ResultRelInfos in the
executor)?

Thanks,
Amit



Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Amit-san,

On Thu, Aug 1, 2019 at 10:33 AM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Jul 31, 2019 at 9:04 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Wed, Jul 31, 2019 at 5:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Tue, Jul 30, 2019 at 4:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > > On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <andres@anarazel.de> wrote:
> > > > > IOW, we should just change the direct modify calls to get the relevant
> > > > > ResultRelationInfo or something in that vein (perhaps just the relevant
> > > > > RT index?).
> > > >
> > > > It seems easy to make one of the two functions that constitute the
> > > > direct modify API, IterateDirectModify(), access the result relation
> > > > from ForeignScanState by saving either the result relation RT index or
> > > > ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> > > > area.  For example, for postgres_fdw, one would simply add a new
> > > > member to PgFdwDirectModifyState struct.
> > > >
> > > > Doing that for the other function BeginDirectModify() seems a bit more
> > > > involved.  We could add a new field to ForeignScan, say
> > > > resultRelation, that's set by either PlanDirectModify() (the FDW code)
> > > > or make_modifytable() (the core code) if the ForeignScan node contains
> > > > the command for direct modification.  BeginDirectModify() can then use
> > > > that value instead of relying on es_result_relation_info being set.
> > > >
> > > > Thoughts?  Fujita-san, do you have any opinion on whether that would
> > > > be a good idea?
> >
> > I'm still not sure that it's a good idea to remove
> > es_result_relation_info, but if I had to say then I think the latter
> > would probably be better.
>
> Could you please clarify what you meant by the "latter"?
>
> If it's the approach of adding a resultRelation Index field to
> ForeignScan node, I tried and had to give up, realizing that we don't
> maintain ResultRelInfos in an array that is indexable by RT indexes.
> It would've worked if es_result_relations had mirrored es_range_table,
> although that probably complicates how the individual ModifyTable
> nodes attach to that array.  In any case, given this discussion,
> further hacking on a global variable like es_result_relations may be a
> course we might not want to pursue.

Yeah, I mean that approach.  To get the ResultRelInfo, I think
searching through the es_result_relations for the ResultRelInfo for
the resultRelation added to the ForeignScan in BeginDirectModify()
like the attached, which is created along your proposal.
ExecFindResultRelInfo() added by the patch wouldn't work efficiently
for inherited UPDATE/DELETE where there are many children that are
foreign tables, but I think that would probably be OK because in most
use-cases, including sharding, the number of such children would be at
most < 100 or so.  For improving the efficiency for the cases where
there are a lot more such children, however, I think it would be an
option to do something about global variables so that we can access
the ResultRelInfos by RT indexes more efficiently, because IMO I don't
think that would be against the point here ie, removing the dependency
on es_result_relation_info.  Maybe I'm missing something, though.

> >  I'm planning to rework on direct
> > modification to base it on upper planner pathification so we can
> > perform direct modification without the ModifyTable node.  (I'm not
> > sure we can really do this for inherited UPDATE/DELETE, though.)  For
> > that rewrite, I'm thinking to call BeginDirectModify() from the
> > ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
> > approach would allow that without any changes and avoid changing that
> > API many times.  That's the reason why I think the latter would
> > probably be better.
>
> Will the new planning approach you're thinking of get rid of needing
> any result relations at all (and so the ResultRelInfos in the
> executor)?

I think the new planning approach would still need result relations
and ResultRelInfos in the executor as-is; and the FDW would probably
use the ResultRelInfo for the foreign table created by the core.  Some
of the ResultRelInfo data would prpbably need to be initialized by the
FDW itesef, though (eg, WCO constraints and/or RETURNING if any).

Best regards,
Etsuro Fujita

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> I looked into trying to do the things I mentioned above and it seems
> to me that revising BeginDirectModify()'s API to receive the
> ResultRelInfo directly as Andres suggested might be the best way
> forward.  I've implemented that in the attached 0001.  Patches that
> were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
> respectively.  0002 is now a patch to "remove"
> es_result_relation_info.

Thanks!  Some minor quibbles aside, the non FDW patches look good to me.

Fujita-san, do you have any comments on the FDW API change?  Or anybody
else?

I'm a bit woried about the move of BeginDirectModify() into
nodeModifyTable.c - it just seems like an odd control flow to me. Not
allowing any intermittent nodes between ForeignScan and ModifyTable also
seems like an undesirable restriction for the future. I realize that we
already do that for BeginForeignModify() (just btw, that already accepts
resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
makes sense), but it still seems like the wrong direction to me.

The need for that move, I assume, comes from needing knowing the correct
ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
at plan time (in setrefs.c), somewhat similar to how we determine
ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?

Then we could just have BeginForeignModify, BeginDirectModify,
BeginForeignScan all be called from ExecInitForeignScan().



Path 04 is such a nice improvement. Besides getting rid of a substantial
amount of code, it also makes the control flow a lot easier to read.


> @@ -4644,9 +4645,7 @@ GetAfterTriggersTableData(Oid relid, CmdType cmdType)
>   * If there are no triggers in 'trigdesc' that request relevant transition
>   * tables, then return NULL.
>   *
> - * The resulting object can be passed to the ExecAR* functions.  The caller
> - * should set tcs_map or tcs_original_insert_tuple as appropriate when dealing
> - * with child tables.
> + * The resulting object can be passed to the ExecAR* functions.
>   *
>   * Note that we copy the flags from a parent table into this struct (rather
>   * than subsequently using the relation's TriggerDesc directly) so that we can
> @@ -5750,14 +5749,26 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
>       */
>      if (row_trigger && transition_capture != NULL)
>      {
> -        TupleTableSlot *original_insert_tuple = transition_capture->tcs_original_insert_tuple;
> -        TupleConversionMap *map = transition_capture->tcs_map;
> +        TupleTableSlot *original_insert_tuple;
> +        PartitionRoutingInfo *pinfo = relinfo->ri_PartitionInfo;
> +        TupleConversionMap *map = pinfo ?
> +                                pinfo->pi_PartitionToRootMap :
> +                                relinfo->ri_ChildToRootMap;
>          bool        delete_old_table = transition_capture->tcs_delete_old_table;
>          bool        update_old_table = transition_capture->tcs_update_old_table;
>          bool        update_new_table = transition_capture->tcs_update_new_table;
>          bool        insert_new_table = transition_capture->tcs_insert_new_table;
>  
>          /*
> +         * Get the originally inserted tuple from the global variable and set
> +         * the latter to NULL because any given tuple must be read only once.
> +         * Note that the TransitionCaptureState is shared across many calls
> +         * to this function.
> +         */
> +        original_insert_tuple = transition_capture->tcs_original_insert_tuple;
> +        transition_capture->tcs_original_insert_tuple = NULL;

Maybe I'm missing something, but original_insert_tuple is not a global
variable?


> @@ -888,7 +889,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
>                      PartitionTupleRouting *proute,
>                      PartitionDispatch dispatch,
>                      ResultRelInfo *partRelInfo,
> -                    int partidx)
> +                    int partidx,
> +                    bool is_update_result_rel)
>  {
>      MemoryContext oldcxt;
>      PartitionRoutingInfo *partrouteinfo;
> @@ -935,10 +937,15 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
>      if (mtstate &&
>          (mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
>      {
> -        partrouteinfo->pi_PartitionToRootMap =
> -            convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
> -                                   RelationGetDescr(partRelInfo->ri_PartitionRoot),
> -                                   gettext_noop("could not convert row type"));
> +        /* If partition is an update target, then we already got the map. */
> +        if (is_update_result_rel)
> +            partrouteinfo->pi_PartitionToRootMap =
> +                partRelInfo->ri_ChildToRootMap;
> +        else
> +            partrouteinfo->pi_PartitionToRootMap =
> +                convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
> +                                       RelationGetDescr(partRelInfo->ri_PartitionRoot),
> +                                       gettext_noop("could not convert row type"));
>      }

Hm, isn't is_update_result_rel just ModifyTable->operation == CMD_UPDATE?



Greetings,

Andres Freund



Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Hi,

On Sat, Aug 3, 2019 at 3:01 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > I looked into trying to do the things I mentioned above and it seems
> > to me that revising BeginDirectModify()'s API to receive the
> > ResultRelInfo directly as Andres suggested might be the best way
> > forward.  I've implemented that in the attached 0001.

> Fujita-san, do you have any comments on the FDW API change?  Or anybody
> else?
>
> I'm a bit woried about the move of BeginDirectModify() into
> nodeModifyTable.c - it just seems like an odd control flow to me. Not
> allowing any intermittent nodes between ForeignScan and ModifyTable also
> seems like an undesirable restriction for the future. I realize that we
> already do that for BeginForeignModify() (just btw, that already accepts
> resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> makes sense), but it still seems like the wrong direction to me.
>
> The need for that move, I assume, comes from needing knowing the correct
> ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> at plan time (in setrefs.c), somewhat similar to how we determine
> ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?

I'd vote for that; I created a patch for that [1].

> Then we could just have BeginForeignModify, BeginDirectModify,
> BeginForeignScan all be called from ExecInitForeignScan().

I think so too.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com



Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-08-03 05:20:35 +0900, Etsuro Fujita wrote:
> On Sat, Aug 3, 2019 at 3:01 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > > I looked into trying to do the things I mentioned above and it seems
> > > to me that revising BeginDirectModify()'s API to receive the
> > > ResultRelInfo directly as Andres suggested might be the best way
> > > forward.  I've implemented that in the attached 0001.
>
> > Fujita-san, do you have any comments on the FDW API change?  Or anybody
> > else?
> >
> > I'm a bit woried about the move of BeginDirectModify() into
> > nodeModifyTable.c - it just seems like an odd control flow to me. Not
> > allowing any intermittent nodes between ForeignScan and ModifyTable also
> > seems like an undesirable restriction for the future. I realize that we
> > already do that for BeginForeignModify() (just btw, that already accepts
> > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> > makes sense), but it still seems like the wrong direction to me.
> >
> > The need for that move, I assume, comes from needing knowing the correct
> > ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> > at plan time (in setrefs.c), somewhat similar to how we determine
> > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?
>
> I'd vote for that; I created a patch for that [1].
>
> [1] https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com

Oh, missed that. But that's not quite what I'm proposing. I don't like
ExecFindResultRelInfo at all. What's the point of it? It's introduction
is still an API break - I don't understand why that break is better than
just passing the ResultRelInfo directly to BeginDirectModify()?  I want
to again remark that BeginForeignModify() does get the ResultRelInfo -
it should have been done the same when adding direct modify.

Even if you need the loop - which I don't think is right - it should
live somewhere that individual FDWs don't have to care about.

- Andres



Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-08-01 18:38:09 +0900, Etsuro Fujita wrote:
> On Thu, Aug 1, 2019 at 10:33 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > If it's the approach of adding a resultRelation Index field to
> > ForeignScan node, I tried and had to give up, realizing that we don't
> > maintain ResultRelInfos in an array that is indexable by RT indexes.
> > It would've worked if es_result_relations had mirrored es_range_table,
> > although that probably complicates how the individual ModifyTable
> > nodes attach to that array.

We know at plan time what the the resultRelation offset for a
ModifyTable node is. We just need to transport that to the respective
foreign scan node, and update it properly in setrefs?  Then we can index
es_result_relations without any additional mapping?

Maybe I'm missing something? I think all we need to do is to have
setrefs.c:set_plan_refs() iterate over ->fdwDirectModifyPlans or such,
and set the respective node's result_relation_offset or whatever we're
naming it to splan->resultRelIndex + offset from fdwDirectModifyPlans?


> > In any case, given this discussion, further hacking on a global
> > variable like es_result_relations may be a course we might not want
> > to pursue.

I don't think es_result_relations really is problem - it doesn't have to
change while processing individual subplans / partitions / whatnot.  If
we needed a mapping between rtis and result indexes, I'd not see a
problem. Doubtful it's needed though.

There's a fundamental difference between EState->es_result_relations and
EState->es_result_relation_info. The former stays static during the
whole query once initialized, whereas es_result_relation_info changes
depending on which relation we're processing. The latter is what makes
the code more complicated, because we cannot ever return early etc.

Similarly, ModifyTableState->mt_per_subplan_tupconv_maps is not a
problem, it stays static, but e.g. mtstate->mt_transition_capture is a
problem, because we have to change for each subplan / routing /
partition movement.

Greetings,

Andres Freund



Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Hi,

On Sat, Aug 3, 2019 at 5:31 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-03 05:20:35 +0900, Etsuro Fujita wrote:
> > On Sat, Aug 3, 2019 at 3:01 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > > > I looked into trying to do the things I mentioned above and it seems
> > > > to me that revising BeginDirectModify()'s API to receive the
> > > > ResultRelInfo directly as Andres suggested might be the best way
> > > > forward.  I've implemented that in the attached 0001.
> >
> > > Fujita-san, do you have any comments on the FDW API change?  Or anybody
> > > else?
> > >
> > > I'm a bit woried about the move of BeginDirectModify() into
> > > nodeModifyTable.c - it just seems like an odd control flow to me. Not
> > > allowing any intermittent nodes between ForeignScan and ModifyTable also
> > > seems like an undesirable restriction for the future. I realize that we
> > > already do that for BeginForeignModify() (just btw, that already accepts
> > > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> > > makes sense), but it still seems like the wrong direction to me.
> > >
> > > The need for that move, I assume, comes from needing knowing the correct
> > > ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> > > at plan time (in setrefs.c), somewhat similar to how we determine
> > > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?
> >
> > I'd vote for that; I created a patch for that [1].
> >
> > [1] https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com
>
> Oh, missed that. But that's not quite what I'm proposing.

Sorry, I misread your message.  I think I was too tired.

> I don't like
> ExecFindResultRelInfo at all. What's the point of it? It's introduction
> is still an API break - I don't understand why that break is better than
> just passing the ResultRelInfo directly to BeginDirectModify()?

What API does that function break?  The point of that function was to
keep the direct modify layering/API as-is, because 1) I too felt the
same way about the move of BeginDirectModify() to nodeModifyTable.c,
and 2) I was thinking that when rewriting direct modify with upper
planner pathification so that we can perform it without ModifyTable,
we could still use the existing layering/API as-is, leading to smaller
changes to the core for that.

> I want
> to again remark that BeginForeignModify() does get the ResultRelInfo -
> it should have been done the same when adding direct modify.

Might have been so.

> Even if you need the loop - which I don't think is right - it should
> live somewhere that individual FDWs don't have to care about.

I was thinking to use hash lookup in ExecFindResultRelInfo() when
es_result_relations is very long, but I think the setters.c approach
you mentioned above might be much better.  Will consider that.

Best regards,
Etsuro Fujita



Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-08-03 19:41:55 +0900, Etsuro Fujita wrote:
> > I don't like
> > ExecFindResultRelInfo at all. What's the point of it? It's introduction
> > is still an API break - I don't understand why that break is better than
> > just passing the ResultRelInfo directly to BeginDirectModify()?
> 
> What API does that function break?

You need to call it, whereas previously you did not need to call it. The
effort to change an FDW to get one more parameter, or to call that
function is about the same.

Greetings,

Andres Freund



Re: partition routing layering in nodeModifyTable.c

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2019-08-03 19:41:55 +0900, Etsuro Fujita wrote:
>> What API does that function break?

> You need to call it, whereas previously you did not need to call it. The
> effort to change an FDW to get one more parameter, or to call that
> function is about the same.

If those are the choices, adding a parameter is clearly the preferable
solution, because it makes the API breakage obvious at compile.

Adding a function would make sense, perhaps, if only a minority of FDWs
need to do so.  It'd still be risky if the need to do so could be missed
in light testing.

            regards, tom lane



Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-08-03 13:48:01 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-08-03 19:41:55 +0900, Etsuro Fujita wrote:
> >> What API does that function break?
> 
> > You need to call it, whereas previously you did not need to call it. The
> > effort to change an FDW to get one more parameter, or to call that
> > function is about the same.
> 
> If those are the choices, adding a parameter is clearly the preferable
> solution, because it makes the API breakage obvious at compile.

Right.  I think it's a *bit* less clear in this case because we'd also
remove the field that such FDWs with direct modify support would use
now (EState.es_result_relation_info).

But I think it's also just plainly a better API to use the
parameter. Even if, in contrast to the BeginDirectModify at hand,
BeginForeignModify didn't already accept it. Requiring a function call to
gather information that just about every realistic implementation is
going to need doesn't make sense.

Greetings,

Andres Freund



Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Hi,

On Sun, Aug 4, 2019 at 3:03 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-03 13:48:01 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2019-08-03 19:41:55 +0900, Etsuro Fujita wrote:
> > >> What API does that function break?
> >
> > > You need to call it, whereas previously you did not need to call it. The
> > > effort to change an FDW to get one more parameter, or to call that
> > > function is about the same.

I got the point.

> > If those are the choices, adding a parameter is clearly the preferable
> > solution, because it makes the API breakage obvious at compile.
>
> Right.  I think it's a *bit* less clear in this case because we'd also
> remove the field that such FDWs with direct modify support would use
> now (EState.es_result_relation_info).
>
> But I think it's also just plainly a better API to use the
> parameter. Even if, in contrast to the BeginDirectModify at hand,
> BeginForeignModify didn't already accept it. Requiring a function call to
> gather information that just about every realistic implementation is
> going to need doesn't make sense.

Agreed.

Best regards,
Etsuro Fujita



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Hi,

On Sun, Aug 4, 2019 at 4:45 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Sun, Aug 4, 2019 at 3:03 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-08-03 13:48:01 -0400, Tom Lane wrote:
> > > If those are the choices, adding a parameter is clearly the preferable
> > > solution, because it makes the API breakage obvious at compile.
> >
> > Right.  I think it's a *bit* less clear in this case because we'd also
> > remove the field that such FDWs with direct modify support would use
> > now (EState.es_result_relation_info).
> >
> > But I think it's also just plainly a better API to use the
> > parameter. Even if, in contrast to the BeginDirectModify at hand,
> > BeginForeignModify didn't already accept it. Requiring a function call to
> > gather information that just about every realistic implementation is
> > going to need doesn't make sense.
>
> Agreed.

So, is it correct to think that the consensus is to add a parameter to
BeginDirectModify()?

Also, avoid changing where BeginDirectModify() is called from, like my
patch did, only to have easy access to the ResultRelInfo to pass.  We
can do that by by augmenting ForeignScan node to add the information
needed to fetch the ResultRelInfo efficiently from
ExecInitForeignScan() itself.  That information is the ordinal
position of a given result relation in PlannedStmt.resultRelations,
not the RT index as we were discussing.

Thanks,
Amit



Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Amit-san,

On Mon, Aug 5, 2019 at 1:31 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Sun, Aug 4, 2019 at 4:45 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Sun, Aug 4, 2019 at 3:03 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2019-08-03 13:48:01 -0400, Tom Lane wrote:
> > > > If those are the choices, adding a parameter is clearly the preferable
> > > > solution, because it makes the API breakage obvious at compile.
> > >
> > > Right.  I think it's a *bit* less clear in this case because we'd also
> > > remove the field that such FDWs with direct modify support would use
> > > now (EState.es_result_relation_info).
> > >
> > > But I think it's also just plainly a better API to use the
> > > parameter. Even if, in contrast to the BeginDirectModify at hand,
> > > BeginForeignModify didn't already accept it. Requiring a function call to
> > > gather information that just about every realistic implementation is
> > > going to need doesn't make sense.
> >
> > Agreed.
>
> So, is it correct to think that the consensus is to add a parameter to
> BeginDirectModify()?

I think so.

> Also, avoid changing where BeginDirectModify() is called from, like my
> patch did, only to have easy access to the ResultRelInfo to pass.  We
> can do that by by augmenting ForeignScan node to add the information
> needed to fetch the ResultRelInfo efficiently from
> ExecInitForeignScan() itself.

I think so.

> That information is the ordinal
> position of a given result relation in PlannedStmt.resultRelations,
> not the RT index as we were discussing.

Yeah, that would be what Andres is proposing, which I think is much
better than what I proposed using the RT index.

Could you update your patch?

Best regards,
Etsuro Fujita



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Fujita-san,

Thanks for the quick follow up.

On Mon, Aug 5, 2019 at 2:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Mon, Aug 5, 2019 at 1:31 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Sun, Aug 4, 2019 at 4:45 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > > On Sun, Aug 4, 2019 at 3:03 AM Andres Freund <andres@anarazel.de> wrote:
> > > > On 2019-08-03 13:48:01 -0400, Tom Lane wrote:
> > > > > If those are the choices, adding a parameter is clearly the preferable
> > > > > solution, because it makes the API breakage obvious at compile.
> > > >
> > > > Right.  I think it's a *bit* less clear in this case because we'd also
> > > > remove the field that such FDWs with direct modify support would use
> > > > now (EState.es_result_relation_info).
> > > >
> > > > But I think it's also just plainly a better API to use the
> > > > parameter. Even if, in contrast to the BeginDirectModify at hand,
> > > > BeginForeignModify didn't already accept it. Requiring a function call to
> > > > gather information that just about every realistic implementation is
> > > > going to need doesn't make sense.
> > >
> > > Agreed.
> >
> > So, is it correct to think that the consensus is to add a parameter to
> > BeginDirectModify()?
>
> I think so.
>
> > Also, avoid changing where BeginDirectModify() is called from, like my
> > patch did, only to have easy access to the ResultRelInfo to pass.  We
> > can do that by by augmenting ForeignScan node to add the information
> > needed to fetch the ResultRelInfo efficiently from
> > ExecInitForeignScan() itself.
>
> I think so.
>
> > That information is the ordinal
> > position of a given result relation in PlannedStmt.resultRelations,
> > not the RT index as we were discussing.
>
> Yeah, that would be what Andres is proposing, which I think is much
> better than what I proposed using the RT index.
>
> Could you update your patch?

OK, I will do that.  I'll reply with the updated patches to an
upthread email of Andres' [1], where he also comments on the other
patches.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/20190802180138.64zcircokw2upaho%40alap3.anarazel.de



Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Amit-san,

On Mon, Aug 5, 2019 at 2:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Aug 5, 2019 at 2:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Mon, Aug 5, 2019 at 1:31 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Sun, Aug 4, 2019 at 4:45 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > > > On Sun, Aug 4, 2019 at 3:03 AM Andres Freund <andres@anarazel.de> wrote:
> > > > > On 2019-08-03 13:48:01 -0400, Tom Lane wrote:
> > > > > > If those are the choices, adding a parameter is clearly the preferable
> > > > > > solution, because it makes the API breakage obvious at compile.
> > > > >
> > > > > Right.  I think it's a *bit* less clear in this case because we'd also
> > > > > remove the field that such FDWs with direct modify support would use
> > > > > now (EState.es_result_relation_info).
> > > > >
> > > > > But I think it's also just plainly a better API to use the
> > > > > parameter. Even if, in contrast to the BeginDirectModify at hand,
> > > > > BeginForeignModify didn't already accept it. Requiring a function call to
> > > > > gather information that just about every realistic implementation is
> > > > > going to need doesn't make sense.
> > > >
> > > > Agreed.
> > >
> > > So, is it correct to think that the consensus is to add a parameter to
> > > BeginDirectModify()?
> >
> > I think so.
> >
> > > Also, avoid changing where BeginDirectModify() is called from, like my
> > > patch did, only to have easy access to the ResultRelInfo to pass.  We
> > > can do that by by augmenting ForeignScan node to add the information
> > > needed to fetch the ResultRelInfo efficiently from
> > > ExecInitForeignScan() itself.
> >
> > I think so.
> >
> > > That information is the ordinal
> > > position of a given result relation in PlannedStmt.resultRelations,
> > > not the RT index as we were discussing.
> >
> > Yeah, that would be what Andres is proposing, which I think is much
> > better than what I proposed using the RT index.
> >
> > Could you update your patch?
>
> OK, I will do that.  I'll reply with the updated patches to an
> upthread email of Andres' [1], where he also comments on the other
> patches.

Thanks!  Will review the updated version of the FDW patch, at least.

Best regards,
Etsuro Fujita



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Hi Andres, Fujita-san,

On Sat, Aug 3, 2019 at 3:01 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > I looked into trying to do the things I mentioned above and it seems
> > to me that revising BeginDirectModify()'s API to receive the
> > ResultRelInfo directly as Andres suggested might be the best way
> > forward.  I've implemented that in the attached 0001.  Patches that
> > were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
> > respectively.  0002 is now a patch to "remove"
> > es_result_relation_info.
>
> Thanks!  Some minor quibbles aside, the non FDW patches look good to me.
>
> Fujita-san, do you have any comments on the FDW API change?  Or anybody
> else?

Based on the discussion, I have updated the patch.

> I'm a bit woried about the move of BeginDirectModify() into
> nodeModifyTable.c - it just seems like an odd control flow to me. Not
> allowing any intermittent nodes between ForeignScan and ModifyTable also
> seems like an undesirable restriction for the future. I realize that we
> already do that for BeginForeignModify() (just btw, that already accepts
> resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> makes sense), but it still seems like the wrong direction to me.
>
> The need for that move, I assume, comes from needing knowing the correct
> ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> at plan time (in setrefs.c), somewhat similar to how we determine
> ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?

The patch adds a resultRelIndex field to ForeignScan node, which is
set to >= 0 value for non-SELECT queries.  I first thought to set it
only if direct modification is being used, but maybe it'd be simpler
to set it even if direct modification is not used.  To set it, the
patch teaches set_plan_refs() to initialize resultRelIndex of
ForeignScan plans that appear under ModifyTable.  Fujita-san said he
plans to revise the planning of direct-modification style queries to
not require a ModifyTable node anymore, but maybe he'll just need to
add similar code elsewhere but not outside setrefs.c.

> Then we could just have BeginForeignModify, BeginDirectModify,
> BeginForeignScan all be called from ExecInitForeignScan().

I too think that it would've been great if we could call both
BeginForeignModify and BeginDirectModify from ExecInitForeignScan, but
the former's API seems to be designed to be called from
ExecInitModifyTable from the get-go.  Maybe we should leave that
as-is?

> Path 04 is such a nice improvement. Besides getting rid of a substantial
> amount of code, it also makes the control flow a lot easier to read.

Thanks.

> > @@ -4644,9 +4645,7 @@ GetAfterTriggersTableData(Oid relid, CmdType cmdType)
> >   * If there are no triggers in 'trigdesc' that request relevant transition
> >   * tables, then return NULL.
> >   *
> > - * The resulting object can be passed to the ExecAR* functions.  The caller
> > - * should set tcs_map or tcs_original_insert_tuple as appropriate when dealing
> > - * with child tables.
> > + * The resulting object can be passed to the ExecAR* functions.
> >   *
> >   * Note that we copy the flags from a parent table into this struct (rather
> >   * than subsequently using the relation's TriggerDesc directly) so that we can
> > @@ -5750,14 +5749,26 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
> >        */
> >       if (row_trigger && transition_capture != NULL)
> >       {
> > -             TupleTableSlot *original_insert_tuple = transition_capture->tcs_original_insert_tuple;
> > -             TupleConversionMap *map = transition_capture->tcs_map;
> > +             TupleTableSlot *original_insert_tuple;
> > +             PartitionRoutingInfo *pinfo = relinfo->ri_PartitionInfo;
> > +             TupleConversionMap *map = pinfo ?
> > +                                                             pinfo->pi_PartitionToRootMap :
> > +                                                             relinfo->ri_ChildToRootMap;
> >               bool            delete_old_table = transition_capture->tcs_delete_old_table;
> >               bool            update_old_table = transition_capture->tcs_update_old_table;
> >               bool            update_new_table = transition_capture->tcs_update_new_table;
> >               bool            insert_new_table = transition_capture->tcs_insert_new_table;
> >
> >               /*
> > +              * Get the originally inserted tuple from the global variable and set
> > +              * the latter to NULL because any given tuple must be read only once.
> > +              * Note that the TransitionCaptureState is shared across many calls
> > +              * to this function.
> > +              */
> > +             original_insert_tuple = transition_capture->tcs_original_insert_tuple;
> > +             transition_capture->tcs_original_insert_tuple = NULL;
>
> Maybe I'm missing something, but original_insert_tuple is not a global
> variable?

I really meant to refer to the fact that it's maintained in a
ModifyTable-global struct.  I've updated this comment a bit.

> > @@ -888,7 +889,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
> >                                       PartitionTupleRouting *proute,
> >                                       PartitionDispatch dispatch,
> >                                       ResultRelInfo *partRelInfo,
> > -                                     int partidx)
> > +                                     int partidx,
> > +                                     bool is_update_result_rel)
> >  {
> >       MemoryContext oldcxt;
> >       PartitionRoutingInfo *partrouteinfo;
> > @@ -935,10 +937,15 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
> >       if (mtstate &&
> >               (mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
> >       {
> > -             partrouteinfo->pi_PartitionToRootMap =
> > -                     convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
> > -                                                                RelationGetDescr(partRelInfo->ri_PartitionRoot),
> > -                                                                gettext_noop("could not convert row type"));
> > +             /* If partition is an update target, then we already got the map. */
> > +             if (is_update_result_rel)
> > +                     partrouteinfo->pi_PartitionToRootMap =
> > +                             partRelInfo->ri_ChildToRootMap;
> > +             else
> > +                     partrouteinfo->pi_PartitionToRootMap =
> > +                             convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
> > +
RelationGetDescr(partRelInfo->ri_PartitionRoot),
> > +                                                                        gettext_noop("could not convert row
type"));
> >       }
>
> Hm, isn't is_update_result_rel just ModifyTable->operation == CMD_UPDATE?

No.  The operation being CMD_UPDATE doesn't mean that the
ResultRelInfo that is passed to ExecInitRoutingInfo() is an UPDATE
result rel.  It could be a ResultRelInfo built by ExecFindPartition()
when a row needed to be moved into a partition that is not present in
the UPDATE result rels contained in ModifyTableState.  Though I
realized that we don't really need to add a new parameter to figure
that out.  Looking at ri_RangeTableIndex property of the passed-in
ResultRelInfo is enough to distinguish the two types of
ResultRelInfos.  I've updated the patch that way.

I found more dead code related to transition capture setup, which I've
removed in the latest 0004.  For example, the
mt_per_subplan_tupconv_maps array and the code in nodeModifyTable.c
that was used to initialize it.

Attached updated patches.

Thanks,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Amit-san,

On Mon, Aug 5, 2019 at 6:16 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Aug 3, 2019 at 3:01 AM Andres Freund <andres@anarazel.de> wrote:
> Based on the discussion, I have updated the patch.
>
> > I'm a bit woried about the move of BeginDirectModify() into
> > nodeModifyTable.c - it just seems like an odd control flow to me. Not
> > allowing any intermittent nodes between ForeignScan and ModifyTable also
> > seems like an undesirable restriction for the future. I realize that we
> > already do that for BeginForeignModify() (just btw, that already accepts
> > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> > makes sense), but it still seems like the wrong direction to me.
> >
> > The need for that move, I assume, comes from needing knowing the correct
> > ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> > at plan time (in setrefs.c), somewhat similar to how we determine
> > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?
>
> The patch adds a resultRelIndex field to ForeignScan node, which is
> set to >= 0 value for non-SELECT queries.

Thanks for the updated patch!

> I first thought to set it
> only if direct modification is being used, but maybe it'd be simpler
> to set it even if direct modification is not used.  To set it, the
> patch teaches set_plan_refs() to initialize resultRelIndex of
> ForeignScan plans that appear under ModifyTable.  Fujita-san said he
> plans to revise the planning of direct-modification style queries to
> not require a ModifyTable node anymore, but maybe he'll just need to
> add similar code elsewhere but not outside setrefs.c.

Yeah, but I'm not sure this is a good idea:

@ -877,12 +878,6 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
                    rc->rti += rtoffset;
                    rc->prti += rtoffset;
                }
-               foreach(l, splan->plans)
-               {
-                   lfirst(l) = set_plan_refs(root,
-                                             (Plan *) lfirst(l),
-                                             rtoffset);
-               }

                /*
                 * Append this ModifyTable node's final result relation RT
@@ -908,6 +903,27 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
                        lappend_int(root->glob->rootResultRelations,
                                    splan->rootRelation);
                }
+
+               resultRelIndex = splan->resultRelIndex;
+               foreach(l, splan->plans)
+               {
+                   lfirst(l) = set_plan_refs(root,
+                                             (Plan *) lfirst(l),
+                                             rtoffset);
+
+                   /*
+                    * For foreign table result relations, save their index
+                    * in the global list of result relations into the
+                    * corresponding ForeignScan nodes.
+                    */
+                   if (IsA(lfirst(l), ForeignScan))
+                   {
+                       ForeignScan *fscan = (ForeignScan *) lfirst(l);
+
+                       fscan->resultRelIndex = resultRelIndex;
+                   }
+                   resultRelIndex++;
+               }
            }

because I still feel the same way as mentioned above by Andres.  What
I'm thinking for the setrefs.c change is to modify ForeignScan (ie,
set_foreignscan_references) rather than ModifyTable, like the
attached.  Maybe I'm missing something, but for direct modification
without ModifyTable, I think we would probably only have to modify
that function further so that it not only adjusts resultRelIndex but
does some extra work such as appending the result relation RT index to
root->glob->resultRelations as done for ModifyTable.

> > Then we could just have BeginForeignModify, BeginDirectModify,
> > BeginForeignScan all be called from ExecInitForeignScan().

Sorry, previously, I mistakenly agreed with that.  As I said before, I
think I was too tired.

> I too think that it would've been great if we could call both
> BeginForeignModify and BeginDirectModify from ExecInitForeignScan, but
> the former's API seems to be designed to be called from
> ExecInitModifyTable from the get-go.  Maybe we should leave that
> as-is?

+1 for leaving that as-is; it seems reasonable to me to call
BeginForeignModify in ExecInitModifyTable, because the ForeignModify
API is designed based on an analogy with local table modifications, in
which case the initialization needed for performing
ExecInsert/ExecUpdate/ExecDelete is done in ModifyTable, not in the
underlying scan/join node.

@@ -895,6 +898,12 @@ BeginDirectModify(ForeignScanState *node,
      for <function>ExplainDirectModify</function> and <function>EndDirectModif\
y</function>.
     </para>

+    <note>
+     Also note that it's a good idea to store the <literal>rinfo</literal>
+     in the <structfield>fdw_state</structfield> for
+     <function>IterateDirectModify</function> to use.
+    </node>

Actually, if the FDW only supports direct modifications for queries
without RETURNING, it wouldn't need the rinfo in IterateDirectModify,
so I think we would probably need to update this as such.  Having said
that, it seems too detailed to me to describe such a thing in the FDW
documentation.  To avoid making the documentation verbose, it would be
better to not add such kind of thing at all?

Note: other change in the attached patch is that I modified
_readForeignScan accordingly.

Best regards,
Etsuro Fujita

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Andres Freund
Дата:
Hi,

On 2019-08-05 18:16:10 +0900, Amit Langote wrote:
> The patch adds a resultRelIndex field to ForeignScan node, which is
> set to >= 0 value for non-SELECT queries.  I first thought to set it
> only if direct modification is being used, but maybe it'd be simpler
> to set it even if direct modification is not used.

Yea, I think we should just always set it.


> To set it, the
> patch teaches set_plan_refs() to initialize resultRelIndex of
> ForeignScan plans that appear under ModifyTable.  Fujita-san said he
> plans to revise the planning of direct-modification style queries to
> not require a ModifyTable node anymore, but maybe he'll just need to
> add similar code elsewhere but not outside setrefs.c.

I think I prefer the approach in Fujita-san's email. While not extremely
pretty either, it would allow for having nodes between the foreign scan
and the modify node.


> > Then we could just have BeginForeignModify, BeginDirectModify,
> > BeginForeignScan all be called from ExecInitForeignScan().
> 
> I too think that it would've been great if we could call both
> BeginForeignModify and BeginDirectModify from ExecInitForeignScan, but
> the former's API seems to be designed to be called from
> ExecInitModifyTable from the get-go.  Maybe we should leave that
> as-is?

Yea, we should leave it where it is.  I think the API here is fairly
ugly, but it's probably not worth changing. And if we were to change it,
it'd need a lot bigger hammer.

Greetings,

Andres Freund



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Fujita-san,

Thanks a lot the review.

On Tue, Aug 6, 2019 at 9:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Mon, Aug 5, 2019 at 6:16 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I first thought to set it
> > only if direct modification is being used, but maybe it'd be simpler
> > to set it even if direct modification is not used.  To set it, the
> > patch teaches set_plan_refs() to initialize resultRelIndex of
> > ForeignScan plans that appear under ModifyTable.  Fujita-san said he
> > plans to revise the planning of direct-modification style queries to
> > not require a ModifyTable node anymore, but maybe he'll just need to
> > add similar code elsewhere but not outside setrefs.c.
>
> Yeah, but I'm not sure this is a good idea:
>
> @ -877,12 +878,6 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
>                     rc->rti += rtoffset;
>                     rc->prti += rtoffset;
>                 }
> -               foreach(l, splan->plans)
> -               {
> -                   lfirst(l) = set_plan_refs(root,
> -                                             (Plan *) lfirst(l),
> -                                             rtoffset);
> -               }
>
>                 /*
>                  * Append this ModifyTable node's final result relation RT
> @@ -908,6 +903,27 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
>                         lappend_int(root->glob->rootResultRelations,
>                                     splan->rootRelation);
>                 }
> +
> +               resultRelIndex = splan->resultRelIndex;
> +               foreach(l, splan->plans)
> +               {
> +                   lfirst(l) = set_plan_refs(root,
> +                                             (Plan *) lfirst(l),
> +                                             rtoffset);
> +
> +                   /*
> +                    * For foreign table result relations, save their index
> +                    * in the global list of result relations into the
> +                    * corresponding ForeignScan nodes.
> +                    */
> +                   if (IsA(lfirst(l), ForeignScan))
> +                   {
> +                       ForeignScan *fscan = (ForeignScan *) lfirst(l);
> +
> +                       fscan->resultRelIndex = resultRelIndex;
> +                   }
> +                   resultRelIndex++;
> +               }
>             }
>
> because I still feel the same way as mentioned above by Andres.

Reading Andres' emails again, I now understand why we shouldn't set
ForeignScan's resultRelIndex the way my patches did.

>  What
> I'm thinking for the setrefs.c change is to modify ForeignScan (ie,
> set_foreignscan_references) rather than ModifyTable, like the
> attached.

Thanks for the patch.  I have couple of comments:

* I'm afraid that we've implicitly created an ordering constraint on
some code in set_plan_refs().  That is, a ModifyTable's plans now must
always be processed before adding its result relations to the global
list, which for good measure, should be written down somewhere; I
propose this comment in the ModifyTable's case block in set_plan_refs:

@@ -877,6 +877,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
                     rc->rti += rtoffset;
                     rc->prti += rtoffset;
                 }
+                /*
+                 * Caution: Do not change the relative ordering of this loop
+                 * and the statement below that adds the result relations to
+                 * root->glob->resultRelations, because we need to use the
+                 * current value of list_length(root->glob->resultRelations)
+                 * in some plans.
+                 */
                 foreach(l, splan->plans)
                 {
                     lfirst(l) = set_plan_refs(root,

* Regarding setting ForeignScan.resultRelIndex even for non-direct
modifications, maybe that's not a good idea anymore.  A foreign table
result relation might be involved in a local join, which prevents it
from being directly-modifiable and also hides the ForeignScan node
from being easily modifiable in PlanForeignModify.  Maybe, we should
just interpret resultRelIndex as being set only when
direct-modification is feasible.  Should we rename the field
accordingly to be self-documenting?

Please let me know your thoughts, so that I can modify the patch.

>  Maybe I'm missing something, but for direct modification
> without ModifyTable, I think we would probably only have to modify
> that function further so that it not only adjusts resultRelIndex but
> does some extra work such as appending the result relation RT index to
> root->glob->resultRelations as done for ModifyTable.

Yeah, that seems reasonable.

> > > Then we could just have BeginForeignModify, BeginDirectModify,
> > > BeginForeignScan all be called from ExecInitForeignScan().
>
> Sorry, previously, I mistakenly agreed with that.  As I said before, I
> think I was too tired.
>
> > I too think that it would've been great if we could call both
> > BeginForeignModify and BeginDirectModify from ExecInitForeignScan, but
> > the former's API seems to be designed to be called from
> > ExecInitModifyTable from the get-go.  Maybe we should leave that
> > as-is?
>
> +1 for leaving that as-is; it seems reasonable to me to call
> BeginForeignModify in ExecInitModifyTable, because the ForeignModify
> API is designed based on an analogy with local table modifications, in
> which case the initialization needed for performing
> ExecInsert/ExecUpdate/ExecDelete is done in ModifyTable, not in the
> underlying scan/join node.

Thanks for the explanation.

> @@ -895,6 +898,12 @@ BeginDirectModify(ForeignScanState *node,
>       for <function>ExplainDirectModify</function> and <function>EndDirectModif\
> y</function>.
>      </para>
>
> +    <note>
> +     Also note that it's a good idea to store the <literal>rinfo</literal>
> +     in the <structfield>fdw_state</structfield> for
> +     <function>IterateDirectModify</function> to use.
> +    </node>
>
> Actually, if the FDW only supports direct modifications for queries
> without RETURNING, it wouldn't need the rinfo in IterateDirectModify,
> so I think we would probably need to update this as such.  Having said
> that, it seems too detailed to me to describe such a thing in the FDW
> documentation.  To avoid making the documentation verbose, it would be
> better to not add such kind of thing at all?

Hmm OK.  Perhaps, others who want to implement the direct modification
API can work that out by looking at postgres_fdw implementation.

> Note: other change in the attached patch is that I modified
> _readForeignScan accordingly.

Thanks.

Regards,
Amit



Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Amit-san,

On Wed, Aug 7, 2019 at 10:24 AM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Aug 6, 2019 at 9:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> >  What
> > I'm thinking for the setrefs.c change is to modify ForeignScan (ie,
> > set_foreignscan_references) rather than ModifyTable, like the
> > attached.
>
> Thanks for the patch.  I have couple of comments:
>
> * I'm afraid that we've implicitly created an ordering constraint on
> some code in set_plan_refs().  That is, a ModifyTable's plans now must
> always be processed before adding its result relations to the global
> list, which for good measure, should be written down somewhere; I
> propose this comment in the ModifyTable's case block in set_plan_refs:
>
> @@ -877,6 +877,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
>                      rc->rti += rtoffset;
>                      rc->prti += rtoffset;
>                  }
> +                /*
> +                 * Caution: Do not change the relative ordering of this loop
> +                 * and the statement below that adds the result relations to
> +                 * root->glob->resultRelations, because we need to use the
> +                 * current value of list_length(root->glob->resultRelations)
> +                 * in some plans.
> +                 */
>                  foreach(l, splan->plans)
>                  {
>                      lfirst(l) = set_plan_refs(root,

+1

> * Regarding setting ForeignScan.resultRelIndex even for non-direct
> modifications, maybe that's not a good idea anymore.  A foreign table
> result relation might be involved in a local join, which prevents it
> from being directly-modifiable and also hides the ForeignScan node
> from being easily modifiable in PlanForeignModify.  Maybe, we should
> just interpret resultRelIndex as being set only when
> direct-modification is feasible.

Yeah, I think so; when using PlanForeignModify because for example,
the foreign table result relation is involved in a local join, as you
mentioned, ForeignScan.operation would be left unchanged (ie,
CMD_SELECT), so to me it's more understandable to not set
ForeignScan.resultRelIndex.

> Should we rename the field
> accordingly to be self-documenting?

IMO I like the name resultRelIndex, but do you have any better idea?

> > @@ -895,6 +898,12 @@ BeginDirectModify(ForeignScanState *node,
> >       for <function>ExplainDirectModify</function> and <function>EndDirectModif\
> > y</function>.
> >      </para>
> >
> > +    <note>
> > +     Also note that it's a good idea to store the <literal>rinfo</literal>
> > +     in the <structfield>fdw_state</structfield> for
> > +     <function>IterateDirectModify</function> to use.
> > +    </node>
> >
> > Actually, if the FDW only supports direct modifications for queries
> > without RETURNING, it wouldn't need the rinfo in IterateDirectModify,
> > so I think we would probably need to update this as such.  Having said
> > that, it seems too detailed to me to describe such a thing in the FDW
> > documentation.  To avoid making the documentation verbose, it would be
> > better to not add such kind of thing at all?
>
> Hmm OK.  Perhaps, others who want to implement the direct modification
> API can work that out by looking at postgres_fdw implementation.

Yeah, I think so.

Best regards,
Etsuro Fujita



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Fujita-san,

Thanks for the quick follow up.

On Wed, Aug 7, 2019 at 11:30 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Aug 7, 2019 at 10:24 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > * Regarding setting ForeignScan.resultRelIndex even for non-direct
> > modifications, maybe that's not a good idea anymore.  A foreign table
> > result relation might be involved in a local join, which prevents it
> > from being directly-modifiable and also hides the ForeignScan node
> > from being easily modifiable in PlanForeignModify.  Maybe, we should
> > just interpret resultRelIndex as being set only when
> > direct-modification is feasible.
>
> Yeah, I think so; when using PlanForeignModify because for example,
> the foreign table result relation is involved in a local join, as you
> mentioned, ForeignScan.operation would be left unchanged (ie,
> CMD_SELECT), so to me it's more understandable to not set
> ForeignScan.resultRelIndex.

OK.

> > Should we rename the field
> > accordingly to be self-documenting?
>
> IMO I like the name resultRelIndex, but do you have any better idea?

On second thought, I'm fine with sticking to resultRelIndex.  Trying
to make it self documenting might make the name very long.

Here are the updated patches.

Thanks,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Hi,

On Wed, Aug 7, 2019 at 11:47 AM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Aug 7, 2019 at 11:30 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Wed, Aug 7, 2019 at 10:24 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > > * Regarding setting ForeignScan.resultRelIndex even for non-direct
> > > modifications, maybe that's not a good idea anymore.  A foreign table
> > > result relation might be involved in a local join, which prevents it
> > > from being directly-modifiable and also hides the ForeignScan node
> > > from being easily modifiable in PlanForeignModify.  Maybe, we should
> > > just interpret resultRelIndex as being set only when
> > > direct-modification is feasible.
> >
> > Yeah, I think so; when using PlanForeignModify because for example,
> > the foreign table result relation is involved in a local join, as you
> > mentioned, ForeignScan.operation would be left unchanged (ie,
> > CMD_SELECT), so to me it's more understandable to not set
> > ForeignScan.resultRelIndex.
>
> OK.
>
> > > Should we rename the field
> > > accordingly to be self-documenting?
> >
> > IMO I like the name resultRelIndex, but do you have any better idea?
>
> On second thought, I'm fine with sticking to resultRelIndex.  Trying
> to make it self documenting might make the name very long.

OK

> Here are the updated patches.

IIUC, I think we reached a consensus at least on the 0001 patch.
Andres, would you mind if I commit that patch?

Best regards,
Etsuro Fujita



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Wed, Aug 7, 2019 at 12:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> IIUC, I think we reached a consensus at least on the 0001 patch.
> Andres, would you mind if I commit that patch?

I just noticed obsolete references to es_result_relation_info that
0002 failed to remove.  One of them is in fdwhandler.sgml:

<programlisting>
TupleTableSlot *
IterateDirectModify(ForeignScanState *node);
</programlisting>

    ... The data that was actually inserted, updated
     or deleted must be stored in the
     <literal>es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
     of the node's <structname>EState</structname>.

We will need to rewrite this without mentioning
es_result_relation_info.  How about as follows:

-     <literal>es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
-     of the node's <structname>EState</structname>.
+     <literal>ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
+     of the result relation's<structname>ResultRelInfo</structname> that has
+     been made available via node.

I've updated 0001 with the above change.

Also, I updated 0002 to remove other references.

Thanks,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Amit-san,

On Wed, Aug 7, 2019 at 4:28 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Aug 7, 2019 at 12:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > IIUC, I think we reached a consensus at least on the 0001 patch.
> > Andres, would you mind if I commit that patch?
>
> I just noticed obsolete references to es_result_relation_info that
> 0002 failed to remove.  One of them is in fdwhandler.sgml:
>
> <programlisting>
> TupleTableSlot *
> IterateDirectModify(ForeignScanState *node);
> </programlisting>
>
>     ... The data that was actually inserted, updated
>      or deleted must be stored in the
>      <literal>es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
>      of the node's <structname>EState</structname>.
>
> We will need to rewrite this without mentioning
> es_result_relation_info.  How about as follows:
>
> -     <literal>es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
> -     of the node's <structname>EState</structname>.
> +     <literal>ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
> +     of the result relation's<structname>ResultRelInfo</structname> that has
> +     been made available via node.
>
> I've updated 0001 with the above change.

Good catch!

This would be nitpicking, but:

* IIUC, we don't use the term "result relation" in fdwhandler.sgml.
For consistency with your change to the doc for BeginDirectModify, how
about using the term "target foreign table" instead of "result
relation"?

* ISTM that "<structname>ResultRelInfo</structname> that has been made
available via node" would be a bit fuzzy to FDW authors.  To be more
specific, how about changing it to
"<structname>ResultRelInfo</structname> passed to
<function>BeginDirectModify</function>" or something like that?

Best regards,
Etsuro Fujita



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Fujita-san,

On Wed, Aug 7, 2019 at 6:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Aug 7, 2019 at 4:28 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I just noticed obsolete references to es_result_relation_info that
> > 0002 failed to remove.  One of them is in fdwhandler.sgml:
> >
> > <programlisting>
> > TupleTableSlot *
> > IterateDirectModify(ForeignScanState *node);
> > </programlisting>
> >
> >     ... The data that was actually inserted, updated
> >      or deleted must be stored in the
> >      <literal>es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
> >      of the node's <structname>EState</structname>.
> >
> > We will need to rewrite this without mentioning
> > es_result_relation_info.  How about as follows:
> >
> > -     <literal>es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
> > -     of the node's <structname>EState</structname>.
> > +     <literal>ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
> > +     of the result relation's<structname>ResultRelInfo</structname> that has
> > +     been made available via node.
> >
> > I've updated 0001 with the above change.
>
> Good catch!

Thanks for the review.

> This would be nitpicking, but:
>
> * IIUC, we don't use the term "result relation" in fdwhandler.sgml.
> For consistency with your change to the doc for BeginDirectModify, how
> about using the term "target foreign table" instead of "result
> relation"?

Agreed, done.

> * ISTM that "<structname>ResultRelInfo</structname> that has been made
> available via node" would be a bit fuzzy to FDW authors.  To be more
> specific, how about changing it to
> "<structname>ResultRelInfo</structname> passed to
> <function>BeginDirectModify</function>" or something like that?

That works for me, although an FDW author reading this still has got
to make the connection.

Attached updated patches; only 0001 changed in this version.

Thanks,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Etsuro Fujita
Дата:
Hi,

On Thu, Aug 8, 2019 at 10:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Aug 7, 2019 at 6:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Wed, Aug 7, 2019 at 4:28 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > I just noticed obsolete references to es_result_relation_info that
> > > 0002 failed to remove.  One of them is in fdwhandler.sgml:
> > >
> > > <programlisting>
> > > TupleTableSlot *
> > > IterateDirectModify(ForeignScanState *node);
> > > </programlisting>
> > >
> > >     ... The data that was actually inserted, updated
> > >      or deleted must be stored in the
> > >      <literal>es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
> > >      of the node's <structname>EState</structname>.
> > >
> > > We will need to rewrite this without mentioning
> > > es_result_relation_info.  How about as follows:
> > >
> > > -     <literal>es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
> > > -     of the node's <structname>EState</structname>.
> > > +     <literal>ri_projectReturning->pi_exprContext->ecxt_scantuple</literal>
> > > +     of the result relation's<structname>ResultRelInfo</structname> that has
> > > +     been made available via node.
> > >
> > > I've updated 0001 with the above change.

> > This would be nitpicking, but:
> >
> > * IIUC, we don't use the term "result relation" in fdwhandler.sgml.
> > For consistency with your change to the doc for BeginDirectModify, how
> > about using the term "target foreign table" instead of "result
> > relation"?
>
> Agreed, done.
>
> > * ISTM that "<structname>ResultRelInfo</structname> that has been made
> > available via node" would be a bit fuzzy to FDW authors.  To be more
> > specific, how about changing it to
> > "<structname>ResultRelInfo</structname> passed to
> > <function>BeginDirectModify</function>" or something like that?
>
> That works for me, although an FDW author reading this still has got
> to make the connection.
>
> Attached updated patches; only 0001 changed in this version.

Thanks for the updated version, Amit-san!  I updated the 0001 patch a
bit further:

* Tweaked comments in plannodes.h, createplan.c, and nodeForeignscan.c.
* Made cosmetic changes to postgres_fdw.c.
* Adjusted doc changes a bit, mainly not to produce unnecessary diff.
* Modified the commit message.

Attached is an updated version of the 0001 patch.  Does that make sense?

Best regards,
Etsuro Fujita

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Fujita-san,

On Thu, Aug 8, 2019 at 9:49 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Thu, Aug 8, 2019 at 10:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Attached updated patches; only 0001 changed in this version.
>
> Thanks for the updated version, Amit-san!  I updated the 0001 patch a
> bit further:
>
> * Tweaked comments in plannodes.h, createplan.c, and nodeForeignscan.c.
> * Made cosmetic changes to postgres_fdw.c.
> * Adjusted doc changes a bit, mainly not to produce unnecessary diff.
> * Modified the commit message.
>
> Attached is an updated version of the 0001 patch.  Does that make sense?

Looks perfect, thank you.

Regards,
Amit



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Fri, Aug 9, 2019 at 10:51 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Fujita-san,
>
> On Thu, Aug 8, 2019 at 9:49 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Thu, Aug 8, 2019 at 10:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > > Attached updated patches; only 0001 changed in this version.
> >
> > Thanks for the updated version, Amit-san!  I updated the 0001 patch a
> > bit further:
> >
> > * Tweaked comments in plannodes.h, createplan.c, and nodeForeignscan.c.
> > * Made cosmetic changes to postgres_fdw.c.
> > * Adjusted doc changes a bit, mainly not to produce unnecessary diff.
> > * Modified the commit message.
> >
> > Attached is an updated version of the 0001 patch.  Does that make sense?
>
> Looks perfect, thank you.

To avoid losing track of this, I've added this to November CF.

https://commitfest.postgresql.org/25/2277/

Struggled a bit to give a title to the entry though.

Thanks,
Amit



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Wed, Sep 4, 2019 at 10:45 AM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Aug 9, 2019 at 10:51 AM Amit Langote <amitlangote09@gmail.com> wrote:
> To avoid losing track of this, I've added this to November CF.
>
> https://commitfest.postgresql.org/25/2277/
>
> Struggled a bit to give a title to the entry though.

Noticed that one of the patches needed a rebase.

Attached updated patches.  Note that v8-0001 is v7-0001 unchanged that
Fujita-san posted on Aug 8.

Thanks,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Thu, Sep 26, 2019 at 1:56 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Sep 4, 2019 at 10:45 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Fri, Aug 9, 2019 at 10:51 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > To avoid losing track of this, I've added this to November CF.
> >
> > https://commitfest.postgresql.org/25/2277/
> >
> > Struggled a bit to give a title to the entry though.
>
> Noticed that one of the patches needed a rebase.
>
> Attached updated patches.  Note that v8-0001 is v7-0001 unchanged that
> Fujita-san posted on Aug 8.

Rebased again.

Thanks,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Tom Lane
Дата:
Amit Langote <amitlangote09@gmail.com> writes:
> Rebased again.

Seems to need that again, according to cfbot :-(

            regards, tom lane



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Mon, Mar 2, 2020 at 4:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > Rebased again.
>
> Seems to need that again, according to cfbot :-(

Thank you, done.

Regards,
Amit

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Daniel Gustafsson
Дата:
> On 2 Mar 2020, at 06:08, Amit Langote <amitlangote09@gmail.com> wrote:
> 
> On Mon, Mar 2, 2020 at 4:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Amit Langote <amitlangote09@gmail.com> writes:
>>> Rebased again.
>> 
>> Seems to need that again, according to cfbot :-(
> 
> Thank you, done.

..and another one is needed as it no longer applies, please submit a rebased
version.

cheers ./daniel



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Wed, Jul 1, 2020 at 6:56 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 2 Mar 2020, at 06:08, Amit Langote <amitlangote09@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2020 at 4:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Amit Langote <amitlangote09@gmail.com> writes:
> >>> Rebased again.
> >>
> >> Seems to need that again, according to cfbot :-(
> >
> > Thank you, done.
>
> ..and another one is needed as it no longer applies, please submit a rebased
> version.

Sorry, it took me a while to get to this.

It's been over 11 months since there was any significant commentary on
the contents of the patches themselves, so perhaps I should reiterate
what the patches are about and why it might still be a good idea to
consider them.

The thread started with some very valid criticism of the way
executor's partition tuple routing logic looks randomly sprinkled over
in nodeModifyTable.c, execPartition.c.  In the process of making it
look less random, we decided to get rid of the global variable
es_result_relation_info to avoid complex maneuvers of
setting/resetting it correctly when performing partition tuple
routing, causing some other churn beside the partitioning code.  Same
with another global variable TransitionCaptureState.tcs_map.  So, the
patches neither add any new capabilities, nor improve performance, but
they do make the code in this area a bit easier to follow.

Actually, there is a problem that some of the changes here conflict
with patches being discussed on other threads ([1], [2]), so much so
that I decided to absorb some changes here into another "refactoring"
patch that I have posted at [2].

Attached rebased patches.

0001 contains preparatory FDW API changes to stop relying on
es_result_relation_info being set correctly.

0002 removes es_result_relation_info in favor passing the active
result relation around as a parameter in the various functions that
need it

0003 Moves UPDATE tuple-routing logic into a new function

0004 removes the global variable TransitionCaptureState.tcs_map which
needed to be set/reset whenever the active result relation relation
changes in favor of a new field in ResultRelInfo to store the same map

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/28/2575/
[2] https://commitfest.postgresql.org/28/2621/

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 13/07/2020 08:47, Amit Langote wrote:
> It's been over 11 months since there was any significant commentary on
> the contents of the patches themselves, so perhaps I should reiterate
> what the patches are about and why it might still be a good idea to
> consider them.
> 
> The thread started with some very valid criticism of the way
> executor's partition tuple routing logic looks randomly sprinkled over
> in nodeModifyTable.c, execPartition.c.  In the process of making it
> look less random, we decided to get rid of the global variable
> es_result_relation_info to avoid complex maneuvers of
> setting/resetting it correctly when performing partition tuple
> routing, causing some other churn beside the partitioning code.  Same
> with another global variable TransitionCaptureState.tcs_map.  So, the
> patches neither add any new capabilities, nor improve performance, but
> they do make the code in this area a bit easier to follow.
> 
> Actually, there is a problem that some of the changes here conflict
> with patches being discussed on other threads ([1], [2]), so much so
> that I decided to absorb some changes here into another "refactoring"
> patch that I have posted at [2].

Thanks for the summary. It's been a bit hard to follow what depends on 
what across these threads, and how they work together. It seems that 
this patch set is the best place to start.

> Attached rebased patches.
> 
> 0001 contains preparatory FDW API changes to stop relying on
> es_result_relation_info being set correctly.

Makes sense. The only thing I don't like about this is the way the 
ForeignScan->resultRelIndex field is set. make_foreignscan() initializes 
it to -1, and the FDW's PlanDirectModify() function is expected to set 
it, like you did in postgres_fdw:

> @@ -2319,6 +2322,11 @@ postgresPlanDirectModify(PlannerInfo *root,
>              rebuild_fdw_scan_tlist(fscan, returningList);
>      }
>  
> +    /*
> +     * Set the index of the subplan result rel.
> +     */
> +    fscan->resultRelIndex = subplan_index;
> +
>      table_close(rel, NoLock);
>      return true;
>  }

It has to be set to that value (subplan_index is an argument to 
PlanDirectModify()), the FDW doesn't have any choice there, so this is 
just additional boilerplate code that has to be copied to every FDW that 
implements direct modify. Furthermore, if the FDW doesn't set it 
correctly, you could have some very interesting results, like updating 
wrong table. It would be better to set it in make_modifytable(), just 
after calling PlanDirectModify().

I'm also a bit unhappy with the way it's updated in set_plan_refs():

> --- a/src/backend/optimizer/plan/setrefs.c
> +++ b/src/backend/optimizer/plan/setrefs.c
> @@ -904,6 +904,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
>                      rc->rti += rtoffset;
>                      rc->prti += rtoffset;
>                  }
> +                /*
> +                 * Caution: Do not change the relative ordering of this loop
> +                 * and the statement below that adds the result relations to
> +                 * root->glob->resultRelations, because we need to use the
> +                 * current value of list_length(root->glob->resultRelations)
> +                 * in some plans.
> +                 */
>                  foreach(l, splan->plans)
>                  {
>                      lfirst(l) = set_plan_refs(root,
> @@ -1243,6 +1250,14 @@ set_foreignscan_references(PlannerInfo *root,
>      }
>  
>      fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
> +
> +    /*
> +     * Adjust resultRelIndex if it's valid (note that we are called before
> +     * adding the RT indexes of ModifyTable result relations to the global
> +     * list)
> +     */
> +    if (fscan->resultRelIndex >= 0)
> +        fscan->resultRelIndex += list_length(root->glob->resultRelations);
>  }
>  
>  /*

That "Caution" comment is well deserved, but could we make this more 
robust to begin with? The most straightforward solution would be to pass 
down the "current resultRelIndex" as an extra parameter to 
set_plan_refs(), similar to rtoffset. If we did that, we wouldn't 
actually need to set it before setrefs.c processing at all.

I'm a bit wary of adding another argument to set_plan_refs() because 
that's a lot of code churn, but it does seem like the most natural 
solution to me. Maybe create a new context struct to hold the 
PlannerInfo, rtoffset, and the new "currentResultRelIndex" value, 
similar to fix_scan_expr_context, to avoid passing through so many 
arguments.


Another idea is to merge "resultRelIndex" and a "range table index" into 
one value. Range table entries that are updated would have a 
ResultRelInfo, others would not. I'm not sure if that would end up being 
cleaner or messier than what we have now, but might be worth trying.

> 0002 removes es_result_relation_info in favor passing the active
> result relation around as a parameter in the various functions that
> need it

Looks good.

> 0003 Moves UPDATE tuple-routing logic into a new function
> 
> 0004 removes the global variable TransitionCaptureState.tcs_map which
> needed to be set/reset whenever the active result relation relation
> changes in favor of a new field in ResultRelInfo to store the same map

I didn't look closely, but these make sense at a quick glance.

- Heikki




Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
Hekki,

Thanks a lot for the review!

On Tue, Oct 6, 2020 at 12:45 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 13/07/2020 08:47, Amit Langote wrote:
> > It's been over 11 months since there was any significant commentary on
> > the contents of the patches themselves, so perhaps I should reiterate
> > what the patches are about and why it might still be a good idea to
> > consider them.
> >
> > The thread started with some very valid criticism of the way
> > executor's partition tuple routing logic looks randomly sprinkled over
> > in nodeModifyTable.c, execPartition.c.  In the process of making it
> > look less random, we decided to get rid of the global variable
> > es_result_relation_info to avoid complex maneuvers of
> > setting/resetting it correctly when performing partition tuple
> > routing, causing some other churn beside the partitioning code.  Same
> > with another global variable TransitionCaptureState.tcs_map.  So, the
> > patches neither add any new capabilities, nor improve performance, but
> > they do make the code in this area a bit easier to follow.
> >
> > Actually, there is a problem that some of the changes here conflict
> > with patches being discussed on other threads ([1], [2]), so much so
> > that I decided to absorb some changes here into another "refactoring"
> > patch that I have posted at [2].
>
> Thanks for the summary. It's been a bit hard to follow what depends on
> what across these threads, and how they work together. It seems that
> this patch set is the best place to start.

Great.  I'd be happy if I will have one less set of patches to keep at home. :-)

> > Attached rebased patches.
> >
> > 0001 contains preparatory FDW API changes to stop relying on
> > es_result_relation_info being set correctly.
>
> Makes sense. The only thing I don't like about this is the way the
> ForeignScan->resultRelIndex field is set. make_foreignscan() initializes
> it to -1, and the FDW's PlanDirectModify() function is expected to set
> it, like you did in postgres_fdw:
>
> > @@ -2319,6 +2322,11 @@ postgresPlanDirectModify(PlannerInfo *root,
> >                       rebuild_fdw_scan_tlist(fscan, returningList);
> >       }
> >
> > +     /*
> > +      * Set the index of the subplan result rel.
> > +      */
> > +     fscan->resultRelIndex = subplan_index;
> > +
> >       table_close(rel, NoLock);
> >       return true;
> >  }
>
> It has to be set to that value (subplan_index is an argument to
> PlanDirectModify()), the FDW doesn't have any choice there, so this is
> just additional boilerplate code that has to be copied to every FDW that
> implements direct modify. Furthermore, if the FDW doesn't set it
> correctly, you could have some very interesting results, like updating
> wrong table. It would be better to set it in make_modifytable(), just
> after calling PlanDirectModify().

Actually, that's how it was done in earlier iterations but I think I
decided to move that into the FDW's functions due to some concern of
one of the other patches that depended on this patch.  Maybe it makes
sense to bring that back into make_modifytable() and worry about the
other patch later.

> I'm also a bit unhappy with the way it's updated in set_plan_refs():
>
> > --- a/src/backend/optimizer/plan/setrefs.c
> > +++ b/src/backend/optimizer/plan/setrefs.c
> > @@ -904,6 +904,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
> >                                       rc->rti += rtoffset;
> >                                       rc->prti += rtoffset;
> >                               }
> > +                             /*
> > +                              * Caution: Do not change the relative ordering of this loop
> > +                              * and the statement below that adds the result relations to
> > +                              * root->glob->resultRelations, because we need to use the
> > +                              * current value of list_length(root->glob->resultRelations)
> > +                              * in some plans.
> > +                              */
> >                               foreach(l, splan->plans)
> >                               {
> >                                       lfirst(l) = set_plan_refs(root,
> > @@ -1243,6 +1250,14 @@ set_foreignscan_references(PlannerInfo *root,
> >       }
> >
> >       fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
> > +
> > +     /*
> > +      * Adjust resultRelIndex if it's valid (note that we are called before
> > +      * adding the RT indexes of ModifyTable result relations to the global
> > +      * list)
> > +      */
> > +     if (fscan->resultRelIndex >= 0)
> > +             fscan->resultRelIndex += list_length(root->glob->resultRelations);
> >  }
> >
> >  /*
>
> That "Caution" comment is well deserved, but could we make this more
> robust to begin with? The most straightforward solution would be to pass
> down the "current resultRelIndex" as an extra parameter to
> set_plan_refs(), similar to rtoffset. If we did that, we wouldn't
> actually need to set it before setrefs.c processing at all.

Hmm, I don't think I understand the last sentence.  A given
ForeignScan node's resultRelIndex will have to be set before getting
to set_plan_refs().  I mean we shouldn't be making it a job of
setrefs.c to figure out which ForeignScan nodes need to have its
resultRelIndex set to a valid value.

> I'm a bit wary of adding another argument to set_plan_refs() because
> that's a lot of code churn, but it does seem like the most natural
> solution to me. Maybe create a new context struct to hold the
> PlannerInfo, rtoffset, and the new "currentResultRelIndex" value,
> similar to fix_scan_expr_context, to avoid passing through so many
> arguments.

I like the idea of a context struct.  I've implemented it as a
separate refactoring patch (0001) and 0002 (what was before 0001)
extends it for "current ResultRelIndex", although I used the name
rroffset for "current ResultRelIndex" to go along with rtoffset.

> Another idea is to merge "resultRelIndex" and a "range table index" into
> one value. Range table entries that are updated would have a
> ResultRelInfo, others would not. I'm not sure if that would end up being
> cleaner or messier than what we have now, but might be worth trying.

I have thought about something like this before.  An idea I had is to
make es_result_relations array indexable by plain RT indexes, then we
don't need to maintain separate indexes that we do today for result
relations.

> > 0002 removes es_result_relation_info in favor passing the active
> > result relation around as a parameter in the various functions that
> > need it
>
> Looks good.
>
> > 0003 Moves UPDATE tuple-routing logic into a new function
> >
> > 0004 removes the global variable TransitionCaptureState.tcs_map which
> > needed to be set/reset whenever the active result relation relation
> > changes in favor of a new field in ResultRelInfo to store the same map
>
> I didn't look closely, but these make sense at a quick glance.

Updated patches attached.

--
Amit Langote
EDB: http://www.enterprisedb.com

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 07/10/2020 12:50, Amit Langote wrote:
> On Tue, Oct 6, 2020 at 12:45 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> It would be better to set it in make_modifytable(), just
>> after calling PlanDirectModify().
> 
> Actually, that's how it was done in earlier iterations but I think I
> decided to move that into the FDW's functions due to some concern of
> one of the other patches that depended on this patch.  Maybe it makes
> sense to bring that back into make_modifytable() and worry about the
> other patch later.

On second thoughts, I take back my earlier comment. Setting it in 
make_modifytable() relies on the assumption that the subplan is a single 
ForeignScan node, on the target relation. The documentation for 
PlanDirectModify says:

> To execute the direct modification on the remote server, this
> function must rewrite the target subplan with a ForeignScan plan node
> that executes the direct modification on the remote server.
So I guess that assumption is safe. But I'd like to have some wiggle 
room here. Wouldn't it be OK to have a Result node on top of the 
ForeignScan, for example? If it really must be a simple ForeignScan 
node, the PlanDirectModify API seems pretty strange.

I'm not entirely sure what I would like to do with this now. I could 
live with either version, but I'm not totally happy with either. (I like 
your suggestion below)

Looking at this block in postgresBeginDirectModify:

>     /*
>      * Identify which user to do the remote access as.  This should match what
>      * ExecCheckRTEPerms() does.
>      */
>     Assert(fsplan->resultRelIndex >= 0);
>     dmstate->resultRelIndex = fsplan->resultRelIndex;
>     rtindex = list_nth_int(resultRelations, fsplan->resultRelIndex);
>     rte = exec_rt_fetch(rtindex, estate);
>     userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();

That's a complicated way of finding out the target table's RTI. We 
should probably store the result RTI in the ForeignScan in the first place.

>> Another idea is to merge "resultRelIndex" and a "range table index" into
>> one value. Range table entries that are updated would have a
>> ResultRelInfo, others would not. I'm not sure if that would end up being
>> cleaner or messier than what we have now, but might be worth trying.
> 
> I have thought about something like this before.  An idea I had is to
> make es_result_relations array indexable by plain RT indexes, then we
> don't need to maintain separate indexes that we do today for result
> relations.

That sounds like a good idea. es_result_relations is currently an array 
of ResultRelInfos, so that would leave a lot of unfilled structs in the 
array. But in on of your other threads, you proposed turning 
es_result_relations into an array of pointers anyway 
(https://www.postgresql.org/message-id/CA+HiwqE4k1Q2TLmCAvekw+8_NXepbnfUOamOeX=KpHRDTfSKxA@mail.gmail.com).

- Heikki



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Wed, Oct 7, 2020 at 9:07 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 07/10/2020 12:50, Amit Langote wrote:
> > On Tue, Oct 6, 2020 at 12:45 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >> It would be better to set it in make_modifytable(), just
> >> after calling PlanDirectModify().
> >
> > Actually, that's how it was done in earlier iterations but I think I
> > decided to move that into the FDW's functions due to some concern of
> > one of the other patches that depended on this patch.  Maybe it makes
> > sense to bring that back into make_modifytable() and worry about the
> > other patch later.
>
> On second thoughts, I take back my earlier comment. Setting it in
> make_modifytable() relies on the assumption that the subplan is a single
> ForeignScan node, on the target relation. The documentation for
> PlanDirectModify says:
>
> > To execute the direct modification on the remote server, this
> > function must rewrite the target subplan with a ForeignScan plan node
> > that executes the direct modification on the remote server.
>>
> So I guess that assumption is safe. But I'd like to have some wiggle
> room here. Wouldn't it be OK to have a Result node on top of the
> ForeignScan, for example? If it really must be a simple ForeignScan
> node, the PlanDirectModify API seems pretty strange.
>
> I'm not entirely sure what I would like to do with this now. I could
> live with either version, but I'm not totally happy with either. (I like
> your suggestion below)

Assuming you mean the idea of using RT index to access ResultRelInfos
in es_result_relations, we would still need to store the index in the
ForeignScan node, so the question of whether to do it in
make_modifytable() or in PlanDirectModify() must still be answered.

> Looking at this block in postgresBeginDirectModify:
>
> >       /*
> >        * Identify which user to do the remote access as.  This should match what
> >        * ExecCheckRTEPerms() does.
> >        */
> >       Assert(fsplan->resultRelIndex >= 0);
> >       dmstate->resultRelIndex = fsplan->resultRelIndex;
> >       rtindex = list_nth_int(resultRelations, fsplan->resultRelIndex);
> >       rte = exec_rt_fetch(rtindex, estate);
> >       userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
>
> That's a complicated way of finding out the target table's RTI. We
> should probably store the result RTI in the ForeignScan in the first place.
>
> >> Another idea is to merge "resultRelIndex" and a "range table index" into
> >> one value. Range table entries that are updated would have a
> >> ResultRelInfo, others would not. I'm not sure if that would end up being
> >> cleaner or messier than what we have now, but might be worth trying.
> >
> > I have thought about something like this before.  An idea I had is to
> > make es_result_relations array indexable by plain RT indexes, then we
> > don't need to maintain separate indexes that we do today for result
> > relations.
>
> That sounds like a good idea. es_result_relations is currently an array
> of ResultRelInfos, so that would leave a lot of unfilled structs in the
> array. But in on of your other threads, you proposed turning
> es_result_relations into an array of pointers anyway
> (https://www.postgresql.org/message-id/CA+HiwqE4k1Q2TLmCAvekw+8_NXepbnfUOamOeX=KpHRDTfSKxA@mail.gmail.com).

Okay, I am reorganizing the patches around that idea and will post an
update soon.


--
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Thu, Oct 8, 2020 at 9:35 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Oct 7, 2020 at 9:07 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > On 07/10/2020 12:50, Amit Langote wrote:
> > > I have thought about something like this before.  An idea I had is to
> > > make es_result_relations array indexable by plain RT indexes, then we
> > > don't need to maintain separate indexes that we do today for result
> > > relations.
> >
> > That sounds like a good idea. es_result_relations is currently an array
> > of ResultRelInfos, so that would leave a lot of unfilled structs in the
> > array. But in on of your other threads, you proposed turning
> > es_result_relations into an array of pointers anyway
> > (https://www.postgresql.org/message-id/CA+HiwqE4k1Q2TLmCAvekw+8_NXepbnfUOamOeX=KpHRDTfSKxA@mail.gmail.com).
>
> Okay, I am reorganizing the patches around that idea and will post an
> update soon.

Attached updated patches.

0001 makes es_result_relations an RTI-indexable array, which allows to
get rid of all "result relation index" fields across the code.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 09/10/2020 11:01, Amit Langote wrote:
> On Thu, Oct 8, 2020 at 9:35 PM Amit Langote <amitlangote09@gmail.com> wrote:
>> On Wed, Oct 7, 2020 at 9:07 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> On 07/10/2020 12:50, Amit Langote wrote:
>>>> I have thought about something like this before.  An idea I had is to
>>>> make es_result_relations array indexable by plain RT indexes, then we
>>>> don't need to maintain separate indexes that we do today for result
>>>> relations.
>>>
>>> That sounds like a good idea. es_result_relations is currently an array
>>> of ResultRelInfos, so that would leave a lot of unfilled structs in the
>>> array. But in on of your other threads, you proposed turning
>>> es_result_relations into an array of pointers anyway
>>> (https://www.postgresql.org/message-id/CA+HiwqE4k1Q2TLmCAvekw+8_NXepbnfUOamOeX=KpHRDTfSKxA@mail.gmail.com).
>>
>> Okay, I am reorganizing the patches around that idea and will post an
>> update soon.
> 
> Attached updated patches.
> 
> 0001 makes es_result_relations an RTI-indexable array, which allows to
> get rid of all "result relation index" fields across the code.

Thanks! A couple small things I wanted to check with you before committing:

1. We have many different cleanup/close routines now: 
ExecCloseResultRelations, ExecCloseRangeTableRelations, and 
ExecCleanUpTriggerState. Do we need them all? It seems to me that we 
could merge ExecCloseRangeTableRelations() and 
ExecCleanUpTriggerState(), they seem to do roughly the same thing: close 
relations that were opened for ResultRelInfos. They are always called 
together, except in afterTriggerInvokeEvents(). And in 
afterTriggerInvokeEvents() too, there would be no harm in doing both, 
even though we know there aren't any entries in the es_result_relations 
array at that point.

2. The way this is handled in worker.c is a bit funny. In 
create_estate_for_relation(), you create a ResultRelInfo, but you 
*don't* put it in the es_opened_result_relations list. That's 
surprising, but I'm also surprised there are no 
ExecCloseResultRelations() calls before the FreeExecutorState() calls in 
worker.c. It's not needed because the 
apply_handle_insert/update/delete_internal() functions call 
ExecCloseIndices() directly, so they don't rely on the 
ExecCloseResultRelations() function for cleanup. That works too, but 
it's a bit surprising because it's different from how it's done in 
copy.c and nodeModifyTable.c. It would feel natural to rely on 
ExecCloseResultRelations() in worker.c as well, but on the other hand, 
it also calls ExecOpenIndices() in a more lazy fashion, and it makes 
sense to call ExecCloseIndices() in the same functions that 
ExecOpenIndices() is called. So I'm not sure if changing that would be 
an improvement overall. What do you think? Did you consider doing that?

Attached is your original patch v13, and a patch on top of it that 
merges ExecCloseResultRelations() and ExecCleanUpTriggerState(), and 
makes some minor comment changes. I didn't do anything about the 
worker.c business, aside from adding a comment about it.

- Heikki

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Mon, Oct 12, 2020 at 8:12 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 09/10/2020 11:01, Amit Langote wrote:
> > On Thu, Oct 8, 2020 at 9:35 PM Amit Langote <amitlangote09@gmail.com> wrote:
> >> On Wed, Oct 7, 2020 at 9:07 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>> On 07/10/2020 12:50, Amit Langote wrote:
> >>>> I have thought about something like this before.  An idea I had is to
> >>>> make es_result_relations array indexable by plain RT indexes, then we
> >>>> don't need to maintain separate indexes that we do today for result
> >>>> relations.
> >>>
> >>> That sounds like a good idea. es_result_relations is currently an array
> >>> of ResultRelInfos, so that would leave a lot of unfilled structs in the
> >>> array. But in on of your other threads, you proposed turning
> >>> es_result_relations into an array of pointers anyway
> >>> (https://www.postgresql.org/message-id/CA+HiwqE4k1Q2TLmCAvekw+8_NXepbnfUOamOeX=KpHRDTfSKxA@mail.gmail.com).
> >>
> >> Okay, I am reorganizing the patches around that idea and will post an
> >> update soon.
> >
> > Attached updated patches.
> >
> > 0001 makes es_result_relations an RTI-indexable array, which allows to
> > get rid of all "result relation index" fields across the code.
>
> Thanks! A couple small things I wanted to check with you before committing:

Thanks for checking.

> 1. We have many different cleanup/close routines now:
> ExecCloseResultRelations, ExecCloseRangeTableRelations, and
> ExecCleanUpTriggerState. Do we need them all? It seems to me that we
> could merge ExecCloseRangeTableRelations() and
> ExecCleanUpTriggerState(), they seem to do roughly the same thing: close
> relations that were opened for ResultRelInfos. They are always called
> together, except in afterTriggerInvokeEvents(). And in
> afterTriggerInvokeEvents() too, there would be no harm in doing both,
> even though we know there aren't any entries in the es_result_relations
> array at that point.

Hmm, I find trigger result relations to behave differently enough to
deserve a separate function.  For example, unlike plan-specified
result relations, they don't point to range table relations and don't
open indices.  Maybe the name could be revisited, say,
ExecCloseTriggerResultRelations().  Also, maybe call the other
functions:

ExecInitPlanResultRelationsArray()
ExecInitPlanResultRelation()
ExecClosePlanResultRelations()

Thoughts?

> 2. The way this is handled in worker.c is a bit funny. In
> create_estate_for_relation(), you create a ResultRelInfo, but you
> *don't* put it in the es_opened_result_relations list. That's
> surprising, but I'm also surprised there are no
> ExecCloseResultRelations() calls before the FreeExecutorState() calls in
> worker.c. It's not needed because the
> apply_handle_insert/update/delete_internal() functions call
> ExecCloseIndices() directly, so they don't rely on the
> ExecCloseResultRelations() function for cleanup. That works too, but
> it's a bit surprising because it's different from how it's done in
> copy.c and nodeModifyTable.c. It would feel natural to rely on
> ExecCloseResultRelations() in worker.c as well, but on the other hand,
> it also calls ExecOpenIndices() in a more lazy fashion, and it makes
> sense to call ExecCloseIndices() in the same functions that
> ExecOpenIndices() is called. So I'm not sure if changing that would be
> an improvement overall. What do you think? Did you consider doing that?

Yeah, that did bother me too a bit.  I'm okay either way but it does
look a bit inconsistent.

Actually, maybe we don't need to be so paranoid about setting up
es_result_relations in worker.c, because none of the downstream
functionality invoked seems to rely on it, that is, no need to call
ExecInitResultRelationsArray() and ExecInitResultRelation().
ExecSimpleRelation* and downstream functionality assume a
single-relation operation and the ResultRelInfo is explicitly passed.

> Attached is your original patch v13, and a patch on top of it that
> merges ExecCloseResultRelations() and ExecCleanUpTriggerState(), and
> makes some minor comment changes. I didn't do anything about the
> worker.c business, aside from adding a comment about it.

Thanks for the cleanup.

I had noticed there was some funny capitalization in my patch:

+   ResultRelInfo **es_result_relations;    /* Array of Per-range-table-entry

s/Per-/per-

Also, I think a comma may be needed in the parenthetical below:

+ * can index it by the RT index (minus 1 to be accurate).

...(minus 1, to be accurate)

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 12/10/2020 16:47, Amit Langote wrote:
> On Mon, Oct 12, 2020 at 8:12 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> 1. We have many different cleanup/close routines now:
>> ExecCloseResultRelations, ExecCloseRangeTableRelations, and
>> ExecCleanUpTriggerState. Do we need them all? It seems to me that we
>> could merge ExecCloseRangeTableRelations() and
>> ExecCleanUpTriggerState(), they seem to do roughly the same thing: close
>> relations that were opened for ResultRelInfos. They are always called
>> together, except in afterTriggerInvokeEvents(). And in
>> afterTriggerInvokeEvents() too, there would be no harm in doing both,
>> even though we know there aren't any entries in the es_result_relations
>> array at that point.
> 
> Hmm, I find trigger result relations to behave differently enough to
> deserve a separate function.  For example, unlike plan-specified
> result relations, they don't point to range table relations and don't
> open indices.  Maybe the name could be revisited, say,
> ExecCloseTriggerResultRelations().

Matter of perception I guess. I still prefer to club them together into 
one Close call. It's true that they're slightly different, but they're 
also pretty similar. And IMHO they're more similar than different.

> Also, maybe call the other functions:
> 
> ExecInitPlanResultRelationsArray()
> ExecInitPlanResultRelation()
> ExecClosePlanResultRelations()
> 
> Thoughts?

Hmm. How about initializing the array lazily, on the first 
ExecInitPlanResultRelation() call? It's not performance critical, and 
that way there's one fewer initialization function that you need to 
remember to call.

It occurred to me that if we do that (initialize the array lazily), 
there's very little need for the PlannedStmt->resultRelations list 
anymore. It's only used in ExecRelationIsTargetRelation(), but if we 
assume that ExecRelationIsTargetRelation() is only called after InitPlan 
has initialized the result relation for the relation, it can easily 
check es_result_relations instead. I think that's a safe assumption. 
ExecRelationIsTargetRelation() is only used in FDWs, and I believe the 
FDWs initialization routine can only be called after ExecInitModifyTable 
has been called on the relation.

The PlannedStmt->rootResultRelations field is even more useless.

> Actually, maybe we don't need to be so paranoid about setting up
> es_result_relations in worker.c, because none of the downstream
> functionality invoked seems to rely on it, that is, no need to call
> ExecInitResultRelationsArray() and ExecInitResultRelation().
> ExecSimpleRelation* and downstream functionality assume a
> single-relation operation and the ResultRelInfo is explicitly passed.

Hmm, yeah, I like that. Similarly in ExecuteTruncateGuts(), there isn't 
actually any need to put the ResultRelInfos in the es_result_relations 
array.

Putting all this together, I ended up with the attached. It doesn't 
include the subsequent commits in this patch set yet, for removal of 
es_result_relation_info et al.

- Heikki

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Tue, Oct 13, 2020 at 1:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 12/10/2020 16:47, Amit Langote wrote:
> > On Mon, Oct 12, 2020 at 8:12 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >> 1. We have many different cleanup/close routines now:
> >> ExecCloseResultRelations, ExecCloseRangeTableRelations, and
> >> ExecCleanUpTriggerState. Do we need them all? It seems to me that we
> >> could merge ExecCloseRangeTableRelations() and
> >> ExecCleanUpTriggerState(), they seem to do roughly the same thing: close
> >> relations that were opened for ResultRelInfos. They are always called
> >> together, except in afterTriggerInvokeEvents(). And in
> >> afterTriggerInvokeEvents() too, there would be no harm in doing both,
> >> even though we know there aren't any entries in the es_result_relations
> >> array at that point.
> >
> > Hmm, I find trigger result relations to behave differently enough to
> > deserve a separate function.  For example, unlike plan-specified
> > result relations, they don't point to range table relations and don't
> > open indices.  Maybe the name could be revisited, say,
> > ExecCloseTriggerResultRelations().
>
> Matter of perception I guess. I still prefer to club them together into
> one Close call. It's true that they're slightly different, but they're
> also pretty similar. And IMHO they're more similar than different.

Okay, fine with me.

> > Also, maybe call the other functions:
> >
> > ExecInitPlanResultRelationsArray()
> > ExecInitPlanResultRelation()
> > ExecClosePlanResultRelations()
> >
> > Thoughts?
>
> Hmm. How about initializing the array lazily, on the first
> ExecInitPlanResultRelation() call? It's not performance critical, and
> that way there's one fewer initialization function that you need to
> remember to call.

Agree that's better.

> It occurred to me that if we do that (initialize the array lazily),
> there's very little need for the PlannedStmt->resultRelations list
> anymore. It's only used in ExecRelationIsTargetRelation(), but if we
> assume that ExecRelationIsTargetRelation() is only called after InitPlan
> has initialized the result relation for the relation, it can easily
> check es_result_relations instead. I think that's a safe assumption.
> ExecRelationIsTargetRelation() is only used in FDWs, and I believe the
> FDWs initialization routine can only be called after ExecInitModifyTable
> has been called on the relation.
>
> The PlannedStmt->rootResultRelations field is even more useless.

I am very much tempted to remove those fields from PlannedStmt,
although I am concerned that the following now assumes that *all*
result relations are initialized in the executor initialization phase:

bool
ExecRelationIsTargetRelation(EState *estate, Index scanrelid)
{
    if (!estate->es_result_relations)
        return false;

    return estate->es_result_relations[scanrelid - 1] != NULL;
}

In the other thread [1], I am proposing that we initialize result
relations lazily, but the above will be a blocker to that.

> > Actually, maybe we don't need to be so paranoid about setting up
> > es_result_relations in worker.c, because none of the downstream
> > functionality invoked seems to rely on it, that is, no need to call
> > ExecInitResultRelationsArray() and ExecInitResultRelation().
> > ExecSimpleRelation* and downstream functionality assume a
> > single-relation operation and the ResultRelInfo is explicitly passed.
>
> Hmm, yeah, I like that. Similarly in ExecuteTruncateGuts(), there isn't
> actually any need to put the ResultRelInfos in the es_result_relations
> array.
>
> Putting all this together, I ended up with the attached. It doesn't
> include the subsequent commits in this patch set yet, for removal of
> es_result_relation_info et al.

Thanks.

+    * We put the ResultRelInfos in the es_opened_result_relations list, even
+    * though we don't have a range table and don't populate the
+    * es_result_relations array.  That's a big bogus, but it's enough to make
+    * ExecGetTriggerResultRel() find them.
     */
    estate = CreateExecutorState();
    resultRelInfos = (ResultRelInfo *)
        palloc(list_length(rels) * sizeof(ResultRelInfo));
    resultRelInfo = resultRelInfos;
+   estate->es_result_relations = (ResultRelInfo **)
+       palloc(list_length(rels) * sizeof(ResultRelInfo *));

Maybe don't allocate es_result_relations here?

+/*
+ * Close all relations opened by ExecGetRangeTableRelation()
+ */
+void
+ExecCloseRangeTableRelations(EState *estate)
+{
+   int         i;
+
+   for (i = 0; i < estate->es_range_table_size; i++)
    {
        if (estate->es_relations[i])
            table_close(estate->es_relations[i], NoLock);
    }

I think we have an optimization opportunity here (maybe as a separate
patch).  Why don't we introduce es_opened_relations?  That way, if
only a single or few of potentially 1000s relations in the range table
is/are opened, we don't needlessly loop over *all* relations here.
That can happen, for example, with a query where no partitions could
be pruned at planning time, so the range table contains all
partitions, but only one or few are accessed during execution and the
rest run-time pruned.  Although, in the workloads where it would
matter, other overheads easily mask the overhead of this loop; see the
first message at the linked thread [1], so it is hard to show an
immediate benefit from this.

Anyway, other than my concern about ExecRelationIsTargetRelation()
mentioned above, I think the patch looks good.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/30/2621/



Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 13/10/2020 07:32, Amit Langote wrote:
> On Tue, Oct 13, 2020 at 1:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> It occurred to me that if we do that (initialize the array lazily),
>> there's very little need for the PlannedStmt->resultRelations list
>> anymore. It's only used in ExecRelationIsTargetRelation(), but if we
>> assume that ExecRelationIsTargetRelation() is only called after InitPlan
>> has initialized the result relation for the relation, it can easily
>> check es_result_relations instead. I think that's a safe assumption.
>> ExecRelationIsTargetRelation() is only used in FDWs, and I believe the
>> FDWs initialization routine can only be called after ExecInitModifyTable
>> has been called on the relation.
>>
>> The PlannedStmt->rootResultRelations field is even more useless.
> 
> I am very much tempted to remove those fields from PlannedStmt,
> although I am concerned that the following now assumes that *all*
> result relations are initialized in the executor initialization phase:
> 
> bool
> ExecRelationIsTargetRelation(EState *estate, Index scanrelid)
> {
>      if (!estate->es_result_relations)
>          return false;
> 
>      return estate->es_result_relations[scanrelid - 1] != NULL;
> }
> 
> In the other thread [1], I am proposing that we initialize result
> relations lazily, but the above will be a blocker to that.

Ok, I'll leave it alone then. But I'll still merge resultRelations and 
rootResultRelations into one list. I don't see any point in keeping them 
separate.

I'm tempted to remove ExecRelationIsTargetRelation() altogether, but 
keeping the resultRelations list isn't really a big deal, so I'll leave 
that for another discussion.

>>> Actually, maybe we don't need to be so paranoid about setting up
>>> es_result_relations in worker.c, because none of the downstream
>>> functionality invoked seems to rely on it, that is, no need to call
>>> ExecInitResultRelationsArray() and ExecInitResultRelation().
>>> ExecSimpleRelation* and downstream functionality assume a
>>> single-relation operation and the ResultRelInfo is explicitly passed.
>>
>> Hmm, yeah, I like that. Similarly in ExecuteTruncateGuts(), there isn't
>> actually any need to put the ResultRelInfos in the es_result_relations
>> array.
>>
>> Putting all this together, I ended up with the attached. It doesn't
>> include the subsequent commits in this patch set yet, for removal of
>> es_result_relation_info et al.
> 
> Thanks.
> 
> +    * We put the ResultRelInfos in the es_opened_result_relations list, even
> +    * though we don't have a range table and don't populate the
> +    * es_result_relations array.  That's a big bogus, but it's enough to make
> +    * ExecGetTriggerResultRel() find them.
>       */
>      estate = CreateExecutorState();
>      resultRelInfos = (ResultRelInfo *)
>          palloc(list_length(rels) * sizeof(ResultRelInfo));
>      resultRelInfo = resultRelInfos;
> +   estate->es_result_relations = (ResultRelInfo **)
> +       palloc(list_length(rels) * sizeof(ResultRelInfo *));
> 
> Maybe don't allocate es_result_relations here?

Fixed.

> Anyway, other than my concern about ExecRelationIsTargetRelation()
> mentioned above, I think the patch looks good.

Ok, committed. I'll continue to look at the rest of the patches in this 
patch series now.

- Heikki



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Tue, Oct 13, 2020 at 7:13 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 13/10/2020 07:32, Amit Langote wrote:
> > On Tue, Oct 13, 2020 at 1:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >> It occurred to me that if we do that (initialize the array lazily),
> >> there's very little need for the PlannedStmt->resultRelations list
> >> anymore. It's only used in ExecRelationIsTargetRelation(), but if we
> >> assume that ExecRelationIsTargetRelation() is only called after InitPlan
> >> has initialized the result relation for the relation, it can easily
> >> check es_result_relations instead. I think that's a safe assumption.
> >> ExecRelationIsTargetRelation() is only used in FDWs, and I believe the
> >> FDWs initialization routine can only be called after ExecInitModifyTable
> >> has been called on the relation.
> >>
> >> The PlannedStmt->rootResultRelations field is even more useless.
> >
> > I am very much tempted to remove those fields from PlannedStmt,
> > although I am concerned that the following now assumes that *all*
> > result relations are initialized in the executor initialization phase:
> >
> > bool
> > ExecRelationIsTargetRelation(EState *estate, Index scanrelid)
> > {
> >      if (!estate->es_result_relations)
> >          return false;
> >
> >      return estate->es_result_relations[scanrelid - 1] != NULL;
> > }
> >
> > In the other thread [1], I am proposing that we initialize result
> > relations lazily, but the above will be a blocker to that.
>
> Ok, I'll leave it alone then. But I'll still merge resultRelations and
> rootResultRelations into one list. I don't see any point in keeping them
> separate.

Should be fine.  As you said in the commit message, it should probably
have been that way to begin with, but I don't recall why I didn't make
it so.

> I'm tempted to remove ExecRelationIsTargetRelation() altogether, but
> keeping the resultRelations list isn't really a big deal, so I'll leave
> that for another discussion.

Yeah, makes sense.

> > Anyway, other than my concern about ExecRelationIsTargetRelation()
> > mentioned above, I think the patch looks good.
>
> Ok, committed. I'll continue to look at the rest of the patches in this
> patch series now.

Thanks.

BTW, you mentioned the lazy ResultRelInfo optimization bit in the
commit message, so does that mean you intend to take a look at the
other thread [1] too?  Or should I post a rebased version of the lazy
ResultRelInfo initialization patch here in this thread?  That patch is
just a bunch of refactoring too.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/30/2621/



Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 13/10/2020 15:03, Amit Langote wrote:
> On Tue, Oct 13, 2020 at 7:13 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Ok, committed. I'll continue to look at the rest of the patches in this
>> patch series now.

I've reviewed the next two patches in the series, they are pretty much 
ready for commit now. I made just a few minor changes, notably:

- I moved the responsibility to set ForeignTable->resultRelation to the 
FDWs, like you had in the original patch version. Sorry for 
flip-flopping on that.

- In postgres_fdw.c, I changed it to store the ResultRelInfo pointer in 
PgFdwDirectModifyState, instead of storing the RT index and looking it 
up in the BeginDirectModify and IterateDirectModify. I think you did it 
that way in the earlier patch versions, too.

- Some minor comment and docs kibitzing.

One little idea I had:

I think all FDWs that support direct modify will have to carry the 
resultRelaton index or the ResultRelInfo pointer from BeginDirectModify 
to IterateDirectModify in the FDW's private struct. It's not 
complicated, but should we make life easier for FDWs by storing the 
ResultRelInfo pointer in the ForeignScanState struct in the core code? 
The doc now says:

> The data that was actually inserted, updated or deleted must be
> stored in the ri_projectReturning->pi_exprContext->ecxt_scantuple of
> the target foreign table's ResultRelInfo obtained using the
> information passed to BeginDirectModify. Return NULL if no more rows
> are available.

That "ResultRelInfo obtained using the information passed to 
BeginDirectModify" part is a pretty vague. We could expand it, but if we 
stored the ResultRelInfo in the ForeignScanState, we could explain it 
succinctly.

> BTW, you mentioned the lazy ResultRelInfo optimization bit in the
> commit message, so does that mean you intend to take a look at the
> other thread [1] too?  Or should I post a rebased version of the lazy
> ResultRelInfo initialization patch here in this thread?  That patch is
> just a bunch of refactoring too.

No promises, but yeah, now that I'm knee-deep in this ResultRelInfo 
business, I'll try to take a look at that too :-).

- Heikki


Вложения

Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 13/10/2020 19:09, Heikki Linnakangas wrote:
> One little idea I had:
> 
> I think all FDWs that support direct modify will have to carry the
> resultRelaton index or the ResultRelInfo pointer from BeginDirectModify
> to IterateDirectModify in the FDW's private struct. It's not
> complicated, but should we make life easier for FDWs by storing the
> ResultRelInfo pointer in the ForeignScanState struct in the core code?
> The doc now says:
> 
>> The data that was actually inserted, updated or deleted must be
>> stored in the ri_projectReturning->pi_exprContext->ecxt_scantuple of
>> the target foreign table's ResultRelInfo obtained using the
>> information passed to BeginDirectModify. Return NULL if no more rows
>> are available.
> 
> That "ResultRelInfo obtained using the information passed to
> BeginDirectModify" part is a pretty vague. We could expand it, but if we
> stored the ResultRelInfo in the ForeignScanState, we could explain it
> succinctly.

I tried that approach, see attached. Yeah, this feels better to me.

- Heikki

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Wed, Oct 14, 2020 at 1:30 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 13/10/2020 19:09, Heikki Linnakangas wrote:
> > One little idea I had:
> >
> > I think all FDWs that support direct modify will have to carry the
> > resultRelaton index or the ResultRelInfo pointer from BeginDirectModify
> > to IterateDirectModify in the FDW's private struct. It's not
> > complicated, but should we make life easier for FDWs by storing the
> > ResultRelInfo pointer in the ForeignScanState struct in the core code?
> > The doc now says:
> >
> >> The data that was actually inserted, updated or deleted must be
> >> stored in the ri_projectReturning->pi_exprContext->ecxt_scantuple of
> >> the target foreign table's ResultRelInfo obtained using the
> >> information passed to BeginDirectModify. Return NULL if no more rows
> >> are available.
> >
> > That "ResultRelInfo obtained using the information passed to
> > BeginDirectModify" part is a pretty vague. We could expand it, but if we
> > stored the ResultRelInfo in the ForeignScanState, we could explain it
> > succinctly.
>
> I tried that approach, see attached. Yeah, this feels better to me.

I like the idea of storing the ResultRelInfo in ForeignScanState, but
it would be better if we can document the fact that an FDW may not
reliably access until IterateDirectModify(). That's because, setting
it in ExecInitForeignScan() will mean *all* result relations must be
initialized during ExecInitModifyTable(), which defies my
lazy-ResultRelInfo-initiailization proposal.  As to why why I'm
pushing that proposal, consider that when we'll get the ability to use
run-time pruning for UPDATE/DELETE with [1], initializing all result
relations before initializing the plan tree will mean most of those
ResultRelInfos will be unused, because run-time pruning that occurs
when the plan tree is initialized (and/or when it is executed) may
eliminate most but a few result relations.

I've attached a diff to v17-0001 to show one way of delaying setting
ForeignScanState.resultRelInfo.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/30/2575/

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 14/10/2020 09:44, Amit Langote wrote:
> I like the idea of storing the ResultRelInfo in ForeignScanState, but
> it would be better if we can document the fact that an FDW may not
> reliably access until IterateDirectModify(). That's because, setting
> it in ExecInitForeignScan() will mean *all* result relations must be
> initialized during ExecInitModifyTable(), which defies my
> lazy-ResultRelInfo-initiailization proposal.  As to why why I'm
> pushing that proposal, consider that when we'll get the ability to use
> run-time pruning for UPDATE/DELETE with [1], initializing all result
> relations before initializing the plan tree will mean most of those
> ResultRelInfos will be unused, because run-time pruning that occurs
> when the plan tree is initialized (and/or when it is executed) may
> eliminate most but a few result relations.
> 
> I've attached a diff to v17-0001 to show one way of delaying setting
> ForeignScanState.resultRelInfo.

The BeginDirectModify function does a lot of expensive things, like 
opening a connection to the remote server. If we want to optimize 
run-time pruning, I think we need to avoid calling BeginDirectModify for 
pruned partitions altogether.

I pushed this without those delay-setting-resultRelInfo changes. But we 
can revisit those changes with the run-time pruning optimization patch.

I'll continue with the last couple of patches in this thread.

- Heikki



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Wed, Oct 14, 2020 at 6:04 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 14/10/2020 09:44, Amit Langote wrote:
> > I like the idea of storing the ResultRelInfo in ForeignScanState, but
> > it would be better if we can document the fact that an FDW may not
> > reliably access until IterateDirectModify(). That's because, setting
> > it in ExecInitForeignScan() will mean *all* result relations must be
> > initialized during ExecInitModifyTable(), which defies my
> > lazy-ResultRelInfo-initiailization proposal.  As to why why I'm
> > pushing that proposal, consider that when we'll get the ability to use
> > run-time pruning for UPDATE/DELETE with [1], initializing all result
> > relations before initializing the plan tree will mean most of those
> > ResultRelInfos will be unused, because run-time pruning that occurs
> > when the plan tree is initialized (and/or when it is executed) may
> > eliminate most but a few result relations.
> >
> > I've attached a diff to v17-0001 to show one way of delaying setting
> > ForeignScanState.resultRelInfo.
>
> The BeginDirectModify function does a lot of expensive things, like
> opening a connection to the remote server. If we want to optimize
> run-time pruning, I think we need to avoid calling BeginDirectModify for
> pruned partitions altogether.

Note that if foreign partitions get pruned during the so called "init"
run-time pruning (that is, in the ExecInitNode() phase),
BeginDirectModify() won't get called on them.  Although, your concern
does apply if there is only going to be "exec" run-time pruning and no
"initial" pruning.

For me, the former case is a bit more interesting, as it occurs with
prepared statements using a generic plan (update parted_table set ...
where partkey = $1).

> I pushed this without those delay-setting-resultRelInfo changes. But we
> can revisit those changes with the run-time pruning optimization patch.

Sure, that makes sense.

> I'll continue with the last couple of patches in this thread.

Okay, thanks.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On Wed, Oct 14, 2020 at 6:04 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'll continue with the last couple of patches in this thread.

I committed the move of the cross-partition logic to new 
ExecCrossPartitionUpdate() function, with just minor comment editing and 
pgindent. I left out the refactoring around the calls to AFTER ROW 
INSERT/DELETE triggers. I stared at the change for a while, and wasn't 
sure if I liked the patched or the unpatched new version better, so I 
left it alone.

Looking at the last patch, "Revise child-to-root tuple conversion map 
management", that's certainly an improvement. However, I find it 
confusing that sometimes the mapping from child to root is in 
relinfo->ri_ChildToRootMap, and sometimes in 
relinfo->ri_PartitionInfo->pi_PartitionToRootMap. When is each of those 
filled in? If both are set, is it well defined which one is initialized 
first?

In general, I'm pretty confused by the initialization of 
ri_PartitionInfo. Where is initialized, and when? In execnodes.h, the 
definition of ResultRelInfo says:

>     /* info for partition tuple routing (NULL if not set up yet) */
>     struct PartitionRoutingInfo *ri_PartitionInfo;

That implies that the field is initialized lazily. But in 
ExecFindPartition, we have this:

>         if (partidx == partdesc->boundinfo->default_index)
>         {
>             PartitionRoutingInfo *partrouteinfo = rri->ri_PartitionInfo;
> 
>             /*
>              * The tuple must match the partition's layout for the constraint
>              * expression to be evaluated successfully.  If the partition is
>              * sub-partitioned, that would already be the case due to the code
>              * above, but for a leaf partition the tuple still matches the
>              * parent's layout.
>              *
>              * Note that we have a map to convert from root to current
>              * partition, but not from immediate parent to current partition.
>              * So if we have to convert, do it from the root slot; if not, use
>              * the root slot as-is.
>              */
>             if (partrouteinfo)
>             {
>                 TupleConversionMap *map = partrouteinfo->pi_RootToPartitionMap;
> 
>                 if (map)
>                     slot = execute_attr_map_slot(map->attrMap, rootslot,
>                                                  partrouteinfo->pi_PartitionTupleSlot);
>                 else
>                     slot = rootslot;
>             }
> 
>             ExecPartitionCheck(rri, slot, estate, true);
>         }

That check implies that it's not just lazily initialized, the code will 
work differently if ri_PartitionInfo is set or not.

I think all this would be more clear if ri_PartitionInfo and 
ri_ChildToRootMap were both truly lazily initialized, the first time 
they're needed. And if we removed 
ri_PartitionInfo->pi_PartitionToRootMap, and always used 
ri_ChildToRootMap for it.

Maybe remove PartitionRoutingInfo struct altogether, and just move its 
fields directly to ResultRelInfo.

- Heikki



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Thu, Oct 15, 2020 at 11:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On Wed, Oct 14, 2020 at 6:04 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > I'll continue with the last couple of patches in this thread.
>
> I committed the move of the cross-partition logic to new
> ExecCrossPartitionUpdate() function, with just minor comment editing and
> pgindent. I left out the refactoring around the calls to AFTER ROW
> INSERT/DELETE triggers. I stared at the change for a while, and wasn't
> sure if I liked the patched or the unpatched new version better, so I
> left it alone.

Okay, thanks for committing that.

> Looking at the last patch, "Revise child-to-root tuple conversion map
> management", that's certainly an improvement. However, I find it
> confusing that sometimes the mapping from child to root is in
> relinfo->ri_ChildToRootMap, and sometimes in
> relinfo->ri_PartitionInfo->pi_PartitionToRootMap. When is each of those
> filled in? If both are set, is it well defined which one is initialized
> first?

It is ri_ChildToRootMap that is set first, because it's only set in
child UPDATE target relations which are initialized in
ExecInitModifyTable(), that is way before partition tuple routing
comes into picture.

ri_PartitionInfo and hence pi_PartitionToRootMap is set in tuple
routing target partition's ResultRelInfos, which are lazily
initialized when tuples land into them.

If a tuple routing target partition happens to be an UPDATE target
relation and we need to initialize the partition-to-root map, which
for a tuple routing target partition is to be saved in
pi_PartitionToRootMap, with the patch, we will try to reuse
ri_ChildToRootMap because it would already be initialized.

But as you say below, maybe we don't need to have two fields for the
same thing, which I agree with.  Having only ri_ChildToRootMap as you
suggest below suffices.

> In general, I'm pretty confused by the initialization of
> ri_PartitionInfo. Where is initialized, and when? In execnodes.h, the
> definition of ResultRelInfo says:
>
> >       /* info for partition tuple routing (NULL if not set up yet) */
> >       struct PartitionRoutingInfo *ri_PartitionInfo;
>
> That implies that the field is initialized lazily. But in
> ExecFindPartition, we have this:
>
> >               if (partidx == partdesc->boundinfo->default_index)
> >               {
> >                       PartitionRoutingInfo *partrouteinfo = rri->ri_PartitionInfo;
> >
> >                       /*
> >                        * The tuple must match the partition's layout for the constraint
> >                        * expression to be evaluated successfully.  If the partition is
> >                        * sub-partitioned, that would already be the case due to the code
> >                        * above, but for a leaf partition the tuple still matches the
> >                        * parent's layout.
> >                        *
> >                        * Note that we have a map to convert from root to current
> >                        * partition, but not from immediate parent to current partition.
> >                        * So if we have to convert, do it from the root slot; if not, use
> >                        * the root slot as-is.
> >                        */
> >                       if (partrouteinfo)
> >                       {
> >                               TupleConversionMap *map = partrouteinfo->pi_RootToPartitionMap;
> >
> >                               if (map)
> >                                       slot = execute_attr_map_slot(map->attrMap, rootslot,
> >
partrouteinfo->pi_PartitionTupleSlot);
> >                               else
> >                                       slot = rootslot;
> >                       }
> >
> >                       ExecPartitionCheck(rri, slot, estate, true);
> >               }
>
> That check implies that it's not just lazily initialized, the code will
> work differently if ri_PartitionInfo is set or not.
>
> I think all this would be more clear if ri_PartitionInfo and
> ri_ChildToRootMap were both truly lazily initialized, the first time
> they're needed.

So, we initialize these maps when we initialize a partition's
ResultRelInfo.  I mean if the partition has a different tuple
descriptor than root, we know we are going to need to convert tuples
between them (in either direction), so we might as well initialize the
maps when the ResultRelInfo is built, which we do lazily for tuple
routing target relations at least.  In that sense, at least
root-to-partition maps are initialized lazily, that is only when a
partition receives a tuple via routing.

Partition-to-root maps' initialization though is not always lazy,
because they are also needed by UPDATE target relations, whose
ResultRelInfo are initialized in ExecInitModifyTable(), which is not
lazy enough.  That will change with my other patch though. :)

> And if we removed
> ri_PartitionInfo->pi_PartitionToRootMap, and always used
> ri_ChildToRootMap for it.

Done in the attached.

> Maybe remove PartitionRoutingInfo struct altogether, and just move its
> fields directly to ResultRelInfo.

If we do that, we'll end up with 3 notations for the same thing across
releases: In v10 and v11, PartitionRoutingInfos members are saved in
arrays in ModifyTableState, totally detached from the partition
ResultRelInfos.  In v12 (3f2393edef), we moved them into ResultRelInfo
but chose to add them into a sub-struct (PartitionRoutingInfo), which
in retrospect was not a great decision.  Now if we pull them into
ResultRelInfo, we'll have invented the 3rd notation.  Maybe that makes
things hard when back-patching bug-fixes?

Attached updated patch.

--
Amit Langote
EDB: http://www.enterprisedb.com

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Alvaro Herrera
Дата:
On 2020-Oct-16, Amit Langote wrote:

> On Thu, Oct 15, 2020 at 11:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > On Wed, Oct 14, 2020 at 6:04 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> > And if we removed
> > ri_PartitionInfo->pi_PartitionToRootMap, and always used
> > ri_ChildToRootMap for it.
> 
> Done in the attached.

Hmm...  Overall I like the simplification.

> > Maybe remove PartitionRoutingInfo struct altogether, and just move its
> > fields directly to ResultRelInfo.
> 
> If we do that, we'll end up with 3 notations for the same thing across
> releases: In v10 and v11, PartitionRoutingInfos members are saved in
> arrays in ModifyTableState, totally detached from the partition
> ResultRelInfos.  In v12 (3f2393edef), we moved them into ResultRelInfo
> but chose to add them into a sub-struct (PartitionRoutingInfo), which
> in retrospect was not a great decision.  Now if we pull them into
> ResultRelInfo, we'll have invented the 3rd notation.  Maybe that makes
> things hard when back-patching bug-fixes?

I don't necessarily agree that PartitionRoutingInfo was such a bad idea.
In fact I wonder if we shouldn't move *more* stuff into it
(ri_PartitionCheckExpr), and keep struct ResultRelInfo clean of
partitioning-related stuff (other than ri_PartitionInfo and
ri_PartitionRoot); there are plenty of ResultRelInfos that are not
partitions, so I think it makes sense to keep the split.  I'm thinking
that the ChildToRootMap should continue to be in PartitionRoutingInfo.

Maybe what we need in order to keep the initialization "lazy enough" is
some inline functions that act as getters, initializing members of
PartitionRoutingInfo when first needed.  (This would probably need
boolean flags, to distinguish "hasn't been set up yet" from "it is not
needed for this partition" for each member that requires it).

BTW it is curious that ExecInitRoutingInfo is called both in
ExecInitPartitionInfo() (from ExecFindPartition when the ResultRelInfo
for the partition is not found) *and* from ExecFindPartition again, when
the ResultRelInfo for the partition *is* found.  Doesn't this mean that
ri_PartitionInfo is set up twice for the same partition?




Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Fri, Oct 16, 2020 at 11:45 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2020-Oct-16, Amit Langote wrote:
> > On Thu, Oct 15, 2020 at 11:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > On Wed, Oct 14, 2020 at 6:04 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> > > And if we removed
> > > ri_PartitionInfo->pi_PartitionToRootMap, and always used
> > > ri_ChildToRootMap for it.
> >
> > Done in the attached.
>
> Hmm...  Overall I like the simplification.

Thank you for looking it over.

> > > Maybe remove PartitionRoutingInfo struct altogether, and just move its
> > > fields directly to ResultRelInfo.
> >
> > If we do that, we'll end up with 3 notations for the same thing across
> > releases: In v10 and v11, PartitionRoutingInfos members are saved in
> > arrays in ModifyTableState, totally detached from the partition
> > ResultRelInfos.  In v12 (3f2393edef), we moved them into ResultRelInfo
> > but chose to add them into a sub-struct (PartitionRoutingInfo), which
> > in retrospect was not a great decision.  Now if we pull them into
> > ResultRelInfo, we'll have invented the 3rd notation.  Maybe that makes
> > things hard when back-patching bug-fixes?
>
> I don't necessarily agree that PartitionRoutingInfo was such a bad idea.
> In fact I wonder if we shouldn't move *more* stuff into it
> (ri_PartitionCheckExpr), and keep struct ResultRelInfo clean of
> partitioning-related stuff (other than ri_PartitionInfo and
> ri_PartitionRoot); there are plenty of ResultRelInfos that are not
> partitions, so I think it makes sense to keep the split.  I'm thinking
> that the ChildToRootMap should continue to be in PartitionRoutingInfo.

Hmm, I don't see ri_PartitionCheckExpr as being a piece of routing
information, because it's primarily meant to be used when inserting
*directly* into a partition, although it's true we do initialize it in
routing target partitions too in some cases.

Also, ChildToRootMap was introduced by the trigger transition table
project, not tuple routing.  I think we misjudged this when we added
PartitionToRootMap to PartitionRoutingInfo, because it doesn't really
belong there.  This patch fixes that by removing PartitionToRootMap.

RootToPartitionMap and the associated partition slot is the only piece
of extra information that is needed by tuple routing target relations.

> Maybe what we need in order to keep the initialization "lazy enough" is
> some inline functions that act as getters, initializing members of
> PartitionRoutingInfo when first needed.  (This would probably need
> boolean flags, to distinguish "hasn't been set up yet" from "it is not
> needed for this partition" for each member that requires it).

As I said in my previous email, I don't see how we can make
initializing the map any lazier than it already is.  If a partition
has a different tuple descriptor than the root table, then we know for
sure that any tuples that are routed to it will need to be converted
from the root tuple format to its tuple format, so we might as well
build the map when the ResultRelInfo is built.  If no tuple lands into
a partition, we would neither build its ResultRelInfo nor the map.
With the current arrangement, if the map field is NULL, it simply
means that the partition has the same tuple format as the root table.

> BTW it is curious that ExecInitRoutingInfo is called both in
> ExecInitPartitionInfo() (from ExecFindPartition when the ResultRelInfo
> for the partition is not found) *and* from ExecFindPartition again, when
> the ResultRelInfo for the partition *is* found.  Doesn't this mean that
> ri_PartitionInfo is set up twice for the same partition?

No.  ExecFindPartition() directly calls ExecInitRoutingInfo() only for
reused update result relations, that too, only the first time a tuple
lands into such a partition.  For the subsequent tuples that land into
the same partition, ExecFindPartition() will be able to find that
ResultRelInfo in the proute->partitions[] array.  All ResultRelInfos
in that array are assumed to have been processed by
ExecInitRoutingInfo().

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Alvaro Herrera
Дата:
On 2020-Oct-17, Amit Langote wrote:

> Hmm, I don't see ri_PartitionCheckExpr as being a piece of routing
> information, because it's primarily meant to be used when inserting
> *directly* into a partition, although it's true we do initialize it in
> routing target partitions too in some cases.
> 
> Also, ChildToRootMap was introduced by the trigger transition table
> project, not tuple routing.  I think we misjudged this when we added
> PartitionToRootMap to PartitionRoutingInfo, because it doesn't really
> belong there.  This patch fixes that by removing PartitionToRootMap.
> 
> RootToPartitionMap and the associated partition slot is the only piece
> of extra information that is needed by tuple routing target relations.

Well, I was thinking on making the ri_PartitionInfo be about
partitioning in general, not just specifically for partition tuple
routing.  Maybe Heikki is right that it may end up being simpler to
remove ri_PartitionInfo altogether.  It'd just be a couple of additional
pointers in ResultRelInfo after all.  (Remember that we wanted to get
rid of fields specific to only certain kinds of RTEs in RangeTblEntry
for example, to keep things cleanly separated, although that project
eventually found its demise for other reasons.)

> As I said in my previous email, I don't see how we can make
> initializing the map any lazier than it already is.  If a partition
> has a different tuple descriptor than the root table, then we know for
> sure that any tuples that are routed to it will need to be converted
> from the root tuple format to its tuple format, so we might as well
> build the map when the ResultRelInfo is built.  If no tuple lands into
> a partition, we would neither build its ResultRelInfo nor the map.
> With the current arrangement, if the map field is NULL, it simply
> means that the partition has the same tuple format as the root table.

I see -- makes sense.

> On Fri, Oct 16, 2020 at 11:45 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > BTW it is curious that ExecInitRoutingInfo is called both in
> > ExecInitPartitionInfo() (from ExecFindPartition when the ResultRelInfo
> > for the partition is not found) *and* from ExecFindPartition again, when
> > the ResultRelInfo for the partition *is* found.  Doesn't this mean that
> > ri_PartitionInfo is set up twice for the same partition?
> 
> No.  ExecFindPartition() directly calls ExecInitRoutingInfo() only for
> reused update result relations, that too, only the first time a tuple
> lands into such a partition.  For the subsequent tuples that land into
> the same partition, ExecFindPartition() will be able to find that
> ResultRelInfo in the proute->partitions[] array.  All ResultRelInfos
> in that array are assumed to have been processed by
> ExecInitRoutingInfo().

Doh, right, sorry, I was misreading the if/else maze there.



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Sun, Oct 18, 2020 at 12:54 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2020-Oct-17, Amit Langote wrote:
> > Hmm, I don't see ri_PartitionCheckExpr as being a piece of routing
> > information, because it's primarily meant to be used when inserting
> > *directly* into a partition, although it's true we do initialize it in
> > routing target partitions too in some cases.
> >
> > Also, ChildToRootMap was introduced by the trigger transition table
> > project, not tuple routing.  I think we misjudged this when we added
> > PartitionToRootMap to PartitionRoutingInfo, because it doesn't really
> > belong there.  This patch fixes that by removing PartitionToRootMap.
> >
> > RootToPartitionMap and the associated partition slot is the only piece
> > of extra information that is needed by tuple routing target relations.
>
> Well, I was thinking on making the ri_PartitionInfo be about
> partitioning in general, not just specifically for partition tuple
> routing.  Maybe Heikki is right that it may end up being simpler to
> remove ri_PartitionInfo altogether.  It'd just be a couple of additional
> pointers in ResultRelInfo after all.

So that's 2 votes for removing PartitionRoutingInfo from the tree.
Okay, I have tried that in the attached 0002 patch.  Also, I fixed
some comments in 0001 that still referenced PartitionToRootMap.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 19/10/2020 07:54, Amit Langote wrote:
> On Sun, Oct 18, 2020 at 12:54 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> On 2020-Oct-17, Amit Langote wrote:
>>> Hmm, I don't see ri_PartitionCheckExpr as being a piece of routing
>>> information, because it's primarily meant to be used when inserting
>>> *directly* into a partition, although it's true we do initialize it in
>>> routing target partitions too in some cases.
>>>
>>> Also, ChildToRootMap was introduced by the trigger transition table
>>> project, not tuple routing.  I think we misjudged this when we added
>>> PartitionToRootMap to PartitionRoutingInfo, because it doesn't really
>>> belong there.  This patch fixes that by removing PartitionToRootMap.
>>>
>>> RootToPartitionMap and the associated partition slot is the only piece
>>> of extra information that is needed by tuple routing target relations.
>>
>> Well, I was thinking on making the ri_PartitionInfo be about
>> partitioning in general, not just specifically for partition tuple
>> routing.  Maybe Heikki is right that it may end up being simpler to
>> remove ri_PartitionInfo altogether.  It'd just be a couple of additional
>> pointers in ResultRelInfo after all.
> 
> So that's 2 votes for removing PartitionRoutingInfo from the tree.
> Okay, I have tried that in the attached 0002 patch.  Also, I fixed
> some comments in 0001 that still referenced PartitionToRootMap.

Pushed, with minor comment changes.

I also noticed that the way the getTargetResultRelInfo() helper function 
was used, was a bit messy. It was used when firing AFTER STATEMENT 
triggers, but for some reason the code to fire BEFORE STATEMENT triggers 
didn't use it but duplicated the logic instead. I made that a bit 
simpler, by always setting the rootResultRelInfo field in 
ExecInitModifyTable(), making the getTargetResultRelInfo() function 
unnecessary.

Thanks!

- Heikki



Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 17/10/2020 18:54, Alvaro Herrera wrote:
> On 2020-Oct-17, Amit Langote wrote:
>> As I said in my previous email, I don't see how we can make
>> initializing the map any lazier than it already is.  If a partition
>> has a different tuple descriptor than the root table, then we know for
>> sure that any tuples that are routed to it will need to be converted
>> from the root tuple format to its tuple format, so we might as well
>> build the map when the ResultRelInfo is built.  If no tuple lands into
>> a partition, we would neither build its ResultRelInfo nor the map.
>> With the current arrangement, if the map field is NULL, it simply
>> means that the partition has the same tuple format as the root table.
> 
> I see -- makes sense.

It's probably true that there's no performance gain from initializing 
them more lazily. But the reasoning and logic around the initialization 
is complicated. After tracing through various path through the code, I'm 
convinced enough that it's correct, or at least these patches didn't 
break it, but I still think some sort of lazy initialization on first 
use would make it more readable. Or perhaps there's some other 
refactoring we could do.

Perhaps we should have a magic TupleConversionMap value to mean "no 
conversion required". NULL could then mean "not initialized yet".

>> On Fri, Oct 16, 2020 at 11:45 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> 
>>> BTW it is curious that ExecInitRoutingInfo is called both in
>>> ExecInitPartitionInfo() (from ExecFindPartition when the ResultRelInfo
>>> for the partition is not found) *and* from ExecFindPartition again, when
>>> the ResultRelInfo for the partition *is* found.  Doesn't this mean that
>>> ri_PartitionInfo is set up twice for the same partition?
>>
>> No.  ExecFindPartition() directly calls ExecInitRoutingInfo() only for
>> reused update result relations, that too, only the first time a tuple
>> lands into such a partition.  For the subsequent tuples that land into
>> the same partition, ExecFindPartition() will be able to find that
>> ResultRelInfo in the proute->partitions[] array.  All ResultRelInfos
>> in that array are assumed to have been processed by
>> ExecInitRoutingInfo().
> 
> Doh, right, sorry, I was misreading the if/else maze there.

I think that demonstrates my point that the logic is hard to follow :-).

- Heikki



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Mon, Oct 19, 2020 at 8:48 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 19/10/2020 07:54, Amit Langote wrote:
> > On Sun, Oct 18, 2020 at 12:54 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >> Well, I was thinking on making the ri_PartitionInfo be about
> >> partitioning in general, not just specifically for partition tuple
> >> routing.  Maybe Heikki is right that it may end up being simpler to
> >> remove ri_PartitionInfo altogether.  It'd just be a couple of additional
> >> pointers in ResultRelInfo after all.
> >
> > So that's 2 votes for removing PartitionRoutingInfo from the tree.
> > Okay, I have tried that in the attached 0002 patch.  Also, I fixed
> > some comments in 0001 that still referenced PartitionToRootMap.
>
> Pushed, with minor comment changes.

Thank you.

> I also noticed that the way the getTargetResultRelInfo() helper function
> was used, was a bit messy. It was used when firing AFTER STATEMENT
> triggers, but for some reason the code to fire BEFORE STATEMENT triggers
> didn't use it but duplicated the logic instead. I made that a bit
> simpler, by always setting the rootResultRelInfo field in
> ExecInitModifyTable(), making the getTargetResultRelInfo() function
> unnecessary.

Good, I was mildly annoyed by that function too.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Mon, Oct 19, 2020 at 8:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 17/10/2020 18:54, Alvaro Herrera wrote:
> > On 2020-Oct-17, Amit Langote wrote:
> >> As I said in my previous email, I don't see how we can make
> >> initializing the map any lazier than it already is.  If a partition
> >> has a different tuple descriptor than the root table, then we know for
> >> sure that any tuples that are routed to it will need to be converted
> >> from the root tuple format to its tuple format, so we might as well
> >> build the map when the ResultRelInfo is built.  If no tuple lands into
> >> a partition, we would neither build its ResultRelInfo nor the map.
> >> With the current arrangement, if the map field is NULL, it simply
> >> means that the partition has the same tuple format as the root table.
> >
> > I see -- makes sense.
>
> It's probably true that there's no performance gain from initializing
> them more lazily. But the reasoning and logic around the initialization
> is complicated. After tracing through various path through the code, I'm
> convinced enough that it's correct, or at least these patches didn't
> break it, but I still think some sort of lazy initialization on first
> use would make it more readable. Or perhaps there's some other
> refactoring we could do.

So the other patch I have mentioned is about lazy initialization of
the ResultRelInfo itself, not the individual fields, but maybe with
enough refactoring we can get the latter too.

Currently, ExecInitModifyTable() performs ExecInitResultRelation() for
all relations in ModifyTable.resultRelations, which sets most but not
all ResultRelInfo fields (whatever InitResultRelInfo() can set),
followed by initializing some other fields based on the contents of
the ModifyTable plan.  My patch moves those two steps into a function
ExecBuildResultRelation() which is called lazily during
ExecModifyTable() for a given result relation on the first tuple
produced by that relation's plan.  Actually, there's a "getter" named
ExecGetResultRelation() which first consults es_result_relations[rti -
1] for the requested relation and if it's NULL then calls
ExecBuildResultRelation().

Would you mind taking a look at that as a starting point?  I am
thinking there's enough relevant discussion here that I should post
the rebased version of that patch here.

> Perhaps we should have a magic TupleConversionMap value to mean "no
> conversion required". NULL could then mean "not initialized yet".

Perhaps, a TupleConversionMap with its attrMap set to NULL means "no
conversion required".


--
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Tue, Oct 20, 2020 at 9:57 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Oct 19, 2020 at 8:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > It's probably true that there's no performance gain from initializing
> > them more lazily. But the reasoning and logic around the initialization
> > is complicated. After tracing through various path through the code, I'm
> > convinced enough that it's correct, or at least these patches didn't
> > break it, but I still think some sort of lazy initialization on first
> > use would make it more readable. Or perhaps there's some other
> > refactoring we could do.
>
> So the other patch I have mentioned is about lazy initialization of
> the ResultRelInfo itself, not the individual fields, but maybe with
> enough refactoring we can get the latter too.

So, I tried implementing a lazy-initialization-on-first-access
approach for both the ResultRelInfos themselves and some of the
individual fields of ResultRelInfo that don't need to be set right
away.  You can see the end result in the attached 0003 patch.  This
slims down ExecInitModifyTable() significantly, both in terms of code
footprint and the amount of work that it does.

0001 fixes a thinko of the recent commit 1375422c782 that I discovered
when debugging a problem with 0003.

0002 is for something I have mentioned upthread.
ForeignScanState.resultRelInfo cannot be set in ExecInit* stage as
it's done now, because with 0003, child ResultRelInfos will not have
been added to es_result_relations during that stage.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Alvaro Herrera
Дата:
On 2020-Oct-22, Amit Langote wrote:

> 0001 fixes a thinko of the recent commit 1375422c782 that I discovered
> when debugging a problem with 0003.

Hmm, how hard is it to produce a test case that fails because of this
problem?




Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Thu, Oct 22, 2020 at 11:25 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2020-Oct-22, Amit Langote wrote:
>
> > 0001 fixes a thinko of the recent commit 1375422c782 that I discovered
> > when debugging a problem with 0003.
>
> Hmm, how hard is it to produce a test case that fails because of this
> problem?

I checked and don't think there's any live bug here.  You will notice
if you take a look at 1375422c7 that we've made es_result_relations an
array of pointers while the individual ModifyTableState nodes own the
actual ResultRelInfos.  So, EvalPlanQualStart() setting the parent
EState's es_result_relations array to NULL implies that those pointers
become inaccessible to the parent query's execution after
EvalPlanQual() returns.  However, nothing in the tree today accesses
ResulRelInfos through es_result_relations array, except during
ExecInit* stage (see ExecInitForeignScan()) but it would still be
intact at that stage.

With the lazy-initialization patch though, we do check
es_result_relations when trying to open a result relation to see if it
has already been initialized (a non-NULL pointer in that array means
yes), so resetting it in the middle of the execution can't be safe.
For one example, we will end up initializing the same relation many
times after not finding it in es_result_relations and also add it
*duplicatively* to es_opened_result_relations list, breaking the
invariant that that list contains distinct relations.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 23/10/2020 05:56, Amit Langote wrote:
> On Thu, Oct 22, 2020 at 11:25 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>
>> On 2020-Oct-22, Amit Langote wrote:
>>
>>> 0001 fixes a thinko of the recent commit 1375422c782 that I discovered
>>> when debugging a problem with 0003.
>>
>> Hmm, how hard is it to produce a test case that fails because of this
>> problem?
> 
> I checked and don't think there's any live bug here.  You will notice
> if you take a look at 1375422c7 that we've made es_result_relations an
> array of pointers while the individual ModifyTableState nodes own the
> actual ResultRelInfos.  So, EvalPlanQualStart() setting the parent
> EState's es_result_relations array to NULL implies that those pointers
> become inaccessible to the parent query's execution after
> EvalPlanQual() returns.  However, nothing in the tree today accesses
> ResulRelInfos through es_result_relations array, except during
> ExecInit* stage (see ExecInitForeignScan()) but it would still be
> intact at that stage.
> 
> With the lazy-initialization patch though, we do check
> es_result_relations when trying to open a result relation to see if it
> has already been initialized (a non-NULL pointer in that array means
> yes), so resetting it in the middle of the execution can't be safe.
> For one example, we will end up initializing the same relation many
> times after not finding it in es_result_relations and also add it
> *duplicatively* to es_opened_result_relations list, breaking the
> invariant that that list contains distinct relations.

Pushed that thinko-fix, thanks!

- Heikki



Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 22/10/2020 16:49, Amit Langote wrote:
> On Tue, Oct 20, 2020 at 9:57 PM Amit Langote <amitlangote09@gmail.com> wrote:
>> On Mon, Oct 19, 2020 at 8:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> It's probably true that there's no performance gain from initializing
>>> them more lazily. But the reasoning and logic around the initialization
>>> is complicated. After tracing through various path through the code, I'm
>>> convinced enough that it's correct, or at least these patches didn't
>>> break it, but I still think some sort of lazy initialization on first
>>> use would make it more readable. Or perhaps there's some other
>>> refactoring we could do.
>>
>> So the other patch I have mentioned is about lazy initialization of
>> the ResultRelInfo itself, not the individual fields, but maybe with
>> enough refactoring we can get the latter too.
> 
> So, I tried implementing a lazy-initialization-on-first-access
> approach for both the ResultRelInfos themselves and some of the
> individual fields of ResultRelInfo that don't need to be set right
> away.  You can see the end result in the attached 0003 patch.  This
> slims down ExecInitModifyTable() significantly, both in terms of code
> footprint and the amount of work that it does.

Have you done any performance testing? I'd like to know how much of a 
difference this makes in practice.

Another alternative is to continue to create the ResultRelInfos in 
ExecInitModify(), but initialize the individual fields in them lazily.

Does this patch become moot if we do the "Overhaul UPDATE/DELETE 
processing"? 
(https://www.postgresql.org/message-id/CA%2BHiwqHpHdqdDn48yCEhynnniahH78rwcrv1rEX65-fsZGBOLQ%40mail.gmail.com)?

- Heikki



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Fri, Oct 23, 2020 at 4:04 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 22/10/2020 16:49, Amit Langote wrote:
> > On Tue, Oct 20, 2020 at 9:57 PM Amit Langote <amitlangote09@gmail.com> wrote:
> >> On Mon, Oct 19, 2020 at 8:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>> It's probably true that there's no performance gain from initializing
> >>> them more lazily. But the reasoning and logic around the initialization
> >>> is complicated. After tracing through various path through the code, I'm
> >>> convinced enough that it's correct, or at least these patches didn't
> >>> break it, but I still think some sort of lazy initialization on first
> >>> use would make it more readable. Or perhaps there's some other
> >>> refactoring we could do.
> >>
> >> So the other patch I have mentioned is about lazy initialization of
> >> the ResultRelInfo itself, not the individual fields, but maybe with
> >> enough refactoring we can get the latter too.
> >
> > So, I tried implementing a lazy-initialization-on-first-access
> > approach for both the ResultRelInfos themselves and some of the
> > individual fields of ResultRelInfo that don't need to be set right
> > away.  You can see the end result in the attached 0003 patch.  This
> > slims down ExecInitModifyTable() significantly, both in terms of code
> > footprint and the amount of work that it does.
>
> Have you done any performance testing? I'd like to know how much of a
> difference this makes in practice.

I have shown some numbers here:

https://www.postgresql.org/message-id/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com

To reiterate, if you apply the following patch:

> Does this patch become moot if we do the "Overhaul UPDATE/DELETE
> processing"?
> (https://www.postgresql.org/message-id/CA%2BHiwqHpHdqdDn48yCEhynnniahH78rwcrv1rEX65-fsZGBOLQ%40mail.gmail.com)?

...and run this benchmark with plan_cache_mode = force_generic_plan

pgbench -i -s 10 --partitions={0, 10, 100, 1000}
pgbench -T120 -f test.sql -M prepared

test.sql:
\set aid random(1, 1000000)
update pgbench_accounts set abalance = abalance + 1 where aid = :aid;

you may see roughly the following results:

HEAD:

0 tps = 13045.485121 (excluding connections establishing)
10 tps = 9358.157433 (excluding connections establishing)
100 tps = 1878.274500 (excluding connections establishing)
1000 tps = 84.684695 (excluding connections establishing)

Patched (overhaul update/delete processing):

0 tps = 12743.487196 (excluding connections establishing)
10 tps = 12644.240748 (excluding connections establishing)
100 tps = 4158.123345 (excluding connections establishing)
1000 tps = 391.248067 (excluding connections establishing)

And if you apply the patch being discussed here, TPS shoots up a bit,
especially for higher partition counts:

Patched (lazy-ResultRelInfo-initialization)

0 tps = 13419.283168 (excluding connections establishing)
10 tps = 12588.016095 (excluding connections establishing)
100 tps = 8560.824225 (excluding connections establishing)
1000 tps = 1926.553901 (excluding connections establishing)

To explain these numbers a bit, "overheaul update/delete processing"
patch improves the performance of that benchmark by allowing the
updates to use run-time pruning when executing generic plans, which
they can't today.

However without "lazy-ResultRelInfo-initialization" patch,
ExecInitModifyTable() (or InitPlan() when I ran those benchmarks) can
be seen to be spending time initializing all of those result
relations, whereas only one of those will actually be used.

As mentioned further in that email, it's really the locking of all
relations by AcquireExecutorLocks() that occurs even before we enter
the executor that's a much thornier bottleneck for this benchmark.
But the ResultRelInfo initialization bottleneck sounded like it could
get alleviated in a relatively straightforward manner.  The patches
that were developed for attacking the locking bottleneck would require
further reflection on whether they are correct.

(Note: I've just copy pasted the numbers I reported in that email.  To
reproduce, I'll have to rebase the "overhaul update/delete processing"
patch on this one, which I haven't yet done.)

> Another alternative is to continue to create the ResultRelInfos in
> ExecInitModify(), but initialize the individual fields in them lazily.

If you consider the above, maybe you can see how that will not really
eliminate the bottleneck I'm aiming to fix here.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Heikki Linnakangas
Дата:
On 23/10/2020 12:37, Amit Langote wrote:
> To explain these numbers a bit, "overheaul update/delete processing"
> patch improves the performance of that benchmark by allowing the
> updates to use run-time pruning when executing generic plans, which
> they can't today.
> 
> However without "lazy-ResultRelInfo-initialization" patch,
> ExecInitModifyTable() (or InitPlan() when I ran those benchmarks) can
> be seen to be spending time initializing all of those result
> relations, whereas only one of those will actually be used.
> 
> As mentioned further in that email, it's really the locking of all
> relations by AcquireExecutorLocks() that occurs even before we enter
> the executor that's a much thornier bottleneck for this benchmark.
> But the ResultRelInfo initialization bottleneck sounded like it could
> get alleviated in a relatively straightforward manner.  The patches
> that were developed for attacking the locking bottleneck would require
> further reflection on whether they are correct.
> 
> (Note: I've just copy pasted the numbers I reported in that email.  To
> reproduce, I'll have to rebase the "overhaul update/delete processing"
> patch on this one, which I haven't yet done.)

Ok, thanks for the explanation, now I understand.

This patch looks reasonable to me at a quick glance. I'm a bit worried 
or unhappy about the impact on FDWs, though. It doesn't seem nice that 
the ResultRelInfo is not available in the BeginDirectModify call. It's 
not too bad, the FDW can call ExecGetResultRelation() if it needs it, 
but still. Perhaps it would be better to delay calling 
BeginDirectModify() until the first modification is performed, to avoid 
any initialization overhead there, like establishing the connection in 
postgres_fdw.

But since this applies on top of the "overhaul update/delete processing" 
patch, let's tackle that patch set next. Could you rebase that, please?

- Heikki



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Tue, Oct 27, 2020 at 10:23 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 23/10/2020 12:37, Amit Langote wrote:
> > To explain these numbers a bit, "overheaul update/delete processing"
> > patch improves the performance of that benchmark by allowing the
> > updates to use run-time pruning when executing generic plans, which
> > they can't today.
> >
> > However without "lazy-ResultRelInfo-initialization" patch,
> > ExecInitModifyTable() (or InitPlan() when I ran those benchmarks) can
> > be seen to be spending time initializing all of those result
> > relations, whereas only one of those will actually be used.
> >
> > As mentioned further in that email, it's really the locking of all
> > relations by AcquireExecutorLocks() that occurs even before we enter
> > the executor that's a much thornier bottleneck for this benchmark.
> > But the ResultRelInfo initialization bottleneck sounded like it could
> > get alleviated in a relatively straightforward manner.  The patches
> > that were developed for attacking the locking bottleneck would require
> > further reflection on whether they are correct.
> >
> > (Note: I've just copy pasted the numbers I reported in that email.  To
> > reproduce, I'll have to rebase the "overhaul update/delete processing"
> > patch on this one, which I haven't yet done.)
>
> Ok, thanks for the explanation, now I understand.
>
> But since this applies on top of the "overhaul update/delete processing"
> patch, let's tackle that patch set next. Could you rebase that, please?

Actually, I made lazy-ResultRelInfo-initialization apply on HEAD
directly at one point because of its separate CF entry, that is, to
appease the CF app's automatic patch tester that wouldn't know to
apply the other patch first.  Because both of these patch sets want to
change thow ModifyTable works, there are conflicts.

The "overhaul update/delete processing" patch is somewhat complex and
I expect some amount of back and forth on its design points.  OTOH,
the lazy-ResultRelInfo-initialization patch is straightforward enough
that I hoped it would be easier to bring it into a committable state
than the other.  But I can see why one may find it hard to justify
committing the latter without the former already in, because the
bottleneck it purports to alleviate (that of eager ResultRelInfo
initialization) is not apparent until update/delete can use run-time
pruning.

Anyway, I will post the rebased patch on the "overhaul update/delete
processing" thread.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Tue, Oct 27, 2020 at 10:23 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> This patch looks reasonable to me at a quick glance. I'm a bit worried
> or unhappy about the impact on FDWs, though. It doesn't seem nice that
> the ResultRelInfo is not available in the BeginDirectModify call. It's
> not too bad, the FDW can call ExecGetResultRelation() if it needs it,
> but still. Perhaps it would be better to delay calling
> BeginDirectModify() until the first modification is performed, to avoid
> any initialization overhead there, like establishing the connection in
> postgres_fdw.

Ah, calling BeginDirectModify() itself lazily sounds like a good idea;
see attached updated 0001 to see how that looks.  While updating that
patch, I realized that the ForeignScan.resultRelation that we
introduced in 178f2d560d will now be totally useless. :-(

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Вложения

Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Wed, Oct 28, 2020 at 12:02 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Oct 27, 2020 at 10:23 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > But since this applies on top of the "overhaul update/delete processing"
> > patch, let's tackle that patch set next. Could you rebase that, please?
>
>
> Anyway, I will post the rebased patch on the "overhaul update/delete
> processing" thread.

Done.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: partition routing layering in nodeModifyTable.c

От
Amit Langote
Дата:
On Wed, Oct 28, 2020 at 4:46 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Tue, Oct 27, 2020 at 10:23 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > This patch looks reasonable to me at a quick glance. I'm a bit worried
> > or unhappy about the impact on FDWs, though. It doesn't seem nice that
> > the ResultRelInfo is not available in the BeginDirectModify call. It's
> > not too bad, the FDW can call ExecGetResultRelation() if it needs it,
> > but still. Perhaps it would be better to delay calling
> > BeginDirectModify() until the first modification is performed, to avoid
> > any initialization overhead there, like establishing the connection in
> > postgres_fdw.
>
> Ah, calling BeginDirectModify() itself lazily sounds like a good idea;
> see attached updated 0001 to see how that looks.  While updating that
> patch, I realized that the ForeignScan.resultRelation that we
> introduced in 178f2d560d will now be totally useless. :-(

Given that we've closed the CF entry for this thread and given that
there seems to be enough context to these patches, I will move these
patches back to their original thread, that is:

* ModifyTable overheads in generic plans *
https://www.postgresql.org/message-id/flat/CA%2BHiwqE4k1Q2TLmCAvekw%2B8_NXepbnfUOamOeX%3DKpHRDTfSKxA%40mail.gmail.com

That will also make the CF-bot happy.

-- 
Amit Langote
EDB: http://www.enterprisedb.com