Re: Parallel Inserts in CREATE TABLE AS

Поиск
Список
Период
Сортировка
От Zhihong Yu
Тема Re: Parallel Inserts in CREATE TABLE AS
Дата
Msg-id CALNJ-vRYzKbjTF3p++8tpVOK5Y5qroV3gYirCEdFnEsuE1g3tw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Ответы Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Список pgsql-hackers
For v16-0002-Tuple-Cost-Adjustment-for-Parallel-Inserts-in-CTAS.patch:

+       if (ignore &&
+           (root->parse->CTASParallelInsInfo &
+            CTAS_PARALLEL_INS_TUP_COST_CAN_IGN))

I wonder why CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is checked again in the above if since when ignore_parallel_tuple_cost returns true, CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is set already.

+ * In this function we only care Append and Gather nodes.

'care' -> 'care about'

+       for (int i = 0; i < aps->as_nplans; i++)
+       {
+           parallel |= PushDownCTASParallelInsertState(dest,
+                                                       aps->appendplans[i],
+                                                       gather_exists);

It seems the loop termination condition can include parallel since we can come out of the loop once parallel is true.

+   if (!allow && tuple_cost_flags && gather_exists)

As the above code shows, gather_exists is only checked when allow is false.

+            * We set the flag for two cases when there is no parent path will
+            * be created(such as : limit,sort,distinct...):

Please correct the grammar : there are two verbs following 'when'

For set_append_rel_size:

+           {
+               root->parse->CTASParallelInsInfo |=
+                                       CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND;
+           }
+       }
+
+       if (root->parse->CTASParallelInsInfo &
+           CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND)
+       {
+           root->parse->CTASParallelInsInfo &=
+                                       ~CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND;

In the if block for childrel->rtekind == RTE_SUBQUERY, CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND maybe set. Why is it cleared immediately after ?

+   /* Set to this in case tuple cost needs to be ignored for Append cases. */
+   CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND = 1 << 3

Since each CTAS_PARALLEL_INS_ flag is a bit, maybe it's better to use 'turn on' or similar term in the comment. Because 'set to' normally means assignment.

Cheers

On Sun, Dec 27, 2020 at 12:50 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Sat, Dec 26, 2020 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have reviewed part of v15-0001 patch, I have a few comments, I will
> continue to review this.

Thanks a lot.

> 1.
> Why is this temporary hack? and what is the plan for removing this hack?

The changes in xact.c, xact.h and heapam.c are common to all the
parallel insert patches - COPY, INSERT INTO SELECT. That was the
initial comment, I forgot to keep it in sync with the other patches.
Now, I used the comment from INSERT INTO SELECT patch. IIRC, the plan
was to have these code in all the parallel inserts patch, whichever
gets to review and commit first, others will update their patches
accordingly.

> 2.
> +/*
> + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> + * insertion is possible, if yes set the parallel insert state i.e. push down
> + * the dest receiver to the Gather nodes.
> + */
> +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
> +{
> +    if (!IS_CTAS(into))
> +        return;
>
> When will this hit?  The functtion name suggest that it is from CTAS
> but now you have a check that if it is
> not for CTAS then return,  can you add the comment that when do you
> expect this case?

Yes it will hit for explain cases, but I choose to remove this and
check outside in the explain something like:
if (into)
    ChooseParallelInsertsInCTAS()

> Also the function name should start in a new line
> i.e
> void
> ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)

Ah, missed that. Modified now.

> 3.
> +/*
> + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> + * insertion is possible, if yes set the parallel insert state i.e. push down
> + * the dest receiver to the Gather nodes.
> + */
>
> Push down to the Gather nodes?  I think the right statement will be
> push down below the Gather node.

Modified.

> 4.
> intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
> {
>      DR_intorel *myState = (DR_intorel *) self;
>
>     -- Comment ->in parallel worker we don't need to crease dest recv blah blah
> +    if (myState->is_parallel_worker)
>     {
>         --parallel worker handling--
>         return;
>     }
>
>     --non-parallel worker code stay right there, instead of moving to else

Done.

> 5.
> +/*
> + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> + * insertion is possible, if yes set the parallel insert state i.e. push down
> + * the dest receiver to the Gather nodes.
> + */
> +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
> +{
>
> From function name and comments it appeared that this function will
> return boolean saying whether
> Parallel insert should be selected or not.  I think name/comment
> should be better for this

Yeah that function can still return void because no point in returning
bool there, since the intention is to see if parallel inserts can be
performed, if yes, set the state otherwise exit. I changed the
function name to TryParallelizingInsertsInCTAS(). Let me know your
suggestions if that doesn't work out.

> 6.
>         /*
> +         * For parallelizing inserts in CTAS i.e. making each parallel worker
> +         * insert the tuples, we must send information such as into clause (for
> +         * each worker to build separate dest receiver), object id (for each
> +         * worker to open the created table).
>
> Comment is saying we need to pass object id but the code under this
> comment is not doing so.

Improved the comment.

> 7.
> +        /*
> +         * Since there are no rows that are transferred from workers to Gather
> +         * node, so we set it to 0 to be visible in estimated row count of
> +         * explain plans.
> +         */
> +        queryDesc->planstate->plan->plan_rows = 0;
>
> This seems a bit hackies Why it is done after the planning,  I mean
> plan must know that it is returning a 0 rows?

This exists to show up the estimated row count(in case of EXPLAIN CTAS
without ANALYZE) in the output. For EXPLAIN ANALYZE CTAS actual tuples
are shown correctly as 0 because Gather doesn't receive any tuples.
    if (es->costs)
    {
        if (es->format == EXPLAIN_FORMAT_TEXT)
        {
            appendStringInfo(es->str, "  (cost=%.2f..%.2f rows=%.0f width=%d)",
                             plan->startup_cost, plan->total_cost,
                             plan->plan_rows, plan->plan_width);

Since it's an estimated row count(which may not be always correct), we
will let the EXPLAIN plan show that and I think we can remove that
part. Thoughts?

I removed it in v6 patch set.

> 8.
> +        char *intoclause_space = shm_toc_allocate(pcxt->toc,
> +                                                  intoclause_len);
> +        memcpy(intoclause_space, intoclausestr, intoclause_len);
> +        shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space);
>
> One blank line between variable declaration and next code segment,
> take care at other places as well.

Done.

I'm attaching the v16 patch set. Please note that I added the
documentation saying that parallel insertions can happen and a sample
output of the explain to 0003 patch as discussed in [1]. But I didn't
move the explain output related code to a separate patch because it's
a small snippet in explain.c. I hope that's okay.

[1] - https://www.postgresql.org/message-id/CAA4eK1JqwXGYoGa1%2B3-f0T50dBGufvKaKQOee_AfFhygZ6QKtA%40mail.gmail.com



With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: Re: range_agg
Следующее
От: "Joel Jacobson"
Дата:
Сообщение: hash_extension(text)