Re: Parallel Inserts in CREATE TABLE AS

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: Parallel Inserts in CREATE TABLE AS
Дата
Msg-id CAFiTN-tgTyaLXiJVvhBMb3EoALZhDxBBdr6XY844bvVTCgaWQg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Ответы Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Список pgsql-hackers
On Thu, Dec 24, 2020 at 1:07 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote:
> > You could change intoclause_len = strlen(intoclausestr) to
> > strlen(intoclausestr) + 1 and use intoclause_len in the remaining
> > places. We can avoid the +1 in the other places.
> > +       /* Estimate space for into clause for CTAS. */
> > +       if (IS_CTAS(intoclause) && OidIsValid(objectid))
> > +       {
> > +               intoclausestr = nodeToString(intoclause);
> > +               intoclause_len = strlen(intoclausestr);
> > +               shm_toc_estimate_chunk(&pcxt->estimator, intoclause_len + 1);
> > +               shm_toc_estimate_keys(&pcxt->estimator, 1);
> > +       }
>
> Done.
>
> > Can we use  node->nworkers_launched == 0 in place of
> > node->need_to_scan_locally, that way the setting and resetting of
> > node->need_to_scan_locally can be removed. Unless need_to_scan_locally
> > is needed in any of the functions that gets called.
> > +       /* Enable leader to insert in case no parallel workers were launched. */
> > +       if (node->nworkers_launched == 0)
> > +               node->need_to_scan_locally = true;
> > +
> > +       /*
> > +        * By now, for parallel workers (if launched any), would have
> > started their
> > +        * work i.e. insertion to target table. In case the leader is chosen to
> > +        * participate for parallel inserts in CTAS, then finish its
> > share before
> > +        * going to wait for the parallel workers to finish.
> > +        */
> > +       if (node->need_to_scan_locally)
> > +       {
>
> need_to_scan_locally is being set in ExecGather() even if
> nworkers_launched > 0 it can still be true, so I think we can not
> remove need_to_scan_locally in ExecParallelInsertInCTAS.
>
> Attaching v15 patch set for further review. Note that the change is
> only in 0001 patch, other patches remain unchanged from v14.

I have reviewed part of v15-0001 patch, I have a few comments, I will
continue to review this.

1.

@@ -763,18 +763,34 @@ GetCurrentCommandId(bool used)
     /* this is global to a transaction, not subtransaction-local */
     if (used)
     {
-        /*
-         * Forbid setting currentCommandIdUsed in a parallel worker, because
-         * we have no provision for communicating this back to the leader.  We
-         * could relax this restriction when currentCommandIdUsed was already
-         * true at the start of the parallel operation.
-         */
-        Assert(!IsParallelWorker());
+         /*
+          * This is a temporary hack for all common parallel insert cases i.e.
+          * insert into, ctas, copy from. To be changed later. In a parallel
+          * worker, set currentCommandIdUsed to true only if it was not set to
+          * true at the start of the parallel operation (by way of
+          * SetCurrentCommandIdUsedForWorker()). We have to do this because
+          * GetCurrentCommandId(true) may be called from anywhere, especially
+          * for parallel inserts, within parallel worker.
+          */
+        Assert(!(IsParallelWorker() && !currentCommandIdUsed));

Why is this temporary hack? and what is the plan for removing this hack?

2.
+/*
+ * ChooseParallelInsertsInCTAS --- determine whether or not parallel
+ * insertion is possible, if yes set the parallel insert state i.e. push down
+ * the dest receiver to the Gather nodes.
+ */
+void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
+{
+    if (!IS_CTAS(into))
+        return;

When will this hit?  The functtion name suggest that it is from CTAS
but now you have a check that if it is
not for CTAS then return,  can you add the comment that when do you
expect this case?

Also the function name should start in a new line
i.e
void
ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)

3.
+/*
+ * ChooseParallelInsertsInCTAS --- determine whether or not parallel
+ * insertion is possible, if yes set the parallel insert state i.e. push down
+ * the dest receiver to the Gather nodes.
+ */

Push down to the Gather nodes?  I think the right statement will be
push down below the Gather node.


4.
intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
{
     DR_intorel *myState = (DR_intorel *) self;

+    if (myState->is_parallel_worker)
+    {
+        /* In the worker */

+        SetCurrentCommandIdUsedForWorker();
+        myState->output_cid = GetCurrentCommandId(false);
+    }
+    else
     {
        non-parallel worker code
    }
}

I think instead of moving all the code related to non-parallel worker
in the else we can do better.  This
will avoid unnecessary code movement.

4.
intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
{
     DR_intorel *myState = (DR_intorel *) self;

    -- Comment ->in parallel worker we don't need to crease dest recv blah blah
+    if (myState->is_parallel_worker)
    {
        --parallel worker handling--
        return;
    }

    --non-parallel worker code stay right there, instead of moving to else

5.
+/*
+ * ChooseParallelInsertsInCTAS --- determine whether or not parallel
+ * insertion is possible, if yes set the parallel insert state i.e. push down
+ * the dest receiver to the Gather nodes.
+ */
+void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
+{

From function name and comments it appeared that this function will
return boolean saying whether
Parallel insert should be selected or not.  I think name/comment
should be better for this

6.
        /*
+         * For parallelizing inserts in CTAS i.e. making each parallel worker
+         * insert the tuples, we must send information such as into clause (for
+         * each worker to build separate dest receiver), object id (for each
+         * worker to open the created table).

Comment is saying we need to pass object id but the code under this
comment is not doing so.

7.
+        /*
+         * Since there are no rows that are transferred from workers to Gather
+         * node, so we set it to 0 to be visible in estimated row count of
+         * explain plans.
+         */
+        queryDesc->planstate->plan->plan_rows = 0;

This seems a bit hackies Why it is done after the planning,  I mean
plan must know that it is returning a 0 rows?

8.
+        char *intoclause_space = shm_toc_allocate(pcxt->toc,
+                                                  intoclause_len);
+        memcpy(intoclause_space, intoclausestr, intoclause_len);
+        shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space);

One blank line between variable declaration and next code segment,
take care at other places as well.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: proposal: schema variables
Следующее
От: Erik Rijkers
Дата:
Сообщение: Re: proposal: schema variables