Re: [DESIGN] ParallelAppend

Поиск
Список
Период
Сортировка
От	Kouhei Kaigai
Тема	Re: [DESIGN] ParallelAppend
Дата	28 июля 2015 г. 12:39:31
Msg-id	9A28C8860F777E439AA12E8AEA7694F80111F3A5@BPXM15GP.gisp.nec.co.jp обсуждение исходный текст
Ответ на	Re: [DESIGN] ParallelAppend (Amit Kapila <amit.kapila16@gmail.com>)
Ответы	Re: [DESIGN] ParallelAppend
Список	pgsql-hackers
Дерево обсуждения
> On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> >
> > > -----Original Message-----
> > > From: pgsql-hackers-owner@postgresql.org
> > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> > > Sent: Monday, July 27, 2015 11:07 PM
> > > To: Amit Kapila
> > > >
> > > > Is there a real need to have new node like ParallelAppendPath?
> > > > Can't we have Funnel node beneath AppendNode and then each
> > > > worker will be responsible to have SeqScan on each inherited child
> > > > relation.  Something like
> > > >
> > > > Append
> > > >    ---> Funnel
> > > >       --> SeqScan rel1
> > > >       --> SeqScan rel2
> > > >
> > > If Funnel can handle both of horizontal and vertical parallelism,
> > > it is a great simplification. I never stick a new node.
> > >
> > > Once Funnel get a capability to have multiple child nodes, probably,
> > > Append node above will have gone. I expect set_append_rel_pathlist()
> > > add two paths based on Append and Funnel, then planner will choose
> > > the cheaper one according to its cost.
> > >
> > In the latest v16 patch, Funnel is declared as follows:
> >
> >   typedef struct Funnel
> >   {
> >       Scan        scan;
> >       int         num_workers;
> >   } Funnel;
> >
> > If we try to add Append capability here, I expects the structure will
> > be adjusted as follows, for example:
> >
> >   typedef struct Funnel
> >   {
> >       Scan        scan;
> >       List       *funnel_plans;
> >       List       *funnel_num_workers;
> >   } Funnel;
> >
> > As literal, funnel_plans saves underlying Plan nodes instead of the
> > lefttree. Also, funnel_num_workers saves number of expected workers
> > to be assigned on individual child plans.
> >
> 
> or shall we have a node like above and name it as FunnelAppend or
> AppenFunnel?
>
It is better to have smaller number of node types which are capable to
kick background workers because of simplification of path construction.

Let's assume the case below. When planner considers a path to append
child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
up Funnel of rel2, can we?
 (Append? or Funnel)  --> SeqScan on rel1  --> Funnel       --> PartialSeqScan on rel2  --> IndexScan on rel3

If we pull Funnel here, I think the plan shall be as follows: Funnel  --> SeqScan on rel1  --> PartialSeqScan on rel2
-->IndexScan on rel3
 

If all we have to pay attention is Funnel node, it makes the code
around path construction and pull-up logic much simpler, rather than
multiple node types can kick background workers.

> > Even though create_parallelscan_paths() in v16 set num_workers not
> > larger than parallel_seqscan_degree, total number of the concurrent
> > background workers may exceed this configuration if more than two
> > PartialSeqScan nodes are underlying.
> > It is a different configuration from max_worker_processes, so it is
> > not a matter as long as we have another restriction.
> > However, how do we control the cap of number of worker processes per
> > "appendable" Funnel node? For example, if a parent table has 200
> > child tables but max_worker_processes are configured to 50.
> > It is obviously impossible to launch all the background workers
> > simultaneously. One idea I have is to suspend launch of some plans
> > until earlier ones are completed.
> >
> 
> Okay, but I think in that idea you need to re-launch the workers again for
> new set of relation scan's which could turn out to be costly, how about
> designing some way where workers after completing their assigned work
> check for new set of task/'s (which in this case would be to scan a new) and
> then execute the same.  I think in this way we can achieve dynamic allocation
> of work and achieve maximum parallelism with available set of workers.
> We have achieved this in ParallelSeqScan by scanning at block level, once
> a worker finishes a block, it checks for new block to scan.
>
Is it possible to put multiple PlannedStmt on TOC, isn't it?
If background worker picks up an uncompleted PlannedStmt first
(based on round-robin likely?), it may achieve the maximum
parallelism. Yep, it seems to me a good idea which I want to try.
If (num of worker) > (num of sub-plans), some of sub-plans can
have multiple workers from the beginning, then, other workers
also help to execute heavy plans later.
It may be better to put PlannedStmt in order of total_cost to
bias multi-workers execution from the beginning.

TODO: Even if a heavy query occupied most of available worker slots,
another session wants to use parallel execution later but during
execution of the primary query. We may need to have a 'scoreboard'
on shared memory to know how many workers are potentially needed
and how much ones are overused by somebody. If someone overconsumed
background workers, it should exit first, rather than picking up
the next PlannedStmt.

> > > We will need to pay attention another issues we will look at when Funnel
> > > kicks background worker towards asymmetric relations.
> > >
> > > If number of rows of individual child nodes are various, we may
> > > want to assign 10 background workers to scan rel1 with PartialSeqScan.
> > > On the other hands, rel2 may have very small number of rows thus
> > > its total_cost may be smaller than cost to launch a worker.
> > > In this case, Funnel has child nodes to be executed asynchronously and
> > > synchronously.
> > >
> 
> I think this might turn out to be slightly tricky, for example how do we know
> for what size of relation, how many workers are sufficient?
>
I expected comparison between total_cost of the sub-plan and a threshold that
represents the cost to kick background workers.
However, I'm inclined to the above approach (multiple PlannedStmt on TOC,
then picked up by background workers by round-robin).

> Another way to look at dividing the work in this case could be in terms of
> chunk-of-blocks, once a worker finishes it current set of block/'s, it should
> be
> able to get new set of block's to scan.  So let us assume if we decide
> chunk-size as 32 and total number of blocks in whole inheritance hierarchy
> are 3200, then the max workers we should allocate to this scan are 100 and
> if we have parallel_seqscan degree lesser than that then we can use those
> many workers and then let them scan 32-blocks-at-a-time.
>
If we use the above multi-PlannedStmt approach, TOC also need to have a counter
to track how many background workers are running on a particular PlannedStmt,
then if enough number of worker is running on the PlannedStmt, next available
worker will skip this PlannedStmt (even if round-robin) or just exit?
Anyway, I think an infrastructure may be needed to avoid too aggressive
parallel execution.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>
В списке pgsql-hackers по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [DESIGN] ParallelAppend