Re: [DESIGN] ParallelAppend

Поиск
Список
Период
Сортировка
От Kouhei Kaigai
Тема Re: [DESIGN] ParallelAppend
Дата
Msg-id 9A28C8860F777E439AA12E8AEA7694F80111DA4F@BPXM15GP.gisp.nec.co.jp
обсуждение исходный текст
Ответ на [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Ответы Re: [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Re: [DESIGN] ParallelAppend  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Список pgsql-hackers
> On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> >
> > Hello,
> >
> > I'm recently working/investigating on ParallelAppend feature
> > towards the next commit fest. Below is my design proposal.
> >
> > 1. Concept
> > ----------
> > Its concept is quite simple anybody might consider more than once.
> > ParallelAppend node kicks background worker process to execute
> > child nodes in parallel / asynchronous.
> > It intends to improve the performance to scan a large partitioned
> > tables from standpoint of entire throughput, however, latency of
> > the first multi-hundred rows are not scope of this project.
> > From standpoint of technology trend, it primarily tries to utilize
> > multi-cores capability within a system, but also enables to expand
> > distributed database environment using foreign-tables inheritance
> > features.
> > Its behavior is very similar to Funnel node except for several
> > points, thus, we can reuse its infrastructure we have had long-
> > standing discussion through the v9.5 development cycle.
> >
> > 2. Problems to be solved
> > -------------------------
> > Typical OLAP workloads takes tons of tables join and scan on large
> > tables which are often partitioned, and its KPI is query response
> > time but very small number of sessions are active simultaneously.
> > So, we are required to run a single query as rapid as possible even
> > if it consumes larger computing resources than typical OLTP workloads.
> >
> > Current implementation to scan heap is painful when we look at its
> > behavior from the standpoint - how many rows we can read within a
> > certain time, because of synchronous manner.
> > In the worst case, when SeqScan node tries to fetch the next tuple,
> > heap_getnext() looks up a block on shared buffer, then ReadBuffer()
> > calls storage manager to read the target block from the filesystem
> > if not on the buffer. Next, operating system makes the caller
> > process slept until required i/o get completed.
> > Most of the cases are helped in earlier stage than the above worst
> > case, however, the best scenario we can expect is: the next tuple
> > already appear on top of the message queue (of course visibility
> > checks are already done also) with no fall down to buffer manager
> > or deeper.
> > If we can run multiple scans in parallel / asynchronous, CPU core
> > shall be assigned to another process by operating system, thus,
> > it eventually improves the i/o density and enables higher processing
> > throughput.
> > Append node is an ideal point to be parallelized because
> > - child nodes can have physically different location by tablespace,
> >   so further tuning is possible according to the system landscape.
> > - it can control whether subplan is actually executed on background
> >   worker, per subplan basis. If subplan contains large tables and
> >   small tables, ParallelAppend may kick background worker to scan
> >   large tables only, but scan on small tables are by itself.
> > - Like as Funnel node, we don't need to care about enhancement of
> >   individual node types. SeqScan, IndexScan, ForeignScan or others
> >   can perform as usual, but actually in parallel.
> >
> >
> > 3. Implementation
> > ------------------
> > * Plan & Cost
> >
> > ParallelAppend shall appear where Appen can appear except for the
> > usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
> > both of AppendPath and ParallelAppendPath with cost for each.
> >
> 
> Is there a real need to have new node like ParallelAppendPath?
> Can't we have Funnel node beneath AppendNode and then each
> worker will be responsible to have SeqScan on each inherited child
> relation.  Something like
> 
> Append
>    ---> Funnel
>       --> SeqScan rel1
>       --> SeqScan rel2
>
If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.

Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.

We will need to pay attention another issues we will look at when Funnel
kicks background worker towards asymmetric relations.

If number of rows of individual child nodes are various, we may
want to assign 10 background workers to scan rel1 with PartialSeqScan.
On the other hands, rel2 may have very small number of rows thus
its total_cost may be smaller than cost to launch a worker.
In this case, Funnel has child nodes to be executed asynchronously and
synchronously.

If cheapest path of the child relation is a pair of Funnel and
PartialSeqScan, we have to avoid to stack Funnel node. Probably,
Funnel node that performs like Append needs to pull up underlying
Funnel and assign equivalen number of workers as follows.
 Append  --> Funnel       --> PartialSeqScan on rel1 (num_workers = 4)  --> Funnel       --> PartialSeqScan on rel2
(num_workers= 8)  --> SeqScan on rel3
 
shall be rewritten to Funnel   --> PartialSeqScan on rel1 (num_workers = 4)   --> PartialSeqScan on rel2 (num_workers =
8)  --> SeqScan on rel3        (num_workers = 1)
 

We also need to consider whether Funnel will have capability
equivalent to MergeAppend, even though parallel sorting is
a fantastic challenge.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Buildfarm TAP testing is useless as currently implemented
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Failing assertions in indxpath.c, placeholder.c and brin_minmax.c