Re: [DESIGN] ParallelAppend

Поиск
Список
Период
Сортировка
От Kouhei Kaigai
Тема Re: [DESIGN] ParallelAppend
Дата
Msg-id 9A28C8860F777E439AA12E8AEA7694F80111D984@BPXM15GP.gisp.nec.co.jp
обсуждение исходный текст
Ответ на Re: [DESIGN] ParallelAppend  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
> Hello, can I ask some questions?
>
> I suppose we can take this as the analog of ParalleSeqScan.  I
> can see not so distinction between Append(ParalleSeqScan) and
> ParallelAppend(SeqScan). What difference is there between them?
>
Append does not start to execute the second or later node until
first node reaches end of the scan.
On the other hands, ParallelAppend will kick all the child nodes
(almost) simultaneously.

> If other nodes will have the same functionality as you mention at
> the last of this proposal, it might be better that some part of
> this feature is implemented as a part of existing executor
> itself, but not as a deidicated additional node, just as my
> asynchronous fdw execution patch patially does. (Although it
> lacks planner part and bg worker launching..) If that is the
> case, it might be better that ExecProcNode is modified so that it
> supports both in-process and inter-bgworker cases by the single
> API.
>
> What do you think about this?
>
Its downside is that we need to adjust all the existing nodes to
follow the new executor's capability. At this moment, we have 38
node types delivered from Plan. I think, it is not an easy job to
review a patch that changes multi-dozens files.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> regards,
>
> > Hello,
> >
> > I'm recently working/investigating on ParallelAppend feature
> > towards the next commit fest. Below is my design proposal.
> >
> > 1. Concept
> > ----------
> > Its concept is quite simple anybody might consider more than once.
> > ParallelAppend node kicks background worker process to execute
> > child nodes in parallel / asynchronous.
> > It intends to improve the performance to scan a large partitioned
> > tables from standpoint of entire throughput, however, latency of
> > the first multi-hundred rows are not scope of this project.
> > From standpoint of technology trend, it primarily tries to utilize
> > multi-cores capability within a system, but also enables to expand
> > distributed database environment using foreign-tables inheritance
> > features.
> > Its behavior is very similar to Funnel node except for several
> > points, thus, we can reuse its infrastructure we have had long-
> > standing discussion through the v9.5 development cycle.
> >
> > 2. Problems to be solved
> > -------------------------
> > Typical OLAP workloads takes tons of tables join and scan on large
> > tables which are often partitioned, and its KPI is query response
> > time but very small number of sessions are active simultaneously.
> > So, we are required to run a single query as rapid as possible even
> > if it consumes larger computing resources than typical OLTP workloads.
> >
> > Current implementation to scan heap is painful when we look at its
> > behavior from the standpoint - how many rows we can read within a
> > certain time, because of synchronous manner.
> > In the worst case, when SeqScan node tries to fetch the next tuple,
> > heap_getnext() looks up a block on shared buffer, then ReadBuffer()
> > calls storage manager to read the target block from the filesystem
> > if not on the buffer. Next, operating system makes the caller
> > process slept until required i/o get completed.
> > Most of the cases are helped in earlier stage than the above worst
> > case, however, the best scenario we can expect is: the next tuple
> > already appear on top of the message queue (of course visibility
> > checks are already done also) with no fall down to buffer manager
> > or deeper.
> > If we can run multiple scans in parallel / asynchronous, CPU core
> > shall be assigned to another process by operating system, thus,
> > it eventually improves the i/o density and enables higher processing
> > throughput.
> > Append node is an ideal point to be parallelized because
> > - child nodes can have physically different location by tablespace,
> >   so further tuning is possible according to the system landscape.
> > - it can control whether subplan is actually executed on background
> >   worker, per subplan basis. If subplan contains large tables and
> >   small tables, ParallelAppend may kick background worker to scan
> >   large tables only, but scan on small tables are by itself.
> > - Like as Funnel node, we don't need to care about enhancement of
> >   individual node types. SeqScan, IndexScan, ForeignScan or others
> >   can perform as usual, but actually in parallel.
> >
> >
> > 3. Implementation
> > ------------------
> > * Plan & Cost
> >
> > ParallelAppend shall appear where Appen can appear except for the
> > usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
> > both of AppendPath and ParallelAppendPath with cost for each.
> > Cost estimation logic shall take further discussions, however,
> > I expect the logic below to estimate the cost for ParallelAppend.
> >   1. Sum startup_cost and run_cost for each child pathnode, but
> >      distinguish according to synchronous or asynchronous.
> >      Probably, total cost of pathnode is less than:
> >       (parallel_setup_cost + its total cost / parallel_append_degree
> >                            + number of rows * cpu_tuple_comm_cost)
> >      is nonsense to run on background worker.
> >   2. parallel_setup_cost * (# of asynchronous nodes) are added to
> >      sum of startup_cost of asynchronous nodes.
> >   3. sum of run_cost of asynchronous nodes are divided by
> >      parallel_append_degree, then cpu_tuple_comm_cost * (total # of
> >      rows by asynchronous nodes) are added.
> >   4. both of synchronous and asynchronous cost are added, then it
> >      becomes the cost of ParallelAppend.
> > Obviously, it stand on the viewpoint that says: cost reflects response
> > time of the underlying plan. So, cost of ParallelAppend can be smaller
> > than sum of underlying child nodes.
> >
> > * Execution
> >
> > Like Funnel node, it kicks background worker on the ExecProcNode handler,
> > thus, its startup time may be later than Fujita-san's approach if call
> > of ParallelAppend would be late. For example, when ParallelAppend is
> > located under HashJoin but inner Hash loads billion of rows.
> > Even though I expect ExecParallelAppend takes, at least, simple round-
> > robin scheduling like funnel_getnext(), we may give synchronous nodes
> > than asynchronous just after the background worker startup.
> >
> > 4. Further challenges
> > ----------------------
> > * Serialization of CustomScan via outfuncs.c/readfuncs.c
> >   Because methods field is, basically, a set of pointers per process basis,
> >   we need to have an infrastructure to reproduce same table on the background
> >   worker process identified by the name.
> >   (I also try to design it.)
> >
> > * Duplication of the parallel
> >   If Funnel+PartialSeqScan is located under ParallelAppend, directly
> >   or indirectly, it eventually leads background worker process to launch
> >   another background workers. Is it expected usage of the current background
> >   workers??
> >
> > * Join pushdown
> >   Distribution of nested-loop and hash-join may have advantage by parallel
> >   processing, and by reduction of hash-size if CHECK() constraint of
> >   individual partitioned tables informs rows obviously not to be joined.
> >   Also see the thread:
> >     [idea] table partition + hash join: http://bit.ly/1S2xpHT
> >   My colleague already started to investigate / develop this feature
> >   based on existing Append, to reduce num_batches.
> >
> >   As an aside, my GpuJoin feature works most effectively if entire inner
> >   relations can be loaded to hash-table on GPU RAM, so features are very
> >   welcome.
> >
> > * Sort break-down
> >   If mergejoin tried to have ParallelAppend node on left or right input,
> >   we may be able to compare its cost with MargeParallelAppend + Sort on
> >   the partial relation.
> >
> > * Aggregate Push Down
> >   It is what I exactly want to do.
> >
> > Thanks,
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: spgist recovery assertion failure
Следующее
От: Tom Lane
Дата:
Сообщение: Buildfarm TAP testing is useless as currently implemented