[DESIGN] ParallelAppend

Поиск
Список
Период
Сортировка
От Kouhei Kaigai
Тема [DESIGN] ParallelAppend
Дата
Msg-id 9A28C8860F777E439AA12E8AEA7694F80111CF22@BPXM15GP.gisp.nec.co.jp
обсуждение исходный текст
Ответы Re: [DESIGN] ParallelAppend  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Re: [DESIGN] ParallelAppend  (Amit Kapila <amit.kapila16@gmail.com>)
Re: [DESIGN] ParallelAppend  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
----------
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-------------------------
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace, so further tuning is possible according to the
systemlandscape. 
- it can control whether subplan is actually executed on background worker, per subplan basis. If subplan contains
largetables and small tables, ParallelAppend may kick background worker to scan large tables only, but scan on small
tablesare by itself. 
- Like as Funnel node, we don't need to care about enhancement of individual node types. SeqScan, IndexScan,
ForeignScanor others can perform as usual, but actually in parallel. 


3. Implementation
------------------
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.
Cost estimation logic shall take further discussions, however,
I expect the logic below to estimate the cost for ParallelAppend. 1. Sum startup_cost and run_cost for each child
pathnode,but    distinguish according to synchronous or asynchronous.    Probably, total cost of pathnode is less than:
   (parallel_setup_cost + its total cost / parallel_append_degree                          + number of rows *
cpu_tuple_comm_cost)   is nonsense to run on background worker. 2. parallel_setup_cost * (# of asynchronous nodes) are
addedto    sum of startup_cost of asynchronous nodes. 3. sum of run_cost of asynchronous nodes are divided by
parallel_append_degree,then cpu_tuple_comm_cost * (total # of    rows by asynchronous nodes) are added. 4. both of
synchronousand asynchronous cost are added, then it    becomes the cost of ParallelAppend. 
Obviously, it stand on the viewpoint that says: cost reflects response
time of the underlying plan. So, cost of ParallelAppend can be smaller
than sum of underlying child nodes.

* Execution

Like Funnel node, it kicks background worker on the ExecProcNode handler,
thus, its startup time may be later than Fujita-san's approach if call
of ParallelAppend would be late. For example, when ParallelAppend is
located under HashJoin but inner Hash loads billion of rows.
Even though I expect ExecParallelAppend takes, at least, simple round-
robin scheduling like funnel_getnext(), we may give synchronous nodes
than asynchronous just after the background worker startup.

4. Further challenges
----------------------
* Serialization of CustomScan via outfuncs.c/readfuncs.c Because methods field is, basically, a set of pointers per
processbasis, we need to have an infrastructure to reproduce same table on the background worker process identified by
thename. (I also try to design it.) 

* Duplication of the parallel If Funnel+PartialSeqScan is located under ParallelAppend, directly or indirectly, it
eventuallyleads background worker process to launch another background workers. Is it expected usage of the current
backgroundworkers?? 

* Join pushdown Distribution of nested-loop and hash-join may have advantage by parallel processing, and by reduction
ofhash-size if CHECK() constraint of individual partitioned tables informs rows obviously not to be joined. Also see
thethread:   [idea] table partition + hash join: http://bit.ly/1S2xpHT My colleague already started to investigate /
developthis feature based on existing Append, to reduce num_batches. 
 As an aside, my GpuJoin feature works most effectively if entire inner relations can be loaded to hash-table on GPU
RAM,so features are very welcome. 

* Sort break-down If mergejoin tried to have ParallelAppend node on left or right input, we may be able to compare its
costwith MargeParallelAppend + Sort on the partial relation. 

* Aggregate Push Down It is what I exactly want to do.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Joe Conway
Дата:
Сообщение: Re: markup problems in row_security GUC docs
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: PL/pgSQL, RAISE and error context