Re: asynchronous and vectorized execution
От | Kyotaro HORIGUCHI |
---|---|
Тема | Re: asynchronous and vectorized execution |
Дата | |
Msg-id | 20160706.162914.131241734.horiguchi.kyotaro@lab.ntt.co.jp обсуждение исходный текст |
Ответ на | Re: asynchronous and vectorized execution (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: asynchronous and vectorized execution
(Robert Haas <robertmhaas@gmail.com>)
|
Список | pgsql-hackers |
Hello, At Tue, 5 Jul 2016 11:39:41 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobnQ6ZpsubttBYC=pSLQ6d=0GuSgBsUFoaARMrie_75BA@mail.gmail.com> > On Wed, Jun 29, 2016 at 11:00 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > We may also want to consider handling abstract events such as > > "tuples-are-available-at-plan-node-X". > > > > One benefit is : we can combine this with batch processing. For e.g. in case > > of an Append node containing foreign scans, its parent node may not want to > > process the Append node result until Append is ready with at least 1000 > > rows. In that case, Append node needs to raise an event > > "n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready event > > will wake up the foreign scan, but it may not eventually cause its parent > > Append node to in turn wake up it's parent. > > Right, I agree. I think this case only arises in parallel query. In > serial execution, there's not really any way for a plan node to just > become ready other than an FD or latch event or the parent becoming > ready. But in parallel query it can happen, of course, because some > other backend can do work that makes that node ready to produce > tuples. > > It's not necessarily the case that we have to deal with this in the > initial patches for this feature, because the most obvious win for > this sort of thing is when we have an Append of ForeignScan plans. > Sure, parallel query has interesting cases, too, but a prototype that > just handles Append over a bunch of postgres_fdw ForeignScans would be > pretty cool. I suggest that we make that the initial goal here. This seems to be a good opportunity to show this patch. The attched patch set does async execution of foreignscan (postgres_fdw) on the Robert's first infrastructure, with some modification. ExecAsyncWaitForNode can get into an inifite-waiting by recursive calls of ExecAsyncWaitForNode caused by ExecProcNode called from async-unaware nodes. Such recursive calls cause a wait on already-ready nodes. I solved that in the patch set by allocating a separate async-execution context for every async-execution subtrees, which is made by ExecProcNode, instead of one async-exec context for the whole execution tree. This works fine but the way switching contexts seems ugly. This may also be solved by make ExecAsyncWaitForNode return when no node to wait even if the waiting node is not ready. This might keep the async-exec context (state) simpler so I'll try this. > > How we can do this event abstraction is the other question. We can have one > > latch for each of the event, and each node would raise its own event by > > setting the corresponding latch. But I am not sure about latches within a > > single process as against one process waking up another process. Or else, > > some internal event structures needs to be present (in estate ?), which then > > ExecProcNode would use when it does the event looping to wake up (i.e. > > execute) required nodes. > > I think adding more latches would be a bad idea. What I think we > should do instead is add two additional data structures to dynamic > shared memory: > > 1. An array of PGPROC * pointers for all of the workers. Processes > add their PGPROC * to this array as they start up. Then, parallel.h > can expose new API ParallelWorkerSetLatchesForGroup(void). In the > leader, this sets the latch for every worker process for every > parallel context with which the leader is associated; in a worker, it > sets the latch for other processes in the parallel group, and the > leader also. > > 2. An array of executor nodes where one process might do something > that allows other processes to make progress on that node. This would > be set up somehow by execParallel.c, which would need to somehow > figure out which plan nodes want to be included in the list. When an > executor node does something that might unblock other workers, it > calls ParallelWorkerSetLatchesForGroup() and the async stuff then > tries calling all of the nodes in this array again to see if any of > them now think that they've got tuples to return (or just to let them > do additional work without returning tuples). Does the ParallelWorkerSetLatchesForGroup use mutex or semaphore or something like instead of latches? > > Also, the WaitEvent.user_data field can have some more info besides the plan > > state. It can have its parent PlanState stored, so that we don't have to > > have parent field in plan state. It also can have some more data such as > > "n-tuples-available". > > I don't think that works, because execution may need to flow > arbitrarily far up the tree. Just knowing the immediate parent isn't > good enough. If it generates a tuple, then you have to in turn call > it's parent, and that one then produces a tuple, you have to continue > on even further up the tree. I think it's going to be very awkward to > make this work without those parent pointers. Basically agreed, but going up too far was bad for the reason above. > BTW, we also need to benchmark those changes to add the parent > pointers and change the return conventions and see if they have any > measurable impact on performance. I have to bring you a bad news. With the attached patch, an append on four foreign scans on one server (at local) performs faster by about 10% and by twice for three or four foreign scns on separate foreign servers (connections) respectively, but only when compiled with -O0. I found that it can take hopelessly small amount of advantage from compiler optimization, while unpatched version gets faster. Anyway, the current state of this patch is attached. For binaries compiled with both -O0 and -O2, ran a simple query "select sum(a) from <table>" on tables generated by the attached script, t0, pl, pf0 and pf1 which are a local table, an append on local tables, an append on foreign tables on the same foreign server and an append on foreign tables on different foreign servers respectively. The numbers are the mean values of ten times run. average(ms) stddev patched-O0 t0 891.3934 18.74902154 pl 416.3298 47.38902802 pf0 13523.0777 87.45769903 pf1 3376.6415 183.3578028 patched-O2: t0 891.4309 5.245807775 pl 408.2932 1.04260004 pf0 13640.3551 52.52211814 pf1 3470.1739 262.3522963 not-patched-O0 t0 845.3927 18.98379876 pl 363.4933 4.142091341 pf0 14986.1284 23.07288416 pf1 14961.0596 127.2587286 not-patched-O2 t0 429.8462 31.51970532 pl 176.3959 0.237832551 pf0 8129.3762 44.68774182 pf1 8211.6319 97.93206675 By the way, running the attached testrun.sh, the result for the first one or two runs of pf0 is faster by about 30%-50% than the rest for some reason unknown to me... regards, -- Kyotaro Horiguchi NTT Open Source Software Center
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Michael PaquierДата:
Сообщение: Re: Password identifiers, protocol aging and SCRAM protocol
Следующее
От: Michael PaquierДата:
Сообщение: Re: Password identifiers, protocol aging and SCRAM protocol