Обсуждение: Parallel Sequence Scan doubts
Corrected subject. Hi Hackers, Implementation of "Parallel Sequence Scan" Approach: 1."Parallel Sequence Scan" can achieved by using the background workers doing the job of actual sequence scan including the qualification check also. 2. Planner generates the parallel scan plan by checking the possible criteria of the table to do a parallel scan and generates the tasks (range of blocks). 3. In the executor Init phase, Try to copy the necessary data required by the workers and start the workers. 4. In the executor run phase, just get the tuples which are sent by the workers and process them further in the plan node execution. some of the problems i am thinking: 1. Data structures that are required to be copied from backend to worker are currentTransactionState, Snapshot, GUC, ComboCID, Estate and etc. I see some problems in copying "Estate" data structure into the shared memory because it contains so many pointers. There is a need of some infrastructure to copy these data structures into the shared memory. If the backend takes care of locking the relation and there are no sub plans involved in the qual and targetlist, is the estate is really required in the worker to achieve the parallel scan? Is it possible to Initialize the qual, targetlist and projinfo with the local "Estate" structure created in the worker with the help of "plan" structure received from the backend? Or In case if "estate" is required in the worker for the processing, is there any better way possible to share backend data structures with the workers? Any suggestions? Regards, Hari Babu Fujitsu Australia
On 08/21/2014 02:47 PM, Haribabu Kommi wrote: > Corrected subject. > > Hi Hackers, > > Implementation of "Parallel Sequence Scan" I think you mean "Parallel Sequential Scan". Scanning a sequence in parallel doesn't make much sense. > 1."Parallel Sequence Scan" can achieved by using the background > workers doing the job of actual sequence scan including the > qualification check also. Only if the qualifiers are stable/immutable I think. Not even necessarily stable functions - consider use of the fmgr/executor state contexts to carry information over between calls, and what parallel execution would do to that. > 3. In the executor Init phase, Try to copy the necessary data required > by the workers and start the workers. Copy how? Back-ends can only communicate with each other over shared memory, signals, and using sockets. Presumably you'd use a dynamic shared memory segment, but it's going to be "interesting" to copy that kind of state over. Some of the work Robert has proposed to add support for data structures that are native to dynamic shmem might help, I guess... > 4. In the executor run phase, just get the tuples which are sent by > the workers and process them further in the plan node execution. Again, how do you propose to copy these back to the main bgworker? That's been one of the things that's limited parallel scan when it's been looked at before. > 1. Data structures that are required to be copied from backend to > worker are currentTransactionState, Snapshot, GUC, ComboCID, Estate > and etc. That's a big "etc". Huge, in fact. Any function can reference any global variable. Even an immutable function might use globals for cache etc - and might, for example, set them up using an executor start hook. You cannot really make any assumptions about what functions access what memory. So it's not possible, as far as I can understand, to define a simple subset of state that must be copied. Nor is it necessarily correct to discard the copied state at the end of the run even if you can copy it. Code may well depend on that state being updated and preserved across calls. > I see some problems in copying "Estate" data structure into the shared > memory because it contains so many pointers. There is a need of some > infrastructure to copy these data structures into the shared memory. It's not just a matter of copying them into/via shmem. It's about their meaning. Does it even make sense to copy the executor state to another backend? If so, you can't copy it back, so what do you do at the end of the scans? > Any suggestions? Before you try to design anything more on this, study the *large* amount of discussion that has happened on this topic on this mailing list over the last years. This is not a simple or easy task, and it's not one you should approach without studying what's already been worked on, built, contemplated, etc. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Aug 24, 2014 at 1:11 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 08/21/2014 02:47 PM, Haribabu Kommi wrote: >> Implementation of "Parallel Sequence Scan" > > I think you mean "Parallel Sequential Scan". Sorry for not being clear, Yes it is parallel sequential scan. >> 1."Parallel Sequence Scan" can achieved by using the background >> workers doing the job of actual sequence scan including the >> qualification check also. > > Only if the qualifiers are stable/immutable I think. > > Not even necessarily stable functions - consider use of the > fmgr/executor state contexts to carry information over between calls, > and what parallel execution would do to that. Thanks for your input. As of now we are targeting only immutable functions, that can be executed in parallel without any side effects. >> 3. In the executor Init phase, Try to copy the necessary data required >> by the workers and start the workers. > > Copy how? > > Back-ends can only communicate with each other over shared memory, > signals, and using sockets. Sorry for not being clear, copying those data structures into dynamic shared memory only. From there the workers can access. >> 4. In the executor run phase, just get the tuples which are sent by >> the workers and process them further in the plan node execution. > > Again, how do you propose to copy these back to the main bgworker? With the help of message queues that are created in the dynamic shared memory, the workers can send the data to the queue. On other side the main backend receives the tuples from the queue. >> 1. Data structures that are required to be copied from backend to >> worker are currentTransactionState, Snapshot, GUC, ComboCID, Estate >> and etc. > > That's a big "etc". Huge, in fact. > > Any function can reference any global variable. Even an immutable > function might use globals for cache etc - and might, for example, set > them up using an executor start hook. You cannot really make any > assumptions about what functions access what memory. Yes you are correct. For that reason only I am thinking of Supporting of functions that only dependent on input variables and are not modifying any global data. >> I see some problems in copying "Estate" data structure into the shared >> memory because it contains so many pointers. There is a need of some >> infrastructure to copy these data structures into the shared memory. > > It's not just a matter of copying them into/via shmem. > > It's about their meaning. Does it even make sense to copy the executor > state to another backend? If so, you can't copy it back, so what do you > do at the end of the scans? If we handle the locking of relation in the backend and avoid doing the parallel sequential scan if any sub query is involved, then there is no need of full estate in the worker. In those cases by sharing less information, I think we can execute the plan in the worker. >> Any suggestions? > > Before you try to design anything more on this, study the *large* amount > of discussion that has happened on this topic on this mailing list over > the last years. > > This is not a simple or easy task, and it's not one you should approach > without studying what's already been worked on, built, contemplated, etc. Thanks for your information. Regards, Hari Babu Fujitsu Australia
On 08/24/2014 09:40 AM, Haribabu Kommi wrote: > Any suggestions? Another point I didn't raise first time around, but that's IMO quite significant, is that you haven't addressed why this approach to fully parallel seqscans is useful and solves real problems in effective ways. It might seem obvious - "of course they're useful". But I see two things they'd address: - CPU-limited sequential scans, where expensive predicates are filtering the scan; and - I/O limited sequential scans, where the predicates already execute fast enough on one CPU, so most time is spent waiting for more disk I/O. The problem I see with your design is that it's not going to be useful for a large class of CPU-limited scans where the predicate isn't composed entirely of immutable functions an operators. Especially since immutable-only predicates are the best candidates for expression indexes anyway. While it'd likely be useful for I/O limited scans, it's going to increase contention on shared_buffers locking and page management. More importantly, is it the most efficient way to solve the problem with I/O limited scans? I would seriously suggest looking at first adding support for asynchronous I/O across ranges of extents during a sequential scan. You might not need multiple worker backends at all. I'm sure using async I/O to implement effective_io_concurrency for seqscans has been been discussed and explored before, so again I think some time in the list archives might make sense. I don't know if it makes sense to do something as complex and parallel multi-process seqscans without having a path forward for supporting non-immutable functions - probably with fmgr API enhancements, additional function labels ("PARALLEL"), etc, depending on what you find is needed. Do you have specific workloads where you see this as useful, and where doing async I/O and readahead within a single back-end wouldn't solve the same problem? >>> 3. In the executor Init phase, Try to copy the necessary data required >>> by the workers and start the workers. >> >> Copy how? >> >> Back-ends can only communicate with each other over shared memory, >> signals, and using sockets. > > Sorry for not being clear, copying those data structures into dynamic > shared memory only. > From there the workers can access. That'll probably work with read-only data, but it's not viable for read/write data unless you use a big lock to protect it, in which case you lose the parallelism you want to achieve. You'd have to classify what may be modified during scan execution carefully and determine if you need to feed any of the resulting modifications back to the original backend - and how to merge modifications by multiple workers, if it's even possible. That's going to involve a detailed structure-by-structure analysis and seems likely to be error prone and buggy. I think you should probably talk to Robert Haas about what he's been doing over the last couple of years on parallel query. >>> 4. In the executor run phase, just get the tuples which are sent by >>> the workers and process them further in the plan node execution. >> >> Again, how do you propose to copy these back to the main bgworker? > > With the help of message queues that are created in the dynamic shared memory, > the workers can send the data to the queue. On other side the main > backend receives the tuples from the queue. OK, so you plan to implement shmem queues. That'd be a useful starting point, as it'd be something that would be useful in its own right. You'd have to be able to handle individual values that're than the ring buffer or whatever you're using for transfers, in case you're dealing with already-detoasted tuples or in-memory tuples. Again, chatting with Robert and others who've worked on dynamic shmem, parallel query, etc would be wise here. > Yes you are correct. For that reason only I am thinking of Supporting > of functions > that only dependent on input variables and are not modifying any global data. You'll want to be careful with that. Nothing stops an immutable function referencing a cache in a C global that it initializes one and then treats as read only, for example. I suspect you'll need a per-function whitelist. I'd love to be wrong. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Aug 24, 2014 at 12:34 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 08/24/2014 09:40 AM, Haribabu Kommi wrote: > >> Any suggestions? > > Another point I didn't raise first time around, but that's IMO quite > significant, is that you haven't addressed why this approach to fully > parallel seqscans is useful and solves real problems in effective ways. > > It might seem obvious - "of course they're useful". But I see two things > they'd address: > > - CPU-limited sequential scans, where expensive predicates are filtering > the scan; and Yes, we are mainly targeting CPU-limited sequential scans, Because of this reason only I want the worker to handle the predicates also not just reading the tuples from disk. > - I/O limited sequential scans, where the predicates already execute > fast enough on one CPU, so most time is spent waiting for more disk I/O. > > The problem I see with your design is that it's not going to be useful > for a large class of CPU-limited scans where the predicate isn't > composed entirely of immutable functions an operators. Especially since > immutable-only predicates are the best candidates for expression indexes > anyway. > > While it'd likely be useful for I/O limited scans, it's going to > increase contention on shared_buffers locking and page management. More > importantly, is it the most efficient way to solve the problem with I/O > limited scans? > > I would seriously suggest looking at first adding support for > asynchronous I/O across ranges of extents during a sequential scan. You > might not need multiple worker backends at all. > > I'm sure using async I/O to implement effective_io_concurrency for > seqscans has been been discussed and explored before, so again I think > some time in the list archives might make sense. Thanks for your inputs. I will check it. > I don't know if it makes sense to do something as complex and parallel > multi-process seqscans without having a path forward for supporting > non-immutable functions - probably with fmgr API enhancements, > additional function labels ("PARALLEL"), etc, depending on what you find > is needed. Thanks for your inputs. I will check for a solution to extend the support for non-immutable functions also. >>>> 3. In the executor Init phase, Try to copy the necessary data required >>>> by the workers and start the workers. >>> >>> Copy how? >>> >>> Back-ends can only communicate with each other over shared memory, >>> signals, and using sockets. >> >> Sorry for not being clear, copying those data structures into dynamic >> shared memory only. >> From there the workers can access. > > That'll probably work with read-only data, but it's not viable for > read/write data unless you use a big lock to protect it, in which case > you lose the parallelism you want to achieve. As of now I am thinking of sharing read-only data with workers. In case if read/write data needs to be shared, then we may need another approach to handle the same. > You'd have to classify what may be modified during scan execution > carefully and determine if you need to feed any of the resulting > modifications back to the original backend - and how to merge > modifications by multiple workers, if it's even possible. > > That's going to involve a detailed structure-by-structure analysis and > seems likely to be error prone and buggy. Thanks for your inputs. I will check it properly. > I think you should probably talk to Robert Haas about what he's been > doing over the last couple of years on parallel query. Sure, I will check with him. >>>> 4. In the executor run phase, just get the tuples which are sent by >>>> the workers and process them further in the plan node execution. >>> >>> Again, how do you propose to copy these back to the main bgworker? >> >> With the help of message queues that are created in the dynamic shared memory, >> the workers can send the data to the queue. On other side the main >> backend receives the tuples from the queue. > > OK, so you plan to implement shmem queues. > > That'd be a useful starting point, as it'd be something that would be > useful in its own right. shmem queues are already possible with dynamic shared memory. Just I want to use them here. > You'd have to be able to handle individual values that're than the ring > buffer or whatever you're using for transfers, in case you're dealing > with already-detoasted tuples or in-memory tuples. > > Again, chatting with Robert and others who've worked on dynamic shmem, > parallel query, etc would be wise here. > >> Yes you are correct. For that reason only I am thinking of Supporting >> of functions >> that only dependent on input variables and are not modifying any global data. > > You'll want to be careful with that. Nothing stops an immutable function > referencing a cache in a C global that it initializes one and then > treats as read only, for example. > > I suspect you'll need a per-function whitelist. I'd love to be wrong. Yes, we need per-function level details. Once we have a better solution to handle non-immutable functions also then these may not be required. Regards, Hari Babu Fujitsu Australia
On Thu, Aug 21, 2014 at 2:47 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > Implementation of "Parallel Sequence Scan" > > Approach: > > 1."Parallel Sequence Scan" can achieved by using the background > workers doing the job of actual sequence scan including the > qualification check also. > > 2. Planner generates the parallel scan plan by checking the possible > criteria of the table to do a parallel scan and generates the tasks > (range of blocks). > > 3. In the executor Init phase, Try to copy the necessary data required > by the workers and start the workers. > > 4. In the executor run phase, just get the tuples which are sent by > the workers and process them further in the plan node execution. Well, this is something I've thought quite a bit about already. Many of my thoughts on parallelism are here: https://wiki.postgresql.org/wiki/Parallel_Sort Although the page title is parallel sort, many of the concerns are applicable to parallelism of any sort. I posted some patches containing some of the necessary infrastructure here: http://archives.postgresql.org/message-id/CA+Tgmoam66dTzCP8N2cRcS6S6dBMFX+JMba+mDf68H=KAkNjPQ@mail.gmail.com I seem to have forgotten to add that message to the CommitFest. Crap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/24/14, 6:22 AM, Haribabu Kommi wrote: > Yes, we are mainly targeting CPU-limited sequential scans, Because of > this reason > only I want the worker to handle the predicates also not just reading > the tuples from > disk. In that case, I would suggest focusing on parallel execution of conditions regardless of where they show up in the queryplan. In my experience, they often have nothing to do with a seqscan. Here's a real-world example. We have a view that pivots our applications accounting journal into a ledger. The expensivepart of the view is this: sum( CASE WHEN b.tag::text = 'installment_principal'::text THEN b.type_cd -- type_cd is either 1, 0, or-1 ELSE 0::numeric END ) * transaction_amount AS installment_principal The view with this pivot has about 100 of these case statements. Frequently we only reference a few of them, but anytimewe need to refer to 20+ the evaluation of that expression gets VERY cpu-expensive compared to the rest of the query. The other thing I would look at before seqscan filters is join processing and bitmap index index combining (ie: ANDing togetherthe results of several bitmap index scans). Those are things that can be very CPU intensive even when doing simpleequality comparisons. BTW, it's also possible that these cases would be good fits for GPU parallel execution. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net