Re: Parallel Seq Scan
От | Robert Haas |
---|---|
Тема | Re: Parallel Seq Scan |
Дата | |
Msg-id | CA+TgmoZ=N3JzjGOojg+aPV3KA2Z8J7LhTKkqSryOsGmeun_z4w@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Parallel Seq Scan (David Rowley <dgrowleyml@gmail.com>) |
Ответы |
Re: Parallel Seq Scan
(Amit Kapila <amit.kapila16@gmail.com>)
Re: Parallel Seq Scan (David Rowley <dgrowleyml@gmail.com>) |
Список | pgsql-hackers |
On Sat, Apr 4, 2015 at 5:19 AM, David Rowley <dgrowleyml@gmail.com> wrote: > Going over the previous emails in this thread I see that it has been a long > time since anyone discussed anything around how we might decide at planning > time how many workers should be used for the query, and from the emails I > don't recall anyone proposing a good idea about how this might be done, and > I for one can't see how this is at all possible to do at planning time. > > I think that the planner should know nothing of parallel query at all, and > the planner quite possibly should go completely unmodified for this patch. > One major problem I can see is that, given a query such as: > > SELECT * FROM million_row_product_table WHERE category = 'ELECTRONICS'; > > Where we have a non-unique index on category, some plans which may be > considered might be: > > 1. Index scan on the category index to get all rows matching 'ELECTRONICS' > 2. Sequence scan on the table, filter matching rows. > 3. Parallel plan which performs a series of partial sequence scans pulling > out all matching rows. > > I really think that if we end choosing things like plan 3, when plan 2 was > thrown out because of its cost, then we'll end up consuming more CPU and I/O > than we can possibly justify using. The environmentalist in me screams that > this is wrong. What if we kicked off 128 worker process on some high-end > hardware to do this? I certainly wouldn't want to pay the power bill. I > understand there's costing built in to perhaps stop this, but I still think > it's wrong headed, and we need to still choose the fastest non-parallel plan > and only consider parallelising that later. I agree that this is an area that needs more thought. I don't (currently, anyway) agree that the planner shouldn't know anything about parallelism. The problem with that is that there's lots of relevant stuff that can only be known at plan time. For example, consider the query you mention above on a table with no index. If the WHERE clause is highly selective, a parallel plan may well be best. But if the selectivity is only, say, 50%, a parallel plan is stupid: the IPC costs of shipping many rows back to the master will overwhelm any benefit we could possibly have hoped to get, and the overall result will likely be that the parallel plan both runs slower and uses more resources. At plan time, we have the selectivity information conveniently at hand, and can use that as part of the cost model to make educated decisions. Execution time is way too late to be thinking about those kinds of questions. I think one of the philosophical questions that has to be answered here is "what does it mean to talk about the cost of a parallel plan?". For a non-parallel plan, the cost of the plan means both "the amount of effort we will spend executing the plan" and also "the amount of time we think the plan will take to complete", but those two things are different for parallel plans. I'm inclined to think it's right to view the cost of a parallel plan as a proxy for execution time, because the fundamental principle of the planner is that we pick the lowest-cost plan. But there also clearly needs to be some way to prevent the selection of a plan which runs slightly faster at the cost of using vastly more resources. Currently, the planner tracks the best unsorted path for each relation as well as the best path for each useful sort order. Suppose we treat parallelism as another axis for judging the quality of a plan: we keep the best unsorted, non-parallel path; the best non-parallel path for each useful sort order; the best unsorted, parallel path; and the best parallel path for each sort order. Each time we plan a node, we generate non-parallel paths first, and then parallel paths. But, if a parallel plan isn't markedly faster than the non-parallel plan for the same sort order, then we discard it. I'm not sure exactly what the thresholds should be here, and they probably need to be configurable, because on a single-user system with excess capacity available it may be absolutely desirable to use ten times the resources to get an answer 25% faster, but on a heavy-loaded system that will stink. Some ideas for GUCs: max_parallel_degree = The largest number of processes we'll consider using for a single query. min_parallel_speedup = The minimum percentage by which a parallel path must be cheaper (in terms of execution time) than a non-parallel path in order to survive. I'm imagining the default here might be something like 15%. min_parallel_speedup_per_worker = Like the previous one, but per worker. e.g. if this is 5%, which might be a sensible default, then a plan with 4 workers must be at least 20% better to survive, but a plan using only 2 workers only needs to be 10% better. An additional benefit of this line of thinking is that planning would always produce a best non-parallel path. And sometimes, there would also be a best parallel path that is expected to run faster. We could then choose between them dynamically at execution time. I think it's pretty hard to imagine a scenario as extreme as the one you mention above ever actually occurring in practice. I mean, even the most naive implementation of parallel query will presumably have something like max_parallel_degree, and you probably won't have that set to 128. For starters, it can't possibly make sense unless you server has at least 128 CPUs, and even then it only makes sense if you don't mind a single query using all of them, and even if the first of those things is true, the second one probably isn't. I don't doubt that less extreme variants of this scenario are possible, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: