Re: Parallel Seq Scan

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Parallel Seq Scan
Дата
Msg-id CA+TgmoZ=N3JzjGOojg+aPV3KA2Z8J7LhTKkqSryOsGmeun_z4w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallel Seq Scan  (David Rowley <dgrowleyml@gmail.com>)
Ответы Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Parallel Seq Scan  (David Rowley <dgrowleyml@gmail.com>)
Список pgsql-hackers
On Sat, Apr 4, 2015 at 5:19 AM, David Rowley <dgrowleyml@gmail.com> wrote:
> Going over the previous emails in this thread I see that it has been a long
> time since anyone discussed anything around how we might decide at planning
> time how many workers should be used for the query, and from the emails I
> don't recall anyone proposing a good idea about how this might be done, and
> I for one can't see how this is at all possible to do at planning time.
>
> I think that the planner should know nothing of parallel query at all, and
> the planner quite possibly should go completely unmodified for this patch.
> One major problem I can see is that, given a query such as:
>
> SELECT * FROM million_row_product_table WHERE category = 'ELECTRONICS';
>
> Where we have a non-unique index on category, some plans which may be
> considered might be:
>
> 1. Index scan on the category index to get all rows matching 'ELECTRONICS'
> 2. Sequence scan on the table, filter matching rows.
> 3. Parallel plan which performs a series of partial sequence scans pulling
> out all matching rows.
>
> I really think that if we end choosing things like plan 3, when plan 2 was
> thrown out because of its cost, then we'll end up consuming more CPU and I/O
> than we can possibly justify using. The environmentalist in me screams that
> this is wrong. What if we kicked off 128 worker process on some high-end
> hardware to do this? I certainly wouldn't want to pay the power bill. I
> understand there's costing built in to perhaps stop this, but I still think
> it's wrong headed, and we need to still choose the fastest non-parallel plan
> and only consider parallelising that later.

I agree that this is an area that needs more thought.  I don't
(currently, anyway) agree that the planner shouldn't know anything
about parallelism.  The problem with that is that there's lots of
relevant stuff that can only be known at plan time.  For example,
consider the query you mention above on a table with no index.  If the
WHERE clause is highly selective, a parallel plan may well be best.
But if the selectivity is only, say, 50%, a parallel plan is stupid:
the IPC costs of shipping many rows back to the master will overwhelm
any benefit we could possibly have hoped to get, and the overall
result will likely be that the parallel plan both runs slower and uses
more resources.  At plan time, we have the selectivity information
conveniently at hand, and can use that as part of the cost model to
make educated decisions.  Execution time is way too late to be
thinking about those kinds of questions.

I think one of the philosophical questions that has to be answered
here is "what does it mean to talk about the cost of a parallel
plan?".  For a non-parallel plan, the cost of the plan means both "the
amount of effort we will spend executing the plan" and also "the
amount of time we think the plan will take to complete", but those two
things are different for parallel plans.  I'm inclined to think it's
right to view the cost of a parallel plan as a proxy for execution
time, because the fundamental principle of the planner is that we pick
the lowest-cost plan.  But there also clearly needs to be some way to
prevent the selection of a plan which runs slightly faster at the cost
of using vastly more resources.

Currently, the planner tracks the best unsorted path for each relation
as well as the best path for each useful sort order.  Suppose we treat
parallelism as another axis for judging the quality of a plan: we keep
the best unsorted, non-parallel path; the best non-parallel path for
each useful sort order; the best unsorted, parallel path; and the best
parallel path for each sort order.  Each time we plan a node, we
generate non-parallel paths first, and then parallel paths.  But, if a
parallel plan isn't markedly faster than the non-parallel plan for the
same sort order, then we discard it.  I'm not sure exactly what the
thresholds should be here, and they probably need to be configurable,
because on a single-user system with excess capacity available it may
be absolutely desirable to use ten times the resources to get an
answer 25% faster, but on a heavy-loaded system that will stink.

Some ideas for GUCs:

max_parallel_degree = The largest number of processes we'll consider
using for a single query.
min_parallel_speedup = The minimum percentage by which a parallel path
must be cheaper (in terms of execution time) than a non-parallel path
in order to survive.  I'm imagining the default here might be
something like 15%.
min_parallel_speedup_per_worker = Like the previous one, but per
worker.  e.g. if this is 5%, which might be a sensible default, then a
plan with 4 workers must be at least 20% better to survive, but a plan
using only 2 workers only needs to be 10% better.

An additional benefit of this line of thinking is that planning would
always produce a best non-parallel path.  And sometimes, there would
also be a best parallel path that is expected to run faster.  We could
then choose between them dynamically at execution time.

I think it's pretty hard to imagine a scenario as extreme as the one
you mention above ever actually occurring in practice.  I mean, even
the most naive implementation of parallel query will presumably have
something like max_parallel_degree, and you probably won't have that
set to 128.  For starters, it can't possibly make sense unless you
server has at least 128 CPUs, and even then it only makes sense if you
don't mind a single query using all of them, and even if the first of
those things is true, the second one probably isn't.  I don't doubt
that less extreme variants of this scenario are possible, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: Tuple visibility within a single XID
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0