Re: Parallelize correlated subqueries that execute within each worker

Поиск
Список
Период
Сортировка
От James Coleman
Тема Re: Parallelize correlated subqueries that execute within each worker
Дата
Msg-id CAAaqYe-FYVq=CSZu8VtKi0fx8Zdi6fzyzEP2B9iK2cUgY8xGjA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallelize correlated subqueries that execute within each worker  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Parallelize correlated subqueries that execute within each worker  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi Robert, thanks for the detailed reply.

On Wed, Nov 3, 2021 at 10:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> As a preliminary comment, it would be quite useful to get Tom Lane's
> opinion on this, since it's not an area I understand especially well,
> and I think he understands it better than anyone.
>
> On Fri, May 7, 2021 at 12:30 PM James Coleman <jtc331@gmail.com> wrote:
> > The basic idea is that we need to track (both on nodes and relations)
> > not only whether that node or rel is parallel safe but also whether
> > it's parallel safe assuming params are rechecked in the using context.
> > That allows us to delay making a final decision until we have
> > sufficient context to conclude that a given usage of a Param is
> > actually parallel safe or unsafe.
>
> I don't really understand what you mean by "assuming params are
> rechecked in the using context." However, I think that a possibly
> better approach to this whole area would be to try to solve the
> problem by putting limits on where you can insert a Gather node.
> Consider:
>
> Nested Loop
> -> Seq Scan on x
> -> Index Scan on y
>    Index Cond: y.q = x.q
>
> If you insert a Gather node atop the Index Scan, necessarily changing
> it to a Parallel Index Scan, then you need to pass values around. For
> every value we get for x.q, we would need to start workers, sending
> them the value of x.q, and they do a parallel index scan working
> together to find all rows where y.q = x.q, and then exit. We repeat
> this for every tuple from x.q. In the absence of infrastructure to
> pass those parameters, we can't put the Gather there. We also don't
> want to, because it would be really slow.
>
> If you insert the Gather node atop the Seq Scan or the Nested Loop, in
> either case necessarily changing the Seq Scan to a Parallel Seq Scan,
> you have no problem. If you put it on top of the Nested Loop, the
> parameter will be set in the workers and used in the workers and
> everything is fine. If you put it on top of the Seq Scan, the
> parameter will be set in the leader -- by the Nested Loop -- and used
> in the leader, and again you have no problem.
>
> So in my view of the world, the parameter just acts as an additional
> constraint on where Gather nodes can be placed. I don't see that there
> are any parameters that are unsafe categorically -- they're just
> unsafe if the place where they are set is on a different side of the
> Gather from the place where they are used. So I don't understand --
> possibly just because I'm dumb -- the idea behind
> consider_parallel_rechecking_params, because that seems to be making a
> sort of overall judgement about the safety or unsafety of the
> parameter on its own merits, rather than thinking about the Gather
> placement.

I had to read through this several times before I understood the point
(not your fault, this is, as you note, a complicated area). I *think*
if I grok it properly you're effectively describing what this patch
results in conceptually (but possibly solving it from a different
direction).

As I understand the current code, parallel plans are largely chosen
based not on where it's safe to insert a Gather node but rather by
determining if a given path is parallel safe. Through that lens params
are a bit of an odd man out -- they aren't inherently unsafe in the
way a parallel-unsafe function is, but they can only be used in
parallel plans under certain conditions (whether because of project
policy, performance, or missing infrastructure).

Under that paradigm the existing consider_parallel and parallel_safe
boolean values imply everything is about whether a plan is inherently
parallel safe. Thus the current doesn't have the context to handle the
nuance of params (as they are not inherently parallel-safe or unsafe).

Introducing consider_parallel_rechecking_params and
parallel_safe_ignoring_params allows us to keep more context on params
and make a more nuanced decision at the proper level of the plan. This
is what I mean by "rechecked in the using context", though I realize
now that both "recheck" and "context" are overloaded terms in the
project, so don't describe the concept particularly clearly. When a
path relies on params we can only make a final determination about its
parallel safety if we know whether or not the current parallel node
can provide the param's value. We don't necessarily know that
information until we attempt to generate a full parallel node in the
plan (I think what you're describing as "inserting a Gather node")
since the param may come from another node in the plan. These new
values allow us to do that by tracking tentatively parallel-safe
subplans (given proper Gather node placement) and delaying the
parallel-safety determination until the point at which a param is
available (or not).

Is that a more helpful framing of what my goal is here?

> When I last worked on this, I had hoped that extParam or allParam
> would be the thing that would answer the question: are there any
> parameters used under this node that are not also set under this node?
> But I seem to recall that neither seemed to be answering precisely
> that question, and the lousy naming of those fields and limited
> documentation of their intended purpose did not help.

I don't really know anything about extParam or allParam, so I can't
offer any insight here.

Thanks,
James Coleman



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Antonin Houska
Дата:
Сообщение: Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Следующее
От: Jan Wieck
Дата:
Сообщение: Re: should we enable log_checkpoints out of the box?