Re: Consider parallel for lateral subqueries with limit

Поиск
Список
Период
Сортировка
От James Coleman
Тема Re: Consider parallel for lateral subqueries with limit
Дата
Msg-id CAAaqYe_zVOj5epc=WhdaSAStNxEuJQHyi2=eF4GXm7rH6ERtFw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Consider parallel for lateral subqueries with limit  (James Coleman <jtc331@gmail.com>)
Ответы Re: Consider parallel for lateral subqueries with limit  (Greg Nancarrow <gregn4422@gmail.com>)
Список pgsql-hackers
Brian's email didn't keep the relevant headers, and so didn't show up
as a reply, so I've pasted it here and am replying in this original
thread. You can find the original at [1].

On Sun, Dec 6, 2020 at 7:34 PM Brian Davis <brian@brianlikespostgres.com> wrote:
>
> > Note that near the end of grouping planner we have a similar check:
> >
> > if (final_rel->consider_parallel && root->query_level > 1 &&
> >         !limit_needed(parse))
> >
> > guarding copying the partial paths from the current rel to the final
> > rel. I haven't managed to come up with a test case that exposes that
>
> Played around with this a bit, here's a non-correlated subquery that gets us to that if statement
>
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (bar int);
>
> INSERT INTO foo (bar)
> SELECT
>   g
> FROM
>   generate_series(1, 10000) AS g;
>
>
> SELECT
>   (
>     SELECT
>       bar
>     FROM
>       foo
>     LIMIT 1
>   ) AS y
> FROM
>   foo;

Thanks. That triggers the parallel if case for limit -- any additional
GUCs need to be modified to get that result? I assume regardless a
parallel plan isn't chosen (even if you remove the limit needed
check)?

> I also was thinking about the LATERAL part.
>
> I couldn't think of any reason why the uncorrelated subquery's results would need to be shared and therefore the
same,when we'll be "looping" over each row of the source table, running the subquery anew for each, conceptually. 
>
> But then I tried this...
>
> test=# CREATE TABLE foo (bar int);
> CREATE TABLE
> test=#
> test=# INSERT INTO foo (bar)
> test-# SELECT
> test-#   g
> test-# FROM
> test-#   generate_series(1, 10) AS g;
> INSERT 0 10
> test=#
> test=#
> test=# SELECT
> test-#   foo.bar,
> test-#   lat.bar
> test-# FROM
> test-#   foo JOIN LATERAL (
> test(#     SELECT
> test(#       bar
> test(#     FROM
> test(#       foo AS foo2
> test(#     ORDER BY
> test(#       random()
> test(#     LIMIT 1
> test(#   ) AS lat ON true;
>  bar | bar
> -----+-----
>    1 |   7
>    2 |   7
>    3 |   7
>    4 |   7
>    5 |   7
>    6 |   7
>    7 |   7
>    8 |   7
>    9 |   7
>   10 |   7
> (10 rows)
>
>
> As you can see, random() is only called once.  If postgres were supposed to be running the subquery for each source
row,conceptually, it would be a mistake to cache the results of a volatile function like random(). 

This was genuinely surprising to me. I think one could argue that this
is just an optimization (after all -- if there is no correlation, then
running it once is conceptually/safely the same as running it multiple
times), but that doesn't seem to hold water with the volatile function
in play.

Of course, given the volatile function we'd never parallelize this
anyway. But we still have to consider the case where the result is
otherwise ordered differently between workers (just by virtue of disk
order, for example).

I've tried the above query using tenk1 from the regress tests to get a
bit more data, and, along with modifying several GUCs, can force
parallel plans. However in no case can I get it to execute that
uncorrelated lateral in multiple workers. That makes me suspicious
that there's another check in play here ensuring the lateral subquery
is executed for each group, and that in the uncorrelated case really
that rule still holds -- it's just a single group.

> The docs say: "When a FROM item contains LATERAL cross-references, evaluation proceeds as follows: for each row of
theFROM item providing the cross-referenced column(s), or set of rows of multiple FROM items providing the columns, the
LATERALitem is evaluated using that row or row set's values of the columns. The resulting row(s) are joined as usual
withthe rows they were computed from. This is repeated for each row or set of rows from the column source table(s)." 
>
> They don't say what happens with LATERAL when there aren't cross-references though.  As we expect, adding one does
showrandom() being called once for each source row. 

If my theory above is correct then it's implicit that the row set is
the whole previous FROM group.

> test=# SELECT
> test-#   foo.bar,
> test-#   lat.bar
> test-# FROM
> test-#   foo JOIN LATERAL (
> test(#     SELECT
> test(#       bar
> test(#     FROM
> test(#       foo AS foo2
> test(#     WHERE
> test(#       foo2.bar < foo.bar + 100000
> test(#     ORDER BY
> test(#       random()
> test(#     LIMIT 1
> test(#   ) AS lat ON true;
>  bar | bar
> -----+-----
>    1 |   5
>    2 |   8
>    3 |   3
>    4 |   4
>    5 |   5
>    6 |   5
>    7 |   1
>    8 |   3
>    9 |   7
>   10 |   3
> (10 rows)
>
> It seems like to keep the same behavior that exists today, results of LATERAL subqueries would need to be the same if
theyaren't correlated, and so you couldn't run them in parallel with a limit if the order wasn't guaranteed.  But I'll
bethe first to admit that it's easy enough for me to miss a key piece of logic on something like this, so I could be
wayoff base too. 

If it weren't for the volatile function in the example, I think I
could argue we could change before (i.e., my theorizing above
originally about just being an optimization)...but yes, it seems like
this behavior shouldn't change. I can't seem to make it break though
with the patch.

While I haven't actually tracked down to guarantee this is handled
elsewhere, a thought experiment -- I think -- shows it must be so.
Here's why: suppose we don't have a limit here, but the query return
order is different in different backends. Then we would have the same
problem you bring up. In that case this code is already setting
consider_parallel=true on the rel. So I don't think we're changing any
behavior here.

James

1: https://www.postgresql.org/message-id/a50766a4-a927-41c4-984c-76e513b6d1c4%40www.fastmail.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Rémi Lapeyre
Дата:
Сообщение: Re: Add header support to text format and matching feature
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: range_agg