Обсуждение: BUG #13528: LATERAL vs. correlated scalar subquery

Поиск
Список
Период
Сортировка

BUG #13528: LATERAL vs. correlated scalar subquery

От
marko@joh.to
Дата:
The following bug has been logged on the website:

Bug reference:      13528
Logged by:          Marko Tiikkaja
Email address:      marko@joh.to
PostgreSQL version: 9.4.4
Operating system:   Linux
Description:

Hi,

Observe the following case:

=# create table data(a int, b int, primary key(a,b));
CREATE TABLE
=# insert into data select i, random() * 100 from generate_series(1, 100000)
i;
INSERT 0 100000
=# create view counts as select a, count(*) from data group by a;
CREATE VIEW
=# explain analyze select u.elem, x.count from unnest(array[1]) u(elem),
lateral (select counts.count from counts where counts.a = u.elem) x;
                                                      QUERY PLAN

----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1867.28..1873.03 rows=100 width=12) (actual
time=69.858..77.021 rows=1 loops=1)
   Hash Cond: (data.a = u.elem)
   ->  HashAggregate  (cost=1865.03..1867.03 rows=200 width=4) (actual
time=44.528..70.394 rows=100000 loops=1)
         Group Key: data.a
         ->  Seq Scan on data  (cost=0.00..1391.02 rows=94802 width=4)
(actual time=0.013..8.586 rows=100000 loops=1)
   ->  Hash  (cost=1.00..1.00 rows=100 width=4) (actual time=0.012..0.012
rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         ->  Function Scan on unnest u  (cost=0.00..1.00 rows=100 width=4)
(actual time=0.010..0.011 rows=1 loops=1)
 Planning time: 0.142 ms
 Execution time: 77.551 ms
(10 rows)

Tweaking any of the enable_* parameters doesn't get me to the desired query
produced by the old way of LATERALizing:

=# explain analyze select u.elem, (select counts.count from counts where
counts.a = u.elem) from unnest(array[1]) u(elem);
                                                                  QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on unnest u  (cost=0.00..125498.75 rows=100 width=4) (actual
time=0.037..0.038 rows=1 loops=1)
   SubPlan 1
     ->  Subquery Scan on counts  (cost=0.29..1254.98 rows=1 width=8)
(actual time=0.024..0.024 rows=1 loops=1)
           ->  GroupAggregate  (cost=0.29..1254.97 rows=1 width=4) (actual
time=0.023..0.023 rows=1 loops=1)
                 Group Key: data.a
                 ->  Index Only Scan using data_pkey on data
(cost=0.29..1252.59 rows=474 width=4) (actual time=0.017..0.018 rows=1
loops=1)
                       Index Cond: (a = u.elem)
                       Heap Fetches: 1
 Planning time: 0.125 ms
 Execution time: 0.073 ms
(10 rows)

Is there some fundamental issue here which prevents the planner from
producing the same plan?

Re: BUG #13528: LATERAL vs. correlated scalar subquery

От
Maxim Boguk
Дата:
On Thu, Jul 30, 2015 at 6:53 PM, <marko@joh.to> wrote:

> The following bug has been logged on the website:
>
> Bug reference:      13528
> Logged by:          Marko Tiikkaja
> Email address:      marko@joh.to
> PostgreSQL version: 9.4.4
> Operating system:   Linux
> Description:
>
> Hi,
>
> Observe the following case:
>
> =3D# create table data(a int, b int, primary key(a,b));
> CREATE TABLE
> =3D# insert into data select i, random() * 100 from generate_series(1,
> 100000)
> i;
> INSERT 0 100000
> =3D# create view counts as select a, count(*) from data group by a;
> CREATE VIEW
> =3D# explain analyze select u.elem, x.count from unnest(array[1]) u(elem)=
,
> lateral (select counts.count from counts where counts.a =3D u.elem) x;
>                                                       QUERY PLAN
>
>
> -------------------------------------------------------------------------=
---------------------------------------------
>  Hash Join  (
> =E2=80=8B=E2=80=8B
> cost=3D1867.28..1873.03 rows=3D100 width=3D12) (actual
> time=3D69.858..77.021 rows=3D1 loops=3D1)
>    Hash Cond: (data.a =3D u.elem)
>    ->  HashAggregate  (cost=3D1865.03..1867.03 rows=3D200 width=3D4) (act=
ual
> time=3D44.528..70.394 rows=3D100000 loops=3D1)
>          Group Key: data.a
>          ->  Seq Scan on data  (cost=3D0.00..1391.02 rows=3D94802 width=
=3D4)
> (actual time=3D0.013..8.586 rows=3D100000 loops=3D1)
>    ->  Hash  (cost=3D1.00..1.00 rows=3D100 width=3D4) (actual time=3D0.01=
2..0.012
> rows=3D1 loops=3D1)
>          Buckets: 1024  Batches: 1  Memory Usage: 1kB
>          ->
> =E2=80=8B=E2=80=8B
> Function Scan on unnest u  (cost=3D0.00..1.00 rows=3D100 width=3D4)
> (actual time=3D0.010..0.011 rows=3D1 loops=3D1)
>  Planning time: 0.142 ms
>  Execution time: 77.551 ms
> (10 rows)
>
> Tweaking any of the enable_* parameters doesn't get me to the desired que=
ry
> produced by the old way of LATERALizing:
>
> =3D# explain analyze select u.elem, (select counts.count from counts wher=
e
> counts.a =3D u.elem) from unnest(array[1]) u(elem);
>                                                                   QUERY
> PLAN
>
>
> -------------------------------------------------------------------------=
---------------------------------------------------------------------
>
> =E2=80=8B=E2=80=8B
> Function Scan on unnest u  (
> =E2=80=8B=E2=80=8B
> cost=3D0.00..125498.75 rows=3D100 width=3D4) (actual
> time=3D0.037..0.038 rows=3D1 loops=3D1)
>    SubPlan 1
>      ->  Subquery Scan on counts  (cost=3D0.29..1254.98 rows=3D1 width=3D=
8)
> (actual time=3D0.024..0.024 rows=3D1 loops=3D1)
>            ->  GroupAggregate  (cost=3D0.29..1254.97 rows=3D1 width=3D4) =
(actual
> time=3D0.023..0.023 rows=3D1 loops=3D1)
>                  Group Key: data.a
>                  ->  Index Only Scan using data_pkey on data
> (cost=3D0.29..1252.59 rows=3D474 width=3D4) (actual time=3D0.017..0.018 r=
ows=3D1
> loops=3D1)
>                        Index Cond: (a =3D u.elem)
>                        Heap Fetches: 1
>  Planning time: 0.125 ms
>  Execution time: 0.073 ms
> (10 rows)
>
> Is there some fundamental issue here which prevents the planner from
> producing the same plan?
>

=E2=80=8BHi Marko,

You could see that the new plan have lower total cost than the old plan
(=E2=80=8Bcost=3D1867.28..1873.03 vs =E2=80=8Bcost=3D0.00..125498.75).
I think it's primary reason why it been selected (planner could produce the
old plan but new plan wins on the cost basis).
An fundamental issue there that unnest expected return 100 rows by default
instead of the 1 row in your case.
( =E2=80=8BFunction Scan on unnest u  (cost=3D0.00..1.00 rows=3D100 width=
=3D4) vs (actual
time=3D0.010..0.011 rows=3D1 loops=3D1) )
If you try the same test with 100 element array I think the new
plan/lateral query will be faster than the old.
It's somewhat fundamental issue with planning queries with unnest(constant
array) (i been burned by it many times).



--=20
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/=
>

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com
=D0=9C=D0=BE=D0=B9=D0=9A=D1=80=D1=83=D0=B3: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."

Re: BUG #13528: LATERAL vs. correlated scalar subquery

От
Marko Tiikkaja
Дата:
On 7/30/15 1:48 PM, Maxim Boguk wrote:
> You could see that the new plan have lower total cost than the old plan
> (​cost=1867.28..1873.03 vs ​cost=0.00..125498.75).
> I think it's primary reason why it been selected (planner could produce the
> old plan but new plan wins on the cost basis).

I'll have to admit I could've put more time into the original report,
but I don't think that's accurate.  If I disable hashagg and hashjoin
and tune the query to tell the planner that only one row is to be
expected, the plan looks like this:

=#* explain analyze select u.elem, x.count from (SELECT u.elem FROM
unnest(array[1]) u(elem) LIMIT 1) u(elem), lateral (select counts.count
from counts where counts.a = u.elem) x;
                                                                   QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.29..4778.86 rows=1 width=12) (actual
time=0.060..52.380 rows=1 loops=1)
    Join Filter: (u.elem = data.a)
    Rows Removed by Join Filter: 99999
    ->  Limit  (cost=0.00..0.01 rows=1 width=4) (actual
time=0.010..0.011 rows=1 loops=1)
          ->  Function Scan on unnest u  (cost=0.00..1.00 rows=100
width=4) (actual time=0.009..0.009 rows=1 loops=1)
    ->  GroupAggregate  (cost=0.29..4774.33 rows=200 width=4) (actual
time=0.047..45.634 rows=100000 loops=1)
          Group Key: data.a
          ->  Index Only Scan using data_pkey on data
(cost=0.29..4298.32 rows=94802 width=4) (actual time=0.042..19.381
rows=100000 loops=1)
                Heap Fetches: 100000
  Planning time: 0.147 ms
  Execution time: 52.429 ms
(11 rows)


which to me suggests that the planner just doesn't realize that it can
push the condition on counts.a into the view.


.m

Re: BUG #13528: LATERAL vs. correlated scalar subquery

От
Tom Lane
Дата:
Marko Tiikkaja <marko@joh.to> writes:
> On 7/30/15 1:48 PM, Maxim Boguk wrote:
>> You could see that the new plan have lower total cost than the old plan
>> (​cost=1867.28..1873.03 vs ​cost=0.00..125498.75).
>> I think it's primary reason why it been selected (planner could produce the
>> old plan but new plan wins on the cost basis).

> I'll have to admit I could've put more time into the original report,
> but I don't think that's accurate.

Yeah.  It would be nice if we could produce a more accurate rowcount
estimate for unnest(array[...]); that's something that's been a pain
for Salesforce so I've been considering ways to fix it.  But it's
not the killer problem here.

> which to me suggests that the planner just doesn't realize that it can
> push the condition on counts.a into the view.

It can't.  We'd need parameterized paths for subqueries, which we don't
have (yet).

            regards, tom lane