Обсуждение: Pull up sublink of type 'NOT NOT (expr)'

Поиск
Список
Период
Сортировка

Pull up sublink of type 'NOT NOT (expr)'

От
Richard Guo
Дата:
Hi hackers,

Currently for quals in the form of "NOT NOT (SubLink)", this SubLink would not
be considered when pulling up sublinks. For instance:

gpadmin=# explain select * from a where NOT NOT (a.i in (select b.i from b));
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on a  (cost=51.50..85.62 rows=1005 width=8)
   Filter: (hashed SubPlan 1)
   SubPlan 1
     ->  Seq Scan on b  (cost=0.00..44.00 rows=3000 width=4)
(4 rows)


Should we give it a chance, like the attached does?

Thanks
Richard
Вложения

Re: Pull up sublink of type 'NOT NOT (expr)'

От
Alexey Bashtanov
Дата:
Hello Richard,

Currently for quals in the form of "NOT NOT (SubLink)", this SubLink would not
be considered when pulling up sublinks. For instance:

gpadmin=# explain select * from a where NOT NOT (a.i in (select b.i from b));
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on a  (cost=51.50..85.62 rows=1005 width=8)
   Filter: (hashed SubPlan 1)
   SubPlan 1
     ->  Seq Scan on b  (cost=0.00..44.00 rows=3000 width=4)
(4 rows)


Should we give it a chance, like the attached does?

Sometimes hashed subplan is faster than hash join and than all the other options, as it preserves the order.
Using NOT NOT, one can control whether to use it or not:
https://pgblog.bashtanov.com/2017/12/08/double-negative-and-query-performance/ (test case and results in the bottom of the page).

Surely dirty tricks should not be the way to control the planner, but when breaking them we should probably provide a way to achieve the same result,
ideally making the planner choose the best plan without hints.

Best,
  Alex

Re: Pull up sublink of type 'NOT NOT (expr)'

От
Richard Guo
Дата:
Hi Alex,

Yes hashed SubPlan preserves order and may be faster than hash join in some
cases. But I don't think that is a reason good enough to prevent the subplan
from being converted to join.

Let's suppose the subplan is uncorrelated, otherwise hashed SubPlan would not
be used. Hashed SubPlan can only be applied when the size of the subquery
result can fit in work_mem. When the data size increases or work_mem is set to
a smaller value that more than one batch is needed, hashed SubPlan will not be
used and the total cost will increase dramatically. Hash join, by contrast,
handles multiple batches better.

Below is an example to show the behavior of hash join and hashed SubPlan for
single batch and multiple batches.

create table a (i int);
create table b (i int);
insert into a select i from generate_series(1,10000)i;
insert into b select i from generate_series(1,10000)i;

For single batch:

Hash Join
gpadmin=# explain analyze select * from a where a.i in (select b.i from b);
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=270.00..552.50 rows=10000 width=4) (actual time=4.332..9.499 rows=10000 loops=1)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.016..1.328 rows=10000 loops=1)
   ->  Hash  (cost=145.00..145.00 rows=10000 width=4) (actual time=4.292..4.292 rows=10000 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 480kB
         ->  Seq Scan on b  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.013..1.817 rows=10000 loops=1)
 Planning Time: 0.236 ms
 Execution Time: 10.099 ms
(8 rows)

Hashed SubPlan
gpadmin=# explain analyze select * from a where not not a.i in (select b.i from b);
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on a  (cost=170.00..340.00 rows=5000 width=4) (actual time=6.359..11.843 rows=10000 loops=1)
   Filter: (hashed SubPlan 1)
   SubPlan 1
     ->  Seq Scan on b  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.017..1.749 rows=10000 loops=1)
 Planning Time: 0.109 ms
 Execution Time: 12.905 ms
(6 rows)

insert into b select i from generate_series(1,100000)i;
insert into b select i from generate_series(1,100000)i;


For multiple batches:

Hash Join
gpadmin=# explain analyze select * from a where a.i in (select b.i from b);
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=6476.00..7659.50 rows=10000 width=4) (actual time=89.915..121.016 rows=10000 loops=1)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.020..1.779 rows=10000 loops=1)
   ->  Hash  (cost=3030.00..3030.00 rows=210000 width=4) (actual time=89.666..89.667 rows=210000 loops=1)
         Buckets: 131072  Batches: 4  Memory Usage: 2860kB
         ->  Seq Scan on b  (cost=0.00..3030.00 rows=210000 width=4) (actual time=0.015..37.686 rows=210000 loops=1)
 Planning Time: 0.256 ms
 Execution Time: 121.769 ms
(8 rows)

Plain SubPlan
gpadmin=# explain analyze select * from a where not not a.i in (select b.i from b);
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on a  (cost=0.00..27130170.00 rows=5000 width=4) (actual time=0.030..9994.036 rows=10000 loops=1)
   Filter: (SubPlan 1)
   SubPlan 1
     ->  Materialize  (cost=0.00..4901.00 rows=210000 width=4) (actual time=0.000..0.359 rows=5000 loops=10000)
           ->  Seq Scan on b  (cost=0.00..3030.00 rows=210000 width=4) (actual time=0.010..1.673 rows=10000 loops=1)
 Planning Time: 0.077 ms
 Execution Time: 9995.556 ms
(7 rows)

Thanks
Richard

On Wed, Oct 24, 2018 at 12:39 AM, Alexey Bashtanov <bashtanov@imap.cc> wrote:
Hello Richard,

Currently for quals in the form of "NOT NOT (SubLink)", this SubLink would not
be considered when pulling up sublinks. For instance:

gpadmin=# explain select * from a where NOT NOT (a.i in (select b.i from b));
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on a  (cost=51.50..85.62 rows=1005 width=8)
   Filter: (hashed SubPlan 1)
   SubPlan 1
     ->  Seq Scan on b  (cost=0.00..44.00 rows=3000 width=4)
(4 rows)


Should we give it a chance, like the attached does?

Sometimes hashed subplan is faster than hash join and than all the other options, as it preserves the order.
Using NOT NOT, one can control whether to use it or not:
https://pgblog.bashtanov.com/2017/12/08/double-negative-and-query-performance/ (test case and results in the bottom of the page).

Surely dirty tricks should not be the way to control the planner, but when breaking them we should probably provide a way to achieve the same result,
ideally making the planner choose the best plan without hints.

Best,
  Alex

Re: Pull up sublink of type 'NOT NOT (expr)'

От
Tom Lane
Дата:
Richard Guo <riguo@pivotal.io> writes:
> Currently for quals in the form of "NOT NOT (SubLink)", this SubLink would
> not be considered when pulling up sublinks.

Yup.

> Should we give it a chance, like the attached does?

What is the argument that this occurs often enough to be worth expending
extra cycles and code space on?

If we do do something like this, I'd be inclined to make it handle
any-number-of-consecutive-NOTs, and maybe remove NOT NOT over an ANY,
not just EXISTS.  But I don't honestly think that it's worth troubling
over.  Do even the dumbest ORMs generate such code?

            regards, tom lane


Re: Pull up sublink of type 'NOT NOT (expr)'

От
Richard Guo
Дата:
Hi Tom,

Thanks for reviewing.

On Tue, Nov 13, 2018 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <riguo@pivotal.io> writes:
> Currently for quals in the form of "NOT NOT (SubLink)", this SubLink would
> not be considered when pulling up sublinks.

Yup.

> Should we give it a chance, like the attached does?

What is the argument that this occurs often enough to be worth expending
extra cycles and code space on?

If we do do something like this, I'd be inclined to make it handle
any-number-of-consecutive-NOTs, and maybe remove NOT NOT over an ANY,
not just EXISTS.  But I don't honestly think that it's worth troubling
over.  Do even the dumbest ORMs generate such code?

What this patch does is to recursively remove NOT NOT over a SubLink, so it 
actually can handle any-number-of-consecutive-NOTs, both over ANY and over EXISTS.

Over ANY:

gpadmin=# explain select * from a where not not not not a.i in (select i from b);
                              QUERY PLAN
-----------------------------------------------------------------------
 Hash Join  (cost=42.75..93.85 rows=1130 width=8)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..32.60 rows=2260 width=8)
   ->  Hash  (cost=40.25..40.25 rows=200 width=4)
         ->  HashAggregate  (cost=38.25..40.25 rows=200 width=4)
               Group Key: b.i
               ->  Seq Scan on b  (cost=0.00..32.60 rows=2260 width=4)
(7 rows)

Over EXISTS:

gpadmin=# explain select * from a where not not not not exists (select 1 from b where a.i = b.i);
                              QUERY PLAN
-----------------------------------------------------------------------
 Hash Join  (cost=42.75..93.85 rows=1130 width=8)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..32.60 rows=2260 width=8)
   ->  Hash  (cost=40.25..40.25 rows=200 width=4)
         ->  HashAggregate  (cost=38.25..40.25 rows=200 width=4)
               Group Key: b.i
               ->  Seq Scan on b  (cost=0.00..32.60 rows=2260 width=4)
(7 rows)


I am not using an ORM, but just considering maybe it would be better if PostgreSQL can do such pull-up.
Tom, what's your suggestion? Is it worthwhile expending several lines of codes to do this pull-up?

Thanks
Richard

Re: Pull up sublink of type 'NOT NOT (expr)'

От
Tom Lane
Дата:
Richard Guo <riguo@pivotal.io> writes:
> On Tue, Nov 13, 2018 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What is the argument that this occurs often enough to be worth expending
>> extra cycles and code space on?

> I am not using an ORM, but just considering maybe it would be better if
> PostgreSQL can do such pull-up.
> Tom, what's your suggestion? Is it worthwhile expending several lines of
> codes to do this pull-up?

Well, really, if we're just doing this on speculation and don't even have
any concrete indication that anybody writes code like that, I can't get
excited about expending even a few lines of code on it.

The reason why we perform optimizations similar to this in places like
eval_const_expressions is (IMO anyway) that transformations such as
function inlining and subquery pullup can create parse trees that look
like this, even when the original query was perfectly sane and without
obvious redundancy.  However, because pull_up_sublinks runs before any
of that, it would only ever see NOT NOT if someone had actually written
such a thing.  So to justify expending any code space or cycles on
optimizing this, you have to make the case that that actually happens
in the wild, and does so often enough to justify wasting some (admittedly
small) number of cycles for everybody else.  I'd kind of like to see some
actual field occurrence of NOT NOT over an optimizable IN/EXISTS before
deciding that it's worth doing.

It's sort of annoying that we have to run pull_up_sublinks before we do
scalar expression simplification.  If we could do that in the other order,
NOT NOT elimination would fall out automatically, and we'd also be able
to recognize some other cases where it initially seems that an IN or
EXISTS is not at top level, but it becomes so after we const-fold, apply
DeMorgan's laws, etc.  However, according to the point I made above,
it makes more sense to apply expression simplification after we've
completed pullup-like operations.  So we can't really get there from
here unless we wanted to do (parts of?) expression simplification twice,
which is unlikely to win often enough to justify the cost.

So I'm inclined to reject this patch as probably being a net loss in
almost all cases.  If you can show any real-world cases where it wins,
we could reconsider.

            regards, tom lane


Re: Pull up sublink of type 'NOT NOT (expr)'

От
Richard Guo
Дата:
Hi Tom,

Thanks for your input.

On Fri, Nov 16, 2018 at 6:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <riguo@pivotal.io> writes:
> On Tue, Nov 13, 2018 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What is the argument that this occurs often enough to be worth expending
>> extra cycles and code space on?

> I am not using an ORM, but just considering maybe it would be better if
> PostgreSQL can do such pull-up.
> Tom, what's your suggestion? Is it worthwhile expending several lines of
> codes to do this pull-up?

Well, really, if we're just doing this on speculation and don't even have
any concrete indication that anybody writes code like that, I can't get
excited about expending even a few lines of code on it.

The reason why we perform optimizations similar to this in places like
eval_const_expressions is (IMO anyway) that transformations such as
function inlining and subquery pullup can create parse trees that look
like this, even when the original query was perfectly sane and without
obvious redundancy.  However, because pull_up_sublinks runs before any
of that, it would only ever see NOT NOT if someone had actually written
such a thing.  So to justify expending any code space or cycles on
optimizing this, you have to make the case that that actually happens
in the wild, and does so often enough to justify wasting some (admittedly
small) number of cycles for everybody else.  I'd kind of like to see some
actual field occurrence of NOT NOT over an optimizable IN/EXISTS before
deciding that it's worth doing.

It's sort of annoying that we have to run pull_up_sublinks before we do
scalar expression simplification.  If we could do that in the other order,
NOT NOT elimination would fall out automatically, and we'd also be able
to recognize some other cases where it initially seems that an IN or
EXISTS is not at top level, but it becomes so after we const-fold, apply
DeMorgan's laws, etc.  However, according to the point I made above,
it makes more sense to apply expression simplification after we've
completed pullup-like operations.  So we can't really get there from
here unless we wanted to do (parts of?) expression simplification twice,
which is unlikely to win often enough to justify the cost.

Agree! If we do expression simplification in the other order, we have to do that twice,
which is not a good idea.
 

So I'm inclined to reject this patch as probably being a net loss in
almost all cases.  If you can show any real-world cases where it wins,
we could reconsider.

I am convinced. I do not see this happen often enough in real world neither.
Feel free to reject his patch.

Thanks
Richard