Re: Converting NOT IN to anti-joins during planning

Поиск
Список
Период
Сортировка
От Li, Zheng
Тема Re: Converting NOT IN to anti-joins during planning
Дата
Msg-id AB147063-56E3-49A9-82B6-E8E57C1363BD@amazon.com
обсуждение исходный текст
Ответ на Re: Converting NOT IN to anti-joins during planning  (David Rowley <david.rowley@2ndquadrant.com>)
Список pgsql-hackers
----- 
   The big "IF" here is if we can calculate the size of the subplan to
    know if it'll be hashed or not at the point in planning where this
    conversion is done. I personally can't quite see how that'll work
    reliably without actually planning the subquery, which I really doubt
    is something we'd consider doing just for a cost estimate. Remember
    the subquery may not just be a single relation scan, it could be a
    complex query containing many joins and UNION / GROUP BY / DISTINCT /
    HAVING clauses etc.
-----

In our latest patch, we plan the subquery right before conversion, we only
proceed with the ANTI JOIN conversion if subplan_is_hashable(subplan) is
false. To avoid re-planning the subquery again in a later phase, I think we can
keep a pointer to the subplan in SubLink.

-----------
Zheng Li
AWS, Amazon Aurora PostgreSQL
 

On 6/14/19, 5:51 AM, "David Rowley" <david.rowley@2ndquadrant.com> wrote:

    On Fri, 14 Jun 2019 at 20:41, Simon Riggs <simon@2ndquadrant.com> wrote:
    >
    > On Wed, 6 Mar 2019 at 04:11, David Rowley <david.rowley@2ndquadrant.com> wrote:
    >>
    >> Hi Jim,
    >>
    >> Thanks for replying here.
    >>
    >> On Wed, 6 Mar 2019 at 16:37, Jim Finnerty <jfinnert@amazon.com> wrote:
    >> >
    >> > Actually, we're working hard to integrate the two approaches.  I haven't had
    >> > time since I returned to review your patch, but I understand that you were
    >> > checking for strict predicates as part of the nullness checking criteria,
    >> > and we definitely must have that.  Zheng tells me that he has combined your
    >> > patch with ours, but before we put out a new patch, we're trying to figure
    >> > out how to preserve the existing NOT IN execution plan in the case where the
    >> > materialized subplan fits in memory.  This (good) plan is effectively an
    >> > in-memory hash anti-join.
    >> >
    >> > This is tricky to do because the NOT IN Subplan to anti-join transformation
    >> > currently happens early in the planning process, whereas the decision to
    >> > materialize is made much later, when the best path is being converted into a
    >> > Plan.
    >>
    >> I guess you're still going with the OR ... IS NULL in your patch then?
    >> otherwise, you'd likely find that the transformation (when NULLs are
    >> not possible) is always a win since it'll allow hash anti-joins. (see
    >> #2 in the original email on this thread)  FWIW I mentioned in [1] and
    >> Tom confirmed in [2] that we both think hacking the join condition to
    >> add an OR .. IS NULL is a bad idea. I guess you're not deterred by
    >> that?
    >
    >
    > Surely we want both?
    >
    > 1. Transform when we can
    > 2. Else apply some other approach if the cost can be reduced by doing it
    
    Maybe.  If the scope for the conversion is reduced to only add the OR
    .. IS NULL join clause when the subplan could not be hashed then it's
    maybe less likely to cause performance regressions. Remember that this
    forces the planner to use a nested loop join since no other join
    algorithms support OR clauses. I think Jim and Zheng have now changed
    their patch to do that.  If we can perform a parameterised nested loop
    join then that has a good chance of being better than scanning the
    subquery multiple times, however, if there's no index to do a
    parameterized nested loop, then we need to do a normal nested loop
    which will perform poorly, but so will the non-hashed subplan...
    
    # create table t1 (a int);
    CREATE TABLE
    # create table t2 (a int);
    CREATE TABLE
    # set work_mem = '64kB';
    SET
    # insert into t1 select generate_Series(1,10000);
    INSERT 0 10000
    # insert into t2 select generate_Series(1,10000);
    INSERT 0 10000
    # explain analyze select count(*) from t1 where a not in(select a from t2);
                                                            QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=1668739.50..1668739.51 rows=1 width=8) (actual
    time=7079.077..7079.077 rows=1 loops=1)
       ->  Seq Scan on t1  (cost=0.00..1668725.16 rows=5738 width=0)
    (actual time=7079.072..7079.072 rows=0 loops=1)
             Filter: (NOT (SubPlan 1))
             Rows Removed by Filter: 10000
             SubPlan 1
               ->  Materialize  (cost=0.00..262.12 rows=11475 width=4)
    (actual time=0.004..0.397 rows=5000 loops=10000)
                     ->  Seq Scan on t2  (cost=0.00..159.75 rows=11475
    width=4) (actual time=0.019..4.921 rows=10000 loops=1)
     Planning Time: 0.348 ms
     Execution Time: 7079.309 ms
    (9 rows)
    
    # explain analyze select count(*) from t1 where not exists(select 1
    from t2 where t1.a = t2.a or t2.a is null);
                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=1250873.25..1250873.26 rows=1 width=8) (actual
    time=7263.980..7263.980 rows=1 loops=1)
       ->  Nested Loop Anti Join  (cost=0.00..1250858.97 rows=5709
    width=0) (actual time=7263.976..7263.976 rows=0 loops=1)
             Join Filter: ((t1.a = t2.a) OR (t2.a IS NULL))
             Rows Removed by Join Filter: 49995000
             ->  Seq Scan on t1  (cost=0.00..159.75 rows=11475 width=4)
    (actual time=0.013..2.350 rows=10000 loops=1)
             ->  Materialize  (cost=0.00..262.12 rows=11475 width=4)
    (actual time=0.004..0.396 rows=5000 loops=10000)
                   ->  Seq Scan on t2  (cost=0.00..159.75 rows=11475
    width=4) (actual time=0.007..4.075 rows=10000 loops=1)
     Planning Time: 0.086 ms
     Execution Time: 7264.141 ms
    (9 rows)
    
    When the index exists the transformation is certainly much better.
    
    # create index on t2(a);
    CREATE INDEX
    # explain analyze select count(*) from t1 where not exists(select 1
    from t2 where t1.a = t2.a or t2.a is null);
                                                                  QUERY
    PLAN

---------------------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=111342.50..111342.51 rows=1 width=8) (actual
    time=29.580..29.581 rows=1 loops=1)
       ->  Nested Loop Anti Join  (cost=7.10..111342.50 rows=1 width=0)
    (actual time=29.574..29.622 rows=0 loops=1)
             ->  Seq Scan on t1  (cost=0.00..145.00 rows=10000 width=4)
    (actual time=0.010..0.883 rows=10000 loops=1)
             ->  Bitmap Heap Scan on t2  (cost=7.10..11.11 rows=1 width=4)
    (actual time=0.002..0.002 rows=1 loops=10000)
                   Recheck Cond: ((t1.a = a) OR (a IS NULL))
                   Heap Blocks: exact=10000
                   ->  BitmapOr  (cost=7.10..7.10 rows=1 width=0) (actual
    time=0.002..0.002 rows=0 loops=10000)
                         ->  Bitmap Index Scan on t2_a_idx
    (cost=0.00..0.30 rows=1 width=0) (actual time=0.001..0.001 rows=1
    loops=10000)
                               Index Cond: (a = t1.a)
                         ->  Bitmap Index Scan on t2_a_idx
    (cost=0.00..4.29 rows=1 width=0) (actual time=0.001..0.001 rows=0
    loops=10000)
                               Index Cond: (a IS NULL)
     Planning Time: 0.311 ms
     Execution Time: 29.670 ms
    (13 rows)
    
    The big "IF" here is if we can calculate the size of the subplan to
    know if it'll be hashed or not at the point in planning where this
    conversion is done. I personally can't quite see how that'll work
    reliably without actually planning the subquery, which I really doubt
    is something we'd consider doing just for a cost estimate. Remember
    the subquery may not just be a single relation scan, it could be a
    complex query containing many joins and UNION / GROUP BY / DISTINCT /
    HAVING clauses etc.
    
    However, if there turns out to be some good way to do that that I
    can't see then I think that each patch should be separate so that they
    can be accepted or rejected on their own merits. The problem, for now,
    is that the patches conflict with each other. I don't really want to
    base mine on Jim and Zheng's patch, perhaps they feel the same about
    basing theirs on mine.
    
    -- 
     David Rowley                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
    


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: pg_dump multi VALUES INSERT