Обсуждение: Optimizing NOT IN plans / verify rewrite

Поиск
Список
Период
Сортировка

Optimizing NOT IN plans / verify rewrite

От
Maciek Sakrejda
Дата:
Hi,

I'm running PostgreSQL 8.3 and I have a query with a couple of NOT IN
subqueries:

DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b FROM bar
cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type != 'o');

The plan produced for this is:

                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using foo_type_index on foo  (cost=17851.93..1271633830.75
rows=66410 width=6)
   Index Cond: (type = 'o'::bpchar)
   Filter: ((NOT (subplan)) AND (NOT (subplan)))
   SubPlan
     ->  Materialize  (cost=6077.87..10238.57 rows=299170 width=8)
           ->  Seq Scan on bar cqc  (cost=0.00..4609.70 rows=299170 width=8)
     ->  Materialize  (cost=11774.06..15728.45 rows=284339 width=8)
           ->  Seq Scan on foo car  (cost=0.00..10378.73 rows=284339 width=8)
                 Filter: (type <> 'o'::bpchar)
(9 rows)


Unfortunately, when these tables get large-ish, the materilzations get
really expensive to re-scan for every tuple (see cost above). At the
moment, I have ~500k rows in foo and ~300k rows in bar. The
selectivity of type = 'o' is ~50%. I've tried to re-write the query as
follows:

DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as b FROM
(SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM foo
cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND cqar2.type
!= 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER JOIN bar ON
(candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS NULL AND
bar.b IS NULL);

This gives the more sensible plan:

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Hash IN Join  (cost=48999.81..71174.41 rows=66410 width=6)
   Hash Cond: (foo.b = cqar1.b)
   ->  Seq Scan on foo  (cost=0.00..9003.78 rows=549978 width=14)
   ->  Hash  (cost=47909.68..47909.68 rows=66410 width=8)
         ->  Hash Left Join  (cost=24562.29..47909.68 rows=66410 width=8)
               Hash Cond: (cqar1.b = bar.b)
               Filter: (bar.b IS NULL)
               ->  Hash Left Join  (cost=15043.96..33635.58 rows=132820 width=8)
                     Hash Cond: (cqar1.b = cqar2.b)
                     Filter: (cqar2.b IS NULL)
                     ->  Seq Scan on foo cqar1  (cost=0.00..10378.73
rows=265639 width=8)
                           Filter: (type = 'o'::bpchar)
                     ->  Hash  (cost=10378.73..10378.73 rows=284339 width=8)
                           ->  Seq Scan on foo cqar2
(cost=0.00..10378.73 rows=284339 width=8)
                                 Filter: (type <> 'o'::bpchar)
               ->  Hash  (cost=4609.70..4609.70 rows=299170 width=8)
                     ->  Seq Scan on bar  (cost=0.00..4609.70
rows=299170 width=8)
(17 rows)


As far as I can tell, the results are identical.

My questions

1. Is my rewrite valid?
2. Any way to reliably achieve the second plan (or really, any plan
that doesn't rescan ~~500k tuples per each of ~250k tuples) by
tweaking (per-session) planner constants? I've played with this a
little, but without much success. As with any rewrite situation, I'd
prefer to stick with the simpler, more explicit original query.

Here is a SQL script to reproduce the problem:

\set ON_ERROR_STOP

drop schema if exists not_in_test cascade;
create schema not_in_test;

set search_path to not_in_test;

create table foo (
       a oid not null,
       b bigint not null,
       type char not null,
       ts timestamp without time zone not null
);
create index "foo_b_a_type_index" on foo (b, a, type);
create index "foo_a_index" on foo (a);
create index "foo_type_index" on foo(type);

create table bar (
       b bigint unique not null,
       c timestamp with time zone not null
);
create index "bar_b_index" on bar(b);

insert into foo select (random()*10)::integer,
generate_series(1,550000), case when random() > 0.5 then 'o' else 'x'
end, now();
insert into bar select val, now() from generate_series(1,1200000) as
vals(val) where random() > 0.75;

analyze foo;
analyze bar;

EXPLAIN DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b
FROM bar cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type
!= 'o');
EXPLAIN DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as
b FROM (SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM
foo cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND
cqar2.type != 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER
JOIN bar ON (candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS
NULL AND bar.b IS NULL);


Thanks,
---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

Re: Optimizing NOT IN plans / verify rewrite

От
"Kevin Grittner"
Дата:
Maciek Sakrejda <msakrejda@truviso.com> wrote:

> DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b FROM
> bar cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type !=
> 'o');

Can "b" be null in any of these tables?  If not, then you can
rewrite your query to us NOT EXISTS and have the same semantics.
That will often be much faster.  Something like:

DELETE FROM foo
  WHERE type = 'o'
    AND NOT EXISTS (SELECT * FROM bar cqc where cqc.b = foo.b)
    AND NOT EXISTS (SELECT * FROM foo car WHERE car.b = foo.b
                    AND car.type <> 'o');

-Kevin

Re: Optimizing NOT IN plans / verify rewrite

От
Maciek Sakrejda
Дата:
> Can "b" be null in any of these tables?  If not, then you can
> rewrite your query to us NOT EXISTS and have the same semantics.
> That will often be much faster.

Thanks, Kevin.

No NULLs. It looks like it's a good deal slower than the LOJ version,
but a good deal faster than the original. Since the equivalence of
semantics is much easier to verify here, we may go with this (at least
for the moment).
---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 230
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

Re: Optimizing NOT IN plans / verify rewrite

От
"Kevin Grittner"
Дата:
Maciek Sakrejda <msakrejda@truviso.com> wrote:

> No NULLs. It looks like it's a good deal slower than the LOJ
> version, but a good deal faster than the original.

On 8.4 and later the NOT EXISTS I suggested is a bit faster than
your fast version, since Tom did some very nice work in this area,
implementing semi join and anti join.  If you've got much load with
this kind of query, it might be worth upgrading.

-Kevin

Re: Optimizing NOT IN plans / verify rewrite

От
Andres Freund
Дата:
Hi,

On Mon, Aug 02, 2010 at 12:12:51PM -0700, Maciek Sakrejda wrote:
> I'm running PostgreSQL 8.3 and I have a query with a couple of NOT IN
> subqueries:
With 8.3 you will have to use manual antijoins (i.e LEFT JOIN
... WHERE NULL). If you use 8.4 NOT EXISTS() will do that
automatically in many cases (contrary to NOT IN () which has strange
NULL semantics).

Anbdres

Re: Optimizing NOT IN plans / verify rewrite

От
Maciek Sakrejda
Дата:
All fields involved are declared NOT NULL, but thanks for the heads up.
---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

Re: Optimizing NOT IN plans / verify rewrite

От
Dave Crooke
Дата:
With Oracle, I've found an anti-union (MINUS in Oracle, EXCEPT in PGSQL) to be often a bit better than an anti-join, which is in turn faster than NOT IN. Depends of course on row distribution and index layouts, and a bunch of other details.

Depending on what you're returning, it can pay to make sure this computation is done with the shortest possible rows, if necessary using a subquery.

Cheers
Dave

On Mon, Aug 2, 2010 at 2:49 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Maciek Sakrejda <msakrejda@truviso.com> wrote:

> No NULLs. It looks like it's a good deal slower than the LOJ
> version, but a good deal faster than the original.

On 8.4 and later the NOT EXISTS I suggested is a bit faster than
your fast version, since Tom did some very nice work in this area,
implementing semi join and anti join.  If you've got much load with
this kind of query, it might be worth upgrading.

-Kevin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Optimizing NOT IN plans / verify rewrite

От
Andres Freund
Дата:
On Mon, Aug 02, 2010 at 01:06:00PM -0700, Maciek Sakrejda wrote:
> All fields involved are declared NOT NULL, but thanks for the heads up.
Afair the planner doesnt use that atm.

Andres

Re: Optimizing NOT IN plans / verify rewrite

От
Maciek Sakrejda
Дата:
>> All fields involved are declared NOT NULL, but thanks for the heads up.
>Afair the planner doesnt use that atm.

I was referring to not having to care about the strange NULL semantics
(as per your original comment), since I have no NULLs. Given that, I
think the NOT EXISTS could be a good solution, even on 8.3 (we're
planning to upgrade, but it's not a feasible solution to this
particular problem), no?

Basically, it seems like the main issue with the current plans is the
per-tuple seq scans on the full materializations. Adding correlation
(by rewriting NOT IN as NOT EXISTS) prevents materialization, hence
getting rid of the biggest performance problem.

Thanks,
---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

Re: Optimizing NOT IN plans / verify rewrite

От
"Kevin Grittner"
Дата:
Dave Crooke <dcrooke@gmail.com> wrote:

> With Oracle, I've found an anti-union (MINUS in Oracle, EXCEPT in
> PGSQL) to be often a bit better than an anti-join, which is in
> turn faster than NOT IN. Depends of course on row distribution and
> index layouts, and a bunch of other details.

I found that assertion intriguing, so I tested the "fast" query from
the original post against my suggestion and a version using EXCEPT.
(This was against the development HEAD, not any release.)

OP "fast":  32.9 seconds
NOT EXISTS: 11.2 seconds
EXCEPT:      7.7 seconds

That last was using this query, which just might work OK on 8.3:

DELETE FROM foo
  where foo.b in (
    select b from foo WHERE type = 'o'
    except SELECT b FROM bar
    except SELECT b FROM foo where type <> 'o');

I wonder whether this could make a reasonable alternative plan for
the optmizer to consider some day....

-Kevin

Re: Optimizing NOT IN plans / verify rewrite

От
"Kevin Grittner"
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> DELETE FROM foo
>   where foo.b in (
>     select b from foo WHERE type = 'o'
>     except SELECT b FROM bar
>     except SELECT b FROM foo where type <> 'o');

Oops.  Maybe before I get excited I should try it with a query which
is actually logically equivalent.  :-(

-Kevin

Re: Optimizing NOT IN plans / verify rewrite

От
"Kevin Grittner"
Дата:
Kevin Grittner <Kgrittn@wicourts.gov> wrote:

> Maybe before I get excited I should try it with a query which is
> actually logically equivalent.

Fixed version:

DELETE FROM foo
  where type = 'o' and foo.b in (
  select b from foo WHERE type = 'o'
  except SELECT b FROM bar
  except SELECT b FROM foo where type <> 'o');

The change didn't affect run time significantly; it still beats the
others.

-Kevin


Re: Optimizing NOT IN plans / verify rewrite

От
Andres Freund
Дата:
On Mon, Aug 02, 2010 at 01:35:13PM -0700, Maciek Sakrejda wrote:
> >> All fields involved are declared NOT NULL, but thanks for the heads up.
> >Afair the planner doesnt use that atm.
>
> I was referring to not having to care about the strange NULL semantics
> (as per your original comment), since I have no NULLs. Given that, I
> think the NOT EXISTS could be a good solution, even on 8.3 (we're
> planning to upgrade, but it's not a feasible solution to this
> particular problem), no?
The point is that only 8.4 will optimize that case properly. 8.3 will
generate plans which are inefficient in many (or most) cases for both
variants. I would suggest to use manual antijoins...

Re: Optimizing NOT IN plans / verify rewrite

От
Maciek Sakrejda
Дата:
>> Maybe before I get excited I should try it with a query which is
>> actually logically equivalent.

Yes, the joys of manual rewrites...

> Fixed version:
>
> DELETE FROM foo
>  where type = 'o' and foo.b in (
>  select b from foo WHERE type = 'o'
>  except SELECT b FROM bar
>  except SELECT b FROM foo where type <> 'o');
>
> The change didn't affect run time significantly; it still beats the
> others.

On my 8.3, it still performs a little worse than your original
correlated EXCEPT (which is actually on par with the antijoin in 8.3,
but significantly better in 8.4). In 8.4, this EXCEPT version does
seem somewhat better.

It looks like according to Andres, though, I should not be depending
on these plans with 8.3, so I may want to stick with the manual
antijoin.

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com