Обсуждение: Bad query plan

Поиск
Список
Период
Сортировка

Bad query plan

От
Дмитрий Васильев
Дата:
I have a problem with poor query plan.

My PostgreSQL is "PostgreSQL 8.4.8, compiled by Visual C++ build 1400,
32-bit" installed by EnterpriseDB installer on Windows 7 32 bit.

Steps to reproduce:

Start with fresh installation and execute the following:

drop table if exists small;
drop table if exists large;

CREATE TABLE small
(
  id bigint,
  primary key(id)
);

CREATE TABLE large
(
  id bigint,
  primary key(id)
);

--Insert 100000 rows into large
CREATE or replace FUNCTION populate_large() RETURNS bigint AS $$
DECLARE
    id1 bigint := 0;
BEGIN
    LOOP
        insert into large(id) values(id1);
        id1 := id1 +1;
        if id1>100000 then
            exit;
        end if;
    END LOOP;
    return id1;
END
$$ LANGUAGE plpgsql;

--Insert 1000 rows into small
CREATE or replace FUNCTION populate_small() RETURNS bigint AS $$
DECLARE
    id1 bigint := 0;
BEGIN
    LOOP
        insert into small(id) values(id1);
        id1 := id1 +1;
        if id1>1000 then
            exit;
        end if;
    END LOOP;
    return id1;
END
$$ LANGUAGE plpgsql;

select populate_large(),populate_small();
analyze;

Then execute

explain analyze insert into large(id) select id from small where id
not in(select id from large);

It gives

"Seq Scan on small  (cost=1934.01..823278.28 rows=500 width=8) (actual
time=6263.588..6263.588 rows=0 loops=1)"
"  Filter: (NOT (SubPlan 1))"
"  SubPlan 1"
"    ->  Materialize  (cost=1934.01..3325.02 rows=100001 width=8)
(actual time=0.007..3.012 rows=501 loops=1001)"
"          ->  Seq Scan on large  (cost=0.00..1443.01 rows=100001
width=8) (actual time=0.010..5.810 rows=1001 loops=1)"
"Total runtime: 6263.703 ms"

But

explain analyze insert into large(id) select id from small where not
exists (select id from large l where small.id=l.id);

exeutes much faster:

"Merge Anti Join  (cost=0.00..85.58 rows=1 width=8) (actual
time=15.793..15.793 rows=0 loops=1)"
"  Merge Cond: (small.id = l.id)"
"  ->  Index Scan using small_pkey on small  (cost=0.00..43.27
rows=1001 width=8) (actual time=0.025..3.515 rows=1001 loops=1)"
"  ->  Index Scan using large_pkey on large l  (cost=0.00..3050.28
rows=100001 width=8) (actual time=0.017..2.932 rows=1001 loops=1)"
"Total runtime: 15.863 ms"

Both queries are semantically the same.

Re: Bad query plan

От
Tom Lane
Дата:
=?KOI8-R?B?5M3J1NLJyiD3wdPJzNjF1w==?= <dmitry.vasil@gmail.com> writes:
> explain analyze insert into large(id) select id from small where id
> not in(select id from large);
> [ crummy plan ]
> explain analyze insert into large(id) select id from small where not
> exists (select id from large l where small.id=l.id);
> [ better plan ]
> Both queries are semantically the same.

No, they are not.  NOT IN is hard to optimize because it has strange
behaviors with nulls in the data.  Use the NOT EXISTS formulation.

            regards, tom lane

Re: Bad query plan

От
Gavin Flower
Дата:
On 25/07/11 02:06, Дмитрий Васильев wrote:
> I have a problem with poor query plan.
>
> My PostgreSQL is "PostgreSQL 8.4.8, compiled by Visual C++ build 1400,
> 32-bit" installed by EnterpriseDB installer on Windows 7 32 bit.
>
> Steps to reproduce:
>
> Start with fresh installation and execute the following:
>
> drop table if exists small;
> drop table if exists large;
>
> CREATE TABLE small
> (
>    id bigint,
>    primary key(id)
> );
>
> CREATE TABLE large
> (
>    id bigint,
>    primary key(id)
> );
>
> --Insert 100000 rows into large
> CREATE or replace FUNCTION populate_large() RETURNS bigint AS $$
> DECLARE
>      id1 bigint := 0;
> BEGIN
>      LOOP
>          insert into large(id) values(id1);
>          id1 := id1 +1;
>          if id1>100000 then
>              exit;
>          end if;
>      END LOOP;
>      return id1;
> END
> $$ LANGUAGE plpgsql;
>
> --Insert 1000 rows into small
> CREATE or replace FUNCTION populate_small() RETURNS bigint AS $$
> DECLARE
>      id1 bigint := 0;
> BEGIN
>      LOOP
>          insert into small(id) values(id1);
>          id1 := id1 +1;
>          if id1>1000 then
>              exit;
>          end if;
>      END LOOP;
>      return id1;
> END
> $$ LANGUAGE plpgsql;
>
> select populate_large(),populate_small();
> analyze;
>
> Then execute
>
> explain analyze insert into large(id) select id from small where id
> not in(select id from large);
>
> It gives
>
> "Seq Scan on small  (cost=1934.01..823278.28 rows=500 width=8) (actual
> time=6263.588..6263.588 rows=0 loops=1)"
> "  Filter: (NOT (SubPlan 1))"
> "  SubPlan 1"
> "    ->   Materialize  (cost=1934.01..3325.02 rows=100001 width=8)
> (actual time=0.007..3.012 rows=501 loops=1001)"
> "          ->   Seq Scan on large  (cost=0.00..1443.01 rows=100001
> width=8) (actual time=0.010..5.810 rows=1001 loops=1)"
> "Total runtime: 6263.703 ms"
>
> But
>
> explain analyze insert into large(id) select id from small where not
> exists (select id from large l where small.id=l.id);
>
> exeutes much faster:
>
> "Merge Anti Join  (cost=0.00..85.58 rows=1 width=8) (actual
> time=15.793..15.793 rows=0 loops=1)"
> "  Merge Cond: (small.id = l.id)"
> "  ->   Index Scan using small_pkey on small  (cost=0.00..43.27
> rows=1001 width=8) (actual time=0.025..3.515 rows=1001 loops=1)"
> "  ->   Index Scan using large_pkey on large l  (cost=0.00..3050.28
> rows=100001 width=8) (actual time=0.017..2.932 rows=1001 loops=1)"
> "Total runtime: 15.863 ms"
>
> Both queries are semantically the same.
>
Out of interest, I ran your code on my existing 9.1beta3 installation.

Notes
(1) the second SELECT ran a faster than the first.
(2) both plans are different to the ones you got

$ psql
psql (9.1beta3)
[...]
gavin=> explain analyze insert into large(id) select id from small where id
gavin-> not in(select id from large);
                                                         QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
  Insert on large  (cost=1543.01..1559.02 rows=500 width=8) (actual
time=51.090..51.090 rows=0 loops=1)
    ->  Seq Scan on small  (cost=1543.01..1559.02 rows=500 width=8)
(actual time=51.087..51.087 rows=0 loops=1)
          Filter: (NOT (hashed SubPlan 1))
          SubPlan 1
            ->  Seq Scan on large  (cost=0.00..1443.01 rows=100001
width=8) (actual time=0.008..13.867 rows=100001 loops=1)
  Total runtime: 51.582 ms
(6 rows)

gavin=> explain analyze insert into large(id) select id from small where not
gavin-> exists (select id from large l where small.id=l.id);
                                                                 QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
  Insert on large  (cost=0.00..80.94 rows=1 width=8) (actual
time=0.907..0.907 rows=0 loops=1)
    ->  Merge Anti Join  (cost=0.00..80.94 rows=1 width=8) (actual
time=0.906..0.906 rows=0 loops=1)
          Merge Cond: (small.id = l.id)
          ->  Index Scan using small_pkey on small  (cost=0.00..40.61
rows=1001 width=8) (actual time=0.010..0.225 rows=1001 loops=1)
          ->  Index Scan using large_pkey on large l
(cost=0.00..2800.12 rows=100001 width=8) (actual time=0.006..0.235
rows=1001 loops=1)
  Total runtime: 1.000 ms
(6 rows)

postgresql.conf parameters changed:
shared_buffers = 2GB
temp_buffers = 64MB
work_mem = 16MB
maintenance_work_mem = 512MB
max_stack_depth = 6MB
checkpoint_segments = 8
cpu_index_tuple_cost = 0.0025
cpu_operator_cost = 0.001
effective_cache_size = 2GB