Обсуждение: plan variations: join vs. exists vs. row comparison

Поиск
Список
Период
Сортировка

plan variations: join vs. exists vs. row comparison

От
Jon Nelson
Дата:
Originally, I posted to -general but I found some time to write some
samples, and realized it's probably more of a performance question.

The original post is here:
http://archives.postgresql.org/pgsql-general/2011-03/msg00198.php

I was hoping that somebody could help me understand the differences
between three plans.
All of the plans are updating a table using a second table, and should
be logically equivalent.
Two of the plans use joins, and one uses an exists subquery.
One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
this plan which has really awful performance.
Clearly it is due to the nested loop, but why would the planner choose
that approach?

I also don't understand why in the 'exists' plan the planner thinks
the index scan will provide 1019978 rows, when there are only 1000000,
but that is a lesser issue.

Here is a sample SQL file which demonstrates the issues and includes
all three variations.

begin;
create temporary table t7 (
  i BIGINT NOT NULL,
  k BIGINT
);

create temporary table t8 (
  i BIGINT NOT NULL,
  j INT
);

CREATE FUNCTION populate_t8()
RETURNS VOID
LANGUAGE SQL
AS
$$
truncate t8;
insert into t8
SELECT i, 1 from t7
ORDER BY i LIMIT 10000;

insert into t8
SELECT i, 2 from t7
WHERE i > 10000
ORDER BY i LIMIT 10000;

SELECT i, 3 from t7
WHERE i > 20000
ORDER BY i LIMIT 20000;

analyze t8;
$$;

INSERT INTO t7
select x, x + 10 from generate_series(1,1000000) as x ;
analyze t7;

select populate_t8();

explain analyze verbose
update
  t7
SET
  k = 1
FROM
  t8
WHERE
  t7.i = t8.i
  AND
  (
    t8.j = 2
    OR
    t8.j = 1
  );

select populate_t8();

explain analyze verbose
update
  t7
SET
  k = 1
WHERE
  EXISTS (
    SELECT 1 FROM t8
    WHERE t8.i = t7.i
    AND
    (
      t8.j = 2
      OR
      t8.j = 1
    )
  );

select populate_t8();

explain
update
  t7
SET
  k = 1
FROM
  t8
WHERE
  ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i)
  AND
  (
    t8.j = 2
    OR
    t8.j = 1
  );

explain analyze verbose
update
  t7
SET
  k = 1
FROM
  t8
WHERE
  ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i)
  AND
  (
    t8.j = 2
    OR
    t8.j = 1
  );

rollback;





--
Jon

Re: plan variations: join vs. exists vs. row comparison

От
Tom Lane
Дата:
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> I was hoping that somebody could help me understand the differences
> between three plans.
> All of the plans are updating a table using a second table, and should
> be logically equivalent.
> Two of the plans use joins, and one uses an exists subquery.
> One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
> this plan which has really awful performance.
> Clearly it is due to the nested loop, but why would the planner choose
> that approach?

IS NOT DISTINCT FROM pretty much disables all optimizations: it can't be
an indexqual, merge join qual, or hash join qual.  So it's not
surprising that you get a sucky plan for it.  Possibly somebody will
work on improving that someday.

As for your other questions, what PG version are you using?  Because I
do get pretty much the same plan (modulo a plain join versus a semijoin)
for the first two queries, when using 9.0 or later.  And the results of
ANALYZE are only approximate, so you shouldn't be surprised at all if a
rowcount estimate is off by a couple percent.  Most of the time, you
should be happy if it's within a factor of 2 of reality.

            regards, tom lane

Re: plan variations: join vs. exists vs. row comparison

От
Merlin Moncure
Дата:
On Mon, Mar 7, 2011 at 1:07 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> Originally, I posted to -general but I found some time to write some
> samples, and realized it's probably more of a performance question.
>
> The original post is here:
> http://archives.postgresql.org/pgsql-general/2011-03/msg00198.php
>
> I was hoping that somebody could help me understand the differences
> between three plans.
> All of the plans are updating a table using a second table, and should
> be logically equivalent.
> Two of the plans use joins, and one uses an exists subquery.
> One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
> this plan which has really awful performance.

The problem is really coming from SQL: it requires row wise
comparisons to be of all fields in left to right order and the fact
that you can't match NULL to NULL with =.

If you have a table with a,b,c, (1,1,NULL) is not distinct from (1,2,3) becomes:
Filter: ((NOT (a IS DISTINCT FROM 1)) AND (NOT (b IS DISTINCT FROM 1))
AND (NOT (c IS DISTINCT FROM NULL::integer)))

At present postgresql does not have the facilities to turn that into
an index lookup.  SQL doesn't allow the way you'd want to write this
the way you'd really like to:

select * from v where (a,b,c) = (1,1,NULL);

because the comparison can't be applied from a row to another row but
only between the member fields. You can cheat the system, but only if
you reserve a special index for that purpose:

create table v(a int, b int, c int);
create index on v(v);
select * from v where v = (1,1, NULL) will match as 'is not distinct
from' does, using the index.  This is because composite type
comparison (as opposed to its fields) follows a different code path.
Confused yet?  You can also use the above trick with a type if you are
not comparing all fields of 'v':

create type foo(a int, b int);
create index on v(((a,b)::foo));
select * from v where (a,b)::foo = (1,1);

will get you field subset comparison with index.  Note if you do the
above, the index can only match on the entire composite, not
particular fields...

merlin

Re: plan variations: join vs. exists vs. row comparison

От
Jon Nelson
Дата:
On Mon, Mar 7, 2011 at 2:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> I was hoping that somebody could help me understand the differences
>> between three plans.
>> All of the plans are updating a table using a second table, and should
>> be logically equivalent.
>> Two of the plans use joins, and one uses an exists subquery.
>> One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
>> this plan which has really awful performance.
>> Clearly it is due to the nested loop, but why would the planner choose
>> that approach?
>
> IS NOT DISTINCT FROM pretty much disables all optimizations: it can't be
> an indexqual, merge join qual, or hash join qual.  So it's not
> surprising that you get a sucky plan for it.  Possibly somebody will
> work on improving that someday.
>
> As for your other questions, what PG version are you using?  Because I
> do get pretty much the same plan (modulo a plain join versus a semijoin)
> for the first two queries, when using 9.0 or later.  And the results of
> ANALYZE are only approximate, so you shouldn't be surprised at all if a
> rowcount estimate is off by a couple percent.  Most of the time, you
> should be happy if it's within a factor of 2 of reality.

Sorry - I had stated in the original post that I was using 8.4.5 on 64
bit openSUSE and CentOS 5.5, and had forgotten to carry that
information over into the second post.

What is the difference between a plain join and a semi join?

--
Jon

Re: plan variations: join vs. exists vs. row comparison

От
"Kevin Grittner"
Дата:
Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

> What is the difference between a plain join and a semi join?

As soon as a semi join finds a match it stops looking for more.

-Kevin