Обсуждение: plan variations: join vs. exists vs. row comparison
Originally, I posted to -general but I found some time to write some samples, and realized it's probably more of a performance question. The original post is here: http://archives.postgresql.org/pgsql-general/2011-03/msg00198.php I was hoping that somebody could help me understand the differences between three plans. All of the plans are updating a table using a second table, and should be logically equivalent. Two of the plans use joins, and one uses an exists subquery. One of the plans uses row constructors and IS NOT DISTINCT FROM. It is this plan which has really awful performance. Clearly it is due to the nested loop, but why would the planner choose that approach? I also don't understand why in the 'exists' plan the planner thinks the index scan will provide 1019978 rows, when there are only 1000000, but that is a lesser issue. Here is a sample SQL file which demonstrates the issues and includes all three variations. begin; create temporary table t7 ( i BIGINT NOT NULL, k BIGINT ); create temporary table t8 ( i BIGINT NOT NULL, j INT ); CREATE FUNCTION populate_t8() RETURNS VOID LANGUAGE SQL AS $$ truncate t8; insert into t8 SELECT i, 1 from t7 ORDER BY i LIMIT 10000; insert into t8 SELECT i, 2 from t7 WHERE i > 10000 ORDER BY i LIMIT 10000; SELECT i, 3 from t7 WHERE i > 20000 ORDER BY i LIMIT 20000; analyze t8; $$; INSERT INTO t7 select x, x + 10 from generate_series(1,1000000) as x ; analyze t7; select populate_t8(); explain analyze verbose update t7 SET k = 1 FROM t8 WHERE t7.i = t8.i AND ( t8.j = 2 OR t8.j = 1 ); select populate_t8(); explain analyze verbose update t7 SET k = 1 WHERE EXISTS ( SELECT 1 FROM t8 WHERE t8.i = t7.i AND ( t8.j = 2 OR t8.j = 1 ) ); select populate_t8(); explain update t7 SET k = 1 FROM t8 WHERE ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i) AND ( t8.j = 2 OR t8.j = 1 ); explain analyze verbose update t7 SET k = 1 FROM t8 WHERE ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i) AND ( t8.j = 2 OR t8.j = 1 ); rollback; -- Jon
Jon Nelson <jnelson+pgsql@jamponi.net> writes: > I was hoping that somebody could help me understand the differences > between three plans. > All of the plans are updating a table using a second table, and should > be logically equivalent. > Two of the plans use joins, and one uses an exists subquery. > One of the plans uses row constructors and IS NOT DISTINCT FROM. It is > this plan which has really awful performance. > Clearly it is due to the nested loop, but why would the planner choose > that approach? IS NOT DISTINCT FROM pretty much disables all optimizations: it can't be an indexqual, merge join qual, or hash join qual. So it's not surprising that you get a sucky plan for it. Possibly somebody will work on improving that someday. As for your other questions, what PG version are you using? Because I do get pretty much the same plan (modulo a plain join versus a semijoin) for the first two queries, when using 9.0 or later. And the results of ANALYZE are only approximate, so you shouldn't be surprised at all if a rowcount estimate is off by a couple percent. Most of the time, you should be happy if it's within a factor of 2 of reality. regards, tom lane
On Mon, Mar 7, 2011 at 1:07 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > Originally, I posted to -general but I found some time to write some > samples, and realized it's probably more of a performance question. > > The original post is here: > http://archives.postgresql.org/pgsql-general/2011-03/msg00198.php > > I was hoping that somebody could help me understand the differences > between three plans. > All of the plans are updating a table using a second table, and should > be logically equivalent. > Two of the plans use joins, and one uses an exists subquery. > One of the plans uses row constructors and IS NOT DISTINCT FROM. It is > this plan which has really awful performance. The problem is really coming from SQL: it requires row wise comparisons to be of all fields in left to right order and the fact that you can't match NULL to NULL with =. If you have a table with a,b,c, (1,1,NULL) is not distinct from (1,2,3) becomes: Filter: ((NOT (a IS DISTINCT FROM 1)) AND (NOT (b IS DISTINCT FROM 1)) AND (NOT (c IS DISTINCT FROM NULL::integer))) At present postgresql does not have the facilities to turn that into an index lookup. SQL doesn't allow the way you'd want to write this the way you'd really like to: select * from v where (a,b,c) = (1,1,NULL); because the comparison can't be applied from a row to another row but only between the member fields. You can cheat the system, but only if you reserve a special index for that purpose: create table v(a int, b int, c int); create index on v(v); select * from v where v = (1,1, NULL) will match as 'is not distinct from' does, using the index. This is because composite type comparison (as opposed to its fields) follows a different code path. Confused yet? You can also use the above trick with a type if you are not comparing all fields of 'v': create type foo(a int, b int); create index on v(((a,b)::foo)); select * from v where (a,b)::foo = (1,1); will get you field subset comparison with index. Note if you do the above, the index can only match on the entire composite, not particular fields... merlin
On Mon, Mar 7, 2011 at 2:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> I was hoping that somebody could help me understand the differences >> between three plans. >> All of the plans are updating a table using a second table, and should >> be logically equivalent. >> Two of the plans use joins, and one uses an exists subquery. >> One of the plans uses row constructors and IS NOT DISTINCT FROM. It is >> this plan which has really awful performance. >> Clearly it is due to the nested loop, but why would the planner choose >> that approach? > > IS NOT DISTINCT FROM pretty much disables all optimizations: it can't be > an indexqual, merge join qual, or hash join qual. So it's not > surprising that you get a sucky plan for it. Possibly somebody will > work on improving that someday. > > As for your other questions, what PG version are you using? Because I > do get pretty much the same plan (modulo a plain join versus a semijoin) > for the first two queries, when using 9.0 or later. And the results of > ANALYZE are only approximate, so you shouldn't be surprised at all if a > rowcount estimate is off by a couple percent. Most of the time, you > should be happy if it's within a factor of 2 of reality. Sorry - I had stated in the original post that I was using 8.4.5 on 64 bit openSUSE and CentOS 5.5, and had forgotten to carry that information over into the second post. What is the difference between a plain join and a semi join? -- Jon
Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > What is the difference between a plain join and a semi join? As soon as a semi join finds a match it stops looking for more. -Kevin