Обсуждение: problem (bug?) with "in (subquery)"

Поиск
Список
Период
Сортировка

problem (bug?) with "in (subquery)"

От
Luca Pireddu
Дата:
I have the following query that isn't behaving like I would expect:

select * from strains s where s.id in (select strain_id from pathway_strains);

I would expect each strain record to appear only once.  Instead I get output 
like this, where the same strain id appears many times:
 id   |     name     | organism
-------+--------------+----------   83 | common       |       82    83 | common       |       82    83 | common       |
     82    83 | common       |       82    83 | common       |       82    83 | common       |       82    83 | common
    |       82    83 | common       |       82    83 | common       |       82    83 | common       |       82   506 |
common      |      487  506 | common       |      487
 
... continues

By the way, this output is the same as if running the query:
select * from strains s join pathway_strains ps on ps.strain_id = s.id;

=====================================
Table "public.strains"   Column     |     Type     |                        Modifiers
---------------+--------------+---------------------------------------------------------id            | integer      |
notnull name          | text         | not null default 'common'::textorganism      | integer      | not null
 

Indexes:   "strains_pkey" PRIMARY KEY, btree (id)
================================== View "public.pathway_strains"  Column   |  Type   | Modifiers
------------+---------+-----------pathway_id | integer |strain_id  | integer |
View definition:SELECT DISTINCT p.id AS pathway_id, c.strain_id  FROM catalyst_associations c  JOIN pathway_edges e ON
c.pathway_edge_id= e.id  RIGHT JOIN pathways p ON p.id = e.pathway_id ORDER BY p.id, c.strain_id;
 

The contents of pathways_strains are likepathway_id | strain_id
------------+-----------      2083 |        76      2083 |        80      2083 |        83      2083 |        95
2084|        76      2084 |        80      2084 |        83      2084 |        95      2084 |       162
 
...etc

So, am I wrong in expecting each strain record to appear only once in the 
result set?  Or is there something wrong with PostgreSQL?  I would be happy 
to provide more information if it's needed.

Thank you!

Luca

ps: # select version();                                                    version

-----------------------------------------------------------------------------------------------------------------PostgreSQL
8.0.3on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.2.3 
 
20030502 (Red Hat Linux 3.2.3-42)


Re: problem (bug?) with "in (subquery)"

От
Michael Fuhr
Дата:
On Thu, Jul 14, 2005 at 01:34:21AM -0600, Luca Pireddu wrote:
> I have the following query that isn't behaving like I would expect:
> 
> select * from strains s where s.id in (select strain_id from pathway_strains);

Any reason the subquery isn't doing "SELECT DISTINCT strain_id"?

> I would expect each strain record to appear only once.  Instead I get output 
> like this, where the same strain id appears many times:
> 
>   id   |     name     | organism
> -------+--------------+----------
>     83 | common       |       82 
>     83 | common       |       82 
>     83 | common       |       82 

What happens when you try each of the following?  Do they give the
expected results?  I did some tests and I'm wondering if the planner's
hash join is responsible for the duplicate rows.

SELECT * FROM strains WHERE id IN ( SELECT strain_id FROM pathway_strains ORDER BY strain_id
);

CREATE TEMPORARY TABLE foo AS SELECT strain_id FROM pathway_strains;
SELECT * FROM strains WHERE id IN (SELECT strain_id FROM foo);

SET enable_hashjoin TO off;
SELECT * FROM strains WHERE id IN (SELECT strain_id FROM pathway_strains);

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/


Re: problem (bug?) with "in (subquery)"

От
Tom Lane
Дата:
Luca Pireddu <luca@cs.ualberta.ca> writes:
> So, am I wrong in expecting each strain record to appear only once in the 
> result set?  Or is there something wrong with PostgreSQL?

Could we see a self-contained example (table definitions and sample data
as a SQL script)?  I don't really have time to reverse-engineer a test
case from your description ...
        regards, tom lane


Re: problem (bug?) with "in (subquery)"

От
Michael Fuhr
Дата:
On Fri, Jul 15, 2005 at 09:59:27AM -0400, Tom Lane wrote:
> Luca Pireddu <luca@cs.ualberta.ca> writes:
> > So, am I wrong in expecting each strain record to appear only once in the 
> > result set?  Or is there something wrong with PostgreSQL?
> 
> Could we see a self-contained example (table definitions and sample data
> as a SQL script)?  I don't really have time to reverse-engineer a test
> case from your description ...

I've been reverse-engineering and simplifying this.  Here's something
that I think is close:

CREATE TABLE foo (id integer);
CREATE TABLE bar (id1 integer, id2 integer);

INSERT INTO foo VALUES (1);

INSERT INTO bar VALUES (1, 1);
INSERT INTO bar VALUES (2, 2);
INSERT INTO bar VALUES (3, 1);

SELECT *
FROM foo
WHERE id IN (SELECT id2 FROM (SELECT DISTINCT id1, id2 FROM bar) AS s);id 
---- 1 1
(2 rows)

SELECT *
FROM foo
WHERE id IN (SELECT id2 FROM (SELECT id1, id2 FROM bar) AS s);id 
---- 1
(1 row)

8.0.3 and HEAD behave as shown.  7.4.8, 7.3.10, and 7.2.8 return a
single row for both queries.

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/


Re: problem (bug?) with "in (subquery)"

От
Tom Lane
Дата:
Michael Fuhr <mike@fuhr.org> writes:
> I've been reverse-engineering and simplifying this.  Here's something
> that I think is close:

> CREATE TABLE foo (id integer);
> CREATE TABLE bar (id1 integer, id2 integer);

> INSERT INTO foo VALUES (1);

> INSERT INTO bar VALUES (1, 1);
> INSERT INTO bar VALUES (2, 2);
> INSERT INTO bar VALUES (3, 1);

> SELECT *
> FROM foo
> WHERE id IN (SELECT id2 FROM (SELECT DISTINCT id1, id2 FROM bar) AS s);
>  id 
> ----
>   1
>   1
> (2 rows)

Ah-hah: this one is the fault of create_unique_path, which quoth
   /*    * If the input is a subquery whose output must be unique already, we    * don't need to do anything.    */

Of course, that needs to read "... unique already, *and we are using all
of its output columns in our DISTINCT list*, we don't need to do
anything."
        regards, tom lane


Re: problem (bug?) with "in (subquery)"

От
Luca Pireddu
Дата:
On July 15, 2005 07:34, Michael Fuhr wrote:
> On Thu, Jul 14, 2005 at 01:34:21AM -0600, Luca Pireddu wrote:
> > I have the following query that isn't behaving like I would expect:

Thanks for creating the reduced test case Michael.  My apologies for not doing 
it myself.

> > 
> > select * from strains s where s.id in (select strain_id from 
pathway_strains);
> 
> Any reason the subquery isn't doing "SELECT DISTINCT strain_id"?

because I don't need to according to the specification of "in".  However, it 
does generate the correct output.  So does

select distinct * from strains s where s.id in (select strain_id from 
pathway_strains);

> 
> > I would expect each strain record to appear only once.  Instead I get 
output 
> > like this, where the same strain id appears many times:
> > 
> >   id   |     name     | organism
> > -------+--------------+----------
> >     83 | common       |       82 
> >     83 | common       |       82 
> >     83 | common       |       82 
> 
> What happens when you try each of the following?  Do they give the
> expected results?  I did some tests and I'm wondering if the planner's
> hash join is responsible for the duplicate rows.
> 
> SELECT * FROM strains WHERE id IN (
>   SELECT strain_id FROM pathway_strains ORDER BY strain_id
> );

With the "order by"  it works as it should, not generating duplicate rows.

> 
> CREATE TEMPORARY TABLE foo AS SELECT strain_id FROM pathway_strains;
> SELECT * FROM strains WHERE id IN (SELECT strain_id FROM foo);

This one's interesting.  It only returns the unique rows.  

> 
> SET enable_hashjoin TO off;
> SELECT * FROM strains WHERE id IN (SELECT strain_id FROM pathway_strains);

With hashjoin off the query returns the correct output.

On July 15, 2005 08:58, Tom Lane wrote:
> Ah-hah: this one is the fault of create_unique_path, which quoth
> 
>     /*
>      * If the input is a subquery whose output must be unique already, we
>      * don't need to do anything.
>      */
> 
> Of course, that needs to read "... unique already, *and we are using all
> of its output columns in our DISTINCT list*, we don't need to do
> anything."
> 
>             regards, tom lane

In any case, it looks like Tom has already found the problem :-)  Thanks guys!

Luca


Re: problem (bug?) with "in (subquery)"

От
Tom Lane
Дата:
Luca Pireddu <lucap@shaw.ca> writes:
> On July 15, 2005 08:58, Tom Lane wrote:
>> Ah-hah: this one is the fault of create_unique_path, which quoth

> In any case, it looks like Tom has already found the problem :-)  Thanks guys!

On closer analysis, the test in create_unique_path is almost but not
quite completely wrong :-(.  Here is the patch against 8.0 branch,
if you need it right away.
        regards, tom lane


Index: src/backend/optimizer/util/pathnode.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/util/pathnode.c,v
retrieving revision 1.111
diff -c -r1.111 pathnode.c
*** src/backend/optimizer/util/pathnode.c    31 Dec 2004 22:00:23 -0000    1.111
--- src/backend/optimizer/util/pathnode.c    15 Jul 2005 17:03:06 -0000
***************
*** 34,40 **** #include "utils/syscache.h"  
! static bool is_distinct_query(Query *query); static bool hash_safe_tlist(List *tlist);  
--- 34,41 ---- #include "utils/syscache.h"  
! static List *translate_sub_tlist(List *tlist, int relid);
! static bool query_is_distinct_for(Query *query, List *colnos); static bool hash_safe_tlist(List *tlist);  
***************
*** 642,655 ****     pathnode->subpath = subpath;      /*
!      * If the input is a subquery whose output must be unique already, we
!      * don't need to do anything.      */
!     if (rel->rtekind == RTE_SUBQUERY)     {         RangeTblEntry *rte = rt_fetch(rel->relid, root->rtable); 
!         if (is_distinct_query(rte->subquery))         {             pathnode->umethod = UNIQUE_PATH_NOOP;
pathnode->rows= rel->rows;
 
--- 643,683 ----     pathnode->subpath = subpath;      /*
!      * Try to identify the targetlist that will actually be unique-ified.
!      * In current usage, this routine is only used for sub-selects of IN
!      * clauses, so we should be able to find the tlist in in_info_list.
!      */
!     sub_targetlist = NIL;
!     foreach(l, root->in_info_list)
!     {
!         InClauseInfo *ininfo = (InClauseInfo *) lfirst(l);
! 
!         if (bms_equal(ininfo->righthand, rel->relids))
!         {
!             sub_targetlist = ininfo->sub_targetlist;
!             break;
!         }
!     }
! 
!     /*
!      * If the input is a subquery whose output must be unique already,
!      * then we don't need to do anything.  The test for uniqueness has
!      * to consider exactly which columns we are extracting; for example
!      * "SELECT DISTINCT x,y" doesn't guarantee that x alone is distinct.
!      * So we cannot check for this optimization unless we found our own
!      * targetlist above, and it consists only of simple Vars referencing
!      * subquery outputs.  (Possibly we could do something with expressions
!      * in the subquery outputs, too, but for now keep it simple.)      */
!     if (sub_targetlist && rel->rtekind == RTE_SUBQUERY)     {         RangeTblEntry *rte = rt_fetch(rel->relid,
root->rtable);
+         List   *sub_tlist_colnos; 
!         sub_tlist_colnos = translate_sub_tlist(sub_targetlist, rel->relid);
! 
!         if (sub_tlist_colnos &&
!             query_is_distinct_for(rte->subquery, sub_tlist_colnos))         {             pathnode->umethod =
UNIQUE_PATH_NOOP;            pathnode->rows = rel->rows;
 
***************
*** 664,686 ****     }      /*
-      * Try to identify the targetlist that will actually be unique-ified.
-      * In current usage, this routine is only used for sub-selects of IN
-      * clauses, so we should be able to find the tlist in in_info_list.
-      */
-     sub_targetlist = NIL;
-     foreach(l, root->in_info_list)
-     {
-         InClauseInfo *ininfo = (InClauseInfo *) lfirst(l);
- 
-         if (bms_equal(ininfo->righthand, rel->relids))
-         {
-             sub_targetlist = ininfo->sub_targetlist;
-             break;
-         }
-     }
- 
-     /*      * If we know the targetlist, try to estimate number of result rows;      * otherwise punt.      */
--- 692,697 ----
***************
*** 755,804 **** }  /*
!  * is_distinct_query - does query never return duplicate rows?  */
! static bool
! is_distinct_query(Query *query) {
!     /* DISTINCT (but not DISTINCT ON) guarantees uniqueness */
!     if (has_distinct_clause(query))
!         return true; 
!     /* UNION, INTERSECT, EXCEPT guarantee uniqueness, except with ALL */
!     if (query->setOperations)     {
!         SetOperationStmt *topop = (SetOperationStmt *) query->setOperations; 
!         Assert(IsA(topop, SetOperationStmt));
!         Assert(topop->op != SETOP_NONE); 
!         if (!topop->all)             return true;     }      /*
!      * GROUP BY guarantees uniqueness if all the grouped columns appear in
!      * the output.    In our implementation this means checking they are non
!      * resjunk columns.      */     if (query->groupClause)     {
!         ListCell   *gl;
! 
!         foreach(gl, query->groupClause)         {
!             GroupClause *grpcl = (GroupClause *) lfirst(gl);             TargetEntry *tle =
get_sortgroupclause_tle(grpcl,                                                       query->targetList); 
 
!             if (tle->resdom->resjunk)
!                 break;         }
!         if (!gl)                /* got to the end? */             return true;     }      /*      * XXX Are there any
othercases in which we can easily see the result      * must be distinct?      */
 
--- 766,888 ---- }  /*
!  * translate_sub_tlist - get subquery column numbers represented by tlist
!  *
!  * The given targetlist should contain only Vars referencing the given relid.
!  * Extract their varattnos (ie, the column numbers of the subquery) and return
!  * as an integer List.
!  *
!  * If any of the tlist items is not a simple Var, we cannot determine whether
!  * the subquery's uniqueness condition (if any) matches ours, so punt and
!  * return NIL.  */
! static List *
! translate_sub_tlist(List *tlist, int relid) {
!     List       *result = NIL;
!     ListCell   *l; 
!     foreach(l, tlist)     {
!         Var       *var = (Var *) lfirst(l); 
!         if (!var || !IsA(var, Var) ||
!             var->varno != relid)
!             return NIL;            /* punt */ 
!         result = lappend_int(result, var->varattno);
!     }
!     return result;
! }
! 
! /*
!  * query_is_distinct_for - does query never return duplicates of the
!  *        specified columns?
!  *
!  * colnos is an integer list of output column numbers (resno's).  We are
!  * interested in whether rows consisting of just these columns are certain
!  * to be distinct.
!  */
! static bool
! query_is_distinct_for(Query *query, List *colnos)
! {
!     ListCell   *l;
! 
!     /*
!      * DISTINCT (including DISTINCT ON) guarantees uniqueness if all the
!      * columns in the DISTINCT clause appear in colnos.
!      */
!     if (query->distinctClause)
!     {
!         foreach(l, query->distinctClause)
!         {
!             SortClause *scl = (SortClause *) lfirst(l);
!             TargetEntry *tle = get_sortgroupclause_tle(scl,
!                                                        query->targetList);
! 
!             if (!list_member_int(colnos, tle->resdom->resno))
!                 break;            /* exit early if no match */
!         }
!         if (l == NULL)            /* had matches for all? */             return true;     }      /*
!      * Similarly, GROUP BY guarantees uniqueness if all the grouped columns
!      * appear in colnos.      */     if (query->groupClause)     {
!         foreach(l, query->groupClause)         {
!             GroupClause *grpcl = (GroupClause *) lfirst(l);             TargetEntry *tle =
get_sortgroupclause_tle(grpcl,                                                       query->targetList); 
 
!             if (!list_member_int(colnos, tle->resdom->resno))
!                 break;            /* exit early if no match */         }
!         if (l == NULL)            /* had matches for all? */
!             return true;
!     }
!     else
!     {
!         /*
!          * If we have no GROUP BY, but do have aggregates or HAVING, then
!          * the result is at most one row so it's surely unique.
!          */
!         if (query->hasAggs || query->havingQual)             return true;     }      /*
+      * UNION, INTERSECT, EXCEPT guarantee uniqueness of the whole output row,
+      * except with ALL
+      */
+     if (query->setOperations)
+     {
+         SetOperationStmt *topop = (SetOperationStmt *) query->setOperations;
+ 
+         Assert(IsA(topop, SetOperationStmt));
+         Assert(topop->op != SETOP_NONE);
+ 
+         if (!topop->all)
+         {
+             /* We're good if all the nonjunk output columns are in colnos */
+             foreach(l, query->targetList)
+             {
+                 TargetEntry *tle = (TargetEntry *) lfirst(l);
+ 
+                 if (!tle->resdom->resjunk &&
+                     !list_member_int(colnos, tle->resdom->resno))
+                     break;        /* exit early if no match */
+             }
+             if (l == NULL)        /* had matches for all? */
+                 return true;
+         }
+     }
+ 
+     /*      * XXX Are there any other cases in which we can easily see the result      * must be distinct?      */
Index: src/test/regress/expected/subselect.out
===================================================================
RCS file: /cvsroot/pgsql/src/test/regress/expected/subselect.out,v
retrieving revision 1.10.4.2
diff -c -r1.10.4.2 subselect.out
*** src/test/regress/expected/subselect.out    1 Feb 2005 23:09:00 -0000    1.10.4.2
--- src/test/regress/expected/subselect.out    15 Jul 2005 17:03:06 -0000
***************
*** 203,208 ****
--- 203,265 ---- (1 row)  --
+ -- Test cases to check for overenthusiastic optimization of
+ -- "IN (SELECT DISTINCT ...)" and related cases.  Per example from
+ -- Luca Pireddu and Michael Fuhr.
+ --
+ CREATE TEMP TABLE foo (id integer);
+ CREATE TEMP TABLE bar (id1 integer, id2 integer);
+ INSERT INTO foo VALUES (1);
+ INSERT INTO bar VALUES (1, 1);
+ INSERT INTO bar VALUES (2, 2);
+ INSERT INTO bar VALUES (3, 1);
+ -- These cases require an extra level of distinct-ing above subquery s
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT DISTINCT id1, id2 FROM bar) AS s);
+  id 
+ ----
+   1
+ (1 row)
+ 
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT id1,id2 FROM bar GROUP BY id1,id2) AS s);
+  id 
+ ----
+   1
+ (1 row)
+ 
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT id1, id2 FROM bar UNION
+                       SELECT id1, id2 FROM bar) AS s);
+  id 
+ ----
+   1
+ (1 row)
+ 
+ -- These cases do not
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT DISTINCT ON (id2) id1, id2 FROM bar) AS s);
+  id 
+ ----
+   1
+ (1 row)
+ 
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT id2 FROM bar GROUP BY id2) AS s);
+  id 
+ ----
+   1
+ (1 row)
+ 
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT id2 FROM bar UNION
+                       SELECT id2 FROM bar) AS s);
+  id 
+ ----
+   1
+ (1 row)
+ 
+ -- -- Test case to catch problems with multiply nested sub-SELECTs not getting -- recalculated properly.  Per bug
reportfrom Didier Moens. --
 
Index: src/test/regress/sql/subselect.sql
===================================================================
RCS file: /cvsroot/pgsql/src/test/regress/sql/subselect.sql,v
retrieving revision 1.7
diff -c -r1.7 subselect.sql
*** src/test/regress/sql/subselect.sql    4 Oct 2004 14:42:48 -0000    1.7
--- src/test/regress/sql/subselect.sql    15 Jul 2005 17:03:06 -0000
***************
*** 96,101 ****
--- 96,134 ----    where unique1 IN (select distinct hundred from tenk1 b)) ss;  --
+ -- Test cases to check for overenthusiastic optimization of
+ -- "IN (SELECT DISTINCT ...)" and related cases.  Per example from
+ -- Luca Pireddu and Michael Fuhr.
+ --
+ 
+ CREATE TEMP TABLE foo (id integer);
+ CREATE TEMP TABLE bar (id1 integer, id2 integer);
+ 
+ INSERT INTO foo VALUES (1);
+ 
+ INSERT INTO bar VALUES (1, 1);
+ INSERT INTO bar VALUES (2, 2);
+ INSERT INTO bar VALUES (3, 1);
+ 
+ -- These cases require an extra level of distinct-ing above subquery s
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT DISTINCT id1, id2 FROM bar) AS s);
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT id1,id2 FROM bar GROUP BY id1,id2) AS s);
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT id1, id2 FROM bar UNION
+                       SELECT id1, id2 FROM bar) AS s);
+ 
+ -- These cases do not
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT DISTINCT ON (id2) id1, id2 FROM bar) AS s);
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT id2 FROM bar GROUP BY id2) AS s);
+ SELECT * FROM foo WHERE id IN
+     (SELECT id2 FROM (SELECT id2 FROM bar UNION
+                       SELECT id2 FROM bar) AS s);
+ 
+ -- -- Test case to catch problems with multiply nested sub-SELECTs not getting -- recalculated properly.  Per bug
reportfrom Didier Moens. --