Обсуждение: BUG #14107: Major query planner bug regarding subqueries and indices

Поиск
Список
Период
Сортировка

BUG #14107: Major query planner bug regarding subqueries and indices

От
mathiaskunter@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      14107
Logged by:          Mathias Kunter
Email address:      mathiaskunter@gmail.com
PostgreSQL version: 9.5.0
Operating system:   Windows 7
Description:

The query planner doesn't use an index although it could, causing an
unneccessary sequential table scan. Step by step instructions to reproduce
the problem are given below.


Step 1 - just create a simple test table with an indexed id column:

CREATE TABLE test (id serial NOT NULL, CONSTRAINT pkey PRIMARY KEY (id));


Step 2 - note that the index is used for the following query as expected:

EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (2);
                               QUERY PLAN
-------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=8.33..13.67 rows=2 width=4)
   Recheck Cond: ((id = 1) OR (id = 2))
   ->  BitmapOr  (cost=8.33..8.33 rows=2 width=0)
         ->  Bitmap Index Scan on pkey  (cost=0.00..4.16 rows=1 width=0)
               Index Cond: (id = 1)
         ->  Bitmap Index Scan on pkey  (cost=0.00..4.16 rows=1 width=0)
               Index Cond: (id = 2)


Step 3 - note that the index is NOT used for the following query:

EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (SELECT id FROM test WHERE
id = 2);
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on test  (cost=8.17..56.42 rows=1275 width=4)
   Filter: ((id = 1) OR (hashed SubPlan 1))
   SubPlan 1
     ->  Index Only Scan using pkey on test test_1  (cost=0.16..8.17 rows=1
width=4)
           Index Cond: (id = 2)

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
"David G. Johnston"
Дата:
On Thu, Apr 21, 2016 at 4:56 AM, <mathiaskunter@gmail.com> wrote:

> The following bug has been logged on the website:
>
> Bug reference:      14107
> Logged by:          Mathias Kunter
> Email address:      mathiaskunter@gmail.com
> PostgreSQL version: 9.5.0
> Operating system:   Windows 7
> Description:
>
> The query planner doesn't use an index although it could, causing an
> unneccessary sequential table scan. Step by step instructions to reproduc=
e
> the problem are given below.
>
>
> Step 1 - just create a simple test table with an indexed id column:
>
> CREATE TABLE test (id serial NOT NULL, CONSTRAINT pkey PRIMARY KEY (id));
>
>
> Step 2 - note that the index is used for the following query as expected:
>
> EXPLAIN SELECT * FROM test WHERE id =3D 1 OR id IN (2);
>                                QUERY PLAN
> -------------------------------------------------------------------------
>  Bitmap Heap Scan on test  (cost=3D8.33..13.67 rows=3D2 width=3D4)
>    Recheck Cond: ((id =3D 1) OR (id =3D 2))
>    ->  BitmapOr  (cost=3D8.33..8.33 rows=3D2 width=3D0)
>          ->  Bitmap Index Scan on pkey  (cost=3D0.00..4.16 rows=3D1 width=
=3D0)
>                Index Cond: (id =3D 1)
>          ->  Bitmap Index Scan on pkey  (cost=3D0.00..4.16 rows=3D1 width=
=3D0)
>                Index Cond: (id =3D 2)
>
>
> Step 3 - note that the index is NOT used for the following query:
>
> EXPLAIN SELECT * FROM test WHERE id =3D 1 OR id IN (SELECT id FROM test W=
HERE
> id =3D 2);
>                                      QUERY PLAN
>
> -------------------------------------------------------------------------=
------------
>  Seq Scan on test  (cost=3D8.17..56.42 rows=3D1275 width=3D4)
>    Filter: ((id =3D 1) OR (hashed SubPlan 1))
>    SubPlan 1
>      ->  Index Only Scan using pkey on test test_1  (cost=3D0.16..8.17 ro=
ws=3D1
> width=3D4)
>            Index Cond: (id =3D 2)
>
>
=E2=80=8BTo lazy to research at the moment but I think this has been fixed =
and
released.  You show 9.5.0 as your version.  Update and you should be fine.

David J=E2=80=8B.

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Mathias Kunter
Дата:
The problem unfortunately persists in 9.5.2 as well, where the query
plan is exactly the same as in 9.5.0. (Tested on Windows 7, but this
presumably is a cross-platform bug.) From what I can tell, this affects
all queries of the type

SELECT ... WHERE <condition> OR <indexed_column> IN (<subselect>);

As a workaround, you can use UNION, which works as expected:

SELECT ... WHERE <condition> UNION SELECT ... WHERE <indexed_column> IN
(<subselect>);

However, as it currently stands, queries of the above form are de facto
unusable with PostgreSQL. So this is pretty serious IMO.


Am 21.04.2016 um 23:01 schrieb David G. Johnston:
> On Thu, Apr 21, 2016 at 4:56 AM, <mathiaskunter@gmail.com
> <mailto:mathiaskunter@gmail.com>>wrote:
>
>     The following bug has been logged on the website:
>
>     Bug reference:      14107
>     Logged by:          Mathias Kunter
>     Email address:      mathiaskunter@gmail.com
>     <mailto:mathiaskunter@gmail.com>
>     PostgreSQL version: 9.5.0
>     Operating system:   Windows 7
>     Description:
>
>     The query planner doesn't use an index although it could, causing an
>     unneccessary sequential table scan. Step by step instructions to
>     reproduce
>     the problem are given below.
>
>
>     Step 1 - just create a simple test table with an indexed id column:
>
>     CREATE TABLE test (id serial NOT NULL, CONSTRAINT pkey PRIMARY KEY
>     (id));
>
>
>     Step 2 - note that the index is used for the following query as
>     expected:
>
>     EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (2);
>                                    QUERY PLAN
>     -------------------------------------------------------------------------
>      Bitmap Heap Scan on test  (cost=8.33..13.67 rows=2 width=4)
>        Recheck Cond: ((id = 1) OR (id = 2))
>        ->  BitmapOr  (cost=8.33..8.33 rows=2 width=0)
>              ->  Bitmap Index Scan on pkey  (cost=0.00..4.16 rows=1 width=0)
>                    Index Cond: (id = 1)
>              ->  Bitmap Index Scan on pkey  (cost=0.00..4.16 rows=1 width=0)
>                    Index Cond: (id = 2)
>
>
>     Step 3 - note that the index is NOT used for the following query:
>
>     EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (SELECT id FROM
>     test WHERE
>     id = 2);
>                                          QUERY PLAN
>     -------------------------------------------------------------------------------------
>      Seq Scan on test  (cost=8.17..56.42 rows=1275 width=4)
>        Filter: ((id = 1) OR (hashed SubPlan 1))
>        SubPlan 1
>          ->  Index Only Scan using pkey on test test_1  (cost=0.16..8.17
>     rows=1
>     width=4)
>                Index Cond: (id = 2)
>
>
> ​To lazy to research at the moment but I think this has been fixed and
> released.  You show 9.5.0 as your version.  Update and you should be fine.
>
> David J​.
>

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
David Rowley
Дата:
On 21 April 2016 at 23:56,  <mathiaskunter@gmail.com> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      14107
> Logged by:          Mathias Kunter
> Email address:      mathiaskunter@gmail.com
> PostgreSQL version: 9.5.0
> Operating system:   Windows 7
> Description:
>
> The query planner doesn't use an index although it could, causing an
> unneccessary sequential table scan. Step by step instructions to reproduce
> the problem are given below.
>

..

> Step 3 - note that the index is NOT used for the following query:
>
> EXPLAIN SELECT * FROM test WHERE id = 1 OR id IN (SELECT id FROM test WHERE
> id = 2);
>                                      QUERY PLAN
> -------------------------------------------------------------------------------------
>  Seq Scan on test  (cost=8.17..56.42 rows=1275 width=4)
>    Filter: ((id = 1) OR (hashed SubPlan 1))
>    SubPlan 1
>      ->  Index Only Scan using pkey on test test_1  (cost=0.16..8.17 rows=1
> width=4)
>            Index Cond: (id = 2)

Thanks for the report, but I think your subject line is a little
exaggerated. I'd describe this as a missed optimisation, and not even
a bug.

Having said that, I have looked previously at adding more smarts to
the planner around detecting when each relation can return, at most, a
single row, based on the restriction quals. When that can be proved,
instead of performing a join to that relation, use an init plan, such
as what you'd get if you'd written the query as;

EXPLAIN SELECT * FROM test WHERE id = 1 OR id = (SELECT id FROM test
WHERE id = 2);

then just remove the join/subplan altogether.

I'd not actually thought about expanding this to IN and NOT IN, but
likely it would be possible, providing NULLs could be handled
correctly too.

Was this the optimisation you think is missing? or did you expect the
subquery to feed the bitmap scan keys? If so I'm not too sure how that
could be made to work well when the row estimates are way off, and the
bitmap index scan has to deal with millions or billions of scan keys.
It might not end well.

As for if the IN() clause could be detected to return a single row...
well that's in my "I'd like to do that one day list".

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Mathias Kunter
Дата:
> Thanks for the report, but I think your subject line is a little
> exaggerated.

Sorry for that :-)

> I'd not actually thought about expanding this to IN and NOT IN, but
> likely it would be possible, providing NULLs could be handled
> correctly too. Was this the optimisation you think is missing?

No - let's consider a real-world example where the problem becomes more
evident. Assume we have the following two tables:

CREATE TABLE book (id SERIAL NOT NULL, name VARCHAR, authorId SERIAL,
CONSTRAINT book_pkey PRIMARY KEY (id));

CREATE TABLE author (id SERIAL NOT NULL, surname VARCHAR, CONSTRAINT
author_pkey PRIMARY KEY (id));

All relevant columns are indexed. The book table contains 1.3 M rows.

Now let's say we want to find books by either name or by author name. A
simple problem, actually. The following query is used (which could of
course also be rewritten to use a JOIN instead, but that's not the point
here). We find out that it has an unexpected and unacceptably long
execution time:


EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' OR authorId IN
(SELECT id FROM author WHERE surname = 'Rowling');
                                         QUERY PLAN
-------------------------------------------------------------------------------------------
  Seq Scan on book  (cost=13.68..26792.88 rows=576709 width=40)
    Filter: (((name)::text = 'Harry Potter'::text) OR (hashed SubPlan 1))
    SubPlan 1
      ->  Bitmap Heap Scan on author  (cost=4.20..13.67 rows=6 width=4)
            Recheck Cond: ((surname)::text = 'Rowling'::text)
            ->  Bitmap Index Scan on author_surname_index
(cost=0.00..4.20 rows=6 width=0)
                  Index Cond: ((surname)::text = 'Rowling'::text)


In contrast to that, consider the query plan for the following logically
equivalent query, which executes in just a few milliseconds (!):

EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' UNION SELECT *
FROM book WHERE authorId IN (SELECT id FROM author WHERE surname =
'Rowling');
                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=454.87..455.99 rows=112 width=29)
    Group Key: book.id, book.name, book.authorid
    ->  Append  (cost=0.43..454.03 rows=112 width=29)
          ->  Index Scan using book_name_index on book
(cost=0.43..16.48 rows=3 width=29)
                Index Cond: ((name)::text = 'Harry Potter'::text)
          ->  Nested Loop  (cost=4.63..436.43 rows=109 width=29)
                ->  Bitmap Heap Scan on author  (cost=4.20..13.67 rows=6
width=4)
                      Recheck Cond: ((surname)::text = 'Rowling'::text)
                      ->  Bitmap Index Scan on author_surname_index
(cost=0.00..4.20 rows=6 width=0)
                            Index Cond: ((surname)::text = 'Rowling'::text)
                ->  Index Scan using book_authorid_index on book book_1
(cost=0.43..70.28 rows=18 width=29)
                      Index Cond: (authorid = author.id)

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Mathias Kunter
Дата:
Sorry, typo:

> CREATE TABLE book (id SERIAL NOT NULL, name VARCHAR, authorId SERIAL, CONSTRAINT book_pkey PRIMARY KEY (id));

It should of course be "authorId INTEGER" instead of "authorId SERIAL".

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Victor Yegorov
Дата:
2016-04-22 17:02 GMT+03:00 Mathias Kunter <mathiaskunter@gmail.com>:

> EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' OR authorId IN
> (SELECT id FROM author WHERE surname = 'Rowling');
>                                         QUERY PLAN
>
> -------------------------------------------------------------------------------------------
>  Seq Scan on book  (cost=13.68..26792.88 rows=576709 width=40)
> ...
>
> EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' UNION SELECT * FROM
> book WHERE authorId IN (SELECT id FROM author WHERE surname = 'Rowling');
>                                                  QUERY PLAN
>
> ------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=454.87..455.99 rows=112 width=29)
>

Queries return different number of rows, meaning they're not fully
equivalent.
Can you provide full DDL for the tables, constraints and indexes in use?


--
Victor Y. Yegorov

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Francisco Olarte
Дата:
On Fri, Apr 22, 2016 at 4:14 PM, Victor Yegorov <vyegorov@gmail.com> wrote:
>> EXPLAIN SELECT * FROM book WHERE name = 'Harry Potter' UNION SELECT * FROM
...
> Queries return different number of rows, meaning they're not fully
> equivalent.

IIRC plain-EXPLAIN returns estimates, you need explain analyze to be comparable.

Francisco Olarte.

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Victor Yegorov
Дата:
2016-04-22 19:24 GMT+03:00 Francisco Olarte <folarte@peoplecall.com>:

> IIRC plain-EXPLAIN returns estimates, you need explain analyze to be
> comparable.


Of course, sorry for this.

Well, then `EXPLAIN (analyze, buffers)` is also wanted, together with
object definitions.


--
Victor Y. Yegorov

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Mathias Kunter
Дата:
> Queries return different number of rows, meaning they're not fully equivalent.

Well, I think in the given example they should actually be fully
equivalent, since the unique id column is selected as well. Thus, UNION
can't do any unwanted row eliminations.

> Well, then `EXPLAIN (analyze, buffers)` is also wanted, together with
> object definitions.

So here are the exact SQL commands to reproduce the problem.

-- Create table "book"

CREATE TABLE book (id SERIAL NOT NULL, name VARCHAR, author INTEGER,
CONSTRAINT book_pkey PRIMARY KEY (id));
CREATE INDEX book_name_index ON book (name);
CREATE INDEX book_author_index ON book (author);


-- Create table "author"

CREATE TABLE author (id SERIAL NOT NULL, name VARCHAR, CONSTRAINT
author_pkey PRIMARY KEY (id));
CREATE INDEX author_name_index ON author (name);


-- Insert some test data so that the planner would never assume a
sequential scan could be faster

INSERT INTO book (id, name, author) SELECT generate_series(1, 100000),
md5(random()::text), (generate_series(1, 100000) - 1) % 10000 + 1;
INSERT INTO author (id, name) SELECT generate_series(1, 10000),
md5(random()::text);
ANALYZE book;
ANALYZE author;


-- Check query plan when using OR operator

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM book WHERE name = 'Harry
Potter' OR author IN (SELECT id FROM author WHERE name = 'Rowling');
                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Seq Scan on book  (cost=8.30..2443.30 rows=50001 width=41) (actual
time=25.527..25.527 rows=0 loops=1)
    Filter: (((name)::text = 'Harry Potter'::text) OR (hashed SubPlan 1))
    Rows Removed by Filter: 100000
    Buffers: shared hit=937
    SubPlan 1
      ->  Index Scan using author_name_index on author  (cost=0.29..8.30
rows=1 width=4) (actual time=0.041..0.041 rows=0 loops=1)
            Index Cond: ((name)::text = 'Rowling'::text)
            Buffers: shared hit=2
  Planning time: 0.237 ms
  Execution time: 25.603 ms


-- Check query plan when using UNION operator

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM book WHERE name = 'Harry
Potter' UNION SELECT * FROM book WHERE author IN (SELECT id FROM author
WHERE name = 'Rowling');
                                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=58.42..58.53 rows=11 width=41) (actual
time=0.066..0.066 rows=0 loops=1)
    Group Key: book.id, book.name, book.author
    Buffers: shared hit=5
    ->  Append  (cost=0.42..58.34 rows=11 width=41) (actual
time=0.061..0.061 rows=0 loops=1)
          Buffers: shared hit=5
          ->  Index Scan using book_name_index on book  (cost=0.42..8.44
rows=1 width=41) (actual time=0.035..0.035 rows=0 loops=1)
                Index Cond: ((name)::text = 'Harry Potter'::text)
                Buffers: shared hit=3
          ->  Nested Loop  (cost=4.66..49.79 rows=10 width=41) (actual
time=0.019..0.019 rows=0 loops=1)
                Buffers: shared hit=2
                ->  Index Scan using author_name_index on author
(cost=0.29..8.30 rows=1 width=4) (actual time=0.015..0.015 rows=0 loops=1)
                      Index Cond: ((name)::text = 'Rowling'::text)
                      Buffers: shared hit=2
                ->  Bitmap Heap Scan on book book_1  (cost=4.37..41.39
rows=10 width=41) (never executed)
                      Recheck Cond: (author = author.id)
                      ->  Bitmap Index Scan on book_author_index
(cost=0.00..4.37 rows=10 width=0) (never executed)
                            Index Cond: (author = author.id)
  Planning time: 0.669 ms
  Execution time: 0.183 ms

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Yaroslav
Дата:
Hmm... and this is even worse (on the data you provided):

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
  FROM book
 WHERE name = 'Harry Potter'
    OR EXISTS (
       SELECT 1
         FROM author
        WHERE author.id = book.author AND author.name = 'Rowling'
       );


Seq Scan on book  (cost=0.00..832685.00 rows=50000 width=41) (actual
time=35.848..35.848 rows=0 loops=1)
  Filter: (((name)::text = 'Harry Potter'::text) OR (alternatives: SubPlan 1
or hashed SubPlan 2))
  Rows Removed by Filter: 100000
  Buffers: shared hit=937
  SubPlan 1
    ->  Index Scan using author_name_index on author  (cost=0.29..8.30
rows=1 width=0) (never executed)
          Index Cond: ((name)::text = 'Rowling'::text)
          Filter: (id = book.author)
  SubPlan 2
    ->  Index Scan using author_name_index on author author_1
(cost=0.29..8.30 rows=1 width=4) (actual time=0.027..0.027 rows=0 loops=1)
          Index Cond: ((name)::text = 'Rowling'::text)
          Buffers: shared hit=2
Planning time: 0.268 ms
Execution time: 35.923 ms



-----
WBR, Yaroslav Schekin.
--
View this message in context:
http://postgresql.nabble.com/BUG-14107-Major-query-planner-bug-regarding-subqueries-and-indices-tp5899873p5900147.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

Re: Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Mathias Kunter
Дата:
> Hmm... and this is even worse (on the data you provided):
>
> EXPLAIN (ANALYZE, BUFFERS)
> SELECT *
>   FROM book
>  WHERE name = 'Harry Potter'
>     OR EXISTS (
>        SELECT 1
>          FROM author
>         WHERE author.id = book.author AND author.name = 'Rowling'
>        );

Yes, but the problem seems to be even bigger. Apparently it's neither
limited to subqueries nor to the operators EXISTS, IN, NOT IN, ANY,
SOME, and ALL. It rather seems that the planner has a severe bug
regarding usage of the OR operator itself. This seems hard to believe,
so please verify the query plans given below (and also earlier). I'd be
happy if I'm mistaken on this.



EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM book JOIN author ON
(book.author = author.id) WHERE book.name = 'Harry Potter' OR
author.name = 'Rowling';
                                                        QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=309.00..4118.40 rows=11 width=78) (actual
time=325.283..325.283 rows=0 loops=1)
    Hash Cond: (book.author = author.id)
    Join Filter: (((book.name)::text = 'Harry Potter'::text) OR
((author.name)::text = 'Rowling'::text))
    Rows Removed by Join Filter: 100000
    Buffers: shared hit=1019
    ->  Seq Scan on book  (cost=0.00..1935.00 rows=100000 width=41)
(actual time=0.010..130.936 rows=100000 loops=1)
          Buffers: shared hit=935
    ->  Hash  (cost=184.00..184.00 rows=10000 width=37) (actual
time=28.933..28.933 rows=10000 loops=1)
          Buckets: 16384  Batches: 1  Memory Usage: 802kB
          Buffers: shared hit=84
          ->  Seq Scan on author  (cost=0.00..184.00 rows=10000
width=37) (actual time=0.007..14.061 rows=10000 loops=1)
                Buffers: shared hit=84
  Planning time: 0.456 ms
  Execution time: 325.546 ms



EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM book WHERE author IN (SELECT id
FROM author WHERE name = 'Rowling') OR FALSE;
                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Seq Scan on book  (cost=8.30..2193.30 rows=50000 width=41) (actual
time=13.838..13.838 rows=0 loops=1)
    Filter: (hashed SubPlan 1)
    Rows Removed by Filter: 100000
    Buffers: shared hit=937
    SubPlan 1
      ->  Index Scan using author_name_index on author  (cost=0.29..8.30
rows=1 width=4) (actual time=0.032..0.032 rows=0 loops=1)
            Index Cond: ((name)::text = 'Rowling'::text)
            Buffers: shared hit=2
  Planning time: 0.204 ms
  Execution time: 13.910 ms

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Yaroslav
Дата:
> Yes, but the problem seems to be even bigger.
Seems like it. I've asked folks on IRC for help, and it was confirmed the
bug exists since 8.2, at least:
https://gist.github.com/ilmari/b8ae0c37a7465d6fb3987dc651923660

Also, I've created sqlfiddle based on your example:
http://sqlfiddle.com/#!15/76ef4/3





-----
WBR, Yaroslav Schekin.
--
View this message in context:
http://postgresql.nabble.com/BUG-14107-Major-query-planner-bug-regarding-subqueries-and-indices-tp5899873p5900430.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Mathias Kunter
Дата:
> Seems like it. I've asked folks on IRC for help, and it was confirmed the
> bug exists since 8.2


So, as it currently stands, queries of the following synopsis are de
facto unusable with PostgreSQL, as these will always ignore any existing
indices and trigger a full table scan instead:


SELECT ... FROM a WHERE a.x = ? OR a.y IN (...);

SELECT ... FROM a JOIN b ON (...) WHERE a.x = ? OR b.y = ?;


The following SQL fiddle demonstrates the bug:
http://sqlfiddle.com/#!15/ab714/2

Does anybody care? Shouldn't something like this be fixed ASAP??

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Tom Lane
Дата:
Mathias Kunter <mathiaskunter@gmail.com> writes:
> Does anybody care? Shouldn't something like this be fixed ASAP??

As was already stated upthread, this is not a bug; it's an opportunity
for future improvement.  It's unlikely to get "fixed ASAP" because there
would be quite a bit of work involved, possibly including new executor
infrastructure not just planner work.

As far as the first case goes, the executor doesn't currently have any
direct way to treat "a.x IN (SELECT ...)" as an indexqual for a.x.
In simple cases the planner can work around that by transforming the query
into a join against a unique-ified version of the sub-select, but that
doesn't work if the IN is underneath an OR.  If you know that the
sub-select isn't going to return very many rows, you could do

SELECT ... FROM a WHERE a.x = ? OR a.y = ANY(ARRAY(SELECT ...));

but this would blow up rather badly with a large sub-select result,
so I'm not sure I want to try to make the planner transform it that
way automatically.

I don't actually see any way to do very much with your second example at
all:

> SELECT ... FROM a JOIN b ON (...) WHERE a.x = ? OR b.y = ?;

There's no way to push anything down to either the A or B scans from
that WHERE condition: you can't remove any rows before the join because
they might join to rows on the other side that satisfy the other half
of the OR.

            regards, tom lane

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
David Rowley
Дата:
On 29 April 2016 at 13:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't actually see any way to do very much with your second example at
> all:
>
>> SELECT ... FROM a JOIN b ON (...) WHERE a.x = ? OR b.y = ?;
>
> There's no way to push anything down to either the A or B scans from
> that WHERE condition: you can't remove any rows before the join because
> they might join to rows on the other side that satisfy the other half
> of the OR.

I was confused when I read that too. The only way I thought to
transform would be to create a UNION for each condition, which in
theory would be possible providing the SELECT lists contained the PK
columns. Of course, in practice this would be difficult, since UNIONs
are planned later, and having the query execute in that way would have
to be costed, and only used if cheaper.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Mathias Kunter
Дата:
> If you know that the
> sub-select isn't going to return very many rows, you could do
>
> SELECT ... FROM a WHERE a.x = ? OR a.y = ANY(ARRAY(SELECT ...));

Isn't the planner already doing something like this, since the following
query is using the index as expected:

SELECT ... FROM a WHERE a.x IN (SELECT ...);

> but this would blow up rather badly with a large sub-select result,
> so I'm not sure I want to try to make the planner transform it that
> way automatically.

Wouldn't it be possible then to use this optimization based on the
estimated result size of the subquery? I think this would almost always
be faster than a sequential scan anyway. I observed that using the
ANY(ARRAY(SELECT...)) syntax on my small test tables (100 K rows)
already improves the query time by a factor of more than 100, and it
will be even more when tables are large. Please consider implementing
this optimization!

> I don't actually see any way to do very much with your second example at
> all:
>
>> SELECT ... FROM a JOIN b ON (...) WHERE a.x = ? OR b.y = ?;

Assuming both joined tables contain a PK (or another unique column),
then it should be possible by replanning the query as:

SELECT ... FROM a JOIN b ON ((join_cond AND a.x = ?) OR (join_cond AND
b.y = ?));

(Let "join_cond" denote the original join condition here.) Now, the JOIN
implementation must be smart enough to handle OR conditions: First,
obtain the rows satisfying

join_cond AND a.x = ?

by using the index as usual. For each matching row, create the tuple
(a.id, b.id) and insert it into a search tree (or hash or whatever).
Then, obtain the rows satisfying

join_cond AND b.y = ?

For each matching row, query the search tree whether it already contains
the tuple (a.id, b.id), and only add the current row to the final result
if it doesn't.

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Mathias Kunter
Дата:
Sorry for bumping this one more time, but I'd like to share some more
real-life performance test results of using ANY(ARRAY(...)) instead of
IN(...), hoping that you'd maybe still consider implementing such an
optimization into the query planner. Since the test results indicate
that the performance boost can really be massive on certain query types
(factor 1000), I think that it'd really be worth the work.


===== Test setup =====

The tables "mb.release" and "mb.release_group" both contain about 1.5
million rows of real data, taken from the MusicBrainz database, and are
of course properly indexed. All performance tests have been repeated a
few times to be comparable.

The test covers subqueries which return just a few rows and also
subqueries which return more than 100000 rows. The queries test the
performance of IN vs. ANY(ARRAY()) when used in different scenarios.

For reference, the full query plans of all used queries are linked below.


===== Tested queries =====

1) SELECT id FROM mb.release WHERE release_group IN (SELECT id FROM
mb.release_group WHERE name = 'Bear');

2) SELECT id FROM mb.release WHERE release_group IN (SELECT id FROM
mb.release_group WHERE name < 'Bear');

3) SELECT id FROM mb.release WHERE name = 'Tiger' OR release_group IN
(SELECT id FROM mb.release_group WHERE name = 'Bear');

4) SELECT id FROM mb.release WHERE name = 'Tiger' OR release_group IN
(SELECT id FROM mb.release_group WHERE name < 'Bear');

5) SELECT id FROM mb.release WHERE name = 'Tiger' AND release_group IN
(SELECT id FROM mb.release_group WHERE name = 'Bear');

6) SELECT id FROM mb.release WHERE name = 'Tiger' AND release_group IN
(SELECT id FROM mb.release_group WHERE name < 'Bear');


===== Test results =====

All numbers are given in milliseconds and show the total query time
(planning + execution).

-------------------------------------------
| Query |    IN (...) | = ANY(ARRAY(...)) |
-------------------------------------------
|     1 |         0.7 |               0.4 |
|     2 |      6001.1 |            2517.8 |
|     3 |       711.3 |               0.5 |
|     4 | > 1000000.0 |            1962.6 |
|     5 |         0.8 |               0.5 |
|     6 |         0.9 |             492.7 |
-------------------------------------------

Note: Query 4 using the IN operator has been canceled after running for
more than 15 minutes.


===== Full query plans =====

For reference, all query plans of this performance test have been
recorded using EXPLAIN (ANALYZE, BUFFERS). Please find them at
http://pastebin.com/zymkbcSf

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
David Rowley
Дата:
On 11 May 2016 at 23:55, Mathias Kunter <mathiaskunter@gmail.com> wrote:
> Sorry for bumping this one more time, but I'd like to share some more
> real-life performance test results of using ANY(ARRAY(...)) instead of
> IN(...), hoping that you'd maybe still consider implementing such an
> optimization into the query planner. Since the test results indicate that
> the performance boost can really be massive on certain query types (factor
> 1000), I think that it'd really be worth the work.
>
>
> ===== Test setup =====
>
> The tables "mb.release" and "mb.release_group" both contain about 1.5
> million rows of real data, taken from the MusicBrainz database, and are of
> course properly indexed. All performance tests have been repeated a few
> times to be comparable.
>
> The test covers subqueries which return just a few rows and also subqueries
> which return more than 100000 rows. The queries test the performance of IN
> vs. ANY(ARRAY()) when used in different scenarios.
>
> For reference, the full query plans of all used queries are linked below.
>
>
> ===== Tested queries =====
>
> 1) SELECT id FROM mb.release WHERE release_group IN (SELECT id FROM
> mb.release_group WHERE name = 'Bear');
>
> 2) SELECT id FROM mb.release WHERE release_group IN (SELECT id FROM
> mb.release_group WHERE name < 'Bear');
>
> 3) SELECT id FROM mb.release WHERE name = 'Tiger' OR release_group IN
> (SELECT id FROM mb.release_group WHERE name = 'Bear');
>
> 4) SELECT id FROM mb.release WHERE name = 'Tiger' OR release_group IN
> (SELECT id FROM mb.release_group WHERE name < 'Bear');
>
> 5) SELECT id FROM mb.release WHERE name = 'Tiger' AND release_group IN
> (SELECT id FROM mb.release_group WHERE name = 'Bear');
>
> 6) SELECT id FROM mb.release WHERE name = 'Tiger' AND release_group IN
> (SELECT id FROM mb.release_group WHERE name < 'Bear');
>
>
> ===== Test results =====
>
> All numbers are given in milliseconds and show the total query time
> (planning + execution).
>
> -------------------------------------------
> | Query |    IN (...) | = ANY(ARRAY(...)) |
> -------------------------------------------
> |     1 |         0.7 |               0.4 |
> |     2 |      6001.1 |            2517.8 |
> |     3 |       711.3 |               0.5 |
> |     4 | > 1000000.0 |            1962.6 |
> |     5 |         0.8 |               0.5 |
> |     6 |         0.9 |             492.7 |
> -------------------------------------------
>
> Note: Query 4 using the IN operator has been canceled after running for more
> than 15 minutes.

How do you find the ANY(ARRAY(...)) version performs with say 10
million records in the array?

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #14107: Major query planner bug regarding subqueries and indices

От
Mathias Kunter
Дата:
> How do you find the ANY(ARRAY(...)) version performs with say 10
> million records in the array?

I've tested with a subquery which returns about 20 million different
rows. In this case IN(...) is about 5 times faster than ANY(ARRAY(...))
for me. The exact numbers are:

IN(...):           about 22 seconds
ANY(ARRAY(...)):   about 115 seconds

However, estimated query costs seem to be always correct. So shouldn't
it be quite easy for the planner to create query plans for both the
ANY(ARRAY(...)) and the IN(...) version and then just use the plan where
costs are cheaper?