Обсуждение: Bad query plan inside EXISTS clause

Поиск
Список
Период
Сортировка

Bad query plan inside EXISTS clause

От
Benoit Delbosc
Дата:
Hi all,

I am trying to understand why inside an EXISTS clause the query planner
  does not use the index:

EXPLAIN ANALYZE SELECT 1 WHERE EXISTS (SELECT 1 FROM read_acls_cache
             WHERE users_md5 = '9bc9012eb29c0bb2ae3cc7b5e78c2acf');
                                         QUERY PLAN

--------------------------------------------------------------------------------------------
  Result  (cost=1.19..1.20 rows=1 width=0) (actual time=466.317..466.318
rows=1 loops=1)
    One-Time Filter: $0
    InitPlan 1 (returns $0)
      ->  Seq Scan on read_acls_cache  (cost=0.00..62637.01 rows=52517
width=0) (actual time=466.309..466.309 rows=1 loops=1)
            Filter: ((users_md5)::text =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
  Total runtime: 466.369 ms
(6 rows)

While it does use the index when executing only the subquery:

EXPLAIN ANALYZE SELECT 1 FROM read_acls_cache WHERE users_md5 =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf';
                             QUERY PLAN

--------------------------------------------------------------------------
  Bitmap Heap Scan on read_acls_cache  (cost=2176.10..35022.98
rows=52517 width=0) (actual time=9.065..21.988 rows=51446 loops=1)
    Recheck Cond: ((users_md5)::text =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
    ->  Bitmap Index Scan on read_acls_cache_users_md5_idx
(cost=0.00..2162.97 rows=52517 width=0) (actual time=8.900..8.900
rows=51446 loops=1)
          Index Cond: ((users_md5)::text =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
  Total runtime: 25.464 ms
(5 rows)

The table has been vacuumed, analyzed and reindexed.

Thanks for your support.

Regards

ben

Here are some more info :

\d read_acls_cache
         Table "public.read_acls_cache"
   Column   |         Type          | Modifiers
-----------+-----------------------+-----------
  users_md5 | character varying(34) | not null
  acl_id    | character varying(34) |
Indexes:
     "read_acls_cache_users_md5_idx" btree (users_md5)


SELECT COUNT(*) FROM read_acls_cache;
   count
---------
  2520899
(1 row)


SELECT COUNT(DISTINCT(users_md5)) FROM read_acls_cache ;
  count
-------
     49
(1 row)


SELECT Version();
                                                   version
------------------------------------------------------------------
  PostgreSQL 8.4.2 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.2.real
(GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64
(1 row)



Re: Bad query plan inside EXISTS clause

От
Kenneth Marshall
Дата:
EXISTS matches NULLs too and since they are not indexed a
sequential scan is needed to check for them. Try using
IN instead.

Cheers,
Ken

On Wed, Mar 10, 2010 at 02:26:20PM +0100, Benoit Delbosc wrote:
> Hi all,
>
> I am trying to understand why inside an EXISTS clause the query planner
> does not use the index:
>
> EXPLAIN ANALYZE SELECT 1 WHERE EXISTS (SELECT 1 FROM read_acls_cache
>      WHERE users_md5 = '9bc9012eb29c0bb2ae3cc7b5e78c2acf');
>                                         QUERY PLAN
> --------------------------------------------------------------------------------------------
>  Result  (cost=1.19..1.20 rows=1 width=0) (actual time=466.317..466.318
> rows=1 loops=1)
>    One-Time Filter: $0
>    InitPlan 1 (returns $0)
>      ->  Seq Scan on read_acls_cache  (cost=0.00..62637.01 rows=52517
> width=0) (actual time=466.309..466.309 rows=1 loops=1)
>            Filter: ((users_md5)::text =
> '9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
>  Total runtime: 466.369 ms
> (6 rows)
>
> While it does use the index when executing only the subquery:
>
> EXPLAIN ANALYZE SELECT 1 FROM read_acls_cache WHERE users_md5 =
> '9bc9012eb29c0bb2ae3cc7b5e78c2acf';
>                             QUERY PLAN
> --------------------------------------------------------------------------
>  Bitmap Heap Scan on read_acls_cache  (cost=2176.10..35022.98 rows=52517
> width=0) (actual time=9.065..21.988 rows=51446 loops=1)
>    Recheck Cond: ((users_md5)::text =
> '9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
>    ->  Bitmap Index Scan on read_acls_cache_users_md5_idx
> (cost=0.00..2162.97 rows=52517 width=0) (actual time=8.900..8.900
> rows=51446 loops=1)
>          Index Cond: ((users_md5)::text =
> '9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
>  Total runtime: 25.464 ms
> (5 rows)
>
> The table has been vacuumed, analyzed and reindexed.
>
> Thanks for your support.
>
> Regards
>
> ben
>
> Here are some more info :
>
> \d read_acls_cache
>         Table "public.read_acls_cache"
>   Column   |         Type          | Modifiers
> -----------+-----------------------+-----------
>  users_md5 | character varying(34) | not null
>  acl_id    | character varying(34) |
> Indexes:
>     "read_acls_cache_users_md5_idx" btree (users_md5)
>
>
> SELECT COUNT(*) FROM read_acls_cache;
>   count
> ---------
>  2520899
> (1 row)
>
>
> SELECT COUNT(DISTINCT(users_md5)) FROM read_acls_cache ;
>  count
> -------
>     49
> (1 row)
>
>
> SELECT Version();
>                                                   version
> ------------------------------------------------------------------
>  PostgreSQL 8.4.2 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.2.real
> (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64
> (1 row)
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Bad query plan inside EXISTS clause

От
Grzegorz Jaśkiewicz
Дата:
try JOINs...

Re: Bad query plan inside EXISTS clause

От
Yeb Havinga
Дата:
Kenneth Marshall wrote:
> EXISTS matches NULLs too and since they are not indexed a
> sequential scan is needed to check for them. Try using
> IN instead.
>
This is nonsense in more than one way.

regards
Yeb Havinga


Re: Bad query plan inside EXISTS clause

От
Yeb Havinga
Дата:
Yeb Havinga wrote:
> Kenneth Marshall wrote:
>> EXISTS matches NULLs too and since they are not indexed a
>> sequential scan is needed to check for them. Try using
>> IN instead.
>>
> This is nonsense in more than one way.
Hit ctrl-return a bit too slow - exists does not match null but a set of
records, that is either empty or not empty. Also it is possible to index
table columns with nulls, and then the indexes can still be used.
Besides filtering record sets with expressions, indexes are also used
for ordering. There the effect of indexes with nulls can be seen: where
to put them: in front or after the non nulls? So indexes can be
perfectly used in conjunction with nulls. I found the original mail
rather intriguing and played with an example myself a bit, but could not
repeat the behavior (9.0 devel version), in my case the exists used an
index. Maybe it has something to do with the fact that the planner
estimates to return 50000 rows, even when the actual numbers list only 1
hit. In the exists case, it can stop at the first hit. In the select all
rows case, it must return all rows. Maybe a better plan emerges with
better statistics?

regards,
Yeb Havinga


Re: Bad query plan inside EXISTS clause

От
Benoit Delbosc
Дата:
Yeb Havinga a écrit :
> Yeb Havinga wrote:
>> Kenneth Marshall wrote:
>>> EXISTS matches NULLs too and since they are not indexed a
>>> sequential scan is needed to check for them. Try using
>>> IN instead.
>>>
>> This is nonsense in more than one way.
> Hit ctrl-return a bit too slow - exists does not match null but a set of
> records, that is either empty or not empty. Also it is possible to index
> table columns with nulls, and then the indexes can still be used.
> Besides filtering record sets with expressions, indexes are also used
> for ordering. There the effect of indexes with nulls can be seen: where
> to put them: in front or after the non nulls? So indexes can be
> perfectly used in conjunction with nulls. I found the original mail
> rather intriguing and played with an example myself a bit, but could not
> repeat the behavior (9.0 devel version), in my case the exists used an
> index. Maybe it has something to do with the fact that the planner
> estimates to return 50000 rows, even when the actual numbers list only 1
> hit. In the exists case, it can stop at the first hit. In the select all
> rows case, it must return all rows. Maybe a better plan emerges with
> better statistics?

Thanks for your quick investigation.

Changing the target statistics to 1000 for the column and analyzing the
table has not changed the query plan.

I just notice that using a "LIMIT 1" also gives a bad query plan

EXPLAIN ANALYZE SELECT 1 FROM read_acls_cache WHERE users_md5 =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf' LIMIT 1;
                                                         QUERY PLAN

---------------------------------------------------------------------------------------
  Limit  (cost=0.00..1.19 rows=1 width=0) (actual time=366.771..366.772
rows=1 loops=1)
    ->  Seq Scan on read_acls_cache  (cost=0.00..62637.01 rows=52517
width=0) (actual time=366.769..366.769 rows=1 loops=1)
          Filter: ((users_md5)::text =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
  Total runtime: 366.806 ms
(4 rows)

Perhaps the EXISTS clause is also trying to limit the sub select query.

There are no NULL value on this column but only 49 distinct values for
2.5m rows, around 51k rows per value.

What I am trying to do is to know if a value is present or not in the
table, so far the fastest way is to use a COUNT:

EXPLAIN ANALYZE SELECT COUNT(1) FROM read_acls_cache WHERE users_md5 =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf';
                                      QUERY PLAN
---------------------------------------------------------------------------
  Aggregate  (cost=35154.28..35154.29 rows=1 width=0) (actual
time=20.242..20.242 rows=1 loops=1)
    ->  Bitmap Heap Scan on read_acls_cache  (cost=2176.10..35022.98
rows=52517 width=0) (actual time=6.937..15.025 rows=51446 loops=1)
          Recheck Cond: ((users_md5)::text =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
          ->  Bitmap Index Scan on read_acls_cache_users_md5_idx
(cost=0.00..2162.97 rows=52517 width=0) (actual time=6.835..6.835
rows=51446 loops=1)
                Index Cond: ((users_md5)::text =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
  Total runtime: 20.295 ms
(6 rows)

ben

Re: Bad query plan inside EXISTS clause

От
Tom Lane
Дата:
Benoit Delbosc <bdelbosc@nuxeo.com> writes:
> I am trying to understand why inside an EXISTS clause the query planner
>   does not use the index:

I'm not sure this plan is as bad as all that.  The key point is that the
planner is expecting 52517 rows that match that users_md5 value (and the
true number is evidently 51446, so that estimate isn't far off).  That's
about 1/48th of the table.  It knows that the EXISTS case can stop as
soon as it finds one match, so it's betting that a plain seqscan will
hit a match faster than an index lookup would be able to, ie,
seqscanning about 48 tuples is faster than one index lookup.  This might
be a bad bet if the users_md5 values are correlated with physical order,
ie the matches are not randomly scattered but are all towards the end of
the table.  Barring that, though, it could be a good bet if the table
isn't swapped in.  Which is what the default cost parameters are set
up to assume.

I suspect your real complaint is that you expect the table to be swapped
in, in which case what you ought to be doing is adjusting the planner's
cost parameters.  Some playing around here with a similar case suggests
that even a small reduction in random_page_cost would make it prefer an
indexscan for this type of situation.

            regards, tom lane

Re: Bad query plan inside EXISTS clause

От
Benoit Delbosc
Дата:
Tom Lane a écrit :
> Benoit Delbosc <bdelbosc@nuxeo.com> writes:
>> I am trying to understand why inside an EXISTS clause the query planner
>>   does not use the index:
>
> I'm not sure this plan is as bad as all that.  The key point is that the
> planner is expecting 52517 rows that match that users_md5 value (and the
> true number is evidently 51446, so that estimate isn't far off).  That's
> about 1/48th of the table.  It knows that the EXISTS case can stop as
> soon as it finds one match, so it's betting that a plain seqscan will
> hit a match faster than an index lookup would be able to, ie,
> seqscanning about 48 tuples is faster than one index lookup.  This might
> be a bad bet if the users_md5 values are correlated with physical order,
> ie the matches are not randomly scattered but are all towards the end of
> the table.
exact, the data is not randomly scattered but ordered this explains why
in my case seq scan is a bad bet

  Barring that, though, it could be a good bet if the table
> isn't swapped in.  Which is what the default cost parameters are set
> up to assume.
there are lots of shared buffers and effective memory on this instance,
the query is executed many times I can assume that the table isn't
swapped in right ?

> I suspect your real complaint is that you expect the table to be swapped
> in, in which case what you ought to be doing is adjusting the planner's
> cost parameters.  Some playing around here with a similar case suggests
> that even a small reduction in random_page_cost would make it prefer an
> indexscan for this type of situation.
excellent !

Changing the random_page_cost from 4 to 2 do the trick

SET random_page_cost = 2;
EXPLAIN ANALYZE SELECT 1 WHERE EXISTS (SELECT 1 FROM read_acls_cache
WHERE users_md5 = '9bc9012eb29c0bb2ae3cc7b5e78c2acf');


    QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------------
  Result  (cost=1.06..1.07 rows=1 width=0) (actual time=0.048..0.048
rows=1 loops=1)
    One-Time Filter: $0
    InitPlan 1 (returns $0)
      ->  Index Scan using read_acls_cache_users_md5_idx on
read_acls_cache  (cost=0.00..55664.21 rows=52517 width=0) (actual
time=0.045..0.045 rows=1 loops=1)
            Index Cond: ((users_md5)::text =
'9bc9012eb29c0bb2ae3cc7b5e78c2acf'::text)
  Total runtime: 0.087 ms
(6 rows)

466/0.087 = 5360 thanks !

kind regards

ben