Обсуждение: will the planner ever use an index when the condition is <> ?

Поиск
Список
Период
Сортировка

will the planner ever use an index when the condition is <> ?

От
Roxanne Reid-Bennett
Дата:
I have a query that used <> against an indexed column. In this
case I can use the reverse and use in or = and get the performance
I need... but "in general"... will the planner ever use an index when
the related column is compared using <>?

I feel like the answer is no, but wanted to ask.

Roxanne
Postgres Version 8.4.9 PostGIS version 1.5.2



Context for question:

I have the following query:

select *
from op_region opr, yield_segment_info ysi, data_location dl
where opr.op_region_id in
          (select distinct op_region_id
           from yield_point
           where yield > 0
           and area > 0
           and ST_GeometryType(location) <> 'ST_Point'
          )
and ysi.op_region_id = opr.op_region_id
and dl.data_set_id = opr.data_set_id

Yield_Point has 161,575,599 records
where yield >0 and area > 0 has 161,263,193 records,
where ST_GeometryType(location)<> 'ST_Point' has just 231 records

yield_segment_info has 165,929 records
op_region has 566,212 records
data_location has 394,763

All of these have a high volume of insert/delete's.
The tables have recently been vacuum full'd and the indexes reindexed.
[they are under the management of the autovacuum, but we forced a
cleanup on the chance that things had degraded...]

If I run an explain analyze:

"Nested Loop
     (cost=5068203.00..5068230.31 rows=3 width=225308)
     (actual time=192571.730..193625.728 rows=236 loops=1)"
"->Nested Loop
       (cost=5068203.00..5068219.66 rows=1 width=57329)
       (actual time=192522.573..192786.698 rows=230 loops=1)"
"  ->Nested Loop
    (cost=5068203.00..5068211.36 rows=1 width=57268)
    (actual time=192509.822..192638.446 rows=230 loops=1)"
"    ->HashAggregate
            (cost=5068203.00..5068203.01 rows=1 width=4)
            (actual time=192471.507..192471.682 rows=230 loops=1)"
"       ->Seq Scan on yield_point
               (cost=0.00..5068203.00 rows=1 width=4)
               (actual time=602.174..192471.177 rows=230 loops=1)"
"             Filter: ((yield > 0::double precision) AND
                        (area > 0::double precision) AND
                        (st_geometrytype(location) <> 'ST_Point'::text))"
"    ->Index Scan using op_region_pkey on op_region opr
            (cost=0.00..8.33 rows=1 width=57264)
            (actual time=0.723..0.723 rows=1 loops=230)"
"          Index Cond: (opr.op_region_id = yield_point.op_region_id)"
"  ->Index Scan using yield_segment_info_key on yield_segment_info ysi
        (cost=0.00..8.29 rows=1 width=65)
        (actual time=0.643..0.643 rows=1 loops=230)"
"      Index Cond: (ysi.op_region_id = opr.op_region_id)"
"->Index Scan using data_location_data_set_idx on data_location dl
     (cost=0.00..10.61 rows=3 width=167979)
     (actual time=3.611..3.646 rows=1 loops=230)"
"Index Cond: (dl.data_set_id = opr.data_set_id)"
"Total runtime: 193625.955 ms"

yield_point has the following indexes:
       btree on ST_GeometryType(location)
       gist on location
       btree on op_region_id

I've also tried an index on
       ((yield > 0::double precision) AND (area > 0::double precision)
AND (st_geometrytype(location) <> 'ST_Point'::text))
... it still goes for the sequential scan.

But if I change it to st_geometrytype(location) = 'ST_Polygon' or
even in ('ST_Polygon','ST_MultiPolygon')

the planner uses the index.

Roxanne

Re: will the planner ever use an index when the condition is <> ?

От
Filip Rembiałkowski
Дата:
Normally there is no chance it could work,
because (a) the planner does not know all possible values of a column,
and (b) btree indexes cannot search on "not equal" operator.


BTW I've just made a case where - logically - it could work, but it
still does not:

create table nums ( num int4 not null, check(num=1 or num=2) );
insert into nums select case when random()<=0.99 then 1 else 2 end
from generate_series(1,1000000);
create index nums_idx on nums(num);
analyze nums;
set constraint_exclusion to 'on';
explain select * from nums where num<>1;
--planner could estimate selectivity as 1%, and use index with "=2"
filter basing on check constraint?




2011/12/17 Roxanne Reid-Bennett <rox@tara-lu.com>:
> I have a query that used <> against an indexed column. In this
> case I can use the reverse and use in or = and get the performance
> I need... but "in general"... will the planner ever use an index when the
> related column is compared using <>?
>
> I feel like the answer is no, but wanted to ask.
>
> Roxanne
> Postgres Version 8.4.9 PostGIS version 1.5.2
>
>
>
> Context for question:
>
> I have the following query:
>
> select *
> from op_region opr, yield_segment_info ysi, data_location dl
> where opr.op_region_id in
>         (select distinct op_region_id
>          from yield_point
>          where yield > 0
>          and area > 0
>          and ST_GeometryType(location) <> 'ST_Point'
>         )
> and ysi.op_region_id = opr.op_region_id
> and dl.data_set_id = opr.data_set_id
>
> Yield_Point has 161,575,599 records
> where yield >0 and area > 0 has 161,263,193 records,
> where ST_GeometryType(location)<> 'ST_Point' has just 231 records
>
> yield_segment_info has 165,929 records
> op_region has 566,212 records
> data_location has 394,763
>
> All of these have a high volume of insert/delete's.
> The tables have recently been vacuum full'd and the indexes reindexed.
> [they are under the management of the autovacuum, but we forced a cleanup on
> the chance that things had degraded...]
>
> If I run an explain analyze:
>
> "Nested Loop
>    (cost=5068203.00..5068230.31 rows=3 width=225308)
>    (actual time=192571.730..193625.728 rows=236 loops=1)"
> "->Nested Loop
>      (cost=5068203.00..5068219.66 rows=1 width=57329)
>      (actual time=192522.573..192786.698 rows=230 loops=1)"
> "  ->Nested Loop
>        (cost=5068203.00..5068211.36 rows=1 width=57268)
>        (actual time=192509.822..192638.446 rows=230 loops=1)"
> "    ->HashAggregate
>           (cost=5068203.00..5068203.01 rows=1 width=4)
>           (actual time=192471.507..192471.682 rows=230 loops=1)"
> "       ->Seq Scan on yield_point
>              (cost=0.00..5068203.00 rows=1 width=4)
>              (actual time=602.174..192471.177 rows=230 loops=1)"
> "             Filter: ((yield > 0::double precision) AND
>                       (area > 0::double precision) AND
>                       (st_geometrytype(location) <> 'ST_Point'::text))"
> "    ->Index Scan using op_region_pkey on op_region opr
>           (cost=0.00..8.33 rows=1 width=57264)
>           (actual time=0.723..0.723 rows=1 loops=230)"
> "          Index Cond: (opr.op_region_id = yield_point.op_region_id)"
> "  ->Index Scan using yield_segment_info_key on yield_segment_info ysi
>       (cost=0.00..8.29 rows=1 width=65)
>       (actual time=0.643..0.643 rows=1 loops=230)"
> "      Index Cond: (ysi.op_region_id = opr.op_region_id)"
> "->Index Scan using data_location_data_set_idx on data_location dl
>    (cost=0.00..10.61 rows=3 width=167979)
>    (actual time=3.611..3.646 rows=1 loops=230)"
> "Index Cond: (dl.data_set_id = opr.data_set_id)"
> "Total runtime: 193625.955 ms"
>
> yield_point has the following indexes:
>      btree on ST_GeometryType(location)
>      gist on location
>      btree on op_region_id
>
> I've also tried an index on
>      ((yield > 0::double precision) AND (area > 0::double precision) AND
> (st_geometrytype(location) <> 'ST_Point'::text))
> ... it still goes for the sequential scan.
>
> But if I change it to st_geometrytype(location) = 'ST_Polygon' or
> even in ('ST_Polygon','ST_MultiPolygon')
>
> the planner uses the index.
>
> Roxanne
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: will the planner ever use an index when the condition is <> ?

От
Віталій Тимчишин
Дата:


17.12.2011 18:25 пользователь "Filip Rembiałkowski" <plk.zuber@gmail.com> написал:
>
> Normally there is no chance it could work,
> because (a) the planner does not know all possible values of a column,
> and (b) btree indexes cannot search on "not equal" operator.
>

Why so? a<>b is same as (a<b or a>b), so, planner should chech this option.

Re: will the planner ever use an index when the condition is <> ?

От
Roxanne Reid-Bennett
Дата:
On 12/17/2011 11:24 AM, Filip Rembiałkowski wrote:
> Normally there is no chance it could work,
> because (a) the planner does not know all possible values of a column,
> and (b) btree indexes cannot search on "not equal" operator.

Is there an index type that can check "not equal"?
This specific column has a limited number of possible values - it is
essentially an enumerated list.

Roxanne
>
>
> BTW I've just made a case where - logically - it could work, but it
> still does not:
>
> create table nums ( num int4 not null, check(num=1 or num=2) );
> insert into nums select case when random()<=0.99 then 1 else 2 end
> from generate_series(1,1000000);
> create index nums_idx on nums(num);
> analyze nums;
> set constraint_exclusion to 'on';
> explain select * from nums where num<>1;
> --planner could estimate selectivity as 1%, and use index with "=2"
> filter basing on check constraint?
>
>
>
>
> 2011/12/17 Roxanne Reid-Bennett<rox@tara-lu.com>:
>> I have a query that used<>  against an indexed column. In this
>> case I can use the reverse and use in or = and get the performance
>> I need... but "in general"... will the planner ever use an index when the
>> related column is compared using<>?
>>
>> I feel like the answer is no, but wanted to ask.
>>
>> Roxanne
>> Postgres Version 8.4.9 PostGIS version 1.5.2
>>
>>
>>
>> Context for question:
>>
>> I have the following query:
>>
>> select *
>> from op_region opr, yield_segment_info ysi, data_location dl
>> where opr.op_region_id in
>>          (select distinct op_region_id
>>           from yield_point
>>           where yield>  0
>>           and area>  0
>>           and ST_GeometryType(location)<>  'ST_Point'
>>          )
>> and ysi.op_region_id = opr.op_region_id
>> and dl.data_set_id = opr.data_set_id
>>
>> Yield_Point has 161,575,599 records
>> where yield>0 and area>  0 has 161,263,193 records,
>> where ST_GeometryType(location)<>  'ST_Point' has just 231 records
>>
>> yield_segment_info has 165,929 records
>> op_region has 566,212 records
>> data_location has 394,763
>>
>> All of these have a high volume of insert/delete's.
>> The tables have recently been vacuum full'd and the indexes reindexed.
>> [they are under the management of the autovacuum, but we forced a cleanup on
>> the chance that things had degraded...]
>>
>> If I run an explain analyze:
>>
>> "Nested Loop
>>     (cost=5068203.00..5068230.31 rows=3 width=225308)
>>     (actual time=192571.730..193625.728 rows=236 loops=1)"
>> "->Nested Loop
>>       (cost=5068203.00..5068219.66 rows=1 width=57329)
>>       (actual time=192522.573..192786.698 rows=230 loops=1)"
>> "  ->Nested Loop
>>         (cost=5068203.00..5068211.36 rows=1 width=57268)
>>         (actual time=192509.822..192638.446 rows=230 loops=1)"
>> "    ->HashAggregate
>>            (cost=5068203.00..5068203.01 rows=1 width=4)
>>            (actual time=192471.507..192471.682 rows=230 loops=1)"
>> "       ->Seq Scan on yield_point
>>               (cost=0.00..5068203.00 rows=1 width=4)
>>               (actual time=602.174..192471.177 rows=230 loops=1)"
>> "             Filter: ((yield>  0::double precision) AND
>>                        (area>  0::double precision) AND
>>                        (st_geometrytype(location)<>  'ST_Point'::text))"
>> "    ->Index Scan using op_region_pkey on op_region opr
>>            (cost=0.00..8.33 rows=1 width=57264)
>>            (actual time=0.723..0.723 rows=1 loops=230)"
>> "          Index Cond: (opr.op_region_id = yield_point.op_region_id)"
>> "  ->Index Scan using yield_segment_info_key on yield_segment_info ysi
>>        (cost=0.00..8.29 rows=1 width=65)
>>        (actual time=0.643..0.643 rows=1 loops=230)"
>> "      Index Cond: (ysi.op_region_id = opr.op_region_id)"
>> "->Index Scan using data_location_data_set_idx on data_location dl
>>     (cost=0.00..10.61 rows=3 width=167979)
>>     (actual time=3.611..3.646 rows=1 loops=230)"
>> "Index Cond: (dl.data_set_id = opr.data_set_id)"
>> "Total runtime: 193625.955 ms"
>>
>> yield_point has the following indexes:
>>       btree on ST_GeometryType(location)
>>       gist on location
>>       btree on op_region_id
>>
>> I've also tried an index on
>>       ((yield>  0::double precision) AND (area>  0::double precision) AND
>> (st_geometrytype(location)<>  'ST_Point'::text))
>> ... it still goes for the sequential scan.
>>
>> But if I change it to st_geometrytype(location) = 'ST_Polygon' or
>> even in ('ST_Polygon','ST_MultiPolygon')
>>
>> the planner uses the index.
>>
>> Roxanne
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance


Re: will the planner ever use an index when the condition is <> ?

От
Tom Lane
Дата:
Roxanne Reid-Bennett <rox@tara-lu.com> writes:
> On 12/17/2011 11:24 AM, Filip Rembiałkowski wrote:
>> Normally there is no chance it could work,
>> because (a) the planner does not know all possible values of a column,
>> and (b) btree indexes cannot search on "not equal" operator.

> Is there an index type that can check "not equal"?

There is not.  It's not so much that it's logically impossible as that
it doesn't seem worth the trouble to implement and maintain, because
most of the time a query like "where x <> constant" is going to fetch
most of the table, and so it would be better done as a seqscan anyway.

If you have a specific case where that's not true, you might consider
a partial index (CREATE INDEX ... WHERE x <> constant).  But the details
of that would depend a lot on the queries you're concerned about.

            regards, tom lane

Re: will the planner ever use an index when the condition is <> ?

От
Marti Raudsepp
Дата:
On Sun, Dec 18, 2011 at 16:52, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:
> Is there an index type that can check "not equal"?
> This specific column has a limited number of possible values - it is
> essentially an enumerated list.

Instead of writing WHERE foo<>3 you could rewrite it as WHERE foo IN
(1,2,4,...) or WHERE foo < 3 OR foo > 3. Both of these are indexable
queries, but obviously the planner may choose not to use index scan if
it's not worth it.

Regards,
Marti

Re: will the planner ever use an index when the condition is <> ?

От
Roxanne Reid-Bennett
Дата:
On 12/18/2011 1:31 PM, Tom Lane wrote:
> If you have a specific case where that's not true, you might consider
> a partial index (CREATE INDEX ... WHERE x <> constant). But the
> details of that would depend a lot on the queries you're concerned
> about. regards, tom lane

Which I had tried in the form of (st_geometrytype(location) <>
'ST_Point'::text)... planner never picked it (for the scenario given
before).  But this thread was all pretty much .. design/plan/future use.
This specific instance I've handled with   in
("ST_Polygon","ST_MultiPolygon").

Thank you for the feedback.

Roxanne