Обсуждение: Why does a simple query not use an obvious index?

Поиск
Список
Период
Сортировка

Why does a simple query not use an obvious index?

От
"Jack Kerkhof"
Дата:
The query:

    select count(*) from billing where timestamp > now()-60

should obviously use the index

    CREATE INDEX billing_timestamp_idx ON billing USING btree ("timestamp" timestamp_ops);

on a table with 1400000 rows.

But it uses a Seq Scan. If I set enable_seqscan=no, it indicates a queryplan could not be calculated.

Why does this simple query not use the timestamp index, and how can I get it to?

Thanks, Jack

Jack Kerkhof
Research & Development
jack.kerkhof@guest-tek.com
www.guest-tek.com
1.866.509.1010 3480

Guest-Tek is a leading provider of broadband technology solutions for the hospitality industry. Guest-Tek's GlobalSuite™ high-speed Internet solution enables hotels to offer their guests the convenience of wired and/or wireless broadband Internet access from guest rooms, meeting rooms and public areas.

 

Re: Why does a simple query not use an obvious index?

От
Mr Pink
Дата:
Strangely enough, I don't find that result surprising.

if the vast bulk of the data is in the past and now()-60 represents a very small slice of the data
we might expect that using an index is optimal, but there could be many reasons why it doesn't get
used.

AFAIK postgres doesn't peek at values used in a query when optimizing so any query with a ">" type
condition is gonna have a seq scan as the plan since the best guess is that you are gonna match
50% of the table. That's one possible explanation.

Another is that if the condition data types don't match then an indes won't be used you could try:

  select count(*) from billing where timestamp > (now()-60)::timestamp

Might make a difference, I dunno, it's a case of testing amd seing what happens.

You could try lowering the random page cost, it might help, but I don't like your chances.

If your problem is that you want to access the most recent data from a large table with fast
response, then you could consider:

1. a "recent" index. If the data is within the "recent" time from set a flag to true, other wise
null. Reset the flags periodically. Nulls aren't indexed so the selectivity of such an index is
much higher. Can work wonders.

2, duplicate recent data in another table that is purged when data passes the age limit. This is
basic archiving.

Something like that. Hopefully someone with more knowlege of the optimaizer will have a brighter
suggestion for you.

What version are you using by the way?

Regards
Mr Pink

--- Jack Kerkhof <jack.kerkhof@guest-tek.com> wrote:

> The query:
>     select count(*) from billing where timestamp > now()-60
>
> should obviously use the index
>
>     CREATE INDEX billing_timestamp_idx ON billing USING btree ("timestamp"
> timestamp_ops);
>
> on a table with 1400000 rows.
>
> But it uses a Seq Scan. If I set enable_seqscan=no, it indicates a queryplan
> could not be calculated.
>
> Why does this simple query not use the timestamp index, and how can I get it
> to?
>
> Thanks, Jack
>
>      Jack Kerkhof
>       Research & Development
>       jack.kerkhof@guest-tek.com
>       www.guest-tek.com
>       1.866.509.1010 3480
>
> --------------------------------------------------------------------------
>
>       Guest-Tek is a leading provider of broadband technology solutions for
> the hospitality industry. Guest-Tek's GlobalSuite�Ehigh-speed Internet
> solution enables hotels to offer their guests the convenience of wired
> and/or wireless broadband Internet access from guest rooms, meeting rooms
> and public areas.
>
>
>




__________________________________
Do you Yahoo!?
Take Yahoo! Mail with you! Get it on your mobile phone.
http://mobile.yahoo.com/maildemo

Re: Why does a simple query not use an obvious index?

От
"Scott Marlowe"
Дата:
On Fri, 2004-08-27 at 11:12, Jack Kerkhof wrote:
> The query:
>
>     select count(*) from billing where timestamp > now()-60
>
> should obviously use the index
>
>     CREATE INDEX billing_timestamp_idx ON billing USING btree
> ("timestamp" timestamp_ops);
>
> on a table with 1400000 rows.
>
> But it uses a Seq Scan. If I set enable_seqscan=no, it indicates a
> queryplan could not be calculated.

Have you tried this:

marlowe=> select now()-60;
ERROR:  operator does not exist: timestamp with time zone - integer
HINT:  No operator matches the given name and argument type(s). You may
need to add explicit type casts.

you likely need:

smarlowe=> select now()-'60 seconds'::interval;
           ?column?
-------------------------------
 2004-08-29 12:25:38.249564-06

inside there.

Also, count(*) is likely to always generate a seq scan due to the way
aggregates are implemented currently in pgsql.  you might want to try:

select somefield from sometable where timestampfield > now()-'60
seconds'::interval

and count the number of returned rows.  If there's a lot, it won't be
any faster, if there's a few, it should be a win.


Re: Why does a simple query not use an obvious index?

От
"Steinar H. Gunderson"
Дата:
On Sun, Aug 29, 2004 at 11:04:48AM -0700, Mr Pink wrote:
> Another is that if the condition data types don't match then an indes won't be used you could try:
>
>   select count(*) from billing where timestamp > (now()-60)::timestamp

In fact, I've had success with code like

    select count(*) from billing where timestamp > ( select now() - interval '1 minute' )

At least in my case (PostgreSQL 7.2, though), it made PostgreSQL magically do
an index scan. :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Why does a simple query not use an obvious index?

От
Greg Stark
Дата:
Mr Pink <mr_pink_is_the_only_pro@yahoo.com> writes:

> AFAIK postgres doesn't peek at values used in a query when optimizing

Of course it does.

However sometimes things don't work perfectly.
To get good answers rather than just guesses we'll need two things:

. What version of postgres are you using.
. The output of EXPLAIN ANALYZE select ...

--
greg

Re: Why does a simple query not use an obvious index?

От
Greg Stark
Дата:
"Scott Marlowe" <smarlowe@qwest.net> writes:

> Also, count(*) is likely to always generate a seq scan due to the way
> aggregates are implemented currently in pgsql.  you might want to try:

Huh? I'm curious to know what you're talking about here.

> select somefield from sometable where timestampfield > now()-'60
> seconds'::interval
>
> and count the number of returned rows.  If there's a lot, it won't be
> any faster, if there's a few, it should be a win.

Why would this ever be faster? And how could postgres ever calculate that
without doing a sequential scan when count(*) would force it to do a
sequential scan?

--
greg

Re: Why does a simple query not use an obvious index?

От
"Scott Marlowe"
Дата:
On Sun, 2004-08-29 at 15:12, Greg Stark wrote:
> "Scott Marlowe" <smarlowe@qwest.net> writes:
>
> > Also, count(*) is likely to always generate a seq scan due to the way
> > aggregates are implemented currently in pgsql.  you might want to try:
>
> Huh? I'm curious to know what you're talking about here.

This has been discussed ad infinitum on the lists in the past.  And
explained by better minds than mine, but I'll give it a go.

PostgreSQL has a "generic" aggregate method.  Imagine instead doing a
select count(id1+id2-id3) from table where ...  In that instance, it's
not a simple shortcut to just grab the number of rows anymore.  Since
PostgreSQL uses a generic aggregate method that can be expanded by the
user with custom aggregates et. al., it has no optimizations to make
simple count(*) fast, like many other databases.

Add to that the fact that even when postgresql uses an index it still
has to hit the data store to get the actual value of the tuple, and
you've got very few instances in which an index scan of more than some
small percentage of the table is worth while.  I.e. a sequential scan
tends to "win" over an index scan quicker in postgresql than in other
databases like Oracle, where the data store is serialized and the
indexes have the correct information along with the application of the
transaction / roll back segments.

> > select somefield from sometable where timestampfield > now()-'60
> > seconds'::interval
> >
> > and count the number of returned rows.  If there's a lot, it won't be
> > any faster, if there's a few, it should be a win.
>
> Why would this ever be faster? And how could postgres ever calculate that
> without doing a sequential scan when count(*) would force it to do a
> sequential scan?

Because, count(*) CANNOT use an index.  So, if you're hitting, say,
0.01% of the table (let's say 20 out of 20,000,000 rows or something
like that) then the second should be MUCH faster.


Re: Why does a simple query not use an obvious index?

От
"Scott Marlowe"
Дата:
On Sun, 2004-08-29 at 15:38, Scott Marlowe wrote:
> On Sun, 2004-08-29 at 15:12, Greg Stark wrote:
> > "Scott Marlowe" <smarlowe@qwest.net> writes:
> >
> > > Also, count(*) is likely to always generate a seq scan due to the way
> > > aggregates are implemented currently in pgsql.  you might want to try:
> >
> > Huh? I'm curious to know what you're talking about here.
>
> This has been discussed ad infinitum on the lists in the past.  And
> explained by better minds than mine, but I'll give it a go.
>
> PostgreSQL has a "generic" aggregate method.  Imagine instead doing a
> select count(id1+id2-id3) from table where ...

that should be avg(id1+id2-id3)... doh


Re: Why does a simple query not use an obvious index?

От
Tom Lane
Дата:
>> select somefield from sometable where timestampfield > now()-'60
>> seconds'::interval

This is a FAQ, but since the archives don't seem to be up at the moment,
here's the answer once again:

The expression "now() - something" is not a constant, so the planner
is faced with "timestampfield > unknownvalue".  Its default assumption
about the number of rows that will match is much too high to make an
indexscan look profitable (from memory, I think it guesses that about
a third of the table will match...).

There are a couple of hacks you can use to deal with this.  Plan A
is just "set enable_seqscan = false" for this query.  This is ugly and
not really recommended, but you should try it first to verify that you
do get an indexscan that way, just to be sure that lack of statistics
is the problem and not something else.

Plan B is to add an extra WHERE clause to make the problem look like a
range query, eg

    where timestampfield > now() - ... AND timestampfield <= now();

The planner still doesn't know the exact values involved, so it still
can't make use of any statistics, but it can see that this is a range
constraint on timestampfield.  The default guess about the selectivity
will be a lot smaller than in the case of the one-sided inequality,
and in most cases you should get an indexscan out of it.  This isn't
completely guaranteed though.  Also, it's got a severe problem in that
if you sometimes do queries with a large interval, it'll still do an
indexscan even though that may be quite inappropriate.

Plan C is to fix things so that the compared-to value *does* look like a
constant; then the planner will correctly observe that only a small part
of the table is to be scanned, and do the right thing (given reasonably
up-to-date ANALYZE statistics, anyway).  The most trustworthy way of
doing that is to compute the "now() - interval" value on the client side
and send over a timestamp constant.  If that's not convenient for some
reason, people frequently use a hack like this:

    create function ago(interval) returns timestamptz as
    'select now() - $1' language sql strict immutable;

    select ... where timestampfield > ago('60 seconds');

This is a kluge because you are lying when you say that the result of
ago() is immutable; it obviously isn't.  But the planner will believe
you, fold the function call to a constant during planning, and use the
result.  CAUTION: this works nicely for interactively-issued SQL
queries, but it will come back to bite you if you try to use ago() in
prepared queries or plpgsql functions, because the premature collapsing
of the now() result will become significant.

We have speculated about ways to get the planner to treat expressions
involving now() and similar functions as pseudo-constants, so that it
would do the right thing in this sort of situation without any kluges.
It's not been done yet though.

BTW, the above discussion applies to PG 7.3 and later; if you're dealing
with an old version then there are some different considerations.

            regards, tom lane

Re: Why does a simple query not use an obvious index?

От
"Steinar H. Gunderson"
Дата:
On Sun, Aug 29, 2004 at 03:38:00PM -0600, Scott Marlowe wrote:
>>> select somefield from sometable where timestampfield > now()-'60
>>> seconds'::interval
>>>
>>> and count the number of returned rows.  If there's a lot, it won't be
>>> any faster, if there's a few, it should be a win.
>> Why would this ever be faster? And how could postgres ever calculate that
>> without doing a sequential scan when count(*) would force it to do a
>> sequential scan?
> Because, count(*) CANNOT use an index.  So, if you're hitting, say,
> 0.01% of the table (let's say 20 out of 20,000,000 rows or something
> like that) then the second should be MUCH faster.

Of course count(*) can use an index:

images=# explain analyze select count(*) from images where event='test';
                                                             QUERY PLAN
            

------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=168.97..168.97 rows=1 width=0) (actual time=68.211..68.215 rows=1 loops=1)
   ->  Index Scan using unique_filenames on images  (cost=0.00..168.81 rows=63 width=0) (actual time=68.094..68.149
rows=8loops=1) 
         Index Cond: ((event)::text = 'test'::text)
 Total runtime: 68.369 ms
(4 rows)

However, it cannot rely on an index _alone_; it has to go fetch the relevant
pages, but of course, so must "select somefield from" etc..

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Why does a simple query not use an obvious index?

От
Greg Stark
Дата:
"Scott Marlowe" <smarlowe@qwest.net> writes:

> PostgreSQL has a "generic" aggregate method.  Imagine instead doing a
> select count(id1+id2-id3) from table where ...  In that instance, it's
> not a simple shortcut to just grab the number of rows anymore.  Since
> PostgreSQL uses a generic aggregate method that can be expanded by the
> user with custom aggregates et. al., it has no optimizations to make
> simple count(*) fast, like many other databases.

People expect count(*) _without a where clause_ to be cached in a single
global variable. Postgres can't do this, but the reason has everything to do
with MVCC, not with postgres's generalized aggregates. Because of MVCC
Postgres can't just store a single cached value, because there is no single
cached value. It would have to store a complete history back to the oldest
extant transaction.

However in this case the user has a where clause. No database is going to
cache values of count(*) for random where clauses. But that doesn't stop
Postgres from using an index to fetch the records.


> > > select somefield from sometable where timestampfield > now()-'60
> > > seconds'::interval
> > >
> > > and count the number of returned rows.  If there's a lot, it won't be
> > > any faster, if there's a few, it should be a win.
> >
> > Why would this ever be faster? And how could postgres ever calculate that
> > without doing a sequential scan when count(*) would force it to do a
> > sequential scan?
>
> Because, count(*) CANNOT use an index.  So, if you're hitting, say,
> 0.01% of the table (let's say 20 out of 20,000,000 rows or something
> like that) then the second should be MUCH faster.

I think you've applied these past discussions and come up with some bogus
conclusions.

The problem here has nothing to do with the count(*) and everything to do with
the WHERE clause. To fetch the records satisfying that where clause postgres
has to do exactly the same thing regardless of whether it's going to feed the
data to count(*) or return some or all of it to the client.

If postgres decides the where clause isn't selective enough it'll choose to
use a sequential scan. However it would do that regardless of whether you're
calling count(*) or not. If the number is records is substantial then you
would get the overhead of the scan plus the time it takes to transfer all that
unnecessary data to the user.

What you're probably thinking of when you talk about general purpose aggregate
interfaces is the difficulty of making min()/max() use indexes. That's a whole
other case entirely. That's where postgres's generalized aggregates leaves it
without enough information about what records the aggregate functions are
interested in and what index scans might make them faster.

None of these common cases end up making it a good idea to read the records
into the clients and do the work in the client. The only cases where that
would make sense would be if the function requires doing some manipulation of
the data that's awkward to express in sql. The "top n" type of query is the
usual culprit, but with postgres's new array functions even that becomes
tractable.

--
greg

Re: Why does a simple query not use an obvious index?

От
Rod Taylor
Дата:
> People expect count(*) _without a where clause_ to be cached in a single
> global variable. Postgres can't do this, but the reason has everything to do

Someone should write an approx_count('table') function that reads
reltuples from pg_class and tell them to use it in combination with
autovac.

I've yet to see someone use count(*) across a table and not round the
result themselves (approx 9 million clients).



Re: Why does a simple query not use an obvious index?

От
Mr Pink
Дата:
--- Greg Stark <gsstark@mit.edu> wrote:

>
> Mr Pink <mr_pink_is_the_only_pro@yahoo.com> writes:
>
> > AFAIK postgres doesn't peek at values used in a query when optimizing
>
> Of course it does.

But not ones returned by a function such as now(), or when you use bind variables, as Tom aptly
explained.

That's what I meant by 'peek'. Interestingly enough Oracle does that, it's inline with their
policy of recommending the use of bind variables. Perhaps postgres could use such a feature some
day.





__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

Re: Why does a simple query not use an obvious index?

От
Tom Lane
Дата:
Mr Pink <mr_pink_is_the_only_pro@yahoo.com> writes:
>>> AFAIK postgres doesn't peek at values used in a query when optimizing
>>
>> Of course it does.

> But not ones returned by a function such as now(), or when you use
> bind variables, as Tom aptly explained.

FWIW, 8.0 does have the ability to use the values of bind variables for
planning estimation (Oliver Jowett did the work for this).  The main
issue in the way of dealing with now() is that whatever we did to now()
would apply to all functions marked "stable", and it's a bit
nervous-making to assume that they should all be treated this way.
Or we could introduce more function volatility categories, but that's
not much fun either.

            regards, tom lane

Re: Why does a simple query not use an obvious index?

От
Mr Pink
Дата:
Yeah! Bind variable peeking is great news. I did actually read the guff, but forgot about that.

Version 8 is looking great on paper, I hope I'll get a chance to play wth it soon.

I can kind of appreciate your point about peeking stable functions, however, I would have thought
that if it was possible to do for bind variables (which could change many times in a transaction)
then it would make even more sense for a stable function which doesn't change for the life of the
transaction. No doubt this is an oversimplification the situation.

regards
Mr Pink



_______________________________
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush

Re: Why does a simple query not use an obvious index?

От
Guy Thornley
Дата:
On Sun, Aug 29, 2004 at 06:03:43PM -0400, Tom Lane wrote:
> >> select somefield from sometable where timestampfield > now()-'60
> >> seconds'::interval
>
> This is a FAQ, but since the archives don't seem to be up at the moment,
> here's the answer once again:
>
> The expression "now() - something" is not a constant, so the planner
> is faced with "timestampfield > unknownvalue".  Its default assumption
> about the number of rows that will match is much too high to make an
> indexscan look profitable (from memory, I think it guesses that about
> a third of the table will match...).


Ok; this explains some really wierd stuff I've been seeing.

However, I'm seeing breakage of the form mentioned by the original poster
even when the query uses a _constant_ timestamp: [Postgres 7.4.3]

        ntais# \d detect.stats
                                                 Table "detect.stats"
            Column    |           Type           |                          Modifiers
        --------------+--------------------------+-------------------------------------------------------------
         anomaly_id   | integer                  | not null
         at           | timestamp with time zone | not null default ('now'::text)::timestamp(6) with time zone
         resolution   | real                     | default 1.0
         values       | real[]                   |
         stat_type_id | integer                  | not null
        Indexes:
            "stats_pkey" primary key, btree (anomaly_id, stat_type_id, "at")
            "stats__ends_at" btree (stats__ends_at("at", resolution, "values"))
        Foreign-key constraints:
            "$1" FOREIGN KEY (anomaly_id) REFERENCES anomalies(anomaly_id) ON DELETE CASCADE
            "$2" FOREIGN KEY (stat_type_id) REFERENCES stat_types(stat_type_id)


        ntais=# SET enable_seqscan = on;
        SET
        ntais=# EXPLAIN ANALYZE
          SELECT anomaly_id, stat_type_id
          FROM detect.stats
          WHERE detect.stats__ends_at(at, resolution, values) > '2004-08-30 16:21:09+12'::timestamptz
          ORDER BY anomaly_id, stat_type_id
          ;

                                                                 QUERY PLAN
             

-----------------------------------------------------------------------------------------------------------------------------
         Sort  (cost=602473.59..608576.72 rows=2441254 width=8) (actual time=198577.407..198579.136 rows=6152 loops=1)
           Sort Key: anomaly_id, stat_type_id
           ->  Seq Scan on stats  (cost=0.00..248096.42 rows=2441254 width=8) (actual time=198299.685..198551.460
rows=6152loops=1) 
                 Filter: (stats__ends_at("at", resolution, "values") > '2004-08-30 16:21:09+12'::timestamp with time
zone)
         Total runtime: 198641.649 ms
        (5 rows)


        ntais=# EXPLAIN ANALYZE
          SELECT anomaly_id, stat_type_id
          FROM detect.stats
          WHERE detect.stats__ends_at(at, resolution, values) > '2004-08-30
          16:21:09+12'::timestamptz
          ORDER BY anomaly_id, stat_type_id
          ;

                                                                         QUERY PLAN
                            

--------------------------------------------------------------------------------------------------------------------------------------------
         Sort  (cost=10166043.26..10172146.40 rows=2441254 width=8) (actual time=44.710..46.661 rows=6934 loops=1)
           Sort Key: anomaly_id, stat_type_id
           ->  Index Scan using stats__ends_at on stats  (cost=0.00..9811666.09 rows=2441254 width=8) (actual
time=0.075..24.702rows=6934 loops=1) 
         Index Cond: (stats__ends_at("at", resolution, "values") > '2004-08-30 16:21:09+12'::timestamp with time zone)
         Total runtime: 50.354 ms
        (5 rows)


        ntais=# SELECT count(*) FROM detect.stats;
          count
        ---------
         7326151
        (1 row)

Ive done repeated ANALYZE's, both table-specific and database-wide, and get
the same result every time.

For us, a global 'enable_seqscan = off' in postgresql.conf is the way to go.
You occasionally see an odd plan while developing a query (eg: scanning an
index with no contraint to simply get ORDER BY). Usually thats a broken
query/index, and I simply fix it.


Guy Thornley

Re: Why does a simple query not use an obvious index?

От
Tom Lane
Дата:
Guy Thornley <guy@esphion.com> writes:
> However, I'm seeing breakage of the form mentioned by the original poster
> even when the query uses a _constant_ timestamp: [Postgres 7.4.3]

>         Indexes:
>             "stats_pkey" primary key, btree (anomaly_id, stat_type_id, "at")
>             "stats__ends_at" btree (stats__ends_at("at", resolution, "values"))

>         ntais=# EXPLAIN ANALYZE
>           SELECT anomaly_id, stat_type_id
>           FROM detect.stats
>           WHERE detect.stats__ends_at(at, resolution, values) > '2004-08-30 16:21:09+12'::timestamptz
>           ORDER BY anomaly_id, stat_type_id
>           ;

Here I'm afraid you're just stuck until 8.0 comes out (or you're feeling
brave enough to use a beta).  Releases before 8.0 do not maintain any
statistics about the contents of functional indexes, so the planner is
flying blind here in any case, and you end up with the very same 1/3rd
default assumption no matter what the right-hand side looks like.
You'll have to fall back to Plan A or Plan B to get this case to work
in 7.4.

            regards, tom lane

Re: Why does a simple query not use an obvious index?

От
Greg Stark
Дата:
Guy Thornley <guy@esphion.com> writes:

>             "stats__ends_at" btree (stats__ends_at("at", resolution, "values"))

Postgres 7.4 doesn't have any stats on functional indexes. So it's back to
just guessing at the selectivity of this. 8.0 does gather stats for functional
indexes so it should be better off.

--
greg

Re: Why does a simple query not use an obvious index?

От
"Steinar H. Gunderson"
Дата:
On Sun, Aug 29, 2004 at 06:03:43PM -0400, Tom Lane wrote:
> The expression "now() - something" is not a constant, so the planner
> is faced with "timestampfield > unknownvalue".  Its default assumption
> about the number of rows that will match is much too high to make an
> indexscan look profitable (from memory, I think it guesses that about
> a third of the table will match...).

Out of curiosity, does the subselect query I presented earlier in the thread
count as "a constant"? It gives the correct query plan, but this could of
course just be correct by accident...

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Why does a simple query not use an obvious index?

От
Tom Lane
Дата:
"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes:
> On Sun, Aug 29, 2004 at 06:03:43PM -0400, Tom Lane wrote:
>> The expression "now() - something" is not a constant, so the planner
>> is faced with "timestampfield > unknownvalue".

> Out of curiosity, does the subselect query I presented earlier in the thread
> count as "a constant"? It gives the correct query plan, but this could of
> course just be correct by accident...

That was on 7.2, wasn't it?  I don't remember any longer exactly how 7.2
does this stuff, but it's different from 7.3 and later (and certainly
not any more "right").

You did at one time need to hide now() in a subselect to get the planner
to consider an indexscan at all --- that was before we made the
distinction between immutable and stable functions, and so now() had
to be treated as unsafe to index against (just as random() still is).
I think 7.2 behaved that way but I'm not totally sure.

            regards, tom lane

Re: Why does a simple query not use an obvious index?

От
Pierre-Frédéric Caillaud
Дата:
    Most likely your table has a SERIAL PRIMARY KEY in it, in this case, do
the following :

       my_limit = select primary_key_field from billing where timestamp >
(now()-60)::timestamp ORDER BY timestamp ASC LIMIT 1;

    then

    SELECT count(*) FROM billing WHERE primary_key_field>=my_limit;

    I don't know if it'll work better, but you can try.

    When you insert records in the table, they are appended at the end, so
this type of recent records query only requires reading the tail of the
table. It should be fast if planned correctly.

> Strangely enough, I don't find that result surprising.
>
> if the vast bulk of the data is in the past and now()-60 represents a
> very small slice of the data
> we might expect that using an index is optimal, but there could be many
> reasons why it doesn't get
> used.
>
> AFAIK postgres doesn't peek at values used in a query when optimizing so
> any query with a ">" type
> condition is gonna have a seq scan as the plan since the best guess is
> that you are gonna match
> 50% of the table. That's one possible explanation.
>
> Another is that if the condition data types don't match then an indes
> won't be used you could try:
>
>   select count(*) from billing where timestamp > (now()-60)::timestamp
>
> Might make a difference, I dunno, it's a case of testing amd seing what
> happens.
>
> You could try lowering the random page cost, it might help, but I don't
> like your chances.
>
> If your problem is that you want to access the most recent data from a
> large table with fast
> response, then you could consider:
>
> 1. a "recent" index. If the data is within the "recent" time from set a
> flag to true, other wise
> null. Reset the flags periodically. Nulls aren't indexed so the
> selectivity of such an index is
> much higher. Can work wonders.
>
> 2, duplicate recent data in another table that is purged when data
> passes the age limit. This is
> basic archiving.
>
> Something like that. Hopefully someone with more knowlege of the
> optimaizer will have a brighter
> suggestion for you.
>
> What version are you using by the way?
> Regards
> Mr Pink
> --- Jack Kerkhof <jack.kerkhof@guest-tek.com> wrote:
>
>> The query:
>>     select count(*) from billing where timestamp > now()-60
>>
>> should obviously use the index
>>
>>     CREATE INDEX billing_timestamp_idx ON billing USING btree
>> ("timestamp"
>> timestamp_ops);
>>
>> on a table with 1400000 rows.
>>
>> But it uses a Seq Scan. If I set enable_seqscan=no, it indicates a
>> queryplan
>> could not be calculated.
>>
>> Why does this simple query not use the timestamp index, and how can I
>> get it
>> to?
>>
>> Thanks, Jack
>>
>>      Jack Kerkhof
>>       Research & Development
>>       jack.kerkhof@guest-tek.com
>>       www.guest-tek.com
>>       1.866.509.1010 3480
>>
>> --------------------------------------------------------------------------
>>
>>       Guest-Tek is a leading provider of broadband technology solutions
>> for
>> the hospitality industry. Guest-Tek's GlobalSuite�Ehigh-speed Internet
>> solution enables hotels to offer their guests the convenience of wired
>> and/or wireless broadband Internet access from guest rooms, meeting
>> rooms
>> and public areas.
>>
>>
>>
>
>
>
>
> __________________________________
> Do you Yahoo!?
> Take Yahoo! Mail with you! Get it on your mobile phone.
> http://mobile.yahoo.com/maildemo
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>



Re: Why does a simple query not use an obvious index?

От
Pierre-Frédéric Caillaud
Дата:
>> Also, count(*) is likely to always generate a seq scan due to the way
>> aggregates are implemented currently in pgsql.  you might want to try:


    By the way, in an ideal world, count(*) should only read the index on the
timetamp column, not the rows. I guess this is not the case. Would this be
an useful optimization ?

Re: Why does a simple query not use an obvious index?

От
Pierre-Frédéric Caillaud
Дата:
    Another primary key trick :

    If you insert records with a serial primary key, and rarely delete them
or update the timestamp, you can use the primary key to compute an
approximate number of rows.

    a := SELECT pkey FROM table WHERE timestamp() > threshold ORDER BY
timestamp ASC LIMIT 1;
    b := SELECT pkey FROM table WHERE ORDER BY pkey DESC LIMIT 1;

    (b-a) is an approximate count.

    Performance is great because you only fetch two rows. Index scan is
guaranteed (LIMIT 1). On the downside, you get an approximation, and this
only works for tables where timestamp is a date of INSERT, timestamp
worrelated wiht pkey) not when timestamp is a date of UPDATE (uncorrelated
with pkey).

Re: Why does a simple query not use an obvious index?

От
Bruno Wolff III
Дата:
On Mon, Aug 30, 2004 at 21:21:26 +0200,
  Pierre-Frédéric Caillaud <lists@boutiquenumerique.com> wrote:
> >>Also, count(*) is likely to always generate a seq scan due to the way
> >>aggregates are implemented currently in pgsql.  you might want to try:
>
>
>     By the way, in an ideal world, count(*) should only read the index
>     on the  timetamp column, not the rows. I guess this is not the case. Would
> this be  an useful optimization ?

It's in the archives. The short answer is that no, postgres has to check
the heap to check tuple visibility to the current transaction.

Re: Why does a simple query not use an obvious index?

От
Greg Stark
Дата:

[I'm actually responding to the previous post from Tom Lane, but I've deleted
it and the archives seem to be down again.]


The assumption being made is that the first provided result is representative
of all future results. I don't see any reason that making this assumption of
all stable functions should be less scary than making the assumption about
user provided parameters.

However I have the complementary reaction. I find peeking at the first
bind parameter to be scary as hell. Functions seem slightly less scary.

On Oracle Peeking at bind parameters is a feature explicitly intended for DSS
data warehouse type systems. The use of placeholders there was purely for
security and programming ease, not efficiency, since the queries are only
planned executed a small number of times per plan. These are systems that
suffered enormously without the parameter values. They often involved full
table scans or bitmap index scans and without the statistics produced awful
plans.

For OLTP systems peeking at placeholders is more a danger than a benefit. The
query will be executed thousands of times and if it's planned based on a
single unusual value initially the entire system could fail.

Consider the following scenario which isn't farfetched at all. In fact I think
it well describes my current project:

I have a table with a few million records. 99% of the time users are working
with only a few hundred records at most. There's an index on the column
they're keying off of. 1% of the key values have an unusually large number of
records.

Without peeking at placeholders the system should see that virtually all the
key values are well under the threshold for an index scan to be best. So it
always uses an index scan. 1% of the time it takes longer than that it would
have with a sequential scan, but only by a small factor. (On the whole we're
probably still better off avoiding the cache pollution anyways.)

With peeking at placeholders 99% of the backends would perform the same way.
However 1 backend in 100 sees one of these unusual values for its first query.
This backend will use a sequential scan for *every* request. Executing a
sequential table scan of this big table once a second this backend will drive
the entire system into the ground.

This means every time I start the system up I stand a small but significant
chance of it just completely failing to perform properly. Worse, apache is
designed to periodically start new processes, so at any given time the system
could just randomly fall over and die.

I would rather incur a 10% penalty on every query than have a 1% chance of it
keeling over and dieing. Given this I would when I upgrade to 8.0 have to
ensure that my application driver is either not using placeholders at all (at
the protocol level -- I always prefer them at the api level) or ensure that
postgres is *not* peeking at the value.

I like the feature but I just want to be sure that it's optional.

--
greg

Re: Why does a simple query not use an obvious index?

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> However I have the complementary reaction. I find peeking at the first
> bind parameter to be scary as hell. Functions seem slightly less scary.

FWIW, we only do it in the context of unnamed parameterized queries.
As the protocol docs say, those are optimized on the assumption that
they will be executed only once.  It seems entirely legitimate to me
to use the parameter values in such a case.

We might in future get braver about using sample parameter values,
but 8.0 is conservative about it.

            regards, tom lane

Re: Why does a simple query not use an obvious index?

От
Greg Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > However I have the complementary reaction. I find peeking at the first
> > bind parameter to be scary as hell. Functions seem slightly less scary.
>
> FWIW, we only do it in the context of unnamed parameterized queries.

I knew that. That's why I hadn't been jumping up and down screaming. I was
watching though to insist on an option to disable it if it became more
widespread.

> As the protocol docs say, those are optimized on the assumption that
> they will be executed only once.  It seems entirely legitimate to me
> to use the parameter values in such a case.

Sure. It's a great feature to have; it means people can be more aggressive
about using placeholders for other reasons without worrying about performance
impacts.

> We might in future get braver about using sample parameter values,
> but 8.0 is conservative about it.

If they're used for named parameters I would strongly recommend guc variable
to control the default on a server-wide basis. It could be a variable that
individual sessions could override since there's no security or resource
implications. It's purely a protocol interface issue.

For that matter, would it be possible for the default selectivity estimates to
be a guc variable? It's something that the DBA -- or even programmer on a
per-session basis -- might be able to provide a better value for his
applications than any hard coded default.

Or perhaps it would be one valid use of hints to provide selectivity estimates
for blind placeholders. It would be nice to be able to say for example:

  select * from foo where col > $0 /*+ 5% */ AND col2 > $1 /*+ 10% */

Would there be any hope of convincing you that this is a justifiable use of
hints; providing information that the optimizer has absolutely no possibility
of ever being able to calculate on its own?

--
greg

Re: Why does a simple query not use an obvious index?

От
Mr Pink
Дата:
Hi Greg, Tom, etal

It's true that oracle only peeks during a hard parse, and this can have good or bad results
depending on the situation. Basically, the first value used in that query will determine the plan
until that query is bumped from the sql cache or the server is restarted. As far as I know, there
is no option to disable that feature in Oracle, I don't know about postgres.

Overall, I think it's a good feature because it helps us in the goal of reducing hardparsing (that
was it's real purpose in oracle). The trick as with all good features is to use it cleverly. For
example, you could run scripts on server startup that run such queries with optimal values before
any one gets back on. If your application has optimal use of bind variables allowing re-use of
query plan, and the sql cache has enough memory then the query plans you created at server startup
could be expected to be current for the life of that instance.

I write all this from my knowlegdge of Oracle, but I can't be sure how it applies to postgres.
Come to think about it, I don't think I've seen a good discussion of plan caching, hard parsing
and such like specifically related to pg. I'd really like to know more about how pg treats that
stuff.

regards
Mr Pink

--- Greg Stark <gsstark@mit.edu> wrote:

>
>
> [I'm actually responding to the previous post from Tom Lane, but I've deleted
> it and the archives seem to be down again.]
>
>
> The assumption being made is that the first provided result is representative
> of all future results. I don't see any reason that making this assumption of
> all stable functions should be less scary than making the assumption about
> user provided parameters.
>
> However I have the complementary reaction. I find peeking at the first
> bind parameter to be scary as hell. Functions seem slightly less scary.
>
> On Oracle Peeking at bind parameters is a feature explicitly intended for DSS
> data warehouse type systems. The use of placeholders there was purely for
> security and programming ease, not efficiency, since the queries are only
> planned executed a small number of times per plan. These are systems that
> suffered enormously without the parameter values. They often involved full
> table scans or bitmap index scans and without the statistics produced awful
> plans.
>
> For OLTP systems peeking at placeholders is more a danger than a benefit. The
> query will be executed thousands of times and if it's planned based on a
> single unusual value initially the entire system could fail.
>
> Consider the following scenario which isn't farfetched at all. In fact I think
> it well describes my current project:
>
> I have a table with a few million records. 99% of the time users are working
> with only a few hundred records at most. There's an index on the column
> they're keying off of. 1% of the key values have an unusually large number of
> records.
>
> Without peeking at placeholders the system should see that virtually all the
> key values are well under the threshold for an index scan to be best. So it
> always uses an index scan. 1% of the time it takes longer than that it would
> have with a sequential scan, but only by a small factor. (On the whole we're
> probably still better off avoiding the cache pollution anyways.)
>
> With peeking at placeholders 99% of the backends would perform the same way.
> However 1 backend in 100 sees one of these unusual values for its first query.
> This backend will use a sequential scan for *every* request. Executing a
> sequential table scan of this big table once a second this backend will drive
> the entire system into the ground.
>
> This means every time I start the system up I stand a small but significant
> chance of it just completely failing to perform properly. Worse, apache is
> designed to periodically start new processes, so at any given time the system
> could just randomly fall over and die.
>
> I would rather incur a 10% penalty on every query than have a 1% chance of it
> keeling over and dieing. Given this I would when I upgrade to 8.0 have to
> ensure that my application driver is either not using placeholders at all (at
> the protocol level -- I always prefer them at the api level) or ensure that
> postgres is *not* peeking at the value.
>
> I like the feature but I just want to be sure that it's optional.
>
> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>




__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail