Обсуждение: timestamped archive data index searches

Поиск
Список
Период
Сортировка

timestamped archive data index searches

От
"Stephen Birch"
Дата:
I know that the question of forcing PostgreSQL to use an index during search
ops is an FAQ and have worked with each of the suggested solutions to no
avail.

From the nature of those questions, it looks like the problem I have to
solve is common and unsolved. I am using the database to archive data
arriving at a rate of about 100 records a minute, the old data needs to be
stored hence the use of a database. Each record is timestamped as it is
inserted in the database.

The system needs a web site that can display data from (say) the last hour
of data.

Now, when the database is searched using a select on the timestamp it never
uses the index on that field  no matter how I set the db params. I think
that the query optimizer is noticing the sequential nature of the timestamp
field and assuming that an index will always slow the query.

The problem is that the retrieval of the past hour's data has to scan the
entire database and so is very, very slow.

Any ideas?


_________________________________________________________________
MSN Photos is the easiest way to share and print your photos:
http://photos.msn.com/support/worldwide.aspx


Re: timestamped archive data index searches

От
Jason Earl
Дата:
"Stephen Birch" <sgbirch@hotmail.com> writes:

> I know that the question of forcing PostgreSQL to use an index
> during search ops is an FAQ and have worked with each of the
> suggested solutions to no avail.
>
> > From the nature of those questions, it looks like the problem I
> > have to solve is common and unsolved. I am using the database to
> > archive data arriving at a rate of about 100 records a minute, the
> > old data needs to be stored hence the use of a database. Each
> > record is timestamped as it is inserted in the database.
>
> The system needs a web site that can display data from (say) the last
> hour of data.
>
> Now, when the database is searched using a select on the timestamp it
> never uses the index on that field  no matter how I set the db
> params. I think that the query optimizer is noticing the sequential
> nature of the timestamp field and assuming that an index will always
> slow the query.
>
> The problem is that the retrieval of the past hour's data has to scan
> the entire database and so is very, very slow.
>
> Any ideas?

What exactly does the query look like?  Have you ANALYZED the data?  I
do something very similar to this and it should be possible to get
PostgreSQL to use the index.

Jason

Re: timestamped archive data index searches

От
"Stephen Birch"
Дата:
The select is something like..

SELECT AVG(x) FROM arch WHERE tstamp > :t

or

SELECT * FROM arch WHERE tstamp > :t.

I am using embedded SQL and the variable t is set to the current time minus
one hour (60*60). The results are correct. It is just taking far too long.

Yes, I have been using ANALYZE, that is how I know a sequential search has
been selected. I have also tried telling the database to not use sequential
searches using the appropriate SET command.

As I mentioned, I believe this may be a 'feature' of PostgreSQL that will be
encountered by many people. Since the incoming data is timestamped as it
arrives, the time column (I called it tstamp) will always be ordered. This
fools the optimizer into thinking a sequential search would be faster.

In fact, my query only needs to look at a small subset of the stored data,
just the recently inserted records. But the whole database is always
checked.

I love PostgreSQL and have used it for many years, but this has me stumped!

Steve


>From: Jason Earl <jason.earl@simplot.com>
>To: "Stephen Birch" <sgbirch@hotmail.com>
>CC: pgsql-general@postgresql.org
>Subject: Re: [GENERAL] timestamped archive data index searches
>Date: 16 Jul 2002 13:15:26 -0600
>
>"Stephen Birch" <sgbirch@hotmail.com> writes:
>
> > I know that the question of forcing PostgreSQL to use an index
> > during search ops is an FAQ and have worked with each of the
> > suggested solutions to no avail.
> >
> > > From the nature of those questions, it looks like the problem I
> > > have to solve is common and unsolved. I am using the database to
> > > archive data arriving at a rate of about 100 records a minute, the
> > > old data needs to be stored hence the use of a database. Each
> > > record is timestamped as it is inserted in the database.
> >
> > The system needs a web site that can display data from (say) the last
> > hour of data.
> >
> > Now, when the database is searched using a select on the timestamp it
> > never uses the index on that field  no matter how I set the db
> > params. I think that the query optimizer is noticing the sequential
> > nature of the timestamp field and assuming that an index will always
> > slow the query.
> >
> > The problem is that the retrieval of the past hour's data has to scan
> > the entire database and so is very, very slow.
> >
> > Any ideas?
>
>What exactly does the query look like?  Have you ANALYZED the data?  I
>do something very similar to this and it should be possible to get
>PostgreSQL to use the index.
>
>Jason




_________________________________________________________________
Send and receive Hotmail on your mobile device: http://mobile.msn.com


Re: timestamped archive data index searches

От
Martijn van Oosterhout
Дата:
On Wed, Jul 17, 2002 at 08:45:53AM +0000, Stephen Birch wrote:
>
> The select is something like..
>
> SELECT AVG(x) FROM arch WHERE tstamp > :t
>
> or
>
> SELECT * FROM arch WHERE tstamp > :t.
>
> I am using embedded SQL and the variable t is set to the current time minus
> one hour (60*60). The results are correct. It is just taking far too long.
>
> Yes, I have been using ANALYZE, that is how I know a sequential search has
> been selected. I have also tried telling the database to not use sequential
> searches using the appropriate SET command.

Do you have an index on tstamp? What does EXPLAIN ANALYSE tell you? Both
with and without seq_scans enabled.

> As I mentioned, I believe this may be a 'feature' of PostgreSQL that will be
> encountered by many people. Since the incoming data is timestamped as it
> arrives, the time column (I called it tstamp) will always be ordered. This
> fools the optimizer into thinking a sequential search would be faster.

The planner in 7.2 knows about clustering.

More details please.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: timestamped archive data index searches

От
Tom Lane
Дата:
"Stephen Birch" <sgbirch@hotmail.com> writes:
> The select is something like..
> SELECT AVG(x) FROM arch WHERE tstamp > :t

What *exactly* arrives at the backend when this is done?  (Turn on query
logging to find out.)

I'm suspicious that the problem is a datatype-mismatch issue and you
might need a cast.  But it's hard to tell without seeing the exact
query.

            regards, tom lane

Re: timestamped archive data index searches

От
Tom Lane
Дата:
"Stephen Birch" <sgbirch@hotmail.com> writes:
> I see the same problem if I query the database using psql. But to answer
> your question, here is an example query that fails to use the index on
> tstamp.

> select sum(vol) from tdet where tstamp > 1026921570;

Some experimentation shows that that expression is actually interpreted
as
    where text(tstamp) > '1026921570'::text
No wonder it ain't using the index :-(.  I'm surprised that you believe
the results are correct --- most display styles for timestamps wouldn't
come anywhere near making this work as a textual comparison.

There are various hacks for converting numeric Unix timestamps to
Postgres timestamps.  The logically cleanest way is

regression=# select 'epoch'::timestamptz + '1026921570 seconds'::interval;
        ?column?
------------------------
 2002-07-17 11:59:30-04
(1 row)

If you write your query as
    select sum(vol) from tdet where tstamp > ('epoch'::timestamptz + '1026921570 seconds'::interval);
you should find that it'll use the index.

> Also, I can get the same effect using pgsql with something like:
> select sum(vol) from tdet where date(tstamp) = '2002-07-17';

> Again, I would hope this would use the index on tstamp to select a small
> subset of the very large database.

Not unless you build the index on date(tstamp).

            regards, tom lane

Re: timestamped archive data index searches

От
"Stephen Birch"
Дата:

Thank you for your reply, as usual you give tons to think about.

If I have understood your reasoning, I would expect your suggestion to work.
But it still seems to be using a sequence scan :-(

I also tried the following, which explain said is also using a sequence scan
:-(

SELECT sum(vol) FROM det
  WHERE tstamp > (current_timestamp - '5 seconds'::interval);

Any ideas?

(all tests done on SuSE 7.3, kernel updated to 2.4.18 and PostgeSQL at
7.2.1)



>From: Tom Lane <tgl@sss.pgh.pa.us>
>To: "Stephen Birch" <sgbirch@hotmail.com>
>CC: pgsql-general@postgreSQL.org
>Subject: Re: [GENERAL] timestamped archive data index searches Date: Wed,
>17 Jul 2002 12:37:14 -0400
>
>"Stephen Birch" <sgbirch@hotmail.com> writes:
> > I see the same problem if I query the database using psql. But to answer
> > your question, here is an example query that fails to use the index on
> > tstamp.
>
> > select sum(vol) from tdet where tstamp > 1026921570;
>
>Some experimentation shows that that expression is actually interpreted
>as
>    where text(tstamp) > '1026921570'::text
>No wonder it ain't using the index :-(.  I'm surprised that you believe
>the results are correct --- most display styles for timestamps wouldn't
>come anywhere near making this work as a textual comparison.
>
>There are various hacks for converting numeric Unix timestamps to
>Postgres timestamps.  The logically cleanest way is
>
>regression=# select 'epoch'::timestamptz + '1026921570 seconds'::interval;
>         ?column?
>------------------------
>  2002-07-17 11:59:30-04
>(1 row)
>
>If you write your query as
>     select sum(vol) from tdet where tstamp > ('epoch'::timestamptz +
>'1026921570 seconds'::interval);
>you should find that it'll use the index.
>
> > Also, I can get the same effect using pgsql with something like:
> > select sum(vol) from tdet where date(tstamp) = '2002-07-17';
>
> > Again, I would hope this would use the index on tstamp to select a small
> > subset of the very large database.
>
>Not unless you build the index on date(tstamp).
>
>            regards, tom lane




_________________________________________________________________
Join the world�s largest e-mail service with MSN Hotmail.
http://www.hotmail.com


Re: timestamped archive data index searches

От
Tom Lane
Дата:
"Stephen Birch" <sgbirch@hotmail.com> writes:
> I also tried the following, which explain said is also using a sequence scan
> :-(

> SELECT sum(vol) FROM det
>   WHERE tstamp > (current_timestamp - '5 seconds'::interval);

This doesn't work (in 7.2 and before) because the planner doesn't think
current_timestamp is a constant.  You can get around that with a custom
function that hides the current_timestamp computation and is marked
isCachable --- this is a cheat but works well enough in interactive
queries.  (It'd not work inside plpgsql unfortunately.)  See past
archived discussions --- searching for isCachable should turn up
examples.

Beginning in 7.3 there will be a finer-grain notion of constant
functions so that the planner can optimize this sort of thing as-is.

            regards, tom lane

Re: timestamped archive data index searches

От
"Stephen Birch"
Дата:
ok

I assume that the string 'now' would suffer from the same problem and should
also be avoided until 7.3?

BTW Another thing I notice with a large database is that count(*) takes
ages, it
looks like it has to scan every record. This may be a newbie question or a
faq, but isn't there a record count kept that would make count(*) return
quickly?

Steve

PS I have been using PostgreSQL for about 5 years now in a mission critical
application and have grown to love it. What I can't figure out is how Tom
and the others find time to do such fantastic work!


>From: Tom Lane <tgl@sss.pgh.pa.us>
>To: "Stephen Birch" <sgbirch@hotmail.com>
>CC: pgsql-general@postgreSQL.org
>Subject: Re: [GENERAL] timestamped archive data index searches Date: Sat,
>20 Jul 2002 15:11:24 -0400
>
>"Stephen Birch" <sgbirch@hotmail.com> writes:
> > I also tried the following, which explain said is also using a sequence
>scan
> > :-(
>
> > SELECT sum(vol) FROM det
> >   WHERE tstamp > (current_timestamp - '5 seconds'::interval);
>
>This doesn't work (in 7.2 and before) because the planner doesn't think
>current_timestamp is a constant.  You can get around that with a custom
>function that hides the current_timestamp computation and is marked
>isCachable --- this is a cheat but works well enough in interactive
>queries.  (It'd not work inside plpgsql unfortunately.)  See past
>archived discussions --- searching for isCachable should turn up
>examples.
>
>Beginning in 7.3 there will be a finer-grain notion of constant
>functions so that the planner can optimize this sort of thing as-is.
>
>            regards, tom lane






_________________________________________________________________
Send and receive Hotmail on your mobile device: http://mobile.msn.com


Re: timestamped archive data index searches

От
merlyn@stonehenge.com (Randal L. Schwartz)
Дата:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> "Stephen Birch" <sgbirch@hotmail.com> writes:
>> I also tried the following, which explain said is also using a sequence scan
>> :-(

>> SELECT sum(vol) FROM det
>> WHERE tstamp > (current_timestamp - '5 seconds'::interval);

Tom> This doesn't work (in 7.2 and before) because the planner doesn't think
Tom> current_timestamp is a constant.  You can get around that with a custom
Tom> function that hides the current_timestamp computation and is marked
Tom> isCachable --- this is a cheat but works well enough in interactive
Tom> queries.  (It'd not work inside plpgsql unfortunately.)  See past
Tom> archived discussions --- searching for isCachable should turn up
Tom> examples.

I've found this to work:

SELECT sum(vol) FROM det
WHERE tstamp > (select current_timestamp - '5 seconds'::interval);

I don't know where I got that trick from, but it works fine.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

Re: timestamped archive data index searches

От
"Stephen Birch"
Дата:
Huh!!

I just tried this and it didn't work for me, it was very slow. I did an
EXPLAIN and it was back to sequence scan again.

Steve


>From: merlyn@stonehenge.com (Randal L. Schwartz)
>To: pgsql-general@postgresql.org
>Subject: Re: [GENERAL] timestamped archive data index searches
>Date: 20 Jul 2002 12:51:00 -0700
>
> >>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>Tom> "Stephen Birch" <sgbirch@hotmail.com> writes:
> >> I also tried the following, which explain said is also using a sequence
>scan
> >> :-(
>
> >> SELECT sum(vol) FROM det
> >> WHERE tstamp > (current_timestamp - '5 seconds'::interval);
>
>Tom> This doesn't work (in 7.2 and before) because the planner doesn't
>think
>Tom> current_timestamp is a constant.  You can get around that with a
>custom
>Tom> function that hides the current_timestamp computation and is marked
>Tom> isCachable --- this is a cheat but works well enough in interactive
>Tom> queries.  (It'd not work inside plpgsql unfortunately.)  See past
>Tom> archived discussions --- searching for isCachable should turn up
>Tom> examples.
>
>I've found this to work:
>
>SELECT sum(vol) FROM det
>WHERE tstamp > (select current_timestamp - '5 seconds'::interval);
>
>I don't know where I got that trick from, but it works fine.
>
>--
>Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
><merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
>Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
>See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl
>training!
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

_________________________________________________________________
Send and receive Hotmail on your mobile device: http://mobile.msn.com


Re: timestamped archive data index searches

От
"Stephen Birch"
Дата:
I am still puzzled by the systems use of sequence scans. Using Tom's
suggestion, I am now able to get a reasonable response time on the 1M record
database by searching on the tstamp field.

But ... I tried asking the database what the earliest record is:

SELECT MIN(tstamp) FROM det;

This used a sequence scan even if I do a SET ENABLE_SEQSCAN to off.

Shouldn't this also use an index?

Steve



>From: Tom Lane <tgl@sss.pgh.pa.us>
>To: "Stephen Birch" <sgbirch@hotmail.com>
>CC: pgsql-general@postgreSQL.org
>Subject: Re: [GENERAL] timestamped archive data index searches Date: Wed,
>17 Jul 2002 12:37:14 -0400
>
>"Stephen Birch" <sgbirch@hotmail.com> writes:
> > I see the same problem if I query the database using psql. But to answer
> > your question, here is an example query that fails to use the index on
> > tstamp.
>
> > select sum(vol) from tdet where tstamp > 1026921570;
>
>Some experimentation shows that that expression is actually interpreted
>as
>    where text(tstamp) > '1026921570'::text
>No wonder it ain't using the index :-(.  I'm surprised that you believe
>the results are correct --- most display styles for timestamps wouldn't
>come anywhere near making this work as a textual comparison.
>
>There are various hacks for converting numeric Unix timestamps to
>Postgres timestamps.  The logically cleanest way is
>
>regression=# select 'epoch'::timestamptz + '1026921570 seconds'::interval;
>         ?column?
>------------------------
>  2002-07-17 11:59:30-04
>(1 row)
>
>If you write your query as
>     select sum(vol) from tdet where tstamp > ('epoch'::timestamptz +
>'1026921570 seconds'::interval);
>you should find that it'll use the index.
>
> > Also, I can get the same effect using pgsql with something like:
> > select sum(vol) from tdet where date(tstamp) = '2002-07-17';
>
> > Again, I would hope this would use the index on tstamp to select a small
> > subset of the very large database.
>
>Not unless you build the index on date(tstamp).
>
>            regards, tom lane




_________________________________________________________________
Send and receive Hotmail on your mobile device: http://mobile.msn.com


Re: timestamped archive data index searches

От
Ralph Graulich
Дата:
Hi Stephen,


> SELECT MIN(tstamp) FROM det;
> This used a sequence scan even if I do a SET ENABLE_SEQSCAN to off.
> Shouldn't this also use an index?
[...]

Have a look at the thread I started under "max() not using index". I think
you can do the same trick to work around the min() problem if it's
logical. That'd be:

    SELECT tmstamp FROM det ORDER BY tmstamp ASC LIMIT 1;


Kind regards
... Ralph ...



Re: timestamped archive data index searches

От
Bruno Wolff III
Дата:
On Sun, Jul 21, 2002 at 07:32:22 +0000,
  Stephen Birch <sgbirch@hotmail.com> wrote:
> I am still puzzled by the systems use of sequence scans. Using Tom's
> suggestion, I am now able to get a reasonable response time on the 1M
> record database by searching on the tstamp field.
>
> But ... I tried asking the database what the earliest record is:
>
> SELECT MIN(tstamp) FROM det;
>
> This used a sequence scan even if I do a SET ENABLE_SEQSCAN to off.
>
> Shouldn't this also use an index?

No because there isn't hardcoded special knowledge about the min and max
aggregate functions. This gets discussed on the lists pretty often so
you should be able to find more detailed discussions in the archives.
If there is a usable index on column of interest you should rewrite
your query to use order by and limit. For example:
select tstamp from det order by tstamp limit 1;

Re: timestamped archive data index searches

От
Andrew Sullivan
Дата:
On Sat, Jul 20, 2002 at 07:29:08PM +0000, Stephen Birch wrote:
>
> BTW Another thing I notice with a large database is that count(*) takes
> ages, it
> looks like it has to scan every record. This may be a newbie question or a
> faq, but isn't there a record count kept that would make count(*) return
> quickly?

No.  MVCC means that the count() for every connection could be
different.  So it has to do a seqscan every time.

A

----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: timestamped archive data index searches

От
Bruce Momjian
Дата:
It is now in th4e FAQ on the web site.

---------------------------------------------------------------------------

Bruno Wolff III wrote:
> On Sun, Jul 21, 2002 at 07:32:22 +0000,
>   Stephen Birch <sgbirch@hotmail.com> wrote:
> > I am still puzzled by the systems use of sequence scans. Using Tom's
> > suggestion, I am now able to get a reasonable response time on the 1M
> > record database by searching on the tstamp field.
> >
> > But ... I tried asking the database what the earliest record is:
> >
> > SELECT MIN(tstamp) FROM det;
> >
> > This used a sequence scan even if I do a SET ENABLE_SEQSCAN to off.
> >
> > Shouldn't this also use an index?
>
> No because there isn't hardcoded special knowledge about the min and max
> aggregate functions. This gets discussed on the lists pretty often so
> you should be able to find more detailed discussions in the archives.
> If there is a usable index on column of interest you should rewrite
> your query to use order by and limit. For example:
> select tstamp from det order by tstamp limit 1;
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026