Обсуждение: timestamped archive data index searches
I know that the question of forcing PostgreSQL to use an index during search ops is an FAQ and have worked with each of the suggested solutions to no avail. From the nature of those questions, it looks like the problem I have to solve is common and unsolved. I am using the database to archive data arriving at a rate of about 100 records a minute, the old data needs to be stored hence the use of a database. Each record is timestamped as it is inserted in the database. The system needs a web site that can display data from (say) the last hour of data. Now, when the database is searched using a select on the timestamp it never uses the index on that field no matter how I set the db params. I think that the query optimizer is noticing the sequential nature of the timestamp field and assuming that an index will always slow the query. The problem is that the retrieval of the past hour's data has to scan the entire database and so is very, very slow. Any ideas? _________________________________________________________________ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx
"Stephen Birch" <sgbirch@hotmail.com> writes: > I know that the question of forcing PostgreSQL to use an index > during search ops is an FAQ and have worked with each of the > suggested solutions to no avail. > > > From the nature of those questions, it looks like the problem I > > have to solve is common and unsolved. I am using the database to > > archive data arriving at a rate of about 100 records a minute, the > > old data needs to be stored hence the use of a database. Each > > record is timestamped as it is inserted in the database. > > The system needs a web site that can display data from (say) the last > hour of data. > > Now, when the database is searched using a select on the timestamp it > never uses the index on that field no matter how I set the db > params. I think that the query optimizer is noticing the sequential > nature of the timestamp field and assuming that an index will always > slow the query. > > The problem is that the retrieval of the past hour's data has to scan > the entire database and so is very, very slow. > > Any ideas? What exactly does the query look like? Have you ANALYZED the data? I do something very similar to this and it should be possible to get PostgreSQL to use the index. Jason
The select is something like.. SELECT AVG(x) FROM arch WHERE tstamp > :t or SELECT * FROM arch WHERE tstamp > :t. I am using embedded SQL and the variable t is set to the current time minus one hour (60*60). The results are correct. It is just taking far too long. Yes, I have been using ANALYZE, that is how I know a sequential search has been selected. I have also tried telling the database to not use sequential searches using the appropriate SET command. As I mentioned, I believe this may be a 'feature' of PostgreSQL that will be encountered by many people. Since the incoming data is timestamped as it arrives, the time column (I called it tstamp) will always be ordered. This fools the optimizer into thinking a sequential search would be faster. In fact, my query only needs to look at a small subset of the stored data, just the recently inserted records. But the whole database is always checked. I love PostgreSQL and have used it for many years, but this has me stumped! Steve >From: Jason Earl <jason.earl@simplot.com> >To: "Stephen Birch" <sgbirch@hotmail.com> >CC: pgsql-general@postgresql.org >Subject: Re: [GENERAL] timestamped archive data index searches >Date: 16 Jul 2002 13:15:26 -0600 > >"Stephen Birch" <sgbirch@hotmail.com> writes: > > > I know that the question of forcing PostgreSQL to use an index > > during search ops is an FAQ and have worked with each of the > > suggested solutions to no avail. > > > > > From the nature of those questions, it looks like the problem I > > > have to solve is common and unsolved. I am using the database to > > > archive data arriving at a rate of about 100 records a minute, the > > > old data needs to be stored hence the use of a database. Each > > > record is timestamped as it is inserted in the database. > > > > The system needs a web site that can display data from (say) the last > > hour of data. > > > > Now, when the database is searched using a select on the timestamp it > > never uses the index on that field no matter how I set the db > > params. I think that the query optimizer is noticing the sequential > > nature of the timestamp field and assuming that an index will always > > slow the query. > > > > The problem is that the retrieval of the past hour's data has to scan > > the entire database and so is very, very slow. > > > > Any ideas? > >What exactly does the query look like? Have you ANALYZED the data? I >do something very similar to this and it should be possible to get >PostgreSQL to use the index. > >Jason _________________________________________________________________ Send and receive Hotmail on your mobile device: http://mobile.msn.com
On Wed, Jul 17, 2002 at 08:45:53AM +0000, Stephen Birch wrote: > > The select is something like.. > > SELECT AVG(x) FROM arch WHERE tstamp > :t > > or > > SELECT * FROM arch WHERE tstamp > :t. > > I am using embedded SQL and the variable t is set to the current time minus > one hour (60*60). The results are correct. It is just taking far too long. > > Yes, I have been using ANALYZE, that is how I know a sequential search has > been selected. I have also tried telling the database to not use sequential > searches using the appropriate SET command. Do you have an index on tstamp? What does EXPLAIN ANALYSE tell you? Both with and without seq_scans enabled. > As I mentioned, I believe this may be a 'feature' of PostgreSQL that will be > encountered by many people. Since the incoming data is timestamped as it > arrives, the time column (I called it tstamp) will always be ordered. This > fools the optimizer into thinking a sequential search would be faster. The planner in 7.2 knows about clustering. More details please. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
"Stephen Birch" <sgbirch@hotmail.com> writes: > The select is something like.. > SELECT AVG(x) FROM arch WHERE tstamp > :t What *exactly* arrives at the backend when this is done? (Turn on query logging to find out.) I'm suspicious that the problem is a datatype-mismatch issue and you might need a cast. But it's hard to tell without seeing the exact query. regards, tom lane
"Stephen Birch" <sgbirch@hotmail.com> writes: > I see the same problem if I query the database using psql. But to answer > your question, here is an example query that fails to use the index on > tstamp. > select sum(vol) from tdet where tstamp > 1026921570; Some experimentation shows that that expression is actually interpreted as where text(tstamp) > '1026921570'::text No wonder it ain't using the index :-(. I'm surprised that you believe the results are correct --- most display styles for timestamps wouldn't come anywhere near making this work as a textual comparison. There are various hacks for converting numeric Unix timestamps to Postgres timestamps. The logically cleanest way is regression=# select 'epoch'::timestamptz + '1026921570 seconds'::interval; ?column? ------------------------ 2002-07-17 11:59:30-04 (1 row) If you write your query as select sum(vol) from tdet where tstamp > ('epoch'::timestamptz + '1026921570 seconds'::interval); you should find that it'll use the index. > Also, I can get the same effect using pgsql with something like: > select sum(vol) from tdet where date(tstamp) = '2002-07-17'; > Again, I would hope this would use the index on tstamp to select a small > subset of the very large database. Not unless you build the index on date(tstamp). regards, tom lane
Thank you for your reply, as usual you give tons to think about. If I have understood your reasoning, I would expect your suggestion to work. But it still seems to be using a sequence scan :-( I also tried the following, which explain said is also using a sequence scan :-( SELECT sum(vol) FROM det WHERE tstamp > (current_timestamp - '5 seconds'::interval); Any ideas? (all tests done on SuSE 7.3, kernel updated to 2.4.18 and PostgeSQL at 7.2.1) >From: Tom Lane <tgl@sss.pgh.pa.us> >To: "Stephen Birch" <sgbirch@hotmail.com> >CC: pgsql-general@postgreSQL.org >Subject: Re: [GENERAL] timestamped archive data index searches Date: Wed, >17 Jul 2002 12:37:14 -0400 > >"Stephen Birch" <sgbirch@hotmail.com> writes: > > I see the same problem if I query the database using psql. But to answer > > your question, here is an example query that fails to use the index on > > tstamp. > > > select sum(vol) from tdet where tstamp > 1026921570; > >Some experimentation shows that that expression is actually interpreted >as > where text(tstamp) > '1026921570'::text >No wonder it ain't using the index :-(. I'm surprised that you believe >the results are correct --- most display styles for timestamps wouldn't >come anywhere near making this work as a textual comparison. > >There are various hacks for converting numeric Unix timestamps to >Postgres timestamps. The logically cleanest way is > >regression=# select 'epoch'::timestamptz + '1026921570 seconds'::interval; > ?column? >------------------------ > 2002-07-17 11:59:30-04 >(1 row) > >If you write your query as > select sum(vol) from tdet where tstamp > ('epoch'::timestamptz + >'1026921570 seconds'::interval); >you should find that it'll use the index. > > > Also, I can get the same effect using pgsql with something like: > > select sum(vol) from tdet where date(tstamp) = '2002-07-17'; > > > Again, I would hope this would use the index on tstamp to select a small > > subset of the very large database. > >Not unless you build the index on date(tstamp). > > regards, tom lane _________________________________________________________________ Join the world�s largest e-mail service with MSN Hotmail. http://www.hotmail.com
"Stephen Birch" <sgbirch@hotmail.com> writes: > I also tried the following, which explain said is also using a sequence scan > :-( > SELECT sum(vol) FROM det > WHERE tstamp > (current_timestamp - '5 seconds'::interval); This doesn't work (in 7.2 and before) because the planner doesn't think current_timestamp is a constant. You can get around that with a custom function that hides the current_timestamp computation and is marked isCachable --- this is a cheat but works well enough in interactive queries. (It'd not work inside plpgsql unfortunately.) See past archived discussions --- searching for isCachable should turn up examples. Beginning in 7.3 there will be a finer-grain notion of constant functions so that the planner can optimize this sort of thing as-is. regards, tom lane
ok I assume that the string 'now' would suffer from the same problem and should also be avoided until 7.3? BTW Another thing I notice with a large database is that count(*) takes ages, it looks like it has to scan every record. This may be a newbie question or a faq, but isn't there a record count kept that would make count(*) return quickly? Steve PS I have been using PostgreSQL for about 5 years now in a mission critical application and have grown to love it. What I can't figure out is how Tom and the others find time to do such fantastic work! >From: Tom Lane <tgl@sss.pgh.pa.us> >To: "Stephen Birch" <sgbirch@hotmail.com> >CC: pgsql-general@postgreSQL.org >Subject: Re: [GENERAL] timestamped archive data index searches Date: Sat, >20 Jul 2002 15:11:24 -0400 > >"Stephen Birch" <sgbirch@hotmail.com> writes: > > I also tried the following, which explain said is also using a sequence >scan > > :-( > > > SELECT sum(vol) FROM det > > WHERE tstamp > (current_timestamp - '5 seconds'::interval); > >This doesn't work (in 7.2 and before) because the planner doesn't think >current_timestamp is a constant. You can get around that with a custom >function that hides the current_timestamp computation and is marked >isCachable --- this is a cheat but works well enough in interactive >queries. (It'd not work inside plpgsql unfortunately.) See past >archived discussions --- searching for isCachable should turn up >examples. > >Beginning in 7.3 there will be a finer-grain notion of constant >functions so that the planner can optimize this sort of thing as-is. > > regards, tom lane _________________________________________________________________ Send and receive Hotmail on your mobile device: http://mobile.msn.com
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> "Stephen Birch" <sgbirch@hotmail.com> writes: >> I also tried the following, which explain said is also using a sequence scan >> :-( >> SELECT sum(vol) FROM det >> WHERE tstamp > (current_timestamp - '5 seconds'::interval); Tom> This doesn't work (in 7.2 and before) because the planner doesn't think Tom> current_timestamp is a constant. You can get around that with a custom Tom> function that hides the current_timestamp computation and is marked Tom> isCachable --- this is a cheat but works well enough in interactive Tom> queries. (It'd not work inside plpgsql unfortunately.) See past Tom> archived discussions --- searching for isCachable should turn up Tom> examples. I've found this to work: SELECT sum(vol) FROM det WHERE tstamp > (select current_timestamp - '5 seconds'::interval); I don't know where I got that trick from, but it works fine. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 <merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Huh!! I just tried this and it didn't work for me, it was very slow. I did an EXPLAIN and it was back to sequence scan again. Steve >From: merlyn@stonehenge.com (Randal L. Schwartz) >To: pgsql-general@postgresql.org >Subject: Re: [GENERAL] timestamped archive data index searches >Date: 20 Jul 2002 12:51:00 -0700 > > >>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > >Tom> "Stephen Birch" <sgbirch@hotmail.com> writes: > >> I also tried the following, which explain said is also using a sequence >scan > >> :-( > > >> SELECT sum(vol) FROM det > >> WHERE tstamp > (current_timestamp - '5 seconds'::interval); > >Tom> This doesn't work (in 7.2 and before) because the planner doesn't >think >Tom> current_timestamp is a constant. You can get around that with a >custom >Tom> function that hides the current_timestamp computation and is marked >Tom> isCachable --- this is a cheat but works well enough in interactive >Tom> queries. (It'd not work inside plpgsql unfortunately.) See past >Tom> archived discussions --- searching for isCachable should turn up >Tom> examples. > >I've found this to work: > >SELECT sum(vol) FROM det >WHERE tstamp > (select current_timestamp - '5 seconds'::interval); > >I don't know where I got that trick from, but it works fine. > >-- >Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 ><merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/> >Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. >See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl >training! > >---------------------------(end of broadcast)--------------------------- >TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) _________________________________________________________________ Send and receive Hotmail on your mobile device: http://mobile.msn.com
I am still puzzled by the systems use of sequence scans. Using Tom's suggestion, I am now able to get a reasonable response time on the 1M record database by searching on the tstamp field. But ... I tried asking the database what the earliest record is: SELECT MIN(tstamp) FROM det; This used a sequence scan even if I do a SET ENABLE_SEQSCAN to off. Shouldn't this also use an index? Steve >From: Tom Lane <tgl@sss.pgh.pa.us> >To: "Stephen Birch" <sgbirch@hotmail.com> >CC: pgsql-general@postgreSQL.org >Subject: Re: [GENERAL] timestamped archive data index searches Date: Wed, >17 Jul 2002 12:37:14 -0400 > >"Stephen Birch" <sgbirch@hotmail.com> writes: > > I see the same problem if I query the database using psql. But to answer > > your question, here is an example query that fails to use the index on > > tstamp. > > > select sum(vol) from tdet where tstamp > 1026921570; > >Some experimentation shows that that expression is actually interpreted >as > where text(tstamp) > '1026921570'::text >No wonder it ain't using the index :-(. I'm surprised that you believe >the results are correct --- most display styles for timestamps wouldn't >come anywhere near making this work as a textual comparison. > >There are various hacks for converting numeric Unix timestamps to >Postgres timestamps. The logically cleanest way is > >regression=# select 'epoch'::timestamptz + '1026921570 seconds'::interval; > ?column? >------------------------ > 2002-07-17 11:59:30-04 >(1 row) > >If you write your query as > select sum(vol) from tdet where tstamp > ('epoch'::timestamptz + >'1026921570 seconds'::interval); >you should find that it'll use the index. > > > Also, I can get the same effect using pgsql with something like: > > select sum(vol) from tdet where date(tstamp) = '2002-07-17'; > > > Again, I would hope this would use the index on tstamp to select a small > > subset of the very large database. > >Not unless you build the index on date(tstamp). > > regards, tom lane _________________________________________________________________ Send and receive Hotmail on your mobile device: http://mobile.msn.com
Hi Stephen, > SELECT MIN(tstamp) FROM det; > This used a sequence scan even if I do a SET ENABLE_SEQSCAN to off. > Shouldn't this also use an index? [...] Have a look at the thread I started under "max() not using index". I think you can do the same trick to work around the min() problem if it's logical. That'd be: SELECT tmstamp FROM det ORDER BY tmstamp ASC LIMIT 1; Kind regards ... Ralph ...
On Sun, Jul 21, 2002 at 07:32:22 +0000, Stephen Birch <sgbirch@hotmail.com> wrote: > I am still puzzled by the systems use of sequence scans. Using Tom's > suggestion, I am now able to get a reasonable response time on the 1M > record database by searching on the tstamp field. > > But ... I tried asking the database what the earliest record is: > > SELECT MIN(tstamp) FROM det; > > This used a sequence scan even if I do a SET ENABLE_SEQSCAN to off. > > Shouldn't this also use an index? No because there isn't hardcoded special knowledge about the min and max aggregate functions. This gets discussed on the lists pretty often so you should be able to find more detailed discussions in the archives. If there is a usable index on column of interest you should rewrite your query to use order by and limit. For example: select tstamp from det order by tstamp limit 1;
On Sat, Jul 20, 2002 at 07:29:08PM +0000, Stephen Birch wrote: > > BTW Another thing I notice with a large database is that count(*) takes > ages, it > looks like it has to scan every record. This may be a newbie question or a > faq, but isn't there a record count kept that would make count(*) return > quickly? No. MVCC means that the count() for every connection could be different. So it has to do a seqscan every time. A ---- Andrew Sullivan 87 Mowat Avenue Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M6K 3E3 +1 416 646 3304 x110
It is now in th4e FAQ on the web site. --------------------------------------------------------------------------- Bruno Wolff III wrote: > On Sun, Jul 21, 2002 at 07:32:22 +0000, > Stephen Birch <sgbirch@hotmail.com> wrote: > > I am still puzzled by the systems use of sequence scans. Using Tom's > > suggestion, I am now able to get a reasonable response time on the 1M > > record database by searching on the tstamp field. > > > > But ... I tried asking the database what the earliest record is: > > > > SELECT MIN(tstamp) FROM det; > > > > This used a sequence scan even if I do a SET ENABLE_SEQSCAN to off. > > > > Shouldn't this also use an index? > > No because there isn't hardcoded special knowledge about the min and max > aggregate functions. This gets discussed on the lists pretty often so > you should be able to find more detailed discussions in the archives. > If there is a usable index on column of interest you should rewrite > your query to use order by and limit. For example: > select tstamp from det order by tstamp limit 1; > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026