Обсуждение: Optimization with dates
Hello,
I really have a problem dealing with dates :
I have a big table (~10M rows) like :
create table gains ( dategain timestamp, otherfields
);
There's an index on dategain and I want to use it to get the last
30 days records :
explain select count(*) from gains where dategain>=now()-30;
Aggregate (cost=256546.78..256546.78 rows=1 width=0) -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640
width=0)
whereas :
explain select count(*) from gains where dategain>='now';
Aggregate (cost=27338.47..27338.47 rows=1 width=0) -> Index Scan using ix_gains_dategain on gains
(cost=0.00..27320.71rows=7103 width=0)
I have tried : where dategain>=(select now()-30); where dategain+30>='now' where date(dategain)>=date('now')-30 and
manyother, syntactically absurd :-)
Anyone can help me use this index ?
TIA,
--
Jean-Christophe Boggio
cat@thefreecat.org -o)
Independant Consultant and Developer /\\
Delphi, Linux, Perl, PostgreSQL, Debian _\_V
Jean-Christophe, > Aggregate (cost=256546.78..256546.78 rows=1 width=0) > -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0) > > whereas : Hmmm... if the number of rows is actually accurate (2M out of 10M in the last 30 days) then a Seq Scan seems like a good plan to me. If the numbers aren't accurate, it's time to run a VACUUM ANALYZE. Regardless, if you're actually querying for 2M recordsets, you'd better throw some hardware at the problem, and learn the .conf parameters. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
On Wed, 14 Nov 2001, Jean-Christophe Boggio wrote:
> Hello,
>
> I really have a problem dealing with dates :
>
> I have a big table (~10M rows) like :
>
> create table gains (
> dategain timestamp,
> otherfields
> );
>
> There's an index on dategain and I want to use it to get the last
> 30 days records :
>
> explain select count(*) from gains where dategain>=now()-30;
>
> Aggregate (cost=256546.78..256546.78 rows=1 width=0)
> -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0)
>
> whereas :
>
> explain select count(*) from gains where dategain>='now';
>
> Aggregate (cost=27338.47..27338.47 rows=1 width=0)
> -> Index Scan using ix_gains_dategain on gains (cost=0.00..27320.71 rows=7103 width=0)
>
> I have tried :
> where dategain>=(select now()-30);
> where dategain+30>='now'
> where date(dategain)>=date('now')-30
> and many other, syntactically absurd :-)
>
> Anyone can help me use this index ?
Is 2367640 a reasonable estimate for the number of
rows that match the condition? Have you run vacuum
analyze?
If the estimate is right, you'll probably find that
the sequence scan is actually faster than an index
scan since about 1/4 of the table is being selected.
Yikes! Good catch. My example database returns on ~.6M for the last 30 days, and an index scan still turns out to be a win (of course, it turned out to be a bigger win to have a separate table with 15 minute summarizations of the data :). Josh is right, chances are good that a sequential scan will actually perform better for you than an index scan if you are returning 20% of your entire table. Jason "Josh Berkus" <josh@agliodbs.com> writes: > Jean-Christophe, > > > Aggregate (cost=256546.78..256546.78 rows=1 width=0) > > -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0) > > > > whereas : > > Hmmm... if the number of rows is actually accurate (2M out of 10M in the > last 30 days) then a Seq Scan seems like a good plan to me. If the > numbers aren't accurate, it's time to run a VACUUM ANALYZE. > > Regardless, if you're actually querying for 2M recordsets, you'd better > throw some hardware at the problem, and learn the .conf parameters. > > -Josh > > ______AGLIO DATABASE SOLUTIONS___________________________ > Josh Berkus > Complete information technology josh@agliodbs.com > and data management solutions (415) 565-7293 > for law firms, small businesses fax 621-2533 > and non-profit organizations. San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
I have a similar table (~16M rows) with an indexed timestamp, and have
had similar problems. I have found that even when I am using a
constant timestamp like in this query.
SELECT * FROM caseweights1 WHERE dt > '2001-11-01';
I start getting sequential scans with 7.1.3 long before they are
faster than index based queries. I believe that there is a constant
that can be fiddled to modify this behavior, and it seems like I have
also read that this constant has been modified in the new 7.2 release.
However, for queries that you *know* will be faster using the index
you can always issue:
set enable_seqscan to off;
before running your query. This will force PostgreSQL to use the
index even in queries like this one:
SELECT * FROM caseweights1 WHERE dt > (SELECT now() - '30 days'::interval);
Jason
Jean-Christophe Boggio <cat@thefreecat.org> writes:
> Hello,
>
> I really have a problem dealing with dates :
>
> I have a big table (~10M rows) like :
>
> create table gains (
> dategain timestamp,
> otherfields
> );
>
> There's an index on dategain and I want to use it to get the last
> 30 days records :
>
> explain select count(*) from gains where dategain>=now()-30;
>
> Aggregate (cost=256546.78..256546.78 rows=1 width=0)
> -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0)
>
> whereas :
>
> explain select count(*) from gains where dategain>='now';
>
> Aggregate (cost=27338.47..27338.47 rows=1 width=0)
> -> Index Scan using ix_gains_dategain on gains (cost=0.00..27320.71 rows=7103 width=0)
>
> I have tried :
> where dategain>=(select now()-30);
> where dategain+30>='now'
> where date(dategain)>=date('now')-30
> and many other, syntactically absurd :-)
>
> Anyone can help me use this index ?
>
> TIA,
>
> --
> Jean-Christophe Boggio
> cat@thefreecat.org -o)
> Independant Consultant and Developer /\\
> Delphi, Linux, Perl, PostgreSQL, Debian _\_V
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
Jason Earl <jason.earl@simplot.com> writes:
> I have a similar table (~16M rows) with an indexed timestamp, and have
> had similar problems. I have found that even when I am using a
> constant timestamp like in this query.
> SELECT * FROM caseweights1 WHERE dt > '2001-11-01';
> I start getting sequential scans with 7.1.3 long before they are
> faster than index based queries.
Just out of curiosity, do the indexed timestamps correlate closely to
the physical order of the table? I'd expect that to happen if you
are timestamping records by insertion time and there are few or no
updates.
7.1 and before assume that the index order is random with respect to
the physical order, which is a good conservative assumption ... but it
results in drastically overestimating the cost of an indexscan when
strong correlation exists. 7.2 keeps some statistics about ordering
correlation, and may perhaps do better with this sort of situation.
(I have no doubt that its estimates will need further tweaking, but
at least the effect is somewhat accounted for now.)
regards, tom lane
Jean-Christophe Boggio <cat@thefreecat.org> writes:
> I have tried :
> where dategain>=(select now()-30);
> where dategain+30>='now'
> where date(dategain)>=date('now')-30
> and many other, syntactically absurd :-)
Trydategain >= ago(30)
where "ago" is a function that computes "date(now()) - n" and is
marked "iscachable". This is a cheat but it keeps the planner from
being distracted by the noncachable nature of "now()". You can find
past discussions of this if you search the archives for "iscachable",
eg
http://fts.postgresql.org/db/mw/msg.html?mid=1037521
http://fts.postgresql.org/db/mw/msg.html?mid=60584
http://fts.postgresql.org/db/mw/msg.html?mid=97823
As several other people pointed out, there's also a question of whether
the system *should* use the index --- you haven't told us how many
rows you expect the query to visit. But until you express the WHERE
condition in the form "column >= something-that-can-be-reduced-to-
a-constant", you're unlikely to get the system to even try.
regards, tom lane
Hello,
Thanks for all the answers. A little feedback :
>> I have tried :
>> where dategain>=(select now()-30);
>> and many other, syntactically absurd :-)
TL> dategain >= ago(30)
TL> where "ago" is a function that computes "date(now()) - n" and is
TL> marked "iscachable".
create function ago(interval) returns timestamp as ' select now() - $1
' language 'sql' with (iscachable);
explain select count(*) from gains where dategain>=ago('30 0:00');
Aggregate (cost=180640.90..180640.90 rows=1 width=0) -> Seq Scan on gains (cost=0.00..179761.71 rows=351676
width=0)
=====
explain select count(*) from gains where dategain>=ago('5 days');
Aggregate (cost=172340.65..172340.65 rows=1 width=0) -> Index Scan using ix_gains_dategain on gains
(cost=0.00..172202.94rows=55084 width=0)
=====
explain select count(*) from gains where dategain>=ago('6 days');
Aggregate (cost=179929.06..179929.06 rows=1 width=0) -> Seq Scan on gains (cost=0.00..179761.71 rows=66940 width=0)
TL> Just out of curiosity, do the indexed timestamps correlate closely to
TL> the physical order of the table? I'd expect that to happen if you
TL> are timestamping records by insertion time and there are few or no
TL> updates.
That's right, there are very few updates.
===========================================
Now, for Jason's idea :
set enable_seqscan to off;
SET VARIABLE
explain select count(*) from gains where dategain>=now()-30;
Aggregate (cost=100256770.86..100256770.86 rows=1 width=0) -> Seq Scan on gains (cost=100000000.00..100250847.08
rows=2369512width=0)
Strange isn't it ?
Is it possible to do the equivalent of "set enable_seqscan to off"
out of psql (in php or perl code) ?
===========================================
To answer Stephan and Josh :
SS> Is 2367640 a reasonable estimate for the number of
SS> rows that match the condition?
JB> Hmmm... if the number of rows is actually accurate (2M out of 10M in the
JB> last 30 days) then a Seq Scan seems like a good plan to me. If the
JB> numbers aren't accurate, it's time to run a VACUUM ANALYZE.
select avg(cnt) from (select count(*) as cnt from gains group bydate(dategain)) as foo; avg
------------------12009.6131756757
If I did it right, this should be the average number of rows per day.
The db exists since April 1st 2000.
select date('now')-date('2000-04-01'); 592
select 592*12009; 7109328
select count(*) from gains; count
---------7109753
As you see, dategain is *quite* linear !
So to answer your question, a reasonable estimate for the number of
rows that match the condition is :
select 30*12009; 360270
The real answer is :
select count(*) from gains where dategain>=now()-30;231781
SS> Have you run vacuum analyze?
Every night (and it's a VEERRYYYY long process, even dropping the
indexes before and recreating them afterwards, maybe that's the real
problem ?) Keeping the index makes the VACUUM process several hours.
We'll try 7.2 which should solve part of this problem but since these
are production systems, we wait a little feedback from 7.2 users.
SS> If the estimate is right, you'll probably find that
SS> the sequence scan is actually faster than an index
SS> scan since about 1/4 of the table is being selected.
It should select 1/592 of the table !
Any further advises VERY appreciated. Thanks again everyone for your
help.
--
Jean-Christophe Boggio
cat@thefreecat.org -o)
Independant Consultant and Developer /\\
Delphi, Linux, Perl, PostgreSQL, Debian _\_V
Hi Tom, I've found this in a recent mail of yours > dategain >= ago(30) It looks very useful for some of my applications. Since when is this function implemented? Regards, Christoph
Tom Lane <tgl@sss.pgh.pa.us> writes: > Jason Earl <jason.earl@simplot.com> writes: > > I have a similar table (~16M rows) with an indexed timestamp, and have > > had similar problems. I have found that even when I am using a > > constant timestamp like in this query. > > > SELECT * FROM caseweights1 WHERE dt > '2001-11-01'; > > > I start getting sequential scans with 7.1.3 long before they are > > faster than index based queries. > > Just out of curiosity, do the indexed timestamps correlate closely > to the physical order of the table? I'd expect that to happen if > you are timestamping records by insertion time and there are few or > no updates. That's it precisely. Records in this table are inserted only, and never updated. The records are timestamped when inserted. > 7.1 and before assume that the index order is random with respect to > the physical order, which is a good conservative assumption ... but > it results in drastically overestimating the cost of an indexscan > when strong correlation exists. 7.2 keeps some statistics about > ordering correlation, and may perhaps do better with this sort of > situation. (I have no doubt that its estimates will need further > tweaking, but at least the effect is somewhat accounted for now.) > > regards, tom lane I feel bad that I don't have 7.2b in testing yet. I have been waiting for a Debian package :). I am quite excited about several new features. I read HACKERS pretty religiously and so I was aware that you had added some new statistics and that you have fiddled a bit with the constant that decides when a sequential scan is a good idea. But the fact of the matter is that I have already learned how to make PostgreSQL do "the right thing" when it comes to using these indexes. Besides, I generally only select from this table in 15 minute intervals (~200 records) and PostgreSQL has no problems doing the right thing by default. The feature that I am most excited about is the new vacuum. Currently I can only vacuum this table (and several similar tables) once every two weeks while the factory is down for maintenance. This isn't a very big deal as there are no updates or deletions from these tables and the statistics in a ~16 million row table don't change much in two weeks, but the new vacuum should do the correct thing for my tables without any special handling on my part. Thanks again, Jason
So, no one can help ? Should I stop defining indexes for date fields ? ie: should DATE types be considered broken with no short time solution and be replaced with strings ? -- Jean-Christophe Boggio cat@thefreecat.org -o) Independant Consultant and Developer /\\ Delphi, Linux, Perl, PostgreSQL, Debian _\_V
Yikes! Good catch. My example database returns on ~.6M for the last 30 days, and an index scan still turns out to be a win (of course, it turned out to be a bigger win to have a separate table with 15 minute summarizations of the data :). Josh is right, chances are good that a sequential scan will actually perform better for you than an index scan if you are returning 20% of your entire table. Jason "Josh Berkus" <josh@agliodbs.com> writes: > Jean-Christophe, > > > Aggregate (cost=256546.78..256546.78 rows=1 width=0) > > -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0) > > > > whereas : > > Hmmm... if the number of rows is actually accurate (2M out of 10M in the > last 30 days) then a Seq Scan seems like a good plan to me. If the > numbers aren't accurate, it's time to run a VACUUM ANALYZE. > > Regardless, if you're actually querying for 2M recordsets, you'd better > throw some hardware at the problem, and learn the .conf parameters. > > -Josh > > ______AGLIO DATABASE SOLUTIONS___________________________ > Josh Berkus > Complete information technology josh@agliodbs.com > and data management solutions (415) 565-7293 > for law firms, small businesses fax 621-2533 > and non-profit organizations. San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
cat@thefreecat.org (Jean-Christophe Boggio) wrote in message news:<72319017913.20011114005656@thefreecat.org>...
> I have tried :
> where dategain>=(select now()-30);
> where dategain+30>='now'
> where date(dategain)>=date('now')-30
> and many other, syntactically absurd :-)
>
> Anyone can help me use this index ?
From my brief experiments, it _looks_ like the optimizer uses index
scan on dates only when those dates are constant (in my case, anyway -
as always, YMMV).
I tried functions, variables, and using both upper and lower limits.
The only time index scanning was used (according to "explain") was
when I used constant values for the dates.
So ... do you _have_ to do your querying in psql? Or will your
application permit you to create queries with embedded constant date
values in the Planguage (PHP, Python, Perl) of your choice?
Calvin