Обсуждение: Querying 19million records very slowly

Поиск
Список
Период
Сортировка

Querying 19million records very slowly

От
Kjell Tore Fossbakk
Дата:
Hello!

I use FreeBSD 4.11 with PostGreSQL 7.3.8.

I got a huge database with roughly 19 million records. There is just one
table, with a time field, a few ints and a few strings.


table test
fields time (timestamp), source (string), destination (string), p1 (int),
p2 (int)


I have run VACUUM ANALYZE ;

I have created indexes on every field, but for some reason my postgre
server wants to use a seqscan, even tho i know a indexed scan would be
much faster.


create index test_time_idx on test (time) ;
create index test_source_idx on test (source) ;
create index test_destination_idx on test (destination) ;
create index test_p1_idx on test (p1) ;
create index test_p2_idx on test (p2) ;



What is really strange, is that when i query a count(*) on one of the int
fields (p1), which has a very low count, postgre uses seqscan. In another
count on the same int field (p1), i know he is giving about 2.2 million
hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't
the whole idea of indexing to increase performance in large queries.. To
make sort of a phonebook for the values, to make it faster to look up what
ever you need... This just seems opposite..

Here is a EXPLAIN of my query

database=> explain select date_trunc('hour', time),count(*) as total from
test where p1=53 and time > now() - interval '24 hours' group by
date_trunc order by date_trunc ;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
Aggregate  (cost=727622.61..733143.23 rows=73608 width=8)
   ->  Group  (cost=727622.61..731303.02 rows=736083 width=8)
         ->  Sort  (cost=727622.61..729462.81 rows=736083 width=8)
               Sort Key: date_trunc('hour'::text, "time")
               ->  Seq Scan on test  (cost=0.00..631133.12 rows=736083
width=8)
                     Filter: ((p1 = 53) AND ("time" > (now() - '1
day'::interval)))
(6 rows)




database=> drop INDEX test_<TABULATOR>
test_source_idx         test_destination_idx        test_p1_idx       
test_p2_idx       test_time_idx


After all this, i tried to set enable_seqscan to off and
enable_nestedloops to on. This didnt help much either. The time to run the
query is still in minutes. My results are the number of elements for each
hour, and it gives about 1000-2000 hits per hour. I have read somewhere,
about PostGreSQL, that it can easily handle 100-200million records. And
with the right tuned system, have a great performance.. I would like to
learn how :)

I also found an article on a page
( http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php):
Tip #11:  Don't bother indexing columns with huge numbers of records and a
small range of values, such as BOOLEAN columns.

This tip, regretfully, is perhaps the only tip where I cannot provide a
good, real-world example from my work.  So I'll give you a hypothetical
situation instead:

Imagine that you have a database table with a list of every establishment
vending ice cream in the US.  A simple example might look like:

Where there were almost 1 million rows, but due to simplistic data entry,
only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and
3-OTHER) which are relatively evenly distributed.  In this hypothetical
situation, you might find (with testing using EXPLAIN) that an index on
type is ignored and the parser uses a "seq scan" (or table scan) instead.
This is because a table scan can actually be faster than an index scan in
this situation.  Thus, any index on type should be dropped.

Certainly, the boolean column (active) requires no indexing as it has only
two possible values and no index will be faster than a table scan.


Then I ask, what is useful with indexing, when I can't use it on a VERY
large database? It is on my 15 million record database it takes for ever
to do seqscans over and over again... This is probably why, as i mentioned
earlier, the reason (read the quote) why he chooses a full scan and not a
indexed one...

So what do I do? :confused:

I'v used SQL for years, but never in such a big scale. Thus, not having to
learn how to deal with large number of records. Usually a maximum of 1000
records. Now, with millions, I need to learn a way to make my sucky
queries better.

Im trying to learn more about tuning my system, makeing better queries and
such. I'v found some documents on the Internet, but far from the best.

Feedback most appreciated!

Regards,
a learning PostGreSQL user

Re: Querying 19million records very slowly

От
Paul Ramsey
Дата:
Some tips:

- EXPLAIN ANALYZE provides a more useful analysis of a slow query,
because it gives both the estimate and actual times/rows for each step
in the plan.

- The documentation is right: rows with little variation are pretty
useless to index. Indexing is about "selectivity", reducing the amount
of stuff the database has to read off the the disk.

- You only have two things in your WHERE clause, so that is where the
most important indexes reside. How many of your rows have p1=53? How
many of your rows have happened in the last day? If your answer is "a
lot" then the indexes are not going to help: PostgreSQL will be more
efficient scanning every tuple than it will be jumping around the index
structure for a large number of tuples.

- If neither time nor p1 are particularly selective individually, but
they are selective when taken together, try a multi-key index on them both.

Paul

Kjell Tore Fossbakk wrote:

> Hello!
>
> I use FreeBSD 4.11 with PostGreSQL 7.3.8.
>
> I got a huge database with roughly 19 million records. There is just one
> table, with a time field, a few ints and a few strings.
>
>
> table test
> fields time (timestamp), source (string), destination (string), p1 (int),
> p2 (int)
>
>
> I have run VACUUM ANALYZE ;
>
> I have created indexes on every field, but for some reason my postgre
> server wants to use a seqscan, even tho i know a indexed scan would be
> much faster.
>
>
> create index test_time_idx on test (time) ;
> create index test_source_idx on test (source) ;
> create index test_destination_idx on test (destination) ;
> create index test_p1_idx on test (p1) ;
> create index test_p2_idx on test (p2) ;
>
>
>
> What is really strange, is that when i query a count(*) on one of the int
> fields (p1), which has a very low count, postgre uses seqscan. In another
> count on the same int field (p1), i know he is giving about 2.2 million
> hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't
> the whole idea of indexing to increase performance in large queries.. To
> make sort of a phonebook for the values, to make it faster to look up what
> ever you need... This just seems opposite..
>
> Here is a EXPLAIN of my query
>
> database=> explain select date_trunc('hour', time),count(*) as total from
> test where p1=53 and time > now() - interval '24 hours' group by
> date_trunc order by date_trunc ;
>                                         QUERY PLAN
> ------------------------------------------------------------------------------------------
> Aggregate  (cost=727622.61..733143.23 rows=73608 width=8)
>    ->  Group  (cost=727622.61..731303.02 rows=736083 width=8)
>          ->  Sort  (cost=727622.61..729462.81 rows=736083 width=8)
>                Sort Key: date_trunc('hour'::text, "time")
>                ->  Seq Scan on test  (cost=0.00..631133.12 rows=736083
> width=8)
>                      Filter: ((p1 = 53) AND ("time" > (now() - '1
> day'::interval)))
> (6 rows)
>
>
>
>
> database=> drop INDEX test_<TABULATOR>
> test_source_idx         test_destination_idx        test_p1_idx
> test_p2_idx       test_time_idx
>
>
> After all this, i tried to set enable_seqscan to off and
> enable_nestedloops to on. This didnt help much either. The time to run the
> query is still in minutes. My results are the number of elements for each
> hour, and it gives about 1000-2000 hits per hour. I have read somewhere,
> about PostGreSQL, that it can easily handle 100-200million records. And
> with the right tuned system, have a great performance.. I would like to
> learn how :)
>
> I also found an article on a page
> ( http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php):
> Tip #11:  Don't bother indexing columns with huge numbers of records and a
> small range of values, such as BOOLEAN columns.
>
> This tip, regretfully, is perhaps the only tip where I cannot provide a
> good, real-world example from my work.  So I'll give you a hypothetical
> situation instead:
>
> Imagine that you have a database table with a list of every establishment
> vending ice cream in the US.  A simple example might look like:
>
> Where there were almost 1 million rows, but due to simplistic data entry,
> only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and
> 3-OTHER) which are relatively evenly distributed.  In this hypothetical
> situation, you might find (with testing using EXPLAIN) that an index on
> type is ignored and the parser uses a "seq scan" (or table scan) instead.
> This is because a table scan can actually be faster than an index scan in
> this situation.  Thus, any index on type should be dropped.
>
> Certainly, the boolean column (active) requires no indexing as it has only
> two possible values and no index will be faster than a table scan.
>
>
> Then I ask, what is useful with indexing, when I can't use it on a VERY
> large database? It is on my 15 million record database it takes for ever
> to do seqscans over and over again... This is probably why, as i mentioned
> earlier, the reason (read the quote) why he chooses a full scan and not a
> indexed one...
>
> So what do I do? :confused:
>
> I'v used SQL for years, but never in such a big scale. Thus, not having to
> learn how to deal with large number of records. Usually a maximum of 1000
> records. Now, with millions, I need to learn a way to make my sucky
> queries better.
>
> Im trying to learn more about tuning my system, makeing better queries and
> such. I'v found some documents on the Internet, but far from the best.
>
> Feedback most appreciated!
>
> Regards,
> a learning PostGreSQL user
>


Re: Querying 19million records very slowly

От
PFC
Дата:
use CURRENT_TIME which is a constant instead of now() which is not
considered constant... (I think)

Re: Querying 19million records very slowly

От
Kjell Tore Fossbakk
Дата:
database=> set enable_seqscan to on;
SET
Time: 0.34 ms



database=> explain analyze select count(*) from test where p1=53;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=522824.50..522824.50 rows=1 width=0) (actual
time=56380.72..56380.72 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..517383.30 rows=2176479 width=0)
(actual time=9.61..47677.48 rows=2220746 loops=1)
         Filter: (p1 = 53)
 Total runtime: 56380.79 msec
(4 rows)

Time: 56381.40 ms



database=> explain analyze select count(*) from test where p1=53 and
time > now() - interval '24 hours' ;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=661969.01..661969.01 rows=1 width=0) (actual
time=45787.02..45787.02 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..660155.28 rows=725493 width=0)
(actual time=37799.32..45613.58 rows=42424 loops=1)
         Filter: ((p1 = 53) AND ("time" > (now() - '1 day'::interval)))
 Total runtime: 45787.09 msec
(4 rows)

Time: 45787.79 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=53 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
                                                             QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=755116.97..760558.17 rows=72549 width=8) (actual
time=46040.63..46717.61 rows=23 loops=1)
   ->  Group  (cost=755116.97..758744.44 rows=725493 width=8) (actual
time=46022.06..46548.84 rows=42407 loops=1)
         ->  Sort  (cost=755116.97..756930.70 rows=725493 width=8)
(actual time=46022.04..46198.94 rows=42407 loops=1)
               Sort Key: date_trunc('hour'::text, "time")
               ->  Seq Scan on test  (cost=0.00..660155.28 rows=725493
width=8) (actual time=37784.91..45690.88 rows=42407 loops=1)
                     Filter: ((p1 = 53) AND ("time" > (now() - '1
day'::interval)))
 Total runtime: 46718.43 msec
(7 rows)

Time: 46719.44 ms



database=> create index test_time_p1_idx on test(time,p1) ;
CREATE INDEX
Time: 178926.02 ms

database=> vacuum analyze test ;
VACUUM
Time: 73058.33 ms

database=> \d test
               Table "public.test"
  Column     |           Type           | Modifiers
-------------+--------------------------+-----------
 time        | timestamp with time zone |
 source      | inet                     |
 destination | inet                     |
 p1          | integer                  |
 p2          | integer                  |



database=> \di
 public | test_time_idx        | index | database | test
 public | test_source_idx      | index | database | test
 public | test_destination_idx | index | database | test
 public | test_p1_idx          | index | database | test
 public | test_p2_idx          | index | database | test
 public | test_time_p1_idx     | index | database | test



database=> set enable_seqscan to off ;
SET
Time: 0.28 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=53 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
 Aggregate  (cost=2315252.66..2320767.17 rows=73527 width=8) (actual
time=2081.15..2720.44 rows=23 loops=1)
   ->  Group  (cost=2315252.66..2318929.00 rows=735268 width=8)
(actual time=2079.76..2564.22 rows=41366 loops=1)
         ->  Sort  (cost=2315252.66..2317090.83 rows=735268 width=8)
(actual time=2079.74..2243.32 rows=41366 loops=1)
               Sort Key: date_trunc('hour'::text, "time")
               ->  Index Scan using test_time_p1_idx on test
(cost=0.00..2218878.46 rows=735268 width=8) (actual
time=29.50..1774.52 rows=41366 loops=1)
                     Index Cond: (("time" > (now() - '1
day'::interval)) AND (p1 = 53))
 Total runtime: 2735.42 msec

Time: 2736.48 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=80 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
 Aggregate  (cost=1071732.15..1074305.59 rows=34313 width=8) (actual
time=6353.93..7321.99 rows=22 loops=1)
   ->  Group  (cost=1071732.15..1073447.77 rows=343125 width=8)
(actual time=6323.76..7078.10 rows=64267 loops=1)
         ->  Sort  (cost=1071732.15..1072589.96 rows=343125 width=8)
(actual time=6323.75..6579.42 rows=64267 loops=1)
               Sort Key: date_trunc('hour'::text, "time")
               ->  Index Scan using test_time_p1_idx on test
(cost=0.00..1035479.58 rows=343125 width=8) (actual time=0.20..5858.67
rows=64267 loops=1)
                     Index Cond: (("time" > (now() - '1
day'::interval)) AND (p1 = 80))
 Total runtime: 7322.82 msec

Time: 7323.90 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=139 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
 Aggregate  (cost=701562.34..703250.12 rows=22504 width=8) (actual
time=2448.41..3033.80 rows=22 loops=1)
   ->  Group  (cost=701562.34..702687.53 rows=225037 width=8) (actual
time=2417.39..2884.25 rows=36637 loops=1)
         ->  Sort  (cost=701562.34..702124.94 rows=225037 width=8)
(actual time=2417.38..2574.19 rows=36637 loops=1)
               Sort Key: date_trunc('hour'::text, "time")
               ->  Index Scan using test_time_p1_idx on test
(cost=0.00..679115.34 rows=225037 width=8) (actual time=8.47..2156.18
rows=36637 loops=1)
                     Index Cond: (("time" > (now() - '1
day'::interval)) AND (p1 = 139))
 Total runtime: 3034.57 msec

Time: 3035.70 ms



Now, this query gives me all the hours in a day, with the count of all
p1=53 for each hour. Pg uses 46.7 seconds to run with seqscan, while
2.7 seconds indexing on (time,p1). I think I turned "set
enable_seqscan to on;" again, and then the planner used seqscan, and
not index.
- Why does Pg not see the benefits of using index?
- and how can i tune the optimisation fields in postgresql.conf to help him?

So now my PG uses a reasonable amout of time on these queries (with
enable_seqscan turned off)

The next place which seems to slow my queries, is probably my
connection to PHP. I got a bash script running in cron on my server
(freebsd 4.11), which runs php on a php file. To force PG to not use
seqscans, I have modifies the postgresql.conf:

..
enable_seqscan = false
enable_indexscan = true
..
effective_cache_size = 10000
random_page_cost = 2
..

I save the file, type 'pg_crl reload' then enter 'psql database'.

database=> show enable_seqscan ;
 enable_seqscan
----------------
 on
(1 row)


argus=> show effective_cache_size ;
 effective_cache_size
----------------------
 1000
(1 row)

I have used the manual pages on postgresql, postmaster, and so on, but
I cant find anywhere to specify which config file Pg is to use. I'm
not entirely sure if he uses the one im editing
(/usr/local/etc/postgresql.conf).

Any hints, tips or help is most appreciated!

Kjell Tore.





On 6/21/05, PFC <lists@boutiquenumerique.com> wrote:
>
> use CURRENT_TIME which is a constant instead of now() which is not
> considered constant... (I think)
>

Re: Querying 19million records very slowly

От
Tobias Brox
Дата:
[Kjell Tore Fossbakk - Wed at 09:45:22AM +0200]
> database=> explain analyze select count(*) from test where p1=53 and
> time > now() - interval '24 hours' ;

Sorry to say that I have not followed the entire thread neither read the
entire email I'm replying to, but I have a quick hint on this one (ref my
earlier thread about timestamp indices) - the postgresql planner will
generally behave smarter when using a fixed timestamp (typically generated
by the app server) than logics based on now().

One of my colleagues also claimed that he found the usage of
localtimestamp faster than now().

--
Tobias Brox, +86-13521622905
Nordicbet, IT dept

Re: Querying 19million records very slowly

От
Kjell Tore Fossbakk
Дата:
Appreciate your time, Mr Brox.

I'll test the use of current_timestamp, rather than now(). I am not
sure if Pg can do a match between a fixed timestamp and a datetime?

time > current_timestamp - interval '24 hours',
when time is yyyy-mm-dd hh-mm-ss+02, like 2005-06-22 16:00:00+02.

If Pg cant do it, and current_time is faster, i could possibly convert
the time field in my database to timestamp, and insert all rows as
timestamp rather than a timedate. But that is some script to work over
19 mill rows, so I need to know if that will give me any more speed..

Kjell Tore.

On 6/22/05, Tobias Brox <tobias@nordicbet.com> wrote:
> [Kjell Tore Fossbakk - Wed at 09:45:22AM +0200]
> > database=> explain analyze select count(*) from test where p1=53 and
> > time > now() - interval '24 hours' ;
>
> Sorry to say that I have not followed the entire thread neither read the
> entire email I'm replying to, but I have a quick hint on this one (ref my
> earlier thread about timestamp indices) - the postgresql planner will
> generally behave smarter when using a fixed timestamp (typically generated
> by the app server) than logics based on now().
>
> One of my colleagues also claimed that he found the usage of
> localtimestamp faster than now().
>
> --
> Tobias Brox, +86-13521622905
> Nordicbet, IT dept
>

Re: Querying 19million records very slowly

От
Tobias Brox
Дата:
[Kjell Tore Fossbakk - Wed at 10:18:30AM +0200]
> I'll test the use of current_timestamp, rather than now(). I am not
> sure if Pg can do a match between a fixed timestamp and a datetime?

I have almost all my experience with timestamps wo timezones, but ... isn't
that almost the same as the timedate type?

> time > current_timestamp - interval '24 hours',
> when time is yyyy-mm-dd hh-mm-ss+02, like 2005-06-22 16:00:00+02.

Try to type in '2005-06-21 16:36:22+08' directly in the query, and see if it
makes changes.  Or probably '2005-06-21 10:36:22+02' in your case ;-)

(btw, does postgresql really handles timezones?  '+02' is quite different
from 'CET', which will be obvious sometime in the late autoumn...)

--
Tobias Brox, +86-13521622905
Nordicbet, IT dept

Re: Querying 19million records very slowly

От
Michael Glaesemann
Дата:
On Jun 22, 2005, at 5:39 PM, Tobias Brox wrote:

> (btw, does postgresql really handles timezones?  '+02' is quite
> different
> from 'CET', which will be obvious sometime in the late autoumn...)

Yes, it does. It doesn't (currently) record the time zone name, but
rather only the offset from UTC. If a time zone name (rather than UTC
offset) is given, it is converted to the UTC offset *at that
timestamptz* when it is stored. For time zones that take into account
DST, their UTC offset changes during the year, and PostgreSQL records
the equivalent UTC offset for the appropriate timestamptz values.

There has been discussion in the past on storing the time zone name
with the timestamptz as well, though no one has implemented this yet.

Michael Glaesemann
grzm myrealbox com


Re: Querying 19million records very slowly

От
Kjell Tore Fossbakk
Дата:
> Try to type in '2005-06-21 16:36:22+08' directly in the query, and see if it
> makes changes.  Or probably '2005-06-21 10:36:22+02' in your case ;-)

Which one does Pg read fastes? Does he convert datetime in the table,
then my where clause and check, for each row? How does he compare a
datetime with a datetime? Timestamp are easy, large number bigger than
another large number..

time (datetime) > '2005-06-21 10:36:22+02'

or

time (timestamp) > 'some timestamp pointing to yesterday'

Hmm.. I cant find any doc that describes this very good.


On 6/22/05, Michael Glaesemann <grzm@myrealbox.com> wrote:
>
> On Jun 22, 2005, at 5:39 PM, Tobias Brox wrote:
>
> > (btw, does postgresql really handles timezones?  '+02' is quite
> > different
> > from 'CET', which will be obvious sometime in the late autoumn...)
>
> Yes, it does. It doesn't (currently) record the time zone name, but
> rather only the offset from UTC. If a time zone name (rather than UTC
> offset) is given, it is converted to the UTC offset *at that
> timestamptz* when it is stored. For time zones that take into account
> DST, their UTC offset changes during the year, and PostgreSQL records
> the equivalent UTC offset for the appropriate timestamptz values.
>
> There has been discussion in the past on storing the time zone name
> with the timestamptz as well, though no one has implemented this yet.
>
> Michael Glaesemann
> grzm myrealbox com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

Re: Querying 19million records very slowly

От
Tobias Brox
Дата:
[Kjell Tore Fossbakk - Wed at 11:10:42AM +0200]
> Which one does Pg read fastes? Does he convert datetime in the table,
> then my where clause and check, for each row? How does he compare a
> datetime with a datetime? Timestamp are easy, large number bigger than
> another large number..
>
> time (datetime) > '2005-06-21 10:36:22+02'
>
> or
>
> time (timestamp) > 'some timestamp pointing to yesterday'

If I have understood it correctly, the planner will recognize the timestamp
and compare it with the statistics in the first example but not in the
second, and thus it will be more likely to use index scan on the first one
and seqscan on the second.

--
Tobias Brox, +86-13521622905
Nordicbet, IT dept

Re: Querying 19million records very slowly

От
Tom Lane
Дата:
Tobias Brox <tobias@nordicbet.com> writes:
>> time (datetime) > '2005-06-21 10:36:22+02'
>> or
>> time (timestamp) > 'some timestamp pointing to yesterday'

> If I have understood it correctly, the planner will recognize the timestamp
> and compare it with the statistics in the first example but not in the
> second, and thus it will be more likely to use index scan on the first one
> and seqscan on the second.

That statement is true for releases before 8.0.  Kjell has not at any
point told us what PG version he is running, unless I missed it...

            regards, tom lane

Re: Querying 19million records very slowly

От
Paul Ramsey
Дата:
OK, so the planner is in fact making a mistake (I think).  Try turning
down your random_page_cost a little. It defaults at 4.0, see if 2.0
works "right". (Careful, move these things around too much for one
query, you will wreck others.)  4.0 is a little large for almost all
modern hardware, so see if moving it down a little makes things
somewhat smarter.

P

On Wednesday, June 22, 2005, at 12:45 AM, Kjell Tore Fossbakk wrote:

> database=> set enable_seqscan to on;
> SET
> Time: 0.34 ms
>
>
>
> database=> explain analyze select count(*) from test where p1=53;
>                                                       QUERY PLAN
> -----------------------------------------------------------------------
> ------------------------------------------------
>  Aggregate  (cost=522824.50..522824.50 rows=1 width=0) (actual
> time=56380.72..56380.72 rows=1 loops=1)
>    ->  Seq Scan on test  (cost=0.00..517383.30 rows=2176479 width=0)
> (actual time=9.61..47677.48 rows=2220746 loops=1)
>          Filter: (p1 = 53)
>  Total runtime: 56380.79 msec
> (4 rows)
>
> Time: 56381.40 ms
>
>
>
> database=> explain analyze select count(*) from test where p1=53 and
> time > now() - interval '24 hours' ;
>                                                        QUERY PLAN
> -----------------------------------------------------------------------
> -------------------------------------------------
>  Aggregate  (cost=661969.01..661969.01 rows=1 width=0) (actual
> time=45787.02..45787.02 rows=1 loops=1)
>    ->  Seq Scan on test  (cost=0.00..660155.28 rows=725493 width=0)
> (actual time=37799.32..45613.58 rows=42424 loops=1)
>          Filter: ((p1 = 53) AND ("time" > (now() - '1 day'::interval)))
>  Total runtime: 45787.09 msec
> (4 rows)
>
> Time: 45787.79 ms
>
>
>
> database=> explain analyze select date_trunc('hour', time),count(*) as
> total from test where p1=53 and time>now()-interval '24 hours' group
> by date_trunc order by date_trunc;
>                                                              QUERY
> PLAN
> -----------------------------------------------------------------------
> -------------------------------------------------------------
>  Aggregate  (cost=755116.97..760558.17 rows=72549 width=8) (actual
> time=46040.63..46717.61 rows=23 loops=1)
>    ->  Group  (cost=755116.97..758744.44 rows=725493 width=8) (actual
> time=46022.06..46548.84 rows=42407 loops=1)
>          ->  Sort  (cost=755116.97..756930.70 rows=725493 width=8)
> (actual time=46022.04..46198.94 rows=42407 loops=1)
>                Sort Key: date_trunc('hour'::text, "time")
>                ->  Seq Scan on test  (cost=0.00..660155.28 rows=725493
> width=8) (actual time=37784.91..45690.88 rows=42407 loops=1)
>                      Filter: ((p1 = 53) AND ("time" > (now() - '1
> day'::interval)))
>  Total runtime: 46718.43 msec
> (7 rows)
>
> Time: 46719.44 ms
>
>
>
> database=> create index test_time_p1_idx on test(time,p1) ;
> CREATE INDEX
> Time: 178926.02 ms
>
> database=> vacuum analyze test ;
> VACUUM
> Time: 73058.33 ms
>
> database=> \d test
>                Table "public.test"
>   Column     |           Type           | Modifiers
> -------------+--------------------------+-----------
>  time        | timestamp with time zone |
>  source      | inet                     |
>  destination | inet                     |
>  p1          | integer                  |
>  p2          | integer                  |
>
>
>
> database=> \di
>  public | test_time_idx        | index | database | test
>  public | test_source_idx      | index | database | test
>  public | test_destination_idx | index | database | test
>  public | test_p1_idx          | index | database | test
>  public | test_p2_idx          | index | database | test
>  public | test_time_p1_idx     | index | database | test
>
>
>
> database=> set enable_seqscan to off ;
> SET
> Time: 0.28 ms
>
>
>
> database=> explain analyze select date_trunc('hour', time),count(*) as
> total from test where p1=53 and time>now()-interval '24 hours' group
> by date_trunc order by date_trunc;
>  Aggregate  (cost=2315252.66..2320767.17 rows=73527 width=8) (actual
> time=2081.15..2720.44 rows=23 loops=1)
>    ->  Group  (cost=2315252.66..2318929.00 rows=735268 width=8)
> (actual time=2079.76..2564.22 rows=41366 loops=1)
>          ->  Sort  (cost=2315252.66..2317090.83 rows=735268 width=8)
> (actual time=2079.74..2243.32 rows=41366 loops=1)
>                Sort Key: date_trunc('hour'::text, "time")
>                ->  Index Scan using test_time_p1_idx on test
> (cost=0.00..2218878.46 rows=735268 width=8) (actual
> time=29.50..1774.52 rows=41366 loops=1)
>                      Index Cond: (("time" > (now() - '1
> day'::interval)) AND (p1 = 53))
>  Total runtime: 2735.42 msec
>
> Time: 2736.48 ms
>
>
>
> database=> explain analyze select date_trunc('hour', time),count(*) as
> total from test where p1=80 and time>now()-interval '24 hours' group
> by date_trunc order by date_trunc;
>  Aggregate  (cost=1071732.15..1074305.59 rows=34313 width=8) (actual
> time=6353.93..7321.99 rows=22 loops=1)
>    ->  Group  (cost=1071732.15..1073447.77 rows=343125 width=8)
> (actual time=6323.76..7078.10 rows=64267 loops=1)
>          ->  Sort  (cost=1071732.15..1072589.96 rows=343125 width=8)
> (actual time=6323.75..6579.42 rows=64267 loops=1)
>                Sort Key: date_trunc('hour'::text, "time")
>                ->  Index Scan using test_time_p1_idx on test
> (cost=0.00..1035479.58 rows=343125 width=8) (actual time=0.20..5858.67
> rows=64267 loops=1)
>                      Index Cond: (("time" > (now() - '1
> day'::interval)) AND (p1 = 80))
>  Total runtime: 7322.82 msec
>
> Time: 7323.90 ms
>
>
>
> database=> explain analyze select date_trunc('hour', time),count(*) as
> total from test where p1=139 and time>now()-interval '24 hours' group
> by date_trunc order by date_trunc;
>  Aggregate  (cost=701562.34..703250.12 rows=22504 width=8) (actual
> time=2448.41..3033.80 rows=22 loops=1)
>    ->  Group  (cost=701562.34..702687.53 rows=225037 width=8) (actual
> time=2417.39..2884.25 rows=36637 loops=1)
>          ->  Sort  (cost=701562.34..702124.94 rows=225037 width=8)
> (actual time=2417.38..2574.19 rows=36637 loops=1)
>                Sort Key: date_trunc('hour'::text, "time")
>                ->  Index Scan using test_time_p1_idx on test
> (cost=0.00..679115.34 rows=225037 width=8) (actual time=8.47..2156.18
> rows=36637 loops=1)
>                      Index Cond: (("time" > (now() - '1
> day'::interval)) AND (p1 = 139))
>  Total runtime: 3034.57 msec
>
> Time: 3035.70 ms
>
>
>
> Now, this query gives me all the hours in a day, with the count of all
> p1=53 for each hour. Pg uses 46.7 seconds to run with seqscan, while
> 2.7 seconds indexing on (time,p1). I think I turned "set
> enable_seqscan to on;" again, and then the planner used seqscan, and
> not index.
> - Why does Pg not see the benefits of using index?
> - and how can i tune the optimisation fields in postgresql.conf to
> help him?
>
> So now my PG uses a reasonable amout of time on these queries (with
> enable_seqscan turned off)
>
> The next place which seems to slow my queries, is probably my
> connection to PHP. I got a bash script running in cron on my server
> (freebsd 4.11), which runs php on a php file. To force PG to not use
> seqscans, I have modifies the postgresql.conf:
>
> ..
> enable_seqscan = false
> enable_indexscan = true
> ..
> effective_cache_size = 10000
> random_page_cost = 2
> ..
>
> I save the file, type 'pg_crl reload' then enter 'psql database'.
>
> database=> show enable_seqscan ;
>  enable_seqscan
> ----------------
>  on
> (1 row)
>
>
> argus=> show effective_cache_size ;
>  effective_cache_size
> ----------------------
>  1000
> (1 row)
>
> I have used the manual pages on postgresql, postmaster, and so on, but
> I cant find anywhere to specify which config file Pg is to use. I'm
> not entirely sure if he uses the one im editing
> (/usr/local/etc/postgresql.conf).
>
> Any hints, tips or help is most appreciated!
>
> Kjell Tore.
>
>
>
>
>
> On 6/21/05, PFC <lists@boutiquenumerique.com> wrote:
>>
>> use CURRENT_TIME which is a constant instead of now() which is not
>> considered constant... (I think)
>>
>>
      Paul Ramsey
      Refractions Research
      Email: pramsey@refractions.net
      Phone: (250) 885-0632


Re: Querying 19million records very slowly

От
Kjell Tore Fossbakk
Дата:
I cant get the config file to load into my postgres. that's the
problem. I want to set it to 10k, but it is only still at 1000... I
save the file and restart the service..

yes, i ment 'pg_ctl reload', sry about that one.

kjell tore

On 6/22/05, Bricklen Anderson <BAnderson@presinet.com> wrote:
> >> enable_seqscan = false
> >> enable_indexscan = true
> >> ..
> >> effective_cache_size = 10000
> >> random_page_cost = 2
> >> ..
> >>
> >> I save the file, type 'pg_crl reload' then enter 'psql database'.
> >>
> >> argus=> show effective_cache_size ;
> >>  effective_cache_size
> >> ----------------------
> >>  1000
> >> (1 row)
>
> I assume that 'pg_crl' is a typo? That should read 'pg_ctl reload'
> Also, you said that your effective_cache_size = 10000, yet when you SHOW
> it,
> it's only 1000. A cut 'n paste error, or maybe your erroneous "pg_crl"
> didn't
> trigger the reload?
>
> --
> _______________________________
>
> This e-mail may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this e-mail or the information it contains by other than an
> intended recipient is unauthorized. If you received this e-mail in
> error, please advise me (by return e-mail or otherwise) immediately.
> _______________________________
>

Re: Querying 19million records very slowly

От
Stefan Weiss
Дата:
On 2005-06-22 10:55, Michael Glaesemann wrote:
> There has been discussion in the past on storing the time zone name
> with the timestamptz as well, though no one has implemented this yet.

The reason for this may be that time zone names (abbreviations) are not
unique. For example, "ECT" can mean "Ecuador Time" (offset -05) or
"Eastern Caribbean Time" (offset -04).

http://www.worldtimezone.com/wtz-names/timezonenames.html

cheers,
stefan

Re: Querying 19million records very slowly

От
Tobias Brox
Дата:
[Kjell Tore Fossbakk - Wed at 07:41:54AM -0700]
> I cant get the config file to load into my postgres. that's the
> problem. I want to set it to 10k, but it is only still at 1000... I
> save the file and restart the service..
>
> yes, i ment 'pg_ctl reload', sry about that one.

Classical problem, a bit depending on the distro you are using.

The "master" file usually resides in /etc/postgresql while the actual file
used usually resides in /var/lib/postgres/data ... or something.  Some
distros copies over the file (such that one always should edit the file in
/etc) others don't (thus you either have to do that your self, or edit the
file in the database data directory.

--
Tobias Brox, +86-13521622905
Nordicbet, IT dept

Re: Querying 19million records very slowly

От
Kjell Tore Fossbakk
Дата:
-I also changed now() to current_time, which increased performance quite good. I need to make further tests, before I'll post any results.
-I tried to change now()- interval 'x hours' to like 2005-06-22 16:00:00+02. This also increased the performance.
 
changing to time > '2005-06-22 16:00:00+02' (or what ever date is 24 hours back) or time > current_time - interval '24 hours' will be used.
 
I'm running FreeBSD 4.11, and im editing the file in /usr/local/etc/postgresql.conf, but it doesnt help. When i start up "psql database", none of the options are changed (with a restart of the postmaster). I cant find a '--configuration=path/file' option for the postmaster either...
 
Kjell Tore
 
On 6/22/05, Tobias Brox <tobias@nordicbet.com> wrote:
[Kjell Tore Fossbakk - Wed at 07:41:54AM -0700]
> I cant get the config file to load into my postgres. that's the
> problem. I want to set it to 10k, but it is only still at 1000... I
> save the file and restart the service..
>
> yes, i ment 'pg_ctl reload', sry about that one.

Classical problem, a bit depending on the distro you are using.

The "master" file usually resides in /etc/postgresql while the actual file
used usually resides in /var/lib/postgres/data ... or something.  Some
distros copies over the file (such that one always should edit the file in
/etc) others don't (thus you either have to do that your self, or edit the
file in the database data directory.

--
Tobias Brox, +86-13521622905
Nordicbet, IT dept

Re: Querying 19million records very slowly

От
Tobias Brox
Дата:
>  I'm running FreeBSD 4.11, and im editing the file in
> /usr/local/etc/postgresql.conf, but it doesnt help.

On my system the 'live' config file resides in
/var/lib/postgresql/data/postgresql.conf - maybe you have them in
/usr/local/var/lib ...

--
Tobias Brox, +86-13521622905
Nordicbet, IT dept