Обсуждение: Slow query and indexes...
Hi,
I'm trying to figure out how to make postgres utilize my indexes on a table.
this query:
>> explain analyze SELECT max(date_time) FROM data_values;
Goes fast and returns:
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result  (cost=0.08..0.09 rows=1 width=0) (actual time=0.108..0.111
rows=1 loops=1)
   InitPlan
     ->  Limit  (cost=0.00..0.08 rows=1 width=8) (actual
time=0.090..0.092 rows=1 loops=1)
           ->  Index Scan Backward using
data_values_data_date_time_index on data_values  (cost=0.00..58113.06
rows=765121 width=8) (actual time=0.078..0.078 rows=1 loops=1)
                 Filter: (date_time IS NOT NULL)
 Total runtime: 0.204 ms
(6 rows)
while if I add a GROUP BY data_logger  the query uses a seq scan and a
lot of time:
>> explain analyze SELECT max(date_time) FROM data_values GROUP BY
data_logger_id;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=20171.82..20171.85 rows=3 width=12) (actual
time=3510.500..3510.506 rows=3 loops=1)
   ->  Seq Scan on data_values  (cost=0.00..16346.21 rows=765121
width=12) (actual time=0.039..1598.518 rows=765121 loops=1)
 Total runtime: 3510.634 ms
(3 rows)
Tha table contains approx 765000 rows. It has three distinct
data_logger_id's. I can make quick queries on each of them using:
SELECT max(date_time) FROM data_values where data_logger_id=1
I have an index on the date_time field and on the data_logger_id
field, and I ahve also tried to make an index with both date_time and
data_logger_id. Anyone have any idea whats going on, and suggestions
what I should do to speed up my query?
Regards Jonas:)))
Im using PostgreSQL 8.2.3 on windows xp.
My table:
CREATE TABLE data_values
(
  data_value_id serial NOT NULL,
  data_type_id integer NOT NULL,
  data_collection_id integer NOT NULL,
  data_logger_id integer NOT NULL,
  date_time timestamp without time zone NOT NULL,
  lat_wgs84 double precision NOT NULL,
  lon_wgs84 double precision NOT NULL,
  height integer NOT NULL,
  parallell integer NOT NULL DEFAULT 0,
  data_value double precision NOT NULL,
  sensor_id integer,
  CONSTRAINT data_values_pkey PRIMARY KEY (data_value_id),
  CONSTRAINT data_values_data_collection_id_fkey FOREIGN KEY
(data_collection_id)
      REFERENCES data_collections (data_collection_id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT,
  CONSTRAINT data_values_data_logger_id_fkey FOREIGN KEY (data_logger_id)
      REFERENCES data_loggers (data_logger_id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT,
  CONSTRAINT data_values_data_type_id_fkey FOREIGN KEY (data_type_id)
      REFERENCES data_types (data_type_id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT,
  CONSTRAINT data_values_sensor_id_fkey FOREIGN KEY (sensor_id)
      REFERENCES sensors (sensor_id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT,
  CONSTRAINT data_values_data_type_id_key UNIQUE (data_type_id,
data_logger_id, date_time, lat_wgs84, lon_wgs84, height, parallell)
);
CREATE INDEX data_values_data_date_time_index
  ON data_values
  USING btree
  (date_time);
CREATE INDEX data_values_data_logger_id_index
  ON data_values
  USING btree
  (data_logger_id);
CREATE INDEX data_values_time_logger_index
  ON data_values
  USING btree
  (data_logger_id, date_time);
			
		Am Montag, 7. Mai 2007 15:53 schrieb Jonas Henriksen: > while if I add a GROUP BY data_logger the query uses a seq scan and a > > lot of time: > >> explain analyze SELECT max(date_time) FROM data_values GROUP BY > data_logger_id; I don't think there is anything you can do about this. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On May 7, 2007, at 8:53 AM, Jonas Henriksen wrote: > while if I add a GROUP BY data_logger the query uses a seq scan and a > lot of time: >>> explain analyze SELECT max(date_time) FROM data_values GROUP BY > data_logger_id; What do you get if you run that with SET enable_seqscan = off; ? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Well thanks, but that don't help me much.
I've tried setting an extra condition using datetime>(now() - '14
weeks'::interval)
explain analyze
SELECT max(date_time) FROM data_values
where date_time > (now() - '14 weeks'::interval)
 GROUP BY data_logger_id;
HashAggregate  (cost=23264.52..23264.55 rows=2 width=12) (actual
time=1691.447..1691.454 rows=3 loops=1)
  ->  Bitmap Heap Scan on data_values  (cost=7922.08..21787.31
rows=295442 width=12) (actual time=320.643..951.043 rows=298589
loops=1)
        Recheck Cond: (date_time > (now() - '98 days'::interval))
        ->  Bitmap Index Scan on data_values_data_date_time_index
(cost=0.00..7848.22 rows=295442 width=0) (actual time=319.708..319.708
rows=298589 loops=1)
              Index Cond: (date_time > (now() - '98 days'::interval))
Total runtime: 1691.598 ms
However, when I switch to using datetime>(now() - '15 weeks'::interval) I get:
explain analyze
SELECT max(date_time) FROM data_values
where date_time > (now() - '15 weeks'::interval)
 GROUP BY data_logger_id;
HashAggregate  (cost=23798.26..23798.28 rows=2 width=12) (actual
time=3237.816..3237.823 rows=3 loops=1)
  ->  Seq Scan on data_values  (cost=0.00..22084.62 rows=342728
width=12) (actual time=0.037..2409.234 rows=344111 loops=1)
        Filter: (date_time > (now() - '105 days'::interval))
Total runtime: 3237.944 ms
Doing "SET enable_seqscan=off" speeds up the query and forces the use
of the index, but I dont really love that solution...
regards Jonas:))
On 5/7/07, Peter Eisentraut <peter_e@gmx.net> wrote:
> Am Montag, 7. Mai 2007 15:53 schrieb Jonas Henriksen:
> > while if I add a GROUP BY data_logger the query uses a seq scan and a
> >
> > lot of time:
> > >> explain analyze SELECT max(date_time) FROM data_values GROUP BY
> > data_logger_id;
>
> I don't think there is anything you can do about this.
>
> --
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
>
			
		Jonas Henriksen wrote: >>> explain analyze SELECT max(date_time) FROM data_values; > Goes fast and returns: In prior postgres versions, the planner could not take advantage of indexes with max() (nor min()) calculations. A workaround to this was (given an appropriate index) a query like: select date_time from data_values order by date_time desc limit 1; The planner in recent versions has been upgraded to recognize this case and basically apply the same workaround automatically. This is shown by the "Index Scan Backward" and "Limit" nodes in the plan you posted. >>> explain analyze SELECT max(date_time) FROM data_values GROUP BY > data_logger_id; I cannot think of a workaround like above that would speed this up. The planner modifications that work in the above case probably don't handle queries like this in the same way. > Tha table contains approx 765000 rows. It has three distinct > data_logger_id's. I can make quick queries on each of them using: > SELECT max(date_time) FROM data_values where data_logger_id=1 If your 3 distinct data_logger_id will never change (or if you can handle code changes if/when they do change), the following might provide what you are looking for: select max(date_time) from data_values where data_logger_id=1 union all select max(date_time) from data_values where data_logger_id=2 union all select max(date_time) from data_values where data_logger_id=3 If that works for you, you may also be able to eliminate the (data_logger_id, date_time) index if no other queries need it (i.e. you added it in an attempt to speed up this specific case). Hope this helps. Andrew
On 5/7/07, Andrew Kroeger <andrew@sprocks.gotdns.com> wrote: > Jonas Henriksen wrote: > > >>> explain analyze SELECT max(date_time) FROM data_values; > > Goes fast and returns: > > In prior postgres versions, the planner could not take advantage of > indexes with max() (nor min()) calculations. A workaround to this was > (given an appropriate index) a query like: > > select date_time from data_values order by date_time desc limit 1; > > The planner in recent versions has been upgraded to recognize this case > and basically apply the same workaround automatically. This is shown by > the "Index Scan Backward" and "Limit" nodes in the plan you posted. > > >>> explain analyze SELECT max(date_time) FROM data_values GROUP BY > > data_logger_id; > > I cannot think of a workaround like above that would speed this up. The > planner modifications that work in the above case probably don't handle > queries like this in the same way. > > > Tha table contains approx 765000 rows. It has three distinct > > data_logger_id's. I can make quick queries on each of them using: > > SELECT max(date_time) FROM data_values where data_logger_id=1 > > If your 3 distinct data_logger_id will never change (or if you can > handle code changes if/when they do change), the following might provide > what you are looking for: > > select max(date_time) from data_values where data_logger_id=1 > union all > select max(date_time) from data_values where data_logger_id=2 > union all > select max(date_time) from data_values where data_logger_id=3 > > If that works for you, you may also be able to eliminate the > (data_logger_id, date_time) index if no other queries need it (i.e. you > added it in an attempt to speed up this specific case). Naive question, but how would an index on (date_time, data_logger_id) affect things? Say coupled with limit 3 for the above case, or the date interval condition. Isak > > Hope this helps. > > Andrew > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
Thanks for a good answer, I'll try to find a workaround. The number of data_loggers will change, but not to frequently. I was actually hoping to make a view showing the latest data for each logger, maybe I can manage that with a stored procedure thingy... Regards, Jonas:)) On 5/7/07, Andrew Kroeger <andrew@sprocks.gotdns.com> wrote: > Jonas Henriksen wrote: > > >>> explain analyze SELECT max(date_time) FROM data_values; > > Goes fast and returns: > > In prior postgres versions, the planner could not take advantage of > indexes with max() (nor min()) calculations. A workaround to this was > (given an appropriate index) a query like: > > select date_time from data_values order by date_time desc limit 1; > > The planner in recent versions has been upgraded to recognize this case > and basically apply the same workaround automatically. This is shown by > the "Index Scan Backward" and "Limit" nodes in the plan you posted. > > >>> explain analyze SELECT max(date_time) FROM data_values GROUP BY > > data_logger_id; > > I cannot think of a workaround like above that would speed this up. The > planner modifications that work in the above case probably don't handle > queries like this in the same way. > > > Tha table contains approx 765000 rows. It has three distinct > > data_logger_id's. I can make quick queries on each of them using: > > SELECT max(date_time) FROM data_values where data_logger_id=1 > > If your 3 distinct data_logger_id will never change (or if you can > handle code changes if/when they do change), the following might provide > what you are looking for: > > select max(date_time) from data_values where data_logger_id=1 > union all > select max(date_time) from data_values where data_logger_id=2 > union all > select max(date_time) from data_values where data_logger_id=3 > > If that works for you, you may also be able to eliminate the > (data_logger_id, date_time) index if no other queries need it (i.e. you > added it in an attempt to speed up this specific case). > > Hope this helps. > > Andrew > >
> Thanks for a good answer, I'll try to find a workaround. The number of
> data_loggers will change, but not to frequently. I was actually hoping
> to make a view showing the latest data for each logger, maybe I can
> manage that with a stored procedure thingy...
    - Create a table which contains your list of loggers (since it's good
normalization anyway, you probably have it already) and have your data
table's logger_id REFERENCE it
    - You now have a simple way to get the list of loggers (just select from
the loggers table which will contain 3 rows)
    - Then, to get the most recent record for each logger_id, you do :
SELECT l.logger_id, (SELECT id FROM data d WHERE d.logger_id = l.logger_id
ORDER BY d.logger_id DESC, d.date_time DESC LIMIT 1) AS last_record_id
 FROM loggers l
    2 minute example :
forum_bench=> CREATE TABLE loggers (id SERIAL PRIMARY KEY, name TEXT );
CREATE TABLE
forum_bench=> INSERT INTO loggers (name) VALUES ('logger 1'),('logger
2'),('logger 3');
INSERT 0 3
forum_bench=> CREATE TABLE data (id SERIAL PRIMARY KEY, logger_id INTEGER
NOT NULL REFERENCES loggers( id ));
CREATE TABLE
forum_bench=> INSERT INTO data (logger_id) SELECT 1+floor(random()*3) FROM
generate_series(1,1000000);
forum_bench=> SELECT logger_id, count(*) FROM data GROUP BY logger_id;
  logger_id | count
-----------+--------
          3 | 333058
          2 | 333278
          1 | 333664
NOTE : I use id rather than timestamp to get the last one
forum_bench=> EXPLAIN ANALYZE SELECT logger_id, max(id) FROM data GROUP BY
logger_id;
                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=19166.82..19169.32 rows=200 width=8) (actual
time=1642.556..1642.558 rows=3 loops=1)
    ->  Seq Scan on data  (cost=0.00..14411.88 rows=950988 width=8) (actual
time=0.028..503.308 rows=1000000 loops=1)
  Total runtime: 1642.610 ms
forum_bench=> CREATE INDEX data_by_logger ON data (logger_id, id);
CREATE INDEX
forum_bench=> EXPLAIN ANALYZE SELECT l.id, (SELECT d.id FROM data d WHERE
d.logger_id=l.id ORDER BY d.logger_id DESC, d.id DESC LIMIT 1) FROM
loggers l;
                                                                      QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
  Seq Scan on loggers l  (cost=0.00..3128.51 rows=1160 width=4) (actual
time=0.044..0.074 rows=3 loops=1)
    SubPlan
      ->  Limit  (cost=0.00..2.68 rows=1 width=8) (actual time=0.020..0.020
rows=1 loops=3)
            ->  Index Scan Backward using data_by_logger on data d
(cost=0.00..13391.86 rows=5000 width=8) (actual time=0.018..0.018 rows=1
loops=3)
                  Index Cond: (logger_id = $0)
  Total runtime: 0.113 ms
(6 lignes)
forum_bench=> SELECT l.id, (SELECT d.id FROM data d WHERE d.logger_id=l.id
ORDER BY d.logger_id DESC, d.id DESC LIMIT 1) FROM loggers l;
  id | ?column?
----+----------
   1 |   999999
   2 |  1000000
   3 |   999990
(3 lignes)
			
		On May 8, 2007, at 3:29 AM, PFC wrote: > Create a table which contains your list of loggers (since it's good > normalization anyway, you probably have it already) and have your > data table's logger_id REFERENCE it BTW, you could do that dynamically with a subselect: (SELECT DISTINCT logger_id FROM data) AS loggers, though I'm not sure how optimal the plan would be. BTW, I encourage you to not use 'id' as a field name; I've found it makes doing things like joins a lot trickier. Easier to just make every id field the same (logger_id in this case). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
There are other ways to influence the selection of a seqscan, notably effective_cache_size and random_page_cost. First, you need to find out at what point a seqscan is actually faster than an index scan. That's going to be a trial and error search, but eventually if you're going back far enough in time the seqscan will be faster. EXPLAIN ANALYZE has it's own overhead, so a better way to test this is with psql's timing command, and wrap the query into a count so you're not shoving a bunch of data across to psql: SELECT count(*) FROM (... your query goes here ...) a; (SELECT 1 might work too and would be more accurate) Once you've found the break even point, you can tweak all the cost estimates. Start by making sure that effective_cache_size is set approximately to how much memory you have. Increasing that will favor an index scan. Decreasing random_page_cost will also favor an index scan, though I'd try not to go below 2 and definitely not below 1. You can also tweak the CPU cost estimates (lower numbers will favor indexes). But keep in mind that doing that at a system level will impact every query running in the system. You may have no choice but to explicitly set custom parameters for just this statement. SET LOCAL and wrapping the SELECT in a transaction is a less painful way to do that. On May 7, 2007, at 10:47 AM, Jonas Henriksen wrote: > Well thanks, but that don't help me much. > > I've tried setting an extra condition using datetime>(now() - '14 > weeks'::interval) > > explain analyze > SELECT max(date_time) FROM data_values > where date_time > (now() - '14 weeks'::interval) > GROUP BY data_logger_id; > > HashAggregate (cost=23264.52..23264.55 rows=2 width=12) (actual > time=1691.447..1691.454 rows=3 loops=1) > -> Bitmap Heap Scan on data_values (cost=7922.08..21787.31 > rows=295442 width=12) (actual time=320.643..951.043 rows=298589 > loops=1) > Recheck Cond: (date_time > (now() - '98 days'::interval)) > -> Bitmap Index Scan on data_values_data_date_time_index > (cost=0.00..7848.22 rows=295442 width=0) (actual time=319.708..319.708 > rows=298589 loops=1) > Index Cond: (date_time > (now() - '98 days'::interval)) > Total runtime: 1691.598 ms > > However, when I switch to using datetime>(now() - '15 > weeks'::interval) I get: > explain analyze > SELECT max(date_time) FROM data_values > where date_time > (now() - '15 weeks'::interval) > GROUP BY data_logger_id; > > HashAggregate (cost=23798.26..23798.28 rows=2 width=12) (actual > time=3237.816..3237.823 rows=3 loops=1) > -> Seq Scan on data_values (cost=0.00..22084.62 rows=342728 > width=12) (actual time=0.037..2409.234 rows=344111 loops=1) > Filter: (date_time > (now() - '105 days'::interval)) > Total runtime: 3237.944 ms > > Doing "SET enable_seqscan=off" speeds up the query and forces the use > of the index, but I dont really love that solution... > > > regards Jonas:)) > > > > > On 5/7/07, Peter Eisentraut <peter_e@gmx.net> wrote: >> Am Montag, 7. Mai 2007 15:53 schrieb Jonas Henriksen: >> > while if I add a GROUP BY data_logger the query uses a seq scan >> and a >> > >> > lot of time: >> > >> explain analyze SELECT max(date_time) FROM data_values GROUP BY >> > data_logger_id; >> >> I don't think there is anything you can do about this. >> >> -- >> Peter Eisentraut >> http://developer.postgresql.org/~petere/ >> > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)