Обсуждение: database performance and query performance question
Our database has slowed right down. We are not getting any performance from
our biggest table "forecastelement".
The table has 93,218,671 records in it and climbing.
The index is on 4 columns, origianlly it was on 3. I added another to see
if it improve performance. It did not.
Should there be less columns in the index?
How can we improve database performance?
How should I improve my query?
PWFPM_DEV=# \d forecastelement
Table "public.forecastelement"
Column | Type | Modifiers
----------------+-----------------------------+-----------
version | character varying(99) | not null
origin | character varying(10) | not null
timezone | character varying(99) | not null
region_id | character varying(20) | not null
wx_element | character varying(99) | not null
value | character varying(99) | not null
flag | character(3) | not null
units | character varying(99) | not null
valid_time | timestamp without time zone | not null
issue_time | timestamp without time zone | not null
next_forecast | timestamp without time zone | not null
reception_time | timestamp without time zone | not null
Indexes:
"forecastelement_vrwi_idx" btree
(valid_time,region_id.wx_element.issue_time)
explain analyze select DISTINCT ON (valid_time)
to_char(valid_time,'YYYYMMDDHH24MISS') as valid_time,value from
(select valid_time,value,"time"(valid_time) as
hour,reception_time,
issue_time from forecastelement where
valid_time between '2002-09-02 04:00:00' and
'2002-09-07 03:59:59' and region_id = 'PU-REG-WTO-00200'
and wx_element = 'TEMP_VALEUR1' and issue_time between
'2002-09-02 05:00:00' and '2002-09-06 05:00:00'
and origin = 'REGIONAL' and "time"(issue_time) =
'05:00:00'
order by issue_time,reception_time DESC,valid_time) as
foo where
(date(valid_time) = date(issue_time)+1 -1 or
date(valid_time) = date(issue_time)+1 or
(valid_time between '2002-09-07 00:00:00' and '2002-09-07
03:59:59'
and issue_time = '2002-09-06 05:00:00')) order by valid_time
,issue_time DESC;
USING INDEX
"forecastelement_vrwi_idx" btree (valid_time, region_id, wx_element,
issue_time)
Unique (cost=116.75..116.76 rows=1 width=83) (actual
time=9469.088..9470.002 rows=115 loops=1)
-> Sort (cost=116.75..116.75 rows=1 width=83) (actual
time=9469.085..9469.308 rows=194 loops=1)
Sort Key: to_char(valid_time, 'YYYYMMDDHH24MISS'::text), issue_time
-> Subquery Scan foo (cost=116.72..116.74 rows=1 width=83)
(actual time=9465.979..9467.735 rows=194 loops=1)
-> Sort (cost=116.72..116.73 rows=1 width=30) (actual
time=9440.756..9440.981 rows=194 loops=1)
Sort Key: issue_time, reception_time, valid_time
-> Index Scan using forecastelement_vrwi_idx on
forecastelement (cost=0.00..116.71 rows=1 width=30) (actual
time=176.510..9439.470 rows=194 loops=1)
Index Cond: ((valid_time >= '2002-09-02
04:00:00'::timestamp without time zone) AND (valid_time <= '2002-09-07
03:59:59'::timestamp without time zone) AND ((region_id)::text =
'PU-REG-WTO-00200'::text) AND ((wx_element)::text = 'TEMP_VALEUR1'::text)
AND (issue_time >= '2002-09-02 05:00:00'::timestamp without time zone) AND
(issue_time <= '2002-09-06 05:00:00'::timestamp without time zone))
Filter: (((origin)::text = 'REGIONAL'::text) AND
("time"(issue_time) = '05:00:00'::time without time zone) AND
((date(valid_time) = ((date(issue_time) + 1) - 1)) OR (date(valid_time) =
(date(issue_time) + 1)) OR ((valid_time >= '2002-09-07 00:00:00'::timestamp
without time zone) AND (valid_time <= '2002-09-07 03:59:59'::timestamp
without time zone) AND (issue_time = '2002-09-06 05:00:00'::timestamp
without time zone))))
Total runtime: 9470.404 ms
We are running postgresql-7.4-0.5PGDG.i386.rpm .
on a Dell Poweredge 6650.
system
OS RHAS 3.0
cpu 4
memory 3.6 GB
disk 270 GB raid 5
postgresql.conf
max_connections = 64
shared_buffers = 4000
vacuum_mem = 32768
effective_cache_size = 312500
random_page_cost = 2
Dan, > Should there be less columns in the index? > How can we improve database performance? > How should I improve my query? Your query plan isn't the problem. It's a good plan, and a reasonably efficient query. Under other circumstances, the SELECT DISTINCT with the to_char could be a performance-killer, but it's not in that result set. Overall, you're taking 9 seconds to scan 93 million records. Is this the time the first time you run the query, or the 2nd and successive times? When did you last run VACUUM ANALYZE on the table? Have you tried increasing the ANALYZE statistics on the index columns to, say, 500? Your disks are RAID 5. How many drives? In RAID5, more drives improves the speed of large scans. And what's your sort_mem setting? You didn't mention it. Why is your effective cache size only 300mb when you have 3 GB of RAM? It's not affecting this query, but it could affect others. -- -Josh Berkus Aglio Database Solutions San Francisco
Dan, > Why is your effective cache size only 300mb when you have 3 GB of RAM? It's > not affecting this query, but it could affect others. Ignore this last question, I dropped a zero from my math. Sorry! -- -Josh Berkus Aglio Database Solutions San Francisco
-----Original Message-----
From: Josh Berkus [mailto:josh@agliodbs.com]
Sent: Thursday, January 22, 2004 3:01 PM
To: Shea,Dan [CIS]; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] database performance and query performance
question
Dan,
> Should there be less columns in the index?
> How can we improve database performance?
> How should I improve my query?
>>Your query plan isn't the problem. It's a good plan, and a reasonably
>>efficient query. Under other circumstances, the SELECT DISTINCT with the
>>to_char could be a performance-killer, but it's not in that result set.
>>Overall, you're taking 9 seconds to scan 93 million records. Is this the
time
>>the first time you run the query, or the 2nd and successive times?
This is actually the second time. The first query took more time.
Concerning the number of columns for an index, I switched the index to have
only one column and tried the same query. It is below.
>>When did you last run VACUUM ANALYZE on the table? Have you tried
increasing
>>the ANALYZE statistics on the index columns to, say, 500?
It is run nightly. But last night's did not complete. It was taking quite
some time and I cancelled it, over 4 hours. I will try increasing the
ANALYZE statistics to 500.
>>Your disks are RAID 5. How many drives? In RAID5, more drives improves
the
>>speed of large scans.
There are 4 drives in this raid 5. We are using lvm with ext3 filesystem.
Will be moving the database to a SAN within the next month.
And what's your sort_mem setting? You didn't mention it.
>>The sort_mem is the default
PWFPM_DEV=# show sort_mem;
sort_mem
----------
1024
Why is your effective cache size only 300mb when you have 3 GB of RAM? It's
not affecting this query, but it could affect others.
>> Oh, I thought I had it set for 2.5 GB of RAM. 312500 * 8k = 2.5 GB
QUERY WITH 1 column in index.
Unique (cost=717633.28..717633.29 rows=1 width=83) (actual
time=62922.399..62923.334 rows=115 loops=1)
-> Sort (cost=717633.28..717633.29 rows=1 width=83) (actual
time=62922.395..62922.615 rows=194 loops=1)
Sort Key: to_char(valid_time, 'YYYYMMDDHH24MISS'::text), issue_time
-> Subquery Scan foo (cost=717633.26..717633.27 rows=1 width=83)
(actual time=62918.232..62919.989 rows=194 loops=1)
-> Sort (cost=717633.26..717633.26 rows=1 width=30) (actual
time=62902.378..62902.601 rows=194 loops=1)
Sort Key: issue_time, reception_time, valid_time
-> Index Scan using forecastelement_v_idx on
forecastelement (cost=0.00..717633.25 rows=1 width=30) (actual
time=1454.974..62900.752 rows=194 loops=1)
Index Cond: ((valid_time >= '2002-09-02
04:00:00'::timestamp without time zone) AND (valid_time <= '2002-09-07
03:59:59'::timestamp without time zone))
Filter: (((region_id)::text =
'PU-REG-WTO-00200'::text) AND ((wx_element)::text = 'TEMP_VALEUR1'::text)
AND (issue_time >= '2002-09-02 05:00:00'::timestamp without time zone) AND
(issue_time <= '2002-09-06 05:00:00'::timestamp without time zone) AND
((origin)::text = 'REGIONAL'::text) AND ("time"(issue_time) =
'05:00:00'::time without time zone) AND ((date(valid_time) =
((date(issue_time) + 1) - 1)) OR (date(valid_time) = (date(issue_time) + 1))
OR ((valid_time >= '2002-09-07 00:00:00'::timestamp without time zone) AND
(valid_time <= '2002-09-07 03:59:59'::timestamp without time zone) AND
(issue_time = '2002-09-06 05:00:00'::timestamp without time zone))))
Total runtime: 62923.723 ms
(10 rows)
PWFPM_DEV=# expalin analyze 312500
PWFPM_DEV=# explain analyze select DISTINCT ON (valid_time)
to_char(valid_time,'YYYYMMDDHH24MISS') as valid_time,value from
PWFPM_DEV-# (select valid_time,value,"time"(valid_time)
as hour,reception_time,
PWFPM_DEV(# issue_time from forecastelement where
PWFPM_DEV(# valid_time between '2002-09-02 04:00:00' and
PWFPM_DEV(# '2002-09-07 03:59:59' and region_id =
'PU-REG-WTO-00200'
PWFPM_DEV(# and wx_element = 'TEMP_VALEUR1' and
issue_time between
PWFPM_DEV(# '2002-09-02 05:00:00' and '2002-09-06
05:00:00'
PWFPM_DEV(# and origin = 'REGIONAL' and
"time"(issue_time) = '05:00:00'
PWFPM_DEV(# order by issue_time,reception_time
DESC,valid_time) as foo where
PWFPM_DEV-# (date(valid_time) = date(issue_time)+1 -1 or
date(valid_time) = date(issue_time)+1 or
PWFPM_DEV(# (valid_time between '2002-09-07 00:00:00' and
'2002-09-07 03:59:59'
PWFPM_DEV(# and issue_time = '2002-09-06 05:00:00'))
order by valid_time ,issue_time DESC;
QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------
Unique (cost=717633.28..717633.29 rows=1 width=83) (actual
time=21468.227..21469.164 rows=115 loops=1)
-> Sort (cost=717633.28..717633.29 rows=1 width=83) (actual
time=21468.223..21468.452 rows=194 loops=1)
Sort Key: to_char(valid_time, 'YYYYMMDDHH24MISS'::text), issue_time
-> Subquery Scan foo (cost=717633.26..717633.27 rows=1 width=83)
(actual time=21465.274..21467.006 rows=194 loops=1)
-> Sort (cost=717633.26..717633.26 rows=1 width=30) (actual
time=21465.228..21465.452 rows=194 loops=1)
Sort Key: issue_time, reception_time, valid_time
-> Index Scan using forecastelement_v_idx on
forecastelement (cost=0.00..717633.25 rows=1 width=30) (actual
time=1479.649..21463.779 rows=194 loops=1)
Index Cond: ((valid_time >= '2002-09-02
04:00:00'::timestamp without time zone) AND (valid_time <= '2002-09-07
03:59:59'::timestamp without time zone))
Filter: (((region_id)::text =
'PU-REG-WTO-00200'::text) AND ((wx_element)::text = 'TEMP_VALEUR1'::text)
AND (issue_time >= '2002-09-02 05:00:00'::timestamp without time zone) AND
(issue_time <= '2002-09-06 05:00:00'::timestamp without time zone) AND
((origin)::text = 'REGIONAL'::text) AND ("time"(issue_time) =
'05:00:00'::time without time zone) AND ((date(valid_time) =
((date(issue_time) + 1) - 1)) OR (date(valid_time) = (date(issue_time) + 1))
OR ((valid_time >= '2002-09-07 00:00:00'::timestamp without time zone) AND
(valid_time <= '2002-09-07 03:59:59'::timestamp without time zone) AND
(issue_time = '2002-09-06 05:00:00'::timestamp without time zone))))
Total runtime: 21469.485 ms
(10 rows)
PWFPM_DEV=#
--
-Josh Berkus
Aglio Database Solutions
San Francisco
Something that I do not understand is why if you use a valid_time =
'2004-01-22 00:00:00' the query will use the index but if you do a
valid_time > '2004-01-22 00:00:00' it does not use the index?
PWFPM_DEV=# explain analyze select * from forecastelement where valid_time >
date '2004-01-23'::date limit 10;
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------------
Limit (cost=0.00..3.82 rows=10 width=129) (actual
time=199550.388..199550.783 rows=10 loops=1)
-> Seq Scan on forecastelement (cost=0.00..2722898.40 rows=7131102
width=129) (actual time=199550.382..199550.757 rows=10 loops=1)
Filter: (valid_time > '2004-01-23 00:00:00'::timestamp without time
zone)
Total runtime: 199550.871 ms
(4 rows)
PWFPM_DEV=# explain analyze select * from forecastelement where valid_time =
date '2004-01-23'::date limit 10;
QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------
Limit (cost=0.00..18.76 rows=10 width=129) (actual time=176.141..276.577
rows=10 loops=1)
-> Index Scan using forecastelement_vrwi_idx on forecastelement
(cost=0.00..160770.98 rows=85707 width=129) (actual time=176.133..276.494
rows=10 loops=1)
Index Cond: (valid_time = '2004-01-23 00:00:00'::timestamp without
time zone)
Total runtime: 276.721 ms
(4 rows)
-----Original Message-----
From: Josh Berkus [mailto:josh@agliodbs.com]
Sent: Thursday, January 22, 2004 3:01 PM
To: Shea,Dan [CIS]; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] database performance and query performance
question
Dan,
> Should there be less columns in the index?
> How can we improve database performance?
> How should I improve my query?
Your query plan isn't the problem. It's a good plan, and a reasonably
efficient query. Under other circumstances, the SELECT DISTINCT with the
to_char could be a performance-killer, but it's not in that result set.
Overall, you're taking 9 seconds to scan 93 million records. Is this the
time
the first time you run the query, or the 2nd and successive times?
When did you last run VACUUM ANALYZE on the table? Have you tried
increasing
the ANALYZE statistics on the index columns to, say, 500?
Your disks are RAID 5. How many drives? In RAID5, more drives improves the
speed of large scans.
And what's your sort_mem setting? You didn't mention it.
Why is your effective cache size only 300mb when you have 3 GB of RAM? It's
not affecting this query, but it could affect others.
--
-Josh Berkus
Aglio Database Solutions
San Francisco
Dan, > Something that I do not understand is why if you use a valid_time = > '2004-01-22 00:00:00' the query will use the index but if you do a > valid_time > '2004-01-22 00:00:00' it does not use the index? Because of the expected number of rows to be returned. Take a look at the row estimates on the forecastleelement scans. You can improve these estimates by increasing the ANALYZE stats and/or running ANALYZE more often. Of course, increasing the stats makes analyze run slower ... -- -Josh Berkus Aglio Database Solutions San Francisco
Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 22:35: > Something that I do not understand is why if you use a valid_time = > '2004-01-22 00:00:00' the query will use the index but if you do a > valid_time > '2004-01-22 00:00:00' it does not use the index? It probably can't tell if > is selective enough to justify using index. Together with "limit 10" it may be. You could try explain analyze select * from forecastelement where valid_time between '2004-01-22'::date and '2004-01-22'::date limit 10; to see if this is considered good enough. -------------- Hannu
Hannu Krosing kirjutas N, 22.01.2004 kell 22:46: > Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 22:35: > > Something that I do not understand is why if you use a valid_time = > > '2004-01-22 00:00:00' the query will use the index but if you do a > > valid_time > '2004-01-22 00:00:00' it does not use the index? > > It probably can't tell if > is selective enough to justify using index. > > Together with "limit 10" it may be. > > You could try > > explain analyze select * from forecastelement where valid_time between > '2004-01-22'::date and '2004-01-22'::date limit 10; Sorry, that should have been: between '2004-01-22'::date and '2004-01-23'::date > to see if this is considered good enough. > > -------------- > Hannu > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
This sure speed up the query, it is fast. PWFPM_DEV=# explain analyze select * from forecastelement where valid_time between '2004-01-12'::date and '2003-01-12'::date; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- --- Index Scan using forecastelement_v_idx on forecastelement (cost=0.00..159607.11 rows=466094 width=129) (actual time=49.504..49.504 rows=0 loops=1) Index Cond: ((valid_time >= '2004-01-12 00:00:00'::timestamp without time zone) AND (valid_time <= '2003-01-12 00:00:00'::timestamp without time zone)) Total runtime: 49.589 ms (3 rows) -----Original Message----- From: Hannu Krosing [mailto:hannu@tm.ee] Sent: Thursday, January 22, 2004 3:54 PM To: Shea,Dan [CIS] Cc: 'josh@agliodbs.com'; pgsql-performance@postgresql.org Subject: Re: [PERFORM] database performance and query performance question Hannu Krosing kirjutas N, 22.01.2004 kell 22:46: > Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 22:35: > > Something that I do not understand is why if you use a valid_time = > > '2004-01-22 00:00:00' the query will use the index but if you do a > > valid_time > '2004-01-22 00:00:00' it does not use the index? > > It probably can't tell if > is selective enough to justify using index. > > Together with "limit 10" it may be. > > You could try > > explain analyze select * from forecastelement where valid_time between > '2004-01-22'::date and '2004-01-22'::date limit 10; Sorry, that should have been: between '2004-01-22'::date and '2004-01-23'::date > to see if this is considered good enough. > > -------------- > Hannu > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
The end date in the previous example was actually invalid between '2004-01-12'::date and '2003-01-12'::date; There have been multiple inserts since I recreated the index but it took quite some time to complete the following PWFPM_DEV=# explain analyze select * from forecastelement where valid_time between '2004-01-12'::date and '2004-01-13'::date; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ------- Index Scan using forecastelement_v_idx on forecastelement (cost=0.00..832139.81 rows=2523119 width=129) (actual time=0.519..467159.658 rows=2940600 loops=1) Index Cond: ((valid_time >= '2004-01-12 00:00:00'::timestamp without time zone) AND (valid_time <= '2004-01-13 00:00:00'::timestamp without time zone)) Total runtime: 472627.148 ms (3 rows) -----Original Message----- From: Shea,Dan [CIS] Sent: Thursday, January 22, 2004 4:10 PM To: 'Hannu Krosing'; Shea,Dan [CIS] Cc: 'josh@agliodbs.com'; pgsql-performance@postgresql.org Subject: RE: [PERFORM] database performance and query performance question This sure speed up the query, it is fast. PWFPM_DEV=# explain analyze select * from forecastelement where valid_time between '2004-01-12'::date and '2003-01-12'::date; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- --- Index Scan using forecastelement_v_idx on forecastelement (cost=0.00..159607.11 rows=466094 width=129) (actual time=49.504..49.504 rows=0 loops=1) Index Cond: ((valid_time >= '2004-01-12 00:00:00'::timestamp without time zone) AND (valid_time <= '2003-01-12 00:00:00'::timestamp without time zone)) Total runtime: 49.589 ms (3 rows) -----Original Message----- From: Hannu Krosing [mailto:hannu@tm.ee] Sent: Thursday, January 22, 2004 3:54 PM To: Shea,Dan [CIS] Cc: 'josh@agliodbs.com'; pgsql-performance@postgresql.org Subject: Re: [PERFORM] database performance and query performance question Hannu Krosing kirjutas N, 22.01.2004 kell 22:46: > Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 22:35: > > Something that I do not understand is why if you use a valid_time = > > '2004-01-22 00:00:00' the query will use the index but if you do a > > valid_time > '2004-01-22 00:00:00' it does not use the index? > > It probably can't tell if > is selective enough to justify using index. > > Together with "limit 10" it may be. > > You could try > > explain analyze select * from forecastelement where valid_time between > '2004-01-22'::date and '2004-01-22'::date limit 10; Sorry, that should have been: between '2004-01-22'::date and '2004-01-23'::date > to see if this is considered good enough. > > -------------- > Hannu > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Dan, Of course it took forever. You're retrieving 2.9 million rows! > Index Scan using forecastelement_v_idx on forecastelement > (cost=0.00..832139.81 rows=2523119 width=129) (actual time=0.519..467159.658 > rows=2940600 loops=1) -- -Josh Berkus Aglio Database Solutions San Francisco
Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 23:32: > The end date in the previous example was actually invalid between > '2004-01-12'::date and '2003-01-12'::date; > There have been multiple inserts since I recreated the index but it took > quite some time to complete the following > PWFPM_DEV=# explain analyze select * from forecastelement where valid_time > between '2004-01-12'::date and '2004-01-13'::date; You could try ORDER BY to bias the optimiser towards using an index: explain analyze select * from forecastelement where valid_time > '2004-01-12'::date order by valid_time limit 10; This also may be more close to what you are expecting :) ------------------ Hannu
"Shea,Dan [CIS]" <Dan.Shea@ec.gc.ca> writes: > Indexes: > "forecastelement_vrwi_idx" btree (valid_time,region_id.wx_element.issue_time) > > explain analyze > SELECT DISTINCT ON (valid_time) > to_char(valid_time,'YYYYMMDDHH24MISS') AS valid_time, > value > from ( > SELECT valid_time,value, "time"(valid_time) AS hour, reception_time, issue_time > FROM forecastelement > WHERE valid_time BETWEEN '2002-09-02 04:00:00' AND '2002-09-07 03:59:59' > AND region_id = 'PU-REG-WTO-00200' > AND wx_element = 'TEMP_VALEUR1' > AND issue_time BETWEEN '2002-09-02 05:00:00' AND '2002-09-06 05:00:00' > AND origin = 'REGIONAL' > AND "time"(issue_time) = '05:00:00' > ORDER BY issue_time,reception_time DESC,valid_time > ) AS foo > WHERE > ( date(valid_time) = date(issue_time)+1 -1 > OR date(valid_time) = date(issue_time)+1 > OR ( valid_time BETWEEN '2002-09-07 00:00:00' AND '2002-09-07 03:59:59' > AND issue_time = '2002-09-06 05:00:00' > ) > ) > ORDER BY valid_time ,issue_time DESC; Incidentally, I find it easier to analyze queries when they've been formatted well. This makes what's going on much clearer. From this it's clear your index doesn't match the query. Adding more columns will be useless because only the leading column "valid_time" will be used at all. Since you're fetching a whole range of valid_times the remaining columns are all irrelevant. They only serve to bloat the index and require reading a lot more data. You could either try creating an index just on valid_time, or create an index on (region_id,wx_element,valid_time) or (region_id,wx_element,issue_time) whichever is more selective. You could put wx_element first if it's more selective than region_id. Moreover, what purpose does the inner ORDER BY clause serve? It's only going to be re-sorted again by the outer ORDER BY. -- greg