Обсуждение: Select running slow on Postgres
Hi I have a select statement that runs on a partition having say couple million rows. The tabel has indexes on two colums. However the query uses the non-indexed colums too in its where clause. For example: SELECT lane_id,measurement_start, measurement_end,speed,volume,occupancy,quality,effective_date FROM tss.lane_data_06_08 WHERE lane_id in(select lane_id from lane_info where inactive is null ) AND date_part('hour', measurement_start) between 5 and 23 AND date_part('day',measurement_start)=30 GROUP BY lane_id,measurement_start,measurement_end,speed,volume,occupancy,quality,effective_date ORDER BY lane_id, measurement_start out of this only lane_id and mesaurement_start are indexed. This query will return around 10,000 rows. But it seems to be taking a long time to execute which doesnt make sense for a select statement. It doesnt make any sense to create index for every field we are gonna use in tne where clause. Isnt there any way we can improve the performance? Samantha
On Tue, Jul 1, 2008 at 1:29 PM, samantha mahindrakar <sam.mahindrakar@gmail.com> wrote: > Hi > I have a select statement that runs on a partition having say couple > million rows. > The tabel has indexes on two colums. However the query uses the > non-indexed colums too in its where clause. > For example: > SELECT lane_id,measurement_start, > measurement_end,speed,volume,occupancy,quality,effective_date > FROM tss.lane_data_06_08 > WHERE lane_id in(select lane_id from lane_info where inactive is null ) > AND date_part('hour', measurement_start) between 5 and 23 > AND date_part('day',measurement_start)=30 > GROUP BY lane_id,measurement_start,measurement_end,speed,volume,occupancy,quality,effective_date > ORDER BY lane_id, measurement_start > > out of this only lane_id and mesaurement_start are indexed. This query > will return around 10,000 rows. But it seems to be taking a long time > to execute which doesnt make sense for a select statement. It doesnt > make any sense to create index for every field we are gonna use in tne > where clause. > Isnt there any way we can improve the performance? I'm guessing that adding an index for either date_part('hour',measurement_start) or date_part('day',measurement_start) or both would help. What does explain analyze select ... (rest of query here) say?
I ran the explain analyze.Here is what i got:
"Group (cost=112266.37..112266.40 rows=1 width=56) (actual time=5583.399..5615.476 rows=13373 loops=1)"
" -> Sort (cost=112266.37..112266.38 rows=1 width=56) (actual time=5583.382..5590.890 rows=13373 loops=1)"
" Sort Key: lane_data_07_08.lane_id, lane_data_07_08.measurement_start, lane_data_07_08.measurement_end, lane_data_07_08.speed, lane_data_07_08.volume, lane_data_07_08.occupancy, lane_data_07_08.quality, lane_data_07_08.effective_date"
" -> Nested Loop IN Join (cost=0.00..112266.36 rows=1 width=56) (actual time=1100.307..5547.768 rows=13373 loops=1)"
" -> Seq Scan on lane_data_07_08 (cost=0.00..112241.52 rows=3 width=56) (actual time=1087.666..5341.662 rows=20581 loops=1)"
" Filter: (((volume = 255::double precision) OR (speed = 255::double precision) OR (occupancy = 255::double precision) OR (occupancy >= 100::double precision) OR (volume > 52::double precision) OR (volume < 0::double precision) OR (speed > 120::double precision) OR (speed < 0::double precision)) AND (date_part('hour'::text, measurement_start) >= 5::double precision) AND (date_part('hour'::text, measurement_start) <= 23::double precision) AND (date_part('day'::text, measurement_start) = 1::double precision))"
" -> Index Scan using lane_info_pk on lane_info (cost=0.00..8.27 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=20581)"
" Index Cond: (lane_data_07_08.lane_id = lane_info.lane_id)"
" Filter: (inactive IS NULL)"
"Total runtime: 5621.409 ms"
Well instaed of creating extra indexes (since they eat up lot of space) i made use of the whole measurement_start field, so thet it uses the index proeprty and makes the search faster.
So i changed the query to include the measuerment start as follows:
SELECT lane_id,measurement_start,
measurement_end,speed,volume,occupancy,quality,effective_date
FROM tss.lane_data_06_08
WHERE lane_id in(select lane_id from lane_info where inactive is null )
AND measurement_start between '2008-06-30 05:00:00-04' AND '2008-06-30 23:00:00-04'
GROUP BY lane_id,measurement_start,measurement_end,speed,volume,occupancy,quality,effective_date
ORDER BY lane_id, measurement_start
Samantha
On 7/1/08, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Tue, Jul 1, 2008 at 1:29 PM, samantha mahindrakar
> <sam.mahindrakar@gmail.com> wrote:
> > Hi
> > I have a select statement that runs on a partition having say couple
> > million rows.
> > The tabel has indexes on two colums. However the query uses the
> > non-indexed colums too in its where clause.
> > For example:
> > SELECT lane_id,measurement_start,
> > measurement_end,speed,volume,occupancy,quality,effective_date
> > FROM tss.lane_data_06_08
> > WHERE lane_id in(select lane_id from lane_info where inactive is null )
> > AND date_part('hour', measurement_start) between 5 and 23
> > AND date_part('day',measurement_start)=30
> > GROUP BY lane_id,measurement_start,measurement_end,speed,volume,occupancy,quality,effective_date
> > ORDER BY lane_id, measurement_start
> >
> > out of this only lane_id and mesaurement_start are indexed. This query
> > will return around 10,000 rows. But it seems to be taking a long time
> > to execute which doesnt make sense for a select statement. It doesnt
> > make any sense to create index for every field we are gonna use in tne
> > where clause.
> > Isnt there any way we can improve the performance?
>
> I'm guessing that adding an index for either
> date_part('hour',measurement_start) or
> date_part('day',measurement_start) or both would help.
>
> What does explain analyze select ... (rest of query here) say?
>
"Group (cost=112266.37..112266.40 rows=1 width=56) (actual time=5583.399..5615.476 rows=13373 loops=1)"
" -> Sort (cost=112266.37..112266.38 rows=1 width=56) (actual time=5583.382..5590.890 rows=13373 loops=1)"
" Sort Key: lane_data_07_08.lane_id, lane_data_07_08.measurement_start, lane_data_07_08.measurement_end, lane_data_07_08.speed, lane_data_07_08.volume, lane_data_07_08.occupancy, lane_data_07_08.quality, lane_data_07_08.effective_date"
" -> Nested Loop IN Join (cost=0.00..112266.36 rows=1 width=56) (actual time=1100.307..5547.768 rows=13373 loops=1)"
" -> Seq Scan on lane_data_07_08 (cost=0.00..112241.52 rows=3 width=56) (actual time=1087.666..5341.662 rows=20581 loops=1)"
" Filter: (((volume = 255::double precision) OR (speed = 255::double precision) OR (occupancy = 255::double precision) OR (occupancy >= 100::double precision) OR (volume > 52::double precision) OR (volume < 0::double precision) OR (speed > 120::double precision) OR (speed < 0::double precision)) AND (date_part('hour'::text, measurement_start) >= 5::double precision) AND (date_part('hour'::text, measurement_start) <= 23::double precision) AND (date_part('day'::text, measurement_start) = 1::double precision))"
" -> Index Scan using lane_info_pk on lane_info (cost=0.00..8.27 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=20581)"
" Index Cond: (lane_data_07_08.lane_id = lane_info.lane_id)"
" Filter: (inactive IS NULL)"
"Total runtime: 5621.409 ms"
Well instaed of creating extra indexes (since they eat up lot of space) i made use of the whole measurement_start field, so thet it uses the index proeprty and makes the search faster.
So i changed the query to include the measuerment start as follows:
SELECT lane_id,measurement_start,
measurement_end,speed,volume,occupancy,quality,effective_date
FROM tss.lane_data_06_08
WHERE lane_id in(select lane_id from lane_info where inactive is null )
AND measurement_start between '2008-06-30 05:00:00-04' AND '2008-06-30 23:00:00-04'
GROUP BY lane_id,measurement_start,measurement_end,speed,volume,occupancy,quality,effective_date
ORDER BY lane_id, measurement_start
Samantha
On 7/1/08, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Tue, Jul 1, 2008 at 1:29 PM, samantha mahindrakar
> <sam.mahindrakar@gmail.com> wrote:
> > Hi
> > I have a select statement that runs on a partition having say couple
> > million rows.
> > The tabel has indexes on two colums. However the query uses the
> > non-indexed colums too in its where clause.
> > For example:
> > SELECT lane_id,measurement_start,
> > measurement_end,speed,volume,occupancy,quality,effective_date
> > FROM tss.lane_data_06_08
> > WHERE lane_id in(select lane_id from lane_info where inactive is null )
> > AND date_part('hour', measurement_start) between 5 and 23
> > AND date_part('day',measurement_start)=30
> > GROUP BY lane_id,measurement_start,measurement_end,speed,volume,occupancy,quality,effective_date
> > ORDER BY lane_id, measurement_start
> >
> > out of this only lane_id and mesaurement_start are indexed. This query
> > will return around 10,000 rows. But it seems to be taking a long time
> > to execute which doesnt make sense for a select statement. It doesnt
> > make any sense to create index for every field we are gonna use in tne
> > where clause.
> > Isnt there any way we can improve the performance?
>
> I'm guessing that adding an index for either
> date_part('hour',measurement_start) or
> date_part('day',measurement_start) or both would help.
>
> What does explain analyze select ... (rest of query here) say?
>
On Wed, Jul 2, 2008 at 1:01 PM, samantha mahindrakar <sam.mahindrakar@gmail.com> wrote: > I ran the explain analyze.Here is what i got: > > > "Group (cost=112266.37..112266.40 rows=1 width=56) (actual > time=5583.399..5615.476 rows=13373 loops=1)" > " -> Sort (cost=112266.37..112266.38 rows=1 width=56) (actual > time=5583.382..5590.890 rows=13373 loops=1)" > " Sort Key: lane_data_07_08.lane_id, > lane_data_07_08.measurement_start, lane_data_07_08.measurement_end, > lane_data_07_08.speed, lane_data_07_08.volume, lane_data_07_08.occupancy, > lane_data_07_08.quality, lane_data_07_08.effective_date" > " -> Nested Loop IN Join (cost=0.00..112266.36 rows=1 width=56) > (actual time=1100.307..5547.768 rows=13373 loops=1)" > " -> Seq Scan on lane_data_07_08 (cost=0.00..112241.52 rows=3 > width=56) (actual time=1087.666..5341.662 rows=20581 loops=1)" You can see here that the seq scan on lane_data is what's eating up all your time. Also, since the row estimate is WAY off, it then chose a nested loop thinking it would be joining up only 1 row and actually running across 20k rows. > " Filter: (((volume = 255::double precision) OR (speed = > 255::double precision) OR (occupancy = 255::double precision) OR (occupancy >>= 100::double precision) OR (volume > 52::double precision) OR (volume < > 0::double precision) OR (speed > 120::double precision) OR (speed < > 0::double precision)) AND (date_part('hour'::text, measurement_start) >= > 5::double precision) AND (date_part('hour'::text, measurement_start) <= > 23::double precision) AND (date_part('day'::text, measurement_start) = > 1::double precision))" > " -> Index Scan using lane_info_pk on > lane_info (cost=0.00..8.27 rows=1 width=4) (actual time=0.007..0.007 rows=1 > loops=20581)" > " Index Cond: (lane_data_07_08.lane_id = > lane_info.lane_id)" > " Filter: (inactive IS NULL)" > "Total runtime: 5621.409 ms" > > > Well instaed of creating extra indexes (since they eat up lot of space) i > made use of the whole measurement_start field, so thet it uses the index > proeprty and makes the search faster. > So i changed the query to include the measuerment start as follows: > > SELECT lane_id,measurement_start, > measurement_end,speed,volume,occupancy,quality,effective_date > FROM tss.lane_data_06_08 > WHERE lane_id in(select lane_id from lane_info where inactive is null ) > AND measurement_start between '2008-06-30 05:00:00-04' AND '2008-06-30 > 23:00:00-04' > GROUP BY > lane_id,measurement_start,measurement_end,speed,volume,occupancy,quality,effective_date > ORDER BY lane_id, measurement_start Yeah, anytime you can just compare date / timestamp on an indexed field you'll do better. If you find yourself needing to use the other syntax, so you can, for instance, grab the data for 5 days in a row from 5am to 11am or something, then the method I mentioned of making indexes on date_part are a good choice. Note that you need regular timestamp, not timstamptz to create indexes.