Обсуждение: Index isn't used during a join.

От:
Robert Creager
Дата:


Hey folks,

I'm working with a query to get more info out with a join.  The base query works great speed wise because of index
usage. When the join is tossed in, the index is no longer used, so the query performance tanks. 

Can anyone advise on how to get the index usage back?

weather=# select version();
                                                        version
-----------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.1.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.0.1 (4.0.1-5mdk for Mandriva Linux release 2006.0)
(1 row)

The base query is:

weather=# EXPLAIN ANALYZE
weather-# SELECT min_reading, max_reading, avg_reading, -- doy,
weather-#        unmunge_time( time_group ) AS time
weather-# FROM minute."windspeed"
weather-# --JOIN readings_doy ON EXTRACT( doy FROM unmunge_time( time_group ) ) = doy
weather-# WHERE unmunge_time( time_group ) > ( now() - '24 hour'::interval )
weather-# ORDER BY time_group;
                                                                 QUERY PLAN
                     

---------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=10995.29..11155.58 rows=64117 width=28) (actual time=4.509..4.574 rows=285 loops=1)
   Sort Key: time_group
   ->  Bitmap Heap Scan on windspeed  (cost=402.42..5876.05 rows=64117 width=28) (actual time=0.784..3.639 rows=285
loops=1)
         Recheck Cond: (unmunge_time(time_group) > (now() - '24:00:00'::interval))
         ->  Bitmap Index Scan on minute_windspeed_index  (cost=0.00..402.42 rows=64117 width=0) (actual
time=0.675..0.675rows=285 loops=1) 
               Index Cond: (unmunge_time(time_group) > (now() - '24:00:00'::interval))
 Total runtime: 4.880 ms
(7 rows)

When I add in the join, the query tosses out the nice quick index in favor of sequence scans:

weather=# EXPLAIN ANALYZE
weather-# SELECT min_reading, max_reading, avg_reading, -- doy,
weather-#        unmunge_time( time_group ) AS time
weather-# FROM minute."windspeed"
weather-# JOIN readings_doy ON EXTRACT( doy FROM unmunge_time( time_group ) ) = doy
weather-# WHERE unmunge_time( time_group ) > ( now() - '24 hour'::interval )
weather-# ORDER BY time_group;
                                                               QUERY PLAN
                 

-----------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=98239590.88..99052623.66 rows=325213113 width=28) (actual time=60136.484..61079.845 rows=1030656 loops=1)
   Sort Key: windspeed.time_group
   ->  Merge Join  (cost=265774.21..8396903.54 rows=325213113 width=28) (actual time=34318.334..47113.277 rows=1030656
loops=1)
         Merge Cond: ("outer"."?column5?" = "inner"."?column2?")
         ->  Sort  (cost=12997.68..13157.98 rows=64120 width=28) (actual time=2286.155..2286.450 rows=284 loops=1)
               Sort Key: date_part('doy'::text, unmunge_time(windspeed.time_group))
               ->  Seq Scan on windspeed  (cost=0.00..7878.18 rows=64120 width=28) (actual time=2279.275..2285.271
rows=284loops=1) 
                     Filter: (unmunge_time(time_group) > (now() - '24:00:00'::interval))
         ->  Sort  (cost=252776.54..255312.51 rows=1014389 width=8) (actual time=32001.370..33473.407 rows=1051395
loops=1)
               Sort Key: date_part('doy'::text, readings."when")
               ->  Seq Scan on readings  (cost=0.00..142650.89 rows=1014389 width=8) (actual time=0.053..13759.015
rows=1014448loops=1) 
 Total runtime: 61720.935 ms
(12 rows)

weather=# \d minute.windspeed
          Table "minute.windspeed"
   Column    |       Type       | Modifiers
-------------+------------------+-----------
 time_group  | integer          | not null
 min_reading | double precision | not null
 max_reading | double precision | not null
 avg_reading | double precision | not null
Indexes:
    "windspeed_pkey" PRIMARY KEY, btree (time_group)
    "minute_windspeed_index" btree (unmunge_time(time_group))

CREATE OR REPLACE FUNCTION unmunge_time( integer )
RETURNS timestamp AS '
DECLARE
   input ALIAS FOR $1;
BEGIN
   RETURN (''epoch''::timestamptz + input * ''1sec''::interval)::timestamp;
END;
' LANGUAGE plpgsql IMMUTABLE STRICT;

weather=# \d readings
                                             Table "public.readings"
        Column        |            Type             |                          Modifiers
----------------------+-----------------------------+-------------------------------------------------------------
 when                 | timestamp without time zone | not null default (timeofday())::timestamp without time zone
 hour_group           | integer                     |
 minute_group         | integer                     |
 day_group            | integer                     |
 week_group           | integer                     |
 month_group          | integer                     |
 year_group           | integer                     |
 year_group_updated   | boolean                     | default false
 month_group_updated  | boolean                     | default false
 week_group_updated   | boolean                     | default false
 day_group_updated    | boolean                     | default false
 hour_group_updated   | boolean                     | default false
 minute_group_updated | boolean                     | default false
Indexes:
    "readings_pkey" PRIMARY KEY, btree ("when")
    "day_group_updated_index" btree (day_group_updated, day_group)
    "hour_group_updated_index" btree (hour_group_updated, hour_group)
    "month_group_updated_index" btree (month_group_updated, month_group)
    "readings_doy_index" btree (date_part('doy'::text, "when"))
    "week_group_updated_index" btree (week_group_updated, week_group)
    "year_group_updated_index" btree (year_group_updated, year_group)
Triggers:
    munge_time BEFORE INSERT OR UPDATE ON readings FOR EACH ROW EXECUTE PROCEDURE munge_time()

readings_doy is a view that adds date_part('doy'::text, readings."when") AS doy to the readings table.

Thanks,
Rob

--
 21:15:51 up 2 days, 13:42,  9 users,  load average: 3.14, 2.63, 2.62
Linux 2.6.12-12-2 #4 SMP Tue Jan 3 19:56:19 MST 2006

От:
Michael Fuhr
Дата:

On Mon, Jan 09, 2006 at 09:23:38PM -0700, Robert Creager wrote:
> I'm working with a query to get more info out with a join.  The base
> query works great speed wise because of index usage.  When the join is
> tossed in, the index is no longer used, so the query performance tanks.

The first query you posted returns 285 rows and the second returns
over one million; index usage aside, that difference surely accounts
for a performance penalty.  And as is often pointed out, index scans
aren't always faster than sequential scans: the more of a table a
query has to fetch, the more likely a sequential scan will be faster.

Have the tables been vacuumed and analyzed?  The planner's estimates
for windspeed are pretty far off, which could be affecting the query
plan:

>         ->  Sort  (cost=12997.68..13157.98 rows=64120 width=28) (actual time=2286.155..2286.450 rows=284 loops=1)
>               Sort Key: date_part('doy'::text, unmunge_time(windspeed.time_group))
>               ->  Seq Scan on windspeed  (cost=0.00..7878.18 rows=64120 width=28) (actual time=2279.275..2285.271
rows=284loops=1) 
>                     Filter: (unmunge_time(time_group) > (now() - '24:00:00'::interval))

That's a small amount of the total query time, however, so although
an index scan might help it probably won't provide the big gain
you're looking for.

Have you done any tests with enable_seqscan disabled?  That'll show
whether an index or bitmap scan would be faster.  And have you
verified that the join condition is correct?  Should the query be
returning over a million rows?

--
Michael Fuhr

От:
Robert Creager
Дата:

When grilled further on (Mon, 9 Jan 2006 22:58:18 -0700),
Michael Fuhr <> confessed:

> On Mon, Jan 09, 2006 at 09:23:38PM -0700, Robert Creager wrote:
> > I'm working with a query to get more info out with a join.  The base
> > query works great speed wise because of index usage.  When the join is
> > tossed in, the index is no longer used, so the query performance tanks.
>
> The first query you posted returns 285 rows and the second returns
> over one million; index usage aside, that difference surely accounts
> for a performance penalty.  And as is often pointed out, index scans
> aren't always faster than sequential scans: the more of a table a
> query has to fetch, the more likely a sequential scan will be faster.

Thanks for pointing out the obvious that I missed.  Too much data in the second query.  It's supposed to match (row
wise)what was returned from the first query. 

Just ignore me for now...

Thanks,
Rob

--
 08:15:24 up 3 days, 42 min,  9 users,  load average: 2.07, 2.20, 2.25
Linux 2.6.12-12-2 #4 SMP Tue Jan 3 19:56:19 MST 2006

От:
Robert Creager
Дата:

Ok, I'm back, and in a little better shape.

The query is now correct, but still is slow because of lack of index usage.  I don't know how to structure the query
correctlyto use the index. 

Taken individually:

weather=# explain analyze select * from doy_agg where doy = extract( doy from now() );
                                                               QUERY PLAN
                

----------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=13750.67..13750.71 rows=2 width=20) (actual time=123.134..123.135 rows=1 loops=1)
   ->  Bitmap Heap Scan on readings  (cost=25.87..13720.96 rows=3962 width=20) (actual time=6.384..116.559 rows=4175
loops=1)
         Recheck Cond: (date_part('doy'::text, "when") = date_part('doy'::text, now()))
         ->  Bitmap Index Scan on readings_doy_index  (cost=0.00..25.87 rows=3962 width=0) (actual time=5.282..5.282
rows=4215loops=1) 
               Index Cond: (date_part('doy'::text, "when") = date_part('doy'::text, now()))
 Total runtime: 123.366 ms

produces the data:

weather=# select * from doy_agg where doy = extract( doy from now() );
 doy |  avg_windspeed   | max_windspeed
-----+------------------+---------------
  10 | 8.53403056583666 |            59

and:

weather=# EXPLAIN ANALYZE
weather-# SELECT *,
weather-# unmunge_time( time_group ) AS time
weather-# FROM minute."windspeed"
weather-# WHERE unmunge_time( time_group ) > ( now() - '24 hour'::interval )
weather-# ORDER BY time_group;
                                                                   QUERY PLAN
                         

-------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=595.33..595.77 rows=176 width=28) (actual time=4.762..4.828 rows=283 loops=1)
   Sort Key: time_group
   ->  Bitmap Heap Scan on windspeed  (cost=2.62..588.76 rows=176 width=28) (actual time=0.901..3.834 rows=283 loops=1)
         Recheck Cond: (unmunge_time(time_group) > (now() - '24:00:00'::interval))
         ->  Bitmap Index Scan on minute_windspeed_unmunge_index  (cost=0.00..2.62 rows=176 width=0) (actual
time=0.745..0.745rows=284 loops=1) 
               Index Cond: (unmunge_time(time_group) > (now() - '24:00:00'::interval))
 Total runtime: 5.108 ms

produces:

 time_group |    min_reading    | max_reading |    avg_reading    |        time
------------+-------------------+-------------+-------------------+---------------------
 1136869500 |               0.8 |           6 |  2.62193548387097 | 2006-01-09 22:05:00
 1136869800 |                 0 |           3 | 0.406021505376343 | 2006-01-09 22:10:00
 1136870100 |                 0 |           5 |              1.68 | 2006-01-09 22:15:00
...

But I want the composite of the two queries, and I'm stuck on:

weather=# EXPLAIN ANALYZE
weather-# SELECT *,
weather-# unmunge_time( time_group ) AS time
weather-# FROM minute."windspeed"
weather-# JOIN doy_agg ON( EXTRACT( doy FROM unmunge_time( time_group ) ) = doy )
weather-# WHERE unmunge_time( time_group ) > ( now() - '24 hour'::interval )
weather-# ORDER BY time_group;
                                                                         QUERY PLAN
                                     

-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=153627.67..153628.48 rows=322 width=48) (actual time=10637.681..10637.748 rows=286 loops=1)
   Sort Key: windspeed.time_group
   ->  Merge Join  (cost=153604.82..153614.26 rows=322 width=48) (actual time=10633.375..10636.728 rows=286 loops=1)
         Merge Cond: ("outer"."?column5?" = "inner".doy)
         ->  Sort  (cost=594.89..595.33 rows=176 width=28) (actual time=5.539..5.612 rows=286 loops=1)
               Sort Key: date_part('doy'::text, unmunge_time(windspeed.time_group))
               ->  Bitmap Heap Scan on windspeed  (cost=2.62..588.32 rows=176 width=28) (actual time=0.918..4.637
rows=286loops=1) 
                     Recheck Cond: (unmunge_time(time_group) > (now() - '24:00:00'::interval))
                     ->  Bitmap Index Scan on minute_windspeed_unmunge_index  (cost=0.00..2.62 rows=176 width=0)
(actualtime=0.739..0.739 rows=287 loops=1) 
                           Index Cond: (unmunge_time(time_group) > (now() - '24:00:00'::interval))
         ->  Sort  (cost=153009.93..153010.84 rows=366 width=20) (actual time=10627.699..10627.788 rows=295 loops=1)
               Sort Key: doy_agg.doy
               ->  HashAggregate  (cost=152984.28..152990.69 rows=366 width=20) (actual time=10625.649..10626.601
rows=366loops=1) 
                     ->  Seq Scan on readings  (cost=0.00..145364.93 rows=1015914 width=20) (actual
time=0.079..8901.123rows=1015917 loops=1) 
 Total runtime: 10638.298 ms

Where:

weather=# \d doy_agg
            View "public.doy_agg"
    Column     |       Type       | Modifiers
---------------+------------------+-----------
 doy           | double precision |
 avg_windspeed | double precision |
 max_windspeed | integer          |
View definition:
 SELECT doy_readings.doy, avg(doy_readings.windspeedaverage1) AS avg_windspeed, max(doy_readings.windspeedmax1) AS
max_windspeed
   FROM ONLY doy_readings
  GROUP BY doy_readings.doy;

which I don't want because of the full scan on readings.

I can easily do the two queries seperately in the script utilizing this data, but want to do it in the db itself.  I
figureI'm just not seeing how to combine the two queries effectively. 

Thoughts?

Thanks,
Rob

--
 22:08:50 up 3 days, 14:35,  9 users,  load average: 2.71, 2.48, 2.51
Linux 2.6.12-12-2 #4 SMP Tue Jan 3 19:56:19 MST 2006

От:
Michael Fuhr
Дата:

On Tue, Jan 10, 2006 at 10:10:55PM -0700, Robert Creager wrote:
> The query is now correct, but still is slow because of lack of
> index usage.  I don't know how to structure the query correctly to
> use the index.

Have you tried adding restrictions on doy in the WHERE clause?
Something like this, I think:

WHERE ...
  AND doy >= EXTRACT(doy FROM now() - '24 hour'::interval)
  AND doy <= EXTRACT(doy FROM now())

Something else occurred to me: do you (or will you) have more than
one year of data?  If so then matching on doy could be problematic
unless you also check for the year, or unless you want to match
more than one year.

--
Michael Fuhr

От:
Michael Fuhr
Дата:

On Wed, Jan 11, 2006 at 12:56:55AM -0700, Michael Fuhr wrote:
> WHERE ...
>   AND doy >= EXTRACT(doy FROM now() - '24 hour'::interval)
>   AND doy <= EXTRACT(doy FROM now())

To work on 1 Jan this should be more like

WHERE ...
  AND (doy = EXTRACT(doy FROM now() - '24 hour'::interval) OR
       doy = EXTRACT(doy FROM now()))

In any case the point is to add conditions to the WHERE clause that
will use an index on the table for which you're currently getting
a sequential scan.

--
Michael Fuhr

От:
Robert Creager
Дата:

When grilled further on (Wed, 11 Jan 2006 00:56:55 -0700),
Michael Fuhr <> confessed:

> On Tue, Jan 10, 2006 at 10:10:55PM -0700, Robert Creager wrote:
> > The query is now correct, but still is slow because of lack of
> > index usage.  I don't know how to structure the query correctly to
> > use the index.
>
> Have you tried adding restrictions on doy in the WHERE clause?
> Something like this, I think:

I cannot.  That's what I thought I would get from the join.  The query shown will always have two days involved, and
onlygrows from there.  The data is graphed at http://www.logicalchaos.org/weather/index.html, and I'm looking at adding
historicaldata to the graphs. 

Opps, never mind.  You hit the nail on the head:

weather-# SELECT *, unmunge_time( time_group ) AS time,
weather-# EXTRACT( doy FROM unmunge_time( time_group ) )
weather-# FROM minute."windspeed"
weather-# JOIN doy_agg ON( EXTRACT( doy FROM unmunge_time( time_group ) ) = doy )
weather-# WHERE unmunge_time( time_group ) > ( now() - '24 hour'::interval )
weather-# AND doy BETWEEN EXTRACT( doy FROM now() - '24 hour'::interval)
weather-#             AND EXTRACT( doy FROM now() )
weather-# ORDER BY time_group;
                                                                                                   QUERY PLAN
                                                                                        

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=21914.09..21914.10 rows=1 width=48) (actual time=76.595..76.662 rows=286 loops=1)
   Sort Key: windspeed.time_group
   ->  Hash Join  (cost=21648.19..21914.08 rows=1 width=48) (actual time=64.656..75.562 rows=286 loops=1)
         Hash Cond: (date_part('doy'::text, unmunge_time("outer".time_group)) = "inner".doy)
         ->  Bitmap Heap Scan on windspeed  (cost=2.27..267.40 rows=74 width=28) (actual time=0.585..1.111 rows=286
loops=1)
               Recheck Cond: (unmunge_time(time_group) > (now() - '24:00:00'::interval))
               ->  Bitmap Index Scan on minute_windspeed_unmunge_index  (cost=0.00..2.27 rows=74 width=0) (actual
time=0.566..0.566rows=287 loops=1) 
                     Index Cond: (unmunge_time(time_group) > (now() - '24:00:00'::interval))
         ->  Hash  (cost=21645.92..21645.92 rows=3 width=20) (actual time=63.849..63.849 rows=2 loops=1)
               ->  HashAggregate  (cost=21645.84..21645.89 rows=3 width=20) (actual time=63.832..63.834 rows=2 loops=1)
                     ->  Bitmap Heap Scan on readings  (cost=59.21..21596.85 rows=6532 width=20) (actual
time=15.174..53.249rows=7613 loops=1) 
                           Recheck Cond: ((date_part('doy'::text, "when") >= date_part('doy'::text, (now() -
'24:00:00'::interval)))AND (date_part('doy'::text, "when") <= date_part('doy'::text, now()))) 
                           ->  Bitmap Index Scan on readings_doy_index  (cost=0.00..59.21 rows=6532 width=0) (actual
time=12.509..12.509rows=10530 loops=1) 
                                 Index Cond: ((date_part('doy'::text, "when") >= date_part('doy'::text, (now() -
'24:00:00'::interval)))AND (date_part('doy'::text, "when") <= date_part('doy'::text, now()))) 
 Total runtime: 77.177 ms

What I had thought is that PG would (could?) be smart enough to realize that one query was restricted, and apply that
restrictionto the other based on the join.  I know it works in other cases (using indexes on both tables using the
join)...

>
> Something else occurred to me: do you (or will you) have more than
> one year of data?  If so then matching on doy could be problematic
> unless you also check for the year, or unless you want to match
> more than one year.

Yes and yes.  I'm doing both aggregate by day of the year for all data, and aggregate by day of year within each year.
Theexamples are: 

weather=# select * from doy_agg where doy = extract( doy from now() );
 doy |  avg_windspeed   | max_windspeed
-----+------------------+---------------
  11 | 6.14058239764748 |            69
(1 row)

weather=# select * from doy_day_agg where extract( doy from day ) = extract( doy from now() );
         day         |  avg_windspeed   | max_windspeed
---------------------+------------------+---------------
 2004-01-11 00:00:00 | 5.03991313397539 |            17
 2006-01-11 00:00:00 |  18.532050716667 |            69
 2005-01-11 00:00:00 |  3.6106763448041 |            13

Thanks for your help Michael.

Cheers,
Rob

--
 07:07:30 up 3 days, 23:34,  9 users,  load average: 2.29, 2.44, 2.43
Linux 2.6.12-12-2 #4 SMP Tue Jan 3 19:56:19 MST 2006

От:
Robert Creager
Дата:

When grilled further on (Wed, 11 Jan 2006 07:26:59 -0700),
Robert Creager <> confessed:

>
> weather-# SELECT *, unmunge_time( time_group ) AS time,
> weather-# EXTRACT( doy FROM unmunge_time( time_group ) )
> weather-# FROM minute."windspeed"
> weather-# JOIN doy_agg ON( EXTRACT( doy FROM unmunge_time( time_group ) ) = doy )
> weather-# WHERE unmunge_time( time_group ) > ( now() - '24 hour'::interval )
> weather-# AND doy BETWEEN EXTRACT( doy FROM now() - '24 hour'::interval)
> weather-#             AND EXTRACT( doy FROM now() )
> weather-# ORDER BY time_group;

The more I think about it, the more I believe PG is missing an opportunity.  The query is adequately constrained
withoutthe BETWEEN clause.  Why doesn't PG see that?  I realize I'm a hack and by db organization shows that... 

The query is wrong as stated, as it won't work when the interval crosses a year boundary, but it's a stop gap for now.

Cheers,
Rob

--
 07:58:30 up 4 days, 25 min,  9 users,  load average: 2.13, 2.15, 2.22
Linux 2.6.12-12-2 #4 SMP Tue Jan 3 19:56:19 MST 2006

От:
Tom Lane
Дата:

Robert Creager <> writes:
> What I had thought is that PG would (could?) be smart enough to realize tha=
> t one query was restricted, and apply that restriction to the other based o=
> n the join.  I know it works in other cases (using indexes on both tables u=
> sing the join)...

The planner understands about transitivity of equality, ie given a = b
and b = c it can infer a = c.  It doesn't do any such thing for
inequalities though, nor does it deduce f(a) = f(b) for arbitrary
functions f.  The addition Michael suggested requires much more
understanding of the properties of the functions in your query than
I think would be reasonable to put into the planner.

            regards, tom lane

От:
Michael Fuhr
Дата:

On Wed, Jan 11, 2006 at 08:02:37AM -0700, Robert Creager wrote:
> The query is wrong as stated, as it won't work when the interval
> crosses a year boundary, but it's a stop gap for now.

Yeah, I realized that shortly after I posted the original and posted
a correction.

http://archives.postgresql.org/pgsql-performance/2006-01/msg00104.php

--
Michael Fuhr

От:
Robert Creager
Дата:

When grilled further on (Wed, 11 Jan 2006 10:33:03 -0500),
Tom Lane <> confessed:

> The planner understands about transitivity of equality, ie given a = b
> and b = c it can infer a = c.  It doesn't do any such thing for
> inequalities though, nor does it deduce f(a) = f(b) for arbitrary
> functions f.  The addition Michael suggested requires much more
> understanding of the properties of the functions in your query than
> I think would be reasonable to put into the planner.
>

OK.  I think reached a point that I need to re-organize how the data is stored,
maybe ridding myself of the schema and switching entirely to views.  At that
point, I likely could rid myself of the function (unmunge_time) I'm using, and
work with times and doy fields.

Thanks,
Rob

--
 21:17:00 up 4 days, 13:43,  9 users,  load average: 2.02, 2.18, 2.23
Linux 2.6.12-12-2 #4 SMP Tue Jan 3 19:56:19 MST 2006