near identical queries have vastly different plans

Поиск
Список
Период
Сортировка
Here's the setup:

I'm cross joining two dimensions before left outer joining to a fact table so that I can throw a default value into the resultset wherever a value is missing from the fact table.  I have a time dimension and another dimension.  I want the cross join to only cross a subset of rows from each dimension, so I have some filters in the ON clause of the inner join.  

I've got 2 nearly identical queries that perform incredibly differently.  The only difference between them is that the fast query uses two text columns from the sensor dimension when filtering the inner join while the slow query uses bigint id values, where the ids correlate to the text strings in the fast query.  The string columns are device_name and dpoint_name, where device_name is unique but many devices have dpoints with the same name.  The bigint columns are device_id and dpoint_id, where both device_id and dpoint_id map to a single row.  There are indexes on all of them, so the only real difference is that an index on dpoint_name will have more rows for a given value than the index on dpoint_id because dpoints with the same name will still have different ids if attached to different devices.  In both queries, exactly 35 rows from the sensor dimension will be returned.  Note also that I'm aggregating fact table rows via avg() because I have duplicate rows in the fact table, but I want to extract only a single row for each time and sensor row.  The left outer join allows me to populate any missing rows with a default value and the aggregation via avg() combines duplicate rows so that they appear only once.

I can easily just use the fast query, but my concern is that without understanding why the queries are executing differently, I might suddenly discover my code using the slow query plan instead of the fast one at some point in the future, even when using the varchar columns instead of the bigint ids for filtering.  They differ in execution speed by about 5x (which translates to many minutes), so that would be a big problem.  If I could figure out either a query structure or an index structure which will force the fast query plan, I'd be much happier.  So that is what I am looking for - an explanation of how I might convince the planner to always use the fast plan.

Its a CentOS host - Quad core AMD Opteron 1.6Ghz, 2GB of RAM, single 75GB disk with everything on it.  I'm not looking for outright performance, just relative performance between the 2 queries.  DB config was taken wholesale from pg_tune with no changes, IIRC.  It isn't a production box.  I have yet to spec out production hardware for this application, so I don't know what it will eventually be.

# select version();
                                                     version                                                      
------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.4.5 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit

             name             |                 current_setting                 
------------------------------+-------------------------------------------------
 checkpoint_completion_target | 0.9
 checkpoint_segments          | 64
 default_statistics_target    | 100
 effective_cache_size         | 1408MB
 lc_collate                   | en_US.UTF-8
 lc_ctype                     | en_US.UTF-8
 log_directory                | pg_log
 log_filename                 | postgresql-%a.log
 log_rotation_age             | 1d
 log_rotation_size            | 0
 log_truncate_on_rotation     | on
 logging_collector            | on
 maintenance_work_mem         | 240MB
 max_connections              | 20
 max_stack_depth              | 2MB
 port                         | 5432
 server_encoding              | UTF8
 shared_buffers               | 480MB
 TimeZone                     | UTC
 wal_buffers                  | 32MB
 work_mem                     | 48MB


time dimension looks like this:

          Column          |            Type             | Modifiers 
--------------------------+-----------------------------+-----------
 time_zone                | character varying(64)       | 
 tstamp                   | timestamp without time zone | 
 tstamptz                 | timestamp with time zone    | 
 local_key                | bigint                      | 
 utc_key                  | bigint                      | 
Indexes:
    "idx_time_ny_local_key" btree (local_key)
    "idx_time_ny_tstamp" btree (tstamp) CLUSTER
    "idx_time_ny_tstamptz" btree (tstamptz)
    "idx_time_ny_utc_key" btree (utc_key)

plus lots of other columns (approx 25 columns, mostly integer)  that aren't relevant to this query.  It has 420,480 rows where each row is 300 seconds after the previous row.  local_key and utc_key are bigint columns in the form yyyyMMddHHmm (utc_key in UTC time and the other in local time for the time zone represented by the table) and tstamp is the same value as an actual timestamp type. tstamptz is just a convenient column for when I need to do time zone conversions. tstamp is a timestamp without timezone that stores the date and time for that row in the local time zone for the table in question.

The other dimension looks like this:

    Column     |            Type             |                         Modifiers                          
---------------+-----------------------------+------------------------------------------------------------
 sensor_pk     | bigint                      | not null default nextval('sensor_sensor_pk_seq'::regclass)
 building_id   | bigint                      | 
 device_id     | bigint                      | 
 device_type   | character varying(16)       | 
 device_name   | character varying(64)       | 
 dpoint_id     | bigint                      | 
 dpoint_name   | character varying(64)       | 
Indexes:
    "sensor_pkey" PRIMARY KEY, btree (sensor_pk)
    "idx_sensor_device_id" btree (device_id)
    "idx_sensor_device_name" btree (device_name)
    "idx_sensor_device_type" btree (device_type)
    "idx_sensor_dpoint_id" btree (dpoint_id)
    "idx_sensor_dpoint_name" btree (dpoint_name)

There are other columns, but again, they aren't relevant - about 10 more columns, half bigint and half varchar.  Row count is less than 400 rows at the moment, but it will potentially get much larger unless I partition on building_id, which I may well do - in which case, row count will never exceed 100,000 and rarely exceed 10,000.

The fact table is as follows:

             Table "facts.bldg_1_thermal_fact"
    Column    |            Type             |   Modifiers   
--------------+-----------------------------+---------------
 time_fk      | bigint                      | not null
 sensor_fk    | bigint                      | not null
 tstamp       | timestamp without time zone | 
 dpoint_value | real                        | 
 device_mode  | bigint                      | 

With actual data in child tables:

         Table "facts.bldg_1_thermal_fact_20110501"
    Column    |            Type             |   Modifiers   
--------------+-----------------------------+---------------
 time_fk      | bigint                      | not null
 sensor_fk    | bigint                      | not null
 tstamp       | timestamp without time zone | 
 dpoint_value | real                        | 
 device_mode  | bigint                      | 
Indexes:
    "idx_bldg_1_thermal_fact_20110501_sensor_fk" btree (sensor_fk)
    "idx_bldg_1_thermal_fact_20110501_time_fk" btree (time_fk) CLUSTER
Check constraints:
    "bldg_1_thermal_fact_20110501_time_fk_check" CHECK (time_fk >= 201105010000::bigint AND time_fk < 201106010000::bigint)
    "bldg_1_thermal_fact_20110501_tstamp_check" CHECK (tstamp >= '2011-05-01 00:00:00'::timestamp without time zone AND tstamp < '2011-06-01 00:00:00'::timestamp without time zone)
Inherits: bldg_1_thermal_fact

One of the 2 partitions that exists at the moment contains 2 million rows, none of which are relevant to the query in question.  The other partition has 4 million rows and contains 100% of the rows returned by the query.

fast query is as follows:

 SELECT t.local_key,
        s.device_name||'_'||s.dpoint_name,
        CASE WHEN avg(f.dpoint_value) IS NULL THEN -999999
             ELSE avg(f.dpoint_value)::numeric(10,2)
        END as dpoint_value
   FROM dimensions.sensor s
  INNER JOIN dimensions.time_ny t
          ON s.building_id=1
         AND ((s.device_name='AHU_1M02' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_01' AND s.dpoint_name='EffectSetPt')OR(s.device_name='VAV_1M_01' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_02A' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_02A' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_02' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_02' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_03' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_03' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_04' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_04' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_05A' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_05A' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_05' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_05' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_06' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_06' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_07' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_07' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_08' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_08' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_09' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_09' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_10' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_10' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_11' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_11' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_12' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_12' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_13' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_13' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_14' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_14' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_15' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_15' AND s.dpoint_name='SpaceTemp'))
         AND t.tstamp BETWEEN '15-Jun-2011 00:00' AND '29-Jun-2011 00:00'
  LEFT OUTER JOIN facts.bldg_1_thermal_fact f
          ON f.time_fk=t.local_key
         AND f.sensor_fk=s.sensor_pk
  GROUP BY 1,2
  ORDER BY 1,2

Note: as complicated as that big filter on the inner join is, it will never result in more than a few tens of rows being selected.

Fast explain is here: 

Slow query is as follows:

 SELECT t.local_key,
         s.device_name||'_'||s.dpoint_name,
         CASE WHEN avg(f.dpoint_value) IS NULL THEN -999999
              ELSE avg(f.dpoint_value)::numeric(10,2)
         END as dpoint_value
    FROM dimensions.sensor s
   INNER JOIN dimensions.time_ny t
           ON s.building_id=1
          AND ((s.device_id=33 AND s.dpoint_id=183) OR (s.device_id=33 AND s.dpoint_id=184) OR (s.device_id=34 AND s.dpoint_id=187) OR (s.device_id=34 AND s.dpoint_id=188) OR (s.device_id=35 AND s.dpoint_id=191) OR (s.device_id=35 AND s.dpoint_id=192) OR (s.device_id=36 AND s.dpoint_id=195) OR (s.device_id=36 AND s.dpoint_id=196) OR (s.device_id=77 AND s.dpoint_id=364) OR (s.device_id=20 AND s.dpoint_id=131) OR (s.device_id=20 AND s.dpoint_id=132) OR (s.device_id=21 AND s.dpoint_id=135) OR (s.device_id=21 AND s.dpoint_id=136) OR (s.device_id=22 AND s.dpoint_id=139) OR (s.device_id=22 AND s.dpoint_id=140) OR (s.device_id=30 AND s.dpoint_id=171) OR (s.device_id=30 AND s.dpoint_id=172) OR (s.device_id=23 AND s.dpoint_id=143) OR (s.device_id=23 AND s.dpoint_id=144) OR (s.device_id=24 AND s.dpoint_id=147) OR (s.device_id=24 AND s.dpoint_id=148) OR (s.device_id=25 AND s.dpoint_id=151) OR (s.device_id=25 AND s.dpoint_id=152) OR (s.device_id=26 AND s.dpoint_id=155) OR (s.device_id=26 AND s.dpoint_id=156) OR (s.device_id=27 AND s.dpoint_id=159) OR (s.device_id=27 AND s.dpoint_id=160) OR (s.device_id=28 AND s.dpoint_id=163) OR (s.device_id=28 AND s.dpoint_id=164) OR (s.device_id=29 AND s.dpoint_id=167) OR (s.device_id=29 AND s.dpoint_id=168) OR (s.device_id=31 AND s.dpoint_id=175) OR (s.device_id=31 AND s.dpoint_id=176) OR (s.device_id=32 AND s.dpoint_id=179) OR (s.device_id=32 AND s.dpoint_id=180))
         AND t.tstamp BETWEEN '15-Jun-2011 00:00' AND '29-Jun-2011 00:00'
  LEFT OUTER JOIN facts.bldg_1_thermal_fact f
          ON f.time_fk=t.local_key
         AND f.sensor_fk=s.sensor_pk 
  GROUP BY 1,2 
  ORDER BY 1,2 

Slow explain is here:

I tried rewriting the slow query such that the cross join is expressed as a subquery left outer joined to the fact table, but it had no effect.  the query plan was identical.

The query plan also doesn't change if I replace the 2nd column with s.sensor_pk (which maps one to one to the string I am constructing, but which does have an index).  This query is actually being used as the first query in a call to crosstab(text, text) and that second column is just the pivot column.  It doesn't matter to me whether it is a string or the id, since those labels wind up being re-written as column names of the function result. (select * from crosstab() as q(local_key bigint, column1 real, column2 real, ...).

В списке pgsql-performance по дате отправления:

Предыдущее
От: Jeff Janes
Дата:
Сообщение: Re: sequential scan unduly favored over text search gin index
Следующее
От: Samuel Gendler
Дата:
Сообщение: Re: near identical queries have vastly different plans