Using an index for IS DISTINCT FROM queries

Поиск

Список

Период

Сортировка

От	Steven Schlansker
Тема	Using an index for IS DISTINCT FROM queries
Дата	22 апреля 2013 г. 20:37:33
Msg-id	AEBD8B4A-D4DF-45C2-8B2D-7BBDA1C5B69F@likeness.com обсуждение исходный текст
Список	pgsql-general

Дерево обсуждения

Hi everyone,
I have a large table (~150M rows) that keeps a version field.  At any given time, it is expected that the vast majority
ofthe rows are on the "current" version, but some may be missing. 

To figure out all the missing our outdated values, I run a query along the lines of

SELECT id FROM source_table LEFT OUTER JOIN dest_table WHERE version IS DISTINCT FROM <current-version>

However, this query always selects a sequential scan and hash of both tables, which is *very* slow.

The statistics reflect that the planner knows that current-version is overwhelmingly common:

 schemaname | tablename |     attname      | inherited | null_frac | avg_width | n_distinct | most_common_vals |
most_common_freqs| histogram_bounds | correlation 

------------+-----------+------------------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------
 public     | event     | unpacker_version | f         |         0 |         4 |          1 | {1}              | {1}
          |                  |           1 

but it doesn't help:

event=> explain select count(1) from event where unpacker_version is distinct from 1;
                            QUERY PLAN
------------------------------------------------------------------
 Aggregate  (cost=10658841.93..10658841.94 rows=1 width=0)
   ->  Seq Scan on event  (cost=0.00..10658841.93 rows=1 width=0)
         Filter: (unpacker_version IS DISTINCT FROM 1)
(3 rows)


I can "by hand" force the planner to consider the obvious solution (find values below, above, and null) but it is ugly:

event=> explain select count(1) from event where unpacker_version < 1 or unpacker_version > 1 or unpacker_version is
null;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Aggregate  (cost=139.14..139.15 rows=1 width=0)
   ->  Bitmap Heap Scan on event  (cost=135.13..139.14 rows=1 width=0)
         Recheck Cond: ((unpacker_version < 1) OR (unpacker_version > 1) OR (unpacker_version IS NULL))
         ->  BitmapOr  (cost=135.13..135.13 rows=1 width=0)
               ->  Bitmap Index Scan on event_unpacker_version_idx  (cost=0.00..45.04 rows=1 width=0)
                     Index Cond: (unpacker_version < 1)
               ->  Bitmap Index Scan on event_unpacker_version_idx  (cost=0.00..45.04 rows=1 width=0)
                     Index Cond: (unpacker_version > 1)
               ->  Bitmap Index Scan on event_unpacker_version_idx  (cost=0.00..45.04 rows=1 width=0)
                     Index Cond: (unpacker_version IS NULL)
(10 rows)

but this sucks to do such tuning by munging the query.  Is there some case where these are not equivalent?  If they are
equivalent,would this be a reasonable case for the planner to consider and optimize by itself? 

Thanks,
Steven

В списке pgsql-general по дате отправления:

Предыдущее

От: hamann.w@t-online.de
Дата: 22 апреля 2013 г., 20:26:47
Сообщение: how can this get faster

Следующее

От: Christophe Pettus
Дата: 22 апреля 2013 г., 20:41:23
Сообщение: Re: Need solution for weekly database "snapshot"

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Using an index for IS DISTINCT FROM queries

Предыдущее

Следующее