Re: indexing for distinct search in timestamp based table

Поиск

Список

Период

Сортировка

От	H. Hall
Тема	Re: indexing for distinct search in timestamp based table
Дата	28 августа 2008 г. 06:46:20
Msg-id	48B673DE.2030200@reedyriver.com обсуждение исходный текст
Ответ на	indexing for distinct search in timestamp based table ("Rainer Mager" <rainer@vanten.com>)
Список	pgsql-performance

Дерево обсуждения

Rainer Mager wrote:
> I'm looking for some help in speeding up searches. My table is pretty simple
> (see below), but somewhat large, and continuously growing. Currently it has
> about 50 million rows.
>

Regarding your use of  LIKE:
 (1)If you are able to specify the beginning character(s) of the
statement you are searching for, you will have a better chance of your
statement using an index. If you specify a wildcard(%) before the search
string, the entire string in the column must be searched therefore no
index will be used.
(2) Reorder your where clause to reduce the size of the set that LIKE
operates on. In your example below, put the BETWEEN before the LIKE.
(3) Consider the use of trigrams instead of LIKE. I have not used it but
I notice that postgres supports trigrams:

The pg_trgm module provides functions and operators for determining the
similarity of text based on trigram matching, as well as index operator
classes that support fast searching for similar strings.

Here is the link: http://www.postgresql.org/docs/current/static/pgtrgm.html

--cheers
HH

> The table is (I know I have excessive indexes, I'm trying to get the
> appropriate ones and drop the extras):
>                                           Table "public.ad_log"
>     Column    |            Type             |
> Modifiers
> --------------+-----------------------------+-------------------------------
> -----------------------------
>  ad_log_id    | integer                     | not null default
> nextval('ad_log_ad_log_id_seq'::regclass)
>  channel_name | text                        | not null
>  player_name  | text                        | not null
>  ad_name      | text                        | not null
>  start_time   | timestamp without time zone | not null
>  end_time     | timestamp without time zone | not null
> Indexes:
>     "ad_log_pkey" PRIMARY KEY, btree (ad_log_id)
>     "ad_log_channel_name_key" UNIQUE, btree (channel_name, player_name,
> ad_name, start_time, end_time)
>     "ad_log_ad_and_start" btree (ad_name, start_time)
>     "ad_log_ad_name" btree (ad_name)
>     "ad_log_all" btree (channel_name, player_name, start_time, ad_name)
>     "ad_log_channel_name" btree (channel_name)
>     "ad_log_end_time" btree (end_time)
>     "ad_log_player_and_start" btree (player_name, start_time)
>     "ad_log_player_name" btree (player_name)
>     "ad_log_start_time" btree (start_time)
>
>
>
> The query I'm trying to speed up is below. In it the <field> tag can be one
> of channel_name, player_name, or ad_name. I'm actually trying to return the
> distinct values and I found GROUP BY to be slightly faster than using
> DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses
> in which case we use '%', but it seems Postgres optimizes that pretty well.
>
> SELECT <field> FROM ad_log
>     WHERE channel_name LIKE :channel_name
>     AND player_name LIKE :player_name
>     AND ad_name LIKE :ad_name
>     AND start_time BETWEEN :start_date AND (date(:end_date) + 1)
>     GROUP BY <field> ORDER BY <field>
>
>
> A typical query is:
>
> explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'
> AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND
> (date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;
>
> with the result being:
>
> QUERY PLAN
> ----------------------------------------------------------------------------
> ----------------------------------------------------------------------------
> -------
>  Sort  (cost=1163169.02..1163169.03 rows=5 width=10) (actual
> time=75460.187..75460.192 rows=15 loops=1)
>    Sort Key: channel_name
>    Sort Method:  quicksort  Memory: 17kB
>    ->  HashAggregate  (cost=1163168.91..1163168.96 rows=5 width=10) (actual
> time=75460.107..75460.114 rows=15 loops=1)
>          ->  Bitmap Heap Scan on ad_log  (cost=285064.30..1129582.84
> rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296
> loops=1)
>                Recheck Cond: ((start_time >= '2008-07-01
> 00:00:00'::timestamp without time zone) AND (start_time <=
> '2008-07-29'::date))
>                Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~
> '%'::text))
>                ->  Bitmap Index Scan on ad_log_start_time
> (cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443
> rows=13701296 loops=1)
>                      Index Cond: ((start_time >= '2008-07-01
> 00:00:00'::timestamp without time zone) AND (start_time <=
> '2008-07-29'::date))
>  Total runtime: 75460.361 ms
>
>
> It seems to me there should be some way to create an index to speed this up,
> but the various ones I've tried so far haven't helped. Any suggestions would
> be greatly appreciated.
>
>
>


--
H. Hall
ReedyRiver Group LLC
http://www.reedyriver.com

В списке pgsql-performance по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: indexing for distinct search in timestamp based table