Re: Patch: pg_trgm: gin index scan performance for similarity search

Поиск
Список
Период
Сортировка
От Teodor Sigaev
Тема Re: Patch: pg_trgm: gin index scan performance for similarity search
Дата
Msg-id 567D1624.6090702@sigaev.ru
обсуждение исходный текст
Ответ на Patch: pg_trgm: gin index scan performance for similarity search  (Fornaroli Christophe <cfornaro@gmail.com>)
Ответы Re: Patch: pg_trgm: gin index scan performance for similarity search  (Fornaroli Christophe <cfornaro@gmail.com>)
Список pgsql-hackers
Good catch, committed.

Fornaroli Christophe wrote:
> Hi,
>
> I think that we can improve the gin index scan performance for similarity search
> defined in the pg_trgm extension. The similarity function is (for the default
> case where DIVUNION is defined in the code):
>
>          count / (len1 + len2 - count) >= trgm_limit
>
> where
>    len1 is the number of unique trigrams for the first string,
>    len2 is the same number for the second string,
>    count is the number of common trigrams between both strings,
>    trgm_limit is a user specfied limit in [0, 1].
>
> The code used to determine if a tuple may match the query string is:
>
>          case SimilarityStrategyNumber:
>              /* Count the matches */
>              ntrue = 0;
>              for (i = 0; i < nkeys; i++)
>              {
>                  if (check[i] != GIN_FALSE)
>                      ntrue++;
>              }
> #ifdef DIVUNION
>              res = (nkeys == ntrue) ? GIN_MAYBE : (((((float4) ntrue) /
> ((float4) (nkeys - ntrue))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
> #else
>              res = (nkeys == 0) ? GIN_FALSE : (((((float4) ntrue) / ((float4)
> nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
> #endif
>
> where
>    ntrue is the number of common trigrams in both strings,
>    nkeys is the number of trigrams in the search string.
>
> This code uses this upper bound for the similarity: ntrue / (nkeys - ntrue). But
> if there is ntrue trigrams in common, we know that the indexed string is at
> least ntrue trigrams long. We can then use a more aggressive upper bound: ntrue
> / (ntrue + nkeys - ntrue) or ntrue / nkeys. Attached is a patch that changes this.
>
> Here are some performance gains with this test case:
>
> create table foo as select
>    substring(md5(random()::text) for random() * 5) || '123' as bar
> from generate_series(1,1000000);
>
> create index on foo using gin (bar gin_trgm_ops);
>
> patched:
>
> test=# explain analyze select count(*) from foo where bar % 'abc123';
>                                                                QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------
>   Aggregate  (cost=2511.14..2511.15 rows=1 width=0) (actual
> time=807.434..807.435 rows=1 loops=1)
>     ->  Bitmap Heap Scan on foo  (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=109.893..787.261 rows=54746 loops=1)
>           Recheck Cond: (bar % 'abc123'::text)
>           Rows Removed by Index Recheck: 55125
>           Heap Blocks: exact=4514
>           ->  Bitmap Index Scan on foo_bar_idx  (cost=0.00..99.50 rows=1000
> width=0) (actual time=108.456..108.456 rows=109871 loops=1)
>                 Index Cond: (bar % 'abc123'::text)
>   Planning time: 0.353 ms
>   Execution time: 807.593 ms
> (9 rows)
>
> test=# explain analyze select count(*) from foo where bar % 'abcdef';
>                                                             QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------------
>   Aggregate  (cost=2511.14..2511.15 rows=1 width=0) (actual time=4.829..4.830
> rows=1 loops=1)
>     ->  Bitmap Heap Scan on foo  (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=3.512..4.794 rows=5 loops=1)
>           Recheck Cond: (bar % 'abcdef'::text)
>           Rows Removed by Index Recheck: 137
>           Heap Blocks: exact=139
>           ->  Bitmap Index Scan on foo_bar_idx  (cost=0.00..99.50 rows=1000
> width=0) (actual time=3.355..3.355 rows=142 loops=1)
>                 Index Cond: (bar % 'abcdef'::text)
>   Planning time: 0.363 ms
>   Execution time: 5.061 ms
> (9 rows)
>
>
> master:
>
> test=# explain analyze select count(*) from foo where bar % 'abc123';
>                                                                QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------
>   Aggregate  (cost=2511.14..2511.15 rows=1 width=0) (actual
> time=6416.554..6416.554 rows=1 loops=1)
>     ->  Bitmap Heap Scan on foo  (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=484.359..6389.819 rows=54746 loops=1)
>           Recheck Cond: (bar % 'abc123'::text)
>           Rows Removed by Index Recheck: 945250
>           Heap Blocks: exact=4514
>           ->  Bitmap Index Scan on foo_bar_idx  (cost=0.00..99.50 rows=1000
> width=0) (actual time=482.677..482.677 rows=999996 loops=1)
>                 Index Cond: (bar % 'abc123'::text)
>   Planning time: 0.359 ms
>   Execution time: 6416.945 ms
> (9 rows)
>
> test=# explain analyze select count(*) from foo where bar % 'abcdef';
>                                                             QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------
>   Aggregate  (cost=2511.14..2511.15 rows=1 width=0) (actual time=30.678..30.679
> rows=1 loops=1)
>     ->  Bitmap Heap Scan on foo  (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=9.020..30.643 rows=5 loops=1)
>           Recheck Cond: (bar % 'abcdef'::text)
>           Rows Removed by Index Recheck: 2789
>           Heap Blocks: exact=2110
>           ->  Bitmap Index Scan on foo_bar_idx  (cost=0.00..99.50 rows=1000
> width=0) (actual time=7.696..7.696 rows=2794 loops=1)
>                 Index Cond: (bar % 'abcdef'::text)
>   Planning time: 0.254 ms
>   Execution time: 30.809 ms
> (9 rows)
>
>
> Cheers,
>
> Christophe
>
>
>

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Etsuro Fujita
Дата:
Сообщение: Re: Optimization for updating foreign tables in Postgres FDW
Следующее
От: Teodor Sigaev
Дата:
Сообщение: Re: Review: GiST support for UUIDs