Re: trgm regex index peculiarity

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: trgm regex index peculiarity
Дата
Msg-id CAPpHfdu1+Y6mvrg2PU5raEE5efziVKztbR30Y1nR=Fz_bL6eyg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: trgm regex index peculiarity  ("Erik Rijkers" <er@xs4all.nl>)
Ответы Re: trgm regex index peculiarity  ("Erik Rijkers" <er@xs4all.nl>)
Список pgsql-hackers
On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:
On Fri, June 21, 2013 05:25, Tom Lane wrote:
> "Erik Rijkers" <er@xs4all.nl> writes:
>> In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these
>> timings:
>> select txt from azjunk6 where txt ~ '^abcd';
>>    130 ms
>> select txt from azjunk6
>> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>>    3 ms
>
> Hm, could you provide a self-contained test case?
>

yes, sorry.   I tested on a 1M row table:

#!/bin/sh

# create table:
for power in 6;
do
  table=azjunk${power}
  index=${table}_trgm_re_idx
  perl -E'
  sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
  say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 .. 1e'"${power};" \
  | psql -aqXc "
    drop table if exists $table;
    create table $table(txt text);
    copy $table from stdin;";
  echo "set session maintenance_work_mem='1GB';
    create index $index on $table using gin (txt gin_trgm_ops);
    analyze $table;" | psql -qtAX;
done

# test:
echo "
\\timing on
explain analyze select txt from azjunk6 where txt ~ '^abcd';  -- slow (140 ms)
explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
" | psql -Xqa

Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'. However trigrams '__a' is much more frequent than '_ab' which in turn is much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of '__a' and '_ab' and that gives so significant speedup. 
The problem is that trigram regex engine doesn't know that '__a' and '_ab' is more frequent than another trigrams because we don't know their distribution. However, simple assumption that blank character (in pg_trgm meaning) is much more frequent than others seems to be true for most cases.
Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE macro which is used in selectColorTrigrams like count of characters in COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we are intending to scan less posting lists/trees.
Comments is not fixed yet, coming soon.

------
With best regards,
Alexander Korotkov. 
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Support for REINDEX CONCURRENTLY
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls