Re: wildcard search support for pg_trgm

Поиск

Список

Период

Сортировка

От	Jesper Krogh
Тема	Re: wildcard search support for pg_trgm
Дата	24 января 2011 г. 12:14:26
Msg-id	4D3DA553.2070909@krogh.cc обсуждение исходный текст
Ответ на	Re: wildcard search support for pg_trgm (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы	Re: wildcard search support for pg_trgm
Список	pgsql-hackers

Дерево обсуждения

On 2011-01-24 16:34, Alexander Korotkov wrote:
> Hi!
>
> On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański<wulczer@wulczer.org>  wrote:
>
>> I see two issues with this patch. First of them is the resulting index
>> size. I created a table with 5 copies of
>> /usr/share/dict/american-english in it and a gin index on it, using
>> gin_trgm_ops. The results were:
>>
>>   * relation size: 18MB
>>   * index size: 109 MB
>>
>> while without the patch the GIN index was 43 MB. I'm not really sure
>> *why* this happens, as it's not obvious from reading the patch what
>> exactly is this extra data that gets stored in the index, making it more
>> than double its size.
>>
> Do you sure that you did comparison correctly? The sequence of index
> building and data insertion does matter. I tried to build gin index on  5
> copies of /usr/share/dict/american-english with patch and got 43 MB index
> size.
>
>
>> That leads me to the second issue. The pg_trgm code is already woefully
>> uncommented, and after spending quite some time reading it back and
>> forth I have to admit that I don't really understand what the code does
>> in the first place, and so I don't understand what does that patch
>> change. I read all the changes in detail and I could't find any obvious
>> mistakes like reading over array boundaries or dereferencing
>> uninitialized pointers, but I can't tell if the patch is correct
>> semantically. All test cases I threw at it work, though.
>>
> I'll try to write sufficient comment and send new revision of patch.
>
Would it be hard to make it support "n-grams" (e.g. making the length
configurable) instead of trigrams? I actually had the feeling that
penta-grams (pen-tuples or whatever they would be called) would
be better for my usecase (large substring-search in large documents ..
eg. 500 within 3.000.

Larger sizes.. lesser "sensitivity" => Faster lookup .. perhaps my logic 
is wrong?

Hm.. or will the knngist stuff help me here by selecting the best using
pentuples from the beginning?

The above comment is actually general to pg_trgm and not to the wildcard 
search
patch above.

Jesper
-- 
Jesper

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: wildcard search support for pg_trgm