Re: Proposal: q-gram GIN and GiST indexes

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: Proposal: q-gram GIN and GiST indexes
Дата
Msg-id BANLkTin6d3NSe=LGhFi1O4BpNn2cz4EYpw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Proposal: q-gram GIN and GiST indexes  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Proposal: q-gram GIN and GiST indexes  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Mon, Apr 4, 2011 at 6:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I would like to propose a q-gram module which would have following
> differences in comparison with pg_trgm:
> 1) Focus on acceleration of edit distance (e.g. levenshtein distance)
> queries and LIKE/ILIKE queries
> 2) Support of various q

Doesn't the index size grow rather drastically with increasing q?

1) GIN
GIN index size can be represent as entries pages size + data pages size. With increasement of data volume grow of entries pages size is slowing , because set of q-grams which are actually occurs in text is limited. It can be illustrated on following example:

data set              count   avg. length   entries pages   data pages
english dictionary    98569           8.5             529          656
names of persons     105420          19.3             475         1985
paper titles        2526871          47.2            1234        84819

Statistics of GIN index of pg_trgm module is presented in table above. We can see that ratio of entries pages to data pages is decreasing with increasing of data volume. Since number of q-grams extracted from one indexed value is similar for distinct q, we should not expect significant grow of data pages size with increasing q. With increasing q we should expect increase of entries pages size, but on large datasets and with not very high q we shoudn't expect significant grow of index size. In some papers I met assumption that set of whole q-grams in natural text is relatively small when q <= 5. Accordingly, I think we should expect indexes to be usable with at least with q = 5.

2) GiST 
Since I'm going to store exact values in leaf nodes instead sets of q-grams, index grow isn't directly expected with increasing of q. Because index size would depends on size of indexed values and signature size(both don't depends on q). There would be possible indirect index grow, because we probably need longer signatures for appropriate representation of q-gram set. 

----
With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [DOCS] Uppercase SGML entity declarations
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: GSoC proposal: Fast GiST index build