Re: Proposal: q-gram GIN and GiST indexes

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: Proposal: q-gram GIN and GiST indexes
Дата
Msg-id BANLkTinVYEVBq1=Lcs-7zzApCmax-wTXsw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Proposal: q-gram GIN and GiST indexes  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Proposal: q-gram GIN and GiST indexes  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> relatively small when q <= 5. Accordingly, I think we should expect indexes
> to be usable with at least with q = 5.

I defer to your opinion on this, since you know more about it than I
do.  But I think it would still be worthwhile to write a quick Perl
script and calculate the number q-grams in various sample texts for
various values of q.  The worst case is surely exponential in q, so
it'd be nice to have some evidence of what the real-world behavior is.

Here is distribution of numbers of different q-grams count in various datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for example, lowercased) should have lower counts.
q      ds1     ds2     ds3    ds4
2     2313    3461    1625   1288
3    15146   25094   14090  10728
4    58510  105908   69127  47499
5   161801  298466  182680 110929
6   351175  633750  331090 176336
7   613299 1049088  496426 234730
8   921962 1450715  657965 283698
9  1248339 1793158  802188 321261
10 1556838 2066775  926043 348058
ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes
ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes
ds3 - set of person first and last names, 2142298 bytes
ds4 - english dictionary, 931708 bytes
Sure, q-grams count grows with q increasing. At low q we can see approximately exponential grow. At high q grow is slowing and it is approximately linear.
In the worst case count of q-grams is exponential in q if we think data volume to be much higher then number of possible q-grams. But with high q real limitation is total number of q-grams extracted from dataset. In worst case each extracted q-gram is unique. This means that entries pages number would be comparable with data pages number. In this case index size with high q would be few times greater that index size with low q. 

----
With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dimitri Fontaine
Дата:
Сообщение: Re: Re: synchronous_commit and synchronous_replication Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Следующее
От: Shigeru HANADA
Дата:
Сообщение: Re: Re: [COMMITTERS] pgsql: Support comments on FOREIGN DATA WRAPPER and SERVER objects.