Обсуждение: Proposal: q-gram GIN and GiST indexes

Поиск
Список
Период
Сортировка

Proposal: q-gram GIN and GiST indexes

От
Alexander Korotkov
Дата:
Hackers,

I would like to ask you about currency of the work above. I propose to
develop functionality of GIN and GiST q-gram indexes with following
features:
1) Handle edit distance (e.g. levenshtein distance) and LIKE/ILIKE
queries(using GIN partial match if no full q-grams can be extracted
from wildcard)
2) Support of various q
3) Support of positional q-grams in GIN (for more effective edit
distance filtering)
4) Various signature size in GiST
As you can see, there are some significant differences from pg_trgm.
Do you see this functionality useful? If you think this functionality
useful, where do you like to see it: separate project, contrib module,
core (of course, in the case when code have sufficient quality)?
I have stong confidence level about implementability of this project
in few month. That's why I could propose this as an GSoC project.

----
With best regards,
Alexander Korotkov.


Re: Proposal: q-gram GIN and GiST indexes

От
Alexander Korotkov
Дата:
On Fri, Mar 25, 2011 at 8:32 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> I would like to ask you about currency of the work above.
This seems to be a mess of words. Sorry for my bad english. Actually,
I meant that I need a appraisal of my proposal.

----
With best regards,
Alexander Korotkov.


Re: Proposal: q-gram GIN and GiST indexes

От
Robert Haas
Дата:
On Fri, Mar 25, 2011 at 1:32 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> I would like to ask you about currency of the work above. I propose to
> develop functionality of GIN and GiST q-gram indexes with following
> features:
> 1) Handle edit distance (e.g. levenshtein distance) and LIKE/ILIKE
> queries(using GIN partial match if no full q-grams can be extracted
> from wildcard)
> 2) Support of various q
> 3) Support of positional q-grams in GIN (for more effective edit
> distance filtering)
> 4) Various signature size in GiST
> As you can see, there are some significant differences from pg_trgm.
> Do you see this functionality useful? If you think this functionality
> useful, where do you like to see it: separate project, contrib module,
> core (of course, in the case when code have sufficient quality)?
> I have stong confidence level about implementability of this project
> in few month. That's why I could propose this as an GSoC project.

I'm afraid I don't know this code well enough to give you any
meaningful feedback, but I hope someone will.

All - note that Alexander has contributed a number of patches in this
area previously that have been committed, so it'd be great if we can
do our part to help him continue contributing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Proposal: q-gram GIN and GiST indexes

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 25, 2011 at 1:32 PM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
>> I would like to ask you about currency of the work above. I propose to
>> develop functionality of GIN and GiST q-gram indexes with following
>> features:

> I'm afraid I don't know this code well enough to give you any
> meaningful feedback, but I hope someone will.

Really Oleg and Teodor are the only people well-qualified to comment on
such stuff.  (It sounds reasonable to me, but I wouldn't know if there
are problems in the idea.)  They may be too busy right at the moment.
        regards, tom lane


Re: Proposal: q-gram GIN and GiST indexes

От
Alexander Korotkov
Дата:
On Mon, Mar 28, 2011 at 7:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm afraid I don't know this code well enough to give you any
> meaningful feedback, but I hope someone will.

Really Oleg and Teodor are the only people well-qualified to comment on
such stuff.  (It sounds reasonable to me, but I wouldn't know if there
are problems in the idea.)  They may be too busy right at the moment.

Thank you for reply. I'm going to ask Oleg and Teodor for their feedback.

----
With best regards,
Alexander Korotkov.

Re: Proposal: q-gram GIN and GiST indexes

От
Alexander Korotkov
Дата:
Here is text of my GSoC proposal. Given details probably makes essence of my proposal clear. Any comments are welcome.

Name of project

Q-gram indexing module

Synopsis

Currently PostgreSQL has support for trigram-based string collection indexing in pg_trgm module. Indexes in pg_trgm was originally focused on trigram similatiry queries optimization. LIKE/ILIKE operations support was added by my patch, but this support is limited, because trigrams not always can be extracted from wildcard.
I would like to propose a q-gram module which would have following differences in comparison with pg_trgm:
1) Focus on acceleration of edit distance (e.g. levenshtein distance) queries and LIKE/ILIKE queries
2) Support of various q
3) Support of positional q-grams (q-gram which contain position in which it appears in text) in GIN which give a hope to handle edit distance queries better than ordinal q-grams
4) LIKE/ILIKE acceleration even in case when no q-gram can be extracted from wildcard (via partial match)
5) Support of various signature size in GiST
6) Store exact string value in leaf pages of GiST index tree (In spite of set of q-grams, exact string value takes less space and allow exact filtering for LIKE/ILIKE and edit distance filtering)

Benefits to the PostgreSQL Community

Additional support of q-gram indexing will consolidate PostgreSQL as most advanced DBMS in fuzzy search on string collections. Q-gram indexes will be applicable for industry tasks. 

Quantifiable results

Acceleration of edit distance queries and LIKE/ILIKE queries.

Project Details

Q-gram index for LIKE/ILIKE queries

Basic idea is so [1]:
1) Extract q-grams which should anyway present in string which conforms to wildcard.
2) Filter only string which contain q-grams extracted on previous step. As the rule, recheck is required.
Currently this approach was impelemented for pg_trgm, but there is following limitation besides fixed q. In some cases we can extract no q-grams from wildcard (for example '%zz%' and q = 3). In GIN we could filter all q-grams which starts from 'zz' using partial match. But it require to store original q-gram in index (in pg_trgm crc32 in stored in the case of multibyte encoding).

Using q-gram index for edit distance queries

Number of common trigrams in strings p and r is at least max{length(p), length(r)} − q + 1 − k*q [2,5]. If p is query string and r is indexed string, we don't know exactly length of r. Then we can use following minimal common q-grams number bound length(p) − q + 1 − k*q. This bound allow us to filter strings using q-gram GIN and GiST indexes. Moreover, when filtering using GIN or GiST index we know about absence in indexing string of particular q-grams. In some cases it should be more effective then minimal bound of common q-grams. For examle, if query string is 'abcdefg' then absence of trigrams 'abc' and 'efg' can ensure us that minimal possible edit distance is 2.

Using positional q-gram index for edit distance queries

In string to set of q-grams decomposition not only q-gram presence itself but also q-gram position is useful information [3,4,5]. GIN index can store q-gram position as lower part of key. Q-gram position can be-used for edit distance filtering. For example, if we extract q-gram x with position l from query with threshold k, then corresponding q-gram in indexed string should have position in interval [l-k,l+k]. This condition can be used for more effective filtering than with ordinal q-grams.

Questions to discuss with community

1) Find a way to give additional parameters to index (value of q and signature size for GiST). If it wouldn't be possible to add extension-specific parameters to GiST and GIN indexes then distinct opclasses can be created for various parameters values;
2) Find a way to represent edit distance query. Edit distance query require two parameters query string and threshold. Threshold is maximum edit distance between result string and query string. Similar problem presents in pg_trgm for trigram similarity query: we need threshold of matchind trigrams amount. In pg_trgm threshold is declared as global session parameter and it can be manipulated by set_limit show_limit. However this approach have some limitation. For example, different thresholds can't be combined in same query. Probably, we should consider using of composite type for edit distance query respresentation.
3) Find appropriate place for proposed functionality: core, contrib module or a separate project. 

Inch-stones (project broken into small, distinct chunks)

1) Produce a solution af architecture questions with help of community.
2) Edit distance operator implementation
3) Implement GIN qgram index
    b) Index building implementation
    c) Edit distance strategy implementation
    d) LIKE/ILIKE strategy implementation
4) Implement GIN pqgram index
    a) Index building implementation
    b) Edit distance strategy implementation
    c) LIKE/ILIKE strategy implementation
5) Implement GiST qgram index
    a) Index building implementation
    b) Edit distance strategy implementation
    c) Edit distance KNN-strategy implementation
    d) LIKE/ILIKE strategy implementation
6) Perfomance testing on real life datasets
7) Manual writing
8) Code cleanup and commiting

Project Schedule

until May 22
Solve architecture questins with help of commutiny.

May 23 - May 29
Implement edit distance filter operator and KNN-operator.

May 30 - June 12
Implement GIN q-gram index.

June 13 - June 26
Implement GIN pq-gram index.

June 27 - July 10
Implement GiST q-gram index.

July 11 - July 24
Perform testing on real life data.

July 25 - July 31
Manural writing.

August 1 - August 16
Final code cleanup and commiting.

Links

2) Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching in Static Texts, Proceedings of the 16th Symposium on Mathematical Foundations of Computer Science, number 520 in LNCS, Springer, 1991.
3) Erkki Sutinen and Jorma Tarhio. On using q-gram locations in approximate string matching. Algorithms — ESA '95, p. 327-340
4) Chen Li, Bin Wang, Xiaochun Yang. VGRAM: improving performance of approximate queries on string collections using variable-length grams. VLDB '07 Proceedings of the 33rd international conference on Very large data bases
5) Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava. Approximate String Joins in a Database (Almost) for Free. VLDB '01 Proceedings of the 27th International Conference on Very Large Data Bases

----
With best regards,
Alexander Korotkov.

Re: Proposal: q-gram GIN and GiST indexes

От
Robert Haas
Дата:
On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I would like to propose a q-gram module which would have following
> differences in comparison with pg_trgm:
> 1) Focus on acceleration of edit distance (e.g. levenshtein distance)
> queries and LIKE/ILIKE queries
> 2) Support of various q

Doesn't the index size grow rather drastically with increasing q?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Proposal: q-gram GIN and GiST indexes

От
Alexander Korotkov
Дата:
On Mon, Apr 4, 2011 at 6:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I would like to propose a q-gram module which would have following
> differences in comparison with pg_trgm:
> 1) Focus on acceleration of edit distance (e.g. levenshtein distance)
> queries and LIKE/ILIKE queries
> 2) Support of various q

Doesn't the index size grow rather drastically with increasing q?

1) GIN
GIN index size can be represent as entries pages size + data pages size. With increasement of data volume grow of entries pages size is slowing , because set of q-grams which are actually occurs in text is limited. It can be illustrated on following example:

data set              count   avg. length   entries pages   data pages
english dictionary    98569           8.5             529          656
names of persons     105420          19.3             475         1985
paper titles        2526871          47.2            1234        84819

Statistics of GIN index of pg_trgm module is presented in table above. We can see that ratio of entries pages to data pages is decreasing with increasing of data volume. Since number of q-grams extracted from one indexed value is similar for distinct q, we should not expect significant grow of data pages size with increasing q. With increasing q we should expect increase of entries pages size, but on large datasets and with not very high q we shoudn't expect significant grow of index size. In some papers I met assumption that set of whole q-grams in natural text is relatively small when q <= 5. Accordingly, I think we should expect indexes to be usable with at least with q = 5.

2) GiST 
Since I'm going to store exact values in leaf nodes instead sets of q-grams, index grow isn't directly expected with increasing of q. Because index size would depends on size of indexed values and signature size(both don't depends on q). There would be possible indirect index grow, because we probably need longer signatures for appropriate representation of q-gram set. 

----
With best regards,
Alexander Korotkov.

Re: Proposal: q-gram GIN and GiST indexes

От
Robert Haas
Дата:
On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> relatively small when q <= 5. Accordingly, I think we should expect indexes
> to be usable with at least with q = 5.

I defer to your opinion on this, since you know more about it than I
do.  But I think it would still be worthwhile to write a quick Perl
script and calculate the number q-grams in various sample texts for
various values of q.  The worst case is surely exponential in q, so
it'd be nice to have some evidence of what the real-world behavior is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Proposal: q-gram GIN and GiST indexes

От
Alexander Korotkov
Дата:
On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> relatively small when q <= 5. Accordingly, I think we should expect indexes
> to be usable with at least with q = 5.

I defer to your opinion on this, since you know more about it than I
do.  But I think it would still be worthwhile to write a quick Perl
script and calculate the number q-grams in various sample texts for
various values of q.  The worst case is surely exponential in q, so
it'd be nice to have some evidence of what the real-world behavior is.

Here is distribution of numbers of different q-grams count in various datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for example, lowercased) should have lower counts.
q      ds1     ds2     ds3    ds4
2     2313    3461    1625   1288
3    15146   25094   14090  10728
4    58510  105908   69127  47499
5   161801  298466  182680 110929
6   351175  633750  331090 176336
7   613299 1049088  496426 234730
8   921962 1450715  657965 283698
9  1248339 1793158  802188 321261
10 1556838 2066775  926043 348058
ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes
ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes
ds3 - set of person first and last names, 2142298 bytes
ds4 - english dictionary, 931708 bytes
Sure, q-grams count grows with q increasing. At low q we can see approximately exponential grow. At high q grow is slowing and it is approximately linear.
In the worst case count of q-grams is exponential in q if we think data volume to be much higher then number of possible q-grams. But with high q real limitation is total number of q-grams extracted from dataset. In worst case each extracted q-gram is unique. This means that entries pages number would be comparable with data pages number. In this case index size with high q would be few times greater that index size with low q. 

----
With best regards,
Alexander Korotkov.

Re: Proposal: q-gram GIN and GiST indexes

От
Robert Haas
Дата:
On Tue, Apr 5, 2011 at 4:52 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov
>> <aekorotkov@gmail.com> wrote:
>> > relatively small when q <= 5. Accordingly, I think we should expect
>> > indexes
>> > to be usable with at least with q = 5.
>>
>> I defer to your opinion on this, since you know more about it than I
>> do.  But I think it would still be worthwhile to write a quick Perl
>> script and calculate the number q-grams in various sample texts for
>> various values of q.  The worst case is surely exponential in q, so
>> it'd be nice to have some evidence of what the real-world behavior is.
>
> Here is distribution of numbers of different q-grams count in various
> datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for
> example, lowercased) should have lower counts.
> q      ds1     ds2     ds3    ds4
> 2     2313    3461    1625   1288
> 3    15146   25094   14090  10728
> 4    58510  105908   69127  47499
> 5   161801  298466  182680 110929
> 6   351175  633750  331090 176336
> 7   613299 1049088  496426 234730
> 8   921962 1450715  657965 283698
> 9  1248339 1793158  802188 321261
> 10 1556838 2066775  926043 348058
> ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes
> ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes
> ds3 - set of person first and last names, 2142298 bytes
> ds4 - english dictionary, 931708 bytes
> Sure, q-grams count grows with q increasing. At low q we can see
> approximately exponential grow. At high q grow is slowing and it is
> approximately linear.
> In the worst case count of q-grams is exponential in q if we think data
> volume to be much higher then number of possible q-grams. But with high q
> real limitation is total number of q-grams extracted from dataset. In worst
> case each extracted q-gram is unique. This means that entries pages number
> would be comparable with data pages number. In this case index size with
> high q would be few times greater that index size with low q.

So with q=5, the index will be approximately 10x larger than with q=3.Maybe that's OK, I'm not sure.  But it is a big
difference.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Proposal: q-gram GIN and GiST indexes

От
Alexander Korotkov
Дата:
On Tue, Apr 5, 2011 at 3:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
So with q=5, the index will be approximately 10x larger than with q=3.
 Maybe that's OK, I'm not sure.  But it is a big difference.
Not whole index will be approximately 10x larger, but only entries pages number (which contains btree on gin keys, i.e. q-grams), while data pages number (which contains links to rows in lists or btrees) will be similar. In dependence on data volume index size can be 10x larger (on small datasets) or few percents larger (on large datasets).
 
----
With best regards,
Alexander Korotkov.

Re: Proposal: q-gram GIN and GiST indexes

От
Alexander Korotkov
Дата:
For example, here is distribution of q-grams count in 120 Mb of dblp paper titles (pretty large dataset).
q   count
2    7218
3  115107
4  589428
5 1648453
6 3336685
Number of 5-grams if about 15x larger than number of 3-grams. But most part of index space will be occupied by links to the rows(about 120 millions of links), while size of q-grams itself will be almost ignorable in comparison with it.

----
With best regards,
Alexander Korotkov.

Re: Proposal: q-gram GIN and GiST indexes

От
Robert Haas
Дата:
On Tue, Apr 5, 2011 at 8:41 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> For example, here is distribution of q-grams count in 120 Mb of dblp paper
> titles (pretty large dataset).
> q   count
> 2    7218
> 3  115107
> 4  589428
> 5 1648453
> 6 3336685
> Number of 5-grams if about 15x larger than number of 3-grams. But most part
> of index space will be occupied by links to the rows(about 120 millions of
> links), while size of q-grams itself will be almost ignorable in comparison
> with it.

I am probably being stupid here, but doesn't the number of links to
rows grow proportionately to the number of n-grams?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Proposal: q-gram GIN and GiST indexes

От
Alexander Korotkov
Дата:
On Tue, Apr 5, 2011 at 5:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I am probably being stupid here, but doesn't the number of links to
rows grow proportionately to the number of n-grams?
Number of links to rows grow proportionally to total number of extracted q-grams, but not proportionally to number of unique q-grams. Though, if extracted q-grams are not unique inside same indexed value, then it can reduce number of links (but it is rarity). 
Lets consider simple example. Two rows contains strings 'aaa' and 'aaab'. We extract 3-gram 'aaa' from first string and 3-grams 'aaa' and 'aab' from second string (for simplicity, there is no padding here). GIN index will contain structure, which can be represented so:
'aaa' => 1, 2
'aab' => 2
We can see, that there are 2 unique 3-grams, but 3 links to the rows.

----
With best regards,
Alexander Korotkov.

Re: Proposal: q-gram GIN and GiST indexes

От
Tom Lane
Дата:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Tue, Apr 5, 2011 at 5:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I am probably being stupid here, but doesn't the number of links to
>> rows grow proportionately to the number of n-grams?

> Number of links to rows grow proportionally to total number of extracted
> q-grams, but not proportionally to number of unique q-grams.

Sure.  The number of links is exactly proportional to the size of the
text, no?  An n-character text contains exactly n-q+1 q-grams, no more,
no less.  You might have some rules that cause you to discard some of
them, but basically the TID portion of the index will be proportional
to data volume, with no measurable dependence on q.

Or at least that's what it seems like before I've had my morning
caffeine fix...
        regards, tom lane