Re: Proposal: q-gram GIN and GiST indexes

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: Proposal: q-gram GIN and GiST indexes
Дата
Msg-id BANLkTi=A6S0bsdPqmZDzuQ0SH4fhEUfGKA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Proposal: q-gram GIN and GiST indexes  (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы Re: Proposal: q-gram GIN and GiST indexes  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Here is text of my GSoC proposal. Given details probably makes essence of my proposal clear. Any comments are welcome.

Name of project

Q-gram indexing module

Synopsis

Currently PostgreSQL has support for trigram-based string collection indexing in pg_trgm module. Indexes in pg_trgm was originally focused on trigram similatiry queries optimization. LIKE/ILIKE operations support was added by my patch, but this support is limited, because trigrams not always can be extracted from wildcard.
I would like to propose a q-gram module which would have following differences in comparison with pg_trgm:
1) Focus on acceleration of edit distance (e.g. levenshtein distance) queries and LIKE/ILIKE queries
2) Support of various q
3) Support of positional q-grams (q-gram which contain position in which it appears in text) in GIN which give a hope to handle edit distance queries better than ordinal q-grams
4) LIKE/ILIKE acceleration even in case when no q-gram can be extracted from wildcard (via partial match)
5) Support of various signature size in GiST
6) Store exact string value in leaf pages of GiST index tree (In spite of set of q-grams, exact string value takes less space and allow exact filtering for LIKE/ILIKE and edit distance filtering)

Benefits to the PostgreSQL Community

Additional support of q-gram indexing will consolidate PostgreSQL as most advanced DBMS in fuzzy search on string collections. Q-gram indexes will be applicable for industry tasks. 

Quantifiable results

Acceleration of edit distance queries and LIKE/ILIKE queries.

Project Details

Q-gram index for LIKE/ILIKE queries

Basic idea is so [1]:
1) Extract q-grams which should anyway present in string which conforms to wildcard.
2) Filter only string which contain q-grams extracted on previous step. As the rule, recheck is required.
Currently this approach was impelemented for pg_trgm, but there is following limitation besides fixed q. In some cases we can extract no q-grams from wildcard (for example '%zz%' and q = 3). In GIN we could filter all q-grams which starts from 'zz' using partial match. But it require to store original q-gram in index (in pg_trgm crc32 in stored in the case of multibyte encoding).

Using q-gram index for edit distance queries

Number of common trigrams in strings p and r is at least max{length(p), length(r)} − q + 1 − k*q [2,5]. If p is query string and r is indexed string, we don't know exactly length of r. Then we can use following minimal common q-grams number bound length(p) − q + 1 − k*q. This bound allow us to filter strings using q-gram GIN and GiST indexes. Moreover, when filtering using GIN or GiST index we know about absence in indexing string of particular q-grams. In some cases it should be more effective then minimal bound of common q-grams. For examle, if query string is 'abcdefg' then absence of trigrams 'abc' and 'efg' can ensure us that minimal possible edit distance is 2.

Using positional q-gram index for edit distance queries

In string to set of q-grams decomposition not only q-gram presence itself but also q-gram position is useful information [3,4,5]. GIN index can store q-gram position as lower part of key. Q-gram position can be-used for edit distance filtering. For example, if we extract q-gram x with position l from query with threshold k, then corresponding q-gram in indexed string should have position in interval [l-k,l+k]. This condition can be used for more effective filtering than with ordinal q-grams.

Questions to discuss with community

1) Find a way to give additional parameters to index (value of q and signature size for GiST). If it wouldn't be possible to add extension-specific parameters to GiST and GIN indexes then distinct opclasses can be created for various parameters values;
2) Find a way to represent edit distance query. Edit distance query require two parameters query string and threshold. Threshold is maximum edit distance between result string and query string. Similar problem presents in pg_trgm for trigram similarity query: we need threshold of matchind trigrams amount. In pg_trgm threshold is declared as global session parameter and it can be manipulated by set_limit show_limit. However this approach have some limitation. For example, different thresholds can't be combined in same query. Probably, we should consider using of composite type for edit distance query respresentation.
3) Find appropriate place for proposed functionality: core, contrib module or a separate project. 

Inch-stones (project broken into small, distinct chunks)

1) Produce a solution af architecture questions with help of community.
2) Edit distance operator implementation
3) Implement GIN qgram index
    b) Index building implementation
    c) Edit distance strategy implementation
    d) LIKE/ILIKE strategy implementation
4) Implement GIN pqgram index
    a) Index building implementation
    b) Edit distance strategy implementation
    c) LIKE/ILIKE strategy implementation
5) Implement GiST qgram index
    a) Index building implementation
    b) Edit distance strategy implementation
    c) Edit distance KNN-strategy implementation
    d) LIKE/ILIKE strategy implementation
6) Perfomance testing on real life datasets
7) Manual writing
8) Code cleanup and commiting

Project Schedule

until May 22
Solve architecture questins with help of commutiny.

May 23 - May 29
Implement edit distance filter operator and KNN-operator.

May 30 - June 12
Implement GIN q-gram index.

June 13 - June 26
Implement GIN pq-gram index.

June 27 - July 10
Implement GiST q-gram index.

July 11 - July 24
Perform testing on real life data.

July 25 - July 31
Manural writing.

August 1 - August 16
Final code cleanup and commiting.

Links

2) Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching in Static Texts, Proceedings of the 16th Symposium on Mathematical Foundations of Computer Science, number 520 in LNCS, Springer, 1991.
3) Erkki Sutinen and Jorma Tarhio. On using q-gram locations in approximate string matching. Algorithms — ESA '95, p. 327-340
4) Chen Li, Bin Wang, Xiaochun Yang. VGRAM: improving performance of approximate queries on string collections using variable-length grams. VLDB '07 Proceedings of the 33rd international conference on Very large data bases
5) Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava. Approximate String Joins in a Database (Almost) for Free. VLDB '01 Proceedings of the 27th International Conference on Very Large Data Bases

----
With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: GSoC proposal: Fast GiST index build
Следующее
От: "Kevin Grittner"
Дата:
Сообщение: Re: GUC assign hooks (was Re: wal_buffers = -1 and SIGHUP)