GSoC'14: KNN on SP-GiST

Поиск
Список
Период
Сортировка
От Vladislav Sterzhanov
Тема GSoC'14: KNN on SP-GiST
Дата
Msg-id CAK2aJC1_HZ-YNA_BQmqezr0Tb0dQALcWNjCcqBju2hes3OxeKQ@mail.gmail.com
обсуждение исходный текст
Список pgsql-students
Greetings, Yusuph Adekoya!
My fault, forgot that it's only visible for mentors, quoting here (with the feedback from Heikki Linnakangas):

My Project:

Rather straightforward, the ultimate goal of the project is in empowering PostgreSQL's SP-GiST index type with a capability of performing nearest-neighbour search, which naturally divides in a sequence of goals: first of all, the general SP-GiST interface for KNN should be designed and introduced. Then comes adjusting of all the currently existing SP-GiST structures (k-d tree, radix trie over text and quad trees) in order to make all of them use appropriate nearest-neighbourhood picking logic and, finally, a thorough testing of each one.

As I see it, pushing the general high-level interface down to implementations is nescessary in this situation (in contrast to regular PostgreSQL practices) since the neighbouring logic in different SP-GiSTs can vary: e.g. in a quad-tree, if we habe two leaves branching from a single down-level inner node and thus belonging to the closest (in their relation) quadrant we can't guarantee their "global" closeness (in contrast to e.g. text trie where it's true for two leaves having the most common prefix).  

Schedule / Timeline:

- Milestone 1 - Pre-Community-bonding - get to know the related codebase of PostgreSQL in order to ensure that the following work will continue in accordance with best practices.  

- Milestone 2 - Community-bonding + 2 weeks: Get everything structured and discuss project with community (due to the late application). Set up coding and testing environment. By this time I need to clearly imagine and sketch the project.

- Milestone 3 – till the early June - implement the general high-level interface.

- Milestone 4 – till the midterm - implement and test the specification for text radix-trie. Ensure everything goes at right direction, prepare for midterm.

- Milestone 5 – till end of July - implement the specification for k-d quad- trees. 

- Milestone 6 – till final eval - test everything up thoroughly, final discussion of results and new directions with the community.

Heikki Linnakangas March 23, 2014, 2:04 p.m.

Sounds good. There are some design details that need to be figured out: * What new support functions does an operator class need to implement, to support kNN-searches? (GiST needs a "distance" function, is that how it would work with SP-GiST too?) * You said you'd implement the support for text tries, for text datatype, k-d and quad trees, presumably for points. Sounds good, but I wonder what does the query that using text look like? We already have kNN-GiST support for points, so presumably the kNN support for SP-GiST would help with the same operators as the GiST ones, but what about text? Can you give an example of a query that could use the kNN support for text? * (related to previous point) I suspect it might be easier and more useful to implement the support for points than text, so you might want to do that first.

quadrocube March 26, 2014, 11:12 p.m.

Actually, GiST requires a distance relation only on queue entries to maintain a currently-nearest list of nodes to traverse the tree in the smallest-distance-first order. I'll need the similar queue for SP-GiST, but for a different purpose - simply keep track of the best-observed-points, without any order-modification logic. KNN search for k-d tree is essentially a recursive descend down the tree, populating the nearest-neighbour heap, followed by an ascension, which updates this heap by traversing the unvisited branches if needed (bounds-overlap-ball test on the domain of the corresponding inner nodes). KNN for quadtree is essentially the same, keeping in mind more intuitive space organization techniques. Regarding text tries - the generalization is rather obvious - "Find K most lexicographically closest strings from the subset to a given one", but the practical application of the queries like this is debatable - I personally thought of some kind of bioinformatics genome manipulation applications, but have almost no experience in the field. Moreover, recently I've stumbled upon a k-d tree rebuild optimization which ensures an average O(log n) complexity of the NN-search operation for every possible point distribution (which could otherwise return to O(N) for a specific node distribution patterns), but it needs a complete re-structure of the tree thus requiring the "build-once-use-often" way to work with it. (For instance, a link to a mentioned modification: http://dl.acm.org/citation.cfm?doid=355744.355745) So, it seems like the text implentation can really be put at the end of the to-do list, probably even being replaced by the closely related k-d search optimization.

В списке pgsql-students по дате отправления:

Предыдущее
От: Vladislav Sterzhanov
Дата:
Сообщение: GSoC'14: KNN on SP-GiST
Следующее
От: Madhurima Das
Дата:
Сообщение: unable to insert column using postgresql and C