Обсуждение: knngist questions

Поиск
Список
Период
Сортировка

knngist questions

От
Robert Haas
Дата:
I'm gradually slogging my way through the KNNGIST patches which were
posted here:

http://archives.postgresql.org/pgsql-hackers/2010-07/msg01183.php

I have a couple of conceptual questions.

1. Is KNNGIST intended to work if there's more than one pathkey?  If
so, how?  Example:

SELECT * FROM tab ORDER BY this_point <-> '(0,0)', this_point <-> '(1,1)'

As far as I can see, there's nothing in match_pathkey_to_index() which
would prevent lists of pathkeys and sortclauses of length 2 from being
constructed, but can GIST actually handle that?  It seems hard.  If it
can't, which I suspect is the case, then we can make this logic a lot
simpler by removing one of the loops from match_pathkey_to_index() and
bailing out quickly whenever list_length(root->query_pathkeys) != 1.

2. I'm concerned by the fact that the new consistent() methods only
return 0 or -1.0 for non-leaf pages.  If we have a query like:

SELECT * FROM tab WHERE this_point <-> '(0,0)'

...it seems that every non-leaf page will be consistent and we'll end
up loading every key in the entire index into an RBTree, which doesn't
seem practical from a memory-usage perspective.  Maybe I'm
misunderstanding something, but it seems like in a case like this
you'd need to have the consistent function for each page return a
minimum and maximum distance to '(0,0)'.  If you have one page that
has a minimum distance of 0 and a maximum distance of 100 and another
page which has a minimum distance of 101 and a maximum distance of
200, you don't even need to look at the second page until all the keys
from the first one have been processed.  If there's a third page with
a minimum distance of 50 and a maximum distance of 150, you can return
the tuples from the first page with distances less than 50, in order;
then you can do a merge between the remaining keys on that page and
the keys on the third page with values <= 100; then you can do a merge
between the remaining keys on that page and those on the second page
with values <= 150; and finally you can return the remaining keys on
the second page in order.

That still doesn't seem to provide any particularly tight bound on
memory usage, but it's certainly way better than traversing the entire
tree up front.  If you are already doing something like this somewhere
in the code, please point me in the right direction...

3. I've been scratching my head over the following bit of code and it
doesn't make any sense to me.  As far as I can tell, this is
effectively comparing the number of columns in the ORDER BY clause to
the number of restriction clauses applicable to the relation being
scanned.  Those two quantities don't seem to have much to do with each
other, so either I'm confused or the code is.  It doesn't seem like it
should matter anyway, since I don't think we're planning on any AMs
being both amoptionalkey and amcanorderbyop.

+                               if (list_length(restrictclauses) <
indexcol && !index->amoptionalkey)
+                                       break;

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: knngist questions

От
Teodor Sigaev
Дата:
> 1. Is KNNGIST intended to work if there's more than one pathkey?  If
> so, how?  Example:
>
> SELECT * FROM tab ORDER BY this_point<->  '(0,0)', this_point<->  '(1,1)'

Why not, if distances from two points to '(0,0)' are equal, it's need to compare 
distances to '(1,1)'. Nothing new here, KNN-GiST supports it, that a reason why 
StackElem struct has variable-size array for distances.

In practice, it could be used for finding closest restaurant with some fish in 
its name (pg_trgm has a support of disttance):
SELECT * FROM restaurants ORDER BY MY_COORDS <-> r_coords, 'fish' <-> r_name;

> 2. I'm concerned by the fact that the new consistent() methods only
> return 0 or -1.0 for non-leaf pages.  If we have a query like:
M-m, it's returns 0 or -1.0 only for operation from WHERE clause, for ORDER BY 
operation it should return >= 0 value. That's why consistentFn shoulf be able to 
distinguish source of operation (actually, it's need because now we allow 
boolean distances, if distance could not be a boolean then operation could not 
come from WHERE clause)

> SELECT * FROM tab WHERE this_point<->  '(0,0)'

For point, it's incorrect query: non-boolean operations could not present in 
WHERE clause.

> you'd need to have the consistent function for each page return a
> minimum and maximum distance to '(0,0)'.  If you have one page that
Only minimum distance which guarantees that in it's sub-tree there is no more 
close point. For R-tree it's rather obvious because keys on inner pages are 
actually a bounding box of its subtree.


> has a minimum distance of 0 and a maximum distance of 100 and another
> page which has a minimum distance of 101 and a maximum distance of
> 200, you don't even need to look at the second page until all the keys
> from the first one have been processed.  If there's a third page with
Right, it works so, until I made a big mistake in code. But performance test 
allows me to believe that I didn't :)

It's read a page (for the start - root page) and puts all keys in binary tree 
ordered by distance and then takes left-most key from binary tree. If that key 
is pointer to heap, GiST returns it to postgres, if it's a pointer to index's 
page then repeat loop with new page.

> tree up front.  If you are already doing something like this somewhere
> in the code, please point me in the right direction...

getNextNearest() function: it gets next pointer from tree with a help of 
getNextDataPointer(), and then, depending of pointer type (to index or heap 
page) returns a pointer or call processPage() which puts new pointers into tree.

May be, not obvious thing that all pointers on the same distance are stored in 
single binary tree node (struct StackElem). Each pointer is represented by 
DataPointer which contains information needed for support concurrency. List of 
that structs is keeped "semi-ordered": pointer to heap are always in the 
beginning of list to increase response time and memory requirement.


>
> 3. I've been scratching my head over the following bit of code and it
> doesn't make any sense to me.  As far as I can tell, this is
> effectively comparing the number of columns in the ORDER BY clause to
> the number of restriction clauses applicable to the relation being
> scanned.  Those two quantities don't seem to have much to do with each
> other, so either I'm confused or the code is.  It doesn't seem like it
Oops, it seems to me that's artefact of developing, sorry. I'm not very familiar 
with planner/optimizer so that required a lot of my brain and I just missed that 
after another attempt to get it work.

> should matter anyway, since I don't think we're planning on any AMs
> being both amoptionalkey and amcanorderbyop.

Agree.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/