Re: KNN-GiST with recheck

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: KNN-GiST with recheck
Дата
Msg-id 52E7B6AD.9090608@vmware.com
обсуждение исходный текст
Ответ на KNN-GiST with recheck  (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы Re: KNN-GiST with recheck  (Alexander Korotkov <aekorotkov@gmail.com>)
Re: KNN-GiST with recheck  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-hackers
On 01/13/2014 07:17 PM, Alexander Korotkov wrote:
> Here goes a desription of this patch same as in original thread.
>
> KNN-GiST provides ability to get ordered results from index, but this order
> is based only on index information. For instance, GiST index contains
> bounding rectangles for polygons, and we can't get exact distance to
> polygon from index (similar situation is in PostGIS). In attached patch,
> GiST distance method can set recheck flag (similar to consistent method).
> This flag means that distance method returned lower bound of distance and
> we should recheck it from heap.
>
> See an example.
>
> create table test as (select id, polygon(3+(random()*10)::int,
> circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
> generate_series(1,1000000) id);
> create index test_idx on test using gist (p);
>
> We can get results ordered by distance from polygon to point.
>
> postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
> point(0.5,0.5) limit 10;
>     id   |       ?column?
> --------+----------------------
>   755611 | 0.000405855808916853
>   807562 | 0.000464123777564343
>   437778 | 0.000738524708741959
>   947860 |  0.00076250998760724
>   389843 | 0.000886362723569568
>    17586 | 0.000981960100555216
>   411329 |  0.00145338112316853
>   894191 |  0.00149399559703506
>   391907 |   0.0016647896049741
>   235381 |  0.00167554614889509
> (10 rows)
>
> It's fast using just index scan.
>
>                                                              QUERY PLAN
>
>
----------------------------------------------------------------------------------------------------------------------------------
>   Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
> rows=10 loops=1)
>     ->  Index Scan using test_idx on test  (cost=0.29..157672.29
> rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
>           Order By: (p <-> '(0.5,0.5)'::point)
>   Total runtime: 0.305 ms
> (4 rows)

Nice! Some thoughts:

1. This patch introduces a new "polygon <-> point" operator. That seems 
useful on its own, with or without this patch.

2. I wonder how useful it really is to allow mixing exact and non-exact 
return values from the distance function. The distance function included 
in the patch always returns recheck=true. I have a feeling that all 
other distance function will also always return either true or false.

3. A binary heap would be a better data structure to buffer the 
rechecked values. A Red-Black tree allows random insertions and 
deletions, but in this case you need to insert arbitrary values but only 
remove the minimum item. That's exactly what a binary heap excels at. We 
have a nice binary heap implementation in the backend that you can use, 
see src/backend/lib/binaryheap.c.

4. (as you mentioned in the other thread: ) It's a modularity violation 
that you peek into the heap tuple from gist. I think the proper way to 
do this would be to extend the IndexScan executor node to perform the 
re-shuffling of tuples that come from the index in wrong order, or 
perhaps add a new node type for it.

Of course that's exactly what your partial sort patch does :-). I 
haven't looked at that in detail, but I don't think the approach the 
partial sort patch takes will work here as is. In the KNN-GiST case, the 
index is returning tuples roughly in the right order, but a tuple that 
it returns might in reality belong somewhere later in the ordering. In 
the partial sort patch, the "input stream" of tuples is divided into 
non-overlapping groups, so that the tuples within the group are not 
sorted, but the groups are. I think the partial sort case is a special 
case of the KNN-GiST case, if you consider the lower bound of each tuple 
to be the leading keys that you don't need to sort.

BTW, this capability might also be highly useful for the min/max indexes 
as well. A min/max index cannot return an exact ordering of tuples, but 
it can also give a lower bound for a group of tuples.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Christian Kruse
Дата:
Сообщение: Re: [PATCH] Use MAP_HUGETLB where supported (v3)
Следующее
От: salah jubeh
Дата:
Сообщение: Re: Add force option to dropdb