Re: [HACKERS] [PATCH] kNN for btree

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: [HACKERS] [PATCH] kNN for btree
Дата
Msg-id CAPpHfdtvTg+y5aAq3C5dF8HTC8P21iHoAE-6ur3Li91wWpDEmw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] [PATCH] kNN for btree  (Nikita Glukhov <n.gluhov@postgrespro.ru>)
Ответы Re: [HACKERS] [PATCH] kNN for btree  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi, Nikita!

I have assigned as a reviewer for this patchset.  I took a fist look to these patches.
At first, I'd like to notice that it's very cool that you picked up this work.  I frequently hear people complains about lack of this feature.

     k  |  kNN-btree   |  kNN-GiST    |  Opt. query   |  Seq. scan
        |              | (btree_gist) |  with UNION   |  with sort
--------|--------------|--------------|---------------|------------
      1 |  0.041     4 |  0.079     4 |   0.060     8 |  41.1 1824
     10 |  0.048     7 |  0.091     9 |   0.097    17 |  41.8 1824
    100 |  0.107    47 |  0.192    52 |   0.342   104 |  42.3 1824
   1000 |  0.735   573 |  0.913   650 |   2.970  1160 |  43.5 1824
  10000 |  5.070  5622 |  6.240  6760 |  36.300 11031 |  54.1 1824
 100000 | 49.600 51608 | 61.900 64194 | 295.100 94980 | 115.0 1824

These results looks quite expected.  KNN-btree uses about half of blocks in comparison with UNION query, and it's more than twice faster.  In comparison with kNN-GiST there is still some win. 

1. Introduce amcanorderbyop() function

This patch transforms existing boolean AM property amcanorderbyop into a method
(function pointer).  This is necessary because, unlike GiST, kNN for btree
supports only a one ordering operator on the first index column and we need a
different pathkey matching logic for btree (there was a corresponding comment
in match_pathkeys_to_index()).  GiST-specific logic has been moved from
match_pathkeys_to_index() to gistcanorderbyop().

I'm not very excited about this design of amcanorderbyop callback.  Introducing new callback from index access method to the planner should imply quite good flexibility to the future.  In this particular signature of callback I see no potential future use-cases than your implementation for btree.  We could just add amcanorderbyonlyfisrtop property and that would give us same level of flexibility I think.
With existing index types, we could cover much more orderings that we currently do.  Some of possible cases:
1) "ORDER BY col" for btree_gist, SP-GiST text_ops
2) "ORDER BY col1, col2 <-> const" for btree_gist
3) "ORDER BY col1, col2 <-> const" for btree

I understand that #3 is quite hard task and I don't ask you to implement it now.  But it would be nice if some day we decide to add #3, we wouldn't have to change IndexAmRoutine definition.

My idea is that we need more general redesign of specifying ordering which index can produce.  Ideally, we should replace amcanorder, amcanbackward and amcanorderbyop with single callback.  Such callback should take a list of pathkeys and return number of leading pathkeys index could satisfy (with corresponding information for index scan).  I'm not sure that other hackers would agree with such design, but I'm very convinced that we need something of this level of extendability.  Otherwise we would have to hack our planner <-> index_access_method interface each time we decide to cover another index produced ordering.

6. Remove duplicate distance operators from contrib/btree_gist.

References to their own distance operators in btree_gist opclasses are
replaced with references to the built-in operators and than duplicate
operators are dropped.  But if the user is using somewhere these operators,
upgrade of btree_gist from 1.3 to 1.4 would fail.

The query in "btree_gist--1.3--1.4.sql" which directly touches system catalogue to update opfamilies looks too hackery.  I think we shouldn't use such queries until we have no other choice.
In this particular case we can update opfamilies using legal mechanism "ALTER OPERATOR FAMILY name USING index_method ADD/DROP ... " (note that operator name could be schema-qualified if needed).  This way wouldn't be that brief, but it is much more correct. 

Also this like catch my eyes.
info->amcanorderbyop = (void (*)()) amroutine->amcanorderbyop;
It's not necessary to use cast here.  For instance, we don't use cast for amcostestimate.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: [HACKERS] SERIALIZABLE with parallel query
Следующее
От: Kuntal Ghosh
Дата:
Сообщение: Re: [HACKERS] Passing query string to workers