[HACKERS] [PATCH] kNN for btree

Поиск
Список
Период
Сортировка
От Nikita Glukhov
Тема [HACKERS] [PATCH] kNN for btree
Дата
Msg-id ce35e97b-cf34-3f5d-6b99-2c25bae49999@postgrespro.ru
обсуждение исходный текст
Ответы Re: [HACKERS] [PATCH] kNN for btree  (Nikita Glukhov <n.gluhov@postgrespro.ru>)
Список pgsql-hackers
Hi hackers,

I'd like to present a series of patches that implements k-Nearest 
Neighbors (kNN)
search forbtree, which can be usedto speed up ORDER BY distance queries 
like this:
SELECT * FROM eventsORDER BY date <-> '2000-01-01'::date ASC LIMIT 100;
Now only GiST supports kNN, but kNN on btree can be emulated using 
contrib/btree_gist.


Scanning algorithm
==================

Algorithm is very simple: we use bidirectional B-tree index scan 
starting at the point
from which we measure the distance(target point).At each step, we advance
this scan in the directionthat has the nearest point.  Butwhen the 
target point
does not fall into the scannedrange, we don't even need to 
useabidirectional
scan here --- wecan use ordinaryunidirectional scaninthe right direction.


Performance results
===================

Test database istaken from original kNN-GiST presentation (PGCon2010).

Test query

SELECT * FROM events ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;

can be optimizedto the next rather complicated UNION form, whichnolonger 
requireskNN:

WITH
   t1 AS (SELECT * FROM events WHERE date >= '1957-10-04'::date ORDER BY 
date ASC  LIMIT k),
   t2 AS (SELECT * FROM events WHERE date < '1957-10-04'::date ORDER BY 
date DESC LIMIT k),
t  AS (SELECT * FROM t1 UNION SELECT * FROM t2)
SELECT * FROM t ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;


In each cell of this table shown query execution time in milliseconds and
the number of accessed blocks:


    k   | kNN-btree   | kNN-GiST|  Opt. query   | Seq.scan
        |              | (btree_gist) |  withUNION |  with sort
-------|--------------|--------------|---------------|------------
      1 | 0.0414| 0.0794|   0.0608 |41.11824
     10 |  0.0487 |  0.0919 |   0.09717 |41.81824
    100 |  0.10747 |  0.192    52|   0.342104 |42.31824
   1000 |  0.735573 |  0.913   650 | 2.9701160 |43.51824
  10000 |  5.070 5622|  6.240 6760|  36.300 11031 |  54.1 1824
100000 | 49.600 51608| 61.900 64194 | 295.100 94980| 115.0 1824


As you can see, kNN-btree can be two times faster than kNN-GiST(btree_gist)
when k < 1000,but the number of blocks readis roughly the same.


Implementation details
======================

A brief description is given below for each of the patches:

1. Introduce amcanorderbyop() function

This patch transformsexisting boolean AMpropertyamcanorderbyop into a 
method
(function pointer).This is necessarybecause, unlike GiST,kNN for 
btreesupports
only a one ordering operator onthe first index column and we need a 
different pathkey
matching logicfor btree (there was acorresponding comment 
inmatch_pathkeys_to_index()).
GiST-specific logic has been moved from match_pathkeys_to_index() to 
gistcanorderbyop().

2. Extract substructure BTScanState from BTScanOpaque

This refactoringis necessary for bidirectional kNN-scanimplementation. Now,
BTScanOpaque'ssubstructure BTScanState containing only thefields related
toscanpositionis passed to some functions where thewhole BTScanOpaque
waspassedpreviously.

3. Extract get_index_column_opclass(), 
get_opclass_opfamily_and_input_type().

Extracted two simple common functions usedingistproperty() and
btproperty() (see the next patch).

4. Add kNN supportto btree

   * Added additional optional BTScanState to BTScanOpaque for 
bidirectional kNN scan.
   * Implemented bidirectional kNN scan.
   * Implemented logic for selecting kNN strategy
* Implemented btcanorderbyop(), updated btproperty() and btvalidate()

B-tree user interface functions have not been altered because ordering 
operators
are used directly.

5. Add distance operators for sometypes

These operators for integer, float, date, time, timestamp, interval, 
cash and oidtypes
havebeencopied fromcontrib/btree_gistand added to the existing btree 
opclasses
as ordering operators.  Their btree_gist duplicates areremoved in the 
next patch.

6. Remove duplicate distance operators from contrib/btree_gist.

References to their own distance operators in btree_gist opclassesare 
replaced
with references to the built-in operatorsand thanduplicate operators are 
dropped.
But if the user is using somewhere these operators, upgrade of btree_gist
from 1.3 to 1.4 should fail.

7. Add regression tests for btree kNN.

Tests were added only after the built-in distance operators were added.

-- 
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctlstart without --wait
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: [HACKERS] increasing the default WAL segment size