Re: index structure for 114-dimension vector

Список
Период
Сортировка
От Oleg Bartunov
Тема Re: index structure for 114-dimension vector
Дата
Msg-id Pine.LNX.4.64.0704270838230.12152@sn.sai.msu.ru
обсуждение исходный текст
Ответ на Re: index structure for 114-dimension vector  ("Alexander Staubo")
Ответы Re: index structure for 114-dimension vector  (Andrew Lazarus)
Список pgsql-performance
Дерево обсуждения
Basic Q on superfluous primary keys  ("Kynn Jones", )
 Re: Basic Q on superfluous primary keys  (Bill Moran, )
  Re: Basic Q on superfluous primary keys  ("Merlin Moncure", )
   Re: Basic Q on superfluous primary keys  ("Craig A. James", )
    Re: Basic Q on superfluous primary keys  ("Merlin Moncure", )
     Re: Basic Q on superfluous primary keys  (Greg Smith, )
      Re: Basic Q on superfluous primary keys  ("Merlin Moncure", )
       Re: Basic Q on superfluous primary keys  ("Craig A. James", )
        Re: Basic Q on superfluous primary keys  (Richard Huxton, )
         Re: Basic Q on superfluous primary keys  (Greg Smith, )
        Re: Basic Q on superfluous primary keys  ("Merlin Moncure", )
         Re: Basic Q on superfluous primary keys  ("Craig A. James", )
        Re: Basic Q on superfluous primary keys  (Jeff Davis, )
       Re: Basic Q on superfluous primary keys  ("Dave Dutcher", )
        Re: Basic Q on superfluous primary keys  ("Merlin Moncure", )
         index structure for 114-dimension vector  (Andrew Lazarus, )
          Re: index structure for 114-dimension vector  (Jeff Davis, )
           Re: index structure for 114-dimension vector  (Mark Kirkwood, )
            Re: index structure for 114-dimension vector  (Andrew Lazarus, )
             Re: index structure for 114-dimension vector  (Mark Kirkwood, )
             Re: index structure for 114-dimension vector  (Tom Lane, )
            Re: index structure for 114-dimension vector  (Arjen van der Meijden, )
          Re: index structure for 114-dimension vector  (C Storm, )
          Re: index structure for 114-dimension vector  ("Alexander Staubo", )
           Re: index structure for 114-dimension vector  (Oleg Bartunov, )
            Re: index structure for 114-dimension vector  (Andrew Lazarus, )
             Re: index structure for 114-dimension vector  ("Alexander Staubo", )
       Re: Basic Q on superfluous primary keys  ("Craig A. James", )
    Re: Basic Q on superfluous primary keys  (Ron Mayer, )
On Fri, 27 Apr 2007, Alexander Staubo wrote:

> On 4/20/07, Andrew Lazarus <> wrote:
>> I have a table with 2.5 million real[] arrays. (They are points in a
>> time series.) Given a new array X, I'd like to find, say, the 25
>> closest to X in some sense--for simplification, let's just say in the
>> usual vector norm. Speed is critical here, and everything I have tried
>> has been too slow.
>
> Let me chime in with the observation that this is a multidimensional
> nearest neighbour (reverse nearest neighbour and its close cousin,
> k-NN) that is well known in statistics, and particularly relevant to
> statistical learning and classification. Knowing the jargon might help
> you dig up efficient algorithms to mine your data; there are tons of
> fascinating papers available through Citeseer.
>
> In particular, I recommend the paper "Efficient k-NN Search on
> Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here:
> http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
> proposes an algorithm called BOND to drastically reduce the search
> space by probalistic means. They give an example using image
> histograms, but the algorithm applies to any multidimensional data.
> Briefly put, it points out that proximity comparison can be computed
> vertically, a few dimensions at a time, and entire subsets can be
> thrown away when it's apparent that they are below a statistically
> derived lower bound. The only gotcha is that the algorithm derives
> much of its performance from the assumption that your data is
> vertically decomposed, one table per dimension, otherwise the search
> effectively incurs a sequential scan of the entire dataset, and then
> you're pretty much back to square one.
>
> The most common approach to nearest neighbour search is to use a
> spatial data structure. The classic algorithm is the kd-tree
> (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
> tree, neither of which are available in PostgreSQL. If I remember
> correctly, R-trees have also been shown to be useful for high numbers
> of dimensions; with PostgreSQL you have R-trees and even better
> R-tree-equivalent support through GiST. I have no idea whether you can
> actually munge your integer vectors into something GiST can index and
> search, but it's a thought. (GiST, presumably, can also theoretically
> index kd-trees and other spatial trees.)

you're right, but currently  only theoretically due to interface restriction.
We have plan to improve it sometime. There was SP-GiST project, which
could be used for k-d-b tree,  see http://www.cs.purdue.edu/spgist/
I don't know if it works with 8.2 version. Also, it doesn't supports
concurrency and recovery

>
> Alexander.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>              http://archives.postgresql.org
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: , http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

В списке pgsql-performance по дате отправления:

Предыдущее
От: Shane Ambler
Дата:
Сообщение: Re: Usage up to 50% CPU
Следующее
От: Michael Stone
Дата:
Сообщение: Re: Usage up to 50% CPU