Обсуждение: CUBE_MAX_DIM

Поиск

Список

Период

Сортировка

Re: CUBE_MAX_DIM

От

Darafei "Komяpa" Praliaskouski

Дата:

25 июня 2020 г., 16:31:36

Hello,

The problem with higher dimension cubes is that starting with dimensionality of ~52 the "distance" metrics in 64-bit float have less than a single bit per dimension in mantissa, making cubes indistinguishable. Developers for facial recognition software had a chat about that on russian postgres telegram group https://t.me/pgsql. Their problem was that they had 128-dimensional points, recompiled postgres - distances weren't helpful, and GIST KNN severely degraded to almost full scans. They had to change the number of facial features to smaller in order to make KNN search work.

Floating point overflow isn't that much of a risk per se, worst case scenario it becomes an Infinity or 0 which are usually acceptable in those contexts.

While mathematically possible, there are implementation issues with higher dimension cubes. I'm ok with raising the limit if such nuances get a mention in docs.

On Thu, Jun 25, 2020 at 1:01 PM Devrim Gündüz <devrim@gunduz.org> wrote:

Hi,

Someone contacted me about increasing CUBE_MAX_DIM
in contrib/cube/cubedata.h (in the community RPMs). The current value
is 100 with the following comment:

* This limit is pretty arbitrary, but don't make it so large that you
* risk overflow in sizing calculations.

They said they use 500, and never had a problem. I never added such patches to the RPMS, and will not -- but wanted to ask if we can safely increase it in upstream?

Regards,

--
Devrim Gündüz
Open Source Solution Architect, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR

Darafei Praliaskouski

Support me: http://patreon.com/komzpa

Re: CUBE_MAX_DIM

От

Tom Lane

Дата:

25 июня 2020 г., 18:03:20

Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= <devrim@gunduz.org> writes:
> Someone contacted me about increasing CUBE_MAX_DIM
> in contrib/cube/cubedata.h (in the community RPMs). The current value
> is 100 with the following comment:

> * This limit is pretty arbitrary, but don't make it so large that you
> * risk overflow in sizing calculations.

> They said they use 500, and never had a problem.

I guess I'm wondering what's the use-case.  100 already seems an order of
magnitude more than anyone could want.  Or, if it's not enough, why does
raising the limit just 5x enable any large set of new applications?

The practical issue here is that, since the data requires 16 bytes per
dimension (plus a little bit of overhead), we'd be talking about
increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000
bytes.  And cube is not toastable, so that couldn't be compressed or
shoved out-of-line.  Maybe your OP never had a problem with it, but
plenty of use-cases would have "tuple too large" failures due to not
having room on a heap page for whatever other data they want in the row.

Even a non-toastable 2KB field is going to give the tuple toaster
algorithm problems, as it'll end up shoving every other toastable field
out-of-line in an ultimately vain attempt to bring the tuple size below
2KB.  So I'm really quite hesitant to raise CUBE_MAX_DIM much past where
it is now without any other changes.

A more credible proposal would be to make cube toast-aware and then
raise the limit to ~1GB ... but that would take a significant amount
of work, and we still haven't got a use-case justifying it.

I think I'd counsel storing such data as plain float8 arrays, which
do have the necessary storage infrastructure.  Is there something
about the cube operators that's particularly missing?

            regards, tom lane

Re: CUBE_MAX_DIM

От

Alastair McKinley

Дата:

25 июня 2020 г., 19:31:21

>
> Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= <devrim@gunduz.org> writes:
> > Someone contacted me about increasing CUBE_MAX_DIM
> > in contrib/cube/cubedata.h (in the community RPMs). The current value
> > is 100 with the following comment:
>
> > * This limit is pretty arbitrary, but don't make it so large that you
> > * risk overflow in sizing calculations.
>
> > They said they use 500, and never had a problem.
>
> I guess I'm wondering what's the use-case.  100 already seems an order of
> magnitude more than anyone could want.  Or, if it's not enough, why does
> raising the limit just 5x enable any large set of new applications?

The dimensionality of embeddings generated by deep neural networks can be high.
Google BERT has 768 dimensions for example.

I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form (I
havetried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be useful
forthese applications.  There are other pre-processing techniques that can be used to improved the speed of the search,
butit still ends up with a kNN search in a high-ish dimensional space. 

> The practical issue here is that, since the data requires 16 bytes per
> dimension (plus a little bit of overhead), we'd be talking about
> increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000
> bytes.  And cube is not toastable, so that couldn't be compressed or
> shoved out-of-line.  Maybe your OP never had a problem with it, but
> plenty of use-cases would have "tuple too large" failures due to not
> having room on a heap page for whatever other data they want in the row.
>
> Even a non-toastable 2KB field is going to give the tuple toaster
> algorithm problems, as it'll end up shoving every other toastable field
> out-of-line in an ultimately vain attempt to bring the tuple size below
> 2KB.  So I'm really quite hesitant to raise CUBE_MAX_DIM much past where
> it is now without any other changes.
>
> A more credible proposal would be to make cube toast-aware and then
> raise the limit to ~1GB ... but that would take a significant amount
> of work, and we still haven't got a use-case justifying it.
>
> I think I'd counsel storing such data as plain float8 arrays, which
> do have the necessary storage infrastructure.  Is there something
> about the cube operators that's particularly missing?
>

The indexable nearest-neighbour searches are one of the great cube features not available with float8 arrays.

>                         regards, tom lane

Best regards,
Alastair

Re: CUBE_MAX_DIM

От

Tom Lane

Дата:

25 июня 2020 г., 19:43:12

Alastair McKinley <a.mckinley@analyticsengines.com> writes:
> I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form
(Ihave tried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be
usefulfor these applications.  There are other pre-processing techniques that can be used to improved the speed of the
search,but it still ends up with a kNN search in a high-ish dimensional space. 

Is there a way to fix the numerical instability involved?  If we could do
that, then we'd definitely have a use-case justifying the work to make
cube toastable.

            regards, tom lane

Re: CUBE_MAX_DIM

От

Alastair McKinley

Дата:

25 июня 2020 г., 23:47:30

> From: Tom Lane <tgl@sss.pgh.pa.us>
> Sent: 25 June 2020 17:43
>
> Alastair McKinley <a.mckinley@analyticsengines.com> writes:
> > I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw
form(I have tried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be
usefulfor these applications.  There are other pre-processing techniques that can be used to improved the speed of the
search,but it still ends up with a kNN search in a high-ish dimensional space. 
>
> Is there a way to fix the numerical instability involved?  If we could do
> that, then we'd definitely have a use-case justifying the work to make
> cube toastable.

I am not that familiar with the nature of the numerical instability, but it might be worth noting for additional
contextthat for the NN use case: 

- The value of each dimension is likely to be between 0 and 1
- The L1 distance is meaningful for high numbers of dimensions, which *possibly* suffers less from the numeric issues
thaneuclidean distance. 

The numerical stability isn't the only issue for high dimensional kNN, the GiST search performance currently degrades
withincreasing N towards sequential scan performance, although maybe they are related? 

>                         regards, tom lane

Best regards,
Alastair

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: CUBE_MAX_DIM

CUBE_MAX_DIM

Вложения

Re: CUBE_MAX_DIM

Re: CUBE_MAX_DIM

Re: CUBE_MAX_DIM

Re: CUBE_MAX_DIM

Re: CUBE_MAX_DIM