Re: CUBE_MAX_DIM

Поиск
Список
Период
Сортировка
От Alastair McKinley
Тема Re: CUBE_MAX_DIM
Дата
Msg-id PR1PR02MB53401C2502090B6EF3BB3DE3E3920@PR1PR02MB5340.eurprd02.prod.outlook.com
обсуждение исходный текст
Ответ на Re: CUBE_MAX_DIM  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: CUBE_MAX_DIM  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
>
> Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= <devrim@gunduz.org> writes:
> > Someone contacted me about increasing CUBE_MAX_DIM
> > in contrib/cube/cubedata.h (in the community RPMs). The current value
> > is 100 with the following comment:
>
> > * This limit is pretty arbitrary, but don't make it so large that you
> > * risk overflow in sizing calculations.
>
> > They said they use 500, and never had a problem.
>
> I guess I'm wondering what's the use-case.  100 already seems an order of
> magnitude more than anyone could want.  Or, if it's not enough, why does
> raising the limit just 5x enable any large set of new applications?

The dimensionality of embeddings generated by deep neural networks can be high.
Google BERT has 768 dimensions for example.

I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form (I
havetried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be useful
forthese applications.  There are other pre-processing techniques that can be used to improved the speed of the search,
butit still ends up with a kNN search in a high-ish dimensional space. 

> The practical issue here is that, since the data requires 16 bytes per
> dimension (plus a little bit of overhead), we'd be talking about
> increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000
> bytes.  And cube is not toastable, so that couldn't be compressed or
> shoved out-of-line.  Maybe your OP never had a problem with it, but
> plenty of use-cases would have "tuple too large" failures due to not
> having room on a heap page for whatever other data they want in the row.
>
> Even a non-toastable 2KB field is going to give the tuple toaster
> algorithm problems, as it'll end up shoving every other toastable field
> out-of-line in an ultimately vain attempt to bring the tuple size below
> 2KB.  So I'm really quite hesitant to raise CUBE_MAX_DIM much past where
> it is now without any other changes.
>
> A more credible proposal would be to make cube toast-aware and then
> raise the limit to ~1GB ... but that would take a significant amount
> of work, and we still haven't got a use-case justifying it.
>
> I think I'd counsel storing such data as plain float8 arrays, which
> do have the necessary storage infrastructure.  Is there something
> about the cube operators that's particularly missing?
>

The indexable nearest-neighbour searches are one of the great cube features not available with float8 arrays.

>                         regards, tom lane

Best regards,
Alastair








В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeff Davis
Дата:
Сообщение: Re: Default setting for enable_hashagg_disk
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Open Item: Should non-text EXPLAIN always show properties?