Re: CUBE_MAX_DIM

Поиск

Список

Период

Сортировка

От	Alastair McKinley
Тема	Re: CUBE_MAX_DIM
Дата	25 июня 2020 г. 19:31:21
Msg-id	PR1PR02MB53401C2502090B6EF3BB3DE3E3920@PR1PR02MB5340.eurprd02.prod.outlook.com обсуждение исходный текст
Ответ на	Re: CUBE_MAX_DIM (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: CUBE_MAX_DIM (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

>
> Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= <devrim@gunduz.org> writes:
> > Someone contacted me about increasing CUBE_MAX_DIM
> > in contrib/cube/cubedata.h (in the community RPMs). The current value
> > is 100 with the following comment:
>
> > * This limit is pretty arbitrary, but don't make it so large that you
> > * risk overflow in sizing calculations.
>
> > They said they use 500, and never had a problem.
>
> I guess I'm wondering what's the use-case.  100 already seems an order of
> magnitude more than anyone could want.  Or, if it's not enough, why does
> raising the limit just 5x enable any large set of new applications?

The dimensionality of embeddings generated by deep neural networks can be high.
Google BERT has 768 dimensions for example.

I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form (I
havetried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be useful
forthese applications.  There are other pre-processing techniques that can be used to improved the speed of the search,
butit still ends up with a kNN search in a high-ish dimensional space. 

> The practical issue here is that, since the data requires 16 bytes per
> dimension (plus a little bit of overhead), we'd be talking about
> increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000
> bytes.  And cube is not toastable, so that couldn't be compressed or
> shoved out-of-line.  Maybe your OP never had a problem with it, but
> plenty of use-cases would have "tuple too large" failures due to not
> having room on a heap page for whatever other data they want in the row.
>
> Even a non-toastable 2KB field is going to give the tuple toaster
> algorithm problems, as it'll end up shoving every other toastable field
> out-of-line in an ultimately vain attempt to bring the tuple size below
> 2KB.  So I'm really quite hesitant to raise CUBE_MAX_DIM much past where
> it is now without any other changes.
>
> A more credible proposal would be to make cube toast-aware and then
> raise the limit to ~1GB ... but that would take a significant amount
> of work, and we still haven't got a use-case justifying it.
>
> I think I'd counsel storing such data as plain float8 arrays, which
> do have the necessary storage infrastructure.  Is there something
> about the cube operators that's particularly missing?
>

The indexable nearest-neighbour searches are one of the great cube features not available with float8 arrays.

>                         regards, tom lane

Best regards,
Alastair

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Jeff Davis
Дата: 25 июня 2020 г., 19:24:52
Сообщение: Re: Default setting for enable_hashagg_disk

Следующее

От: Tom Lane
Дата: 25 июня 2020 г., 19:33:31
Сообщение: Re: Open Item: Should non-text EXPLAIN always show properties?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: CUBE_MAX_DIM

Предыдущее

Следующее