Re: vector search support

Поиск

Список

Период

Сортировка

От	Giuseppe Broccolo
Тема	Re: vector search support
Дата	29 мая 2023 г. 16:18:03
Msg-id	CAFtuf8AzttX4Vzy5AZebNY_PxKzve_aWYf+0YUFfeKJ0xzYm_A@mail.gmail.com обсуждение исходный текст
Ответ на	Re: vector search support ("Jonathan S. Katz" <jkatz@postgresql.org>)
Список	pgsql-hackers

Дерево обсуждения

Hi Jonathan,

On 5/26/23 3:38 PM, Jonathan S. Katz <jkatz@postgresql.org> wrote:

On 4/26/23 9:31 AM, Giuseppe Broccolo wrote:
> We finally opted for ElasticSearch as search engine, considering that it
> was providing what we needed:
>
> * support to store dense vectors
> * support for kNN searches (last version of ElasticSearch allows this)

I do want to note that we can implement indexing techniques with GiST
that perform K-NN searches with the "distance" support function[1], so
adding the fundamental functions to help with this around known vector
search techniques could add this functionality. We already have this
today with "cube", but as Nathan mentioned, it's limited to 100 dims.

Yes, I was aware of this. It would be enough to define the required support functions for GiST

indexing (I was a bit in the loop when it was tried to add PG14 presorting support to GiST indexing

in PostGIS[1]). That would be really helpful indeed. I was just mentioning it because I know about

other teams using ElasticSearch as a storage of dense vectors only for this.

> An internal benchmark showed us that we were able to achieve the
> expected performance, although we are still lacking some points:
>
> * clustering of vectors (this has to be done outside the search engine,
> using DBScan for our use case)

From your experience, have you found any particular clustering
algorithms better at driving a good performance/recall tradeoff?

Nope, it really depends on the use case: the point of using DBScan above

was mainly because it's a way of clustering without knowing a priori the number

of clusters the algorithm should be able to retrieve, which is actually a parameter

needed for Kmeans. Depending on the use case, DBScan might have better

performance in noisy datasets (i.e. entries that really do not belong to a cluster in

particular). Noise in vectors obtained with embedding models is quite normal,

especially when the embedding model is not properly tuned/trained.

In our use case, DBScan was more or less the best choice, without biasing the

expected clusters.

Also PostGIS includes an implementation of DBScan for its geometries[2].

> * concurrency in updating the ElasticSearch indexes storing the dense
> vectors

I do think concurrent updates of vector-based indexes is one area
PostgreSQL can ultimately be pretty good at, whether in core or in an
extension.

Oh, it would save a lot of overhead in updating indexed vectors! It's something needed

when embedding models are re-trained, vectors are re-generated and indexes need to

be updated.

Regards,

Giuseppe.

[1] https://github.com/postgis/postgis/blob/a4f354398e52ad7ed3564c47773701e4b6b87ae8/doc/release_notes.xml#L284

[2] https://github.com/postgis/postgis/blob/ce75a0e81aec2e8a9fad2649ff7b230327acb64b/postgis/lwgeom_window.c#L117

В списке pgsql-hackers по дате отправления:

Предыдущее

От: vignesh C
Дата: 29 мая 2023 г., 15:46:22
Сообщение: Re: Support logical replication of DDLs

Следующее

От: Masahiko Sawada
Дата: 29 мая 2023 г., 16:35:25
Сообщение: Re: make_ctags: use -I option to ignore pg_node_attr macro

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: vector search support

Предыдущее

Следующее