Re: vector search support

Поиск
Список
Период
Сортировка
От Giuseppe Broccolo
Тема Re: vector search support
Дата
Msg-id CAFtuf8AzttX4Vzy5AZebNY_PxKzve_aWYf+0YUFfeKJ0xzYm_A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: vector search support  ("Jonathan S. Katz" <jkatz@postgresql.org>)
Список pgsql-hackers
Hi Jonathan,

On 5/26/23 3:38 PM, Jonathan S. Katz <jkatz@postgresql.org> wrote:
On 4/26/23 9:31 AM, Giuseppe Broccolo wrote:
> We finally opted for ElasticSearch as search engine, considering that it
> was providing what we needed:
>
> * support to store dense vectors
> * support for kNN searches (last version of ElasticSearch allows this)

I do want to note that we can implement indexing techniques with GiST
that perform K-NN searches with the "distance" support function[1], so
adding the fundamental functions to help with this around known vector
search techniques could add this functionality. We already have this
today with "cube", but as Nathan mentioned, it's limited to 100 dims.

Yes, I was aware of this. It would be enough to define the required support functions for GiST
indexing (I was a bit in the loop when it was tried to add PG14 presorting support to GiST indexing
in PostGIS[1]). That would be really helpful indeed. I was just mentioning it because I know about
other teams using ElasticSearch as a storage of dense vectors only for this.
 
> An internal benchmark showed us that we were able to achieve the
> expected performance, although we are still lacking some points:
>
> * clustering of vectors (this has to be done outside the search engine,
> using DBScan for our use case)

 From your experience, have you found any particular clustering
algorithms better at driving a good performance/recall tradeoff?

Nope, it really depends on the use case: the point of using DBScan above
was mainly because it's a way of clustering without knowing a priori the number
of clusters the algorithm should be able to retrieve, which is actually a parameter
needed for Kmeans. Depending on the use case, DBScan might have better
performance in noisy datasets (i.e. entries that really do not belong to a cluster in
particular). Noise in vectors obtained with embedding models is quite normal,
especially when the embedding model is not properly tuned/trained.

In our use case, DBScan was more or less the best choice, without biasing the
expected clusters.

Also PostGIS includes an implementation of DBScan for its geometries[2].
 
> * concurrency in updating the ElasticSearch indexes storing the dense
> vectors

I do think concurrent updates of vector-based indexes is one area
PostgreSQL can ultimately be pretty good at, whether in core or in an
extension.

Oh, it would save a lot of overhead in updating indexed vectors! It's something needed
when embedding models are re-trained, vectors are re-generated and indexes need to
be updated.

Regards,
Giuseppe.
 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: vignesh C
Дата:
Сообщение: Re: Support logical replication of DDLs
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: make_ctags: use -I option to ignore pg_node_attr macro