Re: [pg_trgm] Making similarity(?, ?) < ? use an index

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: [pg_trgm] Making similarity(?, ?) < ? use an index
Дата
Msg-id CAMkU=1z4ZA-xHBbtRzUEqqT36ac_X2x1Tfa7DBawL03_rWJnGg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [pg_trgm] Making similarity(?, ?) < ? use an index  (Greg Navis <contact@gregnavis.com>)
Ответы Re: [pg_trgm] Making similarity(?, ?) < ? use an index  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [pg_trgm] Making similarity(?, ?) < ? use an index  (Greg Navis <contact@gregnavis.com>)
Список pgsql-general
On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis <contact@gregnavis.com> wrote:
> Thanks for your replies.
>
> Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
> which of course is completely equivalent to `lhs % rhs`, I wanted to write
> `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
> with ternary operators is quite a bit of work. I might have a simpler idea:

The problem is that you would have to build your index on the
expression (similarity(lhs,rhs)), and the then one of the lhs or rhs
would have to be a column, and the other would have to be a constant
string at the time you define the index, which is the same constant
string as specified when you want to use the index.  This could work
as long as there are only a few strings you ever care about checking
similarity to, and you would build an index for each one.  These would
be ordinary btree indexes, not gin indexes.   I've done this before,
only using substring rather than similarity.


> pg_trgm also provides `<->` but it seems this operator doesn't use indexes
> either. It seems the shortest path to per-query thresholds, without
> compromising the design, is making this operator use the index.

To be efficient, the index has to know where to start and stop its
scan.  Such a thing doesn't exists for "a<->b", there is no stopping
point.  It would have to be "a <-> b < c", at which point you are back
to ternary again.

> Please help
> me understand whether my reasoning is correct. If it is, I'd appreciate a
> high-level overview of what needs to be done. I can block a few hours to
> work on this in the upcoming weeks.

I think that you are proposing rewriting the indexing architecture
from scratch.  Doing that in a way that is robust and tested and
backward compatible would  probably take years, not hours, for a
single individual who is not already extensively experienced in the
internals of the code.  And, it is not really clear even what you want
to rewrite it into.

Tom's proposal is to do it in a way that works with the existing
architecture, rather than trying to re-write that architecture.
(Which could probably not be done in a few hours, either, especially
not once you write documentation and upgrade scripts and all the other
stuff that goes into production-quality code.)

I don't know if this would even be appropriate as an addition to
pg_trgm.  We might want to fork that code instead.  That would be a
shame, because the underlying c code would be the fundamentally the
same, but the alternative would be to force people who like % and
set_limit() to carry around the baggage of new operators and types
they have no interest in using, and vice versa.  True, we did just add
several new functions and operators to pg_trgm that many people will
have no interest in, so maybe that is not a big deal.

Cheers,

Jeff


В списке pgsql-general по дате отправления:

Предыдущее
От: Vik Fearing
Дата:
Сообщение: Re: WAL's listing in pg_xlog by some sql query
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [pg_trgm] Making similarity(?, ?) < ? use an index