Re: [pg_trgm] Making similarity(?, ?) < ? use an index

Поиск
Список
Период
Сортировка
От Greg Navis
Тема Re: [pg_trgm] Making similarity(?, ?) < ? use an index
Дата
Msg-id CAA6WWt8xgOc2PeJRtmHcGYnQuqOr1j561p7hakefwZGT0whwJQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [pg_trgm] Making similarity(?, ?) < ? use an index  (Jeff Janes <jeff.janes@gmail.com>)
Ответы Re: [pg_trgm] Making similarity(?, ?) < ? use an index
Список pgsql-general
Thanks for the replies.

On Sat, Jun 4, 2016 at 8:48 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis <contact@gregnavis.com> wrote:
> Thanks for your replies.
>
> Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
> which of course is completely equivalent to `lhs % rhs`, I wanted to write
> `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
> with ternary operators is quite a bit of work. I might have a simpler idea:

The problem is that you would have to build your index on the
expression (similarity(lhs,rhs)), and the then one of the lhs or rhs
would have to be a column, and the other would have to be a constant
string at the time you define the index, which is the same constant
string as specified when you want to use the index.  This could work
as long as there are only a few strings you ever care about checking
similarity to, and you would build an index for each one.  These would
be ordinary btree indexes, not gin indexes.   I've done this before,
only using substring rather than similarity.


> pg_trgm also provides `<->` but it seems this operator doesn't use indexes
> either. It seems the shortest path to per-query thresholds, without
> compromising the design, is making this operator use the index.

To be efficient, the index has to know where to start and stop its
scan.  Such a thing doesn't exists for "a<->b", there is no stopping
point.  It would have to be "a <-> b < c", at which point you are back
to ternary again.
Sorry, that was sloppy thinking on my end. I was fixated on looking for something simple and saw things that aren't there. ;-) You're absolutely right that we're back to ternary with `a <-> b < c`.

> Please help
> me understand whether my reasoning is correct. If it is, I'd appreciate a
> high-level overview of what needs to be done. I can block a few hours to
> work on this in the upcoming weeks.

I think that you are proposing rewriting the indexing architecture
from scratch.  Doing that in a way that is robust and tested and
backward compatible would  probably take years, not hours, for a
single individual who is not already extensively experienced in the
internals of the code.  And, it is not really clear even what you want
to rewrite it into.
No rewrites, please. ;-) I was just confused. I'm looking for a simple, incremental contribution, not a rewrite. 

Tom's proposal is to do it in a way that works with the existing
architecture, rather than trying to re-write that architecture.
(Which could probably not be done in a few hours, either, especially
not once you write documentation and upgrade scripts and all the other
stuff that goes into production-quality code.)
Thanks for the info. I blocked a few hours for open source work every week so whether it takes 10 or 40 hours shouldn't be a problem. 

I don't know if this would even be appropriate as an addition to
pg_trgm.  We might want to fork that code instead.  That would be a
shame, because the underlying c code would be the fundamentally the
same, but the alternative would be to force people who like % and
set_limit() to carry around the baggage of new operators and types
they have no interest in using, and vice versa.  True, we did just add
several new functions and operators to pg_trgm that many people will
have no interest in, so maybe that is not a big deal.

Cheers,

Jeff

Would this be a better plan then:

1. Add support for trigram operators.
2. Implement `issimilar(lhs, rhs, threshold)`.
3. Add `issimilar` to the trigram operator classes.

PS Should we move this discussion to pgsql-hackers?
--
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.

В списке pgsql-general по дате отправления:

Предыдущее
От: Tatsuo Ishii
Дата:
Сообщение: Re: High availability and load balancing ...
Следующее
От: Artur Zakirov
Дата:
Сообщение: Re: [pg_trgm] Making similarity(?, ?) < ? use an index