Обсуждение: something better than pgtrgm?

Поиск
Список
Период
Сортировка

something better than pgtrgm?

От
Willy-Bas Loos
Дата:
Hi,

I need a language unaware text comparison algorithm, so i found pgtrgm.
But i am not so content with it, because the similarities it finds are:
  • biased to favor text that is the same in the first character
  • much dependent on similar length of the strings

Are there any other options?

(i want to use it for "did you mean ...?" for approx 6-10 character codes or 8-20 letter words of mixed languages)

Cheers,

WBL
--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth

Re: something better than pgtrgm?

От
Andrew Sullivan
Дата:
On Tue, Oct 09, 2012 at 02:10:26PM +0200, Willy-Bas Loos wrote:
> Hi,
>
> I need a *language unaware* text comparison algorithm

[. . .]

> (i want to use it for *"did you mean ...?"* for approx 6-10 character codes
> or 8-20 letter words of mixed languages)

I don't think this is going to do what you want, at least from the
user's point of view.

The character codes case probably would work in a language-unaware
way.

But for the mixed languages case, surely it's not _any_ mixed
language?  Are you mixing Arabic, Farsi, Chinese, and Hindi, for
instance?

If not, then you're not really language unaware, but instead
constrained by a subset of languages.  That is a more tractable
problem (for instance, you may not have to worry about direction
changes, which vastly simplifies the problem).

Best,

A

--
Andrew Sullivan
ajs@crankycanuck.ca


Re: something better than pgtrgm?

От
Willy-Bas Loos
Дата:
Hi, Andrew thanks for replying

On Tue, Oct 9, 2012 at 2:18 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
But for the mixed languages case, surely it's not _any_ mixed
language?  Are you mixing Arabic, Farsi, Chinese, and Hindi, for
instance?
We're mixing species names of birds in greek and latin (scientific names), and all languages spoken in africa, europe and western asia.

 

If not, then you're not really language unaware, but instead
constrained by a subset of languages.  That is a more tractable
problem (for instance, you may not have to worry about direction
changes, which vastly simplifies the problem).

I'm not very knowledgeable about scripts around the world, but i am afraid that the above list does include scripts that read from right to left.


--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth

Re: something better than pgtrgm?

От
Andrew Sullivan
Дата:
On Tue, Oct 09, 2012 at 03:10:31PM +0200, Willy-Bas Loos wrote:
> >
> We're mixing species names of birds in greek and latin (scientific names),
> and all languages spoken in africa, europe and western asia.

Yike.

> I'm not very knowledgeable about scripts around the world, but i am afraid
> that the above list does include scripts that read from right to left.

It's much worse than that.

It includes at least two variations of Arabic keyboard (depending on
which language you are using, for instance, you get a different
Unicode encoding of the character YEH, which in some languages has
something approximating the frequency of the letter a in English), and
you have endless problems with dots versus no dots on Arabic-script
spellings (not all uses of Arabic the script are Arabic the
language).  You also run smack into the problem of correct syllable
formation in Brahmi-derived scripts.

If you're going to do something with this sort of language-agnostic
"did you mean" work, you will need to be extremely rigorous about
normalizing spellings on the way in.  Is that a possibility?  If so, I
can almost imagine a way this could work.  If not, well,
"internationalization is hard."  :-/

A

--
Andrew Sullivan
ajs@crankycanuck.ca


Re: something better than pgtrgm?

От
Willy-Bas Loos
Дата:


On Tue, Oct 9, 2012 at 3:23 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
you will need to be extremely rigorous about
normalizing spellings on the way in.  Is that a possibility?

Yes, it is.
 
 If so, I
can almost imagine a way this could work

Great! How?

--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth

Re: something better than pgtrgm?

От
Andrew Sullivan
Дата:
On Tue, Oct 09, 2012 at 03:54:35PM +0200, Willy-Bas Loos wrote:
>
> >  If so, I
> > can almost imagine a way this could work
> >
>
> Great! How?

Well, it involves very large tables.  But basically, you work out a
"variant" table for any language you like, and then query across it
with subsets of the trigrams you were just working with.  It probably
sucks in performance, but at least you're likely to get valid
sequences this way.

For inspiration on this (and why I have so much depressing news on the
subject of internationalization in a multi-script and multi-lingual
environment), see RFC 3743 and RFC 4290.  These are related (among
other things) to how to make "variants" of different DNS labels
somehow hang together.  The problem is not directly related to what
you're working on, but it's a similar sort of problem: people have
rough ideas of what they're entering, and they need an exact match.
You have the good fortune of being able to provide them with a hint!
I wish I were in your shoes.

A

--
Andrew Sullivan
ajs@crankycanuck.ca


Re: something better than pgtrgm?

От
Willy-Bas Loos
Дата:
Thanks, but no, we do need the performance
And we have admins (not users) enter the names and codes, but we can't make it way complicated to do that.
I thought you meant that they see to it that the names end up in the database under the correct encoding (which is a logical thing to do..)

Thanks anyway :)!

WBL

On Tue, Oct 9, 2012 at 5:16 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
On Tue, Oct 09, 2012 at 03:54:35PM +0200, Willy-Bas Loos wrote:
>
> >  If so, I
> > can almost imagine a way this could work
> >
>
> Great! How?

Well, it involves very large tables.  But basically, you work out a
"variant" table for any language you like, and then query across it
with subsets of the trigrams you were just working with.  It probably
sucks in performance, but at least you're likely to get valid
sequences this way.

For inspiration on this (and why I have so much depressing news on the
subject of internationalization in a multi-script and multi-lingual
environment), see RFC 3743 and RFC 4290.  These are related (among
other things) to how to make "variants" of different DNS labels
somehow hang together.  The problem is not directly related to what
you're working on, but it's a similar sort of problem: people have
rough ideas of what they're entering, and they need an exact match.
You have the good fortune of being able to provide them with a hint!
I wish I were in your shoes.

A

--
Andrew Sullivan
ajs@crankycanuck.ca


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth