Обсуждение: something better than pgtrgm?
Hi,
I need a language unaware text comparison algorithm, so i found pgtrgm.
But i am not so content with it, because the similarities it finds are:
I need a language unaware text comparison algorithm, so i found pgtrgm.
But i am not so content with it, because the similarities it finds are:
- biased to favor text that is the same in the first character
- much dependent on similar length of the strings
Are there any other options?
(i want to use it for "did you mean ...?" for approx 6-10 character codes or 8-20 letter words of mixed languages)
Cheers,
WBL
--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth
On Tue, Oct 09, 2012 at 02:10:26PM +0200, Willy-Bas Loos wrote: > Hi, > > I need a *language unaware* text comparison algorithm [. . .] > (i want to use it for *"did you mean ...?"* for approx 6-10 character codes > or 8-20 letter words of mixed languages) I don't think this is going to do what you want, at least from the user's point of view. The character codes case probably would work in a language-unaware way. But for the mixed languages case, surely it's not _any_ mixed language? Are you mixing Arabic, Farsi, Chinese, and Hindi, for instance? If not, then you're not really language unaware, but instead constrained by a subset of languages. That is a more tractable problem (for instance, you may not have to worry about direction changes, which vastly simplifies the problem). Best, A -- Andrew Sullivan ajs@crankycanuck.ca
Hi, Andrew thanks for replying
On Tue, Oct 9, 2012 at 2:18 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
I'm not very knowledgeable about scripts around the world, but i am afraid that the above list does include scripts that read from right to left.
-- On Tue, Oct 9, 2012 at 2:18 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
But for the mixed languages case, surely it's not _any_ mixed
language? Are you mixing Arabic, Farsi, Chinese, and Hindi, for
instance?
We're mixing species names of birds in greek and latin (scientific names), and all languages spoken in africa, europe and western asia.
If not, then you're not really language unaware, but instead
constrained by a subset of languages. That is a more tractable
problem (for instance, you may not have to worry about direction
changes, which vastly simplifies the problem).
I'm not very knowledgeable about scripts around the world, but i am afraid that the above list does include scripts that read from right to left.
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth
On Tue, Oct 09, 2012 at 03:10:31PM +0200, Willy-Bas Loos wrote: > > > We're mixing species names of birds in greek and latin (scientific names), > and all languages spoken in africa, europe and western asia. Yike. > I'm not very knowledgeable about scripts around the world, but i am afraid > that the above list does include scripts that read from right to left. It's much worse than that. It includes at least two variations of Arabic keyboard (depending on which language you are using, for instance, you get a different Unicode encoding of the character YEH, which in some languages has something approximating the frequency of the letter a in English), and you have endless problems with dots versus no dots on Arabic-script spellings (not all uses of Arabic the script are Arabic the language). You also run smack into the problem of correct syllable formation in Brahmi-derived scripts. If you're going to do something with this sort of language-agnostic "did you mean" work, you will need to be extremely rigorous about normalizing spellings on the way in. Is that a possibility? If so, I can almost imagine a way this could work. If not, well, "internationalization is hard." :-/ A -- Andrew Sullivan ajs@crankycanuck.ca
On Tue, Oct 9, 2012 at 3:23 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
Yes, it is.
Great! How?
you will need to be extremely rigorous about
normalizing spellings on the way in. Is that a possibility?
Yes, it is.
If so, I
can almost imagine a way this could work
Great! How?
--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth
On Tue, Oct 09, 2012 at 03:54:35PM +0200, Willy-Bas Loos wrote: > > > If so, I > > can almost imagine a way this could work > > > > Great! How? Well, it involves very large tables. But basically, you work out a "variant" table for any language you like, and then query across it with subsets of the trigrams you were just working with. It probably sucks in performance, but at least you're likely to get valid sequences this way. For inspiration on this (and why I have so much depressing news on the subject of internationalization in a multi-script and multi-lingual environment), see RFC 3743 and RFC 4290. These are related (among other things) to how to make "variants" of different DNS labels somehow hang together. The problem is not directly related to what you're working on, but it's a similar sort of problem: people have rough ideas of what they're entering, and they need an exact match. You have the good fortune of being able to provide them with a hint! I wish I were in your shoes. A -- Andrew Sullivan ajs@crankycanuck.ca
Thanks, but no, we do need the performance
And we have admins (not users) enter the names and codes, but we can't make it way complicated to do that.
I thought you meant that they see to it that the names end up in the database under the correct encoding (which is a logical thing to do..)
Thanks anyway :)!
WBL
--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth
And we have admins (not users) enter the names and codes, but we can't make it way complicated to do that.
I thought you meant that they see to it that the names end up in the database under the correct encoding (which is a logical thing to do..)
Thanks anyway :)!
WBL
On Tue, Oct 9, 2012 at 5:16 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
On Tue, Oct 09, 2012 at 03:54:35PM +0200, Willy-Bas Loos wrote:Well, it involves very large tables. But basically, you work out a
>
> > If so, I
> > can almost imagine a way this could work
> >
>
> Great! How?
"variant" table for any language you like, and then query across it
with subsets of the trigrams you were just working with. It probably
sucks in performance, but at least you're likely to get valid
sequences this way.
For inspiration on this (and why I have so much depressing news on the
subject of internationalization in a multi-script and multi-lingual
environment), see RFC 3743 and RFC 4290. These are related (among
other things) to how to make "variants" of different DNS labels
somehow hang together. The problem is not directly related to what
you're working on, but it's a similar sort of problem: people have
rough ideas of what they're entering, and they need an exact match.
You have the good fortune of being able to provide them with a hint!
I wish I were in your shoes.
A
--
Andrew Sullivan
ajs@crankycanuck.ca
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth