Обсуждение: BUG #15347: Unaccent for greek characters does not work

Поиск
Список
Период
Сортировка

BUG #15347: Unaccent for greek characters does not work

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15347
Logged by:          Tasos Maschalidis
Email address:      tas.o.s@hotmail.com
PostgreSQL version: 9.3.18
Operating system:   Ubuntu 4.8.4
Description:

Call to unaccent function with greek characters does not return the greek
characters without the accents as expected (not even just the few diacritics
used in modern Greek). While the customization of unaccent.rules is an
option for dedicated servers, most cloud services do not provide write
access to the file system and thus this is limiting the unaccent feature for
greek characters. This forces us to find workarounds for something
relatively simple (just some extra characters with diacritics in the
official dictionary). Please find more details in this answer of Stack
Overflow: https://stackoverflow.com/a/49849260/5909738

Thank you,
Tasos Maschalidis


Re: BUG #15347: Unaccent for greek characters does not work

От
Thomas Munro
Дата:
On Thu, Aug 23, 2018 at 3:08 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      15347
> Logged by:          Tasos Maschalidis
> Email address:      tas.o.s@hotmail.com
> PostgreSQL version: 9.3.18
> Operating system:   Ubuntu 4.8.4
> Description:
>
> Call to unaccent function with greek characters does not return the greek
> characters without the accents as expected (not even just the few diacritics
> used in modern Greek).

Hello Tasos,

Right.  We generate the unaccent.rules file from the Unicode data file
using the Python script contrib/unaccent/generate_unaccent_rules.py in
the PostgreSQL source tree.  The script currently limits itself to
Latin characters here:

def is_plain_letter(codepoint):
    """Return true if codepoint represents a plain ASCII letter."""
    return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
           (codepoint.id >= ord('A') and codepoint.id <= ord('Z'))

I was not brave enough to support other kinds of characters, because I
can't read 'em and check if the results are garbage (if you remove the
diacritics from Klingon, it might change the meaning of any word into
a declaration of war for all I know).  If you know Python and would
like to have a go at modifying that script to support Greek, please
do!  Otherwise perhaps I could try to do it and you could review the
results.

There is a precedent already that it knows how to remove a diacritic
from at least one Cyrillic character.  I think there is no reason at
all we shouldn't take a patch to support Greek or any other alphabet
that a native speaker can advise us on.

I think the chances of squeaking a change into PostgreSQL 11 are slim,
since it would require a special exception from the Release Management
Team at this point.  Failing that, it'd be for PostgreSQL 12.  We
don't usually back-patch unaccent.rules changes because they can
affect in indexed data, and we don't want minor version upgrades to
break stuff.

[1] https://www.postgresql.org/message-id/CAEepm%3D1KRVinFtuDao4L%2BqSBh4T4k3z996EwD5-zgytu4Qa5Fw%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15347: Unaccent for greek characters does not work

От
Michael Paquier
Дата:
On Thu, Aug 23, 2018 at 05:22:21PM +1200, Thomas Munro wrote:
> There is a precedent already that it knows how to remove a diacritic
> from at least one Cyrillic character.  I think there is no reason at
> all we shouldn't take a patch to support Greek or any other alphabet
> that a native speaker can advise us on.

+1.  ec0a69e4 has added recently support for Vietnamese characters.
Once you get into it hacking this python code is not that difficult.

> I think the chances of squeaking a change into PostgreSQL 11 are slim,
> since it would require a special exception from the Release Management
> Team at this point.  Failing that, it'd be for PostgreSQL 12.  We
> don't usually back-patch unaccent.rules changes because they can
> affect in indexed data, and we don't want minor version upgrades to
> break stuff.

Getting that into v11 would be too late :(
--
Michael

Вложения

Re: BUG #15347: Unaccent for greek characters does not work

От
Tasos Maschalidis
Дата:

Hi Thomas,

 

Your concerns are understandable, especially when Klingon is taken into consideration.

I am not familiar enough with python to set up something to run the script and check the result, but I am more than willing to review the results! If you need any more input from my part (being a native Greek speaker) please ask away!

 

If I understood correctly, I guess to include the greek characters the method would need to change to this?:

return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
           (codepoint.id >= ord('A') and codepoint.id <= ord('Z')) or \

           (codepoint.id >= ord('α') and codepoint.id <= ord('ω')) or \
           (codepoint.id >= ord('Α') and codepoint.id <= ord('Ω'))

 

Thanks,

Tasos Maschalidis

 

Ps: This gist is what the results should look like, considering greek characters (lines 190-409).

 

 


Από: Thomas Munro <thomas.munro@enterprisedb.com>
Στάλθηκε: Thursday, August 23, 2018 8:22:21 AM
Προς: tas.o.s@hotmail.com; PostgreSQL mailing lists
Θέμα: Re: BUG #15347: Unaccent for greek characters does not work
 
On Thu, Aug 23, 2018 at 3:08 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      15347
> Logged by:          Tasos Maschalidis
> Email address:      tas.o.s@hotmail.com
> PostgreSQL version: 9.3.18
> Operating system:   Ubuntu 4.8.4
> Description:
>
> Call to unaccent function with greek characters does not return the greek
> characters without the accents as expected (not even just the few diacritics
> used in modern Greek).

Hello Tasos,

Right.  We generate the unaccent.rules file from the Unicode data file
using the Python script contrib/unaccent/generate_unaccent_rules.py in
the PostgreSQL source tree.  The script currently limits itself to
Latin characters here:

def is_plain_letter(codepoint):
    """Return true if codepoint represents a plain ASCII letter."""
    return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
           (codepoint.id >= ord('A') and codepoint.id <= ord('Z'))

I was not brave enough to support other kinds of characters, because I
can't read 'em and check if the results are garbage (if you remove the
diacritics from Klingon, it might change the meaning of any word into
a declaration of war for all I know).  If you know Python and would
like to have a go at modifying that script to support Greek, please
do!  Otherwise perhaps I could try to do it and you could review the
results.

There is a precedent already that it knows how to remove a diacritic
from at least one Cyrillic character.  I think there is no reason at
all we shouldn't take a patch to support Greek or any other alphabet
that a native speaker can advise us on.

I think the chances of squeaking a change into PostgreSQL 11 are slim,
since it would require a special exception from the Release Management
Team at this point.  Failing that, it'd be for PostgreSQL 12.  We
don't usually back-patch unaccent.rules changes because they can
affect in indexed data, and we don't want minor version upgrades to
break stuff.

[1] https://www.postgresql.org/message-id/CAEepm%3D1KRVinFtuDao4L%2BqSBh4T4k3z996EwD5-zgytu4Qa5Fw%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15347: Unaccent for greek characters does not work

От
Thomas Munro
Дата:
On Fri, Aug 24, 2018 at 12:22 AM, Tasos Maschalidis <TaS.O.S@hotmail.com> wrote:
> return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
>            (codepoint.id >= ord('A') and codepoint.id <= ord('Z')) or \
>
>            (codepoint.id >= ord('α') and codepoint.id <= ord('ω')) or \
>            (codepoint.id >= ord('Α') and codepoint.id <= ord('Ω'))

Thank you.  Here it is in the form of a patch that I propose to commit
to PostgreSQL 12.  It adds 221 lines to unaccent.rules.  They look
sane to my untrained eye.  Do you agree?

Example of use:

postgres=# select unaccent('Θέμα: Re: BUG #15347: Unaccent for greek ...');
                   unaccent
----------------------------------------------
 Θεμα: Re: BUG #15347: Unaccent for greek ...
(1 row)

I wondered if the documentation might need a change, but it already
says something broad enough: "A more complete example, which is
directly useful for most European languages, can be found in
unaccent.rules, ...".

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: BUG #15347: Unaccent for greek characters does not work

От
Tasos Maschalidis
Дата:
Hi Thomas,

The results are legit for all vowels. There is only one thing missing which I guess does fall into unaccent functionality. When an "σ" is used as the last letter of any word, it changes to "s" grammatically, unless the whole word is capitals, then it stays the same ("Σ"), even at the end of the word. In searches it s useful to convert any "ς" to "σ". I had included it to a custom unaccent.rules file I was using and brought desired results. For example searching for "Θωμάς" would not match "ΘΩΜΑΣ", unless such a convertion exists. Not sure if that should be taken care of somewhere else, but in my case (and also in the gist I sent you, check the last comments) it proved useful and made sense.

Thank you,
Tasos Maschalidis

From: Thomas Munro <thomas.munro@enterprisedb.com>
Sent: Friday, August 24, 2018 1:16:14 AM
To: Tasos Maschalidis
Cc: PostgreSQL mailing lists
Subject: Re: BUG #15347: Unaccent for greek characters does not work
 
On Fri, Aug 24, 2018 at 12:22 AM, Tasos Maschalidis <TaS.O.S@hotmail.com> wrote:
> return (codepoint.id >= ord('a') and codepoint.id <= ord('z')) or \
>            (codepoint.id >= ord('A') and codepoint.id <= ord('Z')) or \
>
>            (codepoint.id >= ord('α') and codepoint.id <= ord('ω')) or \
>            (codepoint.id >= ord('Α') and codepoint.id <= ord('Ω'))

Thank you.  Here it is in the form of a patch that I propose to commit
to PostgreSQL 12.  It adds 221 lines to unaccent.rules.  They look
sane to my untrained eye.  Do you agree?

Example of use:

postgres=# select unaccent('Θέμα: Re: BUG #15347: Unaccent for greek ...');
                   unaccent
----------------------------------------------
 Θεμα: Re: BUG #15347: Unaccent for greek ...
(1 row)

I wondered if the documentation might need a change, but it already
says something broad enough: "A more complete example, which is
directly useful for most European languages, can be found in
unaccent.rules, ...".

--
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15347: Unaccent for greek characters does not work

От
Thomas Munro
Дата:
On Fri, Aug 24, 2018 at 10:47 AM, Tasos Maschalidis <tas.o.s@hotmail.com> wrote:
> The results are legit for all vowels.

Cool.

> There is only one thing missing which
> I guess does fall into unaccent functionality. When an "σ" is used as the
> last letter of any word, it changes to "s" grammatically, unless the whole
> word is capitals, then it stays the same ("Σ"), even at the end of the word.
> In searches it s useful to convert any "ς" to "σ". I had included it to a
> custom unaccent.rules file I was using and brought desired results. For
> example searching for "Θωμάς" would not match "ΘΩΜΑΣ", unless such a
> convertion exists. Not sure if that should be taken care of somewhere else,
> but in my case (and also in the gist I sent you, check the last comments) it
> proved useful and made sense.

Hmm, I see.  Also described here:

https://en.wikipedia.org/wiki/Sigma

I take it you are making searches case insensitive by converting
everything to lower case.  Since you have a distinction that exists in
lower case but not in upper case, wouldn't it make more sense to
converting everything to upper case?

postgres=# select upper('Θωμάς'), upper('Θωμάσ'), upper('Θωμάσ') =
upper('Θωμάς');
 upper | upper | ?column?
-------+-------+----------
 ΘΩΜΆΣ | ΘΩΜΆΣ | t
(1 row)

PS On PostgreSQL mailing lists, we try to avoid "top posting" (=
leaving the message we're replying to below our reply), because it
makes the archive of email threads harder to read.

--
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15347: Unaccent for greek characters does not work

От
Michael Paquier
Дата:
On Fri, Aug 24, 2018 at 10:16:14AM +1200, Thomas Munro wrote:
> I wondered if the documentation might need a change, but it already
> says something broad enough: "A more complete example, which is
> directly useful for most European languages, can be found in
> unaccent.rules, ...".

Perhaps it would be better to avoid non-ASCII characters in this script?
--
Michael

Вложения

Re: BUG #15347: Unaccent for greek characters does not work

От
Thomas Munro
Дата:
On Fri, Aug 24, 2018 at 12:12 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Aug 24, 2018 at 10:16:14AM +1200, Thomas Munro wrote:
>> I wondered if the documentation might need a change, but it already
>> says something broad enough: "A more complete example, which is
>> directly useful for most European languages, can be found in
>> unaccent.rules, ...".
>
> Perhaps it would be better to avoid non-ASCII characters in this script?

You mean in the Python script?  Why?  At the top it has a PEP-263
encoding declaration:

# -*- coding: utf-8 -*-

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15347: Unaccent for greek characters does not work

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Fri, Aug 24, 2018 at 12:12 PM, Michael Paquier <michael@paquier.xyz> wrote:
>> Perhaps it would be better to avoid non-ASCII characters in this script?

> You mean in the Python script?  Why?  At the top it has a PEP-263
> encoding declaration:
> # -*- coding: utf-8 -*-

What happens if someone tries to view this in a non-UTF8 encoding?

As a comparison point, we generally avoid using non-ASCII characters
directly in the SGML docs; we write out the appropriate SGML entity
instead.  I think we should try to do the equivalent thing here ---
I assume python has some way to write "U+nnnn" or some such.

            regards, tom lane


Re: BUG #15347: Unaccent for greek characters does not work

От
Thomas Munro
Дата:
On Fri, Aug 24, 2018 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> On Fri, Aug 24, 2018 at 12:12 PM, Michael Paquier <michael@paquier.xyz> wrote:
>>> Perhaps it would be better to avoid non-ASCII characters in this script?
>
>> You mean in the Python script?  Why?  At the top it has a PEP-263
>> encoding declaration:
>> # -*- coding: utf-8 -*-
>
> What happens if someone tries to view this in a non-UTF8 encoding?
>
> As a comparison point, we generally avoid using non-ASCII characters
> directly in the SGML docs; we write out the appropriate SGML entity
> instead.  I think we should try to do the equivalent thing here ---
> I assume python has some way to write "U+nnnn" or some such.

Ok, 2 against 1.  Done.

I'll wait for other opinions on what to do about lower case sigma
before committing.  I'm not keen on adding that special case because:

1.  It's a new kind of thing: previously we did only accent and
ligature removal, but this is removal of variants that exist in only
one case.  It's admittedly a bit like the German ß, which lacks an
upper case version according to some German speakers and undergoes a
lossy conversion to double-S, but that was already handled without a
special case by ligature expansion, so it's not the same thing.

2.  We are down to only 5 hardcoded special cases: two Cyrillic
characters which I suspect will go away if we allow Cyrillic to be
processed via the general mechanism as we are doing here with Greek,
and 3 oddballs that we inherited from the old hand-maintained
unaccent.rules files: DEGREE CELSIUS, DEGREE FAHRENHEIT, and SOUND
RECORDING COPYRIGHT.  I think the degrees signs can be done
automatically with just a bit more Unicode smarts, and I might try
reporting SOUND RECORDING COPYRIGHT as missing from
<character-fallback> to the CLDR project whose data we're using.

3.  The problem seems to go away by itself if you convert to upper case.

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: BUG #15347: Unaccent for greek characters does not work

От
Michael Paquier
Дата:
On Fri, Aug 24, 2018 at 03:32:28PM +1200, Thomas Munro wrote:
> Ok, 2 against 1.  Done.

Thanks for considering it.  I have not gone in details through the patch
but...

+           (codepoint.id >= ord('A') and codepoint.id <= ord('Z')) or \
+           (codepoint.id >= 0x03b1 and codepoint.id <= 0x03c9) or \
+           (codepoint.id >= 0x0391 and codepoint.id <= 0x03a9)

...  If you could add notes about what those codepoints are, or just
allocate them in a variable with a proper name, that would help with the
readability.  My apologies for the nits on this thread.
--
Michael

Вложения

Re: BUG #15347: Unaccent for greek characters does not work

От
Thomas Munro
Дата:
On Fri, Aug 24, 2018 at 11:35 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Aug 24, 2018 at 03:32:28PM +1200, Thomas Munro wrote:
> Thanks for considering it.  I have not gone in details through the patch
> but...
>
> +           (codepoint.id >= ord('A') and codepoint.id <= ord('Z')) or \
> +           (codepoint.id >= 0x03b1 and codepoint.id <= 0x03c9) or \
> +           (codepoint.id >= 0x0391 and codepoint.id <= 0x03a9)
>
> ...  If you could add notes about what those codepoints are, or just
> allocate them in a variable with a proper name, that would help with the
> readability.  My apologies for the nits on this thread.

Fair criticism, here's a version with comments.

-- 
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: BUG #15347: Unaccent for greek characters does not work

От
Michael Paquier
Дата:
On Tue, Aug 28, 2018 at 10:50:38AM +1200, Thomas Munro wrote:
> Fair criticism, here's a version with comments.

Thanks, that's way better in my opinion.  In the range of fancy things,
I have discovered today the python module unicodedata which can replace
for example 0x03b1 with ord("\N{GREEK SMALL LETTER ALPHA}"), leading to
perhaps more readable code.

Jokes apart, I would have preferred if you used directly the unicode
points as those are easier to look after in UnicodeData.txt, say
'\u03B1' for small alpha.  If you want to go with the hex code, it would
be a better reference to copy/paste directly the character name from
UnicodeData.txt as those are easier to search in the future, perhaps
with their unicode points:
- GREEK SMALL LETTER ALPHA
- GREEK SMALL LETTER OMEGA
- GREEK CAPITAL LETTER ALPHA
- GREEK CAPITAL LETTER OMEGA

Running generate_unaccent_rules.py, I get the same result for
unaccent.rules as you do.
--
Michael

Вложения

Re: BUG #15347: Unaccent for greek characters does not work

От
Thomas Munro
Дата:
On Tue, Aug 28, 2018 at 3:20 PM Michael Paquier <michael@paquier.xyz> wrote:
> Jokes apart, I would have preferred if you used directly the unicode
> points as those are easier to look after in UnicodeData.txt, say
> '\u03B1' for small alpha.  If you want to go with the hex code, it would
> be a better reference to copy/paste directly the character name from
> UnicodeData.txt as those are easier to search in the future, perhaps
> with their unicode points:
> - GREEK SMALL LETTER ALPHA
> - GREEK SMALL LETTER OMEGA
> - GREEK CAPITAL LETTER ALPHA
> - GREEK CAPITAL LETTER OMEGA

Ok, I add the full code point names "GREEK ..." in comments, and
pushed this to master.  Thanks Tasos for the report, and Michael for
the review.

-- 
Thomas Munro
http://www.enterprisedb.com


ΑΠ: BUG #15347: Unaccent for greek characters does not work

От
Tasos Maschalidis
Дата:

Thank you everyone for such a quick communication to work this out!

 

Wish you all the best people,

Tasos Maschalidis

 


Από: Thomas Munro <thomas.munro@enterprisedb.com>
Στάλθηκε: Saturday, September 1, 2018 10:17:59 PM
Προς: Michael Paquier
Κοιν.: Tom Lane; Tasos Maschalidis; PostgreSQL mailing lists
Θέμα: Re: BUG #15347: Unaccent for greek characters does not work
 
On Tue, Aug 28, 2018 at 3:20 PM Michael Paquier <michael@paquier.xyz> wrote:
> Jokes apart, I would have preferred if you used directly the unicode
> points as those are easier to look after in UnicodeData.txt, say
> '\u03B1' for small alpha.  If you want to go with the hex code, it would
> be a better reference to copy/paste directly the character name from
> UnicodeData.txt as those are easier to search in the future, perhaps
> with their unicode points:
> - GREEK SMALL LETTER ALPHA
> - GREEK SMALL LETTER OMEGA
> - GREEK CAPITAL LETTER ALPHA
> - GREEK CAPITAL LETTER OMEGA

Ok, I add the full code point names "GREEK ..." in comments, and
pushed this to master.  Thanks Tasos for the report, and Michael for
the review.

--
Thomas Munro
https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com&amp;data=02%7C01%7C%7Cf90fcaf64b8e4a34c8eb08d6103fb6d4%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636714263159621778&amp;sdata=LKlbV4zSWrOgbk%2B5CUPma%2FkEZL4yiHXWk%2BuNBPljNnk%3D&amp;reserved=0

Re: BUG #15347: Unaccent for greek characters does not work

От
Michael Paquier
Дата:
On Sun, Sep 02, 2018 at 07:17:59AM +1200, Thomas Munro wrote:
> Ok, I add the full code point names "GREEK ..." in comments, and
> pushed this to master.  Thanks Tasos for the report, and Michael for
> the review.

Thanks Thomas for taking care of this!
--
Michael

Вложения