Обсуждение: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Поиск
Список
Период
Сортировка

BUG #15805: Problem with lower function for greek sigma (Σ) letter

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15805
Logged by:          Sergey kuznetsov
Email address:      iksss.88@gmail.com
PostgreSQL version: 9.6.10
Operating system:   linux
Description:

I see unexpected behaviour of lower function for greek sigma letter Σ.
According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
is in final word position, it should be ς in lowercase, and not σ. But
PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
"δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
lower string from another part of a system (java code, for example) Postgre
will not return this row cause it differs from java-generated one.

Thanks, 
Sergey Kuznetsov


Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

От
Daniel Gustafsson
Дата:
> On 15 May 2019, at 09:57, PG Bug reporting form <noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      15805
> Logged by:          Sergey kuznetsov
> Email address:      iksss.88@gmail.com
> PostgreSQL version: 9.6.10
> Operating system:   linux
> Description:
>
> I see unexpected behaviour of lower function for greek sigma letter Σ.
> According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
> is in final word position, it should be ς in lowercase, and not σ. But
> PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
> "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
> lower string from another part of a system (java code, for example) Postgre
> will not return this row cause it differs from java-generated one.

This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
googling there is only a single case of word-final lowercasing which is this
sigma. The attached patch takes a stab at fixing this.

cheers ./daniel


Вложения

Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

От
Alvaro Herrera
Дата:
On 2019-May-15, Daniel Gustafsson wrote:

> > I see unexpected behaviour of lower function for greek sigma letter Σ.
> > According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
> > is in final word position, it should be ς in lowercase, and not σ. But
> > PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
> > "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
> > lower string from another part of a system (java code, for example) Postgre
> > will not return this row cause it differs from java-generated one.
> 
> This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
> googling there is only a single case of word-final lowercasing which is this
> sigma. The attached patch takes a stab at fixing this.

Ummm ... isn't this a counterexample?
https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-May-15, Daniel Gustafsson wrote:
>> This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
>> googling there is only a single case of word-final lowercasing which is this
>> sigma. The attached patch takes a stab at fixing this.

> Ummm ... isn't this a counterexample?
> https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

I do not think the patch as given is acceptable in any case:

1. assumes without any evidence whatsoever that the system's wide-character
representation is Unicode code points;

2. assumes without checking that the locale is one that would allow this
conversion (counterexample: C locale);

3. unreasonable hard-coded assumption about what the "not a word character"
condition is.

It's possible that 1 and 2 could be finessed by checking both that the
original character is Σ and the new one is σ (in Unicode).  We'd still
theoretically be taking a risk of the wrong substitution if the wchar
representation is not Unicode, but the odds seem fairly small.  As for
point 3, why aren't you using iswalpha() on the next character?

            regards, tom lane



Re: BUG #15805: Problem with lower function for greek sigma(Σ) letter

От
"Daniel Verite"
Дата:
    PG Bug reporting form wrote:

> lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
> "δημοτεσ φαιστου" instead of "δημοτες φαιστου"

With PostgreSQL version 10 or newer, you could use an ICU
locale. lower() would produce the expected result:

psql (11.3 (Debian 11.3-1.pgdg90+1))

=> select lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ' collate "und-x-icu");
      lower
-----------------
 δημοτες φαιστου
(1 row)

This case looks comparable to the case of the german ß (sharp s),
which should be upcased into 'SS', but the locales backed by libc
don't do that:

=> select upper(''Ich muß');
  upper
---------
 ICH MUß

For that exemple as well, an ICU locale produces a correct
result with regard to linguistic rules:

=> select upper('Ich muß' collate "und-x-icu");
  upper
----------
 ICH MUSS

The libc library provides an API with character-by-character
case conversions (tolower/toupper), which is too limited
to deal with the above cases, and PostgreSQL basically
just uses this API.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

От
Tom Lane
Дата:
"Daniel Verite" <daniel@manitou-mail.org> writes:
>     PG Bug reporting form wrote:
>> lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
>> "δημοτεσ φαιστου" instead of "δημοτες φαιστου"

> With PostgreSQL version 10 or newer, you could use an ICU
> locale. lower() would produce the expected result:

Oh, if using ICU already fixes this, I think we might as well just
say that you have to use ICU if you want the right behavior for such
cases.

> The libc library provides an API with character-by-character
> case conversions (tolower/toupper), which is too limited
> to deal with the above cases, and PostgreSQL basically
> just uses this API.

Right.

            regards, tom lane



Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

От
Daniel Gustafsson
Дата:
> On 15 May 2019, at 14:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-May-15, Daniel Gustafsson wrote:
>
>>> I see unexpected behaviour of lower function for greek sigma letter Σ.
>>> According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
>>> is in final word position, it should be ς in lowercase, and not σ. But
>>> PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
>>> "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
>>> lower string from another part of a system (java code, for example) Postgre
>>> will not return this row cause it differs from java-generated one.
>>
>> This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
>> googling there is only a single case of word-final lowercasing which is this
>> sigma. The attached patch takes a stab at fixing this.
>
> Ummm ... isn't this a counterexample?
> https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

Hebrew doesn’t have case, so it doesn’t apply in this case.

cheers ./daniel


Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

От
Alvaro Herrera
Дата:
On 2019-May-15, Daniel Gustafsson wrote:

> > On 15 May 2019, at 14:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > Ummm ... isn't this a counterexample?
> > https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html
> 
> Hebrew doesn’t have case, so it doesn’t apply in this case.

Yeah, I realized that afterwards.  Sorry for the noise.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

От
Daniel Gustafsson
Дата:
> On 15 May 2019, at 17:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> "Daniel Verite" <daniel@manitou-mail.org> writes:
>>     PG Bug reporting form wrote:
>>> lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
>>> "δημοτεσ φαιστου" instead of "δημοτες φαιστου"
>
>> With PostgreSQL version 10 or newer, you could use an ICU
>> locale. lower() would produce the expected result:
>
> Oh, if using ICU already fixes this, I think we might as well just
> say that you have to use ICU if you want the right behavior for such
> cases.

Seems reasonable.  Maybe it warrants a mention in the docs on the string
function page since it may suprise users?

cheers ./daniel


Вложения