[HACKERS] strcmp() tie-breaker for identical ICU-collated strings

Поиск

Список

Период

Сортировка

От	Amit Khandekar
Тема	[HACKERS] strcmp() tie-breaker for identical ICU-collated strings
Дата	2 июня 2017 г. 00:58:51
Msg-id	CAJ3gD9ez9O7scYT46iRXU-1KDfDeSQvJ2Ekzxs7RYGofYfB4cg@mail.gmail.com обсуждение исходный текст
Ответы	Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings (Thomas Munro <thomas.munro@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

While comparing two text strings using varstr_cmp(), if *strcoll*()
call returns 0, we do strcmp() tie-breaker to do binary comparison,
because strcoll() can return 0 for non-identical strings :

varstr_cmp()
{
...
/*
* In some locales strcoll() can claim that nonidentical strings are
* equal.  Believing that would be bad news for a number of reasons,
* so we follow Perl's lead and sort "equal" strings according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);
...
}

But is this supposed to apply for ICU collations as well ? If
collation provider is icu, the comparison is done using
ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns
some characters as being identical, so doing strcmp() may not make
sense.

For e.g. , if the below two characters are compared using
ucol_strcollUTF8(), it returns 0, meaning the strings are identical :
Greek Oxia : UTF-16 encoding : 0x1FFD
(http://www.fileformat.info/info/unicode/char/1ffd/index.htm)
Greek Tonos : UTF-16 encoding : 0x0384
(http://www.fileformat.info/info/unicode/char/0384/index.htm)

The characters are displayed like this :
postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest;?column? | ?column?
----------+----------´        | ΄
(Although this example has similar looking characters, this might not
be a factor behind treating them equal)

Now since ucol_strcoll*() returns 0, these strings are always compared
using strcmp(), so 1FFD > 0384 returns true :

create collation ucatest (locale = 'en_US.UTF8', provider = 'icu');

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;?column?
----------t

Whereas, if strcmp() is skipped for ICU collations :
if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU))  result = strcmp(a1p, a2p);

... then the comparison using ICU collation tells they are identical strings :

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;?column?
----------f
(1 row)

postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest;?column?
----------f
(1 row)

postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest;?column?
----------t


Now I have verified that strcoll() returns true for 1FFD > 0384. So,
it looks like ICU API function ucol_strcoll() returns false by
intention. That's the reason I feel like the
strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But
I may be wrong, please correct me if I may be missing something.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Andres Freund
Дата: 02 июня 2017 г., 00:28:46
Сообщение: Re: [HACKERS] logical replication busy-waiting on a lock

Следующее

От: Jeevan Ladhe
Дата: 02 июня 2017 г., 01:35:03
Сообщение: Re: [HACKERS] Adding support for Default partition in partitioning

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

[HACKERS] strcmp() tie-breaker for identical ICU-collated strings

Предыдущее

Следующее