Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
Дата
Msg-id CAEepm=3nmZj6AAFn7CjCwHw_59nrP+2c58ryn5fhS4C9PWggMQ@mail.gmail.com
обсуждение исходный текст
Ответ на [HACKERS] strcmp() tie-breaker for identical ICU-collated strings  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Ответы Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
On Fri, Jun 2, 2017 at 6:58 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> While comparing two text strings using varstr_cmp(), if *strcoll*()
> call returns 0, we do strcmp() tie-breaker to do binary comparison,
> because strcoll() can return 0 for non-identical strings :
>
> varstr_cmp()
> {
> ...
> /*
> * In some locales strcoll() can claim that nonidentical strings are
> * equal.  Believing that would be bad news for a number of reasons,
> * so we follow Perl's lead and sort "equal" strings according to
> * strcmp().
> */
> if (result == 0)
> result = strcmp(a1p, a2p);
> ...
> }
>
> But is this supposed to apply for ICU collations as well ? If
> collation provider is icu, the comparison is done using
> ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns
> some characters as being identical, so doing strcmp() may not make
> sense.
>
> For e.g. , if the below two characters are compared using
> ucol_strcollUTF8(), it returns 0, meaning the strings are identical :
> Greek Oxia : UTF-16 encoding : 0x1FFD
> (http://www.fileformat.info/info/unicode/char/1ffd/index.htm)
> Greek Tonos : UTF-16 encoding : 0x0384
> (http://www.fileformat.info/info/unicode/char/0384/index.htm)
>
> The characters are displayed like this :
> postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest;
>  ?column? | ?column?
> ----------+----------
>  ´        | ΄
> (Although this example has similar looking characters, this might not
> be a factor behind treating them equal)
>
> Now since ucol_strcoll*() returns 0, these strings are always compared
> using strcmp(), so 1FFD > 0384 returns true :
>
> create collation ucatest (locale = 'en_US.UTF8', provider = 'icu');
>
> postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
>  ?column?
> ----------
>  t
>
> Whereas, if strcmp() is skipped for ICU collations :
> if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU))
>    result = strcmp(a1p, a2p);
>
> ... then the comparison using ICU collation tells they are identical strings :
>
> postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
>  ?column?
> ----------
>  f
> (1 row)
>
> postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest;
>  ?column?
> ----------
>  f
> (1 row)
>
> postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest;
>  ?column?
> ----------
>  t
>
>
> Now I have verified that strcoll() returns true for 1FFD > 0384. So,
> it looks like ICU API function ucol_strcoll() returns false by
> intention. That's the reason I feel like the
> strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But
> I may be wrong, please correct me if I may be missing something.

I may not have had enough coffee yet, but...

Why should ICU be any different than the system provider in this
respect?  In both cases, we have a two-level comparison: first we use
the collation-aware comparison, and then as a tie breaker, we use a
binary comparison.  If we didn't do a binary comparison as a
tie-breaker, wouldn't the result be logically incompatible with the =
operator, which does a binary comparison?

Put another way, if we didn't use binary order tie-breaking, we'd have
to teach texteq to understand collations (ie be defined as not (a < b)
and not (b > a)) otherwise we'd permit contradictions like a != b and
not (a < b) and not (b > a).

--
Thomas Munro
http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [HACKERS] [BUGS] Concurrent ALTER SEQUENCE RESTART Regression
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings