Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
От | Thomas Munro |
---|---|
Тема | Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings |
Дата | |
Msg-id | CAEepm=3nmZj6AAFn7CjCwHw_59nrP+2c58ryn5fhS4C9PWggMQ@mail.gmail.com обсуждение исходный текст |
Ответ на | [HACKERS] strcmp() tie-breaker for identical ICU-collated strings (Amit Khandekar <amitdkhan.pg@gmail.com>) |
Ответы |
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
(Peter Geoghegan <pg@bowt.ie>)
|
Список | pgsql-hackers |
On Fri, Jun 2, 2017 at 6:58 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > While comparing two text strings using varstr_cmp(), if *strcoll*() > call returns 0, we do strcmp() tie-breaker to do binary comparison, > because strcoll() can return 0 for non-identical strings : > > varstr_cmp() > { > ... > /* > * In some locales strcoll() can claim that nonidentical strings are > * equal. Believing that would be bad news for a number of reasons, > * so we follow Perl's lead and sort "equal" strings according to > * strcmp(). > */ > if (result == 0) > result = strcmp(a1p, a2p); > ... > } > > But is this supposed to apply for ICU collations as well ? If > collation provider is icu, the comparison is done using > ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns > some characters as being identical, so doing strcmp() may not make > sense. > > For e.g. , if the below two characters are compared using > ucol_strcollUTF8(), it returns 0, meaning the strings are identical : > Greek Oxia : UTF-16 encoding : 0x1FFD > (http://www.fileformat.info/info/unicode/char/1ffd/index.htm) > Greek Tonos : UTF-16 encoding : 0x0384 > (http://www.fileformat.info/info/unicode/char/0384/index.htm) > > The characters are displayed like this : > postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest; > ?column? | ?column? > ----------+---------- > ´ | ΄ > (Although this example has similar looking characters, this might not > be a factor behind treating them equal) > > Now since ucol_strcoll*() returns 0, these strings are always compared > using strcmp(), so 1FFD > 0384 returns true : > > create collation ucatest (locale = 'en_US.UTF8', provider = 'icu'); > > postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest; > ?column? > ---------- > t > > Whereas, if strcmp() is skipped for ICU collations : > if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU)) > result = strcmp(a1p, a2p); > > ... then the comparison using ICU collation tells they are identical strings : > > postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest; > ?column? > ---------- > f > (1 row) > > postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest; > ?column? > ---------- > f > (1 row) > > postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest; > ?column? > ---------- > t > > > Now I have verified that strcoll() returns true for 1FFD > 0384. So, > it looks like ICU API function ucol_strcoll() returns false by > intention. That's the reason I feel like the > strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But > I may be wrong, please correct me if I may be missing something. I may not have had enough coffee yet, but... Why should ICU be any different than the system provider in this respect? In both cases, we have a two-level comparison: first we use the collation-aware comparison, and then as a tie breaker, we use a binary comparison. If we didn't do a binary comparison as a tie-breaker, wouldn't the result be logically incompatible with the = operator, which does a binary comparison? Put another way, if we didn't use binary order tie-breaking, we'd have to teach texteq to understand collations (ie be defined as not (a < b) and not (b > a)) otherwise we'd permit contradictions like a != b and not (a < b) and not (b > a). -- Thomas Munro http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Andres FreundДата:
Сообщение: Re: [HACKERS] [BUGS] Concurrent ALTER SEQUENCE RESTART Regression
Следующее
От: Peter GeogheganДата:
Сообщение: Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings