Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

Поиск
Список
Период
Сортировка
От Palle Girgensohn
Тема Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Дата
Msg-id A4DB6CD4-F4CC-4C48-A9DC-DCBDCBD51186@pingpong.net
обсуждение исходный текст
Ответ на Re: Implementing full UTF-8 support (aka supporting 0x00)  (Bruce Momjian <bruce@momjian.us>)
Ответы Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <bruce@momjian.us>:
>
> On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
>
> Uh, this email from July says Peter Eisentraut will submit it in
> September  :-)
>
>     https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns.


https://github.com/girgen/postgres/


in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly.
Inthis latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test
them,but also, I see little point in supporting anything else using ICU. 



I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when
running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out    2016-08-10 21:09:03.000000000
+0200
--- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out    2016-08-10 21:12:53.000000000 +0200
***************
*** 373,379 **** SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  f (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 ---- SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  t (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***************
*** 385,391 **** SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  t (1 row)
 -- The following actually exercises the selectivity estimation for ~*.
--- 385,391 ---- SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  f (1 row)
 -- The following actually exercises the selectivity estimation for ~*.

======================================================================

The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the
other?I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or
upper(both)?I haven't investigated this yet. @Devrim, is one more correct than the other? 


As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon.

Palle



[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Wait events monitoring future development
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: new pgindent run before branch?