Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

Поиск

Список

Период

Сортировка

От	Palle Girgensohn
Тема	Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Дата	10 августа 2016 г. 23:42:07
Msg-id	A4DB6CD4-F4CC-4C48-A9DC-DCBDCBD51186@pingpong.net обсуждение исходный текст
Ответ на	Re: Implementing full UTF-8 support (aka supporting 0x00) (Bruce Momjian <bruce@momjian.us>)
Ответы	Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Список	pgsql-hackers

Дерево обсуждения

> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <bruce@momjian.us>:
>
> On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
>
> Uh, this email from July says Peter Eisentraut will submit it in
> September  :-)
>
>     https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns.

https://github.com/girgen/postgres/

in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly.
Inthis latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test
them,but also, I see little point in supporting anything else using ICU. 

I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when
running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out    2016-08-10 21:09:03.000000000
+0200
--- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out    2016-08-10 21:12:53.000000000 +0200
***************
*** 373,379 **** SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  f (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 ---- SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  t (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***************
*** 385,391 **** SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  t (1 row)
 -- The following actually exercises the selectivity estimation for ~*.
--- 385,391 ---- SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  f (1 row)
 -- The following actually exercises the selectivity estimation for ~*.

======================================================================

The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the
other?I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or
upper(both)?I haven't investigated this yet. @Devrim, is one more correct than the other? 

As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon.

Palle

[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Robert Haas
Дата: 10 августа 2016 г., 23:39:08
Сообщение: Re: Wait events monitoring future development

Следующее

От: Bruce Momjian
Дата: 10 августа 2016 г., 23:44:17
Сообщение: Re: new pgindent run before branch?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

Предыдущее

Следующее