Re: Patch for collation using ICU
От | Palle Girgensohn |
---|---|
Тема | Re: Patch for collation using ICU |
Дата | |
Msg-id | 55C6D914B6055CD5721BEC40@palle.girgensohn.se обсуждение исходный текст |
Ответ на | Patch for collation using ICU (Palle Girgensohn <girgen@pingpong.net>) |
Ответы |
Re: Patch for collation using ICU
(Stephan Szabo <sszabo@megazone.bigpanda.com>)
Re: Patch for collation using ICU (Hannu Krosing <hannu@tm.ee>) |
Список | pgsql-hackers |
--On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn <girgen@pingpong.net> wrote: > Hi! > > I've put together a patch for using IBM's ICU package for collation. > > If your OS does not have full support for collation ur > uppercase/lowercase in multibyte locales, this might be useful. If you > are using a multibyte character encoding in your database and want > collation, i.e. order by, and also lower(), upper() and initcap() to work > properly, this patch will do just that. > > This patch is needed for FreeBSD, since this OS has no support for > collation of for example unicode locales (that is, wcscoll(3) does not do > what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the > patch is *not* necessary for Linux, although IBM claims ICU collation to > be about twice as fast as glibc for simple western locales. > > It adds a configure switch, `--with-icu', which will set up the code to > use ICU instead of wchar_t and wcscoll. > > This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it > seems to run well. I've not had the time to do any comparative > performance tests yet, but it seems it is at least not slower than using > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. > > I'd be delighted if some more experienced postgresql hackers would review > this stuff. The patch is pretty compact, so it's fast reading :) I'm > planning to add this patch as an option (tagged "experimental") to > FreeBSD's postgresql port. Any ideas about whether this is a good idea or > not? > > Any thoughts or ideas are welcome! > > Cheers, > Palle > > Patch at: > <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.d > iff> > > ICU at sourceforge: <http://icu.sf.net/> Hi! There's a new patch to fix some reported problems. <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-26.diff> This version uses the DatabaseEncoding and sets the ICU encoding at the same time. I had to create a conversion table from PostgreSQL's own, somewhat odd and non-standard, names of encodings, into the prefered IANA names. On or two of the more odd ones might be slightly incorrect, hopefully not too far off anyway? I've noticed a couple of things about using the ICU patch vs. pristine pg-8.0.1: - ORDER BY is case insensitive when using ICU. This might break the SQL standard (?), but sure is nice :) - When the database is initialized using the C locale, upper() and lower() normally does not work at all for non-ASCII characters even if the database's encoding is say LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, and this is probably correct since the locale is still `C', I believe?). The ICU patch changes nothing for the LATIN1 case, since it does not act on single byte encodings, but for the UNICODE representation, it works and does what I expect it to, namely upper() and lower() neatly upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') -> 'åäö'. This is a good thing, although I'm surprised that upper/lower is dragged along with the LC_COLLATE fixation at initdb. I never run initdb in the C locale, but only now do I realize how broken that really is if you need to store anything else than English :-) I'd be delighted to get more feedback about this stuff. Thanks, Palle
В списке pgsql-hackers по дате отправления: