Re: Patch for collation using ICU

Поиск
Список
Период
Сортировка
От John Hansen
Тема Re: Patch for collation using ICU
Дата
Msg-id 5066E5A966339E42AA04BA10BA706AE50A92FD@rodrick.geeknet.com.au
обсуждение исходный текст
Ответ на Patch for collation using ICU  (Palle Girgensohn <girgen@pingpong.net>)
Список pgsql-hackers
Bruce Momjian wrote:
> Palle Girgensohn wrote:
> > >
> > > Is this patch ready for application?
> >
> > I don't think so, not quite. I have not had any positive
> reports from
> > linux users, this is only tested in a FreeBSD environment.
> I'd say it
> > needs some more testing.
>
> OK.
>
> > Also, apparently, ICU is installed by default in many linux
> > distributions, and usually it is version 2.8. Some linux users have
> > asked me if there are plans for a patch that works with ICU 2.8.
> > That's probably a good idea. IBM and the ICU folks seem to consider
> > 3.2 to be the stable version, older versions are hard to
> find on their
> > sites, but most linux distributers seem to consider it too bleeding
> > edge, even gentoo. I don't know why they don't agree.
>
> Good point.  Why would linux folks need ICU?  Doesn't their
> OS support encodings natively?  I am particularly excited
> about this for OSs that don't have such encodings, like UTF8
> support for Win32.
>
> Because ICU will not be used unless enabled by configure, it
> seems we are fine with only supporting the newest version.
> Do Linux users need to use ICU for any reason?

Yes, because on many linux platforms locale support is broken.
Also, ICU enables full unicode support, particularly in multi-language
situations where locale is C, and makes upper/lower/initcap work as
expected, except where it depends on locale information.

There are also many other useful things in ICU that could be
implemented. Transliteration, and break-iterators for example.
Break-iteration particularly interresting for converting a text to a
list of words. Another is it's builtin substring searches.

>
> > > I do have a few questions:
> > >
> > > Why don't you use the lc_ctype_is_c() part of this test?
> > >
> > >      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> >
> > Um, well, I didn't think about that. :)  What would be the
> locale in
> > this case? c_C.UTF-8? ;)  Hmm, it is possible to have
> CTYPE=C and use
> > a wide encoding, indeed. Then the strings will be handled
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
>
> The additional test is more of an optmization, and it fixes a
> problem with some OSs that have processing problems with UTF8
> when the locale is supposed to be turned off, like in "C".  I
> realize ICU might be fine with it but the optimization still
> is an issue.

That the locale is supposed to be turned off, doesn't mean it shouldn't
use ICU.
ICU is more than just locales.

> > > Why is so much code added, for example, in lower()?  The existing
> > > multibyte code is much smaller, and lots of code is added
> in other
> > > places too.
> >
> > ICU uses UTF-16 internally, so all strings must be
> converted from the
> > database encoding to UTF-16. Since that means the strings
> need to be
> > copied, I took the same approach as in
> varlena.c:varstr_cmp(), where
> > small strings use the heap and only larger strings use a palloc.
> > Comments in varstr_cmp about performance made me use that approach.
>
> Oh, interesting.   I think you need to create new functions that
> factor out that common code so the patch is smaller and
> easier to maintain.
>
> > Also, in the latest patch, I also added checks and logging
> for *every*
> > status returned from ICU. I hope this will help debugging
> on debian,
> > where previous version didn't work. That excessive status
> checking is
> > hardly be necessary once the stuff is better tested.
> >
> > I think the string copying and heap/palloc choices stands
> for most of
> > the code bloat, together with the excessive status checking
> and logging.
>
> OK, move that into some common functions and I think it will
> be better.
>
> > > Why do you need to add a mapping of encoding names from
> iana to our
> > > names?
> >
> > This was already answered by John Hansen... There's an old
> thread here
> > about the choice of the name "UNICODE" to describe an
> encoding, which
> > it doesn't. There's half a dozen unicode based encodings...
> UTF-8 is
> > used by postgresql, that would have been a better name... Similarly
> > for most other encodings, really. ICU expect a setlocale(3) string
> > (i.e. IANA). PostgreSQL can't provide it, so a mapping
> table is required.
>
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).
>  Does that help?
>
> > I use this patch in production on one FreeBSD 4.10 server
> at the moment.
> > With the latest version, I've had no problems. Logging is
> swithed on
> > for now, and it shows no signs of ICU complaining. I'd like more
> > reports on Linux, though.
>
> OK, I certainly would like this all done for 8.1 which should
> have feature freeze on July 1.
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square,
> Pennsylvania 19073
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
>


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Palle Girgensohn
Дата:
Сообщение: Re: Patch for collation using ICU
Следующее
От: "John Hansen"
Дата:
Сообщение: Re: Patch for collation using ICU