Обсуждение: Patch for collation using ICU
Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have full support for collation ur uppercase/lowercase in multibyte locales, this might be useful. If you are using a multibyte character encoding in your database and want collation, i.e. order by, and also lower(), upper() and initcap() to work properly, this patch will do just that. This patch is needed for FreeBSD, since this OS has no support for collation of for example unicode locales (that is, wcscoll(3) does not do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* necessary for Linux, although IBM claims ICU collation to be about twice as fast as glibc for simple western locales. It adds a configure switch, `--with-icu', which will set up the code to use ICU instead of wchar_t and wcscoll. This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it seems to run well. I've not had the time to do any comparative performance tests yet, but it seems it is at least not slower than using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. I'd be delighted if some more experienced postgresql hackers would review this stuff. The patch is pretty compact, so it's fast reading :) I'm planning to add this patch as an option (tagged "experimental") to FreeBSD's postgresql port. Any ideas about whether this is a good idea or not? Any thoughts or ideas are welcome! Cheers, Palle Patch at: <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff> ICU at sourceforge: <http://icu.sf.net/>
Useful if it's going to support earlier releases of ICU.... Not all os's come with ICU3.2, debian for example, currently has 2.1 in testing, and 2.6 in unstable. ... John > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of > Palle Girgensohn > Sent: Friday, March 25, 2005 10:40 AM > To: pgsql-hackers@postgresql.org > Subject: [HACKERS] Patch for collation using ICU > > Hi! > > I've put together a patch for using IBM's ICU package for collation. > > If your OS does not have full support for collation ur > uppercase/lowercase in multibyte locales, this might be > useful. If you are using a multibyte character encoding in > your database and want collation, i.e. order by, and also > lower(), upper() and initcap() to work properly, this patch > will do just that. > > This patch is needed for FreeBSD, since this OS has no > support for collation of for example unicode locales (that > is, wcscoll(3) does not do what you expect if you set > LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* > necessary for Linux, although IBM claims ICU collation to be > about twice as fast as glibc for simple western locales. > > It adds a configure switch, `--with-icu', which will set up > the code to use ICU instead of wchar_t and wcscoll. > > This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, > where it seems to run well. I've not had the time to do any > comparative performance tests yet, but it seems it is at > least not slower than using LATIN1 with > sv_SE.ISO8859-1 locale, perhaps even faster. > > I'd be delighted if some more experienced postgresql hackers > would review this stuff. The patch is pretty compact, so it's > fast reading :) I'm planning to add this patch as an option > (tagged "experimental") to FreeBSD's postgresql port. Any > ideas about whether this is a good idea or not? > > Any thoughts or ideas are welcome! > > Cheers, > Palle > > Patch at: > <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 005-03-14.diff> > > ICU at sourceforge: <http://icu.sf.net/> > > > ---------------------------(end of > broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > >
--On fredag, mars 25, 2005 16.34.41 +1100 John Hansen <john@geeknet.com.au> wrote: > Useful if it's going to support earlier releases of ICU.... > > Not all os's come with ICU3.2, debian for example, currently has 2.1 in > testing, and 2.6 in unstable. Oh, OK. FreeBSD has only the 3.2 as port. I can check the older version, I doubt it would too much difference. Some autoconf sorcery needed, perhaps. /Palle > > ... John > >> -----Original Message----- >> From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of >> Palle Girgensohn >> Sent: Friday, March 25, 2005 10:40 AM >> To: pgsql-hackers@postgresql.org >> Subject: [HACKERS] Patch for collation using ICU >> >> Hi! >> >> I've put together a patch for using IBM's ICU package for collation. >> >> If your OS does not have full support for collation ur >> uppercase/lowercase in multibyte locales, this might be >> useful. If you are using a multibyte character encoding in >> your database and want collation, i.e. order by, and also >> lower(), upper() and initcap() to work properly, this patch >> will do just that. >> >> This patch is needed for FreeBSD, since this OS has no >> support for collation of for example unicode locales (that >> is, wcscoll(3) does not do what you expect if you set >> LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* >> necessary for Linux, although IBM claims ICU collation to be >> about twice as fast as glibc for simple western locales. >> >> It adds a configure switch, `--with-icu', which will set up >> the code to use ICU instead of wchar_t and wcscoll. >> >> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, >> where it seems to run well. I've not had the time to do any >> comparative performance tests yet, but it seems it is at >> least not slower than using LATIN1 with >> sv_SE.ISO8859-1 locale, perhaps even faster. >> >> I'd be delighted if some more experienced postgresql hackers >> would review this stuff. The patch is pretty compact, so it's >> fast reading :) I'm planning to add this patch as an option >> (tagged "experimental") to FreeBSD's postgresql port. Any >> ideas about whether this is a good idea or not? >> >> Any thoughts or ideas are welcome! >> >> Cheers, >> Palle >> >> Patch at: >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 > 005-03-14.diff> >> >> ICU at sourceforge: <http://icu.sf.net/> >> >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 7: don't forget to increase your free space map settings >> >>
> --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen > <john@geeknet.com.au> > wrote: > > > Useful if it's going to support earlier releases of ICU.... > > > > Not all os's come with ICU3.2, debian for example, > currently has 2.1 > > in testing, and 2.6 in unstable. > > Oh, OK. FreeBSD has only the 3.2 as port. I can check the > older version, I doubt it would too much difference. Some > autoconf sorcery needed, perhaps. Naww, it's no biggie, we'll just need to include ICU with pg I think. I tried that, there are several functions from ICU that you use, that are not in ICU2.1 Dono about 2.6. However, ICU3.2 compiles on debian with a small change to the debian/rules file. debian/tmp/etc is missing, so add mkdir debian/tmp/etc ... John > > /Palle > > > > > ... John > > > >> -----Original Message----- > >> From: pgsql-hackers-owner@postgresql.org > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Palle > >> Girgensohn > >> Sent: Friday, March 25, 2005 10:40 AM > >> To: pgsql-hackers@postgresql.org > >> Subject: [HACKERS] Patch for collation using ICU > >> > >> Hi! > >> > >> I've put together a patch for using IBM's ICU package for > collation. > >> > >> If your OS does not have full support for collation ur > >> uppercase/lowercase in multibyte locales, this might be useful. If > >> you are using a multibyte character encoding in your database and > >> want collation, i.e. order by, and also lower(), upper() and > >> initcap() to work properly, this patch will do just that. > >> > >> This patch is needed for FreeBSD, since this OS has no support for > >> collation of for example unicode locales (that is, wcscoll(3) does > >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for > example). > >> AFAIK the patch is *not* necessary for Linux, although IBM > claims ICU > >> collation to be about twice as fast as glibc for simple western > >> locales. > >> > >> It adds a configure switch, `--with-icu', which will set > up the code > >> to use ICU instead of wchar_t and wcscoll. > >> > >> This has been tested only on FreeBSD-4.11 & > FreeBSD-5-stable, where > >> it seems to run well. I've not had the time to do any comparative > >> performance tests yet, but it seems it is at least not slower than > >> using LATIN1 with > >> sv_SE.ISO8859-1 locale, perhaps even faster. > >> > >> I'd be delighted if some more experienced postgresql hackers would > >> review this stuff. The patch is pretty compact, so it's > fast reading > >> :) I'm planning to add this patch as an option (tagged > >> "experimental") to FreeBSD's postgresql port. Any ideas > about whether > >> this is a good idea or not? > >> > >> Any thoughts or ideas are welcome! > >> > >> Cheers, > >> Palle > >> > >> Patch at: > >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 > > 005-03-14.diff> > >> > >> ICU at sourceforge: <http://icu.sf.net/> > >> > >> > >> ---------------------------(end of > >> broadcast)--------------------------- > >> TIP 7: don't forget to increase your free space map settings > >> > >> > > > > > >
Ok,.. tested on debian sarge with ICU 3.2 UNICODE Database, C locale. upper() and lower() returns an empty string for any input, including 7bit ascii, regardless of client_encoding, so something is obviously broken. Have you tested this patch on a UNICODE DB with locale C/POSIX ? ... John > -----Original Message----- > From: John Hansen > Sent: Friday, March 25, 2005 10:27 PM > To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org' > Subject: RE: [HACKERS] Patch for collation using ICU > > > --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen > > <john@geeknet.com.au> > > wrote: > > > > > Useful if it's going to support earlier releases of ICU.... > > > > > > Not all os's come with ICU3.2, debian for example, > > currently has 2.1 > > > in testing, and 2.6 in unstable. > > > > Oh, OK. FreeBSD has only the 3.2 as port. I can check the older > > version, I doubt it would too much difference. Some > autoconf sorcery > > needed, perhaps. > > Naww, it's no biggie, we'll just need to include ICU with pg I think. > I tried that, there are several functions from ICU that you > use, that are not in ICU2.1 > > Dono about 2.6. > > However, ICU3.2 compiles on debian with a small change to the > debian/rules file. > debian/tmp/etc is missing, so add mkdir debian/tmp/etc > > ... John > > > > > /Palle > > > > > > > > ... John > > > > > >> -----Original Message----- > > >> From: pgsql-hackers-owner@postgresql.org > > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Palle > > >> Girgensohn > > >> Sent: Friday, March 25, 2005 10:40 AM > > >> To: pgsql-hackers@postgresql.org > > >> Subject: [HACKERS] Patch for collation using ICU > > >> > > >> Hi! > > >> > > >> I've put together a patch for using IBM's ICU package for > > collation. > > >> > > >> If your OS does not have full support for collation ur > > >> uppercase/lowercase in multibyte locales, this might be > useful. If > > >> you are using a multibyte character encoding in your > database and > > >> want collation, i.e. order by, and also lower(), upper() and > > >> initcap() to work properly, this patch will do just that. > > >> > > >> This patch is needed for FreeBSD, since this OS has no > support for > > >> collation of for example unicode locales (that is, > wcscoll(3) does > > >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for > > example). > > >> AFAIK the patch is *not* necessary for Linux, although IBM > > claims ICU > > >> collation to be about twice as fast as glibc for simple western > > >> locales. > > >> > > >> It adds a configure switch, `--with-icu', which will set > > up the code > > >> to use ICU instead of wchar_t and wcscoll. > > >> > > >> This has been tested only on FreeBSD-4.11 & > > FreeBSD-5-stable, where > > >> it seems to run well. I've not had the time to do any > comparative > > >> performance tests yet, but it seems it is at least not > slower than > > >> using LATIN1 with > > >> sv_SE.ISO8859-1 locale, perhaps even faster. > > >> > > >> I'd be delighted if some more experienced postgresql > hackers would > > >> review this stuff. The patch is pretty compact, so it's > > fast reading > > >> :) I'm planning to add this patch as an option (tagged > > >> "experimental") to FreeBSD's postgresql port. Any ideas > > about whether > > >> this is a good idea or not? > > >> > > >> Any thoughts or ideas are welcome! > > >> > > >> Cheers, > > >> Palle > > >> > > >> Patch at: > > >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 > > > 005-03-14.diff> > > >> > > >> ICU at sourceforge: <http://icu.sf.net/> > > >> > > >> > > >> ---------------------------(end of > > >> broadcast)--------------------------- > > >> TIP 7: don't forget to increase your free space map settings > > >> > > >> > > > > > > > > > > > >
--On fredag, mars 25, 2005 23.39.33 +1100 John Hansen <john@geeknet.com.au> wrote: > Ok,.. tested on debian sarge with ICU 3.2 > UNICODE Database, C locale. > > upper() and lower() returns an empty string for any input, including > 7bit ascii, regardless of client_encoding, so something is obviously > broken. > > Have you tested this patch on a UNICODE DB with locale C/POSIX ? No, honestly not. Mostly tested it with my needs, sv_SE.UTF-8 and UNICODE, and also de_DE.UTF-8. How will PostgreSQL react to this combo? A database cluster initdb:ed with locale=C/POSIX, and then a database in UNICODE (really utf-8) representation... hmm... I think I might have made a false assumption that the locale string would contain the character encoding. I do something like encoding = strchr(locale, '.') + 1... That code will be confused by a 'C' locale, indeed. I'll check it out! /Palle > > ... John > >> -----Original Message----- >> From: John Hansen >> Sent: Friday, March 25, 2005 10:27 PM >> To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org' >> Subject: RE: [HACKERS] Patch for collation using ICU >> >> > --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen >> > <john@geeknet.com.au> >> > wrote: >> > >> > > Useful if it's going to support earlier releases of ICU.... >> > > >> > > Not all os's come with ICU3.2, debian for example, >> > currently has 2.1 >> > > in testing, and 2.6 in unstable. >> > >> > Oh, OK. FreeBSD has only the 3.2 as port. I can check the older >> > version, I doubt it would too much difference. Some >> autoconf sorcery >> > needed, perhaps. >> >> Naww, it's no biggie, we'll just need to include ICU with pg I think. >> I tried that, there are several functions from ICU that you >> use, that are not in ICU2.1 >> >> Dono about 2.6. >> >> However, ICU3.2 compiles on debian with a small change to the >> debian/rules file. >> debian/tmp/etc is missing, so add mkdir debian/tmp/etc >> >> ... John >> >> > >> > /Palle >> > >> > > >> > > ... John >> > > >> > >> -----Original Message----- >> > >> From: pgsql-hackers-owner@postgresql.org >> > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Palle >> > >> Girgensohn >> > >> Sent: Friday, March 25, 2005 10:40 AM >> > >> To: pgsql-hackers@postgresql.org >> > >> Subject: [HACKERS] Patch for collation using ICU >> > >> >> > >> Hi! >> > >> >> > >> I've put together a patch for using IBM's ICU package for >> > collation. >> > >> >> > >> If your OS does not have full support for collation ur >> > >> uppercase/lowercase in multibyte locales, this might be >> useful. If >> > >> you are using a multibyte character encoding in your >> database and >> > >> want collation, i.e. order by, and also lower(), upper() and >> > >> initcap() to work properly, this patch will do just that. >> > >> >> > >> This patch is needed for FreeBSD, since this OS has no >> support for >> > >> collation of for example unicode locales (that is, >> wcscoll(3) does >> > >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for >> > example). >> > >> AFAIK the patch is *not* necessary for Linux, although IBM >> > claims ICU >> > >> collation to be about twice as fast as glibc for simple western >> > >> locales. >> > >> >> > >> It adds a configure switch, `--with-icu', which will set >> > up the code >> > >> to use ICU instead of wchar_t and wcscoll. >> > >> >> > >> This has been tested only on FreeBSD-4.11 & >> > FreeBSD-5-stable, where >> > >> it seems to run well. I've not had the time to do any >> comparative >> > >> performance tests yet, but it seems it is at least not >> slower than >> > >> using LATIN1 with >> > >> sv_SE.ISO8859-1 locale, perhaps even faster. >> > >> >> > >> I'd be delighted if some more experienced postgresql >> hackers would >> > >> review this stuff. The patch is pretty compact, so it's >> > fast reading >> > >> :) I'm planning to add this patch as an option (tagged >> > >> "experimental") to FreeBSD's postgresql port. Any ideas >> > about whether >> > >> this is a good idea or not? >> > >> >> > >> Any thoughts or ideas are welcome! >> > >> >> > >> Cheers, >> > >> Palle >> > >> >> > >> Patch at: >> > >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 >> > > 005-03-14.diff> >> > >> >> > >> ICU at sourceforge: <http://icu.sf.net/> >> > >> >> > >> >> > >> ---------------------------(end of >> > >> broadcast)--------------------------- >> > >> TIP 7: don't forget to increase your free space map settings >> > >> >> > >> >> > >> > >> > >> > >> > >> >
Palle Girgensohn <girgen@pingpong.net> writes: > hmm... I think I might have made a false assumption that > the locale string would contain the character encoding. You certainly cannot assume that. Would that it were so easy to find out the character set for a locale :-(. There's some code in initdb that you might emulate, though I can't say that I trust it a whole lot. regards, tom lane
Tom Lane wrote: >Palle Girgensohn <girgen@pingpong.net> writes: > > >>hmm... I think I might have made a false assumption that >>the locale string would contain the character encoding. >> >> > >You certainly cannot assume that. Would that it were so easy to find >out the character set for a locale :-(. > >There's some code in initdb that you might emulate, though I can't say >that I trust it a whole lot. > > > > Me either. On FC3 I saw this the other day, haven't had time to dig into what's going on: [andrew inst]$ echo $LANG en_US.UTF-8 [andrew inst]$ bin/initdb floobl The files belonging to this database system will be owned by user "andrew". This user must also own the server process. The database cluster will be initialized with locale en_US.UTF-8. initdb: could not find suitable encoding for locale "en_US.UTF-8" Rerun initdb with the -E option. Try "initdb --help" for more information. cheers andrew
--On fredag, mars 25, 2005 09.53.38 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Palle Girgensohn <girgen@pingpong.net> writes: >> hmm... I think I might have made a false assumption that >> the locale string would contain the character encoding. > > You certainly cannot assume that. Would that it were so easy to find > out the character set for a locale :-(. > > There's some code in initdb that you might emulate, though I can't say > that I trust it a whole lot. Best thing would of course be to check the encoding of the database. I realized I've made a bad assumption that the initdb locale has anything to do with the character encoding of the database - it doesn't. There must be some already existsing trusted way to get the character encoding of a database? /Palle
> --On fredag, mars 25, 2005 23.39.33 +1100 John Hansen > <john@geeknet.com.au> > wrote: > > > Ok,.. tested on debian sarge with ICU 3.2 UNICODE Database, > C locale. > > > > upper() and lower() returns an empty string for any input, > including > > 7bit ascii, regardless of client_encoding, so something is > obviously > > broken. > > > > Have you tested this patch on a UNICODE DB with locale C/POSIX ? FYI, I also found that initdb crashes with error 139 on any locale other than C/POSIX. > > No, honestly not. Mostly tested it with my needs, sv_SE.UTF-8 > and UNICODE, and also de_DE.UTF-8. > > How will PostgreSQL react to this combo? A database cluster > initdb:ed with locale=C/POSIX, and then a database in UNICODE > (really utf-8) representation... hmm... I think I might have > made a false assumption that the locale string would contain > the character encoding. I do something like encoding = > strchr(locale, '.') + 1... That code will be confused by a 'C' > locale, indeed. I'll check it out! > > /Palle > > > > > > > ... John > > > >> -----Original Message----- > >> From: John Hansen > >> Sent: Friday, March 25, 2005 10:27 PM > >> To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org' > >> Subject: RE: [HACKERS] Patch for collation using ICU > >> > >> > --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen > >> > <john@geeknet.com.au> > >> > wrote: > >> > > >> > > Useful if it's going to support earlier releases of ICU.... > >> > > > >> > > Not all os's come with ICU3.2, debian for example, > >> > currently has 2.1 > >> > > in testing, and 2.6 in unstable. > >> > > >> > Oh, OK. FreeBSD has only the 3.2 as port. I can check the older > >> > version, I doubt it would too much difference. Some > >> autoconf sorcery > >> > needed, perhaps. > >> > >> Naww, it's no biggie, we'll just need to include ICU with > pg I think. > >> I tried that, there are several functions from ICU that > you use, that > >> are not in ICU2.1 > >> > >> Dono about 2.6. > >> > >> However, ICU3.2 compiles on debian with a small change to the > >> debian/rules file. > >> debian/tmp/etc is missing, so add mkdir debian/tmp/etc > >> > >> ... John > >> > >> > > >> > /Palle > >> > > >> > > > >> > > ... John > >> > > > >> > >> -----Original Message----- > >> > >> From: pgsql-hackers-owner@postgresql.org > >> > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf > Of Palle > >> > >> Girgensohn > >> > >> Sent: Friday, March 25, 2005 10:40 AM > >> > >> To: pgsql-hackers@postgresql.org > >> > >> Subject: [HACKERS] Patch for collation using ICU > >> > >> > >> > >> Hi! > >> > >> > >> > >> I've put together a patch for using IBM's ICU package for > >> > collation. > >> > >> > >> > >> If your OS does not have full support for collation ur > >> > >> uppercase/lowercase in multibyte locales, this might be > >> useful. If > >> > >> you are using a multibyte character encoding in your > >> database and > >> > >> want collation, i.e. order by, and also lower(), upper() and > >> > >> initcap() to work properly, this patch will do just that. > >> > >> > >> > >> This patch is needed for FreeBSD, since this OS has no > >> support for > >> > >> collation of for example unicode locales (that is, > >> wcscoll(3) does > >> > >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for > >> > example). > >> > >> AFAIK the patch is *not* necessary for Linux, although IBM > >> > claims ICU > >> > >> collation to be about twice as fast as glibc for > simple western > >> > >> locales. > >> > >> > >> > >> It adds a configure switch, `--with-icu', which will set > >> > up the code > >> > >> to use ICU instead of wchar_t and wcscoll. > >> > >> > >> > >> This has been tested only on FreeBSD-4.11 & > >> > FreeBSD-5-stable, where > >> > >> it seems to run well. I've not had the time to do any > >> comparative > >> > >> performance tests yet, but it seems it is at least not > >> slower than > >> > >> using LATIN1 with > >> > >> sv_SE.ISO8859-1 locale, perhaps even faster. > >> > >> > >> > >> I'd be delighted if some more experienced postgresql > >> hackers would > >> > >> review this stuff. The patch is pretty compact, so it's > >> > fast reading > >> > >> :) I'm planning to add this patch as an option (tagged > >> > >> "experimental") to FreeBSD's postgresql port. Any ideas > >> > about whether > >> > >> this is a good idea or not? > >> > >> > >> > >> Any thoughts or ideas are welcome! > >> > >> > >> > >> Cheers, > >> > >> Palle > >> > >> > >> > >> Patch at: > >> > >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 > >> > > 005-03-14.diff> > >> > >> > >> > >> ICU at sourceforge: <http://icu.sf.net/> > >> > >> > >> > >> > >> > >> ---------------------------(end of > >> > >> broadcast)--------------------------- > >> > >> TIP 7: don't forget to increase your free space map settings > >> > >> > >> > >> > >> > > >> > > >> > > >> > > >> > > >> > > > > > > >
--On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn <girgen@pingpong.net> wrote: > Hi! > > I've put together a patch for using IBM's ICU package for collation. > > If your OS does not have full support for collation ur > uppercase/lowercase in multibyte locales, this might be useful. If you > are using a multibyte character encoding in your database and want > collation, i.e. order by, and also lower(), upper() and initcap() to work > properly, this patch will do just that. > > This patch is needed for FreeBSD, since this OS has no support for > collation of for example unicode locales (that is, wcscoll(3) does not do > what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the > patch is *not* necessary for Linux, although IBM claims ICU collation to > be about twice as fast as glibc for simple western locales. > > It adds a configure switch, `--with-icu', which will set up the code to > use ICU instead of wchar_t and wcscoll. > > This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it > seems to run well. I've not had the time to do any comparative > performance tests yet, but it seems it is at least not slower than using > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. > > I'd be delighted if some more experienced postgresql hackers would review > this stuff. The patch is pretty compact, so it's fast reading :) I'm > planning to add this patch as an option (tagged "experimental") to > FreeBSD's postgresql port. Any ideas about whether this is a good idea or > not? > > Any thoughts or ideas are welcome! > > Cheers, > Palle > > Patch at: > <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.d > iff> > > ICU at sourceforge: <http://icu.sf.net/> Hi! There's a new patch to fix some reported problems. <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-26.diff> This version uses the DatabaseEncoding and sets the ICU encoding at the same time. I had to create a conversion table from PostgreSQL's own, somewhat odd and non-standard, names of encodings, into the prefered IANA names. On or two of the more odd ones might be slightly incorrect, hopefully not too far off anyway? I've noticed a couple of things about using the ICU patch vs. pristine pg-8.0.1: - ORDER BY is case insensitive when using ICU. This might break the SQL standard (?), but sure is nice :) - When the database is initialized using the C locale, upper() and lower() normally does not work at all for non-ASCII characters even if the database's encoding is say LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, and this is probably correct since the locale is still `C', I believe?). The ICU patch changes nothing for the LATIN1 case, since it does not act on single byte encodings, but for the UNICODE representation, it works and does what I expect it to, namely upper() and lower() neatly upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') -> 'åäö'. This is a good thing, although I'm surprised that upper/lower is dragged along with the LC_COLLATE fixation at initdb. I never run initdb in the C locale, but only now do I realize how broken that really is if you need to store anything else than English :-) I'd be delighted to get more feedback about this stuff. Thanks, Palle
--On lördag, mars 26, 2005 10.42.19 +1100 John Hansen <john@geeknet.com.au> wrote: > FYI, I also found that initdb crashes with error 139 on any locale other > than C/POSIX. Odd, not for me, but I did make a bad assumption about character encoding. Perhaps the new patch will help? (see previous mail) What is error 139, anyone? /Palle
> -----Original Message----- > From: Palle Girgensohn [mailto:girgen@pingpong.net] > Sent: Saturday, March 26, 2005 1:10 PM > To: pgsql-hackers@postgresql.org > Cc: John Hansen; Andrew Dunstan > Subject: Re: [HACKERS] Patch for collation using ICU > > --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn > <girgen@pingpong.net> wrote: > > > Hi! > > > > I've put together a patch for using IBM's ICU package for collation. > > > > If your OS does not have full support for collation ur > > uppercase/lowercase in multibyte locales, this might be > useful. If you > > are using a multibyte character encoding in your database and want > > collation, i.e. order by, and also lower(), upper() and > initcap() to > > work properly, this patch will do just that. > > > > This patch is needed for FreeBSD, since this OS has no support for > > collation of for example unicode locales (that is, > wcscoll(3) does not > > do what you expect if you set LC_ALL=sv_SE.UTF-8, for > example). AFAIK > > the patch is *not* necessary for Linux, although IBM claims ICU > > collation to be about twice as fast as glibc for simple > western locales. > > > > It adds a configure switch, `--with-icu', which will set up > the code > > to use ICU instead of wchar_t and wcscoll. > > > > This has been tested only on FreeBSD-4.11 & > FreeBSD-5-stable, where it > > seems to run well. I've not had the time to do any comparative > > performance tests yet, but it seems it is at least not slower than > > using > > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. > > > > I'd be delighted if some more experienced postgresql hackers would > > review this stuff. The patch is pretty compact, so it's > fast reading > > :) I'm planning to add this patch as an option (tagged > > "experimental") to FreeBSD's postgresql port. Any ideas > about whether > > this is a good idea or not? > > > > Any thoughts or ideas are welcome! > > > > Cheers, > > Palle > > > > Patch at: > > > <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1 > > 4.d > > iff> > > > > ICU at sourceforge: <http://icu.sf.net/> > > > Hi! > > There's a new patch to fix some reported problems. > > <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 005-03-26.diff> > > This version uses the DatabaseEncoding and sets the ICU > encoding at the same time. I had to create a conversion table > from PostgreSQL's own, somewhat odd and non-standard, names > of encodings, into the prefered IANA names. On or two of the > more odd ones might be slightly incorrect, hopefully not too > far off anyway? > > I've noticed a couple of things about using the ICU patch vs. pristine > pg-8.0.1: > > - ORDER BY is case insensitive when using ICU. This might > break the SQL standard (?), but sure is nice :) This would mean that indexes are also case insensitive right? Which makes it a Bad Thing(tm). > - When the database is initialized using the C locale, > upper() and lower() normally does not work at all for > non-ASCII characters even if the database's encoding is say > LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, > and this is probably correct since the locale is still `C', I > believe?). The ICU patch changes nothing for the LATIN1 case, > since it does not act on single byte encodings, but for the > UNICODE representation, it works and does what I expect it > to, namely upper() and lower() neatly > upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') > -> 'åäö'. > This is a good thing, although I'm surprised that upper/lower > is dragged along with the LC_COLLATE fixation at initdb. I > never run initdb in the C locale, but only now do I realize > how broken that really is if you need to store anything else > than English :-) That is what I would have expected. However, it probably won't work for the more exotic cases, like turkish I, which dependson the locale. > > I'd be delighted to get more feedback about this stuff. > > Thanks, > Palle > > >
Still doesn't work for me :( UNICODE DB C locale set client_encoding = iso88591; select upper('æ');upper ------- (1 row) Trying to initdb with en_IN.utf8 /usr/lib/postgresql/bin/initdb -D /var/lib/postgres/data/ -E UNICODE --locale=en_IN.utf8 The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale en_IN.utf8. fixing permissions on existing directory /var/lib/postgres/data ... ok creating directory /var/lib/postgres/data/global ... ok creating directory /var/lib/postgres/data/pg_xlog ... ok creating directory /var/lib/postgres/data/pg_xlog/archive_status ... ok creating directory /var/lib/postgres/data/pg_clog ... ok creating directory /var/lib/postgres/data/pg_subtrans ... ok creating directory /var/lib/postgres/data/base ... ok creating directory /var/lib/postgres/data/base/1 ... ok creating directory /var/lib/postgres/data/pg_tblspc ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 1000 creating configuration files ... ok creating template1 database in /var/lib/postgres/data/base/1 ... ok initializing pg_shadow ... ok enabling unlimited row size for system tables ... ok initializing pg_depend ... ok creating system views ... ok loading pg_description ... ok creating conversions ... ok setting privileges on built-in objects ... ok creating information schema ... child process exited with exit code 139 initdb: removing contents of data directory "/var/lib/postgres/data" ... John > -----Original Message----- > From: Palle Girgensohn [mailto:girgen@pingpong.net] > Sent: Saturday, March 26, 2005 1:10 PM > To: pgsql-hackers@postgresql.org > Cc: John Hansen; Andrew Dunstan > Subject: Re: [HACKERS] Patch for collation using ICU > > --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn > <girgen@pingpong.net> wrote: > > > Hi! > > > > I've put together a patch for using IBM's ICU package for collation. > > > > If your OS does not have full support for collation ur > > uppercase/lowercase in multibyte locales, this might be > useful. If you > > are using a multibyte character encoding in your database and want > > collation, i.e. order by, and also lower(), upper() and > initcap() to > > work properly, this patch will do just that. > > > > This patch is needed for FreeBSD, since this OS has no support for > > collation of for example unicode locales (that is, > wcscoll(3) does not > > do what you expect if you set LC_ALL=sv_SE.UTF-8, for > example). AFAIK > > the patch is *not* necessary for Linux, although IBM claims ICU > > collation to be about twice as fast as glibc for simple > western locales. > > > > It adds a configure switch, `--with-icu', which will set up > the code > > to use ICU instead of wchar_t and wcscoll. > > > > This has been tested only on FreeBSD-4.11 & > FreeBSD-5-stable, where it > > seems to run well. I've not had the time to do any comparative > > performance tests yet, but it seems it is at least not slower than > > using > > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. > > > > I'd be delighted if some more experienced postgresql hackers would > > review this stuff. The patch is pretty compact, so it's > fast reading > > :) I'm planning to add this patch as an option (tagged > > "experimental") to FreeBSD's postgresql port. Any ideas > about whether > > this is a good idea or not? > > > > Any thoughts or ideas are welcome! > > > > Cheers, > > Palle > > > > Patch at: > > > <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1 > > 4.d > > iff> > > > > ICU at sourceforge: <http://icu.sf.net/> > > > Hi! > > There's a new patch to fix some reported problems. > > <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 > 005-03-26.diff> > > This version uses the DatabaseEncoding and sets the ICU > encoding at the same time. I had to create a conversion table > from PostgreSQL's own, somewhat odd and non-standard, names > of encodings, into the prefered IANA names. On or two of the > more odd ones might be slightly incorrect, hopefully not too > far off anyway? > > I've noticed a couple of things about using the ICU patch vs. pristine > pg-8.0.1: > > - ORDER BY is case insensitive when using ICU. This might > break the SQL standard (?), but sure is nice :) > > - When the database is initialized using the C locale, > upper() and lower() normally does not work at all for > non-ASCII characters even if the database's encoding is say > LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, > and this is probably correct since the locale is still `C', I > believe?). The ICU patch changes nothing for the LATIN1 case, > since it does not act on single byte encodings, but for the > UNICODE representation, it works and does what I expect it > to, namely upper() and lower() neatly > upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') > -> 'åäö'. > This is a good thing, although I'm surprised that upper/lower > is dragged along with the LC_COLLATE fixation at initdb. I > never run initdb in the C locale, but only now do I realize > how broken that really is if you need to store anything else > than English :-) > > I'd be delighted to get more feedback about this stuff. > > Thanks, > Palle > > >
On Sat, 26 Mar 2005, Palle Girgensohn wrote: > I've noticed a couple of things about using the ICU patch vs. pristine > pg-8.0.1: > > - ORDER BY is case insensitive when using ICU. This might break the SQL > standard (?), but sure is nice :) Err, I think if your system implements strcoll correctly 8.0.1 can do this if the chosen collation is set up that way (or at least naive tests I've done seem to imply that). Or are you speaking about C locale?
--On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo <sszabo@megazone.bigpanda.com> wrote: > On Sat, 26 Mar 2005, Palle Girgensohn wrote: >> I've noticed a couple of things about using the ICU patch vs. pristine >> pg-8.0.1: >> >> - ORDER BY is case insensitive when using ICU. This might break the SQL >> standard (?), but sure is nice :) > > Err, I think if your system implements strcoll correctly 8.0.1 can do this > if the chosen collation is set up that way (or at least naive tests I've > done seem to imply that). Or are you speaking about C locale? No, I doubt this. Example: set up a cluster: $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1 $ createdb foo CREATE DATABASE $ psql foo foo=# create table bar (val text); CREATE TABLE foo=# insert into bar values ('aaa'); INSERT 18354409 1 foo=# insert into bar values ('BBB'); INSERT 18354412 1 foo=# select val from bar order by val;val -----BBBaaa (2 rows) Order by is not case insensitive. It shouldn't be for any system, AFAIK. As John Hansen noted, this might be a bad thing. I'm not sure about that, though... As for general collation of unicode, the reason for me to use ICU is that my system does not support strcoll correctly for multibyte locales, as I mentioned earlier. I also noted that even for systems that do handle strcoll correctly for unicode, ICU claims to be a couple of magnitudes faster, so this patch might be useful for other systems (read Linux) as well. See previous emails for details. Regards, Palle
--On lördag, mars 26, 2005 13.59.19 +1100 John Hansen <john@geeknet.com.au> wrote: >> - ORDER BY is case insensitive when using ICU. This might >> break the SQL standard (?), but sure is nice :) > > This would mean that indexes are also case insensitive right? > Which makes it a Bad Thing(tm). Well, no, not really. Indices use collation rules, yes, but upper and lower case strings are not considered *equal*, just "closer related". In collation, characters are compared at four levels. See [1] for a good explaination. This means that indices will use a case insensitive sort order, but equality will not be different, so it shouldn't break anything. >> - When the database is initialized using the C locale, >> upper() and lower() normally does not work at all for >> non-ASCII characters even if the database's encoding is say >> LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, >> and this is probably correct since the locale is still `C', I >> believe?). The ICU patch changes nothing for the LATIN1 case, >> since it does not act on single byte encodings, but for the >> UNICODE representation, it works and does what I expect it >> to, namely upper() and lower() neatly >> upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') >> -> 'åäö'. >> This is a good thing, although I'm surprised that upper/lower >> is dragged along with the LC_COLLATE fixation at initdb. I >> never run initdb in the C locale, but only now do I realize >> how broken that really is if you need to store anything else >> than English :-) > > That is what I would have expected. However, it probably won't work for > the more exotic cases, like turkish I, which depends on the locale. Nope, Turkish must of course have its locale to for example handle their special capital "i". Let's just say it is less broken :) /Palle [1] <http://icu.sourceforge.net/userguide/Collate_Concepts.html#Comparison_Levels>
On Sun, 27 Mar 2005, Palle Girgensohn wrote: > > > --On l�rdag, mars 26, 2005 08.16.01 -0800 Stephan Szabo > <sszabo@megazone.bigpanda.com> wrote: > > > On Sat, 26 Mar 2005, Palle Girgensohn wrote: > >> I've noticed a couple of things about using the ICU patch vs. pristine > >> pg-8.0.1: > >> > >> - ORDER BY is case insensitive when using ICU. This might break the SQL > >> standard (?), but sure is nice :) > > > > Err, I think if your system implements strcoll correctly 8.0.1 can do this > > if the chosen collation is set up that way (or at least naive tests I've > > done seem to imply that). Or are you speaking about C locale? > > No, I doubt this. > > Example: set up a cluster: > $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1 > $ createdb foo > CREATE DATABASE > $ psql foo > foo=# create table bar (val text); > CREATE TABLE > foo=# insert into bar values ('aaa'); > INSERT 18354409 1 > foo=# insert into bar values ('BBB'); > INSERT 18354412 1 > foo=# select val from bar order by val; > val > ----- > BBB > aaa > (2 rows) > > > Order by is not case insensitive. It shouldn't be for any system, AFAIK. As It is on my machine... for the same test: foo=# select val from bar order by val;val -----aaaBBB (2 rows) I think this just implies even greater breakage of either the collation or strcoll on the system you're trying on. ;) Which, of course, is a fairly reasonable reason to offer an alternative. Especially if it's generically useful.
--On lördag, mars 26, 2005 17.40.01 -0800 Stephan Szabo <sszabo@megazone.bigpanda.com> wrote: > > On Sun, 27 Mar 2005, Palle Girgensohn wrote: > >> >> >> --On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo >> <sszabo@megazone.bigpanda.com> wrote: >> >> > On Sat, 26 Mar 2005, Palle Girgensohn wrote: >> >> I've noticed a couple of things about using the ICU patch vs. pristine >> >> pg-8.0.1: >> >> >> >> - ORDER BY is case insensitive when using ICU. This might break the >> >> SQL standard (?), but sure is nice :) >> > >> > Err, I think if your system implements strcoll correctly 8.0.1 can do >> > this if the chosen collation is set up that way (or at least naive >> > tests I've done seem to imply that). Or are you speaking about C >> > locale? >> >> No, I doubt this. >> >> Example: set up a cluster: >> $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1 >> $ createdb foo >> CREATE DATABASE >> $ psql foo >> foo=# create table bar (val text); >> CREATE TABLE >> foo=# insert into bar values ('aaa'); >> INSERT 18354409 1 >> foo=# insert into bar values ('BBB'); >> INSERT 18354412 1 >> foo=# select val from bar order by val; >> val >> ----- >> BBB >> aaa >> (2 rows) >> >> >> Order by is not case insensitive. It shouldn't be for any system, AFAIK. >> As > > It is on my machine... for the same test: > > foo=# select val from bar order by val; > val > ----- > aaa > BBB > (2 rows) > > I think this just implies even greater breakage of either the collation or > strcoll on the system you're trying on. ;) Which, of course, is a fairly > reasonable reason to offer an alternative. Especially if it's generically > useful. Interesting! Indeed, just tried on an old Linux Redhat system... BTW, that's pretty odd for a unix system. "ls -l" sorts aaa before BBB, I've never seen the likes of it! Call me old fashion if you like ;-) Still, as you say, FreeBSD does it capital letters first, and does not handle unicode locales' collation, so I need an alternative. Perhaps the best way would be to inject ICU into BSD instead :-) /Palle
>As for general collation of unicode, the reason for me to use >ICU is that >my system does not support strcoll correctly for multibyte >locales, as I >mentioned earlier. I also noted that even for systems that do handle >strcoll correctly for unicode, ICU claims to be a couple of magnitudes >faster, so this patch might be useful for other systems (read >Linux) as >well. See previous emails for details. Does it work on win32? If so, perhaps it could help with the lack of UTF-8 strcoll there? //Magnus
--On söndag, mars 27, 2005 20.11.48 +0200 Magnus Hagander <mha@sollentuna.net> wrote: >> As for general collation of unicode, the reason for me to use >> ICU is that >> my system does not support strcoll correctly for multibyte >> locales, as I >> mentioned earlier. I also noted that even for systems that do handle >> strcoll correctly for unicode, ICU claims to be a couple of magnitudes >> faster, so this patch might be useful for other systems (read >> Linux) as >> well. See previous emails for details. > > Does it work on win32? If so, perhaps it could help with the lack of > UTF-8 strcoll there? Hej Magnus! Well, ICU exists for Windows. If you configure postgresql's source to link with ICU on Windows, it would probably work. Please try it out. /Palle
On L, 2005-03-26 at 03:09 +0100, Palle Girgensohn wrote: > Hi! > ... > I've noticed a couple of things about using the ICU patch vs. pristine > pg-8.0.1: > > - ORDER BY is case insensitive when using ICU. This might break the SQL > standard (?), but sure is nice :) How does your patch interact with the ability to use indexes for anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use index) ? -- Hannu Krosing <hannu@tm.ee>
--On söndag, mars 27, 2005 04.34.03 +0300 Hannu Krosing <hannu@tm.ee> wrote: > On L, 2005-03-26 at 03:09 +0100, Palle Girgensohn wrote: > >> Hi! >> > ... >> I've noticed a couple of things about using the ICU patch vs. pristine >> pg-8.0.1: >> >> - ORDER BY is case insensitive when using ICU. This might break the SQL >> standard (?), but sure is nice :) Just a comment: ORDER BY *is* already case sensitive on Linux, since its strcoll ignores case. I doubt very much it violates SQL standards. > How does your patch interact with the ability to use indexes for > anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use index) ? I don't think it matters. You still need to use the special non-locale index functions described in the handbook to get anchored like queries use indices. My ICU patch does not alter this. ICU is "injected" where strings are compared and the database cluster was initialized with a multibyte character encoding. The problem, AFAIK, has to do with the nature of (some) locales, not with a specific implementation of collation. Regards, Palle
Andrew Dunstan wrote: > The database cluster will be initialized with locale en_US.UTF-8. > initdb: could not find suitable encoding for locale "en_US.UTF-8" What does $ LC_ALL=en_US.UTF-8 locale charmap show? -- Peter Eisentraut http://developer.postgresql.org/~petere/
Palle Girgensohn wrote: > Just a comment: ORDER BY *is* already case sensitive on Linux, since > its strcoll ignores case. I doubt very much it violates SQL > standards. The behavior of collation sequences is implementation-defined. So as long as you can put the behavior in words, it should be OK. It would seem, however, that the behavior of a certain locale name should be the same with or without ICU, so perhaps some locale renaming might be needed, but that is speculation on my part. > > How does your patch interact with the ability to use indexes for > > anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use > > index) ? > The problem, AFAIK, has to do with the nature of (some) locales, not > with a specific implementation of collation. Yeah, pretty much the whole point of that code is to avoid collating stuff. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: >Andrew Dunstan wrote: > > >>The database cluster will be initialized with locale en_US.UTF-8. >>initdb: could not find suitable encoding for locale "en_US.UTF-8" >> >> > >What does > >$ LC_ALL=en_US.UTF-8 locale charmap > >show? > > > [andrew pgsql]$ LC_ALL=en_US.UTF-8 locale charmap UTF-8 [andrew pgsql]$ cheers andrew
Andrew Dunstan wrote: > [andrew pgsql]$ LC_ALL=en_US.UTF-8 locale charmap > UTF-8 That seems normal. Time to get out the debugger, I suppose. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Is this patch ready for application? http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.diff.gz The web site is: http://people.freebsd.org/~girgen/postgresql-icu/readme.html I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) Why is so much code added, for example, in lower()? The existing multibyte code is much smaller, and lots of code is added in other places too. Why do you need to add a mapping of encoding names from iana to our names? --------------------------------------------------------------------------- Palle Girgensohn wrote: > Hi! > > I've put together a patch for using IBM's ICU package for collation. > > If your OS does not have full support for collation ur uppercase/lowercase > in multibyte locales, this might be useful. If you are using a multibyte > character encoding in your database and want collation, i.e. order by, and > also lower(), upper() and initcap() to work properly, this patch will do > just that. > > This patch is needed for FreeBSD, since this OS has no support for > collation of for example unicode locales (that is, wcscoll(3) does not do > what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the > patch is *not* necessary for Linux, although IBM claims ICU collation to be > about twice as fast as glibc for simple western locales. > > It adds a configure switch, `--with-icu', which will set up the code to use > ICU instead of wchar_t and wcscoll. > > This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it > seems to run well. I've not had the time to do any comparative performance > tests yet, but it seems it is at least not slower than using LATIN1 with > sv_SE.ISO8859-1 locale, perhaps even faster. > > I'd be delighted if some more experienced postgresql hackers would review > this stuff. The patch is pretty compact, so it's fast reading :) I'm > planning to add this patch as an option (tagged "experimental") to > FreeBSD's postgresql port. Any ideas about whether this is a good idea or > not? > > Any thoughts or ideas are welcome! > > Cheers, > Palle > > Patch at: > <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff> > > ICU at sourceforge: <http://icu.sf.net/> > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Is this patch ready for application? Not until ICU is released under a BSD license ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Is this patch ready for application? > > Not until ICU is released under a BSD license ... Well, readline isn't BSD either, but we use it. It is any different? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On 2005-05-07, Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Tom Lane wrote: >> Bruce Momjian <pgman@candle.pha.pa.us> writes: >> > Is this patch ready for application? >> >> Not until ICU is released under a BSD license ... > > Well, readline isn't BSD either, but we use it. It is any different? ICU appears to be under the X license, which is no more restrictive than BSD-with-no-advertising-clause. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
> Why do you need to add a mapping of encoding names from iana > to our names? > The pg encoding names are not recognized by ICU, hence the mappings.... Install ICU 3.2 on your system, and run uconv -l, that will give you a list of valid ICU encoding names. ... John
Btw, Does it feel right to have pg depend on the bleeding edge version of ICU? On many distro's, even gentoo (known for being bleeding edge) 2.8 is still the default. 2.8 and 3.2 are however incompatible, and supporting both, would bloat the source somewhat. ... John
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Not until ICU is released under a BSD license ... > Well, readline isn't BSD either, but we use it. It is any different? Did you read the license? Some of the more troubling bits: : It is the understanding of INTERNATIONAL BUSINESS MACHINES CORPORATION : that the purpose for which its publications are being reproduced is : accurate and true as stated in your attached request. (er, which attached request would that be?) : Permission to quote from or reprint IBM publications is limited to the : purpose and quantities originally requested and must not be construed as : a blanket license to use the material for other purposes or to reprint : other IBM copyrighted material. : IBM reserves the right to withdraw permission to reproduce copyrighted : material whenever, in its discretion, it feels that the privilege of : reproducing its material is being used in a way detrimental to its : interest or the above instructions are not being followed properly to : protect its copyright. : IBM may have patents or pending patent applications covering subject : matter in this document. The furnishing of this document does not give : you any license to these patents. You can send license inquiries, in : writing, to: : For license inquiries regarding double-byte (DBCS) information, contact : the IBM Intellectual Property Department in your country or send : inquiries, in writing, to: regards, tom lane
Where'd you get the licence from? None of that is in the licence I'm reading! (http://www-306.ibm.com/software/globalization/icu/index.jsp) (http://www-306.ibm.com/software/globalization/icu/license.jsp) ... John > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane > Sent: Saturday, May 07, 2005 3:17 PM > To: Bruce Momjian > Cc: Palle Girgensohn; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Not until ICU is released under a BSD license ... > > > Well, readline isn't BSD either, but we use it. It is any > different? > > Did you read the license? Some of the more troubling bits: > > : It is the understanding of INTERNATIONAL BUSINESS MACHINES > CORPORATION > : that the purpose for which its publications are being reproduced is > : accurate and true as stated in your attached request. > > (er, which attached request would that be?) > > : Permission to quote from or reprint IBM publications is > limited to the > : purpose and quantities originally requested and must not be > construed as > : a blanket license to use the material for other purposes or > to reprint > : other IBM copyrighted material. > > : IBM reserves the right to withdraw permission to reproduce > copyrighted > : material whenever, in its discretion, it feels that the privilege of > : reproducing its material is being used in a way detrimental to its > : interest or the above instructions are not being followed > properly to > : protect its copyright. > > : IBM may have patents or pending patent applications covering subject > : matter in this document. The furnishing of this document > does not give > : you any license to these patents. You can send license inquiries, in > : writing, to: > > : For license inquiries regarding double-byte (DBCS) > information, contact > : the IBM Intellectual Property Department in your country or send > : inquiries, in writing, to: > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to > majordomo@postgresql.org) > >
--On fredag, maj 06, 2005 23.31.20 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Is this patch ready for application? > > Not until ICU is released under a BSD license ... It's not GPL anyway. Seems pretty much like the BSD license, at least more BSD-ish than GPL-ish. <http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html> /Palle
--On fredag, maj 06, 2005 22.57.59 -0400 Bruce Momjian <pgman@candle.pha.pa.us> wrote: > > Is this patch ready for application? > > http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.d > iff.gz > > The web site is: > > http://people.freebsd.org/~girgen/postgresql-icu/readme.html I don't think so, not quite. I have not had any positive reports from linux users, this is only tested in a FreeBSD environment. I'd say it needs some more testing. Also, apparently, ICU is installed by default in many linux distributions, and usually it is version 2.8. Some linux users have asked me if there are plans for a patch that works with ICU 2.8. That's probably a good idea. IBM and the ICU folks seem to consider 3.2 to be the stable version, older versions are hard to find on their sites, but most linux distributers seem to consider it too bleeding edge, even gentoo. I don't know why they don't agree. > I do have a few questions: > > Why don't you use the lc_ctype_is_c() part of this test? > > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. > Why is so much code added, for example, in lower()? The existing > multibyte code is much smaller, and lots of code is added in other > places too. ICU uses UTF-16 internally, so all strings must be converted from the database encoding to UTF-16. Since that means the strings need to be copied, I took the same approach as in varlena.c:varstr_cmp(), where small strings use the heap and only larger strings use a palloc. Comments in varstr_cmp about performance made me use that approach. Also, in the latest patch, I also added checks and logging for *every* status returned from ICU. I hope this will help debugging on debian, where previous version didn't work. That excessive status checking is hardly be necessary once the stuff is better tested. I think the string copying and heap/palloc choices stands for most of the code bloat, together with the excessive status checking and logging. > Why do you need to add a mapping of encoding names from iana to our > names? This was already answered by John Hansen... There's an old thread here about the choice of the name "UNICODE" to describe an encoding, which it doesn't. There's half a dozen unicode based encodings... UTF-8 is used by postgresql, that would have been a better name... Similarly for most other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL can't provide it, so a mapping table is required. I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. /Palle > > ------------------------------------------------------------------------- > -- > > Palle Girgensohn wrote: >> Hi! >> >> I've put together a patch for using IBM's ICU package for collation. >> >> If your OS does not have full support for collation ur >> uppercase/lowercase in multibyte locales, this might be useful. If you >> are using a multibyte character encoding in your database and want >> collation, i.e. order by, and also lower(), upper() and initcap() to >> work properly, this patch will do just that. >> >> This patch is needed for FreeBSD, since this OS has no support for >> collation of for example unicode locales (that is, wcscoll(3) does not >> do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK >> the patch is *not* necessary for Linux, although IBM claims ICU >> collation to be about twice as fast as glibc for simple western locales. >> >> It adds a configure switch, `--with-icu', which will set up the code to >> use ICU instead of wchar_t and wcscoll. >> >> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it >> seems to run well. I've not had the time to do any comparative >> performance tests yet, but it seems it is at least not slower than >> using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. >> >> I'd be delighted if some more experienced postgresql hackers would >> review this stuff. The patch is pretty compact, so it's fast reading :) >> I'm planning to add this patch as an option (tagged "experimental") to >> FreeBSD's postgresql port. Any ideas about whether this is a good idea >> or not? >> >> Any thoughts or ideas are welcome! >> >> Cheers, >> Palle >> >> Patch at: >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14. >> diff> >> >> ICU at sourceforge: <http://icu.sf.net/> >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 7: don't forget to increase your free space map settings >> > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania > 19073
> > I use this patch in production on one FreeBSD 4.10 server at > the moment. > With the latest version, I've had no problems. Logging is > swithed on for > now, and it shows no signs of ICU complaining. I'd like more > reports on > Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t -----æøåÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters.
Palle Girgensohn wrote: > > > > Is this patch ready for application? > > I don't think so, not quite. I have not had any positive reports from linux > users, this is only tested in a FreeBSD environment. I'd say it needs some > more testing. OK. > Also, apparently, ICU is installed by default in many linux distributions, > and usually it is version 2.8. Some linux users have asked me if there are > plans for a patch that works with ICU 2.8. That's probably a good idea. IBM > and the ICU folks seem to consider 3.2 to be the stable version, older > versions are hard to find on their sites, but most linux distributers seem > to consider it too bleeding edge, even gentoo. I don't know why they don't > agree. Good point. Why would linux folks need ICU? Doesn't their OS support encodings natively? I am particularly excited about this for OSs that don't have such encodings, like UTF8 support for Win32. Because ICU will not be used unless enabled by configure, it seems we are fine with only supporting the newest version. Do Linux users need to use ICU for any reason? > > I do have a few questions: > > > > Why don't you use the lc_ctype_is_c() part of this test? > > > > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) > > Um, well, I didn't think about that. :) What would be the locale in this > case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide > encoding, indeed. Then the strings will be handled like byte-wide chars. > Yeah, it's a bug. I'll fix it! Thanks. The additional test is more of an optmization, and it fixes a problem with some OSs that have processing problems with UTF8 when the locale is supposed to be turned off, like in "C". I realize ICU might be fine with it but the optimization still is an issue. > > Why is so much code added, for example, in lower()? The existing > > multibyte code is much smaller, and lots of code is added in other > > places too. > > ICU uses UTF-16 internally, so all strings must be converted from the > database encoding to UTF-16. Since that means the strings need to be > copied, I took the same approach as in varlena.c:varstr_cmp(), where small > strings use the heap and only larger strings use a palloc. Comments in > varstr_cmp about performance made me use that approach. Oh, interesting. I think you need to create new functions that factor out that common code so the patch is smaller and easier to maintain. > Also, in the latest patch, I also added checks and logging for *every* > status returned from ICU. I hope this will help debugging on debian, where > previous version didn't work. That excessive status checking is hardly be > necessary once the stuff is better tested. > > I think the string copying and heap/palloc choices stands for most of the > code bloat, together with the excessive status checking and logging. OK, move that into some common functions and I think it will be better. > > Why do you need to add a mapping of encoding names from iana to our > > names? > > This was already answered by John Hansen... There's an old thread here > about the choice of the name "UNICODE" to describe an encoding, which it > doesn't. There's half a dozen unicode based encodings... UTF-8 is used by > postgresql, that would have been a better name... Similarly for most other > encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL > can't provide it, so a mapping table is required. We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that help? > I use this patch in production on one FreeBSD 4.10 server at the moment. > With the latest version, I've had no problems. Logging is swithed on for > now, and it shows no signs of ICU complaining. I'd like more reports on > Linux, though. OK, I certainly would like this all done for 8.1 which should have feature freeze on July 1. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Errm,... initdb --encoding UNICODE --locale C > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of John Hansen > Sent: Saturday, May 07, 2005 10:23 PM > To: Palle Girgensohn; Bruce Momjian > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > I use this patch in production on one FreeBSD 4.10 server at the > > moment. > > With the latest version, I've had no problems. Logging is > swithed on > > for now, and it shows no signs of ICU complaining. I'd like more > > reports on Linux, though. > > I currently use this on gentoo with ICU3.2 unmasked. > > Works a dream, even with locale C and UNICODE database. > > Small test: > > createdb --encoding UNICODE --locale C test psql test set > client_encoding=iso88591; CREATE TABLE test (t text); INSERT > INTO test (t) VALUES ('æøå'); set client_encoding=unicode; > INSERT INTO test (t) SELECT upper(t) FROM test; set > client_encoding=iso88591; SELECT * FROM test; > t > ----- > æøå > ÆØÅ > (2 rows) > > Just as I'd expect, as upper/lower/initcap are locale > independent for these characters. > > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > >
--On lördag, maj 07, 2005 22.53.46 +1000 John Hansen <john@geeknet.com.au> wrote: > Errm,... initdb --encoding UNICODE --locale C You mean that ICU *shall* be used even for the C locale, and not as Bruce suggested here: >> I do have a few questions: >> >> Why don't you use the lc_ctype_is_c() part of this test? >> >> if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) > > Um, well, I didn't think about that. :) What would be the locale in this > case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide > encoding, indeed. Then the strings will be handled like byte-wide chars. > Yeah, it's a bug. I'll fix it! Thanks. John disagrees here, and I'm obliged to agree. Using the C locale, one will expect C collation, but upper/lower is better off still using ICU. Hence, the above stuff is *not* a bug. Do we agree? /Palle > >> -----Original Message----- >> From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of John Hansen >> Sent: Saturday, May 07, 2005 10:23 PM >> To: Palle Girgensohn; Bruce Momjian >> Cc: pgsql-hackers@postgresql.org >> Subject: Re: [HACKERS] Patch for collation using ICU >> >> > >> > I use this patch in production on one FreeBSD 4.10 server at the >> > moment. >> > With the latest version, I've had no problems. Logging is >> swithed on >> > for now, and it shows no signs of ICU complaining. I'd like more >> > reports on Linux, though. >> >> I currently use this on gentoo with ICU3.2 unmasked. >> >> Works a dream, even with locale C and UNICODE database. >> >> Small test: >> >> createdb --encoding UNICODE --locale C test psql test set >> client_encoding=iso88591; CREATE TABLE test (t text); INSERT >> INTO test (t) VALUES ('æøå'); set client_encoding=unicode; >> INSERT INTO test (t) SELECT upper(t) FROM test; set >> client_encoding=iso88591; SELECT * FROM test; >> t >> ----- >> æøå >> ÆØÅ >> (2 rows) >> >> Just as I'd expect, as upper/lower/initcap are locale >> independent for these characters. >> >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 5: Have you checked our extensive FAQ? >> >> http://www.postgresql.org/docs/faq >> >>
Bruce Momjian wrote: > Palle Girgensohn wrote: > > > > > > Is this patch ready for application? > > > > I don't think so, not quite. I have not had any positive > reports from > > linux users, this is only tested in a FreeBSD environment. > I'd say it > > needs some more testing. > > OK. > > > Also, apparently, ICU is installed by default in many linux > > distributions, and usually it is version 2.8. Some linux users have > > asked me if there are plans for a patch that works with ICU 2.8. > > That's probably a good idea. IBM and the ICU folks seem to consider > > 3.2 to be the stable version, older versions are hard to > find on their > > sites, but most linux distributers seem to consider it too bleeding > > edge, even gentoo. I don't know why they don't agree. > > Good point. Why would linux folks need ICU? Doesn't their > OS support encodings natively? I am particularly excited > about this for OSs that don't have such encodings, like UTF8 > support for Win32. > > Because ICU will not be used unless enabled by configure, it > seems we are fine with only supporting the newest version. > Do Linux users need to use ICU for any reason? Yes, because on many linux platforms locale support is broken. Also, ICU enables full unicode support, particularly in multi-language situations where locale is C, and makes upper/lower/initcap work as expected, except where it depends on locale information. There are also many other useful things in ICU that could be implemented. Transliteration, and break-iterators for example. Break-iteration particularly interresting for converting a text to a list of words. Another is it's builtin substring searches. > > > > I do have a few questions: > > > > > > Why don't you use the lc_ctype_is_c() part of this test? > > > > > > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) > > > > Um, well, I didn't think about that. :) What would be the > locale in > > this case? c_C.UTF-8? ;) Hmm, it is possible to have > CTYPE=C and use > > a wide encoding, indeed. Then the strings will be handled > like byte-wide chars. > > Yeah, it's a bug. I'll fix it! Thanks. > > The additional test is more of an optmization, and it fixes a > problem with some OSs that have processing problems with UTF8 > when the locale is supposed to be turned off, like in "C". I > realize ICU might be fine with it but the optimization still > is an issue. That the locale is supposed to be turned off, doesn't mean it shouldn't use ICU. ICU is more than just locales. > > > Why is so much code added, for example, in lower()? The existing > > > multibyte code is much smaller, and lots of code is added > in other > > > places too. > > > > ICU uses UTF-16 internally, so all strings must be > converted from the > > database encoding to UTF-16. Since that means the strings > need to be > > copied, I took the same approach as in > varlena.c:varstr_cmp(), where > > small strings use the heap and only larger strings use a palloc. > > Comments in varstr_cmp about performance made me use that approach. > > Oh, interesting. I think you need to create new functions that > factor out that common code so the patch is smaller and > easier to maintain. > > > Also, in the latest patch, I also added checks and logging > for *every* > > status returned from ICU. I hope this will help debugging > on debian, > > where previous version didn't work. That excessive status > checking is > > hardly be necessary once the stuff is better tested. > > > > I think the string copying and heap/palloc choices stands > for most of > > the code bloat, together with the excessive status checking > and logging. > > OK, move that into some common functions and I think it will > be better. > > > > Why do you need to add a mapping of encoding names from > iana to our > > > names? > > > > This was already answered by John Hansen... There's an old > thread here > > about the choice of the name "UNICODE" to describe an > encoding, which > > it doesn't. There's half a dozen unicode based encodings... > UTF-8 is > > used by postgresql, that would have been a better name... Similarly > > for most other encodings, really. ICU expect a setlocale(3) string > > (i.e. IANA). PostgreSQL can't provide it, so a mapping > table is required. > > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). > Does that help? > > > I use this patch in production on one FreeBSD 4.10 server > at the moment. > > With the latest version, I've had no problems. Logging is > swithed on > > for now, and it shows no signs of ICU complaining. I'd like more > > reports on Linux, though. > > OK, I certainly would like this all done for 8.1 which should > have feature freeze on July 1. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, > Pennsylvania 19073 > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > >
> --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen > <john@geeknet.com.au> > wrote: > > > Errm,... initdb --encoding UNICODE --locale C > > You mean that ICU *shall* be used even for the C locale, and > not as Bruce suggested here: Yes, that's exactly what I mean. > > >> I do have a few questions: > >> > >> Why don't you use the lc_ctype_is_c() part of this test? > >> > >> if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) > > > > Um, well, I didn't think about that. :) What would be the > locale in > > this case? c_C.UTF-8? ;) Hmm, it is possible to have > CTYPE=C and use > > a wide encoding, indeed. Then the strings will be handled > like byte-wide chars. > > Yeah, it's a bug. I'll fix it! Thanks. > > John disagrees here, and I'm obliged to agree. Using the C > locale, one will expect C collation, but upper/lower is > better off still using ICU. Hence, the above stuff is *not* a > bug. Do we agree? > > /Palle > > > > > >> -----Original Message----- > >> From: pgsql-hackers-owner@postgresql.org > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of > John Hansen > >> Sent: Saturday, May 07, 2005 10:23 PM > >> To: Palle Girgensohn; Bruce Momjian > >> Cc: pgsql-hackers@postgresql.org > >> Subject: Re: [HACKERS] Patch for collation using ICU > >> > >> > > >> > I use this patch in production on one FreeBSD 4.10 server at the > >> > moment. > >> > With the latest version, I've had no problems. Logging is > >> swithed on > >> > for now, and it shows no signs of ICU complaining. I'd like more > >> > reports on Linux, though. > >> > >> I currently use this on gentoo with ICU3.2 unmasked. > >> > >> Works a dream, even with locale C and UNICODE database. > >> > >> Small test: > >> > >> createdb --encoding UNICODE --locale C test psql test set > >> client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO > >> test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO > >> test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; > >> SELECT * FROM test; > >> t > >> ----- > >> æøå > >> ÆØÅ > >> (2 rows) > >> > >> Just as I'd expect, as upper/lower/initcap are locale > independent for > >> these characters. > >> > >> > >> ---------------------------(end of > >> broadcast)--------------------------- > >> TIP 5: Have you checked our extensive FAQ? > >> > >> http://www.postgresql.org/docs/faq > >> > >> > > > > > >
Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls toICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own conversiontables. ICU also has a fair few conversions that we do not have at present. Any thoughts? ... John > -----Original Message----- > From: John Hansen > Sent: Saturday, May 07, 2005 11:09 PM > To: 'Palle Girgensohn'; 'Bruce Momjian' > Cc: 'pgsql-hackers@postgresql.org' > Subject: RE: [HACKERS] Patch for collation using ICU > > > --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen > > <john@geeknet.com.au> > > wrote: > > > > > Errm,... initdb --encoding UNICODE --locale C > > > > You mean that ICU *shall* be used even for the C locale, and not as > > Bruce suggested here: > > Yes, that's exactly what I mean. > > > > > >> I do have a few questions: > > >> > > >> Why don't you use the lc_ctype_is_c() part of this test? > > >> > > >> if (pg_database_encoding_max_length() > 1 && > !lc_ctype_is_c()) > > > > > > Um, well, I didn't think about that. :) What would be the > > locale in > > > this case? c_C.UTF-8? ;) Hmm, it is possible to have > > CTYPE=C and use > > > a wide encoding, indeed. Then the strings will be handled > > like byte-wide chars. > > > Yeah, it's a bug. I'll fix it! Thanks. > > > > John disagrees here, and I'm obliged to agree. Using the C > locale, one > > will expect C collation, but upper/lower is better off still using > > ICU. Hence, the above stuff is *not* a bug. Do we agree? > > > > /Palle > > > > > > > > > >> -----Original Message----- > > >> From: pgsql-hackers-owner@postgresql.org > > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of > > John Hansen > > >> Sent: Saturday, May 07, 2005 10:23 PM > > >> To: Palle Girgensohn; Bruce Momjian > > >> Cc: pgsql-hackers@postgresql.org > > >> Subject: Re: [HACKERS] Patch for collation using ICU > > >> > > >> > > > >> > I use this patch in production on one FreeBSD 4.10 > server at the > > >> > moment. > > >> > With the latest version, I've had no problems. Logging is > > >> swithed on > > >> > for now, and it shows no signs of ICU complaining. I'd > like more > > >> > reports on Linux, though. > > >> > > >> I currently use this on gentoo with ICU3.2 unmasked. > > >> > > >> Works a dream, even with locale C and UNICODE database. > > >> > > >> Small test: > > >> > > >> createdb --encoding UNICODE --locale C test psql test set > > >> client_encoding=iso88591; CREATE TABLE test (t text); > INSERT INTO > > >> test (t) VALUES ('æøå'); set client_encoding=unicode; > INSERT INTO > > >> test (t) SELECT upper(t) FROM test; set > client_encoding=iso88591; > > >> SELECT * FROM test; > > >> t > > >> ----- > > >> æøå > > >> ÆØÅ > > >> (2 rows) > > >> > > >> Just as I'd expect, as upper/lower/initcap are locale > > independent for > > >> these characters. > > >> > > >> > > >> ---------------------------(end of > > >> broadcast)--------------------------- > > >> TIP 5: Have you checked our extensive FAQ? > > >> > > >> http://www.postgresql.org/docs/faq > > >> > > >> > > > > > > > > > > > >
--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Palle Girgensohn wrote: >> > >> > Is this patch ready for application? >> >> I don't think so, not quite. I have not had any positive reports from >> linux users, this is only tested in a FreeBSD environment. I'd say it >> needs some more testing. > > OK. John Hansen just reported that it does work on linux. fine! >> Also, apparently, ICU is installed by default in many linux >> distributions, and usually it is version 2.8. Some linux users have >> asked me if there are plans for a patch that works with ICU 2.8. That's >> probably a good idea. IBM and the ICU folks seem to consider 3.2 to be >> the stable version, older versions are hard to find on their sites, but >> most linux distributers seem to consider it too bleeding edge, even >> gentoo. I don't know why they don't agree. > > Good point. Why would linux folks need ICU? Doesn't their OS support > encodings natively? I am particularly excited about this for OSs that > don't have such encodings, like UTF8 support for Win32. > > Because ICU will not be used unless enabled by configure, it seems we > are fine with only supporting the newest version. Do Linux users need > to use ICU for any reason? There are corner cases where it is impossible to upper/lowercase one character at the time. for example: -- without ICUselect upper('Eßer');upper -------EßER (1 row) -- with ICU select upper('Eßer');upper -------ESSER (1 rad) This is because in the standard postgres implementation, upper/lower is done one character at the time. A proper upper/lower cannot do it that way. Other known example is in Turkish, where an Ì (?) should look different whether it is an initial letter or not. This fails in standard postgresql for all platforms. >> > I do have a few questions: >> > >> > Why don't you use the lc_ctype_is_c() part of this test? >> > >> > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) >> >> Um, well, I didn't think about that. :) What would be the locale in >> this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a >> wide encoding, indeed. Then the strings will be handled like byte-wide >> chars. Yeah, it's a bug. I'll fix it! Thanks. > > The additional test is more of an optmization, and it fixes a problem > with some OSs that have processing problems with UTF8 when the locale is > supposed to be turned off, like in "C". I realize ICU might be fine > with it but the optimization still is an issue. Well, the results are quite different, depending on whether ICU is used or not. See separate mail. >> > Why is so much code added, for example, in lower()? The existing >> > multibyte code is much smaller, and lots of code is added in other >> > places too. >> >> ICU uses UTF-16 internally, so all strings must be converted from the >> database encoding to UTF-16. Since that means the strings need to be >> copied, I took the same approach as in varlena.c:varstr_cmp(), where >> small strings use the heap and only larger strings use a palloc. >> Comments in varstr_cmp about performance made me use that approach. > > Oh, interesting. I think you need to create new functions that > factor out that common code so the patch is smaller and easier to > maintain. Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me... >> Also, in the latest patch, I also added checks and logging for *every* >> status returned from ICU. I hope this will help debugging on debian, >> where previous version didn't work. That excessive status checking is >> hardly be necessary once the stuff is better tested. >> >> I think the string copying and heap/palloc choices stands for most of >> the code bloat, together with the excessive status checking and logging. > > OK, move that into some common functions and I think it will be better. Best way for upper/lower/initcap is probably to use a function pointer... uhh... >> > Why do you need to add a mapping of encoding names from iana to our >> > names? >> >> This was already answered by John Hansen... There's an old thread here >> about the choice of the name "UNICODE" to describe an encoding, which it >> doesn't. There's half a dozen unicode based encodings... UTF-8 is used >> by postgresql, that would have been a better name... Similarly for most >> other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). >> PostgreSQL can't provide it, so a mapping table is required. > > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that > help? I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. >> I use this patch in production on one FreeBSD 4.10 server at the moment. >> With the latest version, I've had no problems. Logging is swithed on for >> now, and it shows no signs of ICU complaining. I'd like more reports on >> Linux, though. > > OK, I certainly would like this all done for 8.1 which should have > feature freeze on July 1. That shouldn't be a problem. /Palle > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania > 19073
Palle Girgensohn wrote: > I'm aware of that. It might help for unicode, but there are a > bunch of > other encodings. IANA has decided that utf-8 has *no* > aliases, hence only > utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is > fogiving, I don't remember/know, but I think we need the mappings, > unfortunately. > Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 UTF-16BE x-utf-16be ibm-1200 ibm-1201 ibm-5297 ibm-13488 ibm-17584 windows-1201 cp1200 cp1201 UTF16_BigEndian UTF-16LE x-utf-16le ibm-1202 ibm-13490 ibm-17586 UTF16_LittleEndian windows-1200 UTF-32 ISO-10646-UCS-4 csUCS4 ucs-4 UTF-32BE UTF32_BigEndian ibm-1232 ibm-1233 UTF-32LE UTF32_LittleEndian ibm-1234 UTF16_PlatformEndian UTF16_OppositeEndian UTF32_PlatformEndian UTF32_OppositeEndian UTF-7 windows-65000 IMAP-mailbox-name SCSU BOCU-1 csBOCU-1 CESU-8 ISO-8859-1 ibm-819 IBM819 cp819 latin1 8859_1 csISOLatin1 iso-ir-100 ISO_8859-1:1987 l1 819 US-ASCII ASCII ANSI_X3.4-1968 ANSI_X3.4-1986 ISO_646.irv:1991 iso_646.irv:1983 ISO646-US us csASCII iso-ir-6 cp367 ascii7 646 windows-20127 gb18030 ibm-1392 windows-54936 ibm-367_P100-1995 ibm-367 IBM367 ibm-912_P100-1995 ibm-912 iso-8859-2 ISO_8859-2:1987 latin2 csISOLatin2 iso-ir-101 l2 8859_2 cp912 912 windows-28592 ibm-913_P100-2000 ibm-913 iso-8859-3 ISO_8859-3:1988 latin3 csISOLatin3 iso-ir-109 l3 8859_3 cp913 913 windows-28593 ibm-914_P100-1995 ibm-914 iso-8859-4 latin4 csISOLatin4 iso-ir-110 ISO_8859-4:1988 l4 8859_4 cp914 914 windows-28594 ibm-915_P100-1995 ibm-915 iso-8859-5 cyrillic csISOLatinCyrillic iso-ir-144 ISO_8859-5:1988 8859_5 cp915 915 windows-28595 ibm-1089_P100-1995 ibm-1089 iso-8859-6 arabic csISOLatinArabic iso-ir-127 ISO_8859-6:1987 ECMA-114 ASMO-708 8859_6 cp1089 1089 windows-28596 ISO-8859-6-I ISO-8859-6-E ibm-813_P100-1995 ibm-813 iso-8859-7 greek greek8 ELOT_928 ECMA-118 csISOLatinGreek iso-ir-126 ISO_8859-7:1987 8859_7 cp813 813 windows-28597 ibm-916_P100-1995 ibm-916 iso-8859-8 hebrew csISOLatinHebrew iso-ir-138 ISO_8859-8:1988 ISO-8859-8-I ISO-8859-8-E 8859_8 cp916 916 windows-28598 ibm-920_P100-1995 ibm-920 iso-8859-9 latin5 csISOLatin5 iso-ir-148 ISO_8859-9:1989 l5 8859_9 cp920 920 windows-28599 ECMA-128 ibm-921_P100-1995 ibm-921 iso-8859-13 8859_13 cp921 921 ibm-923_P100-1998 ibm-923 iso-8859-15 Latin-9 l9 8859_15 latin0 csisolatin0 csisolatin9 iso8859_15_fdis cp923 923 windows-28605 ibm-942_P12A-1999 ibm-942 ibm-932 cp932 shift_jis78 sjis78 ibm-942_VSUB_VPUA ibm-932_VSUB_VPUA ibm-943_P15A-2003 ibm-943 Shift_JIS MS_Kanji csShiftJIS windows-31j csWindows31J x-sjis x-ms-cp932 cp932 windows-932 cp943c IBM-943C ms932 pck sjis ibm-943_VSUB_VPUA ibm-943_P130-1999 ibm-943 Shift_JIS cp943 943 ibm-943_VASCII_VSUB_VPUA ibm-33722_P12A-1999 ibm-33722 ibm-5050 EUC-JP Extended_UNIX_Code_Packed_Format_for_Japanese csEUCPkdFmtJapanese X-EUC-JP eucjis windows-51932 ibm-33722_VPUA IBM-eucJP ibm-33722_P120-1999 ibm-33722 ibm-5050 cp33722 33722 ibm-33722_VASCII_VPUA ibm-954_P101-2000 ibm-954 EUC-JP ibm-1373_P100-2002 ibm-1373 windows-950 windows-950-2000 Big5 csBig5 windows-950 x-big5 ibm-950_P110-1999 ibm-950 cp950 950 macos-2566-10.2 Big5-HKSCS big5hk HKSCS-BIG5 ibm-1375_P100-2003 ibm-1375 Big5-HKSCS ibm-1386_P100-2002 ibm-1386 cp1386 windows-936 ibm-1386_VSUB_VPUA windows-936-2000 GBK CP936 MS936 windows-936 ibm-1383_P110-1999 ibm-1383 GB2312 csGB2312 EUC-CN ibm-eucCN hp15CN cp1383 1383 ibm-1383_VPUA ibm-5478_P100-1995 ibm-5478 GB_2312-80 chinese iso-ir-58 csISO58GB231280 gb2312-1980 GB2312.1980-0 ibm-964_P110-1999 ibm-964 EUC-TW ibm-eucTW cns11643 cp964 964 ibm-964_VPUA ibm-949_P110-1999 ibm-949 cp949 949 ibm-949_VASCII_VSUB_VPUA ibm-949_P11A-1999 ibm-949 cp949c ibm-949_VSUB_VPUA ibm-970_P110-1995 ibm-970 EUC-KR KS_C_5601-1987 windows-51949 csEUCKR ibm-eucKR KSC_5601 5601 ibm-970_VPUA ibm-971_P100-1995 ibm-971 ibm-971_VPUA ibm-1363_P11B-1998 ibm-1363 KS_C_5601-1987 KS_C_5601-1989 KSC_5601 csKSC56011987 korean iso-ir-149 5601 cp1363 ksc windows-949 ibm-1363_VSUB_VPUA ibm-1363_P110-1997 ibm-1363 ibm-1363_VASCII_VSUB_VPUA windows-949-2000 windows-949 KS_C_5601-1987 KS_C_5601-1989 KSC_5601 csKSC56011987 korean iso-ir-149 ms949 ibm-1162_P100-1999 ibm-1162 ibm-874_P100-1995 ibm-874 ibm-9066 cp874 TIS-620 tis620.2533 eucTH cp9066 windows-874-2000 TIS-620 windows-874 MS874 ibm-437_P100-1995 ibm-437 IBM437 cp437 437 csPC8CodePage437 windows-437 ibm-850_P100-1995 ibm-850 IBM850 cp850 850 csPC850Multilingual windows-850 ibm-851_P100-1995 ibm-851 IBM851 cp851 851 csPC851 ibm-852_P100-1995 ibm-852 IBM852 cp852 852 csPCp852 windows-852 ibm-855_P100-1995 ibm-855 IBM855 cp855 855 csIBM855 csPCp855 ibm-856_P100-1995 ibm-856 cp856 856 ibm-857_P100-1995 ibm-857 IBM857 cp857 857 csIBM857 windows-857 ibm-858_P100-1997 ibm-858 IBM00858 CCSID00858 CP00858 PC-Multilingual-850+euro cp858 ibm-860_P100-1995 ibm-860 IBM860 cp860 860 csIBM860 ibm-861_P100-1995 ibm-861 IBM861 cp861 861 cp-is csIBM861 windows-861 ibm-862_P100-1995 ibm-862 IBM862 cp862 862 csPC862LatinHebrew DOS-862 windows-862 ibm-863_P100-1995 ibm-863 IBM863 cp863 863 csIBM863 ibm-864_X110-1999 ibm-864 IBM864 cp864 csIBM864 ibm-865_P100-1995 ibm-865 IBM865 cp865 865 csIBM865 ibm-866_P100-1995 ibm-866 IBM866 cp866 866 csIBM866 windows-866 ibm-867_P100-1998 ibm-867 cp867 ibm-868_P100-1995 ibm-868 IBM868 CP868 868 csIBM868 cp-ar ibm-869_P100-1995 ibm-869 IBM869 cp869 869 cp-gr csIBM869 windows-869 ibm-878_P100-1996 ibm-878 KOI8-R koi8 csKOI8R cp878 ibm-901_P100-1999 ibm-901 ibm-902_P100-1999 ibm-902 ibm-922_P100-1999 ibm-922 cp922 922 ibm-4909_P100-1999 ibm-4909 ibm-5346_P100-1998 ibm-5346 windows-1250 cp1250 ibm-5347_P100-1998 ibm-5347 windows-1251 cp1251 ibm-5348_P100-1997 ibm-5348 windows-1252 cp1252 ibm-5349_P100-1998 ibm-5349 windows-1253 cp1253 ibm-5350_P100-1998 ibm-5350 windows-1254 cp1254 ibm-9447_P100-2002 ibm-9447 windows-1255 cp1255 windows-1256-2000 windows-1256 cp1256 ibm-9449_P100-2002 ibm-9449 windows-1257 cp1257 ibm-5354_P100-1998 ibm-5354 windows-1258 cp1258 ibm-1250_P100-1995 ibm-1250 windows-1250 ibm-1251_P100-1995 ibm-1251 windows-1251 ibm-1252_P100-2000 ibm-1252 windows-1252 ibm-1253_P100-1995 ibm-1253 windows-1253 ibm-1254_P100-1995 ibm-1254 windows-1254 ibm-1255_P100-1995 ibm-1255 ibm-5351_P100-1998 ibm-5351 windows-1255 ibm-1256_P110-1997 ibm-1256 ibm-5352_P100-1998 ibm-5352 windows-1256 ibm-1257_P100-1995 ibm-1257 ibm-5353_P100-1998 ibm-5353 windows-1257 ibm-1258_P100-1997 ibm-1258 windows-1258 macos-0_2-10.2 macintosh mac csMacintosh windows-10000 macos-6-10.2 x-mac-greek windows-10006 macgr macos-7_3-10.2 x-mac-cyrillic windows-10007 maccy macos-29-10.2 x-mac-centraleurroman windows-10029 x-mac-ce macce macos-35-10.2 x-mac-turkish windows-10081 mactr ibm-1051_P100-1995 ibm-1051 hp-roman8 roman8 r8 csHPRoman8 ibm-1276_P100-1995 ibm-1276 Adobe-Standard-Encoding csAdobeStandardEncoding ibm-1277_P100-1995 ibm-1277 Adobe-Latin1-Encoding ibm-1006_P100-1995 ibm-1006 cp1006 1006 ibm-1098_P100-1995 ibm-1098 cp1098 1098 ibm-1124_P100-1996 ibm-1124 cp1124 1124 ibm-1125_P100-1997 ibm-1125 cp1125 ibm-1129_P100-1997 ibm-1129 ibm-1131_P100-1997 ibm-1131 cp1131 ibm-1133_P100-1997 ibm-1133 ibm-1381_P110-1999 ibm-1381 cp1381 1381 ISO_2022,locale=ja,version=0 ISO-2022-JP csISO2022JP ISO_2022,locale=ja,version=1 ISO-2022-JP-1 JIS JIS_Encoding ISO_2022,locale=ja,version=2 ISO-2022-JP-2 csISO2022JP2 ISO_2022,locale=ja,version=3 JIS7 csJISEncoding ISO_2022,locale=ja,version=4 JIS8 ISO_2022,locale=ko,version=0 ISO-2022-KR csISO2022KR ISO_2022,locale=ko,version=1 ibm-25546 ISO_2022,locale=zh,version=0 ISO-2022-CN ISO_2022,locale=zh,version=1 ISO-2022-CN-EXT HZ HZ-GB-2312 ibm-897_P100-1995 ibm-897 JIS_X0201 X0201 csHalfWidthKatakana ISCII,version=0 x-iscii-de windows-57002 iscii-dev ISCII,version=1 x-iscii-be windows-57003 iscii-bng windows-57006 x-iscii-as ISCII,version=2 x-iscii-pa windows-57011 iscii-gur ISCII,version=3 x-iscii-gu windows-57010 iscii-guj ISCII,version=4 x-iscii-or windows-57007 iscii-ori ISCII,version=5 x-iscii-ta windows-57004 iscii-tml ISCII,version=6 x-iscii-te windows-57005 iscii-tlg ISCII,version=7 x-iscii-ka windows-57008 iscii-knd ISCII,version=8 x-iscii-ma windows-57009 iscii-mlm LMBCS-1 lmbcs LMBCS-2 LMBCS-3 LMBCS-4 LMBCS-5 LMBCS-6 LMBCS-8 LMBCS-11 LMBCS-16 LMBCS-17 LMBCS-18 LMBCS-19 ibm-37_P100-1995 ibm-37 IBM037 ibm-037 ebcdic-cp-us ebcdic-cp-ca ebcdic-cp-wt ebcdic-cp-nl csIBM037 cp037 037 cpibm37 cp37 ibm-273_P100-1995 ibm-273 IBM273 CP273 csIBM273 ebcdic-de cpibm273 273 ibm-277_P100-1995 ibm-277 IBM277 cp277 EBCDIC-CP-DK EBCDIC-CP-NO csIBM277 ebcdic-dk cpibm277 277 ibm-278_P100-1995 ibm-278 IBM278 cp278 ebcdic-cp-fi ebcdic-cp-se csIBM278 ebcdic-sv cpibm278 278 ibm-280_P100-1995 ibm-280 IBM280 CP280 ebcdic-cp-it csIBM280 cpibm280 280 ibm-284_P100-1995 ibm-284 IBM284 CP284 ebcdic-cp-es csIBM284 cpibm284 284 ibm-285_P100-1995 ibm-285 IBM285 CP285 ebcdic-cp-gb csIBM285 ebcdic-gb cpibm285 285 ibm-290_P100-1995 ibm-290 IBM290 cp290 EBCDIC-JP-kana csIBM290 ibm-297_P100-1995 ibm-297 IBM297 cp297 ebcdic-cp-fr csIBM297 cpibm297 297 ibm-420_X120-1999 ibm-420 IBM420 cp420 ebcdic-cp-ar1 csIBM420 420 ibm-424_P100-1995 ibm-424 IBM424 cp424 ebcdic-cp-he csIBM424 424 ibm-500_P100-1995 ibm-500 IBM500 CP500 ebcdic-cp-be csIBM500 ebcdic-cp-ch cpibm500 500 ibm-803_P100-1999 ibm-803 cp803 ibm-838_P100-1995 ibm-838 IBM-Thai csIBMThai cp838 838 ibm-9030 ibm-870_P100-1995 ibm-870 IBM870 CP870 ebcdic-cp-roece ebcdic-cp-yu csIBM870 ibm-871_P100-1995 ibm-871 IBM871 ebcdic-cp-is csIBM871 CP871 ebcdic-is cpibm871 871 ibm-875_P100-1995 ibm-875 IBM875 cp875 875 ibm-918_P100-1995 ibm-918 IBM918 CP918 ebcdic-cp-ar2 csIBM918 ibm-930_P120-1999 ibm-930 ibm-5026 cp930 cpibm930 930 ibm-933_P110-1995 ibm-933 cp933 cpibm933 933 ibm-935_P110-1999 ibm-935 cp935 cpibm935 935 ibm-937_P110-1999 ibm-937 cp937 cpibm937 937 ibm-939_P120-1999 ibm-939 ibm-931 ibm-5035 cp939 939 ibm-1025_P100-1995 ibm-1025 cp1025 1025 ibm-1026_P100-1995 ibm-1026 IBM1026 CP1026 csIBM1026 1026 ibm-1047_P100-1995 ibm-1047 IBM1047 cpibm1047 ibm-1097_P100-1995 ibm-1097 cp1097 1097 ibm-1112_P100-1995 ibm-1112 cp1112 1112 ibm-1122_P100-1999 ibm-1122 cp1122 1122 ibm-1123_P100-1995 ibm-1123 cp1123 1123 cpibm1123 ibm-1130_P100-1997 ibm-1130 ibm-1132_P100-1998 ibm-1132 ibm-1140_P100-1997 ibm-1140 IBM01140 CCSID01140 CP01140 cp1140 cpibm1140 ebcdic-us-37+euro ibm-1141_P100-1997 ibm-1141 IBM01141 CCSID01141 CP01141 cp1141 cpibm1141 ebcdic-de-273+euro ibm-1142_P100-1997 ibm-1142 IBM01142 CCSID01142 CP01142 cp1142 cpibm1142 ebcdic-dk-277+euro ebcdic-no-277+euro ibm-1143_P100-1997 ibm-1143 IBM01143 CCSID01143 CP01143 cp1143 cpibm1143 ebcdic-fi-278+euro ebcdic-se-278+euro ibm-1144_P100-1997 ibm-1144 IBM01144 CCSID01144 CP01144 cp1144 cpibm1144 ebcdic-it-280+euro ibm-1145_P100-1997 ibm-1145 IBM01145 CCSID01145 CP01145 cp1145 cpibm1145 ebcdic-es-284+euro ibm-1146_P100-1997 ibm-1146 IBM01146 CCSID01146 CP01146 cp1146 cpibm1146 ebcdic-gb-285+euro ibm-1147_P100-1997 ibm-1147 IBM01147 CCSID01147 CP01147 cp1147 cpibm1147 ebcdic-fr-297+euro ibm-1148_P100-1997 ibm-1148 IBM01148 CCSID01148 CP01148 cp1148 cpibm1148 ebcdic-international-500+euro ibm-1149_P100-1997 ibm-1149 IBM01149 CCSID01149 CP01149 cp1149 cpibm1149 ebcdic-is-871+euro ibm-1153_P100-1999 ibm-1153 cpibm1153 ibm-1154_P100-1999 ibm-1154 cpibm1154 ibm-1155_P100-1999 ibm-1155 cpibm1155 ibm-1156_P100-1999 ibm-1156 cpibm1156 ibm-1157_P100-1999 ibm-1157 cpibm1157 ibm-1158_P100-1999 ibm-1158 cpibm1158 ibm-1160_P100-1999 ibm-1160 cpibm1160 ibm-1164_P100-1999 ibm-1164 cpibm1164 ibm-1364_P110-1997 ibm-1364 cp1364 ibm-1371_P100-1999 ibm-1371 cpibm1371 ibm-1388_P103-2001 ibm-1388 ibm-9580 ibm-1390_P110-2003 ibm-1390 cpibm1390 ibm-1399_P110-2003 ibm-1399 ibm-16684_P110-2003 ibm-16684 ibm-4899_P100-1998 ibm-4899 cpibm4899 ibm-4971_P100-1999 ibm-4971 cpibm4971 ibm-12712_P100-1998 ibm-12712 cpibm12712 ebcdic-he ibm-16804_X110-1999 ibm-16804 cpibm16804 ebcdic-ar ibm-1137_P100-1999 ibm-1137 ibm-5123_P100-1999 ibm-5123 ibm-8482_P100-1999 ibm-8482 ibm-37_P100-1995,swaplfnl ibm-37-s390 ibm037-s390 ibm-1047_P100-1995,swaplfnl ibm-1047-s390 ibm-1140_P100-1997,swaplfnl ibm-1140-s390 ibm-1142_P100-1997,swaplfnl ibm-1142-s390 ibm-1143_P100-1997,swaplfnl ibm-1143-s390 ibm-1144_P100-1997,swaplfnl ibm-1144-s390 ibm-1145_P100-1997,swaplfnl ibm-1145-s390 ibm-1146_P100-1997,swaplfnl ibm-1146-s390 ibm-1147_P100-1997,swaplfnl ibm-1147-s390 ibm-1148_P100-1997,swaplfnl ibm-1148-s390 ibm-1149_P100-1997,swaplfnl ibm-1149-s390 ibm-1153_P100-1999,swaplfnl ibm-1153-s390 ibm-12712_P100-1998,swaplfnl ibm-12712-s390 ibm-16804_X110-1999,swaplfnl ibm-16804-s390 ebcdic-xml-us
--On lördag, maj 07, 2005 23.25.15 +1000 John Hansen <john@geeknet.com.au> wrote: > Palle Girgensohn wrote: >> I'm aware of that. It might help for unicode, but there are a >> bunch of >> other encodings. IANA has decided that utf-8 has *no* >> aliases, hence only >> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is >> fogiving, I don't remember/know, but I think we need the mappings, >> unfortunately. >> > > Here is the list of encoding names and aliases the ICU accepts as of > 3.2: > (it's a bit long...) > > UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 No UTF8 in there. I think that's good, charset aliases are a hassle. /Palle
> -----Original Message----- > From: Palle Girgensohn [mailto:girgen@pingpong.net] > Sent: Saturday, May 07, 2005 11:30 PM > To: John Hansen; Bruce Momjian > Cc: pgsql-hackers@postgresql.org > Subject: RE: [HACKERS] Patch for collation using ICU > > > > --On lördag, maj 07, 2005 23.25.15 +1000 John Hansen > <john@geeknet.com.au> > wrote: > > > Palle Girgensohn wrote: > >> I'm aware of that. It might help for unicode, but there > are a bunch > >> of other encodings. IANA has decided that utf-8 has *no* aliases, > >> hence only > >> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is > >> fogiving, I don't remember/know, but I think we need the mappings, > >> unfortunately. > >> > > > > Here is the list of encoding names and aliases the ICU accepts as of > > 3.2: > > (it's a bit long...) > > > > UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 > > No UTF8 in there. I think that's good, charset aliases are a hassle. Yup! :) > > /Palle > > >
--On lördag, maj 07, 2005 22.22.52 +1000 John Hansen <john@geeknet.com.au> wrote: >> >> I use this patch in production on one FreeBSD 4.10 server at >> the moment. >> With the latest version, I've had no problems. Logging is >> swithed on for >> now, and it shows no signs of ICU complaining. I'd like more >> reports on >> Linux, though. > > I currently use this on gentoo with ICU3.2 unmasked. > > Works a dream, even with locale C and UNICODE database. Ah, good to hear, John. I beleive your report about linux is what's keeping this back. Did you also manage to get it running on Debian? /Palle
> -----Original Message----- > From: Palle Girgensohn [mailto:girgen@pingpong.net] > Sent: Saturday, May 07, 2005 11:33 PM > To: John Hansen; Bruce Momjian > Cc: pgsql-hackers@postgresql.org > Subject: RE: [HACKERS] Patch for collation using ICU > > > > --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen > <john@geeknet.com.au> > wrote: > > >> > >> I use this patch in production on one FreeBSD 4.10 server at the > >> moment. > >> With the latest version, I've had no problems. Logging is > swithed on > >> for now, and it shows no signs of ICU complaining. I'd like more > >> reports on Linux, though. > > > > I currently use this on gentoo with ICU3.2 unmasked. > > > > Works a dream, even with locale C and UNICODE database. > > Ah, good to hear, John. I beleive your report about linux is > what's keeping this back. Did you also manage to get it > running on Debian? Not without ICU3.2 > > /Palle > >
--On lördag, maj 07, 2005 23.15.29 +1000 John Hansen <john@geeknet.com.au> wrote: > Btw, I had been planning to propose replacing every single one of the > built in charset conversion functions with calls to ICU (thus making pg > _depend_ on ICU), as this would seem like a cleaner solution than for us > to maintain our own conversion tables. > > ICU also has a fair few conversions that we do not have at present. > > Any thoughts? I just had a similar though. And why use ICU only for multibyte charsets? If I use LATIN1, I still expect upper('ß') => SS, and I don't get it... Same for the Turkish example. It does eat more memory, and can perhaps cush some performance bits? With the current scheme, a strdup is often enough, or at least just one palloc. With ICU, using UTF-16, you must allocate memory twice, once for the ICU internal UTF-16 representation. That's not a very strong objection, though, as this would be an option... :) John, I have a hard time finding docs about what differs in ICU 2.8 from 3.2. Do you have any pointers? It seems 3.2 has much more support and bug fixes, I'm not sure if we should really consider 2.8? /Palle > > ... John > >> -----Original Message----- >> From: John Hansen >> Sent: Saturday, May 07, 2005 11:09 PM >> To: 'Palle Girgensohn'; 'Bruce Momjian' >> Cc: 'pgsql-hackers@postgresql.org' >> Subject: RE: [HACKERS] Patch for collation using ICU >> >> > --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen >> > <john@geeknet.com.au> >> > wrote: >> > >> > > Errm,... initdb --encoding UNICODE --locale C >> > >> > You mean that ICU *shall* be used even for the C locale, and not as >> > Bruce suggested here: >> >> Yes, that's exactly what I mean. >> >> > >> > >> I do have a few questions: >> > >> >> > >> Why don't you use the lc_ctype_is_c() part of this test? >> > >> >> > >> if (pg_database_encoding_max_length() > 1 && >> !lc_ctype_is_c()) >> > > >> > > Um, well, I didn't think about that. :) What would be the >> > locale in >> > > this case? c_C.UTF-8? ;) Hmm, it is possible to have >> > CTYPE=C and use >> > > a wide encoding, indeed. Then the strings will be handled >> > like byte-wide chars. >> > > Yeah, it's a bug. I'll fix it! Thanks. >> > >> > John disagrees here, and I'm obliged to agree. Using the C >> locale, one >> > will expect C collation, but upper/lower is better off still using >> > ICU. Hence, the above stuff is *not* a bug. Do we agree? >> > >> > /Palle >> > >> > >> > > >> > >> -----Original Message----- >> > >> From: pgsql-hackers-owner@postgresql.org >> > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of >> > John Hansen >> > >> Sent: Saturday, May 07, 2005 10:23 PM >> > >> To: Palle Girgensohn; Bruce Momjian >> > >> Cc: pgsql-hackers@postgresql.org >> > >> Subject: Re: [HACKERS] Patch for collation using ICU >> > >> >> > >> > >> > >> > I use this patch in production on one FreeBSD 4.10 >> server at the >> > >> > moment. >> > >> > With the latest version, I've had no problems. Logging is >> > >> swithed on >> > >> > for now, and it shows no signs of ICU complaining. I'd >> like more >> > >> > reports on Linux, though. >> > >> >> > >> I currently use this on gentoo with ICU3.2 unmasked. >> > >> >> > >> Works a dream, even with locale C and UNICODE database. >> > >> >> > >> Small test: >> > >> >> > >> createdb --encoding UNICODE --locale C test psql test set >> > >> client_encoding=iso88591; CREATE TABLE test (t text); >> INSERT INTO >> > >> test (t) VALUES ('æøå'); set client_encoding=unicode; >> INSERT INTO >> > >> test (t) SELECT upper(t) FROM test; set >> client_encoding=iso88591; >> > >> SELECT * FROM test; >> > >> t >> > >> ----- >> > >> æøå >> > >> ÆØÅ >> > >> (2 rows) >> > >> >> > >> Just as I'd expect, as upper/lower/initcap are locale >> > independent for >> > >> these characters. >> > >> >> > >> >> > >> ---------------------------(end of >> > >> broadcast)--------------------------- >> > >> TIP 5: Have you checked our extensive FAQ? >> > >> >> > >> http://www.postgresql.org/docs/faq >> > >> >> > >> >> > >> > >> > >> > >> > >> >
--On lördag, maj 07, 2005 23.33.31 +1000 John Hansen <john@geeknet.com.au> wrote: > > >> -----Original Message----- >> From: Palle Girgensohn [mailto:girgen@pingpong.net] >> Sent: Saturday, May 07, 2005 11:33 PM >> To: John Hansen; Bruce Momjian >> Cc: pgsql-hackers@postgresql.org >> Subject: RE: [HACKERS] Patch for collation using ICU >> >> >> >> --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen >> <john@geeknet.com.au> >> wrote: >> >> >> >> >> I use this patch in production on one FreeBSD 4.10 server at the >> >> moment. >> >> With the latest version, I've had no problems. Logging is >> swithed on >> >> for now, and it shows no signs of ICU complaining. I'd like more >> >> reports on Linux, though. >> > >> > I currently use this on gentoo with ICU3.2 unmasked. >> > >> > Works a dream, even with locale C and UNICODE database. >> >> Ah, good to hear, John. I beleive your report about linux is >> what's keeping this back. Did you also manage to get it >> running on Debian? > > Not without ICU3.2 Did you try the latest patch? Maybe it will help, and if not, it will (hopefully) give a lot more informative error messages. /Palle
John Hansen wrote: >Here is the list of encoding names and aliases the ICU accepts as of >3.2: >(it's a bit long...) > >UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 >UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 > > > [snip] Don't we use "unicode" as an alias for UTF-8 ? cheers andrew
> Did you try the latest patch? Maybe it will help, and if not, it will > (hopefully) give a lot more informative error messages. No, and I got rid of my debian boxes @ home. The patch required a certain amount of modifications too, to even compile with 2.8. So I guess it's a valid question to ask: it it worth supporting 2.8? It is of course an option to bundle icu 3.2 with pg! ... John
John Hansen wrote: > > --On l?rdag, maj 07, 2005 22.53.46 +1000 John Hansen > > <john@geeknet.com.au> > > wrote: > > > > > Errm,... initdb --encoding UNICODE --locale C > > > > You mean that ICU *shall* be used even for the C locale, and > > not as Bruce suggested here: > > Yes, that's exactly what I mean. There are two reasons for that optimization --- first, some locale support is broken and Unicode encoding with a C locale crashes (not an issue for ICU), and second, it is an optimization for languages like Japanese that want to use unicode, but don't need a locale because upper/lower means nothing in those character sets. So, the first issue doesn't apply for ICU, and the second might not depending on what characters you are using in the Unicode character set. I guess I am little confused how ICU can do upper() when the locale is C. What is it using to determine A is upper for a? Am I confused? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> -----Original Message----- > From: Andrew Dunstan [mailto:andrew@dunslane.net] > Sent: Saturday, May 07, 2005 11:39 PM > To: John Hansen > Cc: Palle Girgensohn; Bruce Momjian; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > > John Hansen wrote: > > >Here is the list of encoding names and aliases the ICU accepts as of > >3.2: > >(it's a bit long...) > > > >UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 > >UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 > > > > > > > [snip] > > Don't we use "unicode" as an alias for UTF-8 ? Yes, unfortunately! > > cheers > > andrew > >
Palle Girgensohn wrote: > >> Also, apparently, ICU is installed by default in many linux > >> distributions, and usually it is version 2.8. Some linux users have > >> asked me if there are plans for a patch that works with ICU 2.8. That's > >> probably a good idea. IBM and the ICU folks seem to consider 3.2 to be > >> the stable version, older versions are hard to find on their sites, but > >> most linux distributers seem to consider it too bleeding edge, even > >> gentoo. I don't know why they don't agree. > > > > Good point. Why would linux folks need ICU? Doesn't their OS support > > encodings natively? I am particularly excited about this for OSs that > > don't have such encodings, like UTF8 support for Win32. > > > > Because ICU will not be used unless enabled by configure, it seems we > > are fine with only supporting the newest version. Do Linux users need > > to use ICU for any reason? > > > There are corner cases where it is impossible to upper/lowercase one > character at the time. for example: > > -- without ICU > select upper('E?er'); > upper > ------- > E?ER > (1 row) > > -- with ICU > select upper('E?er'); > upper > ------- > ESSER > (1 rad) > > This is because in the standard postgres implementation, upper/lower is > done one character at the time. A proper upper/lower cannot do it that way. > Other known example is in Turkish, where an ? (?) should look different > whether it is an initial letter or not. This fails in standard postgresql > for all platforms. Uh, where do you see that? Our code has: workspace = texttowcs(string); for (i = 0; workspace[i] != 0; i++) workspace[i] = towupper(workspace[i]); result = wcstotext(workspace, i); > >> Also, in the latest patch, I also added checks and logging for *every* > >> status returned from ICU. I hope this will help debugging on debian, > >> where previous version didn't work. That excessive status checking is > >> hardly be necessary once the stuff is better tested. > >> > >> I think the string copying and heap/palloc choices stands for most of > >> the code bloat, together with the excessive status checking and logging. > > > > OK, move that into some common functions and I think it will be better. > > Best way for upper/lower/initcap is probably to use a function pointer... > uhh... Uh, I don't think so. Just send pointers to the the function and let the function allocate the memory, and another function to free them, or something like that. I can probably do it if you want. > >> > Why do you need to add a mapping of encoding names from iana to our > >> > names? > >> > >> This was already answered by John Hansen... There's an old thread here > >> about the choice of the name "UNICODE" to describe an encoding, which it > >> doesn't. There's half a dozen unicode based encodings... UTF-8 is used > >> by postgresql, that would have been a better name... Similarly for most > >> other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). > >> PostgreSQL can't provide it, so a mapping table is required. > > > > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that > > help? > > I'm aware of that. It might help for unicode, but there are a bunch of > other encodings. IANA has decided that utf-8 has *no* aliases, hence only > utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is > fogiving, I don't remember/know, but I think we need the mappings, > unfortunately. OK. I guess I am just confused why the native implementations are OK. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Palle Girgensohn wrote: > > --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen <john@geeknet.com.au> > wrote: > > > Btw, I had been planning to propose replacing every single one of the > > built in charset conversion functions with calls to ICU (thus making pg > > _depend_ on ICU), as this would seem like a cleaner solution than for us > > to maintain our own conversion tables. > > > > ICU also has a fair few conversions that we do not have at present. That is a much larger issue, similar to our shipping our own timezone database. What does it buy us?o Do we ship it in our tarball?o Is the license compatible?o Does it remove utils/mb conversions?o Does it allow us to index LIKE (next high char)?o Does it allow us to support multiple encodings in a singledatabase easier?o performance? > I just had a similar though. And why use ICU only for multibyte charsets? > If I use LATIN1, I still expect upper('?') => SS, and I don't get it... > Same for the Turkish example. We assume the native toupper() can handle single-byte character encodings. We use towupper() only for wide character sets. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Andrew Dunstan wrote: > > > John Hansen wrote: > > >Here is the list of encoding names and aliases the ICU accepts as of > >3.2: > >(it's a bit long...) > > > >UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 > >UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 > > > > > > > [snip] > > Don't we use "unicode" as an alias for UTF-8 ? We do. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Palle Girgensohn wrote: >> >> Also, apparently, ICU is installed by default in many linux >> >> distributions, and usually it is version 2.8. Some linux users have >> >> asked me if there are plans for a patch that works with ICU 2.8. >> >> That's probably a good idea. IBM and the ICU folks seem to consider >> >> 3.2 to be the stable version, older versions are hard to find on >> >> their sites, but most linux distributers seem to consider it too >> >> bleeding edge, even gentoo. I don't know why they don't agree. >> > >> > Good point. Why would linux folks need ICU? Doesn't their OS support >> > encodings natively? I am particularly excited about this for OSs that >> > don't have such encodings, like UTF8 support for Win32. >> > >> > Because ICU will not be used unless enabled by configure, it seems we >> > are fine with only supporting the newest version. Do Linux users need >> > to use ICU for any reason? >> >> >> There are corner cases where it is impossible to upper/lowercase one >> character at the time. for example: >> >> -- without ICU >> select upper('E?er'); >> upper >> ------- >> E?ER >> (1 row) >> >> -- with ICU >> select upper('E?er'); >> upper >> ------- >> ESSER >> (1 rad) >> >> This is because in the standard postgres implementation, upper/lower is >> done one character at the time. A proper upper/lower cannot do it that >> way. Other known example is in Turkish, where an ? (?) should look >> different whether it is an initial letter or not. This fails in >> standard postgresql for all platforms. > > Uh, where do you see that? Our code has: > > workspace = texttowcs(string); > > for (i = 0; workspace[i] != 0; i++) > workspace[i] = towupper(workspace[i]); as you see, the loop runs towupper for one character at the time. I cannot consider whether the letter is the initial, as required in Turkish, and it cannot really convert one character into two ('ß' -> 'SS') > > result = wcstotext(workspace, i); > > >> >> Also, in the latest patch, I also added checks and logging for *every* >> >> status returned from ICU. I hope this will help debugging on debian, >> >> where previous version didn't work. That excessive status checking is >> >> hardly be necessary once the stuff is better tested. >> >> >> >> I think the string copying and heap/palloc choices stands for most of >> >> the code bloat, together with the excessive status checking and >> >> logging. >> > >> > OK, move that into some common functions and I think it will be better. >> >> Best way for upper/lower/initcap is probably to use a function >> pointer... uhh... > > Uh, I don't think so. Just send pointers to the the function and let > the function allocate the memory, and another function to free them, or > something like that. I can probably do it if you want. I'll check it out, it seems simple enough. >> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does >> > that help? >> >> I'm aware of that. It might help for unicode, but there are a bunch of >> other encodings. IANA has decided that utf-8 has *no* aliases, hence >> only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is >> fogiving, I don't remember/know, but I think we need the mappings, >> unfortunately. > > OK. I guess I am just confused why the native implementations are OK. They're OK since they understand that UNICODE (or UTF8) is really utf-8. Problem is the strings used to describe them are not understood by ICU. BTW, the pg_enc2iananame_tbl is only used *from* internal representation *to* IANA, not the other way around. Maybe that fact lowers the rate of confusion? ;-) /Palle
Bruce Momjian wrote: > > There are two reasons for that optimization --- first, some > locale support is broken and Unicode encoding with a C locale > crashes (not an issue for ICU), and second, it is an > optimization for languages like Japanese that want to use > unicode, but don't need a locale because upper/lower means > nothing in those character sets. No, upper/lower means nothing in those languages, so why would you need to optimize upper/lower if they're not used?? And if they are, it's obviously because the text contains characters from other languages (probably english) and as such they should behave correctly. Did I mention that for japanese and the like, ICU would also offer transliteration... > > So, the first issue doesn't apply for ICU, and the second > might not depending on what characters you are using in the > Unicode character set. > > I guess I am little confused how ICU can do upper() when the > locale is C. What is it using to determine A is upper for a? > Am I confused? Simple, UNICODE basically consist of a table of characters (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt) Excerpt: 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; ... 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041 From this you can see, that for 0041, which is capital letter A, there is a mapping to it's lowercase counterpart, 0061 Likewise, there is a mapping for 0061 which says it's uppercase counterpart is 0041. There is also SpecialCasing.txt which covers those mappings that haven't got a 1-1 mapping, such as the german SS. These mappings are fixed, independent of locale, only a few cases from specialcasing.txt depend on locale/context.
> It seems 3.2 has much more support and bug fixes, I'm not > sure if we should really consider 2.8? As I said, probably not worth the effort.
Palle Girgensohn wrote: > >> This is because in the standard postgres implementation, upper/lower is > >> done one character at the time. A proper upper/lower cannot do it that > >> way. Other known example is in Turkish, where an ? (?) should look > >> different whether it is an initial letter or not. This fails in > >> standard postgresql for all platforms. > > > > Uh, where do you see that? Our code has: > > > > workspace = texttowcs(string); > > > > for (i = 0; workspace[i] != 0; i++) > > workspace[i] = towupper(workspace[i]); > > as you see, the loop runs towupper for one character at the time. I cannot > consider whether the letter is the initial, as required in Turkish, and it > cannot really convert one character into two ('?' -> 'SS') Oh, OK. I thought texttowcs() would expand the string to allow such conversions. > >> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does > >> > that help? > >> > >> I'm aware of that. It might help for unicode, but there are a bunch of > >> other encodings. IANA has decided that utf-8 has *no* aliases, hence > >> only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is > >> fogiving, I don't remember/know, but I think we need the mappings, > >> unfortunately. > > > > OK. I guess I am just confused why the native implementations are OK. > > They're OK since they understand that UNICODE (or UTF8) is really utf-8. > Problem is the strings used to describe them are not understood by ICU. > > BTW, the pg_enc2iananame_tbl is only used *from* internal representation > *to* IANA, not the other way around. Maybe that fact lowers the rate of > confusion? ;-) OK, got it. I am still a little confused why every native implementation understands our existing names but ICU does not. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > Palle Girgensohn wrote: > > > > --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen > > <john@geeknet.com.au> > > wrote: > > > > > Btw, I had been planning to propose replacing every single one of > > > the built in charset conversion functions with calls to ICU (thus > > > making pg _depend_ on ICU), as this would seem like a cleaner > > > solution than for us to maintain our own conversion tables. > > > > > > ICU also has a fair few conversions that we do not have > at present. > > That is a much larger issue, similar to our shipping our own > timezone database. What does it buy us? > > o Do we ship it in our tarball? > o Is the license compatible? > o Does it remove utils/mb conversions? > o Does it allow us to index LIKE (next high char)? > o Does it allow us to support multiple encodings in > a single database easier? > o performance? > > > I just had a similar though. And why use ICU only for > multibyte charsets? > > If I use LATIN1, I still expect upper('?') => SS, and I > don't get it... > > Same for the Turkish example. > > We assume the native toupper() can handle single-byte > character encodings. We use towupper() only for wide character sets. That assumption is wrong,... Encoding latin1 Locale <> de* Select Upper('ß'); (lowercase german SS) Should return SS, but returns ß ... John
--On lördag, maj 07, 2005 10.06.43 -0400 Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Palle Girgensohn wrote: >> >> --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen >> <john@geeknet.com.au> wrote: >> >> > Btw, I had been planning to propose replacing every single one of the >> > built in charset conversion functions with calls to ICU (thus making pg >> > _depend_ on ICU), as this would seem like a cleaner solution than for >> > us to maintain our own conversion tables. >> > >> > ICU also has a fair few conversions that we do not have at present. > > That is a much larger issue, similar to our shipping our own timezone > database. What does it buy us? > > o Do we ship it in our tarball? > o Is the license compatible? It looks pretty similar to BSD, although I'm a novice on the subject. > o Does it remove utils/mb conversions? Yes, it would probably be possible to remove pg's own conversions. > o Does it allow us to index LIKE (next high char)? I beleive so, using ICU's substring stuff. > o Does it allow us to support multiple encodings in > a single database easier? Heh, the ultimate dream. Perhaps? > o performance? ICU in itself is said to be much faster than for example glibc. Problem is the need for conversion via UTF-16, which requires extra memory allocations and cpu cycles. I don't use glibc, but my very simple performance tests for FreeBSD show that it is similiar in speed. > >> I just had a similar though. And why use ICU only for multibyte >> charsets? If I use LATIN1, I still expect upper('?') => SS, and I don't >> get it... Same for the Turkish example. > > We assume the native toupper() can handle single-byte character > encodings. We use towupper() only for wide character sets. True, problem is that native toupper/towupper run one char at the time. This is a bad design decision in POSIX, there is no way it can handle the examples above unless considering more than one character. ICU does just that. /Palle
John Hansen wrote: > Bruce Momjian wrote: > > > > There are two reasons for that optimization --- first, some > > locale support is broken and Unicode encoding with a C locale > > crashes (not an issue for ICU), and second, it is an > > optimization for languages like Japanese that want to use > > unicode, but don't need a locale because upper/lower means > > nothing in those character sets. > > No, upper/lower means nothing in those languages, so why would you need > to optimize upper/lower if they're not used?? True. I suppose it is for databases that use both Japanese and Latin alphabets and run upper() on all values. > And if they are, it's obviously because the text contains characters > from other languages (probably english) and as such they should behave > correctly. > > Did I mention that for japanese and the like, ICU would also offer > transliteration... Interesting. > > So, the first issue doesn't apply for ICU, and the second > > might not depending on what characters you are using in the > > Unicode character set. > > > > I guess I am little confused how ICU can do upper() when the > > locale is C. What is it using to determine A is upper for a? > > Am I confused? > > Simple, UNICODE basically consist of a table of characters > (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt) > > Excerpt: > > 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; > ... > 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041 > > >From this you can see, that for 0041, which is capital letter A, there > is a mapping to it's lowercase counterpart, 0061 > Likewise, there is a mapping for 0061 which says it's uppercase > counterpart is 0041. > There is also SpecialCasing.txt which covers those mappings that haven't > got a 1-1 mapping, such as the german SS. > > These mappings are fixed, independent of locale, only a few cases from > specialcasing.txt depend on locale/context. As far as I know, the only way to use Unicode currently is to use a locale that is unicode-aware. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
"John Hansen" <john@geeknet.com.au> writes: > Where'd you get the licence from? It was the first thing I came across in their docs: http://icu.sourceforge.net/userguide/intro.html Looking more closely, it may be that this license is only intended to apply to the documentation and not the code ... though "free" code with un-free documentation isn't real useful. regards, tom lane
--On lördag, maj 07, 2005 10.58.09 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > "John Hansen" <john@geeknet.com.au> writes: >> Where'd you get the licence from? > > It was the first thing I came across in their docs: > > http://icu.sourceforge.net/userguide/intro.html > > Looking more closely, it may be that this license is only intended to > apply to the documentation and not the code ... though "free" code with > un-free documentation isn't real useful. > > regards, tom lane Someone should ask the ICU folks about that...
Tom Lane wrote: > "John Hansen" <john@geeknet.com.au> writes: > > Where'd you get the licence from? > > It was the first thing I came across in their docs: > > http://icu.sourceforge.net/userguide/intro.html > > Looking more closely, it may be that this license is only > intended to apply to the documentation and not the code ... > though "free" code with un-free documentation isn't real useful. > Right, it seems to apply only to the resources found on sourceforge. > regards, tom lane > >
"John Hansen" <john@geeknet.com.au> writes: > Btw, I had been planning to propose replacing every single one of the > built in charset conversion functions with calls to ICU (thus making > pg _depend_ on ICU), I find that fairly unacceptable ... especially given the licensing questions, but in any case. It might be OK to rip out the existing conversion support and say that *if* you want encoding conversion, you have to use ICU. But I don't want to be told you cannot build PG without ICU period. The 3.2 vs 2.8 business is disturbing also; specifically, I don't think we get to require 3.2 on a platform where 2.8 is installed. People just aren't going to hold still for that, even assuming that ICU supports installing both versions at once, which isn't clear to me at the moment ... regards, tom lane
> Palle Girgensohn wrote: > > > > --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen <john@geeknet.com.au> > > wrote: > > > > > Btw, I had been planning to propose replacing every single one of the > > > built in charset conversion functions with calls to ICU (thus making pg > > > _depend_ on ICU), as this would seem like a cleaner solution than for us > > > to maintain our own conversion tables. I don't buy it. If current conversion tables does the right thing, why we need to replace. Or if conversion tables are not correct, why don't you fix it? I think the rule of character conversion will not change frequently, especially for LATIN languages. Thus maintaining cost is not too high. -- Tatsuo Ishii > > > ICU also has a fair few conversions that we do not have at present. > > That is a much larger issue, similar to our shipping our own timezone > database. What does it buy us? > > o Do we ship it in our tarball? > o Is the license compatible? > o Does it remove utils/mb conversions? > o Does it allow us to index LIKE (next high char)? > o Does it allow us to support multiple encodings in > a single database easier? > o performance?
> Bruce Momjian wrote: > > > > There are two reasons for that optimization --- first, some > > locale support is broken and Unicode encoding with a C locale > > crashes (not an issue for ICU), and second, it is an > > optimization for languages like Japanese that want to use > > unicode, but don't need a locale because upper/lower means > > nothing in those character sets. > > No, upper/lower means nothing in those languages, so why would you need > to optimize upper/lower if they're not used?? > And if they are, it's obviously because the text contains characters > from other languages (probably english) and as such they should behave > correctly. Yes, Japanese (and probably Chinese and Korean) languages include ASCII character. More precisely ASCII is part of Japanese encodings(LATIN1 is not, however). And we have no problem at all with glibc/C locale. See below("unitest" is an UNICODE database). unitest=# create table t1(t text); CREATE TABLE unitest=# \encoding EUC_JP unitest=# insert into t1 values('abcあいう'); INSERT 1842628 1 unitest=# select upper(t) from t1; upper -----------ABCあいう (1 row) So Japanese(including ASCII)/UNICODE behavior is perfectly correct at this moment. So I strongly object removing that optimization. -- Tatsuo Ishii
Tom Lane wrote: > "John Hansen" <john@geeknet.com.au> writes: > > Btw, I had been planning to propose replacing every single > one of the > > built in charset conversion functions with calls to ICU > (thus making > > pg _depend_ on ICU), > > I find that fairly unacceptable ... especially given the > licensing questions, but in any case. The licencing seems pretty clear to me. http://www-306.ibm.com/software/globalization/icu/license.jsp > > It might be OK to rip out the existing conversion support and > say that *if* you want encoding conversion, you have to use > ICU. But I don't want to be told you cannot build PG without > ICU period. Right, that could be done, but I think the issue at heart is _are_ we going to use it at all, and if so, locale support would certainly benefit from going that way as well. > > The 3.2 vs 2.8 business is disturbing also; specifically, I > don't think we get to require 3.2 on a platform where 2.8 is > installed. There seems to be nothing in the ICU licence that would prevent us from bundling it. This would solve both the 3.2 vs 2.8 problems, and would remove the 'dependency'. > People just aren't going to hold still for that, even > assuming that ICU supports installing both versions at once, > which isn't clear to me at the moment ... There's no problems with having both installed. I did that on debian to get the patch going. Tho, bundling it seems cleaner to me. > > regards, tom lane > >
> I don't buy it. If current conversion tables does the right > thing, why we need to replace. Or if conversion tables are > not correct, why don't you fix it? I think the rule of > character conversion will not change frequently, especially > for LATIN languages. Thus maintaining cost is not too high. I never said we need to, but if we're going to implement ICU, then we might as well go all the way. ... John
Tatsuo Ishii wrote: > Sent: Sunday, May 08, 2005 10:09 AM > To: John Hansen > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > Bruce Momjian wrote: > > > > > > There are two reasons for that optimization --- first, > some locale > > > support is broken and Unicode encoding with a C locale > crashes (not > > > an issue for ICU), and second, it is an optimization for > languages > > > like Japanese that want to use unicode, but don't need a locale > > > because upper/lower means nothing in those character sets. > > > > No, upper/lower means nothing in those languages, so why would you > > need to optimize upper/lower if they're not used?? > > And if they are, it's obviously because the text contains > characters > > from other languages (probably english) and as such they > should behave > > correctly. > > Yes, Japanese (and probably Chinese and Korean) languages > include ASCII character. More precisely ASCII is part of Japanese > encodings(LATIN1 is not, however). And we have no problem at > all with glibc/C locale. See below("unitest" is an UNICODE database). > > unitest=# create table t1(t text); > CREATE TABLE > unitest=# \encoding EUC_JP > unitest=# insert into t1 values('abcあいう'); > INSERT 1842628 1 > unitest=# select upper(t) from t1; > upper > ----------- > ABCあいう > (1 row) > > So Japanese(including ASCII)/UNICODE behavior is perfectly > correct at this moment. Right, so you _never_ use accented ascii characters in Japanese? (like è for example, whose uppercase is È) > So I strongly object removing that optimization. I'm guessing this would call for a vote then, since if implementing ICU, then I'd have to object to leaving it in. Changing the bahaviour of ICU doesn't seem right. Changing the behaviour of pg, so that it works as it should when using unicode, seems the right solution to me. > -- > Tatsuo Ishii > >
On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: > Tatsuo Ishii wrote: > > So Japanese(including ASCII)/UNICODE behavior is perfectly > > correct at this moment. > > Right, so you _never_ use accented ascii characters in Japanese? > (like è for example, whose uppercase is È) That isn't ASCII. It's latin1 or some other ASCII extension. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "La principal característica humana es la tontería" (Augusto Monterroso)
Alvaro Herrera wrote: > Sent: Sunday, May 08, 2005 2:49 PM > To: John Hansen > Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; > girgen@pingpong.net; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: > > Tatsuo Ishii wrote: > > > > So Japanese(including ASCII)/UNICODE behavior is > perfectly correct > > > at this moment. > > > > Right, so you _never_ use accented ascii characters in Japanese? > > (like è for example, whose uppercase is È) > > That isn't ASCII. It's latin1 or some other ASCII extension. Point taken... But... If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, not UTF-8 (unicode). UTF-8 encoded databases are very useful for representing multiple languages in the same database, but this usefulness vanishes if functions like upper/lower doesn't work correctly. So optimizing for 3 languages breaks more than a hundred, that's doesn't seem fair! > > -- > Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "La principal > característica humana es la tontería" > (Augusto Monterroso) > > ... John
> Alvaro Herrera wrote: > > Sent: Sunday, May 08, 2005 2:49 PM > > To: John Hansen > > Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; > > girgen@pingpong.net; pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: > > > Tatsuo Ishii wrote: > > > > > > So Japanese(including ASCII)/UNICODE behavior is > > perfectly correct > > > > at this moment. > > > > > > Right, so you _never_ use accented ascii characters in Japanese? > > > (like è for example, whose uppercase is È) > > > > That isn't ASCII. It's latin1 or some other ASCII extension. > > Point taken... > But... > > If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, not UTF-8 (unicode). > UTF-8 encoded databases are very useful for representing multiple languages in the same database, > but this usefulness vanishes if functions like upper/lower doesn't work correctly. I'm just curious if Germany/French/Spanish mixed text can be sorted correctly. I think these languages need their own locales even with UNICODE/ICU. > So optimizing for 3 languages breaks more than a hundred, that's doesn't seem fair! Why don't you add a GUC variable or some such to control the upper/lower behavior? -- Tatsuo Ishii
Tatsuo Ishii > Sent: Sunday, May 08, 2005 3:41 PM > To: John Hansen > Cc: alvherre@dcc.uchile.cl; pgman@candle.pha.pa.us; > girgen@pingpong.net; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > Alvaro Herrera wrote: > > > Sent: Sunday, May 08, 2005 2:49 PM > > > To: John Hansen > > > Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; girgen@pingpong.net; > > > pgsql-hackers@postgresql.org > > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: > > > > Tatsuo Ishii wrote: > > > > > > > > So Japanese(including ASCII)/UNICODE behavior is > > > perfectly correct > > > > > at this moment. > > > > > > > > Right, so you _never_ use accented ascii characters in > Japanese? > > > > (like è for example, whose uppercase is È) > > > > > > That isn't ASCII. It's latin1 or some other ASCII extension. > > > > Point taken... > > But... > > > > If you want EUC_JP (Japanese + ASCII) then use that as your > backend encoding, not UTF-8 (unicode). > > UTF-8 encoded databases are very useful for representing multiple > > languages in the same database, but this usefulness > vanishes if functions like upper/lower doesn't work correctly. > > I'm just curious if Germany/French/Spanish mixed text can be > sorted correctly. I think these languages need their own > locales even with UNICODE/ICU. No, they will not sort correctly, for that you still need the locale. > > > So optimizing for 3 languages breaks more than a hundred, > that's doesn't seem fair! That is a compromise I'd be willing to agree on. :) > Why don't you add a GUC variable or some such to control the > upper/lower behavior? > -- > Tatsuo Ishii > >
>> Is this patch ready for application? >> >> >http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-200 >5-05-06.d >> iff.gz >> >> The web site is: >> >> http://people.freebsd.org/~girgen/postgresql-icu/readme.html > >I don't think so, not quite. I have not had any positive >reports from linux >users, this is only tested in a FreeBSD environment. I'd say >it needs some more testing. I've just finished some simple testing on win32, and it does seem to work fine there as well, with a few modifications tothe build step. As I don't have a working autoconf, I applied the stuff that would come from your configure.in changesdirectly to the files. Meaning putting USE_ICU in pg_config.h, and the following changes to Makefile.global: 1) Add the directory for the ICU include files to CPPFLAGS 2) Add -licuuc -licuin to LIBS. I notice these are different names from those used on Unix, so a different configure testwill be needed there. 3) Add the icu lib directory to LDFLAGS 4) Remove encodings "PG_WIN_1258"/PG_TCVN and "IBM866"/PG_ALT. Didn't investigate further why this was needed, but this probablyhas something to do with my tests being off CVS tip vs the patch being for 8.0-stable. After doing this, I can properly get upper and lowercase for the swedish characters åäö/ÅÄÖ, as well as the ß->SS conversion.That's all I really tested at this point, but it did *not* work before in a unicode/UTF8 database. I've been working off the ICU 3.2 binaries available from the ICU page, the version compiled with MSVC 6 (because it usesthe same runtime DLLs as stuff compiled with mingw. It should work with the MSVC7 version as well, but that would introduceadditional DLL dependencies). //Magnus
>> The 3.2 vs 2.8 business is disturbing also; specifically, I >> don't think we get to require 3.2 on a platform where 2.8 is >> installed. > >There seems to be nothing in the ICU licence that would prevent us from >bundling it. >This would solve both the 3.2 vs 2.8 problems, and would remove the >'dependency'. > >> People just aren't going to hold still for that, even >> assuming that ICU supports installing both versions at once, >> which isn't clear to me at the moment ... > >There's no problems with having both installed. ... unless you're on win32, it seems. For some reason, they name their libs with the version on unix (libicu18n.so.32), but not on win32 where they all have the same name. And they don't stuff versioning information in the DLL files. That can be lived with as long as libpq doesn't depend on it, though - you can just stick the DLL in the same directory as the EXE, which is also what the ICU people recommend in their docs. Unnecessarily ugly, but it works. >I did that on debian to get the patch going. >Tho, bundling it seems cleaner to me. The source for ICU 3.2 is 9.8Mb in .tar.gz. PostgreSQL 8.0.2 is 13.2. That means the size of the distribution would almost *double* if we bundled ICU. It's probably fine bundling it in the binary distributions (at least we'd probably do it on win32, since not many ppl will have it already there), but bundling the source seems a bit excessive to me. //Magnus
> The source for ICU 3.2 is 9.8Mb in .tar.gz. PostgreSQL 8.0.2 is 13.2. > That means the size of the distribution would almost *double* > if we bundled ICU. Ermm,. Don't forget to remove the current charset conversions and locale support before making your size estimation. > > It's probably fine bundling it in the binary distributions > (at least we'd probably do it on win32, since not many ppl > will have it already there), but bundling the source seems a > bit excessive to me. > > //Magnus > >
> > I don't buy it. If current conversion tables does the right > > thing, why we need to replace. Or if conversion tables are > > not correct, why don't you fix it? I think the rule of > > character conversion will not change frequently, especially > > for LATIN languages. Thus maintaining cost is not too high. > > I never said we need to, but if we're going to implement ICU, > then we might as well go all the way. So you admit there's no benefit using ICU for replacing existing conversions? Besides ICU does not support all existing conversions, I think ICU has serious flaw for using conversion. If I understand correctly, ICU uses UNICODE internally to do the conversion. For example, to implement SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then converts UNICODE to EUC_JP. Problem is these conversion is not roud trip(conversion between SJIS/EUC_JP and UNICODE will lose some information). Thus SJIS->EUC_JP->SJIS conversion using ICU does not preserve original text. -- Tatsuo Ishii
> > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: > > > > > Tatsuo Ishii wrote: > > > > > > > > > > So Japanese(including ASCII)/UNICODE behavior is > > > > perfectly correct > > > > > > at this moment. > > > > > > > > > > Right, so you _never_ use accented ascii characters in > > Japanese? > > > > > (like è for example, whose uppercase is È) > > > > > > > > That isn't ASCII. It's latin1 or some other ASCII extension. > > > > > > Point taken... > > > But... > > > > > > If you want EUC_JP (Japanese + ASCII) then use that as your > > backend encoding, not UTF-8 (unicode). > > > UTF-8 encoded databases are very useful for representing multiple > > > languages in the same database, but this usefulness > > vanishes if functions like upper/lower doesn't work correctly. > > > > I'm just curious if Germany/French/Spanish mixed text can be > > sorted correctly. I think these languages need their own > > locales even with UNICODE/ICU. > > No, they will not sort correctly, for that you still need the locale. I'm confused. I thought the ICU patches is intended for using on broken locale platforms? -- Tatsuo Ishii
--On söndag, maj 08, 2005 22.19.25 +0900 Tatsuo Ishii <t-ishii@sra.co.jp> wrote: >> > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: >> > > > > Tatsuo Ishii wrote: >> > > > >> > > > > > So Japanese(including ASCII)/UNICODE behavior is >> > > > perfectly correct >> > > > > > at this moment. >> > > > > >> > > > > Right, so you _never_ use accented ascii characters in >> > Japanese? >> > > > > (like è for example, whose uppercase is È) >> > > > >> > > > That isn't ASCII. It's latin1 or some other ASCII extension. >> > > >> > > Point taken... >> > > But... >> > > >> > > If you want EUC_JP (Japanese + ASCII) then use that as your >> > backend encoding, not UTF-8 (unicode). >> > > UTF-8 encoded databases are very useful for representing multiple >> > > languages in the same database, but this usefulness >> > vanishes if functions like upper/lower doesn't work correctly. >> > >> > I'm just curious if Germany/French/Spanish mixed text can be >> > sorted correctly. I think these languages need their own >> > locales even with UNICODE/ICU. >> >> No, they will not sort correctly, for that you still need the locale. > > I'm confused. I thought the ICU patches is intended for using on > broken locale platforms? It will sort correctly in *one* locale, using ICU. You still cannot mix different locales in the same database cluster, the collation locale is still fixed at initdb time, unfortunately. /Palle
Palle Girgensohn <girgen@pingpong.net> writes: >> I'm confused. I thought the ICU patches is intended for using on >> broken locale platforms? > It will sort correctly in *one* locale, using ICU. You still cannot mix > different locales in the same database cluster, the collation locale is > still fixed at initdb time, unfortunately. I thought the point of using ICU was to be able to dig out from under that restriction? It's a bit of a large pill to swallow if we will still have to throw it away someday to become SQL spec compliant. regards, tom lane
Magnus Hagander wrote: >The source for ICU 3.2 is 9.8Mb in .tar.gz. PostgreSQL 8.0.2 is 13.2. >That means the size of the distribution would almost *double* if we >bundled ICU. > >It's probably fine bundling it in the binary distributions (at least >we'd probably do it on win32, since not many ppl will have it already >there), but bundling the source seems a bit excessive to me. > > > > I'm also mildly curious to know what effect using ICU will have on memory consumption. Has anyone looked? My suspicion was aroused by this library that it installed on my FC3 box: -rwxr-xr-x 1 root root 9777876 Jan 1 20:18 /usr/lib/libicudata.so.32.0 cheers andrew
> Palle Girgensohn <girgen@pingpong.net> writes: >>> I'm confused. I thought the ICU patches is intended for using on >>> broken locale platforms? > >> It will sort correctly in *one* locale, using ICU. You still cannot mix >> different locales in the same database cluster, the collation locale is >> still fixed at initdb time, unfortunately. > > I thought the point of using ICU was to be able to dig out from under > that restriction? I think it might be quite possible to mix several locales, using ICU. It's just that this is not what the patch does at moment. It just finds out the locale set at initdb and uses it for collation with ICU. Handling mixed locales for collation has a few hard problems, AFAIK. First, isn't the main obstacle for mixing collations that indices require a single well defined locale? I assume that locale dependant comparison (collation) is used when indexing tuples, right? As long as a specific locales collation is used for indexing text fields, I believe we cannot easily mix different locales, right? Second, how do we tell the backend which locale to use? Is there some SQL spec for this? > It's a bit of a large pill to swallow if we will still > have to throw it away someday to become SQL spec compliant. What do we need to be SQL spec compliant in this respect? /Palle
Tatsuo Ishii wrote: > Sent: Sunday, May 08, 2005 11:19 PM > To: John Hansen > Cc: alvherre@dcc.uchile.cl; pgman@candle.pha.pa.us; > girgen@pingpong.net; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: > > > > > > Tatsuo Ishii wrote: > > > > > > > > > > > > So Japanese(including ASCII)/UNICODE behavior is > > > > > perfectly correct > > > > > > > at this moment. > > > > > > > > > > > > Right, so you _never_ use accented ascii characters in > > > Japanese? > > > > > > (like è for example, whose uppercase is È) > > > > > > > > > > That isn't ASCII. It's latin1 or some other ASCII extension. > > > > > > > > Point taken... > > > > But... > > > > > > > > If you want EUC_JP (Japanese + ASCII) then use that as your > > > backend encoding, not UTF-8 (unicode). > > > > UTF-8 encoded databases are very useful for > representing multiple > > > > languages in the same database, but this usefulness > > > vanishes if functions like upper/lower doesn't work correctly. > > > > > > I'm just curious if Germany/French/Spanish mixed text can > be sorted > > > correctly. I think these languages need their own locales > even with > > > UNICODE/ICU. > > > > No, they will not sort correctly, for that you still need > the locale. > > I'm confused. I thought the ICU patches is intended for using > on broken locale platforms? Initially yes, but why duplicate code? What I meant was, that they will not sort correctly using the C locale. Locale _name_ needs to be known to ICU for it to sort correctly. > -- > Tatsuo Ishii > >
Tom Lane wrote: > Sent: Monday, May 09, 2005 2:47 AM > To: Palle Girgensohn > Cc: Tatsuo Ishii; John Hansen; alvherre@dcc.uchile.cl; > pgman@candle.pha.pa.us; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > Palle Girgensohn <girgen@pingpong.net> writes: > >> I'm confused. I thought the ICU patches is intended for using on > >> broken locale platforms? > > > It will sort correctly in *one* locale, using ICU. You still cannot > > mix different locales in the same database cluster, the collation > > locale is still fixed at initdb time, unfortunately. > > I thought the point of using ICU was to be able to dig out > from under that restriction? It's a bit of a large pill to > swallow if we will still have to throw it away someday to > become SQL spec compliant. That is not a limitation of ICU but of postgresql. I don't know what the specs say, but imagine something like: SELECT foo FROM bar ORDER BY foo WITH LOCALE 'en_US', foobar WITH LOCALE 'jp_JP'; Which would be less difficult to implement using ICU. > > regards, tom lane > > ... John
Tatsuo Ishii wrote: > Sent: Sunday, May 08, 2005 11:08 PM > To: John Hansen > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > > I don't buy it. If current conversion tables does the > right thing, > > > why we need to replace. Or if conversion tables are not > correct, why > > > don't you fix it? I think the rule of character > conversion will not > > > change frequently, especially for LATIN languages. Thus > maintaining > > > cost is not too high. > > > > I never said we need to, but if we're going to implement > ICU, then we > > might as well go all the way. > > So you admit there's no benefit using ICU for replacing > existing conversions? > > Besides ICU does not support all existing conversions, I > think ICU has serious flaw for using conversion. If I > understand correctly, ICU uses UNICODE internally to do the > conversion. For example, to implement > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then > converts UNICODE to EUC_JP. Problem is these conversion is > not roud trip(conversion between SJIS/EUC_JP and UNICODE will > lose some information). Thus SJIS->EUC_JP->SJIS conversion > using ICU does not preserve original text. Could you please send me a sample text as an attachment encoded in SJIS where this would happen? > -- > Tatsuo Ishii > >
> -----Original Message----- > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > Sent: Sunday, May 08, 2005 11:08 PM > To: John Hansen > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > > I don't buy it. If current conversion tables does the > right thing, > > > why we need to replace. Or if conversion tables are not > correct, why > > > don't you fix it? I think the rule of character > conversion will not > > > change frequently, especially for LATIN languages. Thus > maintaining > > > cost is not too high. > > > > I never said we need to, but if we're going to implement > ICU, then we > > might as well go all the way. > > So you admit there's no benefit using ICU for replacing > existing conversions? > > Besides ICU does not support all existing conversions, I > think ICU has serious flaw for using conversion. If I > understand correctly, ICU uses UNICODE internally to do the > conversion. For example, to implement > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then > converts UNICODE to EUC_JP. Problem is these conversion is > not roud trip(conversion between SJIS/EUC_JP and UNICODE will > lose some information). Thus SJIS->EUC_JP->SJIS conversion > using ICU does not preserve original text. Just for the record, I fetched a web page encoded in sjis, and converted it to euc-jp and back using uconv from ICU 3.2, and the result is the original is identical to the transformed file. uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.htmluconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.eucdiffindex.html index.html.sjis ... John
> > -----Original Message----- > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > > Sent: Sunday, May 08, 2005 11:08 PM > > To: John Hansen > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > > pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > I don't buy it. If current conversion tables does the > > right thing, > > > > why we need to replace. Or if conversion tables are not > > correct, why > > > > don't you fix it? I think the rule of character > > conversion will not > > > > change frequently, especially for LATIN languages. Thus > > maintaining > > > > cost is not too high. > > > > > > I never said we need to, but if we're going to implement > > ICU, then we > > > might as well go all the way. > > > > So you admit there's no benefit using ICU for replacing > > existing conversions? > > > > Besides ICU does not support all existing conversions, I > > think ICU has serious flaw for using conversion. If I > > understand correctly, ICU uses UNICODE internally to do the > > conversion. For example, to implement > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then > > converts UNICODE to EUC_JP. Problem is these conversion is > > not roud trip(conversion between SJIS/EUC_JP and UNICODE will > > lose some information). Thus SJIS->EUC_JP->SJIS conversion > > using ICU does not preserve original text. > > Just for the record, I fetched a web page encoded in sjis, and converted > it to euc-jp and back using uconv from ICU 3.2, and the result is the > original is identical to the transformed file. > > uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html > uconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc > diff index.html index.html.sjis Not all SJIS/EUC_JP characters have the problem. You might want to try: Shift_JIS 0x81e6, 0x879a, 0xfa5b. BTW, I got this with ICU 3.2: $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt Conversion from Unicode to codepage failed at input byte position 0. Unicode: 301c Error: Invalid character found The contents of a.txt is 0xa1c1 which is a valid EUC_JP character. This makes me nervous in using ICU... -- Tatsuo Ishii
Tatsuo Ishii wrote: > Sent: Tuesday, May 10, 2005 12:32 AM > To: John Hansen > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > > -----Original Message----- > > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > > > Sent: Sunday, May 08, 2005 11:08 PM > > > To: John Hansen > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > > > pgsql-hackers@postgresql.org > > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > > > I don't buy it. If current conversion tables does the > > > right thing, > > > > > why we need to replace. Or if conversion tables are not > > > correct, why > > > > > don't you fix it? I think the rule of character > > > conversion will not > > > > > change frequently, especially for LATIN languages. Thus > > > maintaining > > > > > cost is not too high. > > > > > > > > I never said we need to, but if we're going to implement > > > ICU, then we > > > > might as well go all the way. > > > > > > So you admit there's no benefit using ICU for replacing existing > > > conversions? > > > > > > Besides ICU does not support all existing conversions, I > think ICU > > > has serious flaw for using conversion. If I understand correctly, > > > ICU uses UNICODE internally to do the conversion. For example, to > > > implement > > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then > > > converts UNICODE to EUC_JP. Problem is these conversion > is not roud > > > trip(conversion between SJIS/EUC_JP and UNICODE will lose some > > > information). Thus SJIS->EUC_JP->SJIS conversion using > ICU does not > > > preserve original text. > > > > Just for the record, I fetched a web page encoded in sjis, and > > converted it to euc-jp and back using uconv from ICU 3.2, and the > > result is the original is identical to the transformed file. > > > > uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html > uconv -f > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc diff > index.html > > index.html.sjis > > Not all SJIS/EUC_JP characters have the problem. You might want to > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b. > > BTW, I got this with ICU 3.2: > > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt > Conversion from Unicode to codepage failed at input byte > position 0. Unicode: 301c Error: Invalid character found > > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character. That actually makes perfect sense, since according to unicode.org's database: 301C ~ WAVE DASH This character was encoded to match JIS C 6226-1978 1-33 "wave dash". The JIS standards and some industry practise disagree in mapping. - 3030 wavy dash - FF5E full width tilde In PG FF5E is the mapping currently used. That is obviously wrong (according to the standards), as that is only a 'similar character'. Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis doesn't define "WAVE DASH". In all, I believe this behaviour to be correct according to the standards. There'd be nothing to stop us from defining alternative mappings for the cases where we deviate from the standard, but the question is, should we be non-standard? > > This makes me nervous in using ICU... > -- > Tatsuo Ishii > > ... John
> Tatsuo Ishii wrote: > > Sent: Tuesday, May 10, 2005 12:32 AM > > To: John Hansen > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > > pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > -----Original Message----- > > > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > > > > Sent: Sunday, May 08, 2005 11:08 PM > > > > To: John Hansen > > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > > > > pgsql-hackers@postgresql.org > > > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > > > > > I don't buy it. If current conversion tables does the > > > > right thing, > > > > > > why we need to replace. Or if conversion tables are not > > > > correct, why > > > > > > don't you fix it? I think the rule of character > > > > conversion will not > > > > > > change frequently, especially for LATIN languages. Thus > > > > maintaining > > > > > > cost is not too high. > > > > > > > > > > I never said we need to, but if we're going to implement > > > > ICU, then we > > > > > might as well go all the way. > > > > > > > > So you admit there's no benefit using ICU for replacing existing > > > > conversions? > > > > > > > > Besides ICU does not support all existing conversions, I > > think ICU > > > > has serious flaw for using conversion. If I understand correctly, > > > > ICU uses UNICODE internally to do the conversion. For example, to > > > > implement > > > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then > > > > converts UNICODE to EUC_JP. Problem is these conversion > > is not roud > > > > trip(conversion between SJIS/EUC_JP and UNICODE will lose some > > > > information). Thus SJIS->EUC_JP->SJIS conversion using > > ICU does not > > > > preserve original text. > > > > > > Just for the record, I fetched a web page encoded in sjis, and > > > converted it to euc-jp and back using uconv from ICU 3.2, and the > > > result is the original is identical to the transformed file. > > > > > > uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html > > uconv -f > > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc diff > > index.html > > > index.html.sjis > > > > Not all SJIS/EUC_JP characters have the problem. You might want to > > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b. > > > > BTW, I got this with ICU 3.2: > > > > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt > > Conversion from Unicode to codepage failed at input byte > > position 0. Unicode: 301c Error: Invalid character found > > > > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character. > > That actually makes perfect sense, since according to unicode.org's > database: > 301C ~ WAVE DASH > This character was encoded to match JIS C 6226-1978 1-33 "wave > dash". > The JIS standards and some industry practise disagree in mapping. > - 3030 wavy dash > - FF5E full width tilde > > In PG FF5E is the mapping currently used. That is obviously wrong > (according to the standards), as that is only a 'similar character'. > > Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis > doesn't define "WAVE DASH". > In all, I believe this behaviour to be correct according to the > standards. > > There'd be nothing to stop us from defining alternative mappings for the > cases where we deviate from the standard, but the question is, should we > be non-standard? You missed the point. EUC_JP 0xa1c1 is a perfect valid data and uconv -f EUC_JP -t Shift_JIS should convert it to Shift_JIS 0x8160 regardless of the internal of uconv. -- Tatsuo Ishii
Tatsuo Ishii wrote: > Sent: Tuesday, May 10, 2005 5:45 PM > To: John Hansen > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > Tatsuo Ishii wrote: > > > Sent: Tuesday, May 10, 2005 12:32 AM > > > To: John Hansen > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > > > pgsql-hackers@postgresql.org > > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > > > -----Original Message----- > > > > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > > > > > Sent: Sunday, May 08, 2005 11:08 PM > > > > > To: John Hansen > > > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > > > > > pgsql-hackers@postgresql.org > > > > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > > > > > > > I don't buy it. If current conversion tables does the > > > > > right thing, > > > > > > > why we need to replace. Or if conversion tables are not > > > > > correct, why > > > > > > > don't you fix it? I think the rule of character > > > > > conversion will not > > > > > > > change frequently, especially for LATIN languages. Thus > > > > > maintaining > > > > > > > cost is not too high. > > > > > > > > > > > > I never said we need to, but if we're going to implement > > > > > ICU, then we > > > > > > might as well go all the way. > > > > > > > > > > So you admit there's no benefit using ICU for > replacing existing > > > > > conversions? > > > > > > > > > > Besides ICU does not support all existing conversions, I > > > think ICU > > > > > has serious flaw for using conversion. If I understand > > > > > correctly, ICU uses UNICODE internally to do the > conversion. For > > > > > example, to implement > > > > > SJIS->EUC_JP conversion, ICU first converts SJIS to > UNICODE then > > > > > converts UNICODE to EUC_JP. Problem is these conversion > > > is not roud > > > > > trip(conversion between SJIS/EUC_JP and UNICODE will > lose some > > > > > information). Thus SJIS->EUC_JP->SJIS conversion using > > > ICU does not > > > > > preserve original text. > > > > > > > > Just for the record, I fetched a web page encoded in sjis, and > > > > converted it to euc-jp and back using uconv from ICU > 3.2, and the > > > > result is the original is identical to the transformed file. > > > > > > > > uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html > > > uconv -f > > > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc diff > > > index.html > > > > index.html.sjis > > > > > > Not all SJIS/EUC_JP characters have the problem. You might want to > > > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b. > > > > > > BTW, I got this with ICU 3.2: > > > > > > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt > Conversion > > > from Unicode to codepage failed at input byte position 0. > Unicode: > > > 301c Error: Invalid character found > > > > > > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character. > > > > That actually makes perfect sense, since according to unicode.org's > > database: > > 301C ~ WAVE DASH > > This character was encoded to match JIS C 6226-1978 > 1-33 "wave > > dash". > > The JIS standards and some industry practise > disagree in mapping. > > - 3030 wavy dash > > - FF5E full width tilde > > > > In PG FF5E is the mapping currently used. That is obviously wrong > > (according to the standards), as that is only a 'similar character'. > > > > Unfortunately, there is no mapping from 301C to shift_jis, as > > shift_jis doesn't define "WAVE DASH". > > In all, I believe this behaviour to be correct according to the > > standards. > > > > There'd be nothing to stop us from defining alternative > mappings for > > the cases where we deviate from the standard, but the question is, > > should we be non-standard? > > You missed the point. EUC_JP 0xa1c1 is a perfect valid data > and uconv -f EUC_JP -t Shift_JIS should convert it to > Shift_JIS 0x8160 regardless of the internal of uconv. Studying ICU forther, I found that it works fine, provided you use the _correct_ charset for the conversion.. a.txt contains 0x81 0x60 uconv -f ibm-943_P130-1999 -t EUC_JP a.txt -o b.txt b.txt now contains 0xa1 0xc1 uconv -t ibm-943_P130-1999 -f EUC_JP b.txt -o a.txt a.txt still contains 0x81 0x60 The mapping table you want is ibm-943_P130-1999 Similar, we'd need to find the right euc-jp (and plain jis) mapping, assuming we want the one that strictly defines JIS X 0208 right? I trust this to put your fears to rest... > -- > Tatsuo Ishii > > ... John