Обсуждение: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)
The following bug has been logged online: Bug reference: 1775 Logged by: Marcus Raphelt Email address: marcus@raphelt.de PostgreSQL version: latest Operating system: WinXP (dev), Win2000 (server) Description: UTF8 to ISO does not convert the german "sharp s" (ß) Details: Hi, I currently have a iso_8859_1 app and included the opengeodb tables to have a set of german cities and zipcodes. http://opengeodb.sourceforge.net/index.php?FrontPage_en I downloaded this dump: http://prdownloads.sourceforge.net/opengeodb/opengeodb-0.2.3c-UTF8-postgres. zip?download They offer a Postgresql dump that can be imported without hassle, but is in UTF8. I did several attempts using convert() to convert the city names to iso. It works, and it also works fine on german umlauts (äöü), but it seems like convert() does NOT convert the german "Ã" / sharp s / ß... cheers, marcus
PiBCdWcgcmVmZXJlbmNlOiAgICAgIDE3NzUNCj4gTG9nZ2VkIGJ5OiAgICAg ICAgICBNYXJjdXMgUmFwaGVsdA0KPiBFbWFpbCBhZGRyZXNzOiAgICAgIG1h cmN1c0ByYXBoZWx0LmRlDQo+IFBvc3RncmVTUUwgdmVyc2lvbjogbGF0ZXN0 DQo+IE9wZXJhdGluZyBzeXN0ZW06ICAgV2luWFAgKGRldiksIFdpbjIwMDAg KHNlcnZlcikNCj4gRGVzY3JpcHRpb246ICAgICAgICBVVEY4IHRvIElTTyBk b2VzIG5vdCBjb252ZXJ0IHRoZSBnZXJtYW4gInNoYXJwIHMiDQo+ICgmc3ps aWc7KQ0KPiBEZXRhaWxzOiANCj4gDQo+IEhpLA0KPiANCj4gSSBjdXJyZW50 bHkgaGF2ZSBhIGlzb184ODU5XzEgYXBwIGFuZCBpbmNsdWRlZCB0aGUgb3Bl bmdlb2RiIA0KPiB0YWJsZXMgdG8gaGF2ZSBhIHNldCBvZiBnZXJtYW4gY2l0 aWVzIGFuZCB6aXBjb2Rlcy4NCj4gDQo+IGh0dHA6Ly9vcGVuZ2VvZGIuc291 cmNlZm9yZ2UubmV0L2luZGV4LnBocD9Gcm9udFBhZ2VfZW4NCj4gSSBkb3du bG9hZGVkIHRoaXMgZHVtcDoNCj4gDQo+IGh0dHA6Ly9wcmRvd25sb2Fkcy5z b3VyY2Vmb3JnZS5uZXQvb3Blbmdlb2RiL29wZW5nZW9kYi0wLjIuM2MtDQo+ IFVURjgtcG9zdGdyZXMuDQo+IHppcD9kb3dubG9hZA0KPiANCj4gVGhleSBv ZmZlciBhIFBvc3RncmVzcWwgZHVtcCB0aGF0IGNhbiBiZSBpbXBvcnRlZCB3 aXRob3V0IA0KPiBoYXNzbGUsIGJ1dCBpcyBpbiBVVEY4LiBJIGRpZCBzZXZl cmFsIGF0dGVtcHRzIHVzaW5nIA0KPiBjb252ZXJ0KCkgdG8gY29udmVydCB0 aGUgY2l0eSBuYW1lcyB0byBpc28uIEl0IHdvcmtzLCBhbmQgaXQgDQo+IGFs c28gd29ya3MgZmluZSBvbiBnZXJtYW4gdW1sYXV0cyAow4PCpMODwrbDg8K8 KSwgYnV0IGl0IHNlZW1zIGxpa2UgDQo+IGNvbnZlcnQoKSBkb2VzIE5PVCBj b252ZXJ0IHRoZSBnZXJtYW4gIsODxbgiIC8gc2hhcnAgcyAvICZzemxpZzsu Li4NCg0KUG9zdGdyZVNRTCA4LjAgZG9lcyBub3Qgc3VwcG9ydCBVVEY4IG9u IFdpbjMyLiBZb3Ugd2lsbCBuZWVkIHRvIGNvbnZlcnQgdGhlIGZpbGUgYmVm b3JlIHlvdSBsb2FkIGl0IGludG8gdGhlIGRhdGFiYXNlIC0gbG9vayBhdCBm b3IgZXhhbXBsZSB0aGUgR05VIGljb252IHByb2dyYW0gdG8gZG8gdGhpcy4N Cg0KLy9NYWdudXMNCg==
Magnus Hagander wrote: > PostgreSQL 8.0 does not support UTF8 on Win32. You will need to > convert the file before you load it into the database - look at for > example the GNU iconv program to do this. PostgreSQL does not support UTF8 locales on Windows, but the character conversion should work the same on all platforms, shouldn't it? -- Peter Eisentraut http://developer.postgresql.org/~petere/
> > PostgreSQL 8.0 does not support UTF8 on Win32. You will need to=20 > > convert the file before you load it into the database - look at for=20 > > example the GNU iconv program to do this. >=20 > PostgreSQL does not support UTF8 locales on Windows, but the=20 > character conversion should work the same on all platforms,=20 > shouldn't it? I thought that was also handled in the locale code. But now that you say it, it's probably not - it should be the same. I'll withdraw my comment then, clearly wrong. Thanks for clearing that up. //Magnus
Magnus Hagander Wrote: > > > PostgreSQL 8.0 does not support UTF8 on Win32. You will need to=20 > > > convert the file before you load it into the database -=20 > look at for=20 > > > example the GNU iconv program to do this. > >=20 > > PostgreSQL does not support UTF8 locales on Windows, but=20 > the character=20 > > conversion should work the same on all platforms, shouldn't it? >=20 > I thought that was also handled in the locale code. But now=20 > that you say it, it's probably not - it should be the same. > I'll withdraw my comment then, clearly wrong. Thanks for=20 > clearing that up. Currently, it is very much so handled by the systems locale support, hence it is handled differently depending on platform. ICU support should make it to 8.1 I hear, which will make the bahaviour uniform. ... John >=20 >=20 > //Magnus >=20 > ---------------------------(end of=20 > broadcast)--------------------------- > TIP 6: explain analyze is your friend >=20 >=20
> Magnus Hagander wrote: > > PostgreSQL 8.0 does not support UTF8 on Win32. You will need to > > convert the file before you load it into the database - look at for > > example the GNU iconv program to do this. > > PostgreSQL does not support UTF8 locales on Windows, but the character > conversion should work the same on all platforms, shouldn't it? Right. BTW the original bug report sounds interesting to me. The conversion between UTF-8 and ISO 8859-1 is handled by a program logic, not by conversion tables. So if german umlauts are converted fine, there's no reason the conversion for german sharp s does not work. Marcus, Can you give me the exact error message from PostgreSQL when the conversio failed? -- Tatsuo Ishii
> Hi Tatsuo, > > Am Mittwoch, 20. Juli 2005 um 01:00 schrieben Sie: > > TI> conversion tables. So if german umlauts are converted fine, there's no > TI> reason the conversion for german sharp s does not work. > > TI> Marcus, > > TI> Can you give me the exact error message from PostgreSQL when the > TI> conversio failed? > > Well - actually, there IS no error message, convert() just returns an > empty sting. In detail, here's what I did: > > -Set up PostgreSQL (Ascii) > > -Imported the opengeodb > http://sourceforge.net/project/showfiles.php?group_id=132421 > As they offer a PostgreSQL dump, I chose this one. > > -Imported the dump using pgAdminIII > > -Created a view that returns all german cities with ZIP and > opengeodb-locationID: > > CREATE OR REPLACE VIEW orte_de AS > SELECT code.text_val AS plz, code.loc_id, town.text_val AS ort > FROM geodb_hierarchies hi, geodb_textdata state, geodb_textdata town, geodb_textdata code > WHERE hi.id_lvl2 = state.loc_id AND state.text_val = 'DE'::text > AND state.text_type = 500100001 AND town.loc_id = hi.loc_id > AND town.text_type = 500100000 AND code.loc_id = town.loc_id > AND code.text_type = 500300000; > > So now I've got three columns: "plz" (zip), "ort" (city) and loc_id. > Assuming I want to retrieve cites in the Hamburg / Hannover area... > (perfect for this task, as they have pretty strange city names there > :) > > SELECT plz, loc_id, ort, convert(ort using utf_8_to_iso_8859_1) as > ort_conv > from orte_de > where plz between 20000 and 30000 > order by ort_conv > > This query returns empty values for "ort_conv" if "ort" contains a > sharp s. > Btw, it seems like it is the same for "Ä" (Ä), have a look at > loc_id 25182. I see no problem with encoding conversion itself: utf8=# \encoding latin1 utf8=# select * from t1; t ---- $Bd_(B (1 row) > SELECT plz, loc_id, ort, convert(ort using utf_8_to_iso_8859_1) as > ort_conv > from orte_de > where plz between 20000 and 30000 > order by ort_conv Problem here is the result of convert(ort using utf_8_to_iso_8859_1) is ISO-8859-1 but your database encoding is UTF-8, so the terminal treats ISO-8859-1 chars as UTF-8 which will result in unexpected characters displayed. I guess what you want to do is: SELECT plz, loc_id, ort from orte_de where plz between 20000 and 30000 order by convert(ort using utf_8_to_iso_8859_1) -- Tatsuo Ishii
> TI> Problem here is the result of convert(ort using utf_8_to_iso_8859_1) > TI> is ISO-8859-1 but your database encoding is UTF-8, > > Not exactly, my database is in iso_8859_1, only the opengeodb-dump is > solely available in utf-8. > The whole thing is no problem so far, as I can convert the values in > the application that uses the data - I just wanted to let you know. That's surprising. Since you try to convert from UTF-8 to ISO 8859-1, I thought you store data as UTF-8. > TI> I guess what you want to do is: > TI> SELECT plz, loc_id, ort from orte_de > TI> where plz between 20000 and 30000 > TI> order by convert(ort using utf_8_to_iso_8859_1) > > Well, that was a fantasy-query :) just to show some "odd" values, I do > not need that much data in the application. Currently, I cannot > retrieve ready-converted iso data from the orte_de view, as converted > names that contain "ß" are not shown... Let me make sure. You store UTF-8 data in ISO 8859-1 DB? -- Tatsuo Ishii
Hi, Am Mittwoch, 20. Juli 2005 um 10:27 schrieben Sie: TI> Problem here is the result of convert(ort using utf_8_to_iso_8859_1) TI> is ISO-8859-1 but your database encoding is UTF-8, Not exactly, my database is in iso_8859_1, only the opengeodb-dump is solely available in utf-8. The whole thing is no problem so far, as I can convert the values in the application that uses the data - I just wanted to let you know. TI> I guess what you want to do is: TI> SELECT plz, loc_id, ort from orte_de TI> where plz between 20000 and 30000 TI> order by convert(ort using utf_8_to_iso_8859_1) Well, that was a fantasy-query :) just to show some "odd" values, I do not need that much data in the application. Currently, I cannot retrieve ready-converted iso data from the orte_de view, as converted names that contain "ß" are not shown... -- Mit freundlichen Grüssen Marcus Raphelt mailto:marcus@raphelt.de
Hi Tatsuo, Am Mittwoch, 20. Juli 2005 um 01:00 schrieben Sie: TI> conversion tables. So if german umlauts are converted fine, there's no TI> reason the conversion for german sharp s does not work. TI> Marcus, TI> Can you give me the exact error message from PostgreSQL when the TI> conversio failed? Well - actually, there IS no error message, convert() just returns an empty sting. In detail, here's what I did: -Set up PostgreSQL (Ascii) -Imported the opengeodb http://sourceforge.net/project/showfiles.php?group_id=132421 As they offer a PostgreSQL dump, I chose this one. -Imported the dump using pgAdminIII -Created a view that returns all german cities with ZIP and opengeodb-locationID: CREATE OR REPLACE VIEW orte_de AS SELECT code.text_val AS plz, code.loc_id, town.text_val AS ort FROM geodb_hierarchies hi, geodb_textdata state, geodb_textdata town, geodb_textdata code WHERE hi.id_lvl2 = state.loc_id AND state.text_val = 'DE'::text AND state.text_type = 500100001 AND town.loc_id = hi.loc_id AND town.text_type = 500100000 AND code.loc_id = town.loc_id AND code.text_type = 500300000; So now I've got three columns: "plz" (zip), "ort" (city) and loc_id. Assuming I want to retrieve cites in the Hamburg / Hannover area... (perfect for this task, as they have pretty strange city names there :) SELECT plz, loc_id, ort, convert(ort using utf_8_to_iso_8859_1) as ort_conv from orte_de where plz between 20000 and 30000 order by ort_conv This query returns empty values for "ort_conv" if "ort" contains a sharp s. Btw, it seems like it is the same for "Ä" (Ä), have a look at loc_id 25182. -- Mit freundlichen Grüssen Marcus Raphelt mailto:marcus@raphelt.de
Hi, PE> PostgreSQL does not support UTF8 locales on Windows, but the character PE> conversion should work the same on all platforms, shouldn't it? It works pretty fine even under win32, all umlauts are coverted as expexted, except for the "ß" (ß). For example, "Köln" converts to "Köln", but ""Groà Kreutz" results in an empty string. -- Mit freundlichen Grüssen Marcus Raphelt mailto:marcus@raphelt.de
On Wed, Jul 20, 2005 at 08:35:33 +1000, John Hansen <john@geeknet.com.au> wrote: > > ICU support should make it to 8.1 I hear, which will make the bahaviour > uniform. There was some discussion about that, but my impression was that that wasn't going to happen.
> > ICU support should make it to 8.1 I hear, which will make the=20 > > bahaviour uniform. >=20 > There was some discussion about that, but my impression was=20 > that that wasn't going to happen. AFAIK the result was that it would go in as an alternative, but not to *replace* what's there now. //Magnus
"Magnus Hagander" <mha@sollentuna.net> writes: >>> ICU support should make it to 8.1 I hear, which will make the >>> bahaviour uniform. >> >> There was some discussion about that, but my impression was >> that that wasn't going to happen. > AFAIK the result was that it would go in as an alternative, but not to > *replace* what's there now. The whole issue is still TBD as far as I know --- there were arguments on both sides but not consensus, and if any committer has reviewed the ICU patch in any detail, I didn't hear about it. It'd probably be easier to get consensus for a "make it a compile-time option" proposal than for a "wholesale replacement" proposal. regards, tom lane
> >>> ICU support should make it to 8.1 I hear, which will make the=20 > >>> bahaviour uniform. > >>=20 > >> There was some discussion about that, but my impression=20 > was that that=20 > >> wasn't going to happen. >=20 > > AFAIK the result was that it would go in as an alternative,=20 > but not to > > *replace* what's there now. >=20 > The whole issue is still TBD as far as I know --- there were=20 > arguments on both sides but not consensus, and if any=20 > committer has reviewed the ICU patch in any detail, I didn't=20 > hear about it. Naturally it has to pass the standard requirements for an acceptable patch. But IIRC the early version was checked over (I think by Bruce). Not in details, but as an overview. I fully realise that doesn't mean it's approved, but it's a step ;-) > It'd probably be easier to get consensus for a "make it a=20 > compile-time option" proposal than for a "wholesale=20 > replacement" proposal. My point exactly. Considering it's vital for e.g. win32 to get unicode, and apparantly pretty good for freebsd people if I read Palle right, whereas other platforms have what they need already... //Magnus