Обсуждение: PostgreSQL fails to convert decomposed utf-8 to other encodings
There's a bug in encoding conversions from utf-8 to other encodings that results in corrupt output if decomposed utf-8 is used. PostgreSQL doesn't process utf-8 to pre-composed form first, so decomposed UTF-8 is not handled correctly. Take á: regress=> -- Decomposed - 'a' then 'acute' regress=> SELECT E'\u0061\u0301'; ' ?column? ---------- aÌ (1 row) regress=> -- Precomposed - 'a-acute' regress=> SELECT E'\u00E1'; ?column? ---------- á (1 row) regress=> SELECT convert_to(E'\u0061\u0301', 'iso-8859-1'); ERROR: character with byte sequence 0xcc 0x81 in encoding "UTF8" has no equivalent in encoding "LATIN1" regress=> SELECT convert_to(E'\u00E1', 'iso-8859-1'); convert_to ------------ \xe1 (1 row) This affects input from the client too: regress=> SELECT convert_to('aÌ', 'iso-8859-1'); ERROR: character with byte sequence 0xcc 0x81 in encoding "UTF8" has no equivalent in encoding "LATIN1" regress=> SELECT convert_to('á', 'iso-8859-1'); convert_to ------------ \xe1 (1 row) ... yes, that looks like the same function producing different results on identical input. You might not be able to reproduce with copy and paste from this mail if your client normalizes UTF-8, but you'll be able to by printing the decomposed character to your terminal as an escape string, then copying and pasting from there. We should've probably been normalizing decomposed sequences to precomposed as part of utf-8 validation wherever 'text' input occurs, but it's too late for that now as DBs in the wild will contain decomposed chars. Instead, conversion functions need to normalize decomposed chars to precomposed before converting from utf-8 to another encoding. Comments? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer <craig@2ndquadrant.com> writes: > There's a bug in encoding conversions from utf-8 to other encodings that > results in corrupt output if decomposed utf-8 is used. We don't actually support "decomposed" utf8; if there is any bug here, it's that the input you show isn't rejected. But I think there was some intentional choice to not check \u escapes fully. regards, tom lane
On 08/06/2014 09:14 AM, Tom Lane wrote: > We don't actually support "decomposed" utf8; if there is any bug here, > it's that the input you show isn't rejected. But I think there was > some intentional choice to not check \u escapes fully. Combining characters (i.e. decomposed utf-8 form, for chars where there is a combined equivalent) are part of utf-8. They're not an optional add-on. So if Pg doesn't support them, it doesn't fully support utf-8. Which is fine as far as it goes, but must be documented as a limitation at minimum. (I'll deal with that). It also means that you get fun anomalies like: regress=> SELECT 'aÌ' = 'á'; ?column? ---------- f (1 row) which is IMO insane. Not only that, but we can't reject decomposed forms, because they will already exist in live installs. That'd break dump and reload of such installs and cause exciting problems with pg_upgrade. The "we'll just reject part of utf-8" opportunity has flown. It needs to be documented as a bug in existing versions, and I guess given that I'm the one complaining I get to see if I can find a sane fix for 9.5... -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 08/06/2014 11:54 AM, Craig Ringer wrote: > On 08/06/2014 09:14 AM, Tom Lane wrote: >> We don't actually support "decomposed" utf8; if there is any bug here, >> it's that the input you show isn't rejected. But I think there was >> some intentional choice to not check \u escapes fully. > > Combining characters (i.e. decomposed utf-8 form, for chars where there > is a combined equivalent) are part of utf-8. They're not an optional add-on. ... though we can advertise partial Unicode support, saying that we support UTF-8 for UCS (ISO 10646-1:2000 Annex D / RFC 3629) implementation level 1 only, requiring Normalization Form C (NFC) input. Given that Pg doesn't seem to understand \xf8 or \xfc utf-8 chars, so it doesn't cover the full utf-8 range, it doesn't look like it meets Level 1 either. So it supports "mostly-utf8". With level 1 we should really _reject_ combining chars, but can't do that w/o breaking BC. I guess I should turn this: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt into a regression test. Possibly also parts of this: http://www.columbia.edu/~fdc/utf8/ though it's more oriented toward rendering. It's worth noting that Konsole and Thunderbird had no issues with combining chars when I was testing this. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
PiBPbiAwOC8wNi8yMDE0IDA5OjE0IEFNLCBUb20gTGFuZSB3cm90ZToNCj4+IFdlIGRvbid0IGFj dHVhbGx5IHN1cHBvcnQgImRlY29tcG9zZWQiIHV0Zjg7IGlmIHRoZXJlIGlzIGFueSBidWcgaGVy ZSwNCj4+IGl0J3MgdGhhdCB0aGUgaW5wdXQgeW91IHNob3cgaXNuJ3QgcmVqZWN0ZWQuICBCdXQg SSB0aGluayB0aGVyZSB3YXMNCj4+IHNvbWUgaW50ZW50aW9uYWwgY2hvaWNlIHRvIG5vdCBjaGVj ayBcdSBlc2NhcGVzIGZ1bGx5Lg0KPiANCj4gQ29tYmluaW5nIGNoYXJhY3RlcnMgKGkuZS4gZGVj b21wb3NlZCB1dGYtOCBmb3JtLCBmb3IgY2hhcnMgd2hlcmUgdGhlcmUNCj4gaXMgYSBjb21iaW5l ZCBlcXVpdmFsZW50KSBhcmUgcGFydCBvZiB1dGYtOC4gVGhleSdyZSBub3QgYW4gb3B0aW9uYWwg YWRkLW9uLg0KPiANCj4gU28gaWYgUGcgZG9lc24ndCBzdXBwb3J0IHRoZW0sIGl0IGRvZXNuJ3Qg ZnVsbHkgc3VwcG9ydCB1dGYtOC4gV2hpY2ggaXMNCj4gZmluZSBhcyBmYXIgYXMgaXQgZ29lcywg YnV0IG11c3QgYmUgZG9jdW1lbnRlZCBhcyBhIGxpbWl0YXRpb24gYXQNCj4gbWluaW11bS4gKEkn bGwgZGVhbCB3aXRoIHRoYXQpLg0KPiANCj4gSXQgYWxzbyBtZWFucyB0aGF0IHlvdSBnZXQgZnVu IGFub21hbGllcyBsaWtlOg0KPiANCj4gcmVncmVzcz0+IFNFTEVDVCAnYcyBJyA9ICfDoSc7DQo+ ICA/Y29sdW1uPw0KPiAtLS0tLS0tLS0tDQo+ICBmDQo+ICgxIHJvdykNCj4gDQo+IHdoaWNoIGlz IElNTyBpbnNhbmUuDQo+IA0KPiBOb3Qgb25seSB0aGF0LCBidXQgd2UgY2FuJ3QgcmVqZWN0IGRl Y29tcG9zZWQgZm9ybXMsIGJlY2F1c2UgdGhleSB3aWxsDQo+IGFscmVhZHkgZXhpc3QgaW4gbGl2 ZSBpbnN0YWxscy4gVGhhdCdkIGJyZWFrIGR1bXAgYW5kIHJlbG9hZCBvZiBzdWNoDQo+IGluc3Rh bGxzIGFuZCBjYXVzZSBleGNpdGluZyBwcm9ibGVtcyB3aXRoIHBnX3VwZ3JhZGUuDQo+IA0KPiBU aGUgIndlJ2xsIGp1c3QgcmVqZWN0IHBhcnQgb2YgdXRmLTgiIG9wcG9ydHVuaXR5IGhhcyBmbG93 bi4gSXQgbmVlZHMgdG8NCj4gYmUgZG9jdW1lbnRlZCBhcyBhIGJ1ZyBpbiBleGlzdGluZyB2ZXJz aW9ucywgYW5kIEkgZ3Vlc3MgZ2l2ZW4gdGhhdCBJJ20NCj4gdGhlIG9uZSBjb21wbGFpbmluZyBJ IGdldCB0byBzZWUgaWYgSSBjYW4gZmluZCBhIHNhbmUgZml4IGZvciA5LjUuLi4NCg0KSSdtIG5v dCBzdXJlIHdoYXQgeW91IG1lYW4gYnkgZGVjb21wb3NlZCB1dGY4IGJlY2F1c2UgdGhlcmUncyBu byBzdWNoDQphIHRoaW5nIGluIHRoZSBVbmljb2RlIHN0YW5kYXJkLiBNYXliZSB5b3UgbWVhbiAi Y29tcG9zaXRlIGNoYXJhY3RlciINCm9yICJwcmVjb21wb3NlZCBjaGFyYWN0ZXIiPw0KDQpBbnl3 YSBpbiBteSB1bmRlcnN0YW5kaW5nIHRvIGhhbmRsZSBjb21wb3NpdGUgY2hhcmFjdGVycywgd2Ug c2hvdWxkIGRvDQoiVW5pY29kZSBub3JtYWxpemF0aW9uIiBpbiB0aGUgZmlyc3QgcGxhY2UuIFRo ZXJlJ3MgNCB0eXBlcyBvZg0Kbm9ybWFsaXphdGlvbjoNCg0KTkZEIChOb3JtYWxpemF0aW9uIEZv cm0gQ2Fub25pY2FsIERlY29tcG9zaXRpb24pDQpORkMgKE5vcm1hbGl6YXRpb24gRm9ybSBDYW5v bmljYWwgQ29tcG9zaXRpb24pDQpORktEIChOb3JtYWxpemF0aW9uIEZvcm0gQ29tcGF0aWJpbGl0 eSBEZWNvbXBvc2l0aW9uKQ0KTkZLQyAoTm9ybWFsaXphdGlvbiBGb3JtIENvbXBhdGliaWxpdHkg Q29tcG9zaXRpb24pDQoNCkkgZG9uJ3Qga25vdyBob3cgd2UgY291bGQgaW1wbGVtZW50IG9uZSBv ZiB0aGVzZSB3aXRob3V0IG1ham9yDQpwZXJmb3JtYW5jZSBkZWdyYWRhdGlvbi4NCg0KQWxzbyBz b21lIGNvbXBvc2l0ZSBjaGFyYWN0ZXJzIGNhbiBiZSBkZWNvbXBvc2VkIGJ1dCBhZnRlciBjb21w b3NlZA0KYWdhaW4sIHRoZXkgZG8gbm90IHJldHVybiB0byB0aGUgb3JpZ2luYWwgZm9ybSBvZiBj b21wb3NpdGUgY2hhcmFjdGVycw0KKHJvdW5kIHRyaXAgY29udmVyc2lvbiBpcyBpbXBvc3NpYmxl KS4gU3VjaCBjaGFyYWN0ZXJzIGFyZSBjYWxsZWQNCiJDb21wb3NpdGlvbiBFeGNsdXNpb24iIChz ZWUNCmh0dHA6Ly93d3cudW5pY29kZS5vcmcvUHVibGljL1VOSURBVEEvQ29tcG9zaXRpb25FeGNs dXNpb25zLnR4dCkuDQpJIGhhdmUgbm8gaWRlYSBob3cgdG8gZGVhbCB3aXRoIHRoZSBpc3N1ZS4N Cg0KQmVzdCByZWdhcmRzLA0KLS0NClRhdHN1byBJc2hpaQ0KU1JBIE9TUywgSW5jLiBKYXBhbg0K RW5nbGlzaDogaHR0cDovL3d3dy5zcmFvc3MuY28uanAvaW5kZXhfZW4ucGhwDQpKYXBhbmVzZTpo dHRwOi8vd3d3LnNyYW9zcy5jby5qcA0K