Обсуждение: BUG #13785: Postgresql encoding screw-up
The following bug has been logged on the website: Bug reference: 13785 Logged by: ntpt Email address: ntpt@seznam.cz PostgreSQL version: 9.4.5 Operating system: linux Description: Hi, all I use postgresql several years. But recently with hw upgrade a fall to problem. pg_exec(): Query failed: ERROR: character with byte sequence 0xc2 0x96 in encoding "UTF8" has no equivalent in encoding "WIN1250" It is a strange. First there was a database with latin2 encoding. to this database connect an aplicaton with "set client encoding to win1250" and manipulating data then database was dumped with pg_dump -E UTF8 then database was restored pg_restore on another cluster in database with UTF8 encoding then application connect to new database with "set client encoding to win1250" and - query failed How in this scenario could invaid characters reach the database ??? Look here at whole thread http://postgresql.nabble.com/Query-failed-ERROR-character-with-byte-sequence-0xc2-0x96-in-encoding-quot-UTF8-quot-has-no-equivale-td5875048.html I examine this situation. I am affraid that there is a major design flaw or bug that can screw up a lot of databases. Please look at the https://cs.wikipedia.org/wiki/Windows-1250 (in czech only) but translation table "Srovnánàs ISO 8859-2" is selfexplanatory. In situatin where db have latin2 encoding but client set client encoding to win1250 and manipulate data: Because some characters in cp1250 (win1250) are not present in latin2 encoding, postgres store value of that chars "as is" ie without charset translation ie as example 0x96 (EN_DASH in win1250) is stored as 0x96 (unknown in latin2) (at least version 9.0) If You read from that database with client encoding set to win1250, this "as is" characters are "as is" returned - and from point of client aplication - everything work as expected. But.. If You need to migrate and transcode database to utf8 by recomended way (pg_dump -E utf8 ) things goes weired Because there is no character 0x96 in latin2 , transcoder to utf8 does not know the recipe how treat this chracter - and leave it "as is" producing \u0096 character in output . But if You look to the table "Mapovánà do Unikódu" from wiki above mentioned , proper unicode code for this char in WIN1250 is \u2013. (And postgres can not know it, because postgres does not know from with codepage these stored "as is" characters came ) So now if You restore it to new utf8 database, Part of characters would have "unicodized" raw values of that chars that was not presented in original database encoding, but presented in original client encoding . And result is an error as described in my post in mailing list So everybody in situation where need transfer to utf8 and original db encoding have no representation for some chars in client encodings ended like this - with working - but screwed and non transferable database. I thik that safe practice would be: Pg_dum with -E as used by client applicaton and then restore to newly created utf8 database . It should be mentioned as safe way in the doc, at least But in enviroment, where client use multiple encodings that have more characters then database encoding, the database is screwed forever - will work but can not be repaired and migrated to another encoding. Fix me if i am wrong.. PS: execuse my bad english
Hi, > there is a major design flaw or bug I feel your pain, but how is this a bug? Once the character that cannot be mapped to latin2 is stored, there is no information about the source-encoding (win1250) of this character available anymore. Any client connecting (whether your application or pg_dump) will get that character "as is". I don't see a way around solving this in general, other than rejecting characters that do not fit in the target character set > where client use multiple encodings that have more characters then database > encoding, the database is screwed forever The allowed conversions from LATIN2 to other encodings is quite limited (MULE_INTERNAL, UTF8, WIN1250), , see: see: http://www.postgresql.org/docs/9.4/static/multibyte.html#AEN35768: If the clients using different encodings all touch the same data, the data is already dirty. The migration is only bringing it to light then. If the clients all touch different parts of the data, the data can be safely migrated by exporting distinct parts of data in its correct encoding and then importing it with that encoding in the the target database with UTF8 encoding. > I thik that safe practice would be: Pg_dum with -E as used by client > applicaton and then restore to newly created utf8 database . It should be > mentioned as safe way in the doc, at least This looks safe to me, you export unknown characters data into its original encoding thereby making them known again. If you now import this into UTF8 it will be encoded correctly, because both the source (WIN1250) as the target (UTF8) can encode these character. regards, Feike Steenbergen
On 2015-11-26 12:15:58 +0000, ntpt@seznam.cz wrote: > Because there is no character 0x96 in latin2 , transcoder to utf8 does not > know the recipe how treat this chracter - and leave it "as is" producing > \u0096 character in output .=20 Actually, it does know. While the standard ISO-8859-2[1] only defines the printable characters, those are commonly combined with the control characters from ISO 6429, which does define a control code 0x96 (SPA). The unicode standard also defines two blocks of control characters and they have the same code points: 0x0D (CR) is translated to U+000D, 0x1B (ESC) is translated to U+001B and 0x96 (SPA) is translated to U+0096. So the "problem" here isn't that PostgreSQL doesn't know how to translate an ISO-8859-2 0x96 into unicode (if that was the case, it could reject it, forcing the user to fix the client configuration), but that it does know how to convert it and therefore does it - even if it is almost certainly wrong (when did you ever need an SPA character?). hp PS: I would write a script which fixes the wrong characters in situ.=20 That takes a bit of scripting, but: * You probably can't guarantee that all your clients are fixed,=20 so the problem may crop up again * So you want to be able to find out when that happens and fix it again without taking the db down. * A script like that can be expanded to fix other encoding errors, too (e.g. UTF-8 double-encoding, ISO-8859-2 vs. ISO-8859-1, ...) [1] I actually looked at ECMA-94, but they should be identical. --=20 _ | Peter J. Holzer | I want to forget all about both belts and |_|_) | | suspenders; instead, I want to buy pants=20 | | | hjp@hjp.at | that actually fit. __/ | http://www.hjp.at/ | -- http://noncombatant.org/