Обсуждение: Why is an ISO-8859-8 database allowing values not within that set?

Поиск
Список
Период
Сортировка

Why is an ISO-8859-8 database allowing values not within that set?

От
"Herouth Maoz"
Дата:

I am using Postgresql 8.3.14 on our reporting system. There are scripts that collect data from many databases across the firm into this database. Recently I added tables from a particular database which has encoding UTF-8. My dump procedure says

\encoding ISO-8859-8
\copy ( SELECT ... ) to file

And this fails at a certain row because that row contains Arabic text and it cannot be mapped into ISO-8859-8 (which is 8 bit Hebrew).

This is an expected behavior, but I was wondering why, when I tested the same setup manually, it all worked well.

Turns out that when I did it manually, I did not specify the output encoding. I did the \copy straight. So the file was in UTF-8.

But this puzzles me, because I then took the file, ran psql and \copy <table> from file. And it worked. I tried it again now, and I can see the row with its Arabic content, even though it is not in the database encoding.

I checked \encoding. It replies
ISO_8859_8
but it then happily gives me the Arabic row when I select it.

What's happening here? Why does the database accept input in the wrong encoding and doesn't shout when I then try to select that input?


Secondly, suppose I want to get pure ISO-8859-8 output for now, and replace every incompatible character within the select statement into '*' or whatever. Is there any function that will help me detect such characters? Can I tell the psql conversion function to ignore bad characters?

Thank you,
Herouth

Re: Why is an ISO-8859-8 database allowing values not within that set?

От
Craig Ringer
Дата:
On 07/21/2012 04:59 PM, Herouth Maoz wrote:

I am using Postgresql 8.3.14 on our reporting system. There are scripts that collect data from many databases across the firm into this database. Recently I added tables from a particular database which has encoding UTF-8.
First, I know there have been encoding and UTF-8 handling fixes since 8.3 . It'd be interesting to see if this still happens on a more recent version.

You're also missing five bug-fix point-releases in the 8.3 series, as the latest is 8.3.19 . See:
  http://www.postgresql.org/docs/8.3/static/release.html
for fixes you're missing.

Explanation for what I think is going on below:


But this puzzles me, because I then took the file

... which was created with a \copy with client encoding set to utf-8, right?

ran psql and \copy <table> from file


With which client encoding set? UTF-8 or ISO_8859_8? I bet you copied it in with ISO_8859_1.

And it worked. I tried it again now, and I can see the row with its Arabic content, even though it is not in the database encoding.

It shows up correctly?

If you \copy a dump in utf-8, then \copy it back in with ISO_8859_8, it should be mangled.

If you set your client_encoding to utf_8 ("\encoding utf-8") does it still show up correctly? I suspect it's wrong in the database and you're just unmangling it on display.

It would help if you would actually show the bytes of:

- The chars in the \copy dump, using `xxd' or similar
- The chars in the database before the copy out and copy in, using a CAST to `bytea`
- The chars in the database AFTER the copy out and copy in, again with a CAST to `bytea`

... as well as the database encoding, NOT just the client encoding (see below):

I checked \encoding. It replies
ISO_8859_8

That is the client encoding.

Try:

   \l+

to list databases. You'll see the database encoding there. The same info is available from:

  SELECT datname, encoding from pg_database WHERE datname = 'mydatabase';


Maybe this demo will help enlighten you.

regress=# select version();
                                                   version                                                  
-------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.1.4 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 4.7.0 20120507 (Red Hat 4.7.0-5), 64-bit
(1 row)

regress=# \l+
                                                                           List of databases
        Name        |   Owner    | Encoding |   Collate   |    Ctype    |     Access privileges     |  Size   | Tablespace |                Description                
--------------------+------------+----------+-------------+-------------+---------------------------+---------+------------+--------------------------------------------
 ....
 regress            | craig      | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                           | 41 MB   | pg_default |

regress=# CREATE TABLE enctest (a serial, x text);
CREATE TABLE
regress=# -- Some text randomly pulled off Google News taiwan, since it was convenient and the exact text doesn't matter
regress=# insert into enctest (x) values ('心情已平復很多」。 中國網絡電視台報導');
INSERT 0 1
regress=# \encoding
UTF8
regress=# \copy ( select x from enctest ) to enctest.csv


Set a 1-byte non-utf encoding, doesn't really matter which one. Then import the data we dumped as utf-8.

regress=# \encoding latin-1
regress=# \copy enctest(x) from enctest.csv

enctest now contains two rows. One is the correctly encoded original, one is the dumped and reloaded one.
We can't view the whole table while we're in latin-1 encoding because the correct row won't translate right.

regress=# select * from enctest;
ERROR:  character 0xe5bf83 of encoding "UTF8" has no equivalent in "LATIN1"

but we *CAN* view the second row we dumped as utf-8 then imported as latin-1:

regress=# regress=# select * from enctest where a = 2;
 a |                            x                           
---+---------------------------------------------------------
 2 | 心情已平復很多」。 中國網絡電視台報導
(1 row)
regres



At this point you're probably thinking "WTF!?!".

It shows up correctly in my terminal because my terminal is utf-8, irrespective of the encoding set in psql. Setting a non-utf-8 encoding in psql via "\encoding" just lies to psql about the encoding of the bytes I paste in on my terminal. It receives a byte sequence that could be valid latin-1- though it's actually nonsense garbage, it can't tell the difference. It trusts me and translates the "latin-1" I sent into utf-8 for storage. When I ask for it back again and I'm in a latin-1 client encoding, it converts that utf-8 back into latin-1 - or that's what it thinks it's doing. It's actually demangling mangled utf-8 so my console can display it.

We can undestand this better if we examine what's actually in the database.

regress=# \encoding utf-8
regress=# select * from enctest;
 a |                                                                 x                                                                 
---+------------------------------------------------------------------------------------------------------------------------------------
 1 | 心情已平復很多」。 中國網絡電視台報導
 2 | å¿\u0083æ\u0083\u0085已平復å¾\u0088å¤\u009Aã\u0080\u008Dã\u0080\u0082 中å\u009C\u008B網絡é\u009B»è¦\u0096å\u008F°å ±å°\u008E
(2 rows)


regress=# select a, x::bytea from enctest;
 a |                                                                                                              x                                                            
                                                 
---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
 1 | \xe5bf83e68385e5b7b2e5b9b3e5bea9e5be88e5a49ae3808de3808220e4b8ade59c8be7b6b2e7b5a1e99bbbe8a696e58fb0e5a0b1e5b08e
 2 | \xc3a5c2bfc283c3a6c283c285c3a5c2b7c2b2c3a5c2b9c2b3c3a5c2bec2a9c3a5c2bec288c3a5c2a4c29ac3a3c280c28dc3a3c280c28220c3a4c2b8c2adc3a5c29cc28bc3a7c2b6c2b2c3a7c2b5c2a1c3a9c29bc2b
bc3a8c2a6c296c3a5c28fc2b0c3a5c2a0c2b1c3a5c2b0c28e
(2 rows)


This should help. See how the first row shows up correctly when we're in the right client encoding, but the second one is gibberish? That's what UTF-8 that's been interpreted as latin-1 and "converted" into utf-8 looks like.

It's even more informative if I start a terminal in latin-1 and then set client_encoding to latin-1, so I'm not lying to psql about my client encoding.

$ LANG=en_AU.iso55891 LC_ALL=en_AU.iso88591 xterm
$ psql regress
regress=# \encoding
LATIN-1
regress=# select x from enctest where a = 2;
                            x                           
---------------------------------------------------------
 å¿æ          ã 中å網絡é¦å°å ±å°
(1 row)¾©å¾å¤ã


Note that latin-1 is a synonym for iso-8859-1.

Now you can see that the data in the DB is actually mangled. It's just that when you accidentally lie to Pg the same way in the input and output phases as I've shown, it *looks* ok though the byte sequence in the database is garbage.

Here's yet another way to illustrate it. Let's take the first char of our input and see what data is produced when we convert its utf-8 byte using a conversion function from latin-1 into utf-8.

regress=# select '心'::bytea, convert( '心'::bytea, 'latin-1', 'utf-8') from enctest;
  bytea   |    convert    
----------+----------------
 \xe5bf83 | \xc3a5c2bfc283
(1 row)


Look familiar from the example above? See how the valid utf-8 sequence in the first col gets converted into garbage in the second col? Yet we can reverse the incorrect conversion to get valid utf-8 again:


regress=# select convert( BYTEA '\xc3a5c2bfc283', 'utf-8', 'latin-1');
 convert 
----------
 \xe5bf83
(1 row)


... which is probably what you've been doing.


What's happening here? Why does the database accept input in the wrong encoding and doesn't shout when I then try to select that input?

Correct, because legacy 1-byte encodings cannot be verified. There's no way to say "Yup, this is latin-1" or "Yup, this is ISO-8859-8".

Don't set client_encoding unless you REALLY know encodings. If you do set it for \copy, make sure you always \copy in with the same encoding you used for the \copy out. PostgreSQL cannot protect you from this.

--
Craig Ringer

Re: Why is an ISO-8859-8 database allowing values not within that set?

От
Herouth Maoz
Дата:
Thanks. That makes sense. The default client encoding on the reports database is ISO-8859-8, so I guess when I don't set it using \encoding, it does exactly what you say.

OK, so I'm still looking for a way to convert illegal characters into something that won't collide with my encoding (asterisks or whatever).

Thank you,
Herouth


On 21/07/2012, at 15:36, Craig Ringer wrote:

On 07/21/2012 04:59 PM, Herouth Maoz wrote:

I am using Postgresql 8.3.14 on our reporting system. There are scripts that collect data from many databases across the firm into this database. Recently I added tables from a particular database which has encoding UTF-8.


First, I know there have been encoding and UTF-8 handling fixes since 8.3 . It'd be interesting to see if this still happens on a more recent version.

You're also missing five bug-fix point-releases in the 8.3 series, as the latest is 8.3.19 . See:
for fixes you're missing.

Explanation for what I think is going on below:


But this puzzles me, because I then took the file
... which was created with a \copy with client encoding set to utf-8, right?

ran psql and \copy <table> from file


With which client encoding set? UTF-8 or ISO_8859_8? I bet you copied it in with ISO_8859_1.

And it worked. I tried it again now, and I can see the row with its Arabic content, even though it is not in the database encoding.

It shows up correctly?

If you \copy a dump in utf-8, then \copy it back in with ISO_8859_8, it should be mangled.

If you set your client_encoding to utf_8 ("\encoding utf-8") does it still show up correctly? I suspect it's wrong in the database and you're just unmangling it on display. 

It would help if you would actually show the bytes of:

- The chars in the \copy dump, using `xxd' or similar
- The chars in the database before the copy out and copy in, using a CAST to `bytea`
- The chars in the database AFTER the copy out and copy in, again with a CAST to `bytea`

... as well as the database encoding, NOT just the client encoding (see below):

I checked \encoding. It replies
ISO_8859_8
That is the client encoding.

Try:

   \l+

to list databases. You'll see the database encoding there. The same info is available from:

  SELECT datname, encoding from pg_database WHERE datname = 'mydatabase';


Maybe this demo will help enlighten you.

regress=# select version();                                                    version                                                    -------------------------------------------------------------------------------------------------------------  PostgreSQL 9.1.4 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 4.7.0 20120507 (Red Hat 4.7.0-5), 64-bit (1 row) regress=# \l+                                                                            List of databases         Name        |   Owner    | Encoding |   Collate   |    Ctype    |     Access privileges     |  Size   | Tablespace |                Description                  --------------------+------------+----------+-------------+-------------+---------------------------+---------+------------+--------------------------------------------  ....  regress            | craig      | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                           | 41 MB   | pg_default |  regress=# CREATE TABLE enctest (a serial, x text); CREATE TABLE regress=# -- Some text randomly pulled off Google News taiwan, since it was convenient and the exact text doesn't matter regress=# insert into enctest (x) values ('心情已平復很多」。 中國網絡電視台報導'); INSERT 0 1 regress=# \encoding UTF8 regress=# \copy ( select x from enctest ) to enctest.csv Set a 1-byte non-utf encoding, doesn't really matter which one. Then import the data we dumped as utf-8. regress=# \encoding latin-1 regress=# \copy enctest(x) from enctest.csv enctest now contains two rows. One is the correctly encoded original, one is the dumped and reloaded one. We can't view the whole table while we're in latin-1 encoding because the correct row won't translate right. regress=# select * from enctest; ERROR:  character 0xe5bf83 of encoding "UTF8" has no equivalent in "LATIN1" but we *CAN* view the second row we dumped as utf-8 then imported as latin-1: regress=# regress=# select * from enctest where a = 2;  a |                            x                             ---+---------------------------------------------------------  2 | 心情已平復很多」。 中國網絡電視台報導 (1 row) regres At this point you're probably thinking "WTF!?!". It shows up correctly in my terminal because my terminal is utf-8, irrespective of the encoding set in psql. Setting a non-utf-8 encoding in psql via "\encoding" just lies to psql about the encoding of the bytes I paste in on my terminal. It receives a byte sequence that could be valid latin-1- though it's actually nonsense garbage, it can't tell the difference. It trusts me and translates the "latin-1" I sent into utf-8 for storage. When I ask for it back again and I'm in a latin-1 client encoding, it converts that utf-8 back into latin-1 - or that's what it thinks it's doing. It's actually demangling mangled utf-8 so my console can display it. We can undestand this better if we examine what's actually in the database. regress=# \encoding utf-8 regress=# select * from enctest;  a |                                                                 x                                                                   ---+------------------------------------------------------------------------------------------------------------------------------------  1 | 心情已平復很多」。 中國網絡電視台報導  2 | å¿\u0083æ\u0083\u0085已平復å¾\u0088å¤\u009Aã\u0080\u008Dã\u0080\u0082 中å\u009C\u008B網絡é\u009B»è¦\u0096å\u008F°å ±å°\u008E (2 rows) regress=# select a, x::bytea from enctest;  a |                                                                                                              x                                                                                                                       ---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------  1 | \xe5bf83e68385e5b7b2e5b9b3e5bea9e5be88e5a49ae3808de3808220e4b8ade59c8be7b6b2e7b5a1e99bbbe8a696e58fb0e5a0b1e5b08e  2 | \xc3a5c2bfc283c3a6c283c285c3a5c2b7c2b2c3a5c2b9c2b3c3a5c2bec2a9c3a5c2bec288c3a5c2a4c29ac3a3c280c28dc3a3c280c28220c3a4c2b8c2adc3a5c29cc28bc3a7c2b6c2b2c3a7c2b5c2a1c3a9c29bc2b bc3a8c2a6c296c3a5c28fc2b0c3a5c2a0c2b1c3a5c2b0c28e (2 rows) This should help. See how the first row shows up correctly when we're in the right client encoding, but the second one is gibberish? That's what UTF-8 that's been interpreted as latin-1 and "converted" into utf-8 looks like. It's even more informative if I start a terminal in latin-1 and then set client_encoding to latin-1, so I'm not lying to psql about my client encoding.  $ LANG=en_AU.iso55891 LC_ALL=en_AU.iso88591 xterm $ psql regress regress=# \encoding LATIN-1 regress=# select x from enctest where a = 2;                              x                             ---------------------------------------------------------  å¿æ          ã 中å網絡é¦å°å ±å° (1 row)¾©å¾å¤ã Note that latin-1 is a synonym for iso-8859-1. Now you can see that the data in the DB is actually mangled. It's just that when you accidentally lie to Pg the same way in the input and output phases as I've shown, it *looks* ok though the byte sequence in the database is garbage. Here's yet another way to illustrate it. Let's take the first char of our input and see what data is produced when we convert its utf-8 byte using a conversion function from latin-1 into utf-8. regress=# select '心'::bytea, convert( '心'::bytea, 'latin-1', 'utf-8') from enctest;   bytea   |    convert      ----------+----------------  \xe5bf83 | \xc3a5c2bfc283 (1 row) Look familiar from the example above? See how the valid utf-8 sequence in the first col gets converted into garbage in the second col? Yet we can reverse the incorrect conversion to get valid utf-8 again: regress=# select convert( BYTEA '\xc3a5c2bfc283', 'utf-8', 'latin-1');  convert   ----------  \xe5bf83 (1 row) ... which is probably what you've been doing.

What's happening here? Why does the database accept input in the wrong encoding and doesn't shout when I then try to select that input?

Correct, because legacy 1-byte encodings cannot be verified. There's no way to say "Yup, this is latin-1" or "Yup, this is ISO-8859-8".

Don't set client_encoding unless you REALLY know encodings. If you do set it for \copy, make sure you always \copy in with the same encoding you used for the \copy out. PostgreSQL cannot protect you from this.

--
Craig Ringer



--
חרות מעוז
יוניסל פתרונות סלולריים מתקדמים
☎ 03-5181717 שלוחה 742

Re: Why is an ISO-8859-8 database allowing values not within that set?

От
Craig Ringer
Дата:
On 07/22/2012 03:58 PM, Herouth Maoz wrote:
Thanks. That makes sense. The default client encoding on the reports database is ISO-8859-8, so I guess when I don't set it using \encoding, it does exactly what you say.

OK, so I'm still looking for a way to convert illegal characters into something that won't collide with my encoding (asterisks or whatever).


As far as I know, PostgreSQL's encoding handling functions do not offer substitution for unsupported characters, nor does the built-in client<->server charset translation feature. You could do it with a regular expression replacement of any character not in a class that contains every char in valid in the target encoding. It feels like a very clunky approach though.

An alternative is to use a procedural language that DOES support lossy character encoding conversions. I don't think plpython does and plpgsql certainly doesn't if PostgreSQL its self doesn't. I'd be amazed if plperl didn't support lossy conversions, but I haven't done much with Perl in years.

It'd be handy if Pg's client<->server conversion supported lossy conversions for this kind of thing. Honestly I'm not sad it doesn't, because it'd be something people would misuse to make the error messages they didn't understand go away - then come back and complain that PostgreSQL ate their data later.

--
Craig Ringer