Обсуждение: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

Поиск
Список
Период
Сортировка

COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

От
Steven Schlansker
Дата:
Hello fine PostgreSQL bug-busters,

I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a byte,
causingdata corruption in backups and transferred data. 

First, the environment -

 PostgreSQL 8.4.4 on i386-apple-darwin10.3.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc.
build5646) (dot 1), 64-bit 

Mac OS X 10.6.4

[steven@xxx:~]% psql --version
psql (PostgreSQL) 8.4.4
contains support for command-line editing

Now, the setup:
          Name           |       Owner        | Encoding |  Collation  |    Ctype    |   Access privileges   |  Size
|Tablespace |        Description       
baddb                    | xxxxxxx_production | UTF8     | en_US.utf-8 | en_US.utf-8 |                       | 207 MB
|pg_default |  

baddb=> create table badtable (a int, b int, c character varying, d character varying, e character varying, f character
varying[],g text, h character varying[],i character varying[], j character varying[], k character varying[], l
charactervarying[], m character varying[], n character varying[],o character varying, p character varying); 

baddb=> \copy badtable from '/tmp/data.copy'
baddb=> \copy badtable to '/tmp/badness.copy'
baddb=> \copy badtable from '/tmp/badness.copy'
ERROR:  invalid byte sequence for encoding "UTF8": 0xcf2c
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 
CONTEXT:  COPY badtable, line 1

Obviously, this wouldn't be too helpful without the datafile in question:

1       2377510 FOURSQUARE      1403504 Pizza Hut       {}      \N      {}      {}      {}      {Pizza}
{πίτσα,hut,food,ζωγράφου,pizza,eat,zografou}   {}      \N      \N  \N 

Since this is likely to be eaten by various mail clients or lost in translation, please find attached a TGZ of the data
filein question. 



The hexdump shows exactly what goes wrong:

[steven@xxx:/tmp]% hexdump -C data.copy
00000000  31 09 32 33 37 37 35 31  30 09 46 4f 55 52 53 51  |1.2377510.FOURSQ|
00000010  55 41 52 45 09 31 34 30  33 35 30 34 09 50 69 7a  |UARE.1403504.Piz|
00000020  7a 61 20 48 75 74 09 7b  7d 09 5c 4e 09 7b 7d 09  |za Hut.{}.\N.{}.|
00000030  7b 7d 09 7b 7d 09 7b 50  69 7a 7a 61 7d 09 7b cf  |{}.{}.{Pizza}.{.|
00000040  80 ce af cf 84 cf 83 ce  b1 2c 68 75 74 2c 66 6f  |.........,hut,fo|
00000050  6f 64 2c ce b6 cf 89 ce  b3 cf 81 ce ac cf 86 ce  |od,.............|
00000060  bf cf 85 2c 70 69 7a 7a  61 2c 65 61 74 2c 7a 6f  |...,pizza,eat,zo|
00000070  67 72 61 66 6f 75 7d 09  7b 7d 09 5c 4e 09 5c 4e  |grafou}.{}.\N.\N|
00000080  09 5c 4e 0a                                       |.\N.|
00000084
[steven@xxx:/tmp]% hexdump -C badness.out
00000000  31 09 32 33 37 37 35 31  30 09 46 4f 55 52 53 51  |1.2377510.FOURSQ|
00000010  55 41 52 45 09 31 34 30  33 35 30 34 09 50 69 7a  |UARE.1403504.Piz|
00000020  7a 61 20 48 75 74 09 7b  7d 09 5c 4e 09 7b 7d 09  |za Hut.{}.\N.{}.|
00000030  7b 7d 09 7b 7d 09 7b 50  69 7a 7a 61 7d 09 7b cf  |{}.{}.{Pizza}.{.|
00000040  80 ce af cf 84 cf 83 ce  b1 2c 68 75 74 2c 66 6f  |.........,hut,fo|
00000050  6f 64 2c ce b6 cf 89 ce  b3 cf 81 ce ac cf 86 ce  |od,.............|
00000060  bf cf 2c 70 69 7a 7a 61  2c 65 61 74 2c 7a 6f 67  |..,pizza,eat,zog|
00000070  72 61 66 6f 75 7d 09 7b  7d 09 5c 4e 09 5c 4e 09  |rafou}.{}.\N.\N.|
00000080  5c 4e 0a                                          |\N.|


Note offset 0x62:
00000060  bf cf 85 2c 70 69 7a 7a  61 2c 65 61 74 2c 7a 6f  |...,pizza,eat,zo|
00000060  bf cf 2c 70 69 7a 7a 61  2c 65 61 74 2c 7a 6f 67  |..,pizza,eat,zog|

The 0xCF85 multibyte UTF-8 character was truncated to 0xCF!  When I try to reimport it, it goes in as CF2C (the 2C is
thefollowing comma) which is not valid and matches the error printed by the client. 

I can reproduce this problem using the JDBC driver as well, so I do not think it is a PSQL bug but instead a PostgreSQL
backendproblem.  I could be wrong, of course. 

Interestingly enough, this problem *does not* happen on a different machine -
 PostgreSQL 8.4.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit
(stock Ubuntu package)

I can reliably reproduce it on two different Macs, though.

As with all random email posters, I'm not actually subscribed to the list - please keep me on the CC for replies.

I would very much appreciate help tracking this down!  Thanks for your time :)

Steven Schlansker


Вложения

Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

От
Tom Lane
Дата:
Steven Schlansker <steven@trumpet.io> writes:
> I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a byte,
causingdata corruption in backups and transferred data.
 

I was able to reproduce this on my own Mac.  Some tracing shows that the
problem is that isspace(0x85) returns true when in locale en_US.utf-8.
This causes array_in to drop the final byte of the array element string,
thinking that it's insignificant whitespace.  I believe that you must
not have produced the data file data.copy on a Mac, or at least not in
that locale setting, because array_out should have double-quoted the
array element given that behavior of isspace().

Now, it's probably less than sane for isspace() to be behaving that way,
since in a UTF8-based locale 0x85 can't be regarded as a valid character
code at all.  But I'm not hopeful about the results of filing a bug with
Apple, because their UTF8-based locales have a lot of other bu^H^Hdubious
behaviors too, which they appear not to care much about.

In any case, stepping back and taking a larger viewpoint, it seems unsafe
for array_in/array_out to have any locale-sensitive behavior anyhow.
As an example, in a LATINx locale it is entirely sane for isspace() to
return true for 0xA0, while it should certainly not do so in C locale.
This means we are at risk of data corruption, ie dropping a valid data
character, when an array value starting or ending with 0xA0 is dumped
from a C-locale database and loaded into a LATINx-locale one.

So it seems like the safest answer is to modify array_in/array_out to
use an ASCII-only definition of isspace().  I believe this is
traditionally defined as space, tab, CR, LF, VT, FF.  We could perhaps
trim that further, like just space and tab, but there might be some risk
of breaking client code that expects the other traditional whitespace to
be ignored.

I'm not sure if there are any other places with similar risks.
hstore's I/O routines contain isspace calls, but I haven't
analyzed the implications.  There is an isspace call in record_out
but it is just there for cosmetic purposes and doesn't protect
any decisions in record_in, so I think it's okay if it makes
platform/locale-dependent choices.

Comments?
        regards, tom lane


Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

От
Tom Lane
Дата:
Steven Schlansker <steven@trumpet.io> writes:
> On Aug 19, 2010, at 2:35 PM, Tom Lane wrote:
>> I was able to reproduce this on my own Mac.  Some tracing shows that the
>> problem is that isspace(0x85) returns true when in locale en_US.utf-8.
>> This causes array_in to drop the final byte of the array element string,
>> thinking that it's insignificant whitespace.

> The 0x85 seems to be the second byte of a multibyte UTF-8
> sequence.

Check.

> I'm not at all experienced with character encodings so I could
> be totally off base, but isn't it wrong to ever call isspace(0x85), 
> whatever the result may be, given that the actual character is 0xCF85?
> (U+03C5, GREEK SMALL LETTER UPSILON)

We generally assume that in server-safe encodings, the ctype.h functions
will behave sanely on any single-byte value.  You can argue the wisdom
of that, but deciding to change that policy would be a rather massive
code change; I'm not excited about going that direction.

>> I believe that you must
>> not have produced the data file data.copy on a Mac, or at least not in
>> that locale setting, because array_out should have double-quoted the
>> array element given that behavior of isspace().

> Correct, it was produced on a Linux machine.  That said, the charset
> there was also UTF-8.

Right ... but you had an isspace function that meets our expectations.

> I actually can't reproduce that behavior here:

You need a setlocale() call, else the program acts as though it's in C
locale regardless of environment.  My test case looks like this:

$ cat isspace.c
#include <stdio.h>
#include <ctype.h>
#include <locale.h>

int main()
{ int c;
 setlocale(LC_ALL, "");
 for (c = 1; c < 256; c++)   {     if (isspace(c))       printf("%3o is space\n", c);   }
 return 0;
}
$ gcc -O -Wall isspace.c
$ LANG=C ./a.out11 is space12 is space13 is space14 is space15 is space40 is space
$ LANG=en_US.utf-8 ./a.out11 is space12 is space13 is space14 is space15 is space40 is space
205 is space
240 is space
$ 
        regards, tom lane


Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

От
Tatsuo Ishii
Дата:
> We generally assume that in server-safe encodings, the ctype.h functions
> will behave sanely on any single-byte value.

I think this "wisedom" is only true for C locale.  I'm not surprised
all that it does not work with non C locales.

From array_funcs.c:
    while (isspace((unsigned char) *p))        p++;

IMO this should be something like:
    while (isspace((unsigned char) *p))        p += pg_mblen(p);
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

От
Steven Schlansker
Дата:
On Aug 19, 2010, at 2:35 PM, Tom Lane wrote:

> Steven Schlansker <steven@trumpet.io> writes:
>> I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a
byte,causing data corruption in backups and transferred data. 
>
> I was able to reproduce this on my own Mac.  Some tracing shows that the
> problem is that isspace(0x85) returns true when in locale en_US.utf-8.
> This causes array_in to drop the final byte of the array element string,
> thinking that it's insignificant whitespace.

The 0x85 seems to be the second byte of a multibyte UTF-8
sequence.  I'm not at all experienced with character encodings so I could
be totally off base, but isn't it wrong to ever call isspace(0x85),
whatever the result may be, given that the actual character is 0xCF85?
(U+03C5, GREEK SMALL LETTER UPSILON)


>  I believe that you must
> not have produced the data file data.copy on a Mac, or at least not in
> that locale setting, because array_out should have double-quoted the
> array element given that behavior of isspace().

Correct, it was produced on a Linux machine.  That said, the charset
there was also UTF-8.

>
> Now, it's probably less than sane for isspace() to be behaving that way,
> since in a UTF8-based locale 0x85 can't be regarded as a valid character
> code at all.  But I'm not hopeful about the results of filing a bug with
> Apple, because their UTF8-based locales have a lot of other bu^H^Hdubious
> behaviors too, which they appear not to care much about.

I actually can't reproduce that behavior here:

#include <ctype.h>
#include <stdio.h>
int main() {   printf("%d\n", isspace(0x85));   return 0;
}

[steven@xxx:~]% gcc -o test test.c
[steven@xxx:~]% ./test
0
[steven@xxx:~]% locale
LANG="en_US.utf-8"
LC_COLLATE="en_US.utf-8"
LC_CTYPE="en_US.utf-8"
LC_MESSAGES="en_US.utf-8"
LC_MONETARY="en_US.utf-8"
LC_NUMERIC="en_US.utf-8"
LC_TIME="en_US.utf-8"
LC_ALL=
[steven@xxx:~]% uname -a
Darwin xxx.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386
i386i386 


Thanks much for your help,
Steven Schlansker



Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

От
Steven Schlansker
Дата:
On Aug 19, 2010, at 3:24 PM, Tom Lane wrote:
> Steven Schlansker <steven@trumpet.io> writes:
>> 
>> I'm not at all experienced with character encodings so I could
>> be totally off base, but isn't it wrong to ever call isspace(0x85), 
>> whatever the result may be, given that the actual character is 0xCF85?
>> (U+03C5, GREEK SMALL LETTER UPSILON)
> 
> We generally assume that in server-safe encodings, the ctype.h functions
> will behave sanely on any single-byte value.  You can argue the wisdom
> of that, but deciding to change that policy would be a rather massive
> code change; I'm not excited about going that direction.

Fair enough.  I presume there are no "server-safe encodings" for which
a multibyte sequence 0x XX20 would be valid - which would break anyway
(as the second byte looks like a real space)

> You need a setlocale() call, else the program acts as though it's in C
> locale regardless of environment.

Sigh.  I hate C sometimes. :-p

Anyway, it looks like this is actually a BSD bug which got copy +
pasted into Apple's Darwin source -

http://lists.freebsd.org/pipermail/freebsd-i18n/2007-September/000157.html

I have a couple of contacts at Apple so I'll see if there's any interest in
backporting a fix, but I wouldn't hope for it to happen quickly if at all...

Thanks for taking a look into fixing this, I hope you guys can reach
consensus on how to get it fixed :)

Best,
Steven Schlansker


Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

От
Tom Lane
Дата:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
>> We generally assume that in server-safe encodings, the ctype.h functions
>> will behave sanely on any single-byte value.

> I think this "wisedom" is only true for C locale.  I'm not surprised
> all that it does not work with non C locales.

> From array_funcs.c:

>         while (isspace((unsigned char) *p))
>             p++;

> IMO this should be something like:

>         while (isspace((unsigned char) *p))
>             p += pg_mblen(p);

I don't think that's likely to help at all.  The risk is that isspace
will do something not-sane with a fragment of a character.  If it's not
coded to guard against that, it's just as likely to give wrong results
for the leading byte as for non-leading bytes.  (In the case at hand,
I think the underlying problem is that it imagines what it's given is
a Unicode code point, not a byte of a UTF8 string.  There apparently
aren't any code points in the range U+00C0 - U+00FF for which isspace
is true, but that's not true for isalpha for example.)

If we were going to try to code around this, we'd need to change all
these loops to look something like
    while ((isascii((unsigned char) *p) ||            pg_database_encoding_max_length() == 1) &&
isspace((unsignedchar) *p))        p += pg_mblen(p);    // or p++, it wouldn't matter
 

However, given the limited number of platforms where this is an issue
and the fact that it is an acknowledged bug on those platforms,
I'm not eager to go there.

In any case, no matter whether we changed that or not, we'd still have
the problem that it's a bad idea to have any locale-dependent behavior
in array_in; and the behavior *would* still be locale-dependent, at
least in single-byte encodings.
        regards, tom lane


Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

От
Tom Lane
Дата:
Steven Schlansker <steven@trumpet.io> writes:
> On Aug 19, 2010, at 3:24 PM, Tom Lane wrote:
>> We generally assume that in server-safe encodings, the ctype.h functions
>> will behave sanely on any single-byte value.  You can argue the wisdom
>> of that, but deciding to change that policy would be a rather massive
>> code change; I'm not excited about going that direction.

> Fair enough.  I presume there are no "server-safe encodings" for which
> a multibyte sequence 0x XX20 would be valid - which would break anyway
> (as the second byte looks like a real space)

Right: our definition of a "server-safe encoding" is precisely that no
byte of a multibyte character looks like ASCII, ie all bytes have their
high bit set.  We're essentially assuming that the <ctype.h> functions
will all return false for any byte with the high bit set, if the
selected encoding is multibyte.

> Anyway, it looks like this is actually a BSD bug which got copy +
> pasted into Apple's Darwin source -
> http://lists.freebsd.org/pipermail/freebsd-i18n/2007-September/000157.html

Interesting.  So the BSD people did fix it upstream?
        regards, tom lane


Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

От
Tom Lane
Дата:
Steven Schlansker <steven@trumpet.io> writes:
> Anyway, it looks like this is actually a BSD bug which got copy +
> pasted into Apple's Darwin source -
> http://lists.freebsd.org/pipermail/freebsd-i18n/2007-September/000157.html

I've applied a patch for this to HEAD & 9.0:
http://archives.postgresql.org/pgsql-committers/2010-08/msg00273.php

I'm reluctant to back-patch it into already-released branches, though.
Given the lack of prior reports, the odds of breaking something for
somebody in a minor release seem to outweigh the odds of doing good.
But you could easily drop it into a local copy of 8.4 if you wish.
        regards, tom lane