Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Дата	19 августа 2010 г. 19:24:50
Msg-id	28944.1282256681@sss.pgh.pa.us обсуждение
Ответ на	Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Список	pgsql-hackers

Дерево обсуждения

Steven Schlansker <steven@trumpet.io> writes:
> On Aug 19, 2010, at 2:35 PM, Tom Lane wrote:
>> I was able to reproduce this on my own Mac.  Some tracing shows that the
>> problem is that isspace(0x85) returns true when in locale en_US.utf-8.
>> This causes array_in to drop the final byte of the array element string,
>> thinking that it's insignificant whitespace.

> The 0x85 seems to be the second byte of a multibyte UTF-8
> sequence.

Check.

> I'm not at all experienced with character encodings so I could
> be totally off base, but isn't it wrong to ever call isspace(0x85), 
> whatever the result may be, given that the actual character is 0xCF85?
> (U+03C5, GREEK SMALL LETTER UPSILON)

We generally assume that in server-safe encodings, the ctype.h functions
will behave sanely on any single-byte value.  You can argue the wisdom
of that, but deciding to change that policy would be a rather massive
code change; I'm not excited about going that direction.

>> I believe that you must
>> not have produced the data file data.copy on a Mac, or at least not in
>> that locale setting, because array_out should have double-quoted the
>> array element given that behavior of isspace().

> Correct, it was produced on a Linux machine.  That said, the charset
> there was also UTF-8.

Right ... but you had an isspace function that meets our expectations.

> I actually can't reproduce that behavior here:

You need a setlocale() call, else the program acts as though it's in C
locale regardless of environment.  My test case looks like this:

$ cat isspace.c
#include <stdio.h>
#include <ctype.h>
#include <locale.h>

int main()
{ int c;
 setlocale(LC_ALL, "");
 for (c = 1; c < 256; c++)   {     if (isspace(c))       printf("%3o is space\n", c);   }
 return 0;
}
$ gcc -O -Wall isspace.c
$ LANG=C ./a.out11 is space12 is space13 is space14 is space15 is space40 is space
$ LANG=en_US.utf-8 ./a.out11 is space12 is space13 is space14 is space15 is space40 is space
205 is space
240 is space
$ 
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence