Re: Mac OS: invalid byte sequence for encoding "UTF8"

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Mac OS: invalid byte sequence for encoding "UTF8"
Дата
Msg-id 15966.1455142785@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Mac OS: invalid byte sequence for encoding "UTF8"  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Mac OS: invalid byte sequence for encoding "UTF8"  (Larry Rosenman <ler@lerctr.org>)
Re: Mac OS: invalid byte sequence for encoding "UTF8"  (Chapman Flack <chap@anastigmatix.net>)
Re: Mac OS: invalid byte sequence for encoding "UTF8"  (Artur Zakirov <a.zakirov@postgrespro.ru>)
Список pgsql-hackers
I wrote:
> Artur Zakirov <a.zakirov@postgrespro.ru> writes:
>> I think this is not a bug. It is a normal behavior. In Mac OS sscanf() 
>> with the %s format reads the string one character at a time. The size of 
>> letter 'х' is 2. And sscanf() separate it into two wrong characters.

> That argument might be convincing if OSX behaved that way for all
> multibyte characters, but it doesn't seem to be doing that.  Why is
> only 'х' affected?

I looked into the OS X sources, and found that indeed you are right:
*scanf processes the input a byte at a time, and applies isspace() to
each byte separately, even when the locale is such that that's a clearly
insane thing to do.  Since this code was derived from FreeBSD, FreeBSD
has or once had the same issue.  (A look at the freebsd project on github
says it still does, assuming that's the authoritative repo.)  Not sure
about other BSDen.

I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
0xA0, and no other high-bit-set values, are spaces.  Not sure exactly why
it thinks that, but that explains why 'х' fails when adjacent code points
don't.

So apparently the coding rule we have to adopt is "don't use *scanf()
on data that might contain multibyte characters".  (There might be corner
cases where it'd work all right for conversion specifiers other than %s,
but probably you might as well just use strtol and friends in such cases.)
Ugh.
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Moving responsibility for logging "database system is shut down"
Следующее
От: Larry Rosenman
Дата:
Сообщение: Re: Mac OS: invalid byte sequence for encoding "UTF8"