Re: Mac OS: invalid byte sequence for encoding "UTF8"

Поиск
Список
Период
Сортировка
От Larry Rosenman
Тема Re: Mac OS: invalid byte sequence for encoding "UTF8"
Дата
Msg-id d94fdeb7997353bf0ba6906679a89d0c@thebighonker.lerctr.org
обсуждение исходный текст
Ответ на Re: Mac OS: invalid byte sequence for encoding "UTF8"  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Mac OS: invalid byte sequence for encoding "UTF8"  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 2016-02-10 16:19, Tom Lane wrote:
> I wrote:
>> Artur Zakirov <a.zakirov@postgrespro.ru> writes:
>>> I think this is not a bug. It is a normal behavior. In Mac OS 
>>> sscanf()
>>> with the %s format reads the string one character at a time. The size 
>>> of
>>> letter 'х' is 2. And sscanf() separate it into two wrong characters.
> 
>> That argument might be convincing if OSX behaved that way for all
>> multibyte characters, but it doesn't seem to be doing that.  Why is
>> only 'х' affected?
> 
> I looked into the OS X sources, and found that indeed you are right:
> *scanf processes the input a byte at a time, and applies isspace() to
> each byte separately, even when the locale is such that that's a 
> clearly
> insane thing to do.  Since this code was derived from FreeBSD, FreeBSD
> has or once had the same issue.  (A look at the freebsd project on 
> github
> says it still does, assuming that's the authoritative repo.)  Not sure
> about other BSDen.
> 
> I also verified that in UTF8-based locales, isspace() thinks that 0x85 
> and
> 0xA0, and no other high-bit-set values, are spaces.  Not sure exactly 
> why
> it thinks that, but that explains why 'х' fails when adjacent code 
> points
> don't.
> 
> So apparently the coding rule we have to adopt is "don't use *scanf()
> on data that might contain multibyte characters".  (There might be 
> corner
> cases where it'd work all right for conversion specifiers other than 
> %s,
> but probably you might as well just use strtol and friends in such 
> cases.)
> Ugh.
> 
>             regards, tom lane
Definitive FreeBSD Sources:

https://svnweb.freebsd.org/base/


-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Mac OS: invalid byte sequence for encoding "UTF8"
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Mac OS: invalid byte sequence for encoding "UTF8"