Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

Поиск
Список
Период
Сортировка
От Guillaume Cottenceau
Тема Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
Дата
Msg-id 873btccqgw.fsf@meuh.mnc.ch
обсуждение исходный текст
Ответ на Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution  (Anders Hermansen <anders@yoyo.no>)
Ответы Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution  (Anders Hermansen <anders@yoyo.no>)
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-jdbc
Anders Hermansen <anders 'at' yoyo.no> writes:

> * Guillaume Cottenceau (gc@mnc.ch) wrote:
> > Anders Hermansen <anders 'at' yoyo.no> writes:
> > > * Guillaume Cottenceau (gc@mnc.ch) wrote:
> > > > Isn't there a problem with your UTF-8 data containing 0x00EF?
> > >
> > > E0 to EF hex (224 to 239): first byte of a three-byte sequence.
> >
> > Well 00 is first byte here, isn't it?
>
> UTF-8 is a byte sequence, so it's not about the first byte in the whole
> sequence. But about the first byte in a tree byte sequece.

Yes. I forgot that you assumed the machine was big-endian. So the
UTF-8 character is here probably first byte 0xEF, second byte
0x00?

I did my test with first byte 0x00 and second byte 0xEF, hence
confusion with your initial comment.

My reasoning was that if the first byte of this two-byte
sequence is 0x00 then the rule that 0xEF is first byte of a
three-byte sequence doesn't apply, since 0xEF is second byte in
the sequence.

> There should be no nul (0) bytes when encoding UTF-8. I believe
> this is in the specification to allow it to be compatible with
> C nul-terminated strings.
>
> I believe that the byte sequence 0x00EF i illegal UTF-8 because:
> 1) It contains nul (0x00) byte
> 2) 0xEF is not followed by two more bytes
>
> On the other hand U+00EF is a valid unicode code point. Which points to:

I think this is assumed little-endian, e.g. first byte 0x00 and
second byte 0xEF (especially because UTF-8 is just a series of
bytes without any endianness aspects, so it makes good sense to
actually read this left-to-right, e.g. byte 0x00 first).

> LATIN SMALL LETTER I WITH DIAERESIS
> It is encoded as 0xC3AF in UTF-8
> As 0x00EF in UTF-16 (and UCS-2 ?)

Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are
the same[1].

> As 0xEF in ISO-8859-1

Hum I think I may understand what's going on here. It's possible
that in the message:

        ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

when they say "0x00ef" they don't talk about UTF-8 per-see but
they use the unicode representation (which is error prone).


Ref:
[1] UCS-2 is a subset of UTF-16 which comprises all the 2-byte
    sequence characters but no 3 or 4-byte sequence characters

--
Guillaume Cottenceau

В списке pgsql-jdbc по дате отправления:

Предыдущее
От: Markus Schaber
Дата:
Сообщение: Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1
Следующее
От: Vadim Nasardinov
Дата:
Сообщение: Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution