Re: Patch: add conversion from pg_wchar to multibyte

Поиск

Список

Период

Сортировка

От	Robert Haas
Тема	Re: Patch: add conversion from pg_wchar to multibyte
Дата	2 июля 2012 г. 22:17:00
Msg-id	CA+TgmoaHLC6tD+88XZJmo-gJ7Ue+5d7oNKeES-5hyrUTC_LiKQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Patch: add conversion from pg_wchar to multibyte (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы	Re: Patch: add conversion from pg_wchar to multibyte (Tatsuo Ishii <ishii@postgresql.org>) Re: Patch: add conversion from pg_wchar to multibyte (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

On Mon, Jul 2, 2012 at 4:46 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> So, I provided such transformation in versions 0.3 and 0.4 based on
> explanation from Tatsuo Ishii. The problem is that both conversions are
> nontrivial and it's not evident that they are mirror (understanding that
> they are mirror require some additional assumptions about encodings, not
> evident just by transformation itself). I though you mention that problem
> two message back.

Yeah, I did.  I think I may be a bit confused here, so let me try to
understand this a bit better.  It seems like pg_mule2wchar_with_len
uses the following algorithm:

- If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored
with shifts of 16 and 0.
- If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes,
skipping the first one and storing the remaining two with shifts of 16
and 0.
- If the first character IS_LC2 (0x90-0x99), decode three bytes,
stored with shifts of 16, 8, and 0.
- If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes,
skipping the first one and storing the remaining three with offsets of
16, 8, and 0.

In the reverse transformation implemented by pg_wchar2mule_with_len,
if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
3 bytes, respectively, exactly as I would expect.  ASCII decoding is
also as I would expect.  The case I don't understand is what happens
when the leading byte of the multibyte character was IS_LCPRV1 or
IS_LCPRV2.  In that case, we ought to decode three bytes if it was
IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we
always decode 4 bytes.  That implies that the IS_LCPRV1() case in
pg_mule2wchar_with_len is dead code, and that any 4 byte characters
are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the
comment there is driving at, but it's not too clear to me.

Am I close?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Peter Geoghegan
Дата: 02 июля 2012 г., 21:52:37
Сообщение: Re: enhanced error fields

Следующее

От: Tom Lane
Дата: 02 июля 2012 г., 23:00:39
Сообщение: Re: Event Triggers reduced, v1

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Patch: add conversion from pg_wchar to multibyte

Предыдущее

Следующее