Re: Patch: add conversion from pg_wchar to multibyte

Поиск
Список
Период
Сортировка
От Tatsuo Ishii
Тема Re: Patch: add conversion from pg_wchar to multibyte
Дата
Msg-id 20120522.165029.1187711886221407331.t-ishii@sraoss.co.jp
обсуждение исходный текст
Ответ на Re: Patch: add conversion from pg_wchar to multibyte  (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы Re: Patch: add conversion from pg_wchar to multibyte  (Alexander Korotkov <aekorotkov@gmail.com>)
Список pgsql-hackers
Hi Alexander,

It was good seeing you in Ottawa!

> Hello, Ishii-san!
> 
> We've talked on PGCon that I've questions about mule to wchar
> conversion. My questions about pg_mule2wchar_with_len function are
> following. In these parts of code:
> *
> *
> else if (IS_LCPRV1(*from) && len >= 3)
> {
>     from++;
>     *to = *from++ << 16;
>     *to |= *from++;
>     len -= 3;
> }
> 
> and
> 
> else if (IS_LCPRV2(*from) && len >= 4)
> {
>     from++;
>     *to = *from++ << 16;
>     *to |= *from++ << 8;
>     *to |= *from++;
>     len -= 4;
> }
> 
> we skip first character of original string. Are we able to restore it back
> from pg_wchar?

I think it's possible. The first characters are defined like this:

#define IS_LCPRV1(c)    ((unsigned char)(c) == 0x9a || (unsigned char)(c) == 0x9b)
#define IS_LCPRV2(c)    ((unsigned char)(c) == 0x9c || (unsigned char)(c) == 0x9d)

It seems IS_LCPRV1 is not used in any of PostgreSQL supported
encodings at this point, that means there's 0 chance which existing
databases include LCPRV1. So you could safely ignore it.

For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5)
in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c
and it is fixed to 0x9d.  So you can always restore the value to 0x9d.

> Also in this part of code we're shifting first byte by 16 bits:
> 
> if (IS_LC1(*from) && len >= 2)
> {
>     *to = *from++ << 16;
>     *to |= *from++;
>     len -= 2;
> }
> else if (IS_LCPRV1(*from) && len >= 3)
> {
>     from++;
>     *to = *from++ << 16;
>     *to |= *from++;
>     len -= 3;
> }
> 
> Why don't we shift it by 8 bits?

Because we want the first byte of LC1 case to be placed in the second
byte of wchar. i.e.

0th byte: always 0
1th byte: leading byte (the first byte of the multibyte)
2th byte: always 0
3th byte: the second byte of the multibyte

Note that we always assume that the 1th byte (called "leading byte":
LB in short) represents the id of the character set (from 0x81 to
0xff) in MULE INTERNAL encoding. For the mapping between LB and
charsets, see pg_wchar.h.

> You can see my patch in this thread where I propose purely mechanical
> changes in this function which make inverse conversion possible.
> 
> ------
> With best regards,
> Alexander Korotkov.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Getting rid of cheap-startup-cost paths earlier
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: Getting rid of cheap-startup-cost paths earlier