Re: Patch: add conversion from pg_wchar to multibyte

Поиск
Список
Период
Сортировка
От Tatsuo Ishii
Тема Re: Patch: add conversion from pg_wchar to multibyte
Дата
Msg-id 20120708.111057.2187928410302833000.t-ishii@sraoss.co.jp
обсуждение исходный текст
Ответ на Re: Patch: add conversion from pg_wchar to multibyte  (Tatsuo Ishii <ishii@postgresql.org>)
Ответы Re: Patch: add conversion from pg_wchar to multibyte  (Tatsuo Ishii <ishii@postgresql.org>)
Список pgsql-hackers
>> Tatsuo Ishii <ishii@postgresql.org> writes:
>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>> that I can find.
>>>> 
>>>> I also read in the xemacs internals doc, at
>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>> charset codes.  I don't know how we got to the current state of using
>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>> with XEmacs.
>> 
>>> At the time when mule internal code was introduced to PostgreSQL,
>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>> emacs) was the only implementation allowed to use multi encoding. So I
>>> used the specification of mule documented in the URL I wrote.
>> 
>> I see.  Given that upstream has decided that a simpler definition is
>> more appropriate, is there any reason not to follow their lead, to the
>> extent that we can do so without breaking existing on-disk data?
> 
> Please let me spend week end to understand the their latest spec.

This is an intermediate report on the internal multi-byte charset
implementation of emacen. I have read the link Tom showed. Also I made
a quick scan on xemacs-21.4.0 source code, especially
xemacs-21.4.0/src/mule-charset.h. It seems the web document is
essentially a copy of the comments in the file. Also I looked into
other place of xemacs code and I think I can conclude that xeamcs
21.4's multi-byte implementation is based on the doc on the web.

Next I looked into emacs 24.1 source code because I could not find any
doc regarding emacs's(not xemacs's) implementation of internal
multi-byte charset. I found followings in src/charset.h:

/* Leading-code followed by extended leading-code.    DIMENSION/COLUMN */
#define EMACS_MULE_LEADING_CODE_PRIVATE_11    0x9A /* 1/1 */
#define EMACS_MULE_LEADING_CODE_PRIVATE_12    0x9B /* 1/2 */
#define EMACS_MULE_LEADING_CODE_PRIVATE_21    0x9C /* 2/2 */
#define EMACS_MULE_LEADING_CODE_PRIVATE_22    0x9D /* 2/2 */

And these are used like this:

/* Read one non-ASCII character from INSTREAM.  The character is  encoded in `emacs-mule' and the first byte is already
readin  C.  */
 

static int
read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
{
:
: else if (len == 3)   {     if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11  || buf[0] ==
EMACS_MULE_LEADING_CODE_PRIVATE_12){ charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);  code = buf[2] & 0x7F;}
 

As far as I can tell, this is exactly the same way how PostgreSQL
handles single private character sets: they consist of 3 bytes, and
leading byte is either 0x9a or 0x9b. Other examples regarding single
byte/multi-byte private charsets can be seen in coding.c.

As far as I can tell, it seems emacs and xemacs employes different
implementations of multi-byte charaset regarding "private"
charsets. Emacs's is same as PostgreSQL, while xemacs is not.  I am
contacting to the original Mule author if he knows anything about
this.

BTW, while looking into emacs's source code, I found their charset
definitions are in lisp/international/mule-conf.el. According to the
file several new charsets has been added. Included is the patch to
follow their changes. This makes no changes to current behavior, since
the patch just changes some comments and non supported charsets.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 1bcdfbc..e44749e 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -108,7 +108,7 @@ typedef unsigned int pg_wchar;#define LC_KOI8_R            0x8b    /* Cyrillic KOI8-R */#define
LC_ISO8859_5       0x8c    /* ISO8859 Cyrillic */#define LC_ISO8859_9        0x8d    /* ISO8859 Latin 5 (not supported
yet)*/
 
-/* #define FREE                0x8e    free (unused) */
+#define LC_ISO8859_15        0x8e    /* ISO8859 Latin 15 (not supported yet) *//* #define CONTROL_1        0x8f
controlcharacters (unused) *//* Is a leading byte for "official" single byte encodings? */
 
@@ -119,14 +119,13 @@ typedef unsigned int pg_wchar; * 0x9a-0x9d are free. 0x9e and 0x9f are reserved. */#define
LC_JISX0208_1978   0x90    /* Japanese Kanji, old JIS (not supported) */
 
-/* #define FREE                0x90    free (unused) */#define LC_GB2312_80        0x91    /* Chinese */#define
LC_JISX0208           0x92    /* Japanese Kanji (JIS X 0208) */#define LC_KS5601            0x93    /* Korean */#define
LC_JISX0212           0x94    /* Japanese Kanji (JIS X 0212) */#define LC_CNS11643_1        0x95    /* CNS 11643-1992
Plane1 */#define LC_CNS11643_2        0x96    /* CNS 11643-1992 Plane 2 */
 
-/* #define FREE                0x97    free (unused) */
+#define LC_JISX0213-1        0x97    /* Japanese Kanji (JIS X 0213 Plane 1) (not supported) */#define LC_BIG5_1
   0x98    /* Plane 1 Chinese traditional (not supported) */#define LC_BIG5_2            0x99    /* Plane 1 Chinese
traditional(not supported) */
 
@@ -184,6 +183,12 @@ typedef unsigned int pg_wchar;                                     * (not supported) */#define
LC_TIBETAN_1_COLUMN0xf1    /* Tibetan 1-column width glyphs                                     * (not supported) */
 
+#define LC_UNICODE_SUBSET_2    0xf2    /* Unicode characters of the range U+2500..U+33FF.
+                                     * (not supported) */  
+#define LC_UNICODE_SUBSET_3    0xf3    /* Unicode characters of the range U+E000..U+FFFF.
+                                     * (not supported) */  
+#define LC_UNICODE_SUBSET    0xf4    /* Unicode characters of the range U+0100..U+24FF.
+                                     * (not supported) */  #define LC_ETHIOPIC            0xf5    /* Ethiopic
characters(not supported) */#define LC_CNS11643_3        0xf6    /* CNS 11643-1992 Plane 3 */#define LC_CNS11643_4
 0xf7    /* CNS 11643-1992 Plane 4 */ 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: autocomplete - SELECT fx
Следующее
От: Satoshi Nagayasu
Дата:
Сообщение: Re: New statistics for WAL buffer dirty writes