Обсуждение: Unicode escapes in literals

Поиск
Список
Период
Сортировка

Unicode escapes in literals

От
Peter Eisentraut
Дата:
I would like to add an escape mechanism to PostgreSQL for entering 
arbitrary Unicode characters into string literals.  We currently only 
have the option of entering the character directly via the keyboard or 
cut-and-paste, which is difficult for a number of reasons, such as when 
the font doesn't have the character, and entering the UTF8-encoded bytes 
using the E'...' strings, which is hardly usable.

SQL has the following escape syntax for it:
   U&'special character: \xxxx' [ UESCAPE '\' ]

where xxxx is the hexadecimal Unicode codepoint.  So this is pretty much 
just another variant on what the E'...' syntax does.

The trick is that since we have user-definable encoding conversion 
routines, we can't convert the Unicode codepoint to the server encoding 
in the scanner stage.  I imagine there are two ways to address this:

1. Only support this syntax when the server encoding is UTF8.  This 
would probably cover most use cases anyway.  We could have limited 
support for characters in the ASCII range for all server encodings.

2. Convert this syntax to a function call.  But that would then create a 
lot of inconsistencies, such as needing functional indexes for matches 
against what should really be a literal.

I'd be happy to start with UTF8 support only.  Other ideas?


Re: Unicode escapes in literals

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> SQL has the following escape syntax for it:
>     U&'special character: \xxxx' [ UESCAPE '\' ]

Man that's ugly.  Why the ampersand?  How do you propose to distinguish
this from a perfectly legitimate use of the & operator?

> 2. Convert this syntax to a function call.  But that would then create a 
> lot of inconsistencies, such as needing functional indexes for matches 
> against what should really be a literal.

Uh, why do you think that?  The function could surely be stable, even
immutable if you grant that a database's encoding can't change.
        regards, tom lane


Re: Unicode escapes in literals

От
Peter Eisentraut
Дата:
Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> SQL has the following escape syntax for it:
>>     U&'special character: \xxxx' [ UESCAPE '\' ]
> 
> Man that's ugly.  Why the ampersand?

Yeah, excellent question.  It seems completely unnecessary, but it is 
surely there in the syntax diagram.

> How do you propose to distinguish
> this from a perfectly legitimate use of the & operator?

Well, technically, there is going to be some conflict, but the practical 
impact should be minimal because:

- There are no spaces allowed between U&' .  We typically suggest spaces 
around binary operators.

- Naming a column "u" might not be terribly common.

- Binary-and with an undecorated string literal is not very common.

Of course, I have no data for these assertions.  An inquiry on -general 
might give more insight.

>> 2. Convert this syntax to a function call.  But that would then create a 
>> lot of inconsistencies, such as needing functional indexes for matches 
>> against what should really be a literal.
> 
> Uh, why do you think that?  The function could surely be stable, even
> immutable if you grant that a database's encoding can't change.

Yeah, true, that would work.

There are some other disadvantages for making a function call.  You 
couldn't use that kind of literal in any other place where the parser 
calls for a string constant: role names, tablespace locations, 
passwords, copy delimiters, enum values, function body, file names.

There is also a related feature for Unicode escapes in identifiers, and 
it might be good to keep the door open on that.

We could to a dual approach: Convert in the scanner when server encoding  is UTF8, and pass on as function call
otherwise. Surely ugly though.
 

Or pass it on as a separate token type to the analyze phase, but that is 
a lot more work.


Others: What use cases do you envision, and what requirements would they 
create for this feature?


Re: Unicode escapes in literals

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> There are some other disadvantages for making a function call.  You 
> couldn't use that kind of literal in any other place where the parser 
> calls for a string constant: role names, tablespace locations, 
> passwords, copy delimiters, enum values, function body, file names.

Good point.  I'm okay with supporting the feature only when database
encoding is UTF8.
        regards, tom lane


Re: Unicode escapes in literals

От
Andrew Sullivan
Дата:
On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
>> Man that's ugly.  Why the ampersand?
>
> Yeah, excellent question.  It seems completely unnecessary, but it is 
> surely there in the syntax diagram.

Probably because many Unicode representations are done with "U+"
followed by 4-6 hexadecimal units, but "+" is problematic for other
reasons (in some vendor's implementation)?

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/


Re: Unicode escapes in literals

От
Tom Lane
Дата:
Andrew Sullivan <ajs@commandprompt.com> writes:
> On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
>> Yeah, excellent question.  It seems completely unnecessary, but it is 
>> surely there in the syntax diagram.

> Probably because many Unicode representations are done with "U+"
> followed by 4-6 hexadecimal units, but "+" is problematic for other
> reasons (in some vendor's implementation)?

They could hardly ignore the conflict with the operator interpretation
for +.  The committee has now cut themselves off from ever having a
standard operator named &, but I suppose they didn't think ahead to that.
        regards, tom lane


Re: Unicode escapes in literals

От
Peter Eisentraut
Дата:
I wrote:
> SQL has the following escape syntax for it:
>
>    U&'special character: \xxxx' [ UESCAPE '\' ]

Here is an in-progress patch for this.  It still needs updates in the
psql scanner and possibly other scanners.  But the server-side
functionality works.

Index: doc/src/sgml/syntax.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v
retrieving revision 1.123
diff -u -3 -p -c -r1.123 syntax.sgml
*** doc/src/sgml/syntax.sgml    26 Jun 2008 22:24:42 -0000    1.123
--- doc/src/sgml/syntax.sgml    27 Oct 2008 16:54:26 -0000
*************** UPDATE "my_table" SET "a" = 5;
*** 190,195 ****
--- 190,247 ----
     </para>

     <para>
+     A variant of quoted identifiers allows including escaped Unicode
+     characters identified by their code points.  This variant starts
+     with <literal>U&</literal> (upper or lower case U followed by
+     ampersand) immediately before the opening double quote, without
+     any spaces in between, for example <literal>U&"foo"</literal>.
+     (Note that this creates an ambiguity with the
+     operator <literal>&</literal>.  Use spaces around the operator to
+     avoid this problem.)  Inside the quotes, Unicode characters can be
+     specified in escaped form by writing a backslash followed by the
+     four-digit hexadecimal code point number or alternatively a
+     backslash followed by a plus sign followed by a six-digt
+     hexadecimal code point number.  For example, the
+     identifier <literal>"data"</literal> could be written as
+ <programlisting>
+ U&"d\0061t\0061"
+ </programlisting>
+     or equivalently
+ <programlisting>
+ U&"d\+000061t\+000061"
+ </programlisting>
+     The following less trivial example writes the Russian
+     word <quote>slon</quote> (elephant) in Cyrillic letters:
+ <programlisting>
+ U&"\0441\043B\043E\043D"
+ </programlisting>
+    </para>
+
+    <para>
+     If a different escape character than backslash is desired, it can
+     be specified using the <literal>UESCAPE</literal> clause after the
+     string, for example:
+ <programlisting>
+ U&"d!0061t!0061" UESCAPE '!'
+ </programlisting>
+     The escape character can be any single character other than a
+     hexadecimal digit, the plus sign, a single quote, a double quote,
+     or a whitespace character.  Note that the escape character is
+     written in single quotes, not double quotes.
+    </para>
+
+    <para>
+     To include the escape character in the identifier literally, write
+     it twice.
+    </para>
+
+    <para>
+     The Unicode escape syntax works only when the server encoding is
+     UTF8.  When other server encodings are used, only code points in
+     the ASCII range (up to <literal>\007F</literal>) can be specified.
+    </para>
+
+    <para>
      Quoting an identifier also makes it case-sensitive, whereas
      unquoted names are always folded to lower case.  For example, the
      identifiers <literal>FOO</literal>, <literal>foo</literal>, and
*************** UPDATE "my_table" SET "a" = 5;
*** 245,251 ****
       write two adjacent single quotes, e.g.
       <literal>'Dianne''s horse'</literal>.
       Note that this is <emphasis>not</> the same as a double-quote
!      character (<literal>"</>).
      </para>

      <para>
--- 297,303 ----
       write two adjacent single quotes, e.g.
       <literal>'Dianne''s horse'</literal>.
       Note that this is <emphasis>not</> the same as a double-quote
!      character (<literal>"</>). <!-- font-lock sanity: " -->
      </para>

      <para>
*************** SELECT 'foo'      'bar';
*** 269,282 ****
       by <acronym>SQL</acronym>; <productname>PostgreSQL</productname> is
       following the standard.)
      </para>

-     <para>
       <indexterm>
        <primary>escape string syntax</primary>
       </indexterm>
       <indexterm>
        <primary>backslash escapes</primary>
       </indexterm>
       <productname>PostgreSQL</productname> also accepts <quote>escape</>
       string constants, which are an extension to the SQL standard.
       An escape string constant is specified by writing the letter
--- 321,339 ----
       by <acronym>SQL</acronym>; <productname>PostgreSQL</productname> is
       following the standard.)
      </para>
+    </sect3>
+
+    <sect3 id="sql-syntax-strings-escape">
+     <title>String Constants with C-Style Escapes</title>

       <indexterm>
        <primary>escape string syntax</primary>
       </indexterm>
       <indexterm>
        <primary>backslash escapes</primary>
       </indexterm>
+
+     <para>
       <productname>PostgreSQL</productname> also accepts <quote>escape</>
       string constants, which are an extension to the SQL standard.
       An escape string constant is specified by writing the letter
*************** SELECT 'foo'      'bar';
*** 287,293 ****
       Within an escape string, a backslash character (<literal>\</>) begins a
       C-like <firstterm>backslash escape</> sequence, in which the combination
       of backslash and following character(s) represent a special byte
!      value:

       <table id="sql-backslash-table">
        <title>Backslash Escape Sequences</title>
--- 344,351 ----
       Within an escape string, a backslash character (<literal>\</>) begins a
       C-like <firstterm>backslash escape</> sequence, in which the combination
       of backslash and following character(s) represent a special byte
!      value, shown in <xref linkend="sql-backslash-table">
!     </para>

       <table id="sql-backslash-table">
        <title>Backslash Escape Sequences</title>
*************** SELECT 'foo'      'bar';
*** 341,354 ****
        </tgroup>
       </table>

!      It is your responsibility that the byte sequences you create are
!      valid characters in the server character set encoding. Any other
       character following a backslash is taken literally. Thus, to
       include a backslash character, write two backslashes (<literal>\\</>).
       Also, a single quote can be included in an escape string by writing
       <literal>\'</literal>, in addition to the normal way of <literal>''</>.
      </para>

      <caution>
      <para>
       If the configuration parameter
--- 399,422 ----
        </tgroup>
       </table>

!     <para>
!      Any other
       character following a backslash is taken literally. Thus, to
       include a backslash character, write two backslashes (<literal>\\</>).
       Also, a single quote can be included in an escape string by writing
       <literal>\'</literal>, in addition to the normal way of <literal>''</>.
      </para>

+     <para>
+      It is your responsibility that the byte sequences you create are
+      valid characters in the server character set encoding.  When the
+      server encoding is UTF-8, then the alternative Unicode escape
+      syntax, explained in <xref linkend="sql-syntax-strings-uescape">,
+      should be used instead.  (The alternative would be doing the
+      UTF-8 encoding by hand and writing out the bytes, which would be
+      very cumbersome.)
+     </para>
+
      <caution>
      <para>
       If the configuration parameter
*************** SELECT 'foo'      'bar';
*** 379,384 ****
--- 447,509 ----
      </para>
     </sect3>

+    <sect3 id="sql-syntax-strings-uescape">
+     <title>String Constants with Unicode Escapes</title>
+
+     <para>
+      <productname>PostgreSQL</productname> also supports another type
+      of escape syntax for strings that allows specifying arbitrary
+      Unicode characters by code point.  A Unicode escape string
+      constant starts with <literal>U&</literal> (upper or lower case
+      letter U followed by ampersand) immediately before the opening
+      quote, without any spaces in between, for
+      example <literal>U&'foo'</literal>.  (Note that this creates an
+      ambiguity with the operator <literal>&</literal>.  Use spaces
+      around the operator to avoid this problem.)  Inside the quotes,
+      Unicode characters can be specified in escaped form by writing a
+      backslash followed by the four-digit hexadecimal code point
+      number or alternatively a backslash followed by a plus sign
+      followed by a six-digt hexadecimal code point number.  For
+      example, the string <literal>'data'</literal> could be written as
+ <programlisting>
+ U&'d\0061t\0061'
+ </programlisting>
+      or equivalently
+ <programlisting>
+ U&'d\+000061t\+000061'
+ </programlisting>
+      The following less trivial example writes the Russian
+      word <quote>slon</quote> (elephant) in Cyrillic letters:
+ <programlisting>
+ U&'\0441\043B\043E\043D'
+ </programlisting>
+     </para>
+
+     <para>
+      If a different escape character than backslash is desired, it can
+      be specified using the <literal>UESCAPE</literal> clause after
+      the string, for example:
+ <programlisting>
+ U&'d!0061t!0061' UESCAPE '!'
+ </programlisting>
+      The escape character can be any single character other than a
+      hexadecimal digit, the plus sign, a single quote, a double quote,
+      or a whitespace character.
+     </para>
+
+     <para>
+      The Unicode escape syntax works only when the server encoding is
+      UTF8.  When other server encodings are used, only code points in
+      the ASCII range (up to <literal>\007F</literal>) can be
+      specified.
+     </para>
+
+     <para>
+      To include the escape character in the string literally, write it
+      twice.
+     </para>
+    </sect3>
+
     <sect3 id="sql-syntax-dollar-quoting">
      <title>Dollar-Quoted String Constants</title>

Index: src/backend/parser/scan.l
===================================================================
RCS file: /cvsroot/pgsql/src/backend/parser/scan.l,v
retrieving revision 1.146
diff -u -3 -p -c -r1.146 scan.l
*** src/backend/parser/scan.l    1 Sep 2008 20:42:45 -0000    1.146
--- src/backend/parser/scan.l    27 Oct 2008 16:54:27 -0000
*************** static int        literalalloc;    /* current all
*** 76,81 ****
--- 76,82 ----
  static void addlit(char *ytext, int yleng);
  static void addlitchar(unsigned char ychar);
  static char *litbufdup(void);
+ static char *litbuf_udeescape(unsigned char escape);

  #define lexer_errposition()  scanner_errposition(yylloc)

*************** static unsigned char unescape_single_cha
*** 125,130 ****
--- 126,133 ----
   *  <xq> standard quoted strings
   *  <xe> extended quoted strings (support backslash escape sequences)
   *  <xdolq> $foo$ quoted strings
+  *  <xui> quoted identifier with Unicode escapes
+  *  <xus> quoted string with Unicode escapes
   */

  %x xb
*************** static unsigned char unescape_single_cha
*** 134,139 ****
--- 137,144 ----
  %x xe
  %x xq
  %x xdolq
+ %x xui
+ %x xus

  /*
   * In order to make the world safe for Windows and Mac clients as well as
*************** xdstop            {dquote}
*** 244,249 ****
--- 249,273 ----
  xddouble        {dquote}{dquote}
  xdinside        [^"]+

+ /* Unicode escapes */
+ uescape            [uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
+ /* error rule to avoid backup */
+ uescapefail
("-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU])
+
+ /* Quoted identifier with Unicode escapes */
+ xuistart        [uU]&{dquote}
+ xuistop1        {dquote}{whitespace}*{uescapefail}?
+ xuistop2        {dquote}{whitespace}*{uescape}
+
+ /* Quoted string with Unicode escapes */
+ xusstart        [uU]&{quote}
+ xusstop1        {quote}{whitespace}*{uescapefail}?
+ xusstop2        {quote}{whitespace}*{uescape}
+
+ /* error rule to avoid backup */
+ xufailed        [uU]&
+
+
  /* C-style comments
   *
   * The "extended comment" syntax closely resembles allowable operator syntax.
*************** other            .
*** 444,449 ****
--- 468,478 ----
                      BEGIN(xe);
                      startlit();
                  }
+ {xusstart}        {
+                     SET_YYLLOC();
+                     BEGIN(xus);
+                     startlit();
+                 }
  <xq,xe>{quotestop}    |
  <xq,xe>{quotefail} {
                      yyless(1);
*************** other            .
*** 456,465 ****
                      yylval.str = litbufdup();
                      return SCONST;
                  }
! <xq,xe>{xqdouble} {
                      addlitchar('\'');
                  }
! <xq>{xqinside}  {
                      addlit(yytext, yyleng);
                  }
  <xe>{xeinside}  {
--- 485,506 ----
                      yylval.str = litbufdup();
                      return SCONST;
                  }
! <xus>{xusstop1} {
!                     /* throw back all but the quote */
!                     yyless(1);
!                     BEGIN(INITIAL);
!                     yylval.str = litbuf_udeescape('\\');
!                     return SCONST;
!                 }
! <xus>{xusstop2} {
!                     BEGIN(INITIAL);
!                     yylval.str = litbuf_udeescape(yytext[yyleng-2]);
!                     return SCONST;
!                 }
! <xq,xe,xus>{xqdouble} {
                      addlitchar('\'');
                  }
! <xq,xus>{xqinside}  {
                      addlit(yytext, yyleng);
                  }
  <xe>{xeinside}  {
*************** other            .
*** 496,509 ****
                      if (IS_HIGHBIT_SET(c))
                          saw_high_bit = true;
                  }
! <xq,xe>{quotecontinue} {
                      /* ignore */
                  }
  <xe>.            {
                      /* This is only needed for \ just before EOF */
                      addlitchar(yytext[0]);
                  }
! <xq,xe><<EOF>>        { yyerror("unterminated quoted string"); }

  {dolqdelim}        {
                      SET_YYLLOC();
--- 537,550 ----
                      if (IS_HIGHBIT_SET(c))
                          saw_high_bit = true;
                  }
! <xq,xe,xus>{quotecontinue} {
                      /* ignore */
                  }
  <xe>.            {
                      /* This is only needed for \ just before EOF */
                      addlitchar(yytext[0]);
                  }
! <xq,xe,xus><<EOF>>        { yyerror("unterminated quoted string"); }

  {dolqdelim}        {
                      SET_YYLLOC();
*************** other            .
*** 553,558 ****
--- 594,604 ----
                      BEGIN(xd);
                      startlit();
                  }
+ {xuistart}        {
+                     SET_YYLLOC();
+                     BEGIN(xui);
+                     startlit();
+                 }
  <xd>{xdstop}    {
                      char           *ident;

*************** other            .
*** 565,577 ****
                      yylval.str = ident;
                      return IDENT;
                  }
! <xd>{xddouble}    {
                      addlitchar('"');
                  }
! <xd>{xdinside}    {
                      addlit(yytext, yyleng);
                  }
! <xd><<EOF>>        { yyerror("unterminated quoted identifier"); }

  {typecast}        {
                      SET_YYLLOC();
--- 611,656 ----
                      yylval.str = ident;
                      return IDENT;
                  }
! <xui>{xuistop1}    {
!                     char           *ident;
!
!                     BEGIN(INITIAL);
!                     if (literallen == 0)
!                         yyerror("zero-length delimited identifier");
!                     ident = litbuf_udeescape('\\');
!                     if (literallen >= NAMEDATALEN)
!                         truncate_identifier(ident, literallen, true);
!                     yylval.str = ident;
!                     /* throw back all but the quote */
!                     yyless(1);
!                     return IDENT;
!                 }
! <xui>{xuistop2}    {
!                     char           *ident;
!
!                     BEGIN(INITIAL);
!                     if (literallen == 0)
!                         yyerror("zero-length delimited identifier");
!                     ident = litbuf_udeescape(yytext[yyleng - 2]);
!                     if (literallen >= NAMEDATALEN)
!                         truncate_identifier(ident, literallen, true);
!                     yylval.str = ident;
!                     return IDENT;
!                 }
! <xd,xui>{xddouble}    {
                      addlitchar('"');
                  }
! <xd,xui>{xdinside}    {
                      addlit(yytext, yyleng);
                  }
! <xd,xui><<EOF>>        { yyerror("unterminated quoted identifier"); }
!
! {xufailed}    {
!                     /* throw back all but the initial u/U */
!                     yyless(1);
!                     /* and treat it as {other} */
!                     return yytext[0];
!                 }

  {typecast}        {
                      SET_YYLLOC();
*************** litbufdup(void)
*** 908,913 ****
--- 987,1082 ----
      return new;
  }

+ static int
+ hexval(unsigned char c)
+ {
+     if (c >= '0' && c <= '9')
+         return c - '0';
+     if (c >= 'a' && c <= 'f')
+         return c - 'a' + 0xA;
+     if (c >= 'A' && c <= 'F')
+         return c - 'A' + 0xA;
+     elog(ERROR, "invalid hexadecimal digit");
+     return 0; /* not reached */
+ }
+
+ static void
+ check_unicode_value(pg_wchar c, char * loc)
+ {
+     if (GetDatabaseEncoding() == PG_UTF8)
+         return;
+
+     if (c > 0x7F)
+     {
+         yylloc += (char *) loc - literalbuf + 3;   /* 3 for U&" */
+         yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is
notUTF8"); 
+     }
+ }
+
+ static char *
+ litbuf_udeescape(unsigned char escape)
+ {
+     char *new;
+     char *in, *out;
+
+     if (isxdigit(escape)
+         || escape == '+'
+         || escape == '\''
+         || escape == '"'
+         || scanner_isspace(escape))
+         yyerror("invalid Unicode escape character");
+
+     /*
+      * This relies on the subtle assumption that a UTF-8 expansion
+      * cannot be longer than its escaped representation.
+      */
+     new = palloc(literallen + 1);
+
+     in = literalbuf;
+     out = new;
+     while (*in)
+     {
+         if (in[0] == escape)
+         {
+             if (in[1] == escape)
+             {
+                 *out++ = escape;
+                 in += 2;
+             }
+             else if (isxdigit(in[1]) && isxdigit(in[2]) && isxdigit(in[3]) && isxdigit(in[4]))
+             {
+                 pg_wchar unicode = hexval(in[1]) * 16*16*16 + hexval(in[2]) * 16*16 + hexval(in[3]) * 16 +
hexval(in[4]);
+                 check_unicode_value(unicode, in);
+                 unicode_to_utf8(unicode, (unsigned char *) out);
+                 in += 5;
+                 out += pg_mblen(out);
+             }
+             else if (in[1] == '+'
+                      && isxdigit(in[2]) && isxdigit(in[3])
+                      && isxdigit(in[4]) && isxdigit(in[5])
+                      && isxdigit(in[6]) && isxdigit(in[7]))
+             {
+                 pg_wchar unicode = hexval(in[2]) * 16*16*16*16*16 + hexval(in[3]) * 16*16*16*16 + hexval(in[4]) *
16*16*16
+                                     + hexval(in[5]) * 16*16 + hexval(in[6]) * 16 + hexval(in[7]);
+                 check_unicode_value(unicode, in);
+                 unicode_to_utf8(unicode, (unsigned char *) out);
+                 in += 8;
+                 out += pg_mblen(out);
+             }
+             else
+             {
+                 yylloc += in - literalbuf + 3;   /* 3 for U&" */
+                 yyerror("invalid Unicode escape value");
+             }
+         }
+         else
+             *out++ = *in++;
+     }
+
+     *out = '\0';
+     pg_verifymbstr(new, out - new, false);
+     return new;
+ }

  static unsigned char
  unescape_single_char(unsigned char c)
Index: src/backend/utils/adt/xml.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/xml.c,v
retrieving revision 1.79
diff -u -3 -p -c -r1.79 xml.c
*** src/backend/utils/adt/xml.c    14 Oct 2008 17:12:33 -0000    1.79
--- src/backend/utils/adt/xml.c    27 Oct 2008 16:54:27 -0000
*************** unicode_to_sqlchar(pg_wchar c)
*** 1497,1524 ****
  {
      static unsigned char utf8string[5]; /* need trailing zero */

!     if (c <= 0x7F)
!     {
!         utf8string[0] = c;
!     }
!     else if (c <= 0x7FF)
!     {
!         utf8string[0] = 0xC0 | ((c >> 6) & 0x1F);
!         utf8string[1] = 0x80 | (c & 0x3F);
!     }
!     else if (c <= 0xFFFF)
!     {
!         utf8string[0] = 0xE0 | ((c >> 12) & 0x0F);
!         utf8string[1] = 0x80 | ((c >> 6) & 0x3F);
!         utf8string[2] = 0x80 | (c & 0x3F);
!     }
!     else
!     {
!         utf8string[0] = 0xF0 | ((c >> 18) & 0x07);
!         utf8string[1] = 0x80 | ((c >> 12) & 0x3F);
!         utf8string[2] = 0x80 | ((c >> 6) & 0x3F);
!         utf8string[3] = 0x80 | (c & 0x3F);
!     }

      return (char *) pg_do_encoding_conversion(utf8string,
                                                pg_mblen((char *) utf8string),
--- 1497,1503 ----
  {
      static unsigned char utf8string[5]; /* need trailing zero */

!     unicode_to_utf8(c, utf8string);

      return (char *) pg_do_encoding_conversion(utf8string,
                                                pg_mblen((char *) utf8string),
Index: src/backend/utils/mb/wchar.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/mb/wchar.c,v
retrieving revision 1.66
diff -u -3 -p -c -r1.66 wchar.c
*** src/backend/utils/mb/wchar.c    15 Nov 2007 21:14:40 -0000    1.66
--- src/backend/utils/mb/wchar.c    27 Oct 2008 16:54:27 -0000
*************** pg_utf2wchar_with_len(const unsigned cha
*** 419,424 ****
--- 419,459 ----
      return cnt;
  }

+
+ /*
+  * Map a Unicode codepoint to UTF-8.  utf8string must have 4 bytes of
+  * space allocated.
+  */
+ unsigned char *
+ unicode_to_utf8(pg_wchar c, unsigned char *utf8string)
+ {
+     if (c <= 0x7F)
+     {
+         utf8string[0] = c;
+     }
+     else if (c <= 0x7FF)
+     {
+         utf8string[0] = 0xC0 | ((c >> 6) & 0x1F);
+         utf8string[1] = 0x80 | (c & 0x3F);
+     }
+     else if (c <= 0xFFFF)
+     {
+         utf8string[0] = 0xE0 | ((c >> 12) & 0x0F);
+         utf8string[1] = 0x80 | ((c >> 6) & 0x3F);
+         utf8string[2] = 0x80 | (c & 0x3F);
+     }
+     else
+     {
+         utf8string[0] = 0xF0 | ((c >> 18) & 0x07);
+         utf8string[1] = 0x80 | ((c >> 12) & 0x3F);
+         utf8string[2] = 0x80 | ((c >> 6) & 0x3F);
+         utf8string[3] = 0x80 | (c & 0x3F);
+     }
+
+     return utf8string;
+ }
+
+
  /*
   * Return the byte length of a UTF8 character pointed to by s
   *
Index: src/include/mb/pg_wchar.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/mb/pg_wchar.h,v
retrieving revision 1.79
diff -u -3 -p -c -r1.79 pg_wchar.h
*** src/include/mb/pg_wchar.h    18 Jun 2008 18:42:54 -0000    1.79
--- src/include/mb/pg_wchar.h    27 Oct 2008 16:54:27 -0000
*************** extern const char *GetDatabaseEncodingNa
*** 380,385 ****
--- 380,386 ----
  extern int    pg_valid_client_encoding(const char *name);
  extern int    pg_valid_server_encoding(const char *name);

+ extern unsigned char *unicode_to_utf8(pg_wchar c, unsigned char *utf8string);
  extern int    pg_utf_mblen(const unsigned char *);
  extern unsigned char *pg_do_encoding_conversion(unsigned char *src, int len,
                            int src_encoding,