Обсуждение: BUG #2895: Private Use Unicode character crashes server when using ILIKE

Поиск
Список
Период
Сортировка

BUG #2895: Private Use Unicode character crashes server when using ILIKE

От
"James Russell"
Дата:
The following bug has been logged online:

Bug reference:      2895
Logged by:          James Russell
Email address:      internationalhobo@gmail.com
PostgreSQL version: 8.1.4
Operating system:   Linux (Fedora Core 5)
Description:        Private Use Unicode character crashes server when using
ILIKE
Details:

A UTF-8 text field contains (among others) the following Unicode character:
 0xf4808286 in UTF-8 (== 0x00100086 in UTF-32)

This corresponds to a character in the Unicode "Private Use" area, where the
codepoints are undefined and designated for proprietary uses
(0x100000-0x10FFFD).

If a text field contains a such a character, and if the search term contains
certain Unicode characters, then queries using ILIKE will crash the server.

We discovered this bug when a user searched for rows containing the "TM"
character (UTF-8 0xE284A2 / UTF-32 0x2122), which translated to the
following query:

SELECT * FROM foo WHERE bar ILIKE('%™%');

If the rows searched contain the row with the Private Use character, then
the server crashes during the query and psql returns the following:

"psql:recreatebug.sql:8: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:recreatebug.sql:8: connection to server was lost"

This bug is 100% reproducible on multiple servers with different
configurations. Our server configuration does not have any changes from
default.

The following PHP script will generate a psql script that can reproduce the
problem. Run this script to generate the SQL script, then run the SQL script
in psql. The server will crash when the SELECT statement is run.

<?php
$sql = "
\c template1
CREATE DATABASE temp_db WITH ENCODING 'UNICODE';
\c temp_db
BEGIN;
CREATE TABLE foo (bar text);
INSERT INTO foo (bar) VALUES
('".chr(0xf4).chr(0x80).chr(0x82).chr(0x86)."');
SELECT * FROM foo WHERE bar ILIKE('%".chr(0xe2).chr(0x84).chr(0xa2)."%');

\c template1
DROP DATABASE temp_db;
";
file_put_contents('recreatebug.sql', $sql);


Additional notes:
- The bug does not appear to occur when using LIKE, just ILIKE.
- The bug does not appear to occur when searching for plain ASCII strings
like '%abc%'.
- I have not yet tried to reproduce the bug on the latest Postgres 8.2.x

Re: BUG #2895: Private Use Unicode character crashes server when using ILIKE

От
Michael Fuhr
Дата:
On Tue, Jan 16, 2007 at 06:16:22AM +0000, James Russell wrote:
> Description: Private Use Unicode character crashes server when using ILIKE

The archives show that ILIKE is known to be broken with multibyte
characters in 8.1 and earlier, although I don't recall seeing reports
of a crash resulting.  I got a crash in 8.1.6 built from the latest
source in CVS; here's a partial stack trace:

(gdb) bt
#0  MBMatchTextIC (t=0x2a98613d1c "�\200\202\206", tlen=4, p=0x0, plen=4) at like_match.c:195
#1  0x00000000005ae558 in texticlike (fcinfo=Variable "fcinfo" is not available.
) at like.c:355

I wonder if this is a problem only with code points outside of Plane 0,
viz., those with UTF-8 sequences longer than three bytes.  I don't get
a crash with U+FFFD (E'\357\277\275') but I do with U+10000
(E'\360\220\200\200') and other four-byte sequences.

> - I have not yet tried to reproduce the bug on the latest Postgres 8.2.x

It appears to work in 8.2.1; at least it didn't crash.  The 8.2
Release Notes contain the following item:

* Allow ILIKE to work for multi-byte encodings (Tom)

  Internally, ILIKE now calls lower() and then uses LIKE.  Locale-specific
  regular expression patterns still do not work in these encodings.

--
Michael Fuhr

Re: BUG #2895: Private Use Unicode character crashes server when using ILIKE

От
Tom Lane
Дата:
"James Russell" <internationalhobo@gmail.com> writes:
> If a text field contains a such a character, and if the search term contains
> certain Unicode characters, then queries using ILIKE will crash the server.

I can't reproduce this with your example, so I suppose there's some
critical bit of information you've left out.  Can you provide a stack
trace from the crash?

            regards, tom lane

Re: BUG #2895: Private Use Unicode character crashes server when using ILIKE

От
Michael Fuhr
Дата:
On Tue, Jan 23, 2007 at 09:09:52PM -0500, Tom Lane wrote:
> "James Russell" <internationalhobo@gmail.com> writes:
> > If a text field contains a such a character, and if the search term contains
> > certain Unicode characters, then queries using ILIKE will crash the server.
>
> I can't reproduce this with your example, so I suppose there's some
> critical bit of information you've left out.  Can you provide a stack
> trace from the crash?

This might be platform-dependent -- I get a crash with an 8.1.6
UTF-8 database on RHEL AS 4 x86_64 but not on Solaris 9/sparc 32-bit.
I don't get a crash on either platform with 8.2.1 or 8.3devel.

The crash seems to happen only with code points that have UTF-8
byte sequences of more than three bytes (U+10000 and higher).
Here's an example that segfaults in 8.1.6 on the Linux box:

select e'\360\220\200\200' ilike e'%\342\204\242';

#0  MBMatchTextIC (t=0x8ebcbc "�\220\200\200�7\212", tlen=4, p=0x0, plen=3) at like_match.c:195
195                    if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_'))
(gdb) bt
#0  MBMatchTextIC (t=0x8ebcbc "�\220\200\200�7\212", tlen=4, p=0x0, plen=3) at like_match.c:195
#1  0x00000000005ae558 in texticlike (fcinfo=Variable "fcinfo" is not available.
) at like.c:355
#2  0x0000000000501044 in ExecMakeFunctionResult (fcache=0x90b530, econtext=0x90b400, isNull=0x90bdb0 "",
isDone=0x90bdd0)at execQual.c:1095 
#3  0x0000000000504c23 in ExecProject (projInfo=0x90bbf0, isDone=0x7fbfffdd74) at execQual.c:3704
#4  0x000000000050f64a in ExecResult (node=0x90b2e8) at nodeResult.c:157
#5  0x00000000004ff90d in ExecProcNode (node=0x90b2e8) at execProcnode.c:306
#6  0x00000000004feb8a in ExecutorRun (queryDesc=Variable "queryDesc" is not available.
) at execMain.c:1122
#7  0x000000000058857e in PortalRunSelect (portal=0x908ae8, forward=Variable "forward" is not available.
) at pquery.c:794
#8  0x0000000000588b7f in PortalRun (portal=0x908ae8, count=9223372036854775807, dest=0x8ec510, altdest=0x8ec510,
completionTag=0x7fbfffe0f0"") 
    at pquery.c:646
#9  0x0000000000584846 in exec_simple_query (query_string=0x8eb4e8 "select e'\\360\\220\\200\\200' ilike
e'%\\342\\204\\242';")at postgres.c:1004 
#10 0x00000000005864ee in PostgresMain (argc=4, argv=0x8844a8, username=0x884390 "mfuhr") at postgres.c:3232
#11 0x000000000055c31a in ServerLoop () at postmaster.c:2863
#12 0x000000000055d90a in PostmasterMain (argc=5, argv=0x8828c0) at postmaster.c:941
#13 0x000000000051d5e3 in main (argc=5, argv=0x8828c0) at main.c:265

--
Michael Fuhr

Re: BUG #2895: Private Use Unicode character crashes server when using ILIKE

От
Tom Lane
Дата:
Michael Fuhr <mike@fuhr.org> writes:
> The crash seems to happen only with code points that have UTF-8
> byte sequences of more than three bytes (U+10000 and higher).
> Here's an example that segfaults in 8.1.6 on the Linux box:

> select e'\360\220\200\200' ilike e'%\342\204\242';

Well, tracing through it I still don't get a crash, but I see where it's
stomping on stack :-(.  pg_utf_mblen() and pg_utf8_islegal() allow
4-byte utf8 sequences, but pg_utf2wchar_with_len() only supports 3!
I'm surprised we've not seen reports of other problems.  The lack of
crash in 8.2 is because iwchareq went away completely, but that's
certainly not the only place doing conversions to pg_wchar and expecting
the result to fit in a prespecified buffer ...

            regards, tom lane

Re: BUG #2895: Private Use Unicode character crashes server when using ILIKE

От
Tom Lane
Дата:
I wrote:
> I'm surprised we've not seen reports of other problems.  The lack of
> crash in 8.2 is because iwchareq went away completely, but that's
> certainly not the only place doing conversions to pg_wchar and expecting
> the result to fit in a prespecified buffer ...

Actually, it looks like all the other places that use mb2wchar allocate
a buffer with more pg_wchar's than there are bytes in the input, and so
overrun is impossible even with the breakage in pg_utf2wchar_with_len.
The only exception I see in CVS HEAD is sqlchar_to_unicode() in xml.c,
which is new code as of 8.3devel anyway.  So we don't appear to have
a serious security issue as I first feared --- the worst that can
happen is a crash like this when iwchareq's return address is
overwritten with a zero, and even that is only in pre-8.2.

Nonetheless, the code is certainly giving wrong answers for 4-byte
characters.  Will go fix...

            regards, tom lane

Re: BUG #2895: Private Use Unicode character crashes server when using ILIKE

От
Tom Lane
Дата:
I wrote:
> Nonetheless, the code is certainly giving wrong answers for 4-byte
> characters.  Will go fix...

I've applied the attached patch for 8.1, and related patches in all
supported branches.

            regards, tom lane


Index: wchar.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/mb/wchar.c,v
retrieving revision 1.47.2.4
diff -c -r1.47.2.4 wchar.c
*** wchar.c    22 Aug 2006 12:11:38 -0000    1.47.2.4
--- wchar.c    24 Jan 2007 16:16:27 -0000
***************
*** 345,362 ****
  }

  /*
!  * convert UTF8 string to pg_wchar (UCS-2)
!  * caller should allocate enough space for "to"
   * len: length of from.
   * "from" not necessarily null terminated.
   */
  static int
  pg_utf2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
  {
-     unsigned char c1,
-                 c2,
-                 c3;
      int            cnt = 0;

      while (len > 0 && *from)
      {
--- 345,363 ----
  }

  /*
!  * convert UTF8 string to pg_wchar (UCS-4)
!  * caller must allocate enough space for "to", including a trailing zero!
   * len: length of from.
   * "from" not necessarily null terminated.
   */
  static int
  pg_utf2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
  {
      int            cnt = 0;
+     uint32        c1,
+                 c2,
+                 c3,
+                 c4;

      while (len > 0 && *from)
      {
***************
*** 365,390 ****
              *to = *from++;
              len--;
          }
!         else if ((*from & 0xe0) == 0xc0 && len >= 2)
          {
              c1 = *from++ & 0x1f;
              c2 = *from++ & 0x3f;
!             *to = c1 << 6;
!             *to |= c2;
              len -= 2;
          }
!         else if ((*from & 0xe0) == 0xe0 && len >= 3)
          {
              c1 = *from++ & 0x0f;
              c2 = *from++ & 0x3f;
              c3 = *from++ & 0x3f;
!             *to = c1 << 12;
!             *to |= c2 << 6;
!             *to |= c3;
              len -= 3;
          }
          else
          {
              *to = *from++;
              len--;
          }
--- 366,404 ----
              *to = *from++;
              len--;
          }
!         else if ((*from & 0xe0) == 0xc0)
          {
+             if (len < 2)
+                 break;            /* drop trailing incomplete char */
              c1 = *from++ & 0x1f;
              c2 = *from++ & 0x3f;
!             *to = (c1 << 6) | c2;
              len -= 2;
          }
!         else if ((*from & 0xf0) == 0xe0)
          {
+             if (len < 3)
+                 break;            /* drop trailing incomplete char */
              c1 = *from++ & 0x0f;
              c2 = *from++ & 0x3f;
              c3 = *from++ & 0x3f;
!             *to = (c1 << 12) | (c2 << 6) | c3;
              len -= 3;
          }
+         else if ((*from & 0xf8) == 0xf0)
+         {
+             if (len < 4)
+                 break;            /* drop trailing incomplete char */
+             c1 = *from++ & 0x07;
+             c2 = *from++ & 0x3f;
+             c3 = *from++ & 0x3f;
+             c4 = *from++ & 0x3f;
+             *to = (c1 << 18) | (c2 << 12) | (c3 << 6) | c4;
+             len -= 4;
+         }
          else
          {
+             /* treat a bogus char as length 1; not ours to raise error */
              *to = *from++;
              len--;
          }
***************
*** 396,407 ****
  }

  /*
!  * returns the byte length of a UTF8 character pointed to by s
   */
  int
  pg_utf_mblen(const unsigned char *s)
  {
!     int            len = 1;

      if ((*s & 0x80) == 0)
          len = 1;
--- 410,429 ----
  }

  /*
!  * Return the byte length of a UTF8 character pointed to by s
!  *
!  * Note: in the current implementation we do not support UTF8 sequences
!  * of more than 4 bytes; hence do NOT return a value larger than 4.
!  * We return "1" for any leading byte that is either flat-out illegal or
!  * indicates a length larger than we support.
!  *
!  * pg_utf2wchar_with_len(), utf2ucs(), pg_utf8_islegal(), and perhaps
!  * other places would need to be fixed to change this.
   */
  int
  pg_utf_mblen(const unsigned char *s)
  {
!     int            len;

      if ((*s & 0x80) == 0)
          len = 1;
***************
*** 411,421 ****
          len = 3;
      else if ((*s & 0xf8) == 0xf0)
          len = 4;
      else if ((*s & 0xfc) == 0xf8)
          len = 5;
      else if ((*s & 0xfe) == 0xfc)
          len = 6;
!     return (len);
  }

  static int
--- 433,447 ----
          len = 3;
      else if ((*s & 0xf8) == 0xf0)
          len = 4;
+ #ifdef NOT_USED
      else if ((*s & 0xfc) == 0xf8)
          len = 5;
      else if ((*s & 0xfe) == 0xfc)
          len = 6;
! #endif
!     else
!         len = 1;
!     return len;
  }

  static int