Обсуждение: BUG #2895: Private Use Unicode character crashes server when using ILIKE
The following bug has been logged online: Bug reference: 2895 Logged by: James Russell Email address: internationalhobo@gmail.com PostgreSQL version: 8.1.4 Operating system: Linux (Fedora Core 5) Description: Private Use Unicode character crashes server when using ILIKE Details: A UTF-8 text field contains (among others) the following Unicode character: 0xf4808286 in UTF-8 (== 0x00100086 in UTF-32) This corresponds to a character in the Unicode "Private Use" area, where the codepoints are undefined and designated for proprietary uses (0x100000-0x10FFFD). If a text field contains a such a character, and if the search term contains certain Unicode characters, then queries using ILIKE will crash the server. We discovered this bug when a user searched for rows containing the "TM" character (UTF-8 0xE284A2 / UTF-32 0x2122), which translated to the following query: SELECT * FROM foo WHERE bar ILIKE('%â¢%'); If the rows searched contain the row with the Private Use character, then the server crashes during the query and psql returns the following: "psql:recreatebug.sql:8: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql:recreatebug.sql:8: connection to server was lost" This bug is 100% reproducible on multiple servers with different configurations. Our server configuration does not have any changes from default. The following PHP script will generate a psql script that can reproduce the problem. Run this script to generate the SQL script, then run the SQL script in psql. The server will crash when the SELECT statement is run. <?php $sql = " \c template1 CREATE DATABASE temp_db WITH ENCODING 'UNICODE'; \c temp_db BEGIN; CREATE TABLE foo (bar text); INSERT INTO foo (bar) VALUES ('".chr(0xf4).chr(0x80).chr(0x82).chr(0x86)."'); SELECT * FROM foo WHERE bar ILIKE('%".chr(0xe2).chr(0x84).chr(0xa2)."%'); \c template1 DROP DATABASE temp_db; "; file_put_contents('recreatebug.sql', $sql); Additional notes: - The bug does not appear to occur when using LIKE, just ILIKE. - The bug does not appear to occur when searching for plain ASCII strings like '%abc%'. - I have not yet tried to reproduce the bug on the latest Postgres 8.2.x
On Tue, Jan 16, 2007 at 06:16:22AM +0000, James Russell wrote: > Description: Private Use Unicode character crashes server when using ILIKE The archives show that ILIKE is known to be broken with multibyte characters in 8.1 and earlier, although I don't recall seeing reports of a crash resulting. I got a crash in 8.1.6 built from the latest source in CVS; here's a partial stack trace: (gdb) bt #0 MBMatchTextIC (t=0x2a98613d1c "�\200\202\206", tlen=4, p=0x0, plen=4) at like_match.c:195 #1 0x00000000005ae558 in texticlike (fcinfo=Variable "fcinfo" is not available. ) at like.c:355 I wonder if this is a problem only with code points outside of Plane 0, viz., those with UTF-8 sequences longer than three bytes. I don't get a crash with U+FFFD (E'\357\277\275') but I do with U+10000 (E'\360\220\200\200') and other four-byte sequences. > - I have not yet tried to reproduce the bug on the latest Postgres 8.2.x It appears to work in 8.2.1; at least it didn't crash. The 8.2 Release Notes contain the following item: * Allow ILIKE to work for multi-byte encodings (Tom) Internally, ILIKE now calls lower() and then uses LIKE. Locale-specific regular expression patterns still do not work in these encodings. -- Michael Fuhr
"James Russell" <internationalhobo@gmail.com> writes: > If a text field contains a such a character, and if the search term contains > certain Unicode characters, then queries using ILIKE will crash the server. I can't reproduce this with your example, so I suppose there's some critical bit of information you've left out. Can you provide a stack trace from the crash? regards, tom lane
On Tue, Jan 23, 2007 at 09:09:52PM -0500, Tom Lane wrote: > "James Russell" <internationalhobo@gmail.com> writes: > > If a text field contains a such a character, and if the search term contains > > certain Unicode characters, then queries using ILIKE will crash the server. > > I can't reproduce this with your example, so I suppose there's some > critical bit of information you've left out. Can you provide a stack > trace from the crash? This might be platform-dependent -- I get a crash with an 8.1.6 UTF-8 database on RHEL AS 4 x86_64 but not on Solaris 9/sparc 32-bit. I don't get a crash on either platform with 8.2.1 or 8.3devel. The crash seems to happen only with code points that have UTF-8 byte sequences of more than three bytes (U+10000 and higher). Here's an example that segfaults in 8.1.6 on the Linux box: select e'\360\220\200\200' ilike e'%\342\204\242'; #0 MBMatchTextIC (t=0x8ebcbc "�\220\200\200�7\212", tlen=4, p=0x0, plen=3) at like_match.c:195 195 if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_')) (gdb) bt #0 MBMatchTextIC (t=0x8ebcbc "�\220\200\200�7\212", tlen=4, p=0x0, plen=3) at like_match.c:195 #1 0x00000000005ae558 in texticlike (fcinfo=Variable "fcinfo" is not available. ) at like.c:355 #2 0x0000000000501044 in ExecMakeFunctionResult (fcache=0x90b530, econtext=0x90b400, isNull=0x90bdb0 "", isDone=0x90bdd0)at execQual.c:1095 #3 0x0000000000504c23 in ExecProject (projInfo=0x90bbf0, isDone=0x7fbfffdd74) at execQual.c:3704 #4 0x000000000050f64a in ExecResult (node=0x90b2e8) at nodeResult.c:157 #5 0x00000000004ff90d in ExecProcNode (node=0x90b2e8) at execProcnode.c:306 #6 0x00000000004feb8a in ExecutorRun (queryDesc=Variable "queryDesc" is not available. ) at execMain.c:1122 #7 0x000000000058857e in PortalRunSelect (portal=0x908ae8, forward=Variable "forward" is not available. ) at pquery.c:794 #8 0x0000000000588b7f in PortalRun (portal=0x908ae8, count=9223372036854775807, dest=0x8ec510, altdest=0x8ec510, completionTag=0x7fbfffe0f0"") at pquery.c:646 #9 0x0000000000584846 in exec_simple_query (query_string=0x8eb4e8 "select e'\\360\\220\\200\\200' ilike e'%\\342\\204\\242';")at postgres.c:1004 #10 0x00000000005864ee in PostgresMain (argc=4, argv=0x8844a8, username=0x884390 "mfuhr") at postgres.c:3232 #11 0x000000000055c31a in ServerLoop () at postmaster.c:2863 #12 0x000000000055d90a in PostmasterMain (argc=5, argv=0x8828c0) at postmaster.c:941 #13 0x000000000051d5e3 in main (argc=5, argv=0x8828c0) at main.c:265 -- Michael Fuhr
Michael Fuhr <mike@fuhr.org> writes: > The crash seems to happen only with code points that have UTF-8 > byte sequences of more than three bytes (U+10000 and higher). > Here's an example that segfaults in 8.1.6 on the Linux box: > select e'\360\220\200\200' ilike e'%\342\204\242'; Well, tracing through it I still don't get a crash, but I see where it's stomping on stack :-(. pg_utf_mblen() and pg_utf8_islegal() allow 4-byte utf8 sequences, but pg_utf2wchar_with_len() only supports 3! I'm surprised we've not seen reports of other problems. The lack of crash in 8.2 is because iwchareq went away completely, but that's certainly not the only place doing conversions to pg_wchar and expecting the result to fit in a prespecified buffer ... regards, tom lane
I wrote: > I'm surprised we've not seen reports of other problems. The lack of > crash in 8.2 is because iwchareq went away completely, but that's > certainly not the only place doing conversions to pg_wchar and expecting > the result to fit in a prespecified buffer ... Actually, it looks like all the other places that use mb2wchar allocate a buffer with more pg_wchar's than there are bytes in the input, and so overrun is impossible even with the breakage in pg_utf2wchar_with_len. The only exception I see in CVS HEAD is sqlchar_to_unicode() in xml.c, which is new code as of 8.3devel anyway. So we don't appear to have a serious security issue as I first feared --- the worst that can happen is a crash like this when iwchareq's return address is overwritten with a zero, and even that is only in pre-8.2. Nonetheless, the code is certainly giving wrong answers for 4-byte characters. Will go fix... regards, tom lane
I wrote: > Nonetheless, the code is certainly giving wrong answers for 4-byte > characters. Will go fix... I've applied the attached patch for 8.1, and related patches in all supported branches. regards, tom lane Index: wchar.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/mb/wchar.c,v retrieving revision 1.47.2.4 diff -c -r1.47.2.4 wchar.c *** wchar.c 22 Aug 2006 12:11:38 -0000 1.47.2.4 --- wchar.c 24 Jan 2007 16:16:27 -0000 *************** *** 345,362 **** } /* ! * convert UTF8 string to pg_wchar (UCS-2) ! * caller should allocate enough space for "to" * len: length of from. * "from" not necessarily null terminated. */ static int pg_utf2wchar_with_len(const unsigned char *from, pg_wchar *to, int len) { - unsigned char c1, - c2, - c3; int cnt = 0; while (len > 0 && *from) { --- 345,363 ---- } /* ! * convert UTF8 string to pg_wchar (UCS-4) ! * caller must allocate enough space for "to", including a trailing zero! * len: length of from. * "from" not necessarily null terminated. */ static int pg_utf2wchar_with_len(const unsigned char *from, pg_wchar *to, int len) { int cnt = 0; + uint32 c1, + c2, + c3, + c4; while (len > 0 && *from) { *************** *** 365,390 **** *to = *from++; len--; } ! else if ((*from & 0xe0) == 0xc0 && len >= 2) { c1 = *from++ & 0x1f; c2 = *from++ & 0x3f; ! *to = c1 << 6; ! *to |= c2; len -= 2; } ! else if ((*from & 0xe0) == 0xe0 && len >= 3) { c1 = *from++ & 0x0f; c2 = *from++ & 0x3f; c3 = *from++ & 0x3f; ! *to = c1 << 12; ! *to |= c2 << 6; ! *to |= c3; len -= 3; } else { *to = *from++; len--; } --- 366,404 ---- *to = *from++; len--; } ! else if ((*from & 0xe0) == 0xc0) { + if (len < 2) + break; /* drop trailing incomplete char */ c1 = *from++ & 0x1f; c2 = *from++ & 0x3f; ! *to = (c1 << 6) | c2; len -= 2; } ! else if ((*from & 0xf0) == 0xe0) { + if (len < 3) + break; /* drop trailing incomplete char */ c1 = *from++ & 0x0f; c2 = *from++ & 0x3f; c3 = *from++ & 0x3f; ! *to = (c1 << 12) | (c2 << 6) | c3; len -= 3; } + else if ((*from & 0xf8) == 0xf0) + { + if (len < 4) + break; /* drop trailing incomplete char */ + c1 = *from++ & 0x07; + c2 = *from++ & 0x3f; + c3 = *from++ & 0x3f; + c4 = *from++ & 0x3f; + *to = (c1 << 18) | (c2 << 12) | (c3 << 6) | c4; + len -= 4; + } else { + /* treat a bogus char as length 1; not ours to raise error */ *to = *from++; len--; } *************** *** 396,407 **** } /* ! * returns the byte length of a UTF8 character pointed to by s */ int pg_utf_mblen(const unsigned char *s) { ! int len = 1; if ((*s & 0x80) == 0) len = 1; --- 410,429 ---- } /* ! * Return the byte length of a UTF8 character pointed to by s ! * ! * Note: in the current implementation we do not support UTF8 sequences ! * of more than 4 bytes; hence do NOT return a value larger than 4. ! * We return "1" for any leading byte that is either flat-out illegal or ! * indicates a length larger than we support. ! * ! * pg_utf2wchar_with_len(), utf2ucs(), pg_utf8_islegal(), and perhaps ! * other places would need to be fixed to change this. */ int pg_utf_mblen(const unsigned char *s) { ! int len; if ((*s & 0x80) == 0) len = 1; *************** *** 411,421 **** len = 3; else if ((*s & 0xf8) == 0xf0) len = 4; else if ((*s & 0xfc) == 0xf8) len = 5; else if ((*s & 0xfe) == 0xfc) len = 6; ! return (len); } static int --- 433,447 ---- len = 3; else if ((*s & 0xf8) == 0xf0) len = 4; + #ifdef NOT_USED else if ((*s & 0xfc) == 0xf8) len = 5; else if ((*s & 0xfe) == 0xfc) len = 6; ! #endif ! else ! len = 1; ! return len; } static int