Re: multibyte charater set in levenshtein function
От | Alexander Korotkov |
---|---|
Тема | Re: multibyte charater set in levenshtein function |
Дата | |
Msg-id | AANLkTilQ5bzoOcA_A6mRC0U8BOyXcmwsd4BzZc_0QwlK@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: multibyte charater set in levenshtein function (Itagaki Takahiro <itagaki.takahiro@gmail.com>) |
Список | pgsql-hackers |
Such version with macros and includes can look like this:<br /><br />#ifdef MULTIBYTE <br />#define NEXT_X (x+= char_lens[i-1])<br/>#define NEXT_Y (y+= y_char_len)<br />#define CMP (char_cmp(x, char_lens[i-1], y, y_char_len))<br />#else<br/> #define NEXT_X (x++)<br />#define NEXT_Y (y++)<br />#define CMP (*x == *y)<br />#endif<br /><br />static int<br/>levenshtein_internal(text *s, text *t, int ins_c, int del_c, int sub_c)<br />{<br /> int m,<br /> n,<br /> d;<br /> int *prev;<br /> int *curr;<br /> int i,<br /> j;<br /> const char *x;<br /> const char *y;<br />#ifdef MULTIBYTE <br /> int *char_lens;<br /> int y_char_len;<br />#endif<br /><br /> /*<br /> * We should calculatenumber of characters for multibyte encodings<br /> */<br /> m = pg_mbstrlen_with_len(VARDATA_ANY(s), VARSIZE_ANY_EXHDR(s));<br/> n = pg_mbstrlen_with_len(VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));<br /><br /> /*<br /> * We can transform an empty s into t with n insertions, or a non-empty t<br /> * into an empty s with m deletions.<br/> */<br /> if (m == 0)<br /> return n * ins_c;<br /> if (n == 0)<br /> return m * del_c;<br/><br /> /*<br /> * For security concerns, restrict excessive CPU+RAM usage. (This<br /> * implementationuses O(m) memory and has O(mn) complexity.)<br /> */<br /> if (m > MAX_LEVENSHTEIN_STRLEN ||<br /> n > MAX_LEVENSHTEIN_STRLEN)<br /> ereport(ERROR,<br /> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),<br/> errmsg("argument exceeds the maximum length of %d bytes",<br/> MAX_LEVENSHTEIN_STRLEN)));<br /><br /> /* One more cell for initialization columnand row. */<br /> ++m;<br /> ++n;<br /><br /> /*<br /> * Instead of building an (m+1)x(n+1) array, we'lluse two different<br /> * arrays of size m+1 for storing accumulated values. At each step one<br /> * representsthe "previous" row and one is the "current" row of the<br /> * notional large array.<br /> * For multibyteencoding we'll also store array of lengths of<br /> * characters of first string in order to avoid great numberof<br /> * pg_mblen calls.<br /> */<br />#ifdef MULTIBYTE <br /> prev = (int *) palloc(3 * m * sizeof(int));<br/> curr = prev + m;<br /> char_lens = prev + 2 * m;<br /> for (i = 0, x = VARDATA_ANY(s); i <m - 1; i++)<br /> {<br /> char_lens[i] = pg_mblen(x);<br /> x += char_lens[i];<br /> }<br /> char_lens[i] = 0;<br />#else<br /> prev = (int *) palloc(2 * m * sizeof(int));<br /> curr = prev + m;<br />#endif<br/><br /> /* Initialize the "previous" row to 0..cols */<br /> for (i = 0; i < m; i++)<br /> prev[i]= i * del_c;<br /><br /> /*<br /> * For multibyte encoding we should increase x and y pointers by the<br /> * results of pg_mblen function. Also we should use CHAR_CMP macros<br /> * for character comparison.<br /> */<br /> for (y = VARDATA_ANY(t), j = 1; j < n; NEXT_Y, j++)<br /> {<br /> int *temp;<br/>#ifdef MULTIBYTE <br /> y_char_len = pg_mblen(y);<br />#endif<br /><br /> curr[0] = j * ins_c;<br/><br /> for (x = VARDATA_ANY(s), i = 1; i < m; NEXT_X, i++)<br /> {<br /> int ins;<br /> int del;<br /> int sub;<br /><br /> ins = prev[i]+ ins_c;<br /> del = curr[i - 1] + del_c;<br /> sub = prev[i - 1] + (CMP ? 0 : sub_c); <br /> curr[i] = Min(ins, del);<br /> curr[i] = Min(curr[i], sub);<br /> }<br /><br /> temp= curr;<br /> curr = prev;<br /> prev = temp;<br /> }<br /><br /> d = prev[m - 1];<br /><br /> /*<br /> * Because the final value was swapped from the previous row to the<br /> * current row, that's wherewe'll find it.<br /> */<br /> return d;<br />}<br /><br />What do you thing about it?<br /><br />----<br />Withbest regards,<br />Alexander Korotkov.<br />
В списке pgsql-hackers по дате отправления: