Re: multibyte charater set in levenshtein function

Поиск

Список

Период

Сортировка

От	Alexander Korotkov
Тема	Re: multibyte charater set in levenshtein function
Дата	22 июля 2010 г. 11:16:11
Msg-id	AANLkTilQ5bzoOcA_A6mRC0U8BOyXcmwsd4BzZc_0QwlK@mail.gmail.com обсуждение исходный текст
Ответ на	Re: multibyte charater set in levenshtein function (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

Such version with macros and includes can look like this:<br /><br />#ifdef MULTIBYTE <br />#define NEXT_X (x+=
char_lens[i-1])<br/>#define NEXT_Y (y+= y_char_len)<br />#define CMP (char_cmp(x, char_lens[i-1], y, y_char_len))<br
/>#else<br/> #define NEXT_X (x++)<br />#define NEXT_Y (y++)<br />#define CMP (*x == *y)<br />#endif<br /><br />static
int<br/>levenshtein_internal(text *s, text *t, int ins_c, int del_c, int sub_c)<br />{<br />    int            m,<br
/>               n,<br />                 d;<br />    int           *prev;<br />    int           *curr;<br />   
int           i,<br />                j;<br />    const char *x;<br />    const char *y;<br />#ifdef MULTIBYTE <br
/>   int           *char_lens;<br />     int        y_char_len;<br />#endif<br /><br />    /*<br />     * We should
calculatenumber of characters for multibyte encodings<br />     */<br />    m = pg_mbstrlen_with_len(VARDATA_ANY(s),
VARSIZE_ANY_EXHDR(s));<br/>    n = pg_mbstrlen_with_len(VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));<br /><br />    /*<br
/>    * We can transform an empty s into t with n insertions, or a non-empty t<br />     * into an empty s with m
deletions.<br/>     */<br />    if (m == 0)<br />        return n * ins_c;<br />    if (n == 0)<br />        return m *
del_c;<br/><br />    /*<br />     * For security concerns, restrict excessive CPU+RAM usage. (This<br />     *
implementationuses O(m) memory and has O(mn) complexity.)<br />     */<br />    if (m > MAX_LEVENSHTEIN_STRLEN ||<br
/>       n > MAX_LEVENSHTEIN_STRLEN)<br />         ereport(ERROR,<br />               
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),<br/>                 errmsg("argument exceeds the maximum length of %d
bytes",<br/>                        MAX_LEVENSHTEIN_STRLEN)));<br /><br />    /* One more cell for initialization
columnand row. */<br />    ++m;<br />    ++n;<br /><br />    /*<br />     * Instead of building an (m+1)x(n+1) array,
we'lluse two different<br />     * arrays of size m+1 for storing accumulated values. At each step one<br />      *
representsthe "previous" row and one is the "current" row of the<br />     * notional large array.<br />     * For
multibyteencoding we'll also store array of lengths of<br />     * characters of first string in order to avoid great
numberof<br />       * pg_mblen calls.<br />     */<br />#ifdef MULTIBYTE <br />    prev = (int *) palloc(3 * m *
sizeof(int));<br/>    curr = prev + m;<br />    char_lens = prev + 2 * m;<br />    for (i = 0, x = VARDATA_ANY(s); i
<m - 1; i++)<br />     {<br />        char_lens[i] = pg_mblen(x);<br />        x += char_lens[i];<br />    }<br
/>   char_lens[i] = 0;<br />#else<br />    prev = (int *) palloc(2 * m * sizeof(int));<br />    curr = prev + m;<br
/>#endif<br/><br />    /* Initialize the "previous" row to 0..cols */<br />     for (i = 0; i < m; i++)<br />       
prev[i]= i * del_c;<br /><br />    /*<br />     * For multibyte encoding we should increase x and y pointers by the<br
/>    * results of pg_mblen function. Also we should use CHAR_CMP macros<br />      * for character comparison.<br
/>    */<br />    for (y = VARDATA_ANY(t), j = 1; j < n; NEXT_Y, j++)<br />    {<br />        int          
*temp;<br/>#ifdef MULTIBYTE <br />        y_char_len = pg_mblen(y);<br />#endif<br /><br />         curr[0] = j *
ins_c;<br/><br />        for (x = VARDATA_ANY(s), i = 1; i < m; NEXT_X, i++)<br />        {<br />            int   
       ins;<br />            int            del;<br />            int            sub;<br /><br />             ins =
prev[i]+ ins_c;<br />            del = curr[i - 1] + del_c;<br />            sub = prev[i - 1] + (CMP ? 0 : sub_c); <br
/>           curr[i] = Min(ins, del);<br />            curr[i] = Min(curr[i], sub);<br />        }<br /><br />       
temp= curr;<br />        curr = prev;<br />        prev = temp;<br />    }<br /><br />    d = prev[m - 1];<br /><br
/>   /*<br />     * Because the final value was swapped from the previous row to the<br />     * current row, that's
wherewe'll find it.<br />      */<br />    return d;<br />}<br /><br />What do you thing about it?<br /><br />----<br
/>Withbest regards,<br />Alexander Korotkov.<br />

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: multibyte charater set in levenshtein function