Re: multibyte charater set in levenshtein function

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: multibyte charater set in levenshtein function
Дата
Msg-id AANLkTin8t2ZlF8AjAdQxa6-Wm9ZgiTinyfTEPkNwPgNt@mail.gmail.com
обсуждение исходный текст
Ответ на Re: multibyte charater set in levenshtein function  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: multibyte charater set in levenshtein function  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
*scratches head*  Aren't you just moving the same call to a different place?
So, where you can find this different place? :) In this patch null-terminated strings are not used at all.
 
Yeah, we usually try to avoid changing that sort of thing in existing
code, unless there's a very good reason.
Ok.

> In these case we should add many checks of max_d in levenshtein_internal
> function which make code more complex.

When you say "many" checks, how many?

> Actually, we can merge all four functions into one function. But such
> function will have many checks about multibyte encoding and max_d. So, I see
> four cases here:
> 1) one function with checks for multibyte encoding and max_d
> 2) two functions with checks for multibyte encoding
> 3) two functions with checks for max_d
> 4) four separate functions
> If you prefer case number 3 you should argue your position little more.

I'm somewhat convinced that separating the multibyte case out has a
performance benefit both by intuition and because you posted some
numbers, but I haven't seen any argument for separating out the other
case, so I'm asking if you've checked and whether there is an effect
and whether it's significant.  The default is always to try to avoid
maintaining multiple copies of substantially identical code, due to
the danger that a future patch might fail to update all of them and
thus introduce a bug.

I've tested it with big value of max_d and I thought that it's evident that checking for negative value of max_d will not produce significant benefit. Anyway, I tried to add checking for negative max_d into levenshtein_less_equal_mb function.

static int
levenshtein_less_equal_internal_mb(char *s, char *t, int s_len, int t_len, int ins_c, int del_c, int sub_c, int max_d)
{
    int            m,
                n;
    int           *prev;
    int           *curr;
    int            i,
                j;
    const char *x;
    const char *y;
    CharLengthAndOffset *lengths_and_offsets;
    int        y_char_len;
    int        curr_left, curr_right, prev_left, prev_right, d;
    int        delta, min_d;


    /*
     * We should calculate number of characters for multibyte encodings
     */
    m = pg_mbstrlen_with_len(s, s_len);
    n = pg_mbstrlen_with_len(t, t_len);

    /*
     * We can transform an empty s into t with n insertions, or a non-empty t
     * into an empty s with m deletions.
     */
    if (m == 0)
        return n * ins_c;
    if (n == 0)
        return m * del_c;

    /*
     * We can find the minimal distance by the difference of lengths
     */
    delta = m - n;
    if (delta > 0)
        min_d = delta * del_c;
    else if (delta < 0)
        min_d = - delta * ins_c;
    else
        min_d = 0;
    if (max_d >= 0 && min_d > max_d)
        return max_d + 1;

    /*
     * For security concerns, restrict excessive CPU+RAM usage. (This
     * implementation uses O(m) memory and has O(mn) complexity.)
     */
    if (m > MAX_LEVENSHTEIN_STRLEN ||
        n > MAX_LEVENSHTEIN_STRLEN)
        ereport(ERROR,
                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                 errmsg("argument exceeds the maximum length of %d bytes",
                        MAX_LEVENSHTEIN_STRLEN)));

    /* One more cell for initialization column and row. */
    ++m;
    ++n;


    /*
     * Instead of building an (m+1)x(n+1) array, we'll use two different
     * arrays of size m+1 for storing accumulated values. At each step one
     * represents the "previous" row and one is the "current" row of the
     * notional large array.
     * For multibyte encoding we'll also store array of lengths of
     * characters and array with character offsets in first string
     * in order to avoid great number of
     * pg_mblen calls.
     */
    prev = (int *) palloc((2 * sizeof(int) + sizeof(CharLengthAndOffset)) * m );
    curr = prev + m;
    lengths_and_offsets = (CharLengthAndOffset *)(prev + 2 * m);
    lengths_and_offsets[0].offset = 0;
    for (i = 0, x = s; i < m - 1; i++)
    {
        lengths_and_offsets[i].length = pg_mblen(x);
        lengths_and_offsets[i + 1].offset = lengths_and_offsets[i].offset +
                lengths_and_offsets[i].length;
        x += lengths_and_offsets[i].length;
    }
    lengths_and_offsets[i].length = 0;


    /* Initialize the "previous" row to 0..cols */
    curr_left = 1;
    d = min_d;
    for (i = 0; i < delta; i++)
    {
        prev[i] = d;
    }
    curr_right = m;
    for (; i < m; i++)
    {
        prev[i] = d;
        d += (ins_c + del_c);
        if (max_d >= 0 && d > max_d)
        {
            curr_right = i;
            break;
        }
    }

    /*
     * There are following optimizations:
     * 1) Actually the minimal possible value of final distance (in the case of
     * all possible matches) is stored is the cells of the matrix. In the case
     * of movement towards diagonal, which contain last cell, value should be
     * increased by ins_c + del_c. In the case of movement backwards this
     * diagonal cell value should not be increased.
     * 2) The range of indexes of previous row, where values, which is not
     * greater than max_d, are stored, is [prev_left, prev_right]. So, only
     * the cells connected with this range should be calculated.
     * For multibyte encoding we should increase x and y pointers by the
     * results of pg_mblen function. Also we should use CHAR_CMP macros
     * for character comparison.
     */
    for (y = t, j = 1; j < n; y+= y_char_len, j++)
    {
        int           *temp;
        y_char_len = pg_mblen(y);

        if (max_d >= 0)
        {
            prev_left = curr_left;
            prev_right = curr_right;
            curr_left = -1;
        }else{
            prev_left = 1;
        }

        if (delta >= 0)
            curr[0] = min_d + j * (ins_c + del_c);
        else{
            if (j <= - delta)
                curr[0] = min_d;
            else
                curr[0] = min_d + (j + delta) * (ins_c + del_c);
        }

        for (i = prev_left, x = s + lengths_and_offsets[i - 1].offset; i < m; x+= lengths_and_offsets[i - 1].length, i++)
        {
            if (max_d >= 0)
            {
                d = max_d + 1;

                if (i <= prev_right){
                    d = Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0), d);
                }

                if (i == 1 || i > prev_left)
                {
                    d = Min(curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0), d);
                    d = Min(prev[i - 1] + (char_cmp(x, lengths_and_offsets[i - 1].length, y, y_char_len) ? 0 : sub_c), d);
                }

                curr[i] = d;

                if (curr_left == -1)
                    curr_left = i;
                curr_right = i;
                if (i > prev_right)
                    break;
            }else{
                d = Min(Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0), curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0)),
                        prev[i - 1] + (char_cmp(x, lengths_and_offsets[i - 1].length, y, y_char_len) ? 0 : sub_c));
                curr[i] = d;
            }
        }

        if (curr_left == -1)
            break;

        temp = curr;
        curr = prev;
        prev = temp;
    }

    /*
     * Because the final value was swapped from the previous row to the
     * current row, that's where we'll find it.
     */
    d = prev[m - 1];

    /*
     * If the last cell wasn't filled than return max_d + 1 otherwise
     * return the final value.
     */
    if (max_d >= 0 && (curr_left == -1 || curr_right < m - 1))
        return max_d + 1;
    else
        return d;
}

I tested it with american-english dictionary with 98569 words.

test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
   sum  
---------
 1074376
(1 row)

Time: 131,435 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from words;
   sum  
---------
 1074376
(1 row)

Time: 221,078 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from words;
   sum  
---------
 1074376
(1 row)

Time: 254,819 ms

The function with negative value of max_d didn't become faster than with just big value of max_d.

----
With best regards,
Alexander Korotkov.


 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Merlin Moncure
Дата:
Сообщение: Re: patch: to_string, to_array functions
Следующее
От: Robert Haas
Дата:
Сообщение: Re: documentation for committing with git