Re: multibyte charater set in levenshtein function

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: multibyte charater set in levenshtein function
Дата
Msg-id AANLkTimpvXnGZdJjDf3iTKq8=FnQB00LK6tPAb1uJfBr@mail.gmail.com
обсуждение исходный текст
Ответ на Re: multibyte charater set in levenshtein function  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I reviewed this code in a fair amount of detail today and ended up
rewriting it.  In general terms, it's best to avoid changing things
that are not relevant to the central purpose of the patch.  This patch
randomly adds a whole bunch of whitespace and removes a whole bunch of
comments which I find useful and do not want to see removed.  It took
me about an hour to reverse out all of those changes, which then let
me get down to the business of actually analyzing the code.  
I'm sorry. This is my first patch, I will be more accurate next time. But, I think there is no unified opinion about some changes like replacement "!m" with "m==0".

I think this line is not correct:
"if (m != s_bytes && n != t_bytes)"
The correct negation of assumption, that both strings are single-byte, is the assumption, that at least one string is not single-byte. In this patch levenshtein function can calculate distance incorrectly:
test=# select levenshtein('фw', 'ww');
 levenshtein
-------------
           2
(1 row)
This line should be rewritten by this.
"if (m != s_bytes || n != t_bytes)"

The attached patch takes the approach of using a fast-path for just
the innermost loop when neither string contains any multi-byte
characters (regardless of whether or not the encoding allows
multi-byte characters).  It's still slower than CVS HEAD, but for
strings containing no multi-byte characters it's only marginally, if
at all, slower than previous releases, because 9.0 doesn't contain
your fix to avoid the extra string copy; and it's faster than your
implementation even when multi-byte characters ARE present.  It is
slightly slower on single-byte encoding databases (I tested
SQL_ASCII), but still faster than previous releases.  It's also quite
a bit less code duplication.

Can I look at the test which shows that your implementation is faster than my when multi-byte characters are present? My tests show reverse. For example, with russian dictionary of openoffice:

With my version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum  
---------
 1675281
(1 row)

Time: 277,190 ms

With your version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum  
---------
 1675281
(1 row)

Time: 398,011 ms

I think that the cause of slow down slowdown is memcmp function. This function is very fast for long parts of memory, but of few bytes inline function is faster, because compiler have more freedom for optimization. After replacing memcmp with inline function like following in your implementation:

static inline bool char_cmp(const char *s, const char *d, int len)
{
    while (len--)
    {
        if (*s++ != *d++)
            return false;
    }
    return true;
}

Code becomes much faster:

test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum  
---------
 1675281
(1 row)

Time: 241,272 ms

----
With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Yeb Havinga
Дата:
Сообщение: Re: tracking inherited columns (was: patch for check constraints using multiple inheritance)
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: multibyte charater set in levenshtein function