Re: multibyte charater set in levenshtein function

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: multibyte charater set in levenshtein function
Дата
Msg-id AANLkTim0ZZs8zp6DM7qowV4jwyYhy1Dy+_fN6ApEWBf5@mail.gmail.com
обсуждение исходный текст
Ответ на Re: multibyte charater set in levenshtein function  (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
Список pgsql-hackers
2010/8/2 Alexander Korotkov <aekorotkov@gmail.com>:
> On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I reviewed this code in a fair amount of detail today and ended up
>> rewriting it.  In general terms, it's best to avoid changing things
>> that are not relevant to the central purpose of the patch.  This patch
>> randomly adds a whole bunch of whitespace and removes a whole bunch of
>> comments which I find useful and do not want to see removed.  It took
>> me about an hour to reverse out all of those changes, which then let
>> me get down to the business of actually analyzing the code.
> I'm sorry. This is my first patch, I will be more accurate next time. But, I
> think there is no unified opinion about some changes like replacement "!m"
> with "m==0".

Sure; if that were the only such change, I wouldn't have mentioned it.

> I think this line is not correct:
> "if (m != s_bytes && n != t_bytes)"

Good catch, fixed in the attached version.

>> The attached patch takes the approach of using a fast-path for just
>> the innermost loop when neither string contains any multi-byte
>> characters (regardless of whether or not the encoding allows
>> multi-byte characters).  It's still slower than CVS HEAD, but for
>> strings containing no multi-byte characters it's only marginally, if
>> at all, slower than previous releases, because 9.0 doesn't contain
>> your fix to avoid the extra string copy; and it's faster than your
>> implementation even when multi-byte characters ARE present.  It is
>> slightly slower on single-byte encoding databases (I tested
>> SQL_ASCII), but still faster than previous releases.  It's also quite
>> a bit less code duplication.
>
> Can I look at the test which shows that your implementation is faster than
> my when multi-byte characters are present? My tests show reverse. For
> example, with russian dictionary of openoffice:

Hmm, with the correction you mention above, I'm getting about the same
results with multi-byte characters (but faster with only single-byte
characters).  I did this:

CREATE TABLE words (a text not null primary key);
COPY words from '/usr/share/dict/words';

SELECT SUM(levenshtein(a, 'foo')) from words;
SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
SELECT SUM(levenshtein(a, 'ańs')) FROM words;

With the updated patch attached below, these ran about 102 ms, 222 ms,
129 ms.  With your patch, I get about 134 ms, 222 ms, 130 ms.  Perhaps
this is because I only have multi-byte characters in one of the
inputs, not both.  I don't have a dictionary handy with multi-byte
words in it.  Can you send me a file?

> I think that the cause of slow down slowdown is memcmp function. This
> function is very fast for long parts of memory, but of few bytes inline
> function is faster, because compiler have more freedom for optimization.
> After replacing memcmp with inline function like following in your
> implementation:

Yeah, that does seem to be slightly better.  I've done a version of
this in the attached patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Yeb Havinga
Дата:
Сообщение: Re: patch for check constraints using multiple inheritance
Следующее
От: Marc Cousin
Дата:
Сообщение: Re: lock_timeout GUC patch - Review