Re: multibyte charater set in levenshtein function

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: multibyte charater set in levenshtein function
Дата
Msg-id AANLkTimZcAjwCJyRgwKaM3Ms_0ipO8NteVCnfgKCf7RL@mail.gmail.com
обсуждение исходный текст
Ответ на Re: multibyte charater set in levenshtein function  (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
Ответы Re: multibyte charater set in levenshtein function  (Alexander Korotkov <aekorotkov@gmail.com>)
Список pgsql-hackers
2010/8/2 Alexander Korotkov <aekorotkov@gmail.com>:
> The dump of the table with russian dictionary is in attachment.
>
> I use following tests:
> SELECT SUM(levenshtein(a, 'foo')) from words;
> SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
> SELECT SUM(levenshtein(a, 'ańs')) FROM words;
> SELECT SUM(levenshtein(a, 'foo')) from words2;
> SELECT SUM(levenshtein(a, 'дом')) FROM words2;
> SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;
>
> With your last version of patch results are:
> 63ms 94ms 61ms 100ms 121ms 217ms.
>
> But I found some other optimization. With this condition:
> if (x[x_char_len-1] == y[y_char_len-1] && x_char_len == y_char_len &&
>     (x_char_len == 1 || char_same(x, y, x_char_len - 1)))
> results become:
> 58ms 96ms 63ms 102ms 107ms 171ms
>
> On single-byte characters results almost didn't change, but they come better
> with multi-byte characters. Generally, this improvement is because first
> bytes of different characters are frequently same (for example, almost all
> Cyrillic characters start from same byte, and I think that there is similar
> situation in other alphabets), but match of last bytes is just a
> coincidence.

Hey, that's really cool.  It helps a lot here, too:

My previous patch, with your 5 test cases:
Time: 100.556 ms
Time: 205.254 ms
Time: 127.070 ms
Time: 77.734 ms
Time: 90.345 ms
Time: 166.954 ms

Your original patch, same 5 test cases:
Time: 131.489 ms
Time: 215.048 ms
Time: 125.471 ms
Time: 80.068 ms
Time: 87.110 ms
Time: 166.918 ms

My patch, with your proposed change to compare the last character
first, same 5 test cases:
Time: 96.489 ms
Time: 201.497 ms
Time: 121.876 ms
Time: 75.005 ms
Time: 80.219 ms
Time: 142.844 ms

Updated patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: patch for check constraints using multiple inheritance
Следующее
От: Mike Fowler
Дата:
Сообщение: Re: review: xml_is_well_formed