Re: multibyte charater set in levenshtein function

Поиск

Список

Период

Сортировка

От	Alexander Korotkov
Тема	Re: multibyte charater set in levenshtein function
Дата	13 июля 2010 г. 12:05:58
Msg-id	AANLkTilE4gCI4NnYhe3MxBurK9OiESYju8POczysrVcv@mail.gmail.com обсуждение исходный текст
Ответ на	Re: multibyte charater set in levenshtein function (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
Ответы	Re: multibyte charater set in levenshtein function
Список	pgsql-hackers

Дерево обсуждения

Hi!

* levenshtein_internal() and levenshtein_less_equal_internal() are very
similar. Can you merge the code? We can always use less_equal_internal()
if the overhead is ignorable. Did you compare them?

With big value of max_d overhead is significant. Here is example on american-english dictionary from Openoffice.

test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
sum
---------
1386456
(1 row)

Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from words;
sum
---------
1386456
(1 row)

Time: 317,821 ms

* There are many "if (!multibyte_encoding)" in levenshtein_internal().
How about split the function into two funcs for single-byte chars and
multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
use multi-byte version if the overhead is small.

The overhead of multi-byte version was about 4 times slower. But I have rewritten my CHAR_CMP macro with inline function. And now it's only about 1.5 times slower.

In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)

Time: 136,742 ms

In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)

Time: 88,471 ms

Anyway I think that overhead is not ignorable. That's why I have splited levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, and levenshtein_less_equal_internal into levenshtein_less_equal_internal and levenshtein_less_equal_internal_mb.

* I prefer a struct rather than an array. "4 * m" and "3 * m" look like magic
numbers for me. Could you name the entries with definition of a struct?
/*
* For multibyte encoding we'll also store array of lengths of
* characters and array with character offsets in first string
* in order to avoid great number of pg_mblen calls.
*/
prev = (int *) palloc(4 * m * sizeof(int));

I this line of code the memory is allocated for 4 arrays: prev, curr, offsets, char_lens. So I have joined offsets and char_lens into struct. But I can't join prev and curr because of this trick:
        temp = curr;
        curr = prev;
        prev = temp;

* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
in this function
fuzzystrmatch.c: In function ‘levenshtein_internal’:
fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
this function

Fixed.

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
of 'm' is an integer.

Fixed.

* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding! We should
mention which function supports mbchars not to confuse users.

I've updated this notification. Also I've added documentation for levenshtein_less_equal function.

* (Not an issue for the patch, but...)
Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
Unpacking versions make the core a bit faster.

Fixed.

With best regards,
Alexander Korotkov.

Вложения

fuzzystrmatch-0.4.diff.gz

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: multibyte charater set in levenshtein function

Вложения