Re: multibyte charater set in levenshtein function

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: multibyte charater set in levenshtein function
Дата
Msg-id AANLkTilE4gCI4NnYhe3MxBurK9OiESYju8POczysrVcv@mail.gmail.com
обсуждение исходный текст
Ответ на Re: multibyte charater set in levenshtein function  (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
Ответы Re: multibyte charater set in levenshtein function  (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
Список pgsql-hackers
Hi!

* levenshtein_internal() and levenshtein_less_equal_internal() are very
 similar. Can you merge the code? We can always use less_equal_internal()
 if the overhead is ignorable. Did you compare them?
With big value of max_d overhead is significant. Here is example on american-english dictionary from Openoffice.

test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
   sum  
---------
 1386456
(1 row)

Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from words;
   sum  
---------
 1386456
(1 row)

Time: 317,821 ms
 
* There are many "if (!multibyte_encoding)" in levenshtein_internal().
 How about split the function into two funcs for single-byte chars and
 multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
 use multi-byte version if the overhead is small.
The overhead of multi-byte version was about 4 times slower. But I have rewritten my CHAR_CMP macro with inline function. And now it's only about 1.5 times slower.

In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)<=5;
  id   |   word  
-------+----------
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 136,742 ms

In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)<=5;
  id   |   word  
-------+----------
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 88,471 ms

Anyway I think that overhead is not ignorable. That's why I have splited levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, and levenshtein_less_equal_internal into levenshtein_less_equal_internal and levenshtein_less_equal_internal_mb.
 
* I prefer a struct rather than an array.  "4 * m" and "3 * m" look like magic
 numbers for me. Could you name the entries with definition of a struct?
   /*
    * For multibyte encoding we'll also store array of lengths of
    * characters and array with character offsets in first string
    * in order to avoid great number of pg_mblen calls.
    */
   prev = (int *) palloc(4 * m * sizeof(int));
I this line of code the memory is allocated for 4 arrays: prev, curr, offsets, char_lens. So I have joined offsets and char_lens into struct. But I can't join prev and curr because of this trick:
        temp = curr;
        curr = prev;
        prev = temp;

* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
in this function
fuzzystrmatch.c: In function ‘levenshtein_internal’:
fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
this function
Fixed.

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
of 'm' is an integer.
Fixed.
 
* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding!  We should
mention which function supports mbchars not to confuse users.
I've updated this notification. Also I've added documentation for levenshtein_less_equal function.

* (Not an issue for the patch, but...)
 Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
 PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
 Unpacking versions make the core a bit faster.
 Fixed.

With best regards,
Alexander Korotkov.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Markus Wanner
Дата:
Сообщение: bg worker: patch 6 of 6 - ooo messages
Следующее
От: Yeb Havinga
Дата:
Сообщение: Re: explain.c: why trace PlanState and Plan trees separately?