Обсуждение: multibyte charater set in levenshtein function

Поиск
Список
Период
Сортировка

multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
Hackers,

The current version of levenshtein function in fuzzystrmatch contrib modulte doesn't work properly with multibyte charater sets.

test=# select levenshtein('фыва','аыва');
 levenshtein
-------------
           2
(1 row)

My patch make this function works properly with multibyte charater sets.

test=# select levenshtein('фыва','аыва');
 levenshtein
-------------
           1
(1 row)

Also it avoids text_to_cstring call.

Regards,
Alexander Korotkov.
Вложения

Re: multibyte charater set in levenshtein function

От
Alvaro Herrera
Дата:
Excerpts from Alexander Korotkov's message of lun may 10 11:35:02 -0400 2010:
> Hackers,
> 
> The current version of levenshtein function in fuzzystrmatch contrib modulte
> doesn't work properly with multibyte charater sets.

> My patch make this function works properly with multibyte charater sets.

Great.  Please add it to the next commitfest:
http://commitfest.postgresql.org

On a quick look, I didn't like the way you separated the 
"pg_database_encoding_max_length() > 1" cases.  There seem to be too
much common code.  Can that be refactored a bit better?
-- 


Re: multibyte charater set in levenshtein function

От
Alvaro Herrera
Дата:
Alexander Korotkov escribió:
> On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>wrote:
> 
> > On a quick look, I didn't like the way you separated the
> > "pg_database_encoding_max_length() > 1" cases.  There seem to be too
> > much common code.  Can that be refactored a bit better?
> >
> I did a little refactoring in order to avoid some similar code.
> I'm not quite sure about my CHAR_CMP macro. Is it a good idea?

Well, since it's only used in one place, why are you defining a macro at
all?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
On Thu, May 13, 2010 at 6:03 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Well, since it's only used in one place, why are you defining a macro at
all?
In order to structure code better. My question was about another. Is memcmp function good choice to compare very short sequences of bytes (from 1 to 4 bytes)?

Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On a quick look, I didn't like the way you separated the
"pg_database_encoding_max_length() > 1" cases.  There seem to be too
much common code.  Can that be refactored a bit better?
I did a little refactoring in order to avoid some similar code.
I'm not quite sure about my CHAR_CMP macro. Is it a good idea?

Вложения

Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
Hello Hackers!

I have extended my patch by introducing levenshtein_less_equal function. This function have additional argument max_d and stops calculating when distance exceeds max_d. With low values of max_d function works much faster than original one. 

The example of original levenshtein function usage:

test=# select word, levenshtein(word, 'consistent') as dist from words where levenshtein(word, 'consistent') <= 2 order by dist;
    word     | dist 
-------------+------
 consistent  |    0
 insistent   |    2
 consistency |    2
 coexistent  |    2
 consistence |    2
(5 rows)

test=# explain analyze select word, levenshtein(word, 'consistent') as dist from words where levenshtein(word, 'consistent') <= 2 order by dist;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Sort  (cost=2779.13..2830.38 rows=20502 width=8) (actual time=203.652..203.658 rows=5 loops=1)
   Sort Key: (levenshtein(word, 'consistent'::text))
   Sort Method:  quicksort  Memory: 25kB
   ->  Seq Scan on words  (cost=0.00..1310.83 rows=20502 width=8) (actual time=19.019..203.601 rows=5 loops=1)
         Filter: (levenshtein(word, 'consistent'::text) <= 2)
 Total runtime: 203.723 ms
(6 rows)

Example of levenshtein_less_equal usage in this case:

test=# select word, levenshtein_less_equal(word, 'consistent', 2) as dist from words where levenshtein_less_equal(word, 'consistent', 2) <= 2 order by dist;
    word     | dist 
-------------+------
 consistent  |    0
 insistent   |    2
 consistency |    2
 coexistent  |    2
 consistence |    2

test=# explain analyze select word, levenshtein_less_equal(word, 'consistent', 2) as dist from words where levenshtein_less_equal(word, 'consistent', 2) <= 2 order by dist;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Sort  (cost=2779.13..2830.38 rows=20502 width=8) (actual time=42.198..42.203 rows=5 loops=1)
   Sort Key: (levenshtein_less_equal(word, 'consistent'::text, 2))
   Sort Method:  quicksort  Memory: 25kB
   ->  Seq Scan on words  (cost=0.00..1310.83 rows=20502 width=8) (actual time=5.391..42.143 rows=5 loops=1)
         Filter: (levenshtein_less_equal(word, 'consistent'::text, 2) <= 2)
 Total runtime: 42.292 ms
(6 rows)

In the example above levenshtein_less_equal works about 5 times faster.

With best regards,
Alexander Korotkov.
Вложения

Re: multibyte charater set in levenshtein function

От
Itagaki Takahiro
Дата:
Hi, I'm reviewing "Multibyte charater set in levenshtein function" patch.
https://commitfest.postgresql.org/action/patch_view?id=304

The main logic seems to be good, but I have some comments about
the coding style and refactoring.

* levenshtein_internal() and levenshtein_less_equal_internal() are very similar. Can you merge the code? We can always
useless_equal_internal() if the overhead is ignorable. Did you compare them? 

* There are many "if (!multibyte_encoding)" in levenshtein_internal(). How about split the function into two funcs for
single-bytechars and multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always use multi-byte version if the
overheadis small. 

* I prefer a struct rather than an array.  "4 * m" and "3 * m" look like magic numbers for me. Could you name the
entrieswith definition of a struct?   /*    * For multibyte encoding we'll also store array of lengths of    *
charactersand array with character offsets in first string    * in order to avoid great number of pg_mblen calls.    */
 prev = (int *) palloc(4 * m * sizeof(int)); 

* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
in this function
fuzzystrmatch.c: In function ‘levenshtein_internal’:
fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
this function

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
of 'm' is an integer.

* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding!  We should
mention which function supports mbchars not to confuse users.

* (Not an issue for the patch, but...) Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to PG_GETARG_TEXT_PP,
VARDATA_ANY,and VARSIZE_ANY_EXHDR? Unpacking versions make the core a bit faster. 

--
Itagaki Takahiro


Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
Hi!

* levenshtein_internal() and levenshtein_less_equal_internal() are very
 similar. Can you merge the code? We can always use less_equal_internal()
 if the overhead is ignorable. Did you compare them?
With big value of max_d overhead is significant. Here is example on american-english dictionary from Openoffice.

test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
   sum  
---------
 1386456
(1 row)

Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from words;
   sum  
---------
 1386456
(1 row)

Time: 317,821 ms
 
* There are many "if (!multibyte_encoding)" in levenshtein_internal().
 How about split the function into two funcs for single-byte chars and
 multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
 use multi-byte version if the overhead is small.
The overhead of multi-byte version was about 4 times slower. But I have rewritten my CHAR_CMP macro with inline function. And now it's only about 1.5 times slower.

In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)<=5;
  id   |   word  
-------+----------
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 136,742 ms

In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)<=5;
  id   |   word  
-------+----------
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 88,471 ms

Anyway I think that overhead is not ignorable. That's why I have splited levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, and levenshtein_less_equal_internal into levenshtein_less_equal_internal and levenshtein_less_equal_internal_mb.
 
* I prefer a struct rather than an array.  "4 * m" and "3 * m" look like magic
 numbers for me. Could you name the entries with definition of a struct?
   /*
    * For multibyte encoding we'll also store array of lengths of
    * characters and array with character offsets in first string
    * in order to avoid great number of pg_mblen calls.
    */
   prev = (int *) palloc(4 * m * sizeof(int));
I this line of code the memory is allocated for 4 arrays: prev, curr, offsets, char_lens. So I have joined offsets and char_lens into struct. But I can't join prev and curr because of this trick:
        temp = curr;
        curr = prev;
        prev = temp;

* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
in this function
fuzzystrmatch.c: In function ‘levenshtein_internal’:
fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
this function
Fixed.

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
of 'm' is an integer.
Fixed.
 
* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding!  We should
mention which function supports mbchars not to confuse users.
I've updated this notification. Also I've added documentation for levenshtein_less_equal function.

* (Not an issue for the patch, but...)
 Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
 PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
 Unpacking versions make the core a bit faster.
 Fixed.

With best regards,
Alexander Korotkov.
Вложения

Re: multibyte charater set in levenshtein function

От
Itagaki Takahiro
Дата:
2010/7/13 Alexander Korotkov <aekorotkov@gmail.com>:
> Anyway I think that overhead is not ignorable. That's why I have splited
> levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
> and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
> levenshtein_less_equal_internal_mb.

Thank you for good measurement! Then, it's reasonable to have multiple
implementations. It also has documentation. I'll change status of the
patch to "Ready for Committer".

The patch is good enough except argument types for some functions.
For example: - char* vs. const char* - text* vs. const char* + length
I hope committers would check whether there are better types for them.

-- 
Itagaki Takahiro


Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
On Tue, Jul 20, 2010 at 3:37 AM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:
> 2010/7/13 Alexander Korotkov <aekorotkov@gmail.com>:
>> Anyway I think that overhead is not ignorable. That's why I have splited
>> levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
>> and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
>> levenshtein_less_equal_internal_mb.
>
> Thank you for good measurement! Then, it's reasonable to have multiple
> implementations. It also has documentation. I'll change status of the
> patch to "Ready for Committer".
>
> The patch is good enough except argument types for some functions.
> For example:
>  - char* vs. const char*
>  - text* vs. const char* + length
> I hope committers would check whether there are better types for them.

This patch still needs some work.  It includes a bunch of stylistic
changes that aren't relevant to the purpose of the patch.  There's no
reason that I can see to change the existing levenshtein_internal
function to take text arguments instead of char *, or to change !m to
m == 0 in existing code, or to change the whitespace in the comments
of that function.  All of those need to be reverted before we can
consider committing this.

There is a huge amount of duplicated code here.  I think it makes
sense to have the multibyte version of the function be separate, but
perhaps we can merge the less-than-or-equal to bits  into the main
code, so that we only have two copies instead of four.  Perhaps we
can't just add a max_d argument max_distance to levenshtein_internal;
and if this value is >=0 then it represents the max allowable
distance, but if it is <0 then there is no limit.  Sure, that might
slow down the existing code a bit, but it might not be significant.
I'd at least like to see some numbers showing that it is significant
before we go to this much trouble.

The code doesn't follow the project coding style.  Braces must be
uncuddled.  Comment paragraphs will be reflowed unless they begin and
end with ------.  Function definitions should have the type
declaration on one line and the function name at the start of the
next.

Freeing memory with pfree is likely a waste of time; is there any
reason not to just rely on the memory context reset, as the original
coding did?

I think we might need to remove the acknowledgments section from this
code.  If everyone who touches this code adds their name, we're
quickly going to have a mess.  If we're not going to remove the
acknowledgments section, then please add my name, too, because I've
already patched this code once...

I'm going to set this back to "Waiting on Author".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
This patch still needs some work.  It includes a bunch of stylistic
changes that aren't relevant to the purpose of the patch.  There's no
reason that I can see to change the existing levenshtein_internal
function to take text arguments instead of char *, or to change !m to
m == 0 in existing code, or to change the whitespace in the comments
of that function.  All of those need to be reverted before we can
consider committing this.
I changed arguments of function from char * to text * in order to avoid text_to_cstring call. Same benefit can be achived by replacing char * with char * and length.
I changed !m to m == 0 because Itagaki asked me to make it conforming coding style. Do you think there is no reason to fix coding style in existing code? 
There is a huge amount of duplicated code here.  I think it makes
sense to have the multibyte version of the function be separate, but
perhaps we can merge the less-than-or-equal to bits  into the main
code, so that we only have two copies instead of four.  Perhaps we
can't just add a max_d argument max_distance to levenshtein_internal;
and if this value is >=0 then it represents the max allowable
distance, but if it is <0 then there is no limit.  Sure, that might
slow down the existing code a bit, but it might not be significant.
I'd at least like to see some numbers showing that it is significant
before we go to this much trouble.
In these case we should add many checks of max_d in levenshtein_internal function which make code more complex.
Actually, we can merge all four functions into one function. But such function will have many checks about multibyte encoding and max_d. So, I see four cases here:
1) one function with checks for multibyte encoding and max_d
2) two functions with checks for multibyte encoding
3) two functions with checks for max_d
4) four separate functions
If you prefer case number 3 you should argue your position little more.
The code doesn't follow the project coding style.  Braces must be
uncuddled.  Comment paragraphs will be reflowed unless they begin and
end with ------.  Function definitions should have the type
declaration on one line and the function name at the start of the
next.

Freeing memory with pfree is likely a waste of time; is there any
reason not to just rely on the memory context reset, as the original
coding did?
Ok, I'll fix this things.
 
I think we might need to remove the acknowledgments section from this
code.  If everyone who touches this code adds their name, we're
quickly going to have a mess.  If we're not going to remove the
acknowledgments section, then please add my name, too, because I've
already patched this code once...
In that case I think we can leave original acknowledgments section.

----
With best regards,
Alexander Korotkov.

Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> This patch still needs some work.  It includes a bunch of stylistic
>> changes that aren't relevant to the purpose of the patch.  There's no
>> reason that I can see to change the existing levenshtein_internal
>> function to take text arguments instead of char *, or to change !m to
>> m == 0 in existing code, or to change the whitespace in the comments
>> of that function.  All of those need to be reverted before we can
>> consider committing this.
>
> I changed arguments of function from char * to text * in order to avoid
> text_to_cstring call.

*scratches head*  Aren't you just moving the same call to a different place?

> Same benefit can be achived by replacing char * with
> char * and length.
> I changed !m to m == 0 because Itagaki asked me to make it conforming coding
> style. Do you think there is no reason to fix coding style in existing
> code?

Yeah, we usually try to avoid changing that sort of thing in existing
code, unless there's a very good reason.

>> There is a huge amount of duplicated code here.  I think it makes
>> sense to have the multibyte version of the function be separate, but
>> perhaps we can merge the less-than-or-equal to bits  into the main
>> code, so that we only have two copies instead of four.  Perhaps we
>> can't just add a max_d argument max_distance to levenshtein_internal;
>> and if this value is >=0 then it represents the max allowable
>> distance, but if it is <0 then there is no limit.  Sure, that might
>> slow down the existing code a bit, but it might not be significant.
>> I'd at least like to see some numbers showing that it is significant
>> before we go to this much trouble.
>
> In these case we should add many checks of max_d in levenshtein_internal
> function which make code more complex.

When you say "many" checks, how many?

> Actually, we can merge all four functions into one function. But such
> function will have many checks about multibyte encoding and max_d. So, I see
> four cases here:
> 1) one function with checks for multibyte encoding and max_d
> 2) two functions with checks for multibyte encoding
> 3) two functions with checks for max_d
> 4) four separate functions
> If you prefer case number 3 you should argue your position little more.

I'm somewhat convinced that separating the multibyte case out has a
performance benefit both by intuition and because you posted some
numbers, but I haven't seen any argument for separating out the other
case, so I'm asking if you've checked and whether there is an effect
and whether it's significant.  The default is always to try to avoid
maintaining multiple copies of substantially identical code, due to
the danger that a future patch might fail to update all of them and
thus introduce a bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
*scratches head*  Aren't you just moving the same call to a different place?
So, where you can find this different place? :) In this patch null-terminated strings are not used at all.
 
Yeah, we usually try to avoid changing that sort of thing in existing
code, unless there's a very good reason.
Ok.

> In these case we should add many checks of max_d in levenshtein_internal
> function which make code more complex.

When you say "many" checks, how many?

> Actually, we can merge all four functions into one function. But such
> function will have many checks about multibyte encoding and max_d. So, I see
> four cases here:
> 1) one function with checks for multibyte encoding and max_d
> 2) two functions with checks for multibyte encoding
> 3) two functions with checks for max_d
> 4) four separate functions
> If you prefer case number 3 you should argue your position little more.

I'm somewhat convinced that separating the multibyte case out has a
performance benefit both by intuition and because you posted some
numbers, but I haven't seen any argument for separating out the other
case, so I'm asking if you've checked and whether there is an effect
and whether it's significant.  The default is always to try to avoid
maintaining multiple copies of substantially identical code, due to
the danger that a future patch might fail to update all of them and
thus introduce a bug.

I've tested it with big value of max_d and I thought that it's evident that checking for negative value of max_d will not produce significant benefit. Anyway, I tried to add checking for negative max_d into levenshtein_less_equal_mb function.

static int
levenshtein_less_equal_internal_mb(char *s, char *t, int s_len, int t_len, int ins_c, int del_c, int sub_c, int max_d)
{
    int            m,
                n;
    int           *prev;
    int           *curr;
    int            i,
                j;
    const char *x;
    const char *y;
    CharLengthAndOffset *lengths_and_offsets;
    int        y_char_len;
    int        curr_left, curr_right, prev_left, prev_right, d;
    int        delta, min_d;


    /*
     * We should calculate number of characters for multibyte encodings
     */
    m = pg_mbstrlen_with_len(s, s_len);
    n = pg_mbstrlen_with_len(t, t_len);

    /*
     * We can transform an empty s into t with n insertions, or a non-empty t
     * into an empty s with m deletions.
     */
    if (m == 0)
        return n * ins_c;
    if (n == 0)
        return m * del_c;

    /*
     * We can find the minimal distance by the difference of lengths
     */
    delta = m - n;
    if (delta > 0)
        min_d = delta * del_c;
    else if (delta < 0)
        min_d = - delta * ins_c;
    else
        min_d = 0;
    if (max_d >= 0 && min_d > max_d)
        return max_d + 1;

    /*
     * For security concerns, restrict excessive CPU+RAM usage. (This
     * implementation uses O(m) memory and has O(mn) complexity.)
     */
    if (m > MAX_LEVENSHTEIN_STRLEN ||
        n > MAX_LEVENSHTEIN_STRLEN)
        ereport(ERROR,
                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                 errmsg("argument exceeds the maximum length of %d bytes",
                        MAX_LEVENSHTEIN_STRLEN)));

    /* One more cell for initialization column and row. */
    ++m;
    ++n;


    /*
     * Instead of building an (m+1)x(n+1) array, we'll use two different
     * arrays of size m+1 for storing accumulated values. At each step one
     * represents the "previous" row and one is the "current" row of the
     * notional large array.
     * For multibyte encoding we'll also store array of lengths of
     * characters and array with character offsets in first string
     * in order to avoid great number of
     * pg_mblen calls.
     */
    prev = (int *) palloc((2 * sizeof(int) + sizeof(CharLengthAndOffset)) * m );
    curr = prev + m;
    lengths_and_offsets = (CharLengthAndOffset *)(prev + 2 * m);
    lengths_and_offsets[0].offset = 0;
    for (i = 0, x = s; i < m - 1; i++)
    {
        lengths_and_offsets[i].length = pg_mblen(x);
        lengths_and_offsets[i + 1].offset = lengths_and_offsets[i].offset +
                lengths_and_offsets[i].length;
        x += lengths_and_offsets[i].length;
    }
    lengths_and_offsets[i].length = 0;


    /* Initialize the "previous" row to 0..cols */
    curr_left = 1;
    d = min_d;
    for (i = 0; i < delta; i++)
    {
        prev[i] = d;
    }
    curr_right = m;
    for (; i < m; i++)
    {
        prev[i] = d;
        d += (ins_c + del_c);
        if (max_d >= 0 && d > max_d)
        {
            curr_right = i;
            break;
        }
    }

    /*
     * There are following optimizations:
     * 1) Actually the minimal possible value of final distance (in the case of
     * all possible matches) is stored is the cells of the matrix. In the case
     * of movement towards diagonal, which contain last cell, value should be
     * increased by ins_c + del_c. In the case of movement backwards this
     * diagonal cell value should not be increased.
     * 2) The range of indexes of previous row, where values, which is not
     * greater than max_d, are stored, is [prev_left, prev_right]. So, only
     * the cells connected with this range should be calculated.
     * For multibyte encoding we should increase x and y pointers by the
     * results of pg_mblen function. Also we should use CHAR_CMP macros
     * for character comparison.
     */
    for (y = t, j = 1; j < n; y+= y_char_len, j++)
    {
        int           *temp;
        y_char_len = pg_mblen(y);

        if (max_d >= 0)
        {
            prev_left = curr_left;
            prev_right = curr_right;
            curr_left = -1;
        }else{
            prev_left = 1;
        }

        if (delta >= 0)
            curr[0] = min_d + j * (ins_c + del_c);
        else{
            if (j <= - delta)
                curr[0] = min_d;
            else
                curr[0] = min_d + (j + delta) * (ins_c + del_c);
        }

        for (i = prev_left, x = s + lengths_and_offsets[i - 1].offset; i < m; x+= lengths_and_offsets[i - 1].length, i++)
        {
            if (max_d >= 0)
            {
                d = max_d + 1;

                if (i <= prev_right){
                    d = Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0), d);
                }

                if (i == 1 || i > prev_left)
                {
                    d = Min(curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0), d);
                    d = Min(prev[i - 1] + (char_cmp(x, lengths_and_offsets[i - 1].length, y, y_char_len) ? 0 : sub_c), d);
                }

                curr[i] = d;

                if (curr_left == -1)
                    curr_left = i;
                curr_right = i;
                if (i > prev_right)
                    break;
            }else{
                d = Min(Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0), curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0)),
                        prev[i - 1] + (char_cmp(x, lengths_and_offsets[i - 1].length, y, y_char_len) ? 0 : sub_c));
                curr[i] = d;
            }
        }

        if (curr_left == -1)
            break;

        temp = curr;
        curr = prev;
        prev = temp;
    }

    /*
     * Because the final value was swapped from the previous row to the
     * current row, that's where we'll find it.
     */
    d = prev[m - 1];

    /*
     * If the last cell wasn't filled than return max_d + 1 otherwise
     * return the final value.
     */
    if (max_d >= 0 && (curr_left == -1 || curr_right < m - 1))
        return max_d + 1;
    else
        return d;
}

I tested it with american-english dictionary with 98569 words.

test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
   sum  
---------
 1074376
(1 row)

Time: 131,435 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from words;
   sum  
---------
 1074376
(1 row)

Time: 221,078 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from words;
   sum  
---------
 1074376
(1 row)

Time: 254,819 ms

The function with negative value of max_d didn't become faster than with just big value of max_d.

----
With best regards,
Alexander Korotkov.


 

Re: multibyte charater set in levenshtein function

От
Alvaro Herrera
Дата:
Excerpts from Robert Haas's message of mié jul 21 14:25:47 -0400 2010:
> On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
> > On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> > Same benefit can be achived by replacing char * with
> > char * and length.
> > I changed !m to m == 0 because Itagaki asked me to make it conforming coding
> > style. Do you think there is no reason to fix coding style in existing
> > code?
> 
> Yeah, we usually try to avoid changing that sort of thing in existing
> code, unless there's a very good reason.

I think fixing a stylistic issue in code that's being edited for other
purposes is fine, and a good idea going forward.  We wouldn't commit a
patch that would *only* fix those, because that would cause a problem
for backpatches for no benefit, but if the patch touches something else,
then a backpatch of another patch is going to need manual intervention
anyway.


Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> *scratches head*  Aren't you just moving the same call to a different
>> place?
>
> So, where you can find this different place? :) In this patch
> null-terminated strings are not used at all.

I can't.  You win.  :-)

Actually, I wonder if there's enough performance improvement there
that we might think about extracting that part of the patch and apply
it separately.  Then we could continue trying to figure out what to do
with the rest.  Sometimes it's simpler to deal with one change at a
time.

> I tested it with american-english dictionary with 98569 words.
>
> test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
>    sum
> ---------
>  1074376
> (1 row)
>
> Time: 131,435 ms
> test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from
> words;
>    sum
> ---------
>  1074376
> (1 row)
>
> Time: 221,078 ms
> test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from
> words;
>    sum
> ---------
>  1074376
> (1 row)
>
> Time: 254,819 ms
>
> The function with negative value of max_d didn't become faster than with
> just big value of max_d.

Ah, I see.  That's pretty compelling, I guess.  Although it still
seems like a lot of code...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
On Thu, Jul 22, 2010 at 3:21 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> Ah, I see.  That's pretty compelling, I guess.  Although it still
>> seems like a lot of code...
>
> I think there is a way to merge single-byte and multi-byte versions of
> functions without loss in performance using macros and includes (like in
> 'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

I'm not sure.  I'd like to get a second opinion from someone else on
which way to go with this, but unfortunately 2 or 3 of the main
committers are on vacation at the moment.

Does anyone else have an opinion on what to do about this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
Such version with macros and includes can look like this:<br /><br />#ifdef MULTIBYTE <br />#define NEXT_X (x+=
char_lens[i-1])<br/>#define NEXT_Y (y+= y_char_len)<br />#define CMP (char_cmp(x, char_lens[i-1], y, y_char_len))<br
/>#else<br/> #define NEXT_X (x++)<br />#define NEXT_Y (y++)<br />#define CMP (*x == *y)<br />#endif<br /><br />static
int<br/>levenshtein_internal(text *s, text *t, int ins_c, int del_c, int sub_c)<br />{<br />    int            m,<br
/>               n,<br />                 d;<br />    int           *prev;<br />    int           *curr;<br />   
int           i,<br />                j;<br />    const char *x;<br />    const char *y;<br />#ifdef MULTIBYTE <br
/>   int           *char_lens;<br />     int        y_char_len;<br />#endif<br /><br />    /*<br />     * We should
calculatenumber of characters for multibyte encodings<br />     */<br />    m = pg_mbstrlen_with_len(VARDATA_ANY(s),
VARSIZE_ANY_EXHDR(s));<br/>    n = pg_mbstrlen_with_len(VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));<br /><br />    /*<br
/>    * We can transform an empty s into t with n insertions, or a non-empty t<br />     * into an empty s with m
deletions.<br/>     */<br />    if (m == 0)<br />        return n * ins_c;<br />    if (n == 0)<br />        return m *
del_c;<br/><br />    /*<br />     * For security concerns, restrict excessive CPU+RAM usage. (This<br />     *
implementationuses O(m) memory and has O(mn) complexity.)<br />     */<br />    if (m > MAX_LEVENSHTEIN_STRLEN ||<br
/>       n > MAX_LEVENSHTEIN_STRLEN)<br />         ereport(ERROR,<br />               
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),<br/>                 errmsg("argument exceeds the maximum length of %d
bytes",<br/>                        MAX_LEVENSHTEIN_STRLEN)));<br /><br />    /* One more cell for initialization
columnand row. */<br />    ++m;<br />    ++n;<br /><br />    /*<br />     * Instead of building an (m+1)x(n+1) array,
we'lluse two different<br />     * arrays of size m+1 for storing accumulated values. At each step one<br />      *
representsthe "previous" row and one is the "current" row of the<br />     * notional large array.<br />     * For
multibyteencoding we'll also store array of lengths of<br />     * characters of first string in order to avoid great
numberof<br />       * pg_mblen calls.<br />     */<br />#ifdef MULTIBYTE <br />    prev = (int *) palloc(3 * m *
sizeof(int));<br/>    curr = prev + m;<br />    char_lens = prev + 2 * m;<br />    for (i = 0, x = VARDATA_ANY(s); i
<m - 1; i++)<br />     {<br />        char_lens[i] = pg_mblen(x);<br />        x += char_lens[i];<br />    }<br
/>   char_lens[i] = 0;<br />#else<br />    prev = (int *) palloc(2 * m * sizeof(int));<br />    curr = prev + m;<br
/>#endif<br/><br />    /* Initialize the "previous" row to 0..cols */<br />     for (i = 0; i < m; i++)<br />       
prev[i]= i * del_c;<br /><br />    /*<br />     * For multibyte encoding we should increase x and y pointers by the<br
/>    * results of pg_mblen function. Also we should use CHAR_CMP macros<br />      * for character comparison.<br
/>    */<br />    for (y = VARDATA_ANY(t), j = 1; j < n; NEXT_Y, j++)<br />    {<br />        int          
*temp;<br/>#ifdef MULTIBYTE <br />        y_char_len = pg_mblen(y);<br />#endif<br /><br />         curr[0] = j *
ins_c;<br/><br />        for (x = VARDATA_ANY(s), i = 1; i < m; NEXT_X, i++)<br />        {<br />            int   
       ins;<br />            int            del;<br />            int            sub;<br /><br />             ins =
prev[i]+ ins_c;<br />            del = curr[i - 1] + del_c;<br />            sub = prev[i - 1] + (CMP ? 0 : sub_c); <br
/>           curr[i] = Min(ins, del);<br />            curr[i] = Min(curr[i], sub);<br />        }<br /><br />       
temp= curr;<br />        curr = prev;<br />        prev = temp;<br />    }<br /><br />    d = prev[m - 1];<br /><br
/>   /*<br />     * Because the final value was swapped from the previous row to the<br />     * current row, that's
wherewe'll find it.<br />      */<br />    return d;<br />}<br /><br />What do you thing about it?<br /><br />----<br
/>Withbest regards,<br />Alexander Korotkov.<br /> 

Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Ah, I see.  That's pretty compelling, I guess.  Although it still
seems like a lot of code...
I think there is a way to merge single-byte and multi-byte versions of functions without loss in performance using macros and includes (like in 'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

----
With best regards,
Alexander Korotkov.

Re: multibyte charater set in levenshtein function

От
Alvaro Herrera
Дата:
Excerpts from Alexander Korotkov's message of jue jul 22 03:21:57 -0400 2010:
> On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > Ah, I see.  That's pretty compelling, I guess.  Although it still
> > seems like a lot of code...
> >
> I think there is a way to merge single-byte and multi-byte versions of
> functions without loss in performance using macros and includes (like in
> 'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

Using the trick of a single source file similar to like_match.c that
implements all the different behaviors, and include that file more than
once with different macro definitions, is probably a good idea.


Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
Here is new version of my patch. There are following changes:

1) I've merged singlebyte and multibyte versions of levenshtein_internal and levenshtein_less_equal_internal using macros and includes.
2) I found that levenshtein takes reasonable time even for long strings. There is an example with strings which contain random combinations of words.

test=# select count(*) from words;
 count
-------
 98569
(1 row)

test=# select * from words limit 10;
  id   |                                        word                                        | next_id
-------+------------------------------------------------------------------------------------+---------
   200 | Agnew diodes drilled composts Wassermann nonprofit adjusting Chautauqua            |   17370
  4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither                    |   70983
  5418 | Eroses gauchos bowls smellier weeklies tormentors cornmeal punctuality             |   96685
  6103 | G prompt boron overthrew cricking begonia snuffers blazed                          |   34518
  4707 | Djibouti Mari pencilling approves pointing's pashas magpie rams                    |   71200
 10125 | Lyle Edgardo livers gruelled capable cancels gnawing's inhospitable                |   28018
  9615 | Leos deputy evener yogis assault palatals Octobers surceased                       |   21878
 15640 | Starr's arrests malapropism Koufax's aesthetes Howell rustier Algeria              |   19316
 15776 | Sucre battle's misapprehension's predicating nutmegged electrification bowl planks |   65739
 16878 | Updike Forster ragout keenly exhalation grayed switch guava's                      |   43013
(10 rows)

Time: 26,125 ms

Time: 137,061 ms
test=# select avg(length(word)) from words;
         avg        
---------------------
 74.6052207083362923
(1 row)

Time: 96,415 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher')<=10;
  id  |                              word                               | next_id
------+-----------------------------------------------------------------+---------
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |   70983
(1 row)

Time: 2957,426 ms
test=# select * from words where levenshtein_less_equal(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher',10)<=10;
  id  |                              word                               | next_id
------+-----------------------------------------------------------------+---------
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |   70983
(1 row)

Time: 84,755 ms

It takes O(max(n, m) * max_d / min(ins_c, max_c)) in worst case. That's why I changed limitation of this function to max_d <= 255 * min(ins_c, del_c)

3) I found that there is no need for storing character offsets in levenshtein_less_equal_internal_mb. Instead of this first position in string, where value of distance wasn't greater than max_d, can be stored.

4) I found that there is theoretical maximum distance based of string lengths. It represents the case, when no characters are matching. If theoretical maximum distance is less than given maximum distance we can use theoretical maximum distance. It improves behavior of levenshtein_less_equal with high max_d, but it's still slower than levenshtein:

test=# select * from words where levenshtein_less_equal(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher',1000)<=10;
  id  |                              word                               | next_id
------+-----------------------------------------------------------------+---------
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |   70983
(1 row)

Time: 4102,731 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher')<=10;
  id  |                              word                               | next_id
------+-----------------------------------------------------------------+---------
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |   70983
(1 row)

Time: 2983,244 ms

----
With best regards,
Alexander Korotkov.
Вложения

Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
I forgot attribution in levenshtein.c file.

----
With best regards,
Alexander Korotkov.
Вложения

Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
On Wed, Jul 21, 2010 at 5:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
>> On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>>> *scratches head*  Aren't you just moving the same call to a different
>>> place?
>>
>> So, where you can find this different place? :) In this patch
>> null-terminated strings are not used at all.
>
> I can't.  You win.  :-)
>
> Actually, I wonder if there's enough performance improvement there
> that we might think about extracting that part of the patch and apply
> it separately.  Then we could continue trying to figure out what to do
> with the rest.  Sometimes it's simpler to deal with one change at a
> time.

I tested this today and the answer was a resounding yes.  I ran
sum(levenshtein(t, 'foo')) over a dictionary file with about 2 million
words and got a speedup of around 15% just by eliminating the
text_to_cstring() calls.  So I've committed that part of this patch.

I'll try to look at the rest of the patch when I get a chance, but I'm
wondering if it might make sense to split it into two patches -
specifically, one patch to handle multi-byte characters correctly, and
then a second patch for the less-than-or-equal-to functions.  I think
that might simplify reviewing a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
On Fri, Jul 30, 2010 at 1:14 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Ok, here is the patch for multi-byte characters.
> I changed arguments of levenshtein_internal function from text * to const
> char * and int. I think that it makes levenshtein_internal more reusable.
> For example, this function can be used for calculation distance of two
> fragments of strings without need of copying of these fragments.
> Actullay, I'm not fully satisfied with my solution in merging of single-byte
> and multi-byte versions of function. There are "ifdef"'s in body of function
> and non context-free macros. This is due to multi-byte needs to store
> characters lengths returned by pg_mblen when single-byte version doesn't
> need that. I tried multi-byte version without storing characters lengths,
> but in this case function is about 1.5 times slower (function needs to call
> pg_mblen n*m times).

I reviewed this code in a fair amount of detail today and ended up
rewriting it.  In general terms, it's best to avoid changing things
that are not relevant to the central purpose of the patch.  This patch
randomly adds a whole bunch of whitespace and removes a whole bunch of
comments which I find useful and do not want to see removed.  It took
me about an hour to reverse out all of those changes, which then let
me get down to the business of actually analyzing the code.   The
biggest problem I found is that, as coded, this is more than 50%
slower on UTF-8 databases, because the multi-byte path is really slow,
even if there are actually no multi-byte characters present.

The attached patch takes the approach of using a fast-path for just
the innermost loop when neither string contains any multi-byte
characters (regardless of whether or not the encoding allows
multi-byte characters).  It's still slower than CVS HEAD, but for
strings containing no multi-byte characters it's only marginally, if
at all, slower than previous releases, because 9.0 doesn't contain
your fix to avoid the extra string copy; and it's faster than your
implementation even when multi-byte characters ARE present.  It is
slightly slower on single-byte encoding databases (I tested
SQL_ASCII), but still faster than previous releases.  It's also quite
a bit less code duplication.

Please review and let me know your thoughts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Вложения

Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
2010/8/2 Alexander Korotkov <aekorotkov@gmail.com>:
> On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I reviewed this code in a fair amount of detail today and ended up
>> rewriting it.  In general terms, it's best to avoid changing things
>> that are not relevant to the central purpose of the patch.  This patch
>> randomly adds a whole bunch of whitespace and removes a whole bunch of
>> comments which I find useful and do not want to see removed.  It took
>> me about an hour to reverse out all of those changes, which then let
>> me get down to the business of actually analyzing the code.
> I'm sorry. This is my first patch, I will be more accurate next time. But, I
> think there is no unified opinion about some changes like replacement "!m"
> with "m==0".

Sure; if that were the only such change, I wouldn't have mentioned it.

> I think this line is not correct:
> "if (m != s_bytes && n != t_bytes)"

Good catch, fixed in the attached version.

>> The attached patch takes the approach of using a fast-path for just
>> the innermost loop when neither string contains any multi-byte
>> characters (regardless of whether or not the encoding allows
>> multi-byte characters).  It's still slower than CVS HEAD, but for
>> strings containing no multi-byte characters it's only marginally, if
>> at all, slower than previous releases, because 9.0 doesn't contain
>> your fix to avoid the extra string copy; and it's faster than your
>> implementation even when multi-byte characters ARE present.  It is
>> slightly slower on single-byte encoding databases (I tested
>> SQL_ASCII), but still faster than previous releases.  It's also quite
>> a bit less code duplication.
>
> Can I look at the test which shows that your implementation is faster than
> my when multi-byte characters are present? My tests show reverse. For
> example, with russian dictionary of openoffice:

Hmm, with the correction you mention above, I'm getting about the same
results with multi-byte characters (but faster with only single-byte
characters).  I did this:

CREATE TABLE words (a text not null primary key);
COPY words from '/usr/share/dict/words';

SELECT SUM(levenshtein(a, 'foo')) from words;
SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
SELECT SUM(levenshtein(a, 'ańs')) FROM words;

With the updated patch attached below, these ran about 102 ms, 222 ms,
129 ms.  With your patch, I get about 134 ms, 222 ms, 130 ms.  Perhaps
this is because I only have multi-byte characters in one of the
inputs, not both.  I don't have a dictionary handy with multi-byte
words in it.  Can you send me a file?

> I think that the cause of slow down slowdown is memcmp function. This
> function is very fast for long parts of memory, but of few bytes inline
> function is faster, because compiler have more freedom for optimization.
> After replacing memcmp with inline function like following in your
> implementation:

Yeah, that does seem to be slightly better.  I've done a version of
this in the attached patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Вложения

Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
2010/8/2 Alexander Korotkov <aekorotkov@gmail.com>:
> The dump of the table with russian dictionary is in attachment.
>
> I use following tests:
> SELECT SUM(levenshtein(a, 'foo')) from words;
> SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
> SELECT SUM(levenshtein(a, 'ańs')) FROM words;
> SELECT SUM(levenshtein(a, 'foo')) from words2;
> SELECT SUM(levenshtein(a, 'дом')) FROM words2;
> SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;
>
> With your last version of patch results are:
> 63ms 94ms 61ms 100ms 121ms 217ms.
>
> But I found some other optimization. With this condition:
> if (x[x_char_len-1] == y[y_char_len-1] && x_char_len == y_char_len &&
>     (x_char_len == 1 || char_same(x, y, x_char_len - 1)))
> results become:
> 58ms 96ms 63ms 102ms 107ms 171ms
>
> On single-byte characters results almost didn't change, but they come better
> with multi-byte characters. Generally, this improvement is because first
> bytes of different characters are frequently same (for example, almost all
> Cyrillic characters start from same byte, and I think that there is similar
> situation in other alphabets), but match of last bytes is just a
> coincidence.

Hey, that's really cool.  It helps a lot here, too:

My previous patch, with your 5 test cases:
Time: 100.556 ms
Time: 205.254 ms
Time: 127.070 ms
Time: 77.734 ms
Time: 90.345 ms
Time: 166.954 ms

Your original patch, same 5 test cases:
Time: 131.489 ms
Time: 215.048 ms
Time: 125.471 ms
Time: 80.068 ms
Time: 87.110 ms
Time: 166.918 ms

My patch, with your proposed change to compare the last character
first, same 5 test cases:
Time: 96.489 ms
Time: 201.497 ms
Time: 121.876 ms
Time: 75.005 ms
Time: 80.219 ms
Time: 142.844 ms

Updated patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Вложения

Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Now I think patch is as good as can be. :)

OK, committed.

> I'm going to prepare less-or-equal function in same manner as this patch.

Sounds good.  Since we're now more than half-way through this
CommitFest and this patch has undergone quite a bit of change from
what we started out with, I'd like to ask that you submit the new
patch for the next CommitFest, to begin September 15th.

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I reviewed this code in a fair amount of detail today and ended up
rewriting it.  In general terms, it's best to avoid changing things
that are not relevant to the central purpose of the patch.  This patch
randomly adds a whole bunch of whitespace and removes a whole bunch of
comments which I find useful and do not want to see removed.  It took
me about an hour to reverse out all of those changes, which then let
me get down to the business of actually analyzing the code.  
I'm sorry. This is my first patch, I will be more accurate next time. But, I think there is no unified opinion about some changes like replacement "!m" with "m==0".

I think this line is not correct:
"if (m != s_bytes && n != t_bytes)"
The correct negation of assumption, that both strings are single-byte, is the assumption, that at least one string is not single-byte. In this patch levenshtein function can calculate distance incorrectly:
test=# select levenshtein('фw', 'ww');
 levenshtein
-------------
           2
(1 row)
This line should be rewritten by this.
"if (m != s_bytes || n != t_bytes)"

The attached patch takes the approach of using a fast-path for just
the innermost loop when neither string contains any multi-byte
characters (regardless of whether or not the encoding allows
multi-byte characters).  It's still slower than CVS HEAD, but for
strings containing no multi-byte characters it's only marginally, if
at all, slower than previous releases, because 9.0 doesn't contain
your fix to avoid the extra string copy; and it's faster than your
implementation even when multi-byte characters ARE present.  It is
slightly slower on single-byte encoding databases (I tested
SQL_ASCII), but still faster than previous releases.  It's also quite
a bit less code duplication.

Can I look at the test which shows that your implementation is faster than my when multi-byte characters are present? My tests show reverse. For example, with russian dictionary of openoffice:

With my version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum  
---------
 1675281
(1 row)

Time: 277,190 ms

With your version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum  
---------
 1675281
(1 row)

Time: 398,011 ms

I think that the cause of slow down slowdown is memcmp function. This function is very fast for long parts of memory, but of few bytes inline function is faster, because compiler have more freedom for optimization. After replacing memcmp with inline function like following in your implementation:

static inline bool char_cmp(const char *s, const char *d, int len)
{
    while (len--)
    {
        if (*s++ != *d++)
            return false;
    }
    return true;
}

Code becomes much faster:

test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum  
---------
 1675281
(1 row)

Time: 241,272 ms

----
With best regards,
Alexander Korotkov.

Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
Now I think patch is as good as can be. :) <br />I'm going to prepare less-or-equal function in same manner as this
patch.<br/><br />----<br />With best regards,<br />Alexander Korotkov.  

Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
On Aug 28, 2010, at 8:34 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Here is the patch which adds levenshtein_less_equal function. I'm going to add it to current commitfest.

Cool. Please submit some performance results comparing levenshtein in HEAD vs. levenshtein with this patch vs. levenshtein_less_equal. Perhaps the test cases we used previously would be a good place to start.

...Robert

Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
Here is the patch which adds levenshtein_less_equal function. I'm going to add it to current commitfest.

----
With best regards,
Alexander Korotkov.


On Tue, Aug 3, 2010 at 3:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Now I think patch is as good as can be. :)

OK, committed.

> I'm going to prepare less-or-equal function in same manner as this patch.

Sounds good.  Since we're now more than half-way through this
CommitFest and this patch has undergone quite a bit of change from
what we started out with, I'd like to ask that you submit the new
patch for the next CommitFest, to begin September 15th.

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Вложения

Re: multibyte charater set in levenshtein function

От
Alexander Korotkov
Дата:
SELECT SUM(levenshtein(a, 'foo')) from words;<br />SELECT SUM(levenshtein(a, 'Urbański')) FROM words;<br />SELECT
SUM(levenshtein(a,'ańs')) FROM words;<br />SELECT SUM(levenshtein(a, 'foo')) from words2;<br /> SELECT
SUM(levenshtein(a,'дом')) FROM words2;<br />SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;<br /><br />Before this
patchresults was:<br />67 98 63 102 108 177<br />And after patch:<br />65 100 65 104 111 182<br /> It is slower a bit,
butI think it is a price for having less code duplication.<br /><br />Now test for levenshtein_less_equal
performance.<br/><br />test=# SELECT * FROM words WHERE levenshtein(a, 'extensize') <= 2;<br />      a     <br
/>-----------<br/> expensive<br /> extension<br /> extensive<br />(3 rows)<br /><br />Time: 98,083 ms<br /><br />test=#
SELECT* FROM words WHERE levenshtein_less_equal(a, 'extensize', 2) <= 2;<br />     a     <br /> -----------<br
/> expensive<br/> extension<br /> extensive<br />(3 rows)<br /><br />Time: 48,069 ms<br /><br />test=# SELECT * FROM
words2WHERE levenshtein(a, 'клубничный') <= 2;<br />      a      <br /> ------------<br />  кузничный<br />
 кубичный<br/>  клубочный<br />  клубничный<br /> (4 rows)<br /><br /> Time: 197,499 ms<br /><br />test=# SELECT * FROM
words2WHERE levenshtein_less_equal(a, 'клубничный', 3) <= 2;<br />     a      <br />------------<br /> кузничный<br
/> кубичный<br/> клубочный<br /> клубничный<br />(4 rows)<br /><br />Time: 100,073 ms<br /><br />It gives some speedup
forsearch on word with small distances.<br /><br />test=# SELECT sum(levenshtein(a, 'extensize')) from words;<br /> 
sum  <br />--------<br /> 835545<br />(1 row)<br /><br />Time: 96,253 ms<br /><br />test=# SELECT
sum(levenshtein_less_equal(a,'extensize', 20)) from words;<br />   sum   <br />--------<br /> 835545<br />(1 row)<br
/><br/>Time: 98,438 ms<br /><br />With big distances it gives similar performance because in this case similar branch
ofcode is used.<br />In order to test it with longer words I created a table with random combinations of words.<br
/><br/>test=# create table phrases (a text primary key);<br />test=# insert into phrases select string_agg(y.a, ' ')
from(select x.a, row_number() over () from (select a from words order by random()) x) y group by (y.row_number /
10);<br/><br />test=# select * from phrases limit 10;<br />                                                     
a                                                      <br
/>--------------------------------------------------------------------------------------------------------------<br/>
 hosiery'sspermatozoa Haleakala disco greenness beehive paranoids atrophy unfair<br /> Weinberg's all profanity's
individualizedbellowed perishables Romanov's communicant's landlady's pauperized<br />  Ito seasoned jawbreakers you'll
hurling'sshowcasing anecdota cauliflower intellectualism facilitate<br /> infantryman's busheled designing lucid
nutritionistsAesculapius's rifle clefting exult Milosevic<br />  foresight lurking capitulation's pantsuits traumatizes
moonlightinglancer's Longfellow rising unclasps<br /> permutes centralest cavalryman Dwight granddaughter knight
tallyho'ssober graduate locket's<br />  knucklehead courtliest sapphires baize coniferous emolument antarctic Laocoon's
deadensunseemly<br /> Tartars utopians Pekingese's Cartwright badge's Dollie's spryness Ajax's Amber's bookkeeper<br />
 spoutingconcordant Indus carbonation cinnamon's stockbrokers Evita's Canaries Waldorf's rampage<br /> Amparo exertions
Kuznetsk'sdivots humongous wolverine chugged laurels Goliaths hazelnut<br />(10 rows)<br /><br />test=# select * from
phraseswhere levenshtein('kkkknucklehead courtliest   sapphires be coniferous emolument antarctic Laocoon''s deadens
unseemly',a) <= 10;<br />                                               
a                                                <br />
--------------------------------------------------------------------------------------------------<br/> knucklehead
courtliestsapphires baize coniferous emolument antarctic Laocoon's deadens unseemly<br />(1 row)<br /><br /> Time:
581,850ms<br /><br />test=# select * from phrases where levenshtein_less_equal('kkkknucklehead courtliest   sapphires
beconiferous emolument antarctic Laocoon''s deadens unseemly', a, 10) <= 10;<br />
                                               a                                                 <br />
--------------------------------------------------------------------------------------------------<br/>  knucklehead
courtliestsapphires baize coniferous emolument antarctic Laocoon's deadens unseemly<br /> (1 row)<br /><br /> Time:
22,876ms<br /><br />It gives great speedup for search with small distances on relatively long strings.<br /><br
/>test=#select sum(levenshtein('kkkknucklehead courtliest   sapphires be coniferous emolument antarctic Laocoon''s
deadensunseemly', a)) from phrases;<br />   sum   <br />--------<br /> 794601<br />(1 row)<br /><br />Time: 571,138
ms<br/><br />test=# select sum(levenshtein_less_equal('kkkknucklehead courtliest   sapphires be coniferous emolument
antarcticLaocoon''s deadens unseemly', a, 100)) from phrases;<br />   sum   <br />--------<br /> 794601<br />(1 row)<br
/><br/>Time: 562,895 ms<br /><br />Similar results appears for multi-byte strings.<br /><br />test=# create table
phrases2(a text primary key);<br /> test=# insert into phrases2 select string_agg(y.a, ' ') from (select x.a,
row_number()over () from (select a from words2 order by random()) x) y group by (y.row_number / 10);<br />INSERT 0
14584<br/><br /><br />test=# select * from phrases2 limit 10;<br />
                                                                a                                                <br
/>                <br
/>-----------------------------------------------------------------------------------------------------------------------------------<br
/> борис спиннинг растопочный цифра выдохнуть иридий гнёздышко омоновский базовый<br /> пожжёте закосить насыщающий
паратыйпродрых обеспложивание милливатт бамбуковый отпекающий книжница<br /> приложиться разный устремляющийся
отучающийсяабрикосовка протоколируются сострадательный отрясший вывалить браманизм<br />  наполниться агафоновна
испольныйподтаскивающий запруживавшийся трофимович перетаскиваемый обручавший процентщица передов<br /> вихрастый
отволочённыйдискотека пришей распасовывавший полиция возделавший трехглавый битва загазованный<br />  обовьетесь
перехитрившийинулин стелить недоброжелательность изотрёте пятиалтынный жабовидный щелочно дефицитность<br
/> сиротиночкахлорбензол вгрызаться прокрашивать никарагуанец валентный понесённый перестегивание воспретительный
переименовываемый<br/>  таявший раскупорившийся передислоцируется юлианович праздничный лачужка присыхать отбеливший
фехтовальныйудобряющий<br /> слепнул зонт уластить удобрение тормозивший ушибся ошибся мкс сейсмологический
лесомелиорация<br/> рабовладельцем неудачный самовоспламенение карбидный круглопильный кубинцем подлакированный
наблюдениеисцарапавшийся издёргавший<br /> (10 rows)<br /><br />test=# select * from phrases2 where levenshtein('таяй
раскупорившийсяпередислоцируется юлианович праздничный лачужка присыхать опппливший ффехтовальный уууудобряющий', a)
<=10;<br />                                                         
a                                                      <br />     <br
/>------------------------------------------------------------------------------------------------------------------<br
/>----<br/> таявший раскупорившийся передислоцируется юлианович праздничный лачужка присыхать отбеливший фехтовальный
удобряющий<br/> (1 row)<br /><br />Time: 1291,318 ms<br /><br />test=# select * from phrases2 where
levenshtein_less_equal('таяйраскупорившийся передислоцируется юлианович праздничный лачужка присыхать опппливший
ффехтовальныйуууудобряющий', a, 10) <= 10;<br />                                                          
a                                                      <br />     <br />
------------------------------------------------------------------------------------------------------------------<br/>
----<br/>  таявший раскупорившийся передислоцируется юлианович праздничный лачужка присыхать отбеливший фехтовальный
удобряющий<br/> (1 row)<br /><br /> Time: 55,405 ms<br /><br />test=# select sum(levenshtein_less_equal('таяй
раскупорившийсяпередислоцируется юлианович праздничный лачужка присыхать опппливший ффехтовальный уууудобряющий', a,
200))from phrases;<br />   sum   <br />---------<br />  1091878<br />(1 row)<br /><br />Time: 674,734 ms<br /><br
/>test=#select sum(levenshtein('таяй раскупорившийся передислоцируется юлианович праздничный лачужка присыхать
оппплившийффехтовальный уууудобряющий', a)) from phrases;<br />    sum   <br />---------<br /> 1091878<br />(1 row)<br
/><br/>Time: 673,515 ms<br /><br />----<br />With best regards,<br />Alexander Korotkov. <br /> 

Re: multibyte charater set in levenshtein function

От
Robert Haas
Дата:
2010/8/28 Alexander Korotkov <aekorotkov@gmail.com>:
> Now test for levenshtein_less_equal performance.

Nice results.  I'll try to find time to look at this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company