Обсуждение: Fuzzy matching?

Поиск
Список
Период
Сортировка

Fuzzy matching?

От
"Josh Berkus"
Дата:
Folks,

For many of my programs, it would be extremely useful to have some form
of "fuzzy matching" for VARCHAR fields.  There are two kinds of fuzzy
matching for words that I know of:

1. Phonetic matching, which would be nice but will have to wait for
someone's $100,000 project;

2. Textual mathcing, which I will outline below.

The way textual fuzzy matching should work is as follows:
The developer supplies two VARCHARs to match and a number/percent of
character mis-match that is acceptable:

Fuzzy_match('Thornton','Tornton',1)

And the fuzzy_match should return True if the two phrases are no more
than that number of characters different.  Thus, we should get:

fuzzy_match('Thornton','Tornton',1) = TRUE
fuzzy_match('Thornton','Torntin',1) = FALSE
fuzzy_match('Thornton','Torntin',2) = TRUE

Unfortunately, I cannot think of a way to make this happen in a function
without cycling through all the possible permutations of characters for
both words or doing some character-by-character comparison with
elaborate logic for placement.  Either of these approaches would be very
slow, and completely unsuitable for column comparisons on large tables.

Can anyone suggest some shortcuts here?  Perhaps using pl/perl or
something similar?

Grazie!

-Josh Berkus

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Вложения

Re: Fuzzy matching?

От
"Joe Conway"
Дата:
> And the fuzzy_match should return True if the two phrases are no more
> than that number of characters different.  Thus, we should get:
>
> fuzzy_match('Thornton','Tornton',1) = TRUE
> fuzzy_match('Thornton','Torntin',1) = FALSE
> fuzzy_match('Thornton','Torntin',2) = TRUE
>
> Unfortunately, I cannot think of a way to make this happen in a function
> without cycling through all the possible permutations of characters for
> both words or doing some character-by-character comparison with
> elaborate logic for placement.  Either of these approaches would be very
> slow, and completely unsuitable for column comparisons on large tables.
>
> Can anyone suggest some shortcuts here?  Perhaps using pl/perl or
> something similar?

Sounds like you want something along the lines of soundex or metaphone? I
don't see either function in PostgreSQL, but take a look at the PHP manual
to see examples: http://www.php.net/manual/en/function.soundex.php ,
http://www.php.net/manual/en/function.metaphone.php

I looked at the soundex function in the PHP source, and it looks like it
would be fairly easy to port to a Postgres C function. The algorithm itself
comes from Donald Knuth in "The Art Of Computer Programming, vol. 3: Sorting
And Searching", Addison-Wesley (1973), pp. 391-392.

HTH,

-- Joe




RE: Fuzzy matching?

От
"Robby Slaughter"
Дата:
Here's an off the cuff reply:

It sounds like fuzzy_match(str1,str2,num) is
really just a tokenizer-type operation. The number is exactly
one less than the potential number of string segments
that you are interested in. For example:
  fuzzy_match('Thornton','Tornton',1) = TRUE
 Because the two segements are 'T' and 'ornton'

And also:
  fuzzy_match('Thornton','Torntin',2) = TRUE
 Becuse the three segments are 'T', "ornt', and 'n'

So, it seems like you could try to build the tokens,
which would be probably more efficient than just trying
all permutations.

HTH
-Robby


-----Original Message-----
From: pgsql-sql-owner@postgresql.org
[mailto:pgsql-sql-owner@postgresql.org]On Behalf Of Josh Berkus
Sent: Tuesday, July 31, 2001 11:05 AM
To: pgsql-sql@postgresql.org
Subject: [SQL] Fuzzy matching?


Folks,

For many of my programs, it would be extremely useful to have some form
of "fuzzy matching" for VARCHAR fields.  There are two kinds of fuzzy
matching for words that I know of:

1. Phonetic matching, which would be nice but will have to wait for
someone's $100,000 project;

2. Textual mathcing, which I will outline below.

The way textual fuzzy matching should work is as follows:
The developer supplies two VARCHARs to match and a number/percent of
character mis-match that is acceptable:

Fuzzy_match('Thornton','Tornton',1)

And the fuzzy_match should return True if the two phrases are no more
than that number of characters different.  Thus, we should get:

fuzzy_match('Thornton','Tornton',1) = TRUE
fuzzy_match('Thornton','Torntin',1) = FALSE
fuzzy_match('Thornton','Torntin',2) = TRUE

Unfortunately, I cannot think of a way to make this happen in a function
without cycling through all the possible permutations of characters for
both words or doing some character-by-character comparison with
elaborate logic for placement.  Either of these approaches would be very
slow, and completely unsuitable for column comparisons on large tables.

Can anyone suggest some shortcuts here?  Perhaps using pl/perl or
something similar?

Grazie!

-Josh Berkus

______AGLIO DATABASE SOLUTIONS___________________________                                      Josh Berkus Complete
informationtechnology      josh@agliodbs.com  and data management solutions       (415) 565-7293 for law firms, small
businesses       fax 621-2533   and non-profit organizations.      San Francisco
 



Re: Fuzzy matching?

От
Tom Lane
Дата:
"Josh Berkus" <josh@agliodbs.com> writes:
> For many of my programs, it would be extremely useful to have some form
> of "fuzzy matching" for VARCHAR fields.  There are two kinds of fuzzy
> matching for words that I know of:

> 1. Phonetic matching, which would be nice but will have to wait for
> someone's $100,000 project;

Uh, have you looked at contrib/soundex?  The Soundex code is kinda
specialized but might be just what you want...
        regards, tom lane


Re: Fuzzy matching?

От
"Joe Conway"
Дата:
> > Sounds like you want something along the lines of soundex or metaphone?
I
> > don't see either function in PostgreSQL, but take a look at the PHP
manual
> > to see examples: http://www.php.net/manual/en/function.soundex.php ,
> > http://www.php.net/manual/en/function.metaphone.php
> >
>
> See /contrib/soundex.

Sorry, missed that -- I only looked in the Documentation :(
I guess it's not there because it is a contrib. FWIW, both Oracle and MSSQL
have a built-in soundex function.

In any case, metaphone is reportedly more accurate (at least for English
words) than soundex, and levenshtein offers an entirely different and
interesting approach. Any interest in having all three of these in the
backend?

-- Joe





Re: Fuzzy matching?

От
Bruce Momjian
Дата:
> > See /contrib/soundex.
> 
> Sorry, missed that -- I only looked in the Documentation :(
> I guess it's not there because it is a contrib. FWIW, both Oracle and MSSQL
> have a built-in soundex function.
> 
> In any case, metaphone is reportedly more accurate (at least for English
> words) than soundex, and levenshtein offers an entirely different and
> interesting approach. Any interest in having all three of these in the
> backend?

Sure.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Fuzzy matching?

От
"Joe Conway"
Дата:
> >
> > Actually, this may even be closer to what you want:
> > http://www.php.net/manual/en/function.levenshtein.php
>
> Hey, that's terrific!   I didn't know that those programs existed
> outside fo expensive proprietary software.
>
> Now, who can I talk into porting them (metaphone, levenstein) to
> Postgres?  Hey, GreatBridge folks?   (this would be a significant value
> enhancement for Postgres)
>
> -Josh

I wouldn't mind doing it if the core team agrees. It will probably be a
couple of weeks before I can get to it though -- not sure if that's soon
enough to make it into 7.2. Should it be a contrib, or in the backend?

-- Joe



Re: Fuzzy matching?

От
Bruce Momjian
Дата:
> Sounds like you want something along the lines of soundex or metaphone? I
> don't see either function in PostgreSQL, but take a look at the PHP manual
> to see examples: http://www.php.net/manual/en/function.soundex.php ,
> http://www.php.net/manual/en/function.metaphone.php
> 

See /contrib/soundex.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Fuzzy matching?

От
"Josh Berkus"
Дата:
Joe,

> In any case, metaphone is reportedly more accurate (at least for
> English
> words) than soundex, and levenshtein offers an entirely different and
> interesting approach. Any interest in having all three of these in
> the
> backend?

I'm quite interested, myself.  How difficult is it for somebody that
doesn't program C to attach a function from the Contrib directory?  If
it's not very difficult, then I'd recommend putting metaphone in
/contrib, and levenstein in the backend.  My reasoning is that
levenstein is useful for all roman alphabets, but metaphone is not so
useful for non-english versions of postgres.

-Josh

______AGLIO DATABASE SOLUTIONS___________________________                                      Josh Berkus Complete
informationtechnology      josh@agliodbs.com  and data management solutions       (415) 565-7293 for law firms, small
businesses       fax 621-2533   and non-profit organizations.      San Francisco
 


Re: Fuzzy matching?

От
"Josh Berkus"
Дата:
Joe,

> > Sounds like you want something along the lines of soundex or
> metaphone? I
> > don't see either function in PostgreSQL, but take a look at the PHP
> manual
> > to see examples: http://www.php.net/manual/en/function.soundex.php
> ,
> > http://www.php.net/manual/en/function.metaphone.php
> >
> > I looked at the soundex function in the PHP source, and it looks
> like it
> > would be fairly easy to port to a Postgres C function. The
> algorithm
> itself
> > comes from Donald Knuth in "The Art Of Computer Programming, vol.
> 3:
> Sorting
> > And Searching", Addison-Wesley (1973), pp. 391-392.
> >
> 
> Actually, this may even be closer to what you want:
> http://www.php.net/manual/en/function.levenshtein.php

Hey, that's terrific!   I didn't know that those programs existed
outside fo expensive proprietary software.

Now, who can I talk into porting them (metaphone, levenstein) to
Postgres?  Hey, GreatBridge folks?   (this would be a significant value
enhancement for Postgres)

-Josh


______AGLIO DATABASE SOLUTIONS___________________________                                      Josh Berkus Complete
informationtechnology      josh@agliodbs.com  and data management solutions       (415) 565-7293 for law firms, small
businesses       fax 621-2533   and non-profit organizations.      San Francisco
 


Re: Fuzzy matching?

От
Tom Lane
Дата:
"Joe Conway" <joseph.conway@home.com> writes:
> In any case, metaphone is reportedly more accurate (at least for English
> words) than soundex, and levenshtein offers an entirely different and
> interesting approach. Any interest in having all three of these in the
> backend?

I'd certainly accept such functions as /contrib material, but probably
would want to wait to see if they get used much before putting them in
the standard backend ...
        regards, tom lane


RE: Fuzzy matching?

От
Jeff Eckermann
Дата:
With version 7.2 we will have pl/perlu (untrusted), which will allow use of
the various Perl modules which do this sort of thing.

> -----Original Message-----
> From:    Josh Berkus [SMTP:josh@agliodbs.com]
> Sent:    Tuesday, July 31, 2001 1:16 PM
> To:    Joe Conway; Bruce Momjian
> Cc:    Josh Berkus; pgsql-sql@postgresql.org
> Subject:    Re: Fuzzy matching?
> 
> Joe,
> 
> > In any case, metaphone is reportedly more accurate (at least for
> > English
> > words) than soundex, and levenshtein offers an entirely different and
> > interesting approach. Any interest in having all three of these in
> > the
> > backend?
> 
> I'm quite interested, myself.  How difficult is it for somebody that
> doesn't program C to attach a function from the Contrib directory?  If
> it's not very difficult, then I'd recommend putting metaphone in
> /contrib, and levenstein in the backend.  My reasoning is that
> levenstein is useful for all roman alphabets, but metaphone is not so
> useful for non-english versions of postgres.
> 
> -Josh
> 
> ______AGLIO DATABASE SOLUTIONS___________________________
>                                        Josh Berkus
>   Complete information technology      josh@agliodbs.com
>    and data management solutions       (415) 565-7293
>   for law firms, small businesses        fax 621-2533
>     and non-profit organizations.      San Francisco
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html


Re: Fuzzy matching?

От
"Josh Berkus"
Дата:
Tom, Joe,

> Our usual practice with stuff of uncertain usefulness has been to
> stick
> it in contrib for awhile and see if anyone uses it.  If there's
> sufficient interest, we'll promote it to mainstream in a future
> release.

Makes sense to me.  Go, Joe!

Since I can't help with the porting of metaphone or levenstein (but will
benefit immensely), I pledge to write a name-alike data checking
PL/pgSQL function which I will post to Roberto's library for public
consumption.

-Josh


______AGLIO DATABASE SOLUTIONS___________________________                                      Josh Berkus Complete
informationtechnology      josh@agliodbs.com  and data management solutions       (415) 565-7293 for law firms, small
businesses       fax 621-2533   and non-profit organizations.      San Francisco
 


Re: Fuzzy matching?

От
Tom Lane
Дата:
"Josh Berkus" <josh@agliodbs.com> writes:
> I'm quite interested, myself.  How difficult is it for somebody that
> doesn't program C to attach a function from the Contrib directory?

Run the install script.

> If it's not very difficult, then I'd recommend putting metaphone in
> /contrib, and levenstein in the backend.  My reasoning is that
> levenstein is useful for all roman alphabets, but metaphone is not so
> useful for non-english versions of postgres.

Our usual practice with stuff of uncertain usefulness has been to stick
it in contrib for awhile and see if anyone uses it.  If there's
sufficient interest, we'll promote it to mainstream in a future release.
        regards, tom lane


Re: Fuzzy matching?

От
Christopher Sawtell
Дата:
On Wed, 01 Aug 2001 05:44, Tom Lane wrote:
> "Josh Berkus" <josh@agliodbs.com> writes:
> > For many of my programs, it would be extremely useful to have some form
> > of "fuzzy matching" for VARCHAR fields.  There are two kinds of fuzzy
> > matching for words that I know of:
> >
> > 1. Phonetic matching, which would be nice but will have to wait for
> > someone's $100,000 project;
>
> Uh, have you looked at contrib/soundex?  The Soundex code is kinda
> specialized but might be just what you want...

Be aware that Soundex simply does not work at all well on names of Pacific 
origin. Samoan in particular is a total no no.


Re: Fuzzy matching?

От
"Timothy H. Keitt"
Дата:
I posted this many moons ago to pgsql-hackers.  'Guess nobody noticed.

Tim

Josh Berkus wrote:

>Folks,
>
>For many of my programs, it would be extremely useful to have some form
>of "fuzzy matching" for VARCHAR fields.  There are two kinds of fuzzy
>matching for words that I know of:
>
>1. Phonetic matching, which would be nice but will have to wait for
>someone's $100,000 project;
>
>2. Textual mathcing, which I will outline below.
>
>The way textual fuzzy matching should work is as follows:
>The developer supplies two VARCHARs to match and a number/percent of
>character mis-match that is acceptable:
>
>Fuzzy_match('Thornton','Tornton',1)
>
>And the fuzzy_match should return True if the two phrases are no more
>than that number of characters different.  Thus, we should get:
>
>fuzzy_match('Thornton','Tornton',1) = TRUE
>fuzzy_match('Thornton','Torntin',1) = FALSE
>fuzzy_match('Thornton','Torntin',2) = TRUE
>
>Unfortunately, I cannot think of a way to make this happen in a function
>without cycling through all the possible permutations of characters for
>both words or doing some character-by-character comparison with
>elaborate logic for placement.  Either of these approaches would be very
>slow, and completely unsuitable for column comparisons on large tables.
>
>Can anyone suggest some shortcuts here?  Perhaps using pl/perl or
>something similar?
>
>Grazie!
>
>-Josh Berkus
>
>______AGLIO DATABASE SOLUTIONS___________________________
>                                       Josh Berkus
>  Complete information technology      josh@agliodbs.com
>   and data management solutions       (415) 565-7293
>  for law firms, small businesses        fax 621-2533
>    and non-profit organizations.      San Francisco
>
>
>------------------------------------------------------------------------
>
>
>
>------------------------------------------------------------------------
>
>
>
>------------------------------------------------------------------------
>
>
>
>------------------------------------------------------------------------
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
> Part 1.2
>
> Content-Type:
>
> text/plain
> Content-Encoding:
>
> base64
>
>
> ------------------------------------------------------------------------
> Part 1.3
>
> Content-Type:
>
> text/plain
> Content-Encoding:
>
> base64
>
>
> ------------------------------------------------------------------------
> Part 1.4
>
> Content-Type:
>
> text/plain
> Content-Encoding:
>
> base64
>
>
> ------------------------------------------------------------------------
> Part 1.5
>
> Content-Type:
>
> text/plain
> Content-Encoding:
>
> binary
>
>

--
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/


/* Copyright (c) 2000 Timothy H. Keitt */
/* Licence: GPL version 2 or higher (see http://www.gnu.org/) */

#include "postgres.h"

#define STATIC_SIZE 32

/* This must be changed if STATIC_SIZE is changed */
static int4 static_array[STATIC_SIZE][STATIC_SIZE] = {
  {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
   19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31},
  {1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {26, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {27, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {28, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  {31, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0}};

/* Fast version for strings up to STATIC_SIZE characters */
int4
levenshtein_distance(text *s1, text *s2) {
  register int i, j, l, m, n, add, rows, columns;

  columns = VARSIZE(s1) - VARHDRSZ + 1;
  rows = VARSIZE(s2) - VARHDRSZ + 1;

  /* Use slower dynamically allocated version for larger strings */
  if (columns > STATIC_SIZE || rows > STATIC_SIZE)
    return levenshtein_distance_dynamic(s1, s2);

  for (j=1; j<rows; ++j) {
    for (i=1; i<columns; ++i) {

      if (VARDATA(s1)[i-1] == VARDATA(s2)[j-1])    add=0; else add=1;

      m = 1 + static_array[j-1][i];
      l = 1 + static_array[j][i-1];
      n = add + static_array[j-1][i-1];

      static_array[j][i] = (m < l ? (m < n ? m : n): (l < n ? l : n));

    } /* next column (i) */
  } /* next row (j) */

  return static_array[rows-1][columns-1];

} /* end levenshtein_distance(text *s1, text *s2) */

int4
levenshtein_distance_dynamic(text *s1, text *s2) {
  register int i, j, l, m, n, add, rows, columns, out;
  int4 *dynamic_array;

  columns = VARSIZE(s1) - VARHDRSZ + 1;
  rows = VARSIZE(s2) - VARHDRSZ + 1;

  dynamic_array = (int4 *) palloc(rows * columns * sizeof(int4));
  if (dynamic_array == NULL) return -1;

  for (i=0; i<columns; ++i) dynamic_array[i] = i;
  for (j=0; j<rows; ++j) dynamic_array[j*columns] = j;

  for (j=1; j<rows; ++j) {
    for (i=1; i<columns; ++i) {

      if (VARDATA(s1)[i-1] == VARDATA(s2)[j-1])    add=0; else add=1;

      m = 1 + dynamic_array[(j-1)*columns+i];
      l = 1 + dynamic_array[j*columns+i-1];
      n = add + dynamic_array[(j-1)*columns+i-1];

      dynamic_array[j*columns+i] =
    (m < l ? (m < n ? m : n): (l < n ? l : n));

    } /* next column (i) */
  } /* next row (j) */

  out = dynamic_array[(rows-1)*columns+columns-1];

  pfree(dynamic_array);

  return out;

} /* end levenshtein_distance_dynamic(text *s1, text *s2) */



Re: Fuzzy matching?

От
"Timothy H. Keitt"
Дата:
Oh and despite the copyright notice, I'm happy to put it in the public 
domain, so feel free to incorporate into postgresql.

Tim

Timothy H. Keitt wrote:

> I posted this many moons ago to pgsql-hackers.  'Guess nobody noticed.
>
> Tim
>
> Josh Berkus wrote:
>
>> Folks,
>>
>> For many of my programs, it would be extremely useful to have some form
>> of "fuzzy matching" for VARCHAR fields.  There are two kinds of fuzzy
>> matching for words that I know of:
>>
>> 1. Phonetic matching, which would be nice but will have to wait for
>> someone's $100,000 project;
>>
>> 2. Textual mathcing, which I will outline below.
>>
>> The way textual fuzzy matching should work is as follows:
>> The developer supplies two VARCHARs to match and a number/percent of
>> character mis-match that is acceptable:
>>
>> Fuzzy_match('Thornton','Tornton',1)
>>
>> And the fuzzy_match should return True if the two phrases are no more
>> than that number of characters different.  Thus, we should get:
>>
>> fuzzy_match('Thornton','Tornton',1) = TRUE
>> fuzzy_match('Thornton','Torntin',1) = FALSE
>> fuzzy_match('Thornton','Torntin',2) = TRUE
>>
>> Unfortunately, I cannot think of a way to make this happen in a function
>> without cycling through all the possible permutations of characters for
>> both words or doing some character-by-character comparison with
>> elaborate logic for placement.  Either of these approaches would be very
>> slow, and completely unsuitable for column comparisons on large tables.
>>
>> Can anyone suggest some shortcuts here?  Perhaps using pl/perl or
>> something similar?
>>
>> Grazie!
>>
>> -Josh Berkus
>>
>> ______AGLIO DATABASE SOLUTIONS___________________________
>>                                       Josh Berkus
>>  Complete information technology      josh@agliodbs.com
>>   and data management solutions       (415) 565-7293
>>  for law firms, small businesses        fax 621-2533
>>    and non-profit organizations.      San Francisco
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: you can get off all lists at once with the unregister command
>>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>>
>> Part 1.2
>>
>> Content-Type:
>>
>> text/plain
>> Content-Encoding:
>>
>> base64
>>
>>
>> ------------------------------------------------------------------------
>> Part 1.3
>>
>> Content-Type:
>>
>> text/plain
>> Content-Encoding:
>>
>> base64
>>
>>
>> ------------------------------------------------------------------------
>> Part 1.4
>>
>> Content-Type:
>>
>> text/plain
>> Content-Encoding:
>>
>> base64
>>
>>
>> ------------------------------------------------------------------------
>> Part 1.5
>>
>> Content-Type:
>>
>> text/plain
>> Content-Encoding:
>>
>> binary
>>
>>
>
>
>------------------------------------------------------------------------
>
>/* Copyright (c) 2000 Timothy H. Keitt */
>/* Licence: GPL version 2 or higher (see http://www.gnu.org/) */
>
>#include "postgres.h"
>
>#define STATIC_SIZE 32
>
>/* This must be changed if STATIC_SIZE is changed */
>static int4 static_array[STATIC_SIZE][STATIC_SIZE] = {
>  {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
>   19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31},
>  {1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, 
>  {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, 
>  {10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, 
>  {24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {26, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {27, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {28, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>  {30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, 
>  {31, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0}};
>
>/* Fast version for strings up to STATIC_SIZE characters */
>int4
>levenshtein_distance(text *s1, text *s2) { 
>  register int i, j, l, m, n, add, rows, columns;
>
>  columns = VARSIZE(s1) - VARHDRSZ + 1;
>  rows = VARSIZE(s2) - VARHDRSZ + 1;
>
>  /* Use slower dynamically allocated version for larger strings */
>  if (columns > STATIC_SIZE || rows > STATIC_SIZE)
>    return levenshtein_distance_dynamic(s1, s2);
>
>  for (j=1; j<rows; ++j) { 
>    for (i=1; i<columns; ++i) { 
>
>      if (VARDATA(s1)[i-1] == VARDATA(s2)[j-1])    add=0; else add=1;
>
>      m = 1 + static_array[j-1][i]; 
>      l = 1 + static_array[j][i-1]; 
>      n = add + static_array[j-1][i-1]; 
>
>      static_array[j][i] = (m < l ? (m < n ? m : n): (l < n ? l : n)); 
>
>    } /* next column (i) */
>  } /* next row (j) */
>
>  return static_array[rows-1][columns-1]; 
>
>} /* end levenshtein_distance(text *s1, text *s2) */
>
>int4
>levenshtein_distance_dynamic(text *s1, text *s2) { 
>  register int i, j, l, m, n, add, rows, columns, out;
>  int4 *dynamic_array;
>
>  columns = VARSIZE(s1) - VARHDRSZ + 1;
>  rows = VARSIZE(s2) - VARHDRSZ + 1;
>
>  dynamic_array = (int4 *) palloc(rows * columns * sizeof(int4));
>  if (dynamic_array == NULL) return -1;
>
>  for (i=0; i<columns; ++i) dynamic_array[i] = i; 
>  for (j=0; j<rows; ++j) dynamic_array[j*columns] = j; 
>
>  for (j=1; j<rows; ++j) { 
>    for (i=1; i<columns; ++i) { 
>
>      if (VARDATA(s1)[i-1] == VARDATA(s2)[j-1])    add=0; else add=1;
>
>      m = 1 + dynamic_array[(j-1)*columns+i]; 
>      l = 1 + dynamic_array[j*columns+i-1]; 
>      n = add + dynamic_array[(j-1)*columns+i-1]; 
>
>      dynamic_array[j*columns+i] =
>    (m < l ? (m < n ? m : n): (l < n ? l : n));
>
>    } /* next column (i) */
>  } /* next row (j) */
>
>  out = dynamic_array[(rows-1)*columns+columns-1]; 
>
>  pfree(dynamic_array);
>
>  return out;
>
>} /* end levenshtein_distance_dynamic(text *s1, text *s2) */
>
>
>
>------------------------------------------------------------------------
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>
> levenshtein_distance.c
>
> Content-Type:
>
> text/plain
> Content-Encoding:
>
> 7bit
>
>
> ------------------------------------------------------------------------
> Part 1.3
>
> Content-Type:
>
> text/plain
> Content-Encoding:
>
> binary
>
>

-- 
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/





Re: Fuzzy matching?

От
Oleg Bartunov
Дата:
I have C-codes which implement Levenstein Distance Algorithms
Here is a header for soe info:

// Levenstein Distance Algorithms Module
//
// Contains functions to do "Inexact Alphanumeric Comparisons".  The original
// concept for this code came from "The C Users Journal", May 1991, starting
// on page 127.  The author was Hans Z. Zwakenberg.
//
// These routines have been modified so they do not need to use double
// subscripted arrays in the CUJ code.  An additional function was created to
// "look up" a word from a list, returning the most likely matches.
//
// Written initailly by R. Bruce Roberts, MCI Systems Engineering, 1/7/93
// Compiled with Borland C++, V3.1
//

It's compiled and works well. Don't know about license but it's published
and it's possible to ask author directly

zen:~/app/algo/dist/ldist$ ./test simple sample
Returns 'Levenstein Distance' of two strings on command line.
Determining Distance for simple and sample.
Length is limited to 40 characters.
Words are alike, L Distance is 1, Threshold was 3.

PLease let me know if you need it. I thought about implementing the
same feature as Google has into our OpenFTS search
Regards,
    Oleg
On Tue, 31 Jul 2001, Tom Lane wrote:

> "Josh Berkus" <josh@agliodbs.com> writes:
> > I'm quite interested, myself.  How difficult is it for somebody that
> > doesn't program C to attach a function from the Contrib directory?
>
> Run the install script.
>
> > If it's not very difficult, then I'd recommend putting metaphone in
> > /contrib, and levenstein in the backend.  My reasoning is that
> > levenstein is useful for all roman alphabets, but metaphone is not so
> > useful for non-english versions of postgres.
>
> Our usual practice with stuff of uncertain usefulness has been to stick
> it in contrib for awhile and see if anyone uses it.  If there's
> sufficient interest, we'll promote it to mainstream in a future release.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://www.postgresql.org/search.mpl
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: Fuzzy matching?

От
Peter Eisentraut
Дата:
Josh Berkus writes:

> For many of my programs, it would be extremely useful to have some form
> of "fuzzy matching" for VARCHAR fields.

For lexical similarity, check out the agrep algorithm.  Last I checked the
source code wasn't quite Free(tm), but the algorithm was published in an
academic work.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Re: Fuzzy matching?

От
"Timothy H. Keitt"
Дата:
The Levenshtein-distance code I posted uses the agrep algorithm.

Tim

Peter Eisentraut wrote:

>Josh Berkus writes:
>
>>For many of my programs, it would be extremely useful to have some form
>>of "fuzzy matching" for VARCHAR fields.
>>
>
>For lexical similarity, check out the agrep algorithm.  Last I checked the
>source code wasn't quite Free(tm), but the algorithm was published in an
>academic work.
>

-- 
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/