Re: Need magic for identifieing double adresses

Поиск
Список
Период
Сортировка
От Bill Thoen
Тема Re: Need magic for identifieing double adresses
Дата
Msg-id 4C924C4B.2060305@gisnet.com
обсуждение исходный текст
Ответ на Need magic for identifieing double adresses  (Andreas <maps.on@gmx.net>)
Список pgsql-general
> > On 9/16/2010 5:18 AM, Sam Mason wrote:
>
>> >> On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote:
>>
>>> >>> I need to clean up a lot of contact data because of a merge of customer
>>> >>> lists that used to be kept separate.
>>> >>> I allready know that there are double entries within the lists and they
>>> >>> do overlap, too.
>>> >>>
>>> >>> Relevant fields could be  name, street, zip, city, phon
>>>
> >
> > Since your're working with words, a better "fuzzy" match algorithm is
> > Levenshtein distance. It's surprizingly good about coming up with an
> > index number for words (or even phrases). Calculate the Levenshtein
> > distance for two words and the closer the indexes are to each other the
> > more alike the words are. It's well-documented on Wikipedia (and other
> > places)
>

The problem with using something like this is that it's awkward to do
anything apart from the Cartesian product and compare everything with
everything else.  If you've got 500k rows to start with, that's a *lot*
of comparisons to be doing.  The nice thing about using something like
soundex is that it allows you to use equality when comparing and hence
PG has a chance to finish the query in a reasonable amount of time.

If you're dealing with a small number of possibilities, I'd agree that
some form of edit distance is a very useful tool.


> > But overall, you're basically normalizing an address list with people
> > names. So I first go through the lists with an eye to getting rid of the
> > junk (regexp is great for identifying names with "illeagal" characters)
> > and creating a starting name in uppercase. There's always at least one
> > data entry clerk (usually several) who do quirky things like include
> > notes in the last name field.
>

Yup, normalisation is very important when dealing with this sort of
thing.  Reducing free form text down to something more constrained is
important.


> > You really need to parse each word,
> > correct spelling and consider it in the local context, i.e. does it
> > "fit" where it is?  (as a geocding routine would do with street names)
> > There are canonical forms for names and though these vary from country
> > to country, they make good templates for a name parser.
>

I've tried doing something like this before, but tended to give up.  I
guess all the matching I've been doing has been small enough that it was
quicker to handle the odd cases by hand than spending more time writing
general purpose code that never seemed to be quite good enough.

-- Sam http://samason.me.uk/


В списке pgsql-general по дате отправления:

Предыдущее
От: Christine Penner
Дата:
Сообщение: query join issue
Следующее
От: Dean Rasheed
Дата:
Сообщение: Re: value