Re: Removing duplicates

Поиск
Список
Период
Сортировка
От Josh Berkus
Тема Re: Removing duplicates
Дата
Msg-id web-810507@davinci.ethosmedia.com
обсуждение исходный текст
Ответ на Removing duplicates  (Matthew Hagerty <matthew@brwholesale.com>)
Ответы Re: Removing duplicates  (Matthew Hagerty <matthew@brwholesale.com>)
Список pgsql-sql
Matt,

> I have a customer database (name, address1, address2, city, state,
>  zip) and I need a query (or two) that will give me a mailing list
>  with the least amount of duplicates possible.  I know that precise
>  matching is not possible, i.e. "P.O. Box 123" will never match "PO
>  Box 123" without some data massaging, but if I can isolate even 50%
>  of any duplicates, that would help greatly.

From the sound of things, you are trying to get out a mailing with theleast number of duplicates you can in a limited
time,rather thantrying to clean up the list for permanent storage.  Chances are, youbought or traded this list from an
outsidesource, yes?
 

In that case, here's some quick de-duplication tricks from myfundraising days:

1. Compare text columns with the punctuation stripped out.  It'samazing how many typographical differences come down to
punctuation.FYI, in Roberto Mello's function catalog (accessable fromhttp://techdocs.postgresql.org/ ) I believe that I
postedsomepunctuation-stripping PL/pgSQL procedures.  For furtherde-duplication, compare only the left 15 characters of
atext field(e.g. SUBSTR(strip_string(address1), 1, 15)), but beware ... this cancause you to weed out some
non-duplicates,such as multiple residentsof large apartment buildings.
 

2. For box office lists, you can use phonetic formulas to comparepersonal names (NOT addresses).  I believe that
Soundexand Metaphoneare included in /contrib these days.  This does not work well onChinese or Southeast Asian names.
 

3. If you got phone numbers along with the addresses, these are anexcellend guage of uniqueness.

> Also, any suggestions on which parameters to check the duplicates
>  for?  My first thoughts were to make sure there were no two
>  addresses the same in the same zip code.  Any insight (or examples)
>  would be greatly appreciated.

Just don't forget that some of the zip codes will probably beerroneous.

-Josh


В списке pgsql-sql по дате отправления:

Предыдущее
От: "Dan MacNeil"
Дата:
Сообщение: Re: Removing duplicates
Следующее
От: panjas51@yahoo.fr (panjas51)
Дата:
Сообщение: [CHALLANGE] Add seconds to a date