Matt,
> I have a customer database (name, address1, address2, city, state,
> zip) and I need a query (or two) that will give me a mailing list
> with the least amount of duplicates possible. I know that precise
> matching is not possible, i.e. "P.O. Box 123" will never match "PO
> Box 123" without some data massaging, but if I can isolate even 50%
> of any duplicates, that would help greatly.
From the sound of things, you are trying to get out a mailing with theleast number of duplicates you can in a limited
time,rather thantrying to clean up the list for permanent storage. Chances are, youbought or traded this list from an
outsidesource, yes?
In that case, here's some quick de-duplication tricks from myfundraising days:
1. Compare text columns with the punctuation stripped out. It'samazing how many typographical differences come down to
punctuation.FYI, in Roberto Mello's function catalog (accessable fromhttp://techdocs.postgresql.org/ ) I believe that I
postedsomepunctuation-stripping PL/pgSQL procedures. For furtherde-duplication, compare only the left 15 characters of
atext field(e.g. SUBSTR(strip_string(address1), 1, 15)), but beware ... this cancause you to weed out some
non-duplicates,such as multiple residentsof large apartment buildings.
2. For box office lists, you can use phonetic formulas to comparepersonal names (NOT addresses). I believe that
Soundexand Metaphoneare included in /contrib these days. This does not work well onChinese or Southeast Asian names.
3. If you got phone numbers along with the addresses, these are anexcellend guage of uniqueness.
> Also, any suggestions on which parameters to check the duplicates
> for? My first thoughts were to make sure there were no two
> addresses the same in the same zip code. Any insight (or examples)
> would be greatly appreciated.
Just don't forget that some of the zip codes will probably beerroneous.
-Josh