Обсуждение: "Fuzzy" Matches on Nicknames
Greetings, I have two tables that are populated using large datasets from disparate external systems, and I am trying to match recordsby customer name between these two tables. I do not have any authoritative key, such as customerID or nationalID,by which I can match them up, and I have found many cases where the same customer has different first names inthe two datasets. A sampling of the differences is as follows: Michael <=> Mike Tom <=> Thomas Liz <=> Elizabeth Margaret <=> Maggie How can I build a query in PostgreSQL (v. 9.6) that will find possible matches like these on nicknames? My initial guessis that I would have to either find or build some sort of intermediary table that contains associated names like thoseabove. Sometimes though, there will be more than matching pairs, like: Jim <=> James <=> Jimmy <=> Jimmie Bill <=> Will <=> Willie <=> William and so forth. Has anyone used or developed PostgreSQL queries that will find matches like these? I am running all my database queries.on my local laptops (Win7 and macOS), so performance or uptime is no issue here. I am curious to see how others inthis community have creatively solved this common problem. One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might work here, but honestly I am clueless as to how to setthis up or use it in queries successfully. Thanks, Michael (aka Mike, aka Mikey)
Hello Michael, On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote: > Greetings, > > I have two tables that are populated using large datasets from > disparate external systems, and I am trying to match records by > customer name between these two tables. I do not have any > authoritative key, such as customerID or nationalID, by which I can > match them up, and I have found many cases where the same customer > has different first names in the two datasets. A sampling of the > differences is as follows: > > Michael <=> Mike > Tom <=> Thomas > Liz <=> Elizabeth > Margaret <=> Maggie > > How can I build a query in PostgreSQL (v. 9.6) that will find > possible matches like these on nicknames? My initial guess is that I > would have to either find or build some sort of intermediary table > that contains associated names like those above. Sometimes though, > there will be more than matching pairs, like: > > Jim <=> James <=> Jimmy <=> Jimmie > Bill <=> Will <=> Willie <=> William > > and so forth. > > Has anyone used or developed PostgreSQL queries that will find > matches like these? I am running all my database queries. on my local > laptops (Win7 and macOS), so performance or uptime is no issue here. > I am curious to see how others in this community have creatively > solved this common problem. > > One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might > work here, but honestly I am clueless as to how to set this up or use > it in queries successfully. > > Thanks, > Michael (aka Mike, aka Mikey) > Check out chapter F15 in the doco. Try the double metaphone. I worked on something similar many years ago cleaning up input created by data entry clerks from hand written speeding tickets, so as to match with "trusted" data held in a database. As the volume of input was small in comparison with the number of licensed drivers, we could iterate over and over again trying to match it up. HTH. Rob
On Tue, Nov 29, 2016 at 6:56 PM, rob stone <floriparob@gmail.com> wrote: > Hello Michael, > On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote: >> Greetings, >> >> I have two tables that are populated using large datasets from >> disparate external systems, and I am trying to match records by >> customer name between these two tables. I do not have any >> authoritative key, such as customerID or nationalID, by which I can >> match them up, and I have found many cases where the same customer >> has different first names in the two datasets. A sampling of the >> differences is as follows: >> >> Michael <=> Mike >> Tom <=> Thomas >> Liz <=> Elizabeth >> Margaret <=> Maggie >> >> How can I build a query in PostgreSQL (v. 9.6) that will find >> possible matches like these on nicknames? My initial guess is that I >> would have to either find or build some sort of intermediary table >> that contains associated names like those above. Sometimes though, >> there will be more than matching pairs, like: >> >> Jim <=> James <=> Jimmy <=> Jimmie >> Bill <=> Will <=> Willie <=> William >> >> and so forth. >> >> Has anyone used or developed PostgreSQL queries that will find >> matches like these? I am running all my database queries. on my local >> laptops (Win7 and macOS), so performance or uptime is no issue here. >> I am curious to see how others in this community have creatively >> solved this common problem. >> >> One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might >> work here, but honestly I am clueless as to how to set this up or use >> it in queries successfully. >> >> Thanks, >> Michael (aka Mike, aka Mikey) >> > > Check out chapter F15 in the doco. > Try the double metaphone. > I worked on something similar many years ago cleaning up input created > by data entry clerks from hand written speeding tickets, so as to match > with "trusted" data held in a database. > As the volume of input was small in comparison with the number of > licensed drivers, we could iterate over and over again trying to match > it up. Also check out pg_trgm extension. It's better for addresses than names, but might be something to look at depending on how things turn up with the data. merlin