On 24/05/2011 6:39 AM, Tarlika Elisabeth Schmitz wrote:
> Indeed. However, the situation is not quite as bleak as it appears:
> - I am only dealing with 50 countries (Luxemburg and Vatican are not
> amongst them)
> - Only for two countries will city/region be displayed instead of
> country.
> - Ultimately, where the only important bit of imformation is the
> country.
> - The only really important persons are those from the two countries.
Ah, see that's critical. You've just been able to restrict the problem
domain to a much simpler task with a smaller and well-defined range of
possibilities. Most of the complexity is in the nasty corner cases and
weirdness, and you've (probably) just cut most of that away.
> Of 17000 historical records, 4400 don't match this simple pattern.
> Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900
> "North America" whatever that is! There are plenty of common +
> valid region abbreviations.
>
> I get about 1000 new records of this type per year.
I'd do this kind of analysis in a PL/Perl or PL/python function myself.
It's easier to write "If <x> then <y> else <x>" logic in a readable
form, and such chained tests are usually better for this sort of work.
That also makes it easier to do a cleanup pass first, where you
substitute common spelling errors and canonicalize country names.
> However, the import process has to be as automatic as possible in such
> a way that inconsistencies are flagged up for later manual
> intervention. I say later because, for instance, a person's data will
> have to be imported with or without location info because other new
> data will link to it.
That's another good reason to use a PL function for this cleanup work.
It's easy to INSERT a record into a side table that flags it for later
examination if necessary, and to RAISE NOTICE or to issue a NOTIFY if
you need to do closer-to-realtime checking.
--
Craig Ringer
Tech-related writing at http://soapyfrogs.blogspot.com/