Re: big joins not converging

Поиск
Список
Период
Сортировка
От fork
Тема Re: big joins not converging
Дата
Msg-id loom.20110311T003919-939@post.gmane.org
обсуждение исходный текст
Ответ на big joins not converging  (Dan Ancona <da@vizbang.com>)
Ответы Re: big joins not converging  (Dan Ancona <da@vizbang.com>)
Список pgsql-performance
Steve Atkins <steve <at> blighty.com> writes:

>
>
> On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote:
>
> > Hi postgressers -
> >
> > As part of my work with voter file data, I pretty regularly have to join one
large-ish (over 500k rows) table
> to another. Sometimes this is via a text field (countyname) + integer (voter
id). I've noticed sometimes
> this converges and sometimes it doesn't, seemingly regardless of how I index
things.

By "converge" you mean "finish running" -- "converge" has a lot of other
overtones for us amateur math types.

Note that I think you are doing "record linkage" which is a stepchild academic
of its own these days.  It might bear some research.  THere is also a CDC
matching program for text files freely downloadalbe to windows (ack), if you
hunt for it.

For now, my first thought is that you should try a few different matches, maybe
via PL/PGSQL functions, cascading the non-hits to the next step in the process
while shrinking your tables. upcase and delete all spaces, etc. First use
equality on all columns, which should be able to use indices, and separate those
records.  Then try equality on a few columns.  Then try some super fuzzy regexes
on a few columns.  Etc.

You will also have to give some thought to scoring a match, with perfection a
one, but, say, name and birthday the same with all else different a .75, etc.

Also, soundex(), levenshtein, and other fuzzy string tools are your friend.  I
want to write a version of SAS's COMPGED for Postgres, but I haven't got round
to it yet.

В списке pgsql-performance по дате отправления:

Предыдущее
От: Marti Raudsepp
Дата:
Сообщение: Re: Tuning massive UPDATES and GROUP BY's?
Следующее
От: Dan Ancona
Дата:
Сообщение: Re: big joins not converging