Re: faster INSERT with possible pre-existing row?

Поиск
Список
Период
Сортировка
От Matthew Nuzum
Тема Re: faster INSERT with possible pre-existing row?
Дата
Msg-id f3c0b40805072612357db85f2c@mail.gmail.com
обсуждение исходный текст
Ответ на faster INSERT with possible pre-existing row?  (Dan Harris <fbsd@drivefaster.net>)
Ответы Re: faster INSERT with possible pre-existing row?  (John A Meinel <john@arbash-meinel.com>)
Список pgsql-performance
On 7/26/05, Dan Harris <fbsd@drivefaster.net> wrote:
> I am working on a process that will be inserting tens of million rows
> and need this to be as quick as possible.
>
> The catch is that for each row I could potentially insert, I need to
> look and see if the relationship is already there  to prevent
> multiple entries.  Currently I am doing a SELECT before doing the
> INSERT, but I recognize the speed penalty in doing to operations.  I
> wonder if there is some way I can say "insert this record, only if it
> doesn't exist already".  To see if it exists, I would need to compare
> 3 fields instead of just enforcing a primary key.

I struggled with this for a while. At first I tried stored procedures
and triggers, but it took very long (over 24 hours for my dataset).
After several iterations of rewritting it, first into C# then into
Python I got the whole process down to under 30 min.

My scenario is this:
I want to normalize log data. For example, for the IP address in a log
entry, I need to look up the unique id of the IP address, or if the IP
address is new, insert it and then return the newly created entry.
Multiple processes use the data, but only one process, run daily,
actually changes it. Because this one process knows that the data is
static, it selects the tables into in-memory hash tables (C#) or
Dictionaries (Python) and then does the lookups there. It is *super*
fast, but it uses a *lot* of ram. ;-)

To limit the ram, I wrote a version of the python code that uses gdbm
files instead of Dictionaries. This requires a newer version of Python
(to allow a gdbm db to work just like a dictionary) but makes life
easier in case someone is using my software on a lower end machine.
This doubled the time of the lookups from about 15 minutes to 30,
bringing the whole process to about 45 minutes.

--
Matthew Nuzum
www.bearfruit.org

В списке pgsql-performance по дате отправления:

Предыдущее
От: "Roberto Germano Vieweg Neto"
Дата:
Сообщение: [IMPORTANT] - My application performance
Следующее
От: John A Meinel
Дата:
Сообщение: Re: faster INSERT with possible pre-existing row?