Re: How to insert a bulk of data with unique-violations very fast

От: Torsten Zühlsdorff
Тема: Re: How to insert a bulk of data with unique-violations very fast
Дата: ,
Msg-id: huiroa$sn6$1@news.eternal-september.org
(см: обсуждение, исходный текст)
Ответ на: Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C")
Ответы: Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C")
Список: pgsql-performance

Скрыть дерево обсуждения

How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
 Re: How to insert a bulk of data with unique-violations very fast  (Scott Marlowe, )
 Re: How to insert a bulk of data with unique-violations very fast  (Scott Marlowe, )
  Re: How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
   Re: How to insert a bulk of data with unique-violations very fast  (Scott Marlowe, )
    Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C", )
     Re: How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
      Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C", )
       Re: How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
        Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C", )
 Re: How to insert a bulk of data with unique-violations very fast  (Cédric Villemain, )
  Re: How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
 Re: How to insert a bulk of data with unique-violations very fast  (Andy Colson, )

Pierre C schrieb:
> Since you have lots of data you can use parallel loading.
>
> Split your data in several files and then do :
>
> CREATE TEMPORARY TABLE loader1 ( ... )
> COPY loader1 FROM ...
>
> Use a TEMPORARY TABLE for this : you don't need crash-recovery since if
> something blows up, you can COPY it again... and it will be much faster
> because no WAL will be written.

That's a good advice, thank yo :)

> If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process
> per core, and avoid writing WAL, it will scale.
>
> This doesn't solve the other half of your problem (removing the
> duplicates) which isn't easy to parallelize, but it will make the COPY
> part a lot faster.
>
> Note that you can have 1 core process the INSERT / removing duplicates
> while the others are handling COPY and filling temp tables, so if you
> pipeline it, you could save some time.
>
> Does your data contain a lot of duplicates, or are they rare ? What
> percentage ?

Within the data to import most rows have 20 till 50 duplicates. Sometime
much more, sometimes less.

But over 99,1% of the rows to import are already know. This percentage
is growing, because there is a finite number of rows i want to know.

In my special case i'm collection domain-names. Till now it's completly
for private interests and with normal pc-hardware. I'm collecting them
by crawling known sites and checking them for new hosts. Maybe i will
build later an expired domain service or an reverse ip database or
something like that. But now i'm just interested in the connection of
the sites and the structure people choose domain-names.

(Before someone ask: Till now i have more rows than domains (nearly)
exists, because i collect subdomain of all levels too and do not delete
entries)

Thanks everyone for your advices. This will help me a lot!

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.


В списке pgsql-performance по дате сообщения:

От: "Pierre C"
Дата:
Сообщение: Re: How to insert a bulk of data with unique-violations very fast
От: Torsten Zühlsdorff
Дата:
Сообщение: Re: How to insert a bulk of data with unique-violations very fast