Re: How to insert a bulk of data with unique-violations very fast

От: Pierre C
Тема: Re: How to insert a bulk of data with unique-violations very fast
Дата: ,
Msg-id: op.vdwg04j0eorkce@apollo13
(см: обсуждение, исходный текст)
Ответ на: Re: How to insert a bulk of data with unique-violations very fast  (Scott Marlowe)
Ответы: Re: How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff)
Список: pgsql-performance

Скрыть дерево обсуждения

How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
 Re: How to insert a bulk of data with unique-violations very fast  (Scott Marlowe, )
 Re: How to insert a bulk of data with unique-violations very fast  (Scott Marlowe, )
  Re: How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
   Re: How to insert a bulk of data with unique-violations very fast  (Scott Marlowe, )
    Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C", )
     Re: How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
      Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C", )
       Re: How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
        Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C", )
 Re: How to insert a bulk of data with unique-violations very fast  (Cédric Villemain, )
  Re: How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff, )
 Re: How to insert a bulk of data with unique-violations very fast  (Andy Colson, )

Since you have lots of data you can use parallel loading.

Split your data in several files and then do :

CREATE TEMPORARY TABLE loader1 ( ... )
COPY loader1 FROM ...

Use a TEMPORARY TABLE for this : you don't need crash-recovery since if
something blows up, you can COPY it again... and it will be much faster
because no WAL will be written.

If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process
per core, and avoid writing WAL, it will scale.

This doesn't solve the other half of your problem (removing the
duplicates) which isn't easy to parallelize, but it will make the COPY
part a lot faster.

Note that you can have 1 core process the INSERT / removing duplicates
while the others are handling COPY and filling temp tables, so if you
pipeline it, you could save some time.

Does your data contain a lot of duplicates, or are they rare ? What
percentage ?


В списке pgsql-performance по дате сообщения:

От: "Pierre C"
Дата:
Сообщение: Re: performance of temporary vs. regular tables
От: Torsten Zühlsdorff
Дата:
Сообщение: Re: How to insert a bulk of data with unique-violations very fast