Обсуждение: [GENERAL] Insert large number of records
Hi guys, we need to insert from a table to another (Postgresql 9.6.1) a large amount of data (about 10/20 millions of rows) withoutlocking destination table. Pg_bulkload is the fastest way but it locks the table. Are there other ways? Classic "COPY" from? Thank you! F -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> On 19 Sep 2017, at 15:47, Job <Job@colliniconsulting.it> wrote: > > Hi guys, > > we need to insert from a table to another (Postgresql 9.6.1) a large amount of data (about 10/20 millions of rows) withoutlocking destination table. > Pg_bulkload is the fastest way but it locks the table. > > Are there other ways? > Classic "COPY" from? We do something like that using a staging table to load to initially (although not bulk; data arrives in our staging tablewith batches of 5k to 100k rows) and then we transfer the data using insert/select and "on conflict do". That data-transfer within PG takes a couple of minutes on our rather limited VM for a wide 37M rows table (~37GB on disk).That only locks the staging table (during initial bulkload) and the rows in the master table that are currently beingaltered (during the insert/select). If your data-source is a file in a format supported by COPY, then you can use COPY to do the initial bulk load into the stagingtable. Some benefits of this 2-stage approach are that it leaves room to manipulate the data (type conversions, for example) andthat it can handle the scenario where a matching target record in the master table already exists. In our case, we convertcharacter fields to varchar (which saves a lot of space(s)). Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Dear Alban, thank you for your precious reply, first of all. >> On 19 Sep 2017, at 15:47, Job <Job@colliniconsulting.it> wrote: >> >> Hi guys, >> >> we need to insert from a table to another (Postgresql 9.6.1) a large amount of data (about 10/20 millions of rows) withoutlocking destination table. >> Pg_bulkload is the fastest way but it locks the table. >> >> Are there other ways? >> Classic "COPY" from? >We do something like that using a staging table to load to initially (although not bulk; data arrives in our staging tablewith batches of 5k to 100k rows) and then we transfer the data using insert/select and "on conflict do". >That data-transfer within PG takes a couple of minutes on our rather limited VM for a wide 37M rows table (~37GB on disk).That only locks the staging table (during initial bulkload) and the rows in the master table that are currently beingaltered (during the insert/select). We use a "temporary" table, populated by pg_bulkload - it takes few minutes in this first step. Then, from the temporary table, datas are transferred by a trigger that copy the record into the production table. But *this step* takes really lots of time (sometimes also few hours). There are about 10 millions of record. We cannot use pg_bulkload to load directly data into production table since pg_bulkload would lock the Whole table, and "COPY"command is slow and would not care about table partitioning (COPY command fire partitioned-table triggers). Thank you for the help! F -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Tuesday, September 19, 2017, Job <Job@colliniconsulting.it> wrote:
and would not care about table partitioning (COPY command fire partitioned-table triggers).
You might want to write a script that inserts directly into the partitions and bypass routing altogether.
Insert into ... select from ... is your only option for table-to-table and you are stuck with whatever locks the execution of the command needs to take.
David J.
On 20 September 2017 at 07:42, Job <Job@colliniconsulting.it> wrote: > We use a "temporary" table, populated by pg_bulkload - it takes few minutes in this first step. > Then, from the temporary table, datas are transferred by a trigger that copy the record into the production table. > But *this step* takes really lots of time (sometimes also few hours). > There are about 10 millions of record. Perhaps the problem isn't entirely on the writing end of the process. How often does this trigger fire? Once per row inserted into the "temporary" table, once per statement or only after the bulkload has finished? Do you have appropriate indices on the temporary table to guarantee quick lookup of the records that need to be copied to the target table(s)? > We cannot use pg_bulkload to load directly data into production table since pg_bulkload would lock the Whole table, and"COPY" command is slow and would not care about table partitioning (COPY command fire partitioned-table triggers). As David already said, inserting directly into the appropriate partition is certainly going to be faster. It removes a check on your partitioning conditions from the query execution plan; if you have many partitions, that adds up, because the database needs to check that condition among all your partitions for every row. Come to think of it, I was assuming that the DB would stop checking other partitions once it found a suitable candidate, but now I'm not so sure it would. There may be good reasons not to stop, for example if we can partition further into sub-partitions. Anybody? Since you're already using a trigger, it would probably be more efficient to query your "temporary" table for batches belonging to the same partition and insert those into the partition directly, one partition at a time. Even better would be if your bulkload could already be organised such that all the data in the "temporary" table can indiscriminately be inserted into the same target partition. That though depends a bit on your setup - at some point the time saved at one end gets consumed on the other or it takes even longer there. Well, I think I've thrown enough ideas around for now ;) -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> Even better would be if your bulkload could already be organised such > that all the data in the "temporary" table can indiscriminately be > inserted into the same target partition. That though depends a bit on > your setup - at some point the time saved at one end gets consumed on > the other or it takes even longer there. Thank your for the answers and the ideas, really! We wrote a simple script that split datas directly into the right partition, avoidind any trigger. We also split into 100k-record portions. Now performances have really improved, thanks to everybody! One further question: within a query launched on the MASTER table where i need to scan every table, for exaple to searchrows locatd in more partitions. In there a way to improve "parallel scans" between more table at the same time or not? I noticed, with explain analyze, the scan in the master table is Always sequential, descending into the partitions. Thank you again, F ________________________________________ Da: Alban Hertroys [haramrae@gmail.com] Inviato: mercoledì 20 settembre 2017 17.50 A: Job Cc: pgsql-general@postgresql.org Oggetto: Re: [GENERAL] Insert large number of records On 20 September 2017 at 07:42, Job <Job@colliniconsulting.it> wrote: > We use a "temporary" table, populated by pg_bulkload - it takes few minutes in this first step. > Then, from the temporary table, datas are transferred by a trigger that copy the record into the production table. > But *this step* takes really lots of time (sometimes also few hours). > There are about 10 millions of record. Perhaps the problem isn't entirely on the writing end of the process. How often does this trigger fire? Once per row inserted into the "temporary" table, once per statement or only after the bulkload has finished? Do you have appropriate indices on the temporary table to guarantee quick lookup of the records that need to be copied to the target table(s)? > We cannot use pg_bulkload to load directly data into production table since pg_bulkload would lock the Whole table, and"COPY" command is slow and would not care about table partitioning (COPY command fire partitioned-table triggers). As David already said, inserting directly into the appropriate partition is certainly going to be faster. It removes a check on your partitioning conditions from the query execution plan; if you have many partitions, that adds up, because the database needs to check that condition among all your partitions for every row. Come to think of it, I was assuming that the DB would stop checking other partitions once it found a suitable candidate, but now I'm not so sure it would. There may be good reasons not to stop, for example if we can partition further into sub-partitions. Anybody? Since you're already using a trigger, it would probably be more efficient to query your "temporary" table for batches belonging to the same partition and insert those into the partition directly, one partition at a time. Even better would be if your bulkload could already be organised such that all the data in the "temporary" table can indiscriminately be inserted into the same target partition. That though depends a bit on your setup - at some point the time saved at one end gets consumed on the other or it takes even longer there. Well, I think I've thrown enough ideas around for now ;) -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 20 September 2017 at 22:55, Job <Job@colliniconsulting.it> wrote: > One further question: within a query launched on the MASTER table where i need to scan every table, for exaple to searchrows locatd in more partitions. > In there a way to improve "parallel scans" between more table at the same time or not? > I noticed, with explain analyze, the scan in the master table is Always sequential, descending into the partitions. Since nobody has replied to your latest question yet, I'll give it a try. Which tables a query on your MASTER table needs to scan largely depends on a PG feature called "constraint exclusion". That is to say, if the query optimizer can deduce from your query that it only needs to scan certain partitions for the required results, then it will do so. Now, whether the optimizer can do that, depends on whether your query conditions contain the same (or equivalent) expressions on the same fields of the same types as your partitioning constraints. That 'same type' part is one that people easily miss. Sometimes part of an expression gets auto-cast to make it compatible with the remainder of the expression, but that is sometimes not the same type as what is used in your partitioning (exclusion) constraint. In such cases the planner often doesn't see the similarity between the two expressions and ends up scanning the entire set of partitions. See also section 5.10.4 in https://www.postgresql.org/docs/current/static/ddl-partitioning.html , although it doesn't go into details of how to construct your select statements to prevent scanning the entire partition set. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general