Обсуждение: Insert into on conflict, data size upto 3 billion records

Поиск
Список
Период
Сортировка

Insert into on conflict, data size upto 3 billion records

От
Karthik Kumar Kondamudi
Дата:
Hi, 

I'm looking for suggestions on how I can improve the performance of the below merge statement, we have a batch process that batch load the data into the _batch tables using Postgres and the task is to update the main target tables if the record exists else into it, sometime these batch table could go up to 5 billion records. Here is the current scenario

target_table_main has 700,070,247  records and is hash partitioned into 50 chunks, it has an index on logical_ts and the batch table has 2,715,020,546 close to 3 billion records, so I'm dealing with a huge set of data so looking of doing this in the most efficient way.

Thank you

Re: Insert into on conflict, data size upto 3 billion records

От
Ron
Дата:
On 2/12/21 12:46 PM, Karthik Kumar Kondamudi wrote:
Hi, 

I'm looking for suggestions on how I can improve the performance of the below merge statement, we have a batch process that batch load the data into the _batch tables using Postgres and the task is to update the main target tables if the record exists else into it, sometime these batch table could go up to 5 billion records. Here is the current scenario

target_table_main has 700,070,247  records and is hash partitioned into 50 chunks, it has an index on logical_ts and the batch table has 2,715,020,546 close to 3 billion records, so I'm dealing with a huge set of data so looking of doing this in the most efficient way.

Many times, I have drastically sped up batch processing by #1 partitioning on the same field as an index, and #2 pre-sorting the input data by that field.

That way, you get excellent "locality of data" (meaning lots of writes to the same hot bits of cache, which later get asynchronously flushed to disk).  Unfortunately for your situation, the purpose of hash partitioning is to reduce locality of data.  (Sometimes that's useful, but not when processing batches.)

--
Angular momentum makes the world go 'round.

Re: Insert into on conflict, data size upto 3 billion records

От
Karthik K
Дата:
exactly, for now, what I did was, as the table is already partitioned, I created 50 different connections and tried updating the target table by directly querying from the source partition tables. Are there any other techniques that I can use to speed this up? also when we use on conflict statement for both insert and update does Postgres uses batching internally (committing for every 10000 records etc) or will it update all records at once, in that case, does it create a version for each record and do swap all at once? I'm wondering how atomicity is guaranteed, also if I have to do batching other than selecting from individual partitions does doing it batches of 10000 records help?





On Sat, Feb 13, 2021 at 12:04 PM Ron <ronljohnsonjr@gmail.com> wrote:
On 2/12/21 12:46 PM, Karthik Kumar Kondamudi wrote:
Hi, 

I'm looking for suggestions on how I can improve the performance of the below merge statement, we have a batch process that batch load the data into the _batch tables using Postgres and the task is to update the main target tables if the record exists else into it, sometime these batch table could go up to 5 billion records. Here is the current scenario

target_table_main has 700,070,247  records and is hash partitioned into 50 chunks, it has an index on logical_ts and the batch table has 2,715,020,546 close to 3 billion records, so I'm dealing with a huge set of data so looking of doing this in the most efficient way.

Many times, I have drastically sped up batch processing by #1 partitioning on the same field as an index, and #2 pre-sorting the input data by that field.

That way, you get excellent "locality of data" (meaning lots of writes to the same hot bits of cache, which later get asynchronously flushed to disk).  Unfortunately for your situation, the purpose of hash partitioning is to reduce locality of data.  (Sometimes that's useful, but not when processing batches.)

--
Angular momentum makes the world go 'round.


--
Regards,

Karthik K Kondamudi

Re: Insert into on conflict, data size upto 3 billion records

От
Rob Sargent
Дата:

On 2/15/21 11:41 AM, Karthik K wrote:
> exactly, for now, what I did was, as the table is already partitioned, I 
> created 50 different connections and tried updating the target table by 
> directly querying from the source partition tables. Are there any other 
> techniques that I can use to speed this up? also when we use on conflict 
> statement for both insert and update does Postgres uses batching 
> internally (committing for every 10000 records etc) or will it update 
> all records at once, in that case, does it create a version for each 
> record and do swap all at once? I'm wondering how atomicity is 
> guaranteed, also if I have to do batching other than selecting from 
> individual partitions does doing it batches of 10000 records help?
> 
> 

What is your ratio of inserts versus update?  Can you separate the 
inserts and updates?  Is the target table indexed other than on primary 
key? If so can they be dropped?

Assuming you use \copy to load the batch tables

I've found this strategy to be effective:
index batch on id

--update first
begin
update target t set "all fields" from batch b where t.id = b.id and b.id 
between "hi" and "low"
commit
increment hi low, avoid overlap; repeat

--insert
begin;
insert into target as select b.* from from batch b where not exists 
(select 1 from target v where b.id = v.id) and b.id between "hi" and "low"
commit
increment hi, low, avoid overlap; repeat









Re: Insert into on conflict, data size upto 3 billion records

От
Tim Cross
Дата:
Karthik K <kar6308@gmail.com> writes:

> exactly, for now, what I did was, as the table is already partitioned, I
> created 50 different connections and tried updating the target table by
> directly querying from the source partition tables. Are there any other
> techniques that I can use to speed this up? also when we use on conflict
> statement for both insert and update does Postgres uses batching internally
> (committing for every 10000 records etc) or will it update all records at
> once, in that case, does it create a version for each record and do swap
> all at once? I'm wondering how atomicity is guaranteed, also if I have to
> do batching other than selecting from individual partitions does doing it
> batches of 10000 records help?
>

I have had pretty good success with the following strategy. However, you
will need to profile/test each assumption as things vary greatly
depending on data and table structure. A bit of trial and error is
usually required.

1. Use the \COPY command to upload the batch data into a temporary table
or at least a table with logging turned off

2. Run a stored procedure which first does updates for existing rows
then one which does inserts for non-existing rows into your final table

3. If your table is partitioned, pre-process your batches into separate
batches that are divided by the partition key, so instead of one big
batch, multiple smaller batches. If this is not possible, break your
upload up into multiple batches rather than one huge batch.

4. Optimise the update/insert statement to suit your data and table
structure, dropping any unnecessary indexes and re-building them once
finished (the whole upload). Note that this will need profiling as
depending on the index and index structure, dropping and re-creating can
be overall slower than leaving index in place.

5. Determine best times to run analyze to update table stats. Probably
want to do it after each update and insert run, but sometimes, may be
overall faster to just do it after each 'job' (update + insert).

6. don't forget to check the logs and watch for WAL writes being too
frequent etc. Often things are tuned for 'normal' (outside bulk uploads)
and are very poor for the bulk uploads. Need to make sure it is the
right balance.



Re: Insert into on conflict, data size upto 3 billion records

От
Karthik K
Дата:
yes, I'm using \copy to load the batch table, 

with the new design that we are doing, we expect updates to be less going forward and more inserts, one of the target columns I'm updating is indexed, so I will drop the index and try it out, also from your suggestion above splitting the on conflict into insert and update is performant but in order to split the record into batches( low, high) I need to do a count of primary key on the batch tables to first split it into batches


On Mon, Feb 15, 2021 at 11:06 AM Rob Sargent <robjsargent@gmail.com> wrote:


On 2/15/21 11:41 AM, Karthik K wrote:
> exactly, for now, what I did was, as the table is already partitioned, I
> created 50 different connections and tried updating the target table by
> directly querying from the source partition tables. Are there any other
> techniques that I can use to speed this up? also when we use on conflict
> statement for both insert and update does Postgres uses batching
> internally (committing for every 10000 records etc) or will it update
> all records at once, in that case, does it create a version for each
> record and do swap all at once? I'm wondering how atomicity is
> guaranteed, also if I have to do batching other than selecting from
> individual partitions does doing it batches of 10000 records help?
>
>

What is your ratio of inserts versus update?  Can you separate the
inserts and updates?  Is the target table indexed other than on primary
key? If so can they be dropped?

Assuming you use \copy to load the batch tables

I've found this strategy to be effective:
index batch on id

--update first
begin
update target t set "all fields" from batch b where t.id = b.id and b.id
between "hi" and "low"
commit
increment hi low, avoid overlap; repeat

--insert
begin;
insert into target as select b.* from from batch b where not exists
(select 1 from target v where b.id = v.id) and b.id between "hi" and "low"
commit
increment hi, low, avoid overlap; repeat










--
Regards,

Karthik K Kondamudi

Re: Insert into on conflict, data size upto 3 billion records

От
Rob Sargent
Дата:

On 2/15/21 12:22 PM, Karthik K wrote:
> yes, I'm using \copy to load the batch table,
> 
> with the new design that we are doing, we expect updates to be less 
> going forward and more inserts, one of the target columns I'm updating 
> is indexed, so I will drop the index and try it out, also from your 
> suggestion above splitting the on conflict into insert and update is 
> performant but in order to split the record into batches( low, high) I 
> need to do a count of primary key on the batch tables to first split it 
> into batches
> 
> 
I don't think you need to do a count per se.  If you know the 
approximate range (or better, the min and max) in the incoming/batch 
data you can approximate the range values.