Обсуждение: Bulk processing & deletion

Поиск
Список
Период
Сортировка

Bulk processing & deletion

От
Ivan Voras
Дата:
Hello,

I have a table with a large number of records (millions), on which the
following should be performed:

    1. Retrieve a set of records by a SELECT query with a WHERE condition
    2. Process these in the application
    3. Delete them from the table

Now, in the default read-committed transaction isolation, I can't just
use the same WHERE condition with a DELETE in step 3 as it might delete
more records than are processed in step 1 (i.e. phantom read). I've
thought of several ways around it and would like some feedback on which
would be the most efficient:

#1: Create a giant DELETE WHERE ... IN (...) SQL command for step #3
with primary keys of records from step 1 - but will it hit a SQL string
length limitation in the database? Is there such a limit (and what is it?)

#2: Same as #1 but with batching the records to e.g. 1000 at a time, all
in one transaction

#2: Use a higher isolation level, probably Repeatable Read (PG 9.0) -
but then the question is will this block other clients from inserting
new data into the table? Also, is Repeatable Read enough?

Any other ideas?



Вложения

Re: Bulk processing & deletion

От
Alban Hertroys
Дата:
On 13 October 2011 14:20, Ivan Voras <ivoras@freebsd.org> wrote:
> Hello,
>
> I have a table with a large number of records (millions), on which the
> following should be performed:
>
>        1. Retrieve a set of records by a SELECT query with a WHERE condition
>        2. Process these in the application
>        3. Delete them from the table
>
> Now, in the default read-committed transaction isolation, I can't just
> use the same WHERE condition with a DELETE in step 3 as it might delete
> more records than are processed in step 1 (i.e. phantom read). I've
> thought of several ways around it and would like some feedback on which
> would be the most efficient:
>
> #1: Create a giant DELETE WHERE ... IN (...) SQL command for step #3
> with primary keys of records from step 1 - but will it hit a SQL string
> length limitation in the database? Is there such a limit (and what is it?)
>
> #2: Same as #1 but with batching the records to e.g. 1000 at a time, all
> in one transaction
>
> #2: Use a higher isolation level, probably Repeatable Read (PG 9.0) -
> but then the question is will this block other clients from inserting
> new data into the table? Also, is Repeatable Read enough?
>
> Any other ideas?

CREATE TABLE to_delete (
  job_created timestamp NOT NULL DEFAULT now(),
  fk_id int NOT NULL
);

-- Mark for deletion
INSERT INTO to_delete (fk_id) SELECT id FROM table WHERE condition = true;

-- Process in app
SELECT table.* FROM table INNER JOIN to_delete ON (table.id = to_delete.fk_id);

-- Delete them
DELETE FROM table WHERE id IN (SELECT fk_id FROM to_delete);


--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: Bulk processing & deletion

От
Gregg Jaskiewicz
Дата:
If you don't need the data for more then a transaction, or connection
length - use temporary tables to store ids of data you need to delete.
If those change, or move, or something - it means you are missing PK
on that table.

Re: Bulk processing & deletion

От
Ivan Voras
Дата:
On 13/10/2011 14:34, Alban Hertroys wrote:

>> Any other ideas?
>
> CREATE TABLE to_delete (
>   job_created timestamp NOT NULL DEFAULT now(),
>   fk_id int NOT NULL
> );
>
> -- Mark for deletion
> INSERT INTO to_delete (fk_id) SELECT id FROM table WHERE condition = true;
>
> -- Process in app
> SELECT table.* FROM table INNER JOIN to_delete ON (table.id = to_delete.fk_id);
>
> -- Delete them
> DELETE FROM table WHERE id IN (SELECT fk_id FROM to_delete);

Good point. I can even use a temp table for this and make use of
UNLOGGED temp tables when we upgrade to 9.1!



Вложения

Re: Bulk processing & deletion

От
Alexander Pyhalov
Дата:
On 10/13/2011 16:20, Ivan Voras wrote:
> Hello,
>
> I have a table with a large number of records (millions), on which the
> following should be performed:
>
>     1. Retrieve a set of records by a SELECT query with a WHERE condition
>     2. Process these in the application
>     3. Delete them from the table
>
Hello.
Maybe you can just do begin; delete from ..... WHERE ... RETURNING *;,
process records and issue COMMIT after processing?


--
Best regards,
Alexander Pyhalov,
system administrator of Computer Center of Southern Federal University

Re: Bulk processing & deletion

От
Steve Crawford
Дата:
On 10/13/2011 05:20 AM, Ivan Voras wrote:
> Hello,
>
> I have a table with a large number of records (millions), on which the
> following should be performed:
>
>     1. Retrieve a set of records by a SELECT query with a WHERE condition
>     2. Process these in the application
>     3. Delete them from the table
>
Without knowing a bit more, it is difficult to say. A couple questions:

1. Are there conflicting processes - i.e. could multiple applications be
in contention to process the same set of records?

2. Is the processing "all or none" or could individual records fail? If
so, how do you deal with reprocessing or returning those to the main table.

Depending on the nature of your app, it might be feasible to reorder the
actions to move the records to be processed into a "processing" table
and delete them from that table as the records are processed by the
application.

You could move the records into the processing table with:

with foo as (delete from main_table where your_where_clause returning
a,b,c) insert into processing_table (a,b,c) select a,b,c from foo;

In this case I would not recommend temporary or unlogged tables for the
processing table as that becomes the only source of the data once
deleted from the master table.

Cheers,
Steve


Re: Bulk processing & deletion

От
Ivan Voras
Дата:
On 13 October 2011 20:08, Steve Crawford <scrawford@pinpointresearch.com> wrote:
> On 10/13/2011 05:20 AM, Ivan Voras wrote:
>>
>> Hello,
>>
>> I have a table with a large number of records (millions), on which the
>> following should be performed:
>>
>>        1. Retrieve a set of records by a SELECT query with a WHERE
>> condition
>>        2. Process these in the application
>>        3. Delete them from the table
>>
> Without knowing a bit more, it is difficult to say. A couple questions:
>
> 1. Are there conflicting processes - i.e. could multiple applications be in
> contention to process the same set of records?

No, only one bulk processor.

> 2. Is the processing "all or none" or could individual records fail? If so,
> how do you deal with reprocessing or returning those to the main table.

All or none; the nature of thing is that there can be no fatal failures.

> Depending on the nature of your app, it might be feasible to reorder the
> actions to move the records to be processed into a "processing" table and
> delete them from that table as the records are processed by the application.
>
> You could move the records into the processing table with:
>
> with foo as (delete from main_table where your_where_clause returning a,b,c)
> insert into processing_table (a,b,c) select a,b,c from foo;
>
> In this case I would not recommend temporary or unlogged tables for the
> processing table as that becomes the only source of the data once deleted
> from the master table.

Ok, thanks (to everyone)!

Re: Bulk processing & deletion

От
pasman pasmański
Дата:
Unlogged tables can't be temporary.

2011/10/13, Ivan Voras <ivoras@freebsd.org>:
> On 13/10/2011 14:34, Alban Hertroys wrote:
>
>>> Any other ideas?
>>
>> CREATE TABLE to_delete (
>>   job_created timestamp NOT NULL DEFAULT now(),
>>   fk_id int NOT NULL
>> );
>>
>> -- Mark for deletion
>> INSERT INTO to_delete (fk_id) SELECT id FROM table WHERE condition = true;
>>
>> -- Process in app
>> SELECT table.* FROM table INNER JOIN to_delete ON (table.id =
>> to_delete.fk_id);
>>
>> -- Delete them
>> DELETE FROM table WHERE id IN (SELECT fk_id FROM to_delete);
>
> Good point. I can even use a temp table for this and make use of
> UNLOGGED temp tables when we upgrade to 9.1!
>
>
>


--
------------
pasman