Обсуждение: How should I deal with disconnects during insert?

Поиск
Список
Период
Сортировка

How should I deal with disconnects during insert?

От
Sergey Samokhin
Дата:
Hello!

Problem I'm talking about in this letter is related to how we deal
with extremely big amounts of data to be inserted into DB. Recently I
switched to using buffers.

The best approach to make a buffer I know (and the only I can use with
my driver) is to create a long query (possibly made of some
statements) like:

"INSERT INTO test VALUES (1, 2), (3, 4) ... (N, M); INSERT INTO test
VALUES (7, 8), (9, 10) ... (N2, M2);"

...and then pass it to query() function your driver has (no matter
what driver you use, it should be quite fast).

But what will happen if the connection to PostgreSQL is lost during
the execution?

There are two different things which may happen with such a buffer:

1) I've found that once PostgreSQL has started writing, connection can
safely be lost without affecting the data passed. In this case all the
data will be inserted, because PostgreSQL has the whole query to
process.

2) Data can't be inserted in the case the connection is lost before
PostgreSQL has started such a writing (i.e. during transmission of the
query).

Unfortunatelly, my driver can't say me at which stage the connection
has been lost, so I don't know should I insert it again or not after
reconnect.

Way #1. One of the ways to solve this problem is to guarantee that
query I execute fails completely when connection is lost no matter at
which stage. In this case I can safely insert a buffer after reconnect
without any checks. Is there a way to guarantee that?

Way #2. Another way I see is to have additional table containing
values that show which buffers were inserted successfully. For
example, the last statement in my buffer could be:

INSERT INTO insert_history VALUES (<BUFFER_ID>, <SUCCESSFULLY_INSERTED_FLAG>)

After reconnect, I can check if buffer I'm dealing with has been
inserted or not.

Which way do you prefer and why?

Is there a way to do #1 (it's a bit cleaner and simpler to implement)?

Thanks.

--
Sergey Samokhin

Re: How should I deal with disconnects during insert?

От
David Wilson
Дата:
On Wed, May 20, 2009 at 4:41 PM, Sergey Samokhin <prikrutil@gmail.com> wrote:

> Is there a way to do #1 (it's a bit cleaner and simpler to implement)?

You could wrap the whole thing in a transaction. If you don't get to
the commit, the whole transaction should roll back.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: How should I deal with disconnects during insert?

От
Sam Mason
Дата:
On Wed, May 20, 2009 at 01:41:33PM -0700, Sergey Samokhin wrote:
> There are two different things which may happen with such a buffer:
>
> 1) I've found that once PostgreSQL has started writing, connection can
> safely be lost without affecting the data passed. In this case all the
> data will be inserted, because PostgreSQL has the whole query to
> process.
>
> 2) Data can't be inserted in the case the connection is lost before
> PostgreSQL has started such a writing (i.e. during transmission of the
> query).
>
> Unfortunatelly, my driver can't say me at which stage the connection
> has been lost, so I don't know should I insert it again or not after
> reconnect.

In general I'm not sure how it would ever know; the easiest way is to do
as you're suggesting and retry the insert on any failure.

> Way #1. One of the ways to solve this problem is to guarantee that
> query I execute fails completely when connection is lost no matter at
> which stage. In this case I can safely insert a buffer after reconnect
> without any checks. Is there a way to guarantee that?
>
> Way #2. Another way I see is to have additional table containing
> values that show which buffers were inserted successfully. For
> example, the last statement in my buffer could be:
>
> INSERT INTO insert_history VALUES (<BUFFER_ID>, <SUCCESSFULLY_INSERTED_FLAG>)
>
> After reconnect, I can check if buffer I'm dealing with has been
> inserted or not.

PRIMARY KEYs (or unique constraints in general) and transactions are
your friend here; you can check things if you want or just try dumping
data in and let it fail if it's already there.

Not sure what the "flag" is for; surely if there's a row in there
matching that buffer id then things are good to go.  You could have
a foreign key on this from the data (the "test" table above) and the
database will check that you're only inserting data into it when you say
you are.  Transactions will ensure that either everything happens or
nothing does.

--
  Sam  http://samason.me.uk/

Re: How should I deal with disconnects during insert?

От
Dimitri Fontaine
Дата:
Hi,

Sergey Samokhin <prikrutil@gmail.com> writes:
> Problem I'm talking about in this letter is related to how we deal
> with extremely big amounts of data to be inserted into DB. Recently I
> switched to using buffers.

You might appreciate this blog entry:
  http://www.depesz.com/index.php/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/

Regards,
--
dim