Обсуждение: How should I deal with disconnects during insert?
Hello! Problem I'm talking about in this letter is related to how we deal with extremely big amounts of data to be inserted into DB. Recently I switched to using buffers. The best approach to make a buffer I know (and the only I can use with my driver) is to create a long query (possibly made of some statements) like: "INSERT INTO test VALUES (1, 2), (3, 4) ... (N, M); INSERT INTO test VALUES (7, 8), (9, 10) ... (N2, M2);" ...and then pass it to query() function your driver has (no matter what driver you use, it should be quite fast). But what will happen if the connection to PostgreSQL is lost during the execution? There are two different things which may happen with such a buffer: 1) I've found that once PostgreSQL has started writing, connection can safely be lost without affecting the data passed. In this case all the data will be inserted, because PostgreSQL has the whole query to process. 2) Data can't be inserted in the case the connection is lost before PostgreSQL has started such a writing (i.e. during transmission of the query). Unfortunatelly, my driver can't say me at which stage the connection has been lost, so I don't know should I insert it again or not after reconnect. Way #1. One of the ways to solve this problem is to guarantee that query I execute fails completely when connection is lost no matter at which stage. In this case I can safely insert a buffer after reconnect without any checks. Is there a way to guarantee that? Way #2. Another way I see is to have additional table containing values that show which buffers were inserted successfully. For example, the last statement in my buffer could be: INSERT INTO insert_history VALUES (<BUFFER_ID>, <SUCCESSFULLY_INSERTED_FLAG>) After reconnect, I can check if buffer I'm dealing with has been inserted or not. Which way do you prefer and why? Is there a way to do #1 (it's a bit cleaner and simpler to implement)? Thanks. -- Sergey Samokhin
On Wed, May 20, 2009 at 4:41 PM, Sergey Samokhin <prikrutil@gmail.com> wrote: > Is there a way to do #1 (it's a bit cleaner and simpler to implement)? You could wrap the whole thing in a transaction. If you don't get to the commit, the whole transaction should roll back. -- - David T. Wilson david.t.wilson@gmail.com
On Wed, May 20, 2009 at 01:41:33PM -0700, Sergey Samokhin wrote: > There are two different things which may happen with such a buffer: > > 1) I've found that once PostgreSQL has started writing, connection can > safely be lost without affecting the data passed. In this case all the > data will be inserted, because PostgreSQL has the whole query to > process. > > 2) Data can't be inserted in the case the connection is lost before > PostgreSQL has started such a writing (i.e. during transmission of the > query). > > Unfortunatelly, my driver can't say me at which stage the connection > has been lost, so I don't know should I insert it again or not after > reconnect. In general I'm not sure how it would ever know; the easiest way is to do as you're suggesting and retry the insert on any failure. > Way #1. One of the ways to solve this problem is to guarantee that > query I execute fails completely when connection is lost no matter at > which stage. In this case I can safely insert a buffer after reconnect > without any checks. Is there a way to guarantee that? > > Way #2. Another way I see is to have additional table containing > values that show which buffers were inserted successfully. For > example, the last statement in my buffer could be: > > INSERT INTO insert_history VALUES (<BUFFER_ID>, <SUCCESSFULLY_INSERTED_FLAG>) > > After reconnect, I can check if buffer I'm dealing with has been > inserted or not. PRIMARY KEYs (or unique constraints in general) and transactions are your friend here; you can check things if you want or just try dumping data in and let it fail if it's already there. Not sure what the "flag" is for; surely if there's a row in there matching that buffer id then things are good to go. You could have a foreign key on this from the data (the "test" table above) and the database will check that you're only inserting data into it when you say you are. Transactions will ensure that either everything happens or nothing does. -- Sam http://samason.me.uk/
Hi, Sergey Samokhin <prikrutil@gmail.com> writes: > Problem I'm talking about in this letter is related to how we deal > with extremely big amounts of data to be inserted into DB. Recently I > switched to using buffers. You might appreciate this blog entry: http://www.depesz.com/index.php/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/ Regards, -- dim