Обсуждение: COPY issue(gsoc project)
i want to paticipate in gsoc 2008. <br /><br />here is my plan:<br /><span style="color: rgb(255, 0, 0);">the red part</span>is came from <span class="ppt" id="_user_gsmith@gregsmith.com">Greg Smith(thanks for </span><span class="ppt"id="_user_gsmith@gregsmith.com">Greg Smith's advice</span><span class="ppt" id="_user_gsmith@gregsmith.com">).</span><br/><br />1.release8.2 make COPY TO can copy the output of an arbitrary SELECT<br/>statement. so i think maybe COPY FROM can get data from output and 'insert<br /> into' some column that designated.the format of the command will be<br />discussed.<br /><br /><span style="color: rgb(255, 0, 0);">This would bea nice feature. Right now there are often applications</span><br style="color: rgb(255, 0, 0);" /><span style="color:rgb(255, 0, 0);">where there is a data loading or staging table that ends up being merged</span><br style="color:rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">with a larger table after some cleanup. Moving thatdata from the</span><br style="color: rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">preperation area into thefinal table right now is most easily done with</span><br style="color: rgb(255, 0, 0);" /><span style="color: rgb(255,0, 0);">INSERT INTO X (SELECT A,B FROM C) type actions. This is slow because</span><br style="color: rgb(255, 0,0);" /><span style="color: rgb(255, 0, 0);">INSERT takes much longer than COPY. Adding support for COPY X FROM</span><brstyle="color: rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">(SELECT A,B FROM C) would make this problemgo away.</span><br style="color: rgb(255, 0, 0);" /><br style="color: rgb(255, 0, 0);" /><span style="color: rgb(255,0, 0);">It is possible to do this right now with some clever use of STDIN/OUT like</span><br style="color: rgb(255,0, 0);" /><span style="color: rgb(255, 0, 0);">the below, but having a pure SQL solution would be more widely applicable.</span><brstyle="color: rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">The overhead of having to passeverything through the client (as STDIN/OUT</span><br style="color: rgb(255, 0, 0);" /><span style="color: rgb(255, 0,0);">do) is certainly not zero.</span><br style="color: rgb(255, 102, 102);" /><br />2.this come from TODO list: COPY alwaysbehaviors like a unit of work thar <br />consists of some insert commands, if any error, it rollback. but sometimes<br/>we only care the data should be inserted. in that situation, i used to use<br /> "try....catch...." insertrow by row to skip the error, because it will take<br />much time to examine every row. so:<br /> Allow COPY toreport error lines and continue. <br />this is a good idea.<br /><br /><span style="color: rgb(255, 0, 0);">This is along standing request and many people would be happy to see it</span><br style="color: rgb(255, 0, 0);" /><span style="color:rgb(255, 0, 0);">implemented. You do want to make sure the implementation easily allows</span><br style="color:rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">pushing all the lines that didn't commit into what'scommonly called a</span><br style="color: rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">"reject file".<br/><br /><span style="color: rgb(0, 0, 0);">----------------------------------------------------------------------------<br/><br />is these feasible? <br />whichone should i choose to proposal or both?</span></span><span style="color: rgb(255, 0, 0);"><span style="color: rgb(0,0, 0);"></span><span style="color: rgb(0, 0, 0);"></span><br /></span>
On Tue, 2008-03-11 at 20:56 +0800, longlong wrote: > This would be a nice feature. Right now there are often applications > where there is a data loading or staging table that ends up being > merged with a larger table after some cleanup. Moving that data from > the preperation area into the final table right now is most easily > done with INSERT INTO X (SELECT A,B FROM C) type actions. This is > slow because INSERT takes much longer than COPY. Why would INSERT INTO ... SELECT be any slower than COPY ... FROM SELECT? > 2.this come from TODO list: COPY always behaviors like a unit of work > thar consists of some insert commands, if any error, it rollback. but > sometimes we only care the data should be inserted. in that situation, > i used to use "try....catch...." insert row by row to skip the error, > because it will take much time to examine every row. so: > Allow COPY to report error lines and continue. > this is a good idea. Search the archives for prior discussions of this idea; the implementation will require some careful thought. This is a relevant thread: http://markmail.org/message/y3atxu56s2afgidg Note also that pg_bulkload currently does something analogous to this outside of the DBMS proper: http://pgbulkload.projects.postgresql.org/ > which one should i choose to proposal or both? FWIW, error handling for COPY sounds like a more useful project to me. -Neil
On Tue, 2008-03-11 at 15:18 -0700, Neil Conway wrote: > Note also that pg_bulkload currently does something analogous to this > outside of the DBMS proper: > > http://pgbulkload.projects.postgresql.org/ Sorry, wrong project. I mean pgloader: http://pgfoundry.org/projects/pgloader/ -Neil
<br /><br />---------- Forwarded message ----------<br /><span class="gmail_quote">From: <b class="gmail_sendername">longlong</b><<a href="mailto:asfnuts@gmail.com">asfnuts@gmail.com</a>><br />Date: 2008-3-129:42<br />Subject: Re: [HACKERS]COPY issue(gsoc project)<br /> To: Neil Conway <<a href="mailto:neilc@samurai.com">neilc@samurai.com</a>><br/><br /></span>the first feature has been implementet. i usedto ignore the pid column with the command "copy tablename (columns) from ...." long time ago and i forgeted. that's myfault.<br /><br /><span style="color: rgb(0, 0, 102);">Allow COPY FROM to create index entries in bulk</span><br /><a href="http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php" target="_blank">http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php</a><p>iknow pg_bulkload from the link above.i don't know what is the different between pg_bulkload and pgloader. right now i try to figure out how copy worksand get more infomation form archives for prior discussions on this subject. meanwhile i'll focus on pgloader.<br />
2008/3/12, Neil Conway <neilc@samurai.com>:
i think this is a better idea. I don't see why creating index entries in bulk has anything to do with
COPY vs. INSERT: if a lot of rows are being loaded into the table in a
single command, it would be a win to create the index entries in bulk,
regardless of whether COPY or INSERT ... SELECT is being used.
In any case, the "create indexes in bulk" hasn't actually been
implemented in mainline Postgres...
I mentioned pgloader just as an example of an existing implementation of
the "error recovery in COPY" idea. The issues with doing an
implementation of error recovery in the backend that I see are:
* in order to be sure that you can recover from an error, you
need to abort the current subtransaction
* starting and committing a subtransaction for every row of the COPY
would be too expensive
* therefore, start and commit a subtransaction for every "n" rows
of input. If an error occurs, you lose at most "n-1" rows of
valid input, which you need to backup and reinsert. There are
various approaches to choosing "n" (statically, based on the
error rate of previous batches in the same load, etc.).
-Neil
from NikhilS http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction.
the simplest thing are often the best.
i think it's hard to implement or some other deficiency since you want subtransaction or every "n" rows.
you have mentioned that the 'n' can be changed according when and where the error happened in thread "Re: VLDB Features" .this is like some mechanisms in tcp Congestion Control.but you can't ignore the time wasted in subtransaction before it encounters an error especially when the 'n' is big.
i don't know the cost of a subtransaction(begin and commit) and an copy line reading(CopyReadLine()) exactly. so i just calculate the number of subtransactions.
f(n)=(1-(1-p)^n)*m+m/n
m is the number of lines. n is the subtransaction lines. p is the possibility of each row encounters an error.
big 'n' can reduce the number of subtransaction(m/n), but also increase the possibility of having a error. unless the p is extremely small, choosing a big 'n' is a big mistake.
in fact the errors always get together (my experience), the situation may be a little better.
however, the idea(from NikhilS) that i start with is the perfect solution. yes i have seen in the email archives in thread "Re: VLDB Features" and i notice some disagreements about commit problems and etc. this won't be a problem since so many similar problems have been solved in pg.
Hi Longlong,
Yeah simpler things are often the best, but as folks are mentioning, we need a carefully thought out approach here. The reply from Tom to my posting there raises issues which need to be taken care of. Although I still think that if we carry out *sanity* checks before starting the load about presence of triggers, constrainsts, fkey constraints etc, if others do not have any issues with the approach, the simple_heap_delete idea should work in some cases. Although the term I used "after some minor cleanup" might need some thought too now that I think more of it..
Also if Fkey checks or complex triggers are around, maybe we can fall back to a subtransaction per row insert too as a worse case measure..
Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com
i think this is a better idea.
from NikhilS http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction.
the simplest thing are often the best.
i think it's hard to implement or some other deficiency since you want subtransaction or every "n" rows.
Yeah simpler things are often the best, but as folks are mentioning, we need a carefully thought out approach here. The reply from Tom to my posting there raises issues which need to be taken care of. Although I still think that if we carry out *sanity* checks before starting the load about presence of triggers, constrainsts, fkey constraints etc, if others do not have any issues with the approach, the simple_heap_delete idea should work in some cases. Although the term I used "after some minor cleanup" might need some thought too now that I think more of it..
Also if Fkey checks or complex triggers are around, maybe we can fall back to a subtransaction per row insert too as a worse case measure..
Regards,
Nikhils
EnterpriseDB http://www.enterprisedb.com
hi NikhilS.
i had seen every email of that thread before. i didn't think triggers, constraints and fkey constraints would be a problem because at that time i didn't know much about when and how these triggers were fired. :P
now i believe that all is about efficiency(of course error handling is a base requirement). so i consider that a simple implement should be done, and it can be a benchmark to measure other more advanced and efficient implement. this will help to find a final solution.
2008/3/14, NikhilS <nikkhils@gmail.com>:
Hi Longlong,--i think this is a better idea.
from NikhilS http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction.
the simplest thing are often the best.
i think it's hard to implement or some other deficiency since you want subtransaction or every "n" rows.
Yeah simpler things are often the best, but as folks are mentioning, we need a carefully thought out approach here. The reply from Tom to my posting there raises issues which need to be taken care of. Although I still think that if we carry out *sanity* checks before starting the load about presence of triggers, constrainsts, fkey constraints etc, if others do not have any issues with the approach, the simple_heap_delete idea should work in some cases. Although the term I used "after some minor cleanup" might need some thought too now that I think more of it..
Also if Fkey checks or complex triggers are around, maybe we can fall back to a subtransaction per row insert too as a worse case measure..
Regards,
Nikhils
EnterpriseDB http://www.enterprisedb.com
i had seen every email of that thread before. i didn't think triggers, constraints and fkey constraints would be a problem because at that time i didn't know much about when and how these triggers were fired. :P
now i believe that all is about efficiency(of course error handling is a base requirement). so i consider that a simple implement should be done, and it can be a benchmark to measure other more advanced and efficient implement. this will help to find a final solution.
subtransaction seems to be the only easy and feasible way to me.
as described by Neil earlier:
* in order to be sure that you can recover from an error, you
need to abort the current subtransaction
* starting and committing a subtransaction for every row of the COPY
would be too expensive
* therefore, start and commit a subtransaction for every "n" rows
of input. If an error occurs, you lose at most "n-1" rows of
valid input, which you need to backup and reinsert. There are
various approaches to choosing "n" (statically, based on the
error rate of previous batches in the same load, etc.).
need to abort the current subtransaction
* starting and committing a subtransaction for every row of the COPY
would be too expensive
* therefore, start and commit a subtransaction for every "n" rows
of input. If an error occurs, you lose at most "n-1" rows of
valid input, which you need to backup and reinsert. There are
various approaches to choosing "n" (statically, based on the
error rate of previous batches in the same load, etc.).
i want to implement this feature as a google summer of code project. error handling is the utmost target. test example will be generated for efficiency analysis.
any suggestion is welcome.
any suggestion is welcome.