Обсуждение: COPY issue(gsoc project)

Поиск
Список
Период
Сортировка

COPY issue(gsoc project)

От
longlong
Дата:
i want to paticipate in gsoc 2008. <br /><br />here is my plan:<br /><span style="color: rgb(255, 0, 0);">the red
part</span>is came from <span class="ppt" id="_user_gsmith@gregsmith.com">Greg Smith(thanks for </span><span
class="ppt"id="_user_gsmith@gregsmith.com">Greg Smith's advice</span><span class="ppt"
id="_user_gsmith@gregsmith.com">).</span><br/><br />1.release8.2 make COPY TO can copy the output of an arbitrary
SELECT<br/>statement. so i think maybe COPY FROM can get data from output and 'insert<br /> into' some column that
designated.the format of the command will be<br />discussed.<br /><br /><span style="color: rgb(255, 0, 0);">This would
bea nice feature.  Right now there are often applications</span><br style="color: rgb(255, 0, 0);" /><span
style="color:rgb(255, 0, 0);">where there is a data loading or staging table that ends up being merged</span><br
style="color:rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">with a larger table after some cleanup.  Moving
thatdata from the</span><br style="color: rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">preperation area into
thefinal table right now is most easily done with</span><br style="color: rgb(255, 0, 0);" /><span style="color:
rgb(255,0, 0);">INSERT INTO X (SELECT A,B FROM C) type actions.  This is slow because</span><br style="color: rgb(255,
0,0);" /><span style="color: rgb(255, 0, 0);">INSERT takes much longer than COPY.  Adding support for COPY X
FROM</span><brstyle="color: rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">(SELECT A,B FROM C) would make this
problemgo away.</span><br style="color: rgb(255, 0, 0);" /><br style="color: rgb(255, 0, 0);" /><span style="color:
rgb(255,0, 0);">It is possible to do this right now with some clever use of STDIN/OUT like</span><br style="color:
rgb(255,0, 0);" /><span style="color: rgb(255, 0, 0);">the below, but having a pure SQL solution would be more widely
applicable.</span><brstyle="color: rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">The overhead of having to
passeverything through the client (as STDIN/OUT</span><br style="color: rgb(255, 0, 0);" /><span style="color: rgb(255,
0,0);">do) is certainly not zero.</span><br style="color: rgb(255, 102, 102);" /><br />2.this come from TODO list: COPY
alwaysbehaviors like a unit of work thar <br />consists of some insert commands, if any error, it rollback. but
sometimes<br/>we only care the data should be inserted. in that situation, i used to use<br /> "try....catch...."
insertrow by row to skip the error, because it will take<br />much time to examine every row. so:<br />    Allow COPY
toreport error lines and continue.  <br />this is a good idea.<br /><br /><span style="color: rgb(255, 0, 0);">This is
along standing request and many people would be happy to see it</span><br style="color: rgb(255, 0, 0);" /><span
style="color:rgb(255, 0, 0);">implemented.  You do want to make sure the implementation easily allows</span><br
style="color:rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">pushing all the lines that didn't commit into
what'scommonly called a</span><br style="color: rgb(255, 0, 0);" /><span style="color: rgb(255, 0, 0);">"reject
file".<br/><br /><span style="color: rgb(0, 0,
0);">----------------------------------------------------------------------------<br/><br />is these feasible? <br
/>whichone should i choose to proposal or both?</span></span><span style="color: rgb(255, 0, 0);"><span style="color:
rgb(0,0, 0);"></span><span style="color: rgb(0, 0, 0);"></span><br /></span> 

Re: COPY issue(gsoc project)

От
Neil Conway
Дата:
On Tue, 2008-03-11 at 20:56 +0800, longlong wrote:
> This would be a nice feature.  Right now there are often applications
> where there is a data loading or staging table that ends up being
> merged with a larger table after some cleanup.  Moving that data from
> the preperation area into the final table right now is most easily
> done with INSERT INTO X (SELECT A,B FROM C) type actions.  This is
> slow because INSERT takes much longer than COPY.

Why would INSERT INTO ... SELECT be any slower than COPY ... FROM
SELECT?

> 2.this come from TODO list: COPY always behaviors like a unit of work
> thar consists of some insert commands, if any error, it rollback. but
> sometimes we only care the data should be inserted. in that situation,
> i used to use "try....catch...." insert row by row to skip the error,
> because it will take much time to examine every row. so:
>     Allow COPY to report error lines and continue.  
> this is a good idea.

Search the archives for prior discussions of this idea; the
implementation will require some careful thought. This is a relevant
thread:

http://markmail.org/message/y3atxu56s2afgidg

Note also that pg_bulkload currently does something analogous to this
outside of the DBMS proper:

http://pgbulkload.projects.postgresql.org/

> which one should i choose to proposal or both?

FWIW, error handling for COPY sounds like a more useful project to me.

-Neil




Re: COPY issue(gsoc project)

От
Neil Conway
Дата:
On Tue, 2008-03-11 at 15:18 -0700, Neil Conway wrote:
> Note also that pg_bulkload currently does something analogous to this
> outside of the DBMS proper:
> 
> http://pgbulkload.projects.postgresql.org/

Sorry, wrong project. I mean pgloader:

http://pgfoundry.org/projects/pgloader/

-Neil




Fwd: COPY issue(gsoc project)

От
longlong
Дата:
<br /><br />---------- Forwarded message ----------<br /><span class="gmail_quote">From: <b
class="gmail_sendername">longlong</b><<a href="mailto:asfnuts@gmail.com">asfnuts@gmail.com</a>><br />Date:
2008-3-129:42<br />Subject: Re: [HACKERS]COPY issue(gsoc project)<br /> To: Neil Conway <<a
href="mailto:neilc@samurai.com">neilc@samurai.com</a>><br/><br /></span>the first feature has been implementet. i
usedto ignore the pid column with the command "copy tablename (columns) from ...." long time ago and i forgeted. that's
myfault.<br /><br /><span style="color: rgb(0, 0, 102);">Allow COPY FROM to create index entries in bulk</span><br /><a
href="http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php"
target="_blank">http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php</a><p>iknow pg_bulkload from the link
above.i don't know what is the different  between pg_bulkload and pgloader. right now i try to figure out how copy
worksand get more infomation form archives for prior discussions on this subject. meanwhile i'll focus on pgloader.<br
/>

Re: COPY issue(gsoc project)

От
longlong
Дата:


2008/3/12, Neil Conway <neilc@samurai.com>:
I don't see why creating index entries in bulk has anything to do with
COPY vs. INSERT: if a lot of rows are being loaded into the table in a
single command, it would be a win to create the index entries in bulk,
regardless of whether COPY or INSERT ... SELECT is being used.

In any case, the "create indexes in bulk" hasn't actually been
implemented in mainline Postgres...

I mentioned pgloader just as an example of an existing implementation of
the "error recovery in COPY" idea. The issues with doing an
implementation of error recovery in the backend that I see are:

    * in order to be sure that you can recover from an error, you
      need to abort the current subtransaction

    * starting and committing a subtransaction for every row of the COPY
      would be too expensive

    * therefore, start and commit a subtransaction for every "n" rows
      of input. If an error occurs, you lose at most "n-1" rows of
      valid input, which you need to backup and reinsert. There are
      various approaches to choosing "n" (statically, based on the
      error rate of previous batches in the same load, etc.).


-Neil


i think this is a better idea.
from NikhilS http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction.
the simplest thing are often the best.
i think it's hard to implement or some other
deficiency since you want subtransaction or every "n" rows.

you have mentioned that the 'n' can be changed according when and where the error happened in thread
"Re: VLDB Features" .this is like some mechanisms in tcp Congestion Control.but you can't ignore the time wasted in subtransaction before it encounters an error especially when the 'n' is big.
i don't know the cost of a subtransaction(begin and commit) and  an copy line reading(CopyReadLine()) exactly. so i just calculate the number of subtransactions.
f(n)=(1-(1-p)^n)
*m+m/n 
m is the number of lines. n is the subtransaction lines. p is the possibility of each row encounters an error.
big 'n' can reduce the number of subtransaction(m/n), but also increase the possibility of having a error. unless the p is extremely small, choosing a big 'n' is a big mistake.
in fact the errors always get together (my experience), the situation may be a little better.

however,  the idea(from NikhilS) that i start with is the perfect solution. yes i have seen in the email archives
in thread "Re: VLDB Features" and i notice some disagreements about commit problems and etc. this won't be a problem since so many similar problems have been solved in pg.

Re: COPY issue(gsoc project)

От
NikhilS
Дата:
Hi Longlong,


i think this is a better idea.
from NikhilS http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction.
the simplest thing are often the best.
i think it's hard to implement or some other
deficiency since you want subtransaction or every "n" rows.


Yeah simpler things are often the best, but as folks are mentioning, we need a carefully thought out approach here. The reply from Tom to my posting there raises issues which need to be taken care of. Although I still think that if we carry out *sanity* checks before starting the load about presence of triggers, constrainsts, fkey constraints etc, if others do not have any issues with the approach, the simple_heap_delete idea should work in some cases. Although the term I used "after some minor cleanup" might need some thought too now that I think more of it..

Also if Fkey checks or complex triggers are around, maybe we can fall back to a subtransaction per row insert too as a worse case measure..

Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com

Re: COPY issue(gsoc project)

От
longlong
Дата:
hi NikhilS.

2008/3/14, NikhilS <nikkhils@gmail.com>:
Hi Longlong,


i think this is a better idea.
from NikhilS http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction.
the simplest thing are often the best.
i think it's hard to implement or some other
deficiency since you want subtransaction or every "n" rows.


Yeah simpler things are often the best, but as folks are mentioning, we need a carefully thought out approach here. The reply from Tom to my posting there raises issues which need to be taken care of. Although I still think that if we carry out *sanity* checks before starting the load about presence of triggers, constrainsts, fkey constraints etc, if others do not have any issues with the approach, the simple_heap_delete idea should work in some cases. Although the term I used "after some minor cleanup" might need some thought too now that I think more of it..

Also if Fkey checks or complex triggers are around, maybe we can fall back to a subtransaction per row insert too as a worse case measure..

Regards,
Nikhils
 
--
EnterpriseDB http://www.enterprisedb.com

i had seen every email of that thread before. i didn't  think triggers, constraints and fkey constraints would be a problem because at that time i didn't know much about when and how these triggers were fired. :P
now i believe that all is about efficiency(of course error handling is a base requirement). so i consider that a simple implement should be done, and it can be a benchmark to measure other more advanced and efficient implement. this will help to find a final solution.
 
subtransaction seems to be the only easy and feasible way to me.
as described by Neil earlier:

   * in order to be sure that you can recover from an error, you
     need to abort the current subtransaction

   * starting and committing a subtransaction for every row of the COPY
     would be too expensive

   * therefore, start and commit a subtransaction for every "n" rows
     of input. If an error occurs, you lose at most "n-1" rows of
     valid input, which you need to backup and reinsert. There are
     various approaches to choosing "n" (statically, based on the
     error rate of previous batches in the same load, etc.).
 
i want to implement this feature as a google summer of code project. error handling is the  utmost  target. test example will be generated for efficiency analysis.

any suggestion is welcome.