Обсуждение: in Pl/PgSQL, do commit every 5000 records
Good morning, In a plpgsql function, I am trying to insert 900, 000 records into several tables. I remembered people mentioned before that it is better and more efficient to commit actions for let's say every 5000 records' insertion. May I get more inputs about why and how this commit can speed up the transaction please? Thanks a lot, Ying
On Fri, Mar 10, 2006 at 09:36:16 -0500, Emi Lu <emilu@encs.concordia.ca> wrote: > Good morning, > > In a plpgsql function, I am trying to insert 900, 000 records into > several tables. I remembered people mentioned before that it is better > and more efficient to commit actions for let's say every 5000 records' > insertion. You can't do commits inside of a function. I think you are misremembering advice about not do inserts with a transaction per row which will have a lot of overhead for all of the commits. You can do savepoints inside of a function, but those are going to slow things down, not speed them up.
Hi Bruno, >You can't do commits inside of a function. > The example I have is: CREATE OR REPLACE function test() returns boolean AS $$ DECLARE ... ... counter INTEGER := 0; BEGIN ... ... query_value := ' .....' ; OPEN curs1 FOR EXECUTE query_value; LOOP FETCH curs1 INTO studid; EXIT WHEN NOT FOUND; query_value := ' INSERT INTO ... ...'; EXECUTE query_value ; counter := counter + 1 ; IF counter%5000 = 0 THEN counter := 0; COMMIT; END IF; END LOOP; CLOSE curs1; ... END; ... ... The above function works ok. "can't do commits inside of a function " , do you mean although the function complied ok and run successfully, but it did not really commit insertion actions at every 5000 records? >I think you are misremembering advice about not do inserts with a transaction per row which will have >a lot of overhead for all of the commits. >
I got the answer. Although the compile passed, when it reaches 5000, the commit command causes a "SPI_ERROR_TRANSACTION" exception. Thank you for all your hint. >> You can't do commits inside of a function. > > > The example I have is: > > CREATE OR REPLACE function test() returns boolean AS $$ > DECLARE > ... ... > counter INTEGER := 0; > BEGIN > ... ... > query_value := ' .....' ; > OPEN curs1 FOR EXECUTE query_value; > LOOP > FETCH curs1 INTO studid; > EXIT WHEN NOT FOUND; > > query_value := ' INSERT INTO ... ...'; > EXECUTE query_value ; > > counter := counter + 1 ; > IF counter%5000 = 0 THEN > counter := 0; > COMMIT; > END IF; > > END LOOP; > > > CLOSE curs1; ... > END; > > ... ... > > The above function works ok. > "can't do commits inside of a function " , do you mean although the > function complied ok and run successfully, but it did not really > commit insertion actions at every 5000 records? > >> I think you are misremembering advice about not do inserts with a >> transaction per row which will have >> a lot of overhead for all of the commits >
Emi Lu wrote: > The example I have is: > > CREATE OR REPLACE function test() returns boolean AS $$ > DECLARE > ... ... > counter INTEGER := 0; > BEGIN > ... ... > query_value := ' .....' ; > OPEN curs1 FOR EXECUTE query_value; > LOOP > FETCH curs1 INTO studid; > EXIT WHEN NOT FOUND; > > query_value := ' INSERT INTO ... ...'; > EXECUTE query_value ; > > counter := counter + 1 ; > IF counter%5000 = 0 THEN > counter := 0; > COMMIT; > END IF; > > END LOOP; > > > CLOSE curs1; ... > END; Are you aware of the "insert into <table> (<field1>, ..., <fieldn>) select <val1>, .., <valn> from ...." command? It'd be much faster to use that it it's possible... greetings, Florian Pflug
Florian G. Pflug wrote: > Emi Lu wrote: > >> The example I have is: >> >> CREATE OR REPLACE function test() returns boolean AS $$ >> DECLARE >> ... ... >> counter INTEGER := 0; >> BEGIN >> ... ... >> query_value := ' .....' ; >> OPEN curs1 FOR EXECUTE query_value; >> LOOP >> FETCH curs1 INTO studid; >> EXIT WHEN NOT FOUND; >> >> query_value := ' INSERT INTO ... ...'; >> EXECUTE query_value ; >> >> counter := counter + 1 ; >> IF counter%5000 = 0 THEN >> counter := 0; >> COMMIT; >> END IF; >> >> END LOOP; >> >> >> CLOSE curs1; ... >> END; > > Are you aware of the "insert into <table> (<field1>, ..., <fieldn>) > select <val1>, .., <valn> from ...." > command? It'd be much faster to use that it it's possible... > > greetings, Florian Pflug It did faster. Thank you Florian. Could you hint me why "insert into .. select " is faster than a cursor transaction please? How about update? Way1: update tableA set col1= X.col1, col2=X.col2, ... coln = X.coln from table (select ... from ... where ..) AS X where A.pk = X.pk ; should be faster than Way2: open cursor: fetch (select ... from ... where ... ) into xCol1, xCol2, ... xColn update tableA set col1 = xCol1, col2 =xCol2..., coln =xColn where tableA.pkCols = xPkCols right?
Emi Lu wrote: > Florian G. Pflug wrote: < snipped code of stored procedure > >> >> Are you aware of the "insert into <table> (<field1>, ..., <fieldn>) >> select <val1>, .., <valn> from ...." >> command? It'd be much faster to use that it it's possible... >> >> greetings, Florian Pflug > > It did faster. Thank you Florian. Could you hint me why "insert into .. > select " is faster than a cursor transaction please? Well, you're avoiding a lot of overhead. "insert into ... select from .." is just one sql-statement. Of course, postgres internally does something similar to your stored procedure, but it's all compiled C code now (instead of interpreted plpgsql). Additionally, postgres might be able to optimize this more than you could from plpgsql, because you're restricted to the api that is exposed to plpgsql, while the backend-code might be able to "pull a few more tricks". In general, if you have the choice between looping over a large result in a stored procedure (or, even worse, in a client app) and letting the backend do the looping, then letting the backend handle it is nearly always faster. > How about update? > > Way1: > update tableA > set col1= X.col1, col2=X.col2, ... coln = X.coln > from table (select ... from ... where ..) AS X > where A.pk = X.pk ; > > should be faster than > > Way2: > open cursor: > fetch (select ... from ... where ... ) into xCol1, xCol2, ... xColn > update tableA > set col1 = xCol1, col2 =xCol2..., coln =xColn > where tableA.pkCols = xPkCols > > right? I'd say so, yes. greetings, Florian Pflug
> Florian G. Pflug wrote: > < snipped code of stored procedure > > >>> >>> Are you aware of the "insert into <table> (<field1>, ..., <fieldn>) >>> select <val1>, .., <valn> from ...." >>> command? It'd be much faster to use that it it's possible... >>> >>> greetings, Florian Pflug >> >> >> It did faster. Thank you Florian. Could you hint me why "insert into >> .. select " is faster than a cursor transaction please? > > Well, you're avoiding a lot of overhead. "insert into ... select from .." > is just one sql-statement. Of course, postgres internally does > something similar to your stored procedure, but it's all compiled > C code now (instead of interpreted plpgsql). Additionally, postgres > might be able to optimize this more than you could from plpgsql, because > you're restricted to the api that is exposed to plpgsql, while the > backend-code > might be able to "pull a few more tricks". > > In general, if you have the choice between looping over a large result > in a stored procedure (or, even worse, in a client app) and letting the > backend do the looping, then letting the backend handle it is nearly > always > faster. The information are very helpful! Thank you again Florian. If now, I have a series of queries to be run: 1. "insert into t1... (select .. from ...left join ... .. where ....) " 2. "insert into t2 ... the same sub-query as in 1 " 3. "update t3 set ... from ( the same sub-query as in 1) AS X where t3.pk = X.pk " 4. "update t4 set ... from ( the same sub-query as in 1) AS X where t4.pk = X.pk" . the subquery (select .. from ...left join ... .. where ....) is two big tables doing left join Will there be a better way between a. put all there 4 queries into one function in perl or java, just call this function b. in perl / java, write and run the 4 queries independently The pl/pgsql function does not allow commit. So, in the function , if any step went wrong, all 4 steps rollback. While in java, after every query, I can do commit. May java speed up all four updates? - Ying
Emi Lu wrote: >> Florian G. Pflug wrote: >> < snipped code of stored procedure > >>>> Are you aware of the "insert into <table> (<field1>, ..., <fieldn>) >>>> select <val1>, .., <valn> from ...." >>>> command? It'd be much faster to use that it it's possible... >>>> >>> >>> It did faster. Thank you Florian. Could you hint me why "insert into >>> .. select " is faster than a cursor transaction please? >> >> Well, you're avoiding a lot of overhead. "insert into ... select from .." >> is just one sql-statement. Of course, postgres internally does >> something similar to your stored procedure, but it's all compiled >> C code now (instead of interpreted plpgsql). Additionally, postgres >> might be able to optimize this more than you could from plpgsql, because >> you're restricted to the api that is exposed to plpgsql, while the >> backend-code >> might be able to "pull a few more tricks". >> >> In general, if you have the choice between looping over a large result >> in a stored procedure (or, even worse, in a client app) and letting the >> backend do the looping, then letting the backend handle it is nearly >> always >> faster. > > The information are very helpful! Thank you again Florian. > > If now, I have a series of queries to be run: > > 1. "insert into t1... (select .. from ...left join ... .. where ....) " > 2. "insert into t2 ... the same sub-query as in 1 " > 3. "update t3 set ... from ( the same sub-query as in 1) AS X where > t3.pk = X.pk " 4. "update t4 set ... from ( the same sub-query as in 1) > AS X where t4.pk = X.pk" > > . the subquery (select .. from ...left join ... .. where ....) is two > big tables doing left join If running the subquery "(select ... from .. left join ... .. where ...)" takes a long time, even without inserting the records into a new table (You can benchmark this with "select count(*) from ... left join ... where ...", and see how long it takes), than it might be faster to first do "create temporary table t as select .. from .. left join ... where ...", and then use the temp-table instead of the subquery in the other statements. If this is faster or slower depends on a lot of factors, so you'll have to test which is better. > Will there be a better way between > > a. put all there 4 queries into one function > in perl or java, just call this function > > b. in perl / java, write and run the 4 queries independently Should be about the same - just use whatever fits your overall software design better. > The pl/pgsql function does not allow commit. So, in the function , if > any step went wrong, all 4 steps rollback. While in java, after every > query, I can do commit. May java speed up all four updates? In postgresql 8.0 and above, you could use the exception support in plpgsql to prevent the whole transaction from rolling back in case of an error. Only the statements _inside_ the block where you caught the error would roll back. From java, you could do the same, by using the "savepoint" command manually (or maybe the jdbc driver for postgres has some support for this - I've never actually used jdbc). In any case, the "right right" depends on your application. Are those inserts/updates independent of each other, or will it cause data inconsistencies if one is done and the other is net? Is there actually something your app can do if a statement causes an error, or will it just be reported, and a human will have to fix it? greetings, Florian Pflug
Вложения
> >> In general, if you have the choice between looping over a large result > >> in a stored procedure (or, even worse, in a client app) and letting the > >> backend do the looping, then letting the backend handle it is nearly > >> always > >> faster. There are different reasons why a large query might not always be the best approach. Unfortunately it is the only approach on the server side. Large queries tend to become less and less practical when the database becomes really big. Just as a 'for example', it would be nice to be able to do part of a large complex job, stop it, and continue it again later. stored procedures (not functions) are suppoesed to give you this power and allow you to do things which are non-transactional like vacuum. merlin
Hi Merlin, >> >> In general, if you have the choice between looping over a large result >> >> in a stored procedure (or, even worse, in a client app) and letting the >> >> backend do the looping, then letting the backend handle it is nearly >> >> always >> >> faster. >> >> > >There are different reasons why a large query might not always be the >best approach. Unfortunately it is the only approach on the server >side. > >Large queries tend to become less and less practical when the database >becomes really big. Just as a 'for example', it would be nice to be >able to do part of a large complex job, stop it, and continue it again >later. > > Also combined the suggestions from Florian, >> use the exception support in plpgsql to prevent the whole transaction from rolling back in case of an error. >> Only the statements _inside_ the block where you caught the error would roll back. I will try separate my huge data computation into several pieces something like: declare ... begin ... -- step1 BEGIN ... insert into (select ... ... from ... where ... ) EXCEPTION WHEN ...... THEN -- do nothing END; -- step2 BEGIN ... UPDATE tableA from ... WHERE ... ; EXCEPTION WHEN ...... THEN -- do nothing END; ... ... end; If I understood correctly, "begin ... exception when .. then ... end" can work the same way as commit. In another way, if commands in the sub-block (such as step1) run successfully, data in this part (step1) is committed. Then step2, step3... stepN that are all under "begin.. exception.. end" sub-blocks will be run and "committed" one by one. >stored procedures (not functions) are suppoesed to give you this power >and allow you to do things which are non-transactional like vacuum. > > "To define a procedure, i.e. a function that returns nothing, just specify RETURNS VOID. " Copied from http://www.postgresql.org/docs/8.1/interactive/sql-createfunction.html So, a stored procedure is "a void function", right? Thanks, Ying
> I will try separate my huge data computation into several pieces > something like: [...] > If I understood correctly, "begin ... exception when .. then ... end" > can work the same way as commit. In another way, if commands in the > sub-block (such as step1) run successfully, data in this part (step1) is > committed. Then step2, step3... stepN that are all under "begin.. > exception.. end" sub-blocks will be run and "committed" one by one. begin...exception...end; does not break up the transaction into smaller subtransactions. it does however allow graceful handling from errors inside a function but that is not what you are looking for. To put it another way, it is impossible for any part of the work inside the function to become visible to other backends unless you leave the function without error and the transaction that wraps it (if there is one) is comitted. > "To define a procedure, i.e. a function that returns nothing, just > specify RETURNS VOID. " > Copied from > http://www.postgresql.org/docs/8.1/interactive/sql-createfunction.html > > So, a stored procedure is "a void function", right? yes and no. Yes in that a procedure differs from a function in that it returns no value. No in that 'pure' stored procedures discussed some months back in hackers (and IIRC not implemented yet) are not transactional entities. They are like server side sql scripts that take parameters. Try running vacuum inside a function...it doen't work but it would inside a non function based stored procedure. I think your best bet is to first try Florian's approach of the monolithic query and see if it works for you...if it does, great. If not, you have basically three options: 1. in transaction cursor: declare your input cursor in transaction and fetch x rows at a time (say, 1000) and write them back over another transaction comitting as you go. This is insensitive in that you can't see changes as you loop through the set. 2.non transactional cursor: using only one connection you declare your cursor 'with hold' and loop over and insert over same connection. Just beware that postgresql must materialize 'with hold' cursors into a temporary table. This approach is also insensitive. 3. client side table browsing. You can browse your tables ISAM style. This is a fancy way of saying you use client code to loop over a table ordered on a key. This approach can be sensitive (one connection) or insensitive (two connections) while preserving the ablity to commit as you go. merlin