Обсуждение: in Pl/PgSQL, do commit every 5000 records

Поиск
Список
Период
Сортировка

in Pl/PgSQL, do commit every 5000 records

От
Emi Lu
Дата:
Good morning,

In a plpgsql function, I am trying to insert 900, 000 records into
several tables. I remembered people mentioned before that it is better
and more efficient to commit actions for let's say every 5000 records'
insertion.

May I get more inputs about why and how this commit can speed up the
transaction please?

Thanks a lot,
Ying


Re: in Pl/PgSQL, do commit every 5000 records

От
Bruno Wolff III
Дата:
On Fri, Mar 10, 2006 at 09:36:16 -0500,
  Emi Lu <emilu@encs.concordia.ca> wrote:
> Good morning,
>
> In a plpgsql function, I am trying to insert 900, 000 records into
> several tables. I remembered people mentioned before that it is better
> and more efficient to commit actions for let's say every 5000 records'
> insertion.

You can't do commits inside of a function. I think you are misremembering
advice about not do inserts with a transaction per row which will have
a lot of overhead for all of the commits.

You can do savepoints inside of a function, but those are going to slow things
down, not speed them up.

Re: in Pl/PgSQL, do commit every 5000 records

От
Emi Lu
Дата:
Hi Bruno,

>You can't do commits inside of a function.
>

The example I have is:

CREATE OR REPLACE function test() returns boolean AS $$
DECLARE
... ...
   counter                INTEGER := 0;
BEGIN
... ...
   query_value := ' .....' ;
   OPEN curs1 FOR EXECUTE query_value;
   LOOP
      FETCH curs1 INTO studid;
      EXIT WHEN NOT FOUND;

      query_value := ' INSERT INTO ... ...';

      EXECUTE query_value  ;

      counter := counter + 1 ;
      IF counter%5000 = 0 THEN
         counter := 0;
         COMMIT;
      END IF;

   END LOOP;


   CLOSE curs1;
...
END;

... ...

The above function works ok.

"can't do commits inside of a function " , do you mean although the
function complied ok and run successfully, but it did not really commit
insertion actions at every 5000 records?

>I think you are misremembering advice about not do inserts with a transaction per row which will have
>a lot of overhead for all of the commits.
>



Re: in Pl/PgSQL, do commit every 5000 records

От
Emi Lu
Дата:
I got the answer. Although the compile passed, when it reaches 5000, the
commit command causes a "SPI_ERROR_TRANSACTION" exception.

Thank you for all your hint.


>> You can't do commits inside of a function.
>
>
> The example I have is:
>
> CREATE OR REPLACE function test() returns boolean AS $$
> DECLARE
> ... ...
>   counter                INTEGER := 0;
> BEGIN
> ... ...
>   query_value := ' .....' ;
>   OPEN curs1 FOR EXECUTE query_value;
>   LOOP
>      FETCH curs1 INTO studid;
>      EXIT WHEN NOT FOUND;
>
>      query_value := ' INSERT INTO ... ...';
>          EXECUTE query_value  ;
>
>      counter := counter + 1 ;
>      IF counter%5000 = 0 THEN
>         counter := 0;
>         COMMIT;
>      END IF;
>
>   END LOOP;
>
>
>   CLOSE curs1;  ...
> END;
>
> ... ...
>
> The above function works ok.
> "can't do commits inside of a function " , do you mean although the
> function complied ok and run successfully, but it did not really
> commit insertion actions at every 5000 records?
>
>> I think you are misremembering advice about not do inserts with a
>> transaction per row which will have
>> a lot of overhead for all of the commits
>

Re: in Pl/PgSQL, do commit every 5000 records

От
"Florian G. Pflug"
Дата:
Emi Lu wrote:
> The example I have is:
>
> CREATE OR REPLACE function test() returns boolean AS $$
> DECLARE
> ... ...
>   counter                INTEGER := 0;
> BEGIN
> ... ...
>   query_value := ' .....' ;
>   OPEN curs1 FOR EXECUTE query_value;
>   LOOP
>      FETCH curs1 INTO studid;
>      EXIT WHEN NOT FOUND;
>
>      query_value := ' INSERT INTO ... ...';
>          EXECUTE query_value  ;
>
>      counter := counter + 1 ;
>      IF counter%5000 = 0 THEN
>         counter := 0;
>         COMMIT;
>      END IF;
>
>   END LOOP;
>
>
>   CLOSE curs1;  ...
> END;
Are you aware of the "insert into <table> (<field1>, ..., <fieldn>) select <val1>, .., <valn> from ...."
command? It'd be much faster to use that it it's possible...

greetings, Florian Pflug


Re: in Pl/PgSQL, do commit every 5000 records

От
Emi Lu
Дата:
Florian G. Pflug wrote:

> Emi Lu wrote:
>
>> The example I have is:
>>
>> CREATE OR REPLACE function test() returns boolean AS $$
>> DECLARE
>> ... ...
>>   counter                INTEGER := 0;
>> BEGIN
>> ... ...
>>   query_value := ' .....' ;
>>   OPEN curs1 FOR EXECUTE query_value;
>>   LOOP
>>      FETCH curs1 INTO studid;
>>      EXIT WHEN NOT FOUND;
>>
>>      query_value := ' INSERT INTO ... ...';
>>          EXECUTE query_value  ;
>>
>>      counter := counter + 1 ;
>>      IF counter%5000 = 0 THEN
>>         counter := 0;
>>         COMMIT;
>>      END IF;
>>
>>   END LOOP;
>>
>>
>>   CLOSE curs1;  ...
>> END;
>
> Are you aware of the "insert into <table> (<field1>, ..., <fieldn>)
> select <val1>, .., <valn> from ...."
> command? It'd be much faster to use that it it's possible...
>
> greetings, Florian Pflug

It did faster. Thank you Florian. Could you hint me why "insert into ..
select " is faster than a cursor transaction please?

How about update?

Way1:
update tableA
set col1= X.col1, col2=X.col2, ... coln = X.coln
from table (select ... from ... where ..) AS X
where A.pk = X.pk ;

should be faster than

Way2:
open cursor:
fetch (select ... from ... where ... ) into xCol1, xCol2, ... xColn
    update tableA
    set col1 = xCol1, col2 =xCol2..., coln =xColn
    where tableA.pkCols = xPkCols

right?






Re: in Pl/PgSQL, do commit every 5000 records

От
"Florian G. Pflug"
Дата:
Emi Lu wrote:
> Florian G. Pflug wrote:
< snipped code of stored procedure >
>>
>> Are you aware of the "insert into <table> (<field1>, ..., <fieldn>)
>> select <val1>, .., <valn> from ...."
>> command? It'd be much faster to use that it it's possible...
>>
>> greetings, Florian Pflug
>
> It did faster. Thank you Florian. Could you hint me why "insert into ..
> select " is faster than a cursor transaction please?
Well, you're avoiding a lot of overhead. "insert into ... select from .."
is just one sql-statement. Of course, postgres internally does
something similar to your stored procedure, but it's all compiled
C code now (instead of interpreted plpgsql). Additionally, postgres
might be able to optimize this more than you could from plpgsql, because
you're restricted to the api that is exposed to plpgsql, while the backend-code
might be able to "pull a few more tricks".

In general, if you have the choice between looping over a large result
in a stored procedure (or, even worse, in a client app) and letting the
backend do the looping, then letting the backend handle it is nearly always
faster.

> How about update?
>
> Way1:
> update tableA
> set col1= X.col1, col2=X.col2, ... coln = X.coln
> from table (select ... from ... where ..) AS X
> where A.pk = X.pk ;
>
> should be faster than
>
> Way2:
> open cursor:
> fetch (select ... from ... where ... ) into xCol1, xCol2, ... xColn
>    update tableA
>    set col1 = xCol1, col2 =xCol2..., coln =xColn
>    where tableA.pkCols = xPkCols
>
> right?
I'd say so, yes.

greetings, Florian Pflug

Re: in Pl/PgSQL, do commit every 5000 records

От
Emi Lu
Дата:

> Florian G. Pflug wrote:
> < snipped code of stored procedure >
>
>>>
>>> Are you aware of the "insert into <table> (<field1>, ..., <fieldn>)
>>> select <val1>, .., <valn> from ...."
>>> command? It'd be much faster to use that it it's possible...
>>>
>>> greetings, Florian Pflug
>>
>>
>> It did faster. Thank you Florian. Could you hint me why "insert into
>> .. select " is faster than a cursor transaction please?
>
> Well, you're avoiding a lot of overhead. "insert into ... select from .."
> is just one sql-statement. Of course, postgres internally does
> something similar to your stored procedure, but it's all compiled
> C code now (instead of interpreted plpgsql). Additionally, postgres
> might be able to optimize this more than you could from plpgsql, because
> you're restricted to the api that is exposed to plpgsql, while the
> backend-code
> might be able to "pull a few more tricks".
>
> In general, if you have the choice between looping over a large result
> in a stored procedure (or, even worse, in a client app) and letting the
> backend do the looping, then letting the backend handle it is nearly
> always
> faster.


The information are very helpful! Thank you again Florian.

If now, I have a series of queries to be run:

1. "insert into t1... (select .. from ...left join ... .. where ....) "
2. "insert into t2 ... the same sub-query as in 1 "
3. "update t3 set ... from ( the same sub-query as in 1) AS X where
t3.pk = X.pk "
4. "update t4 set ... from ( the same sub-query as in 1) AS X where
t4.pk = X.pk"

. the subquery (select .. from ...left join ... .. where ....) is two
big tables doing left join

Will there be a better way between

a. put all there 4 queries into one function
    in perl or java, just call this function

b. in perl / java, write and run the 4 queries independently

The pl/pgsql function does not allow commit. So, in the function , if
any step went wrong, all 4 steps rollback. While in java, after every
query, I can do commit. May java speed up all four updates?


- Ying











Re: in Pl/PgSQL, do commit every 5000 records

От
"Florian G. Pflug"
Дата:
Emi Lu wrote:
 >> Florian G. Pflug wrote:
 >> < snipped code of stored procedure >
 >>>> Are you aware of the "insert into <table> (<field1>, ..., <fieldn>)
 >>>> select <val1>, .., <valn> from ...."
 >>>> command? It'd be much faster to use that it it's possible...
 >>>>
 >>>
 >>> It did faster. Thank you Florian. Could you hint me why "insert into
 >>> .. select " is faster than a cursor transaction please?
 >>
 >> Well, you're avoiding a lot of overhead. "insert into ... select from .."
 >> is just one sql-statement. Of course, postgres internally does
 >> something similar to your stored procedure, but it's all compiled
 >> C code now (instead of interpreted plpgsql). Additionally, postgres
 >> might be able to optimize this more than you could from plpgsql, because
 >> you're restricted to the api that is exposed to plpgsql, while the
 >> backend-code
 >> might be able to "pull a few more tricks".
 >>
 >> In general, if you have the choice between looping over a large result
 >> in a stored procedure (or, even worse, in a client app) and letting the
 >> backend do the looping, then letting the backend handle it is nearly
 >> always
 >> faster.
 >
 > The information are very helpful! Thank you again Florian.
 >
 > If now, I have a series of queries to be run:
 >
 > 1. "insert into t1... (select .. from ...left join ... .. where ....) "
 > 2. "insert into t2 ... the same sub-query as in 1 "
 > 3. "update t3 set ... from ( the same sub-query as in 1) AS X where
 > t3.pk = X.pk " 4. "update t4 set ... from ( the same sub-query as in 1)
 > AS X where t4.pk = X.pk"
 >
 > . the subquery (select .. from ...left join ... .. where ....) is two
 > big tables doing left join
If running the subquery "(select ... from .. left join ... .. where ...)"
takes a long time, even without inserting the records into a new table
(You can benchmark this with "select count(*) from ... left join ... where ...",
and see how long it takes),
than it might be faster to first do
"create temporary table t as select .. from .. left join ... where ...",
and then use the temp-table instead of the subquery in the other statements.

If this is faster or slower depends on a lot of factors, so you'll have to
test which is better.

 > Will there be a better way between
 >
 > a. put all there 4 queries into one function
 >    in perl or java, just call this function
 >
 > b. in perl / java, write and run the 4 queries independently
Should be about the same - just use whatever fits your overall software design
better.

 > The pl/pgsql function does not allow commit. So, in the function , if
 > any step went wrong, all 4 steps rollback. While in java, after every
 > query, I can do commit. May java speed up all four updates?
In postgresql 8.0 and above, you could use the exception support in plpgsql
to prevent the whole transaction from rolling back in case of an error.
Only the statements _inside_ the block where you caught the error would roll back.

 From java, you could do the same, by using the "savepoint" command manually (or
maybe the jdbc driver for postgres has some support for this - I've never actually
used jdbc).

In any case, the "right right" depends on your application. Are those inserts/updates
independent of each other, or will it cause data inconsistencies if one is done
and the other is net? Is there actually something your app can do if a statement causes
an error, or will it just be reported, and a human will have to fix it?

greetings, Florian Pflug

Вложения

Re: in Pl/PgSQL, do commit every 5000 records

От
"Merlin Moncure"
Дата:
>  >> In general, if you have the choice between looping over a large result
>  >> in a stored procedure (or, even worse, in a client app) and letting the
>  >> backend do the looping, then letting the backend handle it is nearly
>  >> always
>  >> faster.

There are different reasons why a large query might not always be the
best approach.  Unfortunately it is the only approach on the server
side.

Large queries tend to become less and less practical when the database
becomes really big.  Just as a 'for example', it would be nice to be
able to do part of a large complex job, stop it, and continue it again
later.

stored procedures (not functions) are suppoesed to give you this power
and allow you to do things which are non-transactional like vacuum.

merlin

Re: in Pl/PgSQL, do commit every 5000 records

От
Emi Lu
Дата:
Hi Merlin,

>> >> In general, if you have the choice between looping over a large result
>> >> in a stored procedure (or, even worse, in a client app) and letting the
>> >> backend do the looping, then letting the backend handle it is nearly
>> >> always
>> >> faster.
>>
>>
>
>There are different reasons why a large query might not always be the
>best approach.  Unfortunately it is the only approach on the server
>side.
>
>Large queries tend to become less and less practical when the database
>becomes really big.  Just as a 'for example', it would be nice to be
>able to do part of a large complex job, stop it, and continue it again
>later.
>
>
Also combined the suggestions from Florian,
 >> use the exception support in plpgsql to prevent the whole
transaction from rolling back in case of an error.
 >> Only the statements _inside_ the block where you caught the error
would roll back.

I will try separate my huge data computation into several pieces
something like:

declare
...
begin
...
             -- step1

        BEGIN
            ...
            insert into (select ... ... from ... where ... )

        EXCEPTION WHEN ...... THEN
            -- do nothing
        END;


             -- step2

        BEGIN
            ...
            UPDATE tableA from ... WHERE ... ;

        EXCEPTION WHEN ...... THEN
            -- do nothing
        END;

...
...
end;

If I understood correctly, "begin ... exception when .. then ... end"
can work the same way as commit. In another way, if commands in the
sub-block (such as step1) run successfully, data in this part (step1) is
committed. Then step2, step3... stepN that are all under "begin..
exception.. end" sub-blocks will be run and "committed" one by one.


>stored procedures (not functions) are suppoesed to give you this power
>and allow you to do things which are non-transactional like vacuum.
>
>
"To define a procedure, i.e. a function that returns nothing, just
specify RETURNS VOID. "
Copied from
http://www.postgresql.org/docs/8.1/interactive/sql-createfunction.html

So, a stored procedure is "a void function", right?

Thanks,
Ying




Re: in Pl/PgSQL, do commit every 5000 records

От
"Merlin Moncure"
Дата:
> I will try separate my huge data computation into several pieces
> something like:
[...]

> If I understood correctly, "begin ... exception when .. then ... end"
> can work the same way as commit. In another way, if commands in the
> sub-block (such as step1) run successfully, data in this part (step1) is
> committed. Then step2, step3... stepN that are all under "begin..
> exception.. end" sub-blocks will be run and "committed" one by one.

begin...exception...end; does not break up the transaction into
smaller subtransactions. it does however allow graceful handling from
errors inside a function but that is not what you are looking for.

To put it another way, it is impossible for any part of the work
inside the function to become visible to other backends unless you
leave the function without error and the transaction that wraps it (if
there is one) is comitted.


> "To define a procedure, i.e. a function that returns nothing, just
> specify RETURNS VOID. "
> Copied from
> http://www.postgresql.org/docs/8.1/interactive/sql-createfunction.html
>
> So, a stored procedure is "a void function", right?

yes and no.  Yes in that a procedure differs from a function in that
it returns no value.  No in that 'pure' stored procedures discussed
some months back in hackers (and IIRC not implemented yet) are not
transactional entities.  They are like server side sql scripts that
take parameters.  Try running vacuum inside a function...it doen't
work but it would inside a non function based stored procedure.

I think your best bet is to first try Florian's approach of the
monolithic query and see if it works for you...if it does, great.  If
not, you have basically three options:

1. in transaction cursor:
declare your input cursor in transaction and fetch x rows at a time
(say, 1000) and write them back over another transaction comitting as
you go.  This is insensitive in that you can't see changes as you loop
through the set.

2.non transactional cursor:
using only one connection you declare your cursor 'with hold' and loop
over and insert over same connection.  Just beware that postgresql
must materialize 'with hold' cursors into a temporary table.  This
approach is also insensitive.

3. client side table browsing.  You can browse your tables ISAM style.
 This is a fancy way of saying you use client code to loop over a
table ordered on a key.  This approach can be sensitive (one
connection) or insensitive (two connections) while preserving the
ablity to commit as you go.

merlin