Обсуждение: Slowing UPDATEs inside a transaction

Поиск
Список
Период
Сортировка

Slowing UPDATEs inside a transaction

От
Matt Burke
Дата:
Hi. I've only been using PostgreSQL properly for a week or so, so I
apologise if this has been covered numerous times, however Google is
producing nothing of use.

I'm trying to import a large amount of legacy data (billions of
denormalised rows) into a pg database with a completely different schema,
de-duplicating bits of it on-the-fly while maintaining a reference count.
The procedures to do this have proven easy to write, however the speed is
not pretty. I've spent some time breaking down the cause and it's come down
to a simple UPDATE as evidenced below:


CREATE TABLE foo (a int PRIMARY KEY, b int);
INSERT INTO foo VALUES (1,1);

CREATE OR REPLACE FUNCTION test() RETURNS int AS $$
DECLARE
    i int;
BEGIN
    FOR i IN 1..10000 LOOP
        UPDATE foo SET b=b+1 WHERE a=1;
    END LOOP;
    RETURN 1;
END;
$$ LANGUAGE plpgsql;

When run individually, this function produces the following timing:
Time: 1912.593 ms
Time: 1937.183 ms
Time: 1941.607 ms
Time: 1943.673 ms
Time: 1944.738 ms

However, when run inside a transaction (to simulate a function doing the
same work) I get this:

START TRANSACTION
Time: 0.836 ms
Time: 1908.931 ms
Time: 5370.110 ms
Time: 8809.378 ms
Time: 12274.294 ms
Time: 15698.745 ms
Time: 19218.389 ms


There is no disk i/o and the postgresql process runs 100% cpu.
Server is amd64 FreeBSD 8-STABLE w/16GB RAM running postgresql 9.0.3 from
packages

Looking at the timing of real data (heavily grouped), it seems the speed of
UPDATEs can vary dependent on how heavily updated a row is, so I set out to
produce a test case:

CREATE TABLE foo (a int PRIMARY KEY, b int);
INSERT INTO foo VALUES (1,1),(2,1),(3,1),(4,1);

CREATE OR REPLACE FUNCTION test(int) RETURNS int AS $$
DECLARE
    i int;
BEGIN
    FOR i IN 1..10000 LOOP
        UPDATE foo SET b=1 WHERE a=$1;
    END LOOP;
    RETURN 1;
END;
$$ LANGUAGE plpgsql;
START TRANSACTION;
SELECT test(1);  Time: 1917.305 ms
SELECT test(2);  Time: 1926.758 ms
SELECT test(3);  Time: 1926.498 ms
SELECT test(1);  Time: 5376.691 ms
SELECT test(2);  Time: 5408.231 ms
SELECT test(3);  Time: 5403.654 ms
SELECT test(1);  Time: 8842.425 ms
SELECT test(4);  Time: 1925.954 ms
COMMIT; START TRANSACTION;
SELECT test(1);  Time: 1911.153 ms


As you can see, the more an individual row is updated /within a
transaction/, the slower it becomes for some reason.

Unfortunately in my real-world case, I need to do many billions of these
UPDATEs. Is there any way I can get around this without pulling my huge
source table out of the database and feeding everything in line-at-a-time
from outside the database?



Thanks.


--


The information contained in this message is confidential and is intended for the addressee only. If you have received
thismessage in error or there are any problems please notify the originator immediately. The unauthorised use,
disclosure,copying or alteration of this message is strictly forbidden.  

Critical Software Ltd. reserves the right to monitor and record e-mail messages sent to and from this address for the
purposesof investigating or detecting any unauthorised use of its system and ensuring its effective operation. 

Critical Software Ltd. registered in England, 04909220. Registered Office: IC2, Keele Science Park, Keele,
Staffordshire,ST5 5NH. 

------------------------------------------------------------
This message has been scanned for security threats by iCritical.
    For further information, please visit www.icritical.com
------------------------------------------------------------

Re: Slowing UPDATEs inside a transaction

От
Robert Haas
Дата:
On Thu, Mar 3, 2011 at 9:13 AM, Matt Burke <mattblists@icritical.com> wrote:
> Hi. I've only been using PostgreSQL properly for a week or so, so I
> apologise if this has been covered numerous times, however Google is
> producing nothing of use.
>
> I'm trying to import a large amount of legacy data (billions of
> denormalised rows) into a pg database with a completely different schema,
> de-duplicating bits of it on-the-fly while maintaining a reference count.
> The procedures to do this have proven easy to write, however the speed is
> not pretty. I've spent some time breaking down the cause and it's come down
> to a simple UPDATE as evidenced below:

PostgreSQL uses MVCC, which means that transactions see a snapshot of
the database at existed at a certain point in time, usually the
beginning of the currently query.  Old row versions have to be kept
around until they're no longer of interest to any still-running
transaction.  Sadly, our ability to detect which row versions are
still of interest is imperfect, so we sometimes keep row versions that
are technically not required.  Unfortunately, repeated updates by the
same transaction to the same database row are one of the cases that we
don't handle very well - all the old row versions will be kept until
the transaction commits.  I suspect if you look at the problem case
you'll find that the table and index are getting bigger with every set
of updates, whereas when you do the updates in separate transactions
the size grows for a while and then levels off.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Slowing UPDATEs inside a transaction

От
Merlin Moncure
Дата:
On Thu, Mar 3, 2011 at 8:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 3, 2011 at 9:13 AM, Matt Burke <mattblists@icritical.com> wrote:
>> Hi. I've only been using PostgreSQL properly for a week or so, so I
>> apologise if this has been covered numerous times, however Google is
>> producing nothing of use.
>>
>> I'm trying to import a large amount of legacy data (billions of
>> denormalised rows) into a pg database with a completely different schema,
>> de-duplicating bits of it on-the-fly while maintaining a reference count.
>> The procedures to do this have proven easy to write, however the speed is
>> not pretty. I've spent some time breaking down the cause and it's come down
>> to a simple UPDATE as evidenced below:
>
> PostgreSQL uses MVCC, which means that transactions see a snapshot of
> the database at existed at a certain point in time, usually the
> beginning of the currently query.  Old row versions have to be kept
> around until they're no longer of interest to any still-running
> transaction.  Sadly, our ability to detect which row versions are
> still of interest is imperfect, so we sometimes keep row versions that
> are technically not required.  Unfortunately, repeated updates by the
> same transaction to the same database row are one of the cases that we
> don't handle very well - all the old row versions will be kept until
> the transaction commits.  I suspect if you look at the problem case
> you'll find that the table and index are getting bigger with every set
> of updates, whereas when you do the updates in separate transactions
> the size grows for a while and then levels off.

Another perspective on this is that not having explicit transaction
control via stored procedures contributes to the problem.   Being able
to manage transaction state would allow for a lot of workarounds for
this problem without forcing the processing into the client side.

To the OP I would suggest rethinking your processing as inserts into
one or more staging tables, followed up by a insert...select into the
final destination table.  Try to use less looping and more sql if
possible...

merlin

Re: Slowing UPDATEs inside a transaction

От
Matt Burke
Дата:
Robert Haas wrote:
> Old row versions have to be kept around until they're no longer of
> interest to any still-running transaction.

Thanks for the explanation.

Regarding the snippet above, why would the intermediate history of
multiply-modified uncommitted rows be of interest to anything, or is the
current behaviour simply "cheaper" overall in terms of cpu/developer time?


--

The information contained in this message is confidential and is intended for the addressee only. If you have received
thismessage in error or there are any problems please notify the originator immediately. The unauthorised use,
disclosure,copying or alteration of this message is strictly forbidden.  

Critical Software Ltd. reserves the right to monitor and record e-mail messages sent to and from this address for the
purposesof investigating or detecting any unauthorised use of its system and ensuring its effective operation. 

Critical Software Ltd. registered in England, 04909220. Registered Office: IC2, Keele Science Park, Keele,
Staffordshire,ST5 5NH. 

------------------------------------------------------------
This message has been scanned for security threats by iCritical.
    For further information, please visit www.icritical.com
------------------------------------------------------------

Re: Slowing UPDATEs inside a transaction

От
Robert Haas
Дата:
On Fri, Mar 4, 2011 at 4:21 AM, Matt Burke <mattblists@icritical.com> wrote:
> Robert Haas wrote:
>> Old row versions have to be kept around until they're no longer of
>> interest to any still-running transaction.
>
> Thanks for the explanation.
>
> Regarding the snippet above, why would the intermediate history of
> multiply-modified uncommitted rows be of interest to anything, or is the
> current behaviour simply "cheaper" overall in terms of cpu/developer time?

Because in theory you could have a cursor open.  You could open a
cursor, start to read from it, then make an update.  Now the cursor
needs to see things as they were before the update.

We might be able to do some optimization here if we had some
infrastructure to detect when a backend has no registered snapshots
with a command-ID less than the command-ID of the currently active
snapshot, but nobody's put in the effort to figure out exactly what's
involved and whether it makes sense.  It's a problem I'm interested
in, but #(needed round-tuits) > #(actual round-tuits).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Slowing UPDATEs inside a transaction

От
Merlin Moncure
Дата:
On Fri, Mar 4, 2011 at 8:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 4, 2011 at 4:21 AM, Matt Burke <mattblists@icritical.com> wrote:
>> Robert Haas wrote:
>>> Old row versions have to be kept around until they're no longer of
>>> interest to any still-running transaction.
>>
>> Thanks for the explanation.
>>
>> Regarding the snippet above, why would the intermediate history of
>> multiply-modified uncommitted rows be of interest to anything, or is the
>> current behaviour simply "cheaper" overall in terms of cpu/developer time?
>
> Because in theory you could have a cursor open.  You could open a
> cursor, start to read from it, then make an update.  Now the cursor
> needs to see things as they were before the update.
>
> We might be able to do some optimization here if we had some
> infrastructure to detect when a backend has no registered snapshots
> with a command-ID less than the command-ID of the currently active
> snapshot, but nobody's put in the effort to figure out exactly what's
> involved and whether it makes sense.  It's a problem I'm interested
> in, but #(needed round-tuits) > #(actual round-tuits).

Not just cursors, but pl/pgsql for example is also pretty aggressive
about grabbing snapshots.  Also,t' is a matter of speculation if the
case of a single row being updated a high number of times in a single
transaction merits such complicated optimizations.

It bears repeating: Explicit transaction control (beyond the dblink
type hacks that currently exist) in backend scripting would solve many
cases where this is a problem in practice without having to muck
around in the mvcc engine.  Autonomous transactions are another way to
do this...

merlin