Обсуждение: performance with triggers depends on table size?

Поиск
Список
Период
Сортировка

performance with triggers depends on table size?

От
Christian Mock
Дата:
hello,

I've got two identical tables, "input" and "output"; input is filled
by an importing process, the data is processed and partially deleted,
and the rest is copied to "output" by means of
INSERT INTO output SELECT * FROM input

What I noticed is that with triggers on the output table which update
a statistics table, the time needed for that copying is more or less
linearily dependant on the number of rows in the output table.

If I empty the triggered function (PL/pgsql), so the trigger fires but
doesn't update any tables, performance is about 60% of triggerless,
which is OK.

Also, both without a trigger and with the empty trigger, performance is
pretty much constant and doesn't depend on the size of the output table.

What I don't understand is how the size of the output table, which
is never used in the triggered function (except by the NEW alias)
can have such an extreme influence on the copying performance.

I made a small test case which is attached (perl/DBI) which demonstrates
this problem and displays the timings.

Regards,

cm.

--
Christian Mock                          Wiedner Hauptstrasse 15
Senior Security Engineer                1040 Wien
CoreTEC IT Security Solutions GmbH      +43-1-5037273

Вложения

Re: performance with triggers depends on table size?

От
Tom Lane
Дата:
Christian Mock <cm@coretec.at> writes:
> I've got two identical tables, "input" and "output"; input is filled
> by an importing process, the data is processed and partially deleted,
> and the rest is copied to "output" by means of
> INSERT INTO output SELECT * FROM input

> What I noticed is that with triggers on the output table which update
> a statistics table, the time needed for that copying is more or less
> linearily dependant on the number of rows in the output table.

Perhaps what it's actually dependent on is the size of the stats table?
Do you have indexes on the stats table, and if so are they being used
by the trigger's queries?

            regards, tom lane

Re: performance with triggers depends on table size?

От
Christian Mock
Дата:
On Mon, Aug 12, 2002 at 06:21:05PM -0400, Tom Lane wrote:

> > What I noticed is that with triggers on the output table which update
> > a statistics table, the time needed for that copying is more or less
> > linearily dependant on the number of rows in the output table.
>
> Perhaps what it's actually dependent on is the size of the stats table?
> Do you have indexes on the stats table, and if so are they being used
> by the trigger's queries?

I don't think so -- in the test case, the stats table is at full size
after the second block of copying, and performance still decreases
thereafter. In the real database where I stumbled across this problem,
the stats table has less than 10 entries, and it is indexed (both in
the test case and in the real DB) on all necessary keys (I did an
explain on the selects in the trigger function and it was using an
index scan).

Regards,

cm.
--
Christian Mock                          Wiedner Hauptstrasse 15
Senior Security Engineer                1040 Wien
CoreTEC IT Security Solutions GmbH      +43-1-5037273

Re: performance with triggers depends on table size?

От
Christian Mock
Дата:
On Mon, Aug 12, 2002 at 06:21:05PM -0400, Tom Lane wrote:

> Perhaps what it's actually dependent on is the size of the stats table?
> Do you have indexes on the stats table, and if so are they being used
> by the trigger's queries?

I started the insert on the real database to copy some 3.6 million records
between the tables, and plotted the timings -- now (with about half a
million records done) you can see the time it takes to insert a row
actually goes in a sawtooth pattern -- cf.
http://www.coretec.at/~cm/insert-trigger.gif
which has msec/inserted row on the Y axis and number of rows in the
"output" table on the X axis.

Can anybody explain this?

Note that these are not the tables from the perl script I posted, but
something more complex with a few foreign key constraints, but the
trigger is basically doing the same thing.

Oh, and:

=> select version();
                            version
---------------------------------------------------------------
 PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.4

on Debian/Stable (3.0).

Regards,

cm.

--
Christian Mock                          Wiedner Hauptstrasse 15
Senior Security Engineer                1040 Wien
CoreTEC IT Security Solutions GmbH      +43-1-5037273

Re: performance with triggers depends on table size?

От
Tom Lane
Дата:
Christian Mock <cm@coretec.at> writes:
> I started the insert on the real database to copy some 3.6 million records
> between the tables, and plotted the timings -- now (with about half a
> million records done) you can see the time it takes to insert a row
> actually goes in a sawtooth pattern -- cf.
> http://www.coretec.at/~cm/insert-trigger.gif
> which has msec/inserted row on the Y axis and number of rows in the
> "output" table on the X axis.

Interesting ... I'd have to guess that there is some external influence
causing that.  Do you by any chance have a cron jon set up to do
"vacuum"s periodically, and if so do the vacuums correspond to the
points where the insertion time drops?

            regards, tom lane

Re: performance with triggers depends on table size?

От
Christian Mock
Дата:
On Wed, Aug 14, 2002 at 09:35:31AM -0400, Tom Lane wrote:

> Interesting ... I'd have to guess that there is some external influence
> causing that.  Do you by any chance have a cron jon set up to do
> "vacuum"s periodically, and if so do the vacuums correspond to the
> points where the insertion time drops?

I can't seem to see any external influences; there is a once-a-night
vacuum run from cron, which fits with one of the sawtooth "drops",
but it still doesn't explain the other teeth. (And I checked the
postgres log for signs of other vaccum runs, there are none).

The database, and the machine it's running on, is otherwise quiescent.

After an afternoon of playing around, I found I can keep the performance
up (at least for 50000 rows) if I do

  BEGIN;
  INSERT INTO ac_event SELECT * FROM in_event WHERE eid < x AND eid > y;
  COMMIT;
  VACUUM ANALYZE event_stats;

in a tight loop, where the number of rows selected by x and y is very
small -- about 100 or less (and this is with the script not running
on the DB machine).

What I also noticed on verbose vacuums on the event_stats table is that
there's always a relatively big number of "Deleted" tuples, which has
some correlation to the number of rows copied in the previous insert;
interestingly, if x and y are 1000 apart, but there's "holes" in
in_event and fewer rows get copied, "Deleted" is less than the number
of rows copied, but with 1000 rows copied, it is at 1000 straight.

Where's the big number of "Deleted" tuples coming from? The relevant part
of the trigger function is below, the "DELETE" code path should never be
chosen (and from watching the sequence for the stat_id column, which
is completely in line with "count(*) from event_stats", it never is). Is
the update (on the non-indexed column "count" only) causing this?

I hope one of you gurus can make something out of this data...

regards,

cm.

  x_hour := date_trunc(''hour'', $5);
  SELECT INTO result stat_id, count FROM event_stats
   WHERE sensor_id = x_sensor_id AND ip_src = x_ip_src AND ip_dst = x_ip_dst
   AND sig_id = x_sig_id AND hour = x_hour AND type = x_type;
  IF NOT FOUND THEN
   INSERT INTO event_stats (sensor_id,ip_src,ip_dst,sig_id,hour,type,count)
    VALUES(x_sensor_id,x_ip_src,x_ip_dst,x_sig_id,x_hour,x_type,x_count);
  ELSE
   UPDATE event_stats SET count = count + x_count
    WHERE stat_id = result.stat_id;
   IF (result.count + x_count) = 0 THEN
    DELETE FROM event_stats WHERE stat_id = result.stat_id;
   END IF;
  END IF;

--
Christian Mock                          Wiedner Hauptstrasse 15
Senior Security Engineer                1040 Wien
CoreTEC IT Security Solutions GmbH      +43-1-5037273

Re: performance with triggers depends on table size?

От
Jochem van Dieten
Дата:
Christian Mock wrote:
>
> I can't seem to see any external influences; there is a once-a-night
> vacuum run from cron, which fits with one of the sawtooth "drops",
> but it still doesn't explain the other teeth. (And I checked the
> postgres log for signs of other vaccum runs, there are none).

I think you should be more aggressive on the vacuum. If you can, make
the nightly vacuum a vacuum full (requires full table lock, not always
an option). And I would further expect better results if you run a
normal vacuum about every hour.


> After an afternoon of playing around, I found I can keep the performance
> up (at least for 50000 rows) if I do
>
>   BEGIN;
>   INSERT INTO ac_event SELECT * FROM in_event WHERE eid < x AND eid > y;
>   COMMIT;
>   VACUUM ANALYZE event_stats;
>
> in a tight loop, where the number of rows selected by x and y is very
> small -- about 100 or less (and this is with the script not running
> on the DB machine).

There is something I have been thinking about (on earlier occasions as
well). For a trigger on this insert, the execution plan is determined on
the first insert and is then cached for the rest of the lifetime of the
connection. Right?
Wouldn't that mean that if you start with a very small event_stats
table, the planner would choose a seqscan and stick to that, even if the
  planner should switch to an index scan after the event_stats table
reaches a certain size?

Maybe somebody can shed some light on this?


> Where's the big number of "Deleted" tuples coming from? The relevant part
> of the trigger function is below, the "DELETE" code path should never be
> chosen (and from watching the sequence for the stat_id column, which
> is completely in line with "count(*) from event_stats", it never is). Is
> the update (on the non-indexed column "count" only) causing this?

That would be a result of MVCC. The short version is that UPDATE's are
implemented as a combination of a DELETE and an INSERT. The long version
is at http://www.onlamp.com/pub/a/onlamp/2001/05/25/postgresql_mvcc.html
and http://www.postgresql.org/idocs/index.php?mvcc.html

Jochem


Re: performance with triggers depends on table size?

От
Manfred Koizar
Дата:
On Wed, 14 Aug 2002 19:53:19 +0200, Christian Mock <cm@coretec.at>
wrote:
>Where's the big number of "Deleted" tuples coming from?

From repeatedly
>   UPDATE event_stats SET count = count + x_count
>    WHERE stat_id = result.stat_id;

Due to MVCC Postgres treats an UPDATE much like a DELETE and an
INSERT.  I suspect the deleted tuples get in the way, when the UNIQUE
constraint is checked.  This explains the linear increase of time per
tuple.  I guess you are inserting sorted by (c1, c2, c3) - or at least
in large chunks of equal (c1, c2, c3) -  and the "drops" in your
diagram occur whenever (c1, c2, c3) - or whatever the unique columns
on event_stats are - changes.

For now you have already found a solution/workaround.  In 7.3 this can
be expected to work better.

Servus
 Manfred

Re: performance with triggers depends on table size?

От
Christian Mock
Дата:
On Wed, Aug 14, 2002 at 10:05:00PM +0200, Manfred Koizar wrote:

> Due to MVCC Postgres treats an UPDATE much like a DELETE and an
> INSERT.  I suspect the deleted tuples get in the way, when the UNIQUE

I see; thanks to you and Jochem for explaining.

> constraint is checked.  This explains the linear increase of time per
> tuple.  I guess you are inserting sorted by (c1, c2, c3) - or at least
> in large chunks of equal (c1, c2, c3) -  and the "drops" in your

There's a huge amount of clustering in the data, ATM I've got some 2 mio
rows in the output table and less than 1000 in the stats table; I don't
think they're inserted in equal chunks, but in the real database structure
there's some single-colum indexes on the event_stats table, e.g. on
the hour column, and data is inserted in chronological order.

> For now you have already found a solution/workaround.  In 7.3 this can
> be expected to work better.

So this means my kludgy solution is as good as it can get ATM? That
would mean going back to boring CGI script coding instead of fiddling
with performance :-/

Thanks a lot,

cm.

--
Christian Mock                          Wiedner Hauptstrasse 15
Senior Security Engineer                1040 Wien
CoreTEC IT Security Solutions GmbH      +43-1-5037273

Re: performance with triggers depends on table size?

От
Manfred Koizar
Дата:
On Thu, 15 Aug 2002 02:17:10 +0200, Christian Mock <cm@coretec.at>
wrote:
>So this means my kludgy solution is as good as it can get ATM? That
>would mean going back to boring CGI script coding instead of fiddling
>with performance :-/

Christian,

do you *need* to update event_stats in a trigger?  What I mean is, if
you have tight control over INSERTs to ac_event, you could remove that
part of the trigger and
    BEGIN;

    INSERT INTO ac_event SELECT * FROM in_event;

    UPDATE event_stats
       SET count = count + t.cnt
      FROM (SELECT c1, c2, ..., COUNT(*) AS cnt
              FROM in_event
             GROUP BY c1, c2, ...) AS t
     WHERE event_stats.c1 = t.c1 AND event_stats.c2 = t.c2 ...;

    INSERT INTO event_stats (c1, c2, ..., count)
    SELECT c1, c2, ..., COUNT(*)
          FROM in_event AS e
     WHERE NOT EXISTS (SELECT *
                         FROM event_stats AS s
                        WHERE s.c1 = e.c1 AND s.c2 = e.c2 ...)
     GROUP BY c1, c2, ...;

    DELETE FROM in_event;  -- ??

    COMMIT;

This assumes you want to move rows from in_event to ac_event.  If you
want to keep rows in in_event, you will probably need an additional
condition in the SELECTs from in_event ...

Servus
 Manfred

Re: performance with triggers depends on table size?

От
Christian Mock
Дата:
On Fri, Aug 16, 2002 at 11:19:18AM +0200, Manfred Koizar wrote:

> do you *need* to update event_stats in a trigger?  What I mean is, if

Well, of course I don't need to, but given that that table is not only
inserted into, but also updated, I'd rather have the stats kept in sync
by the trigger than in the application -- if not, I could use mysql.

I did a big import (3.7 mio rows) and it worked out pretty well with
the regular vacuum -- 50 rows/sec on average, and no more sawtooth
pattern.

Regards,

cm.

--
Christian Mock                          Wiedner Hauptstrasse 15
Senior Security Engineer                1040 Wien
CoreTEC IT Security Solutions GmbH      +43-1-5037273