Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

Поиск
Список
Период
Сортировка
Искать

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
daveg@sonic.net
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:
On Fri, 30 Oct 2015 21:49:00 -0700
Jeff Janes  wrote:

> On Fri, Oct 30, 2015 at 8:40 AM, Tom Lane  wrote:
> > Alvaro Herrera  writes:
> >> David Gould wrote:
> >>> Anyway, they are not actually vacuuming. They are waiting on the
> >>> VacuumScheduleLock. And requesting freshs snapshots from the
> >>> stats_collector.
> >
> >> Oh, I see.  Interesting.  Proposals welcome.  I especially dislike the
> >> ("very_expensive") pgstat check.
> >
> > Couldn't we simply move that out of the locked stanza?  That is, if no
> > other worker is working on the table, claim it, and release the lock
> > immediately.  Then do the "very expensive" check.  If that fails, we
> > have to re-take the lock to un-claim the table, but that sounds OK.
> 
> 
> The attached patch does that.  In a system with 4 CPUs and that had
> 100,000 tables, with a big chunk of them in need of vacuuming, and
> with 30 worker processes, this increased the throughput by a factor of
> 40.  Presumably it will do even better with more CPUs.
> 
> It is still horribly inefficient, but 40 times less so.

That is a good result for such a small change.

The attached patch against REL9_5_STABLE_goes a little further. It
claims the table under the lock, but also addresses the problem of all the
workers racing to redo the same table by enforcing an ordering on all the
workers. No worker can claim a table with an oid smaller than the highest
oid claimed by any worker. That is, instead of racing to the same table,
workers leapfrog over each other.

In theory the recheck of the stats could be eliminated although this patch
does not do that. It does eliminate the special handling of stats snapshots
for autovacuum workers which cuts back on the excess rewriting of the stats
file somewhat.

I'll send numbers shortly, but as I recall it is over 100 times better than
the original.

-dg

-- 
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:
On Fri, 30 Oct 2015 12:51:43 -0400
Tom Lane  wrote:

> Alvaro Herrera  writes:
> > Tom Lane wrote:
> >> Good point ... shouldn't we have already checked the stats before ever
> >> deciding to try to claim the table?
> 
> > The second check is there to allow for some other worker (or manual
> > vacuum) having vacuumed it after we first checked, but which had
> > finished before we check the array of current jobs.
> 
> I wonder whether that check costs more than it saves.

It does indeed. It drives the stats collector wild. And of course if there
are lots of tables and indexes the stats temp file gets very large so that
it can take a long time (seconds) to rewrite it. This happens for each
worker for each table that is a candidate for vacuuming.

Since it would not be convenient to provide a copy of the clients 8TB
database I have made a standalone reproduction. The attached files:

 build_test_instance.sh - create a test instance
 datagen.py             - used by above to populate it with lots of tables
 logbyv.awk             - count auto analyze actions in postgres log
 trace.sh               - strace the stats collector and autovacuum workers
 tracereport.sh         - list top 50 calls in strace output

The test process is to run the build_test_instance script to create an
instance with one database with a large number of tiny tables. During the
setup autovacuuming is off. Then make a tarball of the instance for reuse.
For each test case, untar the instance, set the number of workers and start
it. After a short time autovacuum will start workers to analyze the new
tables. Expect to see the stats collector doing lots of writing.

You may want to use tmpfs or a ramdisk for the data dir for building the
test instance. The configuration is sized for reasonable desktop, 8 to 16GB
of memory and an SSD.

-dg

-- 
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:
On Fri, 30 Oct 2015 23:19:52 -0700
David Gould  wrote:

> The attached patch against REL9_5_STABLE_goes a little further. It
> claims the table under the lock, but also addresses the problem of all the
> workers racing to redo the same table by enforcing an ordering on all the
> workers. No worker can claim a table with an oid smaller than the highest
> oid claimed by any worker. That is, instead of racing to the same table,
> workers leapfrog over each other.
> 
> In theory the recheck of the stats could be eliminated although this patch
> does not do that. It does eliminate the special handling of stats snapshots
> for autovacuum workers which cuts back on the excess rewriting of the stats
> file somewhat.
> 
> I'll send numbers shortly, but as I recall it is over 100 times better than
> the original.

I sent performance test data and a setup for reproducing it elsewhere in
the thread. I also ran tests  on a larger system (128GB mem, many cores, 2x
SSD with battery backed raid).

This is for an idle system with 100,000 new small tables to analyze. I ran
all the test for an hour or 5000 tables processed. "jj" refers to the patch
from Jeff Janes, "dg" refers to the attached patch (same as previous).

/autovacuum actions per minute/
workers   9.5b1     jj     dg
-------   -----   ----  -----
   1        183    171    285
   4         45    212   1158
   8         23    462   1225


Could someone please take a look at the patch and comment? Thanks.

-dg

-- 
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:
On Mon, 29 Feb 2016 18:33:50 -0300
Alvaro Herrera  wrote:

> Hi David, did you ever post an updated version of this patch?

No. Let me fix that now. I've attached my current revision of the patch
based on master. This version is significantly better than the original
version and resolves multiple issues:

 - autovacuum workers no longer race each other
 - autovacuum workers do not revacuum each others tables
 - autovacuums workers no longer thrash the stats collector which saves a
   lot of IO when the stats is large.

Hopefully the earlier discussion and the comments in the patch are
sufficient, but please ask questions if it is not clear.

The result is that on a freshly created 40,000 table database with tiny
tables that all need an initial analyze the unpatched postgres saturates
an SSD updating the stats and manages to process less than tables per
minute. With the attached patch it processes several thousand tables per
minute.

The following is a summary of strace output for the autovacuum workers
and the stats collector while the 40k table test is running. The counts and
times are the cost per table.

postgresql 9.5:   85 tables per minute.

     Operations per Table
 calls    msec    system call        [ 4 autovacuum workers ]
------  ------    -------------------
 19.46  196.09    select(0,          <<< Waiting for stats snapshot
  3.26 1040.46    semop(43188238,    <<< Waiting for AutovacuumScheduleLock
  2.05    0.83    sendto(8,          <<< Asking for stats snapshot

 calls    msec    system call        [ stats collector ]
------  ------    -------------------
  3.02    0.05    recvfrom(8,        <<< Request for snapshot refresh
  1.55  248.64    rename("pg_stat_tmp/db_16385.tmp",  <<< Snapshot refresh


+ autovacuum contention patch: 5225 tables per minute

     Operations per Table
 calls    msec    system call        [ 4 autovacuum workers ]
------  ------    -------------------
  0.63    6.34    select(0,          <<< Waiting for stats snapshot
  0.21    0.01    sendto(8,          <<< Asking for stats snapshot
  0.07    0.00    semop(43712518,    <<< Waiting for AutovacuumLock

 calls    msec    system call        [ stats collector ]
------  ------    -------------------
  0.40    0.01    recvfrom(8,        <<< Request for snapshot refresh
  0.04    6.75    rename("pg_stat_tmp/db_16385.tmp",  <<< Snapshot refresh


Regards,

-dg


-- 
David Gould                                   daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
David Gould <daveg@sonic.net>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Josh Berkus <josh@agliodbs.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Joe Conway <mail@joeconway.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Jim Nasby <Jim.Nasby@BlueTreble.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Jeff Janes <jeff.janes@gmail.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Jeff Janes <jeff.janes@gmail.com>
Дата:

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Jeff Janes <jeff.janes@gmail.com>
Дата:
On Fri, Oct 30, 2015 at 8:40 AM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> David Gould wrote:
>>> Anyway, they are not actually vacuuming. They are waiting on the
>>> VacuumScheduleLock. And requesting freshs snapshots from the
>>> stats_collector.
>
>> Oh, I see.  Interesting.  Proposals welcome.  I especially dislike the
>> ("very_expensive") pgstat check.
>
> Couldn't we simply move that out of the locked stanza?  That is, if no
> other worker is working on the table, claim it, and release the lock
> immediately.  Then do the "very expensive" check.  If that fails, we
> have to re-take the lock to un-claim the table, but that sounds OK.


The attached patch does that.  In a system with 4 CPUs and that had
100,000 tables, with a big chunk of them in need of vacuuming, and
with 30 worker processes, this increased the throughput by a factor of
40.  Presumably it will do even better with more CPUs.

It is still horribly inefficient, but 40 times less so.

Cheers,

Jeff

Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:
FAQ