Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
daveg@sonic.net
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
On Fri, 30 Oct 2015 21:49:00 -0700
Jeff Janes wrote:
> On Fri, Oct 30, 2015 at 8:40 AM, Tom Lane wrote:
> > Alvaro Herrera writes:
> >> David Gould wrote:
> >>> Anyway, they are not actually vacuuming. They are waiting on the
> >>> VacuumScheduleLock. And requesting freshs snapshots from the
> >>> stats_collector.
> >
> >> Oh, I see. Interesting. Proposals welcome. I especially dislike the
> >> ("very_expensive") pgstat check.
> >
> > Couldn't we simply move that out of the locked stanza? That is, if no
> > other worker is working on the table, claim it, and release the lock
> > immediately. Then do the "very expensive" check. If that fails, we
> > have to re-take the lock to un-claim the table, but that sounds OK.
>
>
> The attached patch does that. In a system with 4 CPUs and that had
> 100,000 tables, with a big chunk of them in need of vacuuming, and
> with 30 worker processes, this increased the throughput by a factor of
> 40. Presumably it will do even better with more CPUs.
>
> It is still horribly inefficient, but 40 times less so.
That is a good result for such a small change.
The attached patch against REL9_5_STABLE_goes a little further. It
claims the table under the lock, but also addresses the problem of all the
workers racing to redo the same table by enforcing an ordering on all the
workers. No worker can claim a table with an oid smaller than the highest
oid claimed by any worker. That is, instead of racing to the same table,
workers leapfrog over each other.
In theory the recheck of the stats could be eliminated although this patch
does not do that. It does eliminate the special handling of stats snapshots
for autovacuum workers which cuts back on the excess rewriting of the stats
file somewhat.
I'll send numbers shortly, but as I recall it is over 100 times better than
the original.
-dg
--
David Gould 510 282 0869 daveg@sonic.net
If simplicity worked, the world would be overrun with insects.
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
On Fri, 30 Oct 2015 12:51:43 -0400 Tom Lane wrote: > Alvaro Herrera writes: > > Tom Lane wrote: > >> Good point ... shouldn't we have already checked the stats before ever > >> deciding to try to claim the table? > > > The second check is there to allow for some other worker (or manual > > vacuum) having vacuumed it after we first checked, but which had > > finished before we check the array of current jobs. > > I wonder whether that check costs more than it saves. It does indeed. It drives the stats collector wild. And of course if there are lots of tables and indexes the stats temp file gets very large so that it can take a long time (seconds) to rewrite it. This happens for each worker for each table that is a candidate for vacuuming. Since it would not be convenient to provide a copy of the clients 8TB database I have made a standalone reproduction. The attached files: build_test_instance.sh - create a test instance datagen.py - used by above to populate it with lots of tables logbyv.awk - count auto analyze actions in postgres log trace.sh - strace the stats collector and autovacuum workers tracereport.sh - list top 50 calls in strace output The test process is to run the build_test_instance script to create an instance with one database with a large number of tiny tables. During the setup autovacuuming is off. Then make a tarball of the instance for reuse. For each test case, untar the instance, set the number of workers and start it. After a short time autovacuum will start workers to analyze the new tables. Expect to see the stats collector doing lots of writing. You may want to use tmpfs or a ramdisk for the data dir for building the test instance. The configuration is sized for reasonable desktop, 8 to 16GB of memory and an SSD. -dg -- David Gould 510 282 0869 daveg@sonic.net If simplicity worked, the world would be overrun with insects.
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
On Fri, 30 Oct 2015 23:19:52 -0700 David Gould wrote: > The attached patch against REL9_5_STABLE_goes a little further. It > claims the table under the lock, but also addresses the problem of all the > workers racing to redo the same table by enforcing an ordering on all the > workers. No worker can claim a table with an oid smaller than the highest > oid claimed by any worker. That is, instead of racing to the same table, > workers leapfrog over each other. > > In theory the recheck of the stats could be eliminated although this patch > does not do that. It does eliminate the special handling of stats snapshots > for autovacuum workers which cuts back on the excess rewriting of the stats > file somewhat. > > I'll send numbers shortly, but as I recall it is over 100 times better than > the original. I sent performance test data and a setup for reproducing it elsewhere in the thread. I also ran tests on a larger system (128GB mem, many cores, 2x SSD with battery backed raid). This is for an idle system with 100,000 new small tables to analyze. I ran all the test for an hour or 5000 tables processed. "jj" refers to the patch from Jeff Janes, "dg" refers to the attached patch (same as previous). /autovacuum actions per minute/ workers 9.5b1 jj dg ------- ----- ---- ----- 1 183 171 285 4 45 212 1158 8 23 462 1225 Could someone please take a look at the patch and comment? Thanks. -dg -- David Gould 510 282 0869 daveg@sonic.net If simplicity worked, the world would be overrun with insects.
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
On Mon, 29 Feb 2016 18:33:50 -0300
Alvaro Herrera wrote:
> Hi David, did you ever post an updated version of this patch?
No. Let me fix that now. I've attached my current revision of the patch
based on master. This version is significantly better than the original
version and resolves multiple issues:
- autovacuum workers no longer race each other
- autovacuum workers do not revacuum each others tables
- autovacuums workers no longer thrash the stats collector which saves a
lot of IO when the stats is large.
Hopefully the earlier discussion and the comments in the patch are
sufficient, but please ask questions if it is not clear.
The result is that on a freshly created 40,000 table database with tiny
tables that all need an initial analyze the unpatched postgres saturates
an SSD updating the stats and manages to process less than tables per
minute. With the attached patch it processes several thousand tables per
minute.
The following is a summary of strace output for the autovacuum workers
and the stats collector while the 40k table test is running. The counts and
times are the cost per table.
postgresql 9.5: 85 tables per minute.
Operations per Table
calls msec system call [ 4 autovacuum workers ]
------ ------ -------------------
19.46 196.09 select(0, <<< Waiting for stats snapshot
3.26 1040.46 semop(43188238, <<< Waiting for AutovacuumScheduleLock
2.05 0.83 sendto(8, <<< Asking for stats snapshot
calls msec system call [ stats collector ]
------ ------ -------------------
3.02 0.05 recvfrom(8, <<< Request for snapshot refresh
1.55 248.64 rename("pg_stat_tmp/db_16385.tmp", <<< Snapshot refresh
+ autovacuum contention patch: 5225 tables per minute
Operations per Table
calls msec system call [ 4 autovacuum workers ]
------ ------ -------------------
0.63 6.34 select(0, <<< Waiting for stats snapshot
0.21 0.01 sendto(8, <<< Asking for stats snapshot
0.07 0.00 semop(43712518, <<< Waiting for AutovacuumLock
calls msec system call [ stats collector ]
------ ------ -------------------
0.40 0.01 recvfrom(8, <<< Request for snapshot refresh
0.04 6.75 rename("pg_stat_tmp/db_16385.tmp", <<< Snapshot refresh
Regards,
-dg
--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Alvaro Herrera <alvherre@2ndquadrant.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
David Gould <daveg@sonic.net>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Josh Berkus <josh@agliodbs.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Joe Conway <mail@joeconway.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Jim Nasby <Jim.Nasby@BlueTreble.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Jeff Janes <jeff.janes@gmail.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Jeff Janes <jeff.janes@gmail.com>
Дата:
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Jeff Janes <jeff.janes@gmail.com>
Дата:
On Fri, Oct 30, 2015 at 8:40 AM, Tom Lane wrote:
> Alvaro Herrera writes:
>> David Gould wrote:
>>> Anyway, they are not actually vacuuming. They are waiting on the
>>> VacuumScheduleLock. And requesting freshs snapshots from the
>>> stats_collector.
>
>> Oh, I see. Interesting. Proposals welcome. I especially dislike the
>> ("very_expensive") pgstat check.
>
> Couldn't we simply move that out of the locked stanza? That is, if no
> other worker is working on the table, claim it, and release the lock
> immediately. Then do the "very expensive" check. If that fails, we
> have to re-take the lock to un-claim the table, but that sounds OK.
The attached patch does that. In a system with 4 CPUs and that had
100,000 tables, with a big chunk of them in need of vacuuming, and
with 30 worker processes, this increased the throughput by a factor of
40. Presumably it will do even better with more CPUs.
It is still horribly inefficient, but 40 times less so.
Cheers,
Jeff
Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата: