Обсуждение: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

Поиск
Список
Период
Сортировка

[HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

От
Amit Khandekar
Дата:
In a server where autovacuum is disabled and its databases reach
autovacuum_freeze_max_age limit, an autovacuum is forced to prevent
xid wraparound issues. At this stage, when the server is loaded with a
lot of DML operations, an exceedingly high number of autovacuum
workers keep on getting spawned, and these do not do anything, and
then quit.

The issue is : Sometimes an auto-vaccuum worker A1 finds it has no
tables to vacuum because of another worker(s) B already concurrently
vacuuming the tables. Then this worker A1 calls
vac_update_datfrozenxid() in the end, which effectively requests for
new auto-vacuum (vac_update_datfrozenxid() -> vac_truncate_clog() ->
SetTransactionIdLimit -> Send PMSIGNAL_START_AUTOVAC_LAUNCHER signal).
*Immediately* a new auto-vacuum is spawned since it is an
xid-wraparound vacuum. This new autovacuum chooses the same earlier
database for the new worker A2, because the datfrozenxid's aren't yet
updated due to worker B still not done yet. Worker A2 finds that the
same worker(s) B are still vacuuming the same tables. A2 again calls
vac_update_datfrozenxid(), which leads to spawning another auto-vacuum
worker A3 , and the cycle keeps on repeating until B finishes
vacuuming and updates datfrozenid.

Steps to reproduce :

1. In a fresh cluster, create tables using attached create.sql.gz

2. Insert some data :
INSERT into pgbench_history select generate_series(1, 5402107, 1);
UPDATE pgbench_history set bid = 1, aid = 11100, delta = 500, mtime =
'2017/1/1', filler = '19:00' ;

3. Set these GUCs in postgresql.conf and restart the server :
autovacuum = off
autovacuum_freeze_max_age = 100000  # Make auto-vacuum start as early
as possible.

4. Run : "pgbench -n -c 5 -t 2000000" and watch the log file.

5. After the age(datfrozenxid) of the databases crosses
autovacuum_freeze_max_age value, after 1-2 minutes more, a number of
messages will be seen, such as this :
2017-01-13 14:50:12.304 IST [111811] LOG:  autovacuum launcher started
2017-01-13 14:50:12.346 IST [111816] LOG:  autovacuum launcher started
2017-01-13 14:50:12.825 IST [111818] LOG:  autovacuum launcher started

I see around 70 messages per second.


=== Fix ===

For fixing the issue, one approach I thought was to make
do_start_worker() choose a different database after determining that
the earlier database is still being scanned by one of the workers in
AutoVacuumShmem->av_runningWorkers. The current logic is that if the
databases need xid-wraparound prevention, then we pick the one with
oldest datfrozenxid. So, for the fix, just choose the db with the
second oldest datfrozenxid if the oldest one is still being processed.
But this does not solve the problem. Once the other databases are
done, or if there's a single database, the workers will again go in a
loop with the same database.

Instead, the attached patch (prevent_useless_vacuums.patch) prevents
the repeated cycle by noting that there's no point in doing whatever
vac_update_datfrozenxid() does, if we didn't find anything to vacuum
and there's already another worker vacuuming the same database. Note
that it uses wi_tableoid field to check concurrency. It does not use
wi_dboid field to check for already-processing worker, because using
this field might cause each of the workers to think that there is some
other worker vacuuming, and eventually no one vacuums. We have to be
certain that the other worker has already taken a table to vacuum.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Too many autovacuum workers spawned during forcedauto-vacuum

От
Alvaro Herrera
Дата:
Amit Khandekar wrote:
> In a server where autovacuum is disabled and its databases reach
> autovacuum_freeze_max_age limit, an autovacuum is forced to prevent
> xid wraparound issues. At this stage, when the server is loaded with a
> lot of DML operations, an exceedingly high number of autovacuum
> workers keep on getting spawned, and these do not do anything, and
> then quit.

I think this is the same problem as reported in
https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com

> === Fix ===
[...]
> Instead, the attached patch (prevent_useless_vacuums.patch) prevents
> the repeated cycle by noting that there's no point in doing whatever
> vac_update_datfrozenxid() does, if we didn't find anything to vacuum
> and there's already another worker vacuuming the same database. Note
> that it uses wi_tableoid field to check concurrency. It does not use
> wi_dboid field to check for already-processing worker, because using
> this field might cause each of the workers to think that there is some
> other worker vacuuming, and eventually no one vacuums. We have to be
> certain that the other worker has already taken a table to vacuum.

Hmm, it seems reasonable to skip the end action if we didn't do any
cleanup after all.  This would normally give enough time between vacuum
attempts for the first worker to make further progress and avoid causing
a storm.  I'm not really sure that it fixes the problem completely, but
perhaps it's enough.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

От
Amit Khandekar
Дата:
On 13 January 2017 at 19:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I think this is the same problem as reported in
> https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com

Ah yes, this is the same problem. Not sure why I didn't land on that
thread when I tried to search pghackers using relevant keywords.
>
>> === Fix ===
> [...]
>> Instead, the attached patch (prevent_useless_vacuums.patch) prevents
>> the repeated cycle by noting that there's no point in doing whatever
>> vac_update_datfrozenxid() does, if we didn't find anything to vacuum
>> and there's already another worker vacuuming the same database. Note
>> that it uses wi_tableoid field to check concurrency. It does not use
>> wi_dboid field to check for already-processing worker, because using
>> this field might cause each of the workers to think that there is some
>> other worker vacuuming, and eventually no one vacuums. We have to be
>> certain that the other worker has already taken a table to vacuum.
>
> Hmm, it seems reasonable to skip the end action if we didn't do any
> cleanup after all. This would normally give enough time between vacuum
> attempts for the first worker to make further progress and avoid causing
> a storm.  I'm not really sure that it fixes the problem completely, but
> perhaps it's enough.

I had thought about this : if we didn't clean up anything, skip the
end action unconditionally without checking if there was any
concurrent worker. But then thought it is better to skip only if we
know there is another worker doing the same job, because :
a) there might be some reason we are just calling
vac_update_datfrozenxid() without any condition. But I am not sure
whether it was intentionally kept like that. Didn't get any leads from
the history.
b) it's no harm in updating datfrozenxid() it if there was no other
worker. In this case, we *know* that there was indeed nothing to be
cleaned up. So the next time this database won't be chosen again, so
there's no harm just calling this function.

>
> --
> Álvaro Herrera                https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

От
Masahiko Sawada
Дата:
On Mon, Jan 16, 2017 at 1:50 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 13 January 2017 at 19:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> I think this is the same problem as reported in
>> https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com
>
> Ah yes, this is the same problem. Not sure why I didn't land on that
> thread when I tried to search pghackers using relevant keywords.
>>
>>> === Fix ===
>> [...]
>>> Instead, the attached patch (prevent_useless_vacuums.patch) prevents
>>> the repeated cycle by noting that there's no point in doing whatever
>>> vac_update_datfrozenxid() does, if we didn't find anything to vacuum
>>> and there's already another worker vacuuming the same database. Note
>>> that it uses wi_tableoid field to check concurrency. It does not use
>>> wi_dboid field to check for already-processing worker, because using
>>> this field might cause each of the workers to think that there is some
>>> other worker vacuuming, and eventually no one vacuums. We have to be
>>> certain that the other worker has already taken a table to vacuum.
>>
>> Hmm, it seems reasonable to skip the end action if we didn't do any
>> cleanup after all. This would normally give enough time between vacuum
>> attempts for the first worker to make further progress and avoid causing
>> a storm.  I'm not really sure that it fixes the problem completely, but
>> perhaps it's enough.
>
> I had thought about this : if we didn't clean up anything, skip the
> end action unconditionally without checking if there was any
> concurrent worker. But then thought it is better to skip only if we
> know there is another worker doing the same job, because :
> a) there might be some reason we are just calling
> vac_update_datfrozenxid() without any condition. But I am not sure
> whether it was intentionally kept like that. Didn't get any leads from
> the history.
> b) it's no harm in updating datfrozenxid() it if there was no other
> worker. In this case, we *know* that there was indeed nothing to be
> cleaned up. So the next time this database won't be chosen again, so
> there's no harm just calling this function.
>

Since autovacuum worker wakes up autovacuum launcher after launched
the autovacuum launcher could try to spawn worker process at high
frequently if you have database with very large table in it that has
just passed autovacuum_freeze_max_age.

autovacuum.c:L1605       /* wake up the launcher */       if (AutoVacuumShmem->av_launcherpid != 0)
kill(AutoVacuumShmem->av_launcherpid,SIGUSR2);
 

I think we should deal with this case as well.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

От
Amit Khandekar
Дата:
On 16 January 2017 at 15:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Since autovacuum worker wakes up autovacuum launcher after launched
> the autovacuum launcher could try to spawn worker process at high
> frequently if you have database with very large table in it that has
> just passed autovacuum_freeze_max_age.
>
> autovacuum.c:L1605
>         /* wake up the launcher */
>         if (AutoVacuumShmem->av_launcherpid != 0)
>             kill(AutoVacuumShmem->av_launcherpid, SIGUSR2);
>
> I think we should deal with this case as well.

When autovacuum is enabled, after getting SIGUSR2, the worker is
launched only when it's time to launch. Doesn't look like it will be
immediately launched :

/* We're OK to start a new worker */
if (dlist_is_empty(&DatabaseList))
{ /* Special case when the list is empty */ }
else
{..........  /* launch a worker if next_worker is right now or it is in the past */  if
(TimestampDifferenceExceeds(avdb->adl_next_worker,      current_time, 0))     launch_worker(current_time);
 
}

So from the above, it looks as if there will not be a storm of workers.


Whereas, if autovacuum is disabled, autovacuum launcher does not wait
for the worker to start; it just starts the worker and quits; so the
issue won't show up here :

/*  * In emergency mode, just start a worker (unless shutdown was requested)  * and go away.
*/
if (!AutoVacuumingActive())
{   if (!got_SIGTERM)       do_start_worker();   proc_exit(0); /* done */
}


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

On 16 January 2017 at 15:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Jan 16, 2017 at 1:50 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 13 January 2017 at 19:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> I think this is the same problem as reported in
>>> https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com
>>
>> Ah yes, this is the same problem. Not sure why I didn't land on that
>> thread when I tried to search pghackers using relevant keywords.
>>>
>>>> === Fix ===
>>> [...]
>>>> Instead, the attached patch (prevent_useless_vacuums.patch) prevents
>>>> the repeated cycle by noting that there's no point in doing whatever
>>>> vac_update_datfrozenxid() does, if we didn't find anything to vacuum
>>>> and there's already another worker vacuuming the same database. Note
>>>> that it uses wi_tableoid field to check concurrency. It does not use
>>>> wi_dboid field to check for already-processing worker, because using
>>>> this field might cause each of the workers to think that there is some
>>>> other worker vacuuming, and eventually no one vacuums. We have to be
>>>> certain that the other worker has already taken a table to vacuum.
>>>
>>> Hmm, it seems reasonable to skip the end action if we didn't do any
>>> cleanup after all. This would normally give enough time between vacuum
>>> attempts for the first worker to make further progress and avoid causing
>>> a storm.  I'm not really sure that it fixes the problem completely, but
>>> perhaps it's enough.
>>
>> I had thought about this : if we didn't clean up anything, skip the
>> end action unconditionally without checking if there was any
>> concurrent worker. But then thought it is better to skip only if we
>> know there is another worker doing the same job, because :
>> a) there might be some reason we are just calling
>> vac_update_datfrozenxid() without any condition. But I am not sure
>> whether it was intentionally kept like that. Didn't get any leads from
>> the history.
>> b) it's no harm in updating datfrozenxid() it if there was no other
>> worker. In this case, we *know* that there was indeed nothing to be
>> cleaned up. So the next time this database won't be chosen again, so
>> there's no harm just calling this function.
>>
>
> Since autovacuum worker wakes up autovacuum launcher after launched
> the autovacuum launcher could try to spawn worker process at high
> frequently if you have database with very large table in it that has
> just passed autovacuum_freeze_max_age.
>
> autovacuum.c:L1605
>         /* wake up the launcher */
>         if (AutoVacuumShmem->av_launcherpid != 0)
>             kill(AutoVacuumShmem->av_launcherpid, SIGUSR2);
>
> I think we should deal with this case as well.
>
> Regards,
>
> --
> Masahiko Sawada
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

От
Robert Haas
Дата:
On Fri, Jan 13, 2017 at 8:45 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Amit Khandekar wrote:
>> In a server where autovacuum is disabled and its databases reach
>> autovacuum_freeze_max_age limit, an autovacuum is forced to prevent
>> xid wraparound issues. At this stage, when the server is loaded with a
>> lot of DML operations, an exceedingly high number of autovacuum
>> workers keep on getting spawned, and these do not do anything, and
>> then quit.
>
> I think this is the same problem as reported in
> https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com

If I understand correctly, and it's possible that I don't, the issues
are distinct.  I think that the issue in that thread has to do with
the autovacuum launcher starting workers over and over again in a
tight loop, whereas this issue seems to be about autovacuum workers
restarting the launcher over and over again in a tight loop.  In that
thread, it's the autovacuum launcher that is looping, which can only
happen when autovacuum=on.  In this thread, the autovacuum launcher is
repeatedly exiting and getting restarted, which can only happen when
autovacuum=off.

In general, it seems we've been pretty cavalier about just how often
it's reasonable to start the autovacuum launcher when autovacuum=off.
That code probably doesn't see much real-world use.  Foreground
processes signal the postmaster only every 64kB transactions, which on
today's hardware can't happen more than every couple of seconds if
you're not using subtransactions or intentionally burning XIDs, but
hardware keeps getting faster, and you might be using subtransactions.
However, requiring that 65,536 transactions pass between signals does
serve as something of a rate limit.  In the case about which Amit is
complaining, there's no rate limit at all.  As fast as the autovacuum
launcher starts up, it spawns a worker and exits; as fast as the
worker can determine that it can't do anything useful, it starts a new
launcher.  Clearly, some kind of rate control is needed here; the only
question is about where to put it.

I would be tempted to install something directly in postmaster.c.  If
CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) && Shutdown ==
NoShutdown but we last set start_autovac_launcher = true less than 10
seconds ago, don't do it again.  That limits us to launching the
autovacuum launcher at most six times a minute when autovacuum = off.
You could argue that defeats the point of the SendPostmasterSignal in
SetTransactionIdLimit, but I don't think so.  If vacuuming the oldest
database took less than 10 seconds, then we won't vacuum the
next-oldest database until we hit the next 64kB transaction ID
boundary, but that can only cause a problem if we've got so many
databases that we don't get to them all before we run out of
transaction IDs, which is almost unthinkable.  If you had a ten
million tiny databases that all crossed the threshold at the same
instant, it would take you 640 million transaction IDs to visit them
all.  If you also had autovacuum_freeze_max_age set very close to the
upper limit for that variable, you could conceivably have the system
shut down before all of those databases were reached.  But that's a
pretty artificial scenario.  If someone has that scenario, perhaps
they should consider more sensible configuration choices.

I wondered for a while why the existing guard in
vac_update_datfrozenxid() isn't sufficient to prevent this problem.
That turns out to be due to Tom's commit
794e3e81a0e8068de2606015352c1254cb071a78, which causes
ForceTransactionIdLimitUpdate() always returns true when we're past
xidVacLimit.  The commit doesn't contain much in the way of
justification for the change, but I think the issue must be that if
the database nearest to wraparound is dropped, we need some mechanism
for eventually forcing xidVacLimit to get updated, rather than just
spewing warnings.

Another place where we could insert a guard is inside
SetTransactionIdLimit itself.  This is a little tricky.  The easy idea
would be just to skip sending the signal if xidVacLimit hasn't
advanced, but that's wrong in the case where there are multiple
databases with exactly the same oldest XID; vacuuming the first one
doesn't change anything.  It would be correct -- I think -- to skip
sending the signal when xidVacLimit doesn't advance and
vac_update_datfrozenxid() didn't change the current database's value
either, but that requires passing a flag down the call stack a few
levels.  That's only mildly ugly so I'd be fine with it if it were the
best fix, but there seem to be better options.

Amit's chosen yet another possible place to insert the guard: teach
autovacuum that if a worker skips at least one table due to concurrent
autovacuum activity AND ends up vacuuming no tables, don't call
vac_update_datfrozenxid().  Since there is or was another worker
running, vac_update_datfrozenxid() either already has been called or
will be when that worker finishes.  So that seems safe.  If his patch
were changed to skip vac_update_datfrozenxid() in all cases where we
do nothing rather than only when we skip a table due to concurrent
activity, we'd reintroduce the dropped-database problem that was fixed
by 794e3e81a0e8068de2606015352c1254cb071a78.

I'm not entirely sure whether Amit's fix is better or worse than the
postmaster-based fix.  It seems like a fairly fundamental weakness for
the postmaster to have no rate-limiting logic whatsoever here; it
should be the postmaster's job to judge whether it's getting swamped
with signals, and if we fix it in the postmaster then it stops systems
with high rates of XID consumption from going bonkers for that reason.
On the other hand, if somebody does have a scenario where repeatedly
signaling the postmaster to start the launcher in a tight loop is
allowing the system to zip through many small databases efficiently,
Amit's fix will let that keep working, whereas throttling in the
postmaster will make it take longer to get to all of those databases.
In many cases, that could be an improvement, since it would tend to
spread out the datfrozenxid values better, but I can't quite shake the
niggling fear that there might be some case I'm not thinking of where
it's problematic.  So I don't know.

As far as the problem on the other thread, maybe we could extend
Amit's approach so that when a worker exits after having skipped some
tables but not vacuum any tables, we blacklist the database for some
period of time or some number of iterations: autovacuum workers aren't
allowed to choose that database until the blacklist entry expires.
That way, if it becomes evident that more autovacuum workers in that
database are useless, other databases get a chance to attract some
workers, at least for some period of time.  I'm not sure how to
calibrate that exactly, but it's a thought.  I think we should fix
this problem first, though; it's subject to a narrower and
less-speculative repair.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

От
Robert Haas
Дата:
On Tue, Jan 17, 2017 at 4:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Amit's chosen yet another possible place to insert the guard: teach
> autovacuum that if a worker skips at least one table due to concurrent
> autovacuum activity AND ends up vacuuming no tables, don't call
> vac_update_datfrozenxid().  Since there is or was another worker
> running, vac_update_datfrozenxid() either already has been called or
> will be when that worker finishes.  So that seems safe.  If his patch
> were changed to skip vac_update_datfrozenxid() in all cases where we
> do nothing rather than only when we skip a table due to concurrent
> activity, we'd reintroduce the dropped-database problem that was fixed
> by 794e3e81a0e8068de2606015352c1254cb071a78.

After sleeping on this, I'm inclined to go with Amit's fix for now.
It seems less likely to break anything in the back-branches than any
other option I can think up.

An updated version of that patch is attached.  I changed the "if"
statement to avoid having an empty "if" clause and a non-empty "else"
clause, and I rewrote the comment based on my previous analysis.

Barring objections, I'll commit and back-patch this version.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Too many autovacuum workers spawned during forcedauto-vacuum

От
Alvaro Herrera
Дата:
Robert Haas wrote:

> After sleeping on this, I'm inclined to go with Amit's fix for now.
> It seems less likely to break anything in the back-branches than any
> other option I can think up.

Yeah, no objections here.

Note typo "imporatant" in the comment.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

От
Amit Khandekar
Дата:
On 18 January 2017 at 02:32, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 13, 2017 at 8:45 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> I think this is the same problem as reported in
>> https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com
>
> If I understand correctly, and it's possible that I don't, the issues
> are distinct.  I think that the issue in that thread has to do with
> the autovacuum launcher starting workers over and over again in a
> tight loop, whereas this issue seems to be about autovacuum workers
> restarting the launcher over and over again in a tight loop.  In that
> thread, it's the autovacuum launcher that is looping, which can only
> happen when autovacuum=on.  In this thread, the autovacuum launcher is
> repeatedly exiting and getting restarted, which can only happen when
> autovacuum=off.
Yes, that's true : in the other thread, autovacuum is on. Although, I
haven't been able to get why there would there be a storm of workers
spawned in case of autovacuum on. When it is on, the launcher starts
worker only it's time to start the worker.

>
> I would be tempted to install something directly in postmaster.c.  If
> CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) && Shutdown ==
> NoShutdown but we last set start_autovac_launcher = true less than 10
> seconds ago, don't do it again.

My impression was that postmaster is supposed to just do a minimal
work of starting auto-vacuum launcher if not already. And, the work of
ensuring all the things keep going is the job of auto-vacuum launcher.

> That limits us to launching the
> autovacuum launcher at most six times a minute when autovacuum = off.
> You could argue that defeats the point of the SendPostmasterSignal in
> SetTransactionIdLimit, but I don't think so.  If vacuuming the oldest
> database took less than 10 seconds, then we won't vacuum the
> next-oldest database until we hit the next 64kB transaction ID
> boundary, but that can only cause a problem if we've got so many
> databases that we don't get to them all before we run out of
> transaction IDs, which is almost unthinkable.  If you had a ten
> million tiny databases that all crossed the threshold at the same
> instant, it would take you 640 million transaction IDs to visit them
> all.  If you also had autovacuum_freeze_max_age set very close to the
> upper limit for that variable, you could conceivably have the system
> shut down before all of those databases were reached.  But that's a
> pretty artificial scenario.  If someone has that scenario, perhaps
> they should consider more sensible configuration choices.

Yeah this logic makes sense ...

But I guess , from looking at the code, it seems that it was carefully
made sure that in case of auto-vacuum off, we should clean up all
databases as fast as possible with multiple workers cleaning up
multiple tables in parallel.

Instead of autovacuum launcher and worker together making sure that
the cycle of iterations keep on running, I was thinking the
auto-vacuum launcher itself should make sure it does not spawn another
worker on the same database if it did nothing. But that seemed pretty
invasive.



Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

От
Michael Paquier
Дата:
On Fri, Jan 20, 2017 at 4:11 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>
>> After sleeping on this, I'm inclined to go with Amit's fix for now.
>> It seems less likely to break anything in the back-branches than any
>> other option I can think up.
>
> Yeah, no objections here.

+1.
-- 
Michael



Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum

От
Robert Haas
Дата:
On Fri, Jan 20, 2017 at 2:43 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jan 20, 2017 at 4:11 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Robert Haas wrote:
>>
>>> After sleeping on this, I'm inclined to go with Amit's fix for now.
>>> It seems less likely to break anything in the back-branches than any
>>> other option I can think up.
>>
>> Yeah, no objections here.
>
> +1.

OK, committed and back-patched all the way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Too many autovacuum workers spawned during forcedauto-vacuum

От
Jim Nasby
Дата:
On 1/20/17 12:40 AM, Amit Khandekar wrote:
> My impression was that postmaster is supposed to just do a minimal
> work of starting auto-vacuum launcher if not already. And, the work of
> ensuring all the things keep going is the job of auto-vacuum launcher.

There's already a ton of logic in the launcher... ISTM it'd be nice to 
not start adding additional logic to the postmaster. If we had a generic 
need for rate limiting launching of things maybe it wouldn't be that 
bad, but AFAIK we don't.

>> That limits us to launching the
>> autovacuum launcher at most six times a minute when autovacuum = off.
>> You could argue that defeats the point of the SendPostmasterSignal in
>> SetTransactionIdLimit, but I don't think so.  If vacuuming the oldest
>> database took less than 10 seconds, then we won't vacuum the
>> next-oldest database until we hit the next 64kB transaction ID
>> boundary, but that can only cause a problem if we've got so many
>> databases that we don't get to them all before we run out of
>> transaction IDs, which is almost unthinkable.  If you had a ten
>> million tiny databases that all crossed the threshold at the same
>> instant, it would take you 640 million transaction IDs to visit them
>> all.  If you also had autovacuum_freeze_max_age set very close to the
>> upper limit for that variable, you could conceivably have the system
>> shut down before all of those databases were reached.  But that's a
>> pretty artificial scenario.  If someone has that scenario, perhaps
>> they should consider more sensible configuration choices.
> Yeah this logic makes sense ...

I'm not sure that's true in the case of a significant number of 
databases and a very high XID rate, but I might be missing something. In 
any case I agree it's not worth worrying about. If you've disabled 
autovac you're already running with scissors.

> But I guess , from looking at the code, it seems that it was carefully
> made sure that in case of auto-vacuum off, we should clean up all
> databases as fast as possible with multiple workers cleaning up
> multiple tables in parallel.
>
> Instead of autovacuum launcher and worker together making sure that
> the cycle of iterations keep on running, I was thinking the
> auto-vacuum launcher itself should make sure it does not spawn another
> worker on the same database if it did nothing. But that seemed pretty
> invasive.

IMHO we really need some more sophistication in scheduling for both 
launcher and worker. Somewhere on my TODO is allowing the worker to call 
a user defined SELECT to get a prioritized list, but since the launcher 
doesn't connect to a database that wouldn't work. What we could do 
rather simply is honor adl_next_worker in the logic that looks for 
freeze, something like the attached.

On another note, does anyone else find the database selection logic 
rather difficult to trace through? The logic is kinda spread throughout 
several functions. The naming of rebuild_database_list() and 
get_database_list() is rather confusing too.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения