Обсуждение: RE: PG12 autovac issues

Поиск
Список
Период
Сортировка

RE: PG12 autovac issues

От
Justin King
Дата:
Hi Andres-

Thanks for the reply, answers below.

On Tue, Mar 17, 2020 at 8:19 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-17 17:18:57 -0500, Justin King wrote:
> > As you can see in this table, there are only ~80K rows, but billions
> > of updates.  What we have observed is that the frozenxid reaches the
> > 200M mark fairly quickly because of the amount of activity.
>
> And each of those updates is in a separate transaction? Is that
> required? I.e. any chance to perform multiple of those updates in one
> transaction?
>
> Have you considered just increasing the vacuum limit? It's mostly there
> because it can increase space usage a bit, but given today's systems its
> not a usually a problem unless you have hundreds of postgres instances
> on a single system.  It's pretty common to just increase that to 1.8
> billion or so.

We have considered increasing the limit, but as I mentioned, the
problem that we're seeing is that (effectively) a autovac starts on a
system database (postgres, template1) and never completes, or
deadlocks, or something.  This completely stops autovacs from running
until we manually intervene and run a VACUUM FREEZE -- at which point,
the autovacs resume.  If we increase the vacuum limit and this
situation happens, we're going to be in real trouble.

> From a single stats snapshot we can't actually understand the actual xid
> consumption - is it actually the xid usage that triggers the vacuums?

We have looked at this and the xid consumption averages around 1250
xid/sec -- this is when we see the "aggressive" autovac kick off in
the logs.  What I don't understand is why these xid's are being
consumed at this rate on the databases with no activity (postgres,
template1).

>
>
> What makes you think it is a problem that you have all these vacuums? If
> you actually update that much, and you have indexes, you're going want a
> lot of vacuums?
>

I actually don't think there's a problem with the vacuums (I was
mostly pointing out that they are very regular and not problematic).
The main problem I am having is that something is causing the
autovacuums to completely stop and require manual intervention to
resume -- and it seems to be when the "postgres" or "template1"
database hits the autovacuum_freeze_max_age.

>
> > What is interesting is that this happens with the 'postgres' and
> > 'template1' databases as well and there is absolutely no activity in
> > those databases.
>
> That's normal. They should be pretty darn quick in v12?

Yes, a manual VACUUM FREEZE of either database takes less than 1
second -- which is why it's perplexing that the autovac starts and
never seems to complete and prevents other autovacs from running.

>
> Greetings,
>
> Andres Freund



Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-19 10:23:48 -0500, Justin King wrote:
> > From a single stats snapshot we can't actually understand the actual xid
> > consumption - is it actually the xid usage that triggers the vacuums?
> 
> We have looked at this and the xid consumption averages around 1250
> xid/sec -- this is when we see the "aggressive" autovac kick off in
> the logs.  What I don't understand is why these xid's are being
> consumed at this rate on the databases with no activity (postgres,
> template1).

The xid counter is global across all databases.


> > What makes you think it is a problem that you have all these vacuums? If
> > you actually update that much, and you have indexes, you're going want a
> > lot of vacuums?

> I actually don't think there's a problem with the vacuums (I was
> mostly pointing out that they are very regular and not problematic).
> The main problem I am having is that something is causing the
> autovacuums to completely stop and require manual intervention to
> resume -- and it seems to be when the "postgres" or "template1"
> database hits the autovacuum_freeze_max_age.

Did you look at pg_stat_activity for those autovacuums to see whether
they're blocked on something?


> > > What is interesting is that this happens with the 'postgres' and
> > > 'template1' databases as well and there is absolutely no activity in
> > > those databases.
> >
> > That's normal. They should be pretty darn quick in v12?
> 
> Yes, a manual VACUUM FREEZE of either database takes less than 1
> second -- which is why it's perplexing that the autovac starts and
> never seems to complete and prevents other autovacs from running.

One big difference between a manual VACUUM and autovacuum is that with
the default settings VACUUM is not throttled, but autovacuum is.

What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
vacuum_cost_page_miss set to?

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Justin King
Дата:
On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-19 10:23:48 -0500, Justin King wrote:
> > > From a single stats snapshot we can't actually understand the actual xid
> > > consumption - is it actually the xid usage that triggers the vacuums?
> >
> > We have looked at this and the xid consumption averages around 1250
> > xid/sec -- this is when we see the "aggressive" autovac kick off in
> > the logs.  What I don't understand is why these xid's are being
> > consumed at this rate on the databases with no activity (postgres,
> > template1).
>
> The xid counter is global across all databases.

Then what does the "age" value represent for each database in this
case?  Perhaps I'm misunderstanding what I'm looking at?

postgres=#  SELECT datname, age(datfrozenxid),
current_setting('autovacuum_freeze_max_age') FROM pg_database;
  datname  |    age    | current_setting
-----------+-----------+-----------------
 postgres  | 100937449 | 200000000
 template1 |  50244438 | 200000000
 template0 | 160207297 | 200000000
 feedi     | 150147602 | 200000000


>
> > > What makes you think it is a problem that you have all these vacuums? If
> > > you actually update that much, and you have indexes, you're going want a
> > > lot of vacuums?
>
> > I actually don't think there's a problem with the vacuums (I was
> > mostly pointing out that they are very regular and not problematic).
> > The main problem I am having is that something is causing the
> > autovacuums to completely stop and require manual intervention to
> > resume -- and it seems to be when the "postgres" or "template1"
> > database hits the autovacuum_freeze_max_age.
>
> Did you look at pg_stat_activity for those autovacuums to see whether
> they're blocked on something?

This is not something we've done yet but will next time it occurs.

> > > > What is interesting is that this happens with the 'postgres' and
> > > > 'template1' databases as well and there is absolutely no activity in
> > > > those databases.
> > >
> > > That's normal. They should be pretty darn quick in v12?
> >
> > Yes, a manual VACUUM FREEZE of either database takes less than 1
> > second -- which is why it's perplexing that the autovac starts and
> > never seems to complete and prevents other autovacs from running.
>
> One big difference between a manual VACUUM and autovacuum is that with
> the default settings VACUUM is not throttled, but autovacuum is.
>
> What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
> vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
> vacuum_cost_page_miss set to?

Here are all the vacuum related values for the server:

postgres=# select name,setting from pg_settings where name like '%vacuum%';
autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 2500
autovacuum_freeze_max_age = 200000000
autovacuum_max_workers = 8
autovacuum_multixact_freeze_max_age = 400000000
autovacuum_naptime = 15
autovacuum_vacuum_cost_delay = 20
autovacuum_vacuum_cost_limit = -1
autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 500
autovacuum_work_mem = -1
log_autovacuum_min_duration = 0
vacuum_cleanup_index_scale_factor = 0.1
vacuum_cost_delay = 0
vacuum_cost_limit = 1000
vacuum_cost_page_dirty = 20
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_defer_cleanup_age = 0
vacuum_freeze_min_age = 50000000
vacuum_freeze_table_age = 150000000
vacuum_multixact_freeze_min_age = 5000000
vacuum_multixact_freeze_table_age = 150000000

I know the database is busy, so the throttling makes sense, but it
seems like it would complete eventually. We see blocked autovacs for
many hours.

> Greetings,
>
> Andres Freund



Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-19 18:07:14 -0500, Justin King wrote:
> On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-03-19 10:23:48 -0500, Justin King wrote:
> > > > From a single stats snapshot we can't actually understand the actual xid
> > > > consumption - is it actually the xid usage that triggers the vacuums?
> > >
> > > We have looked at this and the xid consumption averages around 1250
> > > xid/sec -- this is when we see the "aggressive" autovac kick off in
> > > the logs.  What I don't understand is why these xid's are being
> > > consumed at this rate on the databases with no activity (postgres,
> > > template1).
> >
> > The xid counter is global across all databases.
> 
> Then what does the "age" value represent for each database in this
> case?  Perhaps I'm misunderstanding what I'm looking at?
> 
> postgres=#  SELECT datname, age(datfrozenxid),
> current_setting('autovacuum_freeze_max_age') FROM pg_database;
>   datname  |    age    | current_setting
> -----------+-----------+-----------------
>  postgres  | 100937449 | 200000000
>  template1 |  50244438 | 200000000
>  template0 | 160207297 | 200000000
>  feedi     | 150147602 | 200000000

Look at datfrozenxid without the age(). age(xid) computes how "old" xid
is compared to the "next" xid to be assigned. Until vacuum comes around
and performs work, pg_database.datfrozenxid / pg_class.relfrozenxid are
constant, since they represent the values actually present in the
table.  But if xids are being consumed, their "age" increases, because
they're further and further in the past relative to the "newest" xids.


> > One big difference between a manual VACUUM and autovacuum is that with
> > the default settings VACUUM is not throttled, but autovacuum is.
> >
> > What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
> > vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
> > vacuum_cost_page_miss set to?
> 
> Here are all the vacuum related values for the server:
> 
> postgres=# select name,setting from pg_settings where name like '%vacuum%';
> autovacuum = on
> autovacuum_analyze_scale_factor = 0.1
> autovacuum_analyze_threshold = 2500
> autovacuum_freeze_max_age = 200000000
> autovacuum_max_workers = 8
> autovacuum_multixact_freeze_max_age = 400000000
> autovacuum_naptime = 15
> autovacuum_vacuum_cost_delay = 20
> autovacuum_vacuum_cost_limit = -1
> autovacuum_vacuum_scale_factor = 0.2
> autovacuum_vacuum_threshold = 500
> autovacuum_work_mem = -1
> log_autovacuum_min_duration = 0
> vacuum_cleanup_index_scale_factor = 0.1
> vacuum_cost_delay = 0
> vacuum_cost_limit = 1000
> vacuum_cost_page_dirty = 20
> vacuum_cost_page_hit = 1
> vacuum_cost_page_miss = 10
> vacuum_defer_cleanup_age = 0
> vacuum_freeze_min_age = 50000000
> vacuum_freeze_table_age = 150000000
> vacuum_multixact_freeze_min_age = 5000000
> vacuum_multixact_freeze_table_age = 150000000
> 
> I know the database is busy, so the throttling makes sense, but it
> seems like it would complete eventually.

The cost limit/delay are way too long/small respectively for a busy
postgres instance.


> We see blocked autovacs for many hours.

On the same table, or just generally being busy?

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Justin King
Дата:
On Thu, Mar 19, 2020 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-19 18:07:14 -0500, Justin King wrote:
> > On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2020-03-19 10:23:48 -0500, Justin King wrote:
> > > > > From a single stats snapshot we can't actually understand the actual xid
> > > > > consumption - is it actually the xid usage that triggers the vacuums?
> > > >
> > > > We have looked at this and the xid consumption averages around 1250
> > > > xid/sec -- this is when we see the "aggressive" autovac kick off in
> > > > the logs.  What I don't understand is why these xid's are being
> > > > consumed at this rate on the databases with no activity (postgres,
> > > > template1).
> > >
> > > The xid counter is global across all databases.
> >
> > Then what does the "age" value represent for each database in this
> > case?  Perhaps I'm misunderstanding what I'm looking at?
> >
> > postgres=#  SELECT datname, age(datfrozenxid),
> > current_setting('autovacuum_freeze_max_age') FROM pg_database;
> >   datname  |    age    | current_setting
> > -----------+-----------+-----------------
> >  postgres  | 100937449 | 200000000
> >  template1 |  50244438 | 200000000
> >  template0 | 160207297 | 200000000
> >  feedi     | 150147602 | 200000000
>
> Look at datfrozenxid without the age(). age(xid) computes how "old" xid
> is compared to the "next" xid to be assigned. Until vacuum comes around
> and performs work, pg_database.datfrozenxid / pg_class.relfrozenxid are
> constant, since they represent the values actually present in the
> table.  But if xids are being consumed, their "age" increases, because
> they're further and further in the past relative to the "newest" xids.
>
>
> > > One big difference between a manual VACUUM and autovacuum is that with
> > > the default settings VACUUM is not throttled, but autovacuum is.
> > >
> > > What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
> > > vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
> > > vacuum_cost_page_miss set to?
> >
> > Here are all the vacuum related values for the server:
> >
> > postgres=# select name,setting from pg_settings where name like '%vacuum%';
> > autovacuum = on
> > autovacuum_analyze_scale_factor = 0.1
> > autovacuum_analyze_threshold = 2500
> > autovacuum_freeze_max_age = 200000000
> > autovacuum_max_workers = 8
> > autovacuum_multixact_freeze_max_age = 400000000
> > autovacuum_naptime = 15
> > autovacuum_vacuum_cost_delay = 20
> > autovacuum_vacuum_cost_limit = -1
> > autovacuum_vacuum_scale_factor = 0.2
> > autovacuum_vacuum_threshold = 500
> > autovacuum_work_mem = -1
> > log_autovacuum_min_duration = 0
> > vacuum_cleanup_index_scale_factor = 0.1
> > vacuum_cost_delay = 0
> > vacuum_cost_limit = 1000
> > vacuum_cost_page_dirty = 20
> > vacuum_cost_page_hit = 1
> > vacuum_cost_page_miss = 10
> > vacuum_defer_cleanup_age = 0
> > vacuum_freeze_min_age = 50000000
> > vacuum_freeze_table_age = 150000000
> > vacuum_multixact_freeze_min_age = 5000000
> > vacuum_multixact_freeze_table_age = 150000000
> >
> > I know the database is busy, so the throttling makes sense, but it
> > seems like it would complete eventually.
>
> The cost limit/delay are way too long/small respectively for a busy
> postgres instance.

This does make sense, we will look into adjusting those values.

> > We see blocked autovacs for many hours.
>
> On the same table, or just generally being busy?

We haven't isolated *which* table it is blocked on (assuming it is),
but all autovac's cease running until we manually intervene.

When we get into this state again, is there some other information
(other than what is in pg_stat_statement or pg_stat_activity) that
would be useful for folks here to help understand what is going on?
>
> Greetings,
>
> Andres Freund



Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-20 12:42:31 -0500, Justin King wrote:
> When we get into this state again, is there some other information
> (other than what is in pg_stat_statement or pg_stat_activity) that
> would be useful for folks here to help understand what is going on?

If it's actually stuck on a single table, and that table is not large,
it would be useful to get a backtrace with gdb.

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Michael Lewis
Дата:
We haven't isolated *which* table it is blocked on (assuming it is),
but all autovac's cease running until we manually intervene.

When we get into this state again, is there some other information
(other than what is in pg_stat_statement or pg_stat_activity) that
would be useful for folks here to help understand what is going on?

Re: PG12 autovac issues

От
Julien Rouhaud
Дата:
On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > When we get into this state again, is there some other information
> > (other than what is in pg_stat_statement or pg_stat_activity) that
> > would be useful for folks here to help understand what is going on?
> 
> If it's actually stuck on a single table, and that table is not large,
> it would be useful to get a backtrace with gdb.

FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
during performance tests since a recent upgrade to pg12 .

What seems to be happening is that after reaching 200M transaction a first pass
of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
(age(datfrozenxid) still being more than autovacuum_freeze_max_age afterwards).
After that point, all available information seems to indicate that no
autovacuum worker is scheduled anymore:

- log_autovacuum_min_duration is set to 0 and no activity is logged (while
  having thousands of those per hour before that)
- 15 min interval snapshot of pg_database and pg_class shows that
  datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
  goes down
 - 15 min interval snapshot of pg_stat_activity doesn't show any autovacuum
   worker
- the autovacuum launcher is up and running and doesn't show any sign of
  problem
- n_mod_since_analyze keeps growing at a consistent rate, never going down
- 15 min delta of tup_updated and tup_deleted shows that the globate write
  activity doesn't change before and after the autovacuum problem

The situation continues for ~2h, at which point the bloat is so heavy that the
main filesystem becomes full, and postgres panics after a failed write in
pg_logical directory or similar.

The same bench was run against pg11 many times and never triggered this issue.
So far our best guess is a side effect of 2aa6e331ead7.

Michael and I have been trying to reproduce this issue locally (drastically
reducing the various freeze_age parameters) for hours, but no luck for now.

This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
relevant configuration changes are quite aggressive autovacuum settings on some
tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
factor to 0, for both heap and toast).



Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote:
> On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > > When we get into this state again, is there some other information
> > > (other than what is in pg_stat_statement or pg_stat_activity) that
> > > would be useful for folks here to help understand what is going on?
> > 
> > If it's actually stuck on a single table, and that table is not large,
> > it would be useful to get a backtrace with gdb.
> 
> FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
> during performance tests since a recent upgrade to pg12 .
>
> What seems to be happening is that after reaching 200M transaction a first pass
> of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
> (age(datfrozenxid) still being more than autovacuum_freeze_max_age
> afterwards).

If you have older transactions around that'd not be surprising. Do you
have autovacuum logging output for this case?


> After that point, all available information seems to indicate that no
> autovacuum worker is scheduled anymore:

Do you mean that this table doesn't get autovac'ed at all anymore, that
no table is getting autovac'd, or just that there's nothing further
increasing relfrozenxid for that table?

It sounds like:

> - log_autovacuum_min_duration is set to 0 and no activity is logged (while
>   having thousands of those per hour before that)

no table at all?


> - 15 min interval snapshot of pg_database and pg_class shows that
>   datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
>   goes down

I assume you mean their age?

What is:
- datfrozenxid, age(datfrozenxid) for the oldest database
  SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;
- relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
  SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY
age(relfrozenxid)DESC LIMIT 1;
 
- Oldest backend xmin
  SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)
DESCLIMIT 3;
 
- oldest replication xmin:
  SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
- oldest slot xmin:
  SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR
catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
 
- oldest prepared transaction (unfortunately xmin not available)
  SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;


> The same bench was run against pg11 many times and never triggered this issue.
> So far our best guess is a side effect of 2aa6e331ead7.

In that case you'd likely see DEBUG1 output, right? Did you try running
with that?


> Michael and I have been trying to reproduce this issue locally (drastically
> reducing the various freeze_age parameters) for hours, but no luck for now.

Have you considered using gdb to investigate?


> This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> relevant configuration changes are quite aggressive autovacuum settings on some
> tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> factor to 0, for both heap and toast).

That, uh, is not an insignificant set of changes for an autovac
scheduling issues.

It sounds like what might be happening is that you have something
holding back the "oldest needed transaction" horizon. Before
2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
vacuums unable to increase the horizon, but afterwards they'll all
immediately exit without any messages.

I wonder if what might be happening is that we're somehow missed/failed
to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
table in the oldest database, but that is *NOT* the oldest table itself,
does the problem "resolve" itself?

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Julien Rouhaud
Дата:
Hi,

On Mon, Mar 23, 2020 at 09:23:03AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote:
> > On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > > > When we get into this state again, is there some other information
> > > > (other than what is in pg_stat_statement or pg_stat_activity) that
> > > > would be useful for folks here to help understand what is going on?
> > > 
> > > If it's actually stuck on a single table, and that table is not large,
> > > it would be useful to get a backtrace with gdb.
> > 
> > FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
> > during performance tests since a recent upgrade to pg12 .
> >
> > What seems to be happening is that after reaching 200M transaction a first pass
> > of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
> > (age(datfrozenxid) still being more than autovacuum_freeze_max_age
> > afterwards).
> 
> If you have older transactions around that'd not be surprising. Do you
> have autovacuum logging output for this case?

There's an hourly long running query that can retain xmin up to a few million
xid, but definitely not something close to 200M.

When I said a first pass, it a batch of vacuum.  Here's the overview of "to
prevent wraparound" per-hour logs (UTC timezone):

 442 2020-03-14 07
1686 2020-03-14 08
  14 2020-03-14 10

and an overview of all autovacuum activity (still in UTC):

1366 2020-03-14 00
1457 2020-03-14 01
1387 2020-03-14 02
1440 2020-03-14 03
1349 2020-03-14 04
7383 2020-03-14 05
13909 2020-03-14 06
14240 2020-03-14 07
2094 2020-03-14 08
   0 2020-03-14 09
  16 2020-03-14 10
   3 2020-03-14 11
   4 2020-03-14 12
   3 2020-03-14 13

The final outage being:
2020-03-14 10:27:23.280 UTC [...] ERROR: could not extend file "base/16386/20245.4": No space left on device

536222:2020-03-14 10:40:00.089 UTC [...] PANIC:  could not write to file "pg_logical/replorigin_checkpoint.tmp": No
spaceleft on device
 

the last autovacuum evidence before that being:

274177:2020-03-14 08:54:11.797 UTC 5e6c8ed0.d665 0   LOG:  automatic vacuum of table "pg_toast.pg_toast_20237": index
scans:1
 
274178-    pages: 0 removed, 273666 remain, 0 skipped due to pins, 251545 skipped frozen
274179-    tuples: 83585 removed, 749 remain, 209 are dead but not yet removable, oldest xmin: 210363848
274180-    buffer usage: 50096 hits, 23521 misses, 19996 dirtied
274181-    avg read rate: 99.195 MB/s, avg write rate: 84.329 MB/s
274182-    system usage: CPU: user: 0.28 s, system: 0.21 s, elapsed: 1.85 s

After the crash-and-restart autovacuum is working again (as seen in the 10AM -
1PM logs), although the bench stays stopped.


What I can see in pg_database is (GMT+1 here):

          timestamp            | datfrozenxid | age(datfrozenxid)
 [...]
 2020-03-14 09:12:11.279515+01 |          480 | 202554804
 2020-03-14 09:27:12.723617+01 |          480 | 205408276
 2020-03-14 09:42:13.868701+01 |          480 | 208239967
 2020-03-14 09:57:15.685827+01 |       827585 | 210143294
 2020-03-14 10:12:17.488993+01 |       827585 | 213143797
 2020-03-14 10:27:18.899525+01 |       827585 | 216104516
 2020-03-14 10:42:19.926601+01 |       827585 | 219075040
 2020-03-14 10:57:21.023513+01 |       827585 | 222085423
 2020-03-14 11:12:22.85198+01  |       827585 | 225057731


> > After that point, all available information seems to indicate that no
> > autovacuum worker is scheduled anymore:
> 
> Do you mean that this table doesn't get autovac'ed at all anymore, that
> no table is getting autovac'd, or just that there's nothing further
> increasing relfrozenxid for that table?
> 
> It sounds like:
> 
> > - log_autovacuum_min_duration is set to 0 and no activity is logged (while
> >   having thousands of those per hour before that)
> 
> no table at all?

Correct, no table being autovacuumed, no sign of autovacuum being scheduled or
anything.

> > - 15 min interval snapshot of pg_database and pg_class shows that
> >   datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
> >   goes down
> 
> I assume you mean their age?

Yes sorry.

> What is:
> - datfrozenxid, age(datfrozenxid) for the oldest database
>   SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;

see above.  FTR there's only one database being used.

> - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
>   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY
age(relfrozenxid)DESC LIMIT 1;
 

The vm has been trashed since, and I don't have that level of detail available
in the gathered stats unfortunately (and the available information I have is a
little bit messy, sorry for that).

> - Oldest backend xmin
>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
> - oldest replication xmin:
>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
> - oldest slot xmin:
>   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR
catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
 
> - oldest prepared transaction (unfortunately xmin not available)
>   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;

I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
(which may be dumb).  Here are the non-null rows after 8AM GMT+1:

           timestamp           |    kind     |   age   |   x_min
-------------------------------+-------------+---------+-----------
 2020-03-14 08:12:04.082523+01 | active_xact |     968 | 191022091
 2020-03-14 08:12:04.082523+01 | idle_xact   |     968 | 191022091
 2020-03-14 08:12:04.082523+01 | query       | 2538322 | 188484737
 2020-03-14 08:27:07.046938+01 | active_xact |     606 | 193917749
 2020-03-14 08:27:07.046938+01 | idle_xact   |     606 | 193917749
 2020-03-14 08:27:07.046938+01 | query       |   39646 | 193878709
 2020-03-14 08:42:08.902248+01 | active_xact |     502 | 196764934
 2020-03-14 08:42:08.902248+01 | idle_xact   |     502 | 196764934
 2020-03-14 08:42:08.902248+01 | query       |   12787 | 196752649
 2020-03-14 08:57:10.089606+01 | active_xact |      91 | 199684930
 2020-03-14 08:57:10.089606+01 | idle_xact   |      91 | 199684930
 2020-03-14 08:57:10.089606+01 | query       |   25453 | 199659568
 2020-03-14 09:12:11.279515+01 | active_xact |     264 | 202555020
 2020-03-14 09:12:11.279515+01 | idle_xact   |     264 | 202555020
 2020-03-14 09:12:11.279515+01 | query       | 2316084 | 200239200
 2020-03-14 09:27:12.723617+01 | active_xact |      90 | 205408666
 2020-03-14 09:27:12.723617+01 | idle_xact   |      90 | 205408666
 2020-03-14 09:27:12.723617+01 | query       | 5169556 | 200239200
 2020-03-14 09:42:13.868701+01 | active_xact |     381 | 208240066
 2020-03-14 09:42:13.868701+01 | idle_xact   |     381 | 208240066
 2020-03-14 09:42:13.868701+01 | query       |  934819 | 207305628
 2020-03-14 09:57:15.685827+01 | active_xact |     791 | 210970088
 2020-03-14 09:57:15.685827+01 | idle_xact   |     791 | 210970088
 2020-03-14 09:57:15.685827+01 | query       |     791 | 210970088
 2020-03-14 10:12:17.488993+01 | active_xact |     243 | 213971139
 2020-03-14 10:12:17.488993+01 | idle_xact   |      61 | 213971321
 2020-03-14 10:27:18.899525+01 | active_xact |     105 | 216931996
 2020-03-14 10:27:18.899525+01 | idle_xact   |     105 | 216931996
 2020-03-14 10:27:18.899525+01 | query       |     109 | 216931992
 2020-03-14 10:42:19.926601+01 | active_xact | 1046073 | 218856552
 2020-03-14 10:42:19.926601+01 | idle_xact   |      14 | 219902611
 2020-03-14 10:42:19.926601+01 | query       |     234 | 219902391
 2020-03-14 10:57:21.023513+01 | active_xact |     707 | 222912301
 2020-03-14 10:57:21.023513+01 | idle_xact   |     707 | 222912301
 2020-03-14 10:57:21.023513+01 | query       |     707 | 222912301
 2020-03-14 11:12:22.85198+01  | active_xact |     302 | 225885014
 2020-03-14 11:12:22.85198+01  | idle_xact   |     302 | 225885014
 2020-03-14 11:12:22.85198+01  | query       |     302 | 225885014

> > The same bench was run against pg11 many times and never triggered this issue.
> > So far our best guess is a side effect of 2aa6e331ead7.
> 
> In that case you'd likely see DEBUG1 output, right? Did you try running
> with that?

That's unfortunately not an option, as while the issue is reproducible, it
happens after roughly 16h with an average of 6.5ktps, and we definitely don't
have enough disk space for that amount of logs.  We'll try next time with a
recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.

> > Michael and I have been trying to reproduce this issue locally (drastically
> > reducing the various freeze_age parameters) for hours, but no luck for now.
> 
> Have you considered using gdb to investigate?

Not for now, because it's hard to be around for the ~ 90min interval the system
stays alive after atuvacuum get stuck, especially since it's a weekend bench.

> > This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> > relevant configuration changes are quite aggressive autovacuum settings on some
> > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> > factor to 0, for both heap and toast).
> 
> That, uh, is not an insignificant set of changes for an autovac
> scheduling issues.
> 
> It sounds like what might be happening is that you have something
> holding back the "oldest needed transaction" horizon. Before
> 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
> vacuums unable to increase the horizon, but afterwards they'll all
> immediately exit without any messages.

The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
200069684 to 210363848.  E.g.:

[...]
266603:2020-03-14 08:49:59.927 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27683":index scans: 0
 
266604-    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266605-    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921
266606-    buffer usage: 25 hits, 1 misses, 1 dirtied
266607-    avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s
266608-    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
266609:2020-03-14 08:49:59.929 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27694":index scans: 0
 
266610-    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266611-    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
266612-    buffer usage: 25 hits, 1 misses, 1 dirtied
266613-    avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
266614-    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_24845":index scans: 0
 
266616-    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266617-    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
266618-    buffer usage: 25 hits, 1 misses, 2 dirtied
266619-    avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
266620-    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27712":index scans: 0
 
266622-    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266623-    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
[...]


> I wonder if what might be happening is that we're somehow missed/failed
> to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> table in the oldest database, but that is *NOT* the oldest table itself,
> does the problem "resolve" itself?

I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
if we'll be able to catch the 1h45 interval when the system stays alive
after the issue though.



Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote:
> > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
> >   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY
age(relfrozenxid)DESC LIMIT 1;
 
> 
> The vm has been trashed since, and I don't have that level of detail available
> in the gathered stats unfortunately (and the available information I have is a
> little bit messy, sorry for that).
> 
> > - Oldest backend xmin
> >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
> > - oldest replication xmin:
> >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
> > - oldest slot xmin:
> >   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR
catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
 
> > - oldest prepared transaction (unfortunately xmin not available)
> >   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;
> 
> I have output from this query:
https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
> (which may be dumb).  Here are the non-null rows after 8AM GMT+1:

Could you share what the config of the server was?


> > > The same bench was run against pg11 many times and never triggered this issue.
> > > So far our best guess is a side effect of 2aa6e331ead7.
> > 
> > In that case you'd likely see DEBUG1 output, right? Did you try running
> > with that?
> 
> That's unfortunately not an option, as while the issue is reproducible, it
> happens after roughly 16h with an average of 6.5ktps, and we definitely don't
> have enough disk space for that amount of logs.  We'll try next time with a
> recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.

You wouldn't need to run it with DEBUG1 the whole time, you could just
change the config and reload config once you hit the issue.


> > > Michael and I have been trying to reproduce this issue locally (drastically
> > > reducing the various freeze_age parameters) for hours, but no luck for now.
> > 
> > Have you considered using gdb to investigate?
> 
> Not for now, because it's hard to be around for the ~ 90min interval the system
> stays alive after atuvacuum get stuck, especially since it's a weekend bench.
> 
> > > This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> > > relevant configuration changes are quite aggressive autovacuum settings on some
> > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> > > factor to 0, for both heap and toast).
> > 
> > That, uh, is not an insignificant set of changes for an autovac
> > scheduling issues.
> > 
> > It sounds like what might be happening is that you have something
> > holding back the "oldest needed transaction" horizon. Before
> > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
> > vacuums unable to increase the horizon, but afterwards they'll all
> > immediately exit without any messages.
> 
> The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
> for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
> 200069684 to 210363848.  E.g.:

Which database is this on?


> [...]
> 266603:2020-03-14 08:49:59.927 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27683":index scans: 0
 
> 266604-    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> 266605-    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921
> 266606-    buffer usage: 25 hits, 1 misses, 1 dirtied
> 266607-    avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s
> 266608-    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 266609:2020-03-14 08:49:59.929 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27694":index scans: 0
 
> 266610-    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> 266611-    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> 266612-    buffer usage: 25 hits, 1 misses, 1 dirtied
> 266613-    avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
> 266614-    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_24845":index scans: 0
 
> 266616-    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> 266617-    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> 266618-    buffer usage: 25 hits, 1 misses, 2 dirtied
> 266619-    avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
> 266620-    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27712":index scans: 0
 
> 266622-    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> 266623-    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
> [...]

Do you have any non-toast ones?


> > I wonder if what might be happening is that we're somehow missed/failed
> > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > table in the oldest database, but that is *NOT* the oldest table itself,
> > does the problem "resolve" itself?
> 
> I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
> if we'll be able to catch the 1h45 interval when the system stays alive
> after the issue though.

Could you just script something to stop the benchmark once the disk is
90% full or so?

Did you see any errors / fatals around the time autovacuum stopped
working?

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Justin King
Дата:
On Mon, Mar 23, 2020 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote:
> > > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
> > >   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY
age(relfrozenxid)DESC LIMIT 1;
 
> >
> > The vm has been trashed since, and I don't have that level of detail available
> > in the gathered stats unfortunately (and the available information I have is a
> > little bit messy, sorry for that).
> >
> > > - Oldest backend xmin
> > >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
> > > - oldest replication xmin:
> > >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
> > > - oldest slot xmin:
> > >   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR
catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
 
> > > - oldest prepared transaction (unfortunately xmin not available)
> > >   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;
> >
> > I have output from this query:
https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
> > (which may be dumb).  Here are the non-null rows after 8AM GMT+1:
>
> Could you share what the config of the server was?
>
>
> > > > The same bench was run against pg11 many times and never triggered this issue.
> > > > So far our best guess is a side effect of 2aa6e331ead7.
> > >
> > > In that case you'd likely see DEBUG1 output, right? Did you try running
> > > with that?
> >
> > That's unfortunately not an option, as while the issue is reproducible, it
> > happens after roughly 16h with an average of 6.5ktps, and we definitely don't
> > have enough disk space for that amount of logs.  We'll try next time with a
> > recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.
>
> You wouldn't need to run it with DEBUG1 the whole time, you could just
> change the config and reload config once you hit the issue.
>
>
> > > > Michael and I have been trying to reproduce this issue locally (drastically
> > > > reducing the various freeze_age parameters) for hours, but no luck for now.
> > >
> > > Have you considered using gdb to investigate?
> >
> > Not for now, because it's hard to be around for the ~ 90min interval the system
> > stays alive after atuvacuum get stuck, especially since it's a weekend bench.
> >
> > > > This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> > > > relevant configuration changes are quite aggressive autovacuum settings on some
> > > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> > > > factor to 0, for both heap and toast).
> > >
> > > That, uh, is not an insignificant set of changes for an autovac
> > > scheduling issues.
> > >
> > > It sounds like what might be happening is that you have something
> > > holding back the "oldest needed transaction" horizon. Before
> > > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
> > > vacuums unable to increase the horizon, but afterwards they'll all
> > > immediately exit without any messages.
> >
> > The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
> > for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
> > 200069684 to 210363848.  E.g.:
>
> Which database is this on?
>
>
> > [...]
> > 266603:2020-03-14 08:49:59.927 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27683":index scans: 0
 
> > 266604-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > 266605-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921
> > 266606-       buffer usage: 25 hits, 1 misses, 1 dirtied
> > 266607-       avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s
> > 266608-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > 266609:2020-03-14 08:49:59.929 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27694":index scans: 0
 
> > 266610-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > 266611-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> > 266612-       buffer usage: 25 hits, 1 misses, 1 dirtied
> > 266613-       avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
> > 266614-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > 266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_24845":index scans: 0
 
> > 266616-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > 266617-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> > 266618-       buffer usage: 25 hits, 1 misses, 2 dirtied
> > 266619-       avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
> > 266620-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > 266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27712":index scans: 0
 
> > 266622-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > 266623-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
> > [...]
>
> Do you have any non-toast ones?

This is occurring in our environment right now (started about 30 min
ago).  Here 's the latest logs (grepped by vacuum):

Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542
GMT [15569] LOG:  automatic vacuum of table "feedi.production.tita":
index scans: 1
Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964
GMT [15654] LOG:  automatic vacuum of table
"feedi.production.distributed_virtual_schedule": index scans: 1
Mar 23 20:54:29 cowtn postgres[15654]: [10-1] 2020-03-23 20:54:29.103
GMT [15654] LOG:  automatic vacuum of table "feedi.production.tita":
index scans: 1
Mar 23 20:54:29 cowtn postgres[15654]: [12-1] 2020-03-23 20:54:29.284
GMT [15654] LOG:  automatic vacuum of table
"feedi.pg_toast.pg_toast_2619": index scans: 1
Mar 23 20:54:44 cowtn postgres[15759]: [8-1] 2020-03-23 20:54:44.005
GMT [15759] LOG:  automatic vacuum of table
"feedi.production.flightplans": index scans: 1
Mar 23 20:54:45 cowtn postgres[15759]: [11-1] 2020-03-23 20:54:45.979
GMT [15759] LOG:  automatic vacuum of table "feedi.production.tita":
index scans: 1
Mar 23 20:54:54 cowtn postgres[15814]: [7-1] 2020-03-23 20:54:54.367
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_statistic": index scans: 0
Mar 23 20:54:54 cowtn postgres[15814]: [8-1] 2020-03-23 20:54:54.378
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_type": index scans: 0
Mar 23 20:54:54 cowtn postgres[15814]: [9-1] 2020-03-23 20:54:54.389
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_foreign_server": index scans: 0
Mar 23 20:54:54 cowtn postgres[15814]: [10-1] 2020-03-23 20:54:54.410
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_statistic_ext_data": index scans: 0
Mar 23 20:54:54 cowtn postgres[15814]: [11-1] 2020-03-23 20:54:54.420
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_user_mapping": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [7-1] 2020-03-23 20:55:09.194
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_attribute": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [8-1] 2020-03-23 20:55:09.205
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_proc": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [9-1] 2020-03-23 20:55:09.216
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_class": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [10-1] 2020-03-23 20:55:09.227
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_attrdef": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [11-1] 2020-03-23 20:55:09.239
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_constraint": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [12-1] 2020-03-23 20:55:09.250
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_inherits": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [13-1] 2020-03-23 20:55:09.260
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_index": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [14-1] 2020-03-23 20:55:09.272
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_operator": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [15-1] 2020-03-23 20:55:09.283
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_opfamily": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [16-1] 2020-03-23 20:55:09.294
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_opclass": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [17-1] 2020-03-23 20:55:09.305
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_am": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [18-1] 2020-03-23 20:55:09.317
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_amop": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [19-1] 2020-03-23 20:55:09.328
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_amproc": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [20-1] 2020-03-23 20:55:09.339
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_language": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [21-1] 2020-03-23 20:55:09.349
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_largeobject_metadata": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [22-1] 2020-03-23 20:55:09.360
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_aggregate": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [23-1] 2020-03-23 20:55:09.371
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_largeobject": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [24-1] 2020-03-23 20:55:09.382
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_statistic_ext": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [25-1] 2020-03-23 20:55:09.393
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_rewrite": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [26-1] 2020-03-23 20:55:09.404
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_trigger": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [27-1] 2020-03-23 20:55:09.415
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_event_trigger": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [28-1] 2020-03-23 20:55:09.426
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_description": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [29-1] 2020-03-23 20:55:09.437
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_cast": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [30-1] 2020-03-23 20:55:09.448
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_enum": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [31-1] 2020-03-23 20:55:09.459
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_namespace": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [32-1] 2020-03-23 20:55:09.470
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_conversion": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [33-1] 2020-03-23 20:55:09.481
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_depend": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [34-1] 2020-03-23 20:55:09.502
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_db_role_setting": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [35-1] 2020-03-23 20:55:09.524
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_pltemplate": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [36-1] 2020-03-23 20:55:09.545
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_shdepend": index scans: 1
Mar 23 20:55:09 cowtn postgres[15890]: [37-1] 2020-03-23 20:55:09.556
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_shdescription": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [38-1] 2020-03-23 20:55:09.567
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_config": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [39-1] 2020-03-23 20:55:09.578
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_config_map": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [40-1] 2020-03-23 20:55:09.589
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_dict": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [41-1] 2020-03-23 20:55:09.600
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_parser": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [42-1] 2020-03-23 20:55:09.611
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_template": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [43-1] 2020-03-23 20:55:09.622
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_extension": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [44-1] 2020-03-23 20:55:09.633
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_foreign_data_wrapper": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [45-1] 2020-03-23 20:55:09.643
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_foreign_table": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [46-1] 2020-03-23 20:55:09.654
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_policy": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [47-1] 2020-03-23 20:55:09.675
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_default_acl": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [48-1] 2020-03-23 20:55:09.686
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_init_privs": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [49-1] 2020-03-23 20:55:09.696
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_seclabel": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [50-1] 2020-03-23 20:55:09.719
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_collation": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [51-1] 2020-03-23 20:55:09.729
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_partitioned_table": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [52-1] 2020-03-23 20:55:09.740
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_range": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [53-1] 2020-03-23 20:55:09.751
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_transform": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [54-1] 2020-03-23 20:55:09.762
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_sequence": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [55-1] 2020-03-23 20:55:09.772
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_publication": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [56-1] 2020-03-23 20:55:09.783
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_publication_rel": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [57-1] 2020-03-23 20:55:09.794
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_subscription_rel": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [58-1] 2020-03-23 20:55:09.805
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_packages": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [59-1] 2020-03-23 20:55:09.816
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_features": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [60-1] 2020-03-23 20:55:09.827
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_implementation_info": index
scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [61-1] 2020-03-23 20:55:09.838
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_parts": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [62-1] 2020-03-23 20:55:09.849
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_languages": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [63-1] 2020-03-23 20:55:09.859
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_sizing": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [64-1] 2020-03-23 20:55:09.870
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_sizing_profiles": index scans:
0
Mar 23 20:55:09 cowtn postgres[15890]: [65-1] 2020-03-23 20:55:09.880
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2600": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [66-1] 2020-03-23 20:55:09.891
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2604": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [67-1] 2020-03-23 20:55:09.902
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3456": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [68-1] 2020-03-23 20:55:09.912
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2606": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [69-1] 2020-03-23 20:55:09.923
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_826": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [70-1] 2020-03-23 20:55:09.933
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2609": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [71-1] 2020-03-23 20:55:09.944
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3466": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [72-1] 2020-03-23 20:55:09.955
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3079": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [73-1] 2020-03-23 20:55:09.965
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2328": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [74-1] 2020-03-23 20:55:09.976
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1417": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [75-1] 2020-03-23 20:55:09.987
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3118": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [76-1] 2020-03-23 20:55:09.997
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3394": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [77-1] 2020-03-23 20:55:10.008
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2612": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [78-1] 2020-03-23 20:55:10.018
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2615": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [79-1] 2020-03-23 20:55:10.029
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3350": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [80-1] 2020-03-23 20:55:10.040
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3256": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [81-1] 2020-03-23 20:55:10.050
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1255": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [82-1] 2020-03-23 20:55:10.062
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2618": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [83-1] 2020-03-23 20:55:10.072
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3596": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [84-1] 2020-03-23 20:55:10.083
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2619": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [85-1] 2020-03-23 20:55:10.094
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3381": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [86-1] 2020-03-23 20:55:10.105
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3429": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [87-1] 2020-03-23 20:55:10.115
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2620": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [88-1] 2020-03-23 20:55:10.126
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3600": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [89-1] 2020-03-23 20:55:10.137
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1247": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [90-1] 2020-03-23 20:55:10.147
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1418": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [91-1] 2020-03-23 20:55:10.158
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1260": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [92-1] 2020-03-23 20:55:10.169
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1262": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [93-1] 2020-03-23 20:55:10.179
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2964": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [94-1] 2020-03-23 20:55:10.190
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1136": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [95-1] 2020-03-23 20:55:10.201
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_6000": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [96-1] 2020-03-23 20:55:10.211
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2396": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [97-1] 2020-03-23 20:55:10.222
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3592": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [98-1] 2020-03-23 20:55:10.232
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_6100": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [99-1] 2020-03-23 20:55:10.243
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1213": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [100-1] 2020-03-23 20:55:10.254
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13267": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [101-1] 2020-03-23 20:55:10.264
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13272": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [102-1] 2020-03-23 20:55:10.275
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13287": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [103-1] 2020-03-23 20:55:10.286
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13277": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [104-1] 2020-03-23 20:55:10.296
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13282": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [105-1] 2020-03-23 20:55:10.307
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13292": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [106-1] 2020-03-23 20:55:10.317
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13297": index scans: 0

>
>
> > > I wonder if what might be happening is that we're somehow missed/failed
> > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > > table in the oldest database, but that is *NOT* the oldest table itself,
> > > does the problem "resolve" itself?

postgres=# SELECT datname
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age')
FROM pg_database;
  datname  |    age    | current_setting
-----------+-----------+-----------------
 postgres  | 202375735 | 200000000
 template1 | 202345459 | 200000000
 template0 | 132459914 | 200000000
 feedi     | 132459914 | 200000000
(4 rows)


> >
> > I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
> > if we'll be able to catch the 1h45 interval when the system stays alive
> > after the issue though.
>
> Could you just script something to stop the benchmark once the disk is
> 90% full or so?
>
> Did you see any errors / fatals around the time autovacuum stopped
> working?

There are no ERROR or FATAL messages around the time -- (or at all).

Since this is occurring right now, what else would be useful to
capture?  You'd asked about a GDB -- do you want that of the main
process or the autovac worker?

>
> Greetings,
>
> Andres Freund



Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-23 16:31:21 -0500, Justin King wrote:
> This is occurring in our environment right now (started about 30 min
> ago).  Here 's the latest logs (grepped by vacuum):
> 
> Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542
> GMT [15569] LOG:  automatic vacuum of table "feedi.production.tita":
> index scans: 1
> Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964
> GMT [15654] LOG:  automatic vacuum of table
> "feedi.production.distributed_virtual_schedule": index scans: 1

Hm, unfortunately you've cut off the details in the subsequent
lines. There's a few newlines in the output. Any chance you could
re-post with those included?


> > > > I wonder if what might be happening is that we're somehow missed/failed
> > > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > > > table in the oldest database, but that is *NOT* the oldest table itself,
> > > > does the problem "resolve" itself?
> 
> postgres=# SELECT datname
>     , age(datfrozenxid)
>     , current_setting('autovacuum_freeze_max_age')
> FROM pg_database;
>   datname  |    age    | current_setting
> -----------+-----------+-----------------
>  postgres  | 202375735 | 200000000
>  template1 | 202345459 | 200000000
>  template0 | 132459914 | 200000000
>  feedi     | 132459914 | 200000000
> (4 rows)

Can you show the oldest tables in 'feedi'? Or, hm, actually, could you
just post the result of all the queries from the "What is:" section in
https://postgr.es/m/20200323162303.s7ay5hjdvimtkz6u%40alap3.anarazel.de


> Since this is occurring right now, what else would be useful to
> capture?  You'd asked about a GDB -- do you want that of the main
> process or the autovac worker?

Unless you can give me gdb access directly, I don't yet have enough data
to suggest what exactly we would want to analyze with gdb in your case.


It'd be helpful if you could change log_min_messages to DEBUG1 and
reload the configuration (not restart!).

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Justin King
Дата:
On Mon, Mar 23, 2020 at 4:31 PM Justin King <kingpin867@gmail.com> wrote:
>
> On Mon, Mar 23, 2020 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote:
> > > > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
> > > >   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY
age(relfrozenxid)DESC LIMIT 1;
 
> > >
> > > The vm has been trashed since, and I don't have that level of detail available
> > > in the gathered stats unfortunately (and the available information I have is a
> > > little bit messy, sorry for that).
> > >
> > > > - Oldest backend xmin
> > > >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
> > > > - oldest replication xmin:
> > > >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3;
 
> > > > - oldest slot xmin:
> > > >   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0
ORcatalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
 
> > > > - oldest prepared transaction (unfortunately xmin not available)
> > > >   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;
> > >
> > > I have output from this query:
https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
> > > (which may be dumb).  Here are the non-null rows after 8AM GMT+1:
> >
> > Could you share what the config of the server was?
> >
> >
> > > > > The same bench was run against pg11 many times and never triggered this issue.
> > > > > So far our best guess is a side effect of 2aa6e331ead7.
> > > >
> > > > In that case you'd likely see DEBUG1 output, right? Did you try running
> > > > with that?
> > >
> > > That's unfortunately not an option, as while the issue is reproducible, it
> > > happens after roughly 16h with an average of 6.5ktps, and we definitely don't
> > > have enough disk space for that amount of logs.  We'll try next time with a
> > > recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.
> >
> > You wouldn't need to run it with DEBUG1 the whole time, you could just
> > change the config and reload config once you hit the issue.
> >
> >
> > > > > Michael and I have been trying to reproduce this issue locally (drastically
> > > > > reducing the various freeze_age parameters) for hours, but no luck for now.
> > > >
> > > > Have you considered using gdb to investigate?
> > >
> > > Not for now, because it's hard to be around for the ~ 90min interval the system
> > > stays alive after atuvacuum get stuck, especially since it's a weekend bench.
> > >
> > > > > This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> > > > > relevant configuration changes are quite aggressive autovacuum settings on some
> > > > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> > > > > factor to 0, for both heap and toast).
> > > >
> > > > That, uh, is not an insignificant set of changes for an autovac
> > > > scheduling issues.
> > > >
> > > > It sounds like what might be happening is that you have something
> > > > holding back the "oldest needed transaction" horizon. Before
> > > > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
> > > > vacuums unable to increase the horizon, but afterwards they'll all
> > > > immediately exit without any messages.
> > >
> > > The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
> > > for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
> > > 200069684 to 210363848.  E.g.:
> >
> > Which database is this on?
> >
> >
> > > [...]
> > > 266603:2020-03-14 08:49:59.927 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27683":index scans: 0
 
> > > 266604-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > > 266605-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921
> > > 266606-       buffer usage: 25 hits, 1 misses, 1 dirtied
> > > 266607-       avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s
> > > 266608-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > > 266609:2020-03-14 08:49:59.929 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27694":index scans: 0
 
> > > 266610-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > > 266611-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> > > 266612-       buffer usage: 25 hits, 1 misses, 1 dirtied
> > > 266613-       avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
> > > 266614-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > > 266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_24845":index scans: 0
 
> > > 266616-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > > 266617-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> > > 266618-       buffer usage: 25 hits, 1 misses, 2 dirtied
> > > 266619-       avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
> > > 266620-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > > 266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27712":index scans: 0
 
> > > 266622-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > > 266623-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
> > > [...]
> >
> > Do you have any non-toast ones?
>
> This is occurring in our environment right now (started about 30 min
> ago).  Here 's the latest logs (grepped by vacuum):
>
> Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542
> GMT [15569] LOG:  automatic vacuum of table "feedi.production.tita":
> index scans: 1
> Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964
> GMT [15654] LOG:  automatic vacuum of table
> "feedi.production.distributed_virtual_schedule": index scans: 1
> Mar 23 20:54:29 cowtn postgres[15654]: [10-1] 2020-03-23 20:54:29.103
> GMT [15654] LOG:  automatic vacuum of table "feedi.production.tita":
> index scans: 1
> Mar 23 20:54:29 cowtn postgres[15654]: [12-1] 2020-03-23 20:54:29.284
> GMT [15654] LOG:  automatic vacuum of table
> "feedi.pg_toast.pg_toast_2619": index scans: 1
> Mar 23 20:54:44 cowtn postgres[15759]: [8-1] 2020-03-23 20:54:44.005
> GMT [15759] LOG:  automatic vacuum of table
> "feedi.production.flightplans": index scans: 1
> Mar 23 20:54:45 cowtn postgres[15759]: [11-1] 2020-03-23 20:54:45.979
> GMT [15759] LOG:  automatic vacuum of table "feedi.production.tita":
> index scans: 1
> Mar 23 20:54:54 cowtn postgres[15814]: [7-1] 2020-03-23 20:54:54.367
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_statistic": index scans: 0
> Mar 23 20:54:54 cowtn postgres[15814]: [8-1] 2020-03-23 20:54:54.378
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_type": index scans: 0
> Mar 23 20:54:54 cowtn postgres[15814]: [9-1] 2020-03-23 20:54:54.389
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_foreign_server": index scans: 0
> Mar 23 20:54:54 cowtn postgres[15814]: [10-1] 2020-03-23 20:54:54.410
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_statistic_ext_data": index scans: 0
> Mar 23 20:54:54 cowtn postgres[15814]: [11-1] 2020-03-23 20:54:54.420
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_user_mapping": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [7-1] 2020-03-23 20:55:09.194
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_attribute": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [8-1] 2020-03-23 20:55:09.205
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_proc": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [9-1] 2020-03-23 20:55:09.216
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_class": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [10-1] 2020-03-23 20:55:09.227
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_attrdef": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [11-1] 2020-03-23 20:55:09.239
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_constraint": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [12-1] 2020-03-23 20:55:09.250
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_inherits": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [13-1] 2020-03-23 20:55:09.260
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_index": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [14-1] 2020-03-23 20:55:09.272
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_operator": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [15-1] 2020-03-23 20:55:09.283
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_opfamily": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [16-1] 2020-03-23 20:55:09.294
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_opclass": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [17-1] 2020-03-23 20:55:09.305
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_am": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [18-1] 2020-03-23 20:55:09.317
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_amop": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [19-1] 2020-03-23 20:55:09.328
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_amproc": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [20-1] 2020-03-23 20:55:09.339
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_language": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [21-1] 2020-03-23 20:55:09.349
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_largeobject_metadata": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [22-1] 2020-03-23 20:55:09.360
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_aggregate": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [23-1] 2020-03-23 20:55:09.371
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_largeobject": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [24-1] 2020-03-23 20:55:09.382
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_statistic_ext": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [25-1] 2020-03-23 20:55:09.393
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_rewrite": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [26-1] 2020-03-23 20:55:09.404
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_trigger": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [27-1] 2020-03-23 20:55:09.415
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_event_trigger": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [28-1] 2020-03-23 20:55:09.426
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_description": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [29-1] 2020-03-23 20:55:09.437
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_cast": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [30-1] 2020-03-23 20:55:09.448
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_enum": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [31-1] 2020-03-23 20:55:09.459
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_namespace": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [32-1] 2020-03-23 20:55:09.470
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_conversion": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [33-1] 2020-03-23 20:55:09.481
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_depend": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [34-1] 2020-03-23 20:55:09.502
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_db_role_setting": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [35-1] 2020-03-23 20:55:09.524
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_pltemplate": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [36-1] 2020-03-23 20:55:09.545
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_shdepend": index scans: 1
> Mar 23 20:55:09 cowtn postgres[15890]: [37-1] 2020-03-23 20:55:09.556
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_shdescription": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [38-1] 2020-03-23 20:55:09.567
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_config": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [39-1] 2020-03-23 20:55:09.578
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_config_map": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [40-1] 2020-03-23 20:55:09.589
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_dict": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [41-1] 2020-03-23 20:55:09.600
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_parser": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [42-1] 2020-03-23 20:55:09.611
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_template": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [43-1] 2020-03-23 20:55:09.622
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_extension": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [44-1] 2020-03-23 20:55:09.633
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_foreign_data_wrapper": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [45-1] 2020-03-23 20:55:09.643
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_foreign_table": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [46-1] 2020-03-23 20:55:09.654
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_policy": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [47-1] 2020-03-23 20:55:09.675
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_default_acl": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [48-1] 2020-03-23 20:55:09.686
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_init_privs": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [49-1] 2020-03-23 20:55:09.696
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_seclabel": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [50-1] 2020-03-23 20:55:09.719
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_collation": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [51-1] 2020-03-23 20:55:09.729
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_partitioned_table": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [52-1] 2020-03-23 20:55:09.740
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_range": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [53-1] 2020-03-23 20:55:09.751
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_transform": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [54-1] 2020-03-23 20:55:09.762
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_sequence": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [55-1] 2020-03-23 20:55:09.772
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_publication": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [56-1] 2020-03-23 20:55:09.783
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_publication_rel": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [57-1] 2020-03-23 20:55:09.794
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_subscription_rel": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [58-1] 2020-03-23 20:55:09.805
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_packages": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [59-1] 2020-03-23 20:55:09.816
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_features": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [60-1] 2020-03-23 20:55:09.827
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_implementation_info": index
> scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [61-1] 2020-03-23 20:55:09.838
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_parts": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [62-1] 2020-03-23 20:55:09.849
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_languages": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [63-1] 2020-03-23 20:55:09.859
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_sizing": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [64-1] 2020-03-23 20:55:09.870
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_sizing_profiles": index scans:
> 0
> Mar 23 20:55:09 cowtn postgres[15890]: [65-1] 2020-03-23 20:55:09.880
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2600": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [66-1] 2020-03-23 20:55:09.891
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2604": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [67-1] 2020-03-23 20:55:09.902
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3456": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [68-1] 2020-03-23 20:55:09.912
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2606": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [69-1] 2020-03-23 20:55:09.923
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_826": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [70-1] 2020-03-23 20:55:09.933
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2609": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [71-1] 2020-03-23 20:55:09.944
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3466": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [72-1] 2020-03-23 20:55:09.955
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3079": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [73-1] 2020-03-23 20:55:09.965
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2328": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [74-1] 2020-03-23 20:55:09.976
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1417": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [75-1] 2020-03-23 20:55:09.987
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3118": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [76-1] 2020-03-23 20:55:09.997
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3394": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [77-1] 2020-03-23 20:55:10.008
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2612": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [78-1] 2020-03-23 20:55:10.018
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2615": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [79-1] 2020-03-23 20:55:10.029
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3350": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [80-1] 2020-03-23 20:55:10.040
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3256": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [81-1] 2020-03-23 20:55:10.050
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1255": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [82-1] 2020-03-23 20:55:10.062
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2618": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [83-1] 2020-03-23 20:55:10.072
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3596": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [84-1] 2020-03-23 20:55:10.083
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2619": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [85-1] 2020-03-23 20:55:10.094
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3381": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [86-1] 2020-03-23 20:55:10.105
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3429": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [87-1] 2020-03-23 20:55:10.115
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2620": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [88-1] 2020-03-23 20:55:10.126
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3600": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [89-1] 2020-03-23 20:55:10.137
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1247": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [90-1] 2020-03-23 20:55:10.147
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1418": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [91-1] 2020-03-23 20:55:10.158
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1260": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [92-1] 2020-03-23 20:55:10.169
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1262": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [93-1] 2020-03-23 20:55:10.179
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2964": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [94-1] 2020-03-23 20:55:10.190
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1136": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [95-1] 2020-03-23 20:55:10.201
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_6000": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [96-1] 2020-03-23 20:55:10.211
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2396": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [97-1] 2020-03-23 20:55:10.222
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3592": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [98-1] 2020-03-23 20:55:10.232
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_6100": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [99-1] 2020-03-23 20:55:10.243
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1213": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [100-1] 2020-03-23 20:55:10.254
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13267": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [101-1] 2020-03-23 20:55:10.264
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13272": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [102-1] 2020-03-23 20:55:10.275
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13287": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [103-1] 2020-03-23 20:55:10.286
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13277": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [104-1] 2020-03-23 20:55:10.296
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13282": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [105-1] 2020-03-23 20:55:10.307
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13292": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [106-1] 2020-03-23 20:55:10.317
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13297": index scans: 0
>
> >
> >
> > > > I wonder if what might be happening is that we're somehow missed/failed
> > > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > > > table in the oldest database, but that is *NOT* the oldest table itself,
> > > > does the problem "resolve" itself?

Andres-

I did confirm that if you run a manual vacuum on a table in the oldest
database but NOT the oldest table, the problem does resolve itself.
Here is the before and after:

postgres=# SELECT datname
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age')
FROM pg_database;
  datname  |    age    | current_setting
-----------+-----------+-----------------
 postgres  | 202375735 | 200000000
 template1 | 202345459 | 200000000
 template0 | 132459914 | 200000000
 feedi     | 132459914 | 200000000
(4 rows)

postgres=#  SELECT c.oid::regclass
    , age(c.relfrozenxid)
    , pg_size_pretty(pg_total_relation_size(c.oid))
    , age(c.relfrozenxid)/200000000::float AS pct
FROM pg_class c
JOIN pg_namespace n on c.relnamespace = n.oid
WHERE relkind IN ('r', 't', 'm')
ORDER BY 2 DESC LIMIT 10;
          oid          |    age    | pg_size_pretty |     pct
-----------------------+-----------+----------------+-------------
 pg_subscription       | 203640681 | 24 kB          | 1.018203405
 pg_authid             | 203640681 | 80 kB          | 1.018203405
 pg_tablespace         | 203620449 | 80 kB          | 1.018102245
 pg_shseclabel         | 203620449 | 16 kB          | 1.018102245
 pg_auth_members       | 203620449 | 72 kB          | 1.018102245
 pg_database           | 203620449 | 80 kB          | 1.018102245
 pg_replication_origin | 203620449 | 24 kB          | 1.018102245
 pg_statistic          |  53646423 | 456 kB         | 0.268232115
 pg_statistic_ext_data |  53646423 | 16 kB          | 0.268232115
 pg_user_mapping       |  53646423 | 24 kB          | 0.268232115
(10 rows)

postgres=# vacuum FREEZE VERBOSE ANALYZE pg_statistic;
INFO:  aggressively vacuuming "pg_catalog.pg_statistic"
INFO:  "pg_statistic": found 0 removable, 18 nonremovable row versions
in 1 out of 38 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 39602876
There were 18 unused item identifiers.
Skipped 0 pages due to buffer pins, 37 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  aggressively vacuuming "pg_toast.pg_toast_2619"
INFO:  "pg_toast_2619": found 0 removable, 13 nonremovable row
versions in 4 out of 4 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 39602876
There were 12 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

postgres=# SELECT datname
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age')
FROM pg_database;
  datname  |    age    | current_setting
-----------+-----------+-----------------
 postgres  |  54640707 | 200000000
 template1 |     38248 | 200000000
 template0 | 134719144 | 200000000
 feedi     | 134719144 | 200000000
(4 rows)

Immediately after running this, all the autovac's starting working
again.  I've attached logfiles that might be easier to read.

We are going to try and recreate the issue, in the meantime, I've
turned on DEBUG1 and when it reoccurs, I'll run the queries in the
"What Is" section as well.

I also have a backtrace of the autovac launcher, but there appears to
be nothing useful at all there (as you'd probably expect) and there
were no autovac workers running.

Thank you again-
Justin

>
> postgres=# SELECT datname
>     , age(datfrozenxid)
>     , current_setting('autovacuum_freeze_max_age')
> FROM pg_database;
>   datname  |    age    | current_setting
> -----------+-----------+-----------------
>  postgres  | 202375735 | 200000000
>  template1 | 202345459 | 200000000
>  template0 | 132459914 | 200000000
>  feedi     | 132459914 | 200000000
> (4 rows)
>
>
> > >
> > > I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
> > > if we'll be able to catch the 1h45 interval when the system stays alive
> > > after the issue though.
> >
> > Could you just script something to stop the benchmark once the disk is
> > 90% full or so?
> >
> > Did you see any errors / fatals around the time autovacuum stopped
> > working?
>
> There are no ERROR or FATAL messages around the time -- (or at all).
>
> Since this is occurring right now, what else would be useful to
> capture?  You'd asked about a GDB -- do you want that of the main
> process or the autovac worker?
>
> >
> > Greetings,
> >
> > Andres Freund

Вложения

Re: PG12 autovac issues

От
Michael Paquier
Дата:
On Mon, Mar 23, 2020 at 01:00:51PM -0700, Andres Freund wrote:
> On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote:
>>> - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
>>>   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY
age(relfrozenxid)DESC LIMIT 1; 
>>
>> The vm has been trashed since, and I don't have that level of detail available
>> in the gathered stats unfortunately (and the available information I have is a
>> little bit messy, sorry for that).
>>
>>> - Oldest backend xmin
>>>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3; 
>>> - oldest replication xmin:
>>>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY
age(backend_xmin)DESC LIMIT 3; 
>>> - oldest slot xmin:
>>>   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR
catalog_xmin<> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; 
>>> - oldest prepared transaction (unfortunately xmin not available)
>>>   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;
>>
>> I have output from this query:
https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
>> (which may be dumb).  Here are the non-null rows after 8AM GMT+1:

This stuff does not use 2PC and logical slots (there is one physical
slot for a WAL archiver), but we got a dump of pg_stat_activity.
Grepping more info about the evolution of pg_database and pg_class is
on our list.

> Could you share what the config of the server was?

Nothing really fancy:
- autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it
in v11 as well).
- autovacuum_naptime = 15s
- autovacuum_max_workers = 6
- log_autovacuum_min_duration = 0

>>> In that case you'd likely see DEBUG1 output, right? Did you try running
>>> with that?
>>
>> That's unfortunately not an option, as while the issue is reproducible, it
>> happens after roughly 16h with an average of 6.5ktps, and we definitely don't
>> have enough disk space for that amount of logs.  We'll try next time with a
>> recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.
>
> You wouldn't need to run it with DEBUG1 the whole time, you could just
> change the config and reload config once you hit the issue.

Sure, though it would reduce the window when the cluster is still up
and running.  One option that we have is just to increase the elevel
from DEBUG1 to WARNING for the log added in 2aa6e33 to keep the amount
of logs reduced without losing information, and we have no issue to
do runs with custom patches.

>> Not for now, because it's hard to be around for the ~ 90min interval the system
>> stays alive after atuvacuum get stuck, especially since it's a weekend bench.

Yeah, our room is very limited here.  I would likely not ne able to do
that, or I would just get very lucky with my timezone.  Not sure for
Julien.

>>> It sounds like what might be happening is that you have something
>>> holding back the "oldest needed transaction" horizon. Before
>>> 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
>>> vacuums unable to increase the horizon, but afterwards they'll all
>>> immediately exit without any messages.
>>
>> The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
>> for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
>> 200069684 to 210363848.  E.g.:
>
> Which database is this on?

In this case this was the database used by the application, FWIW.  But
I can see from those logs that it kept increasing for all the other
databases, like postgres or template1 when this set of aggressive jobs
happened.

>> 266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table
"pg_toast.pg_toast_27712":index scans: 0 
>> 266622-    pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
>> 266623-    tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
>> [...]
>
> Do you have any non-toast ones?

Yep, this includes catalogs and normal tables, based on the logs all
the relations triggered aggressive and anti-wraparound jobs.

>>> I wonder if what might be happening is that we're somehow missed/failed
>>> to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
>>> table in the oldest database, but that is *NOT* the oldest table itself,
>>> does the problem "resolve" itself?
>>
>> I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
>> if we'll be able to catch the 1h45 interval when the system stays alive
>> after the issue though.
>
> Could you just script something to stop the benchmark once the disk is
> 90% full or so?

Hmm.  I think that this one is possible still tricky.  There are some
alarms in place in this system.

> Did you see any errors / fatals around the time autovacuum stopped
> working?

Before going rogue (we are not sure if autovacuum didn't launch any
workers or if the workers were spawned and exited early as we did not
capture any worker information in pg_stat_activity), we saw a bunch of
aggressive wraparound jobs.  Even after that, we have traces in the
logs of one autovacuum analyze happening at equal interval of time (17
minutes) on one single table, which is...  Er...  uncommon to say the
least.
--
Michael

Вложения

Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-24 14:26:06 +0900, Michael Paquier wrote:
> > Could you share what the config of the server was?
> 
> Nothing really fancy:
> - autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it
> in v11 as well).
> - autovacuum_naptime = 15s
> - autovacuum_max_workers = 6
> - log_autovacuum_min_duration = 0

Oh, so you're also involved in this? I'm starting to get a bit confused
as to who is reporting what.


> > Did you see any errors / fatals around the time autovacuum stopped
> > working?
> 
> Before going rogue (we are not sure if autovacuum didn't launch any
> workers or if the workers were spawned and exited early as we did not
> capture any worker information in pg_stat_activity), we saw a bunch of
> aggressive wraparound jobs.  Even after that, we have traces in the
> logs of one autovacuum analyze happening at equal interval of time (17
> minutes) on one single table, which is...  Er...  uncommon to say the
> least.

Well, there's no logging of autovacuum launchers that don't do anything
due to the "skipping redundant" logic, with normal log level. If somehow
the horizon logic of autovacuum workers gets out of whack with what
vacuumlazy.c does, then you'd not get any vacuums. But a usage level
triggered analyze could still happen on such a table, I think.

While looking at this issue I found a few problems, btw. That seems more
like a -hackers discussion, so I started:
https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de

I think I might just have figured out another one...

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Michael Paquier
Дата:
On Mon, Mar 23, 2020 at 10:40:39PM -0700, Andres Freund wrote:
> On 2020-03-24 14:26:06 +0900, Michael Paquier wrote:
>> Nothing really fancy:
>> - autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it
>> in v11 as well).
>> - autovacuum_naptime = 15s
>> - autovacuum_max_workers = 6
>> - log_autovacuum_min_duration = 0
>
> Oh, so you're also involved in this? I'm starting to get a bit confused
> as to who is reporting what.

Yeah, sort of.  Julien has done a lot of work on that and I have an
access to the data and test beds, so we are just saying the same
things.

> Well, there's no logging of autovacuum launchers that don't do anything
> due to the "skipping redundant" logic, with normal log level. If somehow
> the horizon logic of autovacuum workers gets out of whack with what
> vacuumlazy.c does, then you'd not get any vacuums. But a usage level
> triggered analyze could still happen on such a table, I think.

What surprised me the most is that the same table happened to be
analyzed again and again after the launcher began its blackout.

> While looking at this issue I found a few problems, btw. That seems more
> like a -hackers discussion, so I started:
> https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de

Yes, let's discuss there.

> I think I might just have figured out another one...

Ouch.
--
Michael

Вложения

Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-24 15:12:38 +0900, Michael Paquier wrote:
> > Well, there's no logging of autovacuum launchers that don't do anything
> > due to the "skipping redundant" logic, with normal log level. If somehow
> > the horizon logic of autovacuum workers gets out of whack with what
> > vacuumlazy.c does, then you'd not get any vacuums. But a usage level
> > triggered analyze could still happen on such a table, I think.
> 
> What surprised me the most is that the same table happened to be
> analyzed again and again after the launcher began its blackout.

Well, if there's an issue with the "redundant" logic, that would be a
not too surprising outcome. It's quite plausible that one or two tables
in the database would get enough changes to occasionally need to be
analyzed. If the workload is steady, that could e.g. work out to every
~17 minutes. All tables that autovacuum things are not wraparound
threatened will be skipped, but ones that are will get both vacuum and
analyze queued. The redundant logic could then entirely skip all
vacuums - but there's no equivalent for analyze.

> > While looking at this issue I found a few problems, btw. That seems more
> > like a -hackers discussion, so I started:
> > https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de
> 
> Yes, let's discuss there.

Cool. Would also be good if you could expand on the thread introducing
the "redundant" logic.

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Justin King
Дата:
All-

This started happening again.  DEBUG1 is enabled:

Mar 25 14:48:03 cowtn postgres[39720]: [35294-1] 2020-03-25
14:48:03.972 GMT [39720] DEBUG:  autovacuum: processing database
"template0"
Mar 25 14:48:06 cowtn postgres[39735]: [35294-1] 2020-03-25
14:48:06.545 GMT [39735] DEBUG:  autovacuum: processing database
"postgres"
Mar 25 14:48:11 cowtn postgres[39759]: [35294-1] 2020-03-25
14:48:11.284 GMT [39759] DEBUG:  autovacuum: processing database
"template1"
Mar 25 14:48:14 cowtn postgres[39772]: [35294-1] 2020-03-25
14:48:14.564 GMT [39772] DEBUG:  autovacuum: processing database
"feedi"
Mar 25 14:48:14 cowtn postgres[39772]: [35295-1] 2020-03-25
14:48:14.588 GMT [39772] LOG:  automatic vacuum of table
"feedi.pg_catalog.pg_statistic": index scans: 1
Mar 25 14:48:14 cowtn postgres[39772]: [35295-2] #011pages: 0 removed,
117 remain, 0 skipped due to pins, 0 skipped frozen
Mar 25 14:48:14 cowtn postgres[39772]: [35295-3] #011tuples: 477
removed, 704 remain, 0 are dead but not yet removable, oldest xmin:
189591147
Mar 25 14:48:14 cowtn postgres[39772]: [35295-4] #011buffer usage: 315
hits, 0 misses, 0 dirtied
Mar 25 14:48:14 cowtn postgres[39772]: [35295-5] #011avg read rate:
0.000 MB/s, avg write rate: 0.000 MB/s
Mar 25 14:48:14 cowtn postgres[39772]: [35295-6] #011system usage:
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
Mar 25 14:48:15 cowtn postgres[39772]: [35296-1] 2020-03-25
14:48:15.606 GMT [39772] LOG:  automatic vacuum of table
"feedi.production.flightplans": index scans: 1
Mar 25 14:48:15 cowtn postgres[39772]: [35296-2] #011pages: 0 removed,
61331 remain, 0 skipped due to pins, 29772 skipped frozen
Mar 25 14:48:15 cowtn postgres[39772]: [35296-3] #011tuples: 21807
removed, 509983 remain, 1581 are dead but not yet removable, oldest
xmin: 189591147
Mar 25 14:48:15 cowtn postgres[39772]: [35296-4] #011buffer usage:
113684 hits, 0 misses, 1 dirtied
Mar 25 14:48:15 cowtn postgres[39772]: [35296-5] #011avg read rate:
0.000 MB/s, avg write rate: 0.008 MB/s
Mar 25 14:48:15 cowtn postgres[39772]: [35296-6] #011system usage:
CPU: user: 0.64 s, system: 0.12 s, elapsed: 1.00 s
Mar 25 14:48:16 cowtn postgres[39772]: [35297-1] 2020-03-25
14:48:16.537 GMT [39772] LOG:  automatic analyze of table
"feedi.production.flightplans" system usage: CPU: user: 0.83 s,
system: 0.03 s, elapsed: 0.93 s
Mar 25 14:48:16 cowtn postgres[39772]: [35298-1] 2020-03-25
14:48:16.627 GMT [39772] LOG:  automatic vacuum of table
"feedi.production.tita": index scans: 1
Mar 25 14:48:16 cowtn postgres[39772]: [35298-2] #011pages: 0 removed,
1711 remain, 0 skipped due to pins, 0 skipped frozen
Mar 25 14:48:16 cowtn postgres[39772]: [35298-3] #011tuples: 5936
removed, 75280 remain, 1079 are dead but not yet removable, oldest
xmin: 189591147
Mar 25 14:48:16 cowtn postgres[39772]: [35298-4] #011buffer usage:
5748 hits, 0 misses, 1 dirtied
Mar 25 14:48:16 cowtn postgres[39772]: [35298-5] #011avg read rate:
0.000 MB/s, avg write rate: 0.100 MB/s
Mar 25 14:48:16 cowtn postgres[39772]: [35298-6] #011system usage:
CPU: user: 0.06 s, system: 0.00 s, elapsed: 0.07 s
Mar 25 14:48:16 cowtn postgres[39772]: [35299-1] 2020-03-25
14:48:16.959 GMT [39772] LOG:  automatic analyze of table
"feedi.production.tita" system usage: CPU: user: 0.32 s, system: 0.00
s, elapsed: 0.33 s
Mar 25 14:48:16 cowtn postgres[39772]: [35300-1] 2020-03-25
14:48:16.969 GMT [39772] LOG:  automatic analyze of table
"feedi.production.virtual_clocks" system usage: CPU: user: 0.00 s,
system: 0.00 s, elapsed: 0.00 s
Mar 25 14:48:18 cowtn postgres[39790]: [35294-1] 2020-03-25
14:48:18.975 GMT [39790] DEBUG:  autovacuum: processing database
"template0"
Mar 25 14:48:21 cowtn postgres[39799]: [35294-1] 2020-03-25
14:48:21.546 GMT [39799] DEBUG:  autovacuum: processing database
"postgres"
Mar 25 14:48:22 cowtn postgres[39853]: [35294-1] 2020-03-25
14:48:22.446 GMT [39853] LOG:  connection received: host=10.4.4.11
port=25424
Mar 25 14:48:22 cowtn postgres[39853]: [35295-1] 2020-03-25
14:48:22.451 GMT [39853] LOG:  connection authorized: user=feedi
database=feedi SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
Mar 25 14:48:26 cowtn postgres[39875]: [35294-1] 2020-03-25
14:48:26.277 GMT [39875] DEBUG:  autovacuum: processing database
"postgres"
Mar 25 14:48:26 cowtn postgres[39875]: [35295-1] 2020-03-25
14:48:26.298 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_authid"
Mar 25 14:48:26 cowtn postgres[39875]: [35296-1] 2020-03-25
14:48:26.309 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_subscription"
Mar 25 14:48:26 cowtn postgres[39875]: [35297-1] 2020-03-25
14:48:26.319 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_database"
Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25
14:48:26.329 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_tablespace"
Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25
14:48:26.339 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_auth_members"
Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25
14:48:26.350 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_replication_origin"
Mar 25 14:48:26 cowtn postgres[39875]: [35301-1] 2020-03-25
14:48:26.360 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_shseclabel"
Mar 25 14:48:26 cowtn postgres[39875]: [35302-1] 2020-03-25
14:48:26.361 GMT [39875] DEBUG:  transaction ID wrap limit is
2137086523, limited by database with OID 13432
Mar 25 14:48:26 cowtn postgres[39875]: [35303-1] 2020-03-25
14:48:26.361 GMT [39875] DEBUG:  MultiXactId wrap limit is 2147483648,
limited by database with OID 13432
Mar 25 14:48:26 cowtn postgres[39875]: [35304-1] 2020-03-25
14:48:26.361 GMT [39875] DEBUG:  MultiXact member stop limit is now
4294914944 based on MultiXact 1

postgres=# SELECT datname
    , oid
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age')
FROM pg_database;
  datname  |   oid    |    age    | current_setting
-----------+----------+-----------+-----------------
 postgres  |    13432 | 202588645 | 200000000
 template1 |        1 | 152588645 | 200000000
 template0 |    13431 |  87271781 | 200000000
 feedi     | 32178861 |  87271781 | 200000000

Here is the output from the "What is" queries referenced:

postgres=# SELECT datname, age(datfrozenxid), datfrozenxid FROM
pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;
 datname  |    age    | datfrozenxid
----------+-----------+--------------
 postgres | 202773709 |   4284570172


 postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM
pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT
1;
    oid    |    age    | relfrozenxid
-----------+-----------+--------------
 pg_authid | 202793549 |   4284570172


 postgres=# SELECT pid, backend_xmin, age(backend_xmin) FROM
pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)
DESC LIMIT 3;
  pid  | backend_xmin | age
-------+--------------+------
 17032 |    192412343 | 7199
 17033 |    192412343 | 7199
 52320 |    192419542 |    0
(3 rows)


postgres=# SELECT pid, backend_xmin, age(backend_xmin) FROM
pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)
DESC LIMIT 3;
  pid  | backend_xmin | age
-------+--------------+------
 17032 |    192434622 | 2004
 17033 |    192434622 | 2004
(2 rows)


postgres=# SELECT slot_name, xmin, age(xmin), catalog_xmin,
age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR
catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC
LIMIT 3;
 slot_name | xmin | age | catalog_xmin | age
-----------+------+-----+--------------+-----
(0 rows)


postgres=# SELECT gid, database, transaction FROM pg_prepared_xacts
ORDER BY age(transaction) LIMIT 3;
 gid | database | transaction
-----+----------+-------------
(0 rows)

Let me know if there's anything else useful I can provide.

Thanks-
Justin

On Tue, Mar 24, 2020 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-24 15:12:38 +0900, Michael Paquier wrote:
> > > Well, there's no logging of autovacuum launchers that don't do anything
> > > due to the "skipping redundant" logic, with normal log level. If somehow
> > > the horizon logic of autovacuum workers gets out of whack with what
> > > vacuumlazy.c does, then you'd not get any vacuums. But a usage level
> > > triggered analyze could still happen on such a table, I think.
> >
> > What surprised me the most is that the same table happened to be
> > analyzed again and again after the launcher began its blackout.
>
> Well, if there's an issue with the "redundant" logic, that would be a
> not too surprising outcome. It's quite plausible that one or two tables
> in the database would get enough changes to occasionally need to be
> analyzed. If the workload is steady, that could e.g. work out to every
> ~17 minutes. All tables that autovacuum things are not wraparound
> threatened will be skipped, but ones that are will get both vacuum and
> analyze queued. The redundant logic could then entirely skip all
> vacuums - but there's no equivalent for analyze.
>
> > > While looking at this issue I found a few problems, btw. That seems more
> > > like a -hackers discussion, so I started:
> > > https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de
> >
> > Yes, let's discuss there.
>
> Cool. Would also be good if you could expand on the thread introducing
> the "redundant" logic.
>
> Greetings,
>
> Andres Freund



Re: PG12 autovac issues

От
Michael Paquier
Дата:
On Wed, Mar 25, 2020 at 10:39:17AM -0500, Justin King wrote:
> This started happening again.  DEBUG1 is enabled:

Thanks for enabling DEBUG1 logs while this happened.

> Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25
> 14:48:26.329 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> wraparound of table "postgres.pg_catalog.pg_tablespace"
> Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25
> 14:48:26.339 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> wraparound of table "postgres.pg_catalog.pg_auth_members"
> Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25
> 14:48:26.350 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> wraparound of table "postgres.pg_catalog.pg_replication_origin"

Are you seeing such log entries happening multiple times for the same
relations, meaning that autovacuum workers are just looping on the
same relations all over again?  This part of the logs point to catalog
tables, but are there other tables within your system facing the same
logs, particularly the database "feedi" with some of your own tables?

>  postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM
> pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT
> 1;
>     oid    |    age    | relfrozenxid
> -----------+-----------+--------------
>  pg_authid | 202793549 |   4284570172

Ugh.  I think that this is exactly the thing I was suspecting
previously:
- The database stats look sane.
- The relation stats don't look good and visibly are put in such a
state that only one type of jobs gets triggered (non-aggressive but
anti-wraparound), which just keep being skipped and the relation stats
don't get refreshed.  (Such autovacuum jobs should never happen and we
have some underlying issues that will need separate care).

If you still have the cluster in this current state (perhaps you are
not able to keep it longer), could you provide more data about
pg_class.relfrozenxid for the tables which are mentioned in the logs
of the type "skipping redundant vacuum to prevent of table"?

> Let me know if there's anything else useful I can provide.

Thanks!
--
Michael

Вложения

Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-26 10:43:36 +0900, Michael Paquier wrote:
> On Wed, Mar 25, 2020 at 10:39:17AM -0500, Justin King wrote:
> > Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25
> > 14:48:26.329 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> > wraparound of table "postgres.pg_catalog.pg_tablespace"
> > Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25
> > 14:48:26.339 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> > wraparound of table "postgres.pg_catalog.pg_auth_members"
> > Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25
> > 14:48:26.350 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> > wraparound of table "postgres.pg_catalog.pg_replication_origin"

FWIW, this kind of thing is why I think the added skipping logic is a
bad idea. Silently skipping things like this (same with the "bogus"
logic in datfrozenxid computation) is dangerous. I think we should
seriously consider backing this change out.

And if not, then we should at least include enough detail in the message
to be able to debug this.


> >  postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM
> > pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT
> > 1;
> >     oid    |    age    | relfrozenxid
> > -----------+-----------+--------------
> >  pg_authid | 202793549 |   4284570172
> 
> Ugh.  I think that this is exactly the thing I was suspecting
> previously:
> - The database stats look sane.
> - The relation stats don't look good and visibly are put in such a
> state that only one type of jobs gets triggered (non-aggressive but
> anti-wraparound), which just keep being skipped and the relation stats
> don't get refreshed.  (Such autovacuum jobs should never happen and we
> have some underlying issues that will need separate care).

Hm. Why is this a state that is clearly invalid compared to pg_database?
Seems to precisely match

> postgres=# SELECT datname, age(datfrozenxid), datfrozenxid FROM
> pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;
>  datname  |    age    | datfrozenxid
> ----------+-----------+--------------
>  postgres | 202773709 |   4284570172

And why should this lead to anti-wraparound vacuums not happening? This
is older than the the cutoff age?

xid 4284570172 having the age of 202 million xids suggests that
ReadNewTransactionId() is approx 192376585. Which comports with the log
saying: oldest xmin: 189591147.


Or are you saying that you conclude that the relcache entry is somehow
out of date? It sure is interesting that all of the tables that hit the
"skipping redundant vacuum" condition are shared tables.

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Michael Paquier
Дата:
On Wed, Mar 25, 2020 at 07:59:56PM -0700, Andres Freund wrote:
> FWIW, this kind of thing is why I think the added skipping logic is a
> bad idea. Silently skipping things like this (same with the "bogus"
> logic in datfrozenxid computation) is dangerous. I think we should
> seriously consider backing this change out.

That's actually what I would like to do at this stage as a first
step.  It looks pretty clear that it does not help.

> And if not, then we should at least include enough detail in the message
> to be able to debug this.

Sure.  In any attempt I have done until now I was easily able to skip
some jobs, but it should get easier with a higher number of concurrent
workers and a higher number of relations heavily updated.  Thinking
about it, only catalog jobs were getting skipped in my own runs...

>> postgres=# SELECT datname, age(datfrozenxid), datfrozenxid FROM
>> pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;
>>  datname  |    age    | datfrozenxid
>> ----------+-----------+--------------
>>  postgres | 202773709 |   4284570172
>
> And why should this lead to anti-wraparound vacuums not happening? This
> is older than the the cutoff age?
>
> xid 4284570172 having the age of 202 million xids suggests that
> ReadNewTransactionId() is approx 192376585. Which comports with the log
> saying: oldest xmin: 189591147.

Oops, sorry.  My previous email was incorrect.  It looked strange to
not see datfrozenxid being refreshed.

> Or are you saying that you conclude that the relcache entry is somehow
> out of date? It sure is interesting that all of the tables that hit the
> "skipping redundant vacuum" condition are shared tables.

Yeah, that's actually what I was thinking yesterday.  In
heap_vacuum_rel(), xidFullScanLimit may be calculated right, but an
incorrect value of rd_rel->relminmxid or rd_rel->relfrozenxid could
lead to a job to become not aggressive.  It should be actually easy
enough to check that.
--
Michael

Вложения

Re: PG12 autovac issues

От
Justin King
Дата:
On Wed, Mar 25, 2020 at 8:43 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Mar 25, 2020 at 10:39:17AM -0500, Justin King wrote:
> > This started happening again.  DEBUG1 is enabled:
>
> Thanks for enabling DEBUG1 logs while this happened.
>
> > Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25
> > 14:48:26.329 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> > wraparound of table "postgres.pg_catalog.pg_tablespace"
> > Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25
> > 14:48:26.339 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> > wraparound of table "postgres.pg_catalog.pg_auth_members"
> > Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25
> > 14:48:26.350 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> > wraparound of table "postgres.pg_catalog.pg_replication_origin"
>
> Are you seeing such log entries happening multiple times for the same
> relations, meaning that autovacuum workers are just looping on the
> same relations all over again?  This part of the logs point to catalog
> tables, but are there other tables within your system facing the same
> logs, particularly the database "feedi" with some of your own tables?

Nope, it was just these tables that were looping over and over while
nothing else was getting autovac'd.  I'm happy to share the full log
if you'd like.

>
> >  postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM
> > pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT
> > 1;
> >     oid    |    age    | relfrozenxid
> > -----------+-----------+--------------
> >  pg_authid | 202793549 |   4284570172
>
> Ugh.  I think that this is exactly the thing I was suspecting
> previously:
> - The database stats look sane.
> - The relation stats don't look good and visibly are put in such a
> state that only one type of jobs gets triggered (non-aggressive but
> anti-wraparound), which just keep being skipped and the relation stats
> don't get refreshed.  (Such autovacuum jobs should never happen and we
> have some underlying issues that will need separate care).
>
> If you still have the cluster in this current state (perhaps you are
> not able to keep it longer), could you provide more data about
> pg_class.relfrozenxid for the tables which are mentioned in the logs
> of the type "skipping redundant vacuum to prevent of table"?

I did have to remove it from this state, but I can undo my workaround
and, undoubtedly, it'll end up back there.  Let me know if there's
something specific you'd like me to provide when it happens!

>
> > Let me know if there's anything else useful I can provide.
>
> Thanks!
> --
> Michael



Re: PG12 autovac issues

От
Michael Paquier
Дата:
On Thu, Mar 26, 2020 at 09:46:47AM -0500, Justin King wrote:
> Nope, it was just these tables that were looping over and over while
> nothing else was getting autovac'd.  I'm happy to share the full log
> if you'd like.

Thanks, that could help.  If that's very large, it could be a problem
to send that to the lists, but you could send me directly a link to
it and I'll try to extract more information for the lists.  While
testing for reproducing the issue, I have noticed that basically one
set of catalog tables happened to see this "skipping redundant" log.
And I am wondering if we have a match with the set of catalog tables
looping.

> I did have to remove it from this state, but I can undo my workaround
> and, undoubtedly, it'll end up back there.  Let me know if there's
> something specific you'd like me to provide when it happens!

For now I think it's fine.  Note that Julien and I have an environment
where the issue can be reproduced easily (it takes roughly 12 hours
until the wraparound cutoffs are reached with the benchmark and
settings used), and we are checking things using a patched instance
with 2aa6e33 reverted.  I think that we are accumulating enough
evidence that this change was not a good idea anyway thanks to the
information you sent, so likely we'll finish first by a revert of
2aa6e33 from the master and REL_12_STABLE branches, before looking at
the issues with the catalogs for those anti-wraparound and
non-aggressive jobs (this looks like a relcache issue with the so-said
catalogs).
--
Michael

Вложения

Re: PG12 autovac issues

От
Julien Rouhaud
Дата:
On Fri, Mar 27, 2020 at 02:12:04PM +0900, Michael Paquier wrote:
> On Thu, Mar 26, 2020 at 09:46:47AM -0500, Justin King wrote:
> > Nope, it was just these tables that were looping over and over while
> > nothing else was getting autovac'd.  I'm happy to share the full log
> > if you'd like.
> 
> Thanks, that could help.  If that's very large, it could be a problem
> to send that to the lists, but you could send me directly a link to
> it and I'll try to extract more information for the lists.  While
> testing for reproducing the issue, I have noticed that basically one
> set of catalog tables happened to see this "skipping redundant" log. 
> And I am wondering if we have a match with the set of catalog tables
> looping.
> 
> > I did have to remove it from this state, but I can undo my workaround
> > and, undoubtedly, it'll end up back there.  Let me know if there's
> > something specific you'd like me to provide when it happens!
> 
> For now I think it's fine.  Note that Julien and I have an environment
> where the issue can be reproduced easily (it takes roughly 12 hours
> until the wraparound cutoffs are reached with the benchmark and
> settings used), and we are checking things using a patched instance
> with 2aa6e33 reverted.  I think that we are accumulating enough
> evidence that this change was not a good idea anyway thanks to the
> information you sent, so likely we'll finish first by a revert of
> 2aa6e33 from the master and REL_12_STABLE branches, before looking at
> the issues with the catalogs for those anti-wraparound and
> non-aggressive jobs (this looks like a relcache issue with the so-said
> catalogs).

FTR we reached the 200M transaxtion earlier, and I can see multiple logs of the
form "automatic vacuum to prevent wraparound", so non-aggressive antiwraparound
autovacuum, all on shared relations.

As those vacuum weren't skipped, autovacuum didn't get stuck in a loop on those
and continue its work normally.  This happened ~ 4h ago, didn't ocurred again
while the 200M threshold was reached again multiple time.



Re: PG12 autovac issues

От
Justin King
Дата:
On Fri, Mar 27, 2020 at 12:12 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Mar 26, 2020 at 09:46:47AM -0500, Justin King wrote:
> > Nope, it was just these tables that were looping over and over while
> > nothing else was getting autovac'd.  I'm happy to share the full log
> > if you'd like.
>
> Thanks, that could help.  If that's very large, it could be a problem
> to send that to the lists, but you could send me directly a link to
> it and I'll try to extract more information for the lists.  While
> testing for reproducing the issue, I have noticed that basically one
> set of catalog tables happened to see this "skipping redundant" log.
> And I am wondering if we have a match with the set of catalog tables
> looping.

Sounds great.  I will email you directly with a link!

> > I did have to remove it from this state, but I can undo my workaround
> > and, undoubtedly, it'll end up back there.  Let me know if there's
> > something specific you'd like me to provide when it happens!
>
> For now I think it's fine.  Note that Julien and I have an environment
> where the issue can be reproduced easily (it takes roughly 12 hours
> until the wraparound cutoffs are reached with the benchmark and
> settings used), and we are checking things using a patched instance
> with 2aa6e33 reverted.  I think that we are accumulating enough
> evidence that this change was not a good idea anyway thanks to the
> information you sent, so likely we'll finish first by a revert of
> 2aa6e33 from the master and REL_12_STABLE branches, before looking at
> the issues with the catalogs for those anti-wraparound and
> non-aggressive jobs (this looks like a relcache issue with the so-said
> catalogs).

This is encouraging. As I mentioned, we have a workaround in place for
the moment, but don't hesitate if you need anything else from me.
Thanks for jumping in on the thread, it was nice validation to know
that I wasn't the only one seeing the issue!

> --
> Michael



Re: PG12 autovac issues

От
Michael Paquier
Дата:
On Fri, Mar 27, 2020 at 05:10:03PM -0500, Justin King wrote:
> Sounds great.  I will email you directly with a link!

Thanks.  From the logs, the infinite loop on which autovacuum jobs are
stuck is clear.  We have a repetitive number of anti-wraparound and
non-aggressive jobs happening for 7 shared relations every 5 seconds
or so and autovacuum is not able to do something else:
pg_auth_members
pg_authid
pg_database
pg_replication_origin
pg_shseclabel
pg_subscription
pg_tablespace

However I have found a correlation between all the tests I have done
on my own laptop, all the tests Julien and I have done in our internal
company buildfarm, and your logs.  We have 10 shared catalogs in the
system and the problem does not happen with the three following ones,
unable to trigger an anti-wraparound non-aggressive job:
pg_db_role_setting
pg_shdepend
pg_shdescription

Not sure yet what that is, but it is hard to believe that it is just a
coincidence.

> This is encouraging. As I mentioned, we have a workaround in place for
> the moment, but don't hesitate if you need anything else from me.
> Thanks for jumping in on the thread, it was nice validation to know
> that I wasn't the only one seeing the issue!

Yes.  We have been able to confirm that 2aa6e33 is a direct cause of
your problem.
--
Michael

Вложения

Re: PG12 autovac issues

От
Michael Paquier
Дата:
On Fri, Mar 27, 2020 at 08:23:03PM +0100, Julien Rouhaud wrote:
> FTR we reached the 200M transaxtion earlier, and I can see multiple logs of the
> form "automatic vacuum to prevent wraparound", so non-aggressive antiwraparound
> autovacuum, all on shared relations.

Thanks Julien for sharing the information with the community.  Note
that three shared relations did not run such jobs though.

> As those vacuum weren't skipped, autovacuum didn't get stuck in a loop on those
> and continue its work normally.  This happened ~ 4h ago, didn't ocurred again
> while the 200M threshold was reached again multiple time.

As we have confirmed that 2aa6e33 is actually the origin of the
problem, I am planning to revert it at the beginning of next week, and
then dig more into the origin of those anti-wraparound and
non-aggressive jobs happening for a portion of the shared relations.
Andres has started a thread about the problem on -hackers here:
https://www.postgresql.org/message-id/20200323235036.6pje6usrjjx22zv3@alap3.anarazel.de

And I'll follow up there with anything new I find.  Please let me know
if there are any objections with the revert though, this will address
the problem reported by Justin.
--
Michael

Вложения

Re: PG12 autovac issues

От
Andres Freund
Дата:
Hi,

On 2020-03-28 17:47:19 +0900, Michael Paquier wrote:
> On Fri, Mar 27, 2020 at 05:10:03PM -0500, Justin King wrote:
> > This is encouraging. As I mentioned, we have a workaround in place for
> > the moment, but don't hesitate if you need anything else from me.
> > Thanks for jumping in on the thread, it was nice validation to know
> > that I wasn't the only one seeing the issue!
> 
> Yes.  We have been able to confirm that 2aa6e33 is a direct cause of
> your problem.

I assume you're still trying to track the actual cause of the problem
further?

Greetings,

Andres Freund



Re: PG12 autovac issues

От
Michael Paquier
Дата:
On Sat, Mar 28, 2020 at 11:29:41AM -0700, Andres Freund wrote:
> I assume you're still trying to track the actual cause of the problem
> further?

That's the plan, and I'll try to spend some time on it next week.  Any
new information I have will be added to the thread you have begun on
-hackers a couple of days ago.
--
Michael

Вложения

Re: PG12 autovac issues

От
Michael Paquier
Дата:
On Sat, Mar 28, 2020 at 05:53:59PM +0900, Michael Paquier wrote:
> And I'll follow up there with anything new I find.  Please let me know
> if there are any objections with the revert though, this will address
> the problem reported by Justin.

Okay.  Done with this part now as of dd9ac7d.  Now for the older
issue..
--
Michael

Вложения