Обсуждение: Track skipped tables during autovacuum and autoanalyze

Поиск
Список
Период
Сортировка

Track skipped tables during autovacuum and autoanalyze

От
Yugo Nagata
Дата:
Hi,

I would like to propose adding new fields to pg_stat_all_tables to track
tables skipped during autovacuum and autoanalyze.

Currently, when autovacuum or autoanalyze is skipped because a lock cannot
be acquired, this information is only emitted in log messages. However, it
would be more useful if users could access this information via a system view,
alongside related fields such as last_autovacuum, on a per-table basis.

The attached patch add the following fields to pg_stat_all_tables:
 - last_skipped_autovacuum
 - last_skipped_autoanalyze
 - skipped_autovacuum_count
 - skipped_autoanalyze_count

Are there any concerns about exposing this in pg_stat_all_tables, or suggestions
for a better approach?

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Track skipped tables during autovacuum and autoanalyze

От
Sami Imseih
Дата:
Hi,

Thanks for the patch!

> The attached patch add the following fields to pg_stat_all_tables:
>  - last_skipped_autovacuum
>  - last_skipped_autoanalyze
>  - skipped_autovacuum_count
>  - skipped_autoanalyze_count
>
> Are there any concerns about exposing this in pg_stat_all_tables, or suggestions
> for a better approach?

I am not sure about the timestamp columns. I am not saying they will
not be useful,
but I think it will be better to just start with counters for this.
The way the views get
used, a dashboard built for tracking the deltas of the counters can easily spot
when there is a spike of skipped autovacuum/autoanalyze count.
Also, for tables that are being autovacuumed and skipped quickly,
the timestamps will just be overwritten.

So, I am +1 on the counters, -1 on the timestamps.

Out of scope for this patch, but I also wonder if we should add another counter,
autovacuum_started_count. If there are other types of failure scenarios such as
corrupt indexes, checksum failures, etc. which terminate the
autovacuum in flight,
we would be able to catch this by looking at the number of autovacuums
started vs completed. The skipped counters in this patch and a started
counter would capture different stages of the autovacuum lifecycle;
skipped means
"never started" (lock contention), while a started-minus-completed delta means
"started but failed." Both are useful signals, but for different reasons.

In terms of the patch:

1/

+                       if (AmAutoVacuumWorkerProcess())
+                               pgstat_report_skipped_vacuum(relid);

Any reason why this should not also include manual vacuum/analyze?
If someone has a vacuum/analyze script that uses SKIP_LOCKED, and
the operation gets skipped, this should be included in the counter.
this can be done with separate counter fields for autovacuum/autoanalyze and
vacuum/analyze

2/

+            pg_stat_get_skipped_autovacuum_count(C.oid) AS
skipped_autovacuum_count,

How about a name like "autovacuum_lock_skip_count"?


--
Sami Imseih
Amazon Web Services (AWS)



Re: Track skipped tables during autovacuum and autoanalyze

От
Yugo Nagata
Дата:
On Tue, 24 Mar 2026 09:58:48 -0500
Sami Imseih <samimseih@gmail.com> wrote:

Thank you for your comments!

> > The attached patch add the following fields to pg_stat_all_tables:
> >  - last_skipped_autovacuum
> >  - last_skipped_autoanalyze
> >  - skipped_autovacuum_count
> >  - skipped_autoanalyze_count
> >
> > Are there any concerns about exposing this in pg_stat_all_tables, or suggestions
> > for a better approach?
> 
> I am not sure about the timestamp columns. I am not saying they will
> not be useful,
> but I think it will be better to just start with counters for this.
> The way the views get
> used, a dashboard built for tracking the deltas of the counters can easily spot
> when there is a spike of skipped autovacuum/autoanalyze count.
> Also, for tables that are being autovacuumed and skipped quickly,
> the timestamps will just be overwritten.
> 
> So, I am +1 on the counters, -1 on the timestamps.

Although the timestamps are overwritten on each skipped autovacuum or
autoanalyze, they still indicate when the last attempt was made. This
can help users confirm that autovacuum is actively attempting to run,
and that the issue is due to repeated skips rather than inactivity.

While counters can indicate overall activity, they do not reveal when
the last skip occurred. With timestamps, users can immediately see the
most recent attempt, even without a separate dashboard or historical
tracking.

Therefore, counters are useful for monitoring overall activity, but
timestamps give additional, complementary information, so it seems
worthwhile to include them too.
 
> Out of scope for this patch, but I also wonder if we should add another counter,
> autovacuum_started_count. If there are other types of failure scenarios such as
> corrupt indexes, checksum failures, etc. which terminate the
> autovacuum in flight,
> we would be able to catch this by looking at the number of autovacuums
> started vs completed. The skipped counters in this patch and a started
> counter would capture different stages of the autovacuum lifecycle;
> skipped means
> "never started" (lock contention), while a started-minus-completed delta means
> "started but failed." Both are useful signals, but for different reasons.

That makes sense. I was considering adding a counter to track
"canceled" autovacuum or autoanalyze, but tracking the number of started
or attempted autovacuums might provide a more general information than counting
only the failed ones.

> In terms of the patch:
> 
> 1/
> 
> +                       if (AmAutoVacuumWorkerProcess())
> +                               pgstat_report_skipped_vacuum(relid);
> 
> Any reason why this should not also include manual vacuum/analyze?
> If someone has a vacuum/analyze script that uses SKIP_LOCKED, and
> the operation gets skipped, this should be included in the counter.
> this can be done with separate counter fields for autovacuum/autoanalyze and
> vacuum/analyze

For manual vacuum/analyze, an explicit WARNING is output when the
operation is skipped, so I initially thought that reporting it in the
stats view was not necessary. However, I now agree that it should be
included.

> 2/
> 
> +            pg_stat_get_skipped_autovacuum_count(C.oid) AS
> skipped_autovacuum_count,
> 
> How about a name like "autovacuum_lock_skip_count"?

I'm not sure this is a good fit, since it may be better to keep the
naming consistent with other fields whose names end with
"...vacuum_count" or "...analyze_count". Happy to hear other opinions.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Track skipped tables during autovacuum and autoanalyze

От
Michael Paquier
Дата:
On Wed, Mar 25, 2026 at 01:28:47AM +0900, Yugo Nagata wrote:
> Although the timestamps are overwritten on each skipped autovacuum or
> autoanalyze, they still indicate when the last attempt was made. This
> can help users confirm that autovacuum is actively attempting to run,
> and that the issue is due to repeated skips rather than inactivity.
>
> While counters can indicate overall activity, they do not reveal when
> the last skip occurred. With timestamps, users can immediately see the
> most recent attempt, even without a separate dashboard or historical
> tracking.
>
> Therefore, counters are useful for monitoring overall activity, but
> timestamps give additional, complementary information, so it seems
> worthwhile to include them too.

Hmm..  I can buy this argument for the timestamps, especially for
database with many relations of various sizes that could take a
various amount of time to process.  The timestamps could offer hints
about the time it takes between the skips, even if snapshots of the
stats data are not taken at a very aggressive frequency.

This is v20 material at this stage, of course..
--
Michael

Вложения

Re: Track skipped tables during autovacuum and autoanalyze

От
Yugo Nagata
Дата:
On Wed, 25 Mar 2026 11:07:08 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Wed, Mar 25, 2026 at 01:28:47AM +0900, Yugo Nagata wrote:
> > Although the timestamps are overwritten on each skipped autovacuum or
> > autoanalyze, they still indicate when the last attempt was made. This
> > can help users confirm that autovacuum is actively attempting to run,
> > and that the issue is due to repeated skips rather than inactivity.
> > 
> > While counters can indicate overall activity, they do not reveal when
> > the last skip occurred. With timestamps, users can immediately see the
> > most recent attempt, even without a separate dashboard or historical
> > tracking.
> > 
> > Therefore, counters are useful for monitoring overall activity, but
> > timestamps give additional, complementary information, so it seems
> > worthwhile to include them too.
> 
> Hmm..  I can buy this argument for the timestamps, especially for
> database with many relations of various sizes that could take a
> various amount of time to process.  The timestamps could offer hints
> about the time it takes between the skips, even if snapshots of the
> stats data are not taken at a very aggressive frequency.
> 
> This is v20 material at this stage, of course..

Thank you for your comments.
Yes, I understand that it's too late for this to be included in v19.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Track skipped tables during autovacuum and autoanalyze

От
Sami Imseih
Дата:
> > On Wed, Mar 25, 2026 at 01:28:47AM +0900, Yugo Nagata wrote:
> > > Although the timestamps are overwritten on each skipped autovacuum or
> > > autoanalyze, they still indicate when the last attempt was made. This
> > > can help users confirm that autovacuum is actively attempting to run,
> > > and that the issue is due to repeated skips rather than inactivity.
> > >
> > > While counters can indicate overall activity, they do not reveal when
> > > the last skip occurred. With timestamps, users can immediately see the
> > > most recent attempt, even without a separate dashboard or historical
> > > tracking.
> > >
> > > Therefore, counters are useful for monitoring overall activity, but
> > > timestamps give additional, complementary information, so it seems
> > > worthwhile to include them too.
> >
> > Hmm..  I can buy this argument for the timestamps, especially for
> > database with many relations of various sizes that could take a
> > various amount of time to process.  The timestamps could offer hints
> > about the time it takes between the skips, even if snapshots of the
> > stats data are not taken at a very aggressive frequency.

I'm fine with adding timestamps, as there seem to be convincing
reasons to add them.
My other concern is bloat of the pg_stat_all_tables view. This patch
adds 4 columns, or
8 if we also include manual vacuum and analyze (which I think we should).

Given that, should we also start thinking about splitting the vacuum
activity related
columns into a dedicated view and out of pg_stat_all_tables for v20?

--
Sami Imseih
Amazon Web Services (AWS)



Re: Track skipped tables during autovacuum and autoanalyze

От
Michael Paquier
Дата:
On Wed, Mar 25, 2026 at 12:12:35PM -0500, Sami Imseih wrote:
> I'm fine with adding timestamps, as there seem to be convincing
> reasons to add them.
> My other concern is bloat of the pg_stat_all_tables view. This patch
> adds 4 columns, or
> 8 if we also include manual vacuum and analyze (which I think we should).
>
> Given that, should we also start thinking about splitting the vacuum
> activity related
> columns into a dedicated view and out of pg_stat_all_tables for v20?

PgStat_StatTabEntry is shared between indexes and tables.  A bunch of
its fields apply only to tables, not indexes (aka all the vacuum and
analyze ones).  Few fields apply only to indexes, not tables.  Not
that many are shared between both.  I would advocate for a clean split
between indexes and tables, as a start, with a new variable-sized
stats kind dedicated to indexes.
--
Michael

Вложения

Re: Track skipped tables during autovacuum and autoanalyze

От
Yugo Nagata
Дата:
On Thu, 26 Mar 2026 08:26:26 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Wed, Mar 25, 2026 at 12:12:35PM -0500, Sami Imseih wrote:
> > I'm fine with adding timestamps, as there seem to be convincing
> > reasons to add them.
> > My other concern is bloat of the pg_stat_all_tables view. This patch
> > adds 4 columns, or
> > 8 if we also include manual vacuum and analyze (which I think we should).
> > 
> > Given that, should we also start thinking about splitting the vacuum
> > activity related
> > columns into a dedicated view and out of pg_stat_all_tables for v20?
> 
> PgStat_StatTabEntry is shared between indexes and tables.  A bunch of
> its fields apply only to tables, not indexes (aka all the vacuum and
> analyze ones).  Few fields apply only to indexes, not tables.  Not
> that many are shared between both.  I would advocate for a clean split
> between indexes and tables, as a start, with a new variable-sized
> stats kind dedicated to indexes.

I see that the following fields are used in pg_stat_all_indexes:

- numscans
- lastscan
- tuples_returned
- tuples_fetched
- stat_reset_time

but they are also shared with pg_stat_all_tables. Are you suggesting
splitting these shared fields from those that are specific to tables?

I'm not sure this would significantly reduce the size of
PgStat_StatTabEntry. Could you elaborate on the expected benefits?

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Track skipped tables during autovacuum and autoanalyze

От
Michael Paquier
Дата:
On Thu, Mar 26, 2026 at 10:18:39AM +0900, Yugo Nagata wrote:
> I'm not sure this would significantly reduce the size of
> PgStat_StatTabEntry. Could you elaborate on the expected benefits?

The point is that this reduces the shmem footprint for indexes (well,
it's also benefitial for tables, just less), on top of being cleaner
because the stats views would only need to store and query the fields
they care about for each relkind.
--
Michael

Вложения

Re: Track skipped tables during autovacuum and autoanalyze

От
Yugo Nagata
Дата:
On Thu, 26 Mar 2026 10:31:19 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Thu, Mar 26, 2026 at 10:18:39AM +0900, Yugo Nagata wrote:
> > I'm not sure this would significantly reduce the size of
> > PgStat_StatTabEntry. Could you elaborate on the expected benefits?
> 
> The point is that this reduces the shmem footprint for indexes (well,
> it's also benefitial for tables, just less), on top of being cleaner
> because the stats views would only need to store and query the fields
> they care about for each relkind.

Thank you for the clarification. I think I understand your point now.

So, you mean introducing a separate stats kind for indexes, such as
PGSTAT_KIND_INDEX?

Regards,
Yugo Nagata

> --
> Michael


-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Track skipped tables during autovacuum and autoanalyze

От
Yugo Nagata
Дата:
> On Tue, 24 Mar 2026 09:58:48 -0500
> Sami Imseih <samimseih@gmail.com> wrote:

> > 1/
> > 
> > +                       if (AmAutoVacuumWorkerProcess())
> > +                               pgstat_report_skipped_vacuum(relid);
> > 
> > Any reason why this should not also include manual vacuum/analyze?
> > If someone has a vacuum/analyze script that uses SKIP_LOCKED, and
> > the operation gets skipped, this should be included in the counter.
> > this can be done with separate counter fields for autovacuum/autoanalyze and
> > vacuum/analyze
> 
> For manual vacuum/analyze, an explicit WARNING is output when the
> operation is skipped, so I initially thought that reporting it in the
> stats view was not necessary. However, I now agree that it should be
> included.

I've attached an updated patch to also report skipped manual
vacuum/analyze.

The pgstat reporting functions are unified into a single function,
pgstat_report_skipped_vacuum_analyze(), which handles both
auto/manual and vacuum/analyze cases. Also it is fixed to use
InvalidOid for shared relations.

To support manual vacuum/analyze, some hack were needed to
obtain the relid before the lock attempt. When SKIP_LOCKED is specified,
the relid is obtained using RangeVarGetRelid() with NoLock prior to
locking. If ConditionalLockRelationOid() then fails, the skip is
reported.

To handle the possibility that the table is dropped between
RangeVarGetRelid() and the lock attempt, SearchSysCacheExists1() is
used after acquiring the lock.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Track skipped tables during autovacuum and autoanalyze

От
Michael Paquier
Дата:
On Thu, Mar 26, 2026 at 07:22:03PM +0900, Yugo Nagata wrote:
> To handle the possibility that the table is dropped between
> RangeVarGetRelid() and the lock attempt, SearchSysCacheExists1() is
> used after acquiring the lock.

(Noticed while skimming through my emails this morning..)

+void
+pgstat_report_skipped_vacuum_analyze(Oid relid, bool vacuum, bool analyze,
+                                     bool autovacuum)

I'd recommend to replace this interface with three booleans with a set
of three bitwise flags.  That would be less error prone for the
callers of this function, or we could finish by aggregating counters
we don't want to.
--
Michael

Вложения

Re: Track skipped tables during autovacuum and autoanalyze

От
Yugo Nagata
Дата:
On Fri, 27 Mar 2026 08:07:29 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Thu, Mar 26, 2026 at 07:22:03PM +0900, Yugo Nagata wrote:
> > To handle the possibility that the table is dropped between
> > RangeVarGetRelid() and the lock attempt, SearchSysCacheExists1() is
> > used after acquiring the lock.
> 
> (Noticed while skimming through my emails this morning..)
> 
> +void
> +pgstat_report_skipped_vacuum_analyze(Oid relid, bool vacuum, bool analyze,
> +                                     bool autovacuum)
> 
> I'd recommend to replace this interface with three booleans with a set
> of three bitwise flags.  That would be less error prone for the
> callers of this function, or we could finish by aggregating counters
> we don't want to.

Thank you for the suggestion.
I've attached a revised patch reflecting this change, and it also includes
the documentation.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Track skipped tables during autovacuum and autoanalyze

От
Sami Imseih
Дата:
> I've attached a revised patch reflecting this change, and it also includes
> the documentation.

Thanks fo the update!

I have some comments:

1/
+pgstat_report_skipped_vacuum_analyze(Oid relid, bits8 flags)

using bit8 is fine here, but I would have just used int. For this
case, it's just a matter of prefernace.

2/
+/* flags for pgstat_flush_backend() */
+#define PGSTAT_REPORT_SKIPPED_VACUUM   (1 << 0)        /* vacuum is skipped */
+#define PGSTAT_REPORT_SKIPPED_ANALYZE  (1 << 1)        /* analyze is skipped */
+#define PGSTAT_REPORT_SKIPPED_AUTOVAC  (1 << 2)        /* skipped
during autovacuum/autoanalyze */
+#define PGSTAT_REPORT_SKIPPED_ANY   (PGSTAT_REPORT_SKIPPED_VACUUM |
PGSTAT_REPORT_SKIPPED_ANALYZE)

can we just have 4 flags, SKIPPED_VACUUM, SKIPPED_ANALYZE,
SKIPPED_AUTOVACUUM, SKIPPED_AUTOANALYZE,
which can then remove the nested if/else and makes the mapping more obvious

+       if (flags & PGSTAT_REPORT_SKIPPED_AUTOVAC)
+       {
+               if (flags & PGSTAT_REPORT_SKIPPED_VACUUM)
+               {
+                       tabentry->last_skipped_autovacuum_time = ts;
+                       tabentry->skipped_autovacuum_count++;
+               }
+               if (flags & PGSTAT_REPORT_SKIPPED_ANALYZE)
+               {
+                       tabentry->last_skipped_autoanalyze_time = ts;
+                       tabentry->skipped_autoanalyze_count++;
+               }
+       }
+       else
+       {
+               if (flags & PGSTAT_REPORT_SKIPPED_VACUUM)
+               {
+                       tabentry->last_skipped_vacuum_time = ts;
+                       tabentry->skipped_vacuum_count++;
+               }
+               if (flags & PGSTAT_REPORT_SKIPPED_ANALYZE)
+               {
+                       tabentry->last_skipped_analyze_time = ts;
+                       tabentry->skipped_analyze_count++;
+               }
+       }

3/
For the sake of consistency, can we rename the fields from

skipped_vacuum_count to vacuum_skipped_count, etc. ? to be similar
to fields like vacuum_count

4/
field documentation could be a bit better to match existing phrasing

For example, the timestamp fields:

-       Last time a manual vacuum on this table was attempted but skipped due to
-       lock unavailability (not counting <command>VACUUM FULL</command>)
+       The time of the last manual vacuum on this table that was skipped
+       due to lock unavailability (not counting <command>VACUUM FULL</command>)

and the counter fields

-       Number of times vacuums on this table have been attempted but skipped
+       Number of times a manual vacuum on this table has been skipped

5/
Partitioned table asymmetry between vacuum_count and vacuum_skipped_count.

vacuum_count never increments on a the parenttable, because the parent is never
pocessed. On the other hand, if the manual VACUUM/ANALYZE is on the
parent table,
then we will skip all the children. So, we should still report the skip on the
parent table, but we should add a Notes section in the docs perhaps to
document this caveat?

6/
It would be nice to add a test for this, but this requires concurrency and I'm
not sure it's woth it.

Also, can you create a CF entry in
https://commitfest.postgresql.org/59/, please.

Thanks!

--
Sami Imseih
Amazon Web Services (AWS)



Re: Track skipped tables during autovacuum and autoanalyze

От
Yugo Nagata
Дата:
On Fri, 27 Mar 2026 11:48:27 -0500
Sami Imseih <samimseih@gmail.com> wrote:

> > I've attached a revised patch reflecting this change, and it also includes
> > the documentation.
> 
> Thanks fo the update!
> 
> I have some comments:
> 
> 1/
> +pgstat_report_skipped_vacuum_analyze(Oid relid, bits8 flags)
> 
> using bit8 is fine here, but I would have just used int. For this
> case, it's just a matter of prefernace.

That makes sense, since using int for flags seems common in other
places in the code. I'm not sure how much it affects performance,
though.

> 2/
> +/* flags for pgstat_flush_backend() */
> +#define PGSTAT_REPORT_SKIPPED_VACUUM   (1 << 0)        /* vacuum is skipped */
> +#define PGSTAT_REPORT_SKIPPED_ANALYZE  (1 << 1)        /* analyze is skipped */
> +#define PGSTAT_REPORT_SKIPPED_AUTOVAC  (1 << 2)        /* skipped
> during autovacuum/autoanalyze */
> +#define PGSTAT_REPORT_SKIPPED_ANY   (PGSTAT_REPORT_SKIPPED_VACUUM |
> PGSTAT_REPORT_SKIPPED_ANALYZE)
> 
> can we just have 4 flags, SKIPPED_VACUUM, SKIPPED_ANALYZE,
> SKIPPED_AUTOVACUUM, SKIPPED_AUTOANALYZE,
> which can then remove the nested if/else and makes the mapping more obvious

I am fine with that. In that case, the nested logic would move to the
caller side.

> 3/
> For the sake of consistency, can we rename the fields from
> 
> skipped_vacuum_count to vacuum_skipped_count, etc. ? to be similar
> to fields like vacuum_count

Hmm, I think skipped_vacuum_count is more consistent with
fields like last_vacuum and total_vacuum_time, where the modifier
comes before vacuum/analyze. What do you think about that?

> 4/
> field documentation could be a bit better to match existing phrasing
> 
> For example, the timestamp fields:
> 
> -       Last time a manual vacuum on this table was attempted but skipped due to
> -       lock unavailability (not counting <command>VACUUM FULL</command>)
> +       The time of the last manual vacuum on this table that was skipped
> +       due to lock unavailability (not counting <command>VACUUM FULL</command>)

I intended to keep consistency with the existing last_vacuum:

  Last time at which this table was manually vacuumed (not counting VACUUM FULL)

although "at which" was accidentally omitted. Your suggestion seems
simpler and more natural to me. Should we prioritize that over consistency?
 
> and the counter fields
> 
> -       Number of times vacuums on this table have been attempted but skipped
> +       Number of times a manual vacuum on this table has been skipped

The "a munual" was also accidentally omitted, so I'll fix it.

> 5/
> Partitioned table asymmetry between vacuum_count and vacuum_skipped_count.
> 
> vacuum_count never increments on a the parenttable, because the parent is never
> pocessed. On the other hand, if the manual VACUUM/ANALYZE is on the
> parent table,
> then we will skip all the children. So, we should still report the skip on the
> parent table, but we should add a Notes section in the docs perhaps to
> document this caveat?

Yeah, we cannot report skips on the children when a manual
vacuum/analyze on the parent table is skipped. (It might be possible
to obtain child information with NoLock, but that would not be safe.)

Therefore, I agree that the best we can do here is to add a note to the
documentation of last_skipped_vacuum/analyze and skipped_vacuum/analyze_count.

For example:

  When a manual vacuum or analyze on a parent table in an inheritance
  or partitioning hierarchy is skipped, the statistics are recorded
  only for the parent table, not for its children.

> 6/
> It would be nice to add a test for this, but this requires concurrency and I'm
> not sure it's woth it.

I'm not sure what meaningful tests we could add for these statistics.
I couldn't find any existing tests for fields like last_vacuum.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>