Обсуждение: BUG #19430: Autovacuums stopped working possible due to problem with vacuuming shared catalog pg_authid
BUG #19430: Autovacuums stopped working possible due to problem with vacuuming shared catalog pg_authid
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 19430 Logged by: Alexey Ermakov Email address: alexius.work@gmail.com PostgreSQL version: 14.15 Operating system: Ubuntu 22.04 Description: On a PostgreSQL 14.15 server with about 100 databases running on Ubuntu 22.04 I observed following problem: 1) According to the DB logs (log_autovacuum_min_duration = 1s) and monitoring graphs autovacuums almost completely stopped working across all DBs at once 2) This is also confirmed by checking pg_stat_user_tables.last_autovacuum in all databases, there were no autovacuums anywhere for several days 3) Counters in pg_stat_user_tables updated normally, no anomalies there 4) pg_stat_all_tables.last_autovacuum shows one anomaly with pg_authid catalog: only that catalog was autovacuumed recently (there are not many dead rows there, looks normal). There might be other anomalies in other databases as I only checked one but I doubt it. 5) I found nothing interesting in the logs around the time when problem started (I checked for ERROR/WARNING/FATAL messages and possible DDL queries but sadly log_statement = none there, log_min_duration_statement = 1s) 6) Autovacuum launcher process appears to be running normally, periodically wakes up and consume 1-3% CPU of single core, pg_stat_activity shows the wait event as AutoVacuumMain 7) Just in case, I checked dmesg entries and the time offset graph in monitoring (to verify that time wasn't moving backwards for some reason) - no anomalies there 8) Running vacuum on one of the bloated tables (which cause performance problems) worked without issues 9) There is no long running transactions or lagging standbys with hot_standby_feedback on 10) pg_authid catalog has `age(relfrozenxid)` around 290M but autovacuum_freeze_max_age is 200M and anti-wraparound autovacuum should've been triggered 11) After running `vacuum verbose pg_authid;` nothing changed (it wasn't an aggressive vacuum) but running `vacuum (freeze, verbose) pg_authid;` fixed problem and autovacuums started working properly. Output of these commands are at the end of the report. I assume that problem might have started when age(relfrozenxid) for pg_authid approached autovacuum_freeze_max_age. A rough estimate based on the current rate of age(relfrozenxid) confirms that we might indeed did 90M transactions during the autovacuum malfunction. An interesting observation: the query `ALTER ROLE xxx WITH ;` from some permission-granting business logic triggered several autovacuums on ~half of databases at same second for some time but didn't fix the problem completely. Perhaps some cache invalidation did that? Unfortunately, 14.15 is more than 1 year old. An update to 14.22 was discussed recently but haven't been made yet. There were several fixes related to vacuum/autovacuum but they didn't seem directly related to the issue but I might be wrong. Output of vacuum commands: vacuum verbose pg_authid; INFO: vacuuming "pg_catalog.pg_authid" INFO: table "pg_authid": found 0 removable, 7 nonremovable row versions in 1 out of 57 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3828262030 Skipped 0 pages due to buffer pins, 55 frozen pages. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: vacuuming "pg_toast.pg_toast_1260" INFO: table "pg_toast_1260": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3828262030 Skipped 0 pages due to buffer pins, 0 frozen pages. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM vacuum (freeze, verbose) pg_authid; INFO: aggressively vacuuming "pg_catalog.pg_authid" INFO: table "pg_authid": found 0 removable, 196 nonremovable row versions in 4 out of 57 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3828278683 Skipped 0 pages due to buffer pins, 53 frozen pages. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: aggressively vacuuming "pg_toast.pg_toast_1260" INFO: table "pg_toast_1260": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3828278683 Skipped 0 pages due to buffer pins, 0 frozen pages. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM Thanks, Alexey Ermakov
On Thu, 2026-03-12 at 09:24 +0000, PG Bug reporting form wrote: > PostgreSQL version: 14.15 > Operating system: Ubuntu 22.04 > > On a PostgreSQL 14.15 server with about 100 databases running on Ubuntu > 22.04 I observed following problem: > > 1) According to the DB logs (log_autovacuum_min_duration = 1s) and > monitoring graphs autovacuums almost completely stopped working across all > DBs at once > 2) This is also confirmed by checking pg_stat_user_tables.last_autovacuum in > all databases, there were no autovacuums anywhere for several days > 3) Counters in pg_stat_user_tables updated normally, no anomalies there > 4) pg_stat_all_tables.last_autovacuum shows one anomaly with pg_authid > catalog: only that catalog was autovacuumed recently (there are not many > dead rows there, looks normal). There might be other anomalies in other > databases as I only checked one but I doubt it. > 5) I found nothing interesting in the logs around the time when problem > started (I checked for ERROR/WARNING/FATAL messages and possible DDL queries > but sadly log_statement = none there, log_min_duration_statement = 1s) > 6) Autovacuum launcher process appears to be running normally, periodically > wakes up and consume 1-3% CPU of single core, pg_stat_activity shows the > wait event as AutoVacuumMain > 7) Just in case, I checked dmesg entries and the time offset graph in > monitoring (to verify that time wasn't moving backwards for some reason) - > no anomalies there > 8) Running vacuum on one of the bloated tables (which cause performance > problems) worked without issues > 9) There is no long running transactions or lagging standbys with > hot_standby_feedback on > 10) pg_authid catalog has `age(relfrozenxid)` around 290M but > autovacuum_freeze_max_age is 200M and anti-wraparound autovacuum should've > been triggered > 11) After running `vacuum verbose pg_authid;` nothing changed (it wasn't an > aggressive vacuum) but running `vacuum (freeze, verbose) pg_authid;` fixed > problem and autovacuums started working properly. Output of these commands > are at the end of the report. > > I assume that problem might have started when age(relfrozenxid) for > pg_authid approached autovacuum_freeze_max_age. A rough estimate based on > the current rate of age(relfrozenxid) confirms that we might indeed did 90M > transactions during the autovacuum malfunction. > > An interesting observation: the query `ALTER ROLE xxx WITH ;` from some > permission-granting business logic triggered several autovacuums on ~half of > databases at same second for some time but didn't fix the problem > completely. Perhaps some cache invalidation did that? > > Unfortunately, 14.15 is more than 1 year old. An update to 14.22 was > discussed recently but haven't been made yet. There were several fixes > related to vacuum/autovacuum but they didn't seem directly related to the > issue but I might be wrong. > > Output of vacuum commands: > > vacuum verbose pg_authid; > INFO: vacuuming "pg_catalog.pg_authid" > INFO: table "pg_authid": found 0 removable, 7 nonremovable row versions in > 1 out of 57 pages > DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3828262030 > Skipped 0 pages due to buffer pins, 55 frozen pages. > CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. > INFO: vacuuming "pg_toast.pg_toast_1260" > INFO: table "pg_toast_1260": found 0 removable, 0 nonremovable row versions > in 0 out of 0 pages > DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3828262030 > Skipped 0 pages due to buffer pins, 0 frozen pages. > CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. > VACUUM > > vacuum (freeze, verbose) pg_authid; > INFO: aggressively vacuuming "pg_catalog.pg_authid" > INFO: table "pg_authid": found 0 removable, 196 nonremovable row versions > in 4 out of 57 pages > DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3828278683 > Skipped 0 pages due to buffer pins, 53 frozen pages. > CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. > INFO: aggressively vacuuming "pg_toast.pg_toast_1260" > INFO: table "pg_toast_1260": found 0 removable, 0 nonremovable row versions > in 0 out of 0 pages > DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3828278683 > Skipped 0 pages due to buffer pins, 0 frozen pages. > CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. > VACUUM This change in 14.21 looks relevant: - Fix issues around in-place catalog updates (Noah Misch) Send a nontransactional invalidation message for an in-place update, since such an update will survive transaction rollback. Also ensure that the update is WAL-logged before other sessions can see it. These fixes primarily prevent scenarios in which relations' frozen-XID attributes become inconsistent, possibly allowing premature CLOG truncation and subsequent “could not access status of transaction” errors. Yours, Laurenz Albe