Обсуждение: Add mode column to pg_stat_progress_vacuum
Hi hackers, I would like to propose a patch that enhances the pg_stat_progress_vacuum view by adding a mode column. The patch is attached. Although it is possible to identify an anti-wraparound VACUUM through the process title (to prevent wraparound) or specific log entries, it would be significantly more convenient for monitoring purposes to have this status clearly indicated in the pg_stat_progress_vacuum view. This would enable DBAs to immediately understand the urgency of the vacuum process without needing to check separate logs or system processes. This patch introduces a mode column to provide this visibility. The possible values are: - normal: A standard, user-initiated VACUUM or a regular autovacuum run. - anti-wraparound: An autovacuum run launched specifically to prevent transaction ID wraparound. - failsafe: A vacuum that has entered failsafe mode to prevent imminent transaction ID wraparound. This will allow administrators to better understand the context and urgency of vacuum operations, which is crucial for monitoring and troubleshooting. Design Considerations: When defining the scope of the anti-wraparound mode, I considered including manual commands like VACUUM (FREEZE) or VACUUM (DISABLE_PAGE_SKIPPING). However, I decided against this to keep the meaning of the mode clear and simple. These options can be used for various purposes, and overloading the anti-wraparound mode with too many meanings could become confusing. Therefore, the current implementation limits this mode to autovacuum runs that are explicitly launched for wraparound prevention. Regarding Testing: I was able to manually verify the failsafe mode's behavior by using the existing test script at src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl. This script successfully triggered the failsafe condition and the view reported the correct mode. However, I found this test to be somewhat flaky in my environment and decided not to add it to the patch to avoid introducing a potentially unstable test into the tree. Thought? -- Best regards, Shinya Kato NTT OSS Center
Вложения
On Thu, 14 Aug 2025 at 16:13, Shinya Kato <shinya11.kato@gmail.com> wrote: > This patch introduces a mode column to provide this visibility. The > possible values are: > - normal: A standard, user-initiated VACUUM or a regular autovacuum run. > - anti-wraparound: An autovacuum run launched specifically to prevent > transaction ID wraparound. > - failsafe: A vacuum that has entered failsafe mode to prevent > imminent transaction ID wraparound. > Thought? Just a small comment: I am more used to Lazy vs Eager vacuum types. It is how we use to call them in doc and code. Maybe this wording will be better? -- Best regards, Kirill Reshke
On Thu, Aug 14, 2025 at 9:20 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > On Thu, 14 Aug 2025 at 16:13, Shinya Kato <shinya11.kato@gmail.com> wrote: > > > This patch introduces a mode column to provide this visibility. The > > possible values are: > > - normal: A standard, user-initiated VACUUM or a regular autovacuum run. > > - anti-wraparound: An autovacuum run launched specifically to prevent > > transaction ID wraparound. > > - failsafe: A vacuum that has entered failsafe mode to prevent > > imminent transaction ID wraparound. > > > Thought? > > Just a small comment: > > I am more used to Lazy vs Eager vacuum types. It is how we use to call > them in doc and code. Maybe this wording will be better? Thanks for the feedback! Are you suggesting it would be better to change "normal" to "lazy" and "anti-wraparound" to "eager"? My hesitation is that "lazy" is a term used to contrast with VACUUM FULL, and the "lazy" vs. "eager" distinction also exists for tuple freezing logic within vacuum. Reusing these terms for a different purpose could be confusing. I also find "anti-wraparound" to be a much clearer and more descriptive term than "eager". Unless there’s a strong preference otherwise, I’d like to keep “anti-wraparound” and “failsafe” as-is, and keep “normal” (or possibly “plain”/“regular” if that reads better). -- Best regards, Shinya Kato NTT OSS Center
On Fri, 15 Aug 2025 at 09:19, Shinya Kato <shinya11.kato@gmail.com> wrote: eager". > > Unless there’s a strong preference otherwise, I’d like to keep > “anti-wraparound” and “failsafe” as-is, and keep “normal” (or possibly > “plain”/“regular” if that reads better). OK. I have tested the patch a bit, and noticed that `mode` column shows `normal` for both auto-vacuum and user-initiated vacuum (via VACUUM utility statement). Is it ok? Maybe for more visibility we can display different values for these two cases? -- Best regards, Kirill Reshke
On Thu, Aug 14, 2025 at 08:12:55PM +0900, Shinya Kato wrote: > I would like to propose a patch that enhances the > pg_stat_progress_vacuum view by adding a mode column. The patch is > attached. > > Although it is possible to identify an anti-wraparound VACUUM through > the process title (to prevent wraparound) or specific log entries, it > would be significantly more convenient for monitoring purposes to have > this status clearly indicated in the pg_stat_progress_vacuum view. > This would enable DBAs to immediately understand the urgency of the > vacuum process without needing to check separate logs or system > processes. This seems generally reasonable to me. > This patch introduces a mode column to provide this visibility. The > possible values are: > - normal: A standard, user-initiated VACUUM or a regular autovacuum run. > - anti-wraparound: An autovacuum run launched specifically to prevent > transaction ID wraparound. > - failsafe: A vacuum that has entered failsafe mode to prevent > imminent transaction ID wraparound. I wonder if we should also add "aggressive". > I was able to manually verify the failsafe mode's behavior by using > the existing test script at > src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl. This script > successfully triggered the failsafe condition and the view reported > the correct mode. However, I found this test to be somewhat flaky in > my environment and decided not to add it to the patch to avoid > introducing a potentially unstable test into the tree. Perhaps there's something we can do with injection points to improve the stability of the test. -- nathan
On Tue, Oct 7, 2025 at 11:04 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Thu, Aug 14, 2025 at 08:12:55PM +0900, Shinya Kato wrote: > > I would like to propose a patch that enhances the > > pg_stat_progress_vacuum view by adding a mode column. The patch is > > attached. > > > > Although it is possible to identify an anti-wraparound VACUUM through > > the process title (to prevent wraparound) or specific log entries, it > > would be significantly more convenient for monitoring purposes to have > > this status clearly indicated in the pg_stat_progress_vacuum view. > > This would enable DBAs to immediately understand the urgency of the > > vacuum process without needing to check separate logs or system > > processes. > > This seems generally reasonable to me. > There is a bit of an issue that an anti-wraparound vacuum is not in and of itself urgent, especially not with our defaults, so I have a little bit of concern that this patch could be mis-leading, but that isn't exactly an argument against the merits of it. > > This patch introduces a mode column to provide this visibility. The > > possible values are: > > - normal: A standard, user-initiated VACUUM or a regular autovacuum run. > > - anti-wraparound: An autovacuum run launched specifically to prevent > > transaction ID wraparound. > > - failsafe: A vacuum that has entered failsafe mode to prevent > > imminent transaction ID wraparound. > I think we should probably split out manual vacuums, which can be run for a whole host of different reasons. I'd suggest a mode of "manual" for those, and probably "standard" for a regular autovacuum run. > I wonder if we should also add "aggressive". > I don't think so. I feel like the point of the mode is to answer "why is this vacuum running" not "how is it operating under the hood". Robert Treat https://xzilla.net
On Tue, Oct 07, 2025 at 11:50:46AM -0400, Robert Treat wrote: > On Tue, Oct 7, 2025 at 11:04 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I wonder if we should also add "aggressive". > > I don't think so. I feel like the point of the mode is to answer "why > is this vacuum running" not "how is it operating under the hood". To some extent, those are tied together. For example, a failsafe vacuum is an anti-wraparound vacuum that skips index vacuuming, etc. And an anti-wraparound vacuum implies an aggressive scan, but not vice versa. There's also a separate parameter (vacuum_freeze_table_age) that controls when vacuum decides to perform an aggressive scan, just like there exists a parameter for anti-wraparound vacuums (autovacuum_freeze_max_age) and failsafe vacuums (vacuum_failsafe_age). -- nathan
On Tue, Oct 7, 2025 at 9:26 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Oct 07, 2025 at 11:50:46AM -0400, Robert Treat wrote: > > On Tue, Oct 7, 2025 at 11:04 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> I wonder if we should also add "aggressive". > > > > I don't think so. I feel like the point of the mode is to answer "why > > is this vacuum running" not "how is it operating under the hood". > > To some extent, those are tied together. For example, a failsafe vacuum is > an anti-wraparound vacuum that skips index vacuuming, etc. And an > anti-wraparound vacuum implies an aggressive scan, but not vice versa. > There's also a separate parameter (vacuum_freeze_table_age) that controls > when vacuum decides to perform an aggressive scan, just like there exists a > parameter for anti-wraparound vacuums (autovacuum_freeze_max_age) and > failsafe vacuums (vacuum_failsafe_age). Right. I think we cannot display both things in one mode column. Since both manual vacuums and anti-wraparound autovacuums can enter the failsafe mode dynamically, if we show "failsafe" in the mode column, we would lose the information "why is this vacuum running". I guess we would need separate columns. For example, I guess that the column showing "how is it operating under the hood" can have three values: "normal", "aggressive" (disables VM optimization), and "failsafe" (implies aggressive vacuum and disables many things to prioritize XID freezing). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
I am bit late to this thread, but I have a comment. > This patch introduces a mode column to provide this visibility. The > possible values are: > - normal: A standard, user-initiated VACUUM or a regular autovacuum run. > - anti-wraparound: An autovacuum run launched specifically to prevent > transaction ID wraparound. > - failsafe: A vacuum that has entered failsafe mode to prevent > imminent transaction ID wraparound. The vacuum command detail can now be determined from pg_stat_activity.query by joining with pg_stat_progress_vacuum, right? I don't see why this is not sufficient, especially because it already indicates how the vacuum was triggered, and the autovacuum activity message also tells you why it was triggered. We could perhaps add "due to failsafe" to the autovacuum activity message to explicitly show that reason. -- Sami Imseih Amazon Web Services (AWS)
On Tue, Oct 07, 2025 at 10:17:38AM -0700, Masahiko Sawada wrote: > Right. I think we cannot display both things in one mode column. Since > both manual vacuums and anti-wraparound autovacuums can enter the > failsafe mode dynamically, if we show "failsafe" in the mode column, > we would lose the information "why is this vacuum running". I guess we > would need separate columns. For example, I guess that the column > showing "how is it operating under the hood" can have three values: > "normal", "aggressive" (disables VM optimization), and "failsafe" > (implies aggressive vacuum and disables many things to prioritize XID > freezing). Am I understanding correctly that your idea is to have a "reason" column that would have values like "manual", "normal autovacuum", and "autovacuum for wraparound", and a "mode" column that would have values like "normal", "agressive", and "failsafe"? I wonder if we could be even more granular for the "normal autovacuum" case and point to the reason the table was chosen. For example, was it the insert threshold, the update/delete threshold, etc.? -- nathan
On Tue, Oct 07, 2025 at 12:45:12PM -0500, Sami Imseih wrote: > The vacuum command detail can now be determined from > pg_stat_activity.query by joining with pg_stat_progress_vacuum, right? > I don't see why this is not sufficient, especially because it already > indicates how the vacuum was triggered, and the autovacuum activity > message also tells you why it was triggered. We could perhaps add "due to > failsafe" to the autovacuum activity message to explicitly show that reason. Eh, IMHO requiring users to look for a certain substring in the query field doesn't seem especially user-friendly to me. (I was going to point out that it's undocumented, too, but it is in fact documented [0].) [0] https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM -- nathan
> On Tue, Oct 07, 2025 at 12:45:12PM -0500, Sami Imseih wrote: > > The vacuum command detail can now be determined from > > pg_stat_activity.query by joining with pg_stat_progress_vacuum, right? > > I don't see why this is not sufficient, especially because it already > > indicates how the vacuum was triggered, and the autovacuum activity > > message also tells you why it was triggered. We could perhaps add "due to > > failsafe" to the autovacuum activity message to explicitly show that reason. > > Eh, IMHO requiring users to look for a certain substring in the query field > doesn't seem especially user-friendly to me. (I was going to point out > that it's undocumented, too, but it is in fact documented [0].) I am not sure if it's a bad user experience. In my experience that string is quite easy to parse. Also, It is also common for a DBA to have to reference pg_stat_activity anyhow to determine how long the vacuum been running for, wait events, etc. IMO, the progress view is the wrong place for all information that is static ( does not change from the start of the command ) and can be derived from the query string. >> Right. I think we cannot display both things in one mode column. Since >> both manual vacuums and anti-wraparound autovacuums can enter the >> failsafe mode dynamically, if we show "failsafe" in the mode column, >> we would lose the information "why is this vacuum running". I guess we >> would need separate columns. For example, I guess that the column >> showing "how is it operating under the hood" can have three values: >> "normal", "aggressive" (disables VM optimization), and "failsafe" >> (implies aggressive vacuum and disables many things to prioritize XID >> freezing). > Am I understanding correctly that your idea is to have a "reason" column > that would have values like "manual", "normal autovacuum", and "autovacuum > for wraparound", and a "mode" column that would have values like "normal", > "agressive", and "failsafe"? I wonder if we could be even more granular > for the "normal autovacuum" case and point to the reason the table was > chosen. For example, was it the insert threshold, the update/delete > threshold, etc.? ahh, it's true that failsafe can trigger while an (auto)vacuum is in progress, the check does not happen at the start, but in places like the main loop of lazy_scan_heap. Since "failsafe" can be flipped on in-flight, I can see that being a useful (bool?) field in the progress view. -- Sami
On Tue, Oct 7, 2025 at 12:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Oct 07, 2025 at 10:17:38AM -0700, Masahiko Sawada wrote: > > Right. I think we cannot display both things in one mode column. Since > > both manual vacuums and anti-wraparound autovacuums can enter the > > failsafe mode dynamically, if we show "failsafe" in the mode column, > > we would lose the information "why is this vacuum running". I guess we > > would need separate columns. For example, I guess that the column > > showing "how is it operating under the hood" can have three values: > > "normal", "aggressive" (disables VM optimization), and "failsafe" > > (implies aggressive vacuum and disables many things to prioritize XID > > freezing). > > Am I understanding correctly that your idea is to have a "reason" column > that would have values like "manual", "normal autovacuum", and "autovacuum > for wraparound", and a "mode" column that would have values like "normal", > "agressive", and "failsafe"? Right. The first column provides an insight into whether or not the running vacuum is cancellable, and the second column provides information on how vacuums are actually processing tables under the hood. Users are able to get the former information by checking pg_stat_activity too but the latter information is available only in server logs. > I wonder if we could be even more granular > for the "normal autovacuum" case and point to the reason the table was > chosen. For example, was it the insert threshold, the update/delete > threshold, etc.? Sounds like reasonable information. I guess we might want to have such information in a cumulative statistics view but do you think it's better to have it in a dynamic statistics view? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> > I wonder if we could be even more granular > > for the "normal autovacuum" case and point to the reason the table was > > chosen. For example, was it the insert threshold, the update/delete > > threshold, etc.? > > Sounds like reasonable information. I guess we might want to have such > information in a cumulative statistics view but do you think it's > better to have it in a dynamic statistics view? +1 for this information in cumulative stats, on a per table level for sure. I do think however the pg_stat_all_tables views is getting too wide and moving new relation vacuum stats to a separate stats view will be very useful. -- Sami
On Wed, Oct 8, 2025 at 12:04 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > I was able to manually verify the failsafe mode's behavior by using > > the existing test script at > > src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl. This script > > successfully triggered the failsafe condition and the view reported > > the correct mode. However, I found this test to be somewhat flaky in > > my environment and decided not to add it to the patch to avoid > > introducing a potentially unstable test into the tree. > > Perhaps there's something we can do with injection points to improve the > stability of the test. Thank you for the advice. You're right, I can test it with injection point. However, many other progress reporting views do not have regression tests, so I do not have to add a regression test of pg_stat_progress_report. -- Best regards, Shinya Kato NTT OSS Center
On Wed, Oct 8, 2025 at 4:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Oct 7, 2025 at 12:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > > On Tue, Oct 07, 2025 at 10:17:38AM -0700, Masahiko Sawada wrote: > > > Right. I think we cannot display both things in one mode column. Since > > > both manual vacuums and anti-wraparound autovacuums can enter the > > > failsafe mode dynamically, if we show "failsafe" in the mode column, > > > we would lose the information "why is this vacuum running". I guess we > > > would need separate columns. For example, I guess that the column > > > showing "how is it operating under the hood" can have three values: > > > "normal", "aggressive" (disables VM optimization), and "failsafe" > > > (implies aggressive vacuum and disables many things to prioritize XID > > > freezing). > > > > Am I understanding correctly that your idea is to have a "reason" column > > that would have values like "manual", "normal autovacuum", and "autovacuum > > for wraparound", and a "mode" column that would have values like "normal", > > "agressive", and "failsafe"? > > Right. The first column provides an insight into whether or not the > running vacuum is cancellable, and the second column provides > information on how vacuums are actually processing tables under the > hood. Users are able to get the former information by checking > pg_stat_activity too but the latter information is available only in > server logs. Thanks for the clarification. I agree with your proposal. Separating the "reason" from the "mode" into two columns is a great idea that will provide much clearer insight for DBAs. -- Best regards, Shinya Kato NTT OSS Center
On Wed, Oct 8, 2025 at 4:40 AM Sami Imseih <samimseih@gmail.com> wrote: > > > > I wonder if we could be even more granular > > > for the "normal autovacuum" case and point to the reason the table was > > > chosen. For example, was it the insert threshold, the update/delete > > > threshold, etc.? > > > > Sounds like reasonable information. I guess we might want to have such > > information in a cumulative statistics view but do you think it's > > better to have it in a dynamic statistics view? > > +1 for this information in cumulative stats, on a per table level for sure. > I do think however the pg_stat_all_tables views is getting too wide > and moving new relation vacuum stats to a separate stats view will > be very useful. Thanks for the discussion. IIUC are you suggesting I add such a last_autovacuum_reason column to pg_stat_all_tables, which would be populated with one of the following values? - autovacuum_vacuum_threshold - autovacuum_vacuum_insert_threshold - autovacuum_freeze_max_age - autovacuum_multixact_freeze_max_age (For consistency, I should probably add a last_autoanalyze_reason column to pg_stat_all_tables as well.) -- Best regards, Shinya Kato NTT OSS Center
> > > > I wonder if we could be even more granular > > > > for the "normal autovacuum" case and point to the reason the table was > > > > chosen. For example, was it the insert threshold, the update/delete > > > > threshold, etc.? > > > > > > Sounds like reasonable information. I guess we might want to have such > > > information in a cumulative statistics view but do you think it's > > > better to have it in a dynamic statistics view? > > > > +1 for this information in cumulative stats, on a per table level for sure. > > I do think however the pg_stat_all_tables views is getting too wide > > and moving new relation vacuum stats to a separate stats view will > > be very useful. > > Thanks for the discussion. > > IIUC are you suggesting I add such a last_autovacuum_reason column to > pg_stat_all_tables, which would be populated with one of the following > values? > - autovacuum_vacuum_threshold > - autovacuum_vacuum_insert_threshold > - autovacuum_freeze_max_age > - autovacuum_multixact_freeze_max_age This should be a separate discussion. But, I would think the counters will be n_aggressive, n_wraparound and n_failsafe. We need to separate aggressive and wraparound due to what is mentioned in vacuumlazy.c /* * While it's possible for a VACUUM to be both is_wraparound * and !aggressive, that's just a corner-case -- is_wraparound * implies aggressive. Produce distinct output for the corner * case all the same, just in case. */ A normal vacuum is the difference of autovacuum_count and the total of these counters. -- Sami Imseih Amazon Web Services (AWS)