Обсуждение: another autovacuum scheduling thread
/me dons flame-proof suit
My goal with this thread is to produce some incremental autovacuum
scheduling improvements for v19, but realistically speaking, I know that
it's a bit of a long-shot. There have been many discussions over the
years, and I've read through a few of them [0] [1] [2] [3] [4], but there
are certainly others I haven't found. Since this seems to be a contentious
topic, I figured I'd start small to see if we can get _something_
committed.
While I am by no means wedded to a specific idea, my current concrete
proposal (proof-of-concept patch attached) is to start by ordering the
tables a worker will process by (M)XID age. Here are the reasons:
* We do some amount of prioritization of databases at risk of wraparound at
database level, per the following comment from autovacuum.c:
* Choose a database to connect to. We pick the database that was least
* recently auto-vacuumed, or one that needs vacuuming to prevent Xid
* wraparound-related data loss. If any db at risk of Xid wraparound is
* found, we pick the one with oldest datfrozenxid, independently of
* autovacuum times; similarly we pick the one with the oldest datminmxid
* if any is in MultiXactId wraparound. Note that those in Xid wraparound
* danger are given more priority than those in multi wraparound danger.
However, we do no such prioritization of the tables within a database. In
fact, the ordering of the tables is effectively random. IMHO this gives us
quite a bit of wiggle room to experiment; since we are processing tables in
no specific order today, changing the order to something vacuuming-related
seems more likely to help than it is to harm.
* Prioritizing tables based on their (M)XID age might help avoid more
aggressive vacuums, not to mention wraparound. Of course, there are
scenarios where this doesn't work. For example, the age of a table may
have changed greatly between the time we recorded it and the time we
process it. Or maybe there is another table in a different database that
is more important from a wraparound perspective. We could complicate the
patch to try to handle some of these things, but I maintain that even some
basic, incremental scheduling improvements would be better than the status
quo. And we can always change it further in the future to handle these
problems and to consider other things like bloat.
The attached patch works by storing the maximum of the XID age and the MXID
age in the list with the OIDs and sorting it prior to processing.
Thoughts?
[0] https://postgr.es/m/CA%2BTgmoafJPjB3WVqB3FrGWUU4NLRc3VHx8GXzLL-JM%2B%2BJPwK%2BQ%40mail.gmail.com
[1] https://postgr.es/m/CAEG8a3%2B3fwQbgzak%2Bh3Q7Bp%3DvK_aWhw1X7w7g5RCgEW9ufdvtA%40mail.gmail.com
[2] https://postgr.es/m/CAD21AoBUaSRBypA6pd9ZD%3DU-2TJCHtbyZRmrS91Nq0eVQ0B3BA%40mail.gmail.com
[3] https://postgr.es/m/CA%2BTgmobT3m%3D%2BdU5HF3VGVqiZ2O%2Bv6P5wN1Gj%2BPrq%2Bhj7dAm9AQ%40mail.gmail.com
[4] https://postgr.es/m/20130124215715.GE4528%40alvh.no-ip.org
--
nathan
Вложения
Thanks for raising this topic! I agree that autovacuum scheduling could be improved. > * Prioritizing tables based on their (M)XID age might help avoid more > aggressive vacuums, not to mention wraparound. Of course, there are > scenarios where this doesn't work. For example, the age of a table may > have changed greatly between the time we recorded it and the time we > process it. Or maybe there is another table in a different database that > is more important from a wraparound perspective. We could complicate the > patch to try to handle some of these things, but I maintain that even some > basic, incremental scheduling improvements would be better than the status > quo. And we can always change it further in the future to handle these > problems and to consider other things like bloat. One risk I see with this approach is that we will end up autovacuuming tables that also take the longest time to complete, which could cause smaller, quick-to-process tables to be neglected. It’s not always the case that the oldest tables in terms of (M)XID age are also the most expensive to vacuum, but that is often more true than not. Not saying that the current approach, which is as you mention is random, is any better, however this approach will likely increase the behavior of large tables saturating workers. But I also do see the merit of this approach when we know we are in failsafe territory, because I would want my oldest aged tables to be a/v'd first. -- Sami Imseih Amazon Web Services (AWS)
On 2025-Oct-08, Sami Imseih wrote: > One risk I see with this approach is that we will end up autovacuuming > tables that also take the longest time to complete, which could cause > smaller, quick-to-process tables to be neglected. Perhaps we can have autovacuum workers decide on a mode to use at startup (or launcher decides for them), and use different prioritization heuristics depending on the mode. For instance if we're past max freeze age for any tables then we know we have to first vacuum tables with higher MXID ages regardless of size considerations, but if there's at least one worker in that mode then we use the mode where smaller high-churn tables go first. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "No nos atrevemos a muchas cosas porque son difíciles, pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)
Hi, On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote: > However, we do no such prioritization of the tables within a database. In > fact, the ordering of the tables is effectively random. We don't prioritize tables, but I don't think the order really is random? Isn't it basically in the order in which the data is in pg_class? That typically won't change from one autovacuum pass to the next... > * Prioritizing tables based on their (M)XID age might help avoid more > aggressive vacuums, not to mention wraparound. Of course, there are > scenarios where this doesn't work. For example, the age of a table may > have changed greatly between the time we recorded it and the time we > process it. > Or maybe there is another table in a different database that > is more important from a wraparound perspective. That seems like something no ordering within a single AV worker can address. I think it's fine to just define that to be out of scope. > We could complicate the patch to try to handle some of these things, but I > maintain that even some basic, incremental scheduling improvements would be > better than the status quo. And we can always change it further in the > future to handle these problems and to consider other things like bloat. Agreed! It doesn't take much to be better at scheduling than "order in pg_class". > The attached patch works by storing the maximum of the XID age and the MXID > age in the list with the OIDs and sorting it prior to processing. I think it may be worth trying to avoid reliably using the same order - otherwise e.g. a corrupt index on the first scheduled table can cause autovacuum to reliably fail on the same relation, never allowing it to progress past that point. Greetings, Andres Freund
> Not saying that the current approach, which is as you mention is > random, is any better, however this approach will likely increase > the behavior of large tables saturating workers. Maybe it will be good to allocate some workers to the oldest tables and workers based on some random list? This could balance things out between the oldest (large) tables and everything else to avoid this problem. -- Sami Imseih Amazon Web Services (AWS)
On Wed, 8 Oct 2025 12:06:29 -0500 Sami Imseih <samimseih@gmail.com> wrote: > > One risk I see with this approach is that we will end up autovacuuming > tables that also take the longest time to complete, which could cause > smaller, quick-to-process tables to be neglected. > > It’s not always the case that the oldest tables in terms of (M)XID age > are also the most expensive to vacuum, but that is often more true > than not. I think an approach of doing largest objects first actually might work really well for balancing work amongst autovacuum workers. Many years ago I designed a system to backup many databases with a pool of workers and used this same simple & naive algorithm of just reverse sorting on db size, and it worked remarkably well. If you have one big thing then you probably want someone to get started on that first. As long as there's a pool of workers available, as you work through the queue, you can actually end up with pretty optimal use of all the workers. -Jeremy
On Thu, 9 Oct 2025 at 12:41, Jeremy Schneider <schneider@ardentperf.com> wrote: > I think an approach of doing largest objects first actually might work > really well for balancing work amongst autovacuum workers. Many years > ago I designed a system to backup many databases with a pool of workers > and used this same simple & naive algorithm of just reverse sorting on > db size, and it worked remarkably well. If you have one big thing then > you probably want someone to get started on that first. As long as > there's a pool of workers available, as you work through the queue, you > can actually end up with pretty optimal use of all the workers. I believe that is methodology for processing work applies much better in scenarios where there's no new work continually arriving and there's no adverse effects from giving a lower priority to certain portions of the work. I don't think you can apply that so easily to autovacuum as there are scenarios where the work can pile up faster than it can be handled. Also, smaller tables can bloat in terms of growth proportional to the original table size much more quickly than larger tables and that could have huge consequences for queries to small tables which are not indexed sufficiently to handle being becoming bloated and large. David
On Thu, 9 Oct 2025 12:59:23 +1300 David Rowley <dgrowleyml@gmail.com> wrote: > I believe that is methodology for processing work applies much better > in scenarios where there's no new work continually arriving and > there's no adverse effects from giving a lower priority to certain > portions of the work. I don't think you can apply that so easily to > autovacuum as there are scenarios where the work can pile up faster > than it can be handled. Also, smaller tables can bloat in terms of > growth proportional to the original table size much more quickly than > larger tables and that could have huge consequences for queries to > small tables which are not indexed sufficiently to handle being > becoming bloated and large. I'm arguing that it works well with autovacuum. Not saying there aren't going to be certain workloads that it's suboptimal for. We're talking about sorting by (M)XID age. As the clock continues to move forward any table that doesn't get processed naturally moves up the queue for the next autovac run. I think the concerns are minimal here and this would be a good change in general. -Jeremy -- To know the thoughts and deeds that have marked man's progress is to feel the great heart throbs of humanity through the centuries; and if one does not feel in these pulsations a heavenward striving, one must indeed be deaf to the harmonies of life. Helen Keller, The Story Of My Life, 1902, 1903, 1905, introduction by Ralph Barton Perry (Garden City, NY: Doubleday & Company, 1954), p90.
On Wed, 8 Oct 2025 17:27:27 -0700 Jeremy Schneider <schneider@ardentperf.com> wrote: > On Thu, 9 Oct 2025 12:59:23 +1300 > David Rowley <dgrowleyml@gmail.com> wrote: > > > I believe that is methodology for processing work applies much > > better in scenarios where there's no new work continually arriving > > and there's no adverse effects from giving a lower priority to > > certain portions of the work. I don't think you can apply that so > > easily to autovacuum as there are scenarios where the work can pile > > up faster than it can be handled. Also, smaller tables can bloat > > in terms of growth proportional to the original table size much > > more quickly than larger tables and that could have huge > > consequences for queries to small tables which are not indexed > > sufficiently to handle being becoming bloated and large. > > I'm arguing that it works well with autovacuum. Not saying there > aren't going to be certain workloads that it's suboptimal for. We're > talking about sorting by (M)XID age. As the clock continues to move > forward any table that doesn't get processed naturally moves up the > queue for the next autovac run. I think the concerns are minimal here > and this would be a good change in general. Hmm, doesn't work quite like that if the full queue needs to be processed before the next iteration ~ but at steady state these small tables are going to get processed at the same rate whether they were top of bottom of the queue right? And in non-steady-state conditions, this seems like a better order than pg_class ordering? -Jeremy
On Thu, 9 Oct 2025 at 13:27, Jeremy Schneider <schneider@ardentperf.com> wrote: > I'm arguing that it works well with autovacuum. Not saying there aren't > going to be certain workloads that it's suboptimal for. We're talking > about sorting by (M)XID age. As the clock continues to move forward any > table that doesn't get processed naturally moves up the queue for the > next autovac run. I think the concerns are minimal here and this would > be a good change in general. I thought if we're to have a priority queue that it would be hard to argue against sorting by how far over the given auto-vacuum threshold that the table is. If you assume that a table that just meets the dead rows required to trigger autovacuum based on the autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but another table that has n_mod_since_analyze twice over the autovacuum_analyze_scale_factor gets priority 2.0. Effectively, prioritise by the percentage over the given threshold the table is. That way users could still tune things when they weren't happy with the priority given to a table by adjusting the corresponding reloption. It just seems strange to me to only account for 1 of the 4 trigger points for autovacuum when it's possible to account for all 4 without much extra trouble. David
On Thu, 9 Oct 2025 14:03:34 +1300 David Rowley <dgrowleyml@gmail.com> wrote: > I thought if we're to have a priority queue that it would be hard to > argue against sorting by how far over the given auto-vacuum threshold > that the table is. If you assume that a table that just meets the > dead rows required to trigger autovacuum based on the > autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but > another table that has n_mod_since_analyze twice over the > autovacuum_analyze_scale_factor gets priority 2.0. Effectively, > prioritise by the percentage over the given threshold the table is. > That way users could still tune things when they weren't happy with > the priority given to a table by adjusting the corresponding > reloption. If users are tuning this thing then I feel like we've already lost the battle :) On a healthy system, autovac runs continually and hits tables at regular intervals based on their steady state change rates. We have existing knobs (for better or worse) that people can use to tell PG to hit certain tables more frequently, to get rid of sleeps/delays, etc. With our fleet of PG databases here, my current approach is geared toward setting log_autovacuum_min_duration to some conservative value fleet-wide, then monitoring based on the logs for any cases where it runs longer than a defined threshold. I'm able to catch problems sooner this way, versus monitoring on xid age alone. Whenever there are problems with autovacuum, the actual issue is never going to be resolved by what order autovacuum processes tables. I don't think we should encourage any tunables here... to me it seems like putting focus entirely in the wrong place. -Jeremy
On Wed, 8 Oct 2025 18:25:20 -0700 Jeremy Schneider <schneider@ardentperf.com> wrote: > On Thu, 9 Oct 2025 14:03:34 +1300 > David Rowley <dgrowleyml@gmail.com> wrote: > > > I thought if we're to have a priority queue that it would be hard to > > argue against sorting by how far over the given auto-vacuum > > threshold that the table is. If you assume that a table that just > > meets the dead rows required to trigger autovacuum based on the > > autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but > > another table that has n_mod_since_analyze twice over the > > autovacuum_analyze_scale_factor gets priority 2.0. Effectively, > > prioritise by the percentage over the given threshold the table is. > > That way users could still tune things when they weren't happy with > > the priority given to a table by adjusting the corresponding > > reloption. > > If users are tuning this thing then I feel like we've already lost the > battle :) I replied too quickly. Re-reading your email, I think your proposing a different algorithm, taking tuple counts into account. No tunables. Is there a fully fleshed out version of the proposed alternative algorithm somewhere? (one of the older threads?) I guess this is why its so hard to get anything committed in this area... -J
On Thu, 9 Oct 2025 at 14:47, Jeremy Schneider <schneider@ardentperf.com> wrote: > > On Wed, 8 Oct 2025 18:25:20 -0700 > Jeremy Schneider <schneider@ardentperf.com> wrote: > > If users are tuning this thing then I feel like we've already lost the > > battle :) > > I replied too quickly. Re-reading your email, I think your proposing a > different algorithm, taking tuple counts into account. No tunables. Is > there a fully fleshed out version of the proposed alternative algorithm > somewhere? (one of the older threads?) I guess this is why its so hard > to get anything committed in this area... It's along the lines of the "1a)" from [1]. I don't think that post does a great job of explaining it. I think the best way to understand it is if you look at relation_needs_vacanalyze() and see how it calculates boolean values for boolean output params. So, instead of calculating just a boolean value it instead calculates a float4 where < 1.0 means don't do the operation and anything >= 1.0 means do the operation. For example, let's say a table has 600 dead rows and the scale factor and threshold settings mean that autovacuum will trigger at 200 (3 times more dead tuples than the trigger point). That would result in the value of 3.0 (600 / 200). The priority for relfrozenxid portion is basically age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account for mxid by doing the same for that and taking the maximum of each value). For each of those component "scores", the priority for autovacuum would be the maximum of each of those. Effectively, it's a method of aligning the different units of measure, transactions or tuples into a single value which is calculated based on the very same values that we use today to trigger autovacuums. David [1] https://postgr.es/m/CAApHDvo8DWyt4CWhF=NPeRstz_78SteEuuNDfYO7cjp=7YTK4g@mail.gmail.com
On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote: > On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote: >> The attached patch works by storing the maximum of the XID age and the MXID >> age in the list with the OIDs and sorting it prior to processing. > > I think it may be worth trying to avoid reliably using the same order - > otherwise e.g. a corrupt index on the first scheduled table can cause > autovacuum to reliably fail on the same relation, never allowing it to > progress past that point. Hm. What if we kept a short array of "failed" tables in shared memory? Each worker would consult this table before processing. If the table is there, it would remove it from the shared table and skip processing it. Then the next worker would try processing the table again. I also wonder how hard it would be to gracefully catch the error and let the worker continue with the rest of its list... -- nathan
On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote: > I think the best way to understand it is if you look at > relation_needs_vacanalyze() and see how it calculates boolean values > for boolean output params. So, instead of calculating just a boolean > value it instead calculates a float4 where < 1.0 means don't do the > operation and anything >= 1.0 means do the operation. For example, > let's say a table has 600 dead rows and the scale factor and threshold > settings mean that autovacuum will trigger at 200 (3 times more dead > tuples than the trigger point). That would result in the value of 3.0 > (600 / 200). The priority for relfrozenxid portion is basically > age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account > for mxid by doing the same for that and taking the maximum of each > value). For each of those component "scores", the priority for > autovacuum would be the maximum of each of those. > > Effectively, it's a method of aligning the different units of measure, > transactions or tuples into a single value which is calculated based > on the very same values that we use today to trigger autovacuums. I like the idea of a "score" approach, but I'm worried that we'll never come to an agreement on the formula to use. Perhaps we'd have more luck getting consensus on a multifaceted strategy if we kept it brutally simple. IMHO it's worth a try... -- nathan
Hi, On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote: > On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote: > > On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote: > >> The attached patch works by storing the maximum of the XID age and the MXID > >> age in the list with the OIDs and sorting it prior to processing. > > > > I think it may be worth trying to avoid reliably using the same order - > > otherwise e.g. a corrupt index on the first scheduled table can cause > > autovacuum to reliably fail on the same relation, never allowing it to > > progress past that point. > > Hm. What if we kept a short array of "failed" tables in shared memory? I've thought about having that as part of pgstats... > Each worker would consult this table before processing. If the table is > there, it would remove it from the shared table and skip processing it. > Then the next worker would try processing the table again. > > I also wonder how hard it would be to gracefully catch the error and let > the worker continue with the rest of its list... The main set of cases I've seen are when workers get hung up permanently in corrupt indexes. There never is actually an error, the autovacuums just get terminated as part of whatever independent reason there is to restart. The problem with that is that you'll never actually have vacuum fail... Greetings, Andres Freund
On Thu, Oct 09, 2025 at 12:15:31PM -0400, Andres Freund wrote: > On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote: >> I also wonder how hard it would be to gracefully catch the error and let >> the worker continue with the rest of its list... > > The main set of cases I've seen are when workers get hung up permanently in > corrupt indexes. There never is actually an error, the autovacuums just get > terminated as part of whatever independent reason there is to restart. The > problem with that is that you'll never actually have vacuum fail... Ah. Wouldn't the other workers skip that table in that scenario? I'm not following the great advantage of varying the order in this case. I suppose the full set of workers might be able to process more tables before one inevitably gets stuck. Is that it? -- nathan
On Thu, Oct 9, 2025 at 12:15 PM Andres Freund <andres@anarazel.de> wrote: > > Each worker would consult this table before processing. If the table is > > there, it would remove it from the shared table and skip processing it. > > Then the next worker would try processing the table again. > > > > I also wonder how hard it would be to gracefully catch the error and let > > the worker continue with the rest of its list... > > The main set of cases I've seen are when workers get hung up permanently in > corrupt indexes. How recently was this? I'm aware of problems like that that we discussed around 2018, but they were greatly mitigated. First by your commit 3a01f68e, then by my commit c34787f9. In general, there's no particularly good reason why (at least with nbtree indexes) VACUUM should ever hang forever. The access pattern is overwhelmingly simple, sequential access. The only exception is nbtree page deletion (plus backtracking), where it isn't particularly hard to just be very careful about self-deadlock. > There never is actually an error, the autovacuums just get > terminated as part of whatever independent reason there is to restart. What do you mean? In general I'd expect nbtree VACUUM of a corrupt index to either not fail at all (we'll soldier on to the best of our ability when page deletion encounters an inconsistency), or to get permanently stuck due to locking the same page twice/self-deadlock (though as I said, those problems were mitigated, and might even be almost impossible these days). Every other case involves some kind of error (e.g., an OOM is just about possible). I agree with you about using a perfectly deterministic order coming with real downsides, without any upside. Don't interpret what I've said as expressing opposition to that idea. -- Peter Geoghegan
On Thu, Oct 09, 2025 at 11:13:48AM -0500, Nathan Bossart wrote: > On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote: >> I think the best way to understand it is if you look at >> relation_needs_vacanalyze() and see how it calculates boolean values >> for boolean output params. So, instead of calculating just a boolean >> value it instead calculates a float4 where < 1.0 means don't do the >> operation and anything >= 1.0 means do the operation. For example, >> let's say a table has 600 dead rows and the scale factor and threshold >> settings mean that autovacuum will trigger at 200 (3 times more dead >> tuples than the trigger point). That would result in the value of 3.0 >> (600 / 200). The priority for relfrozenxid portion is basically >> age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account >> for mxid by doing the same for that and taking the maximum of each >> value). For each of those component "scores", the priority for >> autovacuum would be the maximum of each of those. >> >> Effectively, it's a method of aligning the different units of measure, >> transactions or tuples into a single value which is calculated based >> on the very same values that we use today to trigger autovacuums. > > I like the idea of a "score" approach, but I'm worried that we'll never > come to an agreement on the formula to use. Perhaps we'd have more luck > getting consensus on a multifaceted strategy if we kept it brutally simple. > IMHO it's worth a try... Here's a prototype of a "score" approach. Two notes: * I've given special priority to anti-wraparound vacuums. I think this is important to avoid focusing too much on bloat when wraparound is imminent. In any case, we need a separate wraparound score in case autovacuum is disabled. * I didn't include the analyze threshold in the score because it doesn't apply to TOAST tables, and therefore would artificially lower their prioritiy. Perhaps there is another way to deal with this. This is very much just a prototype of the basic idea. As-is, I think it'll favor processing tables with lots of bloat unless we're in an anti-wraparound scenario. Maybe that's okay. I'm not sure how scientific we want to be about all of this, but I do intend to try some long-running tests. -- nathan
Вложения
On Fri, Oct 10, 2025 at 1:31 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Here's a prototype of a "score" approach. Two notes: > > * I've given special priority to anti-wraparound vacuums. I think this is > important to avoid focusing too much on bloat when wraparound is imminent. > In any case, we need a separate wraparound score in case autovacuum is > disabled. > > * I didn't include the analyze threshold in the score because it doesn't > apply to TOAST tables, and therefore would artificially lower their > prioritiy. Perhaps there is another way to deal with this. > > This is very much just a prototype of the basic idea. As-is, I think it'll > favor processing tables with lots of bloat unless we're in an > anti-wraparound scenario. Maybe that's okay. I'm not sure how scientific > we want to be about all of this, but I do intend to try some long-running > tests. I think this is a reasonable starting point, although I'm surprised that you chose to combine the sub-scores using + rather than Max. I think it will take a lot of experimentation to figure out whether this particular algorithm (or any other) works well in practice. My intuition (for whatever that is worth to you, which may not be much) is that what will anger users is cases when we ignore a horrible problem to deal with a routine problem. Figuring out how to design the scoring system to avoid such outcomes is the hard part of this problem, IMHO. For this particular algorithm, the main hazards that spring to mind for me are: - The wraparound score can't be more than about 10, but the bloat score could be arbitrarily large, especially for tables with few tuples, so there may be lots of cases in which the wraparound score has no impact on the behavior. - The patch attempts to guard against this by disregarding the non-wraparound portion of the score once the wraparound portion reaches 1.0, but that results in an abrupt behavior shift at that point. Suddenly we go from mostly ignoring the wraparound score to entirely ignoring the bloat score. This might result in the system abruptly ignoring tables that are bloating extremely rapidly in favor of trying to catch up in a wraparound situation that is not yet terribly urgent. When I've thought about this problem -- and I can't claim to have thought about it very hard -- it's seemed to me that we need to (1) somehow normalize everything to somewhat similar units and (2) make sure that severe wraparound danger always wins over every other consideration, but mild wraparound danger can lose to severe bloat. -- Robert Haas EDB: http://www.enterprisedb.com
Thanks for taking a look. On Fri, Oct 10, 2025 at 02:42:57PM -0400, Robert Haas wrote: > I think this is a reasonable starting point, although I'm surprised > that you chose to combine the sub-scores using + rather than Max. My thinking was that we should consider as many factors as we can in the score, not just the worst one. If a table has medium bloat and medium wraparound risk, should it always be lower in priority to something with large bloat and small wraparound risk? It seems worth exploring. I am curious why you first thought of Max. > When I've thought about this problem -- and I can't claim to have > thought about it very hard -- it's seemed to me that we need to (1) > somehow normalize everything to somewhat similar units and (2) make > sure that severe wraparound danger always wins over every other > consideration, but mild wraparound danger can lose to severe bloat. Agreed. I need to think about this some more. While I'm optimistic that we could come up with some sort of normalization framework, I deperately want to avoid super complicated formulas and GUCs, as those seem like sure-fire ways of ensuring nothing ever gets committed. -- nathan
On Fri, Oct 10, 2025 at 3:44 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Fri, Oct 10, 2025 at 02:42:57PM -0400, Robert Haas wrote: > > I think this is a reasonable starting point, although I'm surprised > > that you chose to combine the sub-scores using + rather than Max. > > My thinking was that we should consider as many factors as we can in the > score, not just the worst one. If a table has medium bloat and medium > wraparound risk, should it always be lower in priority to something with > large bloat and small wraparound risk? It seems worth exploring. I am > curious why you first thought of Max. The right answer depends a good bit on how exactly you do the scoring, but it seems to me that it would be easy to overweight secondary problems. Consider a table with an XID age of 900m and an MXID age of 900m and another table with an XID age of 1.8b. I think it is VERY clear that the second one is MUCH worse; but just adding things up will make them seem equal. > Agreed. I need to think about this some more. While I'm optimistic that > we could come up with some sort of normalization framework, I deperately > want to avoid super complicated formulas and GUCs, as those seem like > sure-fire ways of ensuring nothing ever gets committed. IMHO, the trick here is to come up with something that's neither too simple nor too complicated. If it's too simple, we'll easily come up with cases where it sucks, and possibly where it's worse than what we do now (an impressive achievement, to be sure). If it's too complicated, it will be full of arbitrary things that will provoke dissent and probably not work out well in practice. I don't think we need something dramatically awesome to make a change to the status quo, but if it's extremely easy to think up simple scenarios in which a given idea will fail spectacularly, I'd be inclined to suspect that there will be a lot of real-world spectacular failures. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 10 Oct 2025 16:24:51 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > I don't think we > need something dramatically awesome to make a change to the status > quo, but if it's extremely easy to think up simple scenarios in which > a given idea will fail spectacularly, I'd be inclined to suspect that > there will be a lot of real-world spectacular failures. What does a real-world spectacular failure look like? "If those 3 autovac workers had processed tables in a different order everything would have been peachy" But if autovac is going to get jammed up long enough to wraparound the system, does it matter whether or not it did a one-time processing of a bunch of small tables before it got jammed? One particular table always scoring high shouldn't block autovac from other tables, because it doesn't start a new iteration until it goes all the way through the list from its current iteration right? And one iteration of autovac needs to process everything in the list... so it should take the same overall time regardless of order? The spectacular failures I've seen with autovac usually come down to things like too much sleeping (cost_delay) or too few workers, where better ordering would be nice but probably wouldn't fix any real problems leading to the spectacular failures From Robert's 2024 pgConf.dev talk: 1. slow - forward progress not fast enough 2. stuck - no forward progress 3. spinning - not accomplishing anything 4. skipped - thinks not needed 5. starvation - cant keep up I don't think any of these are really addressed by simply changing table order. From Robert's 2022 email to hackers: > A few people have proposed scoring systems, which I think is closer > to the right idea, because our basic goal is to start vacuuming any > given table soon enough that we finish vacuuming it before some > catastrophe strikes. ... > If table A will cause wraparound in 2 hours and take 2 hours to > vacuum, and table B will cause wraparound in 1 hour and take 10 > minutes to vacuum, table A is more urgent even though the catastrophe > is further out. Robert it sounds to me like the main use case you're focused on here is where basically wraparound is imminent - we are already screwed - and our very last hope was that a last-ditch autovac can finish just in time Failsafe and dynamic cost updates were huge advancements. Do we allow dynamic adjustment to worker count yet? I hope y'all just pick something and commit it without getting too lost in the details. I honestly think in the list of improvements around autovac, this is the lowest priority on my list of hopes and dreams as a user for wraparound prevention :) because if this ever matters to me for avoiding wraparound, I was screwed long before we got to this point and this is not going to fix my underlying problems. -Jeremy
On Sat, 11 Oct 2025 at 07:43, Robert Haas <robertmhaas@gmail.com> wrote:
> I think this is a reasonable starting point, although I'm surprised
> that you chose to combine the sub-scores using + rather than Max.
Adding up the component scores doesn't make sense to me either. That
means you could have 0.5 for inserted tuples, 0.5 for dead tuples and,
say 0.1 for analyze threshold, which all add up to 1.1, but neither
component score is high enough for auto-vacuum to have to do anything
yet. With Max(), we'd clearly see that there's nothing to do since the
overall score isn't >= 1.0.
> - The wraparound score can't be more than about 10, but the bloat
> score could be arbitrarily large, especially for tables with few
> tuples, so there may be lots of cases in which the wraparound score
> has no impact on the behavior.
That's a good point. I think we definitely do want to make it so
tables in near danger of causing the database to stop accepting
transactions are dealt with ASAP.
Maybe the score calculation could change when the relevant age() goes
above vacuum_failsafe_age / vacuum_multixact_failsafe_age and start
scaling it very aggressively beyond that. There's plenty to debate,
but at a first cut, maybe something like the following (coded in SQL
for ease of result viewing):
select xidage as "age(relfrozenxid)",case xidage::float8 <
current_setting('vacuum_failsafe_age')::float8 when true then xidage /
current_setting('autovacuum_freeze_max_age')::float8 else power(xidage
/ current_setting('autovacuum_freeze_max_age')::float8,xidage::float8
/ 100_000_000) end xid_age_score from
generate_series(0,2_000_000_000,100_000_000) xidage;
which gives 1e+20 for age of 2 billion. It would take quite an
unreasonable amount of bloat to score higher than that.
I guess someone might argue that we should start taking it more
seriously before the table's relfrozenxid age gets to
vacuum_failsafe_age. Maybe that's true. I just don't know what. In any
case, if a table's age gets that old, then something's probably not
configured very well and needs attention. I did think maybe we could
keep the addressing of auto-vacuum being configured to run too slowly
as a separate thread.
David
On Fri, Oct 10, 2025 at 6:00 PM Jeremy Schneider <schneider@ardentperf.com> wrote: > The spectacular failures I've seen with autovac usually come down to > things like too much sleeping (cost_delay) or too few workers, where > better ordering would be nice but probably wouldn't fix any real > problems leading to the spectacular failures Since I have said the same thing myself, I can hardly disagree. However, there are probably a few exceptions. For instance, if autovacuum on a certain table is failing repeatedly or accomplishing nothing without removing the apparent need to autovacuum, and happens to be the first one in pg_class, it could divert a lot of attention from other tables. > Robert it sounds to me like the main use case you're focused on here > is where basically wraparound is imminent - we are already screwed - and > our very last hope was that a last-ditch autovac can finish just in time Yes, I would argue that this is the scenario that really matters. As you say above, the main thing is having little enough sleeping and a sufficient number of workers. When that's the case, we can do the work in any order and life will mostly be fine. However, if we get into a desperate situation by, say, having one table that can't be vacuumed, and eventually someone fixes that, say by dropping the corrupt index that is preventing vacuuming of that table, we might like it if autovacuum focused on getting that table vacuumed rather than getting lost in the sauce. Of course, if we have the pretty common situation where autovacuum gets behind on all tables, say due to a stale replication slot, then this is less critical, although a perfect system would probably prioritize vacuuming the *largest* tables in this situation, since those will take the longest to finish, and it's when a vacuum of every table in the cluster has been *completed* that the XID horizons can advance. > I hope y'all just pick something and commit it without getting too lost > in the details. I honestly think in the list of improvements around > autovac, this is the lowest priority on my list of hopes and dreams as a > user for wraparound prevention :) because if this ever matters to me for > avoiding wraparound, I was screwed long before we got to this point and > this is not going to fix my underlying problems. I'm not sure if this was your intention, but to me this kind of reads like "well, it's not going to matter anyway so just do whatever and move on" and I don't agree with that. I think that if we're not going to do high-quality engineering here, we just shouldn't change anything at all. It's better to keep having the same bad behavior than for each release to have new and different bad behavior. One possible positive result of leaning into this prioritization problem is that whoever's working in it (Nathan, in this case) might gain some useful insights about how to tackle some of the other problems in this space. All of this is hard enough that we haven't really had any major improvements in this area since, I want to say, 8.3, and it's desirable to break that logjam even if we don't all agree on which problems are most urgent. Even if I ultimately don't agree with whatever Nathan wants to do or proposes, I'm glad he's trying to do something, which is (in my experience) generally much better than making no effort at all. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Oct 12, 2025 at 07:27:10PM +1300, David Rowley wrote:
> On Sat, 11 Oct 2025 at 07:43, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this is a reasonable starting point, although I'm surprised
>> that you chose to combine the sub-scores using + rather than Max.
>
> Adding up the component scores doesn't make sense to me either. That
> means you could have 0.5 for inserted tuples, 0.5 for dead tuples and,
> say 0.1 for analyze threshold, which all add up to 1.1, but neither
> component score is high enough for auto-vacuum to have to do anything
> yet. With Max(), we'd clearly see that there's nothing to do since the
> overall score isn't >= 1.0.
In v3, I switched to Max().
> Maybe the score calculation could change when the relevant age() goes
> above vacuum_failsafe_age / vacuum_multixact_failsafe_age and start
> scaling it very aggressively beyond that. There's plenty to debate,
> but at a first cut, maybe something like the following (coded in SQL
> for ease of result viewing):
>
> select xidage as "age(relfrozenxid)",case xidage::float8 <
> current_setting('vacuum_failsafe_age')::float8 when true then xidage /
> current_setting('autovacuum_freeze_max_age')::float8 else power(xidage
> / current_setting('autovacuum_freeze_max_age')::float8,xidage::float8
> / 100_000_000) end xid_age_score from
> generate_series(0,2_000_000_000,100_000_000) xidage;
>
> which gives 1e+20 for age of 2 billion. It would take quite an
> unreasonable amount of bloat to score higher than that.
>
> I guess someone might argue that we should start taking it more
> seriously before the table's relfrozenxid age gets to
> vacuum_failsafe_age. Maybe that's true. I just don't know what. In any
> case, if a table's age gets that old, then something's probably not
> configured very well and needs attention. I did think maybe we could
> keep the addressing of auto-vacuum being configured to run too slowly
> as a separate thread.
I did something similar to this in v3, although I used the *_freeze_max_age
parameters as the point to start scaling aggressively, and I simply raised
the score to the power of 10.
I've yet to do any real testing with this stuff.
--
nathan
Вложения
On Wed, 22 Oct 2025 at 03:38, Nathan Bossart <nathandbossart@gmail.com> wrote: > I did something similar to this in v3, although I used the *_freeze_max_age > parameters as the point to start scaling aggressively, and I simply raised > the score to the power of 10. > > I've yet to do any real testing with this stuff. I've not tested it or compiled it, but the patch looks good. I did think that the freeze vacuum isn't that big a deal if it's just over the *freeze_max_age and thought it should become aggressive very quickly at the failsafe age, but that leaves a much smaller window of time to do the freezing if autovacuum has been busy with other higher priority tables. Your scaling is much more gentle and comes out (with standard settings) with a score of 1 billion for a table at the failsafe age, and about 1 million at half the failsafe age. That seems reasonable as it's hard to imagine a table having a 1 billion bloat score. However, just thinking of non-standard setting... I do wonder if it'll be aggressive enough if someone did something like raise the *freeze_max_age to 1 billion (it's certainly common that people raise this). With a 1.6 billion vacuum_failsafe_age, a table at freeze_max_age only scores in at 110. I guess there's no reason we couldn't keep your calc and then scale the score further once over vacuum_failsafe_age to ensure those are the highest priority. There is a danger that if a table scores too low when age(relfrozenxid) > vacuum_failsafe_age that autovacuum dawdles along handling bloated tables while oblivious to the nearing armageddon. Is it worth writing a comment explaining the philosophy behind the scoring system to make it easier for people to understand that it aims to standardise the priority of vacuums and unify the various trigger thresholds into a single number to determine which tables are most important to vacuum and/or analyze first? Thanks for working on this. David
On Wed, Oct 22, 2025 at 09:07:33AM +1300, David Rowley wrote: > However, just thinking of non-standard setting... I do wonder if it'll > be aggressive enough if someone did something like raise the > *freeze_max_age to 1 billion (it's certainly common that people raise > this). With a 1.6 billion vacuum_failsafe_age, a table at > freeze_max_age only scores in at 110. I guess there's no reason we > couldn't keep your calc and then scale the score further once over > vacuum_failsafe_age to ensure those are the highest priority. There is > a danger that if a table scores too low when age(relfrozenxid) > > vacuum_failsafe_age that autovacuum dawdles along handling bloated > tables while oblivious to the nearing armageddon. That's a good point. I wonder if we should try to make the wraparound score independent of the *_freeze_max_age parameters (once the table age surpasses said parameters). Else, different settings will greatly impact how aggressively tables are prioritized the closer they are to wraparound. Even if autovacuum_freeze_max_age is set to 200M, it's not critically important for autovacuum to pick up tables right away as soon as their age reaches 200M. But if the parameter is set to 2B, we _do_ want autovacuum to prioritize tables right away once their age reaches 2B. > Is it worth writing a comment explaining the philosophy behind the > scoring system to make it easier for people to understand that it aims > to standardise the priority of vacuums and unify the various trigger > thresholds into a single number to determine which tables are most > important to vacuum and/or analyze first? Yes, I think so. > Thanks for working on this. I appreciate the discussion. -- nathan
On Wed, Oct 22, 2025 at 01:40:11PM -0500, Nathan Bossart wrote:
> On Wed, Oct 22, 2025 at 09:07:33AM +1300, David Rowley wrote:
>> However, just thinking of non-standard setting... I do wonder if it'll
>> be aggressive enough if someone did something like raise the
>> *freeze_max_age to 1 billion (it's certainly common that people raise
>> this). With a 1.6 billion vacuum_failsafe_age, a table at
>> freeze_max_age only scores in at 110. I guess there's no reason we
>> couldn't keep your calc and then scale the score further once over
>> vacuum_failsafe_age to ensure those are the highest priority. There is
>> a danger that if a table scores too low when age(relfrozenxid) >
>> vacuum_failsafe_age that autovacuum dawdles along handling bloated
>> tables while oblivious to the nearing armageddon.
>
> That's a good point. I wonder if we should try to make the wraparound
> score independent of the *_freeze_max_age parameters (once the table age
> surpasses said parameters). Else, different settings will greatly impact
> how aggressively tables are prioritized the closer they are to wraparound.
> Even if autovacuum_freeze_max_age is set to 200M, it's not critically
> important for autovacuum to pick up tables right away as soon as their age
> reaches 200M. But if the parameter is set to 2B, we _do_ want autovacuum
> to prioritize tables right away once their age reaches 2B.
I'm imagining something a bit like the following:
select xidage "age(relfrozenxid)",
power(1.001, xidage::float8 / (select min_val
from pg_settings where name = 'autovacuum_freeze_max_age')::float8)
xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage;
age(relfrozenxid) | xid_age_score
-------------------+--------------------
0 | 1
100000000 | 2.7169239322355936
200000000 | 7.38167565355452
300000000 | 20.055451243143093
400000000 | 54.48913545427955
500000000 | 148.0428361625591
600000000 | 402.22112456608977
700000000 | 1092.804199384323
800000000 | 2969.065882554825
900000000 | 8066.726152697397
1000000000 | 21916.681339054314
1100000000 | 59545.956045257895
1200000000 | 161781.8330472099
1300000000 | 439548.9340069078
1400000000 | 1194221.0181920114
1500000000 | 3244607.664704634
1600000000 | 8815352.21495106
1700000000 | 23950641.403886583
1800000000 | 65072070.82261215
1900000000 | 176795866.53808445
2000000000 | 480340920.9176516
(21 rows)
--
nathan
On Thu, 23 Oct 2025 at 07:58, Nathan Bossart <nathandbossart@gmail.com> wrote: > > That's a good point. I wonder if we should try to make the wraparound > > score independent of the *_freeze_max_age parameters (once the table age > > surpasses said parameters). Else, different settings will greatly impact > > how aggressively tables are prioritized the closer they are to wraparound. > > Even if autovacuum_freeze_max_age is set to 200M, it's not critically > > important for autovacuum to pick up tables right away as soon as their age > > reaches 200M. But if the parameter is set to 2B, we _do_ want autovacuum > > to prioritize tables right away once their age reaches 2B. > > I'm imagining something a bit like the following: > > select xidage "age(relfrozenxid)", > power(1.001, xidage::float8 / (select min_val > from pg_settings where name = 'autovacuum_freeze_max_age')::float8) > xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage; > > age(relfrozenxid) | xid_age_score > -------------------+-------------------- > 0 | 1 > 100000000 | 2.7169239322355936 > 200000000 | 7.38167565355452 > 300000000 | 20.055451243143093 This does start to put the score > 1 before the table reaches autovacuum_freeze_max_age. I don't think that's great as the score of 1.0 was meant to represent that the table now requires some autovacuum work. The main reason I was trying to keep the score scaling with the percentage over the given threshold that the table is was that I had imagined we could use the score number to start reducing the sleep time between autovacuum_vacuum_cost_limit when the highest scoring table persists in being high for too long. I was considering this to fix the misconfigured autovacuum problem that so many people have. If we scaled it the way similar to the query above, the score would look high even before it reaches the limit. This is the reason I was scaling the score linear with the autovacuum_freeze_max_age with the version I sent and only scaling exponentially after the failsafe age. I wanted to talk about the "reducing the cost delay" feature separately so as not to load up this thread and widen the scope for varying opinions, but in its most trivial form, the vacuum_cost_limit() code could be adjusted to only sleep for autovacuum_vacuum_cost_delay / <the table's score>. I think the one I proposed in [1] does this quite well. The table remains eligible to be autovacuumed with any score >= 1.0, and there's still a huge window of time to freeze a table once it's over autovacuum_freeze_max_age before there are issues and the exponential scaling once over failsafe age should ensure that the table is top of the list for when the failsafe code kicks in and removes the cost limit. If we had the varying sleep time as I mentioned above, the failsafe code could even be removed as the "autovacuum_vacuum_cost_delay / <tables score>" calculation would effectively zero the sleep time with any table > failsafe age. David [1] https://postgr.es/m/CAApHDvqrd=SHVUytdRj55OWnLH98Rvtzqam5zq2f4XKRZa7t9Q@mail.gmail.com
On Thu, Oct 23, 2025 at 08:34:49AM +1300, David Rowley wrote: > On Thu, 23 Oct 2025 at 07:58, Nathan Bossart <nathandbossart@gmail.com> wrote: >> I'm imagining something a bit like the following: >> >> select xidage "age(relfrozenxid)", >> power(1.001, xidage::float8 / (select min_val >> from pg_settings where name = 'autovacuum_freeze_max_age')::float8) >> xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage; >> >> age(relfrozenxid) | xid_age_score >> -------------------+-------------------- >> 0 | 1 >> 100000000 | 2.7169239322355936 >> 200000000 | 7.38167565355452 >> 300000000 | 20.055451243143093 > > This does start to put the score > 1 before the table reaches > autovacuum_freeze_max_age. I don't think that's great as the score of > 1.0 was meant to represent that the table now requires some autovacuum > work. My thinking was that this formula would only be used once the table reaches autovacuum_freeze_max_age. If the age is less than that, we'd do something else, such as dividing the age by the *_max_age setting. > The main reason I was trying to keep the score scaling with the > percentage over the given threshold that the table is was that I had > imagined we could use the score number to start reducing the sleep > time between autovacuum_vacuum_cost_limit when the highest scoring > table persists in being high for too long. I was considering this to > fix the misconfigured autovacuum problem that so many people have. If > we scaled it the way similar to the query above, the score would look > high even before it reaches the limit. This is the reason I was > scaling the score linear with the autovacuum_freeze_max_age with the > version I sent and only scaling exponentially after the failsafe age. > I wanted to talk about the "reducing the cost delay" feature > separately so as not to load up this thread and widen the scope for > varying opinions, but in its most trivial form, the > vacuum_cost_limit() code could be adjusted to only sleep for > autovacuum_vacuum_cost_delay / <the table's score>. I see. > I think the one I proposed in [1] does this quite well. The table > remains eligible to be autovacuumed with any score >= 1.0, and there's > still a huge window of time to freeze a table once it's over > autovacuum_freeze_max_age before there are issues and the exponential > scaling once over failsafe age should ensure that the table is top of > the list for when the failsafe code kicks in and removes the cost > limit. Yeah. I'll update the patch with that formula. -- nathan
> > I think the one I proposed in [1] does this quite well. The table > > remains eligible to be autovacuumed with any score >= 1.0, and there's > > still a huge window of time to freeze a table once it's over > > autovacuum_freeze_max_age before there are issues and the exponential > > scaling once over failsafe age should ensure that the table is top of > > the list for when the failsafe code kicks in and removes the cost > > limit. > > Yeah. I'll update the patch with that formula. I was looking at v3, and I understand the formula will be updated in the next version. However, do you think we should benchmark the approach of using an intermediary list to store the eligible tables and sorting that list, which may cause larger performance overhead for databases with hundreds of tables that may all be eligible for autovacuum. I do think such cases out there are common, particularly in multi-tenant type databases, where each tenant could be one or more tables. What do you think? -- Sami
On Thu, Oct 23, 2025 at 01:22:24PM -0500, Sami Imseih wrote: > I was looking at v3, and I understand the formula will be updated in the > next version. However, do you think we should benchmark the approach > of using an intermediary list to store the eligible tables and sorting > that list, > which may cause larger performance overhead for databases with hundreds > of tables that may all be eligible for autovacuum. I do think such cases > out there are common, particularly in multi-tenant type databases, where > each tenant could be one or more tables. We already have an intermediary list of table OIDs, so the additional overhead is ultimately just the score calculation and the sort operation. I'd be quite surprised if that added up to anything remotely worrisome, even for thousands of eligible tables. -- nathan
> On Thu, Oct 23, 2025 at 01:22:24PM -0500, Sami Imseih wrote: > > I was looking at v3, and I understand the formula will be updated in the > > next version. However, do you think we should benchmark the approach > > of using an intermediary list to store the eligible tables and sorting > > that list, > > which may cause larger performance overhead for databases with hundreds > > of tables that may all be eligible for autovacuum. I do think such cases > > out there are common, particularly in multi-tenant type databases, where > > each tenant could be one or more tables. > > We already have an intermediary list of table OIDs, so the additional > overhead is ultimately just the score calculation and the sort operation. > I'd be quite surprised if that added up to anything remotely worrisome, > even for thousands of eligible tables. Yeah, you’re correct, the list already exists; sorry I missed that. My main concern is the additional overhead of the sort operation, especially if we have many eligible tables and an aggressive autovacuum_naptime. I don’t think we should make the existing performance of many relations any worse with an additional sort. That said, in such cases the sort may not even be the main performance bottleneck, since the catalog scan itself already doesn’t scale well with many relations. With our current approach, we have more options to improve this, but if we add a sort, we may not be able to avoid a full scan. -- Sami
On Fri, 24 Oct 2025 at 08:33, Sami Imseih <samimseih@gmail.com> wrote: > Yeah, you’re correct, the list already exists; sorry I missed that. My > main concern is > the additional overhead of the sort operation, especially if we have > many eligible > tables and an aggressive autovacuum_naptime. It is true that there are reasons that millions of tables could suddenly become eligible for autovacuum work with the consumption of a single xid, but I imagine sorting the list of tables is probably the least of the DBAs worries for that case as sorting the tables_to_process list is going to take a tiny fraction of the time that doing the vacuum work will take. If your concern is that the sort could take too large a portion of someone's 1sec autovacuum_naptime instance, then you also need to consider that the list isn't likely to be very long as there's very little time for tables to become eligible in such a short naptime, and if the tables are piling up because autovacuum is configured to run too slowly, then lets fix that at the root cause rather than be worried about improving one area because another area needs work. If we think like that, we'll remain gridlocked and autovacuum will never be improved. TBH, I think that mindset has likely contributed quite a bit to the fact that we've made about zero improvements in this area despite nobody thinking that nothing needs to be done. There are also things that could be done if we were genuinely concerned and had actual proof that this could reasonably be a problem. sort_template.h would reduce the constant factor of the indirect function call overhead by quite a bit. On a quick test here with a table containing 1 million random float8 values, a Seq Scan and in-memory Sort, EXPLAIN ANALYZE reports the sort took about 21ms: (actual time=172.273..193.824). I really doubt anyone will be concerned with 21ms when there's a list of 1 million tables needing to be autovacuumed. David
> On Fri, 24 Oct 2025 at 08:33, Sami Imseih <samimseih@gmail.com> wrote: > > Yeah, you’re correct, the list already exists; sorry I missed that. My > > main concern is > > the additional overhead of the sort operation, especially if we have > > many eligible > > tables and an aggressive autovacuum_naptime. > > It is true that there are reasons that millions of tables could > suddenly become eligible for autovacuum work with the consumption of a > single xid, but I imagine sorting the list of tables is probably the > least of the DBAs worries for that case as sorting the > tables_to_process list is going to take a tiny fraction of the time > that doing the vacuum work will take. Yes, in my last reply, I did indicate that the sort will likely not be the operation that will tip the performance over, but the catalog scan itself that I have seen not scale well as the number of relations grow ( in cases of thousands or hundreds of thousands of tables). If we are to prioritize vacuuming by M(XID), then it will be hard to avoid the catalog scan anymore in a future improvement. >TBH, I think that mindset has likely contributed quite a > bit to the fact that we've made about zero improvements in this area > despite nobody thinking that nothing needs to be done. I am not against this idea, just thinking out loud about the high relation cases I have seen in the past. -- Sami
On Fri, 24 Oct 2025 at 09:48, Sami Imseih <samimseih@gmail.com> wrote: > Yes, in my last reply, I did indicate that the sort will likely not be > the operation that will tip the performance over, but the > catalog scan itself that I have seen not scale well as the number of > relations grow ( in cases of thousands or hundreds of thousands of tables). > If we are to prioritize vacuuming by M(XID), then it will be hard to avoid the > catalog scan anymore in a future improvement. I grant you that I could see that could be a problem for a sufficiently large number of tables and small enough autovacuum_naptime, but I don't see how anything being proposed here moves the goalposts on the requirements to scan pg_class. We at least need to get the relopts from somewhere, plus reltuples, relpages, relallfrozen. We can't magic those values out of thin air. So, since nothing is changing in regards to the scan of pg_class or which columns we need to look at in that table, I don't know why we'd consider it a topic to discuss on this thread. If this thread becomes a dumping ground for unrelated problems, then nothing will be done to fix the problem at hand. David
Here is an updated patch based on the latest discussion. -- nathan
Вложения
On Wed, Oct 22, 2025 at 3:35 PM David Rowley <dgrowleyml@gmail.com> wrote: > If we had the varying sleep time as I mentioned above, the > failsafe code could even be removed as the > "autovacuum_vacuum_cost_delay / <tables score>" calculation would > effectively zero the sleep time with any table > failsafe age. I'm not sure what you mean by "the failsafe could be removed". Importantly, the failsafe will abandon all further index vacuuming. That's why it's presented as something that you as a user are not supposed to rely on. -- Peter Geoghegan
On Sat, 25 Oct 2025 at 10:14, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Oct 22, 2025 at 3:35 PM David Rowley <dgrowleyml@gmail.com> wrote: > > If we had the varying sleep time as I mentioned above, the > > failsafe code could even be removed as the > > "autovacuum_vacuum_cost_delay / <tables score>" calculation would > > effectively zero the sleep time with any table > failsafe age. > > I'm not sure what you mean by "the failsafe could be removed". > Importantly, the failsafe will abandon all further index vacuuming. > That's why it's presented as something that you as a user are not > supposed to rely on. I didn't realise it did that too. I thought it just dropped the delay to zero. In that case, I revoke the statement. David
On Sat, 25 Oct 2025 at 04:08, Nathan Bossart <nathandbossart@gmail.com> wrote:
> Here is an updated patch based on the latest discussion.
Thanks. I've just had a look at it. A few comments and questions.
1) The subtraction here looks back to front:
+ xid_age = TransactionIdIsNormal(relfrozenxid) ? relfrozenxid - recentXid : 0;
+ mxid_age = MultiXactIdIsValid(relminmxid) ? relminmxid - recentMulti : 0;
2) Would it be better to move all the code that sets the xid_score and
mxid_score to under an "if (force_vacuum)"? Those two variables could
be declared in there too.
3) Could the following be refactored a bit so we only check the "relid
!= StatisticRelationId" condition once?
+ if (relid != StatisticRelationId &&
+ classForm->relkind != RELKIND_TOASTVALUE)
Something like:
/* ANALYZE refuses to work with pg_statistic and we don't analyze
toast tables */
if (anltuples > anlthresh && relid != StatisticRelationId &&
classForm->relkind != RELKIND_TOASTVALUE)
{
*doanalyze = true;
// calc analyze score and Max with *score
}
else
*doanalyze = false;
then delete:
/* ANALYZE refuses to work with pg_statistic */
if (relid == StatisticRelationId)
*doanalyze = false;
4) Should these be TransactionIds?
+ uint32 xid_age;
+ uint32 mxid_age;
5) Instead of:
+ double score = 0.0;
Is it better to zero the score inside relation_needs_vacanalyze() so
it works the same as the other output parameters?
David
On Sun, Oct 26, 2025 at 02:25:48PM +1300, David Rowley wrote: > Thanks. I've just had a look at it. A few comments and questions. Thanks. > 1) The subtraction here looks back to front: > > + xid_age = TransactionIdIsNormal(relfrozenxid) ? relfrozenxid - recentXid : 0; > + mxid_age = MultiXactIdIsValid(relminmxid) ? relminmxid - recentMulti : 0; D'oh. > 2) Would it be better to move all the code that sets the xid_score and > mxid_score to under an "if (force_vacuum)"? Those two variables could > be declared in there too. Seems reasonable. > 3) Could the following be refactored a bit so we only check the "relid > != StatisticRelationId" condition once? Yes. We can update the vacuum part to follow the same pattern, too. > 4) Should these be TransactionIds? > > + uint32 xid_age; > + uint32 mxid_age; Probably. > 5) Instead of: > > + double score = 0.0; > > Is it better to zero the score inside relation_needs_vacanalyze() so > it works the same as the other output parameters? My only concern about this is that some compilers might complain about potentially-uninitialized uses. But we can still zero it in the function regardless. -- nathan
Вложения
I spent some time looking at this, and I am not sure how much this will move the goalpost, since most of the time the bottleneck for autovacuum is the limited number of workers and large tables that take a long time to process. That said, this is a good change for the simple reason that it is better to have a well-defined prioritization strategy for autovacuum than something that is somewhat random, as mentioned earlier. Just a couple of comments on v5: 1/ Should we add documentation explaining this prioritization behavior in [0]? I wrote a sql that returns the tables and scores, which I found was useful when I was testing this out, so having the actually rules spelled out in docs will actually be super useful. If we don't want to go that much in depth, at minimum the docs should say: "Autovacuum prioritizes tables based on how far they exceed their thresholds or if they are approaching wraparound limits." so a DBA can understand this behavior. 2/ * The score is calculated as the maximum of the ratios of each of the table's * relevant values to its threshold. For example, if the number of inserted * tuples is 100, and the insert threshold for the table is 80, the insert * score is 1.25. Should we consider clamping down on the score when reltuples = -1, otherwise the scores for such tables ( new tables with a large amount of ingested data ) will be over-inflated? Perhaps, if reltuples = -1 ( # of reltuples not known ), then give a score of .5, so we are not over-prioritizing but not pushing down to the bottom? [0] https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM -- Sami Imseih Amazon Web Services
On Mon, Oct 27, 2025 at 12:47:15PM -0500, Sami Imseih wrote: > 1/ Should we add documentation explaining this prioritization behavior in [0]? > > I wrote a sql that returns the tables and scores, which I found was > useful when I was testing this out, so having the actually rules spelled out > in docs will actually be super useful. Can you elaborate on how it would be useful? I'd be open to adding a short note that autovacuum attempts to prioritize the tables in a smart way, but I'm not sure I see the value of documenting every detail. I also don't want to add too much friction to future changes to the prioritization logic. > If we don't want to go that much in depth, at minimum the docs should say: > > "Autovacuum prioritizes tables based on how far they exceed their thresholds > or if they are approaching wraparound limits." so a DBA can understand > this behavior. Yeah, I would probably choose to keep it relatively vague like this. > * The score is calculated as the maximum of the ratios of each of the table's > * relevant values to its threshold. For example, if the number of inserted > * tuples is 100, and the insert threshold for the table is 80, the insert > * score is 1.25. > > Should we consider clamping down on the score when > reltuples = -1, otherwise the scores for such tables ( new tables > with a large amount of ingested data ) will be over-inflated? Perhaps, > if reltuples = -1 ( # of reltuples not known ), then give a score of .5, > so we are not over-prioritizing but not pushing down to the bottom? I'm not sure it's worth expending too much energy to deal with this. In the worst case, the table will be given an arbitrarily high priority the first time it is vacuumed, but AFAICT that's it. But that's already the case, as the thresholds will be artificially low before the first VACUUM/ANALYZE. -- nathan
> > I wrote a sql that returns the tables and scores, which I found was > > useful when I was testing this out, so having the actually rules spelled out > > in docs will actually be super useful. > > Can you elaborate on how it would be useful? I'd be open to adding a short > note that autovacuum attempts to prioritize the tables in a smart way, but > I'm not sure I see the value of documenting every detail. We discuss the threshold calculations in the documentation, and users can write scripts to monitor which tables are eligible. However, there is nothing that indicates which table autovacuum will work on next (I have been asked that question by users a few times, sometimes out of curiosity, or because they are monitoring vacuum activity and wondering when their important table will get a vacuum cycle, or if they should kick off a manual vacuum). With the scoring system, it will be much more difficult to explain, unless someone walks through the code. > I also don't > want to add too much friction to future changes to the prioritization > logic. Maybe future changes is a good reason to document the way autovacuum prioritizes, since this is a user-facing change. > > If we don't want to go that much in depth, at minimum the docs should say: > > > > "Autovacuum prioritizes tables based on how far they exceed their thresholds > > or if they are approaching wraparound limits." so a DBA can understand > > this behavior. > > Yeah, I would probably choose to keep it relatively vague like this. With all the above said, starting with something small is definitely better than nothing. > > * The score is calculated as the maximum of the ratios of each of the table's > > * relevant values to its threshold. For example, if the number of inserted > > * tuples is 100, and the insert threshold for the table is 80, the insert > > * score is 1.25. > > > > Should we consider clamping down on the score when > > reltuples = -1, otherwise the scores for such tables ( new tables > > with a large amount of ingested data ) will be over-inflated? Perhaps, > > if reltuples = -1 ( # of reltuples not known ), then give a score of .5, > > so we are not over-prioritizing but not pushing down to the bottom? > > I'm not sure it's worth expending too much energy to deal with this. In > the worst case, the table will be given an arbitrarily high priority the > first time it is vacuumed, but AFAICT that's it. But that's already the > case, as the thresholds will be artificially low before the first > VACUUM/ANALYZE. I can think of scenarios where they may be workloads that create/drops staging tables and load some data ( like batch processing ) where this may become an issue because we are now forcing such tables to the top of the list, potentially impacting other tables from getting vacuum cycles. It could happen now, but the difference with this change is we are forcing these tables to the top of the priority; based on an unknown value (pg_class.reltuples = -1). -- Sami Imseih Amazon Web Services (AWS)
The patch is starting to look good. Here's a review of v5:
1. I think the following code at the bottom of
relation_needs_vacanalyze() can be deleted. You've added the check to
ensure *doanalyze never gets set to true for pg_statistic.
/* ANALYZE refuses to work with pg_statistic */
if (relid == StatisticRelationId)
*doanalyze = false;
2. As #1, but in recheck_relation_needs_vacanalyze(), the following I
think can now be removed:
/* ignore ANALYZE for toast tables */
if (classForm->relkind == RELKIND_TOASTVALUE)
*doanalyze = false;
3. Would you be able to include what the idea behind the * 1.05 in the
preceding comment?
On Tue, 28 Oct 2025 at 05:06, Nathan Bossart <nathandbossart@gmail.com> wrote:
> + effective_xid_failsafe_age = Max(vacuum_failsafe_age,
> + autovacuum_freeze_max_age * 1.05);
> + effective_mxid_failsafe_age = Max(vacuum_multixact_failsafe_age,
> + autovacuum_multixact_freeze_max_age * 1.05);
I assume it's to workaround some strange configuration settings, but
don't know for sure, or why 1.05 is a good value.
4. I think it might be neater to format the following as 3 separate "if" tests:
> + if (force_vacuum ||
> + vactuples > vacthresh ||
> + (vac_ins_base_thresh >= 0 && instuples > vacinsthresh))
> + {
> + *dovacuum = true;
> + *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
> + if (vac_ins_base_thresh >= 0)
> + *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
> + }
> + else
> + *dovacuum = false;
i.e:
if (force_vacuum)
*dovacuum = true;
if (vactuples > vacthresh)
{
*dovacuum = true;
*score = Max(*score, (double) vactuples / Max(vacthresh, 1));
}
if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
{
*dovacuum = true;
*score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
}
and also get rid of all the "else *dovacuum = false;" (and *dovacuum =
false) in favour of setting those to false at the top of the function.
It's just getting harder to track that those parameters are getting
set in all cases when they're meant to be.
doing that also gets rid of the duplicative "if (vac_ins_base_thresh
>= 0)" check and also saves doing the score calc when the inputs to it
don't make sense. The current code is relying on Max always picking
the current *score when the threshold isn't met.
David
On Tue, 28 Oct 2025 at 11:35, Sami Imseih <samimseih@gmail.com> wrote: > We discuss the threshold calculations in the documentation, and users > can write scripts to monitor which tables are eligible. However, there > is nothing that indicates which table autovacuum will work on next (I > have been asked that question by users a few times, sometimes out of > curiosity, or because they are monitoring vacuum activity and wondering > when their important table will get a vacuum cycle, or if they should > kick off a manual vacuum). With the scoring system, it will be much more > difficult to explain, unless someone walks through the code. I think it's reasonable to want to document how autovacuum prioritises tables, but maybe not in too much detail. Longer term, I think it would be good to have a pg_catalog view for this which showed the relid or schema/relname, and the output values of relation_needs_vacanalyze(). If we had that and we documented that autovacuum workers work from that list, but they just may have an older snapshot of it, then that might help make the score easier to document. It would also allow people to question the scores as I expect at least some people might not agree with the priorities. That would allow us to consider tuning the score calculation if someone points out a deficiency with the current calculation. Also, longer-term, it also doesn't seem that unreasonable that the autovacuum worker might want to refresh the tables_to_process once it finishes a table and if autovacuum_naptime * $value units of time have passed since it was last checked. That would allow the worker to deal with and react accordingly when scores have changed significantly since it last checked. I mean, it might be days between when autovacuum calculates the scores and finally vacuums the table when the list is long, of it it was tied up with large tables. Other workers may have gotten to some of the tables too, so the score may have dropped, but again made its way above the threshold, but to a lesser extent. David
On Tue, Oct 28, 2025 at 11:47:08AM +1300, David Rowley wrote:
> 1. I think the following code at the bottom of
> relation_needs_vacanalyze() can be deleted. You've added the check to
> ensure *doanalyze never gets set to true for pg_statistic.
>
> /* ANALYZE refuses to work with pg_statistic */
> if (relid == StatisticRelationId)
> *doanalyze = false;
>
> 2. As #1, but in recheck_relation_needs_vacanalyze(), the following I
> think can now be removed:
>
> /* ignore ANALYZE for toast tables */
> if (classForm->relkind == RELKIND_TOASTVALUE)
> *doanalyze = false;
Removed.
> 3. Would you be able to include what the idea behind the * 1.05 in the
> preceding comment?
>
> On Tue, 28 Oct 2025 at 05:06, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> + effective_xid_failsafe_age = Max(vacuum_failsafe_age,
>> + autovacuum_freeze_max_age * 1.05);
>> + effective_mxid_failsafe_age = Max(vacuum_multixact_failsafe_age,
>> + autovacuum_multixact_freeze_max_age * 1.05);
>
> I assume it's to workaround some strange configuration settings, but
> don't know for sure, or why 1.05 is a good value.
This is lifted from vacuum_xid_failsafe_check(). As noted in the docs, the
failsafe settings are silently limited to 105% of *_freeze_max_age. I
expanded on this in the comment atop these lines.
> 4. I think it might be neater to format the following as 3 separate "if" tests:
>
>> + if (force_vacuum ||
>> + vactuples > vacthresh ||
>> + (vac_ins_base_thresh >= 0 && instuples > vacinsthresh))
>> + {
>> + *dovacuum = true;
>> + *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
>> + if (vac_ins_base_thresh >= 0)
>> + *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
>> + }
>> + else
>> + *dovacuum = false;
>
> i.e:
>
> if (force_vacuum)
> *dovacuum = true;
>
> if (vactuples > vacthresh)
> {
> *dovacuum = true;
> *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
> }
>
> if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
> {
> *dovacuum = true;
> *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
> }
>
> and also get rid of all the "else *dovacuum = false;" (and *dovacuum =
> false) in favour of setting those to false at the top of the function.
> It's just getting harder to track that those parameters are getting
> set in all cases when they're meant to be.
>
> doing that also gets rid of the duplicative "if (vac_ins_base_thresh
> >= 0)" check and also saves doing the score calc when the inputs to it
> don't make sense. The current code is relying on Max always picking
> the current *score when the threshold isn't met.
Done.
--
nathan
Вложения
On Tue, Oct 28, 2025 at 12:16:28PM +1300, David Rowley wrote: > I think it's reasonable to want to document how autovacuum prioritises > tables, but maybe not in too much detail. Longer term, I think it > would be good to have a pg_catalog view for this which showed the > relid or schema/relname, and the output values of > relation_needs_vacanalyze(). If we had that and we documented that > autovacuum workers work from that list, but they just may have an > older snapshot of it, then that might help make the score easier to > document. It would also allow people to question the scores as I > expect at least some people might not agree with the priorities. That > would allow us to consider tuning the score calculation if someone > points out a deficiency with the current calculation. > > Also, longer-term, it also doesn't seem that unreasonable that the > autovacuum worker might want to refresh the tables_to_process once it > finishes a table and if autovacuum_naptime * $value units of time have > passed since it was last checked. That would allow the worker to deal > with and react accordingly when scores have changed significantly > since it last checked. I mean, it might be days between when > autovacuum calculates the scores and finally vacuums the table when > the list is long, of it it was tied up with large tables. Other > workers may have gotten to some of the tables too, so the score may > have dropped, but again made its way above the threshold, but to a > lesser extent. Agreed on both points. -- nathan
> Done. My compiler is complaining about v6 "../src/backend/postmaster/autovacuum.c:3293:32: warning: operation on ‘*score’ may be undefined [-Wsequence-point] 3293 | *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1)); [2/2] Linking target src/backend/postgres" shouldn't just be like below? *score =Max(*score, (double) instuples / Max(vacinsthresh, 1)); -- Sami
HI Nathan Bossart
> + if (vactuples > vacthresh)
> + {
> + *dovacuum = true;
> + *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
> + }
> +
> + if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
> + {
> + *dovacuum = true;
> + *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
> + }
I think it ( *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));) I believe this must be a slip of the hand on your part, having copied an extra one.
> + {
> + *dovacuum = true;
> + *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
> + }
> +
> + if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
> + {
> + *dovacuum = true;
> + *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
> + }
I think it ( *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));) I believe this must be a slip of the hand on your part, having copied an extra one.
I also suggest add debug log for score
ereport(DEBUG2,
(errmsg("autovacuum candidate: %s (score=%.3f)",
get_rel_name(table->oid), table->score)));
(errmsg("autovacuum candidate: %s (score=%.3f)",
get_rel_name(table->oid), table->score)));
> + effective_xid_failsafe_age = Max(vacuum_failsafe_age,
> + autovacuum_freeze_max_age * 1.05);
Typically, DBAs avoid setting autovacuum_freeze_max_age too close to vacuum_failsafe_age. Therefore, your logic most likely uses the vacuum_failsafe_age value.
Would taking the average of the two be a better approach?
#
root@localhost:/data/pgsql/pg18data# grep vacuum_failsafe_age postgresql.conf
#vacuum_failsafe_age = 1600000000
root@localhost:/data/pgsql/pg18data# grep autovacuum_freeze_max_age postgresql.conf
#autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum
#vacuum_failsafe_age = 1600000000
root@localhost:/data/pgsql/pg18data# grep autovacuum_freeze_max_age postgresql.conf
#autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum
Thanks
On Wed, Oct 29, 2025 at 6:45 AM Sami Imseih <samimseih@gmail.com> wrote:
> Done.
My compiler is complaining about v6
"../src/backend/postmaster/autovacuum.c:3293:32: warning: operation on
‘*score’ may be undefined [-Wsequence-point]
3293 | *score = *score = Max(*score, (double)
instuples / Max(vacinsthresh, 1));
[2/2] Linking target src/backend/postgres"
shouldn't just be like below?
*score =Max(*score, (double) instuples / Max(vacinsthresh, 1));
--
Sami
> On Tue, Oct 28, 2025 at 12:16:28PM +1300, David Rowley wrote: > > I think it's reasonable to want to document how autovacuum prioritises > > tables, but maybe not in too much detail. Longer term, I think it > > would be good to have a pg_catalog view for this which showed the > > relid or schema/relname, and the output values of > > relation_needs_vacanalyze(). If we had that and we documented that > > autovacuum workers work from that list, but they just may have an > > older snapshot of it, then that might help make the score easier to > > document. It would also allow people to question the scores as I > > expect at least some people might not agree with the priorities. That > > would allow us to consider tuning the score calculation if someone > > points out a deficiency with the current calculation. > > > > Also, longer-term, it also doesn't seem that unreasonable that the > > autovacuum worker might want to refresh the tables_to_process once it > > finishes a table and if autovacuum_naptime * $value units of time have > > passed since it was last checked. That would allow the worker to deal > > with and react accordingly when scores have changed significantly > > since it last checked. I mean, it might be days between when > > autovacuum calculates the scores and finally vacuums the table when > > the list is long, of it it was tied up with large tables. Other > > workers may have gotten to some of the tables too, so the score may > > have dropped, but again made its way above the threshold, but to a > > lesser extent. > > Agreed on both points. I think we do need some documentation about this behavior, which v6 is still missing. Another thing I have been contemplating about is the change in prioritization and the resulting difference in the order in which tables are vacuumed is what it means for workloads in which autovacuum tuning that was done with the current assumptions will no longer be beneficial. Let's imagine staging tables that get created and dropped during some batch processing window and they see huge data ingestion/changes. The current scan will make these less of a priority naturally in relation to other permanent tables, but with the new priority, we are making these staging tables more of a priority. Users will now need to maybe turn off autovacuum on a per-table level to prevent this scenario. That is just one example. What I am also trying to say is should we provide a way, I hate to say a GUC, for users to go back to the old behavior? or am I overstating the risk here? -- Sami Imseih Amazon Web Services (AWS)
On Tue, Oct 28, 2025 at 05:44:37PM -0500, Sami Imseih wrote: > My compiler is complaining about v6 > > "../src/backend/postmaster/autovacuum.c:3293:32: warning: operation on > ‘*score’ may be undefined [-Wsequence-point] > 3293 | *score = *score = Max(*score, (double) > instuples / Max(vacinsthresh, 1)); > [2/2] Linking target src/backend/postgres" > > shouldn't just be like below? > > *score =Max(*score, (double) instuples / Max(vacinsthresh, 1)); Oops. I fixed that typo in v7. -- nathan
Вложения
On Wed, Oct 29, 2025 at 11:10:55AM +0800, wenhui qiu wrote: > Typically, DBAs avoid setting autovacuum_freeze_max_age too close to > vacuum_failsafe_age. Therefore, your logic most likely uses the > vacuum_failsafe_age value. > Would taking the average of the two be a better approach? That approach would begin aggressively scaling the priority of tables sooner, but I don't know if that's strictly better. In any case, I'd like to avoid making the score calculation too magical. -- nathan
On Wed, Oct 29, 2025 at 10:24:17AM -0500, Sami Imseih wrote: > I think we do need some documentation about this behavior, which v6 is > still missing. Would you be interested in giving that part a try? > Another thing I have been contemplating about is the change in prioritization > and the resulting difference in the order in which tables are vacuumed > is what it means for workloads in which autovacuum tuning that was > done with the current assumptions will no longer be beneficial. > > Let's imagine staging tables that get created and dropped during > some batch processing window and they see huge data > ingestion/changes. The current scan will make these less of a priority > naturally in relation to other permanent tables, but with the new priority, > we are making these staging tables more of a priority. Users will now > need to maybe turn off autovacuum on a per-table level to prevent this > scenario. That is just one example. > > What I am also trying to say is should we provide a way, I hate > to say a GUC, for users to go back to the old behavior? or am I > overstating the risk here? It's probably worth testing out this scenario, but I can't say I'm terribly worried. Those kinds of tables are already getting chosen by autovacuum earlier due to reltuples == -1, and this patch will just move them to the front of the list that autovacuum creates. In any case, I'd really like to avoid a GUC or fallback switch here. -- nathan
HI Nathan
> That approach would begin aggressively scaling the priority of tables
> sooner, but I don't know if that's strictly better. In any case, I'd like
> to avoid making the score calculation too magical.
> That approach would begin aggressively scaling the priority of tables
> sooner, but I don't know if that's strictly better. In any case, I'd like
> to avoid making the score calculation too magical.
In fact, with the introduction of the vacuum_max_eager_freeze_failure_rate feature, if a table’s age still exceeds more than 1.x times the autovacuum_freeze_max_age, it suggests that the vacuum freeze process is not functioning properly. Once the age surpasses vacuum_failsafe_age, wraparound issues are likely to occur soon.Taking the average of vacuum_failsafe_age and autovacuum_freeze_max_age is not a complex approach. Under the default configuration, this average already exceeds four times the autovacuum_freeze_max_age. At that stage, a DBA should have already intervened to investigate and resolve why the table age is not decreasing.
Thanks
On Thu, Oct 30, 2025 at 12:07 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Wed, Oct 29, 2025 at 10:24:17AM -0500, Sami Imseih wrote:
> I think we do need some documentation about this behavior, which v6 is
> still missing.
Would you be interested in giving that part a try?
> Another thing I have been contemplating about is the change in prioritization
> and the resulting difference in the order in which tables are vacuumed
> is what it means for workloads in which autovacuum tuning that was
> done with the current assumptions will no longer be beneficial.
>
> Let's imagine staging tables that get created and dropped during
> some batch processing window and they see huge data
> ingestion/changes. The current scan will make these less of a priority
> naturally in relation to other permanent tables, but with the new priority,
> we are making these staging tables more of a priority. Users will now
> need to maybe turn off autovacuum on a per-table level to prevent this
> scenario. That is just one example.
>
> What I am also trying to say is should we provide a way, I hate
> to say a GUC, for users to go back to the old behavior? or am I
> overstating the risk here?
It's probably worth testing out this scenario, but I can't say I'm terribly
worried. Those kinds of tables are already getting chosen by autovacuum
earlier due to reltuples == -1, and this patch will just move them to the
front of the list that autovacuum creates. In any case, I'd really like to
avoid a GUC or fallback switch here.
--
nathan
On Thu, 30 Oct 2025 at 15:58, wenhui qiu <qiuwenhuifx@gmail.com> wrote: > In fact, with the introduction of the vacuum_max_eager_freeze_failure_rate feature, if a table’s age still exceeds morethan 1.x times the autovacuum_freeze_max_age, it suggests that the vacuum freeze process is not functioning properly.Once the age surpasses vacuum_failsafe_age, wraparound issues are likely to occur soon.Taking the average of vacuum_failsafe_ageand autovacuum_freeze_max_age is not a complex approach. Under the default configuration, this averagealready exceeds four times the autovacuum_freeze_max_age. At that stage, a DBA should have already intervened to investigateand resolve why the table age is not decreasing. I don't think anyone would like to modify PostgreSQL in any way that increases the chances that a table gets as old as vacuum_failsafe_age. Regardless of the order in which tables are vacuumed, if a table gets as old as that then vacuum is configured to run too slowly, or there are not enough workers configured to cope with the given amount of work. I think we need to tackle prioritisation and rate limiting as two separate items. Nathan is proposing to improve the prioritisation in this thread and it seems to me that your concerns are with rate limiting. I've suggested an idea that might help with reducing the cost_delay based on the score of the table in this thread. I'd rather not introduce that as a topic for further discussion here (I imagine Nathan agrees). It's not as if the server is going to consume 1 billion xids in 5 mins. It's at least going to take a day to days or longer for that to happen and if autovacuum has not managed to get on top of the workload in that time, then it's configured to run too slowly and the cost_limit or delay needs to be adjusted. My concern is that there are countless problems with autovacuum and if you try and lump them all into a single thread to fix them all at once, we'll get nowhere. Autovacuum was added to core in 8.1, 20 years ago and I don't believe we've done anything to change the ratelimiting aside from reducing the default cost_delay since then. It'd be good to fix that at some point, just not here, please. FWIW, I agree with Nathan about keeping the score calculation non-magical. The score should be simple and easy to document. We can introduce complexity to it as and when it's needed and when the supporting evidence arrives, rather than from people waving their hands. David
HI
I think there might be some misunderstanding — I’m only suggesting changing
effective_xid_failsafe_age = Max(vacuum_failsafe_age,
autovacuum_freeze_max_age * 1.05);
to
effective_xid_failsafe_age = (vacuum_failsafe_age + autovacuum_freeze_max_age) / 2.0;
In the current logic, effective_xid_failsafe_age is almost always equal to vacuum_failsafe_age.
As a result, increasing the vacuum priority only when a table’s age reaches vacuum_failsafe_age is too late.
effective_xid_failsafe_age = Max(vacuum_failsafe_age,
autovacuum_freeze_max_age * 1.05);
to
effective_xid_failsafe_age = (vacuum_failsafe_age + autovacuum_freeze_max_age) / 2.0;
In the current logic, effective_xid_failsafe_age is almost always equal to vacuum_failsafe_age.
As a result, increasing the vacuum priority only when a table’s age reaches vacuum_failsafe_age is too late.
Thanks
On Thu, Oct 30, 2025 at 11:42 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 30 Oct 2025 at 15:58, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
> In fact, with the introduction of the vacuum_max_eager_freeze_failure_rate feature, if a table’s age still exceeds more than 1.x times the autovacuum_freeze_max_age, it suggests that the vacuum freeze process is not functioning properly. Once the age surpasses vacuum_failsafe_age, wraparound issues are likely to occur soon.Taking the average of vacuum_failsafe_age and autovacuum_freeze_max_age is not a complex approach. Under the default configuration, this average already exceeds four times the autovacuum_freeze_max_age. At that stage, a DBA should have already intervened to investigate and resolve why the table age is not decreasing.
I don't think anyone would like to modify PostgreSQL in any way that
increases the chances that a table gets as old as vacuum_failsafe_age.
Regardless of the order in which tables are vacuumed, if a table gets
as old as that then vacuum is configured to run too slowly, or there
are not enough workers configured to cope with the given amount of
work. I think we need to tackle prioritisation and rate limiting as
two separate items. Nathan is proposing to improve the prioritisation
in this thread and it seems to me that your concerns are with rate
limiting. I've suggested an idea that might help with reducing the
cost_delay based on the score of the table in this thread. I'd rather
not introduce that as a topic for further discussion here (I imagine
Nathan agrees). It's not as if the server is going to consume 1
billion xids in 5 mins. It's at least going to take a day to days or
longer for that to happen and if autovacuum has not managed to get on
top of the workload in that time, then it's configured to run too
slowly and the cost_limit or delay needs to be adjusted.
My concern is that there are countless problems with autovacuum and if
you try and lump them all into a single thread to fix them all at
once, we'll get nowhere. Autovacuum was added to core in 8.1, 20 years
ago and I don't believe we've done anything to change the ratelimiting
aside from reducing the default cost_delay since then. It'd be good to
fix that at some point, just not here, please.
FWIW, I agree with Nathan about keeping the score calculation
non-magical. The score should be simple and easy to document. We can
introduce complexity to it as and when it's needed and when the
supporting evidence arrives, rather than from people waving their
hands.
David
On Thu, 30 Oct 2025 at 19:48, wenhui qiu <qiuwenhuifx@gmail.com> wrote: > I think there might be some misunderstanding — I’m only suggesting changing > effective_xid_failsafe_age = Max(vacuum_failsafe_age, > autovacuum_freeze_max_age * 1.05); > to > effective_xid_failsafe_age = (vacuum_failsafe_age + autovacuum_freeze_max_age) / 2.0; > In the current logic, effective_xid_failsafe_age is almost always equal to vacuum_failsafe_age. > As a result, increasing the vacuum priority only when a table’s age reaches vacuum_failsafe_age is too late. I understand your proposal. The autovacuum will trigger for the wraparound at autovacuum_freeze_max_age, so for autovacuum still not to have gotten to the table by the time the table is aged at vacuum_failsafe_age, it means autovacuum isn't working quickly enough to get through the workload, therefore the problem is with the speed of autovacuum not the priority of autovacuum. David
On Wed, Oct 29, 2025 at 11:51 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Oops. I fixed that typo in v7. Are you planning to do some practical experimentation with this? I feel like it would be a good idea to set up some kind of a test case where this is expected to provide a benefit and see if it actually does; and also maybe set up a test case where it will reorder the tables but with no practical difference in the outcome expected and verify that, in fact, nothing changes. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 30, 2025 at 04:05:19PM -0400, Robert Haas wrote: > Are you planning to do some practical experimentation with this? I > feel like it would be a good idea to set up some kind of a test case > where this is expected to provide a benefit and see if it actually > does; and also maybe set up a test case where it will reorder the > tables but with no practical difference in the outcome expected and > verify that, in fact, nothing changes. Yes. I've been thinking through how I want to test this but have yet to actually do so. If you have ideas, I'm all ears. -- nathan
> On Thu, Oct 30, 2025 at 04:05:19PM -0400, Robert Haas wrote: > > Are you planning to do some practical experimentation with this? I > > feel like it would be a good idea to set up some kind of a test case > > where this is expected to provide a benefit and see if it actually > > does; and also maybe set up a test case where it will reorder the > > tables but with no practical difference in the outcome expected and > > verify that, in fact, nothing changes. > > Yes. I've been thinking through how I want to test this but have yet to > actually do so. If you have ideas, I'm all ears. FWIW, I've been putting some scripts together to test some workloads and I will share shortly what I have. -- Sami Imseih Amazon Web Services (AWS)
> FWIW, I've been putting some scripts together to test some workloads > and I will share shortly what I have. Here is my attempt to test the behavior with the new prioritization. I wanted a way to run the same tests with different workloads, both with and without the prioritization patch, and to see if anything stands out as suspicious in terms of autovacuum or autoanalyze activity. For example, certain tables showing too little or too much autovacuum activity. The scripts I put together (attached) run a busy update workload (OLTP) and a separate batch workload. They use pgbench to execute custom scripts that are generated on the fly. The results are summarized by the average number of autovacuum and autoanalyze runs *per table*, along with some other DML activity stats to ensure that the workloads being compared have similar DML activity. Using the scripts: Place the attached scripts in a specific directory, and modify the section under "Caller should adjust these values" in run_workloads.sh to adjust the workload. The scripts assume you have a running cluster with your specific config file adjusted for the test. Once ready, call run_workloads.sh and at the end a summary will show up as you see below. Hopefully it works for you :) The summary.sh script can also be run while the workloads are executing. Here is a example of a test I wanted to run based on the discussion [0]: This scenario is one that was mentioned, but there are others in which a batch process performing inserts only is prioritized over the update workload. I ran this test for 10 minutes, using 200 clients for the update workload and 5 clients for the batch workload, with the following configuration: ``` max_connections=1000; autovacuum_naptime = '10s' shared_buffers = '4GB' autovacuum_max_workers = 6 ``` -- HEAD ``` Total Activity -[ RECORD 1 ]-------------+---------- total_n_dead_tup | 985183 total_n_mod_since_analyze | 220294866 total_reltuples | 247690373 total_autovacuum_count | 137 total_autoanalyze_count | 470 total_n_tup_upd | 7720012 total_n_tup_ins | 446683000 table_count | 105 Activity By Workload Type -[ RECORD 1 ]-----------------+---------------- table_group | batch_tables ** avg_autovacuum_count | 7.400 ** avg_autoanalyze_count | 8.000 avg_vacuum_count | 0.000 avg_analyze_count | 0.000 rows_inserted | 436683000 rows_updated | 0 rows_hot_updated | 0 table_count | 5 -[ RECORD 2 ]-----------------+---------------- table_group | numbered_tables ** avg_autovacuum_count | 1.000 ** avg_autoanalyze_count | 4.300 avg_vacuum_count | 1.000 avg_analyze_count | 0.000 rows_inserted | 10000000 rows_updated | 7720012 rows_hot_updated | 7094573 table_count | 100 ``` -- with v7 applied ``` Total Activity -[ RECORD 1 ]-------------+---------- total_n_dead_tup | 1233045 total_n_mod_since_analyze | 137843507 total_reltuples | 350704437 total_autovacuum_count | 146 total_autoanalyze_count | 605 total_n_tup_upd | 7896354 total_n_tup_ins | 487974000 table_count | 105 Activity By Workload Type -[ RECORD 1 ]-----------------+---------------- table_group | batch_tables ** avg_autovacuum_count | 11.000 ** avg_autoanalyze_count | 13.200 avg_vacuum_count | 0.000 avg_analyze_count | 0.000 rows_inserted | 477974000 rows_updated | 0 rows_hot_updated | 0 table_count | 5 -[ RECORD 2 ]-----------------+---------------- table_group | numbered_tables ** avg_autovacuum_count | 0.910 ** avg_autoanalyze_count | 5.390 avg_vacuum_count | 1.000 avg_analyze_count | 0.000 rows_inserted | 10000000 rows_updated | 7896354 rows_hot_updated | 7123134 table_count | 100 ``` The results above show what I expected: the batch tables receive higher priority, as seen from the averages of autovacuum and autoanalyze runs. This behavior is expected, but it may catch some users by surprise after an upgrade, since certain tables will now receive more attention than others. Longer tests might also show more bloat accumulating on heavily updated tables. In such cases, a user may need to adjust autovacuum settings on a per-table basis to restore the previous behavior. So, I am not quite sure what is the best way to test except for trying to find these non steady state workloads and see the impact of the prioritization change to (auto)vacuum/analyze activity . Maybe there is a better way? [0] https://www.postgresql.org/message-id/aQI7tGEs8IOPxG64%40nathan -- Sami Imseih Amazon Web Services (AWS)
Вложения
On Thu, Oct 30, 2025 at 07:38:15PM -0500, Sami Imseih wrote: > Here is my attempt to test the behavior with the new prioritization. Thanks. > The results above show what I expected: the batch tables receive higher > priority, as seen from the averages of autovacuum and autoanalyze runs. > This behavior is expected, but it may catch some users by surprise after > an upgrade, since certain tables will now receive more attention than > others. Longer tests might also show more bloat accumulating on heavily > updated tables. In such cases, a user may need to adjust autovacuum > settings on a per-table basis to restore the previous behavior. Interesting. From these results, it almost sounds as if we're further amplifying the intended effect of commit 06eae9e. That could be a good thing. Something else I'm curious about is datfrozenxid, i.e., whether prioritization keeps the database (M)XID ages lower. -- nathan
On Sat, 1 Nov 2025 at 09:12, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Oct 30, 2025 at 07:38:15PM -0500, Sami Imseih wrote: > > The results above show what I expected: the batch tables receive higher > > priority, as seen from the averages of autovacuum and autoanalyze runs. > > This behavior is expected, but it may catch some users by surprise after > > an upgrade, since certain tables will now receive more attention than > > others. Longer tests might also show more bloat accumulating on heavily > > updated tables. In such cases, a user may need to adjust autovacuum > > settings on a per-table basis to restore the previous behavior. > > Interesting. From these results, it almost sounds as if we're further > amplifying the intended effect of commit 06eae9e. That could be a good > thing. Something else I'm curious about is datfrozenxid, i.e., whether > prioritization keeps the database (M)XID ages lower. I wonder if it would be more realistic to throttle the work simulation to a certain speed with pgbench -R rather than having it go flat out. The results show that quite a bit higher "rows_inserted" for the batch_tables with the patched version. Sami didn't mention any changes to vacuum_cost_limit, so I suspect that autovacuum would be getting quite behind on this run, which isn't ideal. Rate limiting to something that the given vacuum_cost_limit could keep up with seems more realistic. The fact that the patched version did more insert work in the batch tables does seem a bit unfair as that gave autovacuum more work to do in the patched test run which would result in the lower-scoring tables being neglected more in the patched version. This makes me wonder if we should log the score of the table when the autovacuum starts for the table. We do calculate the score again in recheck_relation_needs_vacanalyze() just before doing the vacuum/analyze, so maybe the score can be stored in the autovac_table struct and displayed somewhere. Maybe along with the log_autovacuum_min_duration / log_autoanalyze_min_duration would be useful. It might be good in there for DBA analysis to give some visibility on how bad things got before autovacuum got around to working on a given table. If we logged the score, we could do the "unpatched" test with the patched code, just with commenting out the list_sort(tables_to_process, TableToProcessComparator); It'd then be interesting to zero the log_auto*_min_duration settings and review the order differences and how high the scores got. Would the average score be higher or lower with patched version? I'd guess lower since the higher scoring tables would tend to get vacuumed later with the unpatched version and their score would be even higher by the time autovacuum got to them. I think if the average score has gone down at the point that the vacuum starts, then that's a very good thing. Maybe we'd need to write a patch to recalculate the "tables_to_process" List after a table is vacuumed and autovacuum_naptime has elapsed for us to see this, else the priorities might have become too outdated. I'd expect that to be even more true when vacuum_cost_limit is configured too low. David
On Sat, 1 Nov 2025 at 14:50, David Rowley <dgrowleyml@gmail.com> wrote:
> If we logged the score, we could do the "unpatched" test with the
> patched code, just with commenting out the
> list_sort(tables_to_process, TableToProcessComparator); It'd then be
> interesting to zero the log_auto*_min_duration settings and review the
> order differences and how high the scores got. Would the average score
> be higher or lower with patched version? I'd guess lower since the
> higher scoring tables would tend to get vacuumed later with the
> unpatched version and their score would be even higher by the time
> autovacuum got to them. I think if the average score has gone down at
> the point that the vacuum starts, then that's a very good thing. Maybe
> we'd need to write a patch to recalculate the "tables_to_process" List
> after a table is vacuumed and autovacuum_naptime has elapsed for us to
> see this, else the priorities might have become too outdated. I'd
> expect that to be even more true when vacuum_cost_limit is configured
> too low.
I'm not yet sure how meaningful it is, but I tried adding the
following to recheck_relation_needs_vacanalyze():
elog(LOG, "Performing autovacuum of table \"%s\" with score = %f",
get_rel_name(relid), score);
then after grepping the logs and loading the data into a table and performing:
select case patched when true then 'v7' else 'master' end as
patched,case when left(tab, 11) = 'table_batch' then 'table_batch_*'
when left(tab,6) = 'table_' then 'table_*' else 'other' end,
avg(score) as avg_Score,count(*) as count from autovac where score>0
and score<2000 group by rollup(1,2) order by 2,1;
with vacuum_cost_limit = 5000, I got:
patched | case | avg_score | count
---------+---------------+--------------------+-------
master | other | 2.004997014705882 | 68
v7 | other | 1.9668087323943668 | 71
master | table_* | 1.196698981375357 | 1396
v7 | table_* | 1.2134741693430646 | 1370
master | table_batch_* | 2.1887380086206902 | 116
v7 | table_batch_* | 1.8882025693430664 | 137
master | | 1.3043197367088595 | 1580
v7 | | 1.3059485323193893 | 1578
| | 1.3051336187460454 | 3158
It would still be good to do the rate limiting as there's more work
being done in the patched version. Seems to be about 1.1% more rows in
batch_tables and 0.48% more updates in the numbered_tables in the
patched version.
David
Thanks for the ideas on improving the test! I am still trying to see how useful this type of testing is, but I will share what I have done. > I wonder if it would be more realistic to throttle the work simulation > to a certain speed with pgbench -R rather than having it go flat out. good point > > If we logged the score, we could do the "unpatched" test with the > > patched code, just with commenting out the > > list_sort(tables_to_process, TableToProcessComparator); It'd then be > > interesting to zero the log_auto*_min_duration settings and review the > > order differences and how high the scores got. Would the average score > > be higher or lower with patched version? I agree. I attached a patch on top of v7 that implements a debug GUC to enable or disable sorting for testing purposes. > I'm not yet sure how meaningful it is, but I tried adding the > following to recheck_relation_needs_vacanalyze(): > > elog(LOG, "Performing autovacuum of table \"%s\" with score = %f", > get_rel_name(relid), score); The same attached patch also implements this log. I also spent more time working on the test script. I cleaned it up and combined it into a single script. I added a few things: - Ability to run with or without the batch workload. - OLTP tables are no longer the same size; they are created with different row counts using a minimum and maximum row count and a multiplier for scaling the next table. - A background collector for pg_stat_all_tables on relevant tables, stored in relstats_monitor.log. - Logs are saved after the run for further analysis, such as examining the scores. Also attached is analysis for a run with 16 OLTP tables and 3 batch tables. It shows that with sorting enabled or disabled, the vacuum/analyze activity does not show any major differences. OLTP had very similar DML and autovacuum/autoanalyze activity. A few points to highlight: 1/ In the sorted run, we had an equal number of autovacuums/autoanalyze on the smaller OLTP tables, as if every eligible table needed both autovacuum and autoanalyze. The unsorted run was less consistent on the smaller tables. I observed this on several runs. I don't think it's a big deal, but interesting nonetheless. 2/ Batch tables in the sorted run had less autovacuum time (1,257,821 vs 962,794 ms), but very similar autovacuum counts. 3/ OLTP tables, on the other hand, had more autovacuum time in the sorted run (3,590,964 vs 3,852,460 ms), but I do not see much difference in autovacuum/autoanalyze counts. Other tests I plan on running: - batch updates/deletes, since the current batch option only tests append-only tables. - OLTP only test. Also, I am thinking about another sorting strategy based on average autovacuum/autoanalyze time per table. The idea is to sort ascending by the greater of the two averages, so workers process quicker tables first instead of all workers potentially getting hung on the slowest tables. We can calculate the average now that v18 includes total_autovacuum_time and total_autoanalyze time. The way I see it, regardless of prioritization, a few large tables may still monopolize autovacuum workers. But at least this way, the quick tables get a chance to get processed first. Will this be an idea worth testing out? -- Sami Imseih Amazon Web Services (AWS)
Вложения
On Fri, 7 Nov 2025 at 11:21, Sami Imseih <samimseih@gmail.com> wrote: > Also, I am thinking about another sorting strategy based on average > autovacuum/autoanalyze time per table. The idea is to sort ascending by > the greater of the two averages, so workers process quicker tables first > instead of all workers potentially getting hung on the slowest tables. > We can calculate the average now that v18 includes total_autovacuum_time > and total_autoanalyze time. > > The way I see it, regardless of prioritization, a few large tables may > still monopolize autovacuum workers. But at least this way, the quick tables > get a chance to get processed first. Will this be an idea worth testing out? This sounds like a terrible idea to me. It'll mean any table that starts taking longer due to autovacuum neglect will have its priority dropped for next time which will result in further neglect. If vacuum_cost_limit is too low, then the tables in need of vacuum the most could end up last in the queue. I also don't see how you'd handle the fact that analyze is likely to be faster than vacuum. Tables that only need an analyze would just come last with no regard for how outdated the statistics are? I'm confused at why we'd have set up our autovacuum trigger points as they are today because we think those are good times to do a vacuum/analyze, but then prioritise on something completely different. Surely if we think 20% dead tuples is worth a vacuum, we must therefore think that 40% dead tuples are even more worthwhile?! I just cannot comprehend why we'd deviate from making the priority the percentage over the trigger point here. If we come to the conclusion that we want something else, then maybe our trigger point threshold method also needs to be redefined. There certainly have been complaints about 20% of a huge table being too much (I guess autovacuum_vacuum_max_threshold is our answer to trying to fix that one). David David David
> On Fri, 7 Nov 2025 at 11:21, Sami Imseih <samimseih@gmail.com> wrote: > > Also, I am thinking about another sorting strategy based on average > > autovacuum/autoanalyze time per table. The idea is to sort ascending by > > the greater of the two averages, so workers process quicker tables first > > instead of all workers potentially getting hung on the slowest tables. > > We can calculate the average now that v18 includes total_autovacuum_time > > and total_autoanalyze time. > > > > The way I see it, regardless of prioritization, a few large tables may > > still monopolize autovacuum workers. But at least this way, the quick tables > > get a chance to get processed first. Will this be an idea worth testing out? > > This sounds like a terrible idea to me. It'll mean any table that > starts taking longer due to autovacuum neglect will have its priority > dropped for next time which will result in further neglect. yes, that is a possibility, but I am not sure how we can actually avoid these scenarios. The flip side is we are giving a chance for the eligible fast tables to get more of a chance to get vacuumed, rather than be backed because workers are all occupied on the larger tables. > vacuum_cost_limit is too low, then the tables in need of vacuum the > most could end up last in the queue. I also don't see how you'd handle > the fact that analyze is likely to be faster than vacuum. Tables that > only need an analyze would just come last with no regard for how > outdated the statistics are? In the "doanalyze" case only, we will look at the average autoanalyze count, which will push these types of tables to the front of the queue, not the last. > I'm confused at why we'd have set up our autovacuum trigger points as > they are today because we think those are good times to do a > vacuum/analyze, but then prioritise on something completely different. > Surely if we think 20% dead tuples is worth a vacuum, we must > therefore think that 40% dead tuples are even more worthwhile?! Sure, but thresholds alone don't indicate anything about the how quick the table can be vacuumed, # of indexes, per table a/v settings, etc. The average a/v time is a good proxy to determine this. What I am suggesting here is we think beyond thresholds for prioritization, and to give a chance for more eligible tables to get autovacuumed rather than workers being saturated on some of the slowest-to-vacuum tables. -- Sami Imseih Amazon Web Services (AWS)
On Sat, 8 Nov 2025 at 08:23, Sami Imseih <samimseih@gmail.com> wrote: > > I'm confused at why we'd have set up our autovacuum trigger points as > > they are today because we think those are good times to do a > > vacuum/analyze, but then prioritise on something completely different. > > Surely if we think 20% dead tuples is worth a vacuum, we must > > therefore think that 40% dead tuples are even more worthwhile?! > > Sure, but thresholds alone don't indicate anything about the how quick > the table can be vacuumed, # of indexes, per table a/v settings, etc. > The average a/v time is a good proxy to determine this. > > What I am suggesting here is we think beyond thresholds for > prioritization, and to give a chance for more eligible tables to get > autovacuumed rather than workers being saturated on some > of the slowest-to-vacuum tables. Can you define "more eligible" here? I think I'm not really grasping this because I don't understand why faster-to-vacuum tables should be prioritised over slower-to-vacuum tables. Can you explain why you think this is important? I do understand that in your script that the OLTP tables received less attention than unpatched, but it wasn't obvious to me why this was an issue. If it's a case of autovacuum acting on a stale score after it obtained the list of tables and their scores, do things look different if we have the autovacuum worker refresh the list and scores after it's done with a table and autovacuum_naptime has elapsed since the list was last refreshed? David
Still catching up on the latest discussion, but here's a v8 patch that amends the DEBUG3 in relation_needs_vacanalyze() to also log the score. I might attempt to add some sort of brief documentation about autovacuum prioritization next. From skimming the latest discussion, I gather we might want to consider re-sorting the list periodically. Is the idea that we'll re-sort the remaining tables in the list, or that we'll basically restart do_autovacuum()? If it's the latter, then we'll need to come up with some way to decide when to stop for the current database. Right now, we just go through pg_class and call it a day. -- nathan
Вложения
On Tue, Nov 11, 2025 at 11:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Still catching up on the latest discussion, but here's a v8 patch that > amends the DEBUG3 in relation_needs_vacanalyze() to also log the score. I > might attempt to add some sort of brief documentation about autovacuum > prioritization next. > > From skimming the latest discussion, I gather we might want to consider > re-sorting the list periodically. Is the idea that we'll re-sort the > remaining tables in the list, or that we'll basically restart > do_autovacuum()? If it's the latter, then we'll need to come up with some > way to decide when to stop for the current database. Right now, we just go > through pg_class and call it a day. > FWIW, when I have built these types of systems in the past, and when I wanted an aggressive recheck-type mechanism, the most common methods involved tying it to autovacuum_max_workers. This usually was done under the assumption that generating the list was relatively cheap and that higher xid age would generate higher priority candidates. Of course I also was biased towards having it be user controllable at the database level (ie. no need to modify some control file or cron job or whatever). To the degree those things are aligned here, there is at least some anecdata that this is a usable setting. Robert Treat https://xzilla.net
On Tue, Nov 11, 2025 at 02:43:19PM -0500, Robert Treat wrote: > FWIW, when I have built these types of systems in the past, and when I > wanted an aggressive recheck-type mechanism, the most common methods > involved tying it to autovacuum_max_workers. Would you mind elaborating on this point? Do you mean that you'd rebuild the list every a_m_w tables, or something else? -- nathan
On Tue, Nov 11, 2025 at 2:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Tue, Nov 11, 2025 at 02:43:19PM -0500, Robert Treat wrote: > > FWIW, when I have built these types of systems in the past, and when I > > wanted an aggressive recheck-type mechanism, the most common methods > > involved tying it to autovacuum_max_workers. > > Would you mind elaborating on this point? Do you mean that you'd rebuild > the list every a_m_w tables, or something else? > Yes. Robert Treat https://xzilla.net
On Wed, 12 Nov 2025 at 05:36, Nathan Bossart <nathandbossart@gmail.com> wrote:
> From skimming the latest discussion, I gather we might want to consider
> re-sorting the list periodically. Is the idea that we'll re-sort the
> remaining tables in the list, or that we'll basically restart
> do_autovacuum()? If it's the latter, then we'll need to come up with some
> way to decide when to stop for the current database. Right now, we just go
> through pg_class and call it a day.
I'm still trying to work out what Sami sees in the results that he
doesn't think is good. I resuggested he try coding up the periodic
refresh-the-list code to see if it makes the thing he sees better. I
was hoping that we could get away with not doing that for stage 1 of
this. My concern there is that these change-the-way-autovacuum-works
patches seems to blow up quickly as everyone chips in with autovacuum
problems they want fixed and expect the patch to do it all.
That said, the periodic refresh probably isn't too hard. I suspected
it was something like:
/* when enough time has passed, refresh the list to ensure the
scores aren't too out-of-date */
if (time is > lastcheck + autovacuum_naptime * <something>)
{
list_free_deep(tables_to_process);
goto the_top;
}
} // end of foreach(cell, tables_to_process)
Perhaps if the test cases we're going to give this involve lengthy
autovacuum runs, then we might need that patch sooner. I'm uncertain
if that's the case with Sami's test. There were some 50GB tables, so I
imagine some of the runs could take a long time, especially so when
running standard vacuum_cost_limit.
David
On Wed, Nov 12, 2025 at 09:03:54AM +1300, David Rowley wrote:
> I'm still trying to work out what Sami sees in the results that he
> doesn't think is good. I resuggested he try coding up the periodic
> refresh-the-list code to see if it makes the thing he sees better. I
> was hoping that we could get away with not doing that for stage 1 of
> this. My concern there is that these change-the-way-autovacuum-works
> patches seems to blow up quickly as everyone chips in with autovacuum
> problems they want fixed and expect the patch to do it all.
+1
> That said, the periodic refresh probably isn't too hard. I suspected
> it was something like:
>
> /* when enough time has passed, refresh the list to ensure the
> scores aren't too out-of-date */
> if (time is > lastcheck + autovacuum_naptime * <something>)
> {
> list_free_deep(tables_to_process);
> goto the_top;
> }
> } // end of foreach(cell, tables_to_process)
My concern is that this might add already-processed tables back to the
list, so a worker might never be able to clear it. Maybe that's not a real
problem in practice for some reason, but it does feel like a step too far
for stage 1, as you said above.
--
nathan
On Tue, Nov 11, 2025 at 02:50:55PM -0500, Robert Treat wrote: > On Tue, Nov 11, 2025 at 2:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> On Tue, Nov 11, 2025 at 02:43:19PM -0500, Robert Treat wrote: >> > FWIW, when I have built these types of systems in the past, and when I >> > wanted an aggressive recheck-type mechanism, the most common methods >> > involved tying it to autovacuum_max_workers. >> >> Would you mind elaborating on this point? Do you mean that you'd rebuild >> the list every a_m_w tables, or something else? > > Yes. Interesting. With our defaults, that would mean rebuilding the list every few tables, which seems quite aggressive. I'd start worrying about the pg_class scanning overhead a little... -- nathan
> On Sat, 8 Nov 2025 at 08:23, Sami Imseih <samimseih@gmail.com> wrote: > > > I'm confused at why we'd have set up our autovacuum trigger points as > > > they are today because we think those are good times to do a > > > vacuum/analyze, but then prioritise on something completely different. > > > Surely if we think 20% dead tuples is worth a vacuum, we must > > > therefore think that 40% dead tuples are even more worthwhile?! > > > > Sure, but thresholds alone don't indicate anything about the how quick > > the table can be vacuumed, # of indexes, per table a/v settings, etc. > > The average a/v time is a good proxy to determine this. > > > > What I am suggesting here is we think beyond thresholds for > > prioritization, and to give a chance for more eligible tables to get > > autovacuumed rather than workers being saturated on some > > of the slowest-to-vacuum tables. > > Can you define "more eligible" here? What I mean by “more eligible” is that once a worker has its list of tables that meet the autovacuum thresholds, it’s trying to get through as many of them as possible within some time window. If the workers always go after the slowest tables first, they’ll spend most of that time on just a few heavy ones, and a lot of other eligible tables might end up waiting much longer to get processed. Eventually the slow tables will be the bottleneck anyway. > I think I'm not really grasping this because I don't understand why > faster-to-vacuum tables should be prioritised over slower-to-vacuum > tables. Can you explain why you think this is important? The thing I’m hoping to address is something I’ve seen many times in practice. Autovacuum workers can get stuck on specific large or slow tables, and when that happens, users often end up running manual vacuums on those tables just to keep things moving for the smaller/faster vacuumed tables. Now, I am not so sure any type of autovacuum prioritization could actually help in these cases. What does help is adding more autovacuum workers. > if we have the autovacuum worker refresh the list and scores after > it's done with a table and autovacuum_naptime has elapsed since the > list was last refreshed? That is an interesting idea, but refreshing the list that often may not be such a good idea, it could be quite expensive on large catalogs. -- Sami Imseih Amazon Web Services (AWS)
On Wed, 12 Nov 2025 at 09:13, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Nov 12, 2025 at 09:03:54AM +1300, David Rowley wrote:
> > /* when enough time has passed, refresh the list to ensure the
> > scores aren't too out-of-date */
> > if (time is > lastcheck + autovacuum_naptime * <something>)
> > {
> > list_free_deep(tables_to_process);
> > goto the_top;
> > }
> > } // end of foreach(cell, tables_to_process)
>
> My concern is that this might add already-processed tables back to the
> list, so a worker might never be able to clear it. Maybe that's not a real
> problem in practice for some reason, but it does feel like a step too far
> for stage 1, as you said above.
Oh, that's a good point. That's a very valid concern. I guess that
could be fixed with a hashtable of vacuumed tables and skipping tables
that exist in there, but the problem with that is that the table might
genuinely need to be vacuumed again. It's a bit tricky to know when a
2nd vacuum is a legit requirement and when it's not. Figuring that out
might me more logic that this code wants to know about.
David
On Wed, 12 Nov 2025 at 09:25, Sami Imseih <samimseih@gmail.com> wrote: > The thing I’m hoping to address is something I’ve seen many times in practice. > Autovacuum workers can get stuck on specific large or slow tables, and when > that happens, users often end up running manual vacuums on those tables > just to keep things moving for the smaller/faster vacuumed tables. > > Now, I am not so sure any type of autovacuum prioritization could actually > help in these cases. What does help is adding more autovacuum workers. Thanks for explaining. I think the scoring system in Nathan's patch helps with this as any smaller table which are neglected continue to bloat, and because they're smaller, the score will increase more quickly, and eventually they'll have a higher score than the larger tables. I think the situation you're talking about is when *all* autovacuum workers are busy with large tables and no workers remaining to deal with the now-higher-scoring smaller tables and they bloat severely or statistics become wildly outdated as a result. I'm aware of that problem. It seems to happen mostly when large tables are busy receiving an anti-wraparound vacuum. I'm not sure what to do about it, but I don't think changing the scoring system is the right thing. Maybe we can have it configurable so that 1 worker can be configured to not work on tables above a given size, so there's at least 1 worker that is less likely to be tied up for extended periods of time. I don't know if that's a good idea, and also don't know what realistic values for "given size" are. David
> > The thing I’m hoping to address is something I’ve seen many times in practice. > > Autovacuum workers can get stuck on specific large or slow tables, and when > > that happens, users often end up running manual vacuums on those tables > > just to keep things moving for the smaller/faster vacuumed tables. > > > > Now, I am not so sure any type of autovacuum prioritization could actually > > help in these cases. What does help is adding more autovacuum workers. > > Thanks for explaining. I think the scoring system in Nathan's patch > helps with this as any smaller table which are neglected continue to > bloat, and because they're smaller, the score will increase more > quickly, That is true. > Maybe we can have it configurable so that 1 worker can be > configured to not work on tables above a given size, so there's at > least 1 worker that is less likely to be tied up for extended periods > of time. I don't know if that's a good idea, and also don't know what > realistic values for "given size" are. Another approach will be to signal for more autovacuum workers to be spun up ( user can have a minimum and max workers ) if all workers has been processing the list for a long time ( Also not sure what the long threshold would be ). This "auto-tuning" of workers could perhaps reduce the need for manual vacuums. It will still not prevent all workers from being tied up, but maybe reduce the likelihood. -- Sami
On Tue, Nov 11, 2025 at 3:27 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Wed, 12 Nov 2025 at 09:13, Nathan Bossart <nathandbossart@gmail.com> wrote:
> > On Wed, Nov 12, 2025 at 09:03:54AM +1300, David Rowley wrote:
> > > /* when enough time has passed, refresh the list to ensure the
> > > scores aren't too out-of-date */
> > > if (time is > lastcheck + autovacuum_naptime * <something>)
> > > {
> > > list_free_deep(tables_to_process);
> > > goto the_top;
> > > }
> > > } // end of foreach(cell, tables_to_process)
> >
> > My concern is that this might add already-processed tables back to the
> > list, so a worker might never be able to clear it. Maybe that's not a real
> > problem in practice for some reason, but it does feel like a step too far
> > for stage 1, as you said above.
>
> Oh, that's a good point. That's a very valid concern. I guess that
> could be fixed with a hashtable of vacuumed tables and skipping tables
> that exist in there, but the problem with that is that the table might
> genuinely need to be vacuumed again. It's a bit tricky to know when a
> 2nd vacuum is a legit requirement and when it's not. Figuring that out
> might me more logic that this code wants to know about.
>
Yeah, there is a common theoretical pattern that always comes up in
these discussions where autovacuum gets stuck behind N big tables +
(AVMW - N) small tables that keep filtering up to the top of the list,
and I'm not saying that would never be a problem, but assuming the
algorithm is working correctly, this should be fairly avoidable,
because the use of xid age essentially works as a "hash of vacuumed
tables" equivalent for tracking purposes.
Walking through it, once a table is vacuumed, it should go to the
bottom of the list. The only way it crops back-up quickly is due to
significant activity on it, but even then, you need a special set of
circumstances, like it needs to be a small enough table with heavy
updates and a small autovacuum_vacuum_threshold. This type of combo
would cause the table to look like it is excessively bloated and in
need of vacuuming, but even in this scenario, eventually other tables
will get an xid age high enough that they will "out rank" the high
activity table and get their turn. TBH I'm not sure if we need to do
replanning, but in the scenarios where I have used it, having more
accurate information on the state of the database has generally been
better than relying on more stale information. Of course it isn't
100%, but the current implementation isn't either, and don't forget we
still have the failsafe_age as, well, a failsafe.
Robert Treat
https://xzilla.net
On Tue, Nov 11, 2025 at 06:22:36PM -0500, Robert Treat wrote: > On Tue, Nov 11, 2025 at 3:27 PM David Rowley <dgrowleyml@gmail.com> wrote: >> On Wed, 12 Nov 2025 at 09:13, Nathan Bossart <nathandbossart@gmail.com> wrote: >> > My concern is that this might add already-processed tables back to the >> > list, so a worker might never be able to clear it. Maybe that's not a real >> > problem in practice for some reason, but it does feel like a step too far >> > for stage 1, as you said above. >> >> Oh, that's a good point. That's a very valid concern. I guess that >> could be fixed with a hashtable of vacuumed tables and skipping tables >> that exist in there, but the problem with that is that the table might >> genuinely need to be vacuumed again. It's a bit tricky to know when a >> 2nd vacuum is a legit requirement and when it's not. Figuring that out >> might me more logic that this code wants to know about. > > Yeah, there is a common theoretical pattern that always comes up in > these discussions where autovacuum gets stuck behind N big tables + > (AVMW - N) small tables that keep filtering up to the top of the list, > and I'm not saying that would never be a problem, but assuming the > algorithm is working correctly, this should be fairly avoidable, > because the use of xid age essentially works as a "hash of vacuumed > tables" equivalent for tracking purposes. I do think re-prioritization is worth considering, but IMHO we should leave it out of phase 1. I think it's pretty easy to reason about one round of prioritization being okay. The order is completely arbitrary today, so how could ordering by vacuum-related criteria make things any worse? In my view, changing the list contents in fancier ways (e.g., adding just-processed tables back to the list) is a step further that requires more discussion and testing. To be clear, I am totally for serious consideration of reprioritization, adjusting cost delay settings, etc., but as David has repeatedly stressed, we are unlikely to get anything committed if we try to boil the ocean. I'd love for this thread to spin off into all kinds of other autovacuum-related threads, but we should be taking baby steps if we want to accomplish anything here. -- nathan
> I do think re-prioritization is worth considering, but IMHO we should leave > it out of phase 1. I think it's pretty easy to reason about one round of > prioritization being okay. The order is completely arbitrary today, so how > could ordering by vacuum-related criteria make things any worse? While it’s true that the current table order is arbitrary, that arbitrariness naturally helps distribute vacuum work across tables of various sizes at a given time The proposal now is by design forcing all the top bloated table, that will require more I/O to vacuum to be vacuumed at the same time, by all workers. Users may observe this after they upgrade and wonder why their I/O profile changed and perhaps slowed others non-vacuum related processing down. They also don't have a knob to go back to the previous behavior. Of course, this behavior can and will happen now, but with this prioritization, we are forcing it. Is this a concern? -- Sami Imseih Amazon Web Services (AWS)
> On Nov 12, 2025, at 5:10 PM, Sami Imseih <samimseih@gmail.com> wrote: > > >> >> I do think re-prioritization is worth considering, but IMHO we should leave >> it out of phase 1. I think it's pretty easy to reason about one round of >> prioritization being okay. The order is completely arbitrary today, so how >> could ordering by vacuum-related criteria make things any worse? > > While it’s true that the current table order is arbitrary, that arbitrariness > naturally helps distribute vacuum work across tables of various sizes > at a given time > > The proposal now is by design forcing all the top bloated table, that > will require more I/O to vacuum to be vacuumed at the same time, > by all workers. Users may observe this after they upgrade and wonder > why their I/O profile changed and perhaps slowed others non-vacuum > related processing down. They also don't have a knob to go back to > the previous behavior. > > Of course, this behavior can and will happen now, but with this > prioritization, we are forcing it. > > Is this a concern? It’s still possible to tune the cost delay, the number of autovacuum workers, etc - if someone needs to manage too much autovacuumI/O concurrency and dialing it back down a little bit. I think that’s sufficient -Jeremy
> On Nov 12, 2025, at 5:10 PM, Sami Imseih <samimseih@gmail.com> wrote:
>
>
>>
>> I do think re-prioritization is worth considering, but IMHO we should leave
>> it out of phase 1. I think it's pretty easy to reason about one round of
>> prioritization being okay. The order is completely arbitrary today, so how
>> could ordering by vacuum-related criteria make things any worse?
>
> While it’s true that the current table order is arbitrary, that arbitrariness
> naturally helps distribute vacuum work across tables of various sizes
> at a given time
>
> The proposal now is by design forcing all the top bloated table, that
> will require more I/O to vacuum to be vacuumed at the same time,
> by all workers. Users may observe this after they upgrade and wonder
> why their I/O profile changed and perhaps slowed others non-vacuum
> related processing down. They also don't have a knob to go back to
> the previous behavior.
>
> Of course, this behavior can and will happen now, but with this
> prioritization, we are forcing it.
>
> Is this a concern?
It’s still possible to tune the cost delay, the number of autovacuum workers, etc - if someone needs to manage too much autovacuum I/O concurrency and dialing it back down a little bit. I think that’s sufficient
Yes, the need to tune a/v for I/O( lower cost limit, higher cost delay ) will likely be
greater with this change.
--
Sami
On Wed, Nov 12, 2025 at 3:10 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I do think re-prioritization is worth considering, but IMHO we should leave > it out of phase 1. I think it's pretty easy to reason about one round of > prioritization being okay. The order is completely arbitrary today, so how > could ordering by vacuum-related criteria make things any worse? In my > view, changing the list contents in fancier ways (e.g., adding > just-processed tables back to the list) is a step further that requires > more discussion and testing. I agree with your view around reprioritization. To answer your rhetorical question, the way that reordering the list could hurt is if the current ordering (pg_class scan order) happened to be a near-optimal choice. For example, suppose the last table in pg_class order in a state where vacuuming appears to be necessary but will be painful and/or useless (VACUUM will error, xmin will prevent all or most tuple removal, located on an incredibly slow disk with nothing cached, whatever). Re-sorting the list figures to move that table earlier, which will not work out for the best. I suspect that reprioritization actually increases the danger of this kind of failure mode. The more aggressive you are about making sure that the highest-priority tables actually get handled first, the more important it is to be correct about the real order of priority. I do think in the long term a really good system is probably going to accumulate a bunch of extra logic to deal with cases like this. For example, if the first table in the queue causes VACUUM to spend an hour chugging a way and then fail with an I/O error, we would ideally want to make sure to wait a while before retrying that table, so that others don't get starved. But like you say, there's no need to solve every problem at once. What seems important to me for this patch is that we don't choose an actively bad sort order. For instance, if we don't get the balance between prioritizing anti-wraparound activity and controlling runaway bloat correct, and especially if there's no way to recover by tweaking settings, to me that's a scary scenario. I do think it's fairly realistic for a bad choice of sort order to end up being a regression over the current lack of a sort order. You might just be getting lucky right now -- say, because the catalog tables all occur first in the catalog and vacuuming those tends to be important, and among user tables, the ones you created first are actually the ones that are most important. That's not a particularly crazy scenario, IMHO. Point being: I think we need to avoid the mindset that we can't be stupider than we are now. I don't think there's any way we would commit something that is GENERALLY stupider than we are now, but it's not about averages. It's about whether there are specific cases that are common enough to worry about which end up getting regressed. I'm honestly not sure how much of a risk that is, and, again, I'm not trying to kill the patch. It might well be that the patch is already good enough that such scenarios will be extremely rare. However, it's easy to get overconfident when replacing a completely unintelligent system with a smarter one. The risk of something backfiring can sometimes be higher than one anticipates. One idea that might be worth considering is adding a reloption of some kind that lets the user exert positive control over the sort order. I know that's scope creep, so maybe it's a bad idea for that reason. But I think it would be a better idea than Sami's proposal to score system catalogs more highly, not so much because his idea is necessary wrong-headed as because it doesn't help with what I see as the principal danger here, namely, that whatever we do will sometimes turn out to be wrong. Trying to be right 100% of the time is not going to work out as well as having a backup plan for the cases where we are wrong. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Nov 20, 2025 at 09:30:42AM -0500, Robert Haas wrote: > Point being: I think we need to avoid the mindset that we can't be > stupider than we are now. I don't think there's any way we would > commit something that is GENERALLY stupider than we are now, but it's > not about averages. It's about whether there are specific cases that > are common enough to worry about which end up getting regressed. I'm > honestly not sure how much of a risk that is, and, again, I'm not > trying to kill the patch. It might well be that the patch is already > good enough that such scenarios will be extremely rare. However, it's > easy to get overconfident when replacing a completely unintelligent > system with a smarter one. The risk of something backfiring can > sometimes be higher than one anticipates. That's a fair point. The possibly-entirely-theoretical case that's in my head is when your oldest and lowest-OID table is also the biggest and most active. That seems like it could be a popular pattern in the field, and it probably benefits to some degree from the sequential scan returning it earlier. I don't know how much to worry about stuff like this. -- nathan
> On Thu, Nov 20, 2025 at 09:30:42AM -0500, Robert Haas wrote: > > Point being: I think we need to avoid the mindset that we can't be > > stupider than we are now. I don't think there's any way we would > > commit something that is GENERALLY stupider than we are now, but it's > > not about averages. It's about whether there are specific cases that > > are common enough to worry about which end up getting regressed. I'm > > honestly not sure how much of a risk that is, and, again, I'm not > > trying to kill the patch. It might well be that the patch is already > > good enough that such scenarios will be extremely rare. However, it's > > easy to get overconfident when replacing a completely unintelligent > > system with a smarter one. The risk of something backfiring can > > sometimes be higher than one anticipates. > > That's a fair point. The possibly-entirely-theoretical case that's in my > head is when your oldest and lowest-OID table is also the biggest and most > active. That seems like it could be a popular pattern in the field, and it > probably benefits to some degree from the sequential scan returning it > earlier. I don't know how much to worry about stuff like this. I think it would be difficult to introduce this new prioritization system without a GUC to control the prioritization behavior. Since ordering by pg_class has been the only behavior ever since autovacuum was released, there should be a way for users to revert back to this. The default could be the new prioritization strategy. Introducing new GUCs is something to be avoided if possible, but I think this case is a clear one to me. -- Sami Imseih Amazon Web Services (AWS)
On Thu, Nov 20, 2025 at 11:34 AM Sami Imseih <samimseih@gmail.com> wrote: > I think it would be difficult to introduce this new prioritization > system without a > GUC to control the prioritization behavior. Since ordering by pg_class has been > the only behavior ever since autovacuum was released, there should be a way > for users to revert back to this. The default could be the new > prioritization strategy. > > Introducing new GUCs is something to be avoided if possible, but I think this > case is a clear one to me. As I sort of alluded to in my previous message, I'd rather see us introduce something that lets you get the behavior you want than something that just lets you get back to the old behavior. Technically, the latter is good enough to avoid any claim that we've regressed things: you can always just the new thing off, and so by definition there are no unavoidable regressions. But that only caters to the scenario where the current behavior is good by accident (because it can never be good for any other reason). Don't take this too literally, but just mooting ideas wildly, suppose the scoring has a wraparound component, a bloat component, and a reloption-driven component, and the former two have a weighting factor that can be adjusted via GUCs. If you want to shut off the new behavior, you can setting the weighting factors to 0. If you want to keep the new behavior but adjust the trade-off between the wraparound and bloat components, you can adjust the relative weighting factors between the two. If you want to take more manual control, you can use the reloption, a choice that you can layer on top of the default strategy or any of the alternate strategies just proposed. Of course, making this all too complicated is a recipe for failure, but I suspect that making it at least somewhat configurable is a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
> something that just lets you get back to the old behavior. > Technically, the latter is good enough to avoid any claim that we've > regressed things: yes, that is the intention with the GUC. I am worried about cases in which a user has no way to go back to the old behavior if the new prioritization strategy causes pain, for some reason. > But that only caters > to the scenario where the current behavior is good by accident > (because it can never be good for any other reason). Well, maybe it was never really good, but it was the only behavior, and the user tuned and tested their autovacuum settings with this behavior; whether they actually kew it's based on pg_class ordering or not ( I know users I worked with that do not realize this ). I think if we are to think how we can improve prioritization, the thing in mind is what can we do so users are no longer required to schedule manual vacuums for specific tables ( which is essentially how users are currently prioritizing tables ). If we go to rigid strategy that is being proposed now, will it reduce or eliminate the need for manually scheduling? I am not so sure. > Don't take this too literally, but just mooting ideas wildly, suppose > the scoring has a wraparound component, a bloat component, and a > reloption-driven component, and the former two have a weighting factor > that can be adjusted via GUCs. If you want to shut off the new > behavior, you can setting the weighting factors to 0. Something like this could. be better since it can both give control over prioritization and allows to revert to the current behavior. -- Sami Imseih Amazon Web Services (AWS)
On Fri, 21 Nov 2025 at 07:36, Robert Haas <robertmhaas@gmail.com> wrote: > If you want to take more manual control, you can use > the reloption, a choice that you can layer on top of the default > strategy or any of the alternate strategies just proposed. Of course, > making this all too complicated is a recipe for failure, but I suspect > that making it at least somewhat configurable is a good idea. But it is configurable... you're free to change any of autovacuum_freeze_max_age, autovacuum_multixact_freeze_max_age, autovacuum_vacuum_scale_factor, autovacuum_vacuum_insert_scale_factor and autovacuum_analyze_scale_factor, plus all the other autovacuum_vacuum*_threshold GUCs and relptions to adjust the score. The design is no accident. Of course, that does also affect the eligibility for the table to be vacuumed, not just the order, but it's not like there's no way for users to influence the order. If we really do discover that pg_catalog tables need vacuum attention sooner, then maybe we should consider defaulting a reloption for that, or maybe there's only a subset of pg_catalog tables that that matters for. For the record, I don't deny that it is possible that there is some scenario where the pg_class order is better than sorting by the percentage-over-threshold method, but IMO, it seems quite extreme to go adding a series of new reloptions to weight the scores based on no evidence that there's an actual problem or that it's even a good solution to fixing some currently unknown problem. If we later discover there is no issue, then reloptions are quite painful to remove due to pg_dump (or rather failed restores). I think the vacuum options are complex enough without risking adding a few new ones that we don't even know are required or are even useful to anyone. As for the GUC, I think we should at least commit the patch first and add an open item to "Decisions to Recheck Mid-Beta" for v19 to see if anyone still thinks a GUC is a good escape hatch, or if we'd prefer to revert the patch because it's causing trouble. As I see it, we've got about 6 months or maybe a bit more of testing how well this works before we need to make a decision. My vote is to use as much of that time as possible rather than using it to allow people to dream up hypothetical problems that might or might not exist. David
On Thu, Nov 20, 2025 at 3:58 PM David Rowley <dgrowleyml@gmail.com> wrote: > before we need to make a decision. My vote is to use as much of that > time as possible rather than using it to allow people to dream up > hypothetical problems that might or might not exist. That seems a little harsh. I think the only hypothesis necessary for my concern to be valid is the hypothesis that whatever algorithm we've selected may not always work well. I admit that I could be wrong in thinking so; there are plenty of heuristics in PostgreSQL that are so effective that nobody ever cares about tuning them. But there's enough problems with autovacuum that I don't think it's a particularly adventurous hypothesis, either. That said, I accept your point that even if we were to agree that something ought to made tunable here, we would still have the problem of deciding exactly what GUCs or reloptions to add, and that might be hard to figure out without more information. Unfortunately, I have a feeling that unless you or someone here is planning to make a determined testing effort over the coming months, we're more likely to get feedback after final release than during development or even beta. But I do also understand that you don't want us to be paralyzed and never move forward. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 21 Nov 2025 at 10:16, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Nov 20, 2025 at 3:58 PM David Rowley <dgrowleyml@gmail.com> wrote: > > before we need to make a decision. My vote is to use as much of that > > time as possible rather than using it to allow people to dream up > > hypothetical problems that might or might not exist. > > That seems a little harsh. It wasn't intended to be offensive. It's an observation that there've been quite a few posts on this thread about various extra things we should account for in the score without any evidence that they're worthy of a special case. I used "dream up" since I don't recall any of those posts arriving with evidence of an actual problem or that the proposed solution was a valid fix for it, and that the proposed solution didn't make something else worse. > That said, I accept your point that even if we were to agree that > something ought to made tunable here, we would still have the problem > of deciding exactly what GUCs or reloptions to add, and that might be > hard to figure out without more information. Unfortunately, I have a > feeling that unless you or someone here is planning to make a > determined testing effort over the coming months, we're more likely to > get feedback after final release than during development or even beta. You might be right. Or after a week we might discover a good reason why the percentage-over-threshold method is rubbish and revert it. The key is probably in the way we act from getting no negative feedback. I suspect the most likely area the new prioritisation order could cause issues is from the lack of randomness. Will multiple workers working into the same database be more likely to bump into each other somehow in a bad way? Maybe that's a good area to focus testing. > But I do also understand that you don't want us to be paralyzed and > never move forward. Yeah partly, but mostly I just really doubt that this matters that much. It's been said on this thread already that prioritisation isn't as important as the autovacuum-configured-to-run-too-slowly issue, and I agree with that. I just find it hard to believe that the highly volatile pg_class order has been just perfect all these years and that sorting by percentage-over-threshold-desc will make things worse overall. There was mention that pg_catalog tables are first in pg_class, but I don't really agree with that as if I create some new tables on a fresh database, I see those getting lower ctids than any pg_catalog table. The space for that is finite, but there's no shortage of other reasons for user tables to become mentioned in pg_class before catalogue tables as the database gets used. I see that table_beginscan_catalog() uses SO_ALLOW_SYNC too, so there's an extra layer of randomness from sync scans. I don't recall any complaints from the order autovacuum works on tables, so, to me, it just seems strange to think that the volatile order of pg_class just happened to be right all these years. I suspect what's happening is that the extra bloat or stale statistics that people get as a result of the pg_class-order autovacuum just gets unnoticed, ignored or attended to via adjustments to the corresponding scale_factor reloption. David
On Thu, Nov 20, 2025 at 5:12 PM David Rowley <dgrowleyml@gmail.com> wrote: > It wasn't intended to be offensive. OK. > I suspect the most likely area the new prioritisation order could > cause issues is from the lack of randomness. Will multiple workers > working into the same database be more likely to bump into each other > somehow in a bad way? Maybe that's a good area to focus testing. I agree that lack of randomness could cause problems, but I don't see how it could cause regressions, because the current system isn't random, either. Even if the order of pg_class is unpredictable, it may (depending on the workload) not change very much from one day to the next. > Yeah partly, but mostly I just really doubt that this matters that > much. It's been said on this thread already that prioritisation isn't > as important as the autovacuum-configured-to-run-too-slowly issue, and > I agree with that. I just find it hard to believe that the highly > volatile pg_class order has been just perfect all these years and that > sorting by percentage-over-threshold-desc will make things worse > overall. There was mention that pg_catalog tables are first in > pg_class, but I don't really agree with that as if I create some new > tables on a fresh database, I see those getting lower ctids than any > pg_catalog table. The space for that is finite, but there's no > shortage of other reasons for user tables to become mentioned in > pg_class before catalogue tables as the database gets used. I see that > table_beginscan_catalog() uses SO_ALLOW_SYNC too, so there's an extra > layer of randomness from sync scans. I don't recall any complaints > from the order autovacuum works on tables, so, to me, it just seems > strange to think that the volatile order of pg_class just happened to > be right all these years. I suspect what's happening is that the extra > bloat or stale statistics that people get as a result of the > pg_class-order autovacuum just gets unnoticed, ignored or attended to > via adjustments to the corresponding scale_factor reloption. Interesting. I don't have any real knowledge of how jumbled-up the order of pg_class is on real production systems, and I agree that if the answer is "it's usually quite jumbled up" then that is good news for this patch. In any case, I'm not trying to say that prioritization is an intrinsically bad idea, because I don't believe that. What I'm trying to say is that there's a limited number of ways for this patch to make things worse, and one of them is if someone is winning right now by accident, so therefore we should think about how many people might be in that situation. I would argue that if a large number of users end up with a very similar pattern in terms of how pg_class is ordered, that makes the patch higher-risk than if, as I think you're arguing here, there's enough randomness in terms of where things end up in pg_class to prevent any particular pattern from predominating. In the latter case, one or two really unlucky users could end up worse off, but that's not really an issue. What would be an issue is if we regressed some kind of common pattern. I admit that's a bit speculative and I'm probably being a little paranoid here: doing smart things is typically better than doing dumb things, and what we're doing right now is dumb. On the other hand, once we ship something, we can't pull it back. If it causes a problem, someone will call me at 2am and need their system fixed right now. If my answer is "well, there are no configuration knobs we can change and no way to get back to the old behavior and I'm sorry you're having that problem but the only answer is for you to run all your VACUUMs manually until two years from now when maybe the algorithm will have been improved," it's not going to be a very good night. After 15 years at EDB, I've learned that the problem isn't being wrong per se; it's having no way to get out from under being wrong. It is absolutely inevitable that I will screw up, you will screw up, the project as a whole will screw up, and that doesn't worry me a bit. What does worry me is the prospect that we won't have thought hard enough about what we're going to do if and when that happens. Most of the customers that I've gotten to work with over the years are very gracious about things going wrong with the software as long as there are some options to deal with the problem. I fully admit that this patch may already be good enough that I'll never hear a single customer complain about it, but the time to think through the reverse scenario, where some users are unhappy, is before we ship, not after. That necessarily involves some speculation about what might go wrong and some of that speculation may be groundless, but speculation causes a lot less pain than angry customers whose problems you can't fix. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Nov 20, 2025 at 9:55 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Thanks for working on this problem, We frequently hear about the auto vacuuming scheduling issue. I believe this is a great starting point to prioritize based on the wraparound and vacuum threshold limit. However, my vision for addressing this problem has always involved maintaining two distinct priority queues (or sorted lists). Each of these queues would contain tables, with the tables within each queue sorted by their respective scores. Queue 1: Wraparound-Critical: This queue contains tables that require immediate action because their XID or MultiXact ID age is critical, especially those approaching the failsafe limit. Queue 2: Threshold-Based: This queue includes tables needing VACUUM due to crossing other thresholds. Both queues would be maintained as sorted lists, with the highest priority score at the head. The autovacuum worker dynamically selects tables for processing from the head of these 2 queues. For instance, if a table is initially chosen from the threshold queue but processing took too long, and another table approaches its failsafe limit due to a high rate of concurrent XID generation, the latter can be prioritized from the wraparound queue. I believe this 2 queue approach offers more flexibility than attempting to merge these distinct concerns into a single scoring dimension. Tables may exist in both queues. If a table is selected and vacuumed, it will be removed from both queues to prevent redundant efforts. -- Regards, Dilip Kumar Google
> > I suspect the most likely area the new prioritisation order could > > cause issues is from the lack of randomness. Will multiple workers > > working into the same database be more likely to bump into each other > > somehow in a bad way? Maybe that's a good area to focus testing. > > I agree that lack of randomness could cause problems, but I don't see > how it could cause regressions, because the current system isn't > random, either. Even if the order of pg_class is unpredictable, it may > (depending on the workload) not change very much from one day to the > next. > > > Yeah partly, but mostly I just really doubt that this matters that > > much. It's been said on this thread already that prioritisation isn't > > as important as the autovacuum-configured-to-run-too-slowly issue, and > > I agree with that. I just find it hard to believe that the highly > > volatile pg_class order has been just perfect all these years and that > > sorting by percentage-over-threshold-desc will make things worse > > overall. There was mention that pg_catalog tables are first in > > pg_class, but I don't really agree with that as if I create some new > > tables on a fresh database, I see those getting lower ctids than any > > pg_catalog table. The space for that is finite, but there's no > > shortage of other reasons for user tables to become mentioned in > > pg_class before catalogue tables as the database gets used. I see that > > table_beginscan_catalog() uses SO_ALLOW_SYNC too, so there's an extra > > layer of randomness from sync scans. I don't recall any complaints > > from the order autovacuum works on tables, so, to me, it just seems > > strange to think that the volatile order of pg_class just happened to > > be right all these years. I suspect what's happening is that the extra > > bloat or stale statistics that people get as a result of the > > pg_class-order autovacuum just gets unnoticed, ignored or attended to > > via adjustments to the corresponding scale_factor reloption. > > Interesting. I don't have any real knowledge of how jumbled-up the > order of pg_class is on real production systems, and I agree that if > the answer is "it's usually quite jumbled up" then that is good news > for this patch. In any case, I'm not trying to say that prioritization > is an intrinsically bad idea, because I don't believe that. What I'm > trying to say is that there's a limited number of ways for this patch > to make things worse What I have not been able to prove from my tests is that the processing order of tables by autovacuum will actually make things any better or any worse. My tests have been short 30 minute tests that count how many vacuum cycles tables with various DML activity and sizes received. I have not found much difference. I am also not sure how valuable these short-duration tests are either. On the field is where the real test occurs and it may be discovered that the new strategy improves the majority of the cases, and there may also be cases where the existing strategy is somehow better. Having the ability to go back to the existing behavior seems like the best way we can roll this out and learn over time. These may be the only two strategies we will ever need, or we may find out that a third strategy in which individual tables are assigned a prioritization score will also be useful. -- Sami
On Sat, Nov 22, 2025 at 12:28 PM Sami Imseih <samimseih@gmail.com> wrote: > What I have not been able to prove from my tests is that the processing > order of tables by autovacuum will actually make things any better or any > worse. My tests have been short 30 minute tests that count how many > vacuum cycles tables with various DML activity and sizes received. > I have not found much difference. I am also not sure how valuable > these short-duration tests are either. Yeah, I'm not sure that would be the right way to look for a benefit from something like this. I think that a better test scenario might involve figuring out how fast we can recover from a bad situation. As we've discussed before, if VACUUM is chronically unable to keep up with the workload, then the system is going to get into a very bad state and there's not really any help for it. But if we start to get into a bad situation due to some outside interference and then someone removes the interference, we might hope that this patch would help us get back on our feet more quickly. For instance, suppose that we have a database with a stale replication slot, so the oldest-XID value for the cluster keeps getting older and older. autovacuum is probably running but it can't clean anything up. Then at some point, the DBA realizes that bad things are happening and drops the replication slot. You might hope that, with the patch, autovacuum would do a better job getting the system back to a working state. If you set up some kind of test scenario, you could ask questions like "what is the largest age(relfrozenxid) that we observe in the database at any point during the test?" or "from the time the replication slot is dropped, how much time passes before age(datfrozenxid) drops to normal?" or "what is the maximum observed amount of bloat during the test?". The same kind of idea could apply to anything else that stops vacuum from running or makes it unproductive: a full table lock on a key table, an open transaction, a table where VACUUM is failing. I actually don't know exactly what kind of scenario would be good to test here, because I struggle to think of a concrete scenario in which we'd be better off with this than without it (which might be a reason not to proceed with it, despite the fact that I think we all agree that, from a theoretical point of view, the idea of prioritizing sounds better than the idea of not prioritizing). But I think that if the patch has a benefit, it won't be one where the system is in a steady state where vacuum is able to keep up. It might be one where we're in a steady state where vacuum is not able to keep up and things are getting worse and worse, but the patch allows us to survive for longer before terrible things happen. But I would say that the most promising scenario for this patch would be something like what I describe above, where we're not in a steady state at all: something bad has happened and now we're trying to recover. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Nov 22, 2025 at 06:28:13AM -0500, Robert Haas wrote: > What would be an issue is if we > regressed some kind of common pattern. I admit that's a bit > speculative and I'm probably being a little paranoid here: doing smart > things is typically better than doing dumb things, and what we're > doing right now is dumb. > > On the other hand, once we ship something, we can't pull it back. If > it causes a problem, someone will call me at 2am and need their system > fixed right now. If my answer is "well, there are no configuration > knobs we can change and no way to get back to the old behavior and I'm > sorry you're having that problem but the only answer is for you to run > all your VACUUMs manually until two years from now when maybe the > algorithm will have been improved," it's not going to be a very good > night. After 15 years at EDB, I've learned that the problem isn't > being wrong per se; it's having no way to get out from under being > wrong. Yeah. I'm tempted to code up the "weighting factor" GUCs for the next revision. As you've noted, those would be useful for tuning and for reverting back to pre-v19 behavior. Sure, we might end up with a handful of retail GUCs that most users don't need, but that's not so terrible. -- nathan
On Sun, 23 Nov 2025 at 07:35, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Nov 22, 2025 at 12:28 PM Sami Imseih <samimseih@gmail.com> wrote: > > What I have not been able to prove from my tests is that the processing > > order of tables by autovacuum will actually make things any better or any > > worse. My tests have been short 30 minute tests that count how many > > vacuum cycles tables with various DML activity and sizes received. > > I have not found much difference. I am also not sure how valuable > > these short-duration tests are either. > > Yeah, I'm not sure that would be the right way to look for a benefit > from something like this. I think that a better test scenario might > involve figuring out how fast we can recover from a bad situation. As > we've discussed before, if VACUUM is chronically unable to keep up > with the workload, then the system is going to get into a very bad > state and there's not really any help for it. But if we start to get > into a bad situation due to some outside interference and then someone > removes the interference, we might hope that this patch would help us > get back on our feet more quickly. One thing that seems to be getting forgotten again is the "/* Stop applying cost limits from this point on */" added in 1e55e7d17 is only going to be applied when the table *currently* being vaccumed is over the failsafe limit. Without Nathan's patch, the worker might end up idling along carefully obeying the cost limits on dozens of other tables before it gets around to vacuuming the table that's over the failsafe limit, then suddenly drop the cost delay code and rush to get the table frozen, before Postgres stops accepting transactions. With the patch, Nathan has added some aggressive score scaling, which should mean any table over the failsafe limit has the highest score and gets attended to first. David
> > What I have not been able to prove from my tests is that the processing > > order of tables by autovacuum will actually make things any better or any > > worse. My tests have been short 30 minute tests that count how many > > vacuum cycles tables with various DML activity and sizes received. > > I have not found much difference. I am also not sure how valuable > > these short-duration tests are either. > > Yeah, I'm not sure that would be the right way to look for a benefit > from something like this. I think that a better test scenario might > involve figuring out how fast we can recover from a bad situation. As > we've discussed before, if VACUUM is chronically unable to keep up > with the workload, then the system is going to get into a very bad > state and there's not really any help for it. But if we start to get > into a bad situation due to some outside interference and then someone > removes the interference, we might hope that this patch would help us > get back on our feet more quickly. > > For instance, suppose that we have a database with a stale replication > slot, so the oldest-XID value for the cluster keeps getting older and > older. autovacuum is probably running but it can't clean anything up. > Then at some point, the DBA realizes that bad things are happening and > drops the replication slot. You might hope that, with the patch, > autovacuum would do a better job getting the system back to a working > state. If you set up some kind of test scenario, you could ask > questions like "what is the largest age(relfrozenxid) that we observe > in the database at any point during the test?" or "from the time the > replication slot is dropped, how much time passes before > age(datfrozenxid) drops to normal?" or "what is the maximum observed > amount of bloat during the test?". From my experience, in these situations, you need to run manual vacuums to supplement autovacuum and get bloat under control as quickly as possible. If the tables are small and vacuum quickly, the order of prioritization doesn’t matter much, even with the extra bloat or high XID age. However, if your slow-to-vacuum tables have the most bloat or the oldest XID age, prioritizing those tables means that your smaller, faster-to-vacuum tables will almost certainly not get the vacuum cycles quickly enough after resolving whatever was blocking the vacuum (such as long-running transactions or stale replication slots). In the current system, these smaller tables might still get vacuumed, but often by pure chance due to the pg_class ordering. Speeding up recovery (removing bloat and freezing rows) as soon as possible will require enabling more autovacuum workers ( which are now dynamic ) or running manual vacuums. I don't think prioritization will improve these situations much. -- Sami Imseih Amazon Web Services (AWS)
On Sun, Nov 23, 2025 at 4:55 AM David Rowley <dgrowleyml@gmail.com> wrote: > One thing that seems to be getting forgotten again is the "/* Stop > applying cost limits from this point on */" added in 1e55e7d17 is only > going to be applied when the table *currently* being vaccumed is over > the failsafe limit. Without Nathan's patch, the worker might end up > idling along carefully obeying the cost limits on dozens of other > tables before it gets around to vacuuming the table that's over the > failsafe limit, then suddenly drop the cost delay code and rush to get > the table frozen, before Postgres stops accepting transactions. With > the patch, Nathan has added some aggressive score scaling, which > should mean any table over the failsafe limit has the highest score > and gets attended to first. Right, so can we use that to construct a specific, concrete scenario where we can see that the patch ends up delivering better behavior than we have today? I think it would be a really good to have at least one fully worked-out case where we can say "look, if you run this series of commands without the patch, X happens, and with the patch, Y happens, and look! Y is better." -- Robert Haas EDB: http://www.enterprisedb.com