Обсуждение: another autovacuum scheduling thread
/me dons flame-proof suit My goal with this thread is to produce some incremental autovacuum scheduling improvements for v19, but realistically speaking, I know that it's a bit of a long-shot. There have been many discussions over the years, and I've read through a few of them [0] [1] [2] [3] [4], but there are certainly others I haven't found. Since this seems to be a contentious topic, I figured I'd start small to see if we can get _something_ committed. While I am by no means wedded to a specific idea, my current concrete proposal (proof-of-concept patch attached) is to start by ordering the tables a worker will process by (M)XID age. Here are the reasons: * We do some amount of prioritization of databases at risk of wraparound at database level, per the following comment from autovacuum.c: * Choose a database to connect to. We pick the database that was least * recently auto-vacuumed, or one that needs vacuuming to prevent Xid * wraparound-related data loss. If any db at risk of Xid wraparound is * found, we pick the one with oldest datfrozenxid, independently of * autovacuum times; similarly we pick the one with the oldest datminmxid * if any is in MultiXactId wraparound. Note that those in Xid wraparound * danger are given more priority than those in multi wraparound danger. However, we do no such prioritization of the tables within a database. In fact, the ordering of the tables is effectively random. IMHO this gives us quite a bit of wiggle room to experiment; since we are processing tables in no specific order today, changing the order to something vacuuming-related seems more likely to help than it is to harm. * Prioritizing tables based on their (M)XID age might help avoid more aggressive vacuums, not to mention wraparound. Of course, there are scenarios where this doesn't work. For example, the age of a table may have changed greatly between the time we recorded it and the time we process it. Or maybe there is another table in a different database that is more important from a wraparound perspective. We could complicate the patch to try to handle some of these things, but I maintain that even some basic, incremental scheduling improvements would be better than the status quo. And we can always change it further in the future to handle these problems and to consider other things like bloat. The attached patch works by storing the maximum of the XID age and the MXID age in the list with the OIDs and sorting it prior to processing. Thoughts? [0] https://postgr.es/m/CA%2BTgmoafJPjB3WVqB3FrGWUU4NLRc3VHx8GXzLL-JM%2B%2BJPwK%2BQ%40mail.gmail.com [1] https://postgr.es/m/CAEG8a3%2B3fwQbgzak%2Bh3Q7Bp%3DvK_aWhw1X7w7g5RCgEW9ufdvtA%40mail.gmail.com [2] https://postgr.es/m/CAD21AoBUaSRBypA6pd9ZD%3DU-2TJCHtbyZRmrS91Nq0eVQ0B3BA%40mail.gmail.com [3] https://postgr.es/m/CA%2BTgmobT3m%3D%2BdU5HF3VGVqiZ2O%2Bv6P5wN1Gj%2BPrq%2Bhj7dAm9AQ%40mail.gmail.com [4] https://postgr.es/m/20130124215715.GE4528%40alvh.no-ip.org -- nathan
Вложения
Thanks for raising this topic! I agree that autovacuum scheduling could be improved. > * Prioritizing tables based on their (M)XID age might help avoid more > aggressive vacuums, not to mention wraparound. Of course, there are > scenarios where this doesn't work. For example, the age of a table may > have changed greatly between the time we recorded it and the time we > process it. Or maybe there is another table in a different database that > is more important from a wraparound perspective. We could complicate the > patch to try to handle some of these things, but I maintain that even some > basic, incremental scheduling improvements would be better than the status > quo. And we can always change it further in the future to handle these > problems and to consider other things like bloat. One risk I see with this approach is that we will end up autovacuuming tables that also take the longest time to complete, which could cause smaller, quick-to-process tables to be neglected. It’s not always the case that the oldest tables in terms of (M)XID age are also the most expensive to vacuum, but that is often more true than not. Not saying that the current approach, which is as you mention is random, is any better, however this approach will likely increase the behavior of large tables saturating workers. But I also do see the merit of this approach when we know we are in failsafe territory, because I would want my oldest aged tables to be a/v'd first. -- Sami Imseih Amazon Web Services (AWS)
On 2025-Oct-08, Sami Imseih wrote: > One risk I see with this approach is that we will end up autovacuuming > tables that also take the longest time to complete, which could cause > smaller, quick-to-process tables to be neglected. Perhaps we can have autovacuum workers decide on a mode to use at startup (or launcher decides for them), and use different prioritization heuristics depending on the mode. For instance if we're past max freeze age for any tables then we know we have to first vacuum tables with higher MXID ages regardless of size considerations, but if there's at least one worker in that mode then we use the mode where smaller high-churn tables go first. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "No nos atrevemos a muchas cosas porque son difíciles, pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)
Hi, On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote: > However, we do no such prioritization of the tables within a database. In > fact, the ordering of the tables is effectively random. We don't prioritize tables, but I don't think the order really is random? Isn't it basically in the order in which the data is in pg_class? That typically won't change from one autovacuum pass to the next... > * Prioritizing tables based on their (M)XID age might help avoid more > aggressive vacuums, not to mention wraparound. Of course, there are > scenarios where this doesn't work. For example, the age of a table may > have changed greatly between the time we recorded it and the time we > process it. > Or maybe there is another table in a different database that > is more important from a wraparound perspective. That seems like something no ordering within a single AV worker can address. I think it's fine to just define that to be out of scope. > We could complicate the patch to try to handle some of these things, but I > maintain that even some basic, incremental scheduling improvements would be > better than the status quo. And we can always change it further in the > future to handle these problems and to consider other things like bloat. Agreed! It doesn't take much to be better at scheduling than "order in pg_class". > The attached patch works by storing the maximum of the XID age and the MXID > age in the list with the OIDs and sorting it prior to processing. I think it may be worth trying to avoid reliably using the same order - otherwise e.g. a corrupt index on the first scheduled table can cause autovacuum to reliably fail on the same relation, never allowing it to progress past that point. Greetings, Andres Freund
> Not saying that the current approach, which is as you mention is > random, is any better, however this approach will likely increase > the behavior of large tables saturating workers. Maybe it will be good to allocate some workers to the oldest tables and workers based on some random list? This could balance things out between the oldest (large) tables and everything else to avoid this problem. -- Sami Imseih Amazon Web Services (AWS)
On Wed, 8 Oct 2025 12:06:29 -0500 Sami Imseih <samimseih@gmail.com> wrote: > > One risk I see with this approach is that we will end up autovacuuming > tables that also take the longest time to complete, which could cause > smaller, quick-to-process tables to be neglected. > > It’s not always the case that the oldest tables in terms of (M)XID age > are also the most expensive to vacuum, but that is often more true > than not. I think an approach of doing largest objects first actually might work really well for balancing work amongst autovacuum workers. Many years ago I designed a system to backup many databases with a pool of workers and used this same simple & naive algorithm of just reverse sorting on db size, and it worked remarkably well. If you have one big thing then you probably want someone to get started on that first. As long as there's a pool of workers available, as you work through the queue, you can actually end up with pretty optimal use of all the workers. -Jeremy
On Thu, 9 Oct 2025 at 12:41, Jeremy Schneider <schneider@ardentperf.com> wrote: > I think an approach of doing largest objects first actually might work > really well for balancing work amongst autovacuum workers. Many years > ago I designed a system to backup many databases with a pool of workers > and used this same simple & naive algorithm of just reverse sorting on > db size, and it worked remarkably well. If you have one big thing then > you probably want someone to get started on that first. As long as > there's a pool of workers available, as you work through the queue, you > can actually end up with pretty optimal use of all the workers. I believe that is methodology for processing work applies much better in scenarios where there's no new work continually arriving and there's no adverse effects from giving a lower priority to certain portions of the work. I don't think you can apply that so easily to autovacuum as there are scenarios where the work can pile up faster than it can be handled. Also, smaller tables can bloat in terms of growth proportional to the original table size much more quickly than larger tables and that could have huge consequences for queries to small tables which are not indexed sufficiently to handle being becoming bloated and large. David
On Thu, 9 Oct 2025 12:59:23 +1300 David Rowley <dgrowleyml@gmail.com> wrote: > I believe that is methodology for processing work applies much better > in scenarios where there's no new work continually arriving and > there's no adverse effects from giving a lower priority to certain > portions of the work. I don't think you can apply that so easily to > autovacuum as there are scenarios where the work can pile up faster > than it can be handled. Also, smaller tables can bloat in terms of > growth proportional to the original table size much more quickly than > larger tables and that could have huge consequences for queries to > small tables which are not indexed sufficiently to handle being > becoming bloated and large. I'm arguing that it works well with autovacuum. Not saying there aren't going to be certain workloads that it's suboptimal for. We're talking about sorting by (M)XID age. As the clock continues to move forward any table that doesn't get processed naturally moves up the queue for the next autovac run. I think the concerns are minimal here and this would be a good change in general. -Jeremy -- To know the thoughts and deeds that have marked man's progress is to feel the great heart throbs of humanity through the centuries; and if one does not feel in these pulsations a heavenward striving, one must indeed be deaf to the harmonies of life. Helen Keller, The Story Of My Life, 1902, 1903, 1905, introduction by Ralph Barton Perry (Garden City, NY: Doubleday & Company, 1954), p90.
On Wed, 8 Oct 2025 17:27:27 -0700 Jeremy Schneider <schneider@ardentperf.com> wrote: > On Thu, 9 Oct 2025 12:59:23 +1300 > David Rowley <dgrowleyml@gmail.com> wrote: > > > I believe that is methodology for processing work applies much > > better in scenarios where there's no new work continually arriving > > and there's no adverse effects from giving a lower priority to > > certain portions of the work. I don't think you can apply that so > > easily to autovacuum as there are scenarios where the work can pile > > up faster than it can be handled. Also, smaller tables can bloat > > in terms of growth proportional to the original table size much > > more quickly than larger tables and that could have huge > > consequences for queries to small tables which are not indexed > > sufficiently to handle being becoming bloated and large. > > I'm arguing that it works well with autovacuum. Not saying there > aren't going to be certain workloads that it's suboptimal for. We're > talking about sorting by (M)XID age. As the clock continues to move > forward any table that doesn't get processed naturally moves up the > queue for the next autovac run. I think the concerns are minimal here > and this would be a good change in general. Hmm, doesn't work quite like that if the full queue needs to be processed before the next iteration ~ but at steady state these small tables are going to get processed at the same rate whether they were top of bottom of the queue right? And in non-steady-state conditions, this seems like a better order than pg_class ordering? -Jeremy
On Thu, 9 Oct 2025 at 13:27, Jeremy Schneider <schneider@ardentperf.com> wrote: > I'm arguing that it works well with autovacuum. Not saying there aren't > going to be certain workloads that it's suboptimal for. We're talking > about sorting by (M)XID age. As the clock continues to move forward any > table that doesn't get processed naturally moves up the queue for the > next autovac run. I think the concerns are minimal here and this would > be a good change in general. I thought if we're to have a priority queue that it would be hard to argue against sorting by how far over the given auto-vacuum threshold that the table is. If you assume that a table that just meets the dead rows required to trigger autovacuum based on the autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but another table that has n_mod_since_analyze twice over the autovacuum_analyze_scale_factor gets priority 2.0. Effectively, prioritise by the percentage over the given threshold the table is. That way users could still tune things when they weren't happy with the priority given to a table by adjusting the corresponding reloption. It just seems strange to me to only account for 1 of the 4 trigger points for autovacuum when it's possible to account for all 4 without much extra trouble. David
On Thu, 9 Oct 2025 14:03:34 +1300 David Rowley <dgrowleyml@gmail.com> wrote: > I thought if we're to have a priority queue that it would be hard to > argue against sorting by how far over the given auto-vacuum threshold > that the table is. If you assume that a table that just meets the > dead rows required to trigger autovacuum based on the > autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but > another table that has n_mod_since_analyze twice over the > autovacuum_analyze_scale_factor gets priority 2.0. Effectively, > prioritise by the percentage over the given threshold the table is. > That way users could still tune things when they weren't happy with > the priority given to a table by adjusting the corresponding > reloption. If users are tuning this thing then I feel like we've already lost the battle :) On a healthy system, autovac runs continually and hits tables at regular intervals based on their steady state change rates. We have existing knobs (for better or worse) that people can use to tell PG to hit certain tables more frequently, to get rid of sleeps/delays, etc. With our fleet of PG databases here, my current approach is geared toward setting log_autovacuum_min_duration to some conservative value fleet-wide, then monitoring based on the logs for any cases where it runs longer than a defined threshold. I'm able to catch problems sooner this way, versus monitoring on xid age alone. Whenever there are problems with autovacuum, the actual issue is never going to be resolved by what order autovacuum processes tables. I don't think we should encourage any tunables here... to me it seems like putting focus entirely in the wrong place. -Jeremy
On Wed, 8 Oct 2025 18:25:20 -0700 Jeremy Schneider <schneider@ardentperf.com> wrote: > On Thu, 9 Oct 2025 14:03:34 +1300 > David Rowley <dgrowleyml@gmail.com> wrote: > > > I thought if we're to have a priority queue that it would be hard to > > argue against sorting by how far over the given auto-vacuum > > threshold that the table is. If you assume that a table that just > > meets the dead rows required to trigger autovacuum based on the > > autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but > > another table that has n_mod_since_analyze twice over the > > autovacuum_analyze_scale_factor gets priority 2.0. Effectively, > > prioritise by the percentage over the given threshold the table is. > > That way users could still tune things when they weren't happy with > > the priority given to a table by adjusting the corresponding > > reloption. > > If users are tuning this thing then I feel like we've already lost the > battle :) I replied too quickly. Re-reading your email, I think your proposing a different algorithm, taking tuple counts into account. No tunables. Is there a fully fleshed out version of the proposed alternative algorithm somewhere? (one of the older threads?) I guess this is why its so hard to get anything committed in this area... -J
On Thu, 9 Oct 2025 at 14:47, Jeremy Schneider <schneider@ardentperf.com> wrote: > > On Wed, 8 Oct 2025 18:25:20 -0700 > Jeremy Schneider <schneider@ardentperf.com> wrote: > > If users are tuning this thing then I feel like we've already lost the > > battle :) > > I replied too quickly. Re-reading your email, I think your proposing a > different algorithm, taking tuple counts into account. No tunables. Is > there a fully fleshed out version of the proposed alternative algorithm > somewhere? (one of the older threads?) I guess this is why its so hard > to get anything committed in this area... It's along the lines of the "1a)" from [1]. I don't think that post does a great job of explaining it. I think the best way to understand it is if you look at relation_needs_vacanalyze() and see how it calculates boolean values for boolean output params. So, instead of calculating just a boolean value it instead calculates a float4 where < 1.0 means don't do the operation and anything >= 1.0 means do the operation. For example, let's say a table has 600 dead rows and the scale factor and threshold settings mean that autovacuum will trigger at 200 (3 times more dead tuples than the trigger point). That would result in the value of 3.0 (600 / 200). The priority for relfrozenxid portion is basically age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account for mxid by doing the same for that and taking the maximum of each value). For each of those component "scores", the priority for autovacuum would be the maximum of each of those. Effectively, it's a method of aligning the different units of measure, transactions or tuples into a single value which is calculated based on the very same values that we use today to trigger autovacuums. David [1] https://postgr.es/m/CAApHDvo8DWyt4CWhF=NPeRstz_78SteEuuNDfYO7cjp=7YTK4g@mail.gmail.com
On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote: > On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote: >> The attached patch works by storing the maximum of the XID age and the MXID >> age in the list with the OIDs and sorting it prior to processing. > > I think it may be worth trying to avoid reliably using the same order - > otherwise e.g. a corrupt index on the first scheduled table can cause > autovacuum to reliably fail on the same relation, never allowing it to > progress past that point. Hm. What if we kept a short array of "failed" tables in shared memory? Each worker would consult this table before processing. If the table is there, it would remove it from the shared table and skip processing it. Then the next worker would try processing the table again. I also wonder how hard it would be to gracefully catch the error and let the worker continue with the rest of its list... -- nathan
On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote: > I think the best way to understand it is if you look at > relation_needs_vacanalyze() and see how it calculates boolean values > for boolean output params. So, instead of calculating just a boolean > value it instead calculates a float4 where < 1.0 means don't do the > operation and anything >= 1.0 means do the operation. For example, > let's say a table has 600 dead rows and the scale factor and threshold > settings mean that autovacuum will trigger at 200 (3 times more dead > tuples than the trigger point). That would result in the value of 3.0 > (600 / 200). The priority for relfrozenxid portion is basically > age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account > for mxid by doing the same for that and taking the maximum of each > value). For each of those component "scores", the priority for > autovacuum would be the maximum of each of those. > > Effectively, it's a method of aligning the different units of measure, > transactions or tuples into a single value which is calculated based > on the very same values that we use today to trigger autovacuums. I like the idea of a "score" approach, but I'm worried that we'll never come to an agreement on the formula to use. Perhaps we'd have more luck getting consensus on a multifaceted strategy if we kept it brutally simple. IMHO it's worth a try... -- nathan
Hi, On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote: > On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote: > > On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote: > >> The attached patch works by storing the maximum of the XID age and the MXID > >> age in the list with the OIDs and sorting it prior to processing. > > > > I think it may be worth trying to avoid reliably using the same order - > > otherwise e.g. a corrupt index on the first scheduled table can cause > > autovacuum to reliably fail on the same relation, never allowing it to > > progress past that point. > > Hm. What if we kept a short array of "failed" tables in shared memory? I've thought about having that as part of pgstats... > Each worker would consult this table before processing. If the table is > there, it would remove it from the shared table and skip processing it. > Then the next worker would try processing the table again. > > I also wonder how hard it would be to gracefully catch the error and let > the worker continue with the rest of its list... The main set of cases I've seen are when workers get hung up permanently in corrupt indexes. There never is actually an error, the autovacuums just get terminated as part of whatever independent reason there is to restart. The problem with that is that you'll never actually have vacuum fail... Greetings, Andres Freund
On Thu, Oct 09, 2025 at 12:15:31PM -0400, Andres Freund wrote: > On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote: >> I also wonder how hard it would be to gracefully catch the error and let >> the worker continue with the rest of its list... > > The main set of cases I've seen are when workers get hung up permanently in > corrupt indexes. There never is actually an error, the autovacuums just get > terminated as part of whatever independent reason there is to restart. The > problem with that is that you'll never actually have vacuum fail... Ah. Wouldn't the other workers skip that table in that scenario? I'm not following the great advantage of varying the order in this case. I suppose the full set of workers might be able to process more tables before one inevitably gets stuck. Is that it? -- nathan
On Thu, Oct 9, 2025 at 12:15 PM Andres Freund <andres@anarazel.de> wrote: > > Each worker would consult this table before processing. If the table is > > there, it would remove it from the shared table and skip processing it. > > Then the next worker would try processing the table again. > > > > I also wonder how hard it would be to gracefully catch the error and let > > the worker continue with the rest of its list... > > The main set of cases I've seen are when workers get hung up permanently in > corrupt indexes. How recently was this? I'm aware of problems like that that we discussed around 2018, but they were greatly mitigated. First by your commit 3a01f68e, then by my commit c34787f9. In general, there's no particularly good reason why (at least with nbtree indexes) VACUUM should ever hang forever. The access pattern is overwhelmingly simple, sequential access. The only exception is nbtree page deletion (plus backtracking), where it isn't particularly hard to just be very careful about self-deadlock. > There never is actually an error, the autovacuums just get > terminated as part of whatever independent reason there is to restart. What do you mean? In general I'd expect nbtree VACUUM of a corrupt index to either not fail at all (we'll soldier on to the best of our ability when page deletion encounters an inconsistency), or to get permanently stuck due to locking the same page twice/self-deadlock (though as I said, those problems were mitigated, and might even be almost impossible these days). Every other case involves some kind of error (e.g., an OOM is just about possible). I agree with you about using a perfectly deterministic order coming with real downsides, without any upside. Don't interpret what I've said as expressing opposition to that idea. -- Peter Geoghegan
On Thu, Oct 09, 2025 at 11:13:48AM -0500, Nathan Bossart wrote: > On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote: >> I think the best way to understand it is if you look at >> relation_needs_vacanalyze() and see how it calculates boolean values >> for boolean output params. So, instead of calculating just a boolean >> value it instead calculates a float4 where < 1.0 means don't do the >> operation and anything >= 1.0 means do the operation. For example, >> let's say a table has 600 dead rows and the scale factor and threshold >> settings mean that autovacuum will trigger at 200 (3 times more dead >> tuples than the trigger point). That would result in the value of 3.0 >> (600 / 200). The priority for relfrozenxid portion is basically >> age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account >> for mxid by doing the same for that and taking the maximum of each >> value). For each of those component "scores", the priority for >> autovacuum would be the maximum of each of those. >> >> Effectively, it's a method of aligning the different units of measure, >> transactions or tuples into a single value which is calculated based >> on the very same values that we use today to trigger autovacuums. > > I like the idea of a "score" approach, but I'm worried that we'll never > come to an agreement on the formula to use. Perhaps we'd have more luck > getting consensus on a multifaceted strategy if we kept it brutally simple. > IMHO it's worth a try... Here's a prototype of a "score" approach. Two notes: * I've given special priority to anti-wraparound vacuums. I think this is important to avoid focusing too much on bloat when wraparound is imminent. In any case, we need a separate wraparound score in case autovacuum is disabled. * I didn't include the analyze threshold in the score because it doesn't apply to TOAST tables, and therefore would artificially lower their prioritiy. Perhaps there is another way to deal with this. This is very much just a prototype of the basic idea. As-is, I think it'll favor processing tables with lots of bloat unless we're in an anti-wraparound scenario. Maybe that's okay. I'm not sure how scientific we want to be about all of this, but I do intend to try some long-running tests. -- nathan
Вложения
On Fri, Oct 10, 2025 at 1:31 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Here's a prototype of a "score" approach. Two notes: > > * I've given special priority to anti-wraparound vacuums. I think this is > important to avoid focusing too much on bloat when wraparound is imminent. > In any case, we need a separate wraparound score in case autovacuum is > disabled. > > * I didn't include the analyze threshold in the score because it doesn't > apply to TOAST tables, and therefore would artificially lower their > prioritiy. Perhaps there is another way to deal with this. > > This is very much just a prototype of the basic idea. As-is, I think it'll > favor processing tables with lots of bloat unless we're in an > anti-wraparound scenario. Maybe that's okay. I'm not sure how scientific > we want to be about all of this, but I do intend to try some long-running > tests. I think this is a reasonable starting point, although I'm surprised that you chose to combine the sub-scores using + rather than Max. I think it will take a lot of experimentation to figure out whether this particular algorithm (or any other) works well in practice. My intuition (for whatever that is worth to you, which may not be much) is that what will anger users is cases when we ignore a horrible problem to deal with a routine problem. Figuring out how to design the scoring system to avoid such outcomes is the hard part of this problem, IMHO. For this particular algorithm, the main hazards that spring to mind for me are: - The wraparound score can't be more than about 10, but the bloat score could be arbitrarily large, especially for tables with few tuples, so there may be lots of cases in which the wraparound score has no impact on the behavior. - The patch attempts to guard against this by disregarding the non-wraparound portion of the score once the wraparound portion reaches 1.0, but that results in an abrupt behavior shift at that point. Suddenly we go from mostly ignoring the wraparound score to entirely ignoring the bloat score. This might result in the system abruptly ignoring tables that are bloating extremely rapidly in favor of trying to catch up in a wraparound situation that is not yet terribly urgent. When I've thought about this problem -- and I can't claim to have thought about it very hard -- it's seemed to me that we need to (1) somehow normalize everything to somewhat similar units and (2) make sure that severe wraparound danger always wins over every other consideration, but mild wraparound danger can lose to severe bloat. -- Robert Haas EDB: http://www.enterprisedb.com
Thanks for taking a look. On Fri, Oct 10, 2025 at 02:42:57PM -0400, Robert Haas wrote: > I think this is a reasonable starting point, although I'm surprised > that you chose to combine the sub-scores using + rather than Max. My thinking was that we should consider as many factors as we can in the score, not just the worst one. If a table has medium bloat and medium wraparound risk, should it always be lower in priority to something with large bloat and small wraparound risk? It seems worth exploring. I am curious why you first thought of Max. > When I've thought about this problem -- and I can't claim to have > thought about it very hard -- it's seemed to me that we need to (1) > somehow normalize everything to somewhat similar units and (2) make > sure that severe wraparound danger always wins over every other > consideration, but mild wraparound danger can lose to severe bloat. Agreed. I need to think about this some more. While I'm optimistic that we could come up with some sort of normalization framework, I deperately want to avoid super complicated formulas and GUCs, as those seem like sure-fire ways of ensuring nothing ever gets committed. -- nathan
On Fri, Oct 10, 2025 at 3:44 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Fri, Oct 10, 2025 at 02:42:57PM -0400, Robert Haas wrote: > > I think this is a reasonable starting point, although I'm surprised > > that you chose to combine the sub-scores using + rather than Max. > > My thinking was that we should consider as many factors as we can in the > score, not just the worst one. If a table has medium bloat and medium > wraparound risk, should it always be lower in priority to something with > large bloat and small wraparound risk? It seems worth exploring. I am > curious why you first thought of Max. The right answer depends a good bit on how exactly you do the scoring, but it seems to me that it would be easy to overweight secondary problems. Consider a table with an XID age of 900m and an MXID age of 900m and another table with an XID age of 1.8b. I think it is VERY clear that the second one is MUCH worse; but just adding things up will make them seem equal. > Agreed. I need to think about this some more. While I'm optimistic that > we could come up with some sort of normalization framework, I deperately > want to avoid super complicated formulas and GUCs, as those seem like > sure-fire ways of ensuring nothing ever gets committed. IMHO, the trick here is to come up with something that's neither too simple nor too complicated. If it's too simple, we'll easily come up with cases where it sucks, and possibly where it's worse than what we do now (an impressive achievement, to be sure). If it's too complicated, it will be full of arbitrary things that will provoke dissent and probably not work out well in practice. I don't think we need something dramatically awesome to make a change to the status quo, but if it's extremely easy to think up simple scenarios in which a given idea will fail spectacularly, I'd be inclined to suspect that there will be a lot of real-world spectacular failures. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 10 Oct 2025 16:24:51 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > I don't think we > need something dramatically awesome to make a change to the status > quo, but if it's extremely easy to think up simple scenarios in which > a given idea will fail spectacularly, I'd be inclined to suspect that > there will be a lot of real-world spectacular failures. What does a real-world spectacular failure look like? "If those 3 autovac workers had processed tables in a different order everything would have been peachy" But if autovac is going to get jammed up long enough to wraparound the system, does it matter whether or not it did a one-time processing of a bunch of small tables before it got jammed? One particular table always scoring high shouldn't block autovac from other tables, because it doesn't start a new iteration until it goes all the way through the list from its current iteration right? And one iteration of autovac needs to process everything in the list... so it should take the same overall time regardless of order? The spectacular failures I've seen with autovac usually come down to things like too much sleeping (cost_delay) or too few workers, where better ordering would be nice but probably wouldn't fix any real problems leading to the spectacular failures From Robert's 2024 pgConf.dev talk: 1. slow - forward progress not fast enough 2. stuck - no forward progress 3. spinning - not accomplishing anything 4. skipped - thinks not needed 5. starvation - cant keep up I don't think any of these are really addressed by simply changing table order. From Robert's 2022 email to hackers: > A few people have proposed scoring systems, which I think is closer > to the right idea, because our basic goal is to start vacuuming any > given table soon enough that we finish vacuuming it before some > catastrophe strikes. ... > If table A will cause wraparound in 2 hours and take 2 hours to > vacuum, and table B will cause wraparound in 1 hour and take 10 > minutes to vacuum, table A is more urgent even though the catastrophe > is further out. Robert it sounds to me like the main use case you're focused on here is where basically wraparound is imminent - we are already screwed - and our very last hope was that a last-ditch autovac can finish just in time Failsafe and dynamic cost updates were huge advancements. Do we allow dynamic adjustment to worker count yet? I hope y'all just pick something and commit it without getting too lost in the details. I honestly think in the list of improvements around autovac, this is the lowest priority on my list of hopes and dreams as a user for wraparound prevention :) because if this ever matters to me for avoiding wraparound, I was screwed long before we got to this point and this is not going to fix my underlying problems. -Jeremy
On Sat, 11 Oct 2025 at 07:43, Robert Haas <robertmhaas@gmail.com> wrote: > I think this is a reasonable starting point, although I'm surprised > that you chose to combine the sub-scores using + rather than Max. Adding up the component scores doesn't make sense to me either. That means you could have 0.5 for inserted tuples, 0.5 for dead tuples and, say 0.1 for analyze threshold, which all add up to 1.1, but neither component score is high enough for auto-vacuum to have to do anything yet. With Max(), we'd clearly see that there's nothing to do since the overall score isn't >= 1.0. > - The wraparound score can't be more than about 10, but the bloat > score could be arbitrarily large, especially for tables with few > tuples, so there may be lots of cases in which the wraparound score > has no impact on the behavior. That's a good point. I think we definitely do want to make it so tables in near danger of causing the database to stop accepting transactions are dealt with ASAP. Maybe the score calculation could change when the relevant age() goes above vacuum_failsafe_age / vacuum_multixact_failsafe_age and start scaling it very aggressively beyond that. There's plenty to debate, but at a first cut, maybe something like the following (coded in SQL for ease of result viewing): select xidage as "age(relfrozenxid)",case xidage::float8 < current_setting('vacuum_failsafe_age')::float8 when true then xidage / current_setting('autovacuum_freeze_max_age')::float8 else power(xidage / current_setting('autovacuum_freeze_max_age')::float8,xidage::float8 / 100_000_000) end xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage; which gives 1e+20 for age of 2 billion. It would take quite an unreasonable amount of bloat to score higher than that. I guess someone might argue that we should start taking it more seriously before the table's relfrozenxid age gets to vacuum_failsafe_age. Maybe that's true. I just don't know what. In any case, if a table's age gets that old, then something's probably not configured very well and needs attention. I did think maybe we could keep the addressing of auto-vacuum being configured to run too slowly as a separate thread. David
On Fri, Oct 10, 2025 at 6:00 PM Jeremy Schneider <schneider@ardentperf.com> wrote: > The spectacular failures I've seen with autovac usually come down to > things like too much sleeping (cost_delay) or too few workers, where > better ordering would be nice but probably wouldn't fix any real > problems leading to the spectacular failures Since I have said the same thing myself, I can hardly disagree. However, there are probably a few exceptions. For instance, if autovacuum on a certain table is failing repeatedly or accomplishing nothing without removing the apparent need to autovacuum, and happens to be the first one in pg_class, it could divert a lot of attention from other tables. > Robert it sounds to me like the main use case you're focused on here > is where basically wraparound is imminent - we are already screwed - and > our very last hope was that a last-ditch autovac can finish just in time Yes, I would argue that this is the scenario that really matters. As you say above, the main thing is having little enough sleeping and a sufficient number of workers. When that's the case, we can do the work in any order and life will mostly be fine. However, if we get into a desperate situation by, say, having one table that can't be vacuumed, and eventually someone fixes that, say by dropping the corrupt index that is preventing vacuuming of that table, we might like it if autovacuum focused on getting that table vacuumed rather than getting lost in the sauce. Of course, if we have the pretty common situation where autovacuum gets behind on all tables, say due to a stale replication slot, then this is less critical, although a perfect system would probably prioritize vacuuming the *largest* tables in this situation, since those will take the longest to finish, and it's when a vacuum of every table in the cluster has been *completed* that the XID horizons can advance. > I hope y'all just pick something and commit it without getting too lost > in the details. I honestly think in the list of improvements around > autovac, this is the lowest priority on my list of hopes and dreams as a > user for wraparound prevention :) because if this ever matters to me for > avoiding wraparound, I was screwed long before we got to this point and > this is not going to fix my underlying problems. I'm not sure if this was your intention, but to me this kind of reads like "well, it's not going to matter anyway so just do whatever and move on" and I don't agree with that. I think that if we're not going to do high-quality engineering here, we just shouldn't change anything at all. It's better to keep having the same bad behavior than for each release to have new and different bad behavior. One possible positive result of leaning into this prioritization problem is that whoever's working in it (Nathan, in this case) might gain some useful insights about how to tackle some of the other problems in this space. All of this is hard enough that we haven't really had any major improvements in this area since, I want to say, 8.3, and it's desirable to break that logjam even if we don't all agree on which problems are most urgent. Even if I ultimately don't agree with whatever Nathan wants to do or proposes, I'm glad he's trying to do something, which is (in my experience) generally much better than making no effort at all. -- Robert Haas EDB: http://www.enterprisedb.com