Обсуждение: do only critical work during single-user vacuum?

Поиск

Список

Период

Сортировка

do only critical work during single-user vacuum?

От

John Naylor

Дата:

09 декабря 2021 г., 22:28:18

When a user must shut down and restart in single-user mode to run
vacuum on an entire database, that does a lot of work that's
unnecessary for getting the system online again, even without
index_cleanup. We had a recent case where a single-user vacuum took
around 3 days to complete.

Now that we have a concept of a fail-safe vacuum, maybe it would be
beneficial to skip a vacuum in single-user mode if the fail-safe
criteria were not met at the beginning of vacuuming a relation. This
is not without risk, of course, but it should be much faster than
today and once up and running the admin would have a chance to get a
handle on things. Thoughts?

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

09 декабря 2021 г., 23:32:53

On 12/9/21, 11:34 AM, "John Naylor" <john.naylor@enterprisedb.com> wrote:
> When a user must shut down and restart in single-user mode to run
> vacuum on an entire database, that does a lot of work that's
> unnecessary for getting the system online again, even without
> index_cleanup. We had a recent case where a single-user vacuum took
> around 3 days to complete.
>
> Now that we have a concept of a fail-safe vacuum, maybe it would be
> beneficial to skip a vacuum in single-user mode if the fail-safe
> criteria were not met at the beginning of vacuuming a relation. This
> is not without risk, of course, but it should be much faster than
> today and once up and running the admin would have a chance to get a
> handle on things. Thoughts?

Would the --min-xid-age and --no-index-cleanup vacuumdb options help
with this?

Nathan

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

10 декабря 2021 г., 00:04:43

On Thu, Dec 9, 2021 at 11:28 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Now that we have a concept of a fail-safe vacuum, maybe it would be
> beneficial to skip a vacuum in single-user mode if the fail-safe
> criteria were not met at the beginning of vacuuming a relation.

Obviously the main goal of the failsafe is to not get into this
situation in the first place. But it's still very reasonable to ask
"what happens when the failsafe even fails at that?". This was
something that we considered directly when working on the feature.

There is a precheck that takes place before any other work, which
ensures that we won't even start off any of the nonessential tasks the
failsafe skips (e.g., index vacuuming). The precheck works like any
other check -- it checks if relfrozenxid is dangerously old. (We won't
even bother trying to launch parallel workers when this precheck
triggers, which is another reason to have it that Mashahiko pointed
out during development.)

Presumably there is no need to specifically check if we're running in
single user mode when considering if we need to trigger the failsafe
-- which, as you say, we won't do. It shouldn't matter, because
anybody running single-user mode just to VACUUM must already be unable
to allocate new XIDs outside of single user mode. That condition alone
will trigger the failsafe.

That said, it would be very easy to add a check for single user mode.
It didn't happen because we weren't aware of any specific need for it.
Perhaps there is an argument for it.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

10 декабря 2021 г., 00:12:50

On Thu, Dec 9, 2021 at 1:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Dec 9, 2021 at 11:28 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Now that we have a concept of a fail-safe vacuum, maybe it would be
> > beneficial to skip a vacuum in single-user mode if the fail-safe
> > criteria were not met at the beginning of vacuuming a relation.
>
> Obviously the main goal of the failsafe is to not get into this
> situation in the first place. But it's still very reasonable to ask
> "what happens when the failsafe even fails at that?". This was
> something that we considered directly when working on the feature.

Oh, I think I misunderstood. Your concern is for the case where the
DBA runs a simple "VACUUM" in single-user mode; you want to skip over
tables that don't really need to advance relfrozenxid, automatically.

I can see an argument for something like that, but I think that it
should be a variant of VACUUM. Or maybe it could be addressed with a
better user interface; single-user mode should prompt the user about
what exact VACUUM command they ought to run to get things going.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Andres Freund

Дата:

10 декабря 2021 г., 01:08:39

Hi,

On 2021-12-09 15:28:18 -0400, John Naylor wrote:
> When a user must shut down and restart in single-user mode to run
> vacuum on an entire database, that does a lot of work that's
> unnecessary for getting the system online again, even without
> index_cleanup. We had a recent case where a single-user vacuum took
> around 3 days to complete.
> 
> Now that we have a concept of a fail-safe vacuum, maybe it would be
> beneficial to skip a vacuum in single-user mode if the fail-safe
> criteria were not met at the beginning of vacuuming a relation. This
> is not without risk, of course, but it should be much faster than
> today and once up and running the admin would have a chance to get a
> handle on things. Thoughts?

What if the user tried to reclaim space by vacuuming (via truncation)? Or is
working around some corruption or such?  I think this is too much magic.

That said, having a VACUUM "selector" that selects the oldest tables could be
quite useful. And address this usecase both for single-user and normal
operation.

Another thing that might be worth doing is to update relfrozenxid earlier. We
definitely should update it before doing truncation (that can be quite
expensive). But we probably should do it even before the final
lazy_cleanup_all_indexes() pass - often that'll be the only pass, and there's
really no reason to delay relfrozenxid advancement till after that.

Greetings,

Andres Freund

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

10 декабря 2021 г., 02:53:40

On Thu, Dec 9, 2021 at 5:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Oh, I think I misunderstood. Your concern is for the case where the
> DBA runs a simple "VACUUM" in single-user mode; you want to skip over
> tables that don't really need to advance relfrozenxid, automatically.

Right.

> I can see an argument for something like that, but I think that it
> should be a variant of VACUUM. Or maybe it could be addressed with a
> better user interface;

On Thu, Dec 9, 2021 at 6:08 PM Andres Freund <andres@anarazel.de> wrote:
> What if the user tried to reclaim space by vacuuming (via truncation)? Or is
> working around some corruption or such?  I think this is too much magic.
>
> That said, having a VACUUM "selector" that selects the oldest tables could be
> quite useful. And address this usecase both for single-user and normal
> operation.

All good points.

[Peter again]
> single-user mode should prompt the user about
> what exact VACUUM command they ought to run to get things going.

The current message is particularly bad in its vagueness because some
users immediately reach for VACUUM FULL, which quite logically seems
like the most complete thing to do.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

10 декабря 2021 г., 03:34:53

On Thu, Dec 9, 2021 at 3:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > single-user mode should prompt the user about
> > what exact VACUUM command they ought to run to get things going.
>
> The current message is particularly bad in its vagueness because some
> users immediately reach for VACUUM FULL, which quite logically seems
> like the most complete thing to do.

You mean the GetNewTransactionId() error, about single-user mode? Why
do we need to use single-user mode at all? I'm pretty sure that the
reason is "as an escape hatch", but I wonder what that really means.

***Thinks***

I suppose that it might be a good idea to make sure that autovacuum
cannot run, because in general autovacuum might need to allocate an
XID (for autoanalyze), and locking all that down in exactly the right
way might not be a very good use of our time.

But even still, why not have some variant of single-user mode just for
this task? Something that's easy to use when the DBA is rudely
awakened at 4am -- something a little bit like a big red button that
fixes the exact problem of XID exhaustion, in a reasonably targeted
way? I don't think that this needs to involve the VACUUM command
itself.

The current recommendation to do a whole-database VACUUM doesn't take
a position on how old the oldest datfrozenxid has to be in order to
become safe again, preferring to "make a conservative recommendation"
-- which is what a database-level VACUUM really is. But that doesn't
seem helpful at all. In fact, it's not even conservative. We could
easily come up with a reasonable definition of "datfrozenxid that's
sufficiently new to make it safe to come back online and allocate XIDs
again". Perhaps something based on the current
autovacuum_freeze_max_age (and autovacuum_multixact_freeze_max_age)
settings, with sanity checks.

We could then apply this criteria in new code that implements this
"big red button" (maybe this is a new option for the postgres
executable, a little like --single?). Something that's reasonably
targeted, and dead simple to use.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

10 декабря 2021 г., 04:05:42

On 12/9/21, 12:33 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 12/9/21, 11:34 AM, "John Naylor" <john.naylor@enterprisedb.com> wrote:
>> Now that we have a concept of a fail-safe vacuum, maybe it would be
>> beneficial to skip a vacuum in single-user mode if the fail-safe
>> criteria were not met at the beginning of vacuuming a relation. This
>> is not without risk, of course, but it should be much faster than
>> today and once up and running the admin would have a chance to get a
>> handle on things. Thoughts?
>
> Would the --min-xid-age and --no-index-cleanup vacuumdb options help
> with this?

Sorry, I'm not sure what I was thinking.  Of coure you cannot use
vacuumdb in single-user mode.  But I think something like
--min-xid-age in VACUUM is what you are looking for.

Nathan

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

10 декабря 2021 г., 04:06:33

On 12/9/21, 4:36 PM, "Peter Geoghegan" <pg@bowt.ie> wrote:
> We could then apply this criteria in new code that implements this
> "big red button" (maybe this is a new option for the postgres
> executable, a little like --single?). Something that's reasonably
> targeted, and dead simple to use.

+1

Nathan

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

10 декабря 2021 г., 04:11:57

On 12/9/21, 5:06 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 12/9/21, 4:36 PM, "Peter Geoghegan" <pg@bowt.ie> wrote:
>> We could then apply this criteria in new code that implements this
>> "big red button" (maybe this is a new option for the postgres
>> executable, a little like --single?). Something that's reasonably
>> targeted, and dead simple to use.
>
> +1

As Andres noted, such a feature might be useful during normal
operation, too.  Perhaps the vacuumdb --min-xid-age stuff should be
moved to a new VACUUM option.

Nathan

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

10 декабря 2021 г., 04:25:53

On Thu, Dec 9, 2021 at 5:12 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> As Andres noted, such a feature might be useful during normal
> operation, too.  Perhaps the vacuumdb --min-xid-age stuff should be
> moved to a new VACUUM option.

I was thinking of something like pg_import_system_collations() for
this: a function that's built-in, and can be called in single user
mode, that nevertheless doesn't make any assumptions about how it may
be called. Nothing stops a superuser from calling
pg_import_system_collations() themselves, outside of initdb. That
isn't particularly common, but it works in the way you'd expect it to
work. It's easy to test.

I imagine that this new function (to handle maintenance tasks in the
event of a wraparound emergency) would output information about its
progress. For example, it would make an up-front decision about which
tables needed to be vacuumed in order for the current DB's
datfrozenxid to be sufficiently new, before it started anything (with
handling for edge-cases with many tables, perhaps). It might also show
the size of each table, and show another line for each table that has
been processed so far, as a rudimentary progress indicator.

We could still have a separate option for the postgres executable,
just to invoke single-user mode and call this function. It would
mostly just be window dressing, of course, but that still seems
useful.

--
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Andres Freund

Дата:

10 декабря 2021 г., 04:56:16

Hi,

On 2021-12-09 16:34:53 -0800, Peter Geoghegan wrote:
> But even still, why not have some variant of single-user mode just for
> this task?

> Something that's easy to use when the DBA is rudely
> awakened at 4am -- something a little bit like a big red button that
> fixes the exact problem of XID exhaustion, in a reasonably targeted
> way? I don't think that this needs to involve the VACUUM command
> itself.

I think we should move *away* from single user mode, rather than the
opposite. It's a substantial code burden and it's hard to use.

I don't think single user mode is a good fit for this anyway - it's inherently
focussed on connecting to a single database. But wraparound issues often
involve more than one database (often just because of shared catalogs).

Also, requiring a restart will often exascerbate the problem - the cache will
be cold, there's no walwriter, etc, making the vacuum slower. Making vacuum
not consume an xid seems like a lot more promising - and quite doable. Then
there's no need to restart at all.

Greetings,

Andres Freund

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

10 декабря 2021 г., 05:32:12

On Thu, Dec 9, 2021 at 5:56 PM Andres Freund <andres@anarazel.de> wrote:
> I think we should move *away* from single user mode, rather than the
> opposite. It's a substantial code burden and it's hard to use.

I wouldn't say that this is moving closer to single user mode.

> I don't think single user mode is a good fit for this anyway - it's inherently
> focussed on connecting to a single database. But wraparound issues often
> involve more than one database (often just because of shared catalogs).

I don't disagree with any of that. My suggestions were based on the
assumption that it might be unrealistic to expect somebody to spend a
huge amount of time on this, given that (in a certain sense) it's
never really supposed to be used. Even a very simple approach would be
a big improvement.

> Also, requiring a restart will often exascerbate the problem - the cache will
> be cold, there's no walwriter, etc, making the vacuum slower. Making vacuum
> not consume an xid seems like a lot more promising - and quite doable. Then
> there's no need to restart at all.

I didn't give too much consideration to what it would take to keep the
system partially online, without introducing excessive complexity.
Maybe it wouldn't be that hard to teach the system to stop allocating
XIDs, while still allowing autovacuum workers to continue to get the
system functioning again. With the av workers taking a particular
emphasis on doing whatever work is required for the system to be able
to allocate XIDs again -- but not too much more (not until things are
back to normal). Now the plan is starting to get ambitious relative to
how often it'll be seen by users, though.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

10 декабря 2021 г., 07:41:07

On 12/9/21, 5:27 PM, "Peter Geoghegan" <pg@bowt.ie> wrote:
> I imagine that this new function (to handle maintenance tasks in the
> event of a wraparound emergency) would output information about its
> progress. For example, it would make an up-front decision about which
> tables needed to be vacuumed in order for the current DB's
> datfrozenxid to be sufficiently new, before it started anything (with
> handling for edge-cases with many tables, perhaps). It might also show
> the size of each table, and show another line for each table that has
> been processed so far, as a rudimentary progress indicator.

I like the idea of having a built-in function that does the bare
minimum to resolve wraparound emergencies, and I think providing some
sort of simple progress indicator (even if rudimentary) would be very
useful.  I imagine the decision logic could be pretty simple.  If
we're only interested in getting the cluster out of a wraparound
emergency, we can probably just look for all tables with an age over
~2B.

Nathan

Re: do only critical work during single-user vacuum?

От

Andres Freund

Дата:

21 декабря 2021 г., 06:46:30

Hi,

On 2021-12-20 17:17:26 -0800, Peter Geoghegan wrote:
> On Thu, Dec 9, 2021 at 8:41 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> > I like the idea of having a built-in function that does the bare
> > minimum to resolve wraparound emergencies, and I think providing some
> > sort of simple progress indicator (even if rudimentary) would be very
> > useful.
> 
> If John doesn't have time to work on this during the Postgres 15
> cycle, and if nobody else picks it up, then we should at least do the
> bare minimum here: force the use of the failsafe in single user mode
> (regardless of the age of relfrozenxid/relminmxid, which in general
> might not be that old in tables where VACUUM might need to do a lot of
> work). Attached quick and dirty patch shows what this would take. If
> nothing else, it seems natural to define running any VACUUM in single
> user mode as an emergency.

As I said before I think this is a bad idea. I'm fine with adding a vacuum
parameter forcing failsafe mode. And perhaps a hint to suggest it in single
user mode. But forcing it is a bad idea - single user isn't just used for
emergencies (c.f. initdb, which this patch would regress) and not every
emergency making single user mode useful is related to wraparound.

Greetings,

Andres Freund

Re: do only critical work during single-user vacuum?

От

Masahiko Sawada

Дата:

21 декабря 2021 г., 07:40:12

On Tue, Dec 21, 2021 at 12:46 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-12-20 17:17:26 -0800, Peter Geoghegan wrote:
> > On Thu, Dec 9, 2021 at 8:41 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> > > I like the idea of having a built-in function that does the bare
> > > minimum to resolve wraparound emergencies, and I think providing some
> > > sort of simple progress indicator (even if rudimentary) would be very
> > > useful.
> >
> > If John doesn't have time to work on this during the Postgres 15
> > cycle, and if nobody else picks it up, then we should at least do the
> > bare minimum here: force the use of the failsafe in single user mode
> > (regardless of the age of relfrozenxid/relminmxid, which in general
> > might not be that old in tables where VACUUM might need to do a lot of
> > work). Attached quick and dirty patch shows what this would take. If
> > nothing else, it seems natural to define running any VACUUM in single
> > user mode as an emergency.
>
> As I said before I think this is a bad idea. I'm fine with adding a vacuum
> parameter forcing failsafe mode. And perhaps a hint to suggest it in single
> user mode. But forcing it is a bad idea - single user isn't just used for
> emergencies (c.f. initdb, which this patch would regress) and not every
> emergency making single user mode useful is related to wraparound.

+1

BTW a vacuum automatically enters failsafe mode under the situation
where the user has to run a vacuum in the single-user mode, right?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

21 декабря 2021 г., 07:52:39

On Mon, Dec 20, 2021 at 8:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> BTW a vacuum automatically enters failsafe mode under the situation
> where the user has to run a vacuum in the single-user mode, right?

Only for the table that had the problem. Maybe there are no other
tables that a database level "VACUUM" will need to spend much time on,
or maybe there are, and they will make it take much much longer (it
all depends).

The goal of the patch is to make sure that when we're in single user
mode, we'll consistently trigger the failsafe, for every VACUUM
against every table -- not just the table (or tables) whose
relfrozenxid is very old. That's still naive, but much less naive than
simply telling users to VACUUM the whole database in single user mode
while vacuuming indexes, etc.

--
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Masahiko Sawada

Дата:

21 декабря 2021 г., 10:56:07

On Tue, Dec 21, 2021 at 1:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Dec 20, 2021 at 8:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > BTW a vacuum automatically enters failsafe mode under the situation
> > where the user has to run a vacuum in the single-user mode, right?
>
> Only for the table that had the problem. Maybe there are no other
> tables that a database level "VACUUM" will need to spend much time on,
> or maybe there are, and they will make it take much much longer (it
> all depends).
>
> The goal of the patch is to make sure that when we're in single user
> mode, we'll consistently trigger the failsafe, for every VACUUM
> against every table -- not just the table (or tables) whose
> relfrozenxid is very old. That's still naive, but much less naive than
> simply telling users to VACUUM the whole database in single user mode
> while vacuuming indexes, etc.

I understand the patch, thank you for the explanation!

I remember Simon proposed a VACUUM command option[1], called
FAST_FREEZE, to turn off index cleanup and heap truncation. Now that
we have failsafe mechanism probably we can have a VACUUM command
option to turn on failsafe mode instead.

Regards,

[1] https://commitfest.postgresql.org/32/2908/

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

12 января 2022 г., 03:58:56

On Tue, Dec 21, 2021 at 4:56 PM Peter Geoghegan <pg@bowt.ie> wrote:

> But if we're going to add a new option to the VACUUM command (or
> something of similar scope), then we might as well add a new behavior
> that is reasonably exact -- something that (say) only *starts* a
> VACUUM for those tables whose relfrozenxid age currently exceeds half
> the autovacuum_freeze_max_age for the table (usually taken from the
> GUC, sometimes taken from the reloption), which also forces the
> failsafe. And with similar handling for
> relminmxid/autovacuum_multixact_freeze_max_age.

> This new command/facility should probably not be a new flag to the
> VACUUM command, as such. Rather, I think that it should either be an
> SQL-callable function, or a dedicated top-level command (that doesn't
> accept any tables). The only reason to have this is for scenarios
> where the user is already in a tough spot with wraparound failure,
> like that client of yours. Nobody wants to force the failsafe for one
> specific table. It's not general purpose, at all, and shouldn't claim
> to be.

I've attached a PoC *untested* patch to show what it would look like
as a top-level statement. If the "shape" is uncontroversial, I'll put
work into testing it and fleshing it out.

For the PoC I wanted to try re-using existing keywords. I went with
"VACUUM LIMIT" since LIMIT is already a keyword that cannot be used as
a table name. It also brings "wraparound limit" to mind. We could add
a single-use unreserved keyword (such as VACUUM_MINIMAL or
VACUUM_FAST), but that doesn't seem great.

> In other words, while triggering the failsafe is important, simply *not
> starting* VACUUM for relations where there is really no need for it is
> at least as important. We shouldn't even think about pruning or
> freezing with these tables. (ISTM that the only thing that might be a
> bit controversial about any of this is my definition of "safe", which
> seems like about the right trade-off to me.)

I'm not sure what the right trade-off is, but as written I used 95% of
max age. It might be undesirable to end up so close to kicking off
uninterruptible vacuums, but the point is to get out of single-user
mode and back to streaming WAL as quickly as possible. It might also
be worth overriding the min ages as well, but haven't done so here.

It can be executed in normal mode (although it's not expected to be),
which makes testing easier and allows for a future possibility of not
requiring shutdown at all, by e.g. terminating non-superuser
connections.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Вложения

v2-vacuum-select-tables-closest-to-wraparound.patch

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

12 января 2022 г., 04:56:35

On Tue, Jan 11, 2022 at 4:59 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> I've attached a PoC *untested* patch to show what it would look like
> as a top-level statement. If the "shape" is uncontroversial, I'll put
> work into testing it and fleshing it out.

Great!

> For the PoC I wanted to try re-using existing keywords. I went with
> "VACUUM LIMIT" since LIMIT is already a keyword that cannot be used as
> a table name. It also brings "wraparound limit" to mind. We could add
> a single-use unreserved keyword (such as VACUUM_MINIMAL or
> VACUUM_FAST), but that doesn't seem great.

This seems reasonable, but you could add a new option instead, without
much downside. While INDEX_CLEANUP kind of looks like a keyword, it
isn't really a keyword. (Perhaps you knew this already.)

Making this a new option is a little awkward, admittedly. It's not
clear what it means to "VACUUM (LIMIT) my_table" -- do you just throw
an error for stuff like that? So perhaps your approach of adding
VacuumMinimalStmt (a minimal variant of the VACUUM command) is better.

> I'm not sure what the right trade-off is, but as written I used 95% of
> max age. It might be undesirable to end up so close to kicking off
> uninterruptible vacuums, but the point is to get out of single-user
> mode and back to streaming WAL as quickly as possible. It might also
> be worth overriding the min ages as well, but haven't done so here.

I wonder if we should keep autovacuum_freeze_max_age out of it -- its
default is too conservative in general. I'm concerned that applying
this autovacuum_freeze_max_age test during VACUUM LIMIT doesn't go far
enough -- it may require VACUUM LIMIT to do significantly more work
than is needed to get the system back online (while leaving a sensible
amount of headroom). Also seems like it might be a good idea to avoid
relying on the user configuration, given that VACUUM LIMIT is only run
when everything is already in disarray. (Besides, it's not clear that
it's okay to use the autovacuum_freeze_max_age GUC without also using
the reloption of the same name.)

What do you think of applying a similar test using a generic 1 billion
XID (and 1 billion MXID) age cutoff? When VACUUM LIMIT is run, we've
already learned that the *entire* XID space wasn't sufficient for the
user workload, so we're not really in a position to promise much.
Often the real problem will be something like a leaked replication
slot, or application code that's seriously misbehaving. It's really
the DBA's job to *keep* the system up. VACUUM LIMIT is just a tool
that allows the DBA to do this without excessive downtime.

The GetNewTransactionId() WARNINGs ought to be changed to reference
VACUUM LIMIT. (You probably just didn't get around to that in this
POC, but couldn't hurt to remind you.)

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Masahiko Sawada

Дата:

12 января 2022 г., 09:48:55

On Wed, Jan 12, 2022 at 10:57 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Jan 11, 2022 at 4:59 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > I've attached a PoC *untested* patch to show what it would look like
> > as a top-level statement. If the "shape" is uncontroversial, I'll put
> > work into testing it and fleshing it out.
>
> Great!

+1

>
> > For the PoC I wanted to try re-using existing keywords. I went with
> > "VACUUM LIMIT" since LIMIT is already a keyword that cannot be used as
> > a table name. It also brings "wraparound limit" to mind. We could add
> > a single-use unreserved keyword (such as VACUUM_MINIMAL or
> > VACUUM_FAST), but that doesn't seem great.
>
> This seems reasonable, but you could add a new option instead, without
> much downside. While INDEX_CLEANUP kind of looks like a keyword, it
> isn't really a keyword. (Perhaps you knew this already.)
>
> Making this a new option is a little awkward, admittedly. It's not
> clear what it means to "VACUUM (LIMIT) my_table" -- do you just throw
> an error for stuff like that? So perhaps your approach of adding
> VacuumMinimalStmt (a minimal variant of the VACUUM command) is better.

It seems to me that adding new syntax instead of a new option is less
flexible. In the future, for instance, when we support parallel heap
scan for VACUUM, we may want to add a parallel-related option to both
VACUUM statement and VACUUM LIMIT statement. VACUUM LIMIT statement
would end up becoming like VACUUM statement?

As another idea, we might be able to add a new option that takes an
optional integer value, like VACUUM (MIN_XID), VACUUM (MIN_MXID), and
VACUUM (MIN_XID 500000). We vacuum only tables whose age is older than
the given value. If the value is omitted, we vacuum only tables whose
age exceeds a threshold (say autovacuum_freeze_max_age * 0.95), which
can be used in an emergency case and output in GetNewTransactionID()
WARNINGs output. vacuumdb’s --min-xid-age and --min-mxid-age can use
this option instead of fetching the list of tables from the server.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

12 января 2022 г., 18:07:47

On Tue, Jan 11, 2022 at 8:57 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 11, 2022 at 4:59 PM John Naylor
> > For the PoC I wanted to try re-using existing keywords. I went with
> > "VACUUM LIMIT" since LIMIT is already a keyword that cannot be used as
> > a table name. It also brings "wraparound limit" to mind. We could add
> > a single-use unreserved keyword (such as VACUUM_MINIMAL or
> > VACUUM_FAST), but that doesn't seem great.
>
> This seems reasonable, but you could add a new option instead, without
> much downside. While INDEX_CLEANUP kind of looks like a keyword, it
> isn't really a keyword. (Perhaps you knew this already.)
>
> Making this a new option is a little awkward, admittedly. It's not
> clear what it means to "VACUUM (LIMIT) my_table" -- do you just throw
> an error for stuff like that? So perhaps your approach of adding
> VacuumMinimalStmt (a minimal variant of the VACUUM command) is better.

We'd also have to do some checks to either ignore other options or
throw an error, which seems undesirable for code maintenance. For that
reason, I prefer the separate top-level statement, but I'm open to
bike-shedding on the actual syntax. I also briefly looked into a SQL
function, but the transaction management would make that more
difficult.

> > I'm not sure what the right trade-off is, but as written I used 95% of
> > max age. It might be undesirable to end up so close to kicking off
> > uninterruptible vacuums, but the point is to get out of single-user
> > mode and back to streaming WAL as quickly as possible. It might also
> > be worth overriding the min ages as well, but haven't done so here.
>
> I wonder if we should keep autovacuum_freeze_max_age out of it -- its
> default is too conservative in general. I'm concerned that applying
> this autovacuum_freeze_max_age test during VACUUM LIMIT doesn't go far
> enough -- it may require VACUUM LIMIT to do significantly more work
> than is needed to get the system back online (while leaving a sensible
> amount of headroom). Also seems like it might be a good idea to avoid
> relying on the user configuration, given that VACUUM LIMIT is only run
> when everything is already in disarray. (Besides, it's not clear that
> it's okay to use the autovacuum_freeze_max_age GUC without also using
> the reloption of the same name.)
>
> What do you think of applying a similar test using a generic 1 billion
> XID (and 1 billion MXID) age cutoff?

I like that a lot, actually. It's simple and insulates us from
wondering about corner cases in configuration.

> The GetNewTransactionId() WARNINGs ought to be changed to reference
> VACUUM LIMIT. (You probably just didn't get around to that in this
> POC, but couldn't hurt to remind you.)

I'll do that as well as documentation after we have agreement (or at
least lack of objection) on the syntax.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

12 января 2022 г., 18:09:49

On Tue, Jan 11, 2022 at 9:20 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Tue, Jan 11, 2022 at 07:58:56PM -0500, John Naylor wrote:
> > +             // FIXME: also check reloption
> > +             // WIP: 95% is a starting point for discussion
> > +             if ((table_xid_age < autovacuum_freeze_max_age * 0.95) ||
> > +                     (table_mxid_age < autovacuum_multixact_freeze_max_age * 0.95))
> > +                     continue;
>
> Should be &&

Thanks! Will fix.

> Should this emergency vacuum "order by age() DESC" ?

That would add complexity and only save a marginal amount of time.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

12 января 2022 г., 18:42:06

On Wed, Jan 12, 2022 at 1:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> It seems to me that adding new syntax instead of a new option is less
> flexible. In the future, for instance, when we support parallel heap
> scan for VACUUM, we may want to add a parallel-related option to both
> VACUUM statement and VACUUM LIMIT statement. VACUUM LIMIT statement
> would end up becoming like VACUUM statement?

This is intended for single-user mode, so parallelism is not relevant.

> As another idea, we might be able to add a new option that takes an
> optional integer value, like VACUUM (MIN_XID), VACUUM (MIN_MXID), and
> VACUUM (MIN_XID 500000). We vacuum only tables whose age is older than
> the given value. If the value is omitted, we vacuum only tables whose
> age exceeds a threshold (say autovacuum_freeze_max_age * 0.95), which
> can be used in an emergency case and output in GetNewTransactionID()
> WARNINGs output. vacuumdb’s --min-xid-age and --min-mxid-age can use
> this option instead of fetching the list of tables from the server.

That could work, and maybe also have general purpose, but I see two
problems if I understand you correctly:

- If we have a default threshold when the values are omitted, that
implies we need to special-case single-user mode with non-obvious
behavior, which is not ideal, as Andres mentioned upthread. (Or, now
manual VACUUM by default would not do anything, except in extreme
cases, which is worse.)

- In the single-user case, the admin would still need to add
INDEX_CLEANUP = off for minimum downtime, and it should be really
simple.

- For the general case, we would now have the ability to vacuum a
table, and possibly have no effect at all. That seems out of place
with the other options.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

12 января 2022 г., 20:25:55

On 1/12/22, 7:43 AM, "John Naylor" <john.naylor@enterprisedb.com> wrote:
> On Wed, Jan 12, 2022 at 1:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> As another idea, we might be able to add a new option that takes an
>> optional integer value, like VACUUM (MIN_XID), VACUUM (MIN_MXID), and
>> VACUUM (MIN_XID 500000). We vacuum only tables whose age is older than
>> the given value. If the value is omitted, we vacuum only tables whose
>> age exceeds a threshold (say autovacuum_freeze_max_age * 0.95), which
>> can be used in an emergency case and output in GetNewTransactionID()
>> WARNINGs output. vacuumdb’s --min-xid-age and --min-mxid-age can use
>> this option instead of fetching the list of tables from the server.
>
> That could work, and maybe also have general purpose, but I see two
> problems if I understand you correctly:
>
> - If we have a default threshold when the values are omitted, that
> implies we need to special-case single-user mode with non-obvious
> behavior, which is not ideal, as Andres mentioned upthread. (Or, now
> manual VACUUM by default would not do anything, except in extreme
> cases, which is worse.)

I agree, I don't think such options should have a default value.

> - In the single-user case, the admin would still need to add
> INDEX_CLEANUP = off for minimum downtime, and it should be really
> simple.
>
> - For the general case, we would now have the ability to vacuum a
> table, and possibly have no effect at all. That seems out of place
> with the other options.

Perhaps a message would be emitted when tables are specified but
skipped due to the min-xid-age option.

As I've stated upthread, Sawada-san's suggested approach was my
initial reaction to this thread.  I'm not wedded to the idea of adding
new options, but I think there are a couple of advantages.  For both
single-user mode and normal operation (which may be in imminent
wraparound danger), you could use the same command:

        VACUUM (MIN_XID_AGE 1600000000, ...);

(As an aside, we'd need to figure out how XID and MXID options would
work together.  Presumably most users would want to OR them.)

This doesn't really tie in super nicely with the failsafe mechanism,
but adding something like a FAILSAFE option doesn't seem right to me,
as it's basically just an alias for a bunch of other options.  In my
mind, even a new top-level command would just be an alias for the
aforementioned command.  Of course, providing a new option is not
quite as simple as opening up single-user mode and typing "BAIL OUT,"
but I don't know if it is prohibitively complicated for end users.
They'll already have had to figure out how to start single-user mode
in the first place, and we can have nice ERROR/WARNING messages that
provide a suggested VACUUM command.

The other advantage I see with age-related options is that it can be
useful for non-imminent-wraparound situations as well.  For example,
maybe a user just wants to manually vacuum everything (including
indexes) with an age above 500M on the weekends.

Another idea is to do both.  We could add age-related options, and we
could also add a "BAIL OUT" command that is just an alias for a
special VACUUM command that we feel will help get things under control
as quickly as possible.

Nathan

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

13 января 2022 г., 15:57:44

On Wed, Jan 12, 2022 at 12:26 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> > - For the general case, we would now have the ability to vacuum a
> > table, and possibly have no effect at all. That seems out of place
> > with the other options.
>
> Perhaps a message would be emitted when tables are specified but
> skipped due to the min-xid-age option.
>
> As I've stated upthread, Sawada-san's suggested approach was my
> initial reaction to this thread.  I'm not wedded to the idea of adding
> new options, but I think there are a couple of advantages.  For both
> single-user mode and normal operation (which may be in imminent
> wraparound danger), you could use the same command:
>
>         VACUUM (MIN_XID_AGE 1600000000, ...);

My proposed top-level statement can also be used in normal operation,
so the only possible advantage is configurability. But I don't really
see any advantage in that -- I don't think we should be moving in the
direction of adding more-intricate ways to paper over the deficiencies
in autovacuum scheduling. (It could be argued that I'm doing exactly
that in this whole thread, but [imminent] shutdown situations have
other causes besides deficient scheduling.)

> (As an aside, we'd need to figure out how XID and MXID options would
> work together.  Presumably most users would want to OR them.)
>
> This doesn't really tie in super nicely with the failsafe mechanism,
> but adding something like a FAILSAFE option doesn't seem right to me,

I agree -- it would be awkward and messy as an option. However, I see
the same problem with xid/mxid -- I would actually argue they are not
even proper options; they are "selectors". Your comments above about
1) needing to OR them and 2) emitting a message when a VACUUM command
doesn't actually do anything are evidence of that fact.

> The other advantage I see with age-related options is that it can be
> useful for non-imminent-wraparound situations as well.  For example,
> maybe a user just wants to manually vacuum everything (including
> indexes) with an age above 500M on the weekends.

There is already vaccumdb for that, and I think it's method of
selecting tables is sound -- I'm not convinced that pushing table
selection to the server command as "options" is an improvement.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

14 января 2022 г., 01:04:11

On 1/13/22, 4:58 AM, "John Naylor" <john.naylor@enterprisedb.com> wrote:
> On Wed, Jan 12, 2022 at 12:26 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> As I've stated upthread, Sawada-san's suggested approach was my
>> initial reaction to this thread.  I'm not wedded to the idea of adding
>> new options, but I think there are a couple of advantages.  For both
>> single-user mode and normal operation (which may be in imminent
>> wraparound danger), you could use the same command:
>>
>>         VACUUM (MIN_XID_AGE 1600000000, ...);
>
> My proposed top-level statement can also be used in normal operation,
> so the only possible advantage is configurability. But I don't really
> see any advantage in that -- I don't think we should be moving in the
> direction of adding more-intricate ways to paper over the deficiencies
> in autovacuum scheduling. (It could be argued that I'm doing exactly
> that in this whole thread, but [imminent] shutdown situations have
> other causes besides deficient scheduling.)

The new top-level command would be configurable, right?  Your patch
uses autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age, so
the behavior of this new command now depends on the values of
parameters that won't obviously be related to it.  If these parameters
are set very low (e.g., the default values), then this command will
end up doing far more work than is probably necessary.

If we did go the route of using a parameter to determine which tables
to vacuum, I think vacuum_failsafe_age is a much better candidate, as
it defaults to a much higher value that is more likely to prevent
doing extra work.  That being said, I don't know if overloading
parameters is the right way to go.

>> (As an aside, we'd need to figure out how XID and MXID options would
>> work together.  Presumably most users would want to OR them.)
>>
>> This doesn't really tie in super nicely with the failsafe mechanism,
>> but adding something like a FAILSAFE option doesn't seem right to me,
>
> I agree -- it would be awkward and messy as an option. However, I see
> the same problem with xid/mxid -- I would actually argue they are not
> even proper options; they are "selectors". Your comments above about
> 1) needing to OR them and 2) emitting a message when a VACUUM command
> doesn't actually do anything are evidence of that fact.

That's a fair point.  But I don't think these problems are totally
intractable.  We already emit "skipping" messages from VACUUM
sometimes, and interactions between VACUUM options exist today, too.
For example, FREEZE is redundant when FULL is specified, and
INDEX_CLEANUP is totally ignored when FULL is used.

>> The other advantage I see with age-related options is that it can be
>> useful for non-imminent-wraparound situations as well.  For example,
>> maybe a user just wants to manually vacuum everything (including
>> indexes) with an age above 500M on the weekends.
>
> There is already vaccumdb for that, and I think it's method of
> selecting tables is sound -- I'm not convinced that pushing table
> selection to the server command as "options" is an improvement.

I guess I'm ultimately imagining the new options as replacing the
vacuumdb implementation.  IOW vacuumdb would just use MIN_(M)XID_AGE
behind the scenes (as would a new top-level command).

Nathan

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

14 января 2022 г., 19:21:54

I see a CF entry has been created already, and the cfbot doesn't like
my PoC. To prevent confusion, I've taken the liberty of switching the
author to myself and set to Waiting on Author. FWIW, my local build
passed make check-world after applying Justin's fix and changing a
couple other things.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Masahiko Sawada

Дата:

19 января 2022 г., 08:46:01

On Fri, Jan 14, 2022 at 7:04 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 1/13/22, 4:58 AM, "John Naylor" <john.naylor@enterprisedb.com> wrote:
> > On Wed, Jan 12, 2022 at 12:26 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> >> As I've stated upthread, Sawada-san's suggested approach was my
> >> initial reaction to this thread.  I'm not wedded to the idea of adding
> >> new options, but I think there are a couple of advantages.  For both
> >> single-user mode and normal operation (which may be in imminent
> >> wraparound danger), you could use the same command:
> >>
> >>         VACUUM (MIN_XID_AGE 1600000000, ...);
> >
> > My proposed top-level statement can also be used in normal operation,
> > so the only possible advantage is configurability. But I don't really
> > see any advantage in that -- I don't think we should be moving in the
> > direction of adding more-intricate ways to paper over the deficiencies
> > in autovacuum scheduling. (It could be argued that I'm doing exactly
> > that in this whole thread, but [imminent] shutdown situations have
> > other causes besides deficient scheduling.)
>
> The new top-level command would be configurable, right?  Your patch
> uses autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age, so
> the behavior of this new command now depends on the values of
> parameters that won't obviously be related to it.  If these parameters
> are set very low (e.g., the default values), then this command will
> end up doing far more work than is probably necessary.
>
> If we did go the route of using a parameter to determine which tables
> to vacuum, I think vacuum_failsafe_age is a much better candidate, as
> it defaults to a much higher value that is more likely to prevent
> doing extra work.  That being said, I don't know if overloading
> parameters is the right way to go.
>
> >> (As an aside, we'd need to figure out how XID and MXID options would
> >> work together.  Presumably most users would want to OR them.)
> >>
> >> This doesn't really tie in super nicely with the failsafe mechanism,
> >> but adding something like a FAILSAFE option doesn't seem right to me,
> >
> > I agree -- it would be awkward and messy as an option. However, I see
> > the same problem with xid/mxid -- I would actually argue they are not
> > even proper options; they are "selectors". Your comments above about
> > 1) needing to OR them and 2) emitting a message when a VACUUM command
> > doesn't actually do anything are evidence of that fact.
>
> That's a fair point.  But I don't think these problems are totally
> intractable.  We already emit "skipping" messages from VACUUM
> sometimes, and interactions between VACUUM options exist today, too.
> For example, FREEZE is redundant when FULL is specified, and
> INDEX_CLEANUP is totally ignored when FULL is used.
>
> >> The other advantage I see with age-related options is that it can be
> >> useful for non-imminent-wraparound situations as well.  For example,
> >> maybe a user just wants to manually vacuum everything (including
> >> indexes) with an age above 500M on the weekends.

I also think there is a use case where a user just wants to manually
vacuum tables that are older than a certain threshold. In this case,
they might want to specify VACUUM command options such as the parallel
option while selecting tables.

> >
> > There is already vaccumdb for that, and I think it's method of
> > selecting tables is sound -- I'm not convinced that pushing table
> > selection to the server command as "options" is an improvement.

I think that having the user not rely on vacuumdb by implementing it
on the server side would be an improvement.

> I guess I'm ultimately imagining the new options as replacing the
> vacuumdb implementation.  IOW vacuumdb would just use MIN_(M)XID_AGE
> behind the scenes (as would a new top-level command).

I had the same idea.

That having been said, I agree that xid/mxid options are different
things from the existing VACUUM command options; whereas the existing
VACUUM options control its behavior, xid/mxid options are selectors
for tables to vacuum (PROCESS_TOAST option could be a selector but I
think it’s slightly different from xid/mxid options).

IIUC what we want to do here are two things: (1) select only old
tables and (2) set INDEX_CLEANUP = off, TRUNCATE = off, and FREEZE =
on. VACUUM LIMIT statement does both things at the same time. Although
I’m concerned a bit about its flexibility, it’s a reasonable solution.

On the other hand, it’s probably also useful to do either one thing in
some cases. For instance, having a selector for (1) would be useful,
and having a new option like FAST_FREEZE for (2) would also be useful.
Given there is already a way for (2) (it does not default though), I
think it might also be a good start inventing something for (1). For
instance, a selector for VACUUM statement I came up with is:

VACUUM (verbose on) TABLES WITH (min_xid_age = 1600000000);
or
VACUUM (verbose on) TABLES WITH (min_age = failsafe_limit);

We can expand it in the future to select tables by, for example, dead
tuple ratio, size, etc.

It's a random thought but maybe worth considering.


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

19 января 2022 г., 22:14:13

On Wed, Jan 19, 2022 at 12:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jan 14, 2022 at 7:04 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> >
> > I guess I'm ultimately imagining the new options as replacing the
> > vacuumdb implementation.  IOW vacuumdb would just use MIN_(M)XID_AGE
> > behind the scenes (as would a new top-level command).
>
> I had the same idea.

This seems to be the motivating reason for wanting new configurability
on the server side. In any case, new knobs are out of scope for this
thread. If the use case is compelling enough, may I suggest starting a
new thread?

Regarding the thread subject, I've been playing with the grammar, and
found it's quite easy to have

VACUUM FOR WRAPAROUND
or
VACUUM FOR EMERGENCY

since FOR is a reserved word (and following that can be an IDENT plus
a strcmp check) and cannot conflict with table names. This sounds a
bit more natural than VACUUM LIMIT. Opinions?

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

19 января 2022 г., 22:57:23

On 1/18/22, 9:47 PM, "Masahiko Sawada" <sawada.mshk@gmail.com> wrote:
> IIUC what we want to do here are two things: (1) select only old
> tables and (2) set INDEX_CLEANUP = off, TRUNCATE = off, and FREEZE =
> on. VACUUM LIMIT statement does both things at the same time. Although
> I’m concerned a bit about its flexibility, it’s a reasonable solution.
>
> On the other hand, it’s probably also useful to do either one thing in
> some cases. For instance, having a selector for (1) would be useful,
> and having a new option like FAST_FREEZE for (2) would also be useful.
> Given there is already a way for (2) (it does not default though), I
> think it might also be a good start inventing something for (1). For
> instance, a selector for VACUUM statement I came up with is:
>
> VACUUM (verbose on) TABLES WITH (min_xid_age = 1600000000);
> or
> VACUUM (verbose on) TABLES WITH (min_age = failsafe_limit);
>
> We can expand it in the future to select tables by, for example, dead
> tuple ratio, size, etc.
>
> It's a random thought but maybe worth considering.

That's an interesting idea.  A separate selector clause could also
allow users to choose how they interacted (e.g., should the options be
OR'd or AND'd).

Nathan

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

20 января 2022 г., 00:11:48

On 1/19/22, 11:15 AM, "John Naylor" <john.naylor@enterprisedb.com> wrote:
> This seems to be the motivating reason for wanting new configurability
> on the server side. In any case, new knobs are out of scope for this
> thread. If the use case is compelling enough, may I suggest starting a
> new thread?

Sure.  Perhaps the new top-level command will use these new options
someday.

> Regarding the thread subject, I've been playing with the grammar, and
> found it's quite easy to have
>
> VACUUM FOR WRAPAROUND
> or
> VACUUM FOR EMERGENCY
>
> since FOR is a reserved word (and following that can be an IDENT plus
> a strcmp check) and cannot conflict with table names. This sounds a
> bit more natural than VACUUM LIMIT. Opinions?

I personally think VACUUM FOR WRAPAROUND is the best of the options
provided thus far.

Nathan

Re: do only critical work during single-user vacuum?

От

Michael Paquier

Дата:

20 января 2022 г., 01:26:32

On Wed, Jan 19, 2022 at 09:11:48PM +0000, Bossart, Nathan wrote:
> I personally think VACUUM FOR WRAPAROUND is the best of the options
> provided thus far.

Could you avoid introducing a new grammar pattern in VACUUM?  Any new
option had better be within the parenthesized part as it is extensible
at will with its set of DefElems.
--
Michael

Вложения

signature.asc

Re: do only critical work during single-user vacuum?

От

Masahiko Sawada

Дата:

21 января 2022 г., 08:58:46

On Thu, Jan 20, 2022 at 4:14 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Wed, Jan 19, 2022 at 12:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Jan 14, 2022 at 7:04 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> > >
> > > I guess I'm ultimately imagining the new options as replacing the
> > > vacuumdb implementation.  IOW vacuumdb would just use MIN_(M)XID_AGE
> > > behind the scenes (as would a new top-level command).
> >
> > I had the same idea.
>
> This seems to be the motivating reason for wanting new configurability
> on the server side. In any case, new knobs are out of scope for this
> thread. If the use case is compelling enough, may I suggest starting a
> new thread?

The purpose of this thread is to provide a way for users to run vacuum
only very old tables (while skipping index cleanup, etc.), and the way
is not limited to introducing a new top-level VACUUM statement yet,
right? A new top-level VACUUM statement you proposed seems a good idea
but trying to achieve it by extending the current VACUUM statement is
also a good idea. So I think the ideas like MIN_XID_AGE option and new
table selector in VACUUM statement are relevant to this thread.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

22 января 2022 г., 01:41:58

On Wed, Jan 19, 2022 at 5:26 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> Could you avoid introducing a new grammar pattern in VACUUM?  Any new
> option had better be within the parenthesized part as it is extensible
> at will with its set of DefElems.

This new behavior is not an option that one can sensibly mix with
other options as the user sees fit, but rather hard-codes the
parameters for its single purpose. That said, I do understand your
objection.

[*thinks*]

How about the attached patch (and test script)? It still needs polish,
but it could work. It allows "verbose" to coexist, although that's
really only for testing normal mode. While testing in single-user
mode, I was sad to find out that it not only doesn't emit messages
(not a client), but also doesn't log. That would have been a decent
way to monitor progress...

In this form, I'm no longer a fan of calling the option "wraparound",
because it's too close to the "is_wraparound" param member.
Internally, at least, we can use "emergency" or "minimal". (In fact
the bit symbol is VACOPT_MINIMAL for this draft). That can be worked
out later.

On Fri, Jan 21, 2022 at 12:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> The purpose of this thread is to provide a way for users to run vacuum
> only very old tables (while skipping index cleanup, etc.),

Ah, thank you Sawada-san, now I understand why we have been talking
past each other. The purpose is actually:

- to have a simple, easy to type, command
- intended for single-user mode, but not limited to it (so it's easy to test)
- to get out of single user mode as quickly as possible

--
John Naylor
EDB: http://www.enterprisedb.com

Вложения

Re: do only critical work during single-user vacuum?

От

"Bossart, Nathan"

Дата:

22 января 2022 г., 02:32:57

On 1/21/22, 2:43 PM, "John Naylor" <john.naylor@enterprisedb.com> wrote:
> - to have a simple, easy to type, command

AFAICT the disagreement is really just about the grammar.
Sawada-san's idea would look something like

VACUUM (FREEZE, INDEX_CLEANUP OFF, MIN_XID_AGE 1600000000, MIN_MXID_AGE 1600000000);

while your proposal looks more like

VACUUM (WRAPAROUND);

The former is highly configurable, but it is probably annoying to type
at 3 AM, and the interaction between the two *_AGE options is not
exactly intuitive (although I expect MIN_XID_AGE to be sufficient in
most cases). The latter is not as configurable, but it is much easier
to type at 3 AM.

I think simplicity is a good goal, but I don't know if the difference
between the two approaches outweighs the benefits of configurability.
If you are in an emergency situation, you already will have to take
down the server, connect in single-user mode to the database(s) that
need vacuuming, and actually do the vacuuming. The wraparound
WARNING/ERROR already has a HINT that describes the next steps
required. Perhaps it would be enough to also emit an example VACUUM
command to use.

I think folks will find the configurability useful, too. With
MIN_XID_AGE, it's super easy to have pg_cron vacuum everything over
500M on the weekend (and also do index cleanup), which may allow you
to use more relaxed autovacuum settings during the week. The docs
already have suggestions for manually vacuuming when the load is low
[0], so I think it is reasonable to build additional support for this
use-case.

Nathan

[0] https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

02 февраля 2022 г., 00:50:31

On Thu, Jan 27, 2022 at 8:28 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

> I'm sure you meant "&" here (fixed in attached patch to appease the cfbot):
> +               if (options | VACOPT_MINIMAL)

Thanks for catching that! That copy-pasto was also masking my failure
to process the option properly -- fixed in the attached as v5.

> It should either refuse to run if a list of tables is specified with MINIMAL,
> or it should filter that list by XID condition.

I went with the former for simplicity. As a single-purpose option, it
makes sense.

> As for the name, it could be MINIMAL or FAILSAFE or EMERGENCY or ??
> I think the name should actually be a bit more descriptive, and maybe say XID,
> like MINIMAL_XID or XID_EMERGENCY...

I went with EMERGENCY in this version to reinforce its purpose in the
mind of the user (and reader of this code).

> Normally, options are independent, but VACUUM (MINIMAL) is a "shortcut" to a
> hardcoded set of options: freeze on, truncate off, cleanup off.  So it refuses
> to be combined with other options - good.
>
> This is effectively a shortcut to hypothetical parameters for selecting tables
> by XID/MXID age.  In the future, someone could debate adding user-facing knobs
> for table selection by age.

I used the params struct in v5 for the emergency cutoff ages. Even
with the values hard-coded, it seems cleaner to keep them here.

> I still wonder if the relations should be processed in order of decreasing age.
> An admin might have increased autovacuum_freeze_max_age up to 2e9, and your
> query might return thousands of tables, with a wide range of sizes and ages.
>
> Processing them in order of decreasing age would allow the admin to quickly
> vacuum the oldest tables, and optionally interrupt vacuum to get out of single
> user mode ASAP - even if their just want to run VACUUM(MINIMAL) in a normal
> backend when services aren't offline.  Processing them out of order might be
> pretty surprising - they might run vacuum for an hour (or overnight), cancel
> it, attempt to start the DB in normal mode, and conclude that it made no
> visible progress.

While that seems like a nice property to have, it does complicate
things, so can be left for follow-on work.

Also in v5:

- It mentions the new command in the error hint in
GetNewTransactionId(). I'm not sure if multi-word commands should be
quoted like this.
- A first draft of documentation

--
John Naylor
EDB: http://www.enterprisedb.com

Вложения

v5-do-only-critical-work-during-single-user-vacuum.patch

Re: do only critical work during single-user vacuum?

От

Masahiko Sawada

Дата:

03 февраля 2022 г., 11:14:07

On Wed, Feb 2, 2022 at 6:50 AM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Thu, Jan 27, 2022 at 8:28 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> > I'm sure you meant "&" here (fixed in attached patch to appease the cfbot):
> > +               if (options | VACOPT_MINIMAL)
>
> Thanks for catching that! That copy-pasto was also masking my failure
> to process the option properly -- fixed in the attached as v5.
>
> > It should either refuse to run if a list of tables is specified with MINIMAL,
> > or it should filter that list by XID condition.
>
> I went with the former for simplicity. As a single-purpose option, it
> makes sense.
>
> > As for the name, it could be MINIMAL or FAILSAFE or EMERGENCY or ??
> > I think the name should actually be a bit more descriptive, and maybe say XID,
> > like MINIMAL_XID or XID_EMERGENCY...
>
> I went with EMERGENCY in this version to reinforce its purpose in the
> mind of the user (and reader of this code).
>
> > Normally, options are independent, but VACUUM (MINIMAL) is a "shortcut" to a
> > hardcoded set of options: freeze on, truncate off, cleanup off.  So it refuses
> > to be combined with other options - good.
> >
> > This is effectively a shortcut to hypothetical parameters for selecting tables
> > by XID/MXID age.  In the future, someone could debate adding user-facing knobs
> > for table selection by age.
>
> I used the params struct in v5 for the emergency cutoff ages. Even
> with the values hard-coded, it seems cleaner to keep them here.
>
> > I still wonder if the relations should be processed in order of decreasing age.
> > An admin might have increased autovacuum_freeze_max_age up to 2e9, and your
> > query might return thousands of tables, with a wide range of sizes and ages.
> >
> > Processing them in order of decreasing age would allow the admin to quickly
> > vacuum the oldest tables, and optionally interrupt vacuum to get out of single
> > user mode ASAP - even if their just want to run VACUUM(MINIMAL) in a normal
> > backend when services aren't offline.  Processing them out of order might be
> > pretty surprising - they might run vacuum for an hour (or overnight), cancel
> > it, attempt to start the DB in normal mode, and conclude that it made no
> > visible progress.
>
> While that seems like a nice property to have, it does complicate
> things, so can be left for follow-on work.
>
> Also in v5:
>
> - It mentions the new command in the error hint in
> GetNewTransactionId(). I'm not sure if multi-word commands should be
> quoted like this.
> - A first draft of documentation

Thank you for updating the patch.

I have a few questions and comments:

+      The only other option that may be combined with
<literal>VERBOSE</literal>, although in single-user mode no client
messages are
+      output.

Given VERBOSE with EMERGENCY can work only in multi-user mode, why
only VERBOSE can be specified with EMERGENCY? I think the same is true
for other options like PARALLEL; PARALLEL can work only in multi-user
mode.

---
+      It performs a database-wide vacuum on tables, toast tables, and
materialized views whose
+      xid age or mxid age is older than 1 billion.

Do we need to allow the user to specify the threshold or need a higher
value (at least larger than 1.6 billion, default value of
vacuum_failsafe_age)? I imagined a case where there are a few very-old
tables (say 2 billion old) and many tables that are older than 1
billion. In this case, VACUUM (EMERGENCY) would take a long time to
complete. But to minimize the downtime, we might want to run VACUUM
(EMERGENCY) on only the very-old tables, start the cluster in
multi-user mode, and run vacuum on multiple tables in parallel.

---
+       if (params->options & VACOPT_EMERGENCY)
+       {
+           /*
+           * Only consider relations able to hold unfrozen XIDs (anything else
+           * should have InvalidTransactionId in relfrozenxid anyway).
+           */
+           if (classForm->relkind != RELKIND_RELATION &&
+               classForm->relkind != RELKIND_MATVIEW &&
+               classForm->relkind != RELKIND_TOASTVALUE)
+           {
+               Assert(!TransactionIdIsValid(classForm->relfrozenxid));
+               Assert(!MultiXactIdIsValid(classForm->relminmxid));
+               continue;
+           }
+
+           table_xid_age = DirectFunctionCall1(xid_age,
classForm->relfrozenxid);
+           table_mxid_age = DirectFunctionCall1(mxid_age,
classForm->relminmxid);
+

I think that instead of calling xid_age and mxid_age for each
relation, we can compute the thresholds for xid and mxid once, and
then compare them to relation's relfrozenxid and relminmxid.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

03 февраля 2022 г., 18:49:00

On Thu, Feb 3, 2022 at 3:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> +      The only other option that may be combined with
> <literal>VERBOSE</literal>, although in single-user mode no client
> messages are
> +      output.
>
> Given VERBOSE with EMERGENCY can work only in multi-user mode, why
> only VERBOSE can be specified with EMERGENCY? I think the same is true
> for other options like PARALLEL; PARALLEL can work only in multi-user
> mode.

You are right; it makes sense to allow options that would be turned
off automatically in single-user mode. Even if we don't expect it to
be used in normal mode, the restrictions should make sense. Also,
maybe documenting the allowed combinations is a distraction in the
main entry and should be put in the notes at the bottom.

> +      It performs a database-wide vacuum on tables, toast tables, and
> materialized views whose
> +      xid age or mxid age is older than 1 billion.
>
> Do we need to allow the user to specify the threshold or need a higher
> value (at least larger than 1.6 billion, default value of
> vacuum_failsafe_age)? I imagined a case where there are a few very-old
> tables (say 2 billion old) and many tables that are older than 1
> billion. In this case, VACUUM (EMERGENCY) would take a long time to
> complete.

I still don't think fine-tuning is helpful here. Shutdown vacuum
should be just as trivial to run as it is now, but with better
behavior. I believe a user knowledgeable enough to come up with the
best number is unlikely to get in this situation in the first place.
I'm also not sure a production support engineer would (or should)
immediately figure out a better number than a good default. That said,
the 1 billion figure was a suggestion from Peter G. upthread, and a
higher number could be argued.

> But to minimize the downtime, we might want to run VACUUM
> (EMERGENCY) on only the very-old tables, start the cluster in
> multi-user mode, and run vacuum on multiple tables in parallel.

That's exactly the idea. Also, back in normal mode, we can start
streaming WAL again. However, we don't want to go back online so close
to the limit that we risk shutdown again. People have a reasonable
expectation that if you fix an emergency, it's now fixed and the
application can go back online. Falling down repeatedly, or worrying
if it's possible, is very bad.

> +       if (params->options & VACOPT_EMERGENCY)
> +       {
> +           /*
> +           * Only consider relations able to hold unfrozen XIDs (anything else
> +           * should have InvalidTransactionId in relfrozenxid anyway).
> +           */
> +           if (classForm->relkind != RELKIND_RELATION &&
> +               classForm->relkind != RELKIND_MATVIEW &&
> +               classForm->relkind != RELKIND_TOASTVALUE)
> +           {
> +               Assert(!TransactionIdIsValid(classForm->relfrozenxid));
> +               Assert(!MultiXactIdIsValid(classForm->relminmxid));
> +               continue;
> +           }
> +
> +           table_xid_age = DirectFunctionCall1(xid_age,
> classForm->relfrozenxid);
> +           table_mxid_age = DirectFunctionCall1(mxid_age,
> classForm->relminmxid);
> +
>
> I think that instead of calling xid_age and mxid_age for each
> relation, we can compute the thresholds for xid and mxid once, and
> then compare them to relation's relfrozenxid and relminmxid.

That sounds like a good idea if it's simple to implement, so I will
try it. If it adds complexity, I don't think it's worth it. Scanning a
few thousand rows in pg_class along with the function calls is tiny
compared to the actual vacuum work.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

03 февраля 2022 г., 19:18:43

Thinking further about the use of emergency mode, we have this:

"If for some reason autovacuum fails to clear old XIDs from a table,
the system will begin to emit warning messages like this when the
database's oldest XIDs reach forty million transactions from the
wraparound point:

WARNING:  database "mydb" must be vacuumed within 39985967 transactions
HINT:  To avoid a database shutdown, execute a database-wide VACUUM in
that database.
"

It seems people tend not to see these warnings if they didn't already
have some kind of monitoring which would prevent them from getting
here in the first place. But if they do, the hint should mention the
emergency option here, too. This puts Justin's idea upthread in a new
light -- if the admin does notice this warning, then emergency mode
should indeed vacuum the oldest tables first, since autovacuum is not
(yet) smart enough to do that. I'll pursue that as a follow-up.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

03 февраля 2022 г., 21:05:50

On Thu, Dec 9, 2021 at 8:56 PM Andres Freund <andres@anarazel.de> wrote:
> I think we should move *away* from single user mode, rather than the
> opposite. It's a substantial code burden and it's hard to use.

Yes. This thread seems to be largely devoted to the topic of making
single-user vacuum work better, but I don't see anyone asking the
question "why do we have a message that tells people to vacuum in
single user mode in the first place?". It's basically bad advice, with
one small exception that I'll talk about in a minute. Suppose we had a
message in the tree that said "HINT: Consider angering a live anaconda
to fix this problem." If that were so, the correct thing to do
wouldn't be to add a section to our documentation explaining how to
deal with angry anacondas. The correct thing to do would be to remove
the hint as bad advice that we never should have offered in the first
place. And so here. We should not try to make vacuum in single
user-mode work better or differently, or at least that shouldn't be
our primary objective. We should just stop telling people to do it. We
should probably add messages and documentation *discouraging* the use
of single user mode for recovering from wraparound trouble, exactly
the opposite of what we do now. There's nothing we can do in
single-user mode that we can't do equally well in multi-user mode. If
people try to fix wraparound problems in multi-user mode, they still
have read-only access to their database, they can use parallelism,
they can use command line utilities like vacuumdb, and they can use
psql which has line editing and allows remote access and is a way
nicer user experience than running postgres --single. We need a really
compelling reason to tell people to give up all those advantages, and
there is no such reason. It makes just as much sense as telling people
to deal with wraparound problems by angering a live anaconda.

I did say there was an exception, and it's this: the last time I
studied this issue back in 2019,[1] vacuum insisted on trying to
truncate tables even when the system is in wraparound danger. Then it
would fail, because truncating the table required allocating an XID,
which would fail if we were short on XIDs. By putting the system in
single user mode, you could continue to allocate XIDs and thus VACUUM
would work. However, if you think about this for even 10 seconds, you
can see that it's terrible. If we're so short of XIDs that we are
scared to allocate them for fear of causing an actual wraparound,
putting the system into a mode where that protection is bypassed is a
super-terrible idea. People will be able to run vacuum, yes, but if
they have too many tables, they will actually experience wraparound
and thus data loss before they process all the tables they have. What
we ought to do to solve this problem is NOT TRUNCATE when the number
of remaining XIDs is small, so that we don't consume any of the
remaining XIDs until we get the system out of wraparound danger. I
think the "failsafe" stuff Peter added in v14 fixes that, though. If
not, we should adjust it so it does. And then we should KILL WITH FIRE
the message telling people to use single user mode -- and once we do
that, the question of what the behavior ought to be when someone does
run VACUUM in single user mode becomes a lot less important.

This problem is basically self-inflicted. We have given people bad
advice (use single user mode) and then they suffer when they take it.
Ameliorating the suffering isn't the worst idea ever, but it's
basically fixing the wrong problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[1] http://postgr.es/m/CA+Tgmob1QCMJrHwRBK8HZtGsr+6cJANRQw2mEgJ9e=D+z7cOsw@mail.gmail.com

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

03 февраля 2022 г., 21:34:28

On Thu, Feb 3, 2022 at 1:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Dec 9, 2021 at 8:56 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we should move *away* from single user mode, rather than the
> > opposite. It's a substantial code burden and it's hard to use.
>
> Yes. This thread seems to be largely devoted to the topic of making
> single-user vacuum work better, but I don't see anyone asking the
> question "why do we have a message that tells people to vacuum in
> single user mode in the first place?". It's basically bad advice, with
> one small exception that I'll talk about in a minute.

The word "advice" sounds like people have a choice, rather than the
system not accepting commands anymore. It would be much less painful
if the system closed connections and forbade all but superusers to
connect, but that sounds like a lot of work. (happy to be proven
otherwise)

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

03 февраля 2022 г., 21:42:20

On Thu, Feb 3, 2022 at 1:34 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> The word "advice" sounds like people have a choice, rather than the
> system not accepting commands anymore. It would be much less painful
> if the system closed connections and forbade all but superusers to
> connect, but that sounds like a lot of work. (happy to be proven
> otherwise)

They *do* have a choice. They can continue to operate the system in
multi-user mode, they can have read access to their data, and they can
run VACUUM and other non-XID-allocating commands to fix the issue.
Sure, their application can't run commands that allocate XIDs, but
it's not going to be able to do that if they go to single-user mode
either.

I don't understand why we would want the system to stop accepting
connections other than superuser connections. That would provide
strictly less functionality and I don't understand what it would gain.
But it would still be better than going into single-user mode, which
provides even less functionality and has basically no advantages of
any kind.

Why are you convinced that the user HAS to go to single-user mode? I
don't think they have to do that, and I don't think they should want
to do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

04 февраля 2022 г., 00:18:27

On Thu, Feb 3, 2022 at 1:42 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Feb 3, 2022 at 1:34 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > The word "advice" sounds like people have a choice, rather than the
> > system not accepting commands anymore. It would be much less painful
> > if the system closed connections and forbade all but superusers to
> > connect, but that sounds like a lot of work. (happy to be proven
> > otherwise)
>
> They *do* have a choice. They can continue to operate the system in
> multi-user mode, they can have read access to their data, and they can
> run VACUUM and other non-XID-allocating commands to fix the issue.
> Sure, their application can't run commands that allocate XIDs, but
> it's not going to be able to do that if they go to single-user mode
> either.

I just checked some client case notes where they tried just that
before getting outside help, and both SELECT and VACUUM FREEZE
commands were rejected. The failure is clearly indicated in the log.
-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Andres Freund

Дата:

04 февраля 2022 г., 00:50:48

Hi,

On 2022-02-03 13:42:20 -0500, Robert Haas wrote:
> They *do* have a choice. They can continue to operate the system in
> multi-user mode, they can have read access to their data, and they can
> run VACUUM and other non-XID-allocating commands to fix the issue.
> Sure, their application can't run commands that allocate XIDs, but
> it's not going to be able to do that if they go to single-user mode
> either.

I wonder if we shouldn't add some exceptions to the xid allocation
prevention. It makes sense that we don't allow random DML. But it's e.g. often
more realistic to drop / truncate a few tables with unimportant content,
rather than spend the time vacuuming those.  We could e.g. allow xid
consumption within VACUUM, TRUNCATE, DROP TABLE / INDEX when run at the top
level for longer than we allow it for anything else.

> But it would still be better than going into single-user mode, which
> provides even less functionality and has basically no advantages of
> any kind.

Indeed. Single user is the worst response to this (and just about anything
else, really). Even just getting into the single user mode takes a while
(shutdown checkpoint). The user interface is completely different (and
awful). The buffer cache is completely cold. The system is slower because
there's no wal writer / checkpointer running.  Which basically is a list of
things one absolutely do not wants when confronted with a wraparound
situation.

Greetings,

Andres Freund

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

04 февраля 2022 г., 00:58:00

On Thu, Feb 3, 2022 at 4:18 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> I just checked some client case notes where they tried just that
> before getting outside help, and both SELECT and VACUUM FREEZE
> commands were rejected. The failure is clearly indicated in the log.

It would be helpful to know how it failed - what was the error? And
then I think we should just fix whatever the problem is. As I said
before, I know TRUNCATE has been an issue in the past, and if that's
not already fixed in v14, we should. If there's other stuff, we should
fix that too. The point I'm making here, which I still believe to be
valid, is that there's nothing intrinsically better about being in
single user mode. In fact, it's clearly worse. And I don't think it's
hard to fix it so that we avoid people needing to do that in the first
place.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Andres Freund

Дата:

04 февраля 2022 г., 00:58:30

Hi,

On 2022-02-03 16:18:27 -0500, John Naylor wrote:
> I just checked some client case notes where they tried just that
> before getting outside help, and both SELECT and VACUUM FREEZE
> commands were rejected.

What kind of SELECT was that? Any chance it caused a write via functions, a
view, whatnot? And what version? What was the exact error message?

VACUUM FREEZE is a *terrible* idea to run when encountering anti-wraparound
issues. I understand why people thing key might need it, but basically all it
achieves is to make VACUUM do a lot more, none of it helpful to get out of the
wraparound-can't-write situation (those rows will already get frozen).

I'd plus one the addition of a HINT that tells users that FREEZE likely is a
bad idea when in wraparound land. We should allow it, because there are
situation where it might make sense, but the people that can make that
judgement know they can ignore the HINT.

Greetings,

Andres Freund

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

04 февраля 2022 г., 01:02:15

On Thu, Feb 3, 2022 at 4:50 PM Andres Freund <andres@anarazel.de> wrote:
> I wonder if we shouldn't add some exceptions to the xid allocation
> prevention. It makes sense that we don't allow random DML. But it's e.g. often
> more realistic to drop / truncate a few tables with unimportant content,
> rather than spend the time vacuuming those.  We could e.g. allow xid
> consumption within VACUUM, TRUNCATE, DROP TABLE / INDEX when run at the top
> level for longer than we allow it for anything else.

True, although we currently don't start refusing XID allocation
altogether until only 1 million remain, IIRC. And that's cutting it
really close if we need to start consuming 1 XID per table we need to
drop. We might need to push out some of the thresholds a bit.

For the most part, I think that there's no reason why autovacuum
shouldn't be able to recover from this situation automatically, as
long as old replication slots and prepared transactions are cleaned up
and any old transactions are killed off. I don't think we're very far
from that Just Working, but we are not all there yet either. Manual
intervention to drop tables etc. is reasonable to allow a bit more
than we do now, but the big problem IMO is that the behavior when we
run short of XIDs has had very little testing and bug fixing, so
things that don't really need to break just do anyway.

> Indeed. Single user is the worst response to this (and just about anything
> else, really). Even just getting into the single user mode takes a while
> (shutdown checkpoint). The user interface is completely different (and
> awful). The buffer cache is completely cold. The system is slower because
> there's no wal writer / checkpointer running.  Which basically is a list of
> things one absolutely do not wants when confronted with a wraparound
> situation.

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

04 февраля 2022 г., 01:07:48

On Thu, Feb 3, 2022 at 4:58 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-02-03 16:18:27 -0500, John Naylor wrote:
> > I just checked some client case notes where they tried just that
> > before getting outside help, and both SELECT and VACUUM FREEZE
> > commands were rejected.
>
> What kind of SELECT was that? Any chance it caused a write via functions, a
> view, whatnot? And what version? What was the exact error message?

Looking closer, there is a function defined by an extension. I'd have
to dig further to see if writes happen. The error is exactly what
we've been talking about:

2022-01-03 22:03:23 PST ERROR: database is not accepting commands to
avoid wraparound data loss in database "<redacted>"
2022-01-03 22:03:23 PST HINT: Stop the postmaster and vacuum that
database in single-user mode. You might also need to commit or roll
back old prepared transactions.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

04 февраля 2022 г., 03:30:13

On Thu, Feb 3, 2022 at 5:08 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> Looking closer, there is a function defined by an extension. I'd have
> to dig further to see if writes happen. The error is exactly what
> we've been talking about:
>
> 2022-01-03 22:03:23 PST ERROR: database is not accepting commands to
> avoid wraparound data loss in database "<redacted>"
> 2022-01-03 22:03:23 PST HINT: Stop the postmaster and vacuum that
> database in single-user mode. You might also need to commit or roll
> back old prepared transactions.

That error comes from GetNewTransactionId(), so that function must
either try to execute DML or do something else which causes an XID to
be assigned. I think a plain SELECT should work just fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Andres Freund

Дата:

04 февраля 2022 г., 04:35:39

Hi,

On 2022-02-03 17:02:15 -0500, Robert Haas wrote:
> On Thu, Feb 3, 2022 at 4:50 PM Andres Freund <andres@anarazel.de> wrote:
> > I wonder if we shouldn't add some exceptions to the xid allocation
> > prevention. It makes sense that we don't allow random DML. But it's e.g. often
> > more realistic to drop / truncate a few tables with unimportant content,
> > rather than spend the time vacuuming those.  We could e.g. allow xid
> > consumption within VACUUM, TRUNCATE, DROP TABLE / INDEX when run at the top
> > level for longer than we allow it for anything else.
>
> True, although we currently don't start refusing XID allocation
> altogether until only 1 million remain, IIRC. And that's cutting it
> really close if we need to start consuming 1 XID per table we need to
> drop. We might need to push out some of the thresholds a bit.

Yea, I'd have no problem leaving the "hard" limit somewhere closer to 1
million (although 100k should be just as well), but introduce a softer "only
vacuum/drop/truncate" limit a good bit before that.

> For the most part, I think that there's no reason why autovacuum
> shouldn't be able to recover from this situation automatically, as
> long as old replication slots and prepared transactions are cleaned up
> and any old transactions are killed off.

To address the "as long as" part: I think that describing better what is
holding back the horizon would be a significant usability improvement.

Imagine that instead of the generic hints in these messages:
                ereport(ERROR,
                        (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                         errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
                                oldest_datname),
                         errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
                                 "You might also need to commit or roll back old prepared transactions, or drop stale
replicationslots.")));

and
        ereport(WARNING,
                (errmsg("oldest xmin is far in the past"),
                 errhint("Close open transactions soon to avoid wraparound problems.\n"
                         "You might also need to commit or roll back old prepared transactions, or drop stale
replicationslots.")));

we'd actually tell the user a bit more what about what is causing the
problem.

We can compute the:
1) oldest slot by xmin, with name
2) oldest walsender by xmin, with pid
3) oldest prepared transaction id by xid / xmin, with name
4) oldest in-progress transaction id by xid / xmin, with name
5) oldest database datfrozenxid, with database name

If 1-4) are close to 5), there's no point in trying to vacuum aggressively, it
won't help. So we instead can say that the xmin horizon (with a better name)
is held back by the oldest of these, with enough identifying information for
the user to actually know where to look.

In contrast, if 5) is older than 1-4), then we can tell the user which
database is the problem, as we do right now, but we can stop mentioning the
"You might also need to commit ..." bit.

Also, adding an SRF providing the above in a useful format would be great for
monitoring and for "remote debugging" of problems.

Greetings,

Andres Freund

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

04 февраля 2022 г., 05:08:03

On Thu, Feb 3, 2022 at 8:35 PM Andres Freund <andres@anarazel.de> wrote:
> Yea, I'd have no problem leaving the "hard" limit somewhere closer to 1
> million (although 100k should be just as well), but introduce a softer "only
> vacuum/drop/truncate" limit a good bit before that.

+1.

> To address the "as long as" part: I think that describing better what is
> holding back the horizon would be a significant usability improvement.
>
> Imagine that instead of the generic hints in these messages:
>                                 ereport(ERROR,
>                                                 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>                                                  errmsg("database is not accepting commands to avoid wraparound data
lossin database \"%s\"",
 
>                                                                 oldest_datname),
>                                                  errhint("Stop the postmaster and vacuum that database in single-user
mode.\n"
>                                                                  "You might also need to commit or roll back old
preparedtransactions, or drop stale replication slots.")));
 
> and
>                 ereport(WARNING,
>                                 (errmsg("oldest xmin is far in the past"),
>                                  errhint("Close open transactions soon to avoid wraparound problems.\n"
>                                                  "You might also need to commit or roll back old prepared
transactions,or drop stale replication slots.")));
 
>
> we'd actually tell the user a bit more what about what is causing the
> problem.
>
> We can compute the:
> 1) oldest slot by xmin, with name
> 2) oldest walsender by xmin, with pid
> 3) oldest prepared transaction id by xid / xmin, with name
> 4) oldest in-progress transaction id by xid / xmin, with name
> 5) oldest database datfrozenxid, with database name
>
> If 1-4) are close to 5), there's no point in trying to vacuum aggressively, it
> won't help. So we instead can say that the xmin horizon (with a better name)
> is held back by the oldest of these, with enough identifying information for
> the user to actually know where to look.

Yes. This kind of thing strikes me as potentially a huge help. To
rephrase that in other terms, we could tell the user what the actual
problem is instead of suggesting to them that they shut down their
database just for fun. It's "just for fun" because (a) it typically
won't fix the real problem, which is most often (1) or (3) from your
list, and even if it's (2) or (4) they could just kill the session
instead of shutting down the whole database, and (b) no matter what
needs to be done, whether it's VACUUM or ROLLBACK PREPARED or
something else, they may as well do that thing in multi-user mode
rather than single-user mode, unless we as PostgreSQL developers
forgot to make that actually work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Andres Freund

Дата:

04 февраля 2022 г., 06:26:01

Hi,

On 2022-02-03 21:08:03 -0500, Robert Haas wrote:
> On Thu, Feb 3, 2022 at 8:35 PM Andres Freund <andres@anarazel.de> wrote:
> > We can compute the:
> > 1) oldest slot by xmin, with name
> > 2) oldest walsender by xmin, with pid
> > 3) oldest prepared transaction id by xid / xmin, with name
> > 4) oldest in-progress transaction id by xid / xmin, with name
> > 5) oldest database datfrozenxid, with database name
> >
> > If 1-4) are close to 5), there's no point in trying to vacuum aggressively, it
> > won't help. So we instead can say that the xmin horizon (with a better name)
> > is held back by the oldest of these, with enough identifying information for
> > the user to actually know where to look.
> 
> Yes. This kind of thing strikes me as potentially a huge help. To
> rephrase that in other terms, we could tell the user what the actual
> problem is instead of suggesting to them that they shut down their
> database just for fun. It's "just for fun" because (a) it typically
> won't fix the real problem, which is most often (1) or (3) from your
> list, and even if it's (2) or (4) they could just kill the session
> instead of shutting down the whole database

Not that it matters, but IME the leading cause is 5). Often due to autovacuum
configuration. Which reminded me of the one thing that single user mode
is actually helpful for: Being able to start a manual VACUUM.

Once autovacuum is churning along in anti-wrap mode, with multiple workers, it
can be hard to manually VACUUM without waiting for autovacuum to do it's
throttled thing. The only way is to start the manual VACUUM and kill
autovacuum workers whenever they're blocking the manual vacuum(s).

Which reminds me: Perhaps we ought to hint about reducing / removing
autovacuum cost limits in this situation? And perhaps make autovacuum absorb
config changes while running? It's annoying that an autovac halfway into a
huge table doesn't absorb changed cost limits for example.

> (b) no matter what needs to be done, whether it's VACUUM or ROLLBACK
> PREPARED or something else, they may as well do that thing in multi-user
> mode rather than single-user mode, unless we as PostgreSQL developers forgot
> to make that actually work.

One thing that we made quite hard is to rollback prepared transactions,
because we require to be in the same database (a lot of fun in single user
mode with a lot of databases). We can't commit in the same database, but I
wonder if it's doable to allow rollbacks?

Greetings,

Andres Freund

Re: do only critical work during single-user vacuum?

От

Justin Pryzby

Дата:

04 февраля 2022 г., 07:44:07

On Thu, Feb 03, 2022 at 07:26:01PM -0800, Andres Freund wrote:
> Which reminds me: Perhaps we ought to hint about reducing / removing
> autovacuum cost limits in this situation? And perhaps make autovacuum absorb
> config changes while running? It's annoying that an autovac halfway into a
> huge table doesn't absorb changed cost limits for example.

I remembered this thread:

https://commitfest.postgresql.org/32/2983/
| Running autovacuum dynamic update to cost_limit and delay

https://www.postgresql.org/message-id/flat/13A6B954-5C21-4E60-BC06-751C8EA469A0%40amazon.com
https://www.postgresql.org/message-id/flat/0A3F8A3C-4328-4A4B-80CF-14CEBE0B695D%40amazon.com

-- 
Justin

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

05 февраля 2022 г., 01:56:45

On Thu, Feb 3, 2022 at 7:30 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> That error comes from GetNewTransactionId(), so that function must
> either try to execute DML or do something else which causes an XID to
> be assigned. I think a plain SELECT should work just fine.

It was indeed doing writes, so that much is not a surprise anymore.

On Thu, Feb 3, 2022 at 9:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Feb 3, 2022 at 8:35 PM Andres Freund <andres@anarazel.de> wrote:
> > Yea, I'd have no problem leaving the "hard" limit somewhere closer to 1
> > million (although 100k should be just as well), but introduce a softer "only
> > vacuum/drop/truncate" limit a good bit before that.
>
> +1.

Since there seems to be agreement on this, I can attempt a stab at it,
but it'll be another week before I can do so.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

15 февраля 2022 г., 07:04:46

On Fri, Feb 4, 2022 at 4:58 AM Robert Haas <robertmhaas@gmail.com> wrote:

> As I said
> before, I know TRUNCATE has been an issue in the past, and if that's
> not already fixed in v14, we should. If there's other stuff, we should
> fix that too.

The failsafe mode does disable truncation as of v14:

commit 60f1f09ff44308667ef6c72fbafd68235e55ae27
Author: Peter Geoghegan <pg@bowt.ie>
Date:   Tue Apr 13 12:58:31 2021 -0700

Don't truncate heap when VACUUM's failsafe is in effect.
--

To demonstrate to myself, I tried a few vacuums in a debugger session
with a breakpoint at GetNewTransactionId(). I've only seen it reach
here when heap truncation happens (or the not relevant for wraparound
situations FULL and ANALYZE).

With the maximum allowable setting of autovacuum_freeze_max_age of 2
billion, the highest allowable vacuum_failsafe_age is 2.1 billion, so
heap truncation will be shut off before the warnings start.

> And then we should KILL WITH FIRE
> the message telling people to use single user mode -- and once we do
> that, the question of what the behavior ought to be when someone does
> run VACUUM in single user mode becomes a lot less important.

Okay, so it sounds like changing the message is enough for v15? The
other two things mentioned are nice-to-haves, but wouldn't need to
hold back this minimal change, it seems:

On Fri, Feb 4, 2022 at 4:50 AM Andres Freund <andres@anarazel.de> wrote:

> I wonder if we shouldn't add some exceptions to the xid allocation
> prevention. It makes sense that we don't allow random DML. But it's e.g. often
> more realistic to drop / truncate a few tables with unimportant content,
> rather than spend the time vacuuming those.  We could e.g. allow xid
> consumption within VACUUM, TRUNCATE, DROP TABLE / INDEX when run at the top
> level for longer than we allow it for anything else.

It seems like this would require having access to "nodetag(parsetree)"
of the statement available in GetNewTransactionId. I don't immediately
see an easy way to do that...is a global var within the realm of
acceptability?

On Fri, Feb 4, 2022 at 8:35 AM Andres Freund <andres@anarazel.de> wrote:

> we'd actually tell the user a bit more what about what is causing the
> problem.
>
> We can compute the:
> 1) oldest slot by xmin, with name
> 2) oldest walsender by xmin, with pid
> 3) oldest prepared transaction id by xid / xmin, with name
> 4) oldest in-progress transaction id by xid / xmin, with name
> 5) oldest database datfrozenxid, with database name
[...]
> Also, adding an SRF providing the above in a useful format would be great for
> monitoring and for "remote debugging" of problems.

I concur it sounds very useful, and not terribly hard, but probably a
v16 project.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

15 февраля 2022 г., 07:21:41

On Mon, Feb 14, 2022 at 8:04 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> The failsafe mode does disable truncation as of v14:
>
> commit 60f1f09ff44308667ef6c72fbafd68235e55ae27
> Author: Peter Geoghegan <pg@bowt.ie>
> Date:   Tue Apr 13 12:58:31 2021 -0700
>
> Don't truncate heap when VACUUM's failsafe is in effect.

That's true, but bear in mind that it only does so when the specific
table being vacuumed actually triggers the failsafe. I believe that
VACUUM(EMERGENCY) doesn't just limit itself to vacuuming tables where
this is guaranteed (or even likely). If I'm not mistaken, it's
possible (even likely) that there will be a table whose
age(relfrozenxid) is high enough for VACUUM(EMERGENCY) to target the
table, and yet not so high that the failsafe will kick in at the
earliest opportunity.

> To demonstrate to myself, I tried a few vacuums in a debugger session
> with a breakpoint at GetNewTransactionId(). I've only seen it reach
> here when heap truncation happens (or the not relevant for wraparound
> situations FULL and ANALYZE).

It's possible for a manually issued VACUUM to directly disable
truncation (same with index_cleanup). Without getting into the
question of what the ideal behavior might be right now, I can say for
sure that it wouldn't be difficult to teach VACUUM(EMERGENCY) to pass
down the same options.

The failsafe is essentially a mechanism that dynamically changes these
options for an ongoing vacuum, once age(relfrozenxid) crosses a
certain threshold. There is nothing fundamentally special about that.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

15 февраля 2022 г., 09:04:42

On Tue, Feb 15, 2022 at 11:22 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Feb 14, 2022 at 8:04 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > The failsafe mode does disable truncation as of v14:
> >
> > commit 60f1f09ff44308667ef6c72fbafd68235e55ae27
> > Author: Peter Geoghegan <pg@bowt.ie>
> > Date:   Tue Apr 13 12:58:31 2021 -0700
> >
> > Don't truncate heap when VACUUM's failsafe is in effect.
>
> That's true, but bear in mind that it only does so when the specific
> table being vacuumed actually triggers the failsafe. I believe that
> VACUUM(EMERGENCY) doesn't just limit itself to vacuuming tables where
> this is guaranteed (or even likely). If I'm not mistaken, it's
> possible (even likely) that there will be a table whose
> age(relfrozenxid) is high enough for VACUUM(EMERGENCY) to target the
> table, and yet not so high that the failsafe will kick in at the
> earliest opportunity.

Well, the point of inventing this new vacuum mode was because I
thought that upon reaching xidStopLimit, we couldn't issue commands,
period, under the postmaster. If it was easier to get a test instance
to xidStopLimit, I certainly would have discovered this sooner. When
Andres wondered about getting away from single user mode, I assumed
that would involve getting into areas too deep to tackle for v15. As
Robert pointed out, lazy_truncate_heap is the only thing that can't
happen for vacuum at this point, and fully explains why in versions <
14 our client's attempts to vacuum resulted in error. Since the
failsafe mode turns off truncation, vacuum should now *just work* near
wraparound. If there is any doubt, we can tighten the check for
entering failsafe.

Now, it's certainly possible that autovacuum is either not working at
all because of something broken, or is not working on the oldest
tables at the moment, so one thing we could do is to make VACUUM [with
no tables listed] get the tables from pg_class in reverse order of
max(xid age, mxid age). That way, the horizon will eventually pull
back over time and the admin can optionally cancel the vacuum at some
point. Since the order is harmless when it's not needed, we can do
that unconditionally.
-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

15 февраля 2022 г., 17:16:06

On Tue, Feb 15, 2022 at 1:04 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Well, the point of inventing this new vacuum mode was because I
> thought that upon reaching xidStopLimit, we couldn't issue commands,
> period, under the postmaster. If it was easier to get a test instance
> to xidStopLimit, I certainly would have discovered this sooner. When
> Andres wondered about getting away from single user mode, I assumed
> that would involve getting into areas too deep to tackle for v15. As
> Robert pointed out, lazy_truncate_heap is the only thing that can't
> happen for vacuum at this point, and fully explains why in versions <
> 14 our client's attempts to vacuum resulted in error. Since the
> failsafe mode turns off truncation, vacuum should now *just work* near
> wraparound. If there is any doubt, we can tighten the check for
> entering failsafe.

+1 to all of that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

15 февраля 2022 г., 20:28:47

On Mon, Feb 14, 2022 at 10:04 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Well, the point of inventing this new vacuum mode was because I
> thought that upon reaching xidStopLimit, we couldn't issue commands,
> period, under the postmaster. If it was easier to get a test instance
> to xidStopLimit, I certainly would have discovered this sooner.

I did notice from my own testing of the failsafe (by artificially
inducing wraparound failure using an XID burning C function) that
autovacuum seemed to totally correct the problem, even when the system
had already crossed xidStopLimit - it came back on its own. I wasn't
completely sure of how robust this effect was, though.

> When
> Andres wondered about getting away from single user mode, I assumed
> that would involve getting into areas too deep to tackle for v15. As
> Robert pointed out, lazy_truncate_heap is the only thing that can't
> happen for vacuum at this point, and fully explains why in versions <
> 14 our client's attempts to vacuum resulted in error. Since the
> failsafe mode turns off truncation, vacuum should now *just work* near
> wraparound. If there is any doubt, we can tighten the check for
> entering failsafe.

Obviously having to enter single user mode is horrid. If we can
reasonably update the advice to something more reasonable now, then
that would help users that find themselves in this situation a great
deal.

> Now, it's certainly possible that autovacuum is either not working at
> all because of something broken, or is not working on the oldest
> tables at the moment, so one thing we could do is to make VACUUM [with
> no tables listed] get the tables from pg_class in reverse order of
> max(xid age, mxid age). That way, the horizon will eventually pull
> back over time and the admin can optionally cancel the vacuum at some
> point. Since the order is harmless when it's not needed, we can do
> that unconditionally.

My ongoing work on freezing/relfrozenxid tends to make the age of
relfrozenxid much more indicative of the amount of work that VACUUM
would have to do when run -- not limited to freezing. You could
probably do this anyway, but it's nice that that'll be true.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

16 февраля 2022 г., 02:17:25

On Tue, Feb 15, 2022 at 9:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Feb 14, 2022 at 10:04 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Well, the point of inventing this new vacuum mode was because I
> > thought that upon reaching xidStopLimit, we couldn't issue commands,
> > period, under the postmaster. If it was easier to get a test instance
> > to xidStopLimit, I certainly would have discovered this sooner.
>
> I did notice from my own testing of the failsafe (by artificially
> inducing wraparound failure using an XID burning C function) that
> autovacuum seemed to totally correct the problem, even when the system
> had already crossed xidStopLimit - it came back on its own. I wasn't
> completely sure of how robust this effect was, though.

It seemed worth noting this in comments above
should_attempt_truncation(). Pushed a commit to do that just now.

Thanks
-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

16 февраля 2022 г., 11:43:12

On Wed, Feb 16, 2022 at 6:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Feb 15, 2022 at 9:28 AM Peter Geoghegan <pg@bowt.ie> wrote:

> > I did notice from my own testing of the failsafe (by artificially
> > inducing wraparound failure using an XID burning C function) that
> > autovacuum seemed to totally correct the problem, even when the system
> > had already crossed xidStopLimit - it came back on its own. I wasn't
> > completely sure of how robust this effect was, though.

I'll put some effort in finding any way that it might not be robust.
After that, changing the message and docs is trivial.

> It seemed worth noting this in comments above
> should_attempt_truncation(). Pushed a commit to do that just now.

Thanks for that.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

16 февраля 2022 г., 19:12:53

On Wed, Feb 16, 2022 at 12:43 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> I'll put some effort in finding any way that it might not be robust.
> After that, changing the message and docs is trivial.

Great, thanks John.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Masahiko Sawada

Дата:

16 февраля 2022 г., 19:47:31

On Wed, Feb 16, 2022 at 2:29 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Feb 14, 2022 at 10:04 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Well, the point of inventing this new vacuum mode was because I
> > thought that upon reaching xidStopLimit, we couldn't issue commands,
> > period, under the postmaster. If it was easier to get a test instance
> > to xidStopLimit, I certainly would have discovered this sooner.
>
> I did notice from my own testing of the failsafe (by artificially
> inducing wraparound failure using an XID burning C function) that
> autovacuum seemed to totally correct the problem, even when the system
> had already crossed xidStopLimit - it came back on its own. I wasn't
> completely sure of how robust this effect was, though.

FYI, I've tested the situation that I assumed autovacuum can not
correct the problem; when the system had already crossed xidStopLimit,
it keeps failing to vacuum on tables that appear in the front of the
list and have sufficient garbage to trigger the truncation but are not
older than the failsafe limit. But contrary to my assumption, it did
correct the problem since autovacuum continues to the next table in
the list even after an error. This probably means that autovacuum
eventually succeeds to process all tables that trigger the failsafe
mode, ensuring advancing datfrozenxid, which is great.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

16 февраля 2022 г., 20:50:41

On Wed, Feb 16, 2022 at 8:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> FYI, I've tested the situation that I assumed autovacuum can not
> correct the problem; when the system had already crossed xidStopLimit,
> it keeps failing to vacuum on tables that appear in the front of the
> list and have sufficient garbage to trigger the truncation but are not
> older than the failsafe limit. But contrary to my assumption, it did
> correct the problem since autovacuum continues to the next table in
> the list even after an error. This probably means that autovacuum
> eventually succeeds to process all tables that trigger the failsafe
> mode, ensuring advancing datfrozenxid, which is great.

Right; it seems as if the situation is much improved, even when the
failsafe didn't prevent the system from going over xidStopLimit. If
autovacuum alone can bring the system back to a normal state as soon
as possible, without a human needing to do anything special, then
clearly the general risk is much smaller. Even this worst case
scenario where "the failsafe has failed" is not so bad anymore, in
practice. I don't think that it really matters if some concurrent
non-emergency VACUUMs fail when attempting to truncate the table (it's
no worse than ANALYZE failing, for example).

Good news!

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

16 февраля 2022 г., 20:56:42

On Wed, Feb 16, 2022 at 12:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Good news!

+1. But I think we might want to try to write documentation around
this. We should explicitly tell people NOT to use single-user mode,
because that stupid message has been there for a long time and a lot
of people have probably internalized it by now. And we should also
tell them that they SHOULD check for prepared transactions, old
replication slots, etc.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

16 февраля 2022 г., 21:14:19

On Wed, Feb 16, 2022 at 9:56 AM Robert Haas <robertmhaas@gmail.com> wrote:
> +1. But I think we might want to try to write documentation around
> this. We should explicitly tell people NOT to use single-user mode,
> because that stupid message has been there for a long time and a lot
> of people have probably internalized it by now. And we should also
> tell them that they SHOULD check for prepared transactions, old
> replication slots, etc.

Absolutely -- couldn't agree more. Do you think it's worth targeting
14 here, or just HEAD?

I'm pretty sure that some people believe that wraparound can cause
actual data corruption, in part because of the way the docs present
the information. The system won't do that, of course (precisely
because of this xidStopLimit behavior). The docs make it all sound
absolutely terrifying, which doesn't seem proportionate to me (at
least not with this stuff in place, maybe not ever).

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Andres Freund

Дата:

16 февраля 2022 г., 21:18:46

Hi,

On 2022-02-16 10:14:19 -0800, Peter Geoghegan wrote:
> Absolutely -- couldn't agree more. Do you think it's worth targeting
> 14 here, or just HEAD?

I'd go for HEAD first, but wouldn't protest against 14.

> I'm pretty sure that some people believe that wraparound can cause
> actual data corruption

Well, historically they're not wrong. And we've enough things stored in 32bit
counters that I'd be surprised if we didn't have more wraparound issues. Of
course that's not related to anti-wrap vacuums...

Greetings,

Andres Freund

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

16 февраля 2022 г., 21:27:38

On Wed, Feb 16, 2022 at 10:18 AM Andres Freund <andres@anarazel.de> wrote:
> > I'm pretty sure that some people believe that wraparound can cause
> > actual data corruption
>
> Well, historically they're not wrong.

True, but the most recent version where that's actually possible is
PostgreSQL 8.0, which was released in early 2005. That was a very
different time for the project. I don't think that people believe that
wraparound can cause data corruption because they remember a time when
it really could. It seems like general confusion to me (which could
have been avoided).

At a minimum, we ought to be very clear on the fact that Postgres
isn't going to just let your database become corrupt in some more or
less predictable way. The xidStopLimit thing is pretty bad, but it's
still much better than that.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

16 февраля 2022 г., 22:18:19

On Wed, Feb 16, 2022 at 10:27 AM Peter Geoghegan <pg@bowt.ie> wrote:
> True, but the most recent version where that's actually possible is
> PostgreSQL 8.0, which was released in early 2005.

It just occurred to me that the main historic reason for the single
user mode advice was the lack of virtual XIDs. The commit that added
the xidStopLimit behavior (commit 60b2444cc3) came a couple of years
before the introduction of virtual transaction IDs (in commit
295e63983d). AFAICT, the advice about single-user mode was added at a
time where exceeding xidStopLimit caused the system to grind to a halt
completely -- even trivial SELECTs would have failed once Postgres 8.1
crossed the xidStopLimit limit.

It seems as if the advice about single user mode persisted for no
great reason at all. Technically there were some remaining reasons to
keep it around (like the truncation thing), but overall these
secondary reasons could have been addressed much sooner if somebody
had thought about it.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

16 февраля 2022 г., 23:11:16

On Wed, Feb 16, 2022 at 1:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Feb 16, 2022 at 10:18 AM Andres Freund <andres@anarazel.de> wrote:
> > > I'm pretty sure that some people believe that wraparound can cause
> > > actual data corruption
> >
> > Well, historically they're not wrong.
>
> True, but the most recent version where that's actually possible is
> PostgreSQL 8.0, which was released in early 2005. That was a very
> different time for the project. I don't think that people believe that
> wraparound can cause data corruption because they remember a time when
> it really could. It seems like general confusion to me (which could
> have been avoided).

No, I think it's PostgreSQL 13, because before the vacuum failsafe
thing you could end up truncating enough tables during vacuum
operations to actually wrap around.

And even in 14+, you can still do that, if you use single user mode.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

16 февраля 2022 г., 23:11:56

On Wed, Feb 16, 2022 at 2:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It seems as if the advice about single user mode persisted for no
> great reason at all. Technically there were some remaining reasons to
> keep it around (like the truncation thing), but overall these
> secondary reasons could have been addressed much sooner if somebody
> had thought about it.

I raised it on the list a couple of years ago, actually. I think I had
a bit of difficulty convincing people that it wasn't something we had
to keep recommending.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

16 февраля 2022 г., 23:21:31

On Wed, Feb 16, 2022 at 12:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
> No, I think it's PostgreSQL 13, because before the vacuum failsafe
> thing you could end up truncating enough tables during vacuum
> operations to actually wrap around.

Why wouldn't the xidStopLimit thing prevent actual incorrect answers
to queries, even on Postgres 13? Why wouldn't that be enough, even if
we make the most pessimistic possible assumptions?

To me it looks like it's physically impossible to advance an XID past
xidStopLimit, unless you're in single user mode. Does your concern
have something to do with the actual xidStopLimit value in shared
memory not being sufficiently protective in practice?

> And even in 14+, you can still do that, if you use single user mode.

So what you're saying is that there is *some* reason for vacuuming in
single user mode after all, and so we should keep the advice about
that in place?   :-)

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

17 февраля 2022 г., 00:04:39

On Wed, Feb 16, 2022 at 3:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Feb 16, 2022 at 12:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > No, I think it's PostgreSQL 13, because before the vacuum failsafe
> > thing you could end up truncating enough tables during vacuum
> > operations to actually wrap around.
>
> Why wouldn't the xidStopLimit thing prevent actual incorrect answers
> to queries, even on Postgres 13? Why wouldn't that be enough, even if
> we make the most pessimistic possible assumptions?
>
> To me it looks like it's physically impossible to advance an XID past
> xidStopLimit, unless you're in single user mode. Does your concern
> have something to do with the actual xidStopLimit value in shared
> memory not being sufficiently protective in practice?

No, what I'm saying is that people running older versions routinely
run VACUUM in single-user mode because otherwise it fails due to the
truncation issue. But once they go into single-user mode they lose
protection.

> > And even in 14+, you can still do that, if you use single user mode.
>
> So what you're saying is that there is *some* reason for vacuuming in
> single user mode after all, and so we should keep the advice about
> that in place?   :-)

We could perhaps amend the text slightly, e.g. "This is a great idea
if you like pain."

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

17 февраля 2022 г., 00:47:39

On Wed, Feb 16, 2022 at 1:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> No, what I'm saying is that people running older versions routinely
> run VACUUM in single-user mode because otherwise it fails due to the
> truncation issue. But once they go into single-user mode they lose
> protection.

Seems logically consistent, but absurd. A Catch-22 situation if ever
there was one.

There might well be an element of survivorship bias here. Most VACUUM
operations won't ever attempt truncation (speaking very generally).
How many times might (say) the customer that John mentioned have
accidentally gone over xidStopLimit for just a little while, before
the situation corrected itself without anybody noticing? A lot of
applications are very read-heavy, or aren't very well monitored.

Eventually (maybe after several years of this), some laggard
anti-wraparound vacuum needs to truncate the relation, due to random
happenstance. Once that happens, the situation is bound to come to a
head. The user is bound to finally notice that the system has gone
over xidStopLimit, because there is no longer any way for the problem
to go away on its own.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

17 февраля 2022 г., 05:56:26

On Wed, Feb 16, 2022 at 4:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There might well be an element of survivorship bias here. Most VACUUM
> operations won't ever attempt truncation (speaking very generally).
> How many times might (say) the customer that John mentioned have
> accidentally gone over xidStopLimit for just a little while, before
> the situation corrected itself without anybody noticing? A lot of
> applications are very read-heavy, or aren't very well monitored.
>
> Eventually (maybe after several years of this), some laggard
> anti-wraparound vacuum needs to truncate the relation, due to random
> happenstance. Once that happens, the situation is bound to come to a
> head. The user is bound to finally notice that the system has gone
> over xidStopLimit, because there is no longer any way for the problem
> to go away on its own.

I think that's not really what is happening, at least not in the cases
that typically are brought to my attention. In those cases, the
typical pattern is:

1. Everything is fine.

2. Then the user forgets about a prepared transaction or a replication
slot, or leaves a transaction open forever, or has some kind of
corruption that causes VACUUM to fall over and die every time it tries
to run.

3. The user has no idea that VACUUM is no longer advanced
relfrozenxid. Time passes.

4. Eventually the system stops being willing to allocate new XIDs. It
tells the user to go to single user mode. So they do.

5. None of the tables in the database have been vacuumed in a long
time. There are a million XIDs left. How many of the tables in the
database are going to be truncate when they are vacuumed and burn one
of the remaining XIDs? Anybody's guess, could be all or none.

6. Sometimes the user decides to run VACUUM FULL instead of plain
VACUUM because it sounds better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

17 февраля 2022 г., 06:07:50

On Wed, Feb 16, 2022 at 6:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think that's not really what is happening, at least not in the cases
> that typically are brought to my attention. In those cases, the
> typical pattern is:

> 5. None of the tables in the database have been vacuumed in a long
> time. There are a million XIDs left. How many of the tables in the
> database are going to be truncate when they are vacuumed and burn one
> of the remaining XIDs? Anybody's guess, could be all or none.

I have to admit that this sounds way more plausible than my
speculative scenario. I haven't been involved in any kind of support
case with a customer in a *long* time, though (not by choice, mind
you).

> 6. Sometimes the user decides to run VACUUM FULL instead of plain
> VACUUM because it sounds better.

It's a pity that the name suggests otherwise. If only we'd named it
something that suggests "option of last resort". Oh well.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Robert Haas

Дата:

17 февраля 2022 г., 17:51:25

On Wed, Feb 16, 2022 at 10:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > 6. Sometimes the user decides to run VACUUM FULL instead of plain
> > VACUUM because it sounds better.
>
> It's a pity that the name suggests otherwise. If only we'd named it
> something that suggests "option of last resort". Oh well.

Unfortunately, such a name would also be misleading, just in a
different way. It is really not at all difficult to have a workload
that demands routine use of VACUUM FULL. I suppose technically it is a
last resort in such situations, because what would you resort to after
trying VF? But it's not like some kind of in-emergency-break-glass
kind of thing, it's just the right tool for the job.

Some years ago I worked with a customer who had a table that was being
used as an update-heavy queue. I don't remember all the details any
more, but I think the general pattern was that they would insert rows,
update them A TON, and then eventually delete them. And they got
really bad table bloat, because vacuum just wasn't running often
enough to keep up. Reducing autovacuum_naptime to 15s fixed the issue,
fortunately, but I was initially thinking that it might be completely
unfixable, because what if they'd also been running a series
4-minute-long reporting queries in a loop on some other table? More
frequent vacuuming wouldn't have helped then, because xmin would not
have been able to advance until the current instance of the reporting
query finished, and then vacuuming more often would have done nothing
useful. I think, anyway.

That's just one example that comes to mind. I think there are lots of
workloads where it's simply not possible to make VACUUM keep up.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Noah Misch

Дата:

20 февраля 2022 г., 07:57:57

On Wed, Feb 16, 2022 at 03:43:12PM +0700, John Naylor wrote:
> On Wed, Feb 16, 2022 at 6:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Tue, Feb 15, 2022 at 9:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> 
> > > I did notice from my own testing of the failsafe (by artificially
> > > inducing wraparound failure using an XID burning C function) that
> > > autovacuum seemed to totally correct the problem, even when the system
> > > had already crossed xidStopLimit - it came back on its own. I wasn't
> > > completely sure of how robust this effect was, though.
> 
> I'll put some effort in finding any way that it might not be robust.

A VACUUM may create a not-trivially-bounded number of multixacts via
FreezeMultiXactId().  In a cluster at multiStopLimit, completing VACUUM
without error needs preparation something like:

1. Kill each XID that might appear in a multixact.
2. Resolve each prepared transaction that might appear in a multixact.
3. Run VACUUM.  At this point, multiStopLimit is blocking new multixacts from
   other commands, and the lack of running multixact members removes the need
   for FreezeMultiXactId() to create multixacts.

Adding to the badness of single-user mode so well described upthread, one can
enter it without doing (2) and then wrap the nextMXact counter.

Re: do only critical work during single-user vacuum?

От

Andres Freund

Дата:

21 февраля 2022 г., 01:15:37

Hi,

On 2022-02-19 20:57:57 -0800, Noah Misch wrote:
> On Wed, Feb 16, 2022 at 03:43:12PM +0700, John Naylor wrote:
> > On Wed, Feb 16, 2022 at 6:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > > On Tue, Feb 15, 2022 at 9:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > 
> > > > I did notice from my own testing of the failsafe (by artificially
> > > > inducing wraparound failure using an XID burning C function) that
> > > > autovacuum seemed to totally correct the problem, even when the system
> > > > had already crossed xidStopLimit - it came back on its own. I wasn't
> > > > completely sure of how robust this effect was, though.
> > 
> > I'll put some effort in finding any way that it might not be robust.
> 
> A VACUUM may create a not-trivially-bounded number of multixacts via
> FreezeMultiXactId().  In a cluster at multiStopLimit, completing VACUUM
> without error needs preparation something like:
> 
> 1. Kill each XID that might appear in a multixact.
> 2. Resolve each prepared transaction that might appear in a multixact.
> 3. Run VACUUM.  At this point, multiStopLimit is blocking new multixacts from
>    other commands, and the lack of running multixact members removes the need
>    for FreezeMultiXactId() to create multixacts.
> 
> Adding to the badness of single-user mode so well described upthread, one can
> enter it without doing (2) and then wrap the nextMXact counter.

If we collected the information along the lines of  I proposed in the second half of
https://www.postgresql.org/message-id/20220204013539.qdegpqzvayq3d4y2%40alap3.anarazel.de
we should be able to handle such cases more intelligently, I think?

We could e.g. add an error if FreezeMultiXactId() needs to create a new
multixact for a far-in-the-past xid. That's not great, of course, but if we
include the precise cause (pid of backend / prepared xact name / slot name /
...) necessitating creating a new multi, it'd still be a significant
improvement over the status quo.

Greetings,

Andres Freund

Re: do only critical work during single-user vacuum?

От

Noah Misch

Дата:

21 февраля 2022 г., 02:08:10

On Sun, Feb 20, 2022 at 02:15:37PM -0800, Andres Freund wrote:
> On 2022-02-19 20:57:57 -0800, Noah Misch wrote:
> > On Wed, Feb 16, 2022 at 03:43:12PM +0700, John Naylor wrote:
> > > On Wed, Feb 16, 2022 at 6:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > > > On Tue, Feb 15, 2022 at 9:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > > 
> > > > > I did notice from my own testing of the failsafe (by artificially
> > > > > inducing wraparound failure using an XID burning C function) that
> > > > > autovacuum seemed to totally correct the problem, even when the system
> > > > > had already crossed xidStopLimit - it came back on its own. I wasn't
> > > > > completely sure of how robust this effect was, though.
> > > 
> > > I'll put some effort in finding any way that it might not be robust.
> > 
> > A VACUUM may create a not-trivially-bounded number of multixacts via
> > FreezeMultiXactId().  In a cluster at multiStopLimit, completing VACUUM
> > without error needs preparation something like:
> > 
> > 1. Kill each XID that might appear in a multixact.
> > 2. Resolve each prepared transaction that might appear in a multixact.
> > 3. Run VACUUM.  At this point, multiStopLimit is blocking new multixacts from
> >    other commands, and the lack of running multixact members removes the need
> >    for FreezeMultiXactId() to create multixacts.
> > 
> > Adding to the badness of single-user mode so well described upthread, one can
> > enter it without doing (2) and then wrap the nextMXact counter.
> 
> If we collected the information along the lines of  I proposed in the second half of
> https://www.postgresql.org/message-id/20220204013539.qdegpqzvayq3d4y2%40alap3.anarazel.de
> we should be able to handle such cases more intelligently, I think?
> 
> We could e.g. add an error if FreezeMultiXactId() needs to create a new
> multixact for a far-in-the-past xid. That's not great, of course, but if we
> include the precise cause (pid of backend / prepared xact name / slot name /
> ...) necessitating creating a new multi, it'd still be a significant
> improvement over the status quo.

Yes, exactly.

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

21 февраля 2022 г., 02:12:36

On Sun, Feb 20, 2022 at 2:15 PM Andres Freund <andres@anarazel.de> wrote:
> We could e.g. add an error if FreezeMultiXactId() needs to create a new
> multixact for a far-in-the-past xid. That's not great, of course, but if we
> include the precise cause (pid of backend / prepared xact name / slot name /
> ...) necessitating creating a new multi, it'd still be a significant
> improvement over the status quo.

There are databases that have large tables (that grow and grow), and
also have tables that need to allocate many MultiXacts (or lots of
member space, at least). I strongly suspect that these are seldom the
same table, though.

The current inability of the system to recognize this difference seems
like it might be a real problem. Why should big tables that contain no
actual MultiXactIds at all (and never contained even one) get early
anti-wraparound VACUUMs, specifically focussed on averting MultiXact
wraparound? I'm hoping that the patch that adds smarter tracking of
final relfrozenxid/relminmxid values during VACUUM makes this less of
a problem automatically.

-- 
Peter Geoghegan

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

16 марта 2022 г., 00:48:24

On Wed, Feb 16, 2022 at 12:43 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> I'll put some effort in finding any way that it might not be robust.
> After that, changing the message and docs is trivial.

It would be great to be able to totally drop the idea of using
single-user mode before Postgres 15 feature freeze. How's that going?

I suggest that we apply the following patch as part of that work. It
adds one last final failsafe check at the point that VACUUM makes a
final decision on rel truncation.

It seems unlikely that the patch will ever make the crucial difference
in a wraparound scenario -- in practice it's very likely that we'd
have triggered the wraparound at that point if we run into trouble
with the target rel's relfrozenxid age. And even if it does get to
that point, it would still be possible for the autovacuum launcher to
launch another autovacuum -- this time around we will avoid rel
truncation, restoring the system to normal operation (i.e. no more
xidStopLimit state).

On the other hand it's possible that lazy_cleanup_all_indexes() will
take a very long time to run, and it runs after the current final
failsafe check. An index AM's amvacuumcleanup() routine can take a
long time to run sometimes, especially with GIN indexes. And so it's
just about possible that we won't have triggered the failsafe by the
time lazy_cleanup_all_indexes() is called, which then spends a long
time doing index cleanup -- long enough for the system to reach
xidStopLimit due to the target rel's relfrozenxid age crossing the
crucial xidStopLimit crossover point.

This patch makes this problem scenario virtually impossible. Right now
I'm only prepared to say it's very unlikely. I don't see a reason to
take any chances, though.

-- 
Peter Geoghegan

Вложения

v1-0001-Perform-final-failsafe-check-before-truncation.patch

Re: do only critical work during single-user vacuum?

От

John Naylor

Дата:

31 марта 2022 г., 12:51:05

On Wed, Mar 16, 2022 at 4:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Feb 16, 2022 at 12:43 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > I'll put some effort in finding any way that it might not be robust.
> > After that, changing the message and docs is trivial.
>
> It would be great to be able to totally drop the idea of using
> single-user mode before Postgres 15 feature freeze. How's that going?

Unfortunately, I was distracted from this work for a time, and just as
I had intended to focus on it during March, I was out sick for 2-3
weeks. I gather from subsequent discussion that a full solution goes
beyond just a new warning message and documentation. Either way I'm
not quite prepared to address this in time for v15.

> I suggest that we apply the following patch as part of that work. It
> adds one last final failsafe check at the point that VACUUM makes a
> final decision on rel truncation.

That is one thing that was in the back of my mind, and it seems
reasonable to me.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: do only critical work during single-user vacuum?

От

Justin Pryzby

Дата:

27 июня 2022 г., 22:36:09

On Thu, Feb 03, 2022 at 01:05:50PM -0500, Robert Haas wrote:
> On Thu, Dec 9, 2021 at 8:56 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we should move *away* from single user mode, rather than the
> > opposite. It's a substantial code burden and it's hard to use.
> 
> Yes. This thread seems to be largely devoted to the topic of making
> single-user vacuum work better, but I don't see anyone asking the
> question "why do we have a message that tells people to vacuum in
> single user mode in the first place?". It's basically bad advice,

> The correct thing to do would be to remove
> the hint as bad advice that we never should have offered in the first
> place. And so here. We should not try to make vacuum in single
> user-mode work better or differently, or at least that shouldn't be
> our primary objective. We should just stop telling people to do it. We
> should probably add messages and documentation *discouraging* the use
> of single user mode for recovering from wraparound trouble, exactly
> the opposite of what we do now. There's nothing we can do in
> single-user mode that we can't do equally well in multi-user mode. If
> people try to fix wraparound problems in multi-user mode, they still
> have read-only access to their database, they can use parallelism,
> they can use command line utilities like vacuumdb, and they can use
> psql which has line editing and allows remote access and is a way
> nicer user experience than running postgres --single. We need a really
> compelling reason to tell people to give up all those advantages, and
> there is no such reason. It makes just as much sense as telling people
> to deal with wraparound problems by angering a live anaconda.

By chance, I came across this prior thread which advocated the same thing in a
initially (rather than indirectly as in this year's thread).

https://www.postgresql.org/message-id/flat/CAMT0RQTmRj_Egtmre6fbiMA9E2hM3BsLULiV8W00stwa3URvzA%40mail.gmail.com
|We should stop telling users to "vacuum that database in single-user mode"

Re: do only critical work during single-user vacuum?

От

Peter Geoghegan

Дата:

27 июня 2022 г., 23:36:36

On Mon, Jun 27, 2022 at 12:36 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> By chance, I came across this prior thread which advocated the same thing in a
> initially (rather than indirectly as in this year's thread).

Revisiting this topic reminded me that PostgreSQL 14 (the first
version that had the wraparound failsafe mechanism controlled by
vacuum_failsafe_age) has been a stable release for 9 months now. As of
today I am still not aware of even one user that ran into the failsafe
mechanism in production. It might well have happened by now, of
course, but I am not aware of any specific case. Perhaps this will
change soon enough -- maybe somebody else will read this and enlighten
me.

To me the fact that the failsafe seems to seldom kick-in in practice
suggests something about workload characteristics in general: that it
isn't all that common for users to try to get away with putting off
freezing until a table attains an age that is significantly above 1
billion XIDs.

When people talk about things like 64-bit XIDs, I tend to wonder: if 2
billion XIDs wasn't enough, why should 4 billion or 8 billion be
enough? *Maybe* the system can do better by getting even further into
debt than it can today, but you can't expect to avoid freezing
altogether (without significant work elsewhere). My general sense is
that freezing isn't a particularly good thing to try to do lazily --
even if we ignore the risk of an eventual wraparound failure.

-- 
Peter Geoghegan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: do only critical work during single-user vacuum?

Вложения

Вложения

Вложения

Вложения

Вложения