Re: New strategies for freezing, advancing relfrozenxid early

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: New strategies for freezing, advancing relfrozenxid early
Дата
Msg-id CAH2-WzmLhGZMxhR6zEz9213xxjpt6v2N4LnhiSy0a4DDb2p3Gw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: New strategies for freezing, advancing relfrozenxid early  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Список pgsql-hackers
On Wed, Jan 4, 2023 at 5:21 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Some reviews (untested; only code review so far) on these versions of
> the patches:

Thanks for the review!

> > [PATCH v14 1/3] Add eager and lazy freezing strategies to VACUUM.

> I don't think the mention of 'cutoff point' is necessary when it has
> 'Threshold'.

Fair. Will fix.

> > +    int            freeze_strategy_threshold;    /* threshold to use eager
> > [...]
> > +    BlockNumber freeze_strategy_threshold;
>
> Is there a way to disable the 'eager' freezing strategy? `int` cannot
> hold the maximum BlockNumber...

I'm going to fix this by switching over to making the GUC (and the
reloption) GUC_UNIT_MB, while keeping it in ConfigureNamesInt[]. That
approach is a little bit more cumbersome, but not by much. That'll
solve this problem.

> > +    lazy_scan_strategy(vacrel);
> >      if (verbose)
>
> I'm slightly suprised you didn't update the message for verbose vacuum
> to indicate whether we used the eager strategy: there are several GUCs
> for tuning this behaviour, so you'd expect to want direct confirmation
> that the configuration is effective.

Perhaps that would be worth doing, but I don't think that it's all
that useful in the grand scheme of things. I wouldn't mind including
it, but I think that it shouldn't be given much prominence. It's
certainly far less important than "aggressive vs non-aggressive" is
right now.

Eagerness is not just a synonym of aggressiveness. For example, every
VACUUM of a table like pgbench_tellers or pgbench_branches will use
eager scanning strategy. More generally, you have to bear in mind that
the actual state of the table is just as important as the GUCs
themselves. We try to avoid obligations that could be very hard or
even impossible for vacuumlazy.c to fulfill.

There are far weaker constraints on things like the final relfrozenxid
value we'll set in pg_class (more on this below, when I talk about
MinXid/MinMulti). It will advance far more frequently and by many more
XIDs than it would today, on average. But occasionally it will allow a
far earlier relfrozenxid than aggressive mode would ever allow, since
making some small amount of progress now is almost always much better
than making no progress at all.

> (looks at further patches) I see that the message for verbose vacuum
> sees significant changes in patch 2 instead.

It just works out to be slightly simpler that way. I want to add the
scanned_pages stuff to VERBOSE in the vmsnap/scanning strategies
commit, so I need to make significant changes to the initial VERBOSE
message in that commit. There is little point in preserving
information about aggressive mode if it's removed in the very next
commit anyway.

> > [PATCH v14 2/3] Add eager and lazy VM strategies to VACUUM.

> Right now, we don't use syncscan to determine a startpoint. I can't
> find the reason why in the available documentation: [0] discusses the
> issue but without clearly describing an issue why it wouldn't be
> interesting from a 'nothing lost' perspective.

That's not something I've given much thought to. It's a separate issue, I think.

Though I will say that one reason why I think that the vm snapshot
concept will become important is that working off an immutable
structure makes various things much easier, in fairly obvious ways. It
makes it straightforward to reorder work. So things like parallel heap
vacuuming are a lot more straightforward.

I also think that it would be useful to teach VACUUM to speculatively
scan a random sample of pages, just like a normal VACUUM. We start out
doing a normal VACUUM that just processes scanned_pages in a random
order. At some point we look at the state of pages so far. If it looks
like the table really doesn't urgently need to be vacuumed, then we
can give up before paying much of a cost. If it looks like the table
really needs to be VACUUM'd, we can press on almost like any other
VACUUM would.

This is related to the problem of bad statistics that drive
autovacuum. Deciding as much as possible at runtime, dynamically,
seems promising to me.

> In addition, I noticed that progress reporting of blocks scanned
> ("heap_blocks_scanned", duh) includes skipped pages. Now that we have
> a solid grasp of how many blocks we're planning to scan, we can update
> the reported stats to how many blocks we're planning to scan (and have
> scanned), increasing the user value of that progress view.

Yeah, that's definitely a natural direction to go with this. Knowing
scanned_pages from the start is a basis for much more useful progress
reporting.

> > +    double        tableagefrac;
>
> I think this can use some extra info on the field itself, that it is
> the fraction of how "old" the relfrozenxid and relminmxid fields are,
> as a fraction between 0 (latest values; nextXID and nextMXID), and 1
> (values that are old by at least freeze_table_age and
> multixact_freeze_table_age (multi)transaction ids, respectively).

Agreed that that needs more than that in comments above the
"tableagefrac" struct field.

> > +     * Decide vmsnap scanning strategy.
> >       *
> > -     * This test also enables more frequent relfrozenxid advancement during
> > -     * non-aggressive VACUUMs.  If the range has any all-visible pages then
> > -     * skipping makes updating relfrozenxid unsafe, which is a real downside.
> > +     * First acquire a visibility map snapshot, which determines the number of
> > +     * pages that each vmsnap scanning strategy is required to scan for us in
> > +     * passing.
>
> I think we should not take disk-backed vm snapshots when
> force_scan_all is set. We need VACUUM to be able to run on very
> resource-constrained environments, and this does not do that - it adds
> a disk space requirement for the VM snapshot for all but the smallest
> relation sizes, which is bad when you realize that we need VACUUM when
> we want to clean up things like CLOG.

I agree that I still have work to do to make visibility map snapshots
as robust as possible in resource constrained environments, including
in cases where there is simply no disk space at all. They should
gracefully degrade even when there isn't space on disk to store a copy
of the VM in temp files, or even a single page.

> Additionally, it took me several reads of the code and comments to
> understand what the decision-making process for lazy vs eager is, and
> why. The comments are interspersed with the code, with no single place
> that describes it from a bird's eyes' view.

You probably have a good point there. I'll try to come up with
something, possibly based on your suggested wording.

> > @@ -1885,6 +1902,30 @@ retry:
> >         tuples_frozen = 0;        /* avoid miscounts in instrumentation */
> >      }
> >
> >     /*
> > +     * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
> > +     * set.  Check that here in passing.
> > +     *
> > [...]
>
> I'm not sure this patch is the appropriate place for this added check.
> I don't disagree with the change, I just think that it's unrelated to
> the rest of the patch. Same with some of the changes in
> lazy_scan_heap.

This issue is hard to explain. I kind of need to do this in the VM
snapshot/scanning strategies commit, because it removes the
all_visible_according_to_vm local variable used inside lazy_scan_heap.

This change that you highlight detects cases where PD_ALL_VISIBLE is
set incorrectly earlier in lazy_scan_prune is part of that, and then
unsets it, so that once lazy_scan_prune returns and lazy_scan_heap
needs to consider setting the VM, it can trust PD_ALL_VISIBLE -- it is
definitely up to date at that point, even in cases involving
corruption. So the steps where we consider setting the VM now always
starts from a clean slate.

Now we won't just unset both PD_ALL_VISIBLE and the VM bits in the
event of corruption like this. We'll complain about it in
lazy_scan_prune, then fully fix the issue in the most appropriate way
in lazy_scan_heap (could be setting the page all-visible now, even
though it shouldn't have been set but was set when we first arrived).
We also won't fail to complain about PD_ALL_VISIBLE corruption because
lazy_scan_prune "destroyed the evidence" before lazy_scan_heap had the
chance to notice the problem. PD_ALL_VISIBLE corruption should never
happen, obviously, so we should make a point of complaining about it
whenever it can be detected. Which is much more often than what you
see on HEAD today.

> > +vm_snap_stage_blocks
>
> Doesn't this waste a lot of cycles on skipping frozen blocks if most
> of the relation is frozen? I'd expected something more like a byte- or
> word-wise processing of skippable blocks, as opposed to this per-block
> loop. I don't think it's strictly necessary to patch, but I think it
> would be a very useful addition for those with larger tables.

I agree that the visibility map snapshot stuff could stand to be a bit
more frugal with memory. It's certainly not critical, but it is
probably fairly easy to do better here, and so I should do better.

> > +    XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
> > +        ((double) freeze_table_age + 0.5);
>
> I don't quite understand what this `+ 0.5` is used for, could you explain?

It avoids division by zero.

> > + [...] Freezing and advancing
> > +         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
> > +         now take place more proactively, in every
> > +         <command>VACUUM</command> operation.
>
> This claim that it happens more proactively in "every" VACUUM
> operation is false, so I think the removal of "every" would be better.

Good catch. Will fix.

> > [PATCH v14 3/3] Finish removing aggressive mode VACUUM.

> I don't quite enjoy the refactoring+rewriting of the docs section;
> it's difficult to determine what changed when so many things changed
> line lengths and were moved around. Tomorrow I'll take a closer look,
> but a separation of changes vs moved would be useful for review.

I think that I should break out the doc changes some more. The docs
are likely the least worked out thing at this point.

> > +    cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
> > [...]
> > +    cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
>
> Why are these values adjusted down (up?) by 5%? If I configure this
> GUC, I'd expect this to be used effectively verbatim; not adjusted by
> an arbitrary factor.

It is kind of arbitrary, but not in the way that you suggest. This
isn't documented in the user docs, and shouldn't really need to be. It
should have very little if any noticeable impact on our final
relfrozenxid/relminmxid in practice. If it does have any noticeable
impact, I strongly suspect it'll be a useful, positive impact.

MinXid/MinMulti control the behavior around whether or not
lazy_scan_noprune is willing to wait the hard way for a cleanup lock,
no matter how long it takes. We do still need something like that, but
it can be far looser than it is right now. The problem with aggressive
mode is that it absolutely insists on a certain outcome, no matter the
cost, and regardless of whether or not a slightly inferior outcome is
acceptable. It's extremely rigid. Rigid things tend to break. Loose,
springy things much less so.

I think that it's an extremely bad idea to wait indefinitely for a
cleanup lock. Sure, it'll work out the vast majority of the time --
it's *very* likely to work. But when it doesn't work right away, there
is no telling how long the wait will be -- all bets are off. Could be
a day, a week, a month -- who knows? The application itself is the
crucial factor here, and in general the application can do whatever it
wants to do -- that is the reality. So we should be willing to kick
the can down the road in almost all cases -- that is actually the
responsible thing to do under the circumstances. We need to get on
with freezing every other page in the table!

There just cannot be very many pages that can't be cleanup locked at
any given time, so waiting indefinitely is a very drastic measure in
response to a problem that is quite likely to go away on its own. A
problem that waiting doesn't really solve anyway. Maybe the only thing
that will work is waiting for a very long time, but we have nothing to
lose (and everything to gain) by waiting to wait.

--
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Using WaitEventSet in the postmaster