Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Дата
Msg-id 20211123054902.b5motbm32ftn6xgu@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
Hi,

On 2021-11-22 17:07:46 -0800, Peter Geoghegan wrote:
> Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a
> non-aggressive VACUUM -- so "at least waiting for one or two pins
> during non-aggressive VACUUM" might not have been the best way of
> expressing the idea that I wanted to express. The important point is
> that _we can make a choice_ about stuff like this dynamically, based
> on the observed characteristics of the table, and some general ideas
> about the costs and benefits (of waiting or not waiting, or of how
> long we want to wait in total, whatever might be important). This
> probably just means adding some heuristics that are pretty sensitive
> to any reason to not do more work in a non-aggressive VACUUM, without
> *completely* balking at doing even a tiny bit more work.

> For example, we can definitely afford to wait a few more milliseconds
> to get a cleanup lock just once

We currently have no infrastructure to wait for an lwlock or pincount for a
limited time. And at least for the former it'd not be easy to add. It may be
worth adding that at some point, but I'm doubtful this is sufficient reason
for nontrivial new infrastructure in very performance sensitive areas.


> All of the autovacuums against the accounts table look similar to this
> one -- you don't see anything about relfrozenxid being advanced
> (because it isn't). Whereas for the smaller pgbench tables, every
> single VACUUM successfully advances relfrozenxid to a fairly recent
> XID (without there ever being an aggressive VACUUM) -- just because
> VACUUM needs to visit every page for the smaller tables. While the
> accounts table doesn't generally need to have 100% of all pages
> touched by VACUUM -- it's more like 95% there. Does that really make
> sense, though?

Does what make really sense?


> I'm pretty sure that less aggressive VACUUMing (e.g. higher
> scale_factor setting) would lead to more aggressive setting of
> relfrozenxid here. I'm always suspicious when I see insignificant
> differences that lead to significant behavioral differences. Am I
> worried over nothing here? Perhaps -- we don't really need to advance
> relfrozenxid early with this table/workload anyway. But I'm not so
> sure.

I think pgbench_accounts is just a really poor showcase. Most importantly
there's no even slightly longer running transactions that hold down the xid
horizon. But in real workloads thats incredibly common IME.  It's also quite
uncommon in real workloads to huge tables in which all records are
updated. It's more common to have value ranges that are nearly static, and a
more heavily changing range.

I think the most interesting cases where using the "measured" horizon will be
advantageous is anti-wrap vacuums. Those obviously have to happen for rarely
modified tables, including completely static ones, too. Using the "measured"
horizon will allow us to reduce the frequency of anti-wrap autovacuums on old
tables, because we'll be able to set a much more recent relfrozenxid.

This is becoming more common with the increased use of partitioning.


> > The problem is that the
> > autovacuum scheduling is way too naive for that to be a significant benefit -
> > nothing tries to schedule autovacuums so that they have a chance to complete
> > before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
> > to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.
> 
> Not sure what you mean about scheduling, since vacuum_freeze_table_age
> is only in place to make overnight (off hours low activity scripted
> VACUUMs) freeze tuples before any autovacuum worker gets the chance
> (since the latter may run at a much less convenient time). Sure,
> vacuum_freeze_table_age might also force a regular autovacuum worker
> to do an aggressive VACUUM -- but I think it's mostly intended for a
> manual overnight VACUUM. Not usually very helpful, but also not
> harmful.

> Oh, wait. I think that you're talking about how autovacuum workers in
> particular tend to be affected by this. We launch an av worker that
> wants to clean up bloat, but it ends up being aggressive (and maybe
> taking way longer), perhaps quite randomly, only due to
> vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is
> that it?

No, not quite. We treat anti-wraparound vacuums as an emergency (including
logging messages, not cancelling). But the only mechanism we have against
anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's
not really a "real" mechanism, because it requires an "independent" reason to
vacuum a table.

I've seen cases where anti-wraparound vacuums weren't a problem / never
happend for important tables for a long time, because there always was an
"independent" reason for autovacuum to start doing its thing before the table
got to be autovacuum_freeze_max_age old. But at some point the important
tables started to be big enough that autovacuum didn't schedule vacuums that
got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap
vacuums. Then things started to burn, because of the unpaced anti-wrap vacuums
clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite
remember the details.

Behaviour that lead to a "sudden" falling over, rather than getting gradually
worse are bad - they somehow tend to happen on Friday evenings :).



> > This is one of the most embarassing issues around the whole anti-wrap
> > topic. We kind of define it as an emergency that there's an anti-wraparound
> > vacuum. But we have *absolutely no mechanism* to prevent them from occurring.
> 
> What do you mean? Only an autovacuum worker can do an anti-wraparound
> VACUUM (which is not quite the same thing as an aggressive VACUUM).

Just that autovacuum should have a mechanism to trigger aggressive vacuums
(i.e. ones that are guaranteed to be able to increase relfrozenxid unless
cancelled) before getting to the "emergency"-ish anti-wraparound state.

Or alternatively that we should have a separate threshold for the "harsher"
anti-wraparound measures.


> > > We now also collect LP_DEAD items in the dead_tuples array in the case
> > > where we cannot immediately get a cleanup lock on the buffer.  We cannot
> > > prune without a cleanup lock, but opportunistic pruning may well have
> > > left some LP_DEAD items behind in the past -- no reason to miss those.
> >
> > This has become *much* more important with the changes around deciding when to
> > index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
> > items, it's that a previous vacuum is quite likely to have left them there,
> > because the previous vacuum decided not to perform index cleanup.
> 
> I haven't seen any evidence of that myself (with the optimization
> added to Postgres 14 by commit 5100010ee4). I still don't understand
> why you doubted that work so much. I'm not saying that you're wrong
> to; I'm saying that I don't think that I understand your perspective
> on it.

I didn't (nor do) doubt that it can be useful - to the contrary, I think the
unconditional index pass was a huge practial issue.  I do however think that
there are cases where it can cause trouble. The comment above wasn't meant as
a criticism - just that it seems worth pointing out that one reason we might
encounter a lot of LP_DEAD items is previous vacuums that didn't perform index
cleanup.


> What I have seen in my own tests (particularly with BenchmarkSQL) is
> that most individual tables either never apply the optimization even
> once (because the table reliably has heap pages with many more LP_DEAD
> items than the 2%-of-relpages threshold), or will never need to
> (because there are precisely zero LP_DEAD items anyway). Remaining
> tables that *might* use the optimization tend to not go very long
> without actually getting a round of index vacuuming. It's just too
> easy for updates (and even aborted xact inserts) to introduce new
> LP_DEAD items for us to go long without doing index vacuuming.

I think workloads are a bit more worried than a realistic set of benchmarksk
that one person can run yourself.

I gave you examples of cases that I see as likely being bitten by this,
e.g. when the skipped index cleanup prevents IOS scans. When both the
likely-to-be-modified and likely-to-be-queried value ranges are a small subset
of the entire data, the 2% threshold can prevent vacuum from cleaning up
LP_DEAD entries for a long time.  Or when all index scans are bitmap index
scans, and nothing ends up cleaning up the dead index entries in certain
ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively
small rollback / non-HOT update rate can start to be really painful.


> > > Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
> > > technique is independently capable of cleaning up line pointer bloat),
> >
> > One thing we could do around this, btw, would be to aggressively replace
> > LP_REDIRECT items with their target item. We can't do that in all situations
> > (somebody might be following a ctid chain), but I think we have all the
> > information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
> > state or something like that.
> 
> Another idea is to truncate the line pointer during pruning (including
> opportunistic pruning). Matthias van de Meent has a patch for that.

I'm a bit doubtful that's as important (which is not to say that it's not
worth doing). For a heavily updated table the max space usage of the line
pointer array just isn't as big a factor as ending up with only half the
usable line pointers.


> > > Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
> > > since there is no barely any real practical sense in which we actually
> > > miss doing useful work for these pages.  Besides, this information
> > > always seemed to have little practical value, even to Postgres hackers.
> >
> > -0.5. I think it provides some value, and I don't see why the removal of the
> > information should be tied to this change. It's hard to diagnose why some dead
> > tuples aren't cleaned up - a common cause for that on smaller tables is that
> > nearly all pages are pinned nearly all the time.
> 
> Is that still true, though? If it turns out that we need to leave it
> in, then I can do that. But I'd prefer to wait until we have more
> information before making a final decision. Remember, the high level
> idea of this whole patch is that we do as much work as possible for
> any scanned_pages, which now includes pages that we never successfully
> acquired a cleanup lock on. And so we're justified in assuming that
> they're exactly equivalent to pages that we did get a cleanup on --
> that's now the working assumption. I know that that's not literally
> true, but that doesn't mean it's not a useful fiction -- it should be
> very close to the truth.

IDK, it seems misleading to me. Small tables with a lot of churn - quite
common - are highly reliant on LP_DEAD entries getting removed or the tiny
table suddenly isn't so tiny anymore. And it's harder to diagnose why the
cleanup isn't happening without knowledge that pages needing cleanup couldn't
be cleaned up due to pins.

If you want to improve the logic so that we only count pages that would have
something to clean up, I'd be happy as well. It doesn't have to mean exactly
what it means today.


> > > +      * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
> > > +      * the rel_pages used by lazy_scan_prune, from before a possible relation
> > > +      * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
> > > +      */
> >
> > I think it should be doable to add an isolation test for this path. There have
> > been quite a few bugs around the wider topic...
> 
> I would argue that we already have one -- vacuum-reltuples.spec. I had
> to update its expected output in the patch. I would argue that the
> behavioral change (count tuples on a pinned-by-cursor heap page) that
> necessitated updating the expected output for the test is an
> improvement overall.

I was thinking of truncations, which I don't think vacuum-reltuples.spec
tests.


> > > +     {
> > > +             /* Can safely advance relfrozen and relminmxid, too */
> > > +             Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
> > > +                        orig_rel_pages);
> > > +             vac_update_relstats(rel, new_rel_pages, new_live_tuples,
> > > +                                                     new_rel_allvisible, vacrel->nindexes > 0,
> > > +                                                     FreezeLimit, MultiXactCutoff, false);
> > > +     }
> >
> > I wonder if this whole logic wouldn't become easier and less fragile if we
> > just went for maintaining the "actually observed" horizon while scanning the
> > relation. If we skip a page via VM set the horizon to invalid. Otherwise we
> > can keep track of the accurate horizon and use that. No need to count pages
> > and stuff.
> 
> There is no question that that makes sense as an optimization -- my
> prototype convinced me of that already. But I don't think that it can
> simplify anything (not even the call to vac_update_relstats itself, to
> actually update relfrozenxid at the end).

Maybe. But we've had quite a few bugs because we ended up changing some detail
of what is excluded in one of the counters, leading to wrong determination
about whether we scanned everything or not.


> Fundamentally, this will only work if we decide to only skip all-frozen
> pages, which (by definition) only happens within aggressive VACUUMs.

Hm? Or if there's just no runs of all-visible pages of sufficient length, so
we don't end up skipping at all.


> You recently said (on the heap-pruning-14-bug thread) that you don't
> think it would be practical to always set a page all-frozen when we
> see that we're going to set it all-visible -- apparently you feel that
> we could never opportunistically freeze early such that all-visible
> but not all-frozen pages practically cease to exist. I'm still not
> sure why you believe that (though you may be right, or I might have
> misunderstood, since it's complicated).

Yes, I think it may not work out to do that. But it's not a very strongly held
opinion.

On reason for my doubt is the following:

We can set all-visible on a page without a FPW image (well, as long as hint
bits aren't logged). There's a significant difference between needing to WAL
log FPIs for every heap page or not, and it's not that rare for data to live
shorter than autovacuum_freeze_max_age or that limit never being reached.

On a table with 40 million individually inserted rows, fully hintbitted via
reads, I see a first VACUUM taking 1.6s and generating 11MB of WAL. A
subsequent VACUUM FREEZE takes 5s and generates 500MB of WAL. That's a quite
large multiplier...

If we ever managed to not have a per-page all-visible flag this'd get even
more extreme, because we'd then not even need to dirty the page for
insert-only pages. But if we want to freeze, we'd need to (unless we just got
rid of freezing).


> It would certainly benefit this dynamic relfrozenxid business if it was
> possible, though. If we could somehow make that work, then almost every
> VACUUM would be able to advance relfrozenxid, independently of
> aggressive-ness -- because we wouldn't have any
> all-visible-but-not-all-frozen pages to skip (that important detail wouldn't
> be left to chance).

Perhaps we can have most of the benefit even without that.  If we were to
freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip
all-visible but not !all-frozen pages if they were less than x% of the
to-be-scanned data, we should be able to to still increase relfrozenxid in a
lot of cases?


> > I don't particularly like doing BufferGetPage() before holding a lock on the
> > page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
> > be good to have a crosscheck that BufferGetPage() is only allowed when holding
> > a page level lock.
> 
> I have occasionally wondered if the whole idea of reading heap pages
> with only a pin (and having cleanup locks in VACUUM) is really worth
> it -- alternative designs seem possible. Obviously that's a BIG
> discussion, and not one to have right now. But it seems kind of
> relevant.

With 'reading' do you mean reads-from-os, or just references to buffer
contents?


> Since it is often legit to read a heap page without a buffer lock
> (only a pin), I can't see why BufferGetPage() without a buffer lock
> shouldn't also be okay -- if anything it seems safer. I think that I
> would agree with you if it wasn't for that inconsistency (which is
> rather a big "if", to be sure -- even for me).

At least for heap it's rarely legit to read buffer contents via
BufferGetPage() without a lock. It's legit to read data at already-determined
offsets, but you can't look at much other than the tuple contents.


> > Why does it make sense to track DEAD tuples this way? Isn't that going to lead
> > to counting them over-and-over again? I think it's quite misleading to include
> > them in "dead bot not yet removable".
> 
> Compared to what? Do we really want to invent a new kind of DEAD tuple
> (e.g., to report on), just to handle this rare case?

When looking at logs I use the
"tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"
line to see whether the user is likely to have issues around an old
transaction / slot / prepared xact preventing cleanup. If new_dead_tuples
doesn't identify those cases anymore that's not reliable anymore.


> I accept that this code is lying about the tuples being RECENTLY_DEAD,
> kind of. But isn't it still strictly closer to the truth, compared to
> HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not
> counting it at all.

I don't see how it's closer at all. There's imo a significant difference
between not being able to remove tuples because of the xmin horizon, and not
being able to remove it because we couldn't get a cleanup lock.


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "tanghy.fnst@fujitsu.com"
Дата:
Сообщение: RE: row filtering for logical replication
Следующее
От: Peter Smith
Дата:
Сообщение: Re: row filtering for logical replication