Обсуждение: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Поиск
Список
Период
Сортировка

Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
Attached WIP patch series significantly simplifies the definition of
scanned_pages inside vacuumlazy.c. Apart from making several very
tricky things a lot simpler, and moving more complex code outside of
the big "blkno" loop inside lazy_scan_heap (building on the Postgres
14 work), this refactoring directly facilitates 2 new optimizations
(also in the patch):

1. We now collect LP_DEAD items into the dead_tuples array for all
scanned pages -- even when we cannot get a cleanup lock.

2. We now don't give up on advancing relfrozenxid during a
non-aggressive VACUUM when we happen to be unable to get a cleanup
lock on a heap page.

Both optimizations are much more natural with the refactoring in
place. Especially #2, which can be thought of as making aggressive and
non-aggressive VACUUM behave similarly. Sure, we shouldn't wait for a
cleanup lock in a non-aggressive VACUUM (by definition) -- and we
still don't in the patch (obviously). But why wouldn't we at least
*check* if the page has tuples that need to be frozen in order for us
to advance relfrozenxid? Why give up on advancing relfrozenxid in a
non-aggressive VACUUM when there's no good reason to?

See the draft commit messages from the patch series for many more
details on the simplifications I am proposing.

I'm not sure how much value the second optimization has on its own.
But I am sure that the general idea of teaching non-aggressive VACUUM
to be conscious of the value of advancing relfrozenxid is a good one
-- and so #2 is a good start on that work, at least. I've discussed
this idea with Andres (CC'd) a few times before now. Maybe we'll need
another patch that makes VACUUM avoid setting heap pages to
all-visible without also setting them to all-frozen (and freezing as
necessary) in order to really get a benefit. Since, of course, a
non-aggressive VACUUM still won't be able to advance relfrozenxid when
it skipped over all-visible pages that are not also known to be
all-frozen.

Masahiko (CC'd) has expressed interest in working on opportunistic
freezing. This refactoring patch seems related to that general area,
too. At a high level, to me, this seems like the tuple freezing
equivalent of the Postgres 14 work on bypassing index vacuuming when
there are very few LP_DEAD items (interpret that as 0 LP_DEAD items,
which is close to the truth anyway).  There are probably quite a few
interesting opportunities to make VACUUM better by not having such a
sharp distinction between aggressive and non-aggressive VACUUM. Why
should they be so different? A good medium term goal might be to
completely eliminate aggressive VACUUMs.

I have heard many stories about anti-wraparound/aggressive VACUUMs
where the cure (which suddenly made autovacuum workers
non-cancellable) was worse than the disease (not actually much danger
of wraparound failure). For example:

https://www.joyent.com/blog/manta-postmortem-7-27-2015

Yes, this problem report is from 2015, which is before we even had the
freeze map stuff. I still think that the point about aggressive
VACUUMs blocking DDL (leading to chaos) remains valid.

There is another interesting area of future optimization within
VACUUM, that also seems relevant to this patch: the general idea of
*avoiding* pruning during VACUUM, when it just doesn't make sense to
do so -- better to avoid dirtying the page for now. Needlessly pruning
inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
with heap fill factor reduced to 95) will have autovacuums that
*constantly* do it (granted, it may not matter so much there because
VACUUM is unlikely to re-dirty the page anyway). This patch seems
relevant to that area because it recognizes that pruning during VACUUM
is not necessarily special -- a new function called lazy_scan_noprune
may be used instead of lazy_scan_prune (though only when a cleanup
lock cannot be acquired). These pages are nevertheless considered
fully processed by VACUUM (this is perhaps 99% true, so it seems
reasonable to round up to 100% true).

I find it easy to imagine generalizing the same basic idea --
recognizing more ways in which pruning by VACUUM isn't necessarily
better than opportunistic pruning, at the level of each heap page. Of
course we *need* to prune sometimes (e.g., might be necessary to do so
to set the page all-visible in the visibility map), but why bother
when we don't, and when there is no reason to think that it'll help
anyway? Something to think about, at least.

-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2021-11-21 18:13:51 -0800, Peter Geoghegan wrote:
> I have heard many stories about anti-wraparound/aggressive VACUUMs
> where the cure (which suddenly made autovacuum workers
> non-cancellable) was worse than the disease (not actually much danger
> of wraparound failure). For example:
> 
> https://www.joyent.com/blog/manta-postmortem-7-27-2015
> 
> Yes, this problem report is from 2015, which is before we even had the
> freeze map stuff. I still think that the point about aggressive
> VACUUMs blocking DDL (leading to chaos) remains valid.

As I noted below, I think this is a bit of a separate issue than what your
changes address in this patch.


> There is another interesting area of future optimization within
> VACUUM, that also seems relevant to this patch: the general idea of
> *avoiding* pruning during VACUUM, when it just doesn't make sense to
> do so -- better to avoid dirtying the page for now. Needlessly pruning
> inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
> with heap fill factor reduced to 95) will have autovacuums that
> *constantly* do it (granted, it may not matter so much there because
> VACUUM is unlikely to re-dirty the page anyway).

Hm. I'm a bit doubtful that there's all that many cases where it's worth not
pruning during vacuum. However, it seems much more common for opportunistic
pruning during non-write accesses.

Perhaps checking whether we'd log an FPW would be a better criteria for
deciding whether to prune or not compared to whether we're dirtying the page?
IME the WAL volume impact of FPWs is a considerably bigger deal than
unnecessarily dirtying a page that has previously been dirtied in the same
checkpoint "cycle".


> This patch seems relevant to that area because it recognizes that pruning
> during VACUUM is not necessarily special -- a new function called
> lazy_scan_noprune may be used instead of lazy_scan_prune (though only when a
> cleanup lock cannot be acquired). These pages are nevertheless considered
> fully processed by VACUUM (this is perhaps 99% true, so it seems reasonable
> to round up to 100% true).

IDK, the potential of not having usable space on an overfly fragmented page
doesn't seem that low. We can't just mark such pages as all-visible because
then we'll potentially never reclaim that space.




> Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
> relfrozenxid, we now make non-aggressive VACUUMs work just a little
> harder in order to make that desirable outcome more likely in practice.
> Aggressive VACUUMs have long checked contended pages with only a shared
> lock, to avoid needlessly waiting on a cleanup lock (in the common case
> where the contended page has no tuples that need to be frozen anyway).
> We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
> course -- if we did that they'd no longer be non-aggressive.

IMO the big difference between aggressive / non-aggressive isn't whether we
wait for a cleanup lock, but that we don't skip all-visible pages...


> But we now make the non-aggressive case notice that a failure to acquire a
> cleanup lock on one particular heap page does not in itself make it unsafe
> to advance relfrozenxid for the whole relation (which is what we usually see
> in the aggressive case already).
> 
> This new relfrozenxid optimization might not be all that valuable on its
> own, but it may still facilitate future work that makes non-aggressive
> VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
> rather than later.  In general it would be useful for non-aggressive
> VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
> a cleanup lock once or twice if needed).

What do you mean by "waiting once or twice"? A single wait may simply never
end on a busy page that's constantly pinned by a lot of backends...


> It would also be generally useful if aggressive VACUUMs were "less
> aggressive" opportunistically (e.g. by being responsive to query
> cancellations when the risk of wraparound failure is still very low).

Being canceleable is already a different concept than anti-wraparound
vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but
anti-wrap only at autovacuum_freeze_max_age. The problem is that the
autovacuum scheduling is way too naive for that to be a significant benefit -
nothing tries to schedule autovacuums so that they have a chance to complete
before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

This is one of the most embarassing issues around the whole anti-wrap
topic. We kind of define it as an emergency that there's an anti-wraparound
vacuum. But we have *absolutely no mechanism* to prevent them from occurring.


> We now also collect LP_DEAD items in the dead_tuples array in the case
> where we cannot immediately get a cleanup lock on the buffer.  We cannot
> prune without a cleanup lock, but opportunistic pruning may well have
> left some LP_DEAD items behind in the past -- no reason to miss those.

This has become *much* more important with the changes around deciding when to
index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
items, it's that a previous vacuum is quite likely to have left them there,
because the previous vacuum decided not to perform index cleanup.


> Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
> technique is independently capable of cleaning up line pointer bloat),

One thing we could do around this, btw, would be to aggressively replace
LP_REDIRECT items with their target item. We can't do that in all situations
(somebody might be following a ctid chain), but I think we have all the
information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
state or something like that.

I think that'd be quite a win - we right now often "migrate" to other pages
for modifications not because we're out of space on a page, but because we run
out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the
number of line pointers, not just the number of actual tuples). Effectively
doubling the number of available line item in common cases in a number of
realistic / common scenarios would be quite the win.


> Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
> since there is no barely any real practical sense in which we actually
> miss doing useful work for these pages.  Besides, this information
> always seemed to have little practical value, even to Postgres hackers.

-0.5. I think it provides some value, and I don't see why the removal of the
information should be tied to this change. It's hard to diagnose why some dead
tuples aren't cleaned up - a common cause for that on smaller tables is that
nearly all pages are pinned nearly all the time.


I wonder if we could have a more restrained version of heap_page_prune() that
doesn't require a cleanup lock? Obviously we couldn't defragment the page, but
it's not immediately obvious that we need it if we constrain ourselves to only
modify tuple versions that cannot be visible to anybody.

Random note: I really dislike that we talk about cleanup locks in some parts
of the code, and super-exclusive locks in others :(.


> +    /*
> +     * Aggressive VACUUM (which is the same thing as anti-wraparound
> +     * autovacuum for most practical purposes) exists so that we'll reliably
> +     * advance relfrozenxid and relminmxid sooner or later.  But we can often
> +     * opportunistically advance them even in a non-aggressive VACUUM.
> +     * Consider if that's possible now.

I don't agree with the "most practical purposes" bit. There's a huge
difference because manual VACUUMs end up aggressive but not anti-wrap once
older than vacuum_freeze_table_age.


> +     * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
> +     * the rel_pages used by lazy_scan_prune, from before a possible relation
> +     * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
> +     */

I think it should be doable to add an isolation test for this path. There have
been quite a few bugs around the wider topic...


> +    if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
> +        !vacrel->freeze_cutoffs_valid)
> +    {
> +        /* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
> +        Assert(!aggressive);
> +        vac_update_relstats(rel, new_rel_pages, new_live_tuples,
> +                            new_rel_allvisible, vacrel->nindexes > 0,
> +                            InvalidTransactionId, InvalidMultiXactId, false);
> +    }
> +    else
> +    {
> +        /* Can safely advance relfrozen and relminmxid, too */
> +        Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
> +               orig_rel_pages);
> +        vac_update_relstats(rel, new_rel_pages, new_live_tuples,
> +                            new_rel_allvisible, vacrel->nindexes > 0,
> +                            FreezeLimit, MultiXactCutoff, false);
> +    }

I wonder if this whole logic wouldn't become easier and less fragile if we
just went for maintaining the "actually observed" horizon while scanning the
relation. If we skip a page via VM set the horizon to invalid. Otherwise we
can keep track of the accurate horizon and use that. No need to count pages
and stuff.


> @@ -1050,18 +1046,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
>          bool        all_visible_according_to_vm = false;
>          LVPagePruneState prunestate;
>  
> -        /*
> -         * Consider need to skip blocks.  See note above about forcing
> -         * scanning of last page.
> -         */
> -#define FORCE_CHECK_PAGE() \
> -        (blkno == nblocks - 1 && should_attempt_truncation(vacrel))
> -
>          pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
>  
>          update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
>                                   blkno, InvalidOffsetNumber);
>  
> +        /*
> +         * Consider need to skip blocks
> +         */
>          if (blkno == next_unskippable_block)
>          {
>              /* Time to advance next_unskippable_block */
> @@ -1110,13 +1102,19 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
>          else
>          {
>              /*
> -             * The current block is potentially skippable; if we've seen a
> -             * long enough run of skippable blocks to justify skipping it, and
> -             * we're not forced to check it, then go ahead and skip.
> -             * Otherwise, the page must be at least all-visible if not
> -             * all-frozen, so we can set all_visible_according_to_vm = true.
> +             * The current block can be skipped if we've seen a long enough
> +             * run of skippable blocks to justify skipping it.
> +             *
> +             * There is an exception: we will scan the table's last page to
> +             * determine whether it has tuples or not, even if it would
> +             * otherwise be skipped (unless it's clearly not worth trying to
> +             * truncate the table).  This avoids having lazy_truncate_heap()
> +             * take access-exclusive lock on the table to attempt a truncation
> +             * that just fails immediately because there are tuples in the
> +             * last page.
>               */
> -            if (skipping_blocks && !FORCE_CHECK_PAGE())
> +            if (skipping_blocks &&
> +                !(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
>              {
>                  /*
>                   * Tricky, tricky.  If this is in aggressive vacuum, the page

I find the  FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like
mixing such changes within a larger change doing many other things.



> @@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
>  
>          buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
>                                   RBM_NORMAL, vacrel->bstrategy);
> +        page = BufferGetPage(buf);
> +        vacrel->scanned_pages++;

I don't particularly like doing BufferGetPage() before holding a lock on the
page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
be good to have a crosscheck that BufferGetPage() is only allowed when holding
a page level lock.


>          /*
> -         * We need buffer cleanup lock so that we can prune HOT chains and
> -         * defragment the page.
> +         * We need a buffer cleanup lock to prune HOT chains and defragment
> +         * the page in lazy_scan_prune.  But when it's not possible to acquire
> +         * a cleanup lock right away, we may be able to settle for reduced
> +         * processing in lazy_scan_noprune.
>           */

s/in lazy_scan_noprune/via lazy_scan_noprune/?


>          if (!ConditionalLockBufferForCleanup(buf))
>          {
>              bool        hastup;
>  
> -            /*
> -             * If we're not performing an aggressive scan to guard against XID
> -             * wraparound, and we don't want to forcibly check the page, then
> -             * it's OK to skip vacuuming pages we get a lock conflict on. They
> -             * will be dealt with in some future vacuum.
> -             */
> -            if (!aggressive && !FORCE_CHECK_PAGE())
> +            LockBuffer(buf, BUFFER_LOCK_SHARE);
> +
> +            /* Check for new or empty pages before lazy_scan_noprune call */
> +            if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
> +                                       vmbuffer))
>              {
> -                ReleaseBuffer(buf);
> -                vacrel->pinskipped_pages++;
> +                /* Lock and pin released for us */
> +                continue;
> +            }

Why isn't this done in lazy_scan_noprune()?


> +            if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
> +            {
> +                /* No need to wait for cleanup lock for this page */
> +                UnlockReleaseBuffer(buf);
> +                if (hastup)
> +                    vacrel->nonempty_pages = blkno + 1;
>                  continue;
>              }

Do we really need all of buf, blkno, page for both of these functions? Quite
possible that yes, if so, could we add an assertion that
BufferGetBockNumber(buf) == blkno?


> +        /* Check for new or empty pages before lazy_scan_prune call */
> +        if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
>          {

Maybe worth a note mentioning that we need to redo this even in the aggressive
case, because we didn't continually hold a lock on the page?



> +/*
> + * Empty pages are not really a special case -- they're just heap pages that
> + * have no allocated tuples (including even LP_UNUSED items).  You might
> + * wonder why we need to handle them here all the same.  It's only necessary
> + * because of a rare corner-case involving a hard crash during heap relation
> + * extension.  If we ever make relation-extension crash safe, then it should
> + * no longer be necessary to deal with empty pages here (or new pages, for
> + * that matter).

I don't think it's actually that rare - the window for this is huge. You just
need to crash / immediate shutdown at any time between the relation having
been extended and the new page contents being written out (checkpoint or
buffer replacement / ring writeout). That's often many minutes.

I don't really see that as a realistic thing to ever reliably avoid, FWIW. I
think the overhead would be prohibitive. We'd need to do synchronous WAL
logging while holding the extension lock I think. Um, not fun.


> + * Caller can either hold a buffer cleanup lock on the buffer, or a simple
> + * shared lock.
> + */

Kinda sounds like it'd be incorrect to call this with an exclusive lock, which
made me wonder why that could be true. Perhaps just say that it needs to be
called with at least a shared lock?


> +static bool
> +lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
> +                       Page page, bool sharelock, Buffer vmbuffer)

It'd be good to document the return value - for me it's not a case where it's
so obvious that it's not worth it.



> +/*
> + *    lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
> + *
> + * Caller need only hold a pin and share lock on the buffer, unlike
> + * lazy_scan_prune, which requires a full cleanup lock.

I'd add somethign like "returns whether a cleanup lock is required". Having to
read multiple paragraphs to understand the basic meaning of the return value
isn't great.


> +        if (ItemIdIsRedirected(itemid))
> +        {
> +            *hastup = true;        /* page won't be truncatable */
> +            continue;
> +        }

It's not really new, but this comment is now a bit confusing, because it can
be understood to be about PageTruncateLinePointerArray().


> +            case HEAPTUPLE_DEAD:
> +            case HEAPTUPLE_RECENTLY_DEAD:
> +
> +                /*
> +                 * We count DEAD and RECENTLY_DEAD tuples in new_dead_tuples.
> +                 *
> +                 * lazy_scan_prune only does this for RECENTLY_DEAD tuples,
> +                 * and never has to deal with DEAD tuples directly (they
> +                 * reliably become LP_DEAD items through pruning).  Our
> +                 * approach to DEAD tuples is a bit arbitrary, but it seems
> +                 * better than totally ignoring them.
> +                 */
> +                new_dead_tuples++;
> +                break;

Why does it make sense to track DEAD tuples this way? Isn't that going to lead
to counting them over-and-over again? I think it's quite misleading to include
them in "dead bot not yet removable".


> +    /*
> +     * Now save details of the LP_DEAD items from the page in the dead_tuples
> +     * array iff VACUUM uses two-pass strategy case
> +     */

Do we really need to have separate code for this in lazy_scan_prune() and
lazy_scan_noprune()?



> +    }
> +    else
> +    {
> +        /*
> +         * We opt to skip FSM processing for the page on the grounds that it
> +         * is probably being modified by concurrent DML operations.  Seems
> +         * best to assume that the space is best left behind for future
> +         * updates of existing tuples.  This matches what opportunistic
> +         * pruning does.

Why can we assume that there concurrent DML rather than concurrent read-only
operations? IME it's much more common for read-only operations to block
cleanup locks than read-write ones (partially because the frequency makes it
easier, partially because cursors allow long-held pins, partially because the
EXCLUSIVE lock of a r/w operation wouldn't let us get here)



I think this is a change mostly in the right direction. But as formulated this
commit does *WAY* too much at once.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Nov 22, 2021 at 11:29 AM Andres Freund <andres@anarazel.de> wrote:
> Hm. I'm a bit doubtful that there's all that many cases where it's worth not
> pruning during vacuum. However, it seems much more common for opportunistic
> pruning during non-write accesses.

Fair enough. I just wanted to suggest an exploratory conversation
about pruning (among several other things). I'm mostly saying: hey,
pruning during VACUUM isn't actually that special, at least not with
this refactoring patch in place. So maybe it makes sense to go
further, in light of that general observation about pruning in VACUUM.

Maybe it wasn't useful to even mention this aspect now. I would rather
focus on freezing optimizations for now -- that's much more promising.

> Perhaps checking whether we'd log an FPW would be a better criteria for
> deciding whether to prune or not compared to whether we're dirtying the page?
> IME the WAL volume impact of FPWs is a considerably bigger deal than
> unnecessarily dirtying a page that has previously been dirtied in the same
> checkpoint "cycle".

Agreed. (I tend to say the former when I really mean the latter, which
I should try to avoid.)

> IDK, the potential of not having usable space on an overfly fragmented page
> doesn't seem that low. We can't just mark such pages as all-visible because
> then we'll potentially never reclaim that space.

Don't get me started on this - because I'll never stop.

It makes zero sense that we don't think about free space holistically,
using the whole context of what changed in the recent past. As I think
you know already, a higher level concept (like open and closed pages)
seems like the right direction to me -- because it isn't sensible to
treat X bytes of free space in one heap page as essentially
interchangeable with any other space on any other heap page. That
misses an enormous amount of things that matter. The all-visible
status of a page is just one such thing.

> IMO the big difference between aggressive / non-aggressive isn't whether we
> wait for a cleanup lock, but that we don't skip all-visible pages...

I know what you mean by that, of course. But FWIW that definition
seems too focused on what actually happens today, rather than what is
essential given the invariants we have for VACUUM. And so I personally
prefer to define it as "a VACUUM that *reliably* advances
relfrozenxid". This looser definition will probably "age" well (ahem).

> > This new relfrozenxid optimization might not be all that valuable on its
> > own, but it may still facilitate future work that makes non-aggressive
> > VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
> > rather than later.  In general it would be useful for non-aggressive
> > VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
> > a cleanup lock once or twice if needed).
>
> What do you mean by "waiting once or twice"? A single wait may simply never
> end on a busy page that's constantly pinned by a lot of backends...

I was speculating about future work again. I think that you've taken
my words too literally. This is just a draft commit message, just a
way of framing what I'm really trying to do.

Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a
non-aggressive VACUUM -- so "at least waiting for one or two pins
during non-aggressive VACUUM" might not have been the best way of
expressing the idea that I wanted to express. The important point is
that _we can make a choice_ about stuff like this dynamically, based
on the observed characteristics of the table, and some general ideas
about the costs and benefits (of waiting or not waiting, or of how
long we want to wait in total, whatever might be important). This
probably just means adding some heuristics that are pretty sensitive
to any reason to not do more work in a non-aggressive VACUUM, without
*completely* balking at doing even a tiny bit more work.

For example, we can definitely afford to wait a few more milliseconds
to get a cleanup lock just once, especially if we're already pretty
sure that that's all the extra work that it would take to ultimately
be able to advance relfrozenxid in the ongoing (non-aggressive) VACUUM
-- it's easy to make that case. Once you agree that it makes sense
under these favorable circumstances, you've already made
"aggressiveness" a continuous thing conceptually, at a high level.

The current binary definition of "aggressive" is needlessly
restrictive -- that much seems clear to me. I'm much less sure of what
specific alternative should replace it.

I've already prototyped advancing relfrozenxid using a dynamically
determined value, so that our final relfrozenxid is just about the
most recent safe value (not the original FreezeLimit). That's been
interesting. Consider this log output from an autovacuum with the
prototype patch (also uses my new instrumentation), based on standard
pgbench (just tuned heap fill factor a bit):

LOG:  automatic vacuum of table "regression.public.pgbench_accounts":
index scans: 0
pages: 0 removed, 909091 remain, 33559 skipped using visibility map
(3.69% of total)
tuples: 297113 removed, 50090880 remain, 90880 are dead but not yet removable
removal cutoff: oldest xmin was 29296744, which is now 203341 xact IDs behind
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
I/O timings: read: 55.574 ms, write: 0.000 ms
avg read rate: 17.805 MB/s, avg write rate: 4.389 MB/s
buffer usage: 1728273 hits, 23150 misses, 5706 dirtied
WAL usage: 594211 records, 0 full page images, 35065032 bytes
system usage: CPU: user: 6.85 s, system: 0.08 s, elapsed: 10.15 s

All of the autovacuums against the accounts table look similar to this
one -- you don't see anything about relfrozenxid being advanced
(because it isn't). Whereas for the smaller pgbench tables, every
single VACUUM successfully advances relfrozenxid to a fairly recent
XID (without there ever being an aggressive VACUUM) -- just because
VACUUM needs to visit every page for the smaller tables. While the
accounts table doesn't generally need to have 100% of all pages
touched by VACUUM -- it's more like 95% there. Does that really make
sense, though?

I'm pretty sure that less aggressive VACUUMing (e.g. higher
scale_factor setting) would lead to more aggressive setting of
relfrozenxid here. I'm always suspicious when I see insignificant
differences that lead to significant behavioral differences. Am I
worried over nothing here? Perhaps -- we don't really need to advance
relfrozenxid early with this table/workload anyway. But I'm not so
sure.

Again, my point is that there is a good chance that redefining
aggressiveness in some way will be helpful. A more creative, flexible
definition might be just what we need. The details are very much up in
the air, though.

> > It would also be generally useful if aggressive VACUUMs were "less
> > aggressive" opportunistically (e.g. by being responsive to query
> > cancellations when the risk of wraparound failure is still very low).
>
> Being canceleable is already a different concept than anti-wraparound
> vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but
> anti-wrap only at autovacuum_freeze_max_age.

You know what I meant. Also, did *you* mean "being canceleable is
already a different concept to *aggressive* vacuums"?   :-)

> The problem is that the
> autovacuum scheduling is way too naive for that to be a significant benefit -
> nothing tries to schedule autovacuums so that they have a chance to complete
> before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
> to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

Not sure what you mean about scheduling, since vacuum_freeze_table_age
is only in place to make overnight (off hours low activity scripted
VACUUMs) freeze tuples before any autovacuum worker gets the chance
(since the latter may run at a much less convenient time). Sure,
vacuum_freeze_table_age might also force a regular autovacuum worker
to do an aggressive VACUUM -- but I think it's mostly intended for a
manual overnight VACUUM. Not usually very helpful, but also not
harmful.

Oh, wait. I think that you're talking about how autovacuum workers in
particular tend to be affected by this. We launch an av worker that
wants to clean up bloat, but it ends up being aggressive (and maybe
taking way longer), perhaps quite randomly, only due to
vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is
that it?

> This is one of the most embarassing issues around the whole anti-wrap
> topic. We kind of define it as an emergency that there's an anti-wraparound
> vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

What do you mean? Only an autovacuum worker can do an anti-wraparound
VACUUM (which is not quite the same thing as an aggressive VACUUM).

I agree that anti-wraparound autovacuum is way too unfriendly, though.

> > We now also collect LP_DEAD items in the dead_tuples array in the case
> > where we cannot immediately get a cleanup lock on the buffer.  We cannot
> > prune without a cleanup lock, but opportunistic pruning may well have
> > left some LP_DEAD items behind in the past -- no reason to miss those.
>
> This has become *much* more important with the changes around deciding when to
> index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
> items, it's that a previous vacuum is quite likely to have left them there,
> because the previous vacuum decided not to perform index cleanup.

I haven't seen any evidence of that myself (with the optimization
added to Postgres 14 by commit 5100010ee4). I still don't understand
why you doubted that work so much. I'm not saying that you're wrong
to; I'm saying that I don't think that I understand your perspective
on it.

What I have seen in my own tests (particularly with BenchmarkSQL) is
that most individual tables either never apply the optimization even
once (because the table reliably has heap pages with many more LP_DEAD
items than the 2%-of-relpages threshold), or will never need to
(because there are precisely zero LP_DEAD items anyway). Remaining
tables that *might* use the optimization tend to not go very long
without actually getting a round of index vacuuming. It's just too
easy for updates (and even aborted xact inserts) to introduce new
LP_DEAD items for us to go long without doing index vacuuming.

If you can be more concrete about a problem you've seen, then I might
be able to help. It's not like there are no options in this already. I
already thought about introducing a small degree of randomness into
the process of deciding to skip or to not skip (in the
consider_bypass_optimization path of lazy_vacuum() on Postgres 14).
The optimization is mostly valuable because it allows us to do more
useful work in VACUUM -- not because it allows us to do less useless
work in VACUUM. In particular, it allows to tune
autovacuum_vacuum_insert_scale_factor very aggressively with an
append-only table, without useless index vacuuming making it all but
impossible for autovacuum to get to the useful work.

> > Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
> > technique is independently capable of cleaning up line pointer bloat),
>
> One thing we could do around this, btw, would be to aggressively replace
> LP_REDIRECT items with their target item. We can't do that in all situations
> (somebody might be following a ctid chain), but I think we have all the
> information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
> state or something like that.

Another idea is to truncate the line pointer during pruning (including
opportunistic pruning). Matthias van de Meent has a patch for that.

I am not aware of a specific workload where the patch helps, but that
doesn't mean that there isn't one, or that it doesn't matter. It's
subtle enough that I might have just missed something. I *expect* the
true damage over time to be very hard to model or understand -- I
imagine the potential for weird feedback loops is there.

> I think that'd be quite a win - we right now often "migrate" to other pages
> for modifications not because we're out of space on a page, but because we run
> out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the
> number of line pointers, not just the number of actual tuples). Effectively
> doubling the number of available line item in common cases in a number of
> realistic / common scenarios would be quite the win.

I believe Masahiko is working on this in the current cycle. It would
be easier if we had a better sense of how increasing
MaxHeapTuplesPerPage will affect tidbitmap.c. But the idea of
increasing that seems sound to me.

> > Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
> > since there is no barely any real practical sense in which we actually
> > miss doing useful work for these pages.  Besides, this information
> > always seemed to have little practical value, even to Postgres hackers.
>
> -0.5. I think it provides some value, and I don't see why the removal of the
> information should be tied to this change. It's hard to diagnose why some dead
> tuples aren't cleaned up - a common cause for that on smaller tables is that
> nearly all pages are pinned nearly all the time.

Is that still true, though? If it turns out that we need to leave it
in, then I can do that. But I'd prefer to wait until we have more
information before making a final decision. Remember, the high level
idea of this whole patch is that we do as much work as possible for
any scanned_pages, which now includes pages that we never successfully
acquired a cleanup lock on. And so we're justified in assuming that
they're exactly equivalent to pages that we did get a cleanup on --
that's now the working assumption. I know that that's not literally
true, but that doesn't mean it's not a useful fiction -- it should be
very close to the truth.

Also, I would like to put more information (much more useful
information) in the same log output. Perhaps that will be less
controversial if I take something useless away first.

> I wonder if we could have a more restrained version of heap_page_prune() that
> doesn't require a cleanup lock? Obviously we couldn't defragment the page, but
> it's not immediately obvious that we need it if we constrain ourselves to only
> modify tuple versions that cannot be visible to anybody.
>
> Random note: I really dislike that we talk about cleanup locks in some parts
> of the code, and super-exclusive locks in others :(.

Somebody should normalize that.

> > +     /*
> > +      * Aggressive VACUUM (which is the same thing as anti-wraparound
> > +      * autovacuum for most practical purposes) exists so that we'll reliably
> > +      * advance relfrozenxid and relminmxid sooner or later.  But we can often
> > +      * opportunistically advance them even in a non-aggressive VACUUM.
> > +      * Consider if that's possible now.
>
> I don't agree with the "most practical purposes" bit. There's a huge
> difference because manual VACUUMs end up aggressive but not anti-wrap once
> older than vacuum_freeze_table_age.

Okay.

> > +      * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
> > +      * the rel_pages used by lazy_scan_prune, from before a possible relation
> > +      * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
> > +      */
>
> I think it should be doable to add an isolation test for this path. There have
> been quite a few bugs around the wider topic...

I would argue that we already have one -- vacuum-reltuples.spec. I had
to update its expected output in the patch. I would argue that the
behavioral change (count tuples on a pinned-by-cursor heap page) that
necessitated updating the expected output for the test is an
improvement overall.

> > +     {
> > +             /* Can safely advance relfrozen and relminmxid, too */
> > +             Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
> > +                        orig_rel_pages);
> > +             vac_update_relstats(rel, new_rel_pages, new_live_tuples,
> > +                                                     new_rel_allvisible, vacrel->nindexes > 0,
> > +                                                     FreezeLimit, MultiXactCutoff, false);
> > +     }
>
> I wonder if this whole logic wouldn't become easier and less fragile if we
> just went for maintaining the "actually observed" horizon while scanning the
> relation. If we skip a page via VM set the horizon to invalid. Otherwise we
> can keep track of the accurate horizon and use that. No need to count pages
> and stuff.

There is no question that that makes sense as an optimization -- my
prototype convinced me of that already. But I don't think that it can
simplify anything (not even the call to vac_update_relstats itself, to
actually update relfrozenxid at the end). Fundamentally, this will
only work if we decide to only skip all-frozen pages, which (by
definition) only happens within aggressive VACUUMs. Isn't it that
simple?

You recently said (on the heap-pruning-14-bug thread) that you don't
think it would be practical to always set a page all-frozen when we
see that we're going to set it all-visible -- apparently you feel that
we could never opportunistically freeze early such that all-visible
but not all-frozen pages practically cease to exist. I'm still not
sure why you believe that (though you may be right, or I might have
misunderstood, since it's complicated). It would certainly benefit
this dynamic relfrozenxid business if it was possible, though. If we
could somehow make that work, then almost every VACUUM would be able
to advance relfrozenxid, independently of aggressive-ness -- because
we wouldn't have any all-visible-but-not-all-frozen pages to skip
(that important detail wouldn't be left to chance).

> > -                     if (skipping_blocks && !FORCE_CHECK_PAGE())
> > +                     if (skipping_blocks &&
> > +                             !(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
> >                       {
> >                               /*
> >                                * Tricky, tricky.  If this is in aggressive vacuum, the page
>
> I find the  FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like
> mixing such changes within a larger change doing many other things.

I got rid of FORCE_CHECK_PAGE() itself in this patch (not a later
patch) because the patch also removes the only other
FORCE_CHECK_PAGE() call -- and the latter change is very much in scope
for the big patch (can't be broken down into smaller changes, I
think). And so this felt natural to me. But if you prefer, I can break
it out into a separate commit.

> > @@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
> >
> >               buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
> >                                                                RBM_NORMAL, vacrel->bstrategy);
> > +             page = BufferGetPage(buf);
> > +             vacrel->scanned_pages++;
>
> I don't particularly like doing BufferGetPage() before holding a lock on the
> page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
> be good to have a crosscheck that BufferGetPage() is only allowed when holding
> a page level lock.

I have occasionally wondered if the whole idea of reading heap pages
with only a pin (and having cleanup locks in VACUUM) is really worth
it -- alternative designs seem possible. Obviously that's a BIG
discussion, and not one to have right now. But it seems kind of
relevant.

Since it is often legit to read a heap page without a buffer lock
(only a pin), I can't see why BufferGetPage() without a buffer lock
shouldn't also be okay -- if anything it seems safer. I think that I
would agree with you if it wasn't for that inconsistency (which is
rather a big "if", to be sure -- even for me).

> > +                     /* Check for new or empty pages before lazy_scan_noprune call */
> > +                     if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
> > +                                                                        vmbuffer))
> >                       {
> > -                             ReleaseBuffer(buf);
> > -                             vacrel->pinskipped_pages++;
> > +                             /* Lock and pin released for us */
> > +                             continue;
> > +                     }
>
> Why isn't this done in lazy_scan_noprune()?

No reason, really -- could be done that way (we'd then also give
lazy_scan_prune the same treatment). I thought that it made a certain
amount of sense to keep some of this in the main loop, but I can
change it if you want.

> > +                     if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
> > +                     {
> > +                             /* No need to wait for cleanup lock for this page */
> > +                             UnlockReleaseBuffer(buf);
> > +                             if (hastup)
> > +                                     vacrel->nonempty_pages = blkno + 1;
> >                               continue;
> >                       }
>
> Do we really need all of buf, blkno, page for both of these functions? Quite
> possible that yes, if so, could we add an assertion that
> BufferGetBockNumber(buf) == blkno?

This just matches the existing lazy_scan_prune function (which doesn't
mean all that much, since it was only added in Postgres 14). Will add
the assertion to both.

> > +             /* Check for new or empty pages before lazy_scan_prune call */
> > +             if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
> >               {
>
> Maybe worth a note mentioning that we need to redo this even in the aggressive
> case, because we didn't continually hold a lock on the page?

Isn't that obvious? Either way it isn't the kind of thing that I'd try
to optimize away. It's such a narrow issue.

> > +/*
> > + * Empty pages are not really a special case -- they're just heap pages that
> > + * have no allocated tuples (including even LP_UNUSED items).  You might
> > + * wonder why we need to handle them here all the same.  It's only necessary
> > + * because of a rare corner-case involving a hard crash during heap relation
> > + * extension.  If we ever make relation-extension crash safe, then it should
> > + * no longer be necessary to deal with empty pages here (or new pages, for
> > + * that matter).
>
> I don't think it's actually that rare - the window for this is huge.

I can just remove the comment, though it still makes sense to me.

> I don't really see that as a realistic thing to ever reliably avoid, FWIW. I
> think the overhead would be prohibitive. We'd need to do synchronous WAL
> logging while holding the extension lock I think. Um, not fun.

My long term goal for the FSM (the lease based design I talked about
earlier this year) includes soft ownership of free space from
preallocated pages by individual xacts -- the smgr layer itself
becomes transactional and crash safe (at least to a limited degree).
This includes bulk extension of relations, to make up for the new
overhead implied by crash safe rel extension. I don't think that we
should require VACUUM (or anything else) to be cool with random
uninitialized pages -- to me that just seems backwards.

We can't do true bulk extension right now (just an inferior version
that doesn't give specific pages to specific backends) because the
risk of losing a bunch of empty pages for way too long is not
acceptable. But that doesn't seem fundamental to me -- that's one of
the things we'd be fixing at the same time (through what I call soft
ownership semantics). I think we'd come out ahead on performance, and
*also* have a more robust approach to relation extension.

> > + * Caller can either hold a buffer cleanup lock on the buffer, or a simple
> > + * shared lock.
> > + */
>
> Kinda sounds like it'd be incorrect to call this with an exclusive lock, which
> made me wonder why that could be true. Perhaps just say that it needs to be
> called with at least a shared lock?

Okay.

> > +static bool
> > +lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
> > +                                        Page page, bool sharelock, Buffer vmbuffer)
>
> It'd be good to document the return value - for me it's not a case where it's
> so obvious that it's not worth it.

Okay.

> > +/*
> > + *   lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
> > + *
> > + * Caller need only hold a pin and share lock on the buffer, unlike
> > + * lazy_scan_prune, which requires a full cleanup lock.
>
> I'd add somethign like "returns whether a cleanup lock is required". Having to
> read multiple paragraphs to understand the basic meaning of the return value
> isn't great.

Will fix.

> > +             if (ItemIdIsRedirected(itemid))
> > +             {
> > +                     *hastup = true;         /* page won't be truncatable */
> > +                     continue;
> > +             }
>
> It's not really new, but this comment is now a bit confusing, because it can
> be understood to be about PageTruncateLinePointerArray().

I didn't think of that. Will address it in the next version.

> Why does it make sense to track DEAD tuples this way? Isn't that going to lead
> to counting them over-and-over again? I think it's quite misleading to include
> them in "dead bot not yet removable".

Compared to what? Do we really want to invent a new kind of DEAD tuple
(e.g., to report on), just to handle this rare case?

I accept that this code is lying about the tuples being RECENTLY_DEAD,
kind of. But isn't it still strictly closer to the truth, compared to
HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not
counting it at all.

Note that we don't remember LP_DEAD items here, either (not here, in
lazy_scan_noprune, and not in lazy_scan_prune on HEAD). Because we
pretty much interpret LP_DEAD items as "future LP_UNUSED items"
instead -- we make a soft assumption that we're going to go on to mark
the same items LP_UNUSED during a second pass over the heap. My point
is that there is no natural way to count "fully DEAD tuple that
autovacuum didn't deal with" -- and so I picked RECENTLY_DEAD.

> > +     /*
> > +      * Now save details of the LP_DEAD items from the page in the dead_tuples
> > +      * array iff VACUUM uses two-pass strategy case
> > +      */
>
> Do we really need to have separate code for this in lazy_scan_prune() and
> lazy_scan_noprune()?

There is hardly any repetition, though.

> > +     }
> > +     else
> > +     {
> > +             /*
> > +              * We opt to skip FSM processing for the page on the grounds that it
> > +              * is probably being modified by concurrent DML operations.  Seems
> > +              * best to assume that the space is best left behind for future
> > +              * updates of existing tuples.  This matches what opportunistic
> > +              * pruning does.
>
> Why can we assume that there concurrent DML rather than concurrent read-only
> operations? IME it's much more common for read-only operations to block
> cleanup locks than read-write ones (partially because the frequency makes it
> easier, partially because cursors allow long-held pins, partially because the
> EXCLUSIVE lock of a r/w operation wouldn't let us get here)

I actually agree. It still probably isn't worth dealing with the FSM
here, though. It's just too much mechanism for too little benefit in a
very rare case. What do you think?

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2021-11-22 17:07:46 -0800, Peter Geoghegan wrote:
> Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a
> non-aggressive VACUUM -- so "at least waiting for one or two pins
> during non-aggressive VACUUM" might not have been the best way of
> expressing the idea that I wanted to express. The important point is
> that _we can make a choice_ about stuff like this dynamically, based
> on the observed characteristics of the table, and some general ideas
> about the costs and benefits (of waiting or not waiting, or of how
> long we want to wait in total, whatever might be important). This
> probably just means adding some heuristics that are pretty sensitive
> to any reason to not do more work in a non-aggressive VACUUM, without
> *completely* balking at doing even a tiny bit more work.

> For example, we can definitely afford to wait a few more milliseconds
> to get a cleanup lock just once

We currently have no infrastructure to wait for an lwlock or pincount for a
limited time. And at least for the former it'd not be easy to add. It may be
worth adding that at some point, but I'm doubtful this is sufficient reason
for nontrivial new infrastructure in very performance sensitive areas.


> All of the autovacuums against the accounts table look similar to this
> one -- you don't see anything about relfrozenxid being advanced
> (because it isn't). Whereas for the smaller pgbench tables, every
> single VACUUM successfully advances relfrozenxid to a fairly recent
> XID (without there ever being an aggressive VACUUM) -- just because
> VACUUM needs to visit every page for the smaller tables. While the
> accounts table doesn't generally need to have 100% of all pages
> touched by VACUUM -- it's more like 95% there. Does that really make
> sense, though?

Does what make really sense?


> I'm pretty sure that less aggressive VACUUMing (e.g. higher
> scale_factor setting) would lead to more aggressive setting of
> relfrozenxid here. I'm always suspicious when I see insignificant
> differences that lead to significant behavioral differences. Am I
> worried over nothing here? Perhaps -- we don't really need to advance
> relfrozenxid early with this table/workload anyway. But I'm not so
> sure.

I think pgbench_accounts is just a really poor showcase. Most importantly
there's no even slightly longer running transactions that hold down the xid
horizon. But in real workloads thats incredibly common IME.  It's also quite
uncommon in real workloads to huge tables in which all records are
updated. It's more common to have value ranges that are nearly static, and a
more heavily changing range.

I think the most interesting cases where using the "measured" horizon will be
advantageous is anti-wrap vacuums. Those obviously have to happen for rarely
modified tables, including completely static ones, too. Using the "measured"
horizon will allow us to reduce the frequency of anti-wrap autovacuums on old
tables, because we'll be able to set a much more recent relfrozenxid.

This is becoming more common with the increased use of partitioning.


> > The problem is that the
> > autovacuum scheduling is way too naive for that to be a significant benefit -
> > nothing tries to schedule autovacuums so that they have a chance to complete
> > before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
> > to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.
> 
> Not sure what you mean about scheduling, since vacuum_freeze_table_age
> is only in place to make overnight (off hours low activity scripted
> VACUUMs) freeze tuples before any autovacuum worker gets the chance
> (since the latter may run at a much less convenient time). Sure,
> vacuum_freeze_table_age might also force a regular autovacuum worker
> to do an aggressive VACUUM -- but I think it's mostly intended for a
> manual overnight VACUUM. Not usually very helpful, but also not
> harmful.

> Oh, wait. I think that you're talking about how autovacuum workers in
> particular tend to be affected by this. We launch an av worker that
> wants to clean up bloat, but it ends up being aggressive (and maybe
> taking way longer), perhaps quite randomly, only due to
> vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is
> that it?

No, not quite. We treat anti-wraparound vacuums as an emergency (including
logging messages, not cancelling). But the only mechanism we have against
anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's
not really a "real" mechanism, because it requires an "independent" reason to
vacuum a table.

I've seen cases where anti-wraparound vacuums weren't a problem / never
happend for important tables for a long time, because there always was an
"independent" reason for autovacuum to start doing its thing before the table
got to be autovacuum_freeze_max_age old. But at some point the important
tables started to be big enough that autovacuum didn't schedule vacuums that
got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap
vacuums. Then things started to burn, because of the unpaced anti-wrap vacuums
clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite
remember the details.

Behaviour that lead to a "sudden" falling over, rather than getting gradually
worse are bad - they somehow tend to happen on Friday evenings :).



> > This is one of the most embarassing issues around the whole anti-wrap
> > topic. We kind of define it as an emergency that there's an anti-wraparound
> > vacuum. But we have *absolutely no mechanism* to prevent them from occurring.
> 
> What do you mean? Only an autovacuum worker can do an anti-wraparound
> VACUUM (which is not quite the same thing as an aggressive VACUUM).

Just that autovacuum should have a mechanism to trigger aggressive vacuums
(i.e. ones that are guaranteed to be able to increase relfrozenxid unless
cancelled) before getting to the "emergency"-ish anti-wraparound state.

Or alternatively that we should have a separate threshold for the "harsher"
anti-wraparound measures.


> > > We now also collect LP_DEAD items in the dead_tuples array in the case
> > > where we cannot immediately get a cleanup lock on the buffer.  We cannot
> > > prune without a cleanup lock, but opportunistic pruning may well have
> > > left some LP_DEAD items behind in the past -- no reason to miss those.
> >
> > This has become *much* more important with the changes around deciding when to
> > index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
> > items, it's that a previous vacuum is quite likely to have left them there,
> > because the previous vacuum decided not to perform index cleanup.
> 
> I haven't seen any evidence of that myself (with the optimization
> added to Postgres 14 by commit 5100010ee4). I still don't understand
> why you doubted that work so much. I'm not saying that you're wrong
> to; I'm saying that I don't think that I understand your perspective
> on it.

I didn't (nor do) doubt that it can be useful - to the contrary, I think the
unconditional index pass was a huge practial issue.  I do however think that
there are cases where it can cause trouble. The comment above wasn't meant as
a criticism - just that it seems worth pointing out that one reason we might
encounter a lot of LP_DEAD items is previous vacuums that didn't perform index
cleanup.


> What I have seen in my own tests (particularly with BenchmarkSQL) is
> that most individual tables either never apply the optimization even
> once (because the table reliably has heap pages with many more LP_DEAD
> items than the 2%-of-relpages threshold), or will never need to
> (because there are precisely zero LP_DEAD items anyway). Remaining
> tables that *might* use the optimization tend to not go very long
> without actually getting a round of index vacuuming. It's just too
> easy for updates (and even aborted xact inserts) to introduce new
> LP_DEAD items for us to go long without doing index vacuuming.

I think workloads are a bit more worried than a realistic set of benchmarksk
that one person can run yourself.

I gave you examples of cases that I see as likely being bitten by this,
e.g. when the skipped index cleanup prevents IOS scans. When both the
likely-to-be-modified and likely-to-be-queried value ranges are a small subset
of the entire data, the 2% threshold can prevent vacuum from cleaning up
LP_DEAD entries for a long time.  Or when all index scans are bitmap index
scans, and nothing ends up cleaning up the dead index entries in certain
ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively
small rollback / non-HOT update rate can start to be really painful.


> > > Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
> > > technique is independently capable of cleaning up line pointer bloat),
> >
> > One thing we could do around this, btw, would be to aggressively replace
> > LP_REDIRECT items with their target item. We can't do that in all situations
> > (somebody might be following a ctid chain), but I think we have all the
> > information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
> > state or something like that.
> 
> Another idea is to truncate the line pointer during pruning (including
> opportunistic pruning). Matthias van de Meent has a patch for that.

I'm a bit doubtful that's as important (which is not to say that it's not
worth doing). For a heavily updated table the max space usage of the line
pointer array just isn't as big a factor as ending up with only half the
usable line pointers.


> > > Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
> > > since there is no barely any real practical sense in which we actually
> > > miss doing useful work for these pages.  Besides, this information
> > > always seemed to have little practical value, even to Postgres hackers.
> >
> > -0.5. I think it provides some value, and I don't see why the removal of the
> > information should be tied to this change. It's hard to diagnose why some dead
> > tuples aren't cleaned up - a common cause for that on smaller tables is that
> > nearly all pages are pinned nearly all the time.
> 
> Is that still true, though? If it turns out that we need to leave it
> in, then I can do that. But I'd prefer to wait until we have more
> information before making a final decision. Remember, the high level
> idea of this whole patch is that we do as much work as possible for
> any scanned_pages, which now includes pages that we never successfully
> acquired a cleanup lock on. And so we're justified in assuming that
> they're exactly equivalent to pages that we did get a cleanup on --
> that's now the working assumption. I know that that's not literally
> true, but that doesn't mean it's not a useful fiction -- it should be
> very close to the truth.

IDK, it seems misleading to me. Small tables with a lot of churn - quite
common - are highly reliant on LP_DEAD entries getting removed or the tiny
table suddenly isn't so tiny anymore. And it's harder to diagnose why the
cleanup isn't happening without knowledge that pages needing cleanup couldn't
be cleaned up due to pins.

If you want to improve the logic so that we only count pages that would have
something to clean up, I'd be happy as well. It doesn't have to mean exactly
what it means today.


> > > +      * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
> > > +      * the rel_pages used by lazy_scan_prune, from before a possible relation
> > > +      * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
> > > +      */
> >
> > I think it should be doable to add an isolation test for this path. There have
> > been quite a few bugs around the wider topic...
> 
> I would argue that we already have one -- vacuum-reltuples.spec. I had
> to update its expected output in the patch. I would argue that the
> behavioral change (count tuples on a pinned-by-cursor heap page) that
> necessitated updating the expected output for the test is an
> improvement overall.

I was thinking of truncations, which I don't think vacuum-reltuples.spec
tests.


> > > +     {
> > > +             /* Can safely advance relfrozen and relminmxid, too */
> > > +             Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
> > > +                        orig_rel_pages);
> > > +             vac_update_relstats(rel, new_rel_pages, new_live_tuples,
> > > +                                                     new_rel_allvisible, vacrel->nindexes > 0,
> > > +                                                     FreezeLimit, MultiXactCutoff, false);
> > > +     }
> >
> > I wonder if this whole logic wouldn't become easier and less fragile if we
> > just went for maintaining the "actually observed" horizon while scanning the
> > relation. If we skip a page via VM set the horizon to invalid. Otherwise we
> > can keep track of the accurate horizon and use that. No need to count pages
> > and stuff.
> 
> There is no question that that makes sense as an optimization -- my
> prototype convinced me of that already. But I don't think that it can
> simplify anything (not even the call to vac_update_relstats itself, to
> actually update relfrozenxid at the end).

Maybe. But we've had quite a few bugs because we ended up changing some detail
of what is excluded in one of the counters, leading to wrong determination
about whether we scanned everything or not.


> Fundamentally, this will only work if we decide to only skip all-frozen
> pages, which (by definition) only happens within aggressive VACUUMs.

Hm? Or if there's just no runs of all-visible pages of sufficient length, so
we don't end up skipping at all.


> You recently said (on the heap-pruning-14-bug thread) that you don't
> think it would be practical to always set a page all-frozen when we
> see that we're going to set it all-visible -- apparently you feel that
> we could never opportunistically freeze early such that all-visible
> but not all-frozen pages practically cease to exist. I'm still not
> sure why you believe that (though you may be right, or I might have
> misunderstood, since it's complicated).

Yes, I think it may not work out to do that. But it's not a very strongly held
opinion.

On reason for my doubt is the following:

We can set all-visible on a page without a FPW image (well, as long as hint
bits aren't logged). There's a significant difference between needing to WAL
log FPIs for every heap page or not, and it's not that rare for data to live
shorter than autovacuum_freeze_max_age or that limit never being reached.

On a table with 40 million individually inserted rows, fully hintbitted via
reads, I see a first VACUUM taking 1.6s and generating 11MB of WAL. A
subsequent VACUUM FREEZE takes 5s and generates 500MB of WAL. That's a quite
large multiplier...

If we ever managed to not have a per-page all-visible flag this'd get even
more extreme, because we'd then not even need to dirty the page for
insert-only pages. But if we want to freeze, we'd need to (unless we just got
rid of freezing).


> It would certainly benefit this dynamic relfrozenxid business if it was
> possible, though. If we could somehow make that work, then almost every
> VACUUM would be able to advance relfrozenxid, independently of
> aggressive-ness -- because we wouldn't have any
> all-visible-but-not-all-frozen pages to skip (that important detail wouldn't
> be left to chance).

Perhaps we can have most of the benefit even without that.  If we were to
freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip
all-visible but not !all-frozen pages if they were less than x% of the
to-be-scanned data, we should be able to to still increase relfrozenxid in a
lot of cases?


> > I don't particularly like doing BufferGetPage() before holding a lock on the
> > page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
> > be good to have a crosscheck that BufferGetPage() is only allowed when holding
> > a page level lock.
> 
> I have occasionally wondered if the whole idea of reading heap pages
> with only a pin (and having cleanup locks in VACUUM) is really worth
> it -- alternative designs seem possible. Obviously that's a BIG
> discussion, and not one to have right now. But it seems kind of
> relevant.

With 'reading' do you mean reads-from-os, or just references to buffer
contents?


> Since it is often legit to read a heap page without a buffer lock
> (only a pin), I can't see why BufferGetPage() without a buffer lock
> shouldn't also be okay -- if anything it seems safer. I think that I
> would agree with you if it wasn't for that inconsistency (which is
> rather a big "if", to be sure -- even for me).

At least for heap it's rarely legit to read buffer contents via
BufferGetPage() without a lock. It's legit to read data at already-determined
offsets, but you can't look at much other than the tuple contents.


> > Why does it make sense to track DEAD tuples this way? Isn't that going to lead
> > to counting them over-and-over again? I think it's quite misleading to include
> > them in "dead bot not yet removable".
> 
> Compared to what? Do we really want to invent a new kind of DEAD tuple
> (e.g., to report on), just to handle this rare case?

When looking at logs I use the
"tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"
line to see whether the user is likely to have issues around an old
transaction / slot / prepared xact preventing cleanup. If new_dead_tuples
doesn't identify those cases anymore that's not reliable anymore.


> I accept that this code is lying about the tuples being RECENTLY_DEAD,
> kind of. But isn't it still strictly closer to the truth, compared to
> HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not
> counting it at all.

I don't see how it's closer at all. There's imo a significant difference
between not being able to remove tuples because of the xmin horizon, and not
being able to remove it because we couldn't get a cleanup lock.


Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Nov 22, 2021 at 9:49 PM Andres Freund <andres@anarazel.de> wrote:
> > For example, we can definitely afford to wait a few more milliseconds
> > to get a cleanup lock just once
>
> We currently have no infrastructure to wait for an lwlock or pincount for a
> limited time. And at least for the former it'd not be easy to add. It may be
> worth adding that at some point, but I'm doubtful this is sufficient reason
> for nontrivial new infrastructure in very performance sensitive areas.

It was a hypothetical example. To be more practical about it: it seems
likely that we won't really benefit from waiting some amount of time
(not forever) for a cleanup lock in non-aggressive VACUUM, once we
have some of the relfrozenxid stuff we've talked about in place. In a
world where we're smarter about advancing relfrozenxid in
non-aggressive VACUUMs, the choice between waiting for a cleanup lock,
and not waiting (but also not advancing relfrozenxid at all) matters
less -- it's no longer a binary choice.

It's no longer a binary choice because we will have done away with the
current rigid way in which our new relfrozenxid for the relation is
either FreezeLimit, or nothing at all. So far we've only talked about
the case where we can update relfrozenxid with a value that happens to
be much newer than FreezeLimit. If we can do that, that's great. But
what about setting relfrozenxid to an *older* value than FreezeLimit
instead (in a non-aggressive VACUUM)? That's also pretty good! There
is still a decent chance that the final "suboptimal" relfrozenxid that
we determine can be safely set in pg_class at the end of our VACUUM
will still be far more recent than the preexisting relfrozenxid.
Especially with larger tables.

Advancing relfrozenxid should be thought of as a totally independent
thing to freezing tuples, at least in vacuumlazy.c itself. That's
kinda the case today, even, but *explicitly* decoupling advancing
relfrozenxid from actually freezing tuples seems like a good high
level goal for this project.

Remember, FreezeLimit is derived from vacuum_freeze_min_age in the
obvious way: OldestXmin for the VACUUM, minus vacuum_freeze_min_age
GUC/reloption setting. I'm pretty sure that this means that making
autovacuum freeze tuples more aggressively (by reducing
vacuum_freeze_min_age) could have the perverse effect of making
non-aggressive VACUUMs less likely to advance relfrozenxid -- which is
exactly backwards. This effect could easily be missed, even by expert
users, since there is no convenient instrumentation that shows how and
when relfrozenxid is advanced.

> > All of the autovacuums against the accounts table look similar to this
> > one -- you don't see anything about relfrozenxid being advanced
> > (because it isn't).

>> Does that really make
> > sense, though?
>
> Does what make really sense?

Well, my accounts table example wasn't a particularly good one (it was
a conveniently available example). I am now sure that you got the
point I was trying to make here already, based on what you go on to
say about non-aggressive VACUUMs optionally *not* skipping
all-visible-not-all-frozen heap pages in the hopes of advancing
relfrozenxid earlier (more on that idea below, in my response).

On reflection, the simplest way of expressing the same idea is what I
just said about decoupling (decoupling advancing relfrozenxid from
freezing).

> I think pgbench_accounts is just a really poor showcase. Most importantly
> there's no even slightly longer running transactions that hold down the xid
> horizon. But in real workloads thats incredibly common IME.  It's also quite
> uncommon in real workloads to huge tables in which all records are
> updated. It's more common to have value ranges that are nearly static, and a
> more heavily changing range.

I agree.

> I think the most interesting cases where using the "measured" horizon will be
> advantageous is anti-wrap vacuums. Those obviously have to happen for rarely
> modified tables, including completely static ones, too. Using the "measured"
> horizon will allow us to reduce the frequency of anti-wrap autovacuums on old
> tables, because we'll be able to set a much more recent relfrozenxid.

That's probably true in practice -- but who knows these days, with the
autovacuum_vacuum_insert_scale_factor stuff? Either way I see no
reason to emphasize that case in the design itself. The "decoupling"
concept now seems like the key design-level concept -- everything else
follows naturally from that.

> This is becoming more common with the increased use of partitioning.

Also with bulk loading. There could easily be a tiny number of
distinct XIDs that are close together in time, for many many rows --
practically one XID, or even exactly one XID.

> No, not quite. We treat anti-wraparound vacuums as an emergency (including
> logging messages, not cancelling). But the only mechanism we have against
> anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's
> not really a "real" mechanism, because it requires an "independent" reason to
> vacuum a table.

Got it.

> I've seen cases where anti-wraparound vacuums weren't a problem / never
> happend for important tables for a long time, because there always was an
> "independent" reason for autovacuum to start doing its thing before the table
> got to be autovacuum_freeze_max_age old. But at some point the important
> tables started to be big enough that autovacuum didn't schedule vacuums that
> got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap
> vacuums.

Right. Not just because they were big; also because autovacuum runs at
geometric intervals -- the final reltuples from last time is used to
determine the point at which av runs this time. This might make sense,
or it might not make any sense -- it all depends (mostly on index
stuff).

> Then things started to burn, because of the unpaced anti-wrap vacuums
> clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite
> remember the details.

Non-cancelling anti-wraparound VACUUMs that (all of a sudden) cause
chaos because they interact badly with automated DDL is one I've seen
several times -- I'm sure you have too. That was what the Manta/Joyent
blogpost I referenced upthread went into.

> Behaviour that lead to a "sudden" falling over, rather than getting gradually
> worse are bad - they somehow tend to happen on Friday evenings :).

These are among our most important challenges IMV.

> Just that autovacuum should have a mechanism to trigger aggressive vacuums
> (i.e. ones that are guaranteed to be able to increase relfrozenxid unless
> cancelled) before getting to the "emergency"-ish anti-wraparound state.

Maybe, but that runs into the problem of needing another GUC that
nobody will ever be able to remember the name of. I consider the idea
of adding a variety of measures that make non-aggressive VACUUM much
more likely to advance relfrozenxid in practice to be far more
promising.

> Or alternatively that we should have a separate threshold for the "harsher"
> anti-wraparound measures.

Or maybe just raise the default of autovacuum_freeze_max_age, which
many people don't change? That might be a lot safer than it once was.
Or will be, once we manage to teach VACUUM to advance relfrozenxid
more often in non-aggressive VACUUMs on Postgres 15. Imagine a world
in which we have that stuff in place, as well as related enhancements
added in earlier releases: autovacuum_vacuum_insert_scale_factor, the
freezemap, and the wraparound failsafe.

These add up to a lot; with all of that in place, the risk we'd be
introducing by increasing the default value of
autovacuum_freeze_max_age would be *far* lower than the risk of making
the same change back in 2006. I bring up 2006 because it was the year
that commit 48188e1621 added autovacuum_freeze_max_age -- the default
hasn't changed since that time.

> I think workloads are a bit more worried than a realistic set of benchmarksk
> that one person can run yourself.

No question. I absolutely accept that I only have to miss one
important detail with something like this -- that just goes with the
territory. Just saying that I have yet to see any evidence that the
bypass-indexes behavior really hurt anything. I do take the idea that
I might have missed something very seriously, despite all this.

> I gave you examples of cases that I see as likely being bitten by this,
> e.g. when the skipped index cleanup prevents IOS scans. When both the
> likely-to-be-modified and likely-to-be-queried value ranges are a small subset
> of the entire data, the 2% threshold can prevent vacuum from cleaning up
> LP_DEAD entries for a long time.  Or when all index scans are bitmap index
> scans, and nothing ends up cleaning up the dead index entries in certain
> ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively
> small rollback / non-HOT update rate can start to be really painful.

That does seem possible. But I consider it very unlikely to appear as
a regression caused by the bypass mechanism itself -- not in any way
that was consistent over time. As far as I can tell, autovacuum
scheduling just doesn't operate at that level of precision, and never
has.

I have personally observed that ANALYZE does a very bad job at
noticing LP_DEAD items in tables/workloads where LP_DEAD items (not
DEAD tuples) tend to concentrate [1]. The whole idea that ANALYZE
should count these items as if they were normal tuples seems pretty
bad to me.

Put it this way: imagine you run into trouble with the bypass thing,
and then you opt to disable it on that table (using the INDEX_CLEANUP
reloption). Why should this step solve the problem on its own? In
order for that to work, VACUUM would have to have to know to be very
aggressive about these LP_DEAD items. But there is good reason to
believe that it just won't ever notice them, as long as ANALYZE is
expected to provide reliable statistics that drive autovacuum --
they're just too concentrated for the block-based approach to truly
work.

I'm not minimizing the risk. Just telling you my thoughts on this.

> I'm a bit doubtful that's as important (which is not to say that it's not
> worth doing). For a heavily updated table the max space usage of the line
> pointer array just isn't as big a factor as ending up with only half the
> usable line pointers.

Agreed; by far the best chance we have of improving the line pointer
bloat situation is preventing it in the first place, by increasing
MaxHeapTuplesPerPage. Once we actually do that, our remaining options
are going to be much less helpful -- then it really is mostly just up
to VACUUM.

> And it's harder to diagnose why the
> cleanup isn't happening without knowledge that pages needing cleanup couldn't
> be cleaned up due to pins.
>
> If you want to improve the logic so that we only count pages that would have
> something to clean up, I'd be happy as well. It doesn't have to mean exactly
> what it means today.

It seems like what you really care about here are remaining cases
where our inability to acquire a cleanup lock has real consequences --
you want to hear about it when it happens, however unlikely it may be.
In other words, you want to keep something in log_autovacuum_* that
indicates that "less than the expected amount of work was completed"
due to an inability to acquire a cleanup lock. And so for you, this is
a question of keeping instrumentation that might still be useful, not
a question of how we define things fundamentally, at the design level.

Sound right?

If so, then this proposal might be acceptable to you:

* Remaining DEAD tuples with storage (though not LP_DEAD items from
previous opportunistic pruning) will get counted separately in the
lazy_scan_noprune (no cleanup lock) path. Also count the total number
of distinct pages that were found to contain one or more such DEAD
tuples.

* These two new counters will be reported on their own line in the log
output, though only in the cases where we actually have any such
tuples -- which will presumably be much rarer than simply failing to
get a cleanup lock (that's now no big deal at all, because we now
consistently do certain cleanup steps, and because FreezeLimit isn't
the only viable thing that we can set relfrozenxid to, at least in the
non-aggressive case).

* There is still a limited sense in which the same items get counted
as RECENTLY_DEAD -- though just those aspects that make the overall
design simpler. So the helpful aspects of this are still preserved.

We only need to tell pgstat_report_vacuum() that these items are
"deadtuples" (remaining dead tuples). That can work by having its
caller add a new int64 counter (same new tuple-based counter used for
the new log line) to vacrel->new_dead_tuples. We'd also add the same
new tuple counter in about the same way at the point where we
determine a final vacrel->new_rel_tuples.

So we wouldn't really be treating anything as RECENTLY_DEAD anymore --
pgstat_report_vacuum() and vacrel->new_dead_tuples don't specifically
expect anything about RECENTLY_DEAD-ness already.

> I was thinking of truncations, which I don't think vacuum-reltuples.spec
> tests.

Got it. I'll look into that for v2.

> Maybe. But we've had quite a few bugs because we ended up changing some detail
> of what is excluded in one of the counters, leading to wrong determination
> about whether we scanned everything or not.

Right. But let me just point out that my whole approach is to make
that impossible, by not needing to count pages, except in
scanned_pages (and in frozenskipped_pages + rel_pages). The processing
performed for any page that we actually read during VACUUM should be
uniform (or practically uniform), by definition. With minimal fudging
in the cleanup lock case (because we mostly do the same work there
too).

There should be no reason for any more page counters now, except for
non-critical instrumentation. For example, if you want to get the
total number of pages skipped via the visibility map (not just
all-frozen pages), then you simply subtract scanned_pages from
rel_pages.

> > Fundamentally, this will only work if we decide to only skip all-frozen
> > pages, which (by definition) only happens within aggressive VACUUMs.
>
> Hm? Or if there's just no runs of all-visible pages of sufficient length, so
> we don't end up skipping at all.

Of course. But my point was: who knows when that'll happen?

> On reason for my doubt is the following:
>
> We can set all-visible on a page without a FPW image (well, as long as hint
> bits aren't logged). There's a significant difference between needing to WAL
> log FPIs for every heap page or not, and it's not that rare for data to live
> shorter than autovacuum_freeze_max_age or that limit never being reached.

This sounds like an objection to one specific heuristic, and not an
objection to the general idea. The only essential part is
"opportunistic freezing during vacuum, when the cost is clearly very
low, and the benefit is probably high". And so it now seems you were
making a far more limited statement than I first believed.

Obviously many variations are possible -- there is a spectrum.
Example: a heuristic that makes VACUUM notice when it is going to
freeze at least one tuple on a page, iff the page will be marked
all-visible in any case -- we should instead freeze every tuple on the
page, and mark the page all-frozen, batching work (could account for
LP_DEAD items here too, not counting them on the assumption that
they'll become LP_UNUSED during the second heap pass later on).

If we see these conditions, then the likely explanation is that the
tuples on the heap page happen to have XIDs that are "split" by the
not-actually-important FreezeLimit cutoff, despite being essentially
similar in any way that matters.

If you want to make the same heuristic more conservative: only do this
when no existing tuples are frozen, since that could be taken as a
sign of the original heuristic not quite working on the same heap page
at an earlier stage.

I suspect that even very conservative versions of the same basic idea
would still help a lot.

> Perhaps we can have most of the benefit even without that.  If we were to
> freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip
> all-visible but not !all-frozen pages if they were less than x% of the
> to-be-scanned data, we should be able to to still increase relfrozenxid in a
> lot of cases?

I bet that's true. I like that idea.

If we had this policy, then the number of "extra"
visited-in-non-aggressive-vacuum pages (all-visible but not yet
all-frozen pages) could be managed over time through more
opportunistic freezing. This might make it work even better.

These all-visible (but not all-frozen) heap pages could be considered
"tenured", since they have survived at least one full VACUUM cycle
without being unset. So why not also freeze them based on the
assumption that they'll probably stay that way forever? There won't be
so many of the pages when we do this anyway, by definition -- since
we'd have a heuristic that limited the total number (say to no more
than 10% of the total relation size, something like that).

We're smoothing out the work that currently takes place all together
during an aggressive VACUUM this way.

Moreover, there is perhaps a good chance that the total number of
all-visible-not all-frozen heap pages will *stay* low over time, as a
result of this policy actually working -- there may be a virtuous
cycle that totally prevents us from getting an aggressive VACUUM even
once.

> > I have occasionally wondered if the whole idea of reading heap pages
> > with only a pin (and having cleanup locks in VACUUM) is really worth
> > it -- alternative designs seem possible. Obviously that's a BIG
> > discussion, and not one to have right now. But it seems kind of
> > relevant.
>
> With 'reading' do you mean reads-from-os, or just references to buffer
> contents?

The latter.

[1] https://postgr.es/m/CAH2-Wz=9R83wcwZcPUH4FVPeDM4znzbzMvp3rt21+XhQWMU8+g@mail.gmail.com
-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2021-11-23 17:01:20 -0800, Peter Geoghegan wrote:
> > On reason for my doubt is the following:
> >
> > We can set all-visible on a page without a FPW image (well, as long as hint
> > bits aren't logged). There's a significant difference between needing to WAL
> > log FPIs for every heap page or not, and it's not that rare for data to live
> > shorter than autovacuum_freeze_max_age or that limit never being reached.
> 
> This sounds like an objection to one specific heuristic, and not an
> objection to the general idea.

I understood you to propose that we do not have separate frozen and
all-visible states. Which I think will be problematic, because of scenarios
like the above.


> The only essential part is "opportunistic freezing during vacuum, when the
> cost is clearly very low, and the benefit is probably high". And so it now
> seems you were making a far more limited statement than I first believed.

I'm on board with freezing when we already dirty out the page, and when doing
so doesn't cause an additional FPI. And I don't think I've argued against that
in the past.


> These all-visible (but not all-frozen) heap pages could be considered
> "tenured", since they have survived at least one full VACUUM cycle
> without being unset. So why not also freeze them based on the
> assumption that they'll probably stay that way forever?

Because it's a potentially massive increase in write volume? E.g. if you have
a insert-only workload, and you discard old data by dropping old partitions,
this will often add yet another rewrite, despite your data likely never
getting old enough to need to be frozen.

Given that we often immediately need to start another vacuum just when one
finished, because the vacuum took long enough to reach thresholds of vacuuming
again, I don't think the (auto-)vacuum count is a good proxy.

Maybe you meant this as a more limited concept, i.e. only doing so when the
percentage of all-visible but not all-frozen pages is small?


We could perhaps do better with if we had information about the system-wide
rate of xid throughput and how often / how long past vacuums of a table took.


Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Tue, Nov 23, 2021 at 5:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Behaviour that lead to a "sudden" falling over, rather than getting gradually
> > worse are bad - they somehow tend to happen on Friday evenings :).
>
> These are among our most important challenges IMV.

I haven't had time to work through any of your feedback just yet --
though it's certainly a priority for. I won't get to it until I return
home from PGConf NYC next week.

Even still, here is a rebased v2, just to fix the bitrot. This is just
a courtesy to anybody interested in the patch.

-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Tue, Nov 30, 2021 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I haven't had time to work through any of your feedback just yet --
> though it's certainly a priority for. I won't get to it until I return
> home from PGConf NYC next week.

Attached is v3, which works through most of your (Andres') feedback.

Changes in v3:

* While the first patch still gets rid of the "pinskipped_pages"
instrumentation, the second patch adds back a replacement that's
better targeted: it tracks and reports "missed_dead_tuples". This
means that log output will show the number of fully DEAD tuples with
storage that could not be pruned away due to the fact that that would
have required waiting for a cleanup lock. But we *don't* generally
report the number of pages that we couldn't get a cleanup lock on,
because that in itself doesn't mean that we skipped any useful work
(which is very much the point of all of the refactoring in the first
patch).

* We now have FSM processing in the lazy_scan_noprune case, which more
or less matches the standard lazy_scan_prune case.

* Many small tweaks, based on suggestions from Andres, and other
things that I noticed.

* Further simplification of the "consider skipping pages using
visibility map" logic -- now we always don't skip the last block in
the relation, without calling should_attempt_truncation() to make sure
we have a reason.

Note that this means that we'll always read the final page during
VACUUM, even when doing so is provably unhelpful. I'd prefer to keep
the code that deals with skipping pages using the visibility map as
simple as possible. There isn't much downside to always doing that
once my refactoring is in place: there is no risk that we'll wait for
a cleanup lock (on the final page in the rel) for no good reason.
We're only wasting one page access, at most.

(I'm not 100% sure that this is the right trade-off, actually, but
it's at least worth considering.)

Not included in v3:

* Still haven't added the isolation test for rel truncation, though
it's on my TODO list.

* I'm still working on the optimization that we discussed on this
thread: the optimization that allows the final relfrozenxid (that we
set in pg_class) to be determined dynamically, based on the actual
XIDs we observed in the table (we don't just naively use FreezeLimit).

I'm not ready to post that today, but it shouldn't take too much
longer to be good enough to review.

Thanks
-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Dec 10, 2021 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> * I'm still working on the optimization that we discussed on this
> thread: the optimization that allows the final relfrozenxid (that we
> set in pg_class) to be determined dynamically, based on the actual
> XIDs we observed in the table (we don't just naively use FreezeLimit).

Attached is v4 of the patch series, which now includes this
optimization, broken out into its own patch. In addition, it includes
a prototype of opportunistic freezing.

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

There is a kind of virtuous circle here, where successive
non-aggressive autovacuums never fall behind on freezing, and so never
fail to advance relfrozenxid (there are never any
all_visible-but-not-all_frozen pages, and we can cope with not
acquiring a cleanup lock quite well). When VACUUM chooses to freeze a
tuple opportunistically, the frozen XIDs naturally cannot hold back
the final safe relfrozenxid for the relation. Opportunistic freezing
avoids setting all_visible (without setting all_frozen) in the
visibility map. It's impossible for VACUUM to just set a page to
all_visible now, which seems like an essential part of making a decent
amount of relfrozenxid advancement take place in almost every VACUUM
operation.

Here is an example of what I'm calling a virtuous circle -- all
pgbench_history autovacuums look like this with the patch applied:

LOG:  automatic vacuum of table "regression.public.pgbench_history":
index scans: 0
    pages: 0 removed, 35503 remain, 31930 skipped using visibility map
(89.94% of total)
    tuples: 0 removed, 5568687 remain (547976 newly frozen), 0 are
dead but not yet removable
    removal cutoff: oldest xmin was 5570281, which is now 1177 xact IDs behind
    relfrozenxid: advanced by 546618 xact IDs, new value: 5565226
    index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
    I/O timings: read: 0.003 ms, write: 0.000 ms
    avg read rate: 0.068 MB/s, avg write rate: 0.068 MB/s
    buffer usage: 7169 hits, 1 misses, 1 dirtied
    WAL usage: 7043 records, 1 full page images, 6974928 bytes
    system usage: CPU: user: 0.10 s, system: 0.00 s, elapsed: 0.11 s

Note that relfrozenxid is almost the same as oldest xmin here. Note also
that the log output shows the number of tuples newly frozen. I see the
same general trends with *every* pgbench_history autovacuum. Actually,
with every autovacuum. The history table tends to have ultra-recent
relfrozenxid values, which isn't always what we see, but that
difference may not matter. As far as I can tell, we can expect
practically every table to have a relfrozenxid that would (at least
traditionally) be considered very safe/recent. Barring weird
application issues that make it totally impossible to advance
relfrozenxid (e.g., idle cursors that hold onto a buffer pin forever),
it seems as if relfrozenxid will now steadily march forward. Sure,
relfrozenxid advancement might be held by the occasional inability to
acquire a cleanup lock, but the effect isn't noticeable over time;
what are the chances that a cleanup lock won't be available on the
same page (with the same old XID) more than once or twice? The odds of
that happening become astronomically tiny, long before there is any
real danger (barring pathological cases).

In the past, we've always talked about opportunistic freezing as a way
of avoiding re-dirtying heap pages during successive VACUUM operations
-- especially as a way of lowering the total volume of WAL. While I
agree that that's important, I have deliberately ignored it for now,
preferring to focus on the relfrozenxid stuff, and smoothing out the
cost of freezing (avoiding big shocks from aggressive/anti-wraparound
autovacuums). I care more about stable performance than absolute
throughput, but even still I believe that the approach I've taken to
opportunistic freezing is probably too aggressive. But it's dead
simple, which will make it easier to understand and discuss the issue
of central importance. It may be possible to optimize the WAL-logging
used during freezing, getting the cost down to the point where
freezing early just isn't a concern. The current prototype adds extra
WAL overhead, to be sure, but even that's not wildly unreasonable (you
make some of it back on FPIs, depending on the workload -- especially
with tables like pgbench_history, where delaying freezing is a total loss).


--
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Masahiko Sawada
Дата:
On Thu, Dec 16, 2021 at 5:27 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Dec 10, 2021 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > * I'm still working on the optimization that we discussed on this
> > thread: the optimization that allows the final relfrozenxid (that we
> > set in pg_class) to be determined dynamically, based on the actual
> > XIDs we observed in the table (we don't just naively use FreezeLimit).
>
> Attached is v4 of the patch series, which now includes this
> optimization, broken out into its own patch. In addition, it includes
> a prototype of opportunistic freezing.
>
> My emphasis here has been on making non-aggressive VACUUMs *always*
> advance relfrozenxid, outside of certain obvious edge cases. And so
> with all the patches applied, up to and including the opportunistic
> freezing patch, every autovacuum of every table manages to advance
> relfrozenxid during benchmarking -- usually to a fairly recent value.
> I've focussed on making aggressive VACUUMs (especially anti-wraparound
> autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> user keeps canceling autovacuums, maybe due to automated script that
> performs DDL). That has taken priority over other goals, for now.

Great!

I've looked at 0001 patch and here are some comments:

@@ -535,8 +540,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,

                    xidFullScanLimit);
        aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,

                   mxactFullScanLimit);
+       skipwithvm = true;
        if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+       {
+               /*
+                * Force aggressive mode, and disable skipping blocks using the
+                * visibility map (even those set all-frozen)
+                */
                aggressive = true;
+               skipwithvm = false;
+       }

        vacrel = (LVRelState *) palloc0(sizeof(LVRelState));

@@ -544,6 +557,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
        vacrel->rel = rel;
        vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
                                         &vacrel->indrels);
+       vacrel->aggressive = aggressive;
        vacrel->failsafe_active = false;
        vacrel->consider_bypass_optimization = true;

How about adding skipwithvm to LVRelState too?

---
                        /*
-                        * The current block is potentially skippable;
if we've seen a
-                        * long enough run of skippable blocks to
justify skipping it, and
-                        * we're not forced to check it, then go ahead and skip.
-                        * Otherwise, the page must be at least
all-visible if not
-                        * all-frozen, so we can set
all_visible_according_to_vm = true.
+                        * The current page can be skipped if we've
seen a long enough run
+                        * of skippable blocks to justify skipping it
-- provided it's not
+                        * the last page in the relation (according to
rel_pages/nblocks).
+                        *
+                        * We always scan the table's last page to
determine whether it
+                        * has tuples or not, even if it would
otherwise be skipped
+                        * (unless we're skipping every single page in
the relation). This
+                        * avoids having lazy_truncate_heap() take
access-exclusive lock
+                        * on the table to attempt a truncation that just fails
+                        * immediately because there are tuples on the
last page.
                         */
-                       if (skipping_blocks && !FORCE_CHECK_PAGE())
+                       if (skipping_blocks && blkno < nblocks - 1)

Why do we always need to scan the last page even if heap truncation is
disabled (or in the failsafe mode)?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > My emphasis here has been on making non-aggressive VACUUMs *always*
> > advance relfrozenxid, outside of certain obvious edge cases. And so
> > with all the patches applied, up to and including the opportunistic
> > freezing patch, every autovacuum of every table manages to advance
> > relfrozenxid during benchmarking -- usually to a fairly recent value.
> > I've focussed on making aggressive VACUUMs (especially anti-wraparound
> > autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> > user keeps canceling autovacuums, maybe due to automated script that
> > performs DDL). That has taken priority over other goals, for now.
>
> Great!

Maybe this is a good time to revisit basic questions about VACUUM. I
wonder if we can get rid of some of the GUCs for VACUUM now.

Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
physical blocks, but we use logical units (XIDs).

We probably shouldn't be using any units, but using XIDs "feels wrong"
to me. Even with my patch, it is theoretically possible that we won't
be able to advance relfrozenxid very much, because we cannot get a
cleanup lock on one single heap page with one old XID. But even in
this extreme case, how relevant is the "age" of this old XID, really?
What really matters is whether or not we can advance relfrozenxid in
time (with time to spare). And so the wraparound risk of the system is
not affected all that much by the age of the single oldest XID. The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

On the other hand, the risk may be far greater if we have *many*
tuples that are still unfrozen, whose XIDs are only "middle aged"
right now. The idea behind vacuum_freeze_min_age seems to be to be
lazy about work (tuple freezing) in the hope that we'll never need to
do it, but that seems obsolete now. (It probably made a little more
sense before the visibility map.)

Using XIDs makes sense for things like autovacuum_freeze_max_age,
because there we have to worry about wraparound and relfrozenxid
(whether or not we like it). But with this patch, and with everything
else (the failsafe, insert-driven autovacuums, everything we've done
over the last several years) I think that it might be time to increase
the autovacuum_freeze_max_age default. Maybe even to something as high
as 800 million transaction IDs, but certainly to 400 million. What do
you think? (Maybe don't answer just yet, something to think about.)

> +       vacrel->aggressive = aggressive;
>         vacrel->failsafe_active = false;
>         vacrel->consider_bypass_optimization = true;
>
> How about adding skipwithvm to LVRelState too?

Agreed -- it's slightly better that way. Will change this.

>                          */
> -                       if (skipping_blocks && !FORCE_CHECK_PAGE())
> +                       if (skipping_blocks && blkno < nblocks - 1)
>
> Why do we always need to scan the last page even if heap truncation is
> disabled (or in the failsafe mode)?

My goal here was to keep the behavior from commit e8429082, "Avoid
useless truncation attempts during VACUUM", while simplifying things
around skipping heap pages via the visibility map (including removing
the FORCE_CHECK_PAGE() macro). Of course you're right that this
particular change that you have highlighted does change the behavior a
little -- now we will always treat the final page as a "scanned page",
except perhaps when 100% of all pages in the relation are skipped
using the visibility map.

This was a deliberate choice (and perhaps even a good choice!). I
think that avoiding accessing the last heap page like this isn't worth
the complexity. Note that we may already access heap pages (making
them "scanned pages") despite the fact that we know it's unnecessary:
the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't
even try to avoid wasting CPU cycles on these
not-skipped-but-skippable pages). So I think that the performance cost
for the last page isn't going to be noticeable.

However, now that I think about it, I wonder...what do you think of
SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today?
SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly
after the original visibility map implementation was committed in
2009. The idea that it helps us to advance relfrozenxid outside of
aggressive VACUUMs (per commit message from bf136cf6e3) seems like it
might no longer matter with the patch -- because now we won't ever set
a page all-visible but not all-frozen. Plus the idea that we need to
do all this work just to get readahead from the OS
seems...questionable.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Masahiko Sawada
Дата:
On Sat, Dec 18, 2021 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > My emphasis here has been on making non-aggressive VACUUMs *always*
> > > advance relfrozenxid, outside of certain obvious edge cases. And so
> > > with all the patches applied, up to and including the opportunistic
> > > freezing patch, every autovacuum of every table manages to advance
> > > relfrozenxid during benchmarking -- usually to a fairly recent value.
> > > I've focussed on making aggressive VACUUMs (especially anti-wraparound
> > > autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> > > user keeps canceling autovacuums, maybe due to automated script that
> > > performs DDL). That has taken priority over other goals, for now.
> >
> > Great!
>
> Maybe this is a good time to revisit basic questions about VACUUM. I
> wonder if we can get rid of some of the GUCs for VACUUM now.
>
> Can we fully get rid of vacuum_freeze_table_age?

Does it mean that a vacuum always is an aggressive vacuum? If
opportunistic freezing works well on all tables, we might no longer
need vacuum_freeze_table_age. But I’m not sure that’s true since the
cost of freezing tuples is not 0.

> We probably shouldn't be using any units, but using XIDs "feels wrong"
> to me. Even with my patch, it is theoretically possible that we won't
> be able to advance relfrozenxid very much, because we cannot get a
> cleanup lock on one single heap page with one old XID. But even in
> this extreme case, how relevant is the "age" of this old XID, really?
> What really matters is whether or not we can advance relfrozenxid in
> time (with time to spare). And so the wraparound risk of the system is
> not affected all that much by the age of the single oldest XID. The
> risk mostly comes from how much total work we still need to do to
> advance relfrozenxid. If the single old XID is quite old indeed (~1.5
> billion XIDs), but there is only one, then we just have to freeze one
> tuple to be able to safely advance relfrozenxid (maybe advance it by a
> huge amount!). How long can it take to freeze one tuple, with the
> freeze map, etc?

I think that that's true for (mostly) static tables. But regarding
constantly-updated tables, since autovacuum runs based on the number
of garbage tuples (or inserted tuples) and how old the relfrozenxid is
if an autovacuum could not advance the relfrozenxid because it could
not get a cleanup lock on the page that has the single oldest XID,
it's likely that when autovacuum runs next time it will have to
process other pages too since the page will get dirty enough.

It might be a good idea that we remember pages where we could not get
a cleanup lock somewhere and revisit them after index cleanup. While
revisiting the pages, we don’t prune the page but only freeze tuples.

>
> On the other hand, the risk may be far greater if we have *many*
> tuples that are still unfrozen, whose XIDs are only "middle aged"
> right now. The idea behind vacuum_freeze_min_age seems to be to be
> lazy about work (tuple freezing) in the hope that we'll never need to
> do it, but that seems obsolete now. (It probably made a little more
> sense before the visibility map.)

Why is it obsolete now? I guess that it's still valid depending on the
cases, for example, heavily updated tables.

>
> Using XIDs makes sense for things like autovacuum_freeze_max_age,
> because there we have to worry about wraparound and relfrozenxid
> (whether or not we like it). But with this patch, and with everything
> else (the failsafe, insert-driven autovacuums, everything we've done
> over the last several years) I think that it might be time to increase
> the autovacuum_freeze_max_age default. Maybe even to something as high
> as 800 million transaction IDs, but certainly to 400 million. What do
> you think? (Maybe don't answer just yet, something to think about.)

I don’t have an objection to increasing autovacuum_freeze_max_age for
now. One of my concerns with anti-wraparound vacuums is that too many
tables (or several large tables) will reach autovacuum_freeze_max_age
at once, using up autovacuum slots and preventing autovacuums from
being launched on tables that are heavily being updated. Given these
works, expanding the gap between vacuum_freeze_table_age and
autovacuum_freeze_max_age would have better chances for the tables to
advance its relfrozenxid by an aggressive vacuum instead of an
anti-wraparound-aggressive vacuum. 400 million seems to be a good
start.

>
> > +       vacrel->aggressive = aggressive;
> >         vacrel->failsafe_active = false;
> >         vacrel->consider_bypass_optimization = true;
> >
> > How about adding skipwithvm to LVRelState too?
>
> Agreed -- it's slightly better that way. Will change this.
>
> >                          */
> > -                       if (skipping_blocks && !FORCE_CHECK_PAGE())
> > +                       if (skipping_blocks && blkno < nblocks - 1)
> >
> > Why do we always need to scan the last page even if heap truncation is
> > disabled (or in the failsafe mode)?
>
> My goal here was to keep the behavior from commit e8429082, "Avoid
> useless truncation attempts during VACUUM", while simplifying things
> around skipping heap pages via the visibility map (including removing
> the FORCE_CHECK_PAGE() macro). Of course you're right that this
> particular change that you have highlighted does change the behavior a
> little -- now we will always treat the final page as a "scanned page",
> except perhaps when 100% of all pages in the relation are skipped
> using the visibility map.
>
> This was a deliberate choice (and perhaps even a good choice!). I
> think that avoiding accessing the last heap page like this isn't worth
> the complexity. Note that we may already access heap pages (making
> them "scanned pages") despite the fact that we know it's unnecessary:
> the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't
> even try to avoid wasting CPU cycles on these
> not-skipped-but-skippable pages). So I think that the performance cost
> for the last page isn't going to be noticeable.

Agreed.

>
> However, now that I think about it, I wonder...what do you think of
> SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today?
> SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly
> after the original visibility map implementation was committed in
> 2009. The idea that it helps us to advance relfrozenxid outside of
> aggressive VACUUMs (per commit message from bf136cf6e3) seems like it
> might no longer matter with the patch -- because now we won't ever set
> a page all-visible but not all-frozen. Plus the idea that we need to
> do all this work just to get readahead from the OS
> seems...questionable.

Given the opportunistic freezing, that's true but I'm concerned
whether opportunistic freezing always works well on all tables since
freezing tuples is not 0 cost.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Dec 20, 2021 at 8:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Can we fully get rid of vacuum_freeze_table_age?
>
> Does it mean that a vacuum always is an aggressive vacuum?

No. Just somewhat more like one. Still no waiting for cleanup locks,
though. Also, autovacuum is still cancelable (that's technically from
anti-wraparound VACUUM, but you know what I mean). And there shouldn't
be a noticeable difference in terms of how many blocks can be skipped
using the VM.

> If opportunistic freezing works well on all tables, we might no longer
> need vacuum_freeze_table_age. But I’m not sure that’s true since the
> cost of freezing tuples is not 0.

That's true, of course, but right now the only goal of opportunistic
freezing is to advance relfrozenxid in every VACUUM. It needs to be
shown to be worth it, of course. But let's assume that it is worth it,
for a moment (perhaps only because we optimize freezing itself in
passing) -- then there is little use for vacuum_freeze_table_age, that
I can see.

> I think that that's true for (mostly) static tables. But regarding
> constantly-updated tables, since autovacuum runs based on the number
> of garbage tuples (or inserted tuples) and how old the relfrozenxid is
> if an autovacuum could not advance the relfrozenxid because it could
> not get a cleanup lock on the page that has the single oldest XID,
> it's likely that when autovacuum runs next time it will have to
> process other pages too since the page will get dirty enough.

I'm not arguing that the age of the single oldest XID is *totally*
irrelevant. Just that it's typically much less important than the
total amount of work we'd have to do (freezing) to be able to advance
relfrozenxid.

In any case, the extreme case where we just cannot get a cleanup lock
on one particular page with an old XID is probably very rare.

> It might be a good idea that we remember pages where we could not get
> a cleanup lock somewhere and revisit them after index cleanup. While
> revisiting the pages, we don’t prune the page but only freeze tuples.

Maybe, but I think that it would make more sense to not use
FreezeLimit for that at all. In an aggressive VACUUM (where we might
actually have to wait for a cleanup lock), why should we wait once the
age is over vacuum_freeze_min_age (usually 50 million XIDs)? The
official answer is "because we need to advance relfrozenxid". But why
not accept a much older relfrozenxid that is still sufficiently
young/safe, in order to avoid waiting for a cleanup lock?

In other words, what if our approach of "being diligent about
advancing relfrozenxid" makes the relfrozenxid problem worse, not
better? The problem with "being diligent" is that it is defined by
FreezeLimit (which is more or less the same thing as
vacuum_freeze_min_age), which is supposed to be about which tuples we
will freeze. That's a very different thing to how old relfrozenxid
should be or can be (after an aggressive VACUUM finishes).

> > On the other hand, the risk may be far greater if we have *many*
> > tuples that are still unfrozen, whose XIDs are only "middle aged"
> > right now. The idea behind vacuum_freeze_min_age seems to be to be
> > lazy about work (tuple freezing) in the hope that we'll never need to
> > do it, but that seems obsolete now. (It probably made a little more
> > sense before the visibility map.)
>
> Why is it obsolete now? I guess that it's still valid depending on the
> cases, for example, heavily updated tables.

Because after the 9.6 freezemap work we'll often set the all-visible
bit in the VM, but not the all-frozen bit (unless we have the
opportunistic freezing patch applied, which specifically avoids that).
When that happens, affected heap pages will still have
older-than-vacuum_freeze_min_age-XIDs after VACUUM runs, until we get
to an aggressive VACUUM. There could be many VACUUMs before the
aggressive VACUUM.

This "freezing cliff" seems like it might be a big problem, in
general. That's what I'm trying to address here.

Either way, the system doesn't really respect vacuum_freeze_min_age in
the way that it did before 9.6 -- which is what I meant by "obsolete".

> I don’t have an objection to increasing autovacuum_freeze_max_age for
> now. One of my concerns with anti-wraparound vacuums is that too many
> tables (or several large tables) will reach autovacuum_freeze_max_age
> at once, using up autovacuum slots and preventing autovacuums from
> being launched on tables that are heavily being updated.

I think that the patch helps with that, actually -- there tends to be
"natural variation" in the relfrozenxid age of each table, which comes
from per-table workload characteristics.

> Given these
> works, expanding the gap between vacuum_freeze_table_age and
> autovacuum_freeze_max_age would have better chances for the tables to
> advance its relfrozenxid by an aggressive vacuum instead of an
> anti-wraparound-aggressive vacuum. 400 million seems to be a good
> start.

The idea behind getting rid of vacuum_freeze_table_age (not to be
confused by the other idea about getting rid of vacuum_freeze_min_age)
is this: with the patch series, we only tend to get an anti-wraparound
VACUUM in extreme and relatively rare cases. For example, we will get
aggressive anti-wraparound VACUUMs on tables that *never* grow, but
constantly get HOT updates (e.g. the pgbench_accounts table with heap
fill factor reduced to 90). We won't really be able to use the VM when
this happens, either.

With tables like this -- tables that still get aggressive VACUUMs --
maybe the patch doesn't make a huge difference. But that's truly the
extreme case -- that is true only because there is already zero chance
of there being a non-aggressive VACUUM. We'll get aggressive
anti-wraparound VACUUMs every time we reach autovacuum_freeze_max_age,
again and again -- no change, really.

But since it's only these extreme cases that continue to get
aggressive VACUUMs, why do we still need vacuum_freeze_table_age? It
helps right now (without the patch) by "escalating" a regular VACUUM
to an aggressive one. But the cases that we still expect an aggressive
VACUUM (with the patch) are the cases where there is zero chance of
that happening. Almost by definition.

> Given the opportunistic freezing, that's true but I'm concerned
> whether opportunistic freezing always works well on all tables since
> freezing tuples is not 0 cost.

That is the big question for this patch.

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Dec 20, 2021 at 9:35 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Given the opportunistic freezing, that's true but I'm concerned
> > whether opportunistic freezing always works well on all tables since
> > freezing tuples is not 0 cost.
>
> That is the big question for this patch.

Attached is a mechanical rebase of the patch series. This new version
just fixes bitrot, caused by Masahiko's recent lazyvacuum.c
refactoring work. In other words, this revision has no significant
changes compared to the v4 that I posted back in late December -- just
want to keep CFTester green.

I still have plenty of work to do here. Especially with the final
patch (the v5-0005-* "freeze early" patch), which is generally more
speculative than the other patches. I'm playing catch-up now, since I
just returned from vacation.

--
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Fri, Dec 17, 2021 at 9:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
> vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
> physical blocks, but we use logical units (XIDs).

I don't see how we can get rid of these. We know that catastrophe will
ensue if we fail to freeze old XIDs for a sufficiently long time ---
where sufficiently long has to do with the number of XIDs that have
been subsequently consumed. So it's natural to decide whether or not
we're going to wait for cleanup locks on pages on the basis of how old
the XIDs they contain actually are. Admittedly, that decision doesn't
need to be made at the start of the vacuum, as we do today. We could
happily skip waiting for a cleanup lock on pages that contain only
newer XIDs, but if there is a page that both contains an old XID and
stays pinned for a long time, we eventually have to sit there and wait
for that pin to be released. And the best way to decide when to switch
to that strategy is really based on the age of that XID, at least as I
see it, because it is the age of that XID reaching 2 billion that is
going to kill us.

I think vacuum_freeze_min_age also serves a useful purpose: it
prevents us from freezing data that's going to be modified again or
even deleted in the near future. Since we can't know the future, we
must base our decision on the assumption that the future will be like
the past: if the page hasn't been modified for a while, then we should
assume it's not likely to be modified again soon; otherwise not. If we
knew the time at which the page had last been modified, it would be
very reasonable to use that here - say, freeze the XIDs if the page
hasn't been touched in an hour, or whatever. But since we lack such
timestamps the XID age is the closest proxy we have.

> The
> risk mostly comes from how much total work we still need to do to
> advance relfrozenxid. If the single old XID is quite old indeed (~1.5
> billion XIDs), but there is only one, then we just have to freeze one
> tuple to be able to safely advance relfrozenxid (maybe advance it by a
> huge amount!). How long can it take to freeze one tuple, with the
> freeze map, etc?

I don't really see any reason for optimism here. There could be a lot
of unfrozen pages in the relation, and we'd have to troll through all
of those in order to find that single old XID. Moreover, there is
nothing whatsoever to focus autovacuum's attention on that single old
XID rather than anything else. Nothing in the autovacuum algorithm
will cause it to focus its efforts on that single old XID at a time
when there's no pin on the page, or at a time when that XID becomes
the thing that's holding back vacuuming throughout the cluster. A lot
of vacuum problems that users experience today would be avoided if
autovacuum had perfect knowledge of what it ought to be prioritizing
at any given time, or even some knowledge. But it doesn't, and is
often busy fiddling while Rome burns.

IOW, the time that it takes to freeze that one tuple *in theory* might
be small. But in practice it may be very large, because we won't
necessarily get around to it on any meaningful time frame.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Jan 6, 2022 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 17, 2021 at 9:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
> > vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
> > physical blocks, but we use logical units (XIDs).
>
> I don't see how we can get rid of these. We know that catastrophe will
> ensue if we fail to freeze old XIDs for a sufficiently long time ---
> where sufficiently long has to do with the number of XIDs that have
> been subsequently consumed.

I don't really disagree with anything you've said, I think. There are
a few subtleties here. I'll try to tease them apart.

I agree that we cannot do without something like vacrel->FreezeLimit
for the foreseeable future -- but the closely related GUC
(vacuum_freeze_min_age) is another matter. Although everything you've
said in favor of the GUC seems true, the GUC is not a particularly
effective (or natural) way of constraining the problem. It just
doesn't make sense as a tunable.

One obvious reason for this is that the opportunistic freezing stuff
is expected to be the thing that usually forces freezing -- not
vacuum_freeze_min_age, nor FreezeLimit, nor any other XID-based
cutoff. As you more or less pointed out yourself, we still need
FreezeLimit as a backstop mechanism. But the value of FreezeLimit can
just come from autovacuum_freeze_max_age/2 in all cases (no separate
GUC), or something along those lines. We don't particularly expect the
value of FreezeLimit to matter, at least most of the time. It should
only noticeably affect our behavior during anti-wraparound VACUUMs,
which become rare with the patch (e.g. my pgbench_accounts example
upthread). Most individual tables will never get even one
anti-wraparound VACUUM -- it just doesn't ever come for most tables in
practice.

My big issue with vacuum_freeze_min_age is that it doesn't really work
with the freeze map work in 9.6, which creates problems that I'm
trying to address by freezing early and so on. After all, HEAD (and
all stable branches) can easily set a page to all-visible (but not
all-frozen) in the VM, meaning that the page's tuples won't be
considered for freezing until the next aggressive VACUUM. This means
that vacuum_freeze_min_age is already frequently ignored by the
implementation -- it's conditioned on other things that are practically
impossible to predict.

Curious about your thoughts on this existing issue with
vacuum_freeze_min_age. I am concerned about the "freezing cliff" that
it creates.

> So it's natural to decide whether or not
> we're going to wait for cleanup locks on pages on the basis of how old
> the XIDs they contain actually are.

I agree, but again, it's only a backstop. With the patch we'd have to
be rather unlucky to ever need to wait like this.

What are the chances that we keep failing to freeze an old XID from
one particular page, again and again? My testing indicates that it's a
negligible concern in practice (barring pathological cases with idle
cursors, etc).

> I think vacuum_freeze_min_age also serves a useful purpose: it
> prevents us from freezing data that's going to be modified again or
> even deleted in the near future. Since we can't know the future, we
> must base our decision on the assumption that the future will be like
> the past: if the page hasn't been modified for a while, then we should
> assume it's not likely to be modified again soon; otherwise not.

But the "freeze early" heuristics work a bit like that anyway. We
won't freeze all the tuples on a whole heap page early if we won't
otherwise set the heap page to all-visible (not all-frozen) in the VM
anyway.

> If we
> knew the time at which the page had last been modified, it would be
> very reasonable to use that here - say, freeze the XIDs if the page
> hasn't been touched in an hour, or whatever. But since we lack such
> timestamps the XID age is the closest proxy we have.

XID age is a *terrible* proxy. The age of an XID in a tuple header may
advance quickly, even when nobody modifies the same table at all.

I concede that it is true that we are (in some sense) "gambling" by
freezing early -- we may end up freezing a tuple that we subsequently
update anyway. But aren't we also "gambling" by *not* freezing early?
By not freezing, we risk getting into "freezing debt" that will have
to be paid off in one ruinously large installment. I would much rather
"gamble" on something where we can tolerate consistently "losing" than
gamble on something where I cannot ever afford to lose (even if it's
much less likely that I'll lose during any given VACUUM operation).

Besides all this, I think that we have a rather decent chance of
coming out ahead in practice by freezing early. In practice the
marginal cost of freezing early is consistently pretty low.
Cost-control-driven (as opposed to need-driven) freezing is *supposed*
to be cheaper, of course. And like it or not, freezing is really just part of
the cost of storing data using Postgres (for the time being, at least).

> > The
> > risk mostly comes from how much total work we still need to do to
> > advance relfrozenxid. If the single old XID is quite old indeed (~1.5
> > billion XIDs), but there is only one, then we just have to freeze one
> > tuple to be able to safely advance relfrozenxid (maybe advance it by a
> > huge amount!). How long can it take to freeze one tuple, with the
> > freeze map, etc?
>
> I don't really see any reason for optimism here.

> IOW, the time that it takes to freeze that one tuple *in theory* might
> be small. But in practice it may be very large, because we won't
> necessarily get around to it on any meaningful time frame.

On second thought I agree that my specific example of 1.5 billion XIDs
was a little too optimistic of me. But 50 million XIDs (i.e. the
vacuum_freeze_min_age default) is too pessimistic. The important point
is that FreezeLimit could plausibly become nothing more than a
backstop mechanism, with the design from the patch series -- something
that typically has no effect on what tuples actually get frozen.

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Jan 6, 2022 at 2:45 PM Peter Geoghegan <pg@bowt.ie> wrote:
> But the "freeze early" heuristics work a bit like that anyway. We
> won't freeze all the tuples on a whole heap page early if we won't
> otherwise set the heap page to all-visible (not all-frozen) in the VM
> anyway.

I believe that applications tend to update rows according to
predictable patterns. Andy Pavlo made an observation about this at one
point:

https://youtu.be/AD1HW9mLlrg?t=3202

I think that we don't do a good enough job of keeping logically
related tuples (tuples inserted around the same time) together, on the
same original heap page, which motivated a lot of my experiments with
the FSM from last year. Even still, it seems like a good idea for us
to err in the direction of assuming that tuples on the same heap page
are logically related. The tuples should all be frozen together when
possible. And *not* frozen early when the heap page as a whole can't
be frozen (barring cases with one *much* older XID before
FreezeLimit).

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Thu, Jan 6, 2022 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
> One obvious reason for this is that the opportunistic freezing stuff
> is expected to be the thing that usually forces freezing -- not
> vacuum_freeze_min_age, nor FreezeLimit, nor any other XID-based
> cutoff. As you more or less pointed out yourself, we still need
> FreezeLimit as a backstop mechanism. But the value of FreezeLimit can
> just come from autovacuum_freeze_max_age/2 in all cases (no separate
> GUC), or something along those lines. We don't particularly expect the
> value of FreezeLimit to matter, at least most of the time. It should
> only noticeably affect our behavior during anti-wraparound VACUUMs,
> which become rare with the patch (e.g. my pgbench_accounts example
> upthread). Most individual tables will never get even one
> anti-wraparound VACUUM -- it just doesn't ever come for most tables in
> practice.

This seems like a weak argument. Sure, you COULD hard-code the limit
to be autovacuum_freeze_max_age/2 rather than making it a separate
tunable, but I don't think it's better. I am generally very skeptical
about the idea of using the same GUC value for multiple purposes,
because it often turns out that the optimal value for one purpose is
different than the optimal value for some other purpose. For example,
the optimal amount of memory for a hash table is likely different than
the optimal amount for a sort, which is why we now have
hash_mem_multiplier. When it's not even the same value that's being
used in both places, but the original value in one place and a value
derived from some formula in the other, the chances of things working
out are even less.

I feel generally that a lot of the argument you're making here
supposes that tables are going to get vacuumed regularly. I agree that
IF tables are being vacuumed on a regular basis, and if as part of
that we always push relfrozenxid forward as far as we can, we will
rarely have a situation where aggressive strategies to avoid
wraparound are required. However, I disagree strongly with the idea
that we can assume that tables will get vacuumed regularly. That can
fail to happen for all sorts of reasons. One of the common ones is a
poor choice of autovacuum configuration. The most common problem in my
experience is a cost limit that is too low to permit the amount of
vacuuming that is actually required, but other kinds of problems like
not enough workers (so tables get starved), too many workers (so the
cost limit is being shared between many processes), autovacuum=off
either globally or on one table (because of ... reasons),
autovacuum_vacuum_insert_threshold = -1 plus not many updates (so
thing ever triggers the vacuum), autovacuum_naptime=1d (actually seen
in the real world! ... and, no, it didn't work well), or stats
collector problems are all possible. We can *hope* that there are
going to be regular vacuums of the table long before wraparound
becomes a danger, but realistically, we better not assume that in our
choice of algorithms, because the real world is a messy place where
all sorts of crazy things happen.

Now, I agree with you in part: I don't think it's obvious that it's
useful to tune vacuum_freeze_table_age. When I advise customers on how
to fix vacuum problems, I am usually telling them to increase
autovacuum_vacuum_cost_limit, possibly also with an increase in
autovacuum_workers; or to increase or decrease
autovacuum_freeze_max_age depending on which problem they have; or
occasionally to adjust settings like autovacuum_naptime. It doesn't
often seem to be necessary to change vacuum_freeze_table_age or, for
that matter, vacuum_freeze_min_age. But if we remove them and then
discover scenarios where tuning them would have been useful, we'll
have no options for fixing PostgreSQL systems in the field. Waiting
for the next major release in such a scenario, or even the next minor
release, is not good. We should be VERY conservative about removing
existing settings if there's any chance that somebody could use them
to tune their way out of trouble.

> My big issue with vacuum_freeze_min_age is that it doesn't really work
> with the freeze map work in 9.6, which creates problems that I'm
> trying to address by freezing early and so on. After all, HEAD (and
> all stable branches) can easily set a page to all-visible (but not
> all-frozen) in the VM, meaning that the page's tuples won't be
> considered for freezing until the next aggressive VACUUM. This means
> that vacuum_freeze_min_age is already frequently ignored by the
> implementation -- it's conditioned on other things that are practically
> impossible to predict.
>
> Curious about your thoughts on this existing issue with
> vacuum_freeze_min_age. I am concerned about the "freezing cliff" that
> it creates.

So, let's see: if we see a page where the tuples are all-visible and
we seize the opportunity to freeze it, we can spare ourselves the need
to ever visit that page again (unless it gets modified). But if we
only mark it all-visible and leave the freezing for later, the next
aggressive vacuum will have to scan and dirty the page. I'm prepared
to believe that it's worth the cost of freezing the page in that
scenario. We've already dirtied the page and written some WAL and
maybe generated an FPW, so doing the rest of the work now rather than
saving it until later seems likely to be a win. I think it's OK to
behave, in this situation, as if vacuum_freeze_min_age=0.

There's another situation in which vacuum_freeze_min_age could apply,
though: suppose the page isn't all-visible yet. I'd argue that in that
case we don't want to run around freezing stuff unless it's quite old
- like older than vacuum_freeze_table_age, say. Because we know we're
going to have to revisit this page in the next vacuum anyway, and
expending effort to freeze tuples that may be about to be modified
again doesn't seem prudent. So, hmm, on further reflection, maybe it's
OK to remove vacuum_freeze_min_age. But if we do, then I think we had
better carefully distinguish between the case where the page can
thereby be marked all-frozen and the case where it cannot. I guess you
say the same, further down.

> > So it's natural to decide whether or not
> > we're going to wait for cleanup locks on pages on the basis of how old
> > the XIDs they contain actually are.
>
> I agree, but again, it's only a backstop. With the patch we'd have to
> be rather unlucky to ever need to wait like this.
>
> What are the chances that we keep failing to freeze an old XID from
> one particular page, again and again? My testing indicates that it's a
> negligible concern in practice (barring pathological cases with idle
> cursors, etc).

I mean, those kinds of pathological cases happen *all the time*. Sure,
there are plenty of users who don't leave cursors open. But the ones
who do don't leave them around for short periods of time on randomly
selected pages of the table. They are disproportionately likely to
leave them on the same table pages over and over, just like data can't
in general be assumed to be uniformly accessed. And not uncommonly,
they leave them around until the snow melts.

And we need to worry about those kinds of users, actually much more
than we need to worry about users doing normal things. Honestly,
autovacuum on a system where things are mostly "normal" - no
long-running transactions, adequate resources for autovacuum to do its
job, reasonable configuration settings - isn't that bad. It's true
that there are people who get surprised by an aggressive autovacuum
kicking off unexpectedly, but it's usually the first one during the
cluster lifetime (which is typically the biggest, since the initial
load tends to be bigger than later ones) and it's usually annoying but
survivable. The places where autovacuum becomes incredibly frustrating
are the pathological cases. When insufficient resources are available
to complete the work in a timely fashion, or difficult trade-offs have
to be made, autovacuum is too dumb to make the right choices. And even
if you call your favorite PostgreSQL support provider and they provide
an expert, once it gets behind, autovacuum isn't very tractable: it
will insist on vacuuming everything, right now, in an order that it
chooses, and it's not going to listen to take any nonsense from some
human being who thinks they might have some useful advice to provide!

> But the "freeze early" heuristics work a bit like that anyway. We
> won't freeze all the tuples on a whole heap page early if we won't
> otherwise set the heap page to all-visible (not all-frozen) in the VM
> anyway.

Hmm, I didn't realize that we had that. Is that an existing thing or
something new you're proposing to do? If existing, where is it?

> > IOW, the time that it takes to freeze that one tuple *in theory* might
> > be small. But in practice it may be very large, because we won't
> > necessarily get around to it on any meaningful time frame.
>
> On second thought I agree that my specific example of 1.5 billion XIDs
> was a little too optimistic of me. But 50 million XIDs (i.e. the
> vacuum_freeze_min_age default) is too pessimistic. The important point
> is that FreezeLimit could plausibly become nothing more than a
> backstop mechanism, with the design from the patch series -- something
> that typically has no effect on what tuples actually get frozen.

I agree that it's OK for this to become a purely backstop mechanism
... but again, I think that the design of such backstop mechanisms
should be done as carefully as we know how, because users seem to hit
the backstop all the time. We want it to be made of, you know, nylon
twine, rather than, say, sharp nails. :-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Jan 7, 2022 at 12:24 PM Robert Haas <robertmhaas@gmail.com> wrote:
> This seems like a weak argument. Sure, you COULD hard-code the limit
> to be autovacuum_freeze_max_age/2 rather than making it a separate
> tunable, but I don't think it's better. I am generally very skeptical
> about the idea of using the same GUC value for multiple purposes,
> because it often turns out that the optimal value for one purpose is
> different than the optimal value for some other purpose.

I thought I was being conservative by suggesting
autovacuum_freeze_max_age/2. My first thought was to teach VACUUM to
make its FreezeLimit "OldestXmin - autovacuum_freeze_max_age". To me
these two concepts really *are* the same thing: vacrel->FreezeLimit
becomes a backstop, just as anti-wraparound autovacuum (the
autovacuum_freeze_max_age cutoff) becomes a backstop.

Of course, an anti-wraparound VACUUM will do early freezing in the
same way as any other VACUUM will (with the patch series). So even
when the FreezeLimit backstop XID cutoff actually affects the behavior
of a given VACUUM operation, it may well not be the reason why most
individual tuples that we freeze get frozen. That is, most individual
heap pages will probably have tuples frozen for some other reason.
Though it depends on workload characteristics, most individual heap
pages will typically be frozen as a group, even here. This is a
logical consequence of the fact that tuple freezing and advancing
relfrozenxid are now only loosely coupled -- it's about as loose as
the current relfrozenxid invariant will allow.

> I feel generally that a lot of the argument you're making here
> supposes that tables are going to get vacuumed regularly.

> I agree that
> IF tables are being vacuumed on a regular basis, and if as part of
> that we always push relfrozenxid forward as far as we can, we will
> rarely have a situation where aggressive strategies to avoid
> wraparound are required.

It's all relative. We hope that (with the patch) cases that only ever
get anti-wraparound VACUUMs are limited to tables where nothing else
drives VACUUM, for sensible reasons related to workload
characteristics (like the pgbench_accounts example upthread). It's
inevitable that some users will misconfigure the system, though -- no
question about that.

I don't see why users that misconfigure the system in this way should
be any worse off than they would be today. They probably won't do
substantially less freezing (usually somewhat more), and will advance
pg_class.relfrozenxid in exactly the same way as today (usually a bit
better, actually). What have I missed?

Admittedly the design of the "Freeze tuples early to advance
relfrozenxid" patch (i.e. v5-0005-*patch) is still unsettled; I need
to verify that my claims about it are really robust. But as far as I
know they are. Reviewers should certainly look at that with a critical
eye.

> Now, I agree with you in part: I don't think it's obvious that it's
> useful to tune vacuum_freeze_table_age.

That's definitely the easier argument to make. After all,
vacuum_freeze_table_age will do nothing unless VACUUM runs before the
anti-wraparound threshold (autovacuum_freeze_max_age) is reached. The
patch series should be strictly better than that. Primarily because
it's "continuous", and so isn't limited to cases where the table age
falls within the "vacuum_freeze_table_age - autovacuum_freeze_max_age"
goldilocks age range.

> We should be VERY conservative about removing
> existing settings if there's any chance that somebody could use them
> to tune their way out of trouble.

I agree, I suppose, but right now I honestly can't think of a reason
why they would be useful.

If I am wrong about this then I'm probably also wrong about some basic
facet of the high-level design, in which case I should change course
altogether. In other words, removing the GUCs is not an incidental
thing. It's possible that I would never have pursued this project if I
didn't first notice how wrong-headed the GUCs are.

> So, let's see: if we see a page where the tuples are all-visible and
> we seize the opportunity to freeze it, we can spare ourselves the need
> to ever visit that page again (unless it gets modified). But if we
> only mark it all-visible and leave the freezing for later, the next
> aggressive vacuum will have to scan and dirty the page. I'm prepared
> to believe that it's worth the cost of freezing the page in that
> scenario.

That's certainly the most compelling reason to perform early freezing.
It's not completely free of downsides, but it's pretty close.

> There's another situation in which vacuum_freeze_min_age could apply,
> though: suppose the page isn't all-visible yet. I'd argue that in that
> case we don't want to run around freezing stuff unless it's quite old
> - like older than vacuum_freeze_table_age, say. Because we know we're
> going to have to revisit this page in the next vacuum anyway, and
> expending effort to freeze tuples that may be about to be modified
> again doesn't seem prudent. So, hmm, on further reflection, maybe it's
> OK to remove vacuum_freeze_min_age. But if we do, then I think we had
> better carefully distinguish between the case where the page can
> thereby be marked all-frozen and the case where it cannot. I guess you
> say the same, further down.

I do. Although v5-0005-*patch still freezes early when the page is
dirtied by pruning, I have my doubts about that particular "freeze
early" criteria. I believe that everything I just said about
misconfigured autovacuums doesn't rely on anything more than the "most
compelling scenario for early freezing" mechanism that arranges to
make us set the all-frozen bit (not just the all-visible bit).

> I mean, those kinds of pathological cases happen *all the time*. Sure,
> there are plenty of users who don't leave cursors open. But the ones
> who do don't leave them around for short periods of time on randomly
> selected pages of the table. They are disproportionately likely to
> leave them on the same table pages over and over, just like data can't
> in general be assumed to be uniformly accessed. And not uncommonly,
> they leave them around until the snow melts.

> And we need to worry about those kinds of users, actually much more
> than we need to worry about users doing normal things.

I couldn't agree more. In fact, I was mostly thinking about how to
*help* these users. Insisting on waiting for a cleanup lock before it
becomes strictly necessary (when the table age is only 50
million/vacuum_freeze_min_age) is actually a big part of the problem
for these users. vacuum_freeze_min_age enforces a false dichotomy on
aggressive VACUUMs, that just isn't unhelpful. Why should waiting on a
cleanup lock fix anything?

Even in the extreme case where we are guaranteed to eventually have a
wraparound failure in the end (due to an idle cursor in an
unsupervised database), the user is still much better off, I think. We
will have at least managed to advance relfrozenxid to the exact oldest
XID on the one heap page that somebody holds an idle cursor
(conflicting buffer pin) on. And we'll usually have frozen most of the
tuples that need to be frozen. Sure, the user may need to use
single-user mode to run a manual VACUUM, but at least this process
only needs to freeze approximately one tuple to get the system back
online again.

If the DBA notices the problem before the database starts to refuse to
allocate XIDs, then they'll have a much better chance of avoiding a
wraparound failure through simple intervention (like killing the
backend with the idle cursor). We can pay down 99.9% of the "freeze
debt" independently of this intractable problem of something holding
onto an idle cursor.

> Honestly,
> autovacuum on a system where things are mostly "normal" - no
> long-running transactions, adequate resources for autovacuum to do its
> job, reasonable configuration settings - isn't that bad.

Right. Autovacuum is "too big to fail".

> > But the "freeze early" heuristics work a bit like that anyway. We
> > won't freeze all the tuples on a whole heap page early if we won't
> > otherwise set the heap page to all-visible (not all-frozen) in the VM
> > anyway.
>
> Hmm, I didn't realize that we had that. Is that an existing thing or
> something new you're proposing to do? If existing, where is it?

It's part of v5-0005-*patch. Still in flux to some degree, because
it's necessary to balance a few things. That shouldn't undermine the
arguments I've made here.

> I agree that it's OK for this to become a purely backstop mechanism
> ... but again, I think that the design of such backstop mechanisms
> should be done as carefully as we know how, because users seem to hit
> the backstop all the time. We want it to be made of, you know, nylon
> twine, rather than, say, sharp nails. :-)

Absolutely. But if autovacuum can only ever run due to
age(relfrozenxid) reaching autovacuum_freeze_max_age, then I can't see
a downside.

Again, the v5-0005-*patch needs to meet the standard that I've laid
out. If it doesn't then I've messed up already.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Fri, Jan 7, 2022 at 5:20 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I thought I was being conservative by suggesting
> autovacuum_freeze_max_age/2. My first thought was to teach VACUUM to
> make its FreezeLimit "OldestXmin - autovacuum_freeze_max_age". To me
> these two concepts really *are* the same thing: vacrel->FreezeLimit
> becomes a backstop, just as anti-wraparound autovacuum (the
> autovacuum_freeze_max_age cutoff) becomes a backstop.

I can't follow this. If the idea is that we're going to
opportunistically freeze a page whenever that allows us to mark it
all-visible, then the remaining question is what XID age we should use
to force freezing when that rule doesn't apply. It seems to me that
there is a rebuttable presumption that that case ought to work just as
it does today - and I think I hear you saying that it should NOT work
as it does today, but should use some other threshold. Yet I can't
understand why you think that.

> I couldn't agree more. In fact, I was mostly thinking about how to
> *help* these users. Insisting on waiting for a cleanup lock before it
> becomes strictly necessary (when the table age is only 50
> million/vacuum_freeze_min_age) is actually a big part of the problem
> for these users. vacuum_freeze_min_age enforces a false dichotomy on
> aggressive VACUUMs, that just isn't unhelpful. Why should waiting on a
> cleanup lock fix anything?

Because waiting on a lock means that we'll acquire it as soon as it's
available. If you repeatedly call your local Pizzeria Uno's and ask
whether there is a wait, and head to the restaurant only when the
answer is in the negative, you may never get there, because they may
be busy every time you call - especially if you always call around
lunch or dinner time. Even if you eventually get there, it may take
multiple days before you find a time when a table is immediately
available, whereas if you had just gone over there and stood in line,
you likely would have been seated in under an hour and savoring the
goodness of quality deep-dish pizza not too long thereafter. The same
principle applies here.

I do think that waiting for a cleanup lock when the age of the page is
only vacuum_freeze_min_age seems like it might be too aggressive, but
I don't think that's how it works. AFAICS, it's based on whether the
vacuum is marked as aggressive, which has to do with
vacuum_freeze_table_age, not vacuum_freeze_min_age. Let's turn the
question around: if the age of the oldest XID on the page is >150
million transactions and the buffer cleanup lock is not available now,
what makes you think that it's any more likely to be available when
the XID age reaches 200 million or 300 million or 700 million? There
is perhaps an argument for some kind of tunable that eventually shoots
the other session in the head (if we can identify it, anyway) but it
seems to me that regardless of what threshold we pick, polling is
strictly less likely to find a time when the page is available than
waiting for the cleanup lock. It has the counterbalancing advantage of
allowing the autovacuum worker to do other useful work in the meantime
and that is indeed a significant upside, but at some point you're
going to have to give up and admit that polling is a failed strategy,
and it's unclear why 150 million XIDs - or probably even 50 million
XIDs - isn't long enough to say that we're not getting the job done
with half measures.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Jan 13, 2022 at 12:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I can't follow this. If the idea is that we're going to
> opportunistically freeze a page whenever that allows us to mark it
> all-visible, then the remaining question is what XID age we should use
> to force freezing when that rule doesn't apply.

That is the idea, yes.

> It seems to me that
> there is a rebuttable presumption that that case ought to work just as
> it does today - and I think I hear you saying that it should NOT work
> as it does today, but should use some other threshold. Yet I can't
> understand why you think that.

Cases where we can not get a cleanup lock fall into 2 sharply distinct
categories in my mind:

1. Cases where our inability to get a cleanup lock signifies nothing
at all about the page in question, or any page in the same table, with
the same workload.

2. Pathological cases. Cases where we're at least at the mercy of the
application to do something about an idle cursor, where the situation
may be entirely hopeless on a long enough timeline. (Whether or not it
actually happens in the end is less significant.)

As far as I can tell, based on testing, category 1 cases are fixed by
the patch series: while a small number of pages from tables in
category 1 cannot be cleanup-locked during each VACUUM, even with the
patch series, it happens at random, with no discernable pattern. The
overall result is that our ability to advance relfrozenxid is really
not impacted *over time*. It's reasonable to suppose that lightning
will not strike in the same place twice -- and it would really have to
strike several times to invalidate this assumption. It's not
impossible, but the chances over time are infinitesimal -- and the
aggregate effect over time (not any one VACUUM operation) is what
matters.

There are seldom more than 5 or so of these pages, even on large
tables. What are the chances that some random not-yet-all-frozen block
(that we cannot freeze tuples on) will also have the oldest
couldn't-be-frozen XID, even once? And when it is the oldest, why
should it be the oldest by very many XIDs? And what are the chances
that the same page has the same problem, again and again, without that
being due to some pathological workload thing?

Admittedly you may see a blip from this -- you might notice that the
final relfrozenxid value for that one single VACUUM isn't quite as new
as you'd like. But then the next VACUUM should catch up with the
stable long term average again. It's hard to describe exactly why this
effect is robust, but as I said, empirically, in practice, it appears
to be robust. That might not be good enough as an explanation that
justifies committing the patch series, but that's what I see. And I
think I will be able to nail it down.

AFAICT that just leaves concern for cases in category 2. More on that below.

> Even if you eventually get there, it may take
> multiple days before you find a time when a table is immediately
> available, whereas if you had just gone over there and stood in line,
> you likely would have been seated in under an hour and savoring the
> goodness of quality deep-dish pizza not too long thereafter. The same
> principle applies here.

I think that you're focussing on individual VACUUM operations, whereas
I'm more concerned about the aggregate effect of a particular policy
over time.

Let's assume for a moment that the only thing that we really care
about is reliably keeping relfrozenxid reasonably recent. Even then,
waiting for a cleanup lock (to freeze some tuples) might be the wrong
thing to do. Waiting in line means that we're not freezing other
tuples (nobody else can either). So we're allowing ourselves to fall
behind on necessary, routine maintenance work that allows us to
advance relfrozenxid....in order to advance relfrozenxid.

> I do think that waiting for a cleanup lock when the age of the page is
> only vacuum_freeze_min_age seems like it might be too aggressive, but
> I don't think that's how it works. AFAICS, it's based on whether the
> vacuum is marked as aggressive, which has to do with
> vacuum_freeze_table_age, not vacuum_freeze_min_age. Let's turn the
> question around: if the age of the oldest XID on the page is >150
> million transactions and the buffer cleanup lock is not available now,
> what makes you think that it's any more likely to be available when
> the XID age reaches 200 million or 300 million or 700 million?

This is my concern -- what I've called category 2 cases have this
exact quality. So given that, why not freeze what you can, elsewhere,
on other pages that don't have the same issue (presumably the vast
vast majority in the table)? That way you have the best possible
chance of recovering once the DBA gets a clue and fixes the issue.

> There
> is perhaps an argument for some kind of tunable that eventually shoots
> the other session in the head (if we can identify it, anyway) but it
> seems to me that regardless of what threshold we pick, polling is
> strictly less likely to find a time when the page is available than
> waiting for the cleanup lock. It has the counterbalancing advantage of
> allowing the autovacuum worker to do other useful work in the meantime
> and that is indeed a significant upside, but at some point you're
> going to have to give up and admit that polling is a failed strategy,
> and it's unclear why 150 million XIDs - or probably even 50 million
> XIDs - isn't long enough to say that we're not getting the job done
> with half measures.

That's kind of what I meant. The difference between 50 million and 150
million is rather unclear indeed. So having accepted that that might
be true, why not be open to the possibility that it won't turn out to
be true in the long run, for any given table? With the enhancements
from the patch series in place (particularly the early freezing
stuff), what do we have to lose by making the FreezeLimit XID cutoff
for freezing much higher than your typical vacuum_freeze_min_age?
Maybe the same as autovacuum_freeze_max_age or vacuum_freeze_table_age
(it can't be higher than that without also making these other settings
become meaningless, of course).

Taking a wait-and-see approach like this (not being too quick to
decide that a table is in category 1 or category 2) doesn't seem to
make wraparound failure any more likely in any particular scenario,
but makes it less likely in other scenarios. It also gives us early
visibility into the problem, because we'll see that autovacuum can no
longer advance relfrozenxid (using the enhanced log output) where
that's generally expected.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Jan 13, 2022 at 1:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Admittedly you may see a blip from this -- you might notice that the
> final relfrozenxid value for that one single VACUUM isn't quite as new
> as you'd like. But then the next VACUUM should catch up with the
> stable long term average again. It's hard to describe exactly why this
> effect is robust, but as I said, empirically, in practice, it appears
> to be robust. That might not be good enough as an explanation that
> justifies committing the patch series, but that's what I see. And I
> think I will be able to nail it down.

Attached is v6, which like v5 is a rebased version that I'm posting to
keep CFTester happy. I pushed a commit that consolidates VACUUM
VERBOSE and autovacuum logging earlier (commit 49c9d9fc), which bitrot
v5. So new real changes, nothing to note.

Although it technically has nothing to do with this patch series, I
will point out that it's now a lot easier to debug using VACUUM
VERBOSE, which will directly display information about how we've
advanced relfrozenxid, tuples frozen, etc:

pg@regression:5432 =# delete from mytenk2 where hundred < 15;
DELETE 1500
pg@regression:5432 =# vacuum VERBOSE mytenk2;
INFO:  vacuuming "regression.public.mytenk2"
INFO:  finished vacuuming "regression.public.mytenk2": index scans: 1
pages: 0 removed, 345 remain, 0 skipped using visibility map (0.00% of total)
tuples: 1500 removed, 8500 remain (8500 newly frozen), 0 are dead but
not yet removable
removable cutoff: 17411, which is 0 xids behind next
new relfrozenxid: 17411, which is 3 xids ahead of previous value
index scan needed: 341 pages from table (98.84% of total) had 1500
dead item identifiers removed
index "mytenk2_unique1_idx": pages: 39 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "mytenk2_unique2_idx": pages: 30 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "mytenk2_hundred_idx": pages: 11 in total, 1 newly deleted, 1
currently deleted, 0 reusable
I/O timings: read: 0.011 ms, write: 0.000 ms
avg read rate: 1.428 MB/s, avg write rate: 2.141 MB/s
buffer usage: 1133 hits, 2 misses, 3 dirtied
WAL usage: 1446 records, 1 full page images, 199702 bytes
system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
VACUUM

-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Thu, Jan 13, 2022 at 4:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> 1. Cases where our inability to get a cleanup lock signifies nothing
> at all about the page in question, or any page in the same table, with
> the same workload.
>
> 2. Pathological cases. Cases where we're at least at the mercy of the
> application to do something about an idle cursor, where the situation
> may be entirely hopeless on a long enough timeline. (Whether or not it
> actually happens in the end is less significant.)

Sure. I'm worrying about case (2). I agree that in case (1) waiting
for the lock is almost always the wrong idea.

> I think that you're focussing on individual VACUUM operations, whereas
> I'm more concerned about the aggregate effect of a particular policy
> over time.

I don't think so. I think I'm worrying about the aggregate effect of a
particular policy over time *in the pathological cases* i.e. (2).

> This is my concern -- what I've called category 2 cases have this
> exact quality. So given that, why not freeze what you can, elsewhere,
> on other pages that don't have the same issue (presumably the vast
> vast majority in the table)? That way you have the best possible
> chance of recovering once the DBA gets a clue and fixes the issue.

That's the part I'm not sure I believe. Imagine a table with a
gigantic number of pages that are not yet all-visible, a small number
of all-visible pages, and one page containing very old XIDs on which a
cursor holds a pin. I don't think it's obvious that not waiting is
best. Maybe you're going to end up vacuuming the table repeatedly and
doing nothing useful. If you avoid vacuuming it repeatedly, you still
have a lot of work to do once the DBA locates a clue.

I think there's probably an important principle buried in here: the
XID threshold that forces a vacuum had better also force waiting for
pins. If it doesn't, you can tight-loop on that table without getting
anything done.

> That's kind of what I meant. The difference between 50 million and 150
> million is rather unclear indeed. So having accepted that that might
> be true, why not be open to the possibility that it won't turn out to
> be true in the long run, for any given table? With the enhancements
> from the patch series in place (particularly the early freezing
> stuff), what do we have to lose by making the FreezeLimit XID cutoff
> for freezing much higher than your typical vacuum_freeze_min_age?
> Maybe the same as autovacuum_freeze_max_age or vacuum_freeze_table_age
> (it can't be higher than that without also making these other settings
> become meaningless, of course).

We should probably distinguish between the situation where (a) an
adverse pin is held continuously and effectively forever and (b)
adverse pins are held frequently but for short periods of time. I
think it's possible to imagine a small, very hot table (or portion of
a table) where very high concurrency means there are often pins. In
case (a), it's not obvious that waiting will ever resolve anything,
although it might prevent other problems like infinite looping. In
case (b), a brief wait will do a lot of good. But maybe that doesn't
even matter. I think part of your argument is that if we fail to
update relfrozenxid for a while, that really isn't that bad.

I think I agree, up to a point. One consequence of failing to
immediately advance relfrozenxid might be that pg_clog and friends are
bigger, but that's pretty minor. Another consequence might be that we
might vacuum the table more times, which is more serious. I'm not
really sure that can happen to a degree that is meaningful, apart from
the infinite loop case already described, but I'm also not entirely
sure that it can't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Jan 17, 2022 at 7:12 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 13, 2022 at 4:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > 1. Cases where our inability to get a cleanup lock signifies nothing
> > at all about the page in question, or any page in the same table, with
> > the same workload.
> >
> > 2. Pathological cases. Cases where we're at least at the mercy of the
> > application to do something about an idle cursor, where the situation
> > may be entirely hopeless on a long enough timeline. (Whether or not it
> > actually happens in the end is less significant.)
>
> Sure. I'm worrying about case (2). I agree that in case (1) waiting
> for the lock is almost always the wrong idea.

I don't doubt that we'd each have little difficulty determining
which category (1 or 2) a given real world case should be placed in,
using a variety of methods that put the issue in context (e.g.,
looking at the application code, talking to the developers or the
DBA). Of course, it doesn't follow that it would be easy to teach
vacuumlazy.c how to determine which category the same "can't get
cleanup lock" falls under, since (just for starters) there is no
practical way for VACUUM to see all that context.

That's what I'm effectively trying to work around with this "wait and
see approach" that demotes FreezeLimit to a backstop (and so justifies
removing the vacuum_freeze_min_age GUC that directly dictates our
FreezeLimit today). The cure may be worse than the disease, and the cure
isn't actually all that great at the best of times, so we should wait
until the disease visibly gets pretty bad before being
"interventionist" by waiting for a cleanup lock.

I've already said plenty about why I don't like vacuum_freeze_min_age
(or FreezeLimit) due to XIDs being fundamentally the wrong unit. But
that's not the only fundamental problem that I see. The other problem
is this: vacuum_freeze_min_age also dictates when an aggressive VACUUM
will start to wait for a cleanup lock. But why should the first thing
be the same as the second thing? I see absolutely no reason for it.
(Hence the idea of making FreezeLimit a backstop, and getting rid of
the GUC itself.)

> > This is my concern -- what I've called category 2 cases have this
> > exact quality. So given that, why not freeze what you can, elsewhere,
> > on other pages that don't have the same issue (presumably the vast
> > vast majority in the table)? That way you have the best possible
> > chance of recovering once the DBA gets a clue and fixes the issue.
>
> That's the part I'm not sure I believe.

To be clear, I think that I have yet to adequately demonstrate that
this is true. It's a bit tricky to do so -- absence of evidence isn't
evidence of absence. I think that your principled skepticism makes
sense right now.

Fortunately the early refactoring patches should be uncontroversial.
The controversial parts are all in the last patch in the patch series,
which isn't too much code. (Plus another patch to at least get rid of
vacuum_freeze_min_age, and maybe vacuum_freeze_table_age too, that
hasn't been written just yet.)

> Imagine a table with a
> gigantic number of pages that are not yet all-visible, a small number
> of all-visible pages, and one page containing very old XIDs on which a
> cursor holds a pin. I don't think it's obvious that not waiting is
> best. Maybe you're going to end up vacuuming the table repeatedly and
> doing nothing useful. If you avoid vacuuming it repeatedly, you still
> have a lot of work to do once the DBA locates a clue.

Maybe this is a simpler way of putting it: I want to delay waiting on
a pin until it's pretty clear that we truly have a pathological case,
which should in practice be limited to an anti-wraparound VACUUM,
which will now be naturally rare -- most individual tables will
literally never have even one anti-wraparound VACUUM.

We don't need to reason about the vacuuming schedule this way, since
anti-wraparound VACUUMs are driven by age(relfrozenxid) -- we don't
really have to predict anything. Maybe we'll need to do an
anti-wraparound VACUUM immediately after a non-aggressive autovacuum
runs, without getting a cleanup lock (due to an idle cursor
pathological case). We won't be able to advance relfrozenxid until the
anti-wraparound VACUUM runs (at the earliest) in this scenario, but it
makes no difference. Rather than predicting the future, we're covering
every possible outcome (at least to the extent that that's possible).

> I think there's probably an important principle buried in here: the
> XID threshold that forces a vacuum had better also force waiting for
> pins. If it doesn't, you can tight-loop on that table without getting
> anything done.

I absolutely agree -- that's why I think that we still need
FreezeLimit. Just as a backstop, that in practice very rarely
influences our behavior. Probably just in those remaining cases that
are never vacuumed except for the occasional anti-wraparound VACUUM
(even then it might not be very important).

> We should probably distinguish between the situation where (a) an
> adverse pin is held continuously and effectively forever and (b)
> adverse pins are held frequently but for short periods of time.

I agree. It's just hard to do that from vacuumlazy.c, during a routine
non-aggressive VACUUM operation.

> I think it's possible to imagine a small, very hot table (or portion of
> a table) where very high concurrency means there are often pins. In
> case (a), it's not obvious that waiting will ever resolve anything,
> although it might prevent other problems like infinite looping. In
> case (b), a brief wait will do a lot of good. But maybe that doesn't
> even matter. I think part of your argument is that if we fail to
> update relfrozenxid for a while, that really isn't that bad.

Yeah, that is a part of it -- it doesn't matter (until it really
matters), and we should be careful to avoid making the situation worse
by waiting for a cleanup lock unnecessarily. That's actually a very
drastic thing to do, at least in a world where freezing has been
decoupled from advancing relfrozenxid.

Updating relfrozenxid should now be thought of as a continuous thing,
not a discrete thing. And so it's highly unlikely that any given
VACUUM will ever *completely* fail to advance relfrozenxid -- that
fact alone signals a pathological case (things that are supposed to be
continuous should not ever appear to be discrete). But you need multiple
VACUUMs to see this "signal". It is only revealed over time.

It seems wise to make the most modest possible assumptions about
what's going on here. We might well "get lucky" before the next VACUUM
comes around when we encounter what at first appears to be a
problematic case involving an idle cursor -- for all kinds of reasons.
Like maybe an opportunistic prune gets rid of the old XID for us,
without any freezing, during some brief window where the application
doesn't have a cursor. We're only talking about one or two heap pages
here.

We might also *not* "get lucky" with the application and its use of
idle cursors, of course. But in that case we must have been doomed all
along. And we'll at least have put things on a much better footing in
this disaster scenario -- there is relatively little freezing left to
do in single user mode, and relfrozenxid should already be the same as
the exact oldest XID in that one page.

> I think I agree, up to a point. One consequence of failing to
> immediately advance relfrozenxid might be that pg_clog and friends are
> bigger, but that's pretty minor.

My arguments are probabilistic (sort of), which makes it tricky.
Actual test cases/benchmarks should bear out the claims that I've
made. If anything fully convinces you, it'll be that, I think.

> Another consequence might be that we
> might vacuum the table more times, which is more serious. I'm not
> really sure that can happen to a degree that is meaningful, apart from
> the infinite loop case already described, but I'm also not entirely
> sure that it can't.

It's definitely true that this overall strategy could result in there
being more individual VACUUM operations. But that naturally
follow from teaching VACUUM to avoid waiting indefinitely.

Obviously the important question is whether we'll do
meaningfully more work for less benefit (in Postgres 15, relative to
Postgres 14). Your concern is very reasonable. I just can't imagine
how we could lose out to any notable degree. Which is a start.

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Updating relfrozenxid should now be thought of as a continuous thing,
> not a discrete thing.

I think that's pretty nearly 100% wrong. The most simplistic way of
expressing that is to say - clearly it can only happen when VACUUM
runs, which is not all the time. That's a bit facile, though; let me
try to say something a little smarter. There are real production
systems that exist today where essentially all vacuums are
anti-wraparound vacuums. And there are also real production systems
that exist today where virtually none of the vacuums are
anti-wraparound vacuums. So if we ship your proposed patches, the
frequency with which relfrozenxid gets updated is going to increase by
a large multiple, perhaps 100x, for the second group of people, who
will then perceive the movement of relfrozenxid to be much closer to
continuous than it is today even though, technically, it's still a
step function. But the people in the first category are not going to
see any difference at all.

And therefore the reasoning that says - anti-wraparound vacuums just
aren't going to happen any more - or - relfrozenxid will advance
continuously seems like dangerous wishful thinking to me. It's only
true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need
not be true in any particular environment, which to me means that all
conclusions based on the idea that it has to be true are pretty
dubious. There's no doubt in my mind that advancing relfrozenxid
opportunistically is a good idea. However, I'm not sure how reasonable
it is to change any other behavior on the basis of the fact that we're
doing it, because we don't know how often it really happens.

If someone says "every time I travel to Europe on business, I will use
the opportunity to bring you back a nice present," you can't evaluate
how much impact that will have on your life without knowing how often
they travel to Europe on business. And that varies radically from
"never" to "a lot" based on the person.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Jan 17, 2022 at 2:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Updating relfrozenxid should now be thought of as a continuous thing,
> > not a discrete thing.
>
> I think that's pretty nearly 100% wrong. The most simplistic way of
> expressing that is to say - clearly it can only happen when VACUUM
> runs, which is not all the time.

That just seems like semantics to me. The very next sentence after the
one you quoted in your reply was "And so it's highly unlikely that any
given VACUUM will ever *completely* fail to advance relfrozenxid".
It's continuous *within* each VACUUM. As far as I can tell there is
pretty much no way that the patch series will ever fail to advance
relfrozenxid *by at least a little bit*, barring pathological cases
with cursors and whatnot.

> That's a bit facile, though; let me
> try to say something a little smarter. There are real production
> systems that exist today where essentially all vacuums are
> anti-wraparound vacuums. And there are also real production systems
> that exist today where virtually none of the vacuums are
> anti-wraparound vacuums. So if we ship your proposed patches, the
> frequency with which relfrozenxid gets updated is going to increase by
> a large multiple, perhaps 100x, for the second group of people, who
> will then perceive the movement of relfrozenxid to be much closer to
> continuous than it is today even though, technically, it's still a
> step function. But the people in the first category are not going to
> see any difference at all.

Actually, I think that even the people in the first category might
well have about the same improved experience. Not just because of this
patch series, mind you. It would also have a lot to do with the
autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to
mention the freeze map. What version are these users on?

I have actually seen this for myself. With BenchmarkSQL, the largest
table (the order lines table) starts out having its autovacuums driven
entirely by autovacuum_vacuum_insert_scale_factor, even though there
is a fair amount of bloat from updates. It stays like that for hours
on HEAD. But even with my reasonably tuned setup, there is eventually
a switchover point. Eventually all autovacuums end up as aggressive
anti-wraparound VACUUMs -- this happens once the table gets
sufficiently large (this is one of the two that is append-only, with
one update to every inserted row from the delivery transaction, which
happens hours after the initial insert).

With the patch series, we have a kind of virtuous circle with freezing
and with advancing relfrozenxid with the same order lines table. As
far as I can tell, we fix the problem with the patch series. Because
there are about 10 tuples inserted per new order transaction, the
actual "XID consumption rate of the table" is much lower than the
"worst case XID consumption" for such a table.

It's also true that even with the patch we still get anti-wraparound
VACUUMs for two fixed-size, hot-update-only tables: the stock table,
and the customers table. But that's no big deal. It only happens
because nothing else will ever trigger an autovacuum, no matter the
autovacuum_freeze_max_age setting.

> And therefore the reasoning that says - anti-wraparound vacuums just
> aren't going to happen any more - or - relfrozenxid will advance
> continuously seems like dangerous wishful thinking to me.

I never said that anti-wraparound vacuums just won't happen anymore. I
said that they'll be limited to cases like the stock table or
customers table case. I was very clear on that point.

With pgbench, whether or not you ever see any anti-wraparound VACUUMs
will depend on how heap fillfactor for the accounts table -- set it
low enough (maybe to 90) and you will still get them, since there
won't be any other reason to VACUUM. As for the branches table, and
the tellers table, they'll get VACUUMs in any case, regardless of heap
fillfactor. And so they'll always advance relfrozenxid during eac
VACUUM, and never have even one anti-wraparound VACUUM.

> It's only
> true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need
> not be true in any particular environment, which to me means that all
> conclusions based on the idea that it has to be true are pretty
> dubious. There's no doubt in my mind that advancing relfrozenxid
> opportunistically is a good idea. However, I'm not sure how reasonable
> it is to change any other behavior on the basis of the fact that we're
> doing it, because we don't know how often it really happens.

It isn't that hard to see that the cases where we continue to get any
anti-wraparound VACUUMs with the patch seem to be limited to cases
like the stock/customers table, or cases like the pathological idle
cursor cases we've been discussing. Pretty narrow cases, overall.
Don't take my word for it - see for yourself.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> That just seems like semantics to me. The very next sentence after the
> one you quoted in your reply was "And so it's highly unlikely that any
> given VACUUM will ever *completely* fail to advance relfrozenxid".
> It's continuous *within* each VACUUM. As far as I can tell there is
> pretty much no way that the patch series will ever fail to advance
> relfrozenxid *by at least a little bit*, barring pathological cases
> with cursors and whatnot.

I mean this boils down to saying that VACUUM will advance relfrozenxid
except when it doesn't.

> Actually, I think that even the people in the first category might
> well have about the same improved experience. Not just because of this
> patch series, mind you. It would also have a lot to do with the
> autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to
> mention the freeze map. What version are these users on?

I think it varies. I expect the increase in the default cost limit to
have had a much more salutary effect than
autovacuum_vacuum_insert_scale_factor, but I don't know for sure. At
any rate, if you make the database big enough and generate dirty data
fast enough, it doesn't matter what the default limits are.

> I never said that anti-wraparound vacuums just won't happen anymore. I
> said that they'll be limited to cases like the stock table or
> customers table case. I was very clear on that point.

I don't know how I'm supposed to sensibly respond to a statement like
this. If you were very clear, then I'm being deliberately obtuse if I
fail to understand. If I say you weren't very clear, then we're just
contradicting each other.

> It isn't that hard to see that the cases where we continue to get any
> anti-wraparound VACUUMs with the patch seem to be limited to cases
> like the stock/customers table, or cases like the pathological idle
> cursor cases we've been discussing. Pretty narrow cases, overall.
> Don't take my word for it - see for yourself.

I don't think that's really possible. Words like "narrow" and
"pathological" are value judgments, not factual statements. If I do an
experiment where no wraparound autovacuums happen, as I'm sure I can,
then those are the normal cases where the patch helps. If I do an
experiment where they do happen, as I'm sure that I also can, you'll
probably say either that the case in question is like the
stock/customers table, or that it's pathological. What will any of
this prove?

I think we're reaching the point of diminishing returns in this
conversation. What I want to know is that users aren't going to be
harmed - even in cases where they have behavior that is like the
stock/customers table, or that you consider pathological, or whatever
other words we want to use to describe the weird things that happen to
people. And I think we've made perhaps a bit of modest progress in
exploring that issue, but certainly less than I'd like. I don't want
to spend the next several days going around in circles about it
though. That does not seem likely to make anyone happy.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Jan 17, 2022 at 8:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > That just seems like semantics to me. The very next sentence after the
> > one you quoted in your reply was "And so it's highly unlikely that any
> > given VACUUM will ever *completely* fail to advance relfrozenxid".
> > It's continuous *within* each VACUUM. As far as I can tell there is
> > pretty much no way that the patch series will ever fail to advance
> > relfrozenxid *by at least a little bit*, barring pathological cases
> > with cursors and whatnot.
>
> I mean this boils down to saying that VACUUM will advance relfrozenxid
> except when it doesn't.

It actually doesn't boil down, at all. The world is complicated and
messy, whether we like it or not.

> > I never said that anti-wraparound vacuums just won't happen anymore. I
> > said that they'll be limited to cases like the stock table or
> > customers table case. I was very clear on that point.
>
> I don't know how I'm supposed to sensibly respond to a statement like
> this. If you were very clear, then I'm being deliberately obtuse if I
> fail to understand.

I don't know if I'd accuse you of being obtuse, exactly. Mostly I just
think it's strange that you don't seem to take what I say seriously
when it cannot be proven very easily. I don't think that you intend
this to be disrespectful, and I don't take it personally. I just don't
understand it.

> > It isn't that hard to see that the cases where we continue to get any
> > anti-wraparound VACUUMs with the patch seem to be limited to cases
> > like the stock/customers table, or cases like the pathological idle
> > cursor cases we've been discussing. Pretty narrow cases, overall.
> > Don't take my word for it - see for yourself.
>
> I don't think that's really possible. Words like "narrow" and
> "pathological" are value judgments, not factual statements. If I do an
> experiment where no wraparound autovacuums happen, as I'm sure I can,
> then those are the normal cases where the patch helps. If I do an
> experiment where they do happen, as I'm sure that I also can, you'll
> probably say either that the case in question is like the
> stock/customers table, or that it's pathological. What will any of
> this prove?

You seem to be suggesting that I used words like "pathological" in
some kind of highly informal, totally subjective way, when I did no
such thing.

I quite clearly said that you'll only get an anti-wraparound VACUUM
with the patch applied when the only factor that *ever* causes *any*
autovacuum worker to VACUUM the table (assuming the workload is
stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With
a table like this, even increasing autovacuum_freeze_max_age to its
absolute maximum of 2 billion would not make it any more likely that
we'd get a non-aggressive VACUUM -- it would merely make the
anti-wraparound VACUUMs less frequent. No big change should be
expected with a table like that.

Also, since the patch is not magic, and doesn't even change the basic
invariants for relfrozenxid, it's still true that any scenario in
which it's fundamentally impossible for VACUUM to keep up will also
have anti-wraparound VACUUMs. But that's the least of the user's
trouble -- in the long run we're going to have the system refuse to
allocate new XIDs with such a workload.

The claim that I have made is 100% testable. Even if it was flat out
incorrect, not getting anti-wraparound VACUUMs per se is not the
important part. The important part is that the work is managed
intelligently, and the burden is spread out over time. I am
particularly concerned about the "freezing cliff" we get when many
pages are all-visible but not also all-frozen. Consistently avoiding
an anti-wraparound VACUUM (except with very particular workload
characteristics) is really just a side effect -- it's something that
makes the overall benefit relatively obvious, and relatively easy to
measure. I thought that you'd appreciate that.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Tue, Jan 18, 2022 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I quite clearly said that you'll only get an anti-wraparound VACUUM
> with the patch applied when the only factor that *ever* causes *any*
> autovacuum worker to VACUUM the table (assuming the workload is
> stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With
> a table like this, even increasing autovacuum_freeze_max_age to its
> absolute maximum of 2 billion would not make it any more likely that
> we'd get a non-aggressive VACUUM -- it would merely make the
> anti-wraparound VACUUMs less frequent. No big change should be
> expected with a table like that.

Sure, I don't disagree with any of that. I don't see how I could. But
I don't see how it detracts from the points I was trying to make
either.

> Also, since the patch is not magic, and doesn't even change the basic
> invariants for relfrozenxid, it's still true that any scenario in
> which it's fundamentally impossible for VACUUM to keep up will also
> have anti-wraparound VACUUMs. But that's the least of the user's
> trouble -- in the long run we're going to have the system refuse to
> allocate new XIDs with such a workload.

Also true. But again, it's just about making sure that the patch
doesn't make other decisions that make things worse for people in that
situation. That's what I was expressing uncertainty about.

> The claim that I have made is 100% testable. Even if it was flat out
> incorrect, not getting anti-wraparound VACUUMs per se is not the
> important part. The important part is that the work is managed
> intelligently, and the burden is spread out over time. I am
> particularly concerned about the "freezing cliff" we get when many
> pages are all-visible but not also all-frozen. Consistently avoiding
> an anti-wraparound VACUUM (except with very particular workload
> characteristics) is really just a side effect -- it's something that
> makes the overall benefit relatively obvious, and relatively easy to
> measure. I thought that you'd appreciate that.

I do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Tue, Jan 18, 2022 at 6:11 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 18, 2022 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I quite clearly said that you'll only get an anti-wraparound VACUUM
> > with the patch applied when the only factor that *ever* causes *any*
> > autovacuum worker to VACUUM the table (assuming the workload is
> > stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With
> > a table like this, even increasing autovacuum_freeze_max_age to its
> > absolute maximum of 2 billion would not make it any more likely that
> > we'd get a non-aggressive VACUUM -- it would merely make the
> > anti-wraparound VACUUMs less frequent. No big change should be
> > expected with a table like that.
>
> Sure, I don't disagree with any of that. I don't see how I could. But
> I don't see how it detracts from the points I was trying to make
> either.

You said "...the reasoning that says - anti-wraparound vacuums just
aren't going to happen any more - or - relfrozenxid will advance
continuously seems like dangerous wishful thinking to me". You then
proceeded to attack a straw man -- a view that I couldn't possibly
hold. This certainly surprised me, because my actual claims seemed
well within the bounds of what is possible, and in any case can be
verified with a fairly modest effort.

That's what I was reacting to -- it had nothing to do with any
concerns you may have had. I wasn't thinking about long-idle cursors
at all. I was defending myself, because I was put in a position where
I had to defend myself.

> > Also, since the patch is not magic, and doesn't even change the basic
> > invariants for relfrozenxid, it's still true that any scenario in
> > which it's fundamentally impossible for VACUUM to keep up will also
> > have anti-wraparound VACUUMs. But that's the least of the user's
> > trouble -- in the long run we're going to have the system refuse to
> > allocate new XIDs with such a workload.
>
> Also true. But again, it's just about making sure that the patch
> doesn't make other decisions that make things worse for people in that
> situation. That's what I was expressing uncertainty about.

I am not just trying to avoid making things worse when users are in
this situation. I actually want to give users every chance to avoid
being in this situation in the first place. In fact, almost everything
I've said about this aspect of things was about improving things for
these users. It was not about covering myself -- not at all. It would
be easy for me to throw up my hands, and change nothing here (keep the
behavior that makes FreezeLimit derived from the vacuum_freeze_min
GUC), since it's all incidental to the main goals of this patch
series.

I still don't understand why you think that my idea (not yet
implemented) of making FreezeLimit into a backstop (making it
autovacuum_freeze_max_age/2 or something) and relying on the new
"early freezing" criteria for almost everything is going to make the
situation worse in this scenario with long idle cursors. It's intended
to make it better.

Why do you think that the current vacuum_freeze_min_age-based
FreezeLimit isn't actually the main problem in these scenarios? I
think that the way that that works right now (in particular during
aggressive VACUUMs) is just an accident of history. It's all path
dependence -- each incremental step may have made sense, but what we
have now doesn't seem to. Waiting for a cleanup lock might feel like
the diligent thing to do, but that doesn't make it so.

My sense is that there are very few apps that are hopelessly incapable
of advancing relfrozenxid from day one. I find it much easier to
believe that users that had this experience got away with it for a
very long time, until their luck ran out, somehow. I would like to
minimize the chance of that ever happening, to the extent that that's
possible within the confines of the basic heapam/vacuumlazy.c
invariants.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Tue, Jan 18, 2022 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> That's what I was reacting to -- it had nothing to do with any
> concerns you may have had. I wasn't thinking about long-idle cursors
> at all. I was defending myself, because I was put in a position where
> I had to defend myself.

I don't think I've said anything on this thread that is an attack on
you. I am getting pretty frustrated with the tenor of the discussion,
though. I feel like you're the one attacking me, and I don't like it.

> I still don't understand why you think that my idea (not yet
> implemented) of making FreezeLimit into a backstop (making it
> autovacuum_freeze_max_age/2 or something) and relying on the new
> "early freezing" criteria for almost everything is going to make the
> situation worse in this scenario with long idle cursors. It's intended
> to make it better.

I just don't understand how I haven't been able to convey my concern
here by now. I've already written multiple emails about it. If none of
them were clear enough for you to understand, I'm not sure how saying
the same thing over again can help. When I say I've already written
about this, I'm referring specifically to the following:

- https://postgr.es/m/CA+TgmobKJm9BsZR3ETeb6MJdLKWxKK5ZXx0XhLf-W9kUgvOcNA@mail.gmail.com
in the second-to-last paragraph, beginning with "I don't really see"
- https://www.postgresql.org/message-id/CA%2BTgmoaGoZ2wX6T4sj0eL5YAOQKW3tS8ViMuN%2BtcqWJqFPKFaA%40mail.gmail.com
in the second paragraph beginning with "Because waiting on a lock"
- https://www.postgresql.org/message-id/CA%2BTgmoZYri_LUp4od_aea%3DA8RtjC%2B-Z1YmTc7ABzTf%2BtRD2Opw%40mail.gmail.com
in the paragraph beginning with "That's the part I'm not sure I
believe."

For all of that, I'm not even convinced that you're wrong. I just
think you might be wrong. I don't really know. It seems to me however
that you're understating the value of waiting, which I've tried to
explain in the above places. Waiting does have the very real
disadvantage of starving the rest of the system of the work that
autovacuum worker would have been doing, and that's why I think you
might be right. However, there are cases where waiting, and only
waiting, gets the job done. If you're not willing to admit that those
cases exist, or you think they don't matter, then we disagree. If you
admit that they exist and think they matter but believe that there's
some reason why increasing FreezeLimit can't cause any damage, then
either (a) you have a good reason for that belief which I have thus
far been unable to understand or (b) you're more optimistic about the
proposed change than can be entirely justified.

> My sense is that there are very few apps that are hopelessly incapable
> of advancing relfrozenxid from day one. I find it much easier to
> believe that users that had this experience got away with it for a
> very long time, until their luck ran out, somehow. I would like to
> minimize the chance of that ever happening, to the extent that that's
> possible within the confines of the basic heapam/vacuumlazy.c
> invariants.

I agree with the idea that most people are OK at the beginning and
then at some point their luck runs out and catastrophe strikes. I
think there are a couple of different kinds of catastrophe that can
happen. For instance, somebody could park a cursor in the middle of a
table someplace and leave it there until the snow melts. Or, somebody
could take a table lock and sit on it forever. Or, there could be a
corrupted page in the table that causes VACUUM to error out every time
it's reached. In the second and third situations, it doesn't matter a
bit what we do with FreezeLimit, but in the first one it might. If the
user is going to leave that cursor sitting there literally forever,
the best solution is to raise FreezeLimit as high as we possibly can.
The system is bound to shut down due to wraparound at some point, but
we at least might as well vacuum other stuff while we're waiting for
that to happen. On the other hand if that user is going to close that
cursor after 10 minutes and open a new one in the same place 10
seconds later, the best thing to do is to keep FreezeLimit as low as
possible, because the first time we wait for the pin to be released
we're guaranteed to advance relfrozenxid within 10 minutes, whereas if
we don't do that we may keep missing the brief windows in which no
cursor is held for a very long time. But we have absolutely no way of
knowing which of those things is going to happen on any particular
system, or of estimating which one is more common in general.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Jan 19, 2022 at 6:56 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think I've said anything on this thread that is an attack on
> you. I am getting pretty frustrated with the tenor of the discussion,
> though. I feel like you're the one attacking me, and I don't like it.

"Attack" is a strong word (much stronger than "defend"), and I don't
think I'd use it to describe anything that has happened on this
thread. All I said was that you misrepresented my views when you
pounced on my use of the word "continuous". Which, honestly, I was
very surprised by.

> For all of that, I'm not even convinced that you're wrong. I just
> think you might be wrong. I don't really know.

I agree that I might be wrong, though of course I think that I'm
probably correct. I value your input as a critical voice -- that's
generally how we get really good designs.

> However, there are cases where waiting, and only
> waiting, gets the job done. If you're not willing to admit that those
> cases exist, or you think they don't matter, then we disagree.

They exist, of course. That's why I don't want to completely eliminate
the idea of waiting for a cleanup lock. Rather, I want to change the
design to recognize that that's an extreme measure, that should be
delayed for as long as possible. There are many ways that the problem
could naturally resolve itself.

Waiting for a cleanup lock after only 50 million XIDs (the
vacuum_freeze_min_age default) is like performing brain surgery to
treat somebody with a headache (at least with the infrastructure from
the earlier patches in place). It's not impossible that "surgery"
could help, in theory (could be a tumor, better to catch these things
early!), but that fact alone can hardly justify such a drastic
measure. That doesn't mean that brain surgery isn't ever appropriate,
of course. It should be delayed until it starts to become obvious that
it's really necessary (but before it really is too late).

> If you
> admit that they exist and think they matter but believe that there's
> some reason why increasing FreezeLimit can't cause any damage, then
> either (a) you have a good reason for that belief which I have thus
> far been unable to understand or (b) you're more optimistic about the
> proposed change than can be entirely justified.

I don't deny that it's just about possible that the changes that I'm
thinking of could make the situation worse in some cases, but I think
that the overwhelming likelihood is that things will be improved
across the board.

Consider the age of the tables from BenchmarkSQL, with the patch series:

     relname      │     age     │ mxid_age
──────────────────┼─────────────┼──────────
 bmsql_district   │         657 │        0
 bmsql_warehouse  │         696 │        0
 bmsql_item       │   1,371,978 │        0
 bmsql_config     │   1,372,061 │        0
 bmsql_new_order  │   3,754,163 │        0
 bmsql_history    │  11,545,940 │        0
 bmsql_order_line │  23,095,678 │        0
 bmsql_oorder     │  40,653,743 │        0
 bmsql_customer   │  51,371,610 │        0
 bmsql_stock      │ 51,371,610 │        0
(10 rows)

We see significant "natural variation" here, unlike HEAD, where the
age of all tables is exactly the same at all times, or close to it
(incidentally, this leads to the largest tables all being
anti-wraparound VACUUMed at the same time). There is a kind of natural
ebb and flow for each table over time, as relfrozenxid is advanced,
due in part to workload characteristics. Less than half of all XIDs
will ever modify the two largest tables, for example, and so
autovacuum should probably never be launched because of the age of
either table (barring some change in workload conditions, perhaps). As
I've said a few times now, XIDs are generally "the wrong unit", except
when needed as a backstop against wraparound failure.

The natural variation that I see contributes to my optimism. A
situation where we cannot get a cleanup lock may well resolve itself,
for many reasons, that are hard to precisely nail down but are
nevertheless very real.

The vacuum_freeze_min_age design (particularly within an aggressive
VACUUM) is needlessly rigid, probably just because the assumption
before now has always been that we can only advance relfrozenxid in an
aggressive VACUUM (it might happen in a non-aggressive VACUUM if we
get very lucky, which cannot be accounted for). Because it is rigid,
it is brittle. Because it is brittle, it will (on a long enough
timeline, for a susceptible workload) actually break.

> On the other hand if that user is going to close that
> cursor after 10 minutes and open a new one in the same place 10
> seconds later, the best thing to do is to keep FreezeLimit as low as
> possible, because the first time we wait for the pin to be released
> we're guaranteed to advance relfrozenxid within 10 minutes, whereas if
> we don't do that we may keep missing the brief windows in which no
> cursor is held for a very long time. But we have absolutely no way of
> knowing which of those things is going to happen on any particular
> system, or of estimating which one is more common in general.

I agree with all that, and I think that this particular scenario is
the crux of the issue.

The first time this happens (and we don't get a cleanup lock), then we
will at least be able to set relfrozenxid to the exact oldest unfrozen
XID. So that'll already have bought us some wallclock time -- often a
great deal (why should the oldest XID on such a page be particularly
old?). Furthermore, there will often be many more VACUUMs before we
need to do an aggressive VACUUM -- each of these VACUUM operations is
an opportunity to freeze the oldest tuple that holds up cleanup. Or
maybe this XID is in a dead tuple, and so somebody's opportunistic
pruning operation does the right thing for us. Never underestimate the
power of dumb luck, especially in a situation where there are many
individual "trials", and we only have to get lucky once.

If and when that doesn't work out, and we actually have to do an
anti-wraparound VACUUM, then something will have to give. Since
anti-wraparound VACUUMs are naturally confined to certain kinds of
tables/workloads with the patch series, we can now be pretty confident
that the problem really is with this one problematic heap page, with
the idle cursor. We could even verify this directly if we wanted to,
by noticing that the preexisting relfrozenxid is an exact match for
one XID on some can't-cleanup-lock page -- we could emit a WARNING
about the page/tuple if we wanted to. To return to my colorful analogy
from earlier, we now know that the patient almost certainly has a
brain tumor.

What new risk is implied by delaying the wait like this? Very little,
I believe. Lets say we derive FreezeLimit from
autovacuum_freeze_max_age/2 (instead of vacuum_freeze_min_age). We
still ought to have the opportunity to wait for the cleanup lock for
rather a long time -- if the XID consumption rate is so high that that
isn't true, then we're doomed anyway. All told, there seems to be a
huge net reduction in risk with this design.

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Wed, Jan 19, 2022 at 2:54 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On the other hand if that user is going to close that
> > cursor after 10 minutes and open a new one in the same place 10
> > seconds later, the best thing to do is to keep FreezeLimit as low as
> > possible, because the first time we wait for the pin to be released
> > we're guaranteed to advance relfrozenxid within 10 minutes, whereas if
> > we don't do that we may keep missing the brief windows in which no
> > cursor is held for a very long time. But we have absolutely no way of
> > knowing which of those things is going to happen on any particular
> > system, or of estimating which one is more common in general.
>
> I agree with all that, and I think that this particular scenario is
> the crux of the issue.

Great, I'm glad we agree on that much. I would be interested in
hearing what other people think about this scenario.

> The first time this happens (and we don't get a cleanup lock), then we
> will at least be able to set relfrozenxid to the exact oldest unfrozen
> XID. So that'll already have bought us some wallclock time -- often a
> great deal (why should the oldest XID on such a page be particularly
> old?). Furthermore, there will often be many more VACUUMs before we
> need to do an aggressive VACUUM -- each of these VACUUM operations is
> an opportunity to freeze the oldest tuple that holds up cleanup. Or
> maybe this XID is in a dead tuple, and so somebody's opportunistic
> pruning operation does the right thing for us. Never underestimate the
> power of dumb luck, especially in a situation where there are many
> individual "trials", and we only have to get lucky once.
>
> If and when that doesn't work out, and we actually have to do an
> anti-wraparound VACUUM, then something will have to give. Since
> anti-wraparound VACUUMs are naturally confined to certain kinds of
> tables/workloads with the patch series, we can now be pretty confident
> that the problem really is with this one problematic heap page, with
> the idle cursor. We could even verify this directly if we wanted to,
> by noticing that the preexisting relfrozenxid is an exact match for
> one XID on some can't-cleanup-lock page -- we could emit a WARNING
> about the page/tuple if we wanted to. To return to my colorful analogy
> from earlier, we now know that the patient almost certainly has a
> brain tumor.
>
> What new risk is implied by delaying the wait like this? Very little,
> I believe. Lets say we derive FreezeLimit from
> autovacuum_freeze_max_age/2 (instead of vacuum_freeze_min_age). We
> still ought to have the opportunity to wait for the cleanup lock for
> rather a long time -- if the XID consumption rate is so high that that
> isn't true, then we're doomed anyway. All told, there seems to be a
> huge net reduction in risk with this design.

I'm just being honest here when I say that I can't see any huge
reduction in risk. Nor a huge increase in risk. It just seems
speculative to me. If I knew something about the system or the
workload, then I could say what would likely work out best on that
system, but in the abstract I neither know nor understand how it's
possible to know.

My gut feeling is that it's going to make very little difference
either way. People who never release their cursors or locks or
whatever are going to be sad either way, and people who usually do
will be happy either way. There's some in-between category of people
who release sometimes but not too often for whom it may matter,
possibly quite a lot. It also seems possible that one decision rather
than another will make the happy people MORE happy, or the sad people
MORE sad. For most people, though, I think it's going to be
irrelevant. The fact that you seem to view the situation quite
differently is a big part of what worries me here. At least one of us
is missing something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Jan 20, 2022 at 6:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Great, I'm glad we agree on that much. I would be interested in
> hearing what other people think about this scenario.

Agreed.

> I'm just being honest here when I say that I can't see any huge
> reduction in risk. Nor a huge increase in risk. It just seems
> speculative to me. If I knew something about the system or the
> workload, then I could say what would likely work out best on that
> system, but in the abstract I neither know nor understand how it's
> possible to know.

I think that it's very hard to predict the timeline with a scenario
like this -- no question. But I often imagine idealized scenarios like
the one you brought up with cursors, with the intention of lowering
the overall exposure to problems to the extent that that's possible;
if it was obvious, we'd have fixed it by now already. I cannot think
of any reason why making FreezeLimit into what I've been calling a
backstop introduces any new risk, but I can think of ways in which it
avoids risk. We shouldn't be waiting indefinitely for something
totally outside our control or understanding, and so blocking all
freezing and other maintenance on the table, until it's provably
necessary.

More fundamentally, freezing should be thought of as an overhead of
storing tuples in heap blocks, as opposed to an overhead of
transactions (that allocate XIDs). Meaning that FreezeLimit becomes
almost an emergency thing, closely associated with aggressive
anti-wraparound VACUUMs.

> My gut feeling is that it's going to make very little difference
> either way. People who never release their cursors or locks or
> whatever are going to be sad either way, and people who usually do
> will be happy either way.

In a real world scenario, the rate at which XIDs are used could be
very low. Buying a few hundred million more XIDs until the pain begins
could amount to buying weeks or months for the user in practice. Plus
they have visibility into the issue, in that they can potentially see
exactly when they stopped being able to advance relfrozenxid by
looking at the autovacuum logs.

My thinking on vacuum_freeze_min_age has shifted very slightly. I now
think that I'll probably need to keep it around, just so things like
VACUUM FREEZE (which sets vacuum_freeze_min_age to 0 internally)
continue to work. So maybe its default should be changed to -1, which
is interpreted as "whatever autovacuum_freeze_max_age/2 is". But it
should still be greatly deemphasized in user docs.

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Thu, Jan 20, 2022 at 11:45 AM Peter Geoghegan <pg@bowt.ie> wrote:
> My thinking on vacuum_freeze_min_age has shifted very slightly. I now
> think that I'll probably need to keep it around, just so things like
> VACUUM FREEZE (which sets vacuum_freeze_min_age to 0 internally)
> continue to work. So maybe its default should be changed to -1, which
> is interpreted as "whatever autovacuum_freeze_max_age/2 is". But it
> should still be greatly deemphasized in user docs.

I like that better, because it lets us retain an escape valve in case
we should need it. I suggest that the documentation should say things
like "The default is believed to be suitable for most use cases" or
"We are not aware of a reason to change the default" rather than
something like "There is almost certainly no good reason to change
this" or "What kind of idiot are you, anyway?" :-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Jan 20, 2022 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 20, 2022 at 11:45 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > My thinking on vacuum_freeze_min_age has shifted very slightly. I now
> > think that I'll probably need to keep it around, just so things like
> > VACUUM FREEZE (which sets vacuum_freeze_min_age to 0 internally)
> > continue to work. So maybe its default should be changed to -1, which
> > is interpreted as "whatever autovacuum_freeze_max_age/2 is". But it
> > should still be greatly deemphasized in user docs.
>
> I like that better, because it lets us retain an escape valve in case
> we should need it.

I do see some value in that, too. Though it's not going to be a way of
turning off the early freezing stuff, which seems unnecessary (though
I do still have work to do on getting the overhead for that down).

> I suggest that the documentation should say things
> like "The default is believed to be suitable for most use cases" or
> "We are not aware of a reason to change the default" rather than
> something like "There is almost certainly no good reason to change
> this" or "What kind of idiot are you, anyway?" :-)

I will admit to having a big bias here: I absolutely *loathe* these
GUCs. I really, really hate them.

Consider how we have to include messy caveats about
autovacuum_freeze_min_age when talking about
autovacuum_vacuum_insert_scale_factor. Then there's the fact that you
really cannot think about the rate of XID consumption intuitively --
it has at best a weak, unpredictable relationship with anything that
users can understand, such as data stored or wall clock time.

Then there are the problems with the equivalent MultiXact GUCs, which
somehow, against all odds, are even worse:

https://buttondown.email/nelhage/archive/notes-on-some-postgresql-implementation-details/

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Greg Stark
Дата:
On Thu, 20 Jan 2022 at 17:01, Peter Geoghegan <pg@bowt.ie> wrote:
>
> Then there's the fact that you
> really cannot think about the rate of XID consumption intuitively --
> it has at best a weak, unpredictable relationship with anything that
> users can understand, such as data stored or wall clock time.

This confuses me. "Transactions per second" is a headline database
metric that lots of users actually focus on quite heavily -- rather
too heavily imho. Ok, XID consumption is only a subset of transactions
that are not read-only but that's a detail that's pretty easy to
explain and users get pretty quickly.

There are corner cases like transactions that look read-only but are
actually read-write or transactions that consume multiple xids but
complex systems are full of corner cases and people don't seem too
surprised about these things.

What I find confuses people much more is the concept of the
oldestxmin. I think most of the autovacuum problems I've seen come
from cases where autovacuum is happily kicking off useless vacuums
because the oldestxmin hasn't actually advanced enough for them to do
any useful work.

-- 
greg



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Jan 21, 2022 at 12:07 PM Greg Stark <stark@mit.edu> wrote:
> This confuses me. "Transactions per second" is a headline database
> metric that lots of users actually focus on quite heavily -- rather
> too heavily imho.

But transactions per second is for the whole database, not for
individual tables. It's also really a benchmarking thing, where the
size and variety of transactions is fixed. With something like pgbench
it actually is exactly the same thing, but such a workload is not at
all realistic.  Even BenchmarkSQL/TPC-C isn't like that, despite the
fact that it is a fairly synthetic workload (it's just not super
synthetic).

> Ok, XID consumption is only a subset of transactions
> that are not read-only but that's a detail that's pretty easy to
> explain and users get pretty quickly.

My point was mostly this: the number of distinct extant unfrozen tuple
headers (and the range of the relevant XIDs) is generally highly
unpredictable today. And the number of tuples we'll have to freeze to
be able to advance relfrozenxid by a good amount is quite variable, in
general.

For example, if we bulk extend a relation as part of an ETL process,
then the number of distinct XIDs could be as low as 1, even though we
can expect a great deal of "freeze debt" that will have to be paid off
at some point (with the current design, in the common case where the
user doesn't account for this effect because they're not already an
expert). There are other common cases that are not quite as extreme as
that, that still have the same effect -- even an expert will find it
hard or impossible to tune autovacuum_freeze_min_age for that.

Another case of interest (that illustrates the general principle) is
something like pgbench_tellers. We'll never have an aggressive VACUUM
of the table with the patch, and we shouldn't ever need to freeze any
tuples. But, owing to workload characteristics, we'll constantly be
able to keep its relfrozenxid very current, because (even if we
introduce skew) each individual row cannot go very long without being
updated, allowing old XIDs to age out that way.

There is also an interesting middle ground, where you get a mixture of
both tendencies due to skew. The tuple that's most likely to get
updated was the one that was just updated. How are you as a DBA ever
supposed to tune autovacuum_freeze_min_age if tuples happen to be
qualitatively different in this way?

> What I find confuses people much more is the concept of the
> oldestxmin. I think most of the autovacuum problems I've seen come
> from cases where autovacuum is happily kicking off useless vacuums
> because the oldestxmin hasn't actually advanced enough for them to do
> any useful work.

As it happens, the proposed log output won't use the term oldestxmin
anymore -- I think that it makes sense to rename it to "removable
cutoff". Here's an example:

LOG:  automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 317308 remain, 250258 skipped using visibility map
(78.87% of total)
tuples: 70 removed, 34105925 remain (6830471 newly frozen), 2528 are
dead but not yet removable
removable cutoff: 37574752, which is 230115 xids behind next
new relfrozenxid: 35221275, which is 5219310 xids ahead of previous value
index scan needed: 55540 pages from table (17.50% of total) had
3339809 dead item identifiers removed
index "bmsql_oorder_pkey": pages: 144257 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "bmsql_oorder_idx2": pages: 330083 in total, 0 newly deleted, 0
currently deleted, 0 reusable
I/O timings: read: 7928.207 ms, write: 1386.662 ms
avg read rate: 33.107 MB/s, avg write rate: 26.218 MB/s
buffer usage: 220825 hits, 443331 misses, 351084 dirtied
WAL usage: 576110 records, 364797 full page images, 2046767817 bytes
system usage: CPU: user: 10.62 s, system: 7.56 s, elapsed: 104.61 s

Note also that I deliberately made the "new relfrozenxid" line that
immediately follows (information that we haven't shown before now)
similar, to highlight that they're now closely related concepts. Now
if you VACUUM a table that is either empty or has only frozen tuples,
VACUUM will set relfrozenxid to oldestxmin/removable cutoff.
Internally, oldestxmin is the "starting point" for our final/target
relfrozenxid for the table. We ratchet it back dynamically, whenever
we see an older-than-current-target XID that cannot be immediately
frozen (e.g., when we can't easily get a cleanup lock on the page).

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I do see some value in that, too. Though it's not going to be a way of
> turning off the early freezing stuff, which seems unnecessary (though
> I do still have work to do on getting the overhead for that down).

Attached is v7, a revision that overhauls the algorithm that decides
what to freeze. I'm now calling it block-driven freezing in the commit
message. Also included is a new patch, that makes VACUUM record zero
free space in the FSM for an all-visible page, unless the total amount
of free space happens to be greater than one half of BLCKSZ.

The fact that I am now including this new FSM patch (v7-0006-*patch)
may seem like a case of expanding the scope of something that could
well do without it. But hear me out! It's true that the new FSM patch
isn't essential. I'm including it now because it seems relevant to the
approach taken with block-driven freezing -- it may even make my
general approach easier to understand. The new approach to freezing is
to freeze every tuple on a block that is about to be set all-visible
(and thus set it all-frozen too), or to not freeze anything on the
page at all (at least until one XID gets really old, which should be
rare). This approach has all the benefits that I described upthread,
and a new benefit: it effectively encourages the application to allow
pages to "become settled".

The main difference in how we freeze here (relative to v6 of the
patch) is that I'm *not* freezing a page just because it was
dirtied/pruned. I now think about freezing as an essentially
page-level thing, barring edge cases where we have to freeze
individual tuples, just because the XIDs really are getting old (it's
an edge case when we can't freeze all the tuples together due to a mix
of new and old, which is something we specifically set out to avoid
now).

Freezing whole pages
====================

When VACUUM sees that all remaining/unpruned tuples on a page are
all-visible, it isn't just important because of cost control
considerations. It's deeper than that. It's also treated as a
tentative signal from the application itself, about the data itself.
Which is: this page looks "settled" -- it may never be updated again,
but if there is an update it likely won't change too much about the
whole page. Also, if the page is ever updated in the future, it's
likely that that will happen at a much later time than you should
expect for those *other* nearby pages, that *don't* appear to be
settled. And so VACUUM infers that the page is *qualitatively*
different to these other nearby pages. VACUUM therefore makes it hard
(though not impossible) for future inserts or updates to disturb these
settled pages, via this FSM behavior -- it is short sighted to just
see the space remaining on the page as free space, equivalent to any
other. This holistic approach seems to work well for
TPC-C/BenchmarkSQL, and perhaps even in general. More on TPC-C below.

This is not unlike the approach taken by other DB systems, where free
space management is baked into concurrency control, and the concept of
physical data independence as we know it from Postgres never really
existed. My approach also seems related to the concept of a "tenured
generation", which is key to generational garbage collection. The
whole basis of generational garbage collection is the generational
hypothesis: "most objects die young". This is an empirical observation
about applications written in GC'd programming languages actually
behave, not a rigorous principle, and yet in practice it appears to
always hold. Intuitively, it seems to me like the hypothesis must work
in practice because if it didn't then a counterexample nemesis
application's behavior would be totally chaotic, in every way.
Theoretically possible, but of no real concern, since the program
makes zero practical sense *as an actual program*. A Java program must
make sense to *somebody* (at least the person that wrote it), which,
it turns out, helpfully constrains the space of possibilities that any
industrial strength GC implementation needs to handle well.

The same principles seem to apply here, with VACUUM. Grouping logical
rows into pages that become their "permanent home until further
notice" may be somewhat arbitrary, at first, but that doesn't mean it
won't end up sticking. Just like with generational garbage collection,
where the application isn't expected to instruct the GC about its
plans for memory that it allocates, that can nevertheless be usefully
organized into distinct generations through an adaptive process.

Second order effects
====================

Relating the FSM to page freezing/all-visible setting makes much more
sense if you consider the second order effects.

There is bound to be competition for free space among backends that
access the free space map. By *not* freezing a page during VACUUM
because it looks unsettled, we make its free space available in the
traditional way instead. It follows that unsettled pages (in tables
with lots of updates) are now the only place that backends that need
more free space from the FSM can look -- unsettled pages therefore
become a hot commodity, freespace-wise. A page that initially appeared
"unsettled", that went on to become settled in this newly competitive
environment might have that happen by pure chance -- but probably not.
It *could* happen by chance, of course -- in which case the page will
get dirtied again, and the cycle continues, for now. There will be
further opportunities to figure it out, and freezing the tuples on the
page "prematurely" still has plenty of benefits.

Locality matters a lot, obviously. The goal with the FSM stuff is
merely to make it *possible* for pages to settle naturally, to the
extent that we can. We really just want to avoid hindering a naturally
occurring process -- we want to avoid destroying naturally occuring
locality. We must be willing to accept some cost for that. Even if it
takes a few attempts for certain pages, constraining the application's
choice of where to get free space from (can't be a page marked
all-visible) allows pages to *systematically* become settled over
time.

The application is in charge, really -- not VACUUM. This is already
the case, whether we like it or not. VACUUM needs to learn to live in
that reality, rather than fighting it. When VACUUM considers a page
settled, and the physical page still has a relatively large amount of
free space (say 45% of BLCKSZ, a borderline case in the new FSM
patch), "losing" so much free space certainly is unappealing. We set
the free space to 0 in the free space map all the same, because we're
cutting our losses at that point. While the exact threshold I've
proposed is tentative, the underlying theory seems pretty sound to me.
The BLCKSZ/2 cutoff (and the way that it extends the general rules for
whole-page freezing) is intended to catch pages that are qualitatively
different, as well as quantitatively different. It is a balancing act,
between not wasting space, and the risk of systemic problems involving
excessive amounts of non-HOT updates that must move a successor
version to another page.

It's possible that a higher cutoff (for example a cutoff of 80% of
BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
addition to the downsides from fragmentation -- it's far from a simple
trade-off. (Not that you should believe that 50% is special, it's just
a starting point for me.)

TPC-C
=====

I'm going to talk about a benchmark that ran throughout the week,
starting on Monday. Each run lasted 24 hours, and there were 2 runs in
total, for both the patch and for master/baseline. So this benchmark
lasted 4 days, not including the initial bulk loading, with databases
that were over 450GB in size by the time I was done (that's 450GB+ for
both the patch and master) . Benchmarking for days at a time is pretty
inconvenient, but it seems necessary to see certain effects in play.
We need to wait until the baseline/master case starts to have
anti-wraparound VACUUMs with default, realistic settings, which just
takes days and days.

I make available all of my data for the Benchmark in question, which
is way more information that anybody is likely to want -- I dump
anything that even might be useful from the system views in an
automated way. There are html reports for all 4 24 hour long runs.
Google drive link:

https://drive.google.com/drive/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR?usp=sharing

While the patch did well overall, and I will get to the particulars
towards the end of the email, I want to start with what I consider to
be the important part: the user/admin experience with VACUUM, and
VACUUM's performance stability. This is about making VACUUM less
scary.

As I've said several times now, with an append-only table like
pgbench_history we see a consistent pattern where relfrozenxid is set
to a value very close to the same VACUUM's OldestXmin value (even
precisely equal to OldestXmin) during each VACUUM operation, again and
again, forever -- that case is easy to understand and appreciate, and
has already been discussed. Now (with v7's new approach to freezing),
a related pattern can be seen in the case of the two big, troublesome
TPC-C tables, the orders and order lines tables.

To recap, these tables are somewhat like the history table, in that
new orders insert into both tables, again and again, forever. But they
also have one huge difference to simple append-only tables too, which
is the source of most of our problems with TPC-C. The difference is:
there are also delayed, correlated updates of each row from each
table. Exactly one such update per row for both tables, which takes
place hours after each order's insert, when the earlier order is
processed by TPC-C's delivery transaction. In the long run we need the
data to age out and not get re-dirtied, as the table grows and grows
indefinitely, much like with a simple append-only table. At the same
time, we don't want to have poor free space management for these
deferred updates. It's adversarial, sort of, but in a way that is
grounded in reality.

With the order and order lines tables, relfrozenxid tends to be
advanced up to the OldestXmin used by the *previous* VACUUM operation
-- an unmistakable pattern. I'll show you all of the autovacuum log
output for the orders table during the second 24 hour long benchmark
run:

2022-01-27 01:46:27 PST  LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1205349 remain, 887225 skipped using visibility map
(73.61% of total)
tuples: 253872 removed, 134182902 remain (26482225 newly frozen),
27193 are dead but not yet removable
removable cutoff: 243783407, older by 728844 xids when operation ended
new relfrozenxid: 215400514, which is 26840669 xids ahead of previous value
...
2022-01-27 05:54:39 PST LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1345302 remain, 993924 skipped using visibility map
(73.88% of total)
tuples: 261656 removed, 150022816 remain (29757570 newly frozen),
29216 are dead but not yet removable
removable cutoff: 276319403, older by 826850 xids when operation ended
new relfrozenxid: 243838706, which is 28438192 xids ahead of previous value
...
2022-01-27 10:37:24 PST LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1504707 remain, 1110002 skipped using visibility map
(73.77% of total)
tuples: 316086 removed, 167990124 remain (33754949 newly frozen),
33326 are dead but not yet removable
removable cutoff: 313328445, older by 987732 xids when operation ended
new relfrozenxid: 276309397, which is 32470691 xids ahead of previous value
...
2022-01-27 15:49:51 PST LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1680649 remain, 1250525 skipped using visibility map
(74.41% of total)
tuples: 343946 removed, 187739072 remain (37346315 newly frozen),
38037 are dead but not yet removable
removable cutoff: 354149019, older by 1222160 xids when operation ended
new relfrozenxid: 313332249, which is 37022852 xids ahead of previous value
...
2022-01-27 21:55:34 PST LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1886336 remain, 1403800 skipped using visibility map
(74.42% of total)
tuples: 389748 removed, 210899148 remain (43453900 newly frozen),
45802 are dead but not yet removable
removable cutoff: 401955979, older by 1458514 xids when operation ended
new relfrozenxid: 354134615, which is 40802366 xids ahead of previous value

This mostly speaks for itself, I think. (Anybody that's interested can
drill down to the logs for order lines, which looks similar.)

The effect we see with the order/order lines table isn't perfectly
reliable. Actually, it depends on how you define it. It's possible
that we won't be able to acquire a cleanup lock on the wrong page at
the wrong time, and as a result fail to advance relfrozenxid by the
usual amount, once. But that effect appears to be both rare and of no
real consequence. One could reasonably argue that we never fell
behind, because we still did 99.9%+ of the required freezing -- we
just didn't immediately get to advance relfrozenxid, because of a
temporary hiccup on one page. We will still advance relfrozenxid by a
small amount. Sometimes it'll be by only hundreds of XIDs when
millions or tens of millions of XIDs were expected. Once we advance it
by some amount, we can reasonably suppose that the issue was just a
hiccup.

On the master branch, the first 24 hour period has no anti-wraparound
VACUUMs, and so looking at that first 24 hour period gives you some
idea of how worse off we are in the short term -- the freezing stuff
won't really start to pay for itself until the second 24 hour run with
these mostly-default freeze related settings. The second 24 hour run
on master almost exclusively has anti-wraparound VACUUMs for all the
largest tables, though -- all at the same time. And not just the first
time, either! This causes big spikes that the patch totally avoids,
simply by avoiding anti-wraparound VACUUMs. With the patch, there are
no anti-wraparound VACUUMs, barring tables that will never be vacuumed
for any other reason, where it's still inevitable, limited to the
stock table and customers table.

It was a mistake for me to emphasize "no anti-wraparound VACUUMs
outside pathological cases" before now. I stand by those statements as
accurate, but anti-wraparound VACUUMs should not have been given so
much emphasis. Let's assume that somehow we really were to get an
anti-wraparound VACUUM against one of the tables where that's just not
expected, like this orders table -- let's suppose that I got that part
wrong, in some way. It would hardly matter at all! We'd still have
avoided the freezing cliff during this anti-wraparound VACUUM, which
is the real benefit. Chances are good that we needed to VACUUM anyway,
just to clean any very old garbage tuples up -- relfrozenxid is now
predictive of the age of the oldest garbage tuples, which might have
been a good enough reason to VACUUM anyway. The stampede of
anti-wraparound VACUUMs against multiple tables seems like it would
still be fixed, since relfrozenxid now actually tells us something
about the table (as opposed to telling us only about what the user set
vacuum_freeze_min_age to). The only concerns that this leaves for me
are all usability related, and not of primary importance (e.g. do we
really need to make anti-wraparound VACUUMs non-cancelable now?).

TPC-C raw numbers
=================

The single most important number for the patch might be the decrease
in both buffer misses and buffer hits, which I believe is caused by
the patch being able to use index-only scans much more effectively
(with modifications to BenchmarkSQL to improve the indexing strategy
[1]). This is quite clear from pg_stat_database state at the end.

Patch:

xact_commit              | 440,515,133
xact_rollback            | 1,871,142
blks_read                | 3,754,614,188
blks_hit                 | 174,551,067,731
tup_returned             | 341,222,714,073
tup_fetched              | 124,797,772,450
tup_inserted             | 2,900,197,655
tup_updated              | 4,549,948,092
tup_deleted              | 165,222,130

Here is the same pg_stat_database info for master:

xact_commit              | 440,402,505
xact_rollback            | 1,871,536
blks_read                | 4,002,682,052
blks_hit                 | 283,015,966,386
tup_returned             | 346,448,070,798
tup_fetched              | 237,052,965,901
tup_inserted             | 2,899,735,420
tup_updated              | 4,547,220,642
tup_deleted              | 165,103,426

The blks_read is x0.938 of master/baseline for the patch -- not bad.
More importantly, blks_hit is x0.616 for the patch -- quite a
significant reduction in a key cost. Note that we start to get this
particular benefit for individual read queries pretty early on --
avoiding unsetting visibility map bits like this matters right from
the start. In TPC-C terms, the ORDER_STATUS transaction will have much
lower latency, particularly tail latency, since it uses index-only
scans to good effect. There are 5 distinct transaction types from the
benchmark, and an improvement to one particular transaction type isn't
unusual -- so you often have to drill down, and look at the full html
report. The latency situation is improved across the board with the
patch, by quite a bit, especially after the second run. This server
can sustain much more throughput than the TPC-C spec formally permits,
even though I've increased the TPM rate from the benchmark by 10x the
spec legal limit, so query latency is the main TPC-C metric of
interest here.

WAL
===

Then there's the WAL overhead. Like practically any workload, the WAL
consumption for this workload is dominated by FPIs, despite the fact
that I've tuned checkpoints reasonably well. The patch *does* write
more WAL in the first set of runs -- it writes a total of ~3.991 TiB,
versus ~3.834 TiB for master. In other words, during the first 24 hour
run (before the trouble with the anti-wraparound freeze cliff even
begins for the master branch), the patch writes x1.040 as much WAL in
total. The good news is that the patch comes out ahead by the end,
after the second set of 24 hour runs. By the time the second run
finishes, it's 8.332 TiB of WAL total for the patch, versus 8.409 TiB
for master, putting the patch at x0.990 in the end -- a small
improvement. I believe that most of the WAL doesn't get generated by
VACUUM here anyway -- opportunistic pruning works well for this
workload.

I expect to be able to commit the first 2 patches in a couple of
weeks, since that won't need to block on making the case for the final
3 or 4 patches from the patch series. The early stuff is mostly just
refactoring work that removes needless differences between aggressive
and non-aggressive VACUUM operations. It makes a lot of sense on its
own.

[1] https://github.com/pgsql-io/benchmarksql/pull/16
--
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
John Naylor
Дата:
On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I do see some value in that, too. Though it's not going to be a way of
> > turning off the early freezing stuff, which seems unnecessary (though
> > I do still have work to do on getting the overhead for that down).
>
> Attached is v7, a revision that overhauls the algorithm that decides
> what to freeze. I'm now calling it block-driven freezing in the commit
> message. Also included is a new patch, that makes VACUUM record zero
> free space in the FSM for an all-visible page, unless the total amount
> of free space happens to be greater than one half of BLCKSZ.
>
> The fact that I am now including this new FSM patch (v7-0006-*patch)
> may seem like a case of expanding the scope of something that could
> well do without it. But hear me out! It's true that the new FSM patch
> isn't essential. I'm including it now because it seems relevant to the
> approach taken with block-driven freezing -- it may even make my
> general approach easier to understand.

Without having looked at the latest patches, there was something in
the back of my mind while following the discussion upthread -- the
proposed opportunistic freezing made a lot more sense if the
earlier-proposed open/closed pages concept was already available.

> Freezing whole pages
> ====================

> It's possible that a higher cutoff (for example a cutoff of 80% of
> BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
> addition to the downsides from fragmentation -- it's far from a simple
> trade-off. (Not that you should believe that 50% is special, it's just
> a starting point for me.)

How was the space utilization with the 50% cutoff in the TPC-C test?

> TPC-C raw numbers
> =================
>
> The single most important number for the patch might be the decrease
> in both buffer misses and buffer hits, which I believe is caused by
> the patch being able to use index-only scans much more effectively
> (with modifications to BenchmarkSQL to improve the indexing strategy
> [1]). This is quite clear from pg_stat_database state at the end.
>
> Patch:

> blks_hit                 | 174,551,067,731
> tup_fetched              | 124,797,772,450

> Here is the same pg_stat_database info for master:

> blks_hit                 | 283,015,966,386
> tup_fetched              | 237,052,965,901

That's impressive.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 4, 2022 at 2:00 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> Without having looked at the latest patches, there was something in
> the back of my mind while following the discussion upthread -- the
> proposed opportunistic freezing made a lot more sense if the
> earlier-proposed open/closed pages concept was already available.

Yeah, sorry about that. The open/closed pages concept is still
something I plan on working on. My prototype (which I never posted to
the list) will be rebased, and I'll try to target Postgres 16.

> > Freezing whole pages
> > ====================
>
> > It's possible that a higher cutoff (for example a cutoff of 80% of
> > BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
> > addition to the downsides from fragmentation -- it's far from a simple
> > trade-off. (Not that you should believe that 50% is special, it's just
> > a starting point for me.)
>
> How was the space utilization with the 50% cutoff in the TPC-C test?

The picture was mixed. To get the raw numbers, compare
pg-relation-sizes-after-patch-2.out and
pg-relation-sizes-after-master-2.out files from the drive link I
provided (to repeat, get them from
https://drive.google.com/drive/u/1/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR)

Highlights: the largest table (the bmsql_order_line table) had a total
size of x1.006 relative to master, meaning that we did slightly worse
there. However, the index on the same table was slightly smaller
instead, probably because reducing heap fragmentation tends to make
the index deletion stuff work a bit better than before.

Certain small tables (bmsql_district and bmsql_warehouse) were
actually significantly smaller (less than half their size on master),
probably just because the patch can reliably remove LP_DEAD items from
heap pages, even when a cleanup lock isn't available.

The bmsql_new_order table was quite a bit larger, but it's not that
large anyway (1250 MB on master at the very end, versus 1433 MB with
the patch). This is a clear trade-off, since we get much less
fragmentation in the same table (as evidenced by the VACUUM output,
where there are fewer pages with any LP_DEAD items per VACUUM with the
patch). The workload for that table is characterized by inserting new
orders together, and deleting the same orders as a group later on. So
we're bound to pay a cost in space utilization to lower the
fragmentation.

> > blks_hit                 | 174,551,067,731
> > tup_fetched              | 124,797,772,450
>
> > Here is the same pg_stat_database info for master:
>
> > blks_hit                 | 283,015,966,386
> > tup_fetched              | 237,052,965,901
>
> That's impressive.

Thanks!

It's still possible to get a big improvement like that with something
like TPC-C because there are certain behaviors that are clearly
suboptimal -- once you look at the details of the workload, and
compare an imaginary ideal to the actual behavior of the system. In
particular, there is really only one way that the free space
management can work for the two big tables that will perform
acceptably -- the orders have to be stored in the same place to begin
with, and stay in the same place forever (at least to the extent that
that's possible).

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> When VACUUM sees that all remaining/unpruned tuples on a page are
> all-visible, it isn't just important because of cost control
> considerations. It's deeper than that. It's also treated as a
> tentative signal from the application itself, about the data itself.
> Which is: this page looks "settled" -- it may never be updated again,
> but if there is an update it likely won't change too much about the
> whole page.

While I agree that there's some case to be made for leaving settled
pages well enough alone, your criterion for settled seems pretty much
accidental. Imagine a system where there are two applications running,
A and B. Application A runs all the time and all the transactions
which it performs are short. Therefore, when a certain page is not
modified by transaction A for a short period of time, the page will
become all-visible and will be considered settled. Application B runs
once a month and performs various transactions all of which are long,
perhaps on a completely separate set of tables. While application B is
running, pages take longer to settle not only for application B but
also for application A. It doesn't make sense to say that the
application is in control of the behavior when, in reality, it may be
some completely separate application that is controlling the behavior.

> The application is in charge, really -- not VACUUM. This is already
> the case, whether we like it or not. VACUUM needs to learn to live in
> that reality, rather than fighting it. When VACUUM considers a page
> settled, and the physical page still has a relatively large amount of
> free space (say 45% of BLCKSZ, a borderline case in the new FSM
> patch), "losing" so much free space certainly is unappealing. We set
> the free space to 0 in the free space map all the same, because we're
> cutting our losses at that point. While the exact threshold I've
> proposed is tentative, the underlying theory seems pretty sound to me.
> The BLCKSZ/2 cutoff (and the way that it extends the general rules for
> whole-page freezing) is intended to catch pages that are qualitatively
> different, as well as quantitatively different. It is a balancing act,
> between not wasting space, and the risk of systemic problems involving
> excessive amounts of non-HOT updates that must move a successor
> version to another page.

I can see that this could have significant advantages under some
circumstances. But I think it could easily be far worse under other
circumstances. I mean, you can have workloads where you do some amount
of read-write work on a table and then go read only and sequential
scan it an infinite number of times. An algorithm that causes the
table to be smaller at the point where we switch to read-only
operations, even by a modest amount, wins infinitely over anything
else. But even if you have no change in the access pattern, is it a
good idea to allow the table to be, say, 5% larger if it means that
correlated data is colocated? In general, probably yes. If that means
that the table fails to fit in shared_buffers instead of fitting, no.
If that means that the table fails to fit in the OS cache instead of
fitting, definitely no.

And to me, that kind of effect is why it's hard to gain much
confidence in regards to stuff like this via laboratory testing. I
mean, I'm glad you're doing such tests. But in a laboratory test, you
tend not to have things like a sudden and complete change in the
workload, or a random other application sometimes sharing the machine,
or only being on the edge of running out of memory. I think in general
people tend to avoid such things in benchmarking scenarios, but even
if include stuff like this, it's hard to know what to include that
would be representative of real life, because just about anything
*could* happen in real life.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 4, 2022 at 2:45 PM Robert Haas <robertmhaas@gmail.com> wrote:
> While I agree that there's some case to be made for leaving settled
> pages well enough alone, your criterion for settled seems pretty much
> accidental.

I fully admit that I came up with the FSM heuristic with TPC-C in
mind. But you have to start somewhere.

Fortunately, the main benefit of this patch series (avoiding the
freeze cliff during anti-wraparound VACUUMs, often avoiding
anti-wraparound VACUUMs altogether) don't depend on the experimental
FSM patch at all. I chose to post that now because it seemed to help
with my more general point about qualitatively different pages, and
freezing at the page level.

> Imagine a system where there are two applications running,
> A and B. Application A runs all the time and all the transactions
> which it performs are short. Therefore, when a certain page is not
> modified by transaction A for a short period of time, the page will
> become all-visible and will be considered settled. Application B runs
> once a month and performs various transactions all of which are long,
> perhaps on a completely separate set of tables. While application B is
> running, pages take longer to settle not only for application B but
> also for application A. It doesn't make sense to say that the
> application is in control of the behavior when, in reality, it may be
> some completely separate application that is controlling the behavior.

Application B will already block pruning by VACUUM operations against
application A's table, and so effectively blocks recording of the
resultant free space in the FSM in your scenario. And so application A
and application B should be considered the same application already.
That's just how VACUUM works.

VACUUM isn't a passive observer of the system -- it's another
participant. It both influences and is influenced by almost everything
else in the system.

> I can see that this could have significant advantages under some
> circumstances. But I think it could easily be far worse under other
> circumstances. I mean, you can have workloads where you do some amount
> of read-write work on a table and then go read only and sequential
> scan it an infinite number of times. An algorithm that causes the
> table to be smaller at the point where we switch to read-only
> operations, even by a modest amount, wins infinitely over anything
> else. But even if you have no change in the access pattern, is it a
> good idea to allow the table to be, say, 5% larger if it means that
> correlated data is colocated? In general, probably yes. If that means
> that the table fails to fit in shared_buffers instead of fitting, no.
> If that means that the table fails to fit in the OS cache instead of
> fitting, definitely no.

5% larger seems like a lot more than would be typical, based on what
I've seen. I don't think that the regression in this scenario can be
characterized as "infinitely worse", or anything like it. On a long
enough timeline, the potential upside of something like this is nearly
unlimited -- it could avoid a huge amount of write amplification. But
the potential downside seems to be small and fixed -- which is the
point (bounding the downside). The mere possibility of getting that
big benefit (avoiding the costs from heap fragmentation) is itself a
benefit, even when it turns out not to pay off in your particular
case. It can be seen as insurance.

> And to me, that kind of effect is why it's hard to gain much
> confidence in regards to stuff like this via laboratory testing. I
> mean, I'm glad you're doing such tests. But in a laboratory test, you
> tend not to have things like a sudden and complete change in the
> workload, or a random other application sometimes sharing the machine,
> or only being on the edge of running out of memory. I think in general
> people tend to avoid such things in benchmarking scenarios, but even
> if include stuff like this, it's hard to know what to include that
> would be representative of real life, because just about anything
> *could* happen in real life.

Then what could you have confidence in?

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Application B will already block pruning by VACUUM operations against
> application A's table, and so effectively blocks recording of the
> resultant free space in the FSM in your scenario. And so application A
> and application B should be considered the same application already.
> That's just how VACUUM works.

Sure ... but that also sucks. If we consider application A and
application B to be the same application, then we're basing our
decision about what to do on information that is inaccurate.

> 5% larger seems like a lot more than would be typical, based on what
> I've seen. I don't think that the regression in this scenario can be
> characterized as "infinitely worse", or anything like it. On a long
> enough timeline, the potential upside of something like this is nearly
> unlimited -- it could avoid a huge amount of write amplification. But
> the potential downside seems to be small and fixed -- which is the
> point (bounding the downside). The mere possibility of getting that
> big benefit (avoiding the costs from heap fragmentation) is itself a
> benefit, even when it turns out not to pay off in your particular
> case. It can be seen as insurance.

I don't see it that way. There are cases where avoiding writes is
better, and cases where trying to cram everything into the fewest
possible ages is better. With the right test case you can make either
strategy look superior. What I think your test case has going for it
is that it is similar to something that a lot of people, really a ton
of people, actually do with PostgreSQL. However, it's not going to be
an accurate model of what everybody does, and therein lies some
element of danger.

> Then what could you have confidence in?

Real-world experience. Which is hard to get if we don't ever commit
any patches, but a good argument for (a) having them tested by
multiple different hackers who invent test cases independently and (b)
some configurability where we can reasonably include it, so that if
anyone does experience problems they have an escape.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 4, 2022 at 4:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Application B will already block pruning by VACUUM operations against
> > application A's table, and so effectively blocks recording of the
> > resultant free space in the FSM in your scenario. And so application A
> > and application B should be considered the same application already.
> > That's just how VACUUM works.
>
> Sure ... but that also sucks. If we consider application A and
> application B to be the same application, then we're basing our
> decision about what to do on information that is inaccurate.

I agree that it sucks, but I don't think that it's particularly
relevant to the FSM prototype patch that I included with v7 of the
patch series. A heap page cannot be considered "closed" (either in the
specific sense from the patch, or in any informal sense) when it has
recently dead tuples.

At some point we should invent a fallback path for pruning, that
migrates recently dead tuples to some other subsidiary structure,
retaining only forwarding information in the heap page. But even that
won't change what I just said about closed pages (it'll just make it
easier to return and fix things up later on).

> I don't see it that way. There are cases where avoiding writes is
> better, and cases where trying to cram everything into the fewest
> possible ages is better. With the right test case you can make either
> strategy look superior.

The cost of reads is effectively much lower than writes with modern
SSDs, in TCO terms. Plus when a FSM strategy like the one from the
patch does badly according to a naive measure such as total table
size, that in itself doesn't mean that we do worse with reads. In
fact, it's quite the opposite.

The benchmark showed that v7 of the patch did very slightly worse on
overall space utilization, but far, far better on reads. In fact, the
benefits for reads were far in excess of any efficiency gains for
writes/with WAL. The greatest bottleneck is almost always latency on
modern hardware [1]. It follows that keeping logically related data
grouped together is crucial. Far more important than potentially using
very slightly more space.

The story I wanted to tell with the FSM patch was about open and
closed pages being the right long term direction. More generally, we
should emphasize managing page-level costs, and deemphasize managing
tuple-level costs, which are much less meaningful.

> What I think your test case has going for it
> is that it is similar to something that a lot of people, really a ton
> of people, actually do with PostgreSQL. However, it's not going to be
> an accurate model of what everybody does, and therein lies some
> element of danger.

No question -- agreed.

> > Then what could you have confidence in?
>
> Real-world experience. Which is hard to get if we don't ever commit
> any patches, but a good argument for (a) having them tested by
> multiple different hackers who invent test cases independently and (b)
> some configurability where we can reasonably include it, so that if
> anyone does experience problems they have an escape.

I agree.

[1] https://dl.acm.org/doi/10.1145/1022594.1022596
-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Greg Stark
Дата:
On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan <pg@bowt.ie> wrote:
>
> My emphasis here has been on making non-aggressive VACUUMs *always*
> advance relfrozenxid, outside of certain obvious edge cases. And so
> with all the patches applied, up to and including the opportunistic
> freezing patch, every autovacuum of every table manages to advance
> relfrozenxid during benchmarking -- usually to a fairly recent value.
> I've focussed on making aggressive VACUUMs (especially anti-wraparound
> autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> user keeps canceling autovacuums, maybe due to automated script that
> performs DDL). That has taken priority over other goals, for now.

While I've seen all the above cases triggering anti-wraparound cases
by far the majority of the cases are not of these pathological forms.

By far the majority of anti-wraparound vacuums are triggered by tables
that are very large and so don't trigger regular vacuums for "long
periods" of time and consistently hit the anti-wraparound threshold
first.

There's nothing limiting how long "long periods" is and nothing tying
it to the rate of xid consumption. It's quite common to have some
*very* large mostly static tables in databases that have other tables
that are *very* busy.

The worst I've seen is a table that took 36 hours to vacuum in a
database that consumed about a billion transactions per day... That's
extreme but these days it's quite common to see tables that get
anti-wraparound vacuums every week or so despite having < 1% modified
tuples. And databases are only getting bigger and transaction rates
faster...


-- 
greg



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 4, 2022 at 10:21 PM Greg Stark <stark@mit.edu> wrote:
> On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan <pg@bowt.ie> wrote:
> > My emphasis here has been on making non-aggressive VACUUMs *always*
> > advance relfrozenxid, outside of certain obvious edge cases. And so
> > with all the patches applied, up to and including the opportunistic
> > freezing patch, every autovacuum of every table manages to advance
> > relfrozenxid during benchmarking -- usually to a fairly recent value.
> > I've focussed on making aggressive VACUUMs (especially anti-wraparound
> > autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> > user keeps canceling autovacuums, maybe due to automated script that
> > performs DDL). That has taken priority over other goals, for now.
>
> While I've seen all the above cases triggering anti-wraparound cases
> by far the majority of the cases are not of these pathological forms.

Right - it's practically inevitable that you'll need an
anti-wraparound VACUUM to advance relfrozenxid right now. Technically
it's possible to advance relfrozenxid in any VACUUM, but in practice
it just never happens on a large table. You only need to get unlucky
with one heap page, either by failing to get a cleanup lock, or (more
likely) by setting even one single page all-visible but not all-frozen
just once (once in any VACUUM that takes place between anti-wraparound
VACUUMs).

> By far the majority of anti-wraparound vacuums are triggered by tables
> that are very large and so don't trigger regular vacuums for "long
> periods" of time and consistently hit the anti-wraparound threshold
> first.

autovacuum_vacuum_insert_scale_factor can help with this on 13 and 14,
but only if you tune autovacuum_freeze_min_age with that goal in mind.
Which probably doesn't happen very often.

> There's nothing limiting how long "long periods" is and nothing tying
> it to the rate of xid consumption. It's quite common to have some
> *very* large mostly static tables in databases that have other tables
> that are *very* busy.
>
> The worst I've seen is a table that took 36 hours to vacuum in a
> database that consumed about a billion transactions per day... That's
> extreme but these days it's quite common to see tables that get
> anti-wraparound vacuums every week or so despite having < 1% modified
> tuples. And databases are only getting bigger and transaction rates
> faster...

Sounds very much like what I've been calling the freezing cliff. An
anti-wraparound VACUUM throws things off by suddenly dirtying many
more pages than the expected amount for a VACUUM against the table,
despite there being no change in workload characteristics. If you just
had to remove the dead tuples in such a table, then it probably
wouldn't matter if it happened earlier than expected.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 4, 2022 at 10:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Right - it's practically inevitable that you'll need an
> anti-wraparound VACUUM to advance relfrozenxid right now. Technically
> it's possible to advance relfrozenxid in any VACUUM, but in practice
> it just never happens on a large table. You only need to get unlucky
> with one heap page, either by failing to get a cleanup lock, or (more
> likely) by setting even one single page all-visible but not all-frozen
> just once (once in any VACUUM that takes place between anti-wraparound
> VACUUMs).

Minor correction: That's a slight exaggeration, since we won't skip
groups of all-visible pages that don't exceed SKIP_PAGES_THRESHOLD
blocks (32 blocks).

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Fri, Feb 4, 2022 at 10:21 PM Greg Stark <stark@mit.edu> wrote:
> By far the majority of anti-wraparound vacuums are triggered by tables
> that are very large and so don't trigger regular vacuums for "long
> periods" of time and consistently hit the anti-wraparound threshold
> first.

That's interesting, because my experience is different. Most of the
time when I get asked to look at a system, it turns out that there is
a prepared transaction or a forgotten replication slot and nobody
noticed until the system hit the wraparound threshold. Or occasionally
a long-running transaction or a failing/stuck vacuum that has the same
effect.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Fri, Feb 4, 2022 at 10:45 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > While I've seen all the above cases triggering anti-wraparound cases
> > by far the majority of the cases are not of these pathological forms.
>
> Right - it's practically inevitable that you'll need an
> anti-wraparound VACUUM to advance relfrozenxid right now. Technically
> it's possible to advance relfrozenxid in any VACUUM, but in practice
> it just never happens on a large table. You only need to get unlucky
> with one heap page, either by failing to get a cleanup lock, or (more
> likely) by setting even one single page all-visible but not all-frozen
> just once (once in any VACUUM that takes place between anti-wraparound
> VACUUMs).

But ... if I'm not mistaken, in the kind of case that Greg is
describing, relfrozenxid will be advanced exactly as often as it is
today. That's because, if VACUUM is only ever getting triggered by XID
age advancement and not by bloat, there's no opportunity for your
patch set to advance relfrozenxid any sooner than we're doing now. So
I think that people in this kind of situation will potentially be
helped or hurt by other things the patch set does, but the eager
relfrozenxid stuff won't make any difference for them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Feb 7, 2022 at 10:08 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But ... if I'm not mistaken, in the kind of case that Greg is
> describing, relfrozenxid will be advanced exactly as often as it is
> today.

But what happens today in a scenario like Greg's is pathological,
despite being fairly common (common in large DBs). It doesn't seem
informative to extrapolate too much from current experience for that
reason.

> That's because, if VACUUM is only ever getting triggered by XID
> age advancement and not by bloat, there's no opportunity for your
> patch set to advance relfrozenxid any sooner than we're doing now.

We must distinguish between:

1. "VACUUM is fundamentally never going to need to run unless it is
forced to, just to advance relfrozenxid" -- this applies to tables
like the stock and customers tables from the benchmark.

and:

2. "VACUUM must sometimes run to mark newly appended heap pages
all-visible, and maybe to also remove dead tuples, but not that often
-- and yet we current only get expensive and inconveniently timed
anti-wraparound VACUUMs, no matter what" -- this applies to all the
other big tables in the benchmark, in particular to the orders and
order lines tables, but also to simpler cases like pgbench_history.

As I've said a few times now, the patch doesn't change anything for 1.
But Greg's problem tables very much sound like they're from category
2. And what we see with the master branch for such tables is that they
always get anti-wraparound VACUUMs, past a certain size (depends on
things like exact XID rate and VACUUM settings, the insert-driven
autovacuum scheduling stuff matters). While the patch never reaches
that point in practice, during my testing -- and doesn't come close.

It is true that in theory, as the size of ones of these "category 2"
tables tends to infinity, the patch ends up behaving the same as
master anyway. But I'm pretty sure that that usually doesn't matter at
all, or matters less than you'd think. As I emphasized when presenting
the recent v7 TPC-C benchmark, neither of the two "TPC-C big problem
tables" (which are particularly interesting/tricky examples of tables
from category 2) come close to getting an anti-wraparound VACUUM
(plus, as I said in the same email, wouldn't matter if they did).

> So I think that people in this kind of situation will potentially be
> helped or hurt by other things the patch set does, but the eager
> relfrozenxid stuff won't make any difference for them.

To be clear, I think it would if everything was in place, including
the basic relfrozenxid advancement thing, plus the new freezing stuff
(though you wouldn't need the experimental FSM thing to get this
benefit).

Here is a thought experiment that may make the general idea a bit clearer:

Imagine I reran the same benchmark as before, with the same settings,
and the expectation that everything would be the same as first time
around for the patch series. But to make things more interesting, this
time I add an adversarial element: I add an adversarial gizmo that
burns XIDs steadily, without doing any useful work. This gizmo doubles
the rate of XID consumption for the database as a whole, perhaps by
calling "SELECT txid_current()" in a loop, followed by a timed sleep
(with a delay chosen with the goal of doubling XID consumption). I
imagine that this would also burn CPU cycles, but probably not enough
to make more than a noise level impact -- so we're severely stressing
the implementation by adding this gizmo, but the stress is precisely
targeted at XID consumption and related implementation details. It's a
pretty clean experiment. What happens now?

I believe (though haven't checked for myself) that nothing important
would change. We'd still see the same VACUUM operations occur at
approximately the same times (relative to the start of the benchmark)
that we saw with the original benchmark, and each VACUUM operation
would do approximately the same amount of physical work on each
occasion. Of course, the autovacuum log output would show that the
OldestXmin for each individual VACUUM operation had larger values than
first time around for this newly initdb'd TPC-C database (purely as a
consequence of the XID burning gizmo), but it would *also* show
*concomitant* increases for our newly set relfrozenxid. The system
should therefore hardly behave differently at all compared to the
original benchmark run, despite this adversarial gizmo.

It's fair to wonder: okay, but what if it was 4x, 8x, 16x? What then?
That does get a bit more complicated, and we should get into why that
is. But for now I'll just say that I think that even that kind of
extreme would make much less difference than you might think -- since
relfrozenxid advancement has been qualitatively improved by the patch
series. It is especially likely that nothing would change if you were
willing to increase autovacuum_freeze_max_age to get a bit more
breathing room -- room to allow the autovacuums to run at their
"natural" times. You wouldn't necessarily have to go too far -- the
extra breathing room from increasing autovacuum_freeze_max_age buys
more wall clock time *between* any two successive "naturally timed
autovacuums". Again, a virtuous cycle.

Does that make sense? It's pretty subtle, admittedly, and you no doubt
have (very reasonable) concerns about the extremes, even if you accept
all that. I just want to get the general idea across here, as a
starting point for further discussion.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > That's because, if VACUUM is only ever getting triggered by XID
> > age advancement and not by bloat, there's no opportunity for your
> > patch set to advance relfrozenxid any sooner than we're doing now.
>
> We must distinguish between:
>
> 1. "VACUUM is fundamentally never going to need to run unless it is
> forced to, just to advance relfrozenxid" -- this applies to tables
> like the stock and customers tables from the benchmark.
>
> and:
>
> 2. "VACUUM must sometimes run to mark newly appended heap pages
> all-visible, and maybe to also remove dead tuples, but not that often
> -- and yet we current only get expensive and inconveniently timed
> anti-wraparound VACUUMs, no matter what" -- this applies to all the
> other big tables in the benchmark, in particular to the orders and
> order lines tables, but also to simpler cases like pgbench_history.

It's not really very understandable for me when you refer to the way
table X behaves in Y benchmark, because I haven't studied that in
enough detail to know. If you say things like insert-only table, or a
continuous-random-updates table, or whatever the case is, it's a lot
easier to wrap my head around it.

> Does that make sense? It's pretty subtle, admittedly, and you no doubt
> have (very reasonable) concerns about the extremes, even if you accept
> all that. I just want to get the general idea across here, as a
> starting point for further discussion.

Not really. I think you *might* be saying tables which currently get
only wraparound vacuums will end up getting other kinds of vacuums
with your patch because things will improve enough for other tables in
the system that they will be able to get more attention than they do
currently. But I'm not sure I am understanding you correctly, and even
if I am I don't understand why that would be so, and even if it is I
think it doesn't help if essentially all the tables in the system are
suffering from the problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Feb 7, 2022 at 12:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > > That's because, if VACUUM is only ever getting triggered by XID
> > > age advancement and not by bloat, there's no opportunity for your
> > > patch set to advance relfrozenxid any sooner than we're doing now.
> >
> > We must distinguish between:
> >
> > 1. "VACUUM is fundamentally never going to need to run unless it is
> > forced to, just to advance relfrozenxid" -- this applies to tables
> > like the stock and customers tables from the benchmark.
> >
> > and:
> >
> > 2. "VACUUM must sometimes run to mark newly appended heap pages
> > all-visible, and maybe to also remove dead tuples, but not that often
> > -- and yet we current only get expensive and inconveniently timed
> > anti-wraparound VACUUMs, no matter what" -- this applies to all the
> > other big tables in the benchmark, in particular to the orders and
> > order lines tables, but also to simpler cases like pgbench_history.
>
> It's not really very understandable for me when you refer to the way
> table X behaves in Y benchmark, because I haven't studied that in
> enough detail to know. If you say things like insert-only table, or a
> continuous-random-updates table, or whatever the case is, it's a lot
> easier to wrap my head around it.

What I've called category 2 tables are the vast majority of big tables
in practice. They include pure append-only tables, but also tables
that grow and grow from inserts, but also have some updates. The point
of the TPC-C order + order lines examples was to show how broad the
category really is. And how mixtures of inserts and bloat from updates
on one single table confuse the implementation in general.

> > Does that make sense? It's pretty subtle, admittedly, and you no doubt
> > have (very reasonable) concerns about the extremes, even if you accept
> > all that. I just want to get the general idea across here, as a
> > starting point for further discussion.
>
> Not really. I think you *might* be saying tables which currently get
> only wraparound vacuums will end up getting other kinds of vacuums
> with your patch because things will improve enough for other tables in
> the system that they will be able to get more attention than they do
> currently.

Yes, I am.

> But I'm not sure I am understanding you correctly, and even
> if I am I don't understand why that would be so, and even if it is I
> think it doesn't help if essentially all the tables in the system are
> suffering from the problem.

When I say "relfrozenxid advancement has been qualitatively improved
by the patch", what I mean is that we are much closer to a rate of
relfrozenxid advancement that is far closer to the theoretically
optimal rate for our current design, with freezing and with 32-bit
XIDs, and with the invariants for freezing.

Consider the extreme case, and generalize. In the simple append-only
table case, it is most obvious. The final relfrozenxid is very close
to OldestXmin (only tiny noise level differences appear), regardless
of XID consumption by the system in general, and even within the
append-only table in particular. Other cases are somewhat trickier,
but have roughly the same quality, to a surprising degree. Lots of
things that never really should have affected relfrozenxid to begin
with do not, for the first time.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Jan 29, 2022 at 8:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v7, a revision that overhauls the algorithm that decides
> what to freeze. I'm now calling it block-driven freezing in the commit
> message. Also included is a new patch, that makes VACUUM record zero
> free space in the FSM for an all-visible page, unless the total amount
> of free space happens to be greater than one half of BLCKSZ.

I pushed the earlier refactoring and instrumentation patches today.

Attached is v8. No real changes -- just a rebased version.

It will be easier to benchmark and test the page-driven freezing stuff
now, since the master/baseline case will now output instrumentation
showing how relfrozenxid has been advanced (if at all) -- whether (and
to what extent) each VACUUM operation advances relfrozenxid can now be
directly compared, just by monitoring the log_autovacuum_min_duration
output for a given table over time.

--
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 11, 2022 at 8:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v8. No real changes -- just a rebased version.

Concerns about my general approach to this project (and even the
Postgres 14 VACUUM work) were expressed by Robert and Andres over on
the "Nonrandom scanned_pages distorts pg_class.reltuples set by
VACUUM" thread. Some of what was said honestly shocked me. It now
seems unwise to pursue this project on my original timeline. I even
thought about shelving it indefinitely (which is still on the table).

I propose the following compromise: the least contentious patch alone
will be in scope for Postgres 15, while the other patches will not be.
I'm referring to the first patch from v8, which adds dynamic tracking
of the oldest extant XID in each heap table, in order to be able to
use it as our new relfrozenxid. I can't imagine that I'll have
difficulty convincing Andres of the merits of this idea, for one,
since it was his idea in the first place. It makes a lot of sense,
independent of any change to how and when we freeze.

The first patch is tricky, but at least it won't require elaborate
performance validation. It doesn't change any of the basic performance
characteristics of VACUUM. It sometimes allows us to advance
relfrozenxid to a value beyond FreezeLimit (typically only possible in
an aggressive VACUUM), which is an intrinsic good. If it isn't
effective then the overhead seems very unlikely to be noticeable. It's
pretty much a strictly additive improvement.

Are there any objections to this plan?

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Fri, Feb 18, 2022 at 3:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Concerns about my general approach to this project (and even the
> Postgres 14 VACUUM work) were expressed by Robert and Andres over on
> the "Nonrandom scanned_pages distorts pg_class.reltuples set by
> VACUUM" thread. Some of what was said honestly shocked me. It now
> seems unwise to pursue this project on my original timeline. I even
> thought about shelving it indefinitely (which is still on the table).
>
> I propose the following compromise: the least contentious patch alone
> will be in scope for Postgres 15, while the other patches will not be.
> I'm referring to the first patch from v8, which adds dynamic tracking
> of the oldest extant XID in each heap table, in order to be able to
> use it as our new relfrozenxid. I can't imagine that I'll have
> difficulty convincing Andres of the merits of this idea, for one,
> since it was his idea in the first place. It makes a lot of sense,
> independent of any change to how and when we freeze.
>
> The first patch is tricky, but at least it won't require elaborate
> performance validation. It doesn't change any of the basic performance
> characteristics of VACUUM. It sometimes allows us to advance
> relfrozenxid to a value beyond FreezeLimit (typically only possible in
> an aggressive VACUUM), which is an intrinsic good. If it isn't
> effective then the overhead seems very unlikely to be noticeable. It's
> pretty much a strictly additive improvement.
>
> Are there any objections to this plan?

I really like the idea of reducing the scope of what is being changed
here, and I agree that eagerly advancing relfrozenxid carries much
less risk than the other changes.

I'd like to have a clearer idea of exactly what is in each of the
remaining patches before forming a final opinion.

What's tricky about 0001? Does it change any other behavior, either as
a necessary component of advancing relfrozenxid more eagerly, or
otherwise?

If there's a way you can make the precise contents of 0002 and 0003
more clear, I would like that, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 18, 2022 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I'd like to have a clearer idea of exactly what is in each of the
> remaining patches before forming a final opinion.

Great.

> What's tricky about 0001? Does it change any other behavior, either as
> a necessary component of advancing relfrozenxid more eagerly, or
> otherwise?

It does not change any other behavior. It's totally mechanical.

0001 is tricky in the sense that there are a lot of fine details, and
if you get any one of them wrong the result might be a subtle bug. For
example, the heap_tuple_needs_freeze() code path is only used when we
cannot get a cleanup lock, which is rare -- and some of the branches
within the function are relatively rare themselves. The obvious
concern is: What if some detail of how we track the new relfrozenxid
value (and new relminmxid value) in this seldom-hit codepath is just
wrong, in whatever way we didn't think of?

On the other hand, we must already be precise in almost the same way
within heap_tuple_needs_freeze() today -- it's not all that different
(we currently need to avoid leaving any XIDs < FreezeLimit behind,
which isn't made that less complicated by the fact that it's a static
XID cutoff). Plus, we have experience with bugs like this. There was
hardening added to catch stuff like this back in 2017, following the
"freeze the dead" bug.

> If there's a way you can make the precise contents of 0002 and 0003
> more clear, I would like that, too.

The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible()
thing) wasn't on the table before now. 0002 is the patch that changes
the basic criteria for freezing, making it block-based rather than
based on the FreezeLimit cutoff (barring edge cases that are important
for correctness, but shouldn't noticeably affect freezing overhead).

The single biggest practical improvement from 0002 is that it
eliminates what I've called the freeze cliff, which is where many old
tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be
frozen all at once, in a balloon payment during an eventual aggressive
VACUUM. Although it's easy to see that that could be useful, it is
harder to justify (much harder) than anything else. Because we're
freezing more eagerly overall, we're also bound to do more freezing
without benefit in certain cases. Although I think that this can be
justified as the cost of doing business, that's a hard argument to
make.

In short, 0001 is mechanically tricky, but easy to understand at a
high level. Whereas 0002 is mechanically simple, but tricky to
understand at a high level (and therefore far trickier than 0001
overall).

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Fri, Feb 18, 2022 at 4:10 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It does not change any other behavior. It's totally mechanical.
>
> 0001 is tricky in the sense that there are a lot of fine details, and
> if you get any one of them wrong the result might be a subtle bug. For
> example, the heap_tuple_needs_freeze() code path is only used when we
> cannot get a cleanup lock, which is rare -- and some of the branches
> within the function are relatively rare themselves. The obvious
> concern is: What if some detail of how we track the new relfrozenxid
> value (and new relminmxid value) in this seldom-hit codepath is just
> wrong, in whatever way we didn't think of?

Right. I think we have no choice but to accept such risks if we want
to make any progress here, and every patch carries them to some
degree. I hope that someone else will review this patch in more depth
than I have just now, but what I notice reading through it is that
some of the comments seem pretty opaque. For instance:

+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that

"maintains" is fuzzy. I think you should be saying something much more
explicit, and the thing you are saying should make it clear that these
arguments are input-output arguments: i.e. the caller must set them
correctly before calling this function, and they will be updated by
the function. I don't think you have to spell all of that out in every
place where this comes up in the patch, but it needs to be clear from
what you do say. For example, I would be happier with a comment that
said something like "Every call to this function will either set
HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an
argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if
it is currently newer than that. Thus, after a series of calls to this
function, *NewRelfrozenxid represents a lower bound on unfrozen xmin
values in the tuples examined. Before calling this function, caller
should initialize *NewRelfrozenxid to <something>."

+                        * Changing nothing, so might have to ratchet
back NewRelminmxid,
+                        * NewRelfrozenxid, or both together

This comment I like.

+                        * New multixact might have remaining XID older than
+                        * NewRelfrozenxid

This one's good, too.

+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will never freeze any of the XIDs from the tuple, even when we say
+ * that they should.  If caller opts to go with our recommendation to freeze,
+ * then it must account for the fact that it shouldn't trust how we've set
+ * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
+ * our recommendation because they must, and non-aggressive VACUUMs always opt
+ * to not freeze, preferring to ratchet back NewRelfrozenxid instead).

I don't understand this one.

+        * (Actually, we maintain NewRelminmxid differently here, because we
+        * assume that XIDs that should be frozen according to cutoff_xid won't
+        * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)

This one either.

I haven't really grokked exactly what is happening in
heap_tuple_needs_freeze yet, and may not have time to study it further
in the near future. Not saying it's wrong, although improving the
comments above would likely help me out.

> > If there's a way you can make the precise contents of 0002 and 0003
> > more clear, I would like that, too.
>
> The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible()
> thing) wasn't on the table before now. 0002 is the patch that changes
> the basic criteria for freezing, making it block-based rather than
> based on the FreezeLimit cutoff (barring edge cases that are important
> for correctness, but shouldn't noticeably affect freezing overhead).
>
> The single biggest practical improvement from 0002 is that it
> eliminates what I've called the freeze cliff, which is where many old
> tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be
> frozen all at once, in a balloon payment during an eventual aggressive
> VACUUM. Although it's easy to see that that could be useful, it is
> harder to justify (much harder) than anything else. Because we're
> freezing more eagerly overall, we're also bound to do more freezing
> without benefit in certain cases. Although I think that this can be
> justified as the cost of doing business, that's a hard argument to
> make.

You've used the term "freezing cliff" repeatedly in earlier emails,
and this is the first time I've been able to understand what you
meant. I'm glad I do, now.

But can you describe the algorithm that 0002 uses to accomplish this
improvement? Like "if it sees that the page meets criteria X, then it
freezes all tuples on the page, else if it sees that that individual
tuples on the page meet criteria Y, then it freezes just those." And
like explain what of that is same/different vs. now.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-02-18 13:09:45 -0800, Peter Geoghegan wrote:
> 0001 is tricky in the sense that there are a lot of fine details, and
> if you get any one of them wrong the result might be a subtle bug. For
> example, the heap_tuple_needs_freeze() code path is only used when we
> cannot get a cleanup lock, which is rare -- and some of the branches
> within the function are relatively rare themselves. The obvious
> concern is: What if some detail of how we track the new relfrozenxid
> value (and new relminmxid value) in this seldom-hit codepath is just
> wrong, in whatever way we didn't think of?

I think it'd be good to add a few isolationtest cases for the
can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The
slightly harder part is verifying that VACUUM did something reasonable, but
that still should be doable?

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-02-18 15:54:19 -0500, Robert Haas wrote:
> > Are there any objections to this plan?
> 
> I really like the idea of reducing the scope of what is being changed
> here, and I agree that eagerly advancing relfrozenxid carries much
> less risk than the other changes.

Sounds good to me too!

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 18, 2022 at 1:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
> + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
> + * target relfrozenxid and relminmxid for the relation.  Assumption is that
>
> "maintains" is fuzzy. I think you should be saying something much more
> explicit, and the thing you are saying should make it clear that these
> arguments are input-output arguments: i.e. the caller must set them
> correctly before calling this function, and they will be updated by
> the function.

Makes sense.

> I don't think you have to spell all of that out in every
> place where this comes up in the patch, but it needs to be clear from
> what you do say. For example, I would be happier with a comment that
> said something like "Every call to this function will either set
> HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an
> argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if
> it is currently newer than that. Thus, after a series of calls to this
> function, *NewRelfrozenxid represents a lower bound on unfrozen xmin
> values in the tuples examined. Before calling this function, caller
> should initialize *NewRelfrozenxid to <something>."

We have to worry about XIDs from MultiXacts (and xmax values more
generally). And we have to worry about the case where we start out
with only xmin frozen (by an earlier VACUUM), and then have to freeze
xmax too. I believe that we have to generally consider xmin and xmax
independently. For example, we cannot ignore xmax, just because we
looked at xmin, since in general xmin alone might have already been
frozen.

> + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
> + * target relfrozenxid and relminmxid for the relation.  Assumption is that
> + * caller will never freeze any of the XIDs from the tuple, even when we say
> + * that they should.  If caller opts to go with our recommendation to freeze,
> + * then it must account for the fact that it shouldn't trust how we've set
> + * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
> + * our recommendation because they must, and non-aggressive VACUUMs always opt
> + * to not freeze, preferring to ratchet back NewRelfrozenxid instead).
>
> I don't understand this one.
>
> +        * (Actually, we maintain NewRelminmxid differently here, because we
> +        * assume that XIDs that should be frozen according to cutoff_xid won't
> +        * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)
>
> This one either.

The difference between the cleanup lock path (in
lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in
lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both
of these confusing comment blocks, really. Note that cutoff_xid is the
name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze
have for FreezeLimit (maybe we should rename every occurence of
cutoff_xid in heapam.c to FreezeLimit).

At a high level, we aren't changing the fundamental definition of an
aggressive VACUUM in any of the patches -- we still need to advance
relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on
HEAD, today (we may be able to advance it *past* FreezeLimit, but
that's just a bonus). But in a non-aggressive VACUUM, where there is
still no strict requirement to advance relfrozenxid (by any amount),
the code added by 0001 can set relfrozenxid to any known safe value,
which could either be from before FreezeLimit, or after FreezeLimit --
almost anything is possible (provided we respect the relfrozenxid
invariant, and provided we see that we didn't skip any
all-visible-not-all-frozen pages).

Since we still need to "respect FreezeLimit" in an aggressive VACUUM,
the aggressive case might need to wait for a full cleanup lock the
hard way, having tried and failed to do it the easy way within
lazy_scan_noprune (lazy_scan_noprune will still return false when any
call to heap_tuple_needs_freeze for any tuple returns false) -- same
as on HEAD, today.

And so the difference at issue here is: FreezeLimit/cutoff_xid only
needs to affect the new NewRelfrozenxid value we use for relfrozenxid in
heap_prepare_freeze_tuple, which is involved in real freezing -- not
in heap_tuple_needs_freeze, whose main purpose is still to help us
avoid freezing where a cleanup lock isn't immediately available. While
the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze
is to determine its bool return value, which will only be of interest
to the aggressive case (which might have to get a cleanup lock and do
it the hard way), not the non-aggressive case (where ratcheting back
NewRelfrozenxid is generally possible, and generally leaves us with
almost as good of a value).

In other words: the calls to heap_tuple_needs_freeze made from
lazy_scan_noprune are simply concerned with the page as it actually
is, whereas the similar/corresponding calls to
heap_prepare_freeze_tuple from lazy_scan_prune are concerned with
*what the page will actually become*, after freezing finishes, and
after lazy_scan_prune is done with the page entirely (ultimately
the final NewRelfrozenxid value set in pg_class.relfrozenxid only has
to be <= the oldest extant XID *at the time the VACUUM operation is
just about to end*, not some earlier time, so "being versus becoming"
is an interesting distinction for us).

Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
here, to make all of this less confusing. I only now fully realized
how confusing all of this stuff is -- very.

> I haven't really grokked exactly what is happening in
> heap_tuple_needs_freeze yet, and may not have time to study it further
> in the near future. Not saying it's wrong, although improving the
> comments above would likely help me out.

Definitely needs more polishing.

> You've used the term "freezing cliff" repeatedly in earlier emails,
> and this is the first time I've been able to understand what you
> meant. I'm glad I do, now.

Ugh. I thought that a snappy term like that would catch on quickly. Guess not!

> But can you describe the algorithm that 0002 uses to accomplish this
> improvement? Like "if it sees that the page meets criteria X, then it
> freezes all tuples on the page, else if it sees that that individual
> tuples on the page meet criteria Y, then it freezes just those." And
> like explain what of that is same/different vs. now.

The mechanics themselves are quite simple (again, understanding the
implications is the hard part). The approach taken within 0002 is
still rough, to be honest, but wouldn't take long to clean up (there
are XXX/FIXME comments about this in 0002).

As a general rule, we try to freeze all of the remaining live tuples
on a page (following pruning) together, as a group, or none at all.
Most of the time this is triggered by our noticing that the page is
about to be set all-visible (but not all-frozen), and doing work
sufficient to mark it fully all-frozen instead. Occasionally there is
FreezeLimit to consider, which is now more of a backstop thing, used
to make sure that we never get too far behind in terms of unfrozen
XIDs. This is useful in part because it avoids any future
non-aggressive VACUUM that is fundamentally unable to advance
relfrozenxid (you can't skip all-visible pages if there are only
all-frozen pages in the VM in practice).

We're generally doing a lot more freezing with 0002, but we still
manage to avoid freezing too much in tables like pgbench_tellers or
pgbench_branches -- tables where it makes the least sense. Such tables
will be updated so frequently that VACUUM is relatively unlikely to
ever mark any page all-visible, avoiding the main criteria for
freezing implicitly. It's also unlikely that they'll ever have an XID that is so
old to trigger the fallback FreezeLimit-style criteria for freezing.

In practice, freezing tuples like this is generally not that expensive in
most tables where VACUUM freezes the majority of pages immediately
(tables that aren't like pgbench_tellers or pgbench_branches), because
they're generally big tables, where the overhead of FPIs tends
to dominate anyway (gambling that we can avoid more FPIs later on is not a
bad gamble, as gambles go). This seems to make the overhead
acceptable, on balance. Granted, you might be able to poke holes in
that argument, and reasonable people might disagree on what acceptable
should mean. There are many value judgements here, which makes it
complicated. (On the other hand we might be able to do better if there
was a particularly bad case for the 0002 work, if one came to light.)

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 18, 2022 at 2:11 PM Andres Freund <andres@anarazel.de> wrote:
> I think it'd be good to add a few isolationtest cases for the
> can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The
> slightly harder part is verifying that VACUUM did something reasonable, but
> that still should be doable?

We could even just extend existing, related tests, from vacuum-reltuples.spec.

Another testing strategy occurs to me: we could stress-test the
implementation by simulating an environment where the no-cleanup-lock
path is hit an unusually large number of times, possibly a fixed
percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
ConditionalLockBufferForCleanup() call return false randomly. Now that
we have lazy_scan_noprune for the no-cleanup-lock path (which is as
similar to the regular lazy_scan_prune path as possible), I wouldn't
expect this ConditionalLockBufferForCleanup() testing gizmo to be too
disruptive.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Another testing strategy occurs to me: we could stress-test the
> implementation by simulating an environment where the no-cleanup-lock
> path is hit an unusually large number of times, possibly a fixed
> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
> ConditionalLockBufferForCleanup() call return false randomly. Now that
> we have lazy_scan_noprune for the no-cleanup-lock path (which is as
> similar to the regular lazy_scan_prune path as possible), I wouldn't
> expect this ConditionalLockBufferForCleanup() testing gizmo to be too
> disruptive.

I tried this out, using the attached patch. It was quite interesting,
even when run against HEAD. I think that I might have found a bug on
HEAD, though I'm not really sure.

If you modify the patch to simulate conditions under which
ConditionalLockBufferForCleanup() fails about 2% of the time, you get
much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
without it being so aggressive as to make "make check-world" fail --
which is exactly what I expected. If you are much more aggressive
about it, and make it 50% instead (which you can get just by using the
patch as written), then some tests will fail, mostly for reasons that
aren't surprising or interesting (e.g. plan changes). This is also
what I'd have guessed would happen.

However, it gets more interesting. One thing that I did not expect to
happen at all also happened (with the current 50% rate of simulated
ConditionalLockBufferForCleanup() failure from the patch): if I run
"make check" from the pg_surgery directory, then the Postgres backend
gets stuck in an infinite loop inside lazy_scan_prune, which has been
a symptom of several tricky bugs in the past year (not every time, but
usually). Specifically, the VACUUM statement launched by the SQL
command "vacuum freeze htab2;" from the file
contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
misbehavior.

This is a temp table, which is a choice made by the tests specifically
because they need to "use a temp table so that vacuum behavior doesn't
depend on global xmin". This is convenient way of avoiding spurious
regression tests failures (e.g. from autoanalyze), and relies on the
GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
94bc27b5.

It's quite possible that this is nothing more than a bug in my
adversarial gizmo patch -- since I don't think that
ConditionalLockBufferForCleanup() can ever fail with a temp buffer
(though even that's not completely clear right now). Even if the
behavior that I saw does not indicate a bug on HEAD, it still seems
informative. At the very least, it wouldn't hurt to Assert() that the
target table isn't a temp table inside lazy_scan_noprune, documenting
our assumptions around temp tables and
ConditionalLockBufferForCleanup().

I haven't actually tried to debug the issue just yet, so take all this
with a grain of salt.

-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

(On phone, so crappy formatting and no source access)

On February 19, 2022 3:08:41 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> Another testing strategy occurs to me: we could stress-test the
>> implementation by simulating an environment where the no-cleanup-lock
>> path is hit an unusually large number of times, possibly a fixed
>> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
>> ConditionalLockBufferForCleanup() call return false randomly. Now that
>> we have lazy_scan_noprune for the no-cleanup-lock path (which is as
>> similar to the regular lazy_scan_prune path as possible), I wouldn't
>> expect this ConditionalLockBufferForCleanup() testing gizmo to be too
>> disruptive.
>
>I tried this out, using the attached patch. It was quite interesting,
>even when run against HEAD. I think that I might have found a bug on
>HEAD, though I'm not really sure.
>
>If you modify the patch to simulate conditions under which
>ConditionalLockBufferForCleanup() fails about 2% of the time, you get
>much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
>without it being so aggressive as to make "make check-world" fail --
>which is exactly what I expected. If you are much more aggressive
>about it, and make it 50% instead (which you can get just by using the
>patch as written), then some tests will fail, mostly for reasons that
>aren't surprising or interesting (e.g. plan changes). This is also
>what I'd have guessed would happen.
>
>However, it gets more interesting. One thing that I did not expect to
>happen at all also happened (with the current 50% rate of simulated
>ConditionalLockBufferForCleanup() failure from the patch): if I run
>"make check" from the pg_surgery directory, then the Postgres backend
>gets stuck in an infinite loop inside lazy_scan_prune, which has been
>a symptom of several tricky bugs in the past year (not every time, but
>usually). Specifically, the VACUUM statement launched by the SQL
>command "vacuum freeze htab2;" from the file
>contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
>misbehavior.


>This is a temp table, which is a choice made by the tests specifically
>because they need to "use a temp table so that vacuum behavior doesn't
>depend on global xmin". This is convenient way of avoiding spurious
>regression tests failures (e.g. from autoanalyze), and relies on the
>GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
>94bc27b5.

We don't have a blocking path for cleanup locks of temporary buffers IIRC (normally not reachable). So I wouldn't be
surprisedif a cleanup lock failing would cause some odd behavior. 

>It's quite possible that this is nothing more than a bug in my
>adversarial gizmo patch -- since I don't think that
>ConditionalLockBufferForCleanup() can ever fail with a temp buffer
>(though even that's not completely clear right now). Even if the
>behavior that I saw does not indicate a bug on HEAD, it still seems
>informative. At the very least, it wouldn't hurt to Assert() that the
>target table isn't a temp table inside lazy_scan_noprune, documenting
>our assumptions around temp tables and
>ConditionalLockBufferForCleanup().

Definitely worth looking into more.


This reminds me of a recent thing I noticed in the aio patch. Spgist can end up busy looping when buffers are locked,
insteadof blocking. Not actually related, of course. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Feb 19, 2022 at 3:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It's quite possible that this is nothing more than a bug in my
> adversarial gizmo patch -- since I don't think that
> ConditionalLockBufferForCleanup() can ever fail with a temp buffer
> (though even that's not completely clear right now). Even if the
> behavior that I saw does not indicate a bug on HEAD, it still seems
> informative.

This very much looks like a bug in pg_surgery itself now -- attached
is a draft fix.

The temp table thing was a red herring. I found I could get exactly
the same kind of failure when htab2 was a permanent table (which was
how it originally appeared, before commit 0811f766fd made it into a
temp table due to test flappiness issues). The relevant "vacuum freeze
htab2" happens at a point after the test has already deliberately
corrupted one of its tuples using heap_force_kill().  It's not that we
aren't careful enough about the corruption at some point in
vacuumlazy.c, which was my second theory. But I quickly discarded that
idea, and came up with a third theory: the relevant heap_surgery.c
path does the relevant ItemIdSetDead() to kill items, without also
defragmenting the page to remove the tuples with storage, which is
wrong.

This meant that we depended on pruning happening (in this case during
VACUUM) and defragmenting the page in passing. But there is no reason
to not defragment the page within pg_surgery (at least no obvious
reason), since we have a cleanup lock anyway.

Theoretically you could blame this on lazy_scan_noprune instead, since
it thinks it can collect LP_DEAD items while assuming that they have
no storage, but that doesn't make much sense to me. There has never
been any way of setting a heap item to LP_DEAD without also
defragmenting the page.  Since that's exactly what it means to prune a
heap page. (Actually, the same used to be true about heap vacuuming,
which worked more like heap pruning before Postgres 14, but that
doesn't seem important.)

-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Feb 19, 2022 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This very much looks like a bug in pg_surgery itself now -- attached
> is a draft fix.

Wait, that's not it either. I jumped the gun -- this isn't sufficient
(though the patch I posted might not be a bad idea anyway).

Looks like pg_surgery isn't processing HOT chains as whole units,
which it really should (at least in the context of killing items via
the heap_force_kill() function). Killing a root item in a HOT chain is
just hazardous -- disconnected/orphaned heap-only tuples are liable to
cause chaos, and should be avoided everywhere (including during
pruning, and within pg_surgery).

It's likely that the hardening I already planned on adding to pruning
[1] (as follow-up work to recent bugfix commit 18b87b201f) will
prevent lazy_scan_prune from getting stuck like this, whatever the
cause happens to be. The actual page image I see lazy_scan_prune choke
on (i.e. exhibit the same infinite loop unpleasantness we've seen
before on) is not in a consistent state at all (its tuples consist of
tuples from a single HOT chain, and the HOT chain is totally
inconsistent on account of having an LP_DEAD line pointer root item).
pg_surgery could in principle do the right thing here by always
treating HOT chains as whole units.

Leaving behind disconnected/orphaned heap-only tuples is pretty much
pointless anyway, since they'll never be accessible by index scans.
Even after a REINDEX, since there is no root item from the heap page
to go in the index. (A dump and restore might work better, though.)

[1] https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com
-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-02-19 17:22:33 -0800, Peter Geoghegan wrote:
> Looks like pg_surgery isn't processing HOT chains as whole units,
> which it really should (at least in the context of killing items via
> the heap_force_kill() function). Killing a root item in a HOT chain is
> just hazardous -- disconnected/orphaned heap-only tuples are liable to
> cause chaos, and should be avoided everywhere (including during
> pruning, and within pg_surgery).

How does that cause the endless loop?

It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
me. So something needs have changed with your patch?


> It's likely that the hardening I already planned on adding to pruning
> [1] (as follow-up work to recent bugfix commit 18b87b201f) will
> prevent lazy_scan_prune from getting stuck like this, whatever the
> cause happens to be.

Yea, we should pick that up again. Not just for robustness or
performance. Also because it's just a lot easier to understand.


> Leaving behind disconnected/orphaned heap-only tuples is pretty much
> pointless anyway, since they'll never be accessible by index scans.
> Even after a REINDEX, since there is no root item from the heap page
> to go in the index. (A dump and restore might work better, though.)

Given that heap_surgery's raison d'etre is correcting corruption etc, I think
it makes sense for it to do as minimal work as possible. Iterating through a
HOT chain would be a problem if you e.g. tried to repair a page with HOT
corruption.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Feb 19, 2022 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:
> How does that cause the endless loop?

Attached is the page image itself, dumped via gdb (and gzip'd). This
was on recent HEAD (commit 8f388f6f, actually), plus
0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No
defragmenting in pg_surgery, nothing like that.

> It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
> me. So something needs have changed with your patch?

It doesn't always happen -- only about half the time on my machine.
Maybe it's timing sensitive?

We hit the "goto retry" on offnum 2, which is the first tuple with
storage (you can see "the ghost" of the tuple from the LP_DEAD item at
offnum 1, since the page isn't defragmented in pg_surgery). I think
that this happens because the heap-only tuple at offnum 2 is fully
DEAD to lazy_scan_prune, but hasn't been recognized as such by
heap_page_prune. There is no way that they'll ever "agree" on the
tuple being DEAD right now, because pruning still doesn't assume that
an orphaned heap-only tuple is fully DEAD.

We can either do that, or we can throw an error concerning corruption
when heap_page_prune notices orphaned tuples. Neither seems
particularly appealing. But it definitely makes no sense to allow
lazy_scan_prune to spin in a futile attempt to reach agreement with
heap_page_prune about a DEAD tuple really being DEAD.

> Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> it makes sense for it to do as minimal work as possible. Iterating through a
> HOT chain would be a problem if you e.g. tried to repair a page with HOT
> corruption.

I guess that's also true. There is at least a legitimate argument to
be made for not leaving behind any orphaned heap-only tuples. The
interface is a TID, and so the user may already believe that they're
killing the heap-only, not just the root item (since ctid suggests
that the TID of a heap-only tuple is the TID of the root item, which
is kind of misleading).

Anyway, we can decide on what to do in heap_surgery later, once the
main issue is under control. My point was mostly just that orphaned
heap-only tuples are definitely not okay, in general. They are the
least worst option when corruption has already happened, maybe -- but
maybe not.

-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-02-19 18:16:54 -0800, Peter Geoghegan wrote:
> On Sat, Feb 19, 2022 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:
> > How does that cause the endless loop?
> 
> Attached is the page image itself, dumped via gdb (and gzip'd). This
> was on recent HEAD (commit 8f388f6f, actually), plus
> 0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No
> defragmenting in pg_surgery, nothing like that.

> > It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
> > me. So something needs have changed with your patch?
> 
> It doesn't always happen -- only about half the time on my machine.
> Maybe it's timing sensitive?

Ah, I'd only run the tests three times or so, without it happening. Trying a
few more times repro'd it.


It's kind of surprising that this needs this
0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question
of hint bits changing due to lazy_scan_noprune(), which then makes
HeapTupleHeaderIsHotUpdated() have a different return value, preventing the
"If the tuple is DEAD and doesn't chain to anything else"
path from being taken.


> We hit the "goto retry" on offnum 2, which is the first tuple with
> storage (you can see "the ghost" of the tuple from the LP_DEAD item at
> offnum 1, since the page isn't defragmented in pg_surgery). I think
> that this happens because the heap-only tuple at offnum 2 is fully
> DEAD to lazy_scan_prune, but hasn't been recognized as such by
> heap_page_prune. There is no way that they'll ever "agree" on the
> tuple being DEAD right now, because pruning still doesn't assume that
> an orphaned heap-only tuple is fully DEAD.

> We can either do that, or we can throw an error concerning corruption
> when heap_page_prune notices orphaned tuples. Neither seems
> particularly appealing. But it definitely makes no sense to allow
> lazy_scan_prune to spin in a futile attempt to reach agreement with
> heap_page_prune about a DEAD tuple really being DEAD.

Yea, this sucks. I think we should go for the rewrite of the
heap_prune_chain() logic. The current approach is just never going to be
robust.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:
> > We can either do that, or we can throw an error concerning corruption
> > when heap_page_prune notices orphaned tuples. Neither seems
> > particularly appealing. But it definitely makes no sense to allow
> > lazy_scan_prune to spin in a futile attempt to reach agreement with
> > heap_page_prune about a DEAD tuple really being DEAD.
>
> Yea, this sucks. I think we should go for the rewrite of the
> heap_prune_chain() logic. The current approach is just never going to be
> robust.

No, it just isn't robust enough. But it's not that hard to fix. My
patch really wasn't invasive.

I confirmed that HeapTupleSatisfiesVacuum() and
heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum
2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no
reason to think that there is a new bug involved). The problem here is
indeed just that heap_prune_chain() can't "get to" the tuple, given
its current design.

For anybody else that doesn't follow what we're talking about:

The "doesn't chain to anything else" code at the start of
heap_prune_chain() won't get to the heap-only tuple at offnum 2, since
the tuple is itself HeapTupleHeaderIsHotUpdated() -- the expectation
is that it'll be processed later on, once we locate the HOT chain's
root item. Since, of course, the "root item" was already LP_DEAD
before we even reached heap_page_prune() (on account of the pg_surgery
corruption), there is no possible way that that can happen later on.
And so we cannot find the same heap-only tuple and mark it LP_UNUSED
(which is how we always deal with HEAPTUPLE_DEAD heap-only tuples)
during pruning.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:
> It's kind of surprising that this needs this
> 0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question
> of hint bits changing due to lazy_scan_noprune(), which then makes
> HeapTupleHeaderIsHotUpdated() have a different return value, preventing the
> "If the tuple is DEAD and doesn't chain to anything else"
> path from being taken.

That makes sense as an explanation. Goes to show just how fragile the
"DEAD and doesn't chain to anything else" logic at the top of
heap_prune_chain really is.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-02-19 19:07:39 -0800, Peter Geoghegan wrote:
> On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:
> > > We can either do that, or we can throw an error concerning corruption
> > > when heap_page_prune notices orphaned tuples. Neither seems
> > > particularly appealing. But it definitely makes no sense to allow
> > > lazy_scan_prune to spin in a futile attempt to reach agreement with
> > > heap_page_prune about a DEAD tuple really being DEAD.
> >
> > Yea, this sucks. I think we should go for the rewrite of the
> > heap_prune_chain() logic. The current approach is just never going to be
> > robust.
> 
> No, it just isn't robust enough. But it's not that hard to fix. My
> patch really wasn't invasive.

I think we're in agreement there. We might think at some point about
backpatching too, but I'd rather have it stew in HEAD for a bit first.


> I confirmed that HeapTupleSatisfiesVacuum() and
> heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum
> 2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no
> reason to think that there is a new bug involved). The problem here is
> indeed just that heap_prune_chain() can't "get to" the tuple, given
> its current design.

Right.

The reason that the "adversarial" patch makes a different is solely that it
changes the heap_surgery test to actually kill an item, which it doesn't
intend:

create temp table htab2(a int);
insert into htab2 values (100);
update htab2 set a = 200;
vacuum htab2;

-- redirected TIDs should be skipped
select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]);


If the vacuum can get the cleanup lock due to the adversarial patch, the
heap_force_kill() doesn't do anything, because the first item is a
redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead
targets the root item. Triggering the endless loop.


Hm. I think this might be a mild regression in 14. In < 14 we'd just skip the
tuple in lazy_scan_heap(), but now we have an uninterruptible endless
loop.


We'd do completely bogus stuff later in < 14 though, I think we'd just leave
it in place despite being older than relfrozenxid, which obviously has its own
set of issues.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> > it makes sense for it to do as minimal work as possible. Iterating through a
> > HOT chain would be a problem if you e.g. tried to repair a page with HOT
> > corruption.
>
> I guess that's also true. There is at least a legitimate argument to
> be made for not leaving behind any orphaned heap-only tuples. The
> interface is a TID, and so the user may already believe that they're
> killing the heap-only, not just the root item (since ctid suggests
> that the TID of a heap-only tuple is the TID of the root item, which
> is kind of misleading).

Actually, I would say that heap_surgery's raison d'etre is making
weird errors related to corruption of this or that TID go away, so
that the user can cut their losses. That's how it's advertised.

Let's assume that we don't want to make VACUUM/pruning just treat
orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
status -- let's say that we want to err in the direction of doing
nothing at all with the page. Now we have to have a weird error in
VACUUM instead (not great, but better than just spinning between
lazy_scan_prune and heap_prune_page). And we've just created natural
demand for heap_surgery to deal with the problem by deleting whole HOT
chains (not just root items).

If we allow VACUUM to treat orphaned heap-only tuples as DEAD right
away, then we might as well do the same thing in heap_surgery, since
there is little chance that the user will get to the heap-only tuples
before VACUUM does (not something to rely on, at any rate).

Either way, I think we probably end up needing to teach heap_surgery
to kill entire HOT chains as a group, given a TID.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Feb 19, 2022 at 7:28 PM Andres Freund <andres@anarazel.de> wrote:
> If the vacuum can get the cleanup lock due to the adversarial patch, the
> heap_force_kill() doesn't do anything, because the first item is a
> redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead
> targets the root item. Triggering the endless loop.

But it shouldn't matter if the root item is an LP_REDIRECT or a normal
(not heap-only) tuple with storage. Either way it's the root of a HOT
chain.

The fact that pg_surgery treats LP_REDIRECT items differently from the
other kind of root items is just arbitrary. It seems to have more to
do with freezing tuples than killing tuples.


--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-02-19 19:31:21 -0800, Peter Geoghegan wrote:
> On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> > > it makes sense for it to do as minimal work as possible. Iterating through a
> > > HOT chain would be a problem if you e.g. tried to repair a page with HOT
> > > corruption.
> >
> > I guess that's also true. There is at least a legitimate argument to
> > be made for not leaving behind any orphaned heap-only tuples. The
> > interface is a TID, and so the user may already believe that they're
> > killing the heap-only, not just the root item (since ctid suggests
> > that the TID of a heap-only tuple is the TID of the root item, which
> > is kind of misleading).
> 
> Actually, I would say that heap_surgery's raison d'etre is making
> weird errors related to corruption of this or that TID go away, so
> that the user can cut their losses. That's how it's advertised.

I'm not that sure those are that different... Imagine some corruption leading
to two hot chains ending in the same tid, which our fancy new secure pruning
algorithm might detect.

Either way, I'm a bit surprised about the logic to not allow killing redirect
items? What if you have a redirect pointing to an unused item?


> Let's assume that we don't want to make VACUUM/pruning just treat
> orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
> status

I don't think that'd ever be a good idea. Those tuples are visible to a
seqscan after all.


> -- let's say that we want to err in the direction of doing
> nothing at all with the page. Now we have to have a weird error in
> VACUUM instead (not great, but better than just spinning between
> lazy_scan_prune and heap_prune_page).

Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
problem here is a DEAD orphaned HOT tuples, and those we should be able to
delete with the new page pruning logic, right?


I think it might be worth getting rid of the need for the retry approach by
reusing the same HTSV status array between heap_prune_page and
lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
lazy_scan_prune() would be some form of corruption.  And it'd be a pretty
decent performance boost, HTSV ain't cheap.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Feb 19, 2022 at 7:47 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not that sure those are that different... Imagine some corruption leading
> to two hot chains ending in the same tid, which our fancy new secure pruning
> algorithm might detect.

I suppose that's possible, but it doesn't seem all that likely to ever
happen, what with the xmin -> xmax cross-tuple matching stuff.

> Either way, I'm a bit surprised about the logic to not allow killing redirect
> items? What if you have a redirect pointing to an unused item?

Again, I simply think it boils down to having to treat HOT chains as a
whole unit when killing TIDs.

> > Let's assume that we don't want to make VACUUM/pruning just treat
> > orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
> > status
>
> I don't think that'd ever be a good idea. Those tuples are visible to a
> seqscan after all.

I agree (I don't hate it completely, but it seems mostly bad). This is
what leads me to the conclusion that pg_surgery has to be able to do
this instead. Surely it's not okay to have something that makes VACUUM
always end in error, that cannot even be fixed by pg_surgery.

> > -- let's say that we want to err in the direction of doing
> > nothing at all with the page. Now we have to have a weird error in
> > VACUUM instead (not great, but better than just spinning between
> > lazy_scan_prune and heap_prune_page).
>
> Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
> problem here is a DEAD orphaned HOT tuples, and those we should be able to
> delete with the new page pruning logic, right?

Right. But what good does that really do? The problematic page had a
third tuple (at offnum 3) that was LIVE. If we could have done
something about the problematic tuple at offnum 2 (which is where we
got stuck), then we'd still be left with a very unpleasant choice
about what happens to the third tuple.

> I think it might be worth getting rid of the need for the retry approach by
> reusing the same HTSV status array between heap_prune_page and
> lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
> lazy_scan_prune() would be some form of corruption.  And it'd be a pretty
> decent performance boost, HTSV ain't cheap.

I guess it doesn't actually matter if we leave an aborted DEAD tuple
behind, that we could have pruned away, but didn't. The important
thing is to be consistent at the level of the page.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On February 19, 2022 7:56:53 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Sat, Feb 19, 2022 at 7:47 PM Andres Freund <andres@anarazel.de> wrote:
>> Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
>> problem here is a DEAD orphaned HOT tuples, and those we should be able to
>> delete with the new page pruning logic, right?
>
>Right. But what good does that really do? The problematic page had a
>third tuple (at offnum 3) that was LIVE. If we could have done
>something about the problematic tuple at offnum 2 (which is where we
>got stuck), then we'd still be left with a very unpleasant choice
>about what happens to the third tuple.

Why does anything need to happen to it from vacuum's POV?  It'll not be a problem for freezing etc. Until it's deleted
vacuumdoesn't need to care. 

Probably worth a WARNING, and amcheck definitely needs to detect it, but otherwise I think it's fine to just continue.


>> I think it might be worth getting rid of the need for the retry approach by
>> reusing the same HTSV status array between heap_prune_page and
>> lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
>> lazy_scan_prune() would be some form of corruption.  And it'd be a pretty
>> decent performance boost, HTSV ain't cheap.
>
>I guess it doesn't actually matter if we leave an aborted DEAD tuple
>behind, that we could have pruned away, but didn't. The important
>thing is to be consistent at the level of the page.

That's not ok, because it opens up dangers of being interpreted differently after wraparound etc.

But I don't see any cases where it would happen with the new pruning logic in your patch and sharing the HTSV status
array?

Andres


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sat, Feb 19, 2022 at 8:21 PM Andres Freund <andres@anarazel.de> wrote:
> Why does anything need to happen to it from vacuum's POV?  It'll not be a problem for freezing etc. Until it's
deletedvacuum doesn't need to care.
 
>
> Probably worth a WARNING, and amcheck definitely needs to detect it, but otherwise I think it's fine to just
continue.

Maybe that's true, but it's just really weird to imagine not having an
LP_REDIRECT that points to the LIVE item here, without throwing an
error. Seems kind of iffy, to say the least.

> >I guess it doesn't actually matter if we leave an aborted DEAD tuple
> >behind, that we could have pruned away, but didn't. The important
> >thing is to be consistent at the level of the page.
>
> That's not ok, because it opens up dangers of being interpreted differently after wraparound etc.
>
> But I don't see any cases where it would happen with the new pruning logic in your patch and sharing the HTSV status
array?

Right. Fundamentally, there isn't any reason why it should matter that
VACUUM reached the heap page just before (rather than concurrent with
or just after) some xact that inserted or updated on the page aborts.
Just as long as we have a consistent idea about what's going on at the
level of the whole page (or maybe the level of each HOT chain, but the
whole page level seems simpler to me).

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Sat, Feb 19, 2022 at 8:54 PM Andres Freund <andres@anarazel.de> wrote:
> > Leaving behind disconnected/orphaned heap-only tuples is pretty much
> > pointless anyway, since they'll never be accessible by index scans.
> > Even after a REINDEX, since there is no root item from the heap page
> > to go in the index. (A dump and restore might work better, though.)
>
> Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> it makes sense for it to do as minimal work as possible. Iterating through a
> HOT chain would be a problem if you e.g. tried to repair a page with HOT
> corruption.

Yeah, I agree. I don't have time to respond to all of these emails
thoroughly right now, but I think it's really important that
pg_surgery do the exact surgery the user requested, and not any other
work. I don't think that page defragmentation should EVER be REQUIRED
as a condition of other work. If other code is relying on that, I'd
say it's busted. I'm a little more uncertain about the case where we
kill the root tuple of a HOT chain, because I can see that this might
leave the page a state where sequential scans see the tuple at the end
of the chain and index scans don't. I'm not sure whether that should
be the responsibility of pg_surgery itself to avoid, or whether that's
your problem as a user of it -- although I lean mildly toward the
latter view, at the moment. But in any case surely the pruning code
can't just decide to go into an infinite loop if that happens. Code
that manipulates the states of data pages needs to be as robust
against arbitrary on-disk states as we can reasonably make it, because
pages get garbled on disk all the time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Fri, Feb 18, 2022 at 7:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> We have to worry about XIDs from MultiXacts (and xmax values more
> generally). And we have to worry about the case where we start out
> with only xmin frozen (by an earlier VACUUM), and then have to freeze
> xmax too. I believe that we have to generally consider xmin and xmax
> independently. For example, we cannot ignore xmax, just because we
> looked at xmin, since in general xmin alone might have already been
> frozen.

Right, so we at least need to add a similar comment to what I proposed
for MXIDs, and maybe other changes are needed, too.

> The difference between the cleanup lock path (in
> lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in
> lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both
> of these confusing comment blocks, really. Note that cutoff_xid is the
> name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze
> have for FreezeLimit (maybe we should rename every occurence of
> cutoff_xid in heapam.c to FreezeLimit).
>
> At a high level, we aren't changing the fundamental definition of an
> aggressive VACUUM in any of the patches -- we still need to advance
> relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on
> HEAD, today (we may be able to advance it *past* FreezeLimit, but
> that's just a bonus). But in a non-aggressive VACUUM, where there is
> still no strict requirement to advance relfrozenxid (by any amount),
> the code added by 0001 can set relfrozenxid to any known safe value,
> which could either be from before FreezeLimit, or after FreezeLimit --
> almost anything is possible (provided we respect the relfrozenxid
> invariant, and provided we see that we didn't skip any
> all-visible-not-all-frozen pages).
>
> Since we still need to "respect FreezeLimit" in an aggressive VACUUM,
> the aggressive case might need to wait for a full cleanup lock the
> hard way, having tried and failed to do it the easy way within
> lazy_scan_noprune (lazy_scan_noprune will still return false when any
> call to heap_tuple_needs_freeze for any tuple returns false) -- same
> as on HEAD, today.
>
> And so the difference at issue here is: FreezeLimit/cutoff_xid only
> needs to affect the new NewRelfrozenxid value we use for relfrozenxid in
> heap_prepare_freeze_tuple, which is involved in real freezing -- not
> in heap_tuple_needs_freeze, whose main purpose is still to help us
> avoid freezing where a cleanup lock isn't immediately available. While
> the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze
> is to determine its bool return value, which will only be of interest
> to the aggressive case (which might have to get a cleanup lock and do
> it the hard way), not the non-aggressive case (where ratcheting back
> NewRelfrozenxid is generally possible, and generally leaves us with
> almost as good of a value).
>
> In other words: the calls to heap_tuple_needs_freeze made from
> lazy_scan_noprune are simply concerned with the page as it actually
> is, whereas the similar/corresponding calls to
> heap_prepare_freeze_tuple from lazy_scan_prune are concerned with
> *what the page will actually become*, after freezing finishes, and
> after lazy_scan_prune is done with the page entirely (ultimately
> the final NewRelfrozenxid value set in pg_class.relfrozenxid only has
> to be <= the oldest extant XID *at the time the VACUUM operation is
> just about to end*, not some earlier time, so "being versus becoming"
> is an interesting distinction for us).
>
> Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
> here, to make all of this less confusing. I only now fully realized
> how confusing all of this stuff is -- very.

Right. I think I understand all of this, or at least most of it -- but
not from the comment. The question is how the comment can be more
clear. My general suggestion is that function header comments should
have more to do with the behavior of the function than how it fits
into the bigger picture. If it's clear to the reader what conditions
must hold before calling the function and which must hold on return,
it helps a lot. IMHO, it's the job of the comments in the calling
function to clarify why we then choose to call that function at the
place and in the way that we do.

> As a general rule, we try to freeze all of the remaining live tuples
> on a page (following pruning) together, as a group, or none at all.
> Most of the time this is triggered by our noticing that the page is
> about to be set all-visible (but not all-frozen), and doing work
> sufficient to mark it fully all-frozen instead. Occasionally there is
> FreezeLimit to consider, which is now more of a backstop thing, used
> to make sure that we never get too far behind in terms of unfrozen
> XIDs. This is useful in part because it avoids any future
> non-aggressive VACUUM that is fundamentally unable to advance
> relfrozenxid (you can't skip all-visible pages if there are only
> all-frozen pages in the VM in practice).
>
> We're generally doing a lot more freezing with 0002, but we still
> manage to avoid freezing too much in tables like pgbench_tellers or
> pgbench_branches -- tables where it makes the least sense. Such tables
> will be updated so frequently that VACUUM is relatively unlikely to
> ever mark any page all-visible, avoiding the main criteria for
> freezing implicitly. It's also unlikely that they'll ever have an XID that is so
> old to trigger the fallback FreezeLimit-style criteria for freezing.
>
> In practice, freezing tuples like this is generally not that expensive in
> most tables where VACUUM freezes the majority of pages immediately
> (tables that aren't like pgbench_tellers or pgbench_branches), because
> they're generally big tables, where the overhead of FPIs tends
> to dominate anyway (gambling that we can avoid more FPIs later on is not a
> bad gamble, as gambles go). This seems to make the overhead
> acceptable, on balance. Granted, you might be able to poke holes in
> that argument, and reasonable people might disagree on what acceptable
> should mean. There are many value judgements here, which makes it
> complicated. (On the other hand we might be able to do better if there
> was a particularly bad case for the 0002 work, if one came to light.)

I think that the idea has potential, but I don't think that I
understand yet what the *exact* algorithm is. Maybe I need to read the
code, when I have some time for that. I can't form an intelligent
opinion at this stage about whether this is likely to be a net
positive.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
,

On Sun, Feb 20, 2022 at 7:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Right, so we at least need to add a similar comment to what I proposed
> for MXIDs, and maybe other changes are needed, too.

Agreed.

> > Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
> > here, to make all of this less confusing. I only now fully realized
> > how confusing all of this stuff is -- very.
>
> Right. I think I understand all of this, or at least most of it -- but
> not from the comment. The question is how the comment can be more
> clear. My general suggestion is that function header comments should
> have more to do with the behavior of the function than how it fits
> into the bigger picture. If it's clear to the reader what conditions
> must hold before calling the function and which must hold on return,
> it helps a lot. IMHO, it's the job of the comments in the calling
> function to clarify why we then choose to call that function at the
> place and in the way that we do.

You've given me a lot of high quality feedback on all of this, which
I'll work through soon. It's hard to get the balance right here, but
it's made much easier by this kind of feedback.

> I think that the idea has potential, but I don't think that I
> understand yet what the *exact* algorithm is.

The algorithm seems to exploit a natural tendency that Andres once
described in a blog post about his snapshot scalability work [1]. To a
surprising extent, we can usefully bucket all tuples/pages into two
simple categories:

1. Very, very old ("infinitely old" for all practical purposes).

2. Very very new.

There doesn't seem to be much need for a third "in-between" category
in practice. This seems to be at least approximately true all of the
time.

Perhaps Andres wouldn't agree with this very general statement -- he
actually said something more specific. I for one believe that the
point he made generalizes surprisingly well, though. I have my own
theories about why this appears to be true. (Executive summary: power
laws are weird, and it seems as if the sparsity-of-effects principle
makes it easy to bucket things at the highest level, in a way that
generalizes well across disparate workloads.)

> Maybe I need to read the
> code, when I have some time for that. I can't form an intelligent
> opinion at this stage about whether this is likely to be a net
> positive.

The code in the v8-0002 patch is a bit sloppy right now. I didn't
quite get around to cleaning it up -- I was focussed on performance
validation of the algorithm itself. So bear that in mind if you do
look at v8-0002 (might want to wait for v9-0002 before looking).

I believe that the only essential thing about the algorithm itself is
that it freezes all the tuples on a page when it anticipates setting
the page all-visible, or (barring edge cases) freezes none at all.
(Note that setting the page all-visible/all-frozen may be happen just
after lazy_scan_prune returns, or in the second pass over the heap,
after LP_DEAD items are set to LP_UNUSED -- lazy_scan_prune doesn't
care which way it will happen.)

There are one or two other design choices that we need to make, like
what exact tuples we freeze in the edge case where FreezeLimit/XID age
forces us to freeze in lazy_scan_prune. These other design choices
don't seem relevant to the issue of central importance, which is
whether or not we come out ahead overall with this new algorithm.
FreezeLimit will seldom affect our choice to freeze or not freeze now,
and so AFAICT the exact way that FreezeLimit affects which precise
freezing-eligible tuples we freeze doesn't complicate performance
validation.

Remember when I got excited about how my big TPC-C benchmark run
showed a predictable, tick/tock style pattern across VACUUM operations
against the order and order lines table [2]? It seemed very
significant to me that the OldestXmin of VACUUM operation n
consistently went on to become the new relfrozenxid for the same table
in VACUUM operation n + 1. It wasn't exactly the same XID, but very
close to it (within the range of noise). This pattern was clearly
present, even though VACUUM operation n + 1 might happen as long as 4
or 5 hours after VACUUM operation n (this was a big table).

This pattern was encouraging to me because it showed (at least for the
workload and tables in question) that the amount of unnecessary extra
freezing can't have been too bad -- the fact that we can always
advance relfrozenxid in the same way is evidence of that. Note that
the vacuum_freeze_min_age setting can't have affected our choice of
what to freeze (given what we see in the logs), and yet there is a
clear pattern where the pages (it's really pages, not tuples) that the
new algorithm doesn't freeze in VACUUM operation n will reliably get
frozen in VACUUM operation n + 1 instead.

And so this pattern seems to lend support to the general idea of
letting the workload itself be the primary driver of what pages we
freeze (not FreezeLimit, and not anything based on XIDs). That's
really the underlying principle behind the new algorithm -- freezing
is driven by workload characteristics (or page/block characteristics,
if you prefer). ISTM that vacuum_freeze_min_age is almost impossible
to tune -- XID age is just too squishy a concept for that to ever
work.

[1]
https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon
[2] https://postgr.es/m/CAH2-Wz=iLnf+0CsaB37efXCGMRJO1DyJw5HMzm7tp1AxG1NR2g@mail.gmail.com
-- scroll down to "TPC-C", which has the relevant autovacuum log
output for the orders table, covering a 24 hour period

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sun, Feb 20, 2022 at 12:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> You've given me a lot of high quality feedback on all of this, which
> I'll work through soon. It's hard to get the balance right here, but
> it's made much easier by this kind of feedback.

Attached is v9. Lots of changes. Highlights:

* Much improved 0001 ("loosen coupling" dynamic relfrozenxid tracking
patch). Some of the improvements are due to recent feedback from
Robert.

* Much improved 0002 ("Make page-level characteristics drive freezing"
patch). Whole new approach to the implementation, though the same
algorithm as before.

* No more FSM patch -- that was totally separate work, that I
shouldn't have attached to this project.

* There are 2 new patches (these are now 0003 and 0004), both of which
are concerned with allowing non-aggressive VACUUM to consistently
advance relfrozenxid. I think that 0003 makes sense on general
principle, but I'm much less sure about 0004. These aren't too
important.

While working on the new approach to freezing taken by v9-0002, I had
some insight about the issues that Robert raised around 0001, too. I
wasn't expecting that to happen.

0002 makes page-level freezing a first class thing.
heap_prepare_freeze_tuple now has some (limited) knowledge of how this
works. heap_prepare_freeze_tuple's cutoff_xid argument is now always
the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We
still have to pass FreezeLimit to heap_prepare_freeze_tuple, which
helps us to respect FreezeLimit as a backstop, and so now it's passed
via the new backstop_cutoff_xid argument instead. Whenever we opt to
"freeze a page", the new page-level algorithm *always* uses the most
recent possible XID and MXID values (OldestXmin and oldestMxact) to
decide what XIDs/XMIDs need to be replaced. That might sound like it'd
be too much, but it only applies to those pages that we actually
decide to freeze (since page-level characteristics drive everything
now). FreezeLimit is only one way of triggering that now (and one of
the least interesting and rarest).

0002 also adds an alternative set of relfrozenxid/relminmxid tracker
variables, to make the "don't freeze the page" path within
lazy_scan_prune simpler (if you don't want to freeze the page, then
use the set of tracker variables that go with that choice, which
heap_prepare_freeze_tuple knows about and helps with). With page-level
freezing, lazy_scan_prune wants to make a decision about the page as a
whole, at the last minute, after all heap_prepare_freeze_tuple calls
have already been made. So I think that heap_prepare_freeze_tuple
needs to know about that aspect of lazy_scan_prune's behavior.

When we *don't* want to freeze the page, we more or less need
everything related to freezing inside lazy_scan_prune to behave like
lazy_scan_noprune, which never freezes the page (that's mostly the
point of lazy_scan_noprune). And that's almost what we actually do --
heap_prepare_freeze_tuple now outsources maintenance of this
alternative set of "don't freeze the page" relfrozenxid/relminmxid
tracker variables to its sibling function, heap_tuple_needs_freeze.
That is the same function that lazy_scan_noprune itself actually
calls.

Now back to Robert's feedback on 0001, which had very complicated
comments in the last version. This approach seems to make the "being
versus becoming" or "going to freeze versus not going to freeze"
distinctions much clearer. This is less true if you assume that 0002
won't be committed but 0001 will be. Even if that happens with
Postgres 15, I have to imagine that adding something like 0002 must be
the real goal, long term. Without 0002, the value from 0001 is far
more limited. You need both together to get the virtuous cycle I've
described.

The approach with always using OldestXmin as cutoff_xid and
oldestMxact as our cutoff_multi makes a lot of sense to me, in part
because I think that it might well cut down on the tendency of VACUUM
to allocate new MultiXacts in order to be able to freeze old ones.
AFAICT the only reason that heap_prepare_freeze_tuple does that is
because it has no flexibility on FreezeLimit and MultiXactCutoff.
These are derived from vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age, respectively, and so they're two
independent though fairly meaningless cutoffs. On the other hand,
OldestXmin and OldestMxact are not independent in the same way. We get
both of them at the same time and the same place, in
vacuum_set_xid_limits. OldestMxact really is very close to OldestXmin
-- only the units differ.

It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
freezing old ones) in large part so it can NOT freeze XIDs that it
would have been useful (and much cheaper) to remove anyway. On HEAD,
FreezeMultiXactId() doesn't get passed down the VACUUM operation's
OldestXmin at all (it actually just gets FreezeLimit passed as its
cutoff_xid argument). It cannot possibly recognize any of this for
itself.

Does that theory about MultiXacts sound plausible? I'm not claiming
that the patch makes it impossible that FreezeMultiXactId() will have
to allocate a new MultiXact to freeze during VACUUM -- the
freeze-the-dead isolation tests already show that that's not true. I
just think that page-level freezing based on page characteristics with
oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff)
cutoffs might make it a lot less likely in practice. oldestXmin and
oldestMxact map to the same wall clock time, more or less -- that
seems like it might be an important distinction, independent of
everything else.

Thanks
--
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-02-24 20:53:08 -0800, Peter Geoghegan wrote:
> 0002 makes page-level freezing a first class thing.
> heap_prepare_freeze_tuple now has some (limited) knowledge of how this
> works. heap_prepare_freeze_tuple's cutoff_xid argument is now always
> the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We
> still have to pass FreezeLimit to heap_prepare_freeze_tuple, which
> helps us to respect FreezeLimit as a backstop, and so now it's passed
> via the new backstop_cutoff_xid argument instead.

I am not a fan of the backstop terminology. It's still the reason we need to
do freezing for correctness reasons. It'd make more sense to me to turn it
around and call the "non-backstop" freezing opportunistic freezing or such.


> Whenever we opt to
> "freeze a page", the new page-level algorithm *always* uses the most
> recent possible XID and MXID values (OldestXmin and oldestMxact) to
> decide what XIDs/XMIDs need to be replaced. That might sound like it'd
> be too much, but it only applies to those pages that we actually
> decide to freeze (since page-level characteristics drive everything
> now). FreezeLimit is only one way of triggering that now (and one of
> the least interesting and rarest).

That largely makes sense to me and doesn't seem weird.

I'm a tad concerned about replacing mxids that have some members that are
older than OldestXmin but not older than FreezeLimit. It's not too hard to
imagine that accelerating mxid consumption considerably.  But we can probably,
if not already done, special case that.


> It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
> freezing old ones) in large part so it can NOT freeze XIDs that it
> would have been useful (and much cheaper) to remove anyway.

Well, we may have to allocate a new mxid because some members are older than
FreezeLimit but others are still running. When do we not remove xids that
would have been cheaper to remove once we decide to actually do work?


> On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's
> OldestXmin at all (it actually just gets FreezeLimit passed as its
> cutoff_xid argument). It cannot possibly recognize any of this for itself.

It does recognize something like OldestXmin in a more precise and expensive
way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().


> Does that theory about MultiXacts sound plausible? I'm not claiming
> that the patch makes it impossible that FreezeMultiXactId() will have
> to allocate a new MultiXact to freeze during VACUUM -- the
> freeze-the-dead isolation tests already show that that's not true. I
> just think that page-level freezing based on page characteristics with
> oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff)
> cutoffs might make it a lot less likely in practice.

Hm. I guess I'll have to look at the code for it. It doesn't immediately
"feel" quite right.


> oldestXmin and oldestMxact map to the same wall clock time, more or less --
> that seems like it might be an important distinction, independent of
> everything else.

Hm. Multis can be kept alive by fairly "young" member xids. So it may not be
removably (without creating a newer multi) until much later than its creation
time. So I don't think that's really true.



> From 483bc8df203f9df058fcb53e7972e3912e223b30 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Mon, 22 Nov 2021 10:02:30 -0800
> Subject: [PATCH v9 1/4] Loosen coupling between relfrozenxid and freezing.
>
> When VACUUM set relfrozenxid before now, it set it to whatever value was
> used to determine which tuples to freeze -- the FreezeLimit cutoff.
> This approach was very naive: the relfrozenxid invariant only requires
> that new relfrozenxid values be <= the oldest extant XID remaining in
> the table (at the point that the VACUUM operation ends), which in
> general might be much more recent than FreezeLimit.  There is no fixed
> relationship between the amount of physical work performed by VACUUM to
> make it safe to advance relfrozenxid (freezing and pruning), and the
> actual number of XIDs that relfrozenxid can be advanced by (at least in
> principle) as a result.  VACUUM might have to freeze all of the tuples
> from a hundred million heap pages just to enable relfrozenxid to be
> advanced by no more than one or two XIDs.  On the other hand, VACUUM
> might end up doing little or no work, and yet still be capable of
> advancing relfrozenxid by hundreds of millions of XIDs as a result.
>
> VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
> extant XID (and oldest extant MultiXactId) from the table, including
> XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
> VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
> This optimization doesn't require any changes to the definition of
> relfrozenxid, nor does it require changes to the core design of
> freezing.


> Final relfrozenxid values must still be >= FreezeLimit in an aggressive
> VACUUM (FreezeLimit is still used as an XID-age based backstop there).
> In non-aggressive VACUUMs (where there is still no strict guarantee that
> relfrozenxid will be advanced at all), we now advance relfrozenxid by as
> much as we possibly can.  This exploits workload conditions that make it
> easy to advance relfrozenxid by many more XIDs (for the same amount of
> freezing/pruning work).

Don't we now always advance relfrozenxid as much as we can, particularly also
during aggressive vacuums?



>   * FRM_RETURN_IS_MULTI
>   *        The return value is a new MultiXactId to set as new Xmax.
>   *        (caller must obtain proper infomask bits using GetMultiXactIdHintBits)
> + *
> + * "relfrozenxid_out" is an output value; it's used to maintain target new
> + * relfrozenxid for the relation.  It can be ignored unless "flags" contains
> + * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
> + * here.  This follows the general convention: only track XIDs that will still
> + * be in the table after the ongoing VACUUM finishes.  Note that it's up to
> + * caller to maintain this when the Xid return value is itself an Xid.
> + *
> + * Note that we cannot depend on xmin to maintain relfrozenxid_out.

What does it mean for xmin to maintain something?



> + * See heap_prepare_freeze_tuple for information about the basic rules for the
> + * cutoffs used here.
> + *
> + * Maintains *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out, which
> + * are the current target relfrozenxid and relminmxid for the relation.  We
> + * assume that caller will never want to freeze its tuple, even when the tuple
> + * "needs freezing" according to our return value.

I don't understand the "will never want to" bit?


> Caller should make temp
> + * copies of global tracking variables before starting to process a page, so
> + * that we can only scribble on copies.  That way caller can just discard the
> + * temp copies if it isn't okay with that assumption.
> + *
> + * Only aggressive VACUUM callers are expected to really care when a tuple
> + * "needs freezing" according to us.  It follows that non-aggressive VACUUMs
> + * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all
> + * cases.

Could it make sense to track can_freeze and need_freeze separately?


> @@ -7158,57 +7256,59 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
>      if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
>      {
>          MultiXactId multi;
> +        MultiXactMember *members;
> +        int            nmembers;
>
>          multi = HeapTupleHeaderGetRawXmax(tuple);
> -        if (!MultiXactIdIsValid(multi))
> -        {
> -            /* no xmax set, ignore */
> -            ;
> -        }

> -        else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
> +        if (MultiXactIdIsValid(multi) &&
> +            MultiXactIdPrecedes(multi, *relminmxid_nofreeze_out))
> +            *relminmxid_nofreeze_out = multi;

I may be misreading the diff, but aren't we know continuing to use multi down
below even if !MultiXactIdIsValid()?


> +        if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
>              return true;
> -        else if (MultiXactIdPrecedes(multi, cutoff_multi))
> -            return true;
> -        else
> +        else if (MultiXactIdPrecedes(multi, backstop_cutoff_multi))
> +            needs_freeze = true;
> +
> +        /* need to check whether any member of the mxact is too old */
> +        nmembers = GetMultiXactIdMembers(multi, &members, false,
> +                                         HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));

Doesn't this mean we unpack the members even if the multi is old enough to
need freezing? Just to then do it again during freezing? Accessing multis
isn't cheap...


> +            if (TransactionIdPrecedes(members[i].xid, backstop_cutoff_xid))
> +                needs_freeze = true;
> +            if (TransactionIdPrecedes(members[i].xid,
> +                                      *relfrozenxid_nofreeze_out))
> +                *relfrozenxid_nofreeze_out = xid;
>          }
> +        if (nmembers > 0)
> +            pfree(members);
>      }
>      else
>      {
>          xid = HeapTupleHeaderGetRawXmax(tuple);
> -        if (TransactionIdIsNormal(xid) &&
> -            TransactionIdPrecedes(xid, cutoff_xid))
> -            return true;
> +        if (TransactionIdIsNormal(xid))
> +        {
> +            if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
> +                *relfrozenxid_nofreeze_out = xid;
> +            if (TransactionIdPrecedes(xid, backstop_cutoff_xid))
> +                needs_freeze = true;
> +        }
>      }
>
>      if (tuple->t_infomask & HEAP_MOVED)
>      {
>          xid = HeapTupleHeaderGetXvac(tuple);
> -        if (TransactionIdIsNormal(xid) &&
> -            TransactionIdPrecedes(xid, cutoff_xid))
> -            return true;
> +        if (TransactionIdIsNormal(xid))
> +        {
> +            if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
> +                *relfrozenxid_nofreeze_out = xid;
> +            if (TransactionIdPrecedes(xid, backstop_cutoff_xid))
> +                needs_freeze = true;
> +        }
>      }

This stanza is repeated a bunch. Perhaps put it in a small static inline
helper?


>      /* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
>      TransactionId FreezeLimit;
>      MultiXactId MultiXactCutoff;
> -    /* Are FreezeLimit/MultiXactCutoff still valid? */
> -    bool        freeze_cutoffs_valid;
> +    /* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
> +    TransactionId NewRelfrozenXid;
> +    MultiXactId NewRelminMxid;

Struct member names starting with an upper case look profoundly ugly to
me...  But this isn't the first one, so I guess... :(




> From d10f42a1c091b4dc52670fca80a63fee4e73e20c Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Mon, 13 Dec 2021 15:00:49 -0800
> Subject: [PATCH v9 2/4] Make page-level characteristics drive freezing.
>
> Teach VACUUM to freeze all of the tuples on a page whenever it notices
> that it would otherwise mark the page all-visible, without also marking
> it all-frozen.  VACUUM typically won't freeze _any_ tuples on the page
> unless _all_ tuples (that remain after pruning) are all-visible.  This
> makes the overhead of vacuuming much more predictable over time.  We
> avoid the need for large balloon payments during aggressive VACUUMs
> (typically anti-wraparound autovacuums).  Freezing is proactive, so
> we're much less likely to get into "freezing debt".


I still suspect this will cause a very substantial increase in WAL traffic in
realistic workloads. It's common to have workloads where tuples are inserted
once, and deleted once/ partition dropped. Freezing all the tuples is a lot
more expensive than just marking the page all visible. It's not uncommon to be
bound by WAL traffic rather than buffer dirtying rate (since the latter may be
ameliorated by s_b and local storage, whereas WAL needs to be
streamed/archived).

This is particularly true because log_heap_visible() doesn't need an FPW if
checkpoints aren't enabled. A small record vs an FPI is a *huge* difference.


I think we'll have to make this less aggressive or tunable. Random ideas for
heuristics:

- Is it likely that freezing would not require an FPI or conversely that
  log_heap_visible() will also need an fpi?  If the page already was recently
  modified / checksums are enabled the WAL overhead of the freezing doesn't
  play much of a role.

- #dead items / #force-frozen items on the page - if we already need to do
  more than just setting all-visible, we can probably afford the WAL traffic.

- relfrozenxid vs max_freeze_age / FreezeLimit. The closer they get, the more
  aggressive we should freeze all-visible pages. Might even make sense to
  start vacuuming an increasing percentage of all-visible pages during
  non-aggressive vacuums, the closer we get to FreezeLimit.

- Keep stats about the age of dead and frozen over time. If all tuples are
  removed within a reasonable fraction of freeze_max_age, there's no point in
  freezing them.


> The new approach to freezing also enables relfrozenxid advancement in
> non-aggressive VACUUMs, which might be enough to avoid aggressive
> VACUUMs altogether (with many individual tables/workloads).  While the
> non-aggressive case continues to skip all-visible (but not all-frozen)
> pages (thereby making relfrozenxid advancement impossible), that in
> itself will no longer hinder relfrozenxid advancement (outside of
> pg_upgrade scenarios).

I don't know how to parse "thereby making relfrozenxid advancement impossible
... will no longer hinder relfrozenxid advancement"?


> We now consistently avoid leaving behind all-visible (not all-frozen) pages.
> This (as well as work from commit 44fa84881f) makes relfrozenxid advancement
> in non-aggressive VACUUMs commonplace.

s/consistently/try to/?


> The system accumulates freezing debt in proportion to the number of
> physical heap pages with unfrozen tuples, more or less.  Anything based
> on XID age is likely to be a poor proxy for the eventual cost of
> freezing (during the inevitable anti-wraparound autovacuum).  At a high
> level, freezing is now treated as one of the costs of storing tuples in
> physical heap pages -- not a cost of transactions that allocate XIDs.
> Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still
> influence what we freeze, and when, they effectively become backstops.
> It may still be necessary to "freeze a page" due to the presence of a
> particularly old XID, from before VACUUM's FreezeLimit cutoff, though
> that will be rare in practice -- FreezeLimit is just a backstop now.

I don't really like the "rare in practice" bit. It'll be rare in some
workloads but others will likely be much less affected.



> + * Although this interface is primarily tuple-based, vacuumlazy.c caller
> + * cooperates with us to decide on whether or not to freeze whole pages,
> + * together as a single group.  We prepare for freezing at the level of each
> + * tuple, but the final decision is made for the page as a whole.  All pages
> + * that are frozen within a given VACUUM operation are frozen according to
> + * cutoff_xid and cutoff_multi.  Caller _must_ freeze the whole page when
> + * we've set *force_freeze to true!
> + *
> + * cutoff_xid must be caller's oldest xmin to ensure that any XID older than
> + * it could neither be running nor seen as running by any open transaction.
> + * This ensures that the replacement will not change anyone's idea of the
> + * tuple state.  Similarly, cutoff_multi must be the smallest MultiXactId used
> + * by any open transaction (at the time that the oldest xmin was acquired).

I think this means my concern above about increasing mxid creation rate
substantially may be warranted.


> + * backstop_cutoff_xid must be <= cutoff_xid, and backstop_cutoff_multi must
> + * be <= cutoff_multi.  When any XID/XMID from before these backstop cutoffs
> + * is encountered, we set *force_freeze to true, making caller freeze the page
> + * (freezing-eligible XIDs/XMIDs will be frozen, at least).  "Backstop
> + * freezing" ensures that VACUUM won't allow XIDs/XMIDs to ever get too old.
> + * This shouldn't be necessary very often.  VACUUM should prefer to freeze
> + * when it's cheap (not when it's urgent).

Hm. Does this mean that we might call heap_prepare_freeze_tuple and then
decide not to freeze? Doesn't that mean we might create new multis over and
over, because we don't end up pulling the trigger on freezing the page?


> +
> +            /*
> +             * We allocated a MultiXact for this, so force freezing to avoid
> +             * wasting it
> +             */
> +            *force_freeze = true;

Ah, I guess not. But it'd be nicer if I didn't have to scroll down to the body
of the function to figure it out...



> From d2190abf366f148bae5307442e8a6245c6922e78 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Mon, 21 Feb 2022 12:46:44 -0800
> Subject: [PATCH v9 3/4] Remove aggressive VACUUM skipping special case.
>
> Since it's simply never okay to miss out on advancing relfrozenxid
> during an aggressive VACUUM (that's the whole point), the aggressive
> case treated any page from a next_unskippable_block-wise skippable block
> range as an all-frozen page (not a merely all-visible page) during
> skipping.  Such a page might not be all-visible/all-frozen at the point
> that it actually gets skipped, but it could nevertheless be safely
> skipped, and then counted in frozenskipped_pages (the page must have
> been all-frozen back when we determined the extent of the range of
> blocks to skip, since aggressive VACUUMs _must_ scan all-visible pages).
> This is necessary to ensure that aggressive VACUUMs are always capable
> of advancing relfrozenxid.

> The non-aggressive case behaved slightly differently: it rechecked the
> visibility map for each page at the point of skipping, and only counted
> pages in frozenskipped_pages when they were still all-frozen at that
> time.  But it skipped the page either way (since we already committed to
> skipping the page at the point of the recheck).  This was correct, but
> sometimes resulted in non-aggressive VACUUMs needlessly wasting an
> opportunity to advance relfrozenxid (when a page was modified in just
> the wrong way, at just the wrong time).  It also resulted in a needless
> recheck of the visibility map for each and every page skipped during
> non-aggressive VACUUMs.
>
> Avoid these problems by conditioning the "skippable page was definitely
> all-frozen when range of skippable pages was first determined" behavior
> on what the visibility map _actually said_ about the range as a whole
> back when we first determined the extent of the range (don't deduce what
> must have happened at that time on the basis of aggressive-ness).  This
> allows us to reliably count skipped pages in frozenskipped_pages when
> they were initially all-frozen.  In particular, when a page's visibility
> map bit is unset after the point where a skippable range of pages is
> initially determined, but before the point where the page is actually
> skipped, non-aggressive VACUUMs now count it in frozenskipped_pages,
> just like aggressive VACUUMs always have [1].  It's not critical for the
> non-aggressive case to get this right, but there is no reason not to.
>
> [1] Actually, it might not work that way when there happens to be a mix
> of all-visible and all-frozen pages in a range of skippable pages.
> There is no chance of VACUUM advancing relfrozenxid in this scenario
> either way, though, so it doesn't matter.

I think this commit message needs a good amount of polishing - it's very
convoluted. It's late and I didn't sleep well, but I've tried to read it
several times without really getting a sense of what this precisely does.




> From 15dec1e572ac4da0540251253c3c219eadf46a83 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Thu, 24 Feb 2022 17:21:45 -0800
> Subject: [PATCH v9 4/4] Avoid setting a page all-visible but not all-frozen.

To me the commit message body doesn't actually describe what this is doing...


> This is pretty much an addendum to the work in the "Make page-level
> characteristics drive freezing" commit.  It has been broken out like
> this because I'm not even sure if it's necessary.  It seems like we
> might want to be paranoid about losing out on the chance to advance
> relfrozenxid in non-aggressive VACUUMs, though.

> The only test that will trigger this case is the "freeze-the-dead"
> isolation test.  It's incredibly narrow.  On the other hand, why take a
> chance?  All it takes is one heap page that's all-visible (and not also
> all-frozen) nestled between some all-frozen heap pages to lose out on
> relfrozenxid advancement.  The SKIP_PAGES_THRESHOLD stuff won't save us
> then [1].

FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up
causing a lot of time doing IO that we never need, completely trashing all CPU
caches, while not actually causing decent readaead IO from what I've seen.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Feb 24, 2022 at 11:14 PM Andres Freund <andres@anarazel.de> wrote:
> I am not a fan of the backstop terminology. It's still the reason we need to
> do freezing for correctness reasons.

Thanks for the review!

I'm not wedded to that particular terminology, but I think that we
need something like it. Open to suggestions.

How about limit-based? Something like that?

> It'd make more sense to me to turn it
> around and call the "non-backstop" freezing opportunistic freezing or such.

The problem with that scheme is that it leads to a world where
"standard freezing" is incredibly rare (it often literally never
happens), whereas "opportunistic freezing" is incredibly common. That
doesn't make much sense to me.

We tend to think of 50 million XIDs (the vacuum_freeze_min_age
default) as being not that many. But I think that it can be a huge
number, too. Even then, it's unpredictable -- I suspect that it can
change without very much changing in the application, from the point
of view of users. That's a big part of the problem I'm trying to
address -- freezing outside of aggressive VACUUMs is way too rare (it
might barely happen at all). FreezeLimit/vacuum_freeze_min_age was
designed at a time when there was no visibility map at all, when it
made somewhat more sense as the thing that drives freezing.

Incidentally, this is part of the problem with anti-wraparound vacuums
and freezing debt -- the fact that some quite busy databases take
weeks or months to go through 50 million XIDs (or 200 million)
increases the pain of the eventual aggressive VACUUM. It's not
completely unbounded -- autovacuum_freeze_max_age is not 100% useless
here. But the extent to which that stuff bounds the debt can vary
enormously, for not-very-good reasons.

> > Whenever we opt to
> > "freeze a page", the new page-level algorithm *always* uses the most
> > recent possible XID and MXID values (OldestXmin and oldestMxact) to
> > decide what XIDs/XMIDs need to be replaced. That might sound like it'd
> > be too much, but it only applies to those pages that we actually
> > decide to freeze (since page-level characteristics drive everything
> > now). FreezeLimit is only one way of triggering that now (and one of
> > the least interesting and rarest).
>
> That largely makes sense to me and doesn't seem weird.

I'm very pleased that the main intuition behind 0002 makes sense to
you. That's a start, at least.

> I'm a tad concerned about replacing mxids that have some members that are
> older than OldestXmin but not older than FreezeLimit. It's not too hard to
> imagine that accelerating mxid consumption considerably.  But we can probably,
> if not already done, special case that.

Let's assume for a moment that this is a real problem. I'm not sure if
it is or not myself (it's complicated), but let's say that it is. The
problem may be more than offset by the positive impact on relminxmid
advancement. I have placed a large emphasis on enabling
relfrozenxid/relminxmid advancement in every non-aggressive VACUUM,
for a number of reasons -- this is one of the reasons. Finding a way
for every VACUUM operation to be "vacrel->scanned_pages +
vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some*
amount of relfrozenxid/relminxmid advancement possible in every
VACUUM) has a great deal of value.

As I said recently on the "do only critical work during single-user
vacuum?" thread, why should the largest tables in databases that
consume too many MXIDs do so evenly, across all their tables? There
are usually one or two large tables, and many more smaller tables. I
think it's much more likely that the largest tables consume
approximately zero MultiXactIds in these databases -- actual
MultiXactId consumption is probably concentrated in just one or two
smaller tables (even when we burn through MultiXacts very quickly).
But we don't recognize these kinds of distinctions at all right now.

Under these conditions, we will have many more opportunities to
advance relminmxid for most of the tables (including the larger
tables) all the way up to current-oldestMxact with the patch series.
Without needing to freeze *any* MultiXacts early (just freezing some
XIDs early) to get that benefit. The patch series is not just about
spreading the burden of freezing, so that non-aggressive VACUUMs
freeze more -- it's also making relfrozenxid and relminmxid more
recent and therefore *reliable* indicators of which tables any
wraparound problems *really* are.

Does that make sense to you? This kind of "virtuous cycle" seems
really important to me. It's a subtle point, so I have to ask.

> > It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
> > freezing old ones) in large part so it can NOT freeze XIDs that it
> > would have been useful (and much cheaper) to remove anyway.
>
> Well, we may have to allocate a new mxid because some members are older than
> FreezeLimit but others are still running. When do we not remove xids that
> would have been cheaper to remove once we decide to actually do work?

My point was that today, on HEAD, there is nothing fundamentally
special about FreezeLimit (aka cutoff_xid) as far as
heap_prepare_freeze_tuple is concerned -- and yet that's the only
cutoff it knows about, really. Why can't we do better, by "exploiting
the difference" between FreezeLimit and OldestXmin?

> > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's
> > OldestXmin at all (it actually just gets FreezeLimit passed as its
> > cutoff_xid argument). It cannot possibly recognize any of this for itself.
>
> It does recognize something like OldestXmin in a more precise and expensive
> way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().

It doesn't look that way to me.

While it's true that FreezeMultiXactId() will call
MultiXactIdIsRunning(), that's only a cross-check. This cross-check is
made at a point where we've already determined that the MultiXact in
question is < cutoff_multi. In other words, it catches cases where a
"MultiXactId < cutoff_multi" Multi contains an XID *that's still
running* -- a correctness issue. Nothing to do with being smart about
avoiding allocating new MultiXacts during freezing, or exploiting the
fact that "FreezeLimit < OldestXmin" (which is almost always true,
very true).

This correctness issue is the same issue discussed in "NB: cutoff_xid
*must* be <= the current global xmin..." comments that appear at the
top of heap_prepare_freeze_tuple. That's all.

> Hm. I guess I'll have to look at the code for it. It doesn't immediately
> "feel" quite right.

I kinda think it might be. Please let me know if you see a problem
with what I've said.

> > oldestXmin and oldestMxact map to the same wall clock time, more or less --
> > that seems like it might be an important distinction, independent of
> > everything else.
>
> Hm. Multis can be kept alive by fairly "young" member xids. So it may not be
> removably (without creating a newer multi) until much later than its creation
> time. So I don't think that's really true.

Maybe what I said above it true, even though (at the same time) I have
*also* created new problems with "young" member xids. I really don't
know right now, though.

> > Final relfrozenxid values must still be >= FreezeLimit in an aggressive
> > VACUUM (FreezeLimit is still used as an XID-age based backstop there).
> > In non-aggressive VACUUMs (where there is still no strict guarantee that
> > relfrozenxid will be advanced at all), we now advance relfrozenxid by as
> > much as we possibly can.  This exploits workload conditions that make it
> > easy to advance relfrozenxid by many more XIDs (for the same amount of
> > freezing/pruning work).
>
> Don't we now always advance relfrozenxid as much as we can, particularly also
> during aggressive vacuums?

I just meant "we hope for the best and accept what we can get". Will fix.

> >   * FRM_RETURN_IS_MULTI
> >   *           The return value is a new MultiXactId to set as new Xmax.
> >   *           (caller must obtain proper infomask bits using GetMultiXactIdHintBits)
> > + *
> > + * "relfrozenxid_out" is an output value; it's used to maintain target new
> > + * relfrozenxid for the relation.  It can be ignored unless "flags" contains
> > + * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
> > + * here.  This follows the general convention: only track XIDs that will still
> > + * be in the table after the ongoing VACUUM finishes.  Note that it's up to
> > + * caller to maintain this when the Xid return value is itself an Xid.
> > + *
> > + * Note that we cannot depend on xmin to maintain relfrozenxid_out.
>
> What does it mean for xmin to maintain something?

Will fix.

> > + * See heap_prepare_freeze_tuple for information about the basic rules for the
> > + * cutoffs used here.
> > + *
> > + * Maintains *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out, which
> > + * are the current target relfrozenxid and relminmxid for the relation.  We
> > + * assume that caller will never want to freeze its tuple, even when the tuple
> > + * "needs freezing" according to our return value.
>
> I don't understand the "will never want to" bit?

I meant "even when it's a non-aggressive VACUUM, which will never want
to wait for a cleanup lock the hard way, and will therefore always
settle for these relfrozenxid_nofreeze_out and
*relminmxid_nofreeze_out values". Note the convention here, which is
relfrozenxid_nofreeze_out is not the same thing as relfrozenxid_out --
the former variable name is used for values in cases where we *don't*
freeze, the latter for values in the cases where we do.

Will try to clear that up.

> > Caller should make temp
> > + * copies of global tracking variables before starting to process a page, so
> > + * that we can only scribble on copies.  That way caller can just discard the
> > + * temp copies if it isn't okay with that assumption.
> > + *
> > + * Only aggressive VACUUM callers are expected to really care when a tuple
> > + * "needs freezing" according to us.  It follows that non-aggressive VACUUMs
> > + * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all
> > + * cases.
>
> Could it make sense to track can_freeze and need_freeze separately?

You mean to change the signature of heap_tuple_needs_freeze, so it
doesn't return a bool anymore? It just has two bool pointers as
arguments, can_freeze and need_freeze?

I suppose that could make sense. Don't feel strongly either way.

> I may be misreading the diff, but aren't we know continuing to use multi down
> below even if !MultiXactIdIsValid()?

Will investigate.

> Doesn't this mean we unpack the members even if the multi is old enough to
> need freezing? Just to then do it again during freezing? Accessing multis
> isn't cheap...

Will investigate.

> This stanza is repeated a bunch. Perhaps put it in a small static inline
> helper?

Will fix.

> Struct member names starting with an upper case look profoundly ugly to
> me...  But this isn't the first one, so I guess... :(

I am in 100% agreement, actually. But you know how it goes...

> I still suspect this will cause a very substantial increase in WAL traffic in
> realistic workloads. It's common to have workloads where tuples are inserted
> once, and deleted once/ partition dropped.

I agree with the principle that this kind of use case should be
accommodated in some way.

> I think we'll have to make this less aggressive or tunable. Random ideas for
> heuristics:

The problem that all of these heuristics have is that they will tend
to make it impossible for future non-aggressive VACUUMs to be able to
advance relfrozenxid. All that it takes is one single all-visible page
to make that impossible. As I said upthread, I think that being able
to advance relfrozenxid (and especially relminmxid) by *some* amount
in every VACUUM has non-obvious value.

Maybe you can address that by changing the behavior of non-aggressive
VACUUMs, so that they are directly sensitive to this. Maybe they don't
skip any all-visible pages when there aren't too many, that kind of
thing. That needs to be in scope IMV.

> I don't know how to parse "thereby making relfrozenxid advancement impossible
> ... will no longer hinder relfrozenxid advancement"?

Will fix.

> > We now consistently avoid leaving behind all-visible (not all-frozen) pages.
> > This (as well as work from commit 44fa84881f) makes relfrozenxid advancement
> > in non-aggressive VACUUMs commonplace.
>
> s/consistently/try to/?

Will fix.

> > The system accumulates freezing debt in proportion to the number of
> > physical heap pages with unfrozen tuples, more or less.  Anything based
> > on XID age is likely to be a poor proxy for the eventual cost of
> > freezing (during the inevitable anti-wraparound autovacuum).  At a high
> > level, freezing is now treated as one of the costs of storing tuples in
> > physical heap pages -- not a cost of transactions that allocate XIDs.
> > Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still
> > influence what we freeze, and when, they effectively become backstops.
> > It may still be necessary to "freeze a page" due to the presence of a
> > particularly old XID, from before VACUUM's FreezeLimit cutoff, though
> > that will be rare in practice -- FreezeLimit is just a backstop now.
>
> I don't really like the "rare in practice" bit. It'll be rare in some
> workloads but others will likely be much less affected.

Maybe. The first time one XID crosses FreezeLimit now will be enough
to trigger freezing the page. So it's still very different to today.

I'll change this, though. It's not important.

> I think this means my concern above about increasing mxid creation rate
> substantially may be warranted.

Can you think of an adversarial workload, to get a sense of the extent
of the problem?

> > + * backstop_cutoff_xid must be <= cutoff_xid, and backstop_cutoff_multi must
> > + * be <= cutoff_multi.  When any XID/XMID from before these backstop cutoffs
> > + * is encountered, we set *force_freeze to true, making caller freeze the page
> > + * (freezing-eligible XIDs/XMIDs will be frozen, at least).  "Backstop
> > + * freezing" ensures that VACUUM won't allow XIDs/XMIDs to ever get too old.
> > + * This shouldn't be necessary very often.  VACUUM should prefer to freeze
> > + * when it's cheap (not when it's urgent).
>
> Hm. Does this mean that we might call heap_prepare_freeze_tuple and then
> decide not to freeze?

Yes. And so heap_prepare_freeze_tuple is now a little more like its
sibling function, heap_tuple_needs_freeze.

> Doesn't that mean we might create new multis over and
> over, because we don't end up pulling the trigger on freezing the page?

> Ah, I guess not. But it'd be nicer if I didn't have to scroll down to the body
> of the function to figure it out...

Will fix.

> I think this commit message needs a good amount of polishing - it's very
> convoluted. It's late and I didn't sleep well, but I've tried to read it
> several times without really getting a sense of what this precisely does.

It received much less polishing than the others.

Think of 0003 like this:

The logic for skipping a range of blocks using the visibility map
works by deciding the next_unskippable_block-wise range of skippable
blocks up front. Later, we actually execute the skipping of this range
of blocks (assuming it exceeds SKIP_PAGES_THRESHOLD). These are two
separate steps.

Right now, we do this:

            if (skipping_blocks && blkno < nblocks - 1)
            {
                /*
                 * Tricky, tricky.  If this is in aggressive vacuum, the page
                 * must have been all-frozen at the time we checked whether it
                 * was skippable, but it might not be any more.  We must be
                 * careful to count it as a skipped all-frozen page in that
                 * case, or else we'll think we can't update relfrozenxid and
                 * relminmxid.  If it's not an aggressive vacuum, we don't
                 * know whether it was initially all-frozen, so we have to
                 * recheck.
                 */
                if (vacrel->aggressive ||
                    VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
                    vacrel->frozenskipped_pages++;
                continue;
            }

The fact that this is conditioned in part on "vacrel->aggressive"
concerns me here. Why should we have a special case for this, where we
condition something on aggressive-ness that isn't actually strictly
related to that? Why not just remember that the range that we're
skipping was all-frozen up-front?

That way non-aggressive VACUUMs are not unnecessarily at a
disadvantage, when it comes to being able to advance relfrozenxid.
What if we end up not incrementing vacrel->frozenskipped_pages when we
easily could have, just because this is a non-aggressive VACUUM? I
think that it's worth avoiding stuff like that whenever possible.
Maybe this particular example isn't the most important one. For
example it probably isn't as bad as the one was fixed by the
lazy_scan_noprune work. But why even take a chance? Seems easier to
remove the special case -- which is what this really is.

> FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up
> causing a lot of time doing IO that we never need, completely trashing all CPU
> caches, while not actually causing decent readaead IO from what I've seen.

I am also suspicious of SKIP_PAGES_THRESHOLD. But if we want to get
rid of it, we'll need to be sensitive to how that affects relfrozenxid
advancement in non-aggressive VACUUMs IMV.

Thanks again for the review!

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-02-25 14:00:12 -0800, Peter Geoghegan wrote:
> On Thu, Feb 24, 2022 at 11:14 PM Andres Freund <andres@anarazel.de> wrote:
> > I am not a fan of the backstop terminology. It's still the reason we need to
> > do freezing for correctness reasons.
> 
> Thanks for the review!
> 
> I'm not wedded to that particular terminology, but I think that we
> need something like it. Open to suggestions.
>
> How about limit-based? Something like that?

freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or
s/limit/below/? I kind of like below because that answers < vs <= which I find
hard to remember around freezing.


> > I'm a tad concerned about replacing mxids that have some members that are
> > older than OldestXmin but not older than FreezeLimit. It's not too hard to
> > imagine that accelerating mxid consumption considerably.  But we can probably,
> > if not already done, special case that.
> 
> Let's assume for a moment that this is a real problem. I'm not sure if
> it is or not myself (it's complicated), but let's say that it is. The
> problem may be more than offset by the positive impact on relminxmid
> advancement. I have placed a large emphasis on enabling
> relfrozenxid/relminxmid advancement in every non-aggressive VACUUM,
> for a number of reasons -- this is one of the reasons. Finding a way
> for every VACUUM operation to be "vacrel->scanned_pages +
> vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some*
> amount of relfrozenxid/relminxmid advancement possible in every
> VACUUM) has a great deal of value.

That may be true, but I think working more incrementally is better in this
are. I'd rather have a smaller improvement for a release, collect some data,
get another improvement in the next, than see a bunch of reports of larger
wind and large regressions.


> As I said recently on the "do only critical work during single-user
> vacuum?" thread, why should the largest tables in databases that
> consume too many MXIDs do so evenly, across all their tables? There
> are usually one or two large tables, and many more smaller tables. I
> think it's much more likely that the largest tables consume
> approximately zero MultiXactIds in these databases -- actual
> MultiXactId consumption is probably concentrated in just one or two
> smaller tables (even when we burn through MultiXacts very quickly).
> But we don't recognize these kinds of distinctions at all right now.

Recognizing those distinctions seems independent of freezing multixacts with
live members. I am happy with freezing them more aggressively if they don't
have live members. It's freezing mxids with live members that has me
concerned.  The limits you're proposing are quite aggressive and can advance
quickly.

I've seen large tables with plenty multixacts. Typically concentrated over a
value range (often changing over time).


> Under these conditions, we will have many more opportunities to
> advance relminmxid for most of the tables (including the larger
> tables) all the way up to current-oldestMxact with the patch series.
> Without needing to freeze *any* MultiXacts early (just freezing some
> XIDs early) to get that benefit. The patch series is not just about
> spreading the burden of freezing, so that non-aggressive VACUUMs
> freeze more -- it's also making relfrozenxid and relminmxid more
> recent and therefore *reliable* indicators of which tables any
> wraparound problems *really* are.

My concern was explicitly about the case where we have to create new
multixacts...


> Does that make sense to you?

Yes.


> > > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's
> > > OldestXmin at all (it actually just gets FreezeLimit passed as its
> > > cutoff_xid argument). It cannot possibly recognize any of this for itself.
> >
> > It does recognize something like OldestXmin in a more precise and expensive
> > way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().
> 
> It doesn't look that way to me.
> 
> While it's true that FreezeMultiXactId() will call MultiXactIdIsRunning(),
> that's only a cross-check.

> This cross-check is made at a point where we've already determined that the
> MultiXact in question is < cutoff_multi. In other words, it catches cases
> where a "MultiXactId < cutoff_multi" Multi contains an XID *that's still
> running* -- a correctness issue. Nothing to do with being smart about
> avoiding allocating new MultiXacts during freezing, or exploiting the fact
> that "FreezeLimit < OldestXmin" (which is almost always true, very true).

If there's <= 1 live members in a mxact, we replace it with with a plain xid
iff the xid also would get frozen. With the current freezing logic I don't see
what passing down OldestXmin would change. Or how it differs to a meaningful
degree from heap_prepare_freeze_tuple()'s logic.  I don't see how it'd avoid a
single new mxact from being allocated.



> > > Caller should make temp
> > > + * copies of global tracking variables before starting to process a page, so
> > > + * that we can only scribble on copies.  That way caller can just discard the
> > > + * temp copies if it isn't okay with that assumption.
> > > + *
> > > + * Only aggressive VACUUM callers are expected to really care when a tuple
> > > + * "needs freezing" according to us.  It follows that non-aggressive VACUUMs
> > > + * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all
> > > + * cases.
> >
> > Could it make sense to track can_freeze and need_freeze separately?
> 
> You mean to change the signature of heap_tuple_needs_freeze, so it
> doesn't return a bool anymore? It just has two bool pointers as
> arguments, can_freeze and need_freeze?

Something like that. Or return true if there's anything to do, and then rely
on can_freeze and need_freeze for finer details. But it doesn't matter that much.


> > I still suspect this will cause a very substantial increase in WAL traffic in
> > realistic workloads. It's common to have workloads where tuples are inserted
> > once, and deleted once/ partition dropped.
> 
> I agree with the principle that this kind of use case should be
> accommodated in some way.
> 
> > I think we'll have to make this less aggressive or tunable. Random ideas for
> > heuristics:
> 
> The problem that all of these heuristics have is that they will tend
> to make it impossible for future non-aggressive VACUUMs to be able to
> advance relfrozenxid. All that it takes is one single all-visible page
> to make that impossible. As I said upthread, I think that being able
> to advance relfrozenxid (and especially relminmxid) by *some* amount
> in every VACUUM has non-obvious value.

I think that's a laudable goal. But I don't think we should go there unless we
are quite confident we've mitigated the potential downsides.

Observed horizons for "never vacuumed before" tables and for aggressive
vacuums alone would be a huge win.


> Maybe you can address that by changing the behavior of non-aggressive
> VACUUMs, so that they are directly sensitive to this. Maybe they don't
> skip any all-visible pages when there aren't too many, that kind of
> thing. That needs to be in scope IMV.

Yea. I still like my idea to have vacuum process a some all-visible pages
every time and to increase that percentage based on how old the relfrozenxid
is.

We could slowly "refill" the number of all-visible pages VACUUM is allowed to
process whenever dirtying a page for other reasons.



> > I think this means my concern above about increasing mxid creation rate
> > substantially may be warranted.
> 
> Can you think of an adversarial workload, to get a sense of the extent
> of the problem?

I'll try to come up with something.


> > FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up
> > causing a lot of time doing IO that we never need, completely trashing all CPU
> > caches, while not actually causing decent readaead IO from what I've seen.
> 
> I am also suspicious of SKIP_PAGES_THRESHOLD. But if we want to get
> rid of it, we'll need to be sensitive to how that affects relfrozenxid
> advancement in non-aggressive VACUUMs IMV.

It might make sense to separate the purposes of SKIP_PAGES_THRESHOLD. The
relfrozenxid advancement doesn't benefit from visiting all-frozen pages, just
because there are only 30 of them in a row.


> Thanks again for the review!

NP, I think we need a lot of improvements in this area.

I wish somebody would tackle merging heap_page_prune() with
vacuuming. Primarily so we only do a single WAL record. But also because the
separation has caused a *lot* of complexity.  I've already more projects than
I should, otherwise I'd start on it...

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 25, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Hm. I guess I'll have to look at the code for it. It doesn't immediately
> > "feel" quite right.
>
> I kinda think it might be. Please let me know if you see a problem
> with what I've said.

Oh, wait. I have a better idea of what you meant now. The loop towards
the end of FreezeMultiXactId() will indeed "Determine whether to keep
this member or ignore it." when we need a new MultiXactId. The loop is
exact in the sense that it will only include those XIDs that are truly
needed -- those that are still running.

But why should we ever get to the FreezeMultiXactId() loop with the
stuff from 0002 in place? The whole purpose of the loop is to handle
cases where we have to remove *some* (not all) XIDs from before
cutoff_xid that appear in a MultiXact, which requires careful checking
of each XID (this is only possible when the MultiXactId is <
cutoff_multi to begin with, which is OldestMxact in the patch, which
is presumably very recent).

It's not impossible that we'll get some number of "skewed MultiXacts"
with the patch -- cases that really do necessitate allocating a new
MultiXact, just to "freeze some XIDs from a MultiXact". That is, there
will sometimes be some number of XIDs that are < OldestXmin, but
nevertheless appear in some MultiXactIds >= OldestMxact. This seems
likely to be rare with the patch, though, since VACUUM calculates its
OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi
really are in the patch) at the same point in time. Which was the
point I made in my email yesterday.

How many of these "skewed MultiXacts" can we really expect? Seems like
there might be very few in practice. But I'm really not sure about
that.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-02-25 15:28:17 -0800, Peter Geoghegan wrote:
> But why should we ever get to the FreezeMultiXactId() loop with the
> stuff from 0002 in place? The whole purpose of the loop is to handle
> cases where we have to remove *some* (not all) XIDs from before
> cutoff_xid that appear in a MultiXact, which requires careful checking
> of each XID (this is only possible when the MultiXactId is <
> cutoff_multi to begin with, which is OldestMxact in the patch, which
> is presumably very recent).
> 
> It's not impossible that we'll get some number of "skewed MultiXacts"
> with the patch -- cases that really do necessitate allocating a new
> MultiXact, just to "freeze some XIDs from a MultiXact". That is, there
> will sometimes be some number of XIDs that are < OldestXmin, but
> nevertheless appear in some MultiXactIds >= OldestMxact. This seems
> likely to be rare with the patch, though, since VACUUM calculates its
> OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi
> really are in the patch) at the same point in time. Which was the
> point I made in my email yesterday.

I don't see why it matters that OldestXmin and OldestMxact are computed at the
same time?  It's a question of the workload, not vacuum algorithm.

OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all
members are older than OldestXmin (not quite true, but that's the bound), and
they have always more than one member.


> How many of these "skewed MultiXacts" can we really expect?

I don't think they're skewed in any way. It's a fundamental aspect of
multixacts.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 25, 2022 at 3:48 PM Andres Freund <andres@anarazel.de> wrote:
> I don't see why it matters that OldestXmin and OldestMxact are computed at the
> same time?  It's a question of the workload, not vacuum algorithm.

I think it's both.

> OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all
> members are older than OldestXmin (not quite true, but that's the bound), and
> they have always more than one member.
>
>
> > How many of these "skewed MultiXacts" can we really expect?
>
> I don't think they're skewed in any way. It's a fundamental aspect of
> multixacts.

Having this happen to some degree is fundamental to MultiXacts, sure.
But also seems like the approach of using FreezeLimit and
MultiXactCutoff in the way that we do right now seems like it might
make the problem a lot worse. Because they're completely meaningless
cutoffs. They are magic numbers that have no relationship whatsoever
to each other.

There are problems with assuming that OldestXmin and OldestMxact
"align" -- no question. But at least it's approximately true -- which
is a start. They are at least not arbitrarily, unpredictably
different, like FreezeLimit and MultiXactCutoff are, and always will
be. I think that that's a meaningful and useful distinction.

I am okay with making the most pessimistic possible assumptions about
how any changes to how we freeze might cause FreezeMultiXactId() to
allocate more MultiXacts than before. And I accept that the patch
series shouldn't "get credit" for "offsetting" any problem like that
by making relminmxid advancement occur much more frequently (even
though that does seem very valuable). All I'm really saying is this:
in general, there are probably quite a few opportunities for
FreezeMultiXactId() to avoid allocating new XMIDs (just to freeze
XIDs) by having the full context. And maybe by making the dialog
between lazy_scan_prune and heap_prepare_freeze_tuple a bit more
nuanced.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 25, 2022 at 3:26 PM Andres Freund <andres@anarazel.de> wrote:
> freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or
> s/limit/below/? I kind of like below because that answers < vs <= which I find
> hard to remember around freezing.

I like freeze_required_limit the most.

> That may be true, but I think working more incrementally is better in this
> are. I'd rather have a smaller improvement for a release, collect some data,
> get another improvement in the next, than see a bunch of reports of larger
> wind and large regressions.

I agree.

There is an important practical way in which it makes sense to treat
0001 as separate to 0002. It is true that 0001 is independently quite
useful. In practical terms, I'd be quite happy to just get 0001 into
Postgres 15, without 0002. I think that that's what you meant here, in
concrete terms, and we can agree on that now.

However, it is *also* true that there is an important practical sense
in which they *are* related. I don't want to ignore that either -- it
does matter. Most of the value to be had here comes from the synergy
between 0001 and 0002 -- or what I've been calling a "virtuous cycle",
the thing that makes it possible to advance relfrozenxid/relminmxid in
almost every VACUUM. Having both 0001 and 0002 together (or something
along the same lines) is way more valuable than having just one.

Perhaps we can even agree on this second point. I am encouraged by the
fact that you at least recognize the general validity of the key ideas
from 0002. If I am going to commit 0001 (and not 0002) ahead of
feature freeze for 15, I better be pretty sure that I have at least
roughly the right idea with 0002, too -- since that's the direction
that 0001 is going in. It almost seems dishonest to pretend that I
wasn't thinking of 0002 when I wrote 0001.

I'm glad that you seem to agree that this business of accumulating
freezing debt without any natural limit is just not okay. That is
really fundamental to me. I mean, vacuum_freeze_min_age kind of
doesn't work as designed. This is a huge problem for us.

> > Under these conditions, we will have many more opportunities to
> > advance relminmxid for most of the tables (including the larger
> > tables) all the way up to current-oldestMxact with the patch series.
> > Without needing to freeze *any* MultiXacts early (just freezing some
> > XIDs early) to get that benefit. The patch series is not just about
> > spreading the burden of freezing, so that non-aggressive VACUUMs
> > freeze more -- it's also making relfrozenxid and relminmxid more
> > recent and therefore *reliable* indicators of which tables any
> > wraparound problems *really* are.
>
> My concern was explicitly about the case where we have to create new
> multixacts...

It was a mistake on my part to counter your point about that with this
other point about eager relminmxid advancement. As I said in the last
email, while that is very valuable, it's not something that needs to
be brought into this.

> > Does that make sense to you?
>
> Yes.

Okay, great. The fact that you recognize the value in that comes as a relief.

> > You mean to change the signature of heap_tuple_needs_freeze, so it
> > doesn't return a bool anymore? It just has two bool pointers as
> > arguments, can_freeze and need_freeze?
>
> Something like that. Or return true if there's anything to do, and then rely
> on can_freeze and need_freeze for finer details. But it doesn't matter that much.

Got it.

> > The problem that all of these heuristics have is that they will tend
> > to make it impossible for future non-aggressive VACUUMs to be able to
> > advance relfrozenxid. All that it takes is one single all-visible page
> > to make that impossible. As I said upthread, I think that being able
> > to advance relfrozenxid (and especially relminmxid) by *some* amount
> > in every VACUUM has non-obvious value.
>
> I think that's a laudable goal. But I don't think we should go there unless we
> are quite confident we've mitigated the potential downsides.

True. But that works both ways. We also shouldn't err in the direction
of adding these kinds of heuristics (which have real downsides) until
the idea of mostly swallowing the cost of freezing whole pages (while
making it possible to disable) has lost, fairly. Overall, it looks
like the cost is acceptable in most cases.

I think that users will find it very reassuring to regularly and
reliably see confirmation that wraparound is being kept at bay, by
every VACUUM operation, with details that they can relate to their
workload. That has real value IMV -- even when it's theoretically
unnecessary for us to be so eager with advancing relfrozenxid.

I really don't like the idea of falling behind on freezing
systematically. You always run the "risk" of freezing being wasted.
But that way of looking at it can be penny wise, pound foolish --
maybe we should just accept that trying to predict what will happen in
the future (whether or not freezing will be worth it) is mostly not
helpful. Our users mostly complain about performance stability these
days. Big shocks are really something we ought to avoid. That does
have a cost. Why wouldn't it?

> > Maybe you can address that by changing the behavior of non-aggressive
> > VACUUMs, so that they are directly sensitive to this. Maybe they don't
> > skip any all-visible pages when there aren't too many, that kind of
> > thing. That needs to be in scope IMV.
>
> Yea. I still like my idea to have vacuum process a some all-visible pages
> every time and to increase that percentage based on how old the relfrozenxid
> is.

You can quite easily construct cases where the patch does much better
than that, though -- very believable cases. Any table like
pgbench_history. And so I lean towards quantifying the cost of
page-level freezing carefully, making sure there is nothing
pathological, and then just accepting it (with a GUC to disable). The
reality is that freezing is really a cost of storing data in Postgres,
and will be for the foreseeable future.

> > Can you think of an adversarial workload, to get a sense of the extent
> > of the problem?
>
> I'll try to come up with something.

That would be very helpful. Thanks!

> It might make sense to separate the purposes of SKIP_PAGES_THRESHOLD. The
> relfrozenxid advancement doesn't benefit from visiting all-frozen pages, just
> because there are only 30 of them in a row.

Right. I imagine that SKIP_PAGES_THRESHOLD actually does help with
this, but if we actually tried we'd find a much better way.

> I wish somebody would tackle merging heap_page_prune() with
> vacuuming. Primarily so we only do a single WAL record. But also because the
> separation has caused a *lot* of complexity.  I've already more projects than
> I should, otherwise I'd start on it...

That has value, but it doesn't feel as urgent.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Sun, Feb 20, 2022 at 3:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think that the idea has potential, but I don't think that I
> > understand yet what the *exact* algorithm is.
>
> The algorithm seems to exploit a natural tendency that Andres once
> described in a blog post about his snapshot scalability work [1]. To a
> surprising extent, we can usefully bucket all tuples/pages into two
> simple categories:
>
> 1. Very, very old ("infinitely old" for all practical purposes).
>
> 2. Very very new.
>
> There doesn't seem to be much need for a third "in-between" category
> in practice. This seems to be at least approximately true all of the
> time.
>
> Perhaps Andres wouldn't agree with this very general statement -- he
> actually said something more specific. I for one believe that the
> point he made generalizes surprisingly well, though. I have my own
> theories about why this appears to be true. (Executive summary: power
> laws are weird, and it seems as if the sparsity-of-effects principle
> makes it easy to bucket things at the highest level, in a way that
> generalizes well across disparate workloads.)

I think that this is not really a description of an algorithm -- and I
think that it is far from clear that the third "in-between" category
does not need to exist.

> Remember when I got excited about how my big TPC-C benchmark run
> showed a predictable, tick/tock style pattern across VACUUM operations
> against the order and order lines table [2]? It seemed very
> significant to me that the OldestXmin of VACUUM operation n
> consistently went on to become the new relfrozenxid for the same table
> in VACUUM operation n + 1. It wasn't exactly the same XID, but very
> close to it (within the range of noise). This pattern was clearly
> present, even though VACUUM operation n + 1 might happen as long as 4
> or 5 hours after VACUUM operation n (this was a big table).

I think findings like this are very unconvincing. TPC-C (or any
benchmark really) is so simple as to be a terrible proxy for what
vacuuming is going to look like on real-world systems. Like, it's nice
that it works, and it shows that something's working, but it doesn't
demonstrate that the patch is making the right trade-offs overall.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Tue, Mar 1, 2022 at 1:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think that this is not really a description of an algorithm -- and I
> think that it is far from clear that the third "in-between" category
> does not need to exist.

But I already described the algorithm. It is very simple
mechanistically -- though that in itself means very little. As I have
said multiple times now, the hard part is assessing what the
implications are. And the even harder part is making a judgement about
whether or not those implications are what we generally want.

> I think findings like this are very unconvincing.

TPC-C may be unrealistic in certain ways, but it is nevertheless
vastly more realistic than pgbench. pgbench is really more of a stress
test than a benchmark.

The main reasons why TPC-C is interesting here are *very* simple, and
would likely be equally true with TPC-E (just for example) -- even
though TPC-E is a very different benchmark kind of OLTP workload
overall. TPC-C (like TPC-E) features a diversity of transaction types,
some of which are more complicated than others -- which is strictly
more realistic than having only one highly synthetic OLTP transaction
type. Each transaction type doesn't necessarily modify the same tables
in the same way. This leads to natural diversity among tables and
among transactions, including:

* The typical or average number of distinct XIDs per heap page varies
significantly among each table. There are way fewer distinct XIDs per
"order line" table heap page than there are per "order" table heap
page, for the obvious reason.

* Roughly speaking, there are various different ways that free space
management ought to work in a system like Postgres. For example it is
necessary to make a "fragmentations vs space utilization" trade-off
with the new orders table.

* There are joins in some of the transactions!

Maybe TPC-C is a crude approximation of reality, but it nevertheless
exercises relevant parts of the system to a significant degree. What
else would you expect me to use, for a project like this? To a
significant degree the relfrozenxid tracking stuff is interesting
because tables tend to have natural differences like the ones I have
highlighted on this thread. How could that not be the case? Why
wouldn't we want to take advantage of that?

There might be some danger in over-optimizing for this particular
benchmark, but right now that is so far from being the main problem
that the idea seems strange to me. pgbench doesn't need the FSM, at
all. In fact pgbench doesn't even really need VACUUM (except for
antiwraparound), once heap fillfactor is lowered to 95 or so. pgbench
simply isn't relevant, *at all*, except perhaps as a way of measuring
regressions in certain synthetic cases that don't benefit.

> TPC-C (or any
> benchmark really) is so simple as to be a terrible proxy for what
> vacuuming is going to look like on real-world systems.

Doesn't that amount to "no amount of any kind of testing or
benchmarking will convince me of anything, ever"?

There is more than one type of real-world system. I think that TPC-C
is representative of some real world systems in some regards. But even
that's not the important point for me. I find TPC-C generally
interesting for one reason: I can clearly see that Postgres does
things in a way that just doesn't make much sense, which isn't
particularly fundamental to how VACUUM works.

My only long term goal is to teach Postgres to *avoid* various
pathological cases exhibited by TPC-C (e.g., the B-Tree "split after
new tuple" mechanism from commit f21668f328 *avoids* a pathological
case from TPC-C). We don't necessarily have to agree on how important
each individual case is "in the real world" (which is impossible to
know anyway). We only have to agree that what we see is a pathological
case (because some reasonable expectation is dramatically violated),
and then work out a fix.

I don't want to teach Postgres to be clever -- I want to teach it to
avoid being stupid in cases where it exhibits behavior that really
cannot be described any other way. You seem to talk about some of this
work as if it was just as likely to have a detrimental effect
elsewhere, for some equally plausible workload, which will have a
downside that is roughly as bad as the advertised upside. I consider
that very unlikely, though. Sure, regressions are quite possible, and
a real concern -- but regressions *like that* are unlikely. Avoiding
doing what is clearly the wrong thing just seems to work out that way,
in general.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Feb 25, 2022 at 5:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There is an important practical way in which it makes sense to treat
> 0001 as separate to 0002. It is true that 0001 is independently quite
> useful. In practical terms, I'd be quite happy to just get 0001 into
> Postgres 15, without 0002. I think that that's what you meant here, in
> concrete terms, and we can agree on that now.

Attached is v10. While this does still include the freezing patch,
it's not in scope for Postgres 15. As I've said, I still think that it
makes sense to maintain the patch series with the freezing stuff,
since it's structurally related. So, to be clear, the first two
patches from the patch series are in scope for Postgres 15. But not
the third.

Highlights:

* Changes to terminology and commit messages along the lines suggested
by Andres.

* Bug fixes to heap_tuple_needs_freeze()'s MultiXact handling. My
testing strategy here still needs work.

* Expanded refactoring by v10-0002 patch.

The v10-0002 patch (which appeared for the first time in v9) was
originally all about fixing a case where non-aggressive VACUUMs were
at a gratuitous disadvantage (relative to aggressive VACUUMs) around
advancing relfrozenxid -- very much like the lazy_scan_noprune work
from commit 44fa8488. And that is still its main purpose. But the
refactoring now seems related to Andres' idea of making non-aggressive
VACUUMs decides to scan a few extra all-visible pages in order to be
able to advance relfrozenxid.

The code that sets up skipping the visibility map is made a lot
clearer by v10-0002. That patch moves a significant amount of code
from lazy_scan_heap() into a new helper routine (so it continues the
trend started by the Postgres 14 work that added lazy_scan_prune()).
Now skipping a range of visibility map pages is fundamentally based on
setting up the range up front, and then using the same saved details
about the range thereafter -- we don't have anymore ad-hoc
VM_ALL_VISIBLE()/VM_ALL_FROZEN() calls for pages from a range that we
already decided to skip (so no calls to those routines from
lazy_scan_heap(), at least not until after we finish processing in
lazy_scan_prune()).

This is more or less what we were doing all along for one special
case: aggressive VACUUMs. We had to make sure to either increment
frozenskipped_pages or increment scanned_pages for every page from
rel_pages -- this issue is described by lazy_scan_heap() comments on
HEAD that begin with "Tricky, tricky." (these date back to the freeze
map work from 2016). Anyway, there is no reason to not go further with
that: we should make whole ranges the basic unit that we deal with
when skipping. It's a lot simpler to think in terms of entire ranges
(not individual pages) that are determined to be all-visible or
all-frozen up-front, without needing to recheck anything (regardless
of whether it's an aggressive VACUUM).

We don't need to track frozenskipped_pages this way. And it's much
more obvious that it's safe for more complicated cases, in particular
for aggressive VACUUMs.

This kind of approach seems necessary to make non-aggressive VACUUMs
do a little more work opportunistically, when they realize that they
can advance relfrozenxid relatively easily that way (which I believe
Andres favors as part of overhauling freezing). That becomes a lot
more natural when you have a clear and unambiguous separation between
deciding what range of blocks to skip, and then actually skipping. I
can imagine the new helper function added by v10-0002 (which I've
called lazy_scan_skip_range()) eventually being taught to do these
kinds of tricks.

In general I think that all of the details of what to skip need to be
decided up front. The loop in lazy_scan_heap() should execute skipping
based on the instructions it receives from the new helper function, in
the simplest way possible. The helper function can become more
intelligent about the costs and benefits of skipping in the future,
without that impacting lazy_scan_heap().

--
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sun, Mar 13, 2022 at 9:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v10. While this does still include the freezing patch,
> it's not in scope for Postgres 15. As I've said, I still think that it
> makes sense to maintain the patch series with the freezing stuff,
> since it's structurally related.

Attached is v11. Changes:

* No longer includes the patch that adds page-level freezing. It was
making it harder to assess code coverage for the patches that I'm
targeting Postgres 15 with. And so including it with each new revision
no longer seems useful. I'll pick it up for Postgres 16.

* Extensive isolation tests added to v11-0001-*, exercising a lot of
hard-to-hit code paths that are reached when VACUUM is unable to
immediately acquire a cleanup lock on some heap page. In particular,
we now have test coverage for the code in heapam.c that handles
tracking the oldest extant XID and MXID in the presence of MultiXacts
(on a no-cleanup-lock heap page).

* v11-0002-* (which is the patch that avoids missing out on advancing
relfrozenxid in non-aggressive VACUUMs due to a race condition on
HEAD) now moves even more of the logic for deciding how VACUUM will
skip using the visibility map into its own helper routine. Now
lazy_scan_heap just does what the state returned by the helper routine
tells it about the current skippable range -- it doesn't make any
decisions itself anymore. This is far simpler than what we do
currently, on HEAD.

There are no behavioral changes here, but this approach could be
pushed further to improve performance. We could easily determine
*every* page that we're going to scan (not skip) up-front in even the
largest tables, very early, before we've even scanned one page. This
could enable things like I/O prefetching, or capping the size of the
dead_items array based on our final scanned_pages (not on rel_pages).

* A new patch (v11-0003-*) alters the behavior of VACUUM's
DISABLE_PAGE_SKIPPING option. DISABLE_PAGE_SKIPPING no longer forces
aggressive VACUUM -- now it only forces the use of the visibility map,
since that behavior is totally independent of aggressiveness.

I don't feel too strongly about the DISABLE_PAGE_SKIPPING change. It
just seems logical to decouple no-vm-skipping from aggressiveness --
it might actually be helpful in testing the work from the patch series
in the future. Any page counted in scanned_pages has essentially been
processed by VACUUM with this work in place -- that was the idea
behind the lazy_scan_noprune stuff from commit 44fa8488. Bear in mind
that the relfrozenxid tracking stuff from v11-0001-* makes it almost
certain that a DISABLE_PAGE_SKIPPING-without-aggressiveness VACUUM
will still manage to advance relfrozenxid -- usually by the same
amount as an equivalent aggressive VACUUM would anyway. (Failing to
acquire a cleanup lock on some heap page might result in the final
older relfrozenxid being appreciably older, but probably not, and we'd
still almost certainly manage to advance relfrozenxid by *some* small
amount.)

Of course, anybody that wants both an aggressive VACUUM and a VACUUM
that never skips even all-frozen pages in the visibility map will
still be able to get that behavior quite easily. For example,
VACUUM(DISABLE_PAGE_SKIPPING, FREEZE) will do that. Several of our
existing tests must already use both of these options together,
because the tests require an effective vacuum_freeze_min_age of 0 (and
vacuum_multixact_freeze_min_age of 0) -- DISABLE_PAGE_SKIPPING alone
won't do that on HEAD, which seems to confuse the issue (see commit
b700f96c for an example of that).

In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently*
force lazy_scan_noprune to refuse to process a page on HEAD (it all
depends on FreezeLimit/vacuum_freeze_min_age), it is logical for
DISABLE_PAGE_SKIPPING to totally get out of the business of caring
about that -- better to limit it to caring only about the visibility
map (by no longer making it force aggressiveness).

-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Wed, Mar 23, 2022 at 3:59 PM Peter Geoghegan <pg@bowt.ie> wrote:
> In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently*
> force lazy_scan_noprune to refuse to process a page on HEAD (it all
> depends on FreezeLimit/vacuum_freeze_min_age), it is logical for
> DISABLE_PAGE_SKIPPING to totally get out of the business of caring
> about that -- better to limit it to caring only about the visibility
> map (by no longer making it force aggressiveness).

It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely
disable skipping pages, we have a problem.

The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named
DISABLE_PAGE_SKIPPING.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 23, 2022 at 1:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
> It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely
> disable skipping pages, we have a problem.

It depends on how you define skipping. DISABLE_PAGE_SKIPPING was
created at a time when a broader definition of skipping made a lot
more sense.

> The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named
> DISABLE_PAGE_SKIPPING.

VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show
that 100% of all of the pages from rel_pages are scanned. A page that
is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of
its tuples frozen. But every other aspect of processing the page
happens in just the same way as it would in the cleanup
lock/lazy_scan_prune path.

We'll even still VACUUM the page if it happens to have some existing
LP_DEAD items left behind by opportunistic pruning. We don't need a
cleanup in either lazy_scan_noprune (a share lock is all we need), nor
do we even need one in lazy_vacuum_heap_page (a regular exclusive lock
is all we need).

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Wed, Mar 23, 2022 at 4:49 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Mar 23, 2022 at 1:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely
> > disable skipping pages, we have a problem.
>
> It depends on how you define skipping. DISABLE_PAGE_SKIPPING was
> created at a time when a broader definition of skipping made a lot
> more sense.
>
> > The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named
> > DISABLE_PAGE_SKIPPING.
>
> VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show
> that 100% of all of the pages from rel_pages are scanned. A page that
> is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of
> its tuples frozen. But every other aspect of processing the page
> happens in just the same way as it would in the cleanup
> lock/lazy_scan_prune path.

I see what you mean about it depending on how you define "skipping".
But I think that DISABLE_PAGE_SKIPPING is intended as a sort of
emergency safeguard when you really, really don't want to leave
anything out. And therefore I favor defining it to mean that we don't
skip any work at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 23, 2022 at 1:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I see what you mean about it depending on how you define "skipping".
> But I think that DISABLE_PAGE_SKIPPING is intended as a sort of
> emergency safeguard when you really, really don't want to leave
> anything out.

I agree.

> And therefore I favor defining it to mean that we don't
> skip any work at all.

But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot
acquire a cleanup lock on a page, unless it happens to have XIDs from
before FreezeLimit (which is probably 50 million XIDs behind
OldestXmin, the vacuum_freeze_min_age default). I don't see much
difference.

Anyway, this isn't important. I'll just drop the third patch.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Thomas Munro
Дата:
On Thu, Mar 24, 2022 at 9:59 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Mar 23, 2022 at 1:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > And therefore I favor defining it to mean that we don't
> > skip any work at all.
>
> But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot
> acquire a cleanup lock on a page, unless it happens to have XIDs from
> before FreezeLimit (which is probably 50 million XIDs behind
> OldestXmin, the vacuum_freeze_min_age default). I don't see much
> difference.

Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable
all page skipping, so 3414099c turned out to be not enough.



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 23, 2022 at 2:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable
> all page skipping, so 3414099c turned out to be not enough.

The proposed change to DISABLE_PAGE_SKIPPING is partly driven by that,
and partly driven by a similar concern about aggressive VACUUM.

It seems worth emphasizing the idea that an aggressive VACUUM is now
just the same as any other VACUUM except for one detail: we're
guaranteed to advance relfrozenxid to a value >= FreezeLimit at the
end. The non-aggressive case has the choice to do things that make
that impossible. But there are only two places where this can happen now:

1. Non-aggressive VACUUMs might decide to skip some all-visible pages in
the new lazy_scan_skip() helper routine for skipping with the VM (see
v11-0002-*).

2. A non-aggressive VACUUM can *always* decide to ratchet back its
target relfrozenxid in lazy_scan_noprune, to avoid waiting for a
cleanup lock -- a final value from before FreezeLimit is usually still
pretty good.

The first scenario is the only one where it becomes impossible for
non-aggressive VACUUM to be able to advance relfrozenxid (with
v11-0001-* in place) by any amount. Even that's a choice, made by
weighing costs against benefits.

There is no behavioral change in v11-0002-* (we're still using the
old SKIP_PAGES_THRESHOLD strategy), but the lazy_scan_skip()
helper routine could fairly easily be taught a lot more about the
downside of skipping all-visible pages (namely how that makes it
impossible to advance relfrozenxid).

Maybe it's worth skipping all-visible pages (there are lots of them
and age(relfrozenxid) is still low), and maybe it isn't worth it. We
should get to decide, without implementation details making
relfrozenxid advancement unsafe.

It would be great if you could take a look v11-0002-*, Robert. Does it
make sense to you?

Thanks
--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Wed, Mar 23, 2022 at 6:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It would be great if you could take a look v11-0002-*, Robert. Does it
> make sense to you?

You're probably not going to love hearing this, but I think you're
still explaining things here in ways that are too baroque and hard to
follow. I do think it's probably better. But, for example, in the
commit message for 0001, I think you could change the subject line to
"Allow non-aggressive vacuums to advance relfrozenxid" and it would be
clearer. And then I think you could eliminate about half of the first
paragraph, starting with "There is no fixed relationship", and all of
the third paragraph (which starts with "Later work..."), and I think
removing all that material would make it strictly more clear than it
is currently. I don't think it's the place of a commit message to
speculate too much on future directions or to wax eloquent on
theoretical points. If that belongs anywhere, it's in a mailing list
discussion.

It seems to me that 0002 mixes code movement with functional changes.
I'm completely on board with moving the code that decides how much to
skip into a function. That seems like a great idea, and probably
overdue. But it is not easy for me to see what has changed
functionally between the old and new code organization, and I bet it
would be possible to split this into two patches, one of which creates
a function, and the other of which fixes the problem, and I think that
would be a useful service to future readers of the code. I have a hard
time believing that if someone in the future bisects a problem back to
this commit, they're going to have an easy time finding the behavior
change in here. In fact I can't see it myself. I think the actual
functional change is to fix what is described in the second paragraph
of the commit message, but I haven't been able to figure out where the
logic is actually changing to address that. Note that I would be happy
with the behavior change happening either before or after the code
reorganization.

I also think that the commit message for 0002 is probably longer and
more complex than is really helpful, and that the subject line is too
vague, but since I don't yet understand exactly what's happening here,
I cannot comment on how I think it should be revised at this point,
except to say that the second paragraph of that commit message looks
like the most useful part.

I would also like to mention a few things that I do like about 0002.
One is that it seems to collapse two different pieces of logic for
page skipping into one. That seems good. As mentioned, it's especially
good because that logic is abstracted into a function. Also, it looks
like it is making a pretty localized change to one (1) aspect of what
VACUUM does -- and I definitely prefer patches that change only one
thing at a time.

Hope that's helpful.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Mar 24, 2022 at 10:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
> You're probably not going to love hearing this, but I think you're
> still explaining things here in ways that are too baroque and hard to
> follow. I do think it's probably better.

There are a lot of dimensions to this work. It's hard to know which to
emphasize here.

> But, for example, in the
> commit message for 0001, I think you could change the subject line to
> "Allow non-aggressive vacuums to advance relfrozenxid" and it would be
> clearer.

But non-aggressive VACUUMs have always been able to do that.

How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"

> And then I think you could eliminate about half of the first
> paragraph, starting with "There is no fixed relationship", and all of
> the third paragraph (which starts with "Later work..."), and I think
> removing all that material would make it strictly more clear than it
> is currently. I don't think it's the place of a commit message to
> speculate too much on future directions or to wax eloquent on
> theoretical points. If that belongs anywhere, it's in a mailing list
> discussion.

Okay, I'll do that.

> It seems to me that 0002 mixes code movement with functional changes.

Believe it or not, I avoided functional changes in 0002 -- at least in
one important sense. That's why you had difficulty spotting any. This
must sound peculiar, since the commit message very clearly says that
the commit avoids a problem seen only in the non-aggressive case. It's
really quite subtle.

You wrote this comment and code block (which I propose to remove in
0002), so clearly you already understand the race condition that I'm
concerned with here:

-           if (skipping_blocks && blkno < rel_pages - 1)
-           {
-               /*
-                * Tricky, tricky.  If this is in aggressive vacuum, the page
-                * must have been all-frozen at the time we checked whether it
-                * was skippable, but it might not be any more.  We must be
-                * careful to count it as a skipped all-frozen page in that
-                * case, or else we'll think we can't update relfrozenxid and
-                * relminmxid.  If it's not an aggressive vacuum, we don't
-                * know whether it was initially all-frozen, so we have to
-                * recheck.
-                */
-               if (vacrel->aggressive ||
-                   VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-                   vacrel->frozenskipped_pages++;
-               continue;
-           }

What you're saying here boils down to this: it doesn't matter what the
visibility map would say right this microsecond (in the aggressive
case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
said that this page was all-frozen *in the recent past*. That's good
enough; we will never fail to scan a page that might have an XID <
OldestXmin (ditto for XMIDs) this way, which is all that really
matters.

This is absolutely mandatory in the aggressive case, because otherwise
relfrozenxid advancement might be seen as unsafe. My observation is:
Why should we accept the same race in the non-aggressive case? Why not
do essentially the same thing in every VACUUM?

In 0002 we now track if each range that we actually chose to skip had
any all-visible (not all-frozen) pages -- if that happens then
relfrozenxid advancement becomes unsafe. The existing code uses
"vacrel->aggressive" as a proxy for the same condition -- the existing
code reasons based on what the visibility map must have said about the
page in the recent past. Which makes sense, but only works in the
aggressive case. The approach taken in 0002 also makes the code
simpler, which is what enabled putting the VM skipping code into its
own helper function, but that was just a bonus.

And so you could almost say that there is now behavioral change at
all. We're skipping pages in the same way, based on the same
information (from the visibility map) as before. We're just being a
bit more careful than before about how that information is tracked, to
avoid this race. A race that we always avoided in the aggressive case
is now consistently avoided.

> I'm completely on board with moving the code that decides how much to
> skip into a function. That seems like a great idea, and probably
> overdue. But it is not easy for me to see what has changed
> functionally between the old and new code organization, and I bet it
> would be possible to split this into two patches, one of which creates
> a function, and the other of which fixes the problem, and I think that
> would be a useful service to future readers of the code.

It seems kinda tricky to split up 0002 like that. It's possible, but
I'm not sure if it's possible to split it in a way that highlights the
issue that I just described. Because we already avoided the race in
the aggressive case.

> I also think that the commit message for 0002 is probably longer and
> more complex than is really helpful, and that the subject line is too
> vague, but since I don't yet understand exactly what's happening here,
> I cannot comment on how I think it should be revised at this point,
> except to say that the second paragraph of that commit message looks
> like the most useful part.

I'll work on that.

> I would also like to mention a few things that I do like about 0002.
> One is that it seems to collapse two different pieces of logic for
> page skipping into one. That seems good. As mentioned, it's especially
> good because that logic is abstracted into a function. Also, it looks
> like it is making a pretty localized change to one (1) aspect of what
> VACUUM does -- and I definitely prefer patches that change only one
> thing at a time.

Totally embracing the idea that we don't necessarily need very recent
information from the visibility map (it just has to be after
OldestXmin was established) has a lot of advantages, architecturally.
It could in principle be hours out of date in the longest VACUUM
operations -- that should be fine. This is exactly the same principle
that makes it okay to stick with our original rel_pages, even when the
table has grown during a VACUUM operation (I documented this in commit
73f6ec3d3c recently).

We could build on the approach taken by 0002 to create a totally
comprehensive picture of the ranges we're skipping up-front, before we
actually scan any pages, even with very large tables. We could in
principle cache a very large number of skippable ranges up-front,
without ever going back to the visibility map again later (unless we
need to set a bit). It really doesn't matter if somebody else unsets a
page's VM bit concurrently, at all.

I see a lot of advantage to knowing our final scanned_pages almost
immediately. Things like prefetching, capping the size of the
dead_items array more intelligently (use final scanned_pages instead
of rel_pages in dead_items_max_items()), improvements to progress
reporting...not to mention more intelligent choices about whether we
should try to advance relfrozenxid a bit earlier during non-aggressive
VACUUMs.

> Hope that's helpful.

Very helpful -- thanks!

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Thu, Mar 24, 2022 at 3:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> But non-aggressive VACUUMs have always been able to do that.
>
> How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"

Sure, that sounds nice.

> Believe it or not, I avoided functional changes in 0002 -- at least in
> one important sense. That's why you had difficulty spotting any. This
> must sound peculiar, since the commit message very clearly says that
> the commit avoids a problem seen only in the non-aggressive case. It's
> really quite subtle.

Well, I think the goal in revising the code is to be as un-subtle as
possible. Commits that people can't easily understand breed future
bugs.

> What you're saying here boils down to this: it doesn't matter what the
> visibility map would say right this microsecond (in the aggressive
> case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
> said that this page was all-frozen *in the recent past*. That's good
> enough; we will never fail to scan a page that might have an XID <
> OldestXmin (ditto for XMIDs) this way, which is all that really
> matters.

Makes sense. So maybe the commit message should try to emphasize this
point e.g. "If a page is all-frozen at the time we check whether it
can be skipped, don't allow it to affect the relfrozenxmin and
relminmxid which we set for the relation. This was previously true for
aggressive vacuums, but not for non-aggressive vacuums, which was
inconsistent. (The reason this is a safe thing to do is that any new
XIDs or MXIDs that appear on the page after we initially observe it to
be frozen must be newer than any relfrozenxid or relminmxid the
current vacuum could possibly consider storing into pg_class.)"

> This is absolutely mandatory in the aggressive case, because otherwise
> relfrozenxid advancement might be seen as unsafe. My observation is:
> Why should we accept the same race in the non-aggressive case? Why not
> do essentially the same thing in every VACUUM?

Sure, that seems like a good idea. I think I basically agree with the
goals of the patch. My concern is just about making the changes
understandable to future readers. This area is notoriously subtle, and
people are going to introduce more bugs even if the comments and code
organization are fantastic.

> And so you could almost say that there is now behavioral change at
> all.

I vigorously object to this part, though. We should always err on the
side of saying that commits *do* have behavioral changes. We should go
out of our way to call out in the commit message any possible way that
someone might notice the difference between the post-commit situation
and the pre-commit situation. It is fine, even good, to also be clear
about how we're maintaining continuity and why we don't think it's a
problem, but the only commits that should be described as not having
any behavioral change are ones that do mechanical code movement, or
are just changing comments, or something like that.

> It seems kinda tricky to split up 0002 like that. It's possible, but
> I'm not sure if it's possible to split it in a way that highlights the
> issue that I just described. Because we already avoided the race in
> the aggressive case.

I do see that there are some difficulties there. I'm not sure what to
do about that. I think a sufficiently clear commit message could
possibly be enough, rather than trying to split the patch. But I also
think splitting the patch should be considered, if that can reasonably
be done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Mar 24, 2022 at 1:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"
>
> Sure, that sounds nice.

Cool.

> > What you're saying here boils down to this: it doesn't matter what the
> > visibility map would say right this microsecond (in the aggressive
> > case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
> > said that this page was all-frozen *in the recent past*. That's good
> > enough; we will never fail to scan a page that might have an XID <
> > OldestXmin (ditto for XMIDs) this way, which is all that really
> > matters.
>
> Makes sense. So maybe the commit message should try to emphasize this
> point e.g. "If a page is all-frozen at the time we check whether it
> can be skipped, don't allow it to affect the relfrozenxmin and
> relminmxid which we set for the relation. This was previously true for
> aggressive vacuums, but not for non-aggressive vacuums, which was
> inconsistent. (The reason this is a safe thing to do is that any new
> XIDs or MXIDs that appear on the page after we initially observe it to
> be frozen must be newer than any relfrozenxid or relminmxid the
> current vacuum could possibly consider storing into pg_class.)"

Okay, I'll add something more like that.

Almost every aspect of relfrozenxid advancement by VACUUM seems
simpler when thought about in these terms IMV. Every VACUUM now scans
all pages that might have XIDs < OldestXmin, and so every VACUUM can
advance relfrozenxid to the oldest extant XID (barring non-aggressive
VACUUMs that *choose* to skip some all-visible pages).

There are a lot more important details, of course. My "Every
VACUUM..." statement works well as an axiom because all of those other
details don't create any awkward exceptions.

> > This is absolutely mandatory in the aggressive case, because otherwise
> > relfrozenxid advancement might be seen as unsafe. My observation is:
> > Why should we accept the same race in the non-aggressive case? Why not
> > do essentially the same thing in every VACUUM?
>
> Sure, that seems like a good idea. I think I basically agree with the
> goals of the patch.

Great.

> My concern is just about making the changes
> understandable to future readers. This area is notoriously subtle, and
> people are going to introduce more bugs even if the comments and code
> organization are fantastic.

Makes sense.

> > And so you could almost say that there is now behavioral change at
> > all.
>
> I vigorously object to this part, though. We should always err on the
> side of saying that commits *do* have behavioral changes.

I think that you've taken my words too literally here. I would never
conceal the intent of a piece of work like that. I thought that it
would clarify matters to point out that I could in theory "get away
with it if I wanted to" in this instance. This was only a means of
conveying a subtle point about the behavioral changes from 0002 --
since you couldn't initially see them yourself (even with my commit
message).

Kind of like Tom Lane's 2011 talk on the query planner. The one where
he lied to the audience several times.

> > It seems kinda tricky to split up 0002 like that. It's possible, but
> > I'm not sure if it's possible to split it in a way that highlights the
> > issue that I just described. Because we already avoided the race in
> > the aggressive case.
>
> I do see that there are some difficulties there. I'm not sure what to
> do about that. I think a sufficiently clear commit message could
> possibly be enough, rather than trying to split the patch. But I also
> think splitting the patch should be considered, if that can reasonably
> be done.

I'll see if I can come up with something. It's hard to be sure about
that kind of thing when you're this close to the code.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Mar 24, 2022 at 2:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > This is absolutely mandatory in the aggressive case, because otherwise
> > > relfrozenxid advancement might be seen as unsafe. My observation is:
> > > Why should we accept the same race in the non-aggressive case? Why not
> > > do essentially the same thing in every VACUUM?
> >
> > Sure, that seems like a good idea. I think I basically agree with the
> > goals of the patch.
>
> Great.

Attached is v12. My current goal is to commit all 3 patches before
feature freeze. Note that this does not include the more complicated
patch including with previous revisions of the patch series (the
page-level freezing work that appeared in versions before v11).

Changes that appear in this new revision, v12:

* Reworking of the commit messages based on feedback from Robert.

* General cleanup of the changes to heapam.c from 0001 (the changes to
heap_prepare_freeze_tuple and related functions).  New and existing
code now fits together a bit better. I also added a couple of new
documenting assertions, to make the flow a bit easier to understand.

* Added new assertions that document
OldestXmin/FreezeLimit/relfrozenxid invariants, right at the point we
update pg_class within vacuumlazy.c.

These assertions would have a decent chance of failing if there were
any bugs in the code.

* Removed patch that made DISABLE_PAGE_SKIPPING not force aggressive
VACUUM, limiting the underlying mechanism to forcing scanning of all
pages in lazy_scan_heap (v11 was the first and last revision that
included this patch).

* Adds a new small patch 0003. This just moves the last piece of
resource allocation that still took place at the top of
lazy_scan_heap() back into its caller, heap_vacuum_rel().

The work in 0003 probably should have happened as part of the patch
that became commit 73f6ec3d -- same idea. It's totally mechanical
stuff. With 0002 and 0003, there is hardly any lazy_scan_heap code
before the main loop that iterates through blocks in rel_pages (and
the code that's still there is obviously related to the loop in a
direct and obvious way). This seems like a big overall improvement in
maintainability.

Didn't see a way to split up 0002, per Robert's suggestion 3 days ago.
As I said at the time, it's possible to split it up, but not in a way
that highlights the underlying issue (since the issue 0002 fixes was
always limited to non-aggressive VACUUMs). The commit message may have
to suffice.

--
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Robert Haas
Дата:
On Sun, Mar 27, 2022 at 11:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v12. My current goal is to commit all 3 patches before
> feature freeze. Note that this does not include the more complicated
> patch including with previous revisions of the patch series (the
> page-level freezing work that appeared in versions before v11).

Reviewing 0001, focusing on the words in the patch file much more than the code:

I can understand this version of the commit message. Woohoo! I like
understanding things.

I think the header comments for FreezeMultiXactId() focus way too much
on what the caller is supposed to do and not nearly enough on what
FreezeMultiXactId() itself does. I think to some extent this also
applies to the comments within the function body.

On the other hand, the header comments for heap_prepare_freeze_tuple()
seem good to me. If I were thinking of calling this function, I would
know how to use the new arguments. If I were looking for bugs in it, I
could compare the logic in the function to what these comments say it
should be doing. Yay.

I think I understand what the first paragraph of the header comment
for heap_tuple_needs_freeze() is trying to say, but the second one is
quite confusing. I think this is again because it veers into talking
about what the caller should do rather than explaining what the
function itself does.

I don't like the statement-free else block in lazy_scan_noprune(). I
think you could delete the else{} and just put that same comment there
with one less level of indentation. There's a clear "return false"
just above so it shouldn't be confusing what's happening.

The comment hunk at the end of lazy_scan_noprune() would probably be
better if it said something more specific than "caller can tolerate
reduced processing." My guess is that it would be something like
"caller does not need to do something or other."

I have my doubts about whether the overwrite-a-future-relfrozenxid
behavior is any good, but that's a topic for another day. I suggest
keeping the words "it seems best to", though, because they convey a
level of tentativeness, which seems appropriate.

I am surprised to see you write in maintenance.sgml that the VACUUM
which most recently advanced relfrozenxid will typically be the most
recent aggressive VACUUM. I would have expected something like "(often
the most recent VACUUM)".

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Tue, Mar 29, 2022 at 10:03 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I can understand this version of the commit message. Woohoo! I like
> understanding things.

That's good news.

> I think the header comments for FreezeMultiXactId() focus way too much
> on what the caller is supposed to do and not nearly enough on what
> FreezeMultiXactId() itself does. I think to some extent this also
> applies to the comments within the function body.

To some extent this is a legitimate difference in style. I myself
don't think that it's intrinsically good to have these sorts of
comments. I just think that it can be the least worst thing when a
function is intrinsically written with one caller and one very
specific set of requirements in mind. That is pretty much a matter of
taste, though.

> I think I understand what the first paragraph of the header comment
> for heap_tuple_needs_freeze() is trying to say, but the second one is
> quite confusing. I think this is again because it veers into talking
> about what the caller should do rather than explaining what the
> function itself does.

I wouldn't have done it that way if the function wasn't called
heap_tuple_needs_freeze().

I would be okay with removing this paragraph if the function was
renamed to reflect the fact it now tells the caller something about
the tuple having an old XID/MXID relative to the caller's own XID/MXID
cutoffs. Maybe the function name should be heap_tuple_would_freeze(),
making it clear that the function merely tells caller what
heap_prepare_freeze_tuple() *would* do, without presuming to tell the
vacuumlazy.c caller what it *should* do about any of the information
it is provided.

Then it becomes natural to see the boolean return value and the
changes the function makes to caller's relfrozenxid/relminmxid tracker
variables as independent.

> I don't like the statement-free else block in lazy_scan_noprune(). I
> think you could delete the else{} and just put that same comment there
> with one less level of indentation. There's a clear "return false"
> just above so it shouldn't be confusing what's happening.

Okay, will fix.

> The comment hunk at the end of lazy_scan_noprune() would probably be
> better if it said something more specific than "caller can tolerate
> reduced processing." My guess is that it would be something like
> "caller does not need to do something or other."

I meant "caller can tolerate not pruning or freezing this particular
page". Will fix.

> I have my doubts about whether the overwrite-a-future-relfrozenxid
> behavior is any good, but that's a topic for another day. I suggest
> keeping the words "it seems best to", though, because they convey a
> level of tentativeness, which seems appropriate.

I agree that it's best to keep a tentative tone here. That code was
written following a very specific bug in pg_upgrade several years
back. There was a very recent bug fixed only last year, by commit
74cf7d46.

FWIW I tend to think that we'd have a much better chance of catching
that sort of thing if we'd had better relfrozenxid instrumentation
before now. Now you'd see a negative value in the "new relfrozenxid:
%u, which is %d xids ahead of previous value" part of the autovacuum
log message in the event of such a bug. That's weird enough that I bet
somebody would notice and report it.

> I am surprised to see you write in maintenance.sgml that the VACUUM
> which most recently advanced relfrozenxid will typically be the most
> recent aggressive VACUUM. I would have expected something like "(often
> the most recent VACUUM)".

That's always been true, and will only be slightly less true in
Postgres 15 -- the fact is that we only need to skip one all-visible
page to lose out, and that's not unlikely with tables that aren't
quite small with all the patches from v12 applied (we're still much
too naive). The work that I'll get into Postgres 15 on VACUUM is very
valuable as a basis for future improvements, but not all that valuable
to users (improved instrumentation might be the biggest benefit in 15,
or maybe relminmxid advancement for certain types of applications).

I still think that we need to do more proactive page-level freezing to
make relfrozenxid advancement happen in almost every VACUUM, but even
that won't quite be enough. There are still cases where we need to
make a choice about giving up on relfrozenxid advancement in a
non-aggressive VACUUM -- all-visible pages won't completely go away
with page-level freezing. At a minimum we'll still have edge cases
like the case where heap_lock_tuple() unsets the all-frozen bit. And
pg_upgrade'd databases, too.

0002 structures the logic for skipping using the VM in a way that will
make the choice to skip or not skip all-visible pages in
non-aggressive VACUUMs quite natural. I suspect that
SKIP_PAGES_THRESHOLD was always mostly just about relfrozenxid
advancement in non-aggressive VACUUM, all along. We can do much better
than SKIP_PAGES_THRESHOLD, especially if we preprocess the entire
visibility map up-front -- we'll know the costs and benefits up-front,
before committing to early relfrozenxid advancement.

Overall, aggressive vs non-aggressive VACUUM seems like a false
dichotomy to me. ISTM that it should be a totally dynamic set of
behaviors. There should probably be several different "aggressive
gradations''. Most VACUUMs start out completely non-aggressive
(including even anti-wraparound autovacuums), but can escalate from
there. The non-cancellable autovacuum behavior (technically an
anti-wraparound thing, but really an aggressiveness thing) should be
something we escalate to, as with the failsafe.

Dynamic behavior works a lot better. And it makes scheduling of
autovacuum workers a lot more straightforward -- the discontinuities
seem to make that much harder, which is one more reason to avoid them
altogether.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Tue, Mar 29, 2022 at 11:58 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think I understand what the first paragraph of the header comment
> > for heap_tuple_needs_freeze() is trying to say, but the second one is
> > quite confusing. I think this is again because it veers into talking
> > about what the caller should do rather than explaining what the
> > function itself does.
>
> I wouldn't have done it that way if the function wasn't called
> heap_tuple_needs_freeze().
>
> I would be okay with removing this paragraph if the function was
> renamed to reflect the fact it now tells the caller something about
> the tuple having an old XID/MXID relative to the caller's own XID/MXID
> cutoffs. Maybe the function name should be heap_tuple_would_freeze(),
> making it clear that the function merely tells caller what
> heap_prepare_freeze_tuple() *would* do, without presuming to tell the
> vacuumlazy.c caller what it *should* do about any of the information
> it is provided.

Attached is v13, which does it that way. This does seem like a real
increase in clarity, albeit one that comes at the cost of renaming
heap_tuple_needs_freeze().

v13 also addresses all of the other items from Robert's most recent
round of feedback.

I would like to commit something close to v13 on Friday or Saturday.

Thanks
-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Justin Pryzby
Дата:
+                               diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+                               Assert(diff > 0);

Did you see that this crashed on windows cfbot?

https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log
TRAP: FailedAssertion("diff > 0", File: "c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984)
abort() has been called2022-03-30 03:48:30.267 GMT [5316][client backend] [pg_regress/tablefunc][3/15389:0] ERROR:
infiniterecursion detected
 
2022-03-30 03:48:38.031 GMT [5592][postmaster] LOG:  server process (PID 5984) was terminated by exception 0xC0000354
2022-03-30 03:48:38.031 GMT [5592][postmaster] DETAIL:  Failed process was running: autovacuum: VACUUM ANALYZE
pg_catalog.pg_database
2022-03-30 03:48:38.031 GMT [5592][postmaster] HINT:  See C include file "ntstatus.h" for a description of the
hexadecimalvalue.
 

https://cirrus-ci.com/task/4592929254670336

00000000`007ff130 00000001`400b4ef8     postgres!ExceptionalCondition(
            char * conditionName = 0x00000001`40a915d8 "diff > 0", 
            char * errorType = 0x00000001`40a915c8 "FailedAssertion", 
            char * fileName = 0x00000001`40a91598 "c:\cirrus\src\backend\access\heap\vacuumlazy.c", 
            int lineNumber = 0n724)+0x8d [c:\cirrus\src\backend\utils\error\assert.c @ 70]
00000000`007ff170 00000001`402a0914     postgres!heap_vacuum_rel(
            struct RelationData * rel = 0x00000000`00a51088, 
            struct VacuumParams * params = 0x00000000`00a8420c, 
            struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0)+0x1038
[c:\cirrus\src\backend\access\heap\vacuumlazy.c@ 724]
 
00000000`007ff350 00000001`402a4686     postgres!table_relation_vacuum(
            struct RelationData * rel = 0x00000000`00a51088, 
            struct VacuumParams * params = 0x00000000`00a8420c, 
            struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0)+0x34
[c:\cirrus\src\include\access\tableam.h@ 1681]
 
00000000`007ff380 00000001`402a1a2d     postgres!vacuum_rel(
            unsigned int relid = 0x4ee, 
            struct RangeVar * relation = 0x00000000`01799ae0, 
            struct VacuumParams * params = 0x00000000`00a8420c)+0x5a6 [c:\cirrus\src\backend\commands\vacuum.c @ 2068]
00000000`007ff400 00000001`4050f1ef     postgres!vacuum(
            struct List * relations = 0x00000000`0179df58, 
            struct VacuumParams * params = 0x00000000`00a8420c, 
            struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0, 
            bool isTopLevel = true)+0x69d [c:\cirrus\src\backend\commands\vacuum.c @ 482]
00000000`007ff5f0 00000001`4050dc95     postgres!autovacuum_do_vac_analyze(
            struct autovac_table * tab = 0x00000000`00a84208, 
            struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0)+0x8f
[c:\cirrus\src\backend\postmaster\autovacuum.c@ 3248]
 
00000000`007ff640 00000001`4050b4e3     postgres!do_autovacuum(void)+0xef5
[c:\cirrus\src\backend\postmaster\autovacuum.c@ 2503]
 

It seems like there should be even more logs, especially since it says:
[03:48:43.119] Uploading 3 artifacts for c:\cirrus\**\*.diffs
[03:48:43.122] Uploaded c:\cirrus\contrib\tsm_system_rows\regression.diffs
[03:48:43.125] Uploaded c:\cirrus\contrib\tsm_system_time\regression.diffs



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Tue, Mar 29, 2022 at 11:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> +                               diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
> +                               Assert(diff > 0);
>
> Did you see that this crashed on windows cfbot?
>
> https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log
> TRAP: FailedAssertion("diff > 0", File: "c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984)

That's weird. There are very similar assertions a little earlier, that
must have *not* failed here, before the call to vac_update_relstats().
I was actually thinking of removing this assertion for that reason --
I thought that it was redundant.

Perhaps something is amiss inside vac_update_relstats(), where the
boolean flag that indicates that pg_class.relfrozenxid was advanced is
set:

    if (frozenxid_updated)
        *frozenxid_updated = false;
    if (TransactionIdIsNormal(frozenxid) &&
        pgcform->relfrozenxid != frozenxid &&
        (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
         TransactionIdPrecedes(ReadNextTransactionId(),
                               pgcform->relfrozenxid)))
    {
        if (frozenxid_updated)
            *frozenxid_updated = true;
        pgcform->relfrozenxid = frozenxid;
        dirty = true;
    }

Maybe the "existing relfrozenxid is in the future, silently update
relfrozenxid" part of the condition (which involves
ReadNextTransactionId()) somehow does the wrong thing here. But how?

The other assertions take into account the fact that OldestXmin can
itself "go backwards" across VACUUM operations against the same table:

    Assert(!aggressive || vacrel->NewRelfrozenXid == OldestXmin ||
           TransactionIdPrecedesOrEquals(FreezeLimit,
                                         vacrel->NewRelfrozenXid));

Note the "vacrel->NewRelfrozenXid == OldestXmin", without which the
assertion will fail pretty easily when the regression tests are run.
Perhaps I need to do something like that with the other assertion as
well (or more likely just get rid of it). Will figure it out tomorrow.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 30, 2022 at 12:01 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Perhaps something is amiss inside vac_update_relstats(), where the
> boolean flag that indicates that pg_class.relfrozenxid was advanced is
> set:
>
>     if (frozenxid_updated)
>         *frozenxid_updated = false;
>     if (TransactionIdIsNormal(frozenxid) &&
>         pgcform->relfrozenxid != frozenxid &&
>         (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
>          TransactionIdPrecedes(ReadNextTransactionId(),
>                                pgcform->relfrozenxid)))
>     {
>         if (frozenxid_updated)
>             *frozenxid_updated = true;
>         pgcform->relfrozenxid = frozenxid;
>         dirty = true;
>     }
>
> Maybe the "existing relfrozenxid is in the future, silently update
> relfrozenxid" part of the condition (which involves
> ReadNextTransactionId()) somehow does the wrong thing here. But how?

I tried several times to recreate this issue on CI. No luck with that,
though -- can't get it to fail again after 4 attempts.

This was a VACUUM of pg_database, run from an autovacuum worker. I am
vaguely reminded of the two bugs fixed by Andres in commit a54e1f15.
Both were issues with the shared relcache init file affecting shared
and nailed catalog relations. Those bugs had symptoms like " ERROR:
found xmin ... from before relfrozenxid ..." for various system
catalogs.

We know that this particular assertion did not fail during the same VACUUM:

    Assert(vacrel->NewRelfrozenXid == OldestXmin ||
           TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
                                         vacrel->NewRelfrozenXid));

So it's hard to see how this could be a bug in the patch -- the final
new relfrozenxid is presumably equal to VACUUM's OldestXmin in the
problem scenario seen on the CI Windows instance yesterday (that's why
this earlier assertion didn't fail).  The assertion I'm showing here
needs the "vacrel->NewRelfrozenXid == OldestXmin" part of the
condition to account for the fact that
OldestXmin/GetOldestNonRemovableTransactionId() is known to "go
backwards". Without that the regression tests will fail quite easily.

The surprising part of the CI failure must have taken place just after
this assertion, when VACUUM's call to vacuum_set_xid_limits() actually
updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid --
presumably because the existing relfrozenxid appeared to be "in the
future" when we examine it in pg_class again. We see evidence that
this must have happened afterwards, when the closely related assertion
(used only in instrumentation code) fails:

From my patch:

>             if (frozenxid_updated)
>             {
> -               diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
> +               diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
> +               Assert(diff > 0);
>                 appendStringInfo(&buf,
>                                  _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
> -                                FreezeLimit, diff);
> +                                vacrel->NewRelfrozenXid, diff);
>             }

Does anybody have any ideas about what might be going on here?

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-03-30 17:50:42 -0700, Peter Geoghegan wrote:
> I tried several times to recreate this issue on CI. No luck with that,
> though -- can't get it to fail again after 4 attempts.

It's really annoying that we don't have Assert variants that show the compared
values, that might make it easier to intepret what's going on.

Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe
AssertCmp(type, a, op, b),

Then the assertion could have been something like
   AssertCmp(int32, diff, >, 0)


Does the line number in the failed run actually correspond to the xid, rather
than the mxid case? I didn't check.


You could try to increase the likelihood of reproducing the failure by
duplicating the invocation that lead to the crash a few times in the
.cirrus.yml file in your dev branch. That might allow hitting the problem more
quickly.

Maybe reduce autovacuum_naptime in src/tools/ci/pg_ci_base.conf?

Or locally - one thing that windows CI does different from the other platforms
is that it runs isolation, contrib and a bunch of other tests using the same
cluster. Which of course increases the likelihood of autovacuum having stuff
to do, *particularly* on shared relations - normally there's probably not
enough changes for that.

You can do something similar locally on linux with
    make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck prove_installcheck=true
(the prove_installcheck=true to prevent tap tests from running, we don't seem
to have another way for that)

I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more
load concurrently than running tests serially...


> We know that this particular assertion did not fail during the same VACUUM:
> 
>     Assert(vacrel->NewRelfrozenXid == OldestXmin ||
>            TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
>                                          vacrel->NewRelfrozenXid));

The comment in your patch says "is either older or newer than FreezeLimit" - I
assume that's some rephrasing damage?



> So it's hard to see how this could be a bug in the patch -- the final
> new relfrozenxid is presumably equal to VACUUM's OldestXmin in the
> problem scenario seen on the CI Windows instance yesterday (that's why
> this earlier assertion didn't fail).

Perhaps it's worth commiting improved assertions on master? If this is indeed
a pre-existing bug, and we're just missing due to slightly less stringent
asserts, we could rectify that separately.


> The surprising part of the CI failure must have taken place just after
> this assertion, when VACUUM's call to vacuum_set_xid_limits() actually
> updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid --
> presumably because the existing relfrozenxid appeared to be "in the
> future" when we examine it in pg_class again. We see evidence that
> this must have happened afterwards, when the closely related assertion
> (used only in instrumentation code) fails:

Hm. This triggers some vague memories. There's some oddities around shared
relations being vacuumed separately in all the databases and thus having
separate horizons.


After "remembering" that, I looked in the cirrus log for the failed run, and
the worker was processing shared a shared relation last:

2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG:  automatic analyze of table
"contrib_regression.pg_catalog.pg_authid"

Obviously that's not a guarantee that the next table processed also is a
shared catalog, but ...

Oh, the relid is actually in the stack trace. 0x4ee = 1262 =
pg_database. Which makes sense, the test ends up with a high percentage of
dead rows in pg_database, due to all the different contrib tests
creating/dropping a database.



> From my patch:
> 
> >             if (frozenxid_updated)
> >             {
> > -               diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
> > +               diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
> > +               Assert(diff > 0);
> >                 appendStringInfo(&buf,
> >                                  _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
> > -                                FreezeLimit, diff);
> > +                                vacrel->NewRelfrozenXid, diff);
> >             }

Perhaps this ought to be an elog() instead of an Assert()? Something has gone
pear shaped if we get here... It's a bit annoying though, because it'd have to
be a PANIC to be visible on the bf / CI :(.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 30, 2022 at 7:00 PM Andres Freund <andres@anarazel.de> wrote:
> Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe
> AssertCmp(type, a, op, b),
>
> Then the assertion could have been something like
>    AssertCmp(int32, diff, >, 0)

I'd definitely use them if they were there.

> Does the line number in the failed run actually correspond to the xid, rather
> than the mxid case? I didn't check.

Yes, I verified -- definitely relfrozenxid.

> You can do something similar locally on linux with
>     make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck prove_installcheck=true
> (the prove_installcheck=true to prevent tap tests from running, we don't seem
> to have another way for that)
>
> I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more
> load concurrently than running tests serially...

Can't get it to fail locally with that recipe.

> >     Assert(vacrel->NewRelfrozenXid == OldestXmin ||
> >            TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
> >                                          vacrel->NewRelfrozenXid));
>
> The comment in your patch says "is either older or newer than FreezeLimit" - I
> assume that's some rephrasing damage?

Both the comment and the assertion are correct. I see what you mean, though.

> Perhaps it's worth commiting improved assertions on master? If this is indeed
> a pre-existing bug, and we're just missing due to slightly less stringent
> asserts, we could rectify that separately.

I don't think there's much chance of the assertion actually hitting
without the rest of the patch series. The new relfrozenxid value is
always going to be OldestXmin - vacuum_min_freeze_age on HEAD, while
with the patch it's sometimes close to OldestXmin. Especially when you
have lots of dead tuples that you churn through constantly (like
pgbench_tellers, or like these system catalogs on the CI test
machine).

> Hm. This triggers some vague memories. There's some oddities around shared
> relations being vacuumed separately in all the databases and thus having
> separate horizons.

That's what I was thinking of, obviously.

> After "remembering" that, I looked in the cirrus log for the failed run, and
> the worker was processing shared a shared relation last:
>
> 2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG:  automatic analyze of table
"contrib_regression.pg_catalog.pg_authid"

I noticed the same thing myself. Should have said sooner.

> Perhaps this ought to be an elog() instead of an Assert()? Something has gone
> pear shaped if we get here... It's a bit annoying though, because it'd have to
> be a PANIC to be visible on the bf / CI :(.

Yeah, a WARNING would be good here. I can write a new version of my
patch series with a separation patch for that this evening. Actually,
better make it a PANIC for now...

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 30, 2022 at 7:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Yeah, a WARNING would be good here. I can write a new version of my
> patch series with a separation patch for that this evening. Actually,
> better make it a PANIC for now...

Attached is v14, which includes a new patch that PANICs like that in
vac_update_relstats() --- 0003.

This approach also covers manual VACUUMs, which isn't the case with
the failing assertion, which is in instrumentation code (actually
VACUUM VERBOSE might hit it).

I definitely think that something like this should be committed.
Silently ignoring system catalog corruption isn't okay.

-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

I was able to trigger the crash.

cat ~/tmp/pgbench-createdb.sql
CREATE DATABASE pgb_:client_id;
DROP DATABASE pgb_:client_id;

pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql

while I was also running

for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s installcheck -j48 -s prove_installcheck=true
USE_MODULE_DB=1> /tmp/ci-$i.log 2>&1; done
 

I triggered twice now, but it took a while longer the second time.

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
        set = {__val = {4194304, 0, 0, 0, 0, 0, 216172782113783808, 2, 2377909399344644096, 18446497967838863616, 0, 0,
0,0, 0, 0}}
 
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007fe49a2db546 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 0, 0, 0, 0, 0, 0,
0,0, 0, 0, 0, 0, 0, 0, 0}},
 
          sa_flags = 0, sa_restorer = 0x107e0}
        sigs = {__val = {32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}
#2  0x00007fe49b9706f1 in ExceptionalCondition (conditionName=0x7fe49ba0618d "diff > 0", errorType=0x7fe49ba05bd1
"FailedAssertion",
    fileName=0x7fe49ba05b90 "/home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c", lineNumber=724)
    at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
No locals.
#3  0x00007fe49b2fc739 in heap_vacuum_rel (rel=0x7fe497a8d148, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10)
    at /home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:724
        buf = {
          data = 0x7fe49c17e238 "automatic vacuum of table \"contrib_regression_dict_int.pg_catalog.pg_database\":
indexscans: 1\npages: 0 removed, 3 remain, 3 scanned (100.00% of total)\ntuples: 49 removed, 53 remain, 9 are dead but
no"...,len = 279, maxlen = 1024, cursor = 0}
 
        msgfmt = 0x7fe49ba06038 "automatic vacuum of table \"%s.%s.%s\": index scans: %d\n"
        diff = 0
        endtime = 702011687982080
        vacrel = 0x7fe49c19b5b8
        verbose = false
        instrument = true
        ru0 = {tv = {tv_sec = 1648696487, tv_usec = 975963}, ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime =
{tv_sec= 0, tv_usec = 3086}, {
 
--Type <RET> for more, q to quit, c to continue without paging--c
              ru_maxrss = 10824, __ru_maxrss_word = 10824}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0,
__ru_idrss_word= 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 449, __ru_minflt_word = 449}, {ru_majflt = 0,
__ru_majflt_word= 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0,
__ru_oublock_word= 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0,
__ru_nsignals_word= 0}, {ru_nvcsw = 2, __ru_nvcsw_word = 2}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}}}
 
        starttime = 702011687975964
        walusage_start = {wal_records = 0, wal_fpi = 0, wal_bytes = 0}
        walusage = {wal_records = 11, wal_fpi = 7, wal_bytes = 30847}
        secs = 0
        usecs = 6116
        read_rate = 16.606033355134073
        write_rate = 7.6643230869849575
        aggressive = false
        skipwithvm = true
        frozenxid_updated = true
        minmulti_updated = true
        orig_rel_pages = 3
        new_rel_pages = 3
        new_rel_allvisible = 0
        indnames = 0x7fe49c19bb28
        errcallback = {previous = 0x0, callback = 0x7fe49b3012fd <vacuum_error_callback>, arg = 0x7fe49c19b5b8}
        startreadtime = 180
        startwritetime = 0
        OldestXmin = 67552
        FreezeLimit = 4245034848
        OldestMxact = 224
        MultiXactCutoff = 4289967520
        __func__ = "heap_vacuum_rel"
#4  0x00007fe49b523d92 in table_relation_vacuum (rel=0x7fe497a8d148, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10)
at/home/andres/src/postgresql/src/include/access/tableam.h:1680
 
No locals.
#5  0x00007fe49b527032 in vacuum_rel (relid=1262, relation=0x7fe49c1ae360, params=0x7fe49c130d7c) at
/home/andres/src/postgresql/src/backend/commands/vacuum.c:2065
        lmode = 4
        rel = 0x7fe497a8d148
        lockrelid = {relId = 1262, dbId = 0}
        toast_relid = 0
        save_userid = 10
        save_sec_context = 0
        save_nestlevel = 2
        __func__ = "vacuum_rel"
#6  0x00007fe49b524c3b in vacuum (relations=0x7fe49c1b03a8, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10,
isTopLevel=true)at /home/andres/src/postgresql/src/backend/commands/vacuum.c:482
 
        vrel = 0x7fe49c1ae3b8
        cur__state = {l = 0x7fe49c1b03a8, i = 0}
        cur = 0x7fe49c1b03c0
        _save_exception_stack = 0x7fff97e35a10
        _save_context_stack = 0x0
        _local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318940970843, 9223372036854775747, 0, 0, 0,
6126579318957748059,6139499258682879835}, __mask_was_saved = 0, __saved_mask = {__val = {32, 140619848279000,
8590910454,140619848278592, 32, 140619848278944, 7784, 140619848278592, 140619848278816, 140735741647200,
140619839915137,8458711686435861857, 32, 4869, 140619848278592, 140619848279024}}}}
 
        _do_rethrow = false
        in_vacuum = true
        stmttype = 0x7fe49baff1a7 "VACUUM"
        in_outer_xact = false
        use_own_xacts = true
        __func__ = "vacuum"
#7  0x00007fe49b6d483d in autovacuum_do_vac_analyze (tab=0x7fe49c130d78, bstrategy=0x7fe49c130e10) at
/home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:3247
        rangevar = 0x7fe49c1ae360
        rel = 0x7fe49c1ae3b8
        rel_list = 0x7fe49c1ae3f0
#8  0x00007fe49b6d34bc in do_autovacuum () at /home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:2495
        _save_exception_stack = 0x7fff97e35d70
        _save_context_stack = 0x0
        _local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318779490139, 9223372036854775747, 0, 0, 0,
6126579319014371163,6139499700101525339}, __mask_was_saved = 0, __saved_mask = {__val = {140619840139982,
140735741647712,140619841923928, 957, 140619847223443, 140735741647656, 140619847312112, 140619847223451,
140619847223443,140619847224399, 0, 139637976727552, 140619817480714, 140735741647616, 140619839856340, 1024}}}}
 
        _do_rethrow = false
        tab = 0x7fe49c130d78
        skipit = false
        stdVacuumCostDelay = 0
        stdVacuumCostLimit = 200
        iter = {cur = 0x7fe497668da0, end = 0x7fe497668da0}
        relid = 1262
        classTup = 0x7fe497a6c568
        isshared = true
        cell__state = {l = 0x7fe49c130d40, i = 0}
        classRel = 0x7fe497a5ae18
        tuple = 0x0
        relScan = 0x7fe49c130928
        dbForm = 0x7fe497a64fb8
        table_oids = 0x7fe49c130d40
        orphan_oids = 0x0
        ctl = {num_partitions = 0, ssize = 0, dsize = 1296236544, max_dsize = 140619847224424, keysize = 4, entrysize =
96,hash = 0x0, match = 0x0, keycopy = 0x0, alloc = 0x0, hcxt = 0x7fff97e35c50, hctl = 0x7fe49b9a787e
<AllocSetFree+670>}
        table_toast_map = 0x7fe49c19d2f0
        cell = 0x7fe49c130d58
        shared = 0x7fe49c17c360
        dbentry = 0x7fe49c18d7a0
        bstrategy = 0x7fe49c130e10
        key = {sk_flags = 0, sk_attno = 17, sk_strategy = 3, sk_subtype = 0, sk_collation = 950, sk_func = {fn_addr =
0x7fe49b809a6a<chareq>, fn_oid = 61, fn_nargs = 2, fn_strict = true, fn_retset = false, fn_stats = 2 '\002', fn_extra =
0x0,fn_mcxt = 0x7fe49c12f7f0, fn_expr = 0x0}, sk_argument = 116}
 
        pg_class_desc = 0x7fe49c12f910
        effective_multixact_freeze_max_age = 400000000
        did_vacuum = false
        found_concurrent_worker = false
        i = 32740
        __func__ = "do_autovacuum"
#9  0x00007fe49b6d21c4 in AutoVacWorkerMain (argc=0, argv=0x0) at
/home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:1719
        dbname =
"contrib_regression_dict_int\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
        local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318890639195, 9223372036854775747, 0, 0, 0,
6126579318785781595,6139499699353759579}, __mask_was_saved = 1, __saved_mask = {__val = {18446744066192964099, 8,
140735741648416,140735741648352, 3156423108750738944, 0, 30, 140735741647888, 140619835812981, 140735741648080,
32666874400,140735741648448, 140619836964693, 140735741652128, 2586778441, 140735741648448}}}} 
        dbid = 205328
        __func__ = "AutoVacWorkerMain"
#10 0x00007fe49b6d1d5b in StartAutoVacWorker () at
/home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:1504
        worker_pid = 0
        __func__ = "StartAutoVacWorker"
#11 0x00007fe49b6e79af in StartAutovacuumWorker () at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5635
        bn = 0x7fe49c0da920
        __func__ = "StartAutovacuumWorker"
#12 0x00007fe49b6e745d in sigusr1_handler (postgres_signal_arg=10) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5340
        save_errno = 4
        __func__ = "sigusr1_handler"
#13 <signal handler called>
No locals.
#14 0x00007fe49a3a9fc4 in __GI___select (nfds=8, readfds=0x7fff97e36c20, writefds=0x0, exceptfds=0x0,
timeout=0x7fff97e36ca0)at ../sysdeps/unix/sysv/linux/select.c:71
 
        sc_ret = -4
        sc_ret = <optimized out>
        s = <optimized out>
        us = <optimized out>
        ns = <optimized out>
        ts64 = {tv_sec = 59, tv_nsec = 765565741}
        pts64 = <optimized out>
        r = <optimized out>
#15 0x00007fe49b6e26c7 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1765
        timeout = {tv_sec = 60, tv_usec = 0}
        rmask = {fds_bits = {224, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}
        selres = -1
        now = 1648696487
        readmask = {fds_bits = {224, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}
        nSockets = 8
        last_lockfile_recheck_time = 1648696432
        last_touch_time = 1648696072
        __func__ = "ServerLoop"
#16 0x00007fe49b6e2031 in PostmasterMain (argc=55, argv=0x7fe49c0aa2d0) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1473
        opt = -1
        status = 0
        userDoption = 0x7fe49c0951d0 "/srv/dev/pgdev-dev/"
        listen_addr_saved = true
        i = 64
        output_config_variable = 0x0
        __func__ = "PostmasterMain"
#17 0x00007fe49b5d2808 in main (argc=55, argv=0x7fe49c0aa2d0) at
/home/andres/src/postgresql/src/backend/main/main.c:202
        do_check_root = true

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <andres@anarazel.de> wrote:
> I triggered twice now, but it took a while longer the second time.

Great.

I wonder if you can get an RR recording...
-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-03-30 20:28:44 -0700, Andres Freund wrote:
> I was able to trigger the crash.
> 
> cat ~/tmp/pgbench-createdb.sql
> CREATE DATABASE pgb_:client_id;
> DROP DATABASE pgb_:client_id;
> 
> pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql
> 
> while I was also running
> 
> for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s installcheck -j48 -s
prove_installcheck=trueUSE_MODULE_DB=1 > /tmp/ci-$i.log 2>&1; done
 
> 
> I triggered twice now, but it took a while longer the second time.

Forgot to say how postgres was started. Via my usual devenv script, which
results in:

+ /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -c hba_file=/home/andres/tmp/pgdev/pg_hba.conf -D
/srv/dev/pgdev-dev/-p 5440 -c shared_buffers=2GB -c wal_level=hot_standby -c max_wal_senders=10 -c track_io_timing=on
-crestart_after_crash=false -c max_prepared_transactions=20 -c log_checkpoints=on -c min_wal_size=48MB -c
max_wal_size=150GB-c 'cluster_name=dev assert' -c ssl_cert_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.pem -c
ssl_key_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.key-c 'log_line_prefix=%m [%p][%b][%v:%x][%a] ' -c
shared_buffers=16MB-c log_min_messages=debug1 -c log_connections=on -c allow_in_place_tablespaces=1 -c
log_autovacuum_min_duration=0-c log_lock_waits=true -c autovacuum_naptime=10s -c fsync=off
 

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote:
> On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <andres@anarazel.de> wrote:
> > I triggered twice now, but it took a while longer the second time.
>
> Great.
>
> I wonder if you can get an RR recording...

Started it, but looks like it's too slow.

(gdb) p MyProcPid
$1 = 2172500

(gdb) p vacrel->NewRelfrozenXid
$3 = 717
(gdb) p vacrel->relfrozenxid
$4 = 717
(gdb) p OldestXmin
$5 = 5112
(gdb) p aggressive
$6 = false

There was another autovacuum of pg_database 10s before:

2022-03-30 20:35:17.622 PDT [2165344][autovacuum worker][5/3:0][] LOG:  automatic vacuum of table
"postgres.pg_catalog.pg_database":index scans: 1
 
        pages: 0 removed, 3 remain, 3 scanned (100.00% of total)
        tuples: 61 removed, 4 remain, 1 are dead but not yet removable
        removable cutoff: 1921, older by 3 xids when operation ended
        new relfrozenxid: 717, which is 3 xids ahead of previous value
        index scan needed: 3 pages from table (100.00% of total) had 599 dead item identifiers removed
        index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
        index "pg_database_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable
        I/O timings: read: 0.029 ms, write: 0.034 ms
        avg read rate: 134.120 MB/s, avg write rate: 89.413 MB/s
        buffer usage: 35 hits, 12 misses, 8 dirtied
        WAL usage: 12 records, 5 full page images, 27218 bytes
        system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

The dying backend:
2022-03-30 20:35:27.668 PDT [2172500][autovacuum worker][7/0:0][] DEBUG:  autovacuum: processing database
"contrib_regression_hstore"
...
2022-03-30 20:35:27.690 PDT [2172500][autovacuum worker][7/674:0][] CONTEXT:  while cleaning up index
"pg_database_oid_index"of relation "pg_catalog.pg_database"
 


Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 30, 2022 at 9:04 PM Andres Freund <andres@anarazel.de> wrote:
> (gdb) p vacrel->NewRelfrozenXid
> $3 = 717
> (gdb) p vacrel->relfrozenxid
> $4 = 717
> (gdb) p OldestXmin
> $5 = 5112
> (gdb) p aggressive
> $6 = false

Does this OldestXmin seem reasonable at this point in execution, based
on context? Does it look too high? Something else?

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-03-30 21:04:07 -0700, Andres Freund wrote:
> On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote:
> > On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <andres@anarazel.de> wrote:
> > > I triggered twice now, but it took a while longer the second time.
> >
> > Great.
> >
> > I wonder if you can get an RR recording...
>
> Started it, but looks like it's too slow.
>
> (gdb) p MyProcPid
> $1 = 2172500
>
> (gdb) p vacrel->NewRelfrozenXid
> $3 = 717
> (gdb) p vacrel->relfrozenxid
> $4 = 717
> (gdb) p OldestXmin
> $5 = 5112
> (gdb) p aggressive
> $6 = false

I added a bunch of debug elogs to see what sets *frozenxid_updated to true.

(gdb) p *vacrel
$1 = {rel = 0x7fe24f3e0148, indrels = 0x7fe255c17ef8, nindexes = 2, aggressive = false, skipwithvm = true,
failsafe_active= false,
 
  consider_bypass_optimization = true, do_index_vacuuming = true, do_index_cleanup = true, do_rel_truncate = true,
bstrategy= 0x7fe255bb0e28, pvs = 0x0,
 
  relfrozenxid = 717, relminmxid = 6, old_live_tuples = 42, OldestXmin = 20751, vistest = 0x7fe255058970
<GlobalVisSharedRels>,FreezeLimit = 4244988047,
 
  MultiXactCutoff = 4289967302, NewRelfrozenXid = 717, NewRelminMxid = 6, skippedallvis = false, relnamespace =
0x7fe255c17bf8"pg_catalog",
 
  relname = 0x7fe255c17cb8 "pg_database", indname = 0x0, blkno = 4294967295, offnum = 0, phase =
VACUUM_ERRCB_PHASE_SCAN_HEAP,verbose = false,
 
  dead_items = 0x7fe255c131d0, rel_pages = 8, scanned_pages = 8, removed_pages = 0, lpdead_item_pages = 0,
missed_dead_pages= 0, nonempty_pages = 8,
 
  new_rel_tuples = 124, new_live_tuples = 42, indstats = 0x7fe255c18320, num_index_scans = 0, tuples_deleted = 0,
lpdead_items= 0, live_tuples = 42,
 
  recently_dead_tuples = 82, missed_dead_tuples = 0}

But the debug elog reports that

relfrozenxid updated 714 -> 717
relminmxid updated 1 -> 6

Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
from the shared relcache init file written by another backend:

2022-03-30 21:10:47.626 PDT [2625038][autovacuum worker][6/433:0][] LOG:  automatic vacuum of table
"contrib_regression_postgres_fdw.pg_catalog.pg_database":index scans: 1
 
        pages: 0 removed, 8 remain, 8 scanned (100.00% of total)
        tuples: 4 removed, 114 remain, 72 are dead but not yet removable
        removable cutoff: 20751, older by 596 xids when operation ended
        new relfrozenxid: 717, which is 3 xids ahead of previous value
        new relminmxid: 6, which is 5 mxids ahead of previous value
        index scan needed: 3 pages from table (37.50% of total) had 8 dead item identifiers removed
        index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
        index "pg_database_oid_index": pages: 6 in total, 0 newly deleted, 2 currently deleted, 2 reusable
        I/O timings: read: 0.050 ms, write: 0.102 ms
        avg read rate: 209.860 MB/s, avg write rate: 76.313 MB/s
        buffer usage: 42 hits, 22 misses, 8 dirtied
        WAL usage: 13 records, 5 full page images, 33950 bytes
        system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
...
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][:0][] DEBUG:  InitPostgres
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] DEBUG:  my backend ID is 6
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] LOG:  reading shared init file
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] DEBUG:  StartTransaction(1) name: unnamed;
blockState:DEFAULT; state: INPROGRESS, xid/sub>
 
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] LOG:  reading non-shared init file

This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's
normally fairly harmless - I think.

Perhaps we should just fetch the horizons from the "local" catalog for shared
rels?

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-03-30 21:11:48 -0700, Peter Geoghegan wrote:
> On Wed, Mar 30, 2022 at 9:04 PM Andres Freund <andres@anarazel.de> wrote:
> > (gdb) p vacrel->NewRelfrozenXid
> > $3 = 717
> > (gdb) p vacrel->relfrozenxid
> > $4 = 717
> > (gdb) p OldestXmin
> > $5 = 5112
> > (gdb) p aggressive
> > $6 = false
>
> Does this OldestXmin seem reasonable at this point in execution, based
> on context? Does it look too high? Something else?

Reasonable:
(gdb) p *ShmemVariableCache
$1 = {nextOid = 78969, oidCount = 2951, nextXid = {value = 21411}, oldestXid = 714, xidVacLimit = 200000714,
xidWarnLimit= 2107484361,
 
  xidStopLimit = 2144484361, xidWrapLimit = 2147484361, oldestXidDB = 1, oldestCommitTsXid = 0, newestCommitTsXid = 0,
latestCompletedXid= {value = 21408},
 
  xactCompletionCount = 1635, oldestClogXid = 714}

I think the explanation I just sent explains the problem, without "in-memory"
confusion about what's running and what's not.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 30, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote:
> But the debug elog reports that
>
> relfrozenxid updated 714 -> 717
> relminmxid updated 1 -> 6
>
> Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
> from the shared relcache init file written by another backend:

We should have added logging of relfrozenxid and relminmxid a long time ago.

> This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's
> normally fairly harmless - I think.

Is this one pretty old?

> Perhaps we should just fetch the horizons from the "local" catalog for shared
> rels?

Not sure what you mean.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 30, 2022 at 9:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Perhaps we should just fetch the horizons from the "local" catalog for shared
> > rels?
>
> Not sure what you mean.

Wait, you mean use vacrel->relfrozenxid directly? Seems kind of ugly...

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote:
> On Wed, Mar 30, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote:
> > But the debug elog reports that
> >
> > relfrozenxid updated 714 -> 717
> > relminmxid updated 1 -> 6
> >
> > Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
> > from the shared relcache init file written by another backend:
> 
> We should have added logging of relfrozenxid and relminmxid a long time ago.

At least at DEBUG1 or such.


> > This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's
> > normally fairly harmless - I think.
> 
> Is this one pretty old?

What do you mean with "this one"? The cause for the assert failure?

I'm not sure there's a proper bug on HEAD here. I think at worst it can delay
the horizon increasing a bunch, by falsely not using an aggressive vacuum when
we should have - might even be limited to a single autovacuum cycle.



> > Perhaps we should just fetch the horizons from the "local" catalog for shared
> > rels?
> 
> Not sure what you mean.

Basically, instead of relying on the relcache, which for shared relation is
vulnerable to seeing "too new" horizons due to the shared relcache init file,
explicitly load relfrozenxid / relminmxid from the the catalog / syscache.

I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using
SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid
etc. Whereas right now we only fetch the pg_class row in
vac_update_relstats(), but use the relcache before.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-03-30 21:59:15 -0700, Andres Freund wrote:
> On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote:
> > On Wed, Mar 30, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote:
> > > Perhaps we should just fetch the horizons from the "local" catalog for shared
> > > rels?
> > 
> > Not sure what you mean.
> 
> Basically, instead of relying on the relcache, which for shared relation is
> vulnerable to seeing "too new" horizons due to the shared relcache init file,
> explicitly load relfrozenxid / relminmxid from the the catalog / syscache.
> 
> I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using
> SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid
> etc. Whereas right now we only fetch the pg_class row in
> vac_update_relstats(), but use the relcache before.

Perhaps we should explicitly mask out parts of relcache entries in the shared
init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to
Invalid* or such.

I even wonder if we should just generally move those out of the fields we have
in the relcache, not just for shared rels loaded from the init
fork. Presumably by just moving them into the CATALOG_VARLEN ifdef.

The only place that appears to access rd_rel->relfrozenxid outside of DDL is
heap_abort_speculative().

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Mar 31, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:
> Perhaps we should explicitly mask out parts of relcache entries in the shared
> init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to
> Invalid* or such.

That has the advantage of being more honest. If you're going to break
the abstraction, then it seems best to break it in an obvious way,
that leaves no doubts about what you're supposed to be relying on.

This bug doesn't seem like the kind of thing that should be left
as-is. If only because it makes it hard to add something like a
WARNING when we make relfrozenxid go backwards (on the basis of the
existing value apparently being in the future), which we really should
have been doing all along.

The whole reason why we overwrite pg_class.relfrozenxid values from
the future is to ameliorate the effects of more serious bugs like the
pg_upgrade/pg_resetwal one fixed in commit 74cf7d46 not so long ago
(mid last year). We had essentially the same pg_upgrade "from the
future" bug twice (once for relminmxid in the MultiXact bug era,
another more recent version affecting relfrozenxid).

> The only place that appears to access rd_rel->relfrozenxid outside of DDL is
> heap_abort_speculative().

I wonder how necessary that really is. Even if the XID is before
relfrozenxid, does that in itself really make it "in the future"?
Obviously it's often necessary to make the assumption that allowing
wraparound amounts to allowing XIDs "from the future" to exist, which
is dangerous. But why here? Won't pruning by VACUUM eventually correct
the issue anyway?

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-03-31 09:58:18 -0700, Peter Geoghegan wrote:
> On Thu, Mar 31, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:
> > The only place that appears to access rd_rel->relfrozenxid outside of DDL is
> > heap_abort_speculative().
> 
> I wonder how necessary that really is. Even if the XID is before
> relfrozenxid, does that in itself really make it "in the future"?
> Obviously it's often necessary to make the assumption that allowing
> wraparound amounts to allowing XIDs "from the future" to exist, which
> is dangerous. But why here? Won't pruning by VACUUM eventually correct
> the issue anyway?

I don't think we should weaken defenses against xids from before relfrozenxid
in vacuum / amcheck / .... If anything we should strengthen them.

Isn't it also just plainly required for correctness? We'd not necessarily
trigger a vacuum in time to remove the xid before approaching wraparound if we
put in an xid before relfrozenxid? That happening in prune_xid is obviously
les bad than on actual data, but still.


ISTM we should just use our own xid. Yes, it might delay cleanup a bit
longer. But unless there's already crud on the page (with prune_xid already
set, the abort of the speculative insertion isn't likely to make the
difference?

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Wed, Mar 30, 2022 at 9:59 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not sure there's a proper bug on HEAD here. I think at worst it can delay
> the horizon increasing a bunch, by falsely not using an aggressive vacuum when
> we should have - might even be limited to a single autovacuum cycle.

So, to be clear: vac_update_relstats() never actually considered the
new relfrozenxid value from its vacuumlazy.c caller to be "in the
future"? It just looked that way to the failing assertion in
vacuumlazy.c, because its own version of the original relfrozenxid was
stale from the beginning? And so the worst problem is probably just
that we don't use aggressive VACUUM when we really should in rare
cases?

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Mar 31, 2022 at 10:11 AM Andres Freund <andres@anarazel.de> wrote:
> I don't think we should weaken defenses against xids from before relfrozenxid
> in vacuum / amcheck / .... If anything we should strengthen them.
>
> Isn't it also just plainly required for correctness? We'd not necessarily
> trigger a vacuum in time to remove the xid before approaching wraparound if we
> put in an xid before relfrozenxid? That happening in prune_xid is obviously
> les bad than on actual data, but still.

Yeah, you're right. Ambiguity about stuff like this should be avoided
on general principle.

> ISTM we should just use our own xid. Yes, it might delay cleanup a bit
> longer. But unless there's already crud on the page (with prune_xid already
> set, the abort of the speculative insertion isn't likely to make the
> difference?

Speculative insertion abort is pretty rare in the real world, I bet.
The speculative insertion precheck is very likely to work almost
always with real workloads.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-03-31 10:12:49 -0700, Peter Geoghegan wrote:
> On Wed, Mar 30, 2022 at 9:59 PM Andres Freund <andres@anarazel.de> wrote:
> > I'm not sure there's a proper bug on HEAD here. I think at worst it can delay
> > the horizon increasing a bunch, by falsely not using an aggressive vacuum when
> > we should have - might even be limited to a single autovacuum cycle.
> 
> So, to be clear: vac_update_relstats() never actually considered the
> new relfrozenxid value from its vacuumlazy.c caller to be "in the
> future"?

No, I added separate debug messages for those, and also applied your patch,
and it didn't trigger.

I don't immediately see how we could end up computing a frozenxid value that
would be problematic? The pgcform->relfrozenxid value will always be the
"local" value, which afaics can be behind the other database's value (and thus
behind the value from the relcache init file). But it can't be ahead, we have
the proper invalidations for that (I think).


I do think we should apply a version of the warnings you have (with a WARNING
instead of PANIC obviously). I think it's bordering on insanity that we have
so many paths to just silently fix stuff up around vacuum. It's like we want
things to be undebuggable, and to give users no warnings about something being
up.


> It just looked that way to the failing assertion in
> vacuumlazy.c, because its own version of the original relfrozenxid was
> stale from the beginning? And so the worst problem is probably just
> that we don't use aggressive VACUUM when we really should in rare
> cases?

Yes, I think that's right.

Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5
and fsync=off made the crash trigger more quickly.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Mar 31, 2022 at 10:50 AM Andres Freund <andres@anarazel.de> wrote:
> > So, to be clear: vac_update_relstats() never actually considered the
> > new relfrozenxid value from its vacuumlazy.c caller to be "in the
> > future"?
>
> No, I added separate debug messages for those, and also applied your patch,
> and it didn't trigger.

The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)". Plus
the other related assert I mentioned did not trigger. So when this
"diff" assert did trigger, the value of "diff" must have been 0 (not a
negative value). While this state does technically indicate that the
"existing" relfrozenxid value (actually a stale version) appears to be
"in the future" (because the OldestXmin XID might still never have
been allocated), it won't ever be in the future according to
vac_update_relstats() (even if it used that version).

I suppose that I might be wrong about that, somehow -- anything is
possible. The important point is that there is currently no evidence
that this bug (or any very recent bug) could ever allow
vac_update_relstats() to actually believe that it needs to update
relfrozenxid/relminmxid, purely because the existing value is in the
future.

The fact that vac_update_relstats() doesn't log/warn when this happens
is very unfortunate, but there is nevertheless no evidence that that
would have informed us of any bug on HEAD, even including the actual
bug here, which is a bug in vacuumlazy.c (not in vac_update_relstats).

> I do think we should apply a version of the warnings you have (with a WARNING
> instead of PANIC obviously). I think it's bordering on insanity that we have
> so many paths to just silently fix stuff up around vacuum. It's like we want
> things to be undebuggable, and to give users no warnings about something being
> up.

Yeah, it's just totally self defeating to not at least log it. I mean
this is a code path that is only hit once per VACUUM, so there is
practically no risk of that causing any new problems.

> Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5
> and fsync=off made the crash trigger more quickly.

I'll try to do that today. I'm not feeling the most energetic right
now, to be honest.

--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan <pg@bowt.ie> wrote:
> The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)".

Attached is v15. I plan to commit the first two patches (the most
substantial two patches by far) in the next couple of days, barring
objections.

v15 removes this "Assert(diff > 0)" assertion from 0001. It's not
adding any value, now that the underlying issue that it accidentally
brought to light is well understood (there are still more robust
assertions to the relfrozenxid/relminmxid invariants). "Assert(diff >
0)" is liable to fail until the underlying bug on HEAD is fixed, which
can be treated as separate work.

I also refined the WARNING patch in v15. It now actually issues
WARNINGs (rather than PANICs, which were just a temporary debugging
measure in v14). Also fixed a compiler warning in this patch, based on
a complaint from CFBot's CompilerWarnings task. I can delay commiting
this WARNING patch until right before feature freeze. Seems best to
give others more opportunity for comments.

-- 
Peter Geoghegan

Вложения

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-04-01 10:54:14 -0700, Peter Geoghegan wrote:
> On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)".
>
> Attached is v15. I plan to commit the first two patches (the most
> substantial two patches by far) in the next couple of days, barring
> objections.

Just saw that you committed: Wee! I think this will be a substantial
improvement for our users.


While I was writing the above I, again, realized that it'd be awfully nice to
have some accumulated stats about (auto-)vacuum's effectiveness. For us to get
feedback about improvements more easily and for users to know what aspects
they need to tune.

Knowing how many times a table was vacuumed doesn't really tell that much, and
requiring to enable log_autovacuum_min_duration and then aggregating those
results is pretty painful (and version dependent).

If we just collected something like:
- number of heap passes
- time spent heap vacuuming
- number of index scans
- time spent index vacuuming
- time spent delaying
- percentage of non-yet-removable vs removable tuples

it'd start to be a heck of a lot easier to judge how well autovacuum is
coping.

If we tracked the related pieces above in the index stats (or perhaps
additionally there), it'd also make it easier to judge the cost of different
indexes.

- Andres



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Sun, Apr 3, 2022 at 12:05 PM Andres Freund <andres@anarazel.de> wrote:
> Just saw that you committed: Wee! I think this will be a substantial
> improvement for our users.

I hope so! I think that it's much more useful as the basis for future
work than as a standalone thing. Users of Postgres 15 might not notice
a huge difference. But it opens up a lot of new directions to take
VACUUM in.

I would like to get rid of anti-wraparound VACUUMs and aggressive
VACUUMs in Postgres 16. This isn't as radical as it sounds. It seems
quite possible to find a way for *every* VACUUM to become aggressive
progressively and dynamically. We'll still need to have autovacuum.c
know about wraparound, but it should just be just another threshold,
not fundamentally different to the other thresholds (except that it's
still used when autovacuum is nominally disabled).

The behavior around autovacuum cancellations is probably still going
to be necessary when age(relfrozenxid) gets too high, but it shouldn't
be conditioned on what age(relfrozenxid) *used to be*, when the
autovacuum started. That could have been a long time ago. It should be
based on what's happening *right now*.

> While I was writing the above I, again, realized that it'd be awfully nice to
> have some accumulated stats about (auto-)vacuum's effectiveness. For us to get
> feedback about improvements more easily and for users to know what aspects
> they need to tune.

Strongly agree. And I'm excited about the potential of the shared
memory stats patch to enable more thorough instrumentation, which
allows us to improve things with feedback that we just can't get right
now.

VACUUM is still too complicated -- that makes this kind of analysis
much harder, even for experts. You need more continuous behavior to
get value from this kind of analysis. There are too many things that
might end up mattering, that really shouldn't ever matter. Too much
potential for strange illogical discontinuities in performance over
time.

Having only one type of VACUUM (excluding VACUUM FULL) will be much
easier for users to reason about. But I also think that it'll be much
easier for us to reason about. For example, better autovacuum
scheduling will be made much easier if autovacuum.c can just assume
that every VACUUM operation will do the same amount of work. (Another
problem with the scheduling is that it uses ANALYZE statistics
(sampling) in a way that just doesn't make any sense for something
like VACUUM, which is an inherently dynamic and cyclic process.)

None of this stuff has to rely on my patch for freezing. We don't
necessarily have to make every VACUUM advance relfrozenxid to do all
this. The important point is that we definitely shouldn't be putting
off *all* freezing of all-visible pages in non-aggressive VACUUMs (or
in VACUUMs that are not expected to advance relfrozenxid). Even a very
conservative implementation could achieve all this; we need only
spread out the burden of freezing all-visible pages over time, across
multiple VACUUM operations. Make the behavior continuous.

> Knowing how many times a table was vacuumed doesn't really tell that much, and
> requiring to enable log_autovacuum_min_duration and then aggregating those
> results is pretty painful (and version dependent).

Yeah. Ideally we could avoid making the output of
log_autovacuum_min_duration into an API, by having a real API instead.
The output probably needs to evolve some more. A lot of very basic
information wasn't there until recently.

> If we just collected something like:
> - number of heap passes
> - time spent heap vacuuming
> - number of index scans
> - time spent index vacuuming
> - time spent delaying

You forgot FPIs.

> - percentage of non-yet-removable vs removable tuples

I think that we should address this directly too. By "taking a
snapshot of the visibility map", so we at least don't scan/vacuum heap
pages that don't really need it. This is also valuable because it
makes slowing down VACUUM (maybe slowing it down a lot) have fewer
downsides. At least we'll have "locked in" our scanned_pages, which we
can figure out in full before we really scan even one page.

> it'd start to be a heck of a lot easier to judge how well autovacuum is
> coping.

What about the potential of the shared memory stats stuff to totally
replace the use of ANALYZE stats in autovacuum.c? Possibly with help
from vacuumlazy.c, and the visibility map?

I see a lot of potential for exploiting the visibility map more, both
within vacuumlazy.c itself, and for autovacuum.c scheduling [1]. I'd
probably start with the scheduling stuff, and only then work out how
to show users more actionable information.

[1] https://postgr.es/m/CAH2-Wzkt9Ey9NNm7q9nSaw5jdBjVsAq3yvb4UT4M93UaJVd_xg@mail.gmail.com
--
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I also refined the WARNING patch in v15. It now actually issues
> WARNINGs (rather than PANICs, which were just a temporary debugging
> measure in v14).

Going to commit this remaining patch tomorrow, barring objections.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Andres Freund
Дата:
Hi,

On 2022-04-04 19:32:13 -0700, Peter Geoghegan wrote:
> On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I also refined the WARNING patch in v15. It now actually issues
> > WARNINGs (rather than PANICs, which were just a temporary debugging
> > measure in v14).
> 
> Going to commit this remaining patch tomorrow, barring objections.

The remaining patch are the warnings in vac_update_relstats(), correct?  I
guess one could argue they should be LOG rather than WARNING, but I find the
project stance on that pretty impractical. So warning's ok with me.

Not sure why you used errmsg_internal()?

Otherwise LGTM.

Greetings,

Andres Freund



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Apr 4, 2022 at 8:18 PM Andres Freund <andres@anarazel.de> wrote:
> The remaining patch are the warnings in vac_update_relstats(), correct?  I
> guess one could argue they should be LOG rather than WARNING, but I find the
> project stance on that pretty impractical. So warning's ok with me.

Right. The reason I used WARNINGs was because it matches vaguely
related WARNINGs in vac_update_relstats()'s sibling function,
vacuum_set_xid_limits().

> Not sure why you used errmsg_internal()?

The usual reason for using errmsg_internal(), I suppose. I tend to do
that with corruption related messages on the grounds that they're
usually highly obscure issues that are (by definition) never supposed
to happen. The only thing that a user can be expected to do with the
information from the message is to report it to -bugs, or find some
other similar report.

-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Mon, Apr 4, 2022 at 8:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Right. The reason I used WARNINGs was because it matches vaguely
> related WARNINGs in vac_update_relstats()'s sibling function,
> vacuum_set_xid_limits().

Okay, pushed the relfrozenxid warning patch.

Thanks
-- 
Peter Geoghegan



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Jim Nasby
Дата:
On 4/3/22 12:05 PM, Andres Freund wrote:
> While I was writing the above I, again, realized that it'd be awfully nice to
> have some accumulated stats about (auto-)vacuum's effectiveness. For us to get
> feedback about improvements more easily and for users to know what aspects
> they need to tune.
>
> Knowing how many times a table was vacuumed doesn't really tell that much, and
> requiring to enable log_autovacuum_min_duration and then aggregating those
> results is pretty painful (and version dependent).
>
> If we just collected something like:
> - number of heap passes
> - time spent heap vacuuming
> - number of index scans
> - time spent index vacuuming
> - time spent delaying
The number of passes would let you know if maintenance_work_mem is too 
small (or to stop killing 187M+ tuples in one go). The timing info would 
give you an idea of the impact of throttling.
> - percentage of non-yet-removable vs removable tuples

This'd give you an idea how bad your long-running-transaction problem is.

Another metric I think would be useful is the average utilization of 
your autovac workers. No spare workers means you almost certainly have 
tables that need vacuuming but have to wait. As a single number, it'd 
also be much easier for users to understand. I'm no stats expert, but 
one way to handle that cheaply would be to maintain an 
engineering-weighted-mean of the percentage of autovac workers that are 
in use at the end of each autovac launcher cycle (though that would 
probably not work great for people that have extreme values for launcher 
delay, or constantly muck with launcher_delay).

>
> it'd start to be a heck of a lot easier to judge how well autovacuum is
> coping.
>
> If we tracked the related pieces above in the index stats (or perhaps
> additionally there), it'd also make it easier to judge the cost of different
> indexes.
>
> - Andres
>
>



Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

От
Peter Geoghegan
Дата:
On Thu, Apr 14, 2022 at 4:19 PM Jim Nasby <nasbyj@amazon.com> wrote:
> > - percentage of non-yet-removable vs removable tuples
>
> This'd give you an idea how bad your long-running-transaction problem is.

VACUUM fundamentally works by removing those tuples that are
considered dead according to an XID-based cutoff established when the
operation begins. And so many very long running VACUUM operations will
see dead-but-not-removable tuples even when there are absolutely no
long running transactions (nor any other VACUUM operations). The only
long running thing involved might be our own long running VACUUM
operation.

I would like to reduce the number of non-removal dead tuples
encountered by VACUUM by "locking in" heap pages that we'd like to
scan up front. This would work by having VACUUM create its own local
in-memory copy of the visibility map before it even starts scanning
heap pages. That way VACUUM won't end up visiting heap pages just
because they were concurrently modified half way through our VACUUM
(by some other transactions). We don't really need to scan these pages
at all -- they have dead tuples, but not tuples that are "dead to
VACUUM".

The key idea here is to remove a big unnatural downside to slowing
VACUUM down. The cutoff would almost work like an MVCC snapshot, that
described precisely the work that VACUUM needs to do (which pages to
scan) up-front. Once that's locked in, the amount of work we're
required to do cannot go up as we're doing it (or it'll be less of an
issue, at least).

It would also help if VACUUM didn't scan pages that it already knows
don't have any dead tuples. The current SKIP_PAGES_THRESHOLD rule
could easily be improved. That's almost the same problem.

-- 
Peter Geoghegan