Обсуждение: Autovacuum vs vac_update_datfrozenxid() vs ?

Поиск
Список
Период
Сортировка

Autovacuum vs vac_update_datfrozenxid() vs ?

От
Andres Freund
Дата:
Hi,

While looking to understand what could be going on with [1], I think I
might have stumbled on a few issues that could at least explain parts of
the problem.

First, it seems to me that vac_update_datfrozenxid() can spuriously
(but temporarily) fail due to the 'bogus'
logic. vac_update_datfrozenxid() first determines the 'newest' xid it
accepts:
    /*
     * Identify the latest relfrozenxid and relminmxid values that we could
     * validly see during the scan.  These are conservative values, but it's
     * not really worth trying to be more exact.
     */
    lastSaneFrozenXid = ReadNewTransactionId();
    lastSaneMinMulti = ReadNextMultiXactId();

and then starts a full table scan of pg_class to determine the database
wide relfrozenxid:
    scan = systable_beginscan(relation, InvalidOid, false,
                              NULL, 0, NULL);

Whenever vac_update_datfrozenxid() encounters a relfrozenxid that's "too
new", it'll bail out:
        /*
         * If things are working properly, no relation should have a
         * relfrozenxid or relminmxid that is "in the future".  However, such
         * cases have been known to arise due to bugs in pg_upgrade.  If we
         * see any entries that are "in the future", chicken out and don't do
         * anything.  This ensures we won't truncate clog before those
         * relations have been scanned and cleaned up.
         */
        if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) ||
            MultiXactIdPrecedes(lastSaneMinMulti, classForm->relminmxid))
        {
            bogus = true;
            break;
        }

Which all in theory makes sense - but there's two fairly sizable races
here:
1) Because lastSaneFrozenXid is determined *before* the snapshot for
   the pg_class, it's entirely possible that concurrently another vacuum on
   another table in the same database starts & finishes. And that very well
   can end up using a relfrozenxid that's newer than the current
   database.
2) Much worse, because vac_update_relstats() uses heap_inplace_update(),
   there's actually no snapshot semantics here anyway! Which means that
   the window for a race isn't just between the ReadNewTransactionId()
   and the systable_beginscan(), but instead lasts for the duration of
   the entire scan.

Either of these can then triggers vac_update_datfrozenxid() to
*silently* bail out without touching anything.



I think there's also another (even larger?) race in
vac_update_datfrozenxid(): Unless I miss something, two backends can
concurrently run through the scan in vac_update_datfrozenxid() for two
different tables in the same database, both can check that they need to
update the pg_database row, and then one of them can overwrite the
results of the other. And the one that updates last might actually be
the one with an older horizon.  This is possible since there is no 'per
database' locking in heap_vacuum_rel().



The way I suspect that interacts with the issue in [1] ist that once
that has happened, do_start_worker() figures out it's doing a xid
wraparound vacuum based on the "wrong" datfrozenxid:
        /* Check to see if this one is at risk of wraparound */
        if (TransactionIdPrecedes(tmp->adw_frozenxid, xidForceLimit))
        {
            if (avdb == NULL ||
                TransactionIdPrecedes(tmp->adw_frozenxid,
                                      avdb->adw_frozenxid))
                avdb = tmp;
            for_xid_wrap = true;
            continue;
        }
        else if (for_xid_wrap)
            continue;            /* ignore not-at-risk DBs */

which means autovac launcher will only start workers for that database -
but there might actually not be any work for it.


I have some theories as to why this is more common in 12, but I've not
fully nailed them down.


Greetings,

Andres Freund

[1] https://postgr.es/m/20200323152247.GB52612%40nol



Re: Autovacuum vs vac_update_datfrozenxid() vs ?

От
Michael Paquier
Дата:
On Mon, Mar 23, 2020 at 04:50:36PM -0700, Andres Freund wrote:
> Which all in theory makes sense - but there's two fairly sizable races
> here:
> 1) Because lastSaneFrozenXid is determined *before* the snapshot for
>    the pg_class, it's entirely possible that concurrently another vacuum on
>    another table in the same database starts & finishes. And that very well
>    can end up using a relfrozenxid that's newer than the current
>    database.
> 2) Much worse, because vac_update_relstats() uses heap_inplace_update(),
>    there's actually no snapshot semantics here anyway! Which means that
>    the window for a race isn't just between the ReadNewTransactionId()
>    and the systable_beginscan(), but instead lasts for the duration of
>    the entire scan.
>
> Either of these can then triggers vac_update_datfrozenxid() to
> *silently* bail out without touching anything.

Hmm.  It looks that you are right for both points, with 2) being much
more plausible to trigger.  Don't we have other issues with the cutoff
calculations done within vacuum_set_xid_limits()?  We decide if an
aggressive job happens depending on the data in the relation cache,
but that may not play well with concurrent jobs that just finished
with invalidations?

> I think there's also another (even larger?) race in
> vac_update_datfrozenxid(): Unless I miss something, two backends can
> concurrently run through the scan in vac_update_datfrozenxid() for two
> different tables in the same database, both can check that they need to
> update the pg_database row, and then one of them can overwrite the
> results of the other. And the one that updates last might actually be
> the one with an older horizon.  This is possible since there is no 'per
> database' locking in heap_vacuum_rel().

The chances of hitting this gets higher with a higher number of max
workers, and it is easy to finish with multiple concurrent workers on
the same database with a busy system.  Perhaps a reason why it was
harder for me to reproduce the problem that was that I was just using
the default for autovacuum_max_workers.  Looks worth trying with a
crazily high value for the max number of workers and more relations
doing a bunch of heavy updates (the application I saw facing a
lockdown uses literally hundreds of relations with a poor man's
partitioning schema and some tables of the schema are heavily
updated).

> The way I suspect that interacts with the issue in [1] ist that once
> that has happened, do_start_worker() figures out it's doing a xid
> wraparound vacuum based on the "wrong" datfrozenxid:
>         /* Check to see if this one is at risk of wraparound */
>         if (TransactionIdPrecedes(tmp->adw_frozenxid, xidForceLimit))
>         {
>             if (avdb == NULL ||
>                 TransactionIdPrecedes(tmp->adw_frozenxid,
>                                       avdb->adw_frozenxid))
>                 avdb = tmp;
>             for_xid_wrap = true;
>             continue;
>         }
>         else if (for_xid_wrap)
>             continue;            /* ignore not-at-risk DBs */
>
> which means autovac launcher will only start workers for that database -
> but there might actually not be any work for it.

Actually, I don't think that pg_database.datfrozenxid explains
everything.  As per what I saw this field is getting freshly updated.
So workers are spawned for a wraparound, and the relation-level stats
cause workers to try to autovacuum a table but nothing happens at the
end.

> I have some theories as to why this is more common in 12, but I've not
> fully nailed them down.

It seems to me that 2aa6e33 also helps in triggering those issues more
easily as it creates a kind of cascading effect with the workers still
getting spawned by the launcher.  However the workers finish by
handling relations which have nothing to do as I suspect that skipping
the anti-wraparound and non-aggressive jobs creates a reduction of the
number of relation stat updates done via vac_update_relstats(), where
individual relations finish in a state where they consider that they
cannot do any more work, being put in a state where they all consider
non-aggressive but anti-wraparound jobs as the norm, causing nothing
to happen as we saw in the thread of -general mentioned by Andres
upthread.  And then things just loop over again and again.  Note that
it is also mentioned on the other thread that issuing a manual VACUUM
FREEZE fixes temporarily the lockdown issue as it forces an update of
the relation stats.  So it seems to me that reverting 2aa6e33 is a
necessary first step to prevent the lockdown to happen, still that
does not address the actual issues causing anti-wraparound and
non-aggressive jobs to exist.
--
Michael

Вложения

Re: Autovacuum vs vac_update_datfrozenxid() vs ?

От
Noah Misch
Дата:
On Mon, Mar 23, 2020 at 04:50:36PM -0700, Andres Freund wrote:
> I think there's also another (even larger?) race in
> vac_update_datfrozenxid(): Unless I miss something, two backends can
> concurrently run through the scan in vac_update_datfrozenxid() for two
> different tables in the same database, both can check that they need to
> update the pg_database row, and then one of them can overwrite the
> results of the other. And the one that updates last might actually be
> the one with an older horizon.  This is possible since there is no 'per
> database' locking in heap_vacuum_rel().

Right.  This thread has a fix:
https://www.postgresql.org/message-id/flat/20190218073103.GA1434723%40rfd.leadboat.com

The CF entry blocking it is starting to see some activity.  (Your discovery
involving lastSaneFrozenXid would need a separate fix.)