Re: improving wraparound behavior

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: improving wraparound behavior
Дата
Msg-id 20190504023015.5mgpbl27tld4irw5@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: improving wraparound behavior  (Stephen Frost <sfrost@snowman.net>)
Ответы Re: improving wraparound behavior  (Stephen Frost <sfrost@snowman.net>)
Список pgsql-hackers
Hi,

On 2019-05-03 22:03:18 -0400, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > I am not sure exactly how to fix this,
> > because the calculation we use to determine the XID that can be used
> > to vacuum a specific table is pretty complex; how can the postmaster
> > know whether it's going to be able to make any progress in *any* table
> > in some database to which it's not even connected?  But it's surely
> > crazy to just keep doing something over and over that can't possibly
> > work.
>
> I definitely agree that it's foolish to keep doing something that isn't
> going to work, and it seems like a pretty large part of the issue here
> is that we don't have enough information to be more intelligent because
> we aren't connected to the database that needs the work to be done.
>
> Now, presuming we're talking about 'new feature work' here to try and
> address this, and not something that we think we can back-patch, I had
> another thought.
>
> I certainly get that having lots of extra processes around can be a
> waste of resources... but I don't recall a lot of people complaining
> about the autovacuum launcher process using up lots of resources
> unnecessairly.
>
> Perhaps we should consider having the 'global' autovacuum launcher, when
> it's decided that a database needs work to be done, launch a 'per-DB'
> launcher which manages launching autovacuum processes for that database?
> If the launcher is still running then it's because there's still work to
> be done on that database and the 'global' autovacuum launcher can skip
> it.  If the 'per-DB' launcher runs out of things to do, and the database
> it's working on is no longer in a danger zone, then it exits.
>
> There are certainly some other variations on this idea and I don't know
> that it's really better than keeping more information in shared
> memory or something else, but it seems like part of the issue is that
> the thing firing off the processes hasn't got enough info to do so
> intelligently and maybe we could fix that by having per-DB launchers
> that are actually connected to a DB.

This sounds a lot more like a wholesale redesign than some small
incremental work.  Which I think we should do, but we probably ought to
do something more minimal before the resources for something like this
are there. Perfect being the enemy of the good, and all that.


> I might have misunderstood it, but it sounded like Andres was
> proposing a new function which would basically tell you what's holding
> back the xid horizon and that sounds fantastic and would be great to
> include in this message, if possible.
>
> As in:
>
> HINT:  Run the function pg_what_is_holding_xmin_back() to identify what
> is preventing autovacuum from progressing and address it.

I was basically thinking of doing *both*, amending the message, *and*
having a new UDF.

Basically, instead of the current:

            char       *oldest_datname = get_database_name(oldest_datoid);

            /* complain even if that DB has disappeared */
            if (oldest_datname)
                ereport(WARNING,
                        (errmsg("database \"%s\" must be vacuumed within %u transactions",
                                oldest_datname,
                                xidWrapLimit - xid),
                         errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
                                 "You might also need to commit or roll back old prepared transactions, or drop stale
replicationslots.")));
 
            else
                ereport(WARNING,
                        (errmsg("database with OID %u must be vacuumed within %u transactions",
                                oldest_datoid,
                                xidWrapLimit - xid),
                         errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
                                 "You might also need to commit or roll back old prepared transactions, or drop stale
replicationslots.")));
 
        }

which is dramatically unhelpful, because often vacuuming won't do squat
(because of old [prepared] transaction, replication connection, or slot).

I'm thinking that we'd do something roughly like (in actual code) for
GetNewTransactionId():

    TransactionId dat_limit = ShmemVariableCache->oldestXid;
    TransactionId slot_limit = Min(replication_slot_xmin, replication_slot_catalog_xmin);
    Transactionid walsender_limit;
    Transactionid prepared_xact_limit;
    Transactionid backend_limit;

    ComputeOldestXminFromProcarray(&walsender_limit, &prepared_xact_limit, &backend_limit);

    if (IsOldest(dat_limit))
       ereport(elevel,
               errmsg("close to xid wraparound, held back by database %s"),
               errdetail("current xid %u, horizon for database %u, shutting down at %u"),
               errhint("..."));
    else if (IsOldest(slot_limit))
      ereport(elevel, errmsg("close to xid wraparound, held back by replication slot %s"),
              ...);

where IsOldest wouldn't actually compare plainly numerically, but would
actually prefer showing the slot, backend, walsender, prepared_xact, as
long as they are pretty close to the dat_limit - as in those cases
vacuuming wouldn't actually solve the issue, unless the other problems
are addressed first (as autovacuum won't compute a cutoff horizon that's
newer than any of those).

and for the function I was thinking of an SRF that'd return roughly rows
like:
* "database horizon", xid, age(xid), database oid, database name
* "slot horizon", xid, age(xid), NULL, slot name
* "backend horizon", xid, age(xid), backend pid, query string?
* "walsender horizon", xid, age(xid), backend pid, connection string?
* "anti-wraparound-vacuums-start", xid, NULL, NULL
* "current xid", xid, NULL, NULL
* "xid warn limit", xid, NULL, NULL
* "xid shutdown limit", xid, NULL, NULL
* "xid wraparound limit", xid, NULL, NULL

Not sure if an SRF is really the best approach, but it seems like it'd
be the easist way to show a good overview of the problems.  I don't know
how many times this'd have prevented support escalations, but it'd be
many.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: improving wraparound behavior
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: improving wraparound behavior