Re: improving wraparound behavior

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: improving wraparound behavior
Дата
Msg-id 20190504024111.GQ6197@tamriel.snowman.net
обсуждение исходный текст
Ответ на Re: improving wraparound behavior  (Andres Freund <andres@anarazel.de>)
Ответы Re: improving wraparound behavior  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2019-05-03 22:03:18 -0400, Stephen Frost wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > I am not sure exactly how to fix this,
> > > because the calculation we use to determine the XID that can be used
> > > to vacuum a specific table is pretty complex; how can the postmaster
> > > know whether it's going to be able to make any progress in *any* table
> > > in some database to which it's not even connected?  But it's surely
> > > crazy to just keep doing something over and over that can't possibly
> > > work.
> >
> > I definitely agree that it's foolish to keep doing something that isn't
> > going to work, and it seems like a pretty large part of the issue here
> > is that we don't have enough information to be more intelligent because
> > we aren't connected to the database that needs the work to be done.
> >
> > Now, presuming we're talking about 'new feature work' here to try and
> > address this, and not something that we think we can back-patch, I had
> > another thought.
> >
> > I certainly get that having lots of extra processes around can be a
> > waste of resources... but I don't recall a lot of people complaining
> > about the autovacuum launcher process using up lots of resources
> > unnecessairly.
> >
> > Perhaps we should consider having the 'global' autovacuum launcher, when
> > it's decided that a database needs work to be done, launch a 'per-DB'
> > launcher which manages launching autovacuum processes for that database?
> > If the launcher is still running then it's because there's still work to
> > be done on that database and the 'global' autovacuum launcher can skip
> > it.  If the 'per-DB' launcher runs out of things to do, and the database
> > it's working on is no longer in a danger zone, then it exits.
> >
> > There are certainly some other variations on this idea and I don't know
> > that it's really better than keeping more information in shared
> > memory or something else, but it seems like part of the issue is that
> > the thing firing off the processes hasn't got enough info to do so
> > intelligently and maybe we could fix that by having per-DB launchers
> > that are actually connected to a DB.
>
> This sounds a lot more like a wholesale redesign than some small
> incremental work.  Which I think we should do, but we probably ought to
> do something more minimal before the resources for something like this
> are there. Perfect being the enemy of the good, and all that.

I suppose it is a pretty big change in the base autovacuum launcher to
be something that's run per database instead and then deal with the
coordination between the two...  but I can't help but feel like it
wouldn't be that much *work*.  I'm not against doing something smaller
but was something smaller actually proposed for this specific issue..?

> > I might have misunderstood it, but it sounded like Andres was
> > proposing a new function which would basically tell you what's holding
> > back the xid horizon and that sounds fantastic and would be great to
> > include in this message, if possible.
> >
> > As in:
> >
> > HINT:  Run the function pg_what_is_holding_xmin_back() to identify what
> > is preventing autovacuum from progressing and address it.
>
> I was basically thinking of doing *both*, amending the message, *and*
> having a new UDF.
>
> Basically, instead of the current:
>
>             char       *oldest_datname = get_database_name(oldest_datoid);
>
>             /* complain even if that DB has disappeared */
>             if (oldest_datname)
>                 ereport(WARNING,
>                         (errmsg("database \"%s\" must be vacuumed within %u transactions",
>                                 oldest_datname,
>                                 xidWrapLimit - xid),
>                          errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
>                                  "You might also need to commit or roll back old prepared transactions, or drop stale
replicationslots."))); 
>             else
>                 ereport(WARNING,
>                         (errmsg("database with OID %u must be vacuumed within %u transactions",
>                                 oldest_datoid,
>                                 xidWrapLimit - xid),
>                          errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
>                                  "You might also need to commit or roll back old prepared transactions, or drop stale
replicationslots."))); 
>         }
>
> which is dramatically unhelpful, because often vacuuming won't do squat
> (because of old [prepared] transaction, replication connection, or slot).
>
> I'm thinking that we'd do something roughly like (in actual code) for
> GetNewTransactionId():
>
>     TransactionId dat_limit = ShmemVariableCache->oldestXid;
>     TransactionId slot_limit = Min(replication_slot_xmin, replication_slot_catalog_xmin);
>     Transactionid walsender_limit;
>     Transactionid prepared_xact_limit;
>     Transactionid backend_limit;
>
>     ComputeOldestXminFromProcarray(&walsender_limit, &prepared_xact_limit, &backend_limit);
>
>     if (IsOldest(dat_limit))
>        ereport(elevel,
>                errmsg("close to xid wraparound, held back by database %s"),
>                errdetail("current xid %u, horizon for database %u, shutting down at %u"),
>                errhint("..."));
>     else if (IsOldest(slot_limit))
>       ereport(elevel, errmsg("close to xid wraparound, held back by replication slot %s"),
>               ...);
>
> where IsOldest wouldn't actually compare plainly numerically, but would
> actually prefer showing the slot, backend, walsender, prepared_xact, as
> long as they are pretty close to the dat_limit - as in those cases
> vacuuming wouldn't actually solve the issue, unless the other problems
> are addressed first (as autovacuum won't compute a cutoff horizon that's
> newer than any of those).

Where the errhint() above includes a recommendation to run the SRF
described below, I take it?

Also, should this really be an 'else if', or should it be just a set of
'if()'s, thereby giving users more info right up-front?  If there's one
thing I quite dislike, it's cases where you're playing whack-a-mole over
and over because the system says "X is bad!" and when you fix X, it just
turns around and says "Y is bad!", even though it had all the info it
needed to say "X and Y are bad!" right up-front.

> and for the function I was thinking of an SRF that'd return roughly rows
> like:
> * "database horizon", xid, age(xid), database oid, database name
> * "slot horizon", xid, age(xid), NULL, slot name
> * "backend horizon", xid, age(xid), backend pid, query string?
> * "walsender horizon", xid, age(xid), backend pid, connection string?
> * "anti-wraparound-vacuums-start", xid, NULL, NULL
> * "current xid", xid, NULL, NULL
> * "xid warn limit", xid, NULL, NULL
> * "xid shutdown limit", xid, NULL, NULL
> * "xid wraparound limit", xid, NULL, NULL
>
> Not sure if an SRF is really the best approach, but it seems like it'd
> be the easist way to show a good overview of the problems.  I don't know
> how many times this'd have prevented support escalations, but it'd be
> many.

Yeah, I'm not sure if an SRF makes the most sense here...  but I'm also
not sure that it's a bad idea either.  Once it's written, it'll be a lot
easier to critique the specific interface. ;)

Thanks!

Stephen

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: improving wraparound behavior
Следующее
От: Andres Freund
Дата:
Сообщение: Re: improving wraparound behavior