Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Дата
Msg-id CAH2-Wzmf7a_ByoAocKb5U1rS0PCymKfQz0MBkmdz7qt_WPHKGA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
.

On Wed, Jan 18, 2023 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > It just fits: the dead tuples approach can sometimes be so
> > completely wrong that even an alternative triggering condition based
> > on something that is virtually unrelated to the thing we actually care
> > about can do much better in practice. Consistently, reliably, for a
> > given table/workload.
>
> Hmm, I don't know. I have no intuition one way or the other for
> whether we're undercounting dead tuples, and I don't understand what
> would cause us to do that. I thought that we tracked that accurately,
> as part of the statistics system, not by sampling
> (pg_stat_all_tables.n_dead_tup).

It's both, kind of.

pgstat_report_analyze() will totally override the
tabentry->dead_tuples information that drives autovacuum.c, based on
an estimate derived from a random sample -- which seems to me to be an
approach that just doesn't have any sound theoretical basis. So while
there is a sense in which we track dead tuples incrementally and
accurately using the statistics system, we occasionally call
pgstat_report_analyze (and pgstat_report_vacuum) like this, so AFAICT
we might as well not even bother tracking things reliably the rest of
the time.

Random sampling works because the things that you don't sample are
very well represented by the things that you do sample. That's why
even very stale optimizer statistics can work quite well (and why the
EAV anti-pattern makes query optimization impossible) -- the
distribution is often fixed, more or less. The statistics generalize
very well because the data meets certain underlying assumptions that
all data stored in a relational database is theoretically supposed to
meet. Whereas with dead tuples, the whole point is to observe and
count dead tuples so that autovacuum can then go remove the dead
tuples -- which then utterly changes the situation! That's a huge
difference.

ISTM that you need a *totally* different approach for something that's
fundamentally dynamic, which is what this really is. Think about how
the random sampling will work in a very large table with concentrated
updates. The modified pages need to outweigh the large majority of
pages in the table that can be skipped by VACUUM anyway.

I wonder how workable it would be to just teach pgstat_report_analyze
and pgstat_report_vacuum to keep out of this, or to not update the
stats unless it's to increase the number of dead_tuples...

> I think we ought to fire autovacuum_vacuum_scale_factor out of an
> airlock.

Couldn't agree more. I think that this and the underlying statistics
are the really big problem as far as under-vacuuming is concerned.

> I think we also ought to invent some sort of better cost limit system
> that doesn't shoot you in the foot automatically as the database
> grows. Nobody actually wants to limit the rate at which the database
> vacuums stuff to a constant. What they really want to do is limit it
> to a rate that is somewhat faster than the minimum rate needed to
> avoid disaster. We should try to develop metrics for whether vacuum is
> keeping up.

Definitely agree that doing some kind of dynamic updating is
promising. What we thought at the start versus what actually happened.
Something cyclic, just like autovacuum itself.

> I don't actually see any reason why dead tuples, even counted in a
> relatively stupid way, isn't fundamentally good enough to get all
> tables vacuumed before we hit the XID age cutoff. It doesn't actually
> do that right now, but I feel like that must be because we're doing
> other stupid things, not because there's anything that terrible about
> the metric as such. Maybe that's wrong, but I find it hard to imagine.

On reflection, maybe you're right here. Maybe it's true that the
bigger problem is just that the implementation is bad, even on its own
terms -- since it's pretty bad! Hard to say at this point.

Depends on how you define it, too. Statistically sampling is just not
fit for purpose here. But is that a problem with
autovacuum_vacuum_scale_factor? I may have said words that could
reasonably be interpreted that way, but I'm not prepared to blame it
on the underlying autovacuum_vacuum_scale_factor model now. It's
fuzzy.

> > We're
> > still subject to the laws of physics. VACUUM would still be something
> > that more or less works at the level of the whole table, or not at
> > all. So being omniscient seems kinda overrated to me. Adding more
> > information does not in general lead to better outcomes.
>
> Yeah, I think that's true. In particular, it's not much use being
> omniscient but stupid. It would be better to have limited information
> and be smart about what you did with it.

I would put it like this: autovacuum shouldn't ever be a sucker. It
should pay attention to disconfirmatory signals. The information that
drives its decision making process should be treated as provisional.

Even if the information was correct at one point, the contents of the
table are constantly changing in a way that could matter enormously.
So we should be paying attention to where the table is going --  and
even where it might be going -- not just where it is, or was.

> True, although it can be overdone. An extra vacuum on a big table with
> some large indexes that end up getting scanned can be very expensive
> even if the table itself is almost entirely all-visible. We can't
> afford to make too many mistakes in the direction of vacuuming early
> in such cases.

No, but we can afford to make some -- and can detect when it happened
after the fact. I would rather err on the side of over-vacuuming,
especially if the system is smart enough to self-correct when that
turns out to be the wrong approach. One of the advantages of running
VACUUM sooner is that it provides us with relatively reliable
information about the needs of the table.

We can also cheat, sort of. If we find another justification for
autovacuuming (e.g., it's a quiet time for the system as a whole), and
it works out to help with this other problem, it may be just as good
for users.

--
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [PATCH] Const'ify the arguments of ilist.c/ilist.h functions
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: almost-super-user problems that we haven't fixed yet