Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

Поиск
Список
Период
Сортировка
От Matthias van de Meent
Тема Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Дата
Msg-id CAEze2WhxhEQEx+c+CXoDpQs1H1HgkYUK4BW-hFw5_eQxuVWqRw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Ответы Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-bugs
On Mon, 1 Nov 2021 at 16:15, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > > I can propose the debugging patch to reproduce the issue that replaces
> > > the hang with the assert and modifies a pair of crash-causing test
> > > scripts to simplify the reproducing. (Sorry, I have no time now to prune
> > > down the scripts further as I have to leave for a week.)
> >
> > This bug is similar to the one fixed in commit d9d8aa9b. And so I
> > wonder if code like GlobalVisTestFor() is missing something that it
> > needs for partitioned tables.
>
> Without `autovacuum = off; fsync = off` I could not replicate the
> issue in the configured 10m time window; with those options I did get
> the reported trace in minutes.
>
> I think that I also have found the culprit, which is something we
> talked about in [0]: GlobalVisState->maybe_needed was not guaranteed
> to never move backwards when recalculated, and because vacuum can
> update its snapshot bounds (heap_prune_satisfies_vacuum ->
> GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed
> could move backwards, resulting in the observed behaviour.
>
> It was my understanding based on the mail conversation that Andres
> would fix this observed issue too while fixing [0] (whose fix was
> included with beta 2), but apparently I was wrong; I can't find the
> code for 'maybe_needed'-won't-move-backwards-in-a-backend.
>
> I (again) propose the attached patch, which ensures that this
> maybe_needed field will not move backwards for a backend. It is
> based on 14, but should be applied on head as well, because it's
> lacking there as well.
>
> Another alternative would be to replace the use of vacrel->OldestXmin
> with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
> that is not legal in how vacuum works (we cannot unilaterally decide
> that we want to retain tuples < OldestXmin).
>
> Note: After fixing the issue with retreating maybe_needed I also hit
> your segfault, and I'm still trying to find out what the source of
> that issue might be. I do think it is an issue seperate from stuck
> vacuum, though.

After further debugging, I think these both might be caused by the
same issue, due to xmin horizon confusion as a result from restored
snapshots:

I seem to repeatedly get backends of which the xmin is set from
InvalidTransactionId to some value < min(ProcGlobal->xids), which then
result in shared_oldest_nonremovable (and others) being less than the
value of their previous iteration. This leads to the infinite loop in
lazy_scan_prune (it stores and uses one value of
*_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
variant). Ergo, this issue is not really solved by my previous patch,
because apparently at this point we have snapshots wih an xmin that is
only registered in the backend's procarray entry when the xmin is
already out of scope, which makes it generally impossible to determine
what tuples may or may not yet be vacuumed.

I noticed that when this happens, generally a parallel vacuum worker
is involved. I also think that this is intimately related to [0], and
how snapshots are restored in parallel workers: A vacuum worker is
generally ignored, but if its snapshot has the oldest xmin available,
then a parallel worker launched from that vacuum worker will move the
visible xmin backwards. Same for concurrent index creation jobs.

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/flat/202110191807.5svc3kmm32tl%40alvherre.pgsql



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Erki Eessaar
Дата:
Сообщение: pg_get_functiondef and aggregate functions
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()