Re: POC: Cleaning up orphaned files using undo logs

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: POC: Cleaning up orphaned files using undo logs
Дата
Msg-id CAA4eK1Kny1RyzuxL4-OmEFFYPRiHD+xf8Jy4DkZ1bbGV4fhHgA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Wed, Jun 19, 2019 at 8:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 19, 2019 at 2:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The reason for the same is that currently, the undo worker keep on
> > executing the requests if there are any.  I think this is good when
> > there are different requests, but getting the same request from error
> > queue and doing it, again and again, doesn't seem to be good and I
> > think it will not help either.
>
> Even if there are multiple requests involved, you don't want a tight
> loop like this.
>

Okay, one reason that comes to mind is we don't want to choke the
system as applying undo can consume CPU and generate a lot of I/O.  Is
that you have in mind or something else?

I see an advantage in having some sort of throttling here, so we can
have some wait time (say 100ms) between processing requests.  Do we
see any need of guc here?

I think on one side it seems a good idea to have multiple guc's for
tuning undo worker machinery because whatever default values we pick
might not be good for some of the users.  OTOH, giving too many guc's
can also make the system difficult to understand and can confuse users
or at the least, they won't know how exactly to use those.  It seems
to me that we should first complete the entire patch and then we can
decide which all things need separate guc.

> > > I assume we want some kind of cool-down between retries.
> > > 10 seconds?  A minute?  Some kind of back-off algorithm that gradually
> > > increases the retry time up to some maximum?
> >
> > Yeah, something on these lines would be good.  How about if we add
> > failure_count with each request in error queue?   Now, it will get
> > incremented on each retry and we can wait in proportion to that, say
> > 10s after the first retry, 20s after second and so on and maximum up
> > to 10 failure_count (100s) will be allowed after which worker will
> > exit considering it has no more work to do.
> >
> > Actually, we also need to think about what we should with such
> > requests because even if undo worker exits after retrying for some
> > threshold number of times, undo launcher will again launch a new
> > worker for this request unless we have some special handling for the
> > same.
> >
> > We can issue some WARNING once any particular request reached the
> > maximum number of retries but not sure if that is enough because the
> > user might not notice the same or didn't take any action.  Do we want
> > to PANIC at some point of time, if so, when or the other alternative
> > is we can try at regular intervals till we succeed?
>
> PANIC is a terrible idea.  How would that fix anything?  You'll very
> possibly still have the same problem after restarting, and so you'll
> just keep on hitting the PANIC. That will mean that in addition to
> whatever problem with undo you already had, you now have a system that
> you can't use for anything at all, because it keeps restarting.
>
> The design goal here should be that if undo for a transaction fails,
> we keep retrying periodically, but with minimal adverse impact on the
> rest of the system.  That means you can't retry in a loop. It also
> means that the system needs to provide fairness: that is, it shouldn't
> be possible to create a system where one or more transactions for
> which undo keeps failing cause other transactions that could have been
> undone to get starved.
>

Agreed.

> It seems to me that thinking of this in terms of what the undo worker
> does and what the undo launcher does is probably not the right
> approach. We need to think of it more as an integrated system. Instead
> of storing a failure_count with each request in the error queue, how
> about storing a next retry time?
>

I think both failure_count and next_retry_time can work in a similar way.

I think incrementing next retry time in multiples will be a bit
tricky.  Say first-time error occurs at X hours.  We can say that
next_retry_time will X+10s=Y and error_occured_at will be X.  The
second time it again failed, how will we know that we need set
next_retry_time as Y+20s, maybe we can do something like Y-X and then
add 10s to it and add the result to the current time.  Now whenever
the worker or launcher finds this request, they can check if the
current_time is greater than or equal to next_retry_time, if so they
can pick that request, otherwise, they check request in next queue.

The failure_count can also work in a somewhat similar fashion.
Basically, we can use error_occurred at and failure_count to compute
the required time.  So, if error is occurred at say X hours and
failure count is 3, then we can check if current_time is greater than
X+(3 * 10s), then we will allow the entry to be processed, otherwise,
it will check other queues for work.

>  I think the error queue needs to be
> ordered by database_id, then by next_retry_time, and then by order of
> insertion.  (The last part is important because next_retry_time is
> going to be prone to having ties, and we need to break those ties in
> the right way.)
>

I think it makes sense to order requests by next_retry_time,
error_occurred_at (this will ensure the order of insertion).  However,
I am not sure if there is a need to club the requests w.r.t database
id.  It can starve the error requests from other databases.  Moreover,
we already have a functionality wherein if the undo worker doesn't
encounter the next request from the same database on which it is
operating for a certain amount of time, then it will peek ahead (few
entries) in each queue to get the request for the same database.  We
don't sort by db_id in other queues as well, so it will be consistent
for this queue if we just sort by next_retry_time and
error_occurred_at.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Race conditions with TAP test for syncrep
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: New vacuum option to do only freezing