Re: POC: Cleaning up orphaned files using undo logs

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: POC: Cleaning up orphaned files using undo logs
Дата
Msg-id CAA4eK1J_GxUHtfqyLpXtuoU5r-oAfQOuZsc33WFoUeTBowEUBA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Tue, Jun 18, 2019 at 11:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > [ new patches ]
>
> I tried writing some code that throws an error from an undo log
> handler and the results were not good.  It appears that the code will
> retry in a tight loop:
>
> 2019-06-18 13:58:53.262 EDT [42803] ERROR:  robert_undo
> 2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
> 2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
..
>
> It seems clear that the error-handling aspect of this patch has not
> been given enough thought.  It's debatable what strategy should be
> used when undo fails, but retrying 40 times per millisecond isn't the
> right answer.
>

The reason for the same is that currently, the undo worker keep on
executing the requests if there are any.  I think this is good when
there are different requests, but getting the same request from error
queue and doing it, again and again, doesn't seem to be good and I
think it will not help either.

> I assume we want some kind of cool-down between retries.
> 10 seconds?  A minute?  Some kind of back-off algorithm that gradually
> increases the retry time up to some maximum?
>

Yeah, something on these lines would be good.  How about if we add
failure_count with each request in error queue?   Now, it will get
incremented on each retry and we can wait in proportion to that, say
10s after the first retry, 20s after second and so on and maximum up
to 10 failure_count (100s) will be allowed after which worker will
exit considering it has no more work to do.

Actually, we also need to think about what we should with such
requests because even if undo worker exits after retrying for some
threshold number of times, undo launcher will again launch a new
worker for this request unless we have some special handling for the
same.

We can issue some WARNING once any particular request reached the
maximum number of retries but not sure if that is enough because the
user might not notice the same or didn't take any action.  Do we want
to PANIC at some point of time, if so, when or the other alternative
is we can try at regular intervals till we succeed?

>  Should there be one or
> more GUCs?
>

Yeah, we can do that, something like undo_apply_error_retry_count, but
I am not completely sure about this, maybe some pre-defined number say
10 or 20 should be enough.  However, I am fine if you or others think
that a guc can help users in this case.

> Another thing that is not very nice is that when I tried to shut down
> the server via 'pg_ctl stop' while the above was happening, it did not
> shut down.  I had to use an immediate shutdown.  That's clearly not
> OK.
>

CHECK_FOR_INTERRUPTS is missing at one place, will fix.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: New EXPLAIN option: ALL
Следующее
От: Paul Guo
Дата:
Сообщение: Re: standby recovery fails (tablespace related) (tentative patch and discussion)