Re: POC: Cleaning up orphaned files using undo logs
От | Amit Kapila |
---|---|
Тема | Re: POC: Cleaning up orphaned files using undo logs |
Дата | |
Msg-id | CAA4eK1J_GxUHtfqyLpXtuoU5r-oAfQOuZsc33WFoUeTBowEUBA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: POC: Cleaning up orphaned files using undo logs (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: POC: Cleaning up orphaned files using undo logs
(Robert Haas <robertmhaas@gmail.com>)
|
Список | pgsql-hackers |
On Tue, Jun 18, 2019 at 11:37 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > [ new patches ] > > I tried writing some code that throws an error from an undo log > handler and the results were not good. It appears that the code will > retry in a tight loop: > > 2019-06-18 13:58:53.262 EDT [42803] ERROR: robert_undo > 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo > 2019-06-18 13:58:53.263 EDT [42803] ERROR: robert_undo .. > > It seems clear that the error-handling aspect of this patch has not > been given enough thought. It's debatable what strategy should be > used when undo fails, but retrying 40 times per millisecond isn't the > right answer. > The reason for the same is that currently, the undo worker keep on executing the requests if there are any. I think this is good when there are different requests, but getting the same request from error queue and doing it, again and again, doesn't seem to be good and I think it will not help either. > I assume we want some kind of cool-down between retries. > 10 seconds? A minute? Some kind of back-off algorithm that gradually > increases the retry time up to some maximum? > Yeah, something on these lines would be good. How about if we add failure_count with each request in error queue? Now, it will get incremented on each retry and we can wait in proportion to that, say 10s after the first retry, 20s after second and so on and maximum up to 10 failure_count (100s) will be allowed after which worker will exit considering it has no more work to do. Actually, we also need to think about what we should with such requests because even if undo worker exits after retrying for some threshold number of times, undo launcher will again launch a new worker for this request unless we have some special handling for the same. We can issue some WARNING once any particular request reached the maximum number of retries but not sure if that is enough because the user might not notice the same or didn't take any action. Do we want to PANIC at some point of time, if so, when or the other alternative is we can try at regular intervals till we succeed? > Should there be one or > more GUCs? > Yeah, we can do that, something like undo_apply_error_retry_count, but I am not completely sure about this, maybe some pre-defined number say 10 or 20 should be enough. However, I am fine if you or others think that a guc can help users in this case. > Another thing that is not very nice is that when I tried to shut down > the server via 'pg_ctl stop' while the above was happening, it did not > shut down. I had to use an immediate shutdown. That's clearly not > OK. > CHECK_FOR_INTERRUPTS is missing at one place, will fix. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления:
Следующее
От: Paul GuoДата:
Сообщение: Re: standby recovery fails (tablespace related) (tentative patch and discussion)