Re: [BUG] non archived WAL removed during production crash recovery

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: [BUG] non archived WAL removed during production crash recovery
Дата
Msg-id 20200420073444.GA77439@paquier.xyz
обсуждение исходный текст
Ответ на Re: [BUG] non archived WAL removed during production crash recovery  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: [BUG] non archived WAL removed during production crash recovery  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Список pgsql-bugs
On Mon, Apr 20, 2020 at 04:02:31PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 18 Apr 2020 18:26:11 +0900, Michael Paquier <michael@paquier.xyz> wrote in
> As the result, +1 to what v7 is doing and discussing on earlier
> removal of such WAL segments separately if needed.

Thanks for the extra review.

>> Yeah.  We could try to do with "false" as command anyway, and see what
>> the buildfarm thinks.  As the test is skipped on Windows, I would
>> assume that it does not matter much anyway.  Let's see what others
>> think about this piece.  I don't have plans to touch again this patch
>> until likely the middle of next week.
>
> Couldn't we use "/" as a globally-results-in-failure command?  But
> that doesn't increment failed_count.  The reason is pgarch_archiveXLog
> exits with FATAL for "is a directory" error.  The comment asserts that
> we exit with FATAL for SIGINT or SIGQUIT and if so it is enough to
> check only exit-by-signal case.  The following fix worked.

Yeah, I was working on this stuff today and I noticed this problem.  I
was just going to send an email on the matter with a more portable
patch and you also just beat me to it with this one :)

So yes, using "false" may be a bad idea because we cannot rely on the
case where the command does not exist in an environment in this test.
After more testing, I have been hit hard about the fact that the
archiver exits immediately if an archive command cannot be found
(errcode = 127), and it does not report this failure back to
pg_stat_archiver, which would cause the test to wait until the timeout
of poll_query_until() kills the test.  There is however an extra
method not mentioned yet on this thread: we know that cp/copy is
portable enough per the buildfarm, so let's use a copy command that we
know *will* fail.  A simple way of doing this is a command where the
origin file does not exist.

> --- a/src/backend/postmaster/pgarch.c
> +++ b/src/backend/postmaster/pgarch.c
> @@ -595,7 +595,7 @@ pgarch_archiveXlog(char *xlog)
>           * "command not found" type of error.  If we overreact it's no big
>           * deal, the postmaster will just start the archiver again.
>           */
> -        int            lev = wait_result_is_any_signal(rc, true) ? FATAL : LOG;
> +        int            lev = wait_result_is_any_signal(rc, false) ? FATAL : LOG;
>
>          if (WIFEXITED(rc))
>          {
>
> I didn't tested it on Windows (I somehow broke my repo and it's too
> slow to clone.) but system("/") returned 1 and I think that result
> increments the counter.

No, this would be a behavior change, which is not acceptable in my
view.  (By the way, just nuke your full repo if it does not work
anymore on Windows, this method works).

>> Indeed.  The extra initialization was part of v4, and got removed as
>> of v5.  Still, it seems to me that this part was not complete without
>> updating the shared memory field correctly at the beginning of the
>> REDO processing as the last version of the patch does.
>
> I may not be following the discussion, but I think it is reasonable
> that SharedRecoveryState is initialized as CRASH then moves to ARCHIVE
> as needed and finished by NONE.  That transition also stables
> RecoveryInProgress().

Thought as well about that over the weekend, and that's still the best
option to me.

> I think it would be better be RECOVERY_STATE_DONE.

I like this suggestion better than the original in v7.

> By the way I noticed that RecoveryState is exactly a subset of
> DBState.  And changes of SharedRecoveryState happens side-by-side with
> ControlFileData->state in most places.  Coundn't we just usee
> ControlFile->state instead of SharedRecoveryState?

I actually found confusing to use the same thing, because then the
reader would thing that SharedRecoveryState could be set to more
values but we don't want that.

> By the way I found a typo.
>
> +# Recovery tests for the achiving with a standby partially check
> s/achiving/archiving/

Thanks, fixed.

Attached is an updated patch, where I tweaked more comments.

Jehan-Guillaume, who is your colleague who found originally about this
problem?  We should credit him in the commit message.
--
Michael

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: [BUG] non archived WAL removed during production crash recovery
Следующее
От: Sandeep Thakkar
Дата:
Сообщение: Re: BUG #16341: Installation with EnterpriseDB Community installer inNT AUTHORITY\SYSTEM context not possible