Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.
Дата
Msg-id CAHGQGwEfjMbBUPhb3-1z+xqBA-O_NuL5xzZnCO=0ojyBit-YRw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.  (Masahiko Sawada <sawada.mshk@gmail.com>)
Ответы Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.
Список pgsql-hackers
On Mon, Nov 20, 2017 at 4:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Nov 17, 2017 at 11:00 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Nov 16, 2017 at 8:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Thu, Nov 16, 2017 at 1:11 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> +   /*
>>>> +    * Quick exit if session is not keeping around a non-exclusive backup
>>>> +    * already started.
>>>> +    */
>>>> +   if (sessionBackupState != SESSION_BACKUP_NON_EXCLUSIVE)
>>>> +       return;
>>>> I think that it would be more solid to use SESSION_BACKUP_NONE for the
>>>> comparison, and complete the assertion after the quick exit as follows
>>>> as this code path should never be taken for an exclusive backup:
>>>
>>> Agreed.
>>
>> I have spent some time doing an extra lookup with tests involving one
>> and two sessions doing backups checking for failure code paths while
>> waiting for archives:
>> - One session with non-exclusive backup.
>> - One session with exclusive backup.
>> - One session is exclusive and the second is non-exclusive
>> - Both are exclusive.
>> Also double-checked on the way the logic around the cleanup callback
>> and sessionBackupState during startup, and those look clean to me. One
>> thing that was bothering me is that the callback
>> nonexclusive_base_backup_cleanup is called after do_pg_start_backup()
>> finishes. But between the moment sessionBackupState is set and the
>> callback is registered there is no CHECK_FOR_INTERRUPTS or even elog()
>> calls so even if the process starting a non-exclusive backup is tried
>> to be terminated between the moment the session lock is set and the
>> callback is registered things are handled correctly.
>>
>> So I think that we are basically safe for backups running with the SQL
>> interface.
>
> Thank you for double checking!
>
>> However, things are not completely clean for base backups taken using
>> the replication protocol. While monitoring more the code, I have
>> noticed that perform_base_backup() calls do_pg_stop_backup() *without*
>> taking any cleanup action. So if a base backup is interrupted, say
>> with SIGTERM, while do_pg_stop_backup() is called and before the
>> session lock is updated then it is possible to finish with
>> inconsistent in-memory counters. Oops.
>
> Good catch! I'd missed this case.
>
>> No need to play with breakpoints and signals in this case, using
>> something like that is enough to create inconsistent counters.
>> --- a/src/backend/replication/basebackup.c
>> +++ b/src/backend/replication/basebackup.c
>> @@ -327,6 +327,8 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
>>     }
>>     PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
>>
>> +   elog(ERROR, "base backups don't decrement counters here, stupid!");
>> +
>>     endptr = do_pg_stop_backup(labelfile->data, !opt->nowait, &endtli);
>>
>> A simple fix for this one is to call do_pg_stop_backup() before
>> PG_END_ENSURE_ERROR_CLEANUP(). By doing so, we also make the callback
>> cleanup logic consistent with what is done for the SQL equivalent
>> where the callback is removed after finishing going through
>> do_pg_stop_backup(). A comment would be adapted here, say something
>> like "finish the backup while still holding the cleanup callback to
>> avoid inconsistent in-memory data should the this call fail before
>> sessionBackupState is updated."
>>
>> For the start phase, the current logic is fine, because in the case of
>> the SQL interface the cleanup callback is registered after finishing
>> do_pg_start_backup().
>>
>> What do you think?
>
> I agree with your approach. It makes sense to me.
>
> Attached updated patch. Please review it.

Thanks for updating the patch! The patch basically looks good to me.

+ /*
+ * Clean up session-level lock. To avoid calling CHECK_FOR_INTERRUPTS by
+ * LWLockReleaseClearVar before changing the backup state we change it
+ * while holding the WAL insert lock.
+ */

I think that you should comment *why* we need to avoid calling
CHECK_FOR_INTERRUPTS before changing the backup state, here.

Regards,

-- 
Fujii Masao


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Martín Marqués
Дата:
Сообщение: Re: [HACKERS] pg_basebackup --progress output for batch execution
Следующее
От: Martín Marqués
Дата:
Сообщение: Using isatty() on WIN32 platform