Re: race condition when writing pg_control

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: race condition when writing pg_control
Дата
Msg-id 637e17dd-e90b-0dab-f215-dbbe9c6c726a@oss.nttdata.com
обсуждение исходный текст
Ответ на Re: race condition when writing pg_control  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: race condition when writing pg_control  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers

On 2020/05/22 13:51, Thomas Munro wrote:
> On Tue, May 5, 2020 at 9:51 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>>> I believe I've discovered a race condition between the startup and
>>> checkpointer processes that can cause a CRC mismatch in the pg_control
>>> file.  If a cluster crashes at the right time, the following error
>>> appears when you attempt to restart it:
>>>
>>>          FATAL:  incorrect checksum in control file
>>>
>>> This appears to be caused by some code paths in xlog_redo() that
>>> update ControlFile without taking the ControlFileLock.  The attached
>>> patch seems to be sufficient to prevent the CRC mismatch in the
>>> control file, but perhaps this is a symptom of a bigger problem with
>>> concurrent modifications of ControlFile->checkPointCopy.nextFullXid.
>>
>> This does indeed look pretty dodgy.  CreateRestartPoint() running in
>> the checkpointer does UpdateControlFile() to compute a checksum and
>> write it out, but xlog_redo() processing
>> XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
>> interlocking.  It looks like the ancestors of that line were there
>> since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
>> UpdateControLFile() directly in the startup process (immediately after
>> that update), so no interlocking problem.  Then in cdd46c76548 (2009),
>> RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
>> in another process.
> 
> Here's a version with a commit message added.  I'll push this to all
> releases in a day or two if there are no objections.

+1 to push the patch.

Per my quick check, XLogReportParameters() seems to have the similar issue,
i.e., it updates the control file without taking ControlFileLock.
Maybe we should fix this at the same time?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: password_encryption default
Следующее
От: Julien Rouhaud
Дата:
Сообщение: Re: pg_bsd_indent and -Wimplicit-fallthrough