Re: Corruption during WAL replay

Поиск
Список
Период
Сортировка
От Daniel Shelepanov
Тема Re: Corruption during WAL replay
Дата
Msg-id 0c911453-a609-7814-f8de-f6c2cf9c5176@mail.ru
обсуждение исходный текст
Ответ на Re: Corruption during WAL replay  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: Corruption during WAL replay  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Список pgsql-hackers
On 27.09.2021 11:30, Kyotaro Horiguchi wrote:
> Thank you for the comments! (Sorry for the late resopnse.)
>
> At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
>> On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
>> <horikyota.ntt@gmail.com> wrote:
>>> The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
>>> used each other, but MarkBufferDirtyHint which delays checkpoint start
>>> is called in RelationTruncate while delaying checkpoint completion.
>>> That is not a strange nor harmful behavior.  I changed delayChkpt to a
>>> bitmap integer from an enum so that both barrier are separately
>>> triggered.
>>>
>>> I'm not sure this is the way to go here, though.  This fixes the issue
>>> of a crash during RelationTruncate, but the issue of smgrtruncate
>>> failure during RelationTruncate still remains (unless we treat that
>>> failure as PANIC?).
>> I like this patch. As I understand it, we're currently cheating by
>> allowing checkpoints to complete without necessarily flushing all of
>> the pages that were dirty at the time we fixed the redo pointer out to
>> disk. We think this is OK because we know that those pages are going
>> to get truncated away, but it's not really OK because when the system
>> starts up, it has to replay WAL starting from the checkpoint's redo
>> pointer, but the state of the page is not the same as it was at the
>> time when the redo pointer was the end of WAL, so redo fails. In the
>> case described in
>> http://postgr.es/m/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
>> modifications are made to the page before the redo pointer is fixed
>> and those changes never make it to disk, but the truncation also never
>> makes it to the disk either. With this patch, that can't happen,
>> because no checkpoint can intervene between when we (1) decide we're
>> not going to bother writing those dirty pages and (2) actually
>> truncate them away. So either the pages will get written as part of
>> the checkpoint, or else they'll be gone before the checkpoint
>> completes. In the latter case, I suppose redo that would have modified
>> those pages will just be skipped, thus dodging the problem.
> I think your understanding is right.
>
>> In RelationTruncate, I suggest that we ought to clear the
>> delay-checkpoint flag before rather than after calling
>> FreeSpaceMapVacuumRange. Since the free space map is not fully
>> WAL-logged, anything we're doing there should be non-critical. Also, I
> Agreed and fixed.
>
>> think it might be better if MarkBufferDirtyHint stays closer to the
>> existing coding and just uses a Boolean and an if-test to decide
>> whether to clear the bit, instead of inventing a new mechanism. I
>> don't really see anything wrong with the new mechanism, but I think
>> it's better to keep the patch minimal.
> Yeah, that was a a kind of silly. Fixed.
>
>> As you say, this doesn't fix the problem that truncation might fail.
>> But as Andres and Sawada-san said, the solution to that is to get rid
>> of the comments saying that it's OK for truncation to fail and make it
>> a PANIC. However, I don't think that change needs to be part of this
>> patch. Even if we do that, we still need to do this. And even if we do
>> this, we still need to do that.
> Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.
>
> +     * Delay the concurrent checkpoint's completion until this truncation
> +     * successfully completes, so that we don't establish a redo-point between
> +     * buffer deletion and file-truncate. Otherwise we can leave inconsistent
> +     * file content against the WAL records after the REDO position and future
> +     * recovery fails.
>
> However, a problem for me for now is that I cannot reproduce the
> problem.
>
> To avoid further confusion, the attached is named as *.patch.
>
> regards.
>
Hi. This is my first attempt to review a patch so feel free to tell me 
if I missed something.

As of today's state of REL_14_STABLE 
(ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is reproducible 
using the script provided by Daniel Wood in this 
(1335373813.287510.1573611814107@connect.xfinity.com) message. Also, the 
latest patch seems not to be applicable and requires some minor tweaks.


Regards,

Daniel Shelepanov




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: CREATEROLE and role ownership hierarchies
Следующее
От: Andres Freund
Дата:
Сообщение: TAP tests, directories and parameters