Re: Corruption during WAL replay
От | Daniel Shelepanov |
---|---|
Тема | Re: Corruption during WAL replay |
Дата | |
Msg-id | 0c911453-a609-7814-f8de-f6c2cf9c5176@mail.ru обсуждение исходный текст |
Ответ на | Re: Corruption during WAL replay (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Ответы |
Re: Corruption during WAL replay
(Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
|
Список | pgsql-hackers |
On 27.09.2021 11:30, Kyotaro Horiguchi wrote: > Thank you for the comments! (Sorry for the late resopnse.) > > At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmhaas@gmail.com> wrote in >> On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi >> <horikyota.ntt@gmail.com> wrote: >>> The patch assumed that CHKPT_START/COMPLETE barrier are exclusively >>> used each other, but MarkBufferDirtyHint which delays checkpoint start >>> is called in RelationTruncate while delaying checkpoint completion. >>> That is not a strange nor harmful behavior. I changed delayChkpt to a >>> bitmap integer from an enum so that both barrier are separately >>> triggered. >>> >>> I'm not sure this is the way to go here, though. This fixes the issue >>> of a crash during RelationTruncate, but the issue of smgrtruncate >>> failure during RelationTruncate still remains (unless we treat that >>> failure as PANIC?). >> I like this patch. As I understand it, we're currently cheating by >> allowing checkpoints to complete without necessarily flushing all of >> the pages that were dirty at the time we fixed the redo pointer out to >> disk. We think this is OK because we know that those pages are going >> to get truncated away, but it's not really OK because when the system >> starts up, it has to replay WAL starting from the checkpoint's redo >> pointer, but the state of the page is not the same as it was at the >> time when the redo pointer was the end of WAL, so redo fails. In the >> case described in >> http://postgr.es/m/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com >> modifications are made to the page before the redo pointer is fixed >> and those changes never make it to disk, but the truncation also never >> makes it to the disk either. With this patch, that can't happen, >> because no checkpoint can intervene between when we (1) decide we're >> not going to bother writing those dirty pages and (2) actually >> truncate them away. So either the pages will get written as part of >> the checkpoint, or else they'll be gone before the checkpoint >> completes. In the latter case, I suppose redo that would have modified >> those pages will just be skipped, thus dodging the problem. > I think your understanding is right. > >> In RelationTruncate, I suggest that we ought to clear the >> delay-checkpoint flag before rather than after calling >> FreeSpaceMapVacuumRange. Since the free space map is not fully >> WAL-logged, anything we're doing there should be non-critical. Also, I > Agreed and fixed. > >> think it might be better if MarkBufferDirtyHint stays closer to the >> existing coding and just uses a Boolean and an if-test to decide >> whether to clear the bit, instead of inventing a new mechanism. I >> don't really see anything wrong with the new mechanism, but I think >> it's better to keep the patch minimal. > Yeah, that was a a kind of silly. Fixed. > >> As you say, this doesn't fix the problem that truncation might fail. >> But as Andres and Sawada-san said, the solution to that is to get rid >> of the comments saying that it's OK for truncation to fail and make it >> a PANIC. However, I don't think that change needs to be part of this >> patch. Even if we do that, we still need to do this. And even if we do >> this, we still need to do that. > Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate. > > + * Delay the concurrent checkpoint's completion until this truncation > + * successfully completes, so that we don't establish a redo-point between > + * buffer deletion and file-truncate. Otherwise we can leave inconsistent > + * file content against the WAL records after the REDO position and future > + * recovery fails. > > However, a problem for me for now is that I cannot reproduce the > problem. > > To avoid further confusion, the attached is named as *.patch. > > regards. > Hi. This is my first attempt to review a patch so feel free to tell me if I missed something. As of today's state of REL_14_STABLE (ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is reproducible using the script provided by Daniel Wood in this (1335373813.287510.1573611814107@connect.xfinity.com) message. Also, the latest patch seems not to be applicable and requires some minor tweaks. Regards, Daniel Shelepanov
В списке pgsql-hackers по дате отправления: