Re: [Patch] ALTER SYSTEM READ ONLY

Поиск
Список
Период
Сортировка
От Amul Sul
Тема Re: [Patch] ALTER SYSTEM READ ONLY
Дата
Msg-id CAAJ_b95=i64xCG3rNQe0T4zMqBdcge9x7CPgSQzsS6pb4LvDhQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [Patch] ALTER SYSTEM READ ONLY  (Amul Sul <sulamul@gmail.com>)
Ответы Re: [Patch] ALTER SYSTEM READ ONLY  (Amul Sul <sulamul@gmail.com>)
Список pgsql-hackers
On Tue, Nov 23, 2021 at 7:23 PM Amul Sul <sulamul@gmail.com> wrote:
>
>    On Wed, Nov 17, 2021 at 6:20 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> >   On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:
> > > >
> > > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > >
> > > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > > > > > Attached is the rebased version of refactoring as well as the
> > > > > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > > > > 39a3105678a).
> > > > >
> > > > > I spent a lot of time today studying 0002, and specifically the
> > > > > question of whether EndOfLog must be the same as
> > > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > > > > XLogCtl->replayEndTLI.
> > > > >
> > > > > The answer to the former question is "no" because, if we don't enter
> > > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > > > > enter redo, then I think it has to be the same unless something very
> > > > > weird happens. EndOfLog gets set like this:
> > > > >
> > > > >     XLogBeginRead(xlogreader, LastRec);
> > > > >     record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> > > > >     EndOfLog = EndRecPtr;
> > > > >
> > > > > In every case that exists in our regression tests, EndRecPtr is the
> > > > > same before these three lines of code as it is afterward. However, if
> > > > > you test with recovery_target=immediate, you can get it to be
> > > > > different, because in that case we drop out of the redo loop after
> > > > > calling recoveryStopsBefore() rather than after calling
> > > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > > > > recovery_target_inclusive=off you can likewise get it to be different
> > > > > (though I discovered the hard way that recovery_target_inclusive=off
> > > > > is ignored when you use recovery_target_name). It seems like a really
> > > > > bad thing that neither recovery_target=immediate nor
> > > > > recovery_target_inclusive=off have any tests, and I think we ought to
> > > > > add some.
> > > > >
> > > >
> > > > recovery/t/003_recovery_targets.pl has test for
> > > > recovery_target=immediate but not for recovery_target_inclusive=off, we
> > > > can add that for recovery_target_lsn, recovery_target_time, and
> > > > recovery_target_xid case only where it affects.
> > > >
> > > > > Anyway, in effect, these three lines of code have the effect of
> > > > > backing up the xlogreader by one record when we stop before rather
> > > > > than after a record that we're replaying. What that means is that
> > > > > EndOfLog is going to be the end+1 of the last record that we actually
> > > > > replayed. There might be one more record that we read but did not
> > > > > replay, and that record won't impact the value we end up with in
> > > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > > > > record that we actually replayed. To put that another way, there's no
> > > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > > > > and before we change LastRec. So in the cases where
> > > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > > > > different from EndOfLog if something different happens when we re-read
> > > > > the last-replayed WAL record than what happened when we read it the
> > > > > first time. That seems unlikely, and would be unfortunate it if it did
> > > > > happen. I am inclined to think that it might be better not to reread
> > > > > the record at all, though.
> > > >
> > > > There are two reasons that the record is reread; first, one that you
> > > > have just explained where the redo loop drops out due to
> > > > recoveryStopsBefore() and another one is that InRecovery is false.
> > > >
> > > > In the formal case at the end, redo while-loop does read a new record
> > > > which in effect updates EndRecPtr and when we breaks the loop, we do
> > > > reach the place where we do reread record -- where we do read the
> > > > record (i.e. LastRec) before the record that redo loop has read and
> > > > which correctly sets EndRecPtr. In the latter case, definitely, we
> > > > don't need any adjustment to EndRecPtr.
> > > >
> > > > So technically one case needs reread but that is also not needed, we
> > > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> > > > do not need to reread the record, but EndOfLog and EndOfLogTLI should
> > > > be set conditionally something like:
> > > >
> > > > if (InRecovery)
> > > > {
> > > >     EndOfLog = XLogCtl->lastReplayedEndRecPtr;
> > > >     EndOfLogTLI = XLogCtl->lastReplayedTLI;
> > > > }
> > > > else
> > > > {
> > > >     EndOfLog = EndRecPtr;
> > > >     EndOfLogTLI = replayTLI;
> > > > }
> > > >
> > > > > As far as this patch goes, I think we need
> > > > > a solution that doesn't involve fetching EndOfLog from a variable
> > > > > that's only sometimes initialized and then not doing anything with it
> > > > > except in the cases where it was initialized.
> > > > >
> > > >
> > > > Another reason could be EndOfLog changes further in the following case:
> > > >
> > > > /*
> > > >  * Actually, if WAL ended in an incomplete record, skip the parts that
> > > >  * made it through and start writing after the portion that persisted.
> > > >  * (It's critical to first write an OVERWRITE_CONTRECORD message, which
> > > >  * we'll do as soon as we're open for writing new WAL.)
> > > >  */
> > > > if (!XLogRecPtrIsInvalid(missingContrecPtr))
> > > > {
> > > >     Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
> > > >     EndOfLog = missingContrecPtr;
> > > > }
> > > >
> > > > Now only solution that I can think is to copy EndOfLog (so
> > > > EndOfLogTLI) into shared memory.
> > > >
> > > > > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
> > > > > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
> > > > > think the regression tests contain any scenarios where we run recovery
> > > > > and the values end up different. However, I think that the code sets
> > > > > EndOfLogTLI to the TLI of the last WAL file that we read, and I think
> > > > > XLogCtl->replayEndTLI gets set to the timeline from which that WAL
> > > > > record originated. So imagine that we are looking for WAL that ought
> > > > > to be in 000000010000000000000003 but we don't find it; instead we
> > > > > find 000000020000000000000003 because our recovery target timeline is
> > > > > 2, or something that has 2 in its history. We will read the WAL for
> > > > > timeline 1 from this file which has timeline 2 in the file name. I
> > > > > think if recovery ends in this file before the timeline switch, these
> > > > > values will be different. I did not try to construct a test case for
> > > > > this today due to not having enough time, so it's possible that I'm
> > > > > wrong about this, but that's how it looks to me from the code.
> > > > >
> > > >
> > > > I am not sure, I have understood this scenario due to lack of
> > > > expertise in this area -- Why would the record we looking that ought
> > > > to be in 000000010000000000000003 we don't find it? Possibly WAL
> > > > corruption or that file is missing?
> > > >
> > >
> > > On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
> > > XLogFileReadAnyTLI(), I think I could make a sense that there could be
> > > a case where the record belong to TLI 1 we are looking for; we might
> > > open the file with TLI 2. But, I am wondering what's wrong if we say
> > > that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
> > > in its file name -- that statement is still true, and that record
> > > should be still accessible from the filename with TLI 1.  Also, if we
> > > going to consider this reading record exists before the timeline
> > > switch point as the EndOfLog then why should be worried about the
> > > latter timeline switch which eventually everything after the EndOfLog
> > > going to be useless for us. We might continue switching TLI and/or
> > > writing the WAL right after EndOfLog, correct me if I am missing
> > > something here.
> > >
> > > Further, I still think replayEndTLI has set to the correct value what
> > > we looking for EndOfLogTLI when we go through the redo loop. When it
> > > read the record and finds a change in the current replayTLI then it
> > > updates that as:
> > >
> > > if (newReplayTLI != replayTLI)
> > > {
> > >     /* Check that it's OK to switch to this TLI */
> > >     checkTimeLineSwitch(EndRecPtr, newReplayTLI,
> > >                         prevReplayTLI, replayTLI);
> > >
> > >     /* Following WAL records should be run with new TLI */
> > >     replayTLI = newReplayTLI;
> > >     switchedTLI = true;
> > > }
> > >
> > > Then replayEndTLI gets updated. If we going to skip the reread of
> > > "LastRec" that we were discussing, then I think the following code
> > > that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
> > > (or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
> > > should be enough, AFAICU.
> > >
> > > /*
> > >  * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
> > >  * the end-of-log. It could be different from the timeline that EndOfLog
> > >  * nominally belongs to, if there was a timeline switch in that segment,
> > >  * and we were reading the old WAL from a segment belonging to a higher
> > >  * timeline.
> > >  */
> > > EndOfLogTLI = xlogreader->seg.ws_tli;
> > >
> >
> > I think I found the right case for this, above TLI fetch is needed in
> > the case where we do restore from the archived WAL files. In my trial,
> > the archive directory has files as below (Kindly ignore the extra
> > history file, I perform a few more trials to be sure):
> >
> > -rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E
> > -rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial
> > -rw-------. 1 amul amul      128 Nov 17 06:36 00000004.history
> > -rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F
> > -rw-------. 1 amul amul      171 Nov 17 06:39 00000005.history
> > -rw-------. 1 amul amul      209 Nov 17 06:45 00000006.history
> > -rw-------. 1 amul amul      247 Nov 17 06:52 00000007.history
> >
> > The timeline is switched in 1F file but the archiver has backup older
> > timeline file and renamed it. While performing PITR using these
> > archived files, the .partitial file seems to be skipped from the
> > restore. The file with the next timeline id is selected to read the
> > records that belong to the previous timeline id as well (i.e. 4 here,
> > all the records before timeline switch point). Here is the files
> > inside pg_wal directory after restore, note that in the current
> > experiment, I chose recovery_target_xid = <just before the timeline#5
> > switch point > and then recovery_target_action = 'promote':
> >
> > -rw-------. 1 amul amul       85 Nov 17 07:33 00000003.history
> > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E
> > -rw-------. 1 amul amul      128 Nov 17 07:33 00000004.history
> > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F
> > -rw-------. 1 amul amul      171 Nov 17 07:33 00000005.history
> > -rw-------. 1 amul amul      209 Nov 17 07:33 00000006.history
> > -rw-------. 1 amul amul      247 Nov 17 07:33 00000007.history
> > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F
> >
> > The last one is the new WAL file created in that cluster.
> >
>
> With this experiment, I think it is clear that the EndOfLogTLI can be
> different from the replayEndTLI or lastReplayedTLI, and we don't have
> any other option to get that into other processes other than exporting
> into shared memory.  Similarly, we have bunch of option (e.g.
> replayEndRecPtr, lastReplayedEndRecPtr, lastSegSwitchLSN etc) to get
> EndOfLog value but those are not perfect and reliable options.
>
> Therefore, in the attached patch, I have exported EndOfLog and
> EndOfLogTLI into shared memory and attached only the refactoring
> patches since there a bunch of other work needs to be done on the main
> ASRO patches what I discussed with Robert off-list, thanks.
>

Attaching the rest of the patches. To execute XLogAcceptWrites() ->
PerformRecoveryXLogAction() in Checkpointer process; ideally, we
should perform full checkpoint but we can't do that using current
PerformRecoveryXLogAction() which would call RequestCheckpoint() with
WAIT flags which make the Checkpointer process wait infinite on itself
to finish the requested checkpoint, bad!!

The option we have is to change RequestCheckpoint() for the
Checkpointer process directly call CreateCheckPoint() as we do for
!IsPostmasterEnvironment case, but problem is that XLogWrite() running
inside Checkpointer process can reach to CreateCheckPoint() and cause
an unexpected behaviour that I have noted previously[1].  The
RequestCheckpoint() from XLogWrite() when inside Checkpointer process
is needed or not is need a separate discussion.   For now, I have
changed PerformRecoveryXLogAction() to call CreateCheckPoint() for the
Checkpointer process; in the v41-0003 version I tried to do the
changes to RequestCheckpoint() to avoid that but that change looks too
ugly.

Another problem is the recursive call to XLogAccepWrite() in the
Checkpointer process due to the aforesaid CreateCheckPoint() call from
PerformRecoveryXLogAction(). The reason is to avoid the delay in
processing WAL prohibit state change requests we do have added
ProcessWALProhibitStateChangeRequest() call multiple places that
Checkpointer can check and process while performing a long-running
checkpoint. When Checkpointer call CreateCheckPoint() from
PerformRecoveryXLogAction() then that can also hit
ProcessWALProhibitStateChangeRequest() and since XLogAccepWrite()
operation not completed yet that tried to do that again. To avoid that
I have added a flag that avoids ProcessWALProhibitStateChangeRequest()
execution is that flag is set, see
ProcessWALProhibitStateChangeRequest() in attached 0003 patch.

Note that both the issues, I noted above are boil down to
CreateCheckPoint() and its need. If we don't need to perform a full
checkpoint in our case then we might not have that recursion issue.
Instead, do the CreateEndOfRecoveryRecord() and then do the full
checkpoint that currently PerformRecoveryXLogAction() does for the
promotion case but not having full checkpoint looks might look scary.
I tried that and works fine for me, but I am not much confident about
that.

Regards,
Amul

1] https://postgr.es/m/CAAJ_b97fPWU_yyOg97Y5AtSvx5mrg2cGyz260swz5x5iPKEM+g@mail.gmail.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Daniel Gustafsson
Дата:
Сообщение: Re: pg_upgrade fails with non-standard ACL
Следующее
От: Daniel Gustafsson
Дата:
Сообщение: Re: Allow CURRENT_ROLE in GRANTED BY