Обсуждение: recovery_target_action=pause with confusing hint
Hello Currently during point-in-time recovery with recovery_target_action = 'pause' we print log lines: > LOG: recovery has paused > HINT: Execute pg_wal_replay_resume() to continue. My colleague told me that this is a terrible moment: to continue what exactly? It sounds like "to continue replay", similarto normal pg_wal_replay_pause/pg_wal_replay_resume behavior. We have just small note in documentation: > The paused state can be resumed by using pg_wal_replay_resume() (see Table 9.81), which then causes recovery to end. But I think this is important place and can be improved. Also the database does not respond to the promote signals at this stage. Attached patch 0001 with the test will fail. 0002 patch contains my proposed ideas: - introduce separate message for pause due pg_wal_replay_pause call and for recovery_target_action. - check for standby triggers only for recovery_target_action - I am not sure this would be safe for pg_wal_replay_pause()call case Maybe more verbose hint would be appropriate: > Execute pg_promote() to end recovery or shut down the server, change the recovery target settings to a later target andrestart to continue recovery Thoughts? regards, Sergei
Вложения
On 2020/01/30 20:00, Sergei Kornilov wrote: > Hello > > Currently during point-in-time recovery with recovery_target_action = 'pause' we print log lines: > >> LOG: recovery has paused >> HINT: Execute pg_wal_replay_resume() to continue. > > My colleague told me that this is a terrible moment: to continue what exactly? It sounds like "to continue replay", similarto normal pg_wal_replay_pause/pg_wal_replay_resume behavior. We have just small note in documentation: > >> The paused state can be resumed by using pg_wal_replay_resume() (see Table 9.81), which then causes recovery to end. > > But I think this is important place and can be improved. > > Also the database does not respond to the promote signals at this stage. Attached patch 0001 with the test will fail. > > 0002 patch contains my proposed ideas: > - introduce separate message for pause due pg_wal_replay_pause call and for recovery_target_action. +1 > - check for standby triggers only for recovery_target_action - I am not sure this would be safe for pg_wal_replay_pause()call case Agreed. Basically I think that recoveryPausesHere() should the promotion trigger whether recovery target is reached or not. But one question is; how should the recovery behave if recovery target is reached with recovery_target_action=pause after the promotion is requested? It should pause? Or promote? Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
Hi Sergei, On 1/30/20 12:00 PM, Fujii Masao wrote: > >> - check for standby triggers only for recovery_target_action - I am >> not sure this would be safe for pg_wal_replay_pause() call case > > Agreed. Basically I think that recoveryPausesHere() should the promotion > trigger whether recovery target is reached or not. But one question is; > how should the recovery behave if recovery target is reached with > recovery_target_action=pause after the promotion is requested? > It should pause? Or promote? Do you have thoughts on Fujii's comments? Regards, -- -David david@pgmasters.net
On 2020/03/09 21:30, David Steele wrote: > Hi Sergei, > > On 1/30/20 12:00 PM, Fujii Masao wrote: >> >>> - check for standby triggers only for recovery_target_action - I am not sure this would be safe for pg_wal_replay_pause()call case >> >> Agreed. Basically I think that recoveryPausesHere() should the promotion >> trigger whether recovery target is reached or not. But one question is; >> how should the recovery behave if recovery target is reached with >> recovery_target_action=pause after the promotion is requested? >> It should pause? Or promote? > > Do you have thoughts on Fujii's comments? Thanks for the ping! And sorry for not reporting the current status. I started the discussion about the topic very related to this. I'm thinking to apply the change that Sergei proposes after applying the patch I attached in this thread. https://postgr.es/m/00c194b2-dbbb-2e8a-5b39-13f14048ef0a@oss.nttdata.com Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
On Mon, Mar 9, 2020 at 12:03 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > I started the discussion about the topic very related to this. > I'm thinking to apply the change that Sergei proposes after applying > the patch I attached in this thread. > https://postgr.es/m/00c194b2-dbbb-2e8a-5b39-13f14048ef0a@oss.nttdata.com I think it would be good to change the primary message, not just the hint. e.g. LOG: pausing at end of recovery HINT: Execute pg_wal_replay_resume() to promote. vs. LOG: recovery has paused HINT: Execute pg_wal_replay_resume() to continue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020/03/10 2:27, Robert Haas wrote: > On Mon, Mar 9, 2020 at 12:03 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> I started the discussion about the topic very related to this. >> I'm thinking to apply the change that Sergei proposes after applying >> the patch I attached in this thread. >> https://postgr.es/m/00c194b2-dbbb-2e8a-5b39-13f14048ef0a@oss.nttdata.com > > I think it would be good to change the primary message, not just the hint. e.g. > > LOG: pausing at end of recovery > HINT: Execute pg_wal_replay_resume() to promote. > > vs. > > LOG: recovery has paused > HINT: Execute pg_wal_replay_resume() to continue. Looks good to me. Attached is the updated version of the patch. This is based on Sergei's patch. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
Вложения
Hello, When I test the patch, I find an issue: I start a stream with 'promote_trigger_file' GUC valid, and exec pg_wal_replay_pause() during recovery and as below it shows success to pause at the first time. I think it use a initialize 'SharedPromoteIsTriggered' value first time I exec the pg_wal_replay_pause(). ##################################### postgres=# select pg_wal_replay_pause(); pg_wal_replay_pause --------------------- (1 row) postgres=# select pg_wal_replay_pause(); ERROR: standby promotion is ongoing HINT: pg_wal_replay_pause() cannot be executed after promotion is triggered. postgres=# select pg_wal_replay_pause(); ERROR: recovery is not in progress HINT: Recovery control functions can only be executed during recovery. postgres=# ############################################################## The new status of this patch is: Waiting on Author
Hello > When I test the patch, I find an issue: I start a stream with 'promote_trigger_file' > GUC valid, and exec pg_wal_replay_pause() during recovery and as below it > shows success to pause at the first time. I think it use a initialize > 'SharedPromoteIsTriggered' value first time I exec the pg_wal_replay_pause(). hm. Are you sure this is related to this patch? Could you explain the exact timing? I mean log_statement = all and relevantlogs. Most likely this is expected change by https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=496ee647ecd2917369ffcf1eaa0b2cdca07c8730 My proposal does not change the behavior after this commit, only changing the lines in the logs. regards, Sergei
On 2020/03/31 18:59, Sergei Kornilov wrote: > Hello > >> When I test the patch, I find an issue: I start a stream with 'promote_trigger_file' >> GUC valid, and exec pg_wal_replay_pause() during recovery and as below it >> shows success to pause at the first time. I think it use a initialize >> 'SharedPromoteIsTriggered' value first time I exec the pg_wal_replay_pause(). > > hm. Are you sure this is related to this patch? Could you explain the exact timing? I mean log_statement = all and relevantlogs. > Most likely this is expected change by https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=496ee647ecd2917369ffcf1eaa0b2cdca07c8730 Yeah, if the problem exists there, that would be my fault in the commit. movead li, could you share the detail procedure to reproduce the issue? I'm happy to investigate it. > My proposal does not change the behavior after this commit, only changing the lines in the logs. Yes. What's your opinion about the latest patch based on Robert's idea? Barring any ojection, I'd like to commit that. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
Hello >> My proposal does not change the behavior after this commit, only changing the lines in the logs. > > Yes. What's your opinion about the latest patch based on Robert's idea? > Barring any ojection, I'd like to commit that. Oh, sorry. Looks good and solves my issue regards, Sergei
On 2020/03/31 19:33, Sergei Kornilov wrote: > Hello > >>> My proposal does not change the behavior after this commit, only changing the lines in the logs. >> >> Yes. What's your opinion about the latest patch based on Robert's idea? >> Barring any ojection, I'd like to commit that. > > Oh, sorry. Looks good and solves my issue Thanks for reviewing the patch! Pushed! Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
>>>> My proposal does not change the behavior after this commit, only changing the lines in the logs. >>> >>> Yes. What's your opinion about the latest patch based on Robert's idea? >>> Barring any ojection, I'd like to commit that. >> >> Oh, sorry. Looks good and solves my issue > > Thanks for reviewing the patch! Pushed! Thank you! regards, Sergei
>> When I test the patch, I find an issue: I start a stream with 'promote_trigger_file'
>> GUC valid, and exec pg_wal_replay_pause() during recovery and as below it
>> shows success to pause at the first time. I think it use a initialize
>> 'SharedPromoteIsTriggered' value first time I exec the pg_wal_replay_pause().
>hm. Are you sure this is related to this patch? Could you explain the exact timing? I mean log_statement = all and relevant logs.
>Most likely this is expected change by https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=496ee647ecd2917369ffcf1eaa0b2cdca07c8730
>My proposal does not change the behavior after this commit, only changing the lines in the logs.
I test it again with (92d31085e926253aa650b9d1e1f2f09934d0ddfc), and the
issue appeared again. Here is my test method which quite simple:
1. Setup a base backup by pg_basebackup.
2. Insert lots of data in master for the purpose I have enough time to exec
pg_wal_replay_pause() when startup the replication.
3. Configure the 'promote_trigger_file' GUC and create the trigger file.
4. Start the backup(standby), connect it immediately, and exec pg_wal_replay_pause()
Then it appears, and a test log attached.
I means when I exec the pg_wal_replay_pause() first time, nobody has check the trigger state
by CheckForStandbyTrigger(), it use a Initialized 'SharedPromoteIsTriggered' value.
And patch attached can solve the issue.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Вложения
On 2020/04/01 11:42, movead.li@highgo.ca wrote: > >>> When I test the patch, I find an issue: I start a stream with 'promote_trigger_file' >> > GUC valid, and exec pg_wal_replay_pause() during recovery and as below it > >> shows success to pause at the first time. I think it use a initialize > >> 'SharedPromoteIsTriggered' value first time I exec the pg_wal_replay_pause(). >>hm. Are you sure this is related to this patch? Could you explain the exact timing? I mean log_statement = all and relevantlogs. >>Most likely this is expected change by https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=496ee647ecd2917369ffcf1eaa0b2cdca07c8730 >>My proposal does not change the behavior after this commit, only changing the lines in the logs. > I test it again with (92d31085e926253aa650b9d1e1f2f09934d0ddfc), and the > issue appeared again. Here is my test method which quite simple: > 1. Setup a base backup by pg_basebackup. > 2. Insert lots of data in master for the purpose I have enough time to exec > pg_wal_replay_pause() when startup the replication. > 3. Configure the 'promote_trigger_file' GUC and create the trigger file. > 4. Start the backup(standby), connect it immediately, and exec pg_wal_replay_pause() > Then it appears, and a test log attached. > > I means when I exec the pg_wal_replay_pause() first time, nobody has check the trigger state > by CheckForStandbyTrigger(), it use a Initialized 'SharedPromoteIsTriggered' value. > And patch attached can solve the issue. Thanks for the explanation! But, sorry,,, I failed to understand the issue that you reported, yet... You mean that the first call of pg_wal_replay_pause() in the step #2 should check whether the trigger file exists or not? If so, could you tell me why we should do that? BTW, right now only the startup process is allowed to call CheckForStandbyTrigger(). So the backend process calling pg_wal_replay_pause() and PromoteIsTriggered() is not allowed to call CheckForStandbyTrigger(). The current logic is that the startup process is responsible for checking the trigger file and set the flag in the shmem if promotion is triggered. Then other processes like backend know whether promotion is ongoing or not from the shmem. So basically the backend doesn't need to check the trigger file itself. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
>But, sorry,,, I failed to understand the issue that you reported, yet....
>You mean that the first call of pg_wal_replay_pause() in the step #2
>should check whether the trigger file exists or not? If so, could you
>tell me why we should do that?
Sorry about my pool english. The 'pg_wal_replay_pause()' is executed
in step 4. I mention it in step 2 is for explain why it need lots of insert
data.
>BTW, right now only the startup process is allowed to call
>CheckForStandbyTrigger(). So the backend process calling
>pg_wal_replay_pause() and PromoteIsTriggered() is not allowed to call
>CheckForStandbyTrigger(). The current logic is that the startup process
>is responsible for checking the trigger file and set the flag in the shmem
It's here, startup process does not call CheckForStandbyTrigger() to check
the trigger file until a pg_wal_replay_pause() or PromoteIsTriggered() comes,
so first time to call the pg_wal_replay_pause(), it use a wrong
'SharedPromoteIsTriggered' value.
>if promotion is triggered. Then other processes like backend know
>whether promotion is ongoing or not from the shmem. So basically
>the backend doesn't need to check the trigger file itself.
If backend is not allowed to call CheckForStandbyTrigger(), then you should
find a way to hold it.
In another word, during the recovery if I add the trigger file, the starup process
do not know it at all until after a pg_wal_replay_pause() come.
In addition, although the first time I exec 'pg_wal_replay_pause' it shows success,
the startup process is keeping redo(no stop).
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 2020/04/01 16:53, movead.li@highgo.ca wrote: > >>But, sorry,,, I failed to understand the issue that you reported, yet.... >>You mean that the first call of pg_wal_replay_pause() in the step #2 >>should check whether the trigger file exists or not? If so, could you >>tell me why we should do that? > Sorry about my pool english. The 'pg_wal_replay_pause()' is executed > in step 4. I mention it in step 2 is for explain why it need lots of insert > data. > >>BTW, right now only the startup process is allowed to call >>CheckForStandbyTrigger(). So the backend process calling >>pg_wal_replay_pause() and PromoteIsTriggered() is not allowed to call >>CheckForStandbyTrigger(). The current logic is that the startup process >>is responsible for checking the trigger file and set the flag in the shmem > It's here, startup process does not call CheckForStandbyTrigger() to check > the trigger file until a pg_wal_replay_pause() or PromoteIsTriggered() comes, > so first time to call the pg_wal_replay_pause(), it use a wrong > 'SharedPromoteIsTriggered' value. > > >>if promotion is triggered. Then other processes like backend know >>whether promotion is ongoing or not from the shmem. So basically >>the backend doesn't need to check the trigger file itself. > If backend is not allowed to call CheckForStandbyTrigger(), then you should > find a way to hold it. > In another word, during the recovery if I add the trigger file, the starup process > do not know it at all until after a pg_wal_replay_pause() come. Thanks for the explanation again! Maybe I understand your point. As far as I read the code, in the standby mode, the startup process periodically checks the trigger file in WaitForWALToBecomeAvailable(). No? There can be small delay between the creation of the trigger file and the periodic call to CheckForStandbyTrigger() by the startup process. If you execute pg_wal_replay_pause() during that delay, it would suceed. But you'd like to get rid of that delay completely? In other words, both the startup process and the backend calling pg_wal_replay_pause() should detect the existence of the trigger file immdiately after it's created? Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
>Thanks for the explanation again! Maybe I understand your point.
Great.
>As far as I read the code, in the standby mode, the startup process
>periodically checks the trigger file in WaitForWALToBecomeAvailable().
>No?
Yes it is.
>There can be small delay between the creation of the trigger file
>and the periodic call to CheckForStandbyTrigger() by the startup process.
>If you execute pg_wal_replay_pause() during that delay, it would suceed.
If there be a huge number of wal segments need a standby to rewind,
then it can not be a 'small delay'.
>But you'd like to get rid of that delay completely? In other words,
>both the startup process and the backend calling pg_wal_replay_pause()
>should detect the existence of the trigger file immdiately after
>it's created?
I want to point out the thing, the pg_wal_replay_pause() shows success but
the startup process keeping redo, it may cause something confused. So it's
better to solve the issue.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 2020/04/01 17:58, movead.li@highgo.ca wrote: >>Thanks for the explanation again! Maybe I understand your point. > Great. > >>As far as I read the code, in the standby mode, the startup process >>periodically checks the trigger file in WaitForWALToBecomeAvailable(). >>No? > Yes it is. > >>There can be small delay between the creation of the trigger file >>and the periodic call to CheckForStandbyTrigger() by the startup process. >>If you execute pg_wal_replay_pause() during that delay, it would suceed. > If there be a huge number of wal segments need a standby to rewind, > then it can not be a 'small delay'. Yeah, that's true. >>But you'd like to get rid of that delay completely? In other words, >>both the startup process and the backend calling pg_wal_replay_pause() >>should detect the existence of the trigger file immdiately after >>it's created? > I want to point out the thing, the pg_wal_replay_pause() shows success but > the startup process keeping redo, it may cause something confused. So it's > better to solve the issue. This happens because the startup process detects the trigger file after pg_wal_replay_pause() succeeds, and then make the recovery get out of the paused state. It might be problematic to end the paused state silently? So, to make the situation less confusing, what about emitting a log message when ending the paused state because of the promotion? Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
>This happens because the startup process detects the trigger file
>after pg_wal_replay_pause() succeeds, and then make the recovery
>get out of the paused state.
Yes that is.
>It might be problematic to end the paused
>state silently? So, to make the situation less confusing, what about
>emitting a log message when ending the paused state because of
>the promotion?
But where to emit it? I think it not so good by emitting to log file,
because the user will not check it everytime. BTW, why
CheckForStandbyTrigger() can not be called in backend.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 2020/04/01 18:56, movead.li@highgo.ca wrote: > > >This happens because the startup process detects the trigger file >>after pg_wal_replay_pause() succeeds, and then make the recovery >>get out of the paused state. > Yes that is. > > >It might be problematic to end the paused >>state silently? So, to make the situation less confusing, what about >>emitting a log message when ending the paused state because of >>the promotion? > But where to emit it? I think it not so good by emitting to log file, > because the user will not check it everytime. Yeah, I'm thinking to emit the message to log file, like the startup process does in other places :) > BTW, why > CheckForStandbyTrigger() can not be called in backend. Because 1) promote_signaled flag that IsPromoteSignaled() sees is set only in the startup process 2) the trigger file can be removed or promote_trigger_file can be changed after the backend detects it but before the startup process does. That is, the result of the trigger file detection can be different between the processes. Of course we can change CheckForStandbyTrigger() so that it can be called by backends, but I'm not sure if it's worth doing that. Or another idea to reduce the delay between the request for the promotion and the detection of it is to make the startup process call CheckForStandbyTrigger() more frequently. Calling that every replay of WAL record would be overkill and decrease the recovery performance. Calling thst every WAL file might be ok. I'm not sure if this is really improvement or not, though... Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
On 2020/04/01 19:37, Fujii Masao wrote: > > > On 2020/04/01 18:56, movead.li@highgo.ca wrote: >> >> >This happens because the startup process detects the trigger file >>> after pg_wal_replay_pause() succeeds, and then make the recovery >>> get out of the paused state. >> Yes that is. >> >> >It might be problematic to end the paused >>> state silently? So, to make the situation less confusing, what about >>> emitting a log message when ending the paused state because of >>> the promotion? >> But where to emit it? I think it not so good by emitting to log file, >> because the user will not check it everytime. > > Yeah, I'm thinking to emit the message to log file, like the startup process > does in other places :) So I'd like to propose the attached patch. The patch changes the message logged when a promotion is requested, based on whether the recovery is in paused state or not. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
Вложения
>So I'd like to propose the attached patch. The patch changes the message
>logged when a promotion is requested, based on whether the recovery is
>in paused state or not.
It is a compromise, we should notice it in document I think.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 2020/04/02 10:41, movead.li@highgo.ca wrote: > > >So I'd like to propose the attached patch. The patch changes the message >>logged when a promotion is requested, based on whether the recovery is >>in paused state or not. > It is a compromise, Ok, so barring any objection, I will commit the patch. > we should notice it in document I think. There is the following explation about the relationship the recovery pause and the promotion, in the document. You may want to add more descriptions into the docs? ------------------------------ If a promotion is triggered while recovery is paused, the paused state ends and a promotion continues. ------------------------------ Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
>> we should notice it in document I think.
>There is the following explation about the relationship the recovery
>pause and the promotion, in the document. You may want to add more
>descriptions into the docs?
>------------------------------
>If a promotion is triggered while recovery is paused, the paused
>state ends and a promotion continues.
>------------------------------
For example we can add this words:
First-time pg_wal_replay_pause() called during recovery which triggered
as promotion, pg_wal_replay_pause() show success but it did not really
pause the recovery.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 2020/04/08 16:41, movead.li@highgo.ca wrote: > > >> we should notice it in document I think. >>There is the following explation about the relationship the recovery >>pause and the promotion, in the document. You may want to add more >>descriptions into the docs? > >------------------------------ >>If a promotion is triggered while recovery is paused, the paused >>state ends and a promotion continues. >>------------------------------ > > For example we can add this words: > First-time pg_wal_replay_pause() called during recovery which triggered > as promotion, pg_wal_replay_pause() show success but it did not really > pause the recovery. I think this is not true. In your case, pg_wal_replay_pause() succeeded because the startup process had not detected the promotion request yet at that moment. To cover your case, what about adding the following description? ----------------- There can be delay between a promotion request by users and the trigger of a promotion in the server. Note that pg_wal_replay_pause() succeeds during that delay, i.e., until a promotion is actually triggered. ----------------- Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
>To cover your case, what about adding the following description?
>-----------------
>There can be delay between a promotion request by users and the trigger of
>a promotion in the server. Note that pg_wal_replay_pause() succeeds
>during that delay, i.e., until a promotion is actually triggered.
>-----------------
Yes that's it.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca