Обсуждение: Should we remove "not fast" promotion at all?
Hi all, We discussed the $SUBJECT in the following threads: http://www.postgresql.org/message-id/CA+TgmoZbR+WL8E7MF_KRp6fY4FD2pMr11TPiuyjMFX_Vtg1Wrw@mail.gmail.com http://www.postgresql.org/message-id/CAHGQGwEBUvgcx8X+Z0Hh+VdwYcJ8KCuRuLt1jSsxeLxPcX=0_w@mail.gmail.com Our consensus seems to remove "not fast" promotion at all because there is no use case for that promotion. Attached patch removes "not fast" promotion. Barring any objections, I will commit this patch. Regards, On Sat, Aug 3, 2013 at 4:31 PM, Tomonari Katsumata <t.katsumata1122@gmail.com> wrote: > Hi, > > I made a patch for REL9_3_STABLE which gets rid of > old promote processing. please check it. > This patch make PostgreSQL do fast promoting(*) always. > (*) which means skipping long checkpoint before increasing > timeline. > > And after this, I'll do make another patch for unlinking files which are > created by user as a trigger_file or "pg_ctl promote" command. > > --------------- > Tomonari Katsumata > 2013/7/30 Fujii Masao <masao.fujii@gmail.com> >> >> On Sat, Jul 27, 2013 at 6:57 PM, Tomonari Katsumata >> <t.katsumata1122@gmail.com> wrote: >> > Hi, >> > >> > >> >>>> Yes, it prevents PROMOTE_SIGNAL_FILE from remaining even if >> >>>> both promote files exist. >> >>>> >> >>> The command("unlink(PROMOTE_SIGNAL_FILE)") here is for >> >>> unusualy case. >> >>> Because the case is when done both procedures below. >> >>> - user create "promote" file on PGDATA >> >>> - user issue "pg_ctl promote" >> >>> >> >>> I understand the reason. >> >>> But I think it's better to unlink(PROMOTE_SIGNAL_FILE) before >> >>> unlink(FAST_PROMOTE_SIGNAL_FILE). >> >>> Because FAST_PROMOTE_SIGNAL_FILE is definetly there but >> >>> PROMOTE_SIGNAL_FILE is sometimes there or not there. >> >> >> >> I could not understand why that's better. Could you elaborate that? >> >> >> > I'm sorry for less explanation. >> > >> > I've thought that errno would be set ENOENT and >> > this may lead something wrong. >> > I checked this and I know it's not problem. >> > >> > sorry for confusing you. >> > >> > >> > >> >>> And I have another question linking this behavior. >> >>> I think TriggerFile should be removed too. >> >>> This is corner-case but it will happen. >> >>> How do you think of it ? >> >> >> >> I don't have strong opinion about that. I've never heard the complaint >> >> about that current behavior so far. >> >> >> > For example, please imagine the cascading replication environment and >> > using old master as a standby without copying the timeline history file >> > to new standby. >> > >> > ------- >> > 1. replicating 3 servers(A,B,C) >> > A->B->C >> > ("trigger_file = /tmp/trig" is set in recovery_recovery.conf on B and >> > C.) >> > >> > 2. stop server A and promoting server B with "touch /tmp/trig;pg_ctl >> > promote" >> >> Why do you need to both create the trigger file and run pg_ctl promote? >> >> Anyway, if the patch is useful for fail-safe and it doesn't break the >> current >> behavior, I'd be happy to apply it. You are suggesting that we should >> remove >> the trigger file in CheckForStandbyTrigger() even if pg_ctl promote is >> executed. >> But there can be some cases where we can get out of the WAL replay loop, >> for example, reach the recovery_target_xxx. So ISTM we should try to >> remove >> both the trigger file and "promote" file at the end of recovery >> instead. Thought? >> >> > B->C >> > (/tmp/trig file remains on server B) >> > >> > 4. stop server B and promoting server C with "pg_ctl promote" >> > C >> > >> > 5. making server B connect for standby of server C >> > C->B >> > --------- >> > >> > In step5 server B will promote as soon as it starts, >> > because "/tmp/trig" is stil there. >> > >> > >> > >> >>>> One question is that: we really still need to support normal promote? >> >>>> pg_ctl promote provides only way to do fast promotion. If we want to >> >>>> do normal promotion, we need to create PROMOTE_SIGNAL_FILE >> >>>> and send the SIGUSR1 signal to postmaster by hand. This seems messy. >> >>>> >> >>>> I think that we should remove normal promotion at all, or change >> >>>> pg_ctl promote so that provides also the way to do normal promotion. >> >>>> >> >>> I think he merit of "fast promote" is >> >>> - allowing quick connection by skipping checkpoint >> >>> and its demerit is >> >>> - taking little bit longer when crash-recovery >> >>> >> >>> If it is seldom to happen its crash soon after promoting >> >>> and "fast promte" never breaks consistency of database cluster, >> >>> I think we don't need normal promotion. >> >> >> >> You can execute checkpoint after fast promotion for that. >> >> >> > OK. >> > Then I think we should do below things. >> > - removing normal promotion at all from source >> > - adding the know-how you suggest on document >> >> IMO either is necessary. >> >> Regards, >> >> -- >> Fujii Masao > > -- Fujii Masao
Вложения
On Tue, Aug 6, 2013 at 3:24 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Hi all, > > We discussed the $SUBJECT in the following threads: > http://www.postgresql.org/message-id/CA+TgmoZbR+WL8E7MF_KRp6fY4FD2pMr11TPiuyjMFX_Vtg1Wrw@mail.gmail.com > http://www.postgresql.org/message-id/CAHGQGwEBUvgcx8X+Z0Hh+VdwYcJ8KCuRuLt1jSsxeLxPcX=0_w@mail.gmail.com > > Our consensus seems to remove "not fast" promotion at all > because there is no use case for that promotion. Indeed, if two modes of promotion are available, it is not that user-friendly if pg_ctl does not support both directly. struct stat stat_buf; - if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0 || - stat(FAST_PROMOTE_SIGNAL_FILE, &stat_buf) == 0) + /* + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the + * signal handler. We now leave the file in place and let the Startup + * process do the unlink. This is the infrastructure for supporting + * various promotion modes in the future. This allows Startup to know + * the mode from the promote signal file that the postmaster left. + */ + if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) return true; Why not reshuffle this comment and remove references to 9.1 and 9.2? Something like that perhaps: Leave the promote signal file in place and let the Startup do the unlink. This infrastructure permits Startup to know the mode from the promote signal file that postmaster left, keeping the door open for support of multiple promotion modes in the future. - /* - * In 9.1 and 9.2 the postmaster unlinked the promote file inside the - * signal handler. We now leave the file in place and let the Startup - * process do the unlink. This allows Startup to know whether we're - * doing fast or normal promotion. Fast promotion takes precedence. - */ - if (stat(FAST_PROMOTE_SIGNAL_FILE, &stat_buf) == 0) - { - unlink(FAST_PROMOTE_SIGNAL_FILE); - unlink(PROMOTE_SIGNAL_FILE); - fast_promote = true; - } - else if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) - { - unlink(PROMOTE_SIGNAL_FILE); - fast_promote = false; - } - ereport(LOG, (errmsg("received promote request"))); - + unlink(PROMOTE_SIGNAL_FILE); Wouldn't it make sense to keep the call to stat() to check the file status before unlinking it? -- Michael
Hi, On 2013-08-06 03:24:58 +0900, Fujii Masao wrote: > Hi all, > > We discussed the $SUBJECT in the following threads: > http://www.postgresql.org/message-id/CA+TgmoZbR+WL8E7MF_KRp6fY4FD2pMr11TPiuyjMFX_Vtg1Wrw@mail.gmail.com > http://www.postgresql.org/message-id/CAHGQGwEBUvgcx8X+Z0Hh+VdwYcJ8KCuRuLt1jSsxeLxPcX=0_w@mail.gmail.com > > Our consensus seems to remove "not fast" promotion at all > because there is no use case for that promotion. > > Attached patch removes "not fast" promotion. Barring any objections, > I will commit this patch. FWIW I'd rather keep plain promotion for a release or two. TBH, I have a bit of trust issues regarding the new method, and I'd like to be able to test potential issues against a stock postgres by doing a normal instead of a fast promotion. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Aug 6, 2013 at 11:20 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Aug 6, 2013 at 3:24 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Hi all, >> >> We discussed the $SUBJECT in the following threads: >> http://www.postgresql.org/message-id/CA+TgmoZbR+WL8E7MF_KRp6fY4FD2pMr11TPiuyjMFX_Vtg1Wrw@mail.gmail.com >> http://www.postgresql.org/message-id/CAHGQGwEBUvgcx8X+Z0Hh+VdwYcJ8KCuRuLt1jSsxeLxPcX=0_w@mail.gmail.com >> >> Our consensus seems to remove "not fast" promotion at all >> because there is no use case for that promotion. > Indeed, if two modes of promotion are available, it is not that > user-friendly if pg_ctl does not support both directly. > > struct stat stat_buf; > > - if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0 || > - stat(FAST_PROMOTE_SIGNAL_FILE, &stat_buf) == 0) > + /* > + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the > + * signal handler. We now leave the file in place and let the Startup > + * process do the unlink. This is the infrastructure for supporting > + * various promotion modes in the future. This allows Startup to know > + * the mode from the promote signal file that the postmaster left. > + */ > + if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) > return true; > Why not reshuffle this comment and remove references to 9.1 and 9.2? I just left the old comment as it is. I'm OK with your suggestion. > - /* > - * In 9.1 and 9.2 the postmaster unlinked the promote > file inside the > - * signal handler. We now leave the file in place and > let the Startup > - * process do the unlink. This allows Startup to know > whether we're > - * doing fast or normal promotion. Fast promotion > takes precedence. > - */ > - if (stat(FAST_PROMOTE_SIGNAL_FILE, &stat_buf) == 0) > - { > - unlink(FAST_PROMOTE_SIGNAL_FILE); > - unlink(PROMOTE_SIGNAL_FILE); > - fast_promote = true; > - } > - else if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) > - { > - unlink(PROMOTE_SIGNAL_FILE); > - fast_promote = false; > - } > - > ereport(LOG, (errmsg("received promote request"))); > - > + unlink(PROMOTE_SIGNAL_FILE); > Wouldn't it make sense to keep the call to stat() to check the file > status before unlinking it? Why do we need to check the existence of the file before removing it here? Regards, -- Fujii Masao
On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2013-08-06 03:24:58 +0900, Fujii Masao wrote: >> Hi all, >> >> We discussed the $SUBJECT in the following threads: >> http://www.postgresql.org/message-id/CA+TgmoZbR+WL8E7MF_KRp6fY4FD2pMr11TPiuyjMFX_Vtg1Wrw@mail.gmail.com >> http://www.postgresql.org/message-id/CAHGQGwEBUvgcx8X+Z0Hh+VdwYcJ8KCuRuLt1jSsxeLxPcX=0_w@mail.gmail.com >> >> Our consensus seems to remove "not fast" promotion at all >> because there is no use case for that promotion. >> >> Attached patch removes "not fast" promotion. Barring any objections, >> I will commit this patch. > > FWIW I'd rather keep plain promotion for a release or two. TBH, I have a > bit of trust issues regarding the new method, and I'd like to be able to > test potential issues against a stock postgres by doing a normal instead > of a fast promotion. So we should add new option specifying the promotion mode, into pg_ctl? Currently pg_ctl cannot trigger the normal promotion. Or, instead of normal promotion, it might be better to use another promotion technique like shutdown + remove recovery.conf + restart for that purpose? Regards, -- Fujii Masao
Fujii Masao <masao.fujii@gmail.com> schrieb: >On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> >wrote: >> Hi, >> >> On 2013-08-06 03:24:58 +0900, Fujii Masao wrote: >>> Hi all, >>> >>> We discussed the $SUBJECT in the following threads: >>> >http://www.postgresql.org/message-id/CA+TgmoZbR+WL8E7MF_KRp6fY4FD2pMr11TPiuyjMFX_Vtg1Wrw@mail.gmail.com >>> >http://www.postgresql.org/message-id/CAHGQGwEBUvgcx8X+Z0Hh+VdwYcJ8KCuRuLt1jSsxeLxPcX=0_w@mail.gmail.com >>> >>> Our consensus seems to remove "not fast" promotion at all >>> because there is no use case for that promotion. >>> >>> Attached patch removes "not fast" promotion. Barring any objections, >>> I will commit this patch. >> >> FWIW I'd rather keep plain promotion for a release or two. TBH, I >have a >> bit of trust issues regarding the new method, and I'd like to be able >to >> test potential issues against a stock postgres by doing a normal >instead >> of a fast promotion. > >So we should add new option specifying the promotion mode, into pg_ctl? >Currently pg_ctl cannot trigger the normal promotion. I am fine with only supporting doing the promotion in the old fashioned way, but I wouldn't protest against an option either. >Or, instead of normal promotion, it might be better to use another >promotion >technique like shutdown + remove recovery.conf + restart for that >purpose? That's a very bad thing to do since it suppresses the timeline increase... Regards, Andres Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Aug 6, 2013 at 12:41 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> - >> + unlink(PROMOTE_SIGNAL_FILE); >> Wouldn't it make sense to keep the call to stat() to check the file >> status before unlinking it? > > Why do we need to check the existence of the file before removing it > here? Forget what I said, I had in mind that it might have been better to put in silence errors of unlink here. This is not mandatory. -- Michael
Fujii Masao <masao.fujii@gmail.com> writes: > On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> FWIW I'd rather keep plain promotion for a release or two. TBH, I have a >> bit of trust issues regarding the new method, and I'd like to be able to >> test potential issues against a stock postgres by doing a normal instead >> of a fast promotion. > So we should add new option specifying the promotion mode, into pg_ctl? > Currently pg_ctl cannot trigger the normal promotion. It would be silly to add such an option if we want to remove the old mode in a release or two. I think what Andres is suggesting is to leave it as-is for 9.4 and then remove the old code in 9.5 or 9.6. Which seems prudent to me. regards, tom lane
On Tue, Aug 6, 2013 at 12:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > > Fujii Masao <masao.fujii@gmail.com> schrieb: >>On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> >>wrote: >>> Hi, >>> >>> On 2013-08-06 03:24:58 +0900, Fujii Masao wrote: >>>> Hi all, >>>> >>>> We discussed the $SUBJECT in the following threads: >>>> >>http://www.postgresql.org/message-id/CA+TgmoZbR+WL8E7MF_KRp6fY4FD2pMr11TPiuyjMFX_Vtg1Wrw@mail.gmail.com >>>> >>http://www.postgresql.org/message-id/CAHGQGwEBUvgcx8X+Z0Hh+VdwYcJ8KCuRuLt1jSsxeLxPcX=0_w@mail.gmail.com >>>> >>>> Our consensus seems to remove "not fast" promotion at all >>>> because there is no use case for that promotion. >>>> >>>> Attached patch removes "not fast" promotion. Barring any objections, >>>> I will commit this patch. >>> >>> FWIW I'd rather keep plain promotion for a release or two. TBH, I >>have a >>> bit of trust issues regarding the new method, and I'd like to be able >>to >>> test potential issues against a stock postgres by doing a normal >>instead >>> of a fast promotion. >> >>So we should add new option specifying the promotion mode, into pg_ctl? >>Currently pg_ctl cannot trigger the normal promotion. > > I am fine with only supporting doing the promotion in the old fashioned way, but I wouldn't protest against an option either. Yeah an additional option would be the way to go especially if new promotion modes are supported in the future. Btw, what I like about this patch is that it opens the door for easier support of additional promotion modes. Could it be possible to use this advantage to support both the fast and non-fast promotions now with a new fresh structure? >>Or, instead of normal promotion, it might be better to use another >>promotion >>technique like shutdown + remove recovery.conf + restart for that >>purpose? > > That's a very bad thing to do since it suppresses the timeline increase... Agreed with Andres. This is unsafe as it avoids as well all the safety checks at the end of archive recovery. -- Michael
Hi,
2013/8/6 Tom Lane <tgl@sss.pgh.pa.us>
How about giving trigger_file an ability to chose "fast promote" and "normal promote"Fujii Masao <masao.fujii@gmail.com> writes:
> On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> wrote:>> FWIW I'd rather keep plain promotion for a release or two. TBH, I have aIt would be silly to add such an option if we want to remove the old mode
>> bit of trust issues regarding the new method, and I'd like to be able to
>> test potential issues against a stock postgres by doing a normal instead
>> of a fast promotion.
> So we should add new option specifying the promotion mode, into pg_ctl?
> Currently pg_ctl cannot trigger the normal promotion.
in a release or two.
I think what Andres is suggesting is to leave it as-is for 9.4 and then
remove the old code in 9.5 or 9.6. Which seems prudent to me.
like the triggerfile of pg_standby.
It means that if the contents of the trigger_file is empty or 'smart' then do "normal promote",
and it's 'fast' then do "fast promote".
I think this change would be smaller than change to pg_ctl.
And this would allow us to treat ${PGDATA}/promote and trigger_file only.(because ${PGDATA}/fast_promote is not created automatically)
regards,
---------------
---------------
Tomonari Katsumata
On Tue, Aug 6, 2013 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> wrote: >>> FWIW I'd rather keep plain promotion for a release or two. TBH, I have a >>> bit of trust issues regarding the new method, and I'd like to be able to >>> test potential issues against a stock postgres by doing a normal instead >>> of a fast promotion. > >> So we should add new option specifying the promotion mode, into pg_ctl? >> Currently pg_ctl cannot trigger the normal promotion. > > It would be silly to add such an option if we want to remove the old mode > in a release or two. Without such an option, a user cannot easily trigger the "normal" promotion when we find some problems in fast promotion. In this case, a user needs to create the "promote" file and send the SIGUSR1 signal to postmaster by hand. Or needs to execute pg_ctl promote by using old version (e.g., 9.2) of pg_ctl. Seems confusing. Regards, -- Fujii Masao
On 2013-08-07 22:26:53 +0900, Fujii Masao wrote: > On Tue, Aug 6, 2013 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Fujii Masao <masao.fujii@gmail.com> writes: > >> On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >>> FWIW I'd rather keep plain promotion for a release or two. TBH, I have a > >>> bit of trust issues regarding the new method, and I'd like to be able to > >>> test potential issues against a stock postgres by doing a normal instead > >>> of a fast promotion. > > > >> So we should add new option specifying the promotion mode, into pg_ctl? > >> Currently pg_ctl cannot trigger the normal promotion. > > > > It would be silly to add such an option if we want to remove the old mode > > in a release or two. > > Without such an option, a user cannot easily trigger the "normal" promotion > when we find some problems in fast promotion. In this case, a user needs to > create the "promote" file and send the SIGUSR1 signal to postmaster by hand. > Or needs to execute pg_ctl promote by using old version (e.g., 9.2) of pg_ctl. > Seems confusing. Seems fine for debugging to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Aug 6, 2013 at 8:05 PM, Tomonari Katsumata <t.katsumata1122@gmail.com> wrote: > Hi, > > 2013/8/6 Tom Lane <tgl@sss.pgh.pa.us> >> >> Fujii Masao <masao.fujii@gmail.com> writes: >> > On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> >> > wrote: >> >> FWIW I'd rather keep plain promotion for a release or two. TBH, I have >> >> a >> >> bit of trust issues regarding the new method, and I'd like to be able >> >> to >> >> test potential issues against a stock postgres by doing a normal >> >> instead >> >> of a fast promotion. >> >> > So we should add new option specifying the promotion mode, into pg_ctl? >> > Currently pg_ctl cannot trigger the normal promotion. >> >> It would be silly to add such an option if we want to remove the old mode >> in a release or two. >> >> I think what Andres is suggesting is to leave it as-is for 9.4 and then >> remove the old code in 9.5 or 9.6. Which seems prudent to me. >> > How about giving trigger_file an ability to chose "fast promote" and "normal > promote" > like the triggerfile of pg_standby. > It means that if the contents of the trigger_file is empty or 'smart' then > do "normal promote", > and it's 'fast' then do "fast promote". > I think this change would be smaller than change to pg_ctl. > And this would allow us to treat ${PGDATA}/promote and trigger_file only. > (because ${PGDATA}/fast_promote is not created automatically) Indeed, this would be the way to go to have an extensible format for other promotion modes or other actions that could be triggered by a standby. So why not taking the approach suggested by Katsumata-san now? One single file to rule them all, in this case called promote, including a keyword indicating the promotion action to take. This could be controlled by pg_ctl entirely, and opens the door to extra possible modes. -- Michael
On 2013-08-08 06:40:00 +0900, Michael Paquier wrote: > On Tue, Aug 6, 2013 at 8:05 PM, Tomonari Katsumata > <t.katsumata1122@gmail.com> wrote: > > Hi, > > > > 2013/8/6 Tom Lane <tgl@sss.pgh.pa.us> > >> > >> Fujii Masao <masao.fujii@gmail.com> writes: > >> > On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> > >> > wrote: > >> >> FWIW I'd rather keep plain promotion for a release or two. TBH, I have > >> >> a > >> >> bit of trust issues regarding the new method, and I'd like to be able > >> >> to > >> >> test potential issues against a stock postgres by doing a normal > >> >> instead > >> >> of a fast promotion. > >> > >> > So we should add new option specifying the promotion mode, into pg_ctl? > >> > Currently pg_ctl cannot trigger the normal promotion. > >> > >> It would be silly to add such an option if we want to remove the old mode > >> in a release or two. > >> > >> I think what Andres is suggesting is to leave it as-is for 9.4 and then > >> remove the old code in 9.5 or 9.6. Which seems prudent to me. > >> > > How about giving trigger_file an ability to chose "fast promote" and "normal > > promote" > > like the triggerfile of pg_standby. > > It means that if the contents of the trigger_file is empty or 'smart' then > > do "normal promote", > > and it's 'fast' then do "fast promote". > > I think this change would be smaller than change to pg_ctl. > > And this would allow us to treat ${PGDATA}/promote and trigger_file only. > > (because ${PGDATA}/fast_promote is not created automatically) > Indeed, this would be the way to go to have an extensible format for > other promotion modes or other actions that could be triggered by a > standby. So why not taking the approach suggested by Katsumata-san > now? One single file to rule them all, in this case called promote, > including a keyword indicating the promotion action to take. This > could be controlled by pg_ctl entirely, and opens the door to extra > possible modes. Why are we suddenly trying to make this even more complicated? It's too late to redesign stuff without very good evidence that it's needed. Renaming trigger files and changing their format certainly doesn't seem appropriate post-beta. Let's just leave this as is, and remove the code in 9.4/9.5. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Aug 8, 2013 at 12:24 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-08-08 06:40:00 +0900, Michael Paquier wrote: >> On Tue, Aug 6, 2013 at 8:05 PM, Tomonari Katsumata >> <t.katsumata1122@gmail.com> wrote: >> > Hi, >> > >> > 2013/8/6 Tom Lane <tgl@sss.pgh.pa.us> >> >> >> >> Fujii Masao <masao.fujii@gmail.com> writes: >> >> > On Tue, Aug 6, 2013 at 11:40 AM, Andres Freund <andres@2ndquadrant.com> >> >> > wrote: >> >> >> FWIW I'd rather keep plain promotion for a release or two. TBH, I have >> >> >> a >> >> >> bit of trust issues regarding the new method, and I'd like to be able >> >> >> to >> >> >> test potential issues against a stock postgres by doing a normal >> >> >> instead >> >> >> of a fast promotion. >> >> >> >> > So we should add new option specifying the promotion mode, into pg_ctl? >> >> > Currently pg_ctl cannot trigger the normal promotion. >> >> >> >> It would be silly to add such an option if we want to remove the old mode >> >> in a release or two. >> >> >> >> I think what Andres is suggesting is to leave it as-is for 9.4 and then >> >> remove the old code in 9.5 or 9.6. Which seems prudent to me. >> >> >> > How about giving trigger_file an ability to chose "fast promote" and "normal >> > promote" >> > like the triggerfile of pg_standby. >> > It means that if the contents of the trigger_file is empty or 'smart' then >> > do "normal promote", >> > and it's 'fast' then do "fast promote". >> > I think this change would be smaller than change to pg_ctl. >> > And this would allow us to treat ${PGDATA}/promote and trigger_file only. >> > (because ${PGDATA}/fast_promote is not created automatically) >> Indeed, this would be the way to go to have an extensible format for >> other promotion modes or other actions that could be triggered by a >> standby. So why not taking the approach suggested by Katsumata-san >> now? One single file to rule them all, in this case called promote, >> including a keyword indicating the promotion action to take. This >> could be controlled by pg_ctl entirely, and opens the door to extra >> possible modes. > > Why are we suddenly trying to make this even more complicated? It's too > late to redesign stuff without very good evidence that it's > needed. Renaming trigger files and changing their format certainly > doesn't seem appropriate post-beta. > > Let's just leave this as is, and remove the code in 9.4/9.5. Sorry. I should have been clearer. I meant that for 9.4~ only. For 9.3 yes it's too late. -- Michael
On Thu, Aug 8, 2013 at 01:27:35PM +0900, Michael Paquier wrote: > > Why are we suddenly trying to make this even more complicated? It's too > > late to redesign stuff without very good evidence that it's > > needed. Renaming trigger files and changing their format certainly > > doesn't seem appropriate post-beta. > > > > Let's just leave this as is, and remove the code in 9.4/9.5. > Sorry. I should have been clearer. I meant that for 9.4~ only. For 9.3 > yes it's too late. We seem to be all over the map with the fast promotion code --- some people don't trust it, some people want an option to enable the old method, and some people want the old method removed. This has left us in an odd situation where we are going to ship 9.3 old-method code with no way to enable it in case we need it. Adding the ability to enable it in 9.4 makes no sense --- effectively, if we need the old promotion code, we are going to have to enable it in a 9.3 minor release, while if we get to 9.4 final without needing it, we can assume the fast promotion code is good and we don't need the old code. I think a prudent plan would be to remove the old promotion code just before 9.4 beta as we would know by then that the code is no longer needed. I have added this as a 9.4 open item: https://wiki.postgresql.org/wiki/PostgreSQL_9.4_Open_Items -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 2013-08-08 12:50:31 -0400, Bruce Momjian wrote: > On Thu, Aug 8, 2013 at 01:27:35PM +0900, Michael Paquier wrote: > > > Why are we suddenly trying to make this even more complicated? It's too > > > late to redesign stuff without very good evidence that it's > > > needed. Renaming trigger files and changing their format certainly > > > doesn't seem appropriate post-beta. > > > > > > Let's just leave this as is, and remove the code in 9.4/9.5. > > Sorry. I should have been clearer. I meant that for 9.4~ only. For 9.3 > > yes it's too late. > > We seem to be all over the map with the fast promotion code --- some > people don't trust it, some people want an option to enable the old > method, and some people want the old method removed. > > This has left us in an odd situation where we are going to ship 9.3 > old-method code with no way to enable it in case we need it. That's not true. It's relatively easy to trigger it. You just can't use pg_ctl. Which seems completely fine for debugging. Imo there's no need to do anything. Which is what we've concluded on before... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Bruce, all: > We seem to be all over the map with the fast promotion code --- some > people don't trust it, some people want an option to enable the old > method, and some people want the old method removed. Having read over this thread, the only reason given for retaining any ability to use "old" promotion code is because people are worried about "fast" promotion being buggy. This seems wrong. Either we have confidence is fast promotion, or we don't. If we don't have confidence, then either (a) more testing is needed, or (b) it shouldn't be the default. Again, here, we are coming up against our lack of any kind of broad replication failure testing. Of course, even if we have confidence, bugs are always possible, and leaving the old promotion code in there would make it somewhat easier to ship a 9.3.2 update which reverts the behavior. But maybe we should focus on shipping a version which is relatively bug-free instead? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-08-08 10:15:14 -0700, Josh Berkus wrote: > Bruce, all: > > > We seem to be all over the map with the fast promotion code --- some > > people don't trust it, some people want an option to enable the old > > method, and some people want the old method removed. > > Having read over this thread, the only reason given for retaining any > ability to use "old" promotion code is because people are worried about > "fast" promotion being buggy. This seems wrong. Well, it's touching one of the more complex parts of pg. > Either we have confidence is fast promotion, or we don't. If we don't > have confidence, then either (a) more testing is needed, or (b) it > shouldn't be the default. Again, here, we are coming up against our > lack of any kind of broad replication failure testing. While I think we definitely miss out there I don't think any regression suite would help much here. I am wary of unknown problems, not ones we already have tests for. The subtle ones aren't easy to test, even with a regression suite. > Of course, even if we have confidence, bugs are always possible, and > leaving the old promotion code in there would make it somewhat easier to > ship a 9.3.2 update which reverts the behavior. But maybe we should > focus on shipping a version which is relatively bug-free instead? The problem is that, especially involving HS, there's lots of subtle corner cases. And those are pretty hard to forsee and thus hard to test. Being able to tell somebody to touch some file and kill a certain process instead of pg_ctl triggering is certainly better than to have them apply complex patches which then only exhibit the old behaviour. It's not about letting people regularly use it or such. It's about being able to verify problems. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 08/08/2013 10:34 AM, Andres Freund wrote: > On 2013-08-08 10:15:14 -0700, Josh Berkus wrote: >> Either we have confidence is fast promotion, or we don't. If we don't >> have confidence, then either (a) more testing is needed, or (b) it >> shouldn't be the default. Again, here, we are coming up against our >> lack of any kind of broad replication failure testing. > > While I think we definitely miss out there I don't think any regression > suite would help much here. I am wary of unknown problems, not ones > we already have tests for. The subtle ones aren't easy to test, even > with a regression suite. Yeah, that's why we have to get beyond the mentality that regression testing is the only kind of testing. We need a destruction test for replication, and that's NOT going to be a regression test. Among other things, we'll probably need to run it on cloud hosting. > The problem is that, especially involving HS, there's lots of subtle > corner cases. And those are pretty hard to forsee and thus hard to > test. It would be useful to assemble a list of corner cases we *do* know about. This could become a test suite, and we could keep adding to it. > Being able to tell somebody to touch some file and kill a certain > process instead of pg_ctl triggering is certainly better than to have > them apply complex patches which then only exhibit the old behaviour. > It's not about letting people regularly use it or such. It's about being > able to verify problems. The problem is, if failover fails badly, the user is probably facing a corrupt database, downtime, loss of data, and restore from backup. So if we don't think that fast failover is rock-solid trustworthy --- or at least as trustworthy as slow failover was -- then we should be making it a non-default option for 9.3. We shouldn't be exposing people who don't need fast failover to new risks without their knowledge. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-08-08 10:51:45 -0700, Josh Berkus wrote: > On 08/08/2013 10:34 AM, Andres Freund wrote: > > On 2013-08-08 10:15:14 -0700, Josh Berkus wrote: > >> Either we have confidence is fast promotion, or we don't. If we don't > >> have confidence, then either (a) more testing is needed, or (b) it > >> shouldn't be the default. Again, here, we are coming up against our > >> lack of any kind of broad replication failure testing. > > > > While I think we definitely miss out there I don't think any regression > > suite would help much here. I am wary of unknown problems, not ones > > we already have tests for. The subtle ones aren't easy to test, even > > with a regression suite. > > Yeah, that's why we have to get beyond the mentality that regression > testing is the only kind of testing. We need a destruction test for > replication, and that's NOT going to be a regression test. Among other > things, we'll probably need to run it on cloud hosting. The point is, that will still mostly produce scenarios we know of. > > The problem is that, especially involving HS, there's lots of subtle > > corner cases. And those are pretty hard to forsee and thus hard to > > test. > > It would be useful to assemble a list of corner cases we *do* know > about. This could become a test suite, and we could keep adding to it. The first thing would be to build the infrastructure for HS testing. Unfortunately I don't have the time/energy for that atm. Unless somebody steps up, this won't happen :( > > Being able to tell somebody to touch some file and kill a certain > > process instead of pg_ctl triggering is certainly better than to have > > them apply complex patches which then only exhibit the old behaviour. > > It's not about letting people regularly use it or such. It's about being > > able to verify problems. > > The problem is, if failover fails badly, the user is probably facing a > corrupt database, downtime, loss of data, and restore from backup. So > if we don't think that fast failover is rock-solid trustworthy --- or at > least as trustworthy as slow failover was -- then we should be making it > a non-default option for 9.3. We shouldn't be exposing people who don't > need fast failover to new risks without their knowledge. I don't think anybody working on related areas of the code thinks it's rock solid. But anyway, I just don't see the downside of allowing problem analysis. You're free to do more testing, review, whatever before the release. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 08/08/2013 11:01 AM, Andres Freund wrote: > I don't think anybody working on related areas of the code thinks it's > rock solid. > But anyway, I just don't see the downside of allowing problem > analysis. You're free to do more testing, review, whatever before the > release. I'm 100% with you that we ought to keep the slow failover code around and accessible to debugging. What I'm asking is whether it should still be explicitly available to users, and the default. Based on your feedback, it's sounding like it should be. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Hi, I understood it's too late to change the feature. I hope it will be revised in 9.4! (2013/08/09 4:13), Josh Berkus wrote:> On 08/08/2013 11:01 AM, Andres Freund wrote:>>> I don't think anybody working on relatedareas of the code thinks it's>> rock solid.>> But anyway, I just don't see the downside of allowing problem>> analysis.You're free to do more testing, review, whatever before the>> release.>> I'm 100% with you that we ought to keepthe slow failover code around> and accessible to debugging. What I'm asking is whether it should still> be explicitlyavailable to users, and the default. Based on your> feedback, it's sounding like it should be.> In my opinion, the default behavior "fast promote" is ok. Because we don't have any explicit problem with it for now. And we shouldn't mention about "normal promote" in document. I think it will make user confused if do so. By the way, I'm thinking about when these trigger files(*) are unlinked. (*)Now I treat these three files 1) a file specified in trigger_file 2) ${PGDATA}/promote 3) ${PGDATA}/fast_promote Current source has a problem in some corner cases. It occurs by the timing of detecting these files. for example: ------ case1: createing 1) and executing "pg_ctl promote" occur at the same time. case2: creating 1) and promoting at recovery-end(by recovery_target_xxx) occur at the same time. ------ 1) is created, but promoting completes by another trigger. Both cases 1) remains on the server. If user doesn't know it and make a standby on the server, the standby will promote soon. I think this is not so big problem, but not user-friendly. Against this, I'm thinking unlinking these files before starting recovery. This should be fixed in 9.4 too? --------- Tomonari Katsumata
On 08.08.2013 20:15, Josh Berkus wrote: > Bruce, all: > >> We seem to be all over the map with the fast promotion code --- some >> people don't trust it, some people want an option to enable the old >> method, and some people want the old method removed. > > Having read over this thread, the only reason given for retaining any > ability to use "old" promotion code is because people are worried about > "fast" promotion being buggy. This seems wrong. > > Either we have confidence is fast promotion, or we don't. If we don't > have confidence, then either (a) more testing is needed, or (b) it > shouldn't be the default. Again, here, we are coming up against our > lack of any kind of broad replication failure testing. Well, I don't see much harm in keeping the old behavior as an undocumented escape hatch, as it is now. The way I'd phrase the current situation is this: 9.3 now always does "fast promotion". However, for debugging and testing purposes, you can still trigger the old behavior by manually creating a file in $PGDATA. That should never be necessary in the field, however. There's one thing that irks me with the current situation, however: if you use 9.2 version of pg_ctl against a 9.3 server, it will inadvertently trigger slow promotion, because it creates the "promote" file. Since fast mode is the default, and not only the default but the only documented mode, it's confusing if you can accidentally trigger the old behavior like that. And it's even worse if you use 9.3 pg_ctl against a 9.2 server: it will create a filed called "fast_promote" and return success, but it won't actually do anything. I think "promote" file should trigger the fast promotion, and the filename to trigger the slow mode should be called "fallback_promote" or "safe_promote" or something like that. There wasn't any good reason to change the filename primarily used. It might even break people's scripts for no good reason, if people are creating the $PGDATA/promote file themselves without using pg_ctl. (I raised this back in April, but Simon argued strongly for the current situation. I never understood why. http://www.postgresql.org/message-id/517798AE.30203@vmware.com) - Heikki
On Mon, Aug 19, 2013 at 4:20 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Well, I don't see much harm in keeping the old behavior as an undocumented > escape hatch, as it is now. The way I'd phrase the current situation is > this: 9.3 now always does "fast promotion". However, for debugging and > testing purposes, you can still trigger the old behavior by manually > creating a file in $PGDATA. That should never be necessary in the field, > however. > > There's one thing that irks me with the current situation, however: if you > use 9.2 version of pg_ctl against a 9.3 server, it will inadvertently > trigger slow promotion, because it creates the "promote" file. Since fast > mode is the default, and not only the default but the only documented mode, > it's confusing if you can accidentally trigger the old behavior like that. > > And it's even worse if you use 9.3 pg_ctl against a 9.2 server: it will > create a filed called "fast_promote" and return success, but it won't > actually do anything. > > I think "promote" file should trigger the fast promotion, and the filename > to trigger the slow mode should be called "fallback_promote" or > "safe_promote" or something like that. There wasn't any good reason to > change the filename primarily used. It might even break people's scripts for > no good reason, if people are creating the $PGDATA/promote file themselves > without using pg_ctl. +1. ...Robert
Robert Haas <robertmhaas@gmail.com> wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> I think "promote" file should trigger the fast promotion, and >> the filename to trigger the slow mode should be called >> "fallback_promote" or "safe_promote" or something like that. >> There wasn't any good reason to change the filename primarily >> used. It might even break people's scripts for no good reason, >> if people are creating the $PGDATA/promote file themselves >> without using pg_ctl. > > +1. +1 Changing it has risk with no associated benefit that I can see. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 19, 2013 at 11:20:42AM +0300, Heikki Linnakangas wrote: > And it's even worse if you use 9.3 pg_ctl against a 9.2 server: it > will create a filed called "fast_promote" and return success, but it > won't actually do anything. > > I think "promote" file should trigger the fast promotion, and the > filename to trigger the slow mode should be called > "fallback_promote" or "safe_promote" or something like that. There > wasn't any good reason to change the filename primarily used. It > might even break people's scripts for no good reason, if people are > creating the $PGDATA/promote file themselves without using pg_ctl. > > (I raised this back in April, but Simon argued strongly for the > current situation. I never understood why. > http://www.postgresql.org/message-id/517798AE.30203@vmware.com) +1 -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > On Mon, Aug 19, 2013 at 11:20:42AM +0300, Heikki Linnakangas wrote: >> I think "promote" file should trigger the fast promotion, and the >> filename to trigger the slow mode should be called >> "fallback_promote" or "safe_promote" or something like that. There >> wasn't any good reason to change the filename primarily used. It >> might even break people's scripts for no good reason, if people are >> creating the $PGDATA/promote file themselves without using pg_ctl. >> >> (I raised this back in April, but Simon argued strongly for the >> current situation. I never understood why. >> http://www.postgresql.org/message-id/517798AE.30203@vmware.com) > +1 If we're going to change this in 9.3, it needs to happen *now*, as in the next couple hours, because I plan to wrap rc1 this afternoon. Please stop discussing and commit something. regards, tom lane
On Mon, Aug 19, 2013 at 01:27:29PM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > On Mon, Aug 19, 2013 at 11:20:42AM +0300, Heikki Linnakangas wrote: > >> I think "promote" file should trigger the fast promotion, and the > >> filename to trigger the slow mode should be called > >> "fallback_promote" or "safe_promote" or something like that. There > >> wasn't any good reason to change the filename primarily used. It > >> might even break people's scripts for no good reason, if people are > >> creating the $PGDATA/promote file themselves without using pg_ctl. > >> > >> (I raised this back in April, but Simon argued strongly for the > >> current situation. I never understood why. > >> http://www.postgresql.org/message-id/517798AE.30203@vmware.com) > > > +1 > > If we're going to change this in 9.3, it needs to happen *now*, as in > the next couple hours, because I plan to wrap rc1 this afternoon. > Please stop discussing and commit something. FYI, Tom, I think Heikki is working on it now. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 19.08.2013 20:27, Tom Lane wrote: > Bruce Momjian<bruce@momjian.us> writes: >> On Mon, Aug 19, 2013 at 11:20:42AM +0300, Heikki Linnakangas wrote: >>> I think "promote" file should trigger the fast promotion, and the >>> filename to trigger the slow mode should be called >>> "fallback_promote" or "safe_promote" or something like that. There >>> wasn't any good reason to change the filename primarily used. It >>> might even break people's scripts for no good reason, if people are >>> creating the $PGDATA/promote file themselves without using pg_ctl. >>> >>> (I raised this back in April, but Simon argued strongly for the >>> current situation. I never understood why. >>> http://www.postgresql.org/message-id/517798AE.30203@vmware.com) > >> +1 > > If we're going to change this in 9.3, it needs to happen *now*, as in > the next couple hours, because I plan to wrap rc1 this afternoon. > Please stop discussing and commit something. Ok, committed. The promote trigger file is now called "promote", like it was in previous versions. For the non-fast promotion, create a file called "fallback_promote". - Heikki
On 19 August 2013 09:20, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 08.08.2013 20:15, Josh Berkus wrote: >> >> Bruce, all: >> >>> We seem to be all over the map with the fast promotion code --- some >>> people don't trust it, some people want an option to enable the old >>> method, and some people want the old method removed. >> >> >> Having read over this thread, the only reason given for retaining any >> ability to use "old" promotion code is because people are worried about >> "fast" promotion being buggy. This seems wrong. >> >> Either we have confidence is fast promotion, or we don't. If we don't >> have confidence, then either (a) more testing is needed, or (b) it >> shouldn't be the default. Again, here, we are coming up against our >> lack of any kind of broad replication failure testing. > > > Well, I don't see much harm in keeping the old behavior as an undocumented > escape hatch, as it is now. The way I'd phrase the current situation is > this: 9.3 now always does "fast promotion". However, for debugging and > testing purposes, you can still trigger the old behavior by manually > creating a file in $PGDATA. That should never be necessary in the field, > however. +1, again. I have removed this item from the 9.4 open items list, since this issue has already been resolved. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services