Обсуждение: Re: BUG #15346: Replica fails to start after the crash

Поиск
Список
Период
Сортировка

Re: BUG #15346: Replica fails to start after the crash

От
Alexander Kukushkin
Дата:
Hello hackers!

It seems bgwriter running on the replicas is broken in the commit
8d68ee6 and as a result bgwriter never updates minRecoveryPoint in the
pg_control.Please see a detailed explanation below.

2018-08-29 22:54 GMT+02:00 Michael Paquier <michael@paquier.xyz>:

> This is not a solution in my opinion, as you could invalidate activities
> of backends connected to the database when the incorrect consistent
> point allows connections to come in too early.

That true, but I still think it is better than aborting startup process...

> What happens if you replay with hot_standby = on up to the latest point,
> without any concurrent connections, then issue a checkpoint on the
> standby once you got to a point newer than the complain, and finally
> restart the standby with the bgworker?
>
> Another idea I have would be to make the standby promote, issue a
> checkpoint on it, and then use pg_rewind as a trick to update the
> control file to a point newer than the inconsistency.  As PG < 9.6.10
> could make the minimum recovery point go backwards, applying the upgrade
> after the consistent point got to an incorrect state would trigger the
> failure.

Well, all these actions probably help to relife symptoms and replay
WAL up to the point when it becomes really consistent.

I was really long trying to figure out how it could happen that some
of the pages were written on disk, but pg_control wasn't updated, And
I think I managed to put all pieces of the puzzle into a nice picture:

static void
UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
{
        /* Quick check using our local copy of the variable */
        if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint))
                return;

        /*
         * An invalid minRecoveryPoint means that we need to recover
all the WAL,
         * i.e., we're doing crash recovery.  We never modify the control file's
         * value in that case, so we can short-circuit future checks
here too. The
         * local values of minRecoveryPoint and minRecoveryPointTLI
should not be
         * updated until crash recovery finishes.
         */
        if (XLogRecPtrIsInvalid(minRecoveryPoint))
        {
                updateMinRecoveryPoint = false;
                return;
        }

This code was changed in the commit 8d68ee6 and it broke bgwriter. Now
bgwriter never updates pg_control when flushes dirty pages to disk.
How it happens:
When bgwriter starts,  minRecoveryPoint is not initialized and if I
attach with gdb, it shows that value of minRecoveryPoint = 0,
therefore it is Invalid.
As a result, updateMinRecoveryPoint is set to false and on the next
call of UpdateMinRecoveryPoint from bgwriter it returns from the
function after the very first if.
Bgwriter itself never changes updateMinRecoveryPoint to true and boom,
we can get a lot of pages written to disk, but minRecoveryPoint in the
pg_control won't be updated!

If the replica happened to crash in such conditions it reaches a
consistency much earlier than it should!

Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Thu, Aug 30, 2018 at 10:55:23AM +0200, Alexander Kukushkin wrote:
> Bgwriter itself never changes updateMinRecoveryPoint to true and boom,
> we can get a lot of pages written to disk, but minRecoveryPoint in the
> pg_control won't be updated!

That's indeed obvious by reading the code.  The bgwriter would be
started only once a consistent point has been reached, so the startup
process would have normally already updated the control file to the
consistent point.  Something like the attached should take care of the
problem.  As the updates of the local copy of minRecoveryPoint strongly
rely on if the startup process is used, I think that we should use
InRecovery for the sanity checks.

I'd like to also add a TAP test for that, which should be easy enough if
we do sanity checks by looking up at the output of the control file.
I'll try to put more thoughts on that.

Does it take care of the problem?
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Thu, Aug 30, 2018 at 10:55:23AM +0200, Alexander Kukushkin wrote:
> Bgwriter itself never changes updateMinRecoveryPoint to true and boom,
> we can get a lot of pages written to disk, but minRecoveryPoint in the
> pg_control won't be updated!

That's indeed obvious by reading the code.  The bgwriter would be
started only once a consistent point has been reached, so the startup
process would have normally already updated the control file to the
consistent point.  Something like the attached should take care of the
problem.  As the updates of the local copy of minRecoveryPoint strongly
rely on if the startup process is used, I think that we should use
InRecovery for the sanity checks.

I'd like to also add a TAP test for that, which should be easy enough if
we do sanity checks by looking up at the output of the control file.
I'll try to put more thoughts on that.

Does it take care of the problem?
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Alexander Kukushkin
Дата:
Hi,

2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>:

> That's indeed obvious by reading the code.  The bgwriter would be
> started only once a consistent point has been reached, so the startup
> process would have normally already updated the control file to the
> consistent point.  Something like the attached should take care of the
> problem.  As the updates of the local copy of minRecoveryPoint strongly
> rely on if the startup process is used, I think that we should use
> InRecovery for the sanity checks.
>
> I'd like to also add a TAP test for that, which should be easy enough if
> we do sanity checks by looking up at the output of the control file.
> I'll try to put more thoughts on that.
>
> Does it take care of the problem?

Yep, with the patch applied bgwriter acts as expected!

Breakpoint 1, UpdateControlFile () at xlog.c:4536
4536            INIT_CRC32C(ControlFile->crc);
(gdb) bt
#0  UpdateControlFile () at xlog.c:4536
#1  0x00005646d071ddb2 in UpdateMinRecoveryPoint (lsn=26341965784,
force=0 '\000') at xlog.c:2597
#2  0x00005646d071de65 in XLogFlush (record=26341965784) at xlog.c:2632
#3  0x00005646d09d693a in FlushBuffer (buf=0x7f8e1ca523c0,
reln=0x5646d2e86028) at bufmgr.c:2729
#4  0x00005646d09d63d6 in SyncOneBuffer (buf_id=99693,
skip_recently_used=1 '\001', wb_context=0x7ffd07757380) at
bufmgr.c:2394
#5  0x00005646d09d6172 in BgBufferSync (wb_context=0x7ffd07757380) at
bufmgr.c:2270
#6  0x00005646d097c266 in BackgroundWriterMain () at bgwriter.c:279
#7  0x00005646d073b38c in AuxiliaryProcessMain (argc=2,
argv=0x7ffd07758840) at bootstrap.c:424
#8  0x00005646d098dc4a in StartChildProcess (type=BgWriterProcess) at
postmaster.c:5300
#9  0x00005646d098d672 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:4999
#10 <signal handler called>
#11 0x00007f8e5f5a6ff7 in __GI___select (nfds=5,
readfds=0x7ffd07759060, writefds=0x0, exceptfds=0x0,
timeout=0x7ffd07758fd0) at ../sysdeps/unix/sysv/linux/select.c:41
#12 0x00005646d09890ca in ServerLoop () at postmaster.c:1685
#13 0x00005646d0988799 in PostmasterMain (argc=17,
argv=0x5646d2e53390) at postmaster.c:1329
#14 0x00005646d08d2880 in main (argc=17, argv=0x5646d2e53390) at main.c:228



Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

От
Alexander Kukushkin
Дата:
Hi,

2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>:

> That's indeed obvious by reading the code.  The bgwriter would be
> started only once a consistent point has been reached, so the startup
> process would have normally already updated the control file to the
> consistent point.  Something like the attached should take care of the
> problem.  As the updates of the local copy of minRecoveryPoint strongly
> rely on if the startup process is used, I think that we should use
> InRecovery for the sanity checks.
>
> I'd like to also add a TAP test for that, which should be easy enough if
> we do sanity checks by looking up at the output of the control file.
> I'll try to put more thoughts on that.
>
> Does it take care of the problem?

Yep, with the patch applied bgwriter acts as expected!

Breakpoint 1, UpdateControlFile () at xlog.c:4536
4536            INIT_CRC32C(ControlFile->crc);
(gdb) bt
#0  UpdateControlFile () at xlog.c:4536
#1  0x00005646d071ddb2 in UpdateMinRecoveryPoint (lsn=26341965784,
force=0 '\000') at xlog.c:2597
#2  0x00005646d071de65 in XLogFlush (record=26341965784) at xlog.c:2632
#3  0x00005646d09d693a in FlushBuffer (buf=0x7f8e1ca523c0,
reln=0x5646d2e86028) at bufmgr.c:2729
#4  0x00005646d09d63d6 in SyncOneBuffer (buf_id=99693,
skip_recently_used=1 '\001', wb_context=0x7ffd07757380) at
bufmgr.c:2394
#5  0x00005646d09d6172 in BgBufferSync (wb_context=0x7ffd07757380) at
bufmgr.c:2270
#6  0x00005646d097c266 in BackgroundWriterMain () at bgwriter.c:279
#7  0x00005646d073b38c in AuxiliaryProcessMain (argc=2,
argv=0x7ffd07758840) at bootstrap.c:424
#8  0x00005646d098dc4a in StartChildProcess (type=BgWriterProcess) at
postmaster.c:5300
#9  0x00005646d098d672 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:4999
#10 <signal handler called>
#11 0x00007f8e5f5a6ff7 in __GI___select (nfds=5,
readfds=0x7ffd07759060, writefds=0x0, exceptfds=0x0,
timeout=0x7ffd07758fd0) at ../sysdeps/unix/sysv/linux/select.c:41
#12 0x00005646d09890ca in ServerLoop () at postmaster.c:1685
#13 0x00005646d0988799 in PostmasterMain (argc=17,
argv=0x5646d2e53390) at postmaster.c:1329
#14 0x00005646d08d2880 in main (argc=17, argv=0x5646d2e53390) at main.c:228



Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Thu, Aug 30, 2018 at 04:03:43PM +0200, Alexander Kukushkin wrote:
> 2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
>> Does it take care of the problem?
>
> Yep, with the patch applied bgwriter acts as expected!

Thanks for double-checking.

I have been struggling for a couple of hours to get a deterministic test
case out of my pocket, and I did not get one as you would need to get
the bgwriter to flush a page before crash recovery finishes, we could do
that easily with a dedicated bgworker, now pg_ctl start expects the
standby to finish recovery before, which makes it harder to trigger all
the time, particularly for slow machines .  Anyway, I did more code
review and I think that I found another issue with XLogNeedsFlush(),
which could enforce updateMinRecoveryPoint to false if called before
XLogFlush during crash recovery from another process than the startup
process, so if it got called before XLogFlush() we'd still have the same
issue for a process doing both operations.  Hence, I have come up with
the attached, which actually brings back the code to what it was before
8d68ee6 for those routines, except that we have fast-exit paths for the
startup process so as it is still able to replay all WAL available and
avoid page reference issues post-promotion, deciding when to update its
own copy of minRecoveryPoint when it finishes crash recovery.  This also
saves from a couple of locks on the control file from the startup
process.

If you apply the patch and try it on your standby, are you able to get
things up and working?
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Thu, Aug 30, 2018 at 04:03:43PM +0200, Alexander Kukushkin wrote:
> 2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
>> Does it take care of the problem?
>
> Yep, with the patch applied bgwriter acts as expected!

Thanks for double-checking.

I have been struggling for a couple of hours to get a deterministic test
case out of my pocket, and I did not get one as you would need to get
the bgwriter to flush a page before crash recovery finishes, we could do
that easily with a dedicated bgworker, now pg_ctl start expects the
standby to finish recovery before, which makes it harder to trigger all
the time, particularly for slow machines .  Anyway, I did more code
review and I think that I found another issue with XLogNeedsFlush(),
which could enforce updateMinRecoveryPoint to false if called before
XLogFlush during crash recovery from another process than the startup
process, so if it got called before XLogFlush() we'd still have the same
issue for a process doing both operations.  Hence, I have come up with
the attached, which actually brings back the code to what it was before
8d68ee6 for those routines, except that we have fast-exit paths for the
startup process so as it is still able to replay all WAL available and
avoid page reference issues post-promotion, deciding when to update its
own copy of minRecoveryPoint when it finishes crash recovery.  This also
saves from a couple of locks on the control file from the startup
process.

If you apply the patch and try it on your standby, are you able to get
things up and working?
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Alexander Kukushkin
Дата:
2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
> I have been struggling for a couple of hours to get a deterministic test
> case out of my pocket, and I did not get one as you would need to get
> the bgwriter to flush a page before crash recovery finishes, we could do

In my case the active standby server has crashed, it wasn't in the
crash recovery mode.

> the time, particularly for slow machines .  Anyway, I did more code
> review and I think that I found another issue with XLogNeedsFlush(),
> which could enforce updateMinRecoveryPoint to false if called before
> XLogFlush during crash recovery from another process than the startup
> process, so if it got called before XLogFlush() we'd still have the same
> issue for a process doing both operations.  Hence, I have come up with

At least XLogNeedsFlush() is called just from a couple of places and
doesn't break bgwriter, but anyway, thanks for finding it.

> the attached, which actually brings back the code to what it was before
> 8d68ee6 for those routines, except that we have fast-exit paths for the
> startup process so as it is still able to replay all WAL available and
> avoid page reference issues post-promotion, deciding when to update its
> own copy of minRecoveryPoint when it finishes crash recovery.  This also
> saves from a couple of locks on the control file from the startup
> process.

Sound good.

>
> If you apply the patch and try it on your standby, are you able to get
> things up and working?

Nope, minRecoveryPoint in pg_control is still wrong and therefore
startup still aborts on the same place if there are connections open.
I think there is no way to fix it other than let it replay sufficient
amount of WAL without open connections.
Just juddging from the timestamps of WAL files in the pg_xlog it is
obvious that a few moments before the hardware crashed postgres was
replaying 0000000500000AB300000057, because the next file has smaller
mtime (it was recycled).
-rw-------  1 akukushkin akukushkin 16777216 Aug 22 07:22
0000000500000AB300000057
-rw-------  1 akukushkin akukushkin 16777216 Aug 22 07:01
0000000500000AB300000058

Minimum recovery ending location is AB3/4A1B3118, but at the same time
I managed to find pages from 0000000500000AB300000053 on disk (at
least in the index files). That could only mean that bgwriter was
flushing dirty pages, but pg_control wasn't properly updated and it
happened not during recovery after hardware crash, but while the
postgres was running before the hardware crash.

The only possible way to recover such standby - cut off all possible
connections and let it replay all WAL files it managed to write to
disk before the first crash.

Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

От
Alexander Kukushkin
Дата:
2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
> I have been struggling for a couple of hours to get a deterministic test
> case out of my pocket, and I did not get one as you would need to get
> the bgwriter to flush a page before crash recovery finishes, we could do

In my case the active standby server has crashed, it wasn't in the
crash recovery mode.

> the time, particularly for slow machines .  Anyway, I did more code
> review and I think that I found another issue with XLogNeedsFlush(),
> which could enforce updateMinRecoveryPoint to false if called before
> XLogFlush during crash recovery from another process than the startup
> process, so if it got called before XLogFlush() we'd still have the same
> issue for a process doing both operations.  Hence, I have come up with

At least XLogNeedsFlush() is called just from a couple of places and
doesn't break bgwriter, but anyway, thanks for finding it.

> the attached, which actually brings back the code to what it was before
> 8d68ee6 for those routines, except that we have fast-exit paths for the
> startup process so as it is still able to replay all WAL available and
> avoid page reference issues post-promotion, deciding when to update its
> own copy of minRecoveryPoint when it finishes crash recovery.  This also
> saves from a couple of locks on the control file from the startup
> process.

Sound good.

>
> If you apply the patch and try it on your standby, are you able to get
> things up and working?

Nope, minRecoveryPoint in pg_control is still wrong and therefore
startup still aborts on the same place if there are connections open.
I think there is no way to fix it other than let it replay sufficient
amount of WAL without open connections.
Just juddging from the timestamps of WAL files in the pg_xlog it is
obvious that a few moments before the hardware crashed postgres was
replaying 0000000500000AB300000057, because the next file has smaller
mtime (it was recycled).
-rw-------  1 akukushkin akukushkin 16777216 Aug 22 07:22
0000000500000AB300000057
-rw-------  1 akukushkin akukushkin 16777216 Aug 22 07:01
0000000500000AB300000058

Minimum recovery ending location is AB3/4A1B3118, but at the same time
I managed to find pages from 0000000500000AB300000053 on disk (at
least in the index files). That could only mean that bgwriter was
flushing dirty pages, but pg_control wasn't properly updated and it
happened not during recovery after hardware crash, but while the
postgres was running before the hardware crash.

The only possible way to recover such standby - cut off all possible
connections and let it replay all WAL files it managed to write to
disk before the first crash.

Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote:
> 2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
>> I have been struggling for a couple of hours to get a deterministic test
>> case out of my pocket, and I did not get one as you would need to get
>> the bgwriter to flush a page before crash recovery finishes, we could do
>
> In my case the active standby server has crashed, it wasn't in the
> crash recovery mode.

That's what I meant, a standby crashed and then restarted, doing crash
recovery before moving on with archive recovery once it was done with
all its local WAL.

> Minimum recovery ending location is AB3/4A1B3118, but at the same time
> I managed to find pages from 0000000500000AB300000053 on disk (at
> least in the index files). That could only mean that bgwriter was
> flushing dirty pages, but pg_control wasn't properly updated and it
> happened not during recovery after hardware crash, but while the
> postgres was running before the hardware crash.

Exactly, that would explain the incorrect reference.

> The only possible way to recover such standby - cut off all possible
> connections and let it replay all WAL files it managed to write to
> disk before the first crash.

Yeah...  I am going to apply the patch after another lookup, that will
fix the problem moving forward.  Thanks for checking by the way.
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote:
> 2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
>> I have been struggling for a couple of hours to get a deterministic test
>> case out of my pocket, and I did not get one as you would need to get
>> the bgwriter to flush a page before crash recovery finishes, we could do
>
> In my case the active standby server has crashed, it wasn't in the
> crash recovery mode.

That's what I meant, a standby crashed and then restarted, doing crash
recovery before moving on with archive recovery once it was done with
all its local WAL.

> Minimum recovery ending location is AB3/4A1B3118, but at the same time
> I managed to find pages from 0000000500000AB300000053 on disk (at
> least in the index files). That could only mean that bgwriter was
> flushing dirty pages, but pg_control wasn't properly updated and it
> happened not during recovery after hardware crash, but while the
> postgres was running before the hardware crash.

Exactly, that would explain the incorrect reference.

> The only possible way to recover such standby - cut off all possible
> connections and let it replay all WAL files it managed to write to
> disk before the first crash.

Yeah...  I am going to apply the patch after another lookup, that will
fix the problem moving forward.  Thanks for checking by the way.
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Kyotaro HORIGUCHI
Дата:
At Thu, 30 Aug 2018 11:57:05 -0700, Michael Paquier <michael@paquier.xyz> wrote in
<20180830185705.GF15446@paquier.xyz>
> On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote:
> > 2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
> >> I have been struggling for a couple of hours to get a deterministic test
> >> case out of my pocket, and I did not get one as you would need to get
> >> the bgwriter to flush a page before crash recovery finishes, we could do
> > 
> > In my case the active standby server has crashed, it wasn't in the
> > crash recovery mode.
> 
> That's what I meant, a standby crashed and then restarted, doing crash
> recovery before moving on with archive recovery once it was done with
> all its local WAL.
> 
> > Minimum recovery ending location is AB3/4A1B3118, but at the same time
> > I managed to find pages from 0000000500000AB300000053 on disk (at
> > least in the index files). That could only mean that bgwriter was
> > flushing dirty pages, but pg_control wasn't properly updated and it
> > happened not during recovery after hardware crash, but while the
> > postgres was running before the hardware crash.
> 
> Exactly, that would explain the incorrect reference.
> 
> > The only possible way to recover such standby - cut off all possible
> > connections and let it replay all WAL files it managed to write to
> > disk before the first crash.
> 
> Yeah...  I am going to apply the patch after another lookup, that will
> fix the problem moving forward.  Thanks for checking by the way.

Please wait a bit.. I have a concern about this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #15346: Replica fails to start after the crash

От
Kyotaro HORIGUCHI
Дата:
At Thu, 30 Aug 2018 11:57:05 -0700, Michael Paquier <michael@paquier.xyz> wrote in
<20180830185705.GF15446@paquier.xyz>
> On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote:
> > 2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
> >> I have been struggling for a couple of hours to get a deterministic test
> >> case out of my pocket, and I did not get one as you would need to get
> >> the bgwriter to flush a page before crash recovery finishes, we could do
> > 
> > In my case the active standby server has crashed, it wasn't in the
> > crash recovery mode.
> 
> That's what I meant, a standby crashed and then restarted, doing crash
> recovery before moving on with archive recovery once it was done with
> all its local WAL.
> 
> > Minimum recovery ending location is AB3/4A1B3118, but at the same time
> > I managed to find pages from 0000000500000AB300000053 on disk (at
> > least in the index files). That could only mean that bgwriter was
> > flushing dirty pages, but pg_control wasn't properly updated and it
> > happened not during recovery after hardware crash, but while the
> > postgres was running before the hardware crash.
> 
> Exactly, that would explain the incorrect reference.
> 
> > The only possible way to recover such standby - cut off all possible
> > connections and let it replay all WAL files it managed to write to
> > disk before the first crash.
> 
> Yeah...  I am going to apply the patch after another lookup, that will
> fix the problem moving forward.  Thanks for checking by the way.

Please wait a bit.. I have a concern about this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote:
> Please wait a bit.. I have a concern about this.

Sure, please feel free.
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote:
> Please wait a bit.. I have a concern about this.

Sure, please feel free.
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Kyotaro HORIGUCHI
Дата:
At Thu, 30 Aug 2018 18:48:55 -0700, Michael Paquier <michael@paquier.xyz> wrote in
<20180831014855.GJ15446@paquier.xyz>
> On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote:
> > Please wait a bit.. I have a concern about this.
> 
> Sure, please feel free.

Thanks.

I looked though the patch and related code and have a concern.

The patch inhibits turning off updateMinRecoveryPoint on other
than the startup process running crash-recovery except at the end
of XLogNeedsFlush.

>     /*
>      * Check minRecoveryPoint for any other process than the startup
>      * process doing crash recovery, which should not update the control
>      * file value if crash recovery is still running.
>      */
>     if (XLogRecPtrIsInvalid(minRecoveryPoint))
>       updateMinRecoveryPoint = false;

Even if any other processes than the startup calls the function
during crash recovery, they don't have a means to turn on
updateMinRecoveryPoint again. Actually the only process that
calls the function druing crash recovery is the startup. bwriter
and checkpointer doesn't. Hot-standby backends come after
crash-recvery ends. After all, we just won't have an invalid
minRecoveryPoint value there, and updateMinRecoverypoint must be
true.

Other than that I didn't find a problem. Please find the attached
patch. I also have not come up with how to check the case
automatically..


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..74692a128d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2723,9 +2723,13 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
      * i.e., we're doing crash recovery.  We never modify the control file's
      * value in that case, so we can short-circuit future checks here too. The
      * local values of minRecoveryPoint and minRecoveryPointTLI should not be
-     * updated until crash recovery finishes.
+     * updated until crash recovery finishes.  We only do this for the startup
+     * process as it should not update its own reference of minRecoveryPoint
+     * until it has finished crash recovery to make sure that all WAL
+     * available is replayed in this case.  This also saves from extra locks
+     * taken on the control file from the startup process.
      */
-    if (XLogRecPtrIsInvalid(minRecoveryPoint))
+    if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
     {
         updateMinRecoveryPoint = false;
         return;
@@ -2737,7 +2741,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
     minRecoveryPoint = ControlFile->minRecoveryPoint;
     minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
-    if (force || minRecoveryPoint < lsn)
+    if (XLogRecPtrIsInvalid(minRecoveryPoint))
+        updateMinRecoveryPoint = false;
+    else if (force || minRecoveryPoint < lsn)
     {
         XLogRecPtr    newMinRecoveryPoint;
         TimeLineID    newMinRecoveryPointTLI;
@@ -3124,12 +3130,15 @@ XLogNeedsFlush(XLogRecPtr record)
     if (RecoveryInProgress())
     {
         /*
-         * An invalid minRecoveryPoint means that we need to recover all the
-         * WAL, i.e., we're doing crash recovery.  We never modify the control
-         * file's value in that case, so we can short-circuit future checks
-         * here too.
+         * An invalid minRecoveryPoint on the startup process means that we
+         * need to recover all the WAL, i.e., we're doing crash recovery.  We
+         * never modify the control file's value in that case, so we can
+         * short-circuit future checks here too.  This triggers a quick exit
+         * path for the startup process, which cannot update its local copy of
+         * minRecoveryPoint as long as it has not replayed all WAL available
+         * when doing crash recovery.
          */
-        if (XLogRecPtrIsInvalid(minRecoveryPoint))
+        if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
             updateMinRecoveryPoint = false;
 
         /* Quick exit if already known to be updated or cannot be updated */
@@ -3146,6 +3155,12 @@ XLogNeedsFlush(XLogRecPtr record)
         minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
         LWLockRelease(ControlFileLock);
 
+        /*
+         * No other process than the startup doesn't reach here before crash
+         * recovery ends. So minRecoveryPoint must have a valid value here.
+         */
+        Assert(minRecoveryPoint != InvalidXLogRecPtr);
+
         /* check again */
         return record > minRecoveryPoint;
     }

Re: BUG #15346: Replica fails to start after the crash

От
Kyotaro HORIGUCHI
Дата:
At Thu, 30 Aug 2018 18:48:55 -0700, Michael Paquier <michael@paquier.xyz> wrote in
<20180831014855.GJ15446@paquier.xyz>
> On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote:
> > Please wait a bit.. I have a concern about this.
> 
> Sure, please feel free.

Thanks.

I looked though the patch and related code and have a concern.

The patch inhibits turning off updateMinRecoveryPoint on other
than the startup process running crash-recovery except at the end
of XLogNeedsFlush.

>     /*
>      * Check minRecoveryPoint for any other process than the startup
>      * process doing crash recovery, which should not update the control
>      * file value if crash recovery is still running.
>      */
>     if (XLogRecPtrIsInvalid(minRecoveryPoint))
>       updateMinRecoveryPoint = false;

Even if any other processes than the startup calls the function
during crash recovery, they don't have a means to turn on
updateMinRecoveryPoint again. Actually the only process that
calls the function druing crash recovery is the startup. bwriter
and checkpointer doesn't. Hot-standby backends come after
crash-recvery ends. After all, we just won't have an invalid
minRecoveryPoint value there, and updateMinRecoverypoint must be
true.

Other than that I didn't find a problem. Please find the attached
patch. I also have not come up with how to check the case
automatically..


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..74692a128d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2723,9 +2723,13 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
      * i.e., we're doing crash recovery.  We never modify the control file's
      * value in that case, so we can short-circuit future checks here too. The
      * local values of minRecoveryPoint and minRecoveryPointTLI should not be
-     * updated until crash recovery finishes.
+     * updated until crash recovery finishes.  We only do this for the startup
+     * process as it should not update its own reference of minRecoveryPoint
+     * until it has finished crash recovery to make sure that all WAL
+     * available is replayed in this case.  This also saves from extra locks
+     * taken on the control file from the startup process.
      */
-    if (XLogRecPtrIsInvalid(minRecoveryPoint))
+    if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
     {
         updateMinRecoveryPoint = false;
         return;
@@ -2737,7 +2741,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
     minRecoveryPoint = ControlFile->minRecoveryPoint;
     minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
-    if (force || minRecoveryPoint < lsn)
+    if (XLogRecPtrIsInvalid(minRecoveryPoint))
+        updateMinRecoveryPoint = false;
+    else if (force || minRecoveryPoint < lsn)
     {
         XLogRecPtr    newMinRecoveryPoint;
         TimeLineID    newMinRecoveryPointTLI;
@@ -3124,12 +3130,15 @@ XLogNeedsFlush(XLogRecPtr record)
     if (RecoveryInProgress())
     {
         /*
-         * An invalid minRecoveryPoint means that we need to recover all the
-         * WAL, i.e., we're doing crash recovery.  We never modify the control
-         * file's value in that case, so we can short-circuit future checks
-         * here too.
+         * An invalid minRecoveryPoint on the startup process means that we
+         * need to recover all the WAL, i.e., we're doing crash recovery.  We
+         * never modify the control file's value in that case, so we can
+         * short-circuit future checks here too.  This triggers a quick exit
+         * path for the startup process, which cannot update its local copy of
+         * minRecoveryPoint as long as it has not replayed all WAL available
+         * when doing crash recovery.
          */
-        if (XLogRecPtrIsInvalid(minRecoveryPoint))
+        if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
             updateMinRecoveryPoint = false;
 
         /* Quick exit if already known to be updated or cannot be updated */
@@ -3146,6 +3155,12 @@ XLogNeedsFlush(XLogRecPtr record)
         minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
         LWLockRelease(ControlFileLock);
 
+        /*
+         * No other process than the startup doesn't reach here before crash
+         * recovery ends. So minRecoveryPoint must have a valid value here.
+         */
+        Assert(minRecoveryPoint != InvalidXLogRecPtr);
+
         /* check again */
         return record > minRecoveryPoint;
     }

Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Fri, Aug 31, 2018 at 02:52:06PM +0900, Kyotaro HORIGUCHI wrote:
> The patch inhibits turning off updateMinRecoveryPoint on other
> than the startup process running crash-recovery except at the end
> of XLogNeedsFlush.

Yes that's a matter of safety, as I put into the truck any modules which
may use XLogFlush().  And that maps with the old code, so there is no
more surprise.

> Even if any other processes than the startup calls the function
> during crash recovery, they don't have a means to turn on
> updateMinRecoveryPoint again. Actually the only process that
> calls the function during crash recovery is the startup. bwriter
> and checkpointer doesn't. Hot-standby backends come after
> crash-recvery ends. After all, we just won't have an invalid
> minRecoveryPoint value there, and updateMinRecoverypoint must be
> true.

Yes, until a recovery point is reached only the startup process could
call that.  Now I would imagine that we could have a background worker
as well which is spawned when the postmaster starts, and calls those
code paths...

> +    /*
> +     * No other process than the startup doesn't reach here before crash
> +     * recovery ends. So minRecoveryPoint must have a valid value here.
> +     */
> +    Assert(minRecoveryPoint != InvalidXLogRecPtr);

...  And that would invalidate your assertion here.
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Fri, Aug 31, 2018 at 02:52:06PM +0900, Kyotaro HORIGUCHI wrote:
> The patch inhibits turning off updateMinRecoveryPoint on other
> than the startup process running crash-recovery except at the end
> of XLogNeedsFlush.

Yes that's a matter of safety, as I put into the truck any modules which
may use XLogFlush().  And that maps with the old code, so there is no
more surprise.

> Even if any other processes than the startup calls the function
> during crash recovery, they don't have a means to turn on
> updateMinRecoveryPoint again. Actually the only process that
> calls the function during crash recovery is the startup. bwriter
> and checkpointer doesn't. Hot-standby backends come after
> crash-recvery ends. After all, we just won't have an invalid
> minRecoveryPoint value there, and updateMinRecoverypoint must be
> true.

Yes, until a recovery point is reached only the startup process could
call that.  Now I would imagine that we could have a background worker
as well which is spawned when the postmaster starts, and calls those
code paths...

> +    /*
> +     * No other process than the startup doesn't reach here before crash
> +     * recovery ends. So minRecoveryPoint must have a valid value here.
> +     */
> +    Assert(minRecoveryPoint != InvalidXLogRecPtr);

...  And that would invalidate your assertion here.
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Thu, Aug 30, 2018 at 11:23:54PM -0700, Michael Paquier wrote:
> Yes that's a matter of safety, as I put into the truck any modules which
> may use XLogFlush().  And that maps with the old code, so there is no
> more surprise.

Okay, I have pushed my previous version as that's the safest approach.
If you have any idea of improvements or clarifications, let's discuss
about those on a different thread and only for HEAD.

Many thanks Alexander for the whole work and Horiguchi-san for the
review.
--
Michael

Вложения

Re: BUG #15346: Replica fails to start after the crash

От
Michael Paquier
Дата:
On Thu, Aug 30, 2018 at 11:23:54PM -0700, Michael Paquier wrote:
> Yes that's a matter of safety, as I put into the truck any modules which
> may use XLogFlush().  And that maps with the old code, so there is no
> more surprise.

Okay, I have pushed my previous version as that's the safest approach.
If you have any idea of improvements or clarifications, let's discuss
about those on a different thread and only for HEAD.

Many thanks Alexander for the whole work and Horiguchi-san for the
review.
--
Michael

Вложения