Обсуждение: Problem while setting the fpw with SIGHUP

Поиск
Список
Период
Сортировка

Problem while setting the fpw with SIGHUP

От
Dilip Kumar
Дата:
While setting the full_page_write with SIGHUP I hit an assert in checkpoint process. And, that is because inside a CRITICAL section we are calling RecoveryInProgress which intern allocates memory.  So I have moved RecoveryInProgress call out of the CRITICAL section and the problem got solved.

command executed: killall -SIGHUP postgres
Crash call stack:
#0  0x00007fa19560d5d7 in raise () from /lib64/libc.so.6
#1  0x00007fa19560ecc8 in abort () from /lib64/libc.so.6
#2  0x00000000009fc991 in ExceptionalCondition (conditionName=0xc5ab1c "!(CritSectionCount == 0)", errorType=0xc5a739 "FailedAssertion", 
    fileName=0xc5a8a5 "mcxt.c", lineNumber=635) at assert.c:54
#3  0x0000000000a34e56 in MemoryContextCreate (node=0x192edc0, tag=T_AllocSetContext, size=216, nameoffset=216, methods=0xc58620 <AllocSetMethods>, 
    parent=0x18fe1b0, name=0xac1137 "WAL record construction", flags=0) at mcxt.c:635
#4  0x0000000000a2aaa1 in AllocSetContextCreateExtended (parent=0x18fe1b0, name=0xac1137 "WAL record construction", flags=0, minContextSize=0, 
    initBlockSize=8192, maxBlockSize=8388608) at aset.c:463
#5  0x000000000055983c in InitXLogInsert () at xloginsert.c:1033
#6  0x000000000054e4e5 in InitXLOGAccess () at xlog.c:8183
#7  0x000000000054df71 in RecoveryInProgress () at xlog.c:7952
#8  0x00000000005507f6 in UpdateFullPageWrites () at xlog.c:9566
#9  0x00000000007ea821 in UpdateSharedMemoryConfig () at checkpointer.c:1366
#10 0x00000000007e95a1 in CheckpointerMain () at checkpointer.c:383

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Вложения

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Fri, Mar 09, 2018 at 01:42:04PM +0530, Dilip Kumar wrote:
> While setting the full_page_write with SIGHUP I hit an assert in checkpoint
> process. And, that is because inside a CRITICAL section we are calling
> RecoveryInProgress which intern allocates memory.  So I have moved
> RecoveryInProgress call out of the CRITICAL section and the problem got
> solved.

Indeed, I can see how this is possible.

If you apply the following you can also have way more fun, but that's
overdoing it:
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7918,6 +7918,8 @@ CheckRecoveryConsistency(void)
bool
RecoveryInProgress(void)
{
+   Assert(CritSectionCount == 0);

Anyway, it seems to me that you are not taking care of all possible race
conditions here.  RecoveryInProgress() could as well be called in
XLogFlush(), and that's a code path taken during redo.

Instead of doing what you are suggesting, why not moving
InitXLogInsert() out of InitXLOGAccess() and change InitPostgres() so as
the allocations for WAL inserts is done unconditionally?  This has
the cost of also making this allocation even for backends which are
started during recovery, still we are talking about allocating a couple
of bytes in exchange of addressing completely all race conditions in
this area.  InitXLogInsert() does not depend on any post-recovery data
like ThisTimeLineId, so a split is possible.

Thoughts?
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Tue, Mar 13, 2018 at 05:04:04PM +0900, Michael Paquier wrote:
> Instead of doing what you are suggesting, why not moving
> InitXLogInsert() out of InitXLOGAccess() and change InitPostgres() so as
> the allocations for WAL inserts is done unconditionally?  This has
> the cost of also making this allocation even for backends which are
> started during recovery, still we are talking about allocating a couple
> of bytes in exchange of addressing completely all race conditions in
> this area.  InitXLogInsert() does not depend on any post-recovery data
> like ThisTimeLineId, so a split is possible.

I have been hacking things this way, and it seems to me that it takes
care of all this class of problems.  CreateCheckPoint() actually
mentions that InitXLogInsert() cannot be called in a critical section,
so the comments don't match the code.  I also think that we still want
to be able to use RecoveryInProgress() in critical sections to do
decision-making for the generation of WAL records.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Dilip Kumar
Дата:
On Fri, Mar 16, 2018 at 10:53 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 13, 2018 at 05:04:04PM +0900, Michael Paquier wrote:
> Instead of doing what you are suggesting, why not moving
> InitXLogInsert() out of InitXLOGAccess() and change InitPostgres() so as
> the allocations for WAL inserts is done unconditionally?  This has
> the cost of also making this allocation even for backends which are
> started during recovery, still we are talking about allocating a couple
> of bytes in exchange of addressing completely all race conditions in
> this area.  InitXLogInsert() does not depend on any post-recovery data
> like ThisTimeLineId, so a split is possible.

I have been hacking things this way, and it seems to me that it takes
care of all this class of problems.  CreateCheckPoint() actually
mentions that InitXLogInsert() cannot be called in a critical section,
so the comments don't match the code.  I also think that we still want
to be able to use RecoveryInProgress() in critical sections to do
decision-making for the generation of WAL records

Thanks for the patch, the idea looks good to me.  Please find some comments and updated patch.

I think like WALWriterProcess, we need to call InitXLogInsert for the CheckpointerProcess as well as for the BgWriterProcess
because earlier they were calling InitXLogInsert while check RecoveryInProgress before inserting the WAL.

see below crash:
#0  0x00007f89273a65d7 in raise () from /lib64/libc.so.6
#1  0x00007f89273a7cc8 in abort () from /lib64/libc.so.6
#2  0x00000000009fd24e in errfinish (dummy=0) at elog.c:556
#3  0x00000000009ff70b in elog_finish (elevel=20, fmt=0xac0d1d "too much WAL data") at elog.c:1378
#4  0x0000000000558766 in XLogRegisterData (data=0xf3efac <fullPageWrites> "\001", len=1) at xloginsert.c:330
#5  0x000000000055080e in UpdateFullPageWrites () at xlog.c:9569
#6  0x00000000007ea831 in UpdateSharedMemoryConfig () at checkpointer.c:1360
#7  0x00000000007e95b1 in CheckpointerMain () at checkpointer.c:370
#8  0x0000000000561680 in AuxiliaryProcessMain (argc=2, argv=0x7fffcfd4bec0) at bootstrap.c:447

I have modified you patch and called InitXLogInsert for CheckpointerProcess and BgWriterProcess also. After that the
issue is solved and fpw is getting set properly.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Вложения

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Tue, Mar 20, 2018 at 10:43:55AM +0530, Dilip Kumar wrote:
> I think like WALWriterProcess, we need to call InitXLogInsert for the
> CheckpointerProcess as well as for the BgWriterProcess
> because earlier they were calling InitXLogInsert while check
> RecoveryInProgress before inserting the WAL.

/* don't set signals, bgwriter has its own agenda */
+                       InitXLOGAccess();
+                       InitXLogInsert()

This is wrong, as the checkpointer is started as well on standbys, and
that InitXLOGAccess initializes things for WAL generation like
ThisTimeLineID.  So you should just call InitXLogInsert(), and a comment
would be welcome for both the bgwriter and the checkpointer.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Dilip Kumar
Дата:
On Tue, Mar 20, 2018 at 11:26 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 20, 2018 at 10:43:55AM +0530, Dilip Kumar wrote:
> I think like WALWriterProcess, we need to call InitXLogInsert for the
> CheckpointerProcess as well as for the BgWriterProcess
> because earlier they were calling InitXLogInsert while check
> RecoveryInProgress before inserting the WAL.

/* don't set signals, bgwriter has its own agenda */
+                       InitXLOGAccess();
+                       InitXLogInsert()

This is wrong, as the checkpointer is started as well on standbys, and
that InitXLOGAccess initializes things for WAL generation like
ThisTimeLineID.  So you should just call InitXLogInsert(), and a comment
would be welcome for both the bgwriter and the checkpointer.

Yeah, you are right.  Fixed.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Вложения

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Tue, Mar 20, 2018 at 12:00:47PM +0530, Dilip Kumar wrote:
> Yeah, you are right.  Fixed.

So I have been spending a couple of hours playing with your patch, and
tested various configurations manually, like switch the fpw switch to on
and off while running in parallel pgbench.  I have also tested
promotions, etc.

While the patch does its job, it is possible to simplify a bit more the
calls to InitXLogInsert().  Particularly, the one in CreateCheckpoint()
is basically useless for the checkpointer, still it is useful for the
startup process when trigerring an end-in-recovery checkpoint.  So one
additional cleanup would be to move the call in CreateCheckpoint() to
bootstrap.c for the startup process.  In order to test that, please make
sure to create fallback_promote at the root of the data folder before
sending SIGUSR2 to the startup process which would trigger the pre-9.3
promotion where the end-of-recovery checkpoint is triggered by the
startup process itself.

+   /* Initialize the working areas for constructing WAL records. */
+   InitXLogInsert();
Instead of having the same comment for each process calling
InitXLogInsert() multiple times, I think that it would be better to
complete the comment in bootstrap.c where is written "XLOG operations".

Here is a suggestion:
/*
 * Initialize WAL-related operations and enter the main loop of each
 * process.  InitXLogInsert is called for each process which can
 * generate WAL.  While this is wasteful for processes started on
 * a standby, this gives the guarantee that initialization of WAL
 * insertion areas is able to happen in a consistent way and out of
 * any critical sections so as the facility is usable when a promotion
 * is triggered.
 */

What do you think?
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Dilip Kumar
Дата:
On Fri, Mar 23, 2018 at 1:19 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 20, 2018 at 12:00:47PM +0530, Dilip Kumar wrote:
> Yeah, you are right.  Fixed.

So I have been spending a couple of hours playing with your patch, and
tested various configurations manually, like switch the fpw switch to on
and off while running in parallel pgbench.  I have also tested
promotions, etc.

While the patch does its job, it is possible to simplify a bit more the
calls to InitXLogInsert().  Particularly, the one in CreateCheckpoint()
is basically useless for the checkpointer, still it is useful for the
startup process when trigerring an end-in-recovery checkpoint.  So one
additional cleanup would be to move the call in CreateCheckpoint() to
bootstrap.c for the startup process.

In StarupXLOG, just before the CreateCheckPoint() call,  we are calling LocalSetXLogInsertAllowed().  So, I am thinking can we just remove InitXLogInsert from CreateCheckpoint. And, we don't need to move it to bootstrap.c.  Or am I missing something?
 
In order to test that, please make
sure to create fallback_promote at the root of the data folder before
sending SIGUSR2 to the startup process which would trigger the pre-9.3
promotion where the end-of-recovery checkpoint is triggered by the
startup process itself.

+   /* Initialize the working areas for constructing WAL records. */
+   InitXLogInsert();
Instead of having the same comment for each process calling
InitXLogInsert() multiple times, I think that it would be better to
complete the comment in bootstrap.c where is written "XLOG operations".

Here is a suggestion:
/*
 * Initialize WAL-related operations and enter the main loop of each
 * process.  InitXLogInsert is called for each process which can
 * generate WAL.  While this is wasteful for processes started on
 * a standby, this gives the guarantee that initialization of WAL
 * insertion areas is able to happen in a consistent way and out of
 * any critical sections so as the facility is usable when a promotion
 * is triggered.
 */

What do you think?
 
Looks good to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Mon, Mar 26, 2018 at 02:32:22PM +0530, Dilip Kumar wrote:
> On Fri, Mar 23, 2018 at 1:19 PM, Michael Paquier <michael@paquier.xyz>
> wrote:
> In StartupXLOG, just before the CreateCheckPoint() call,  we are calling
> LocalSetXLogInsertAllowed().  So, I am thinking can we just remove
> InitXLogInsert from CreateCheckpoint. And, we don't need to move it to
> bootstrap.c.  Or am I missing something?

I have finally been able to spend more time on this issue, and checked
for a couple of hours all the calls to RecoveryInProgress() that could
be triggered within a critical section to see if the move I suggested
previously is worth it ot not as this would cost some memory for
standbys all the time, which would suck for many read-only sessions.

There are a couple of calls potentially happening in critical sections,
however except for the one in UpdateFullPageWrites() those are used for
sanity  checks in code paths that should never trigger it, take
XLogInsertBegin() for example.  So with assertions this would trigger
a failure before the real elog(ERROR) message shows up.

Hence, I am changing opinion still I think that instead of the patch you
proposed first we could just do a call to InitXLogInsert() before
entering the critical section.  This is also more consistent with what
CreateCheckpoint() does.  That's also less risky for a backpatch as your
patch increases the window between the beginning of the critical section
and the real moment where the check for RecoveryInProgress is needed.  A
comment explaining why the initialization needs to happen is also
essential.

All in all, this would give the simple patch attached.

Thoughts?
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Tue, 27 Mar 2018 16:46:30 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180327074630.GD9940@paquier.xyz>
> I have finally been able to spend more time on this issue, and checked
> for a couple of hours all the calls to RecoveryInProgress() that could
> be triggered within a critical section to see if the move I suggested
> previously is worth it ot not as this would cost some memory for
> standbys all the time, which would suck for many read-only sessions.
> 
> There are a couple of calls potentially happening in critical sections,
> however except for the one in UpdateFullPageWrites() those are used for
> sanity  checks in code paths that should never trigger it, take
> XLogInsertBegin() for example.  So with assertions this would trigger
> a failure before the real elog(ERROR) message shows up.
> 
> Hence, I am changing opinion still I think that instead of the patch you
> proposed first we could just do a call to InitXLogInsert() before
> entering the critical section.  This is also more consistent with what
> CreateCheckpoint() does.  That's also less risky for a backpatch as your
> patch increases the window between the beginning of the critical section
> and the real moment where the check for RecoveryInProgress is needed.  A
> comment explaining why the initialization needs to happen is also
> essential.
> 
> All in all, this would give the simple patch attached.
> 
> Thoughts?

At the first look I was uneasy that the patch initializes xlog
working area earlier than required.

The current UpdateFullPageWrites is safe on standby and promotion
so what we should consider is only the non-standby case. I think
what we should do is just calling RecoveryInProgress() at the
beginning of CheckPointerMain, which is just the same thing with
InitPostgres, but before setting up signal handler to avoid
processing SIGHUP before being ready to insert xlog.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4b452e7cee..c446e81299 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -197,6 +197,13 @@ CheckpointerMain(void)
 
     CheckpointerShmem->checkpointer_pid = MyProcPid;
 
+    /*
+     * We need to call InitXLOGAccess(), if the system isn't in hot-standby
+     * mode.  This is handled by calling RecoveryInProgress and ignoring the
+     * result. This needs to be done before SIGHUP handler is set up.
+     */
+    (void) RecoveryInProgress();
+
     /*
      * Properly accept or ignore signals the postmaster might send us
      *

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Tue, Mar 27, 2018 at 09:01:20PM +0900, Kyotaro HORIGUCHI wrote:
> The current UpdateFullPageWrites is safe on standby and promotion
> so what we should consider is only the non-standby case. I think
> what we should do is just calling RecoveryInProgress() at the
> beginning of CheckPointerMain, which is just the same thing with
> InitPostgres, but before setting up signal handler to avoid
> processing SIGHUP before being ready to insert xlog.

Your proposal does not fix the issue for a checkpointer process started
on a standby.  After a promotion, if SIGHUP is issued with a change in
full_page_writes, then the initialization of InitXLogInsert() would
happen again in the critical section of UpdateFullPageWrites().  The
window is rather small for normal promotions as the startup process
requests a checkpoint which would do the initialization, and much larger
for fallback_promote where the startup process is in charge of doing the
end-of-recovery checkpoint.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Tue, 27 Mar 2018 22:02:26 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180327130226.GA1105@paquier.xyz>
> On Tue, Mar 27, 2018 at 09:01:20PM +0900, Kyotaro HORIGUCHI wrote:
> > The current UpdateFullPageWrites is safe on standby and promotion
> > so what we should consider is only the non-standby case. I think
> > what we should do is just calling RecoveryInProgress() at the
> > beginning of CheckPointerMain, which is just the same thing with
> > InitPostgres, but before setting up signal handler to avoid
> > processing SIGHUP before being ready to insert xlog.
> 
> Your proposal does not fix the issue for a checkpointer process started
> on a standby.  After a promotion, if SIGHUP is issued with a change in
> full_page_writes, then the initialization of InitXLogInsert() would
> happen again in the critical section of UpdateFullPageWrites().  The
> window is rather small for normal promotions as the startup process
> requests a checkpoint which would do the initialization, and much larger
> for fallback_promote where the startup process is in charge of doing the
> end-of-recovery checkpoint.

Yeah. I realized that after sending the mail.

I looked closer and found several problems there.

- On standby, StartupXLOG calls UpdateFullPageWrites and
  checkpointer can call the same function simultaneously, but it
  doesn't assume concurrent call.

- StartupXLOG can make a concurrent write to
  Insert->fullPageWrite so it needs to be locked.

- At the time of the very end of recovery, the startup process
  ignores possible change of full_page_writes GUC. It sticks with
  the startup value. It leads to loss of
  XLOG_CHANGE_FPW. (StartXLOG is not considering GUC changes by
  reload)

- If checkpointer calls UpdateFullPageWrites concurrently with
  change of SharedRecoveryInProgress to false in StartupXLOG the
  change may lose corresponding XLOG_CHANGE_FPW.

So, if we don't accept the current behavior, what I think we
should do are all of the follows.

A. In StartupXLOG, protect write to Insert->fullPageWrites with
  wal insert exlusive lock. Do the same thing for read in
  UpdateFullPageWrites.

B. Surround the whole UpdateFullPageWrites with any kind of lock
  to exclude concurrent calls. The attached uses ControlFileLock.
  This also exludes the function with chaging of
  SharedRecoveryInProgress to avoid loss of XLOG_CHANGE_FPW.

C. After exiting recovery mode, call UpdateFullPageWrites from
  StartupXLOG if shared fullPageWrites is found changed from the
  last known value. If checkponiter did the same thing at the
  same time, one of them completes the work.

D. Call RecoveryInProgress to set up xlog working area.

The attached does that. I don't like that it uses ControlFileLock
to exlucde concurrent UpdateFullPageWrites and StartupXLOG but
WALInsertLock cannot be used since UpdateFullPageWrites may take
the same lock.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Wed, Mar 28, 2018 at 03:40:59PM +0900, Kyotaro HORIGUCHI wrote:
> The attached does that. I don't like that it uses ControlFileLock
> to exlucde concurrent UpdateFullPageWrites and StartupXLOG but
> WALInsertLock cannot be used since UpdateFullPageWrites may take
> the same lock.

You visibly forgot your patch.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Wed, 28 Mar 2018 15:59:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180328065948.GM1105@paquier.xyz>
> On Wed, Mar 28, 2018 at 03:40:59PM +0900, Kyotaro HORIGUCHI wrote:
> > The attached does that. I don't like that it uses ControlFileLock
> > to exlucde concurrent UpdateFullPageWrites and StartupXLOG but
> > WALInsertLock cannot be used since UpdateFullPageWrites may take
> > the same lock.
> 
> You visibly forgot your patch.

Mmm, someone must have eaten that. I'm sure it is attached this
time.

I don't like UpdateFullPageWrites is using ControlFileLock to
exclusion..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cb9c2a29cb..d2bb5e16ac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7558,9 +7558,11 @@ StartupXLOG(void)
     /*
      * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
      * record before resource manager writes cleanup WAL records or checkpoint
-     * record is written.
+     * record is written, ignoreing the change of full_page_write GUC so far.
      */
+    WALInsertLockAcquireExclusive();
     Insert->fullPageWrites = lastFullPageWrites;
+    WALInsertLockRelease();
     LocalSetXLogInsertAllowed();
     UpdateFullPageWrites();
     LocalXLogInsertAllowed = -1;
@@ -7783,6 +7785,9 @@ StartupXLOG(void)
      * itself, we use the info_lck here to ensure that there are no race
      * conditions concerning visibility of other recent updates to shared
      * memory.
+     *
+     * ControlFileLock excludes concurrent update of shared fullPageWrites in
+     * addition to its ordinary use.
      */
     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
     ControlFile->state = DB_IN_PRODUCTION;
@@ -7790,11 +7795,25 @@ StartupXLOG(void)
 
     SpinLockAcquire(&XLogCtl->info_lck);
     XLogCtl->SharedRecoveryInProgress = false;
+    lastFullPageWrites = Insert->fullPageWrites;
     SpinLockRelease(&XLogCtl->info_lck);
 
     UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
+    /*
+     * Shared fullPageWrites at the end of recovery differs to our last known
+     * value. The change has been made by checkpointer but we haven't made
+     * corresponding XLOG_FPW_CHANGE record. We reread GUC change if any and
+     * try update shared fullPageWrites by myself. It ends with doing nothing
+     * if checkpointer already did the same thing.
+     */
+    if (lastFullPageWrites != fullPageWrites)
+    {
+        HandleStartupProcInterrupts();
+        UpdateFullPageWrites();
+    }
+
     /*
      * If there were cascading standby servers connected to us, nudge any wal
      * sender processes to notice that we've been promoted.
@@ -9525,8 +9544,10 @@ XLogReportParameters(void)
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
- * Note: this function assumes there is no other process running
- * concurrently that could update it.
+ * This function assumes called concurrently from multiple processes and
+ * called concurrently with changing of shared fullPageWrites. See
+ * StartupXLOG().
+ * 
  */
 void
 UpdateFullPageWrites(void)
@@ -9536,13 +9557,25 @@ UpdateFullPageWrites(void)
     /*
      * Do nothing if full_page_writes has not been changed.
      *
-     * It's safe to check the shared full_page_writes without the lock,
-     * because we assume that there is no concurrently running process which
-     * can update it.
+     * We acquire ControlFileLock to exclude other concurrent call and change
+     * of shared fullPageWrites.
      */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+    WALInsertLockAcquireExclusive();
     if (fullPageWrites == Insert->fullPageWrites)
+    {
+        WALInsertLockRelease();
+        LWLockRelease(ControlFileLock);
         return;
+    }
+    WALInsertLockRelease();
 
+    /*
+     * Need to set up XLogInsert working area before entering critical section
+     * if we are not in recovery mode.
+     */
+    (void) RecoveryInProgress();
+        
     START_CRIT_SECTION();
 
     /*
@@ -9578,6 +9611,8 @@ UpdateFullPageWrites(void)
         WALInsertLockRelease();
     }
     END_CRIT_SECTION();
+
+    LWLockRelease(ControlFileLock);
 }
 
 /*

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Wed, Mar 28, 2018 at 12:10 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Tue, 27 Mar 2018 22:02:26 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20180327130226.GA1105@paquier.xyz>
>> On Tue, Mar 27, 2018 at 09:01:20PM +0900, Kyotaro HORIGUCHI wrote:
>> > The current UpdateFullPageWrites is safe on standby and promotion
>> > so what we should consider is only the non-standby case. I think
>> > what we should do is just calling RecoveryInProgress() at the
>> > beginning of CheckPointerMain, which is just the same thing with
>> > InitPostgres, but before setting up signal handler to avoid
>> > processing SIGHUP before being ready to insert xlog.
>>
>> Your proposal does not fix the issue for a checkpointer process started
>> on a standby.  After a promotion, if SIGHUP is issued with a change in
>> full_page_writes, then the initialization of InitXLogInsert() would
>> happen again in the critical section of UpdateFullPageWrites().  The
>> window is rather small for normal promotions as the startup process
>> requests a checkpoint which would do the initialization, and much larger
>> for fallback_promote where the startup process is in charge of doing the
>> end-of-recovery checkpoint.
>
> Yeah. I realized that after sending the mail.
>
> I looked closer and found several problems there.
>
> - On standby, StartupXLOG calls UpdateFullPageWrites and
>   checkpointer can call the same function simultaneously, but it
>   doesn't assume concurrent call.
>
> - StartupXLOG can make a concurrent write to
>   Insert->fullPageWrite so it needs to be locked.
>
> - At the time of the very end of recovery, the startup process
>   ignores possible change of full_page_writes GUC. It sticks with
>   the startup value. It leads to loss of
>   XLOG_CHANGE_FPW. (StartXLOG is not considering GUC changes by
>   reload)
>

Won't this be covered by checkpointer process?  Basically, the next
time checkpointer sees that it has received the sighup, it will call
UpdateFullPageWrites which will log the record if required.

In general, I was wondering why in the first place this variable
(full_page_writes) is a SIGHUP variable?  I think if the user tries to
switch it to 'on' from 'off', it won't guarantee the recovery from
torn pages.  Yeah, one can turn it to 'off' from 'on' without any
problem, but as the reverse doesn't guarantee anything, it can confuse
users. What do you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Sat, 31 Mar 2018 17:43:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1JSHSPhkHZXj5q2yf3x1MgBN0oYHb9JvcoVh9ZYqB5g+w@mail.gmail.com>
> On Wed, Mar 28, 2018 at 12:10 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Tue, 27 Mar 2018 22:02:26 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20180327130226.GA1105@paquier.xyz>
> >> On Tue, Mar 27, 2018 at 09:01:20PM +0900, Kyotaro HORIGUCHI wrote:
> >> > The current UpdateFullPageWrites is safe on standby and promotion
> >> > so what we should consider is only the non-standby case. I think
> >> > what we should do is just calling RecoveryInProgress() at the
> >> > beginning of CheckPointerMain, which is just the same thing with
> >> > InitPostgres, but before setting up signal handler to avoid
> >> > processing SIGHUP before being ready to insert xlog.
> >>
> >> Your proposal does not fix the issue for a checkpointer process started
> >> on a standby.  After a promotion, if SIGHUP is issued with a change in
> >> full_page_writes, then the initialization of InitXLogInsert() would
> >> happen again in the critical section of UpdateFullPageWrites().  The
> >> window is rather small for normal promotions as the startup process
> >> requests a checkpoint which would do the initialization, and much larger
> >> for fallback_promote where the startup process is in charge of doing the
> >> end-of-recovery checkpoint.
> >
> > Yeah. I realized that after sending the mail.
> >
> > I looked closer and found several problems there.
> >
> > - On standby, StartupXLOG calls UpdateFullPageWrites and
> >   checkpointer can call the same function simultaneously, but it
> >   doesn't assume concurrent call.
> >
> > - StartupXLOG can make a concurrent write to
> >   Insert->fullPageWrite so it needs to be locked.
> >
> > - At the time of the very end of recovery, the startup process
> >   ignores possible change of full_page_writes GUC. It sticks with
> >   the startup value. It leads to loss of
> >   XLOG_CHANGE_FPW. (StartXLOG is not considering GUC changes by
> >   reload)
> >
> 
> Won't this be covered by checkpointer process?  Basically, the next
> time checkpointer sees that it has received the sighup, it will call
> UpdateFullPageWrites which will log the record if required.

Right. Checkpointer is doing the right thing, but without
writing XLOG_FPW_CHANGE records during recovery.

The problem is in StartupXLOG. It doesn't read shared FPW flag
during recovery and updates local flag according to WAL
records. Then it tries to issue XLOG_FPW_CHANGE if its local
status and shared flag are different. This is correct.

But after that, checkpointer still cannot write XLOG (before
SharedRecovoeryInProgress becomes false) but checkpointer can
change shared fullPagesWrites without writing WAL and the WAL
record is eventually lost.

> In general, I was wondering why in the first place this variable
> (full_page_writes) is a SIGHUP variable?  I think if the user tries to
> switch it to 'on' from 'off', it won't guarantee the recovery from
> torn pages.  Yeah, one can turn it to 'off' from 'on' without any
> problem, but as the reverse doesn't guarantee anything, it can confuse
> users. What do you think?

I tend to agree with you. It works as expected after the next
checkpoint. So define the variable as "it can be changed any time
but has an effect at the next checkpoint time", then remove
XLOG_FPW_CHANGE record. And that eliminates the problem of
concurrent calls since the checkpointer becomes the only modifier
of the variable. And the problematic function
UpdateFullPageWrites also no longer needs to write a WAL
record. The information is conveyed only by checkpoint records.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Wed, 04 Apr 2018 17:26:46 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180404.172646.238325981.horiguchi.kyotaro@lab.ntt.co.jp>
> > In general, I was wondering why in the first place this variable
> > (full_page_writes) is a SIGHUP variable?  I think if the user tries to
> > switch it to 'on' from 'off', it won't guarantee the recovery from
> > torn pages.  Yeah, one can turn it to 'off' from 'on' without any
> > problem, but as the reverse doesn't guarantee anything, it can confuse
> > users. What do you think?
> 
> I tend to agree with you. It works as expected after the next
> checkpoint. So define the variable as "it can be changed any time
> but has an effect at the next checkpoint time", then remove
> XLOG_FPW_CHANGE record. And that eliminates the problem of
> concurrent calls since the checkpointer becomes the only modifier
> of the variable. And the problematic function
> UpdateFullPageWrites also no longer needs to write a WAL
> record. The information is conveyed only by checkpoint records.

I noticed that XLOG_FPW_CHANGE(fpw=false) is still required by
pg_start/stop_backup to know FPW's turning-off without waiting
for the next checkpoint record. But XLOG_FPW_CHANGE(true) is not
required since no one uses the information. It seems even harmful
when it is written at the incorrect place.

In the attached patch, shared fullPageWrites is updated only at
REDO point and XLOG_FPW_CHANGE is written only for FPW's turning
off before REDO point.

Disabling FPW at any time makes sense when we need to slow down
the speed of WAL urgently, but...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 10f6865cf1e96e8d883ed7e03e1fafb6e73ec935 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 6 Apr 2018 13:57:48 +0900
Subject: [PATCH] Change FPW handling

The GUC full_pages_writes currently has an effect immediately. That
makes a race condition between config reload on checkpointer and
StartupXLOG. But since full page images are meaningful only when they
are attached to all WAL records covers a checkpoint, there is no
problem if we update the shared FPW only at REDO point.  On the other
hand, online backup mechanism on standby requires to know if FPW is
turned off before the next checkpoint record comes.

As the result, with this patch, changing of full_page_writes takes
effect at REDO point and additional XLOG_FPW_CHANGE is written only
for turning-off. These are sufficient for standby-backup to work
properly, reduces complexity and prevent the race condition.
---
 src/backend/access/transam/xlog.c     | 114 +++++++++-------------------------
 src/backend/postmaster/checkpointer.c |   6 --
 src/include/access/xlog.h             |   1 -
 src/include/catalog/pg_control.h      |   2 +-
 4 files changed, 31 insertions(+), 92 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b4fd8395b7..0a81650c26 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -202,14 +202,6 @@ static XLogRecPtr LastRec;
 static XLogRecPtr receivedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
-/*
- * During recovery, lastFullPageWrites keeps track of full_page_writes that
- * the replayed WAL records indicate. It's initialized with full_page_writes
- * that the recovery starting checkpoint record indicates, and then updated
- * each time XLOG_FPW_CHANGE record is replayed.
- */
-static bool lastFullPageWrites;
-
 /*
  * Local copy of SharedRecoveryInProgress variable. True actually means "not
  * known, need to check the shared state".
@@ -6776,11 +6768,7 @@ StartupXLOG(void)
      */
     restoreTwoPhaseData();
 
-    lastFullPageWrites = checkPoint.fullPageWrites;
-
     RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
-    doPageWrites = lastFullPageWrites;
-
     if (RecPtr < checkPoint.redo)
         ereport(PANIC,
                 (errmsg("invalid redo in checkpoint record")));
@@ -7575,16 +7563,6 @@ StartupXLOG(void)
     /* Pre-scan prepared transactions to find out the range of XIDs present */
     oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
 
-    /*
-     * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-     * record before resource manager writes cleanup WAL records or checkpoint
-     * record is written.
-     */
-    Insert->fullPageWrites = lastFullPageWrites;
-    LocalSetXLogInsertAllowed();
-    UpdateFullPageWrites();
-    LocalXLogInsertAllowed = -1;
-
     if (InRecovery)
     {
         /*
@@ -7808,6 +7786,13 @@ StartupXLOG(void)
     ControlFile->state = DB_IN_PRODUCTION;
     ControlFile->time = (pg_time_t) time(NULL);
 
+    /*
+     * Set the initial value of shared fullPageWrites. Once
+     * SharedRecoveryInProgress is turned false, checkpointer will update this
+     * value.
+     */
+    XLogCtl->Insert.fullPageWrites = checkPoint.fullPageWrites;
+
     SpinLockAcquire(&XLogCtl->info_lck);
     XLogCtl->SharedRecoveryInProgress = false;
     SpinLockRelease(&XLogCtl->info_lck);
@@ -8669,6 +8654,20 @@ CreateCheckPoint(int flags)
      */
     last_important_lsn = GetLastImportantRecPtr();
 
+    /*
+     * If full_page_writes has been turned off, issue XLOG_FPW_CHANGE before
+     * the flag actually takes effect. No lock is required since checkpointer
+     * is the only updator of shared fullPageWrites after recovery is
+     * finished. Both shared and local fullPageWrites do not change before the
+     * next reading below.
+     */
+    if (Insert->fullPageWrites && !fullPageWrites)
+    {
+        XLogBeginInsert();
+        XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
+        XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE);
+    }
+
     /*
      * We must block concurrent insertions while examining insert state to
      * determine the checkpoint REDO pointer.
@@ -8710,6 +8709,15 @@ CreateCheckPoint(int flags)
     else
         checkPoint.PrevTimeLineID = ThisTimeLineID;
 
+    /*
+     * Update shared flag of fullPageWrites. WALInsertLock ensures that this
+     * affects all WAL records exactly from REDO point. As the result a
+     * checkpoint marked as fpw=true is ensured that all WAL records have full
+     * page image.
+     */
+    if (fullPageWrites != Insert->fullPageWrites)
+        Insert->fullPageWrites = fullPageWrites;
+
     checkPoint.fullPageWrites = Insert->fullPageWrites;
 
     /*
@@ -9541,65 +9549,6 @@ XLogReportParameters(void)
     }
 }
 
-/*
- * Update full_page_writes in shared memory, and write an
- * XLOG_FPW_CHANGE record if necessary.
- *
- * Note: this function assumes there is no other process running
- * concurrently that could update it.
- */
-void
-UpdateFullPageWrites(void)
-{
-    XLogCtlInsert *Insert = &XLogCtl->Insert;
-
-    /*
-     * Do nothing if full_page_writes has not been changed.
-     *
-     * It's safe to check the shared full_page_writes without the lock,
-     * because we assume that there is no concurrently running process which
-     * can update it.
-     */
-    if (fullPageWrites == Insert->fullPageWrites)
-        return;
-
-    START_CRIT_SECTION();
-
-    /*
-     * It's always safe to take full page images, even when not strictly
-     * required, but not the other round. So if we're setting full_page_writes
-     * to true, first set it true and then write the WAL record. If we're
-     * setting it to false, first write the WAL record and then set the global
-     * flag.
-     */
-    if (fullPageWrites)
-    {
-        WALInsertLockAcquireExclusive();
-        Insert->fullPageWrites = true;
-        WALInsertLockRelease();
-    }
-
-    /*
-     * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
-     * full_page_writes during archive recovery, if required.
-     */
-    if (XLogStandbyInfoActive() && !RecoveryInProgress())
-    {
-        XLogBeginInsert();
-        XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
-
-        XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE);
-    }
-
-    if (!fullPageWrites)
-    {
-        WALInsertLockAcquireExclusive();
-        Insert->fullPageWrites = false;
-        WALInsertLockRelease();
-    }
-    END_CRIT_SECTION();
-}
-
 /*
  * Check that it's OK to switch to new timeline during recovery.
  *
@@ -9965,9 +9914,6 @@ xlog_redo(XLogReaderState *record)
                 XLogCtl->lastFpwDisableRecPtr = ReadRecPtr;
             SpinLockRelease(&XLogCtl->info_lck);
         }
-
-        /* Keep track of full_page_writes */
-        lastFullPageWrites = fpw;
     }
 }
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4b452e7cee..8b87d139d0 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1359,12 +1359,6 @@ UpdateSharedMemoryConfig(void)
     /* update global shmem state for sync rep */
     SyncRepUpdateSyncStandbysDefined();
 
-    /*
-     * If full_page_writes has been changed by SIGHUP, we update it in shared
-     * memory and write an XLOG_FPW_CHANGE record.
-     */
-    UpdateFullPageWrites();
-
     elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..bc23574694 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -270,7 +270,6 @@ extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
-extern void UpdateFullPageWrites(void);
 extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 773d9e6eba..cb911fd5a9 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -38,7 +38,7 @@ typedef struct CheckPoint
     TimeLineID    ThisTimeLineID; /* current TLI */
     TimeLineID    PrevTimeLineID; /* previous TLI, if this record begins a new
                                  * timeline (equals ThisTimeLineID otherwise) */
-    bool        fullPageWrites; /* current full_page_writes */
+    bool        fullPageWrites; /* true if all covering WALs are having FPI */
     uint32        nextXidEpoch;    /* higher-order bits of nextXid */
     TransactionId nextXid;        /* next free XID */
     Oid            nextOid;        /* next free OID */
-- 
2.16.3


Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Fri, Apr 6, 2018 at 1:50 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello.
>
> At Wed, 04 Apr 2018 17:26:46 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20180404.172646.238325981.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> > In general, I was wondering why in the first place this variable
>> > (full_page_writes) is a SIGHUP variable?  I think if the user tries to
>> > switch it to 'on' from 'off', it won't guarantee the recovery from
>> > torn pages.  Yeah, one can turn it to 'off' from 'on' without any
>> > problem, but as the reverse doesn't guarantee anything, it can confuse
>> > users. What do you think?
>>
>> I tend to agree with you. It works as expected after the next
>> checkpoint. So define the variable as "it can be changed any time
>> but has an effect at the next checkpoint time", then remove
>> XLOG_FPW_CHANGE record. And that eliminates the problem of
>> concurrent calls since the checkpointer becomes the only modifier
>> of the variable. And the problematic function
>> UpdateFullPageWrites also no longer needs to write a WAL
>> record. The information is conveyed only by checkpoint records.
>
> I noticed that XLOG_FPW_CHANGE(fpw=false) is still required by
> pg_start/stop_backup to know FPW's turning-off without waiting
> for the next checkpoint record. But XLOG_FPW_CHANGE(true) is not
> required since no one uses the information. It seems even harmful
> when it is written at the incorrect place.
>
> In the attached patch, shared fullPageWrites is updated only at
> REDO point
>

I am not completely sure if that is the right option because this
would mean that we are defining the new scope for a GUC variable.  I
guess we should take the input of others as well.  I am not sure what
is the right way to do that, but maybe we can start a new thread with
a proper subject and description rather than discussing this under
some related bug fix patch discussion.  I guess we can try that after
CF unless some other people pitch in and share their feedback.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Fri, 6 Apr 2018 17:59:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+1zULC52G_EyNcrrxFCmBi4NUuA1CoQAKu2FFPai_Teg@mail.gmail.com>
> On Fri, Apr 6, 2018 at 1:50 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello.
> >
> > At Wed, 04 Apr 2018 17:26:46 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20180404.172646.238325981.horiguchi.kyotaro@lab.ntt.co.jp>
 
> >> > In general, I was wondering why in the first place this variable
> >> > (full_page_writes) is a SIGHUP variable?  I think if the user tries to
> >> > switch it to 'on' from 'off', it won't guarantee the recovery from
> >> > torn pages.  Yeah, one can turn it to 'off' from 'on' without any
> >> > problem, but as the reverse doesn't guarantee anything, it can confuse
> >> > users. What do you think?
> >>
> >> I tend to agree with you. It works as expected after the next
> >> checkpoint. So define the variable as "it can be changed any time
> >> but has an effect at the next checkpoint time", then remove
> >> XLOG_FPW_CHANGE record. And that eliminates the problem of
> >> concurrent calls since the checkpointer becomes the only modifier
> >> of the variable. And the problematic function
> >> UpdateFullPageWrites also no longer needs to write a WAL
> >> record. The information is conveyed only by checkpoint records.
> >
> > I noticed that XLOG_FPW_CHANGE(fpw=false) is still required by
> > pg_start/stop_backup to know FPW's turning-off without waiting
> > for the next checkpoint record. But XLOG_FPW_CHANGE(true) is not
> > required since no one uses the information. It seems even harmful
> > when it is written at the incorrect place.
> >
> > In the attached patch, shared fullPageWrites is updated only at
> > REDO point
> >
> 
> I am not completely sure if that is the right option because this
> would mean that we are defining the new scope for a GUC variable.  I
> guess we should take the input of others as well.  I am not sure what
> is the right way to do that, but maybe we can start a new thread with
> a proper subject and description rather than discussing this under
> some related bug fix patch discussion.  I guess we can try that after
> CF unless some other people pitch in and share their feedback.

I'd like to refrain from making a new thread since this topic is
registered as an open item (in Old Bugs section). Or making a new
thread then relinking it from the page is preferable?

I'm surprised a bit that this got confilcted so soon. On the way
rebasing, for anyone's information, I considered comment and
documentation fix but all comments and documentation can be read
as both old and new behavior. That being said, the patch contains
a small addtion about the new behavior.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From fe0401335e0ac1a9ae36dbdb5c2db82f9081a1a3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 6 Apr 2018 13:57:48 +0900
Subject: [PATCH] Change FPW handling

The GUC full_pages_writes currently has an effect immediately. That
makes a race condition between config reload on checkpointer and
StartupXLOG. But since full page images are meaningful only when they
are attached to all WAL records covers a checkpoint, there is no
problem if we update the shared FPW only at REDO point.  On the other
hand, online backup mechanism on standby requires to know if FPW is
turned off before the next checkpoint record comes.

As the result, with this patch, changing of full_page_writes takes
effect at REDO point and additional XLOG_FPW_CHANGE is written only
for turning-off. These are sufficient for standby-backup to work
properly, reduces complexity and prevent the race condition.
---
 doc/src/sgml/config.sgml              |   4 +-
 src/backend/access/transam/xlog.c     | 114 +++++++++-------------------------
 src/backend/postmaster/checkpointer.c |   6 --
 src/include/access/xlog.h             |   1 -
 src/include/catalog/pg_control.h      |   2 +-
 5 files changed, 34 insertions(+), 93 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d5f2d23c4..7ea42c25e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2598,7 +2598,9 @@ include_dir 'conf.d'
        <para>
         This parameter can only be set in the <filename>postgresql.conf</filename>
         file or on the server command line.
-        The default is <literal>on</literal>.
+
+        The default is <literal>on</literal>. The change of the parmeter takes
+        effect at the next checkpoint time.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4a47395174..ba9cf07d38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -202,14 +202,6 @@ static XLogRecPtr LastRec;
 static XLogRecPtr receivedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
-/*
- * During recovery, lastFullPageWrites keeps track of full_page_writes that
- * the replayed WAL records indicate. It's initialized with full_page_writes
- * that the recovery starting checkpoint record indicates, and then updated
- * each time XLOG_FPW_CHANGE record is replayed.
- */
-static bool lastFullPageWrites;
-
 /*
  * Local copy of SharedRecoveryInProgress variable. True actually means "not
  * known, need to check the shared state".
@@ -6851,11 +6843,7 @@ StartupXLOG(void)
      */
     restoreTwoPhaseData();
 
-    lastFullPageWrites = checkPoint.fullPageWrites;
-
     RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
-    doPageWrites = lastFullPageWrites;
-
     if (RecPtr < checkPoint.redo)
         ereport(PANIC,
                 (errmsg("invalid redo in checkpoint record")));
@@ -7650,16 +7638,6 @@ StartupXLOG(void)
     /* Pre-scan prepared transactions to find out the range of XIDs present */
     oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
 
-    /*
-     * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-     * record before resource manager writes cleanup WAL records or checkpoint
-     * record is written.
-     */
-    Insert->fullPageWrites = lastFullPageWrites;
-    LocalSetXLogInsertAllowed();
-    UpdateFullPageWrites();
-    LocalXLogInsertAllowed = -1;
-
     if (InRecovery)
     {
         /*
@@ -7893,6 +7871,13 @@ StartupXLOG(void)
     ControlFile->state = DB_IN_PRODUCTION;
     ControlFile->time = (pg_time_t) time(NULL);
 
+    /*
+     * Set the initial value of shared fullPageWrites. Once
+     * SharedRecoveryInProgress is turned false, checkpointer will update this
+     * value.
+     */
+    XLogCtl->Insert.fullPageWrites = checkPoint.fullPageWrites;
+
     SpinLockAcquire(&XLogCtl->info_lck);
     XLogCtl->SharedRecoveryInProgress = false;
     SpinLockRelease(&XLogCtl->info_lck);
@@ -8754,6 +8739,20 @@ CreateCheckPoint(int flags)
      */
     last_important_lsn = GetLastImportantRecPtr();
 
+    /*
+     * If full_page_writes has been turned off, issue XLOG_FPW_CHANGE before
+     * the flag actually takes effect. No lock is required since checkpointer
+     * is the only updator of shared fullPageWrites after recovery is
+     * finished. Both shared and local fullPageWrites do not change before the
+     * next reading below.
+     */
+    if (Insert->fullPageWrites && !fullPageWrites)
+    {
+        XLogBeginInsert();
+        XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
+        XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE);
+    }
+
     /*
      * We must block concurrent insertions while examining insert state to
      * determine the checkpoint REDO pointer.
@@ -8795,6 +8794,15 @@ CreateCheckPoint(int flags)
     else
         checkPoint.PrevTimeLineID = ThisTimeLineID;
 
+    /*
+     * Update shared flag of fullPageWrites. WALInsertLock ensures that this
+     * affects all WAL records exactly from REDO point. As the result a
+     * checkpoint marked as fpw=true is ensured that all WAL records have full
+     * page image.
+     */
+    if (fullPageWrites != Insert->fullPageWrites)
+        Insert->fullPageWrites = fullPageWrites;
+
     checkPoint.fullPageWrites = Insert->fullPageWrites;
 
     /*
@@ -9642,65 +9650,6 @@ XlogChecksums(ChecksumType new_type)
     XLogInsert(RM_XLOG_ID, XLOG_CHECKSUMS);
 }
 
-/*
- * Update full_page_writes in shared memory, and write an
- * XLOG_FPW_CHANGE record if necessary.
- *
- * Note: this function assumes there is no other process running
- * concurrently that could update it.
- */
-void
-UpdateFullPageWrites(void)
-{
-    XLogCtlInsert *Insert = &XLogCtl->Insert;
-
-    /*
-     * Do nothing if full_page_writes has not been changed.
-     *
-     * It's safe to check the shared full_page_writes without the lock,
-     * because we assume that there is no concurrently running process which
-     * can update it.
-     */
-    if (fullPageWrites == Insert->fullPageWrites)
-        return;
-
-    START_CRIT_SECTION();
-
-    /*
-     * It's always safe to take full page images, even when not strictly
-     * required, but not the other round. So if we're setting full_page_writes
-     * to true, first set it true and then write the WAL record. If we're
-     * setting it to false, first write the WAL record and then set the global
-     * flag.
-     */
-    if (fullPageWrites)
-    {
-        WALInsertLockAcquireExclusive();
-        Insert->fullPageWrites = true;
-        WALInsertLockRelease();
-    }
-
-    /*
-     * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
-     * full_page_writes during archive recovery, if required.
-     */
-    if (XLogStandbyInfoActive() && !RecoveryInProgress())
-    {
-        XLogBeginInsert();
-        XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
-
-        XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE);
-    }
-
-    if (!fullPageWrites)
-    {
-        WALInsertLockAcquireExclusive();
-        Insert->fullPageWrites = false;
-        WALInsertLockRelease();
-    }
-    END_CRIT_SECTION();
-}
-
 /*
  * Check that it's OK to switch to new timeline during recovery.
  *
@@ -10066,9 +10015,6 @@ xlog_redo(XLogReaderState *record)
                 XLogCtl->lastFpwDisableRecPtr = ReadRecPtr;
             SpinLockRelease(&XLogCtl->info_lck);
         }
-
-        /* Keep track of full_page_writes */
-        lastFullPageWrites = fpw;
     }
     else if (info == XLOG_CHECKSUMS)
     {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4b452e7cee..8b87d139d0 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1359,12 +1359,6 @@ UpdateSharedMemoryConfig(void)
     /* update global shmem state for sync rep */
     SyncRepUpdateSyncStandbysDefined();
 
-    /*
-     * If full_page_writes has been changed by SIGHUP, we update it in shared
-     * memory and write an XLOG_FPW_CHANGE record.
-     */
-    UpdateFullPageWrites();
-
     elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f21870c644..6e4648e94b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -276,7 +276,6 @@ extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
-extern void UpdateFullPageWrites(void);
 extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 33c59f9a63..1710b8ce1e 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -38,7 +38,7 @@ typedef struct CheckPoint
     TimeLineID    ThisTimeLineID; /* current TLI */
     TimeLineID    PrevTimeLineID; /* previous TLI, if this record begins a new
                                  * timeline (equals ThisTimeLineID otherwise) */
-    bool        fullPageWrites; /* current full_page_writes */
+    bool        fullPageWrites; /* true if all covering WALs are having FPI */
     uint32        nextXidEpoch;    /* higher-order bits of nextXid */
     TransactionId nextXid;        /* next free XID */
     Oid            nextOid;        /* next free OID */
-- 
2.16.3


Re: Problem while setting the fpw with SIGHUP

От
Heikki Linnakangas
Дата:
On 09/04/18 11:13, Kyotaro HORIGUCHI wrote:
> 
> At Fri, 6 Apr 2018 17:59:58 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+1zULC52G_EyNcrrxFCmBi4NUuA1CoQAKu2FFPai_Teg@mail.gmail.com>
>> On Fri, Apr 6, 2018 at 1:50 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> Hello.
>>>
>>> At Wed, 04 Apr 2018 17:26:46 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20180404.172646.238325981.horiguchi.kyotaro@lab.ntt.co.jp>
 
>>>>> In general, I was wondering why in the first place this variable
>>>>> (full_page_writes) is a SIGHUP variable?  I think if the user tries to
>>>>> switch it to 'on' from 'off', it won't guarantee the recovery from
>>>>> torn pages.  Yeah, one can turn it to 'off' from 'on' without any
>>>>> problem, but as the reverse doesn't guarantee anything, it can confuse
>>>>> users. What do you think?
>>>>
>>>> I tend to agree with you. It works as expected after the next
>>>> checkpoint. So define the variable as "it can be changed any time
>>>> but has an effect at the next checkpoint time", then remove
>>>> XLOG_FPW_CHANGE record. And that eliminates the problem of
>>>> concurrent calls since the checkpointer becomes the only modifier
>>>> of the variable. And the problematic function
>>>> UpdateFullPageWrites also no longer needs to write a WAL
>>>> record. The information is conveyed only by checkpoint records.
>>>
>>> I noticed that XLOG_FPW_CHANGE(fpw=false) is still required by
>>> pg_start/stop_backup to know FPW's turning-off without waiting
>>> for the next checkpoint record. But XLOG_FPW_CHANGE(true) is not
>>> required since no one uses the information. It seems even harmful
>>> when it is written at the incorrect place.
>>>
>>> In the attached patch, shared fullPageWrites is updated only at
>>> REDO point
>>
>> I am not completely sure if that is the right option because this
>> would mean that we are defining the new scope for a GUC variable.  I
>> guess we should take the input of others as well.  I am not sure what
>> is the right way to do that, but maybe we can start a new thread with
>> a proper subject and description rather than discussing this under
>> some related bug fix patch discussion.  I guess we can try that after
>> CF unless some other people pitch in and share their feedback.

I think the new behavior where the GUC only takes effect at next 
checkpoint is OK. It seems quite intuitive.

> [rebased patch version]

Looks good at a quick glance. Assuming no objections from others, I'll 
take a closer look and commit tomorrow. Thanks!

- Heikki


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Wed, Apr 11, 2018 at 02:09:48PM +0300, Heikki Linnakangas wrote:
> I think the new behavior where the GUC only takes effect at next checkpoint
> is OK. It seems quite intuitive.
>
> > [rebased patch version]
>
> Looks good at a quick glance. Assuming no objections from others, I'll take
> a closer look and commit tomorrow. Thanks!

Sorry for not following up closely this thread lately.

+   /*
+    * If full_page_writes has been turned off, issue XLOG_FPW_CHANGE before
+    * the flag actually takes effect. No lock is required since checkpointer
+    * is the only updator of shared fullPageWrites after recovery is
+    * finished. Both shared and local fullPageWrites do not change before the
+    * next reading below.
+    */
+   if (Insert->fullPageWrites && !fullPageWrites)
+   {
+       XLogBeginInsert();
+       XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
+       XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE);
+   }

This is not actually true.  If a fallback_promote is used, then
CreateCheckPoint() is called by the startup process which is in charge
of issuing the end-of-recovery checkpoint, and not the checkpointer.  So
I still fail to see how a no-lock approach is fine except if we remove
fallback_promote?
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
Hello. Thanks to Heikkit for picking this up and thanks for the
commnet to Michael.

# The attached is changed only in a comment, and rebased.

At Thu, 12 Apr 2018 05:24:14 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20180411202414.GA32449@paquier.xyz>
> On Wed, Apr 11, 2018 at 02:09:48PM +0300, Heikki Linnakangas wrote:
> > I think the new behavior where the GUC only takes effect at next checkpoint
> > is OK. It seems quite intuitive.
> > 
> > > [rebased patch version]
> > 
> > Looks good at a quick glance. Assuming no objections from others, I'll take
> > a closer look and commit tomorrow. Thanks!
> 
> Sorry for not following up closely this thread lately.
> 
> +   /*
> +    * If full_page_writes has been turned off, issue XLOG_FPW_CHANGE before
> +    * the flag actually takes effect. No lock is required since checkpointer
> +    * is the only updator of shared fullPageWrites after recovery is
> +    * finished. Both shared and local fullPageWrites do not change before the
> +    * next reading below.
> +    */
> +   if (Insert->fullPageWrites && !fullPageWrites)
> +   {
> +       XLogBeginInsert();
> +       XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
> +       XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE);
> +   }
> 
> This is not actually true.  If a fallback_promote is used, then
> CreateCheckPoint() is called by the startup process which is in charge
> of issuing the end-of-recovery checkpoint, and not the checkpointer.  So
> I still fail to see how a no-lock approach is fine except if we remove
> fallback_promote?

Checkpointer never calls CreateCheckPoint while
RecoveryInProgress() == true. In other words, checkpointer is not
an updator of shared FPW at the time StartupXLOG calls
CreateCheckPoint for fallback_promote.

The comment may be somewhat confusing that it is written
there. The point is that checkpointer and StartupXLOG are
mutually excluded on updating shared FPW by
SharedRecoveryInProgress flag.

| * If full_page_writes has been turned off, issue XLOG_FPW_CHANGE before
| * the flag actually takes effect. Checkpointer never calls this function
| * before StartupXLOG() turns off SharedRecoveryInProgress so there's no
| * window where checkpointer and startup processes - the only updators of
| * the flag - can update shared FPW simultaneously. Thus no lock is
| * required here. Both shared and local fullPageWrites do not change
| * before the next reading below.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From f6d4857356508fa16dc5d54b92d0177dbeaae3e2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 6 Apr 2018 13:57:48 +0900
Subject: [PATCH] Change FPW handling

The GUC full_pages_writes currently has an effect immediately. That
makes a race condition between config reload on checkpointer and
StartupXLOG. But since full page images are meaningful only when they
are attached to all WAL records covers a checkpoint, there is no
problem if we update the shared FPW only at REDO point.  On the other
hand, online backup mechanism on standby requires to know if FPW is
turned off before the next checkpoint record comes.

As the result, with this patch, changing of full_page_writes takes
effect at REDO point and additional XLOG_FPW_CHANGE is written only
for turning-off. These are sufficient for standby-backup to work
properly, reduces complexity and prevent the race condition.
---
 doc/src/sgml/config.sgml                   |   4 +-
 src/backend/access/transam/xlog.c          | 116 ++++++++---------------------
 src/backend/optimizer/util/clauses.c       |   4 +-
 src/backend/parser/gram.y                  |  78 ++++++++++---------
 src/backend/parser/parse_agg.c             |  10 +++
 src/backend/parser/parse_expr.c            |   5 ++
 src/backend/parser/parse_func.c            |   3 +
 src/backend/parser/parse_utilcmd.c         |  72 +++++++++++-------
 src/backend/postmaster/checkpointer.c      |   6 --
 src/include/access/xlog.h                  |   1 -
 src/include/catalog/pg_control.h           |   2 +-
 src/include/optimizer/clauses.h            |   2 +
 src/include/parser/parse_node.h            |   1 +
 src/include/utils/rel.h                    |   6 ++
 src/test/regress/expected/create_table.out |  36 +++++----
 src/test/regress/sql/create_table.sql      |  21 +++++-
 16 files changed, 192 insertions(+), 175 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d5f2d23c4..7ea42c25e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2598,7 +2598,9 @@ include_dir 'conf.d'
        <para>
         This parameter can only be set in the <filename>postgresql.conf</filename>
         file or on the server command line.
-        The default is <literal>on</literal>.
+
+        The default is <literal>on</literal>. The change of the parmeter takes
+        effect at the next checkpoint time.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4a47395174..e251cc108b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -202,14 +202,6 @@ static XLogRecPtr LastRec;
 static XLogRecPtr receivedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
-/*
- * During recovery, lastFullPageWrites keeps track of full_page_writes that
- * the replayed WAL records indicate. It's initialized with full_page_writes
- * that the recovery starting checkpoint record indicates, and then updated
- * each time XLOG_FPW_CHANGE record is replayed.
- */
-static bool lastFullPageWrites;
-
 /*
  * Local copy of SharedRecoveryInProgress variable. True actually means "not
  * known, need to check the shared state".
@@ -6851,11 +6843,7 @@ StartupXLOG(void)
      */
     restoreTwoPhaseData();
 
-    lastFullPageWrites = checkPoint.fullPageWrites;
-
     RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
-    doPageWrites = lastFullPageWrites;
-
     if (RecPtr < checkPoint.redo)
         ereport(PANIC,
                 (errmsg("invalid redo in checkpoint record")));
@@ -7650,16 +7638,6 @@ StartupXLOG(void)
     /* Pre-scan prepared transactions to find out the range of XIDs present */
     oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
 
-    /*
-     * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-     * record before resource manager writes cleanup WAL records or checkpoint
-     * record is written.
-     */
-    Insert->fullPageWrites = lastFullPageWrites;
-    LocalSetXLogInsertAllowed();
-    UpdateFullPageWrites();
-    LocalXLogInsertAllowed = -1;
-
     if (InRecovery)
     {
         /*
@@ -7893,6 +7871,13 @@ StartupXLOG(void)
     ControlFile->state = DB_IN_PRODUCTION;
     ControlFile->time = (pg_time_t) time(NULL);
 
+    /*
+     * Set the initial value of shared fullPageWrites. Once
+     * SharedRecoveryInProgress is turned false, checkpointer will update this
+     * value.
+     */
+    XLogCtl->Insert.fullPageWrites = checkPoint.fullPageWrites;
+
     SpinLockAcquire(&XLogCtl->info_lck);
     XLogCtl->SharedRecoveryInProgress = false;
     SpinLockRelease(&XLogCtl->info_lck);
@@ -8754,6 +8739,22 @@ CreateCheckPoint(int flags)
      */
     last_important_lsn = GetLastImportantRecPtr();
 
+    /*
+     * If full_page_writes has been turned off, issue XLOG_FPW_CHANGE before
+     * the flag actually takes effect. Checkpointer never calls this function
+     * before StartupXLOG() turns off SharedRecoveryInProgress so there's no
+     * window where checkpointer and startup processes - the only updators of
+     * the flag - can update shared FPW simultaneously. Thus no lock is
+     * required here. Both shared and local fullPageWrites do not change
+     * before the next reading below.
+     */
+    if (Insert->fullPageWrites && !fullPageWrites)
+    {
+        XLogBeginInsert();
+        XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
+        XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE);
+    }
+
     /*
      * We must block concurrent insertions while examining insert state to
      * determine the checkpoint REDO pointer.
@@ -8795,6 +8796,15 @@ CreateCheckPoint(int flags)
     else
         checkPoint.PrevTimeLineID = ThisTimeLineID;
 
+    /*
+     * Update shared flag of fullPageWrites. WALInsertLock ensures that this
+     * affects all WAL records exactly from REDO point. As the result a
+     * checkpoint marked as fpw=true is ensured that all WAL records have full
+     * page image.
+     */
+    if (fullPageWrites != Insert->fullPageWrites)
+        Insert->fullPageWrites = fullPageWrites;
+
     checkPoint.fullPageWrites = Insert->fullPageWrites;
 
     /*
@@ -9642,65 +9652,6 @@ XlogChecksums(ChecksumType new_type)
     XLogInsert(RM_XLOG_ID, XLOG_CHECKSUMS);
 }
 
-/*
- * Update full_page_writes in shared memory, and write an
- * XLOG_FPW_CHANGE record if necessary.
- *
- * Note: this function assumes there is no other process running
- * concurrently that could update it.
- */
-void
-UpdateFullPageWrites(void)
-{
-    XLogCtlInsert *Insert = &XLogCtl->Insert;
-
-    /*
-     * Do nothing if full_page_writes has not been changed.
-     *
-     * It's safe to check the shared full_page_writes without the lock,
-     * because we assume that there is no concurrently running process which
-     * can update it.
-     */
-    if (fullPageWrites == Insert->fullPageWrites)
-        return;
-
-    START_CRIT_SECTION();
-
-    /*
-     * It's always safe to take full page images, even when not strictly
-     * required, but not the other round. So if we're setting full_page_writes
-     * to true, first set it true and then write the WAL record. If we're
-     * setting it to false, first write the WAL record and then set the global
-     * flag.
-     */
-    if (fullPageWrites)
-    {
-        WALInsertLockAcquireExclusive();
-        Insert->fullPageWrites = true;
-        WALInsertLockRelease();
-    }
-
-    /*
-     * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
-     * full_page_writes during archive recovery, if required.
-     */
-    if (XLogStandbyInfoActive() && !RecoveryInProgress())
-    {
-        XLogBeginInsert();
-        XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
-
-        XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE);
-    }
-
-    if (!fullPageWrites)
-    {
-        WALInsertLockAcquireExclusive();
-        Insert->fullPageWrites = false;
-        WALInsertLockRelease();
-    }
-    END_CRIT_SECTION();
-}
-
 /*
  * Check that it's OK to switch to new timeline during recovery.
  *
@@ -10066,9 +10017,6 @@ xlog_redo(XLogReaderState *record)
                 XLogCtl->lastFpwDisableRecPtr = ReadRecPtr;
             SpinLockRelease(&XLogCtl->info_lck);
         }
-
-        /* Keep track of full_page_writes */
-        lastFullPageWrites = fpw;
     }
     else if (info == XLOG_CHECKSUMS)
     {
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index ed6b680ed8..cfb0984100 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -152,8 +152,6 @@ static Node *substitute_actual_parameters(Node *expr, int nargs, List *args,
 static Node *substitute_actual_parameters_mutator(Node *node,
                                      substitute_actual_parameters_context *context);
 static void sql_inline_error_callback(void *arg);
-static Expr *evaluate_expr(Expr *expr, Oid result_type, int32 result_typmod,
-              Oid result_collation);
 static Query *substitute_actual_srf_parameters(Query *expr,
                                  int nargs, List *args);
 static Node *substitute_actual_srf_parameters_mutator(Node *node,
@@ -4842,7 +4840,7 @@ sql_inline_error_callback(void *arg)
  * We use the executor's routine ExecEvalExpr() to avoid duplication of
  * code and ensure we get the same result as the executor would get.
  */
-static Expr *
+Expr *
 evaluate_expr(Expr *expr, Oid result_type, int32 result_typmod,
               Oid result_collation)
 {
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dd0c26c11b..2745c4b3da 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -180,6 +180,8 @@ static Node *makeXmlExpr(XmlExprOp op, char *name, List *named_args,
 static List *mergeTableFuncParameters(List *func_args, List *columns);
 static TypeName *TableFuncTypeName(List *columns);
 static RangeVar *makeRangeVarFromAnyName(List *names, int position, core_yyscan_t yyscanner);
+static Node *makePartRangeDatum(PartitionRangeDatumKind kind, Node *value,
+                                int location);
 static void SplitColQualList(List *qualList,
                              List **constraintList, CollateClause **collClause,
                              core_yyscan_t yyscanner);
@@ -472,7 +474,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>    columnDef columnOptions
 %type <defelt>    def_elem reloption_elem old_aggr_elem operator_def_elem
 %type <node>    def_arg columnElem where_clause where_or_current_clause
-                a_expr b_expr c_expr AexprConst indirection_el opt_slice_bound
+                a_expr u_expr b_expr b0_expr c_expr c0_expr
+                AexprConst indirection_el opt_slice_bound
                 columnref in_expr having_clause func_table xmltable array_expr
                 ExclusionWhereClause operator_def_arg
 %type <list>    rowsfrom_item rowsfrom_list opt_col_def_list
@@ -585,7 +588,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partelem>    part_elem
 %type <list>        part_params
 %type <partboundspec> PartitionBoundSpec
-%type <node>        partbound_datum PartitionRangeDatum
+%type <node>        PartitionRangeDatum
 %type <list>        hash_partbound partbound_datum_list range_datum_list
 %type <defelt>        hash_partbound_elem
 
@@ -2804,15 +2807,9 @@ hash_partbound:
             }
         ;
 
-partbound_datum:
-            Sconst            { $$ = makeStringConst($1, @1); }
-            | NumericOnly    { $$ = makeAConst($1, @1); }
-            | NULL_P        { $$ = makeNullAConst(@1); }
-        ;
-
 partbound_datum_list:
-            partbound_datum                        { $$ = list_make1($1); }
-            | partbound_datum_list ',' partbound_datum
+            u_expr                        { $$ = list_make1($1); }
+            | partbound_datum_list ',' u_expr
                                                 { $$ = lappend($1, $3); }
         ;
 
@@ -2825,33 +2822,18 @@ range_datum_list:
 PartitionRangeDatum:
             MINVALUE
                 {
-                    PartitionRangeDatum *n = makeNode(PartitionRangeDatum);
-
-                    n->kind = PARTITION_RANGE_DATUM_MINVALUE;
-                    n->value = NULL;
-                    n->location = @1;
-
-                    $$ = (Node *) n;
+                    $$ = makePartRangeDatum(PARTITION_RANGE_DATUM_MINVALUE,
+                                            NULL, @1);
                 }
             | MAXVALUE
                 {
-                    PartitionRangeDatum *n = makeNode(PartitionRangeDatum);
-
-                    n->kind = PARTITION_RANGE_DATUM_MAXVALUE;
-                    n->value = NULL;
-                    n->location = @1;
-
-                    $$ = (Node *) n;
+                    $$ = makePartRangeDatum(PARTITION_RANGE_DATUM_MAXVALUE,
+                                            NULL, @1);
                 }
-            | partbound_datum
+            | u_expr
                 {
-                    PartitionRangeDatum *n = makeNode(PartitionRangeDatum);
-
-                    n->kind = PARTITION_RANGE_DATUM_VALUE;
-                    n->value = $1;
-                    n->location = @1;
-
-                    $$ = (Node *) n;
+                    $$ = makePartRangeDatum(PARTITION_RANGE_DATUM_VALUE,
+                                            $1, @1);
                 }
         ;
 
@@ -13478,9 +13460,17 @@ a_expr:        c_expr                                    { $$ = $1; }
  * cause trouble in the places where b_expr is used.  For simplicity, we
  * just eliminate all the boolean-keyword-operator productions from b_expr.
  */
-b_expr:        c_expr
-                { $$ = $1; }
-            | b_expr TYPECAST Typename
+b_expr:        c_expr { $$ = $1; }
+            | b0_expr { $$ = $1; }
+        ;
+
+/* u_expr is a subset of b_expr usable along with unreserved keywords */
+u_expr:        c0_expr { $$ = $1; }
+            | b0_expr { $$ = $1; }
+        ;
+
+/* common part of b_expr and u_expr */
+b0_expr:    b_expr TYPECAST Typename
                 { $$ = makeTypeCast($1, $3, @2); }
             | '+' b_expr                    %prec UMINUS
                 { $$ = (Node *) makeSimpleA_Expr(AEXPR_OP, "+", NULL, $2, @1); }
@@ -13554,7 +13544,11 @@ b_expr:        c_expr
  * ambiguity to the b_expr syntax.
  */
 c_expr:        columnref                                { $$ = $1; }
-            | AexprConst                            { $$ = $1; }
+            | c0_expr                                { $$ = $1; }
+        ;
+
+/* common part of c_expr and u_expr */
+c0_expr:         AexprConst                            { $$ = $1; }
             | PARAM opt_indirection
                 {
                     ParamRef *p = makeNode(ParamRef);
@@ -16275,6 +16269,18 @@ makeRangeVarFromAnyName(List *names, int position, core_yyscan_t yyscanner)
     return r;
 }
 
+static Node *
+makePartRangeDatum(PartitionRangeDatumKind kind, Node *value, int location)
+{
+    PartitionRangeDatum *n = makeNode(PartitionRangeDatum);
+
+    n->kind = kind;
+    n->value = value;
+    n->location = location;
+
+    return (Node *) n;
+}
+
 /* Separate Constraint nodes from COLLATE clauses in a ColQualList */
 static void
 SplitColQualList(List *qualList,
diff --git a/src/backend/parser/parse_agg.c b/src/backend/parser/parse_agg.c
index 0307738946..4e426f2b28 100644
--- a/src/backend/parser/parse_agg.c
+++ b/src/backend/parser/parse_agg.c
@@ -506,6 +506,13 @@ check_agglevels_and_constraints(ParseState *pstate, Node *expr)
             else
                 err = _("grouping operations are not allowed in EXECUTE parameters");
 
+            break;
+        case EXPR_KIND_PARTITION_BOUND:
+            if (isAgg)
+                err = _("aggregate functions are not allowed in partition bound");
+            else
+                err = _("grouping operations are not allowed in partition bound");
+
             break;
         case EXPR_KIND_TRIGGER_WHEN:
             if (isAgg)
@@ -909,6 +916,9 @@ transformWindowFuncCall(ParseState *pstate, WindowFunc *wfunc,
         case EXPR_KIND_PARTITION_EXPRESSION:
             err = _("window functions are not allowed in partition key expressions");
             break;
+        case EXPR_KIND_PARTITION_BOUND:
+            err = _("window functions are not allowed in partition bound");
+            break;
         case EXPR_KIND_CALL_ARGUMENT:
             err = _("window functions are not allowed in CALL arguments");
             break;
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 38fbe3366f..a7f3d86f75 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -1850,6 +1850,9 @@ transformSubLink(ParseState *pstate, SubLink *sublink)
         case EXPR_KIND_CALL_ARGUMENT:
             err = _("cannot use subquery in CALL argument");
             break;
+        case EXPR_KIND_PARTITION_BOUNDS:
+            err = _("cannot use subquery in partition bounds");
+            break;
 
             /*
              * There is intentionally no default: case here, so that the
@@ -3474,6 +3477,8 @@ ParseExprKindName(ParseExprKind exprKind)
             return "WHEN";
         case EXPR_KIND_PARTITION_EXPRESSION:
             return "PARTITION BY";
+        case EXPR_KIND_PARTITION_BOUNDS:
+            return "partition bounds";
         case EXPR_KIND_CALL_ARGUMENT:
             return "CALL";
         case EXPR_KIND_MERGE_WHEN_AND:
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 615aee6d15..fef05388b6 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -2306,6 +2306,9 @@ check_srf_call_placement(ParseState *pstate, Node *last_srf, int location)
         case EXPR_KIND_PARTITION_EXPRESSION:
             err = _("set-returning functions are not allowed in partition key expressions");
             break;
+        case EXPR_KIND_PARTITION_BOUND:
+            err = _("set-returning functions are not allowed in partition bounds");
+            break;
         case EXPR_KIND_CALL_ARGUMENT:
             err = _("set-returning functions are not allowed in CALL arguments");
             break;
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index f9f9904bad..0112d22d23 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -48,6 +48,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
 #include "optimizer/planner.h"
 #include "parser/analyze.h"
 #include "parser/parse_clause.h"
@@ -138,8 +139,9 @@ static void transformColumnType(CreateStmtContext *cxt, ColumnDef *column);
 static void setSchemaName(char *context_schema, char **stmt_schema_name);
 static void transformPartitionCmd(CreateStmtContext *cxt, PartitionCmd *cmd);
 static void validateInfiniteBounds(ParseState *pstate, List *blist);
-static Const *transformPartitionBoundValue(ParseState *pstate, A_Const *con,
-                             const char *colName, Oid colType, int32 colTypmod);
+static Const *transformPartitionBoundValue(ParseState *pstate, Node *con,
+                             const char *colName, Oid colType, int32 colTypmod,
+                             Oid colCollation);
 
 
 /*
@@ -3651,6 +3653,7 @@ transformPartitionBound(ParseState *pstate, Relation parent,
         char       *colname;
         Oid            coltype;
         int32        coltypmod;
+        Oid            colcollation;
 
         if (spec->strategy != PARTITION_STRATEGY_LIST)
             ereport(ERROR,
@@ -3670,17 +3673,19 @@ transformPartitionBound(ParseState *pstate, Relation parent,
         /* Need its type data too */
         coltype = get_partition_col_typid(key, 0);
         coltypmod = get_partition_col_typmod(key, 0);
+        colcollation = get_partition_col_collation(key, 0);
 
         result_spec->listdatums = NIL;
         foreach(cell, spec->listdatums)
         {
-            A_Const    *con = castNode(A_Const, lfirst(cell));
+            Node       *expr = (Node *)lfirst (cell);
             Const       *value;
             ListCell   *cell2;
             bool        duplicate;
 
-            value = transformPartitionBoundValue(pstate, con,
-                                                 colname, coltype, coltypmod);
+            value = transformPartitionBoundValue(pstate, expr,
+                                                 colname, coltype, coltypmod,
+                                                 colcollation);
 
             /* Don't add to the result if the value is a duplicate */
             duplicate = false;
@@ -3740,7 +3745,7 @@ transformPartitionBound(ParseState *pstate, Relation parent,
             char       *colname;
             Oid            coltype;
             int32        coltypmod;
-            A_Const    *con;
+            Oid            colcollation;
             Const       *value;
 
             /* Get the column's name in case we need to output an error */
@@ -3758,13 +3763,15 @@ transformPartitionBound(ParseState *pstate, Relation parent,
             /* Need its type data too */
             coltype = get_partition_col_typid(key, i);
             coltypmod = get_partition_col_typmod(key, i);
+            colcollation = get_partition_col_collation(key, i);
 
             if (ldatum->value)
             {
-                con = castNode(A_Const, ldatum->value);
-                value = transformPartitionBoundValue(pstate, con,
+                value = transformPartitionBoundValue(pstate,
+                                                     ldatum->value,
                                                      colname,
-                                                     coltype, coltypmod);
+                                                     coltype, coltypmod,
+                                                     colcollation);
                 if (value->constisnull)
                     ereport(ERROR,
                             (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
@@ -3775,10 +3782,11 @@ transformPartitionBound(ParseState *pstate, Relation parent,
 
             if (rdatum->value)
             {
-                con = castNode(A_Const, rdatum->value);
-                value = transformPartitionBoundValue(pstate, con,
+                value = transformPartitionBoundValue(pstate,
+                                                     rdatum->value,
                                                      colname,
-                                                     coltype, coltypmod);
+                                                     coltype, coltypmod,
+                                                     colcollation);
                 if (value->constisnull)
                     ereport(ERROR,
                             (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
@@ -3845,13 +3853,14 @@ validateInfiniteBounds(ParseState *pstate, List *blist)
  * Transform one constant in a partition bound spec
  */
 static Const *
-transformPartitionBoundValue(ParseState *pstate, A_Const *con,
-                             const char *colName, Oid colType, int32 colTypmod)
+transformPartitionBoundValue(ParseState *pstate, Node *val,
+                             const char *colName, Oid colType, int32 colTypmod,
+                             Oid colCollation)
 {
     Node       *value;
 
-    /* Make it into a Const */
-    value = (Node *) make_const(pstate, &con->val, con->location);
+    /* Transform raw parsetree */
+    value = transformExpr(pstate, val, EXPR_KIND_PARTITION_BOUND);
 
     /* Coerce to correct type */
     value = coerce_to_target_type(pstate,
@@ -3867,21 +3876,32 @@ transformPartitionBoundValue(ParseState *pstate, A_Const *con,
                 (errcode(ERRCODE_DATATYPE_MISMATCH),
                  errmsg("specified value cannot be cast to type %s for column \"%s\"",
                         format_type_be(colType), colName),
-                 parser_errposition(pstate, con->location)));
+                 parser_errposition(pstate, exprLocation(val))));
+
+    /* Fix collations after all else */
+    assign_expr_collations(pstate, value);
+
+    /*
+     * Check for conflict between explict collations. Partition key expression
+     * has precedence over partition bound value.
+     */
+    if (exprCollation(value) != DEFAULT_COLLATION_OID &&
+        colCollation != exprCollation(value))    
+        ereport(ERROR,
+                (errcode(ERRCODE_COLLATION_MISMATCH),
+                 errmsg("collation mismatch between partition key expression (%d) and partition bound value (%d)",
colCollation,exprCollation(value)),
 
+                 parser_errposition(pstate, exprLocation(val))));
+                
 
     /* Simplify the expression, in case we had a coercion */
     if (!IsA(value, Const))
         value = (Node *) expression_planner((Expr *) value);
 
-    /* Fail if we don't have a constant (i.e., non-immutable coercion) */
+    /* Eval if we still don't have a constant (i.e., non-immutable coercion) */
     if (!IsA(value, Const))
-        ereport(ERROR,
-                (errcode(ERRCODE_DATATYPE_MISMATCH),
-                 errmsg("specified value cannot be cast to type %s for column \"%s\"",
-                        format_type_be(colType), colName),
-                 errdetail("The cast requires a non-immutable conversion."),
-                 errhint("Try putting the literal value in single quotes."),
-                 parser_errposition(pstate, con->location)));
-
+        value = (Node *)evaluate_expr((Expr *) value, colType, colTypmod,
+                                      colCollation);
+    
+    Assert(IsA(value, Const));
     return (Const *) value;
 }
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4b452e7cee..8b87d139d0 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1359,12 +1359,6 @@ UpdateSharedMemoryConfig(void)
     /* update global shmem state for sync rep */
     SyncRepUpdateSyncStandbysDefined();
 
-    /*
-     * If full_page_writes has been changed by SIGHUP, we update it in shared
-     * memory and write an XLOG_FPW_CHANGE record.
-     */
-    UpdateFullPageWrites();
-
     elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f21870c644..6e4648e94b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -276,7 +276,6 @@ extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
-extern void UpdateFullPageWrites(void);
 extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 33c59f9a63..1710b8ce1e 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -38,7 +38,7 @@ typedef struct CheckPoint
     TimeLineID    ThisTimeLineID; /* current TLI */
     TimeLineID    PrevTimeLineID; /* previous TLI, if this record begins a new
                                  * timeline (equals ThisTimeLineID otherwise) */
-    bool        fullPageWrites; /* current full_page_writes */
+    bool        fullPageWrites; /* true if all covering WALs are having FPI */
     uint32        nextXidEpoch;    /* higher-order bits of nextXid */
     TransactionId nextXid;        /* next free XID */
     Oid            nextOid;        /* next free OID */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index ba4fa4b68b..4b1a5b96f8 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -85,4 +85,6 @@ extern Node *estimate_expression_value(PlannerInfo *root, Node *node);
 extern Query *inline_set_returning_function(PlannerInfo *root,
                               RangeTblEntry *rte);
 
+extern Expr *evaluate_expr(Expr *expr, Oid result_type, int32 result_typmod,
+                           Oid result_collation);
 #endif                            /* CLAUSES_H */
diff --git a/src/include/parser/parse_node.h b/src/include/parser/parse_node.h
index 3fd2151ccb..9175a32c42 100644
--- a/src/include/parser/parse_node.h
+++ b/src/include/parser/parse_node.h
@@ -70,6 +70,7 @@ typedef enum ParseExprKind
     EXPR_KIND_TRIGGER_WHEN,        /* WHEN condition in CREATE TRIGGER */
     EXPR_KIND_POLICY,            /* USING or WITH CHECK expr in policy */
     EXPR_KIND_PARTITION_EXPRESSION, /* PARTITION BY expression */
+    EXPR_KIND_PARTITION_BOUND,     /* partition bounds value */
     EXPR_KIND_CALL_ARGUMENT        /* procedure argument in CALL */
 } ParseExprKind;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ffffde01da..215f5fa06e 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -660,6 +660,12 @@ get_partition_col_typmod(PartitionKey key, int col)
     return key->parttypmod[col];
 }
 
+static inline Oid
+get_partition_col_collation(PartitionKey key, int col)
+{
+    return key->partcollation[col];
+}
+
 /*
  * RelationGetPartitionDesc
  *        Returns partition descriptor for a relation.
diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out
index e724439037..2080a656e4 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -449,14 +449,6 @@ CREATE TABLE list_parted (
 CREATE TABLE part_1 PARTITION OF list_parted FOR VALUES IN ('1');
 CREATE TABLE part_2 PARTITION OF list_parted FOR VALUES IN (2);
 CREATE TABLE part_null PARTITION OF list_parted FOR VALUES IN (null);
-CREATE TABLE fail_part PARTITION OF list_parted FOR VALUES IN (int '1');
-ERROR:  syntax error at or near "int"
-LINE 1: ... fail_part PARTITION OF list_parted FOR VALUES IN (int '1');
-                                                              ^
-CREATE TABLE fail_part PARTITION OF list_parted FOR VALUES IN ('1'::int);
-ERROR:  syntax error at or near "::"
-LINE 1: ...fail_part PARTITION OF list_parted FOR VALUES IN ('1'::int);
-                                                                ^
 -- syntax does not allow empty list of values for list partitions
 CREATE TABLE fail_part PARTITION OF list_parted FOR VALUES IN ();
 ERROR:  syntax error at or near ")"
@@ -490,12 +482,8 @@ CREATE TABLE moneyp (
     a money
 ) PARTITION BY LIST (a);
 CREATE TABLE moneyp_10 PARTITION OF moneyp FOR VALUES IN (10);
-ERROR:  specified value cannot be cast to type money for column "a"
-LINE 1: ...EATE TABLE moneyp_10 PARTITION OF moneyp FOR VALUES IN (10);
-                                                                   ^
-DETAIL:  The cast requires a non-immutable conversion.
-HINT:  Try putting the literal value in single quotes.
-CREATE TABLE moneyp_10 PARTITION OF moneyp FOR VALUES IN ('10');
+CREATE TABLE moneyp_11 PARTITION OF moneyp FOR VALUES IN ('11');
+CREATE TABLE moneyp_12 PARTITION OF moneyp FOR VALUES IN (to_char(12, '99')::int);
 DROP TABLE moneyp;
 -- immutable cast should work, though
 CREATE TABLE bigintp (
@@ -683,6 +671,26 @@ ERROR:  modulus for hash partition must be a positive integer
 -- remainder must be greater than or equal to zero and less than modulus
 CREATE TABLE fail_part PARTITION OF hash_parted2 FOR VALUES WITH (MODULUS 8, REMAINDER 8);
 ERROR:  remainder for hash partition must be less than modulus
+-- check for collation handling
+CREATE TABLE col_parted (
+    a varchar
+) PARTITION BY LIST (a);
+CREATE TABLE fail_part PARTITION OF col_parted FOR VALUES IN (('a' collate "en_US"));
+ERROR:  collation mismatch between partition key expression (100) and partition bound value (12638)
+LINE 1: ...fail_part PARTITION OF col_parted FOR VALUES IN (('a' collat...
+                                                             ^
+CREATE TABLE success_part PARTITION OF col_parted FOR VALUES IN ('a');
+DROP TABLE col_parted;
+CREATE TABLE col_parted (
+    a varchar collate "en_US"
+) PARTITION BY LIST (a);
+CREATE TABLE fail_part PARTITION OF col_parted FOR VALUES IN (('a' collate "en_GB"));
+ERROR:  collation mismatch between partition key expression (12638) and partition bound value (12631)
+LINE 1: ...fail_part PARTITION OF col_parted FOR VALUES IN (('a' collat...
+                                                             ^
+CREATE TABLE success_part PARTITION OF col_parted FOR VALUES IN ('a');
+CREATE TABLE success_part2 PARTITION OF col_parted FOR VALUES IN (('b' collate "en_US"));
+DROP TABLE col_parted;
 -- check schema propagation from parent
 CREATE TABLE parted (
     a text,
diff --git a/src/test/regress/sql/create_table.sql b/src/test/regress/sql/create_table.sql
index 235bef13dc..f739d89a75 100644
--- a/src/test/regress/sql/create_table.sql
+++ b/src/test/regress/sql/create_table.sql
@@ -432,8 +432,6 @@ CREATE TABLE list_parted (
 CREATE TABLE part_1 PARTITION OF list_parted FOR VALUES IN ('1');
 CREATE TABLE part_2 PARTITION OF list_parted FOR VALUES IN (2);
 CREATE TABLE part_null PARTITION OF list_parted FOR VALUES IN (null);
-CREATE TABLE fail_part PARTITION OF list_parted FOR VALUES IN (int '1');
-CREATE TABLE fail_part PARTITION OF list_parted FOR VALUES IN ('1'::int);
 
 -- syntax does not allow empty list of values for list partitions
 CREATE TABLE fail_part PARTITION OF list_parted FOR VALUES IN ();
@@ -458,7 +456,8 @@ CREATE TABLE moneyp (
     a money
 ) PARTITION BY LIST (a);
 CREATE TABLE moneyp_10 PARTITION OF moneyp FOR VALUES IN (10);
-CREATE TABLE moneyp_10 PARTITION OF moneyp FOR VALUES IN ('10');
+CREATE TABLE moneyp_11 PARTITION OF moneyp FOR VALUES IN ('11');
+CREATE TABLE moneyp_12 PARTITION OF moneyp FOR VALUES IN (to_char(12, '99')::int);
 DROP TABLE moneyp;
 
 -- immutable cast should work, though
@@ -620,6 +619,22 @@ CREATE TABLE fail_part PARTITION OF hash_parted2 FOR VALUES WITH (MODULUS 0, REM
 -- remainder must be greater than or equal to zero and less than modulus
 CREATE TABLE fail_part PARTITION OF hash_parted2 FOR VALUES WITH (MODULUS 8, REMAINDER 8);
 
+-- check for collation handling
+CREATE TABLE col_parted (
+    a varchar
+) PARTITION BY LIST (a);
+CREATE TABLE fail_part PARTITION OF col_parted FOR VALUES IN (('a' collate "en_US"));
+CREATE TABLE success_part PARTITION OF col_parted FOR VALUES IN ('a');
+DROP TABLE col_parted;
+
+CREATE TABLE col_parted (
+    a varchar collate "en_US"
+) PARTITION BY LIST (a);
+CREATE TABLE fail_part PARTITION OF col_parted FOR VALUES IN (('a' collate "en_GB"));
+CREATE TABLE success_part PARTITION OF col_parted FOR VALUES IN ('a');
+CREATE TABLE success_part2 PARTITION OF col_parted FOR VALUES IN (('b' collate "en_US"));
+DROP TABLE col_parted;
+
 -- check schema propagation from parent
 
 CREATE TABLE parted (
-- 
2.16.3


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Apr 12, 2018 at 10:34:30AM +0900, Kyotaro HORIGUCHI wrote:
> Checkpointer never calls CreateCheckPoint while
> RecoveryInProgress() == true. In other words, checkpointer is not
> an updator of shared FPW at the time StartupXLOG calls
> CreateCheckPoint for fallback_promote.

I have been able to spend a couple of hours on your patch, wrapping my
mind on your stuff.  So what I had in mind was something like this type
of scenario:
1) The startup process requires a restart point.
2) The checkpointer receives the request, and blocks before reading
RecoveryInProgress().
3) A fallback_promote is triggered, making the startup process call
CreateCheckpoint().
4) Startup process finishes checkpoint, updates Insert->fullPageWrites.
5) Checkpoint reads RecoveryInProgress to false, moves on with its
checkpoint.

> The comment may be somewhat confusing that it is written
> there. The point is that checkpointer and StartupXLOG are
> mutually excluded on updating shared FPW by
> SharedRecoveryInProgress flag.

Indeed.  I can see that it is the main key point of the patch.

> | * If full_page_writes has been turned off, issue XLOG_FPW_CHANGE before
> | * the flag actually takes effect. Checkpointer never calls this function
> | * before StartupXLOG() turns off SharedRecoveryInProgress so there's no
> | * window where checkpointer and startup processes - the only updators of
> | * the flag - can update shared FPW simultaneously. Thus no lock is
> | * required here. Both shared and local fullPageWrites do not change
> | * before the next reading below.

Yeah, this reduces the confusion.

(The latest patch is a mix of two patches.)

+        The default is <literal>on</literal>. The change of the parmeter takes
+        effect at the next checkpoint time.
s/parmeter/parameter/

By the way, I would vote for keeping track in WAL of full_page_writes
switched from off to on.  This is not used in the backend, but that's
still useful for debugging end-user issues.

Actually, I was wondering why a spin lock is not taken in
RecoveryInProgress when reading SharedRecoveryInProgress, but that's
from 1a3d1044 which added a memory barrier instead as guarantee...
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Thu, 12 Apr 2018 14:07:53 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20180412050753.GA19289@paquier.xyz>
> On Thu, Apr 12, 2018 at 10:34:30AM +0900, Kyotaro HORIGUCHI wrote:
> > Checkpointer never calls CreateCheckPoint while
> > RecoveryInProgress() == true. In other words, checkpointer is not
> > an updator of shared FPW at the time StartupXLOG calls
> > CreateCheckPoint for fallback_promote.
> 
> I have been able to spend a couple of hours on your patch, wrapping my
> mind on your stuff.  So what I had in mind was something like this type
> of scenario:

Thank for the precise explanation.

The scenario that CreateCheckPoint is called simultaneously in
different prcoesses seems broken by itself in the first place but
I put that aside for now.

> 1) The startup process requires a restart point.
> 2) The checkpointer receives the request, and blocks before reading
> RecoveryInProgress().

RecoveryInProgress doesn't take lock. But I assume here that
checkpointer is taking a long time after entering
RecoveryInProgress and haven't actually read
SharedRecoveryInProgress.

> 3) A fallback_promote is triggered, making the startup process call
> CreateCheckpoint().

I'm confused here. It seems to me that StartupXLOG calls
CreateCheckPoint only in bootstrap or standalone cases. No
concurrent processe is running in the cases.

Even if CreateCheckPoint is called there in the IsUnderPostmater
case, checkpointer never calls CreateCheckPoint during the
checkpoint. This is safe in regard to shared FPW. (but checkpoint
is being blocked in this scenario.)

Assuming that RequestCheckpoint() is called here, the request is
merged with the previous request above and the function returns
after the checkpoint ends. (That is, checkpointer must continue
to run in this case.)

> 4) Startup process finishes checkpoint, updates Insert->fullPageWrites.

According to this scenario, checkpionter is still stalling
now. So it is not a concurrent update.

> 5) Checkpoint reads RecoveryInProgress to false, moves on with its
> checkpoint.

If checkpointer sees SharedRecoveryInProgress being false,
Create(or Request)CheckPoint called in (3) must have finished and
StartupXLOG() no longer updates shared FPW. There's no concurrent
update.

> > The comment may be somewhat confusing that it is written
> > there. The point is that checkpointer and StartupXLOG are
> > mutually excluded on updating shared FPW by
> > SharedRecoveryInProgress flag.
> 
> Indeed.  I can see that it is the main key point of the patch.
> 
> > | * If full_page_writes has been turned off, issue XLOG_FPW_CHANGE before
> > | * the flag actually takes effect. Checkpointer never calls this function
> > | * before StartupXLOG() turns off SharedRecoveryInProgress so there's no
> > | * window where checkpointer and startup processes - the only updators of
> > | * the flag - can update shared FPW simultaneously. Thus no lock is
> > | * required here. Both shared and local fullPageWrites do not change
> > | * before the next reading below.
> 
> Yeah, this reduces the confusion.

Thanks^^;

> (The latest patch is a mix of two patches.)

Sorry I counld get this.

> +        The default is <literal>on</literal>. The change of the parmeter takes
> +        effect at the next checkpoint time.
> s/parmeter/parameter/
> 
> By the way, I would vote for keeping track in WAL of full_page_writes
> switched from off to on.  This is not used in the backend, but that's
> still useful for debugging end-user issues.

Agreed and I tried that. The problem on that is that some records
can be written after REDO point before XLOG_FPW_CHANGE(true) is
written. However this is no problem for the FPW-related stuff to
work properly (since no one looks it), the FPW record suggests
that the current checkpoint loses FPI in the first several
records. This has a far larger impact with this patch because
shared FPW is always turned on just at REDO point.

So I choosed not to write XLOG_FPW_CHANGE(false) rather than
writing bogus records.

> Actually, I was wondering why a spin lock is not taken in
> RecoveryInProgress when reading SharedRecoveryInProgress, but that's
> from 1a3d1044 which added a memory barrier instead as guarantee...

Maybe it doesn't need barrier, since the flag is initialized as
true and becomes false just once and delay in reading by other
processes doesn't no harm. I think that bool doesn't suffer
atomicity. Even all these are true, some description is needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Apr 12, 2018 at 04:59:10PM +0900, Kyotaro HORIGUCHI wrote:
> At Thu, 12 Apr 2018 14:07:53 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20180412050753.GA19289@paquier.xyz>
>> I have been able to spend a couple of hours on your patch, wrapping my
>> mind on your stuff.  So what I had in mind was something like this type
>> of scenario:
>
> Thank for the precise explanation.

Just to be clear and to avoid incorrect conclusion.  This is the type of
scenarios I imagined about when I read your previous email, concluding
such scenarios those cannot apply per the strong assumption on
SharedRecoveryInProgress your patch heavily relies on.  In short I have
no objections.

>> (The latest patch is a mix of two patches.)
>
> Sorry I counld get this.

The patch called v2-0001-Change-FPW-handling.patch posted on
https://www.postgresql.org/message-id/20180412.103430.133595350.horiguchi.kyotaro%40lab.ntt.co.jp,
which is the latest version available, is a mix of the patch you are
creating for this thread and of a patch aimed a fixing an issue with
partition range handling.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Robert Haas
Дата:
On Wed, Apr 11, 2018 at 7:09 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I think the new behavior where the GUC only takes effect at next checkpoint
> is OK. It seems quite intuitive.

I think it may actually be confusing.  If you run pg_ctl reload and it
reports that the value has changed, you'll expect it to have taken
effect.  But really, it will take effect at some later time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Apr 12, 2018 at 02:55:53PM -0400, Robert Haas wrote:
> I think it may actually be confusing.  If you run pg_ctl reload and it
> reports that the value has changed, you'll expect it to have taken
> effect.  But really, it will take effect at some later time.

It is true that sometimes some people like to temporarily disable
full_page_writes particularly when doing some bulk load of data to
minimize the effort on WAL, and then re-enable it just after doing
the inserting this data.

Still does it matter when the change is effective?  By disabling
full_page_writes even temporarily, you accept the fact that this
instance would not be safe until the next checkpoint completes.  The
instance even finishes by writing less unnecessary WAL data if the
change is only effective at the next checkpoint.  Well, it is true that
this increases potential torn pages problems but the user is already
accepting that risk if a crash happens until the next checkpoint then it
exposes itself to torn pages anyway as it chose to disable
full_page_writes.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Fri, Apr 13, 2018 at 6:59 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Apr 12, 2018 at 02:55:53PM -0400, Robert Haas wrote:
>> I think it may actually be confusing.  If you run pg_ctl reload and it
>> reports that the value has changed, you'll expect it to have taken
>> effect.  But really, it will take effect at some later time.
>

+1. I also think it is confusing and it could be difficult for end
users to know when the setting is effective.

> It is true that sometimes some people like to temporarily disable
> full_page_writes particularly when doing some bulk load of data to
> minimize the effort on WAL, and then re-enable it just after doing
> the inserting this data.
>
> Still does it matter when the change is effective?  By disabling
> full_page_writes even temporarily, you accept the fact that this
> instance would not be safe until the next checkpoint completes.  The
> instance even finishes by writing less unnecessary WAL data if the
> change is only effective at the next checkpoint.  Well, it is true that
> this increases potential torn pages problems but the user is already
> accepting that risk if a crash happens until the next checkpoint then it
> exposes itself to torn pages anyway as it chose to disable
> full_page_writes.
>

I think this means that is will be difficult for end users to predict
unless they track the next checkpoint which isn't too bad, but won't
be convenient either.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Fri, 13 Apr 2018 08:31:02 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1LVFpLf=d-7XmfwhLv7Xu53pU0bGU=wVrYWSRU4XSsyHQ@mail.gmail.com>
> On Fri, Apr 13, 2018 at 6:59 AM, Michael Paquier <michael@paquier.xyz> wrote:
> > On Thu, Apr 12, 2018 at 02:55:53PM -0400, Robert Haas wrote:
> >> I think it may actually be confusing.  If you run pg_ctl reload and it
> >> reports that the value has changed, you'll expect it to have taken
> >> effect.  But really, it will take effect at some later time.
> >
> 
> +1. I also think it is confusing and it could be difficult for end
> users to know when the setting is effective.
> 
> > It is true that sometimes some people like to temporarily disable
> > full_page_writes particularly when doing some bulk load of data to
> > minimize the effort on WAL, and then re-enable it just after doing
> > the inserting this data.
> >
> > Still does it matter when the change is effective?  By disabling
> > full_page_writes even temporarily, you accept the fact that this
> > instance would not be safe until the next checkpoint completes.  The
> > instance even finishes by writing less unnecessary WAL data if the
> > change is only effective at the next checkpoint.  Well, it is true that
> > this increases potential torn pages problems but the user is already
> > accepting that risk if a crash happens until the next checkpoint then it
> > exposes itself to torn pages anyway as it chose to disable
> > full_page_writes.

I still don't think that enabling FPW anytime is useful but
disabling seems useful as I mentioned upthread.

The problem was checkpointer changes the flag anytime including
recovery time. Startup process updates the same flag at the end
of recovery but before publicated. Letting checkpointer change
the flag only at checkpoint time is a straightforward way to
avoid conflicts with startup process.

I reconsider a bit and came up with the thought that we could
just skip changing shared FPW in checkpointer until recovery
ends, then update the flag after recovery end (perhaps at
checkpoint time in major cases). In this case, FPI is attached
from REDO point of the first checkpoint (not restartpoint) or a
bit earlier, then FPW can be flipped at any time.  I'll come up
with that soon.

> I think this means that is will be difficult for end users to predict
> unless they track the next checkpoint which isn't too bad, but won't
> be convenient either.

Looking checkpiont record is enough to know wheter the checkpoint
is protected by FPW eough, but I agree that such strictness is
not crutial.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Fri, 13 Apr 2018 13:47:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180413.134751.76149471.horiguchi.kyotaro@lab.ntt.co.jp>
> At Fri, 13 Apr 2018 08:31:02 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1LVFpLf=d-7XmfwhLv7Xu53pU0bGU=wVrYWSRU4XSsyHQ@mail.gmail.com>
> > On Fri, Apr 13, 2018 at 6:59 AM, Michael Paquier <michael@paquier.xyz> wrote:
> > > On Thu, Apr 12, 2018 at 02:55:53PM -0400, Robert Haas wrote:
> > >> I think it may actually be confusing.  If you run pg_ctl reload and it
> > >> reports that the value has changed, you'll expect it to have taken
> > >> effect.  But really, it will take effect at some later time.
> > >
> > 
> > +1. I also think it is confusing and it could be difficult for end
> > users to know when the setting is effective.
> > 
> > > It is true that sometimes some people like to temporarily disable
> > > full_page_writes particularly when doing some bulk load of data to
> > > minimize the effort on WAL, and then re-enable it just after doing
> > > the inserting this data.
> > >
> > > Still does it matter when the change is effective?  By disabling
> > > full_page_writes even temporarily, you accept the fact that this
> > > instance would not be safe until the next checkpoint completes.  The
> > > instance even finishes by writing less unnecessary WAL data if the
> > > change is only effective at the next checkpoint.  Well, it is true that
> > > this increases potential torn pages problems but the user is already
> > > accepting that risk if a crash happens until the next checkpoint then it
> > > exposes itself to torn pages anyway as it chose to disable
> > > full_page_writes.
> 
> I still don't think that enabling FPW anytime is useful but
> disabling seems useful as I mentioned upthread.
> 
> The problem was checkpointer changes the flag anytime including
> recovery time. Startup process updates the same flag at the end
> of recovery but before publicated. Letting checkpointer change
> the flag only at checkpoint time is a straightforward way to
> avoid conflicts with startup process.
> 
> I reconsider a bit and came up with the thought that we could
> just skip changing shared FPW in checkpointer until recovery
> ends, then update the flag after recovery end (perhaps at
> checkpoint time in major cases). In this case, FPI is attached
> from REDO point of the first checkpoint (not restartpoint) or a
> bit earlier, then FPW can be flipped at any time.  I'll come up
> with that soon.

Please find the attached. The most significant change is that
UpdateSharedMemoryConfig skips updating of shared fullPageWrites
during recovery. The original crash is fixed since this
guarantees that XLog working area is initializeed before reaching
UpdateFullPageWrites().

Addition to that, I changed CheckpointerMain so that it tries
update of shared FPW regardless of SIGHUP and provided new
function to just wakeup checkpointer. StartupXLOG wakes up
checkpointer either checkpoint is required or not and
checkpointer makes the first update of shared FPW at the time.

After this point, everything works as the same as the current
behavior.

> > I think this means that is will be difficult for end users to predict
> > unless they track the next checkpoint which isn't too bad, but won't
> > be convenient either.
> 
> Looking checkpiont record is enough to know wheter the checkpoint
> is protected by FPW eough, but I agree that such strictness is
> not crutial.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 70cf6a0c0fb4c4e421d8895c197b1bf7fecdaa8f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 13 Apr 2018 16:29:25 +0900
Subject: [PATCH] Inhibit update shared FPW during recovery

The shared full_pages_writes is currently updated by checkpointer even
during recovery. This leads to unexpected concurrent updates of the
flag or a crash by using uninitialized memory. The update is needless
in the first place so this patch inhibits checkpointer from updating
it during recovery. Instead, startup process wakes up checkpointer at
the end of recovery so as to the shared flag is updated just after.
---
 src/backend/access/transam/xlog.c     |  4 +++
 src/backend/postmaster/checkpointer.c | 57 ++++++++++++++++++++++-------------
 src/include/postmaster/bgwriter.h     |  1 +
 3 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 08dc9ba031..581844904d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7826,9 +7826,13 @@ StartupXLOG(void)
      * isn't required for consistency, but the last restartpoint might be far
      * back, and in case of a crash, recovering from it might take a longer
      * than is appropriate now that we're not in standby mode anymore.
+     * If no checkpoint need, request checkpointer to process shared
+     * configuration instead.
      */
     if (fast_promoted)
         RequestCheckpoint(CHECKPOINT_FORCE);
+    else
+        WakeupCheckpointer();
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4b452e7cee..332eba1c7b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -332,12 +332,6 @@ CheckpointerMain(void)
      */
     PG_SETMASK(&UnBlockSig);
 
-    /*
-     * Ensure all shared memory values are set correctly for the config. Doing
-     * this here ensures no race conditions from other concurrent updaters.
-     */
-    UpdateSharedMemoryConfig();
-
     /*
      * Advertise our latch that backends can use to wake us up while we're
      * sleeping.
@@ -368,20 +362,25 @@ CheckpointerMain(void)
         {
             got_SIGHUP = false;
             ProcessConfigFile(PGC_SIGHUP);
-
-            /*
-             * Checkpointer is the last process to shut down, so we ask it to
-             * hold the keys for a range of other tasks required most of which
-             * have nothing to do with checkpointing at all.
-             *
-             * For various reasons, some config values can change dynamically
-             * so the primary copy of them is held in shared memory to make
-             * sure all backends see the same value.  We make Checkpointer
-             * responsible for updating the shared memory copy if the
-             * parameter setting changes because of SIGHUP.
-             */
-            UpdateSharedMemoryConfig();
         }
+
+        /*
+         * Checkpointer is the last process to shut down, so we ask it to hold
+         * the keys for a range of other tasks required most of which have
+         * nothing to do with checkpointing at all.
+         *
+         * For various reasons, some config values can change dynamically so
+         * the primary copy of them is held in shared memory to make sure all
+         * backends see the same value.  We make Checkpointer responsible for
+         * updating the shared memory copy if the parameter setting changes
+         * because of SIGHUP.
+         *
+         * Since some of the shared memory copy are not updated during
+          * recovery and are to be in sync as soon as recovery ends, call
+          * UpdateSharedMemoryConfig() always regardless of SIGHUP.
+         */
+        UpdateSharedMemoryConfig();
+
         if (checkpoint_requested)
         {
             checkpoint_requested = false;
@@ -1090,6 +1089,19 @@ RequestCheckpoint(int flags)
     }
 }
 
+/*
+ * WakeupCheckpointer
+ *
+ * Requests checkpointer to wake up. This causes per-loop processing
+ * regardless of the necessity of a checkpoint.
+ */
+void
+WakeupCheckpointer(void)
+{
+    if (IsUnderPostmaster && CheckpointerShmem->checkpointer_pid != 0)
+        kill(CheckpointerShmem->checkpointer_pid, SIGINT);
+}
+
 /*
  * ForwardFsyncRequest
  *        Forward a file-fsync request from a backend to the checkpointer
@@ -1361,9 +1373,12 @@ UpdateSharedMemoryConfig(void)
 
     /*
      * If full_page_writes has been changed by SIGHUP, we update it in shared
-     * memory and write an XLOG_FPW_CHANGE record.
+     * memory and write an XLOG_FPW_CHANGE record. Don't do that during
+     * recovery since we are not allowed to write WAL and startup process has
+     * the priority to update the shared flag.
      */
-    UpdateFullPageWrites();
+    if (!RecoveryInProgress())
+        UpdateFullPageWrites();
 
     elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d..3aac1606a0 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -29,6 +29,7 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
+extern void WakeupCheckpointer(void);
 extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-- 
2.16.3


Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
Sorry, the patch attached to the previous main is slightly
old. The attached is the correct one.

# They differ only in some phrase in a comment.

====
At Fri, 13 Apr 2018 17:28:40 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180413.172840.228724367.horiguchi.kyotaro@lab.ntt.co.jp>
At Fri, 13 Apr 2018 13:47:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180413.134751.76149471.horiguchi.kyotaro@lab.ntt.co.jp>
> At Fri, 13 Apr 2018 08:31:02 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1LVFpLf=d-7XmfwhLv7Xu53pU0bGU=wVrYWSRU4XSsyHQ@mail.gmail.com>
> > On Fri, Apr 13, 2018 at 6:59 AM, Michael Paquier <michael@paquier.xyz> wrote:
> > > On Thu, Apr 12, 2018 at 02:55:53PM -0400, Robert Haas wrote:
> > >> I think it may actually be confusing.  If you run pg_ctl reload and it
> > >> reports that the value has changed, you'll expect it to have taken
> > >> effect.  But really, it will take effect at some later time.
> > >
> > 
> > +1. I also think it is confusing and it could be difficult for end
> > users to know when the setting is effective.
> > 
> > > It is true that sometimes some people like to temporarily disable
> > > full_page_writes particularly when doing some bulk load of data to
> > > minimize the effort on WAL, and then re-enable it just after doing
> > > the inserting this data.
> > >
> > > Still does it matter when the change is effective?  By disabling
> > > full_page_writes even temporarily, you accept the fact that this
> > > instance would not be safe until the next checkpoint completes.  The
> > > instance even finishes by writing less unnecessary WAL data if the
> > > change is only effective at the next checkpoint.  Well, it is true that
> > > this increases potential torn pages problems but the user is already
> > > accepting that risk if a crash happens until the next checkpoint then it
> > > exposes itself to torn pages anyway as it chose to disable
> > > full_page_writes.
> 
> I still don't think that enabling FPW anytime is useful but
> disabling seems useful as I mentioned upthread.
> 
> The problem was checkpointer changes the flag anytime including
> recovery time. Startup process updates the same flag at the end
> of recovery but before publicated. Letting checkpointer change
> the flag only at checkpoint time is a straightforward way to
> avoid conflicts with startup process.
> 
> I reconsider a bit and came up with the thought that we could
> just skip changing shared FPW in checkpointer until recovery
> ends, then update the flag after recovery end (perhaps at
> checkpoint time in major cases). In this case, FPI is attached
> from REDO point of the first checkpoint (not restartpoint) or a
> bit earlier, then FPW can be flipped at any time.  I'll come up
> with that soon.

Please find the attached. The most significant change is that
UpdateSharedMemoryConfig skips updating of shared fullPageWrites
during recovery. The original crash is fixed since this
guarantees that XLog working area is initializeed before reaching
UpdateFullPageWrites().

Addition to that, I changed CheckpointerMain so that it tries
update of shared FPW regardless of SIGHUP and provided new
function to just wakeup checkpointer. StartupXLOG wakes up
checkpointer either checkpoint is required or not and
checkpointer makes the first update of shared FPW at the time.

After this point, everything works as the same as the current
behavior.

> > I think this means that is will be difficult for end users to predict
> > unless they track the next checkpoint which isn't too bad, but won't
> > be convenient either.
> 
> Looking checkpiont record is enough to know wheter the checkpoint
> is protected by FPW eough, but I agree that such strictness is
> not crutial.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b17158ce5cb8f29c587fcf8b47ac1ebb8bac6a70 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 13 Apr 2018 16:29:25 +0900
Subject: [PATCH] Inhibit update shared FPW during recovery

The shared full_pages_writes is currently updated by checkpointer even
during recovery. This leads to unexpected concurrent updates of the
flag or a crash by using uninitialized memory. The update is needless
in the first place so this patch inhibits checkpointer from updating
it during recovery. Instead, startup process wakes up checkpointer at
the end of recovery so as to the shared flag is updated just after.
---
 src/backend/access/transam/xlog.c     |  4 +++
 src/backend/postmaster/checkpointer.c | 58 ++++++++++++++++++++++-------------
 src/include/postmaster/bgwriter.h     |  1 +
 3 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 08dc9ba031..d518cc4a60 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7826,9 +7826,13 @@ StartupXLOG(void)
      * isn't required for consistency, but the last restartpoint might be far
      * back, and in case of a crash, recovering from it might take a longer
      * than is appropriate now that we're not in standby mode anymore.
+     * If checkpoint is not need, request checkpointer to process shared
+     * configuration instead.
      */
     if (fast_promoted)
         RequestCheckpoint(CHECKPOINT_FORCE);
+    else
+        WakeupCheckpointer();
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4b452e7cee..3e14f537ef 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -332,12 +332,6 @@ CheckpointerMain(void)
      */
     PG_SETMASK(&UnBlockSig);
 
-    /*
-     * Ensure all shared memory values are set correctly for the config. Doing
-     * this here ensures no race conditions from other concurrent updaters.
-     */
-    UpdateSharedMemoryConfig();
-
     /*
      * Advertise our latch that backends can use to wake us up while we're
      * sleeping.
@@ -368,20 +362,26 @@ CheckpointerMain(void)
         {
             got_SIGHUP = false;
             ProcessConfigFile(PGC_SIGHUP);
-
-            /*
-             * Checkpointer is the last process to shut down, so we ask it to
-             * hold the keys for a range of other tasks required most of which
-             * have nothing to do with checkpointing at all.
-             *
-             * For various reasons, some config values can change dynamically
-             * so the primary copy of them is held in shared memory to make
-             * sure all backends see the same value.  We make Checkpointer
-             * responsible for updating the shared memory copy if the
-             * parameter setting changes because of SIGHUP.
-             */
-            UpdateSharedMemoryConfig();
         }
+
+        /*
+         * Checkpointer is the last process to shut down, so we ask it to hold
+         * the keys for a range of other tasks required most of which have
+         * nothing to do with checkpointing at all.
+         *
+         * For various reasons, some config values can change dynamically so
+         * the primary copy of them is held in shared memory to make sure all
+         * backends see the same value.  We make Checkpointer responsible for
+         * updating the shared memory copy if the parameter setting changes
+         * because of SIGHUP.
+         *
+         * Since some of the shared memory copy are in sync only after
+          * recovery ends, skipped updated are not refrected until the next
+          * SIGHUP. So call UpdateSharedMemoryConfig() regardless of SIGHUP in
+          * order to sync them without waiting the next reload.
+         */
+        UpdateSharedMemoryConfig();
+
         if (checkpoint_requested)
         {
             checkpoint_requested = false;
@@ -1090,6 +1090,19 @@ RequestCheckpoint(int flags)
     }
 }
 
+/*
+ * WakeupCheckpointer
+ *
+ * Requests checkpointer to wake up. This causes per-loop processing
+ * regardless of the necessity of a checkpoint.
+ */
+void
+WakeupCheckpointer(void)
+{
+    if (IsUnderPostmaster && CheckpointerShmem->checkpointer_pid != 0)
+        kill(CheckpointerShmem->checkpointer_pid, SIGINT);
+}
+
 /*
  * ForwardFsyncRequest
  *        Forward a file-fsync request from a backend to the checkpointer
@@ -1361,9 +1374,12 @@ UpdateSharedMemoryConfig(void)
 
     /*
      * If full_page_writes has been changed by SIGHUP, we update it in shared
-     * memory and write an XLOG_FPW_CHANGE record.
+     * memory and write an XLOG_FPW_CHANGE record. Don't do that during
+     * recovery since we are not allowed to write WAL and startup process has
+     * the priority to update the shared flag.
      */
-    UpdateFullPageWrites();
+    if (!RecoveryInProgress())
+        UpdateFullPageWrites();
 
     elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d..3aac1606a0 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -29,6 +29,7 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
+extern void WakeupCheckpointer(void);
 extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-- 
2.16.3


Re: Problem while setting the fpw with SIGHUP

От
Robert Haas
Дата:
On Thu, Apr 12, 2018 at 9:29 PM, Michael Paquier <michael@paquier.xyz> wrote:
> Still does it matter when the change is effective?

I don't really care deeply about when the change takes effect, but I
do care about whether the time when the system *says* the change took
effect is the same as when it *actually* took effect.  If those aren't
the same, it's confusing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Fri, Apr 13, 2018 at 10:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 12, 2018 at 9:29 PM, Michael Paquier <michael@paquier.xyz> wrote:
>> Still does it matter when the change is effective?
>
> I don't really care deeply about when the change takes effect, but I
> do care about whether the time when the system *says* the change took
> effect is the same as when it *actually* took effect.  If those aren't
> the same, it's confusing.
>

So, what in your opinion is the way to deal with this?  If we make it
a PGC_POSTMASTER parameter, it will have a very clear behavior and
users don't need to bother whether they have a risk of torn page
problem or not and as a side-impact the code will be simplified as
well.  However, as Michael said the people who get the benefit of this
option by disabling/enabling this parameter might complain.  Keeping
it as a SIGHUP option has the drawback that even after the user has
enabled it, there is a risk of torn pages.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Robert Haas
Дата:
On Wed, Apr 18, 2018 at 10:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Apr 13, 2018 at 10:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Apr 12, 2018 at 9:29 PM, Michael Paquier <michael@paquier.xyz> wrote:
>>> Still does it matter when the change is effective?
>>
>> I don't really care deeply about when the change takes effect, but I
>> do care about whether the time when the system *says* the change took
>> effect is the same as when it *actually* took effect.  If those aren't
>> the same, it's confusing.
>>
>
> So, what in your opinion is the way to deal with this?  If we make it
> a PGC_POSTMASTER parameter, it will have a very clear behavior and
> users don't need to bother whether they have a risk of torn page
> problem or not and as a side-impact the code will be simplified as
> well.  However, as Michael said the people who get the benefit of this
> option by disabling/enabling this parameter might complain.  Keeping
> it as a SIGHUP option has the drawback that even after the user has
> enabled it, there is a risk of torn pages.

I would just document the risks.  If the documentation says that you
can't rely on the value until after the next checkpoint, or whatever
the rule is, then I think we're fine.  I don't think that we really
have the infrastructure to do any better; if we try, we'll just end up
with odd warts.  Documenting the current set of warts is less churn
and less work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Wed, Apr 18, 2018 at 10:52:51AM -0400, Robert Haas wrote:
> I would just document the risks.  If the documentation says that you
> can't rely on the value until after the next checkpoint, or whatever
> the rule is, then I think we're fine.  I don't think that we really
> have the infrastructure to do any better; if we try, we'll just end up
> with odd warts.  Documenting the current set of warts is less churn
> and less work.

The last version of the patch proposed has eaten this diff which was
part of one of the past versions (v2-0001-Change-FPW-handling.patch from
https://www.postgresql.org/message-id/20180412.103430.133595350.horiguchi.kyotaro%40lab.ntt.co.jp):
+        The default is <literal>on</literal>. The change of the parameter takes
+        effect at the next checkpoint time.
So there were some documentation about the beHavior change for what it's
worth.

And, er, actually, I was thinking again about the case where a user
wants to disable full_page_writes temporarily to do some bulk load and
then re-enable it.  With the patch proposed to actually update the FPW
effect at checkpoint time, then a user would need to issue a manual
checkpoint after updating the configuration and reloading, which may
create more I/O than he'd want to pay for, then a second checkpoint
would need to be issued after the configuration comes back again.  That
would cause a regression which could surprise a class of users.  WAL and
FPW overhead is a problem which shows up a lot when doing bulk-loading
of data.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Thu, Apr 19, 2018 at 7:19 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Wed, Apr 18, 2018 at 10:52:51AM -0400, Robert Haas wrote:
>> I would just document the risks.  If the documentation says that you
>> can't rely on the value until after the next checkpoint, or whatever
>> the rule is, then I think we're fine.  I don't think that we really
>> have the infrastructure to do any better; if we try, we'll just end up
>> with odd warts.  Documenting the current set of warts is less churn
>> and less work.
>
> The last version of the patch proposed has eaten this diff which was
> part of one of the past versions (v2-0001-Change-FPW-handling.patch from
> https://www.postgresql.org/message-id/20180412.103430.133595350.horiguchi.kyotaro%40lab.ntt.co.jp):
> +        The default is <literal>on</literal>. The change of the parameter takes
> +        effect at the next checkpoint time.
> So there were some documentation about the beHavior change for what it's
> worth.
>
> And, er, actually, I was thinking again about the case where a user
> wants to disable full_page_writes temporarily to do some bulk load and
> then re-enable it.  With the patch proposed to actually update the FPW
> effect at checkpoint time, then a user would need to issue a manual
> checkpoint after updating the configuration and reloading, which may
> create more I/O than he'd want to pay for, then a second checkpoint
> would need to be issued after the configuration comes back again.
>

Why a second checkpoint?  One checkpoint either manual or automatic
should be enough to make the setting effective.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Apr 19, 2018 at 07:11:43PM +0530, Amit Kapila wrote:
> On Thu, Apr 19, 2018 at 7:19 AM, Michael Paquier <michael@paquier.xyz> wrote:
>> And, er, actually, I was thinking again about the case where a user
>> wants to disable full_page_writes temporarily to do some bulk load and
>> then re-enable it.  With the patch proposed to actually update the FPW
>> effect at checkpoint time, then a user would need to issue a manual
>> checkpoint after updating the configuration and reloading, which may
>> create more I/O than he'd want to pay for, then a second checkpoint
>> would need to be issued after the configuration comes back again.
>
> Why a second checkpoint?  One checkpoint either manual or automatic
> should be enough to make the setting effective.

I was thinking about cases where users have say hourly cron jobs in
charge of doing some maintenance of update cleanups, where they would
need to be sure that full_page_writes are back online after doing the
bulk-load.  In this case an extra checkpoint would be necessary to make
the parameter update effective.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
By the way, I think I found a bug of FPW.

The following steps yields INSERT record that doesn't have a FPI
after a checkpoint.

(Start server with full_page_writes = off)
CREATE TABLE t (a int);
CHECKPOINT;
INSERT INTO t VALUES (1);
ALTER SYSTEM SET full_page_writes TO on;
SELECT pg_reload_conf();
CHECKPOINT;
INSERT INTO t VALUES (1);

The last insert is expected to write a record with FPI but it
doesn't actually. No FPI will be written for the page after that.

It seems that the reason is that XLogInsertRecord is forgetting
to check doPageWrites' update.

In the failure case, fpw_lsn has been set by XLogRecordAssemble
but doPageWrite is false at the time and it considers that no FPI
is required then it sets fpw_lsn to InvalidXLogRecPtr.

After that, XLogInsertRecord receives the record but it thinks
that doPageWrites is true since it looks the shared value.

> if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)

So this line thinks that "no FPI is omitted in this record" but
actually the record is just forgotten to attach them.

The attached patch lets XLogInsertRecord check if doPageWrites
has been turned on after the last call and cause recomputation in
the case.

> * If there are any registered buffers, and a full-page image was not taken
> * of all of them, *fpw_lsn is set to the lowest LSN among such pages. This
> * signals that the assembled record is only good for insertion on the
> * assumption that the RedoRecPtr and doPageWrites values were up-to-date.

And the patch fixes one comment typo of XLogInsertRecord.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 40ce98bba0496b1eb0a982134eae9cec01d532a8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 20 Apr 2018 15:01:26 +0900
Subject: [PATCH] Correctly attach FPI to the first record after a checkpoint

XLogInsert fails to attach a required FPIs to the first record after a
checkpoint if no other record has been written after full_page_writes
was turned on. This makes incosistency between fpw flag of the
checkpoint record and the following record's FPW status. This patch
makes XLogInsertRecord to cause a recomputation when the given record
is generated during FPW is off but found that the flag has been turned
on.
---
 src/backend/access/transam/xlog.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 08dc9ba031..27753f7321 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -937,7 +937,7 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  *
  * If 'fpw_lsn' is valid, it is the oldest LSN among the pages that this
  * WAL record applies to, that were not included in the record as full page
- * images.  If fpw_lsn >= RedoRecPtr, the function does not perform the
+ * images.  If fpw_lsn <= RedoRecPtr, the function does not perform the
  * insertion and returns InvalidXLogRecPtr.  The caller can then recalculate
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
@@ -970,6 +970,7 @@ XLogInsertRecord(XLogRecData *rdata,
                                info == XLOG_SWITCH);
     XLogRecPtr    StartPos;
     XLogRecPtr    EndPos;
+    bool        prevDoPageWrites = doPageWrites;
 
     /* we assume that all of the record header is in the first chunk */
     Assert(rdata->len >= SizeOfXLogRecord);
@@ -1022,7 +1023,8 @@ XLogInsertRecord(XLogRecData *rdata,
      * This can only happen just after a checkpoint, so it's better to be slow
      * in this case and fast otherwise.
      *
-     * If we aren't doing full-page writes then RedoRecPtr doesn't actually
+     * If doPageWrites was just turned on, we force a recomputation. Otherwise
+     * if we aren't doing full-page writes then RedoRecPtr doesn't actually
      * affect the contents of the XLOG record, so we'll update our local copy
      * but not force a recomputation.  (If doPageWrites was just turned off,
      * we could recompute the record without full pages, but we choose not to
@@ -1035,7 +1037,9 @@ XLogInsertRecord(XLogRecData *rdata,
     }
     doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
 
-    if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
+    if (doPageWrites &&
+        (!prevDoPageWrites ||
+         (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
     {
         /*
          * Oops, some buffer now needs to be backed up that the caller didn't
-- 
2.16.3


Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
I noticed that the previous patch is a mixture with another
patch.. sorry.

At Thu, 19 Apr 2018 19:11:43 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1LZ8UBfOMWMHAQGZ0_5N7aWPOXgUYTvCV+J2SXkpmrjRg@mail.gmail.com>
> On Thu, Apr 19, 2018 at 7:19 AM, Michael Paquier <michael@paquier.xyz> wrote:
> > On Wed, Apr 18, 2018 at 10:52:51AM -0400, Robert Haas wrote:
> >> I would just document the risks.  If the documentation says that you
> >> can't rely on the value until after the next checkpoint, or whatever
> >> the rule is, then I think we're fine.  I don't think that we really
> >> have the infrastructure to do any better; if we try, we'll just end up
> >> with odd warts.  Documenting the current set of warts is less churn
> >> and less work.
> >
> > The last version of the patch proposed has eaten this diff which was
> > part of one of the past versions (v2-0001-Change-FPW-handling.patch from
> > https://www.postgresql.org/message-id/20180412.103430.133595350.horiguchi.kyotaro%40lab.ntt.co.jp):
> > +        The default is <literal>on</literal>. The change of the parameter takes
> > +        effect at the next checkpoint time.
> > So there were some documentation about the beHavior change for what it's
> > worth.
> >
> > And, er, actually, I was thinking again about the case where a user
> > wants to disable full_page_writes temporarily to do some bulk load and
> > then re-enable it.  With the patch proposed to actually update the FPW
> > effect at checkpoint time, then a user would need to issue a manual
> > checkpoint after updating the configuration and reloading, which may
> > create more I/O than he'd want to pay for, then a second checkpoint
> > would need to be issued after the configuration comes back again.
> >
> 
> Why a second checkpoint?  One checkpoint either manual or automatic
> should be enough to make the setting effective.

One significant point in my first patch is anyway the FPW is
useless untill the second checkpoint starts. In the sense of the
timing when *useful* FPW comes back, the second checkpoint is
required regardless of the patch. As Amit said, there is no
difference whether it is manual or automatic.

On the other hand the timing when FPW is actually turned off is
different. Focusing on user's view, one can run bulkload
immediately under the current behavior but should wait for the
next checkpoint starts with the first patch, which can be caused
manually but may be delayed after the currently running
checkpoint if any.

I think that starting the *first* checkpoint before bulkload is
not such a bother but on the other hand the behavior can lead to
FPW flood for those who are accostomed to the current behavior.

The attached first patch is the bugfix proposed in this thread.
The second fixes the cocurrent update problem only.

The third changes the behavior so that turning-on happenes only
on checkpoints and turning-off happens any time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 9b3496ceb6f39b017698225fef289974bd01fb07 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 20 Apr 2018 15:01:26 +0900
Subject: [PATCH 1/3] Correctly attach FPI to the first record after a
 checkpoint

XLogInsert fails to attach a required FPIs to the first record after a
checkpoint if no other record has been written after full_page_writes
was turned on. This makes incosistency between fpw flag of the
checkpoint record and the following record's FPW status. This patch
makes XLogInsertRecord to cause a recomputation when the given record
is generated during FPW is off but found that the flag has been turned
on.
---
 src/backend/access/transam/xlog.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 08dc9ba031..27753f7321 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -937,7 +937,7 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  *
  * If 'fpw_lsn' is valid, it is the oldest LSN among the pages that this
  * WAL record applies to, that were not included in the record as full page
- * images.  If fpw_lsn >= RedoRecPtr, the function does not perform the
+ * images.  If fpw_lsn <= RedoRecPtr, the function does not perform the
  * insertion and returns InvalidXLogRecPtr.  The caller can then recalculate
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
@@ -970,6 +970,7 @@ XLogInsertRecord(XLogRecData *rdata,
                                info == XLOG_SWITCH);
     XLogRecPtr    StartPos;
     XLogRecPtr    EndPos;
+    bool        prevDoPageWrites = doPageWrites;
 
     /* we assume that all of the record header is in the first chunk */
     Assert(rdata->len >= SizeOfXLogRecord);
@@ -1022,7 +1023,8 @@ XLogInsertRecord(XLogRecData *rdata,
      * This can only happen just after a checkpoint, so it's better to be slow
      * in this case and fast otherwise.
      *
-     * If we aren't doing full-page writes then RedoRecPtr doesn't actually
+     * If doPageWrites was just turned on, we force a recomputation. Otherwise
+     * if we aren't doing full-page writes then RedoRecPtr doesn't actually
      * affect the contents of the XLOG record, so we'll update our local copy
      * but not force a recomputation.  (If doPageWrites was just turned off,
      * we could recompute the record without full pages, but we choose not to
@@ -1035,7 +1037,9 @@ XLogInsertRecord(XLogRecData *rdata,
     }
     doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
 
-    if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
+    if (doPageWrites &&
+        (!prevDoPageWrites ||
+         (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
     {
         /*
          * Oops, some buffer now needs to be backed up that the caller didn't
-- 
2.16.3

From cd6f3162ed9ed678a61b6080a40f004920684b55 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 20 Apr 2018 17:27:49 +0900
Subject: [PATCH 2/3] step1

---
 src/backend/access/transam/xlog.c     | 19 ++++++++-----
 src/backend/postmaster/checkpointer.c | 51 ++++++++++++++++++++++++-----------
 src/include/catalog/pg_control.h      |  2 +-
 src/include/postmaster/bgwriter.h     |  1 +
 4 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 27753f7321..72cc880139 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7580,14 +7580,15 @@ StartupXLOG(void)
     oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
 
     /*
-     * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-     * record before resource manager writes cleanup WAL records or checkpoint
-     * record is written.
+     * Update full_page_writes in shared memory with the lastest value before
+     * resource manager writes cleanup WAL records or checkpoint record is
+     * written. We don't need to write XLOG_FPW_CHANGE since this just
+     * reflects the status at the last redo'ed record. No lock is required
+     * since startup is the only updator of the flag at this
+     * point. Checkpointer will take over after SharedRecoveryInProgress is
+     * turned false.
      */
     Insert->fullPageWrites = lastFullPageWrites;
-    LocalSetXLogInsertAllowed();
-    UpdateFullPageWrites();
-    LocalXLogInsertAllowed = -1;
 
     if (InRecovery)
     {
@@ -7829,10 +7830,14 @@ StartupXLOG(void)
      * If this was a fast promotion, request an (online) checkpoint now. This
      * isn't required for consistency, but the last restartpoint might be far
      * back, and in case of a crash, recovering from it might take a longer
-     * than is appropriate now that we're not in standby mode anymore.
+     * than is appropriate now that we're not in standby mode anymore.  If
+     * checkpoint is not needed, tell checkpointer to update shared
+     * configuration instead.
      */
     if (fast_promoted)
         RequestCheckpoint(CHECKPOINT_FORCE);
+    else
+        WakeupCheckpointer();
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4b452e7cee..23d404f082 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -368,20 +368,26 @@ CheckpointerMain(void)
         {
             got_SIGHUP = false;
             ProcessConfigFile(PGC_SIGHUP);
-
-            /*
-             * Checkpointer is the last process to shut down, so we ask it to
-             * hold the keys for a range of other tasks required most of which
-             * have nothing to do with checkpointing at all.
-             *
-             * For various reasons, some config values can change dynamically
-             * so the primary copy of them is held in shared memory to make
-             * sure all backends see the same value.  We make Checkpointer
-             * responsible for updating the shared memory copy if the
-             * parameter setting changes because of SIGHUP.
-             */
-            UpdateSharedMemoryConfig();
         }
+
+        /*
+         * Checkpointer is the last process to shut down, so we ask it to hold
+         * the keys for a range of other tasks required most of which have
+         * nothing to do with checkpointing at all.
+         *
+         * For various reasons, some config values can change dynamically so
+         * the primary copy of them is held in shared memory to make sure all
+         * backends see the same value.  We make Checkpointer responsible for
+         * updating the shared memory copy if the parameter setting changes
+         * because of SIGHUP.
+         *
+         * Since some of the shared memory copy are in sync only after
+          * recovery ends, skipped updated are not refrected until the next
+          * SIGHUP. So call UpdateSharedMemoryConfig() regardless of SIGHUP in
+          * order to sync them without waiting the next reload.
+         */
+        UpdateSharedMemoryConfig();
+
         if (checkpoint_requested)
         {
             checkpoint_requested = false;
@@ -1090,6 +1096,19 @@ RequestCheckpoint(int flags)
     }
 }
 
+/*
+ * WakeupCheckpointer
+ *
+ * Requests checkpointer to wake up. This causes per-loop processing
+ * regardless of the necessity of a checkpoint.
+ */
+void
+WakeupCheckpointer(void)
+{
+    if (IsUnderPostmaster && CheckpointerShmem->checkpointer_pid != 0)
+        kill(CheckpointerShmem->checkpointer_pid, SIGINT);
+}
+
 /*
  * ForwardFsyncRequest
  *        Forward a file-fsync request from a backend to the checkpointer
@@ -1361,9 +1380,11 @@ UpdateSharedMemoryConfig(void)
 
     /*
      * If full_page_writes has been changed by SIGHUP, we update it in shared
-     * memory and write an XLOG_FPW_CHANGE record.
+     * memory and write an XLOG_FPW_CHANGE record. We don't do this until
+     * recovery ends.
      */
-    UpdateFullPageWrites();
+    if (!RecoveryInProgress())
+        UpdateFullPageWrites();
 
     elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 773d9e6eba..a7e22e0f5f 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -38,7 +38,7 @@ typedef struct CheckPoint
     TimeLineID    ThisTimeLineID; /* current TLI */
     TimeLineID    PrevTimeLineID; /* previous TLI, if this record begins a new
                                  * timeline (equals ThisTimeLineID otherwise) */
-    bool        fullPageWrites; /* current full_page_writes */
+    bool        fullPageWrites; /* full_page_writes at REDO start point */
     uint32        nextXidEpoch;    /* higher-order bits of nextXid */
     TransactionId nextXid;        /* next free XID */
     Oid            nextOid;        /* next free OID */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d..3aac1606a0 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -29,6 +29,7 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
+extern void WakeupCheckpointer(void);
 extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-- 
2.16.3

From 8cb3bb6cf49d64eee0f27fb9f2264abf0cf89f70 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 20 Apr 2018 17:28:56 +0900
Subject: [PATCH 3/3] step2

---
 doc/src/sgml/config.sgml              |  5 ++++-
 src/backend/access/transam/xlog.c     | 15 ++++++++++++---
 src/backend/postmaster/checkpointer.c | 14 ++++----------
 src/include/access/xlog.h             |  2 +-
 4 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d5f2d23c4..d39f0c12d4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2598,7 +2598,10 @@ include_dir 'conf.d'
        <para>
         This parameter can only be set in the <filename>postgresql.conf</filename>
         file or on the server command line.
-        The default is <literal>on</literal>.
+
+        The default is <literal>on</literal>. Turning of of this variable is
+        immediately takes effect and turning on takes effect at the next
+        checkpoint time.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 72cc880139..efb3abd979 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8678,6 +8678,9 @@ CreateCheckPoint(int flags)
      */
     last_important_lsn = GetLastImportantRecPtr();
 
+    /* update fullPageWrites in shared memory if turned on */
+    UpdateFullPageWrites(true);
+
     /*
      * We must block concurrent insertions while examining insert state to
      * determine the checkpoint REDO pointer.
@@ -9554,11 +9557,13 @@ XLogReportParameters(void)
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
- * Note: this function assumes there is no other process running
- * concurrently that could update it.
+ * Processes only turning on if turn_on is true or turning off if not.
+ *
+ * Note: this function assumes there is no other process running concurrently
+ * that could update it.
  */
 void
-UpdateFullPageWrites(void)
+UpdateFullPageWrites(bool turn_on)
 {
     XLogCtlInsert *Insert = &XLogCtl->Insert;
 
@@ -9572,6 +9577,10 @@ UpdateFullPageWrites(void)
     if (fullPageWrites == Insert->fullPageWrites)
         return;
 
+    /* return if we don't handle the change now  */
+    if ((turn_on && !fullPageWrites) || (!turn_on && fullPageWrites))
+        return;
+
     START_CRIT_SECTION();
 
     /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 23d404f082..60d1f6b3b0 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -332,12 +332,6 @@ CheckpointerMain(void)
      */
     PG_SETMASK(&UnBlockSig);
 
-    /*
-     * Ensure all shared memory values are set correctly for the config. Doing
-     * this here ensures no race conditions from other concurrent updaters.
-     */
-    UpdateSharedMemoryConfig();
-
     /*
      * Advertise our latch that backends can use to wake us up while we're
      * sleeping.
@@ -1379,12 +1373,12 @@ UpdateSharedMemoryConfig(void)
     SyncRepUpdateSyncStandbysDefined();
 
     /*
-     * If full_page_writes has been changed by SIGHUP, we update it in shared
-     * memory and write an XLOG_FPW_CHANGE record. We don't do this until
-     * recovery ends.
+     * If full_page_writes has been turned off by SIGHUP, we update it in
+     * shared memory and write an XLOG_FPW_CHANGE record. We don't do this
+     * until recovery ends.
      */
     if (!RecoveryInProgress())
-        UpdateFullPageWrites();
+        UpdateFullPageWrites(false);
 
     elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..dc90d64a2c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -270,7 +270,7 @@ extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
-extern void UpdateFullPageWrites(void);
+extern void UpdateFullPageWrites(bool turn_on);
 extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
-- 
2.16.3


Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Fri, Apr 20, 2018 at 11:40 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> By the way, I think I found a bug of FPW.
>
> The following steps yields INSERT record that doesn't have a FPI
> after a checkpoint.
>
> (Start server with full_page_writes = off)
> CREATE TABLE t (a int);
> CHECKPOINT;
> INSERT INTO t VALUES (1);
> ALTER SYSTEM SET full_page_writes TO on;
> SELECT pg_reload_conf();
> CHECKPOINT;
> INSERT INTO t VALUES (1);
>
> The last insert is expected to write a record with FPI but it
> doesn't actually. No FPI will be written for the page after that.
>
> It seems that the reason is that XLogInsertRecord is forgetting
> to check doPageWrites' update.
>
> In the failure case, fpw_lsn has been set by XLogRecordAssemble
> but doPageWrite is false at the time and it considers that no FPI
> is required then it sets fpw_lsn to InvalidXLogRecPtr.
>
> After that, XLogInsertRecord receives the record but it thinks
> that doPageWrites is true since it looks the shared value.
>
>> if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
>
> So this line thinks that "no FPI is omitted in this record" but
> actually the record is just forgotten to attach them.
>
> The attached patch lets XLogInsertRecord check if doPageWrites
> has been turned on after the last call and cause recomputation in
> the case.
>

I think you have correctly spotted the problem and your fix looks good
to me.  As this is a separate problem and fix is different from what
we are discussing here, I think this can be committed it separately.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Robert Haas
Дата:
On Wed, Apr 18, 2018 at 9:49 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Wed, Apr 18, 2018 at 10:52:51AM -0400, Robert Haas wrote:
>> I would just document the risks.  If the documentation says that you
>> can't rely on the value until after the next checkpoint, or whatever
>> the rule is, then I think we're fine.  I don't think that we really
>> have the infrastructure to do any better; if we try, we'll just end up
>> with odd warts.  Documenting the current set of warts is less churn
>> and less work.
>
> The last version of the patch proposed has eaten this diff which was
> part of one of the past versions (v2-0001-Change-FPW-handling.patch from
> https://www.postgresql.org/message-id/20180412.103430.133595350.horiguchi.kyotaro%40lab.ntt.co.jp):
> +        The default is <literal>on</literal>. The change of the parameter takes
> +        effect at the next checkpoint time.
> So there were some documentation about the beHavior change for what it's
> worth.

Fine, but that doesn't answer the question of whether we actually need
to or should change the behavior in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Mon, Apr 23, 2018 at 12:21:04PM -0400, Robert Haas wrote:
> Fine, but that doesn't answer the question of whether we actually need
> to or should change the behavior in the first place.

Per the last arguments that would be "No, we don't want to change it as
it would surprise some users":
https://www.postgresql.org/message-id/20180420010402.GF2024@paquier.xyz
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Tue, 24 Apr 2018 08:52:17 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180423235217.GB1570@paquier.xyz>
> On Mon, Apr 23, 2018 at 12:21:04PM -0400, Robert Haas wrote:
> > Fine, but that doesn't answer the question of whether we actually need
> > to or should change the behavior in the first place.
> 
> Per the last arguments that would be "No, we don't want to change it as
> it would surprise some users":
> https://www.postgresql.org/message-id/20180420010402.GF2024@paquier.xyz

The answer is that the change of behavior is not required to fix
the bug. So I'm fine with applying only (0001 and) 0002 here.

https://www.postgresql.org/message-id/20180420010402.GF2024@paquier.xyz

One reason that I introduced the "restriction" was that it was
useful to avoid concurrent udpate of the flag. It is now avoided
in another way.

Just my opinion on the behavior follows.

I don't think anyone confirms that FPI come back after switching
full_page_writes (one of the reason is it needs pg_waldump).
pg_start/stop_backup() on standby says that "You need to turn on
FPW, then do checkpoint". It suggests that FPI's don't work
before the next checkpoint. If we keep the current behavior, the
documentation might need an additional phrase something like "FPW
ensures that data is protected from partial writes after the next
chackpoint starts". On the other hand honestly I don't have
confidence that the WAL reduction worth the additional complexity
by 0003.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Fri, Apr 20, 2018 at 6:06 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Apr 20, 2018 at 11:40 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> By the way, I think I found a bug of FPW.
>>
>> The following steps yields INSERT record that doesn't have a FPI
>> after a checkpoint.
>>
>> (Start server with full_page_writes = off)
>> CREATE TABLE t (a int);
>> CHECKPOINT;
>> INSERT INTO t VALUES (1);
>> ALTER SYSTEM SET full_page_writes TO on;
>> SELECT pg_reload_conf();
>> CHECKPOINT;
>> INSERT INTO t VALUES (1);
>>
>> The last insert is expected to write a record with FPI but it
>> doesn't actually. No FPI will be written for the page after that.
>>
>> It seems that the reason is that XLogInsertRecord is forgetting
>> to check doPageWrites' update.
>>
>> In the failure case, fpw_lsn has been set by XLogRecordAssemble
>> but doPageWrite is false at the time and it considers that no FPI
>> is required then it sets fpw_lsn to InvalidXLogRecPtr.
>>
>> After that, XLogInsertRecord receives the record but it thinks
>> that doPageWrites is true since it looks the shared value.
>>
>>> if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
>>
>> So this line thinks that "no FPI is omitted in this record" but
>> actually the record is just forgotten to attach them.
>>
>> The attached patch lets XLogInsertRecord check if doPageWrites
>> has been turned on after the last call and cause recomputation in
>> the case.
>>
>
> I think you have correctly spotted the problem and your fix looks good
> to me.  As this is a separate problem and fix is different from what
> we are discussing here, I think this can be committed it separately.
>

AFAICS, this problem has been introduced by commit
2c03216d831160bedd72d45f712601b6f7d03f1c, so we should backpatch till
9.5.  Please find attached the slightly modified patch for this bug.
I have modified one of the comments in the patch and the proposed
commit message.  Can you please once cross-verify if the problem exits
till 9.5 and that this patch fixes it?  Also, I don't see an easy way
to write a test for it, do you have anything in mind?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Tue, Apr 24, 2018 at 7:10 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Tue, 24 Apr 2018 08:52:17 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20180423235217.GB1570@paquier.xyz>
>> On Mon, Apr 23, 2018 at 12:21:04PM -0400, Robert Haas wrote:
>> > Fine, but that doesn't answer the question of whether we actually need
>> > to or should change the behavior in the first place.
>>
>> Per the last arguments that would be "No, we don't want to change it as
>> it would surprise some users":
>> https://www.postgresql.org/message-id/20180420010402.GF2024@paquier.xyz
>
> The answer is that the change of behavior is not required to fix
> the bug. So I'm fine with applying only (0001 and) 0002 here.
>

I have just responded to your first patch (0001).  Can you once again
summarize what the 0002 exactly accomplishes?  I think one of the
goals is to fix the original problem reported in this thread and other
is you have found the concurrency issue.  Is it possible to have
separate patches for those or you think they are interrelated and
needs to be fixed together?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Sat, Jul 28, 2018 at 07:10:24PM +0530, Amit Kapila wrote:
> I have just responded to your first patch (0001).  Can you once again
> summarize what the 0002 exactly accomplishes?  I think one of the
> goals is to fix the original problem reported in this thread and other
> is you have found the concurrency issue.  Is it possible to have
> separate patches for those or you think they are interrelated and
> needs to be fixed together?

That would be nice.  The last time I read this thread I have been rather
confused about what was being discussed, what were the set of problems,
and what was being fixed.  Speaking of which, this is one of the bugfix
patches I wanted to look at once I have untanggled the autovacuum one
for temporary relations and the DOS issues with lock lookups.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
Thank you, Amit, Michael.

At Sun, 29 Jul 2018 08:19:11 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180728231911.GB1471@paquier.xyz>
> On Sat, Jul 28, 2018 at 07:10:24PM +0530, Amit Kapila wrote:
> > I have just responded to your first patch (0001).  Can you once again
> > summarize what the 0002 exactly accomplishes?  I think one of the
> > goals is to fix the original problem reported in this thread and other
> > is you have found the concurrency issue.  Is it possible to have
> > separate patches for those or you think they are interrelated and
> > needs to be fixed together?
> 
> That would be nice.  The last time I read this thread I have been rather
> confused about what was being discussed, what were the set of problems,
> and what was being fixed.  Speaking of which, this is one of the bugfix
> patches I wanted to look at once I have untanggled the autovacuum one
> for temporary relations and the DOS issues with lock lookups.

It's a long time ago.. Let me have a bit of time to blow dust off..

https://www.postgresql.org/message-id/20180420.173354.43303926.horiguchi.kyotaro@lab.ntt.co.jp

...done..i working..

The original problem here was UpdateFullPageWrites() can call
RecoveryInProgress(), which does palloc in a critical
section. This happens when standy is commanded to reload with
change of full_pages_writes during recovery.

While exploring it, I found that update of fullPageWrite flag is
updated concurrently against its design. A race condition happens
in the following steps in my diagnosis.

1. While the startup process is running recovery, we didn't
 consider that checkpointer may be running concurrently, but it
 happens for standby.

2. Checkpointer considers itself (or designed) as the *only*
 updator of shared config including fillPageWrites. In reality
 the startup is another concurent updator on standby. Addition to
 that, checkpointer assumes that it is started under WAL-emitting
 state, that is, InitXLogInsert() has been called elsewhere, but
 it is not the case on standby.

 Note that checkpointer mustn't update FPW flag before recovery
 ends because startup will overrides the flag.

3. As the result, when standby gets full_page_writes changed and
 SIGHUP during recovery, checkpointer tries to update FPW flag
 and calls RecoveryInProgress() on the way. bang!


With the 0002-step1.patch, checkpointer really becomes the only
updator of the FPW flag after recovery ends. StartupXLOG()
updates it just once before checkpointer starts to update it.

Under the normal(slow?) promotion mode, checkpointer is woken up
explicitly to make the config setting effective as soon as
possible. (This might be unnecessary.)

In checkpointer, RecoveryInProgress() is called preior to
UpdateFPW() to disable update FPW during recovery, so the crash
that is the issue here is fixed.

FYI I think that 0002-tesp2.patch is rejected.

Please find the attacehd revised patch. It is rebased and
provided with a commit message.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c839e039d4ebb06c0dd345f7992a460a27827805 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 1 Aug 2018 15:29:01 +0900
Subject: [PATCH] Fix FPW flags updates during recovery.

Each of checkpointer and startup processes thinks it is the only
updator of fullPageWrites flag ignorantly of each other. This causes
race condition and leads checkpionter to a crash in a certain case.

This patch makes chackpionter the only updator and organize fix the
sequence around recovery end in order to fullPageWrite flag is
correctly updated.
---
 src/backend/access/transam/xlog.c     | 19 ++++++++-----
 src/backend/postmaster/checkpointer.c | 50 ++++++++++++++++++++++++-----------
 src/include/postmaster/bgwriter.h     |  1 +
 3 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..f315d11745 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7692,14 +7692,15 @@ StartupXLOG(void)
     XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
     /*
-     * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-     * record before resource manager writes cleanup WAL records or checkpoint
-     * record is written.
+     * Update full_page_writes in shared memory with the lastest value before
+     * resource manager writes cleanup WAL records or checkpoint record is
+     * written. We don't need to write XLOG_FPW_CHANGE since this just
+     * reflects the status at the last redo'ed record. No lock is required
+     * since startup is the only updator of the flag at this
+     * point. Checkpointer will take over after SharedRecoveryInProgress is
+     * turned to false.
      */
     Insert->fullPageWrites = lastFullPageWrites;
-    LocalSetXLogInsertAllowed();
-    UpdateFullPageWrites();
-    LocalXLogInsertAllowed = -1;
 
     if (InRecovery)
     {
@@ -7941,10 +7942,14 @@ StartupXLOG(void)
      * If this was a fast promotion, request an (online) checkpoint now. This
      * isn't required for consistency, but the last restartpoint might be far
      * back, and in case of a crash, recovering from it might take a longer
-     * than is appropriate now that we're not in standby mode anymore.
+     * than is appropriate now that we're not in standby mode anymore.  If
+     * checkpoint is not needed, tell checkpointer to update shared
+     * configuration instead.
      */
     if (fast_promoted)
         RequestCheckpoint(CHECKPOINT_FORCE);
+    else
+        WakeupCheckpointer();
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index de1b22d045..892f597cde 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -358,20 +358,25 @@ CheckpointerMain(void)
         {
             got_SIGHUP = false;
             ProcessConfigFile(PGC_SIGHUP);
-
-            /*
-             * Checkpointer is the last process to shut down, so we ask it to
-             * hold the keys for a range of other tasks required most of which
-             * have nothing to do with checkpointing at all.
-             *
-             * For various reasons, some config values can change dynamically
-             * so the primary copy of them is held in shared memory to make
-             * sure all backends see the same value.  We make Checkpointer
-             * responsible for updating the shared memory copy if the
-             * parameter setting changes because of SIGHUP.
-             */
-            UpdateSharedMemoryConfig();
         }
+
+        /*
+         * Checkpointer is the last process to shut down, so we ask it to hold
+         * the keys for a range of other tasks required most of which have
+         * nothing to do with checkpointing at all.
+         *
+         * For various reasons, some config values can change dynamically so
+         * the primary copy of them is held in shared memory to make sure all
+         * backends see the same value.  We make Checkpointer responsible for
+         * updating the shared memory copy if the parameter setting changes
+         * because of SIGHUP.
+         *
+         * Since some of the shared variables are omitted during
+          * recovery. UpdateSharedMemoryConfig() is needed regardless of SIGHUP
+          * in order to keep them in sync without waiting the next reload.
+         */
+        UpdateSharedMemoryConfig();
+
         if (checkpoint_requested)
         {
             checkpoint_requested = false;
@@ -1080,6 +1085,19 @@ RequestCheckpoint(int flags)
     }
 }
 
+/*
+ * WakeupCheckpointer
+ *
+ * Requests checkpointer to wake up. This causes per-loop processing
+ * regardless of the necessity of a checkpoint.
+ */
+void
+WakeupCheckpointer(void)
+{
+    if (IsUnderPostmaster && CheckpointerShmem->checkpointer_pid != 0)
+        kill(CheckpointerShmem->checkpointer_pid, SIGINT);
+}
+
 /*
  * ForwardFsyncRequest
  *        Forward a file-fsync request from a backend to the checkpointer
@@ -1351,9 +1369,11 @@ UpdateSharedMemoryConfig(void)
 
     /*
      * If full_page_writes has been changed by SIGHUP, we update it in shared
-     * memory and write an XLOG_FPW_CHANGE record.
+     * memory and write an XLOG_FPW_CHANGE record. We don't do this until
+     * recovery ends.
      */
-    UpdateFullPageWrites();
+    if (!RecoveryInProgress())
+        UpdateFullPageWrites();
 
     elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d..3aac1606a0 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -29,6 +29,7 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
+extern void WakeupCheckpointer(void);
 extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-- 
2.16.3


Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Wed, Aug 1, 2018 at 12:56 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Thank you, Amit, Michael.
>

Can you verify the first patch that I have posted above [1]?  We can
commit it separately.

>
> It's a long time ago.. Let me have a bit of time to blow dust off..
>
> https://www.postgresql.org/message-id/20180420.173354.43303926.horiguchi.kyotaro@lab.ntt.co.jp
>
> ...done..i working..
>
> The original problem here was UpdateFullPageWrites() can call
> RecoveryInProgress(), which does palloc in a critical
> section. This happens when standy is commanded to reload with
> change of full_pages_writes during recovery.
>

AFAIU from the original problem reported by Dilip, it can happen
during checkpoint without standby as well.

> While exploring it, I found that update of fullPageWrite flag is
> updated concurrently against its design. A race condition happens
> in the following steps in my diagnosis.
>
> 1. While the startup process is running recovery, we didn't
>  consider that checkpointer may be running concurrently, but it
>  happens for standby.
>
> 2. Checkpointer considers itself (or designed) as the *only*
>  updator of shared config including fillPageWrites. In reality
>  the startup is another concurent updator on standby. Addition to
>  that, checkpointer assumes that it is started under WAL-emitting
>  state, that is, InitXLogInsert() has been called elsewhere, but
>  it is not the case on standby.
>
>  Note that checkpointer mustn't update FPW flag before recovery
>  ends because startup will overrides the flag.
>
> 3. As the result, when standby gets full_page_writes changed and
>  SIGHUP during recovery, checkpointer tries to update FPW flag
>  and calls RecoveryInProgress() on the way. bang!
>
>
> With the 0002-step1.patch, checkpointer really becomes the only
> updator of the FPW flag after recovery ends. StartupXLOG()
> updates it just once before checkpointer starts to update it.
>

- * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
- * record before resource manager writes cleanup WAL records or checkpoint
- * record is written.
+ * Update full_page_writes in shared memory with the lastest value before
+ * resource manager writes cleanup WAL records or checkpoint record is
+ * written. We don't need to write XLOG_FPW_CHANGE since this just
+ * reflects the status at the last redo'ed record. No lock is required
+ * since startup is the only updator of the flag at this
+ * point. Checkpointer will take over after SharedRecoveryInProgress is
+ * turned to false.
  */
  Insert->fullPageWrites = lastFullPageWrites;
- LocalSetXLogInsertAllowed();
- UpdateFullPageWrites();
- LocalXLogInsertAllowed = -1;

lastFullPageWrites will contain the latest value among the replayed
WAL records.  It can still be different fullPageWrites which is
updated by UpdateFullPageWrites().  So, I don't think it is advisable
to remove it and rely on checkpointer to update it.  I think when it
is called from this code path, it will anyway not write
XLOG_FPW_CHANGE because of the !RecoveryInProgress() check.

> Under the normal(slow?) promotion mode, checkpointer is woken up
> explicitly to make the config setting effective as soon as
> possible. (This might be unnecessary.)
>

I am not sure this is the right approach.  If we are worried about
concurrency issues, then we can use lock to protect it.

> In checkpointer, RecoveryInProgress() is called preior to
> UpdateFPW() to disable update FPW during recovery, so the crash
> that is the issue here is fixed.
>

It seems to me that you are trying to resolve two different problems
in the same patch.  I think we can have a patch to deal with
concurrency issue if any and a separate patch to call
RecoveryInProgress outside critical section.

[1] - https://www.postgresql.org/message-id/CAA4eK1JvKxsHfM6GCHoKNas-7vDSniLgaAm%3DcG_OaQaOYRNc3w%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Sat, 25 Aug 2018 14:50:53 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1K7dVgKU4zrNTSCW=EoqALG38XmNT0HK9Wdkr935iwTQg@mail.gmail.com>
> On Wed, Aug 1, 2018 at 12:56 PM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >
> > Thank you, Amit, Michael.
> >
> 
> Can you verify the first patch that I have posted above [1]?  We can
> commit it separately.

Thanks for prompting. The difference is in a comment and I'm fine
with the change.


> > It's a long time ago.. Let me have a bit of time to blow dust off..
> >
> > https://www.postgresql.org/message-id/20180420.173354.43303926.horiguchi.kyotaro@lab.ntt.co.jp
> >
> > ...done..i working..
> >
> > The original problem here was UpdateFullPageWrites() can call
> > RecoveryInProgress(), which does palloc in a critical
> > section. This happens when standy is commanded to reload with
> > change of full_pages_writes during recovery.
> >
> 
> AFAIU from the original problem reported by Dilip, it can happen
> during checkpoint without standby as well.

Yes, standby is not needed but archive (streaming) recovery is
required to start checkpointer.

> > While exploring it, I found that update of fullPageWrite flag is
> > updated concurrently against its design. A race condition happens
> > in the following steps in my diagnosis.
> >
> > 1. While the startup process is running recovery, we didn't
> >  consider that checkpointer may be running concurrently, but it
> >  happens for standby.
> >
> > 2. Checkpointer considers itself (or designed) as the *only*
> >  updator of shared config including fillPageWrites. In reality
> >  the startup is another concurent updator on standby. Addition to
> >  that, checkpointer assumes that it is started under WAL-emitting
> >  state, that is, InitXLogInsert() has been called elsewhere, but
> >  it is not the case on standby.
> >
> >  Note that checkpointer mustn't update FPW flag before recovery
> >  ends because startup will overrides the flag.
> >
> > 3. As the result, when standby gets full_page_writes changed and
> >  SIGHUP during recovery, checkpointer tries to update FPW flag
> >  and calls RecoveryInProgress() on the way. bang!
> >
> >
> > With the 0002-step1.patch, checkpointer really becomes the only
> > updator of the FPW flag after recovery ends. StartupXLOG()
> > updates it just once before checkpointer starts to update it.
> >
> 
> - * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
> - * record before resource manager writes cleanup WAL records or checkpoint
> - * record is written.
> + * Update full_page_writes in shared memory with the lastest value before
> + * resource manager writes cleanup WAL records or checkpoint record is
> + * written. We don't need to write XLOG_FPW_CHANGE since this just
> + * reflects the status at the last redo'ed record. No lock is required
> + * since startup is the only updator of the flag at this
> + * point. Checkpointer will take over after SharedRecoveryInProgress is
> + * turned to false.
>   */
>   Insert->fullPageWrites = lastFullPageWrites;
> - LocalSetXLogInsertAllowed();
> - UpdateFullPageWrites();
> - LocalXLogInsertAllowed = -1;
> 
> lastFullPageWrites will contain the latest value among the replayed
> WAL records.  It can still be different fullPageWrites which is
> updated by UpdateFullPageWrites().  So, I don't think it is advisable
> to remove it and rely on checkpointer to update it.  I think when it
> is called from this code path, it will anyway not write
> XLOG_FPW_CHANGE because of the !RecoveryInProgress() check.

Unfortunately startup doesn't process reloads by SIGHUP. So just
letting startup process set sharedFPW doesn't work correctly. I
don't think reload during redo loop will be apparently
safe. Forcibly reloading config without SIGHUP just before
UpdateFullPageWrites() in StartupXLOG breaks the reload
semantics.

# The comment for StartupPorcessSigHagndler is wrong in the sense..

> > Under the normal(slow?) promotion mode, checkpointer is woken up
> > explicitly to make the config setting effective as soon as
> > possible. (This might be unnecessary.)
> >
> 
> I am not sure this is the right approach.  If we are worried about
> concurrency issues, then we can use lock to protect it.

Since only checkpointer knows the right current value of the
flag, it is responsible for the final (just after recovery end)
setup of the flag. Actually we don't need to wake up checkpoiner
as soon as the end of recovery, but it must
UpdateFullPageWrites() before the first checkpoint starts without
receiving SIGHUP.


> > In checkpointer, RecoveryInProgress() is called preior to
> > UpdateFPW() to disable update FPW during recovery, so the crash
> > that is the issue here is fixed.
> >
> 
> It seems to me that you are trying to resolve two different problems
> in the same patch.  I think we can have a patch to deal with
> concurrency issue if any and a separate patch to call
> RecoveryInProgress outside critical section.

Hmm. Perhaps right. The change of UpdateShareMemoryConfig alone
resolves the initial issue and others are needed to have the flag
set correctly in the problematic scenario. I don't mind split
them. Please find the attached.

As a separate issue, postmater routes SIGHUP to startup process
but it just wakes up the process and doesn't cause config
reloading.  On the other hand walreceiver, the only waker of the
process AFAICS, doesn't use SIGHUP and calls WakeupRecovery()
directly. So startup process might should just ignore SIGHUP for
clarity..

The attached third patch does that but leaving the SIGHUP routing
in postmaster.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 7e635cc2512699d2324eaf0451a28ecbe4838181 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Aug 2018 18:48:46 +0900
Subject: [PATCH 1/3] Let checkpointer not update shared FPW flag during
 recovery

If reloaded happens during recovery, checkpointer may try to update
the shared FPW flag and causes crash in certain timing. This patch
fixes the crash by hihibiting the update during recovery.
---
 src/backend/postmaster/checkpointer.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c5..011ff1d442 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1345,9 +1345,11 @@ UpdateSharedMemoryConfig(void)
 
     /*
      * If full_page_writes has been changed by SIGHUP, we update it in shared
-     * memory and write an XLOG_FPW_CHANGE record.
+     * memory and write an XLOG_FPW_CHANGE record. We cannot modify it before
+     * recovery ends since startup process can change it.
      */
-    UpdateFullPageWrites();
+    if (!RecoveryInProgress())
+        UpdateFullPageWrites();
 
     elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
-- 
2.16.3

From d3bf39688ae575656bec7db3b441c7c004392598 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Aug 2018 18:53:20 +0900
Subject: [PATCH 2/3] Set correct value of full_page_write reloaded during
 recovery

If full_page_write was edited in config file then reloaded during
recovery, the change will be postponed until the next reload. This
patch fixes that.
---
 src/backend/access/transam/xlog.c     | 19 +++++++++------
 src/backend/postmaster/checkpointer.c | 44 ++++++++++++++++++++++++-----------
 src/include/postmaster/bgwriter.h     |  1 +
 3 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..f315d11745 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7692,14 +7692,15 @@ StartupXLOG(void)
     XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
     /*
-     * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-     * record before resource manager writes cleanup WAL records or checkpoint
-     * record is written.
+     * Update full_page_writes in shared memory with the lastest value before
+     * resource manager writes cleanup WAL records or checkpoint record is
+     * written. We don't need to write XLOG_FPW_CHANGE since this just
+     * reflects the status at the last redo'ed record. No lock is required
+     * since startup is the only updator of the flag at this
+     * point. Checkpointer will take over after SharedRecoveryInProgress is
+     * turned to false.
      */
     Insert->fullPageWrites = lastFullPageWrites;
-    LocalSetXLogInsertAllowed();
-    UpdateFullPageWrites();
-    LocalXLogInsertAllowed = -1;
 
     if (InRecovery)
     {
@@ -7941,10 +7942,14 @@ StartupXLOG(void)
      * If this was a fast promotion, request an (online) checkpoint now. This
      * isn't required for consistency, but the last restartpoint might be far
      * back, and in case of a crash, recovering from it might take a longer
-     * than is appropriate now that we're not in standby mode anymore.
+     * than is appropriate now that we're not in standby mode anymore.  If
+     * checkpoint is not needed, tell checkpointer to update shared
+     * configuration instead.
      */
     if (fast_promoted)
         RequestCheckpoint(CHECKPOINT_FORCE);
+    else
+        WakeupCheckpointer();
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 011ff1d442..63bb4904ef 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -358,20 +358,25 @@ CheckpointerMain(void)
         {
             got_SIGHUP = false;
             ProcessConfigFile(PGC_SIGHUP);
-
-            /*
-             * Checkpointer is the last process to shut down, so we ask it to
-             * hold the keys for a range of other tasks required most of which
-             * have nothing to do with checkpointing at all.
-             *
-             * For various reasons, some config values can change dynamically
-             * so the primary copy of them is held in shared memory to make
-             * sure all backends see the same value.  We make Checkpointer
-             * responsible for updating the shared memory copy if the
-             * parameter setting changes because of SIGHUP.
-             */
-            UpdateSharedMemoryConfig();
         }
+
+        /*
+         * Checkpointer is the last process to shut down, so we ask it to hold
+         * the keys for a range of other tasks required most of which have
+         * nothing to do with checkpointing at all.
+         *
+         * For various reasons, some config values can change dynamically so
+         * the primary copy of them is held in shared memory to make sure all
+         * backends see the same value.  We make Checkpointer responsible for
+         * updating the shared memory copy if the parameter setting changes
+         * because of SIGHUP.
+         *
+         * Since some of the shared variables are omitted during
+         * recovery. UpdateSharedMemoryConfig() is needed regardless of SIGHUP
+         * in order to keep them in sync without waiting the next reload.
+         */
+        UpdateSharedMemoryConfig();
+
         if (checkpoint_requested)
         {
             checkpoint_requested = false;
@@ -1074,6 +1079,19 @@ RequestCheckpoint(int flags)
     }
 }
 
+/*
+ * WakeupCheckpointer
+ *
+ * Requests checkpointer to wake up. This causes per-loop processing
+ * regardless of the necessity of a checkpoint.
+ */
+void
+WakeupCheckpointer(void)
+{
+    if (IsUnderPostmaster && CheckpointerShmem->checkpointer_pid != 0)
+        kill(CheckpointerShmem->checkpointer_pid, SIGINT);
+}
+
 /*
  * ForwardFsyncRequest
  *        Forward a file-fsync request from a backend to the checkpointer
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d..3aac1606a0 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -29,6 +29,7 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
+extern void WakeupCheckpointer(void);
 extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-- 
2.16.3

From 4cc5e950c126368cc619447c8d548c9de788ca7f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Aug 2018 19:31:04 +0900
Subject: [PATCH 3/3] Let startup process ignore SIGHUP.

SIGHUP does nothing on startup process. Ignore it.
---
 src/backend/postmaster/startup.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 2926211e35..124c0cc1ac 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -110,18 +110,6 @@ StartupProcTriggerHandler(SIGNAL_ARGS)
     errno = save_errno;
 }
 
-/* SIGHUP: set flag to re-read config file at next convenient time */
-static void
-StartupProcSigHupHandler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    WakeupRecovery();
-
-    errno = save_errno;
-}
-
 /* SIGTERM: set flag to abort redo and exit */
 static void
 StartupProcShutdownHandler(SIGNAL_ARGS)
@@ -175,7 +163,7 @@ StartupProcessMain(void)
     /*
      * Properly accept or ignore signals the postmaster might send us.
      */
-    pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
+    pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
     pqsignal(SIGINT, SIG_IGN);    /* ignore query cancel */
     pqsignal(SIGTERM, StartupProcShutdownHandler);    /* request shutdown */
     pqsignal(SIGQUIT, startupproc_quickdie);    /* hard crash time */
-- 
2.16.3


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Tue, Aug 28, 2018 at 07:34:36PM +0900, Kyotaro HORIGUCHI wrote:
> Thanks for prompting. The difference is in a comment and I'm fine
> with the change.

/*
 * Properly accept or ignore signals the postmaster might send us.
 */
-       pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
+       pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */

I am finally coming back to this patch set, and that's one of the first
things I am going to help moving on for this CF.  And this bit from the
last patch series is not acceptable as there are some parameters which
are used by the startup process which can be reloaded.  One of them is
wal_retrieve_retry_interval for tuning when fetching WAL at recovery.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Tue, Aug 28, 2018 at 4:05 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Hello.
>
> At Sat, 25 Aug 2018 14:50:53 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1K7dVgKU4zrNTSCW=EoqALG38XmNT0HK9Wdkr935iwTQg@mail.gmail.com>
> > On Wed, Aug 1, 2018 at 12:56 PM Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > >
> > > Thank you, Amit, Michael.
> > >
> >
> > Can you verify the first patch that I have posted above [1]?  We can
> > commit it separately.
>
> Thanks for prompting. The difference is in a comment and I'm fine
> with the change.
>

Thanks, but what I wanted you to verify is the commit that broke it in
9.5.  On again looking at it, I think it is below code in commit
2076db2aea that caused this problem.  If possible, can you once test
it before and at this commit or at least do the manual review of same
to cross-verify?

+       doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
-       /*
-        * Also check to see if fullPageWrites or forcePageWrites was
just turned
-        * on; if we weren't already doing full-page writes then go back and
-        * recompute. (If it was just turned off, we could recompute the record
-        * without full pages, but we choose not to bother.)
-        */
-       if ((Insert->fullPageWrites || Insert->forcePageWrites) &&
!doPageWrites)
+       if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr &&
doPageWrites)
        {
-               /* Oops, must redo it with full-page data. */
+               /*
+                * Oops, some buffer now needs to be backed up that the caller
+                * didn't back up.  Start over.
+                */
                WALInsertLockRelease();
                END_CRIT_SECTION();
-               rdt_lastnormal->next = NULL;
-               info = info_orig;
-               goto begin;
+               return InvalidXLogRecPtr;
        }


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Mon, Sep 10, 2018 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Thanks, but what I wanted you to verify is the commit that broke it in
> 9.5.  On again looking at it, I think it is below code in commit
> 2076db2aea that caused this problem.  If possible, can you once test
> it before and at this commit or at least do the manual review of same
> to cross-verify?
>

I have myself investigated this further and found that this problem
started occuring due to code rearrangement in commits 2c03216d83 and
2076db2aea.  In commit 2076db2aea, below check for comparing the old
and new value for fullPageWrites got changed:

> -       if ((Insert->fullPageWrites || Insert->forcePageWrites) &&
> !doPageWrites)
> +       if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr &&
> doPageWrites)
>         {

However, it alone didn't lead to this problem because
XLogRecordAssemble() gives the valid value of fpw_lsn irrespective of
whether full-page-writes was enabled or not. Later in commit
2c03216d83, we changed XLogRecordAssemble() such that it will give the
valid value of fpw_lsn only if doPageWrites is enabled, basically this
part of the commit:

+ /* Determine if this block needs to be backed up */
+ if (regbuf->flags & REGBUF_FORCE_IMAGE)
+ needs_backup = true;
+ else if (regbuf->flags & REGBUF_NO_IMAGE)
+ needs_backup = false;
+ else if (!doPageWrites)
+ needs_backup = false;
+ else
  {
- /* Simple data, just include it */
- len += rdt->len;
+ /*
+ * We assume page LSN is first data on *every* page that can be
+ * passed to XLogInsert, whether it has the standard page layout
+ * or not.
+ */
+ XLogRecPtr page_lsn = PageGetLSN(regbuf->page);
+
+ needs_backup = (page_lsn <= RedoRecPtr);
+ if (!needs_backup)
+ {
+ if (*fpw_lsn == InvalidXLogRecPtr || page_lsn < *fpw_lsn)
+ *fpw_lsn = page_lsn;
+ }
  }

So, the problem started appearing after some rearrangement of code in
both the above-mentioned commits.  I verified that this problem
doesn't exist in versions <=9.4, so backpatch-through 9.5.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Sep 13, 2018 at 04:38:30PM +0530, Amit Kapila wrote:
> So, the problem started appearing after some rearrangement of code in
> both the above-mentioned commits.  I verified that this problem
> doesn't exist in versions <=9.4, so backpatch-through 9.5.

Thanks Amit for taking care of this first problem.  I am going to send
another patch which is able to take care of concurrent updates of
Insert->fullPageWrites for the checkpointer and the startup process
to fix the original issue reported by Dilip Kumar, so as we are able to
close definitely the loop on this thread.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Sep 06, 2018 at 04:37:28PM -0700, Michael Paquier wrote:
> /*
>  * Properly accept or ignore signals the postmaster might send us.
>  */
> -       pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
> +       pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
>
> I am finally coming back to this patch set, and that's one of the first
> things I am going to help moving on for this CF.  And this bit from the
> last patch series is not acceptable as there are some parameters which
> are used by the startup process which can be reloaded.  One of them is
> wal_retrieve_retry_interval for tuning when fetching WAL at recovery.

So, I have been working on this problem again and I have reviewed the
thread, and there have been many things discussed in the last couple of
months:
1) We do not want to initialize XLogInsert stuff unconditionally for all
processes at the moment recovery begins, but we just want to initialize
it once WAL write is open for business.
2) Both the checkpointer and the startup process can call
UpdateFullPageWrites() which can cause Insert->fullPageWrites to get
incorrect values.
3) We do not want a palloc() in a critical section because of
RecoveryinProgress being called.

And the root issue here is 2), because the checkpointer tries to update
Insert->fullPageWrites but it does not need to do so until recovery has
been finished.  So in order to fix the original issue I am proposing a
simple fix: let's make sure that the checkpointer does not update
Insert->fullPageWrites until recovery finishes, and let's have the
startup process do the first update once it finishes recovery and
inserts by itself the XLOG_PARAMETER_CHANGE.  This way the order of
events is purely sequential and we don't need to worry about having the
checkpointer and the startup process eat on each other's plate because
the checkpointer would only try to work on updating the shared memory
value of full_page_writes once SharedRecoveryInProgress is switched to
true, and that happens after the startup process does its initial call
to UpdateFullPageWrites().  I have improved as well all the comments
around to make clear the behavior wanted.

Thoughts?
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 06, 2018 at 04:37:28PM -0700, Michael Paquier wrote:
> > /*
> >  * Properly accept or ignore signals the postmaster might send us.
> >  */
> > -       pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
> > +       pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
> >
> > I am finally coming back to this patch set, and that's one of the first
> > things I am going to help moving on for this CF.  And this bit from the
> > last patch series is not acceptable as there are some parameters which
> > are used by the startup process which can be reloaded.  One of them is
> > wal_retrieve_retry_interval for tuning when fetching WAL at recovery.
>
> So, I have been working on this problem again and I have reviewed the
> thread, and there have been many things discussed in the last couple of
> months:
> 1) We do not want to initialize XLogInsert stuff unconditionally for all
> processes at the moment recovery begins, but we just want to initialize
> it once WAL write is open for business.
> 2) Both the checkpointer and the startup process can call
> UpdateFullPageWrites() which can cause Insert->fullPageWrites to get
> incorrect values.

Can you share the steps to reproduce this problem?

> 3) We do not want a palloc() in a critical section because of
> RecoveryinProgress being called.
>
> And the root issue here is 2), because the checkpointer tries to update
> Insert->fullPageWrites but it does not need to do so until recovery has
> been finished.  So in order to fix the original issue I am proposing a
> simple fix: let's make sure that the checkpointer does not update
> Insert->fullPageWrites until recovery finishes, and let's have the
> startup process do the first update once it finishes recovery and
> inserts by itself the XLOG_PARAMETER_CHANGE.  This way the order of
> events is purely sequential and we don't need to worry about having the
> checkpointer and the startup process eat on each other's plate because
> the checkpointer would only try to work on updating the shared memory
> value of full_page_writes once SharedRecoveryInProgress is switched to
> true, and that happens after the startup process does its initial call
> to UpdateFullPageWrites().  I have improved as well all the comments
> around to make clear the behavior wanted.
>
> Thoughts?
>

 UpdateFullPageWrites(void)
 {
  XLogCtlInsert *Insert = &XLogCtl->Insert;
+ /*
+ * Check if recovery is still in progress before entering this critical
+ * section, as some memory allocation could happen at the end of
+ * recovery.  There is nothing to do for a system still in recovery.
+ * Note that we need to process things here at the end of recovery for
+ * the startup process, which is why this checks after InRecovery.
+ */
+ if (RecoveryInProgress() && !InRecovery)
+ return;
+

On a regular startup when there is no recovery, it won't allow us to
log the WAL record (XLOG_FPW_CHANGE) which can happen without above
change.  You can check that by setting full_page_writes=off and start
the system.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Thu, 6 Sep 2018 16:37:28 -0700, Michael Paquier <michael@paquier.xyz> wrote in <20180906233728.GR2726@paquier.xyz>
> On Tue, Aug 28, 2018 at 07:34:36PM +0900, Kyotaro HORIGUCHI wrote:
> > Thanks for prompting. The difference is in a comment and I'm fine
> > with the change.
> 
> /*
>  * Properly accept or ignore signals the postmaster might send us.
>  */
> -       pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
> +       pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
> 
> I am finally coming back to this patch set, and that's one of the first
> things I am going to help moving on for this CF.  And this bit from the
> last patch series is not acceptable as there are some parameters which
> are used by the startup process which can be reloaded.  One of them is
> wal_retrieve_retry_interval for tuning when fetching WAL at recovery.

The third patch actually is not mandatory in the patch set. The
only motive of that is it doesn't nothing. The handler for SIGHUP
just sets got_SIGHUP then wakes up the process, and the variable
is not looked up by no one. If you mind the change, it can be
removed with no side effect.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Tue, 18 Sep 2018 11:38:50 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180918.113850.164570138.horiguchi.kyotaro@lab.ntt.co.jp>
> At Thu, 6 Sep 2018 16:37:28 -0700, Michael Paquier <michael@paquier.xyz> wrote in
<20180906233728.GR2726@paquier.xyz>
> > I am finally coming back to this patch set, and that's one of the first
> > things I am going to help moving on for this CF.  And this bit from the
> > last patch series is not acceptable as there are some parameters which
> > are used by the startup process which can be reloaded.  One of them is
> > wal_retrieve_retry_interval for tuning when fetching WAL at recovery.
> 
> The third patch actually is not mandatory in the patch set. The
> only motive of that is it doesn't nothing. The handler for SIGHUP
> just sets got_SIGHUP then wakes up the process, and the variable
> is not looked up by no one. If you mind the change, it can be
> removed with no side effect.

I was wrong here. It was handled in HandleStartupProcInterrupts
called from StartupXLOG. So, it should be just removed from the
set. Sorry for the bogus patch.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Fri, Sep 14, 2018 at 04:30:37PM +0530, Amit Kapila wrote:
> On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote:
>> So, I have been working on this problem again and I have reviewed the
>> thread, and there have been many things discussed in the last couple of
>> months:
>> 1) We do not want to initialize XLogInsert stuff unconditionally for all
>> processes at the moment recovery begins, but we just want to initialize
>> it once WAL write is open for business.
>> 2) Both the checkpointer and the startup process can call
>> UpdateFullPageWrites() which can cause Insert->fullPageWrites to get
>> incorrect values.
>
> Can you share the steps to reproduce this problem?

This refers to the first problem reported on this thread:
https://www.postgresql.org/message-id/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC%3DC2whnzBM8TAcv%3DstckYUw%40mail.gmail.com

In order to reproduce the problem, you can for example stop the server
in immediate mode.  Then attach a debugger to it and add a breakpoint to
UpdateFullPageWrites.  You can check that XLOG insert has not been
initialized yet by looking at xloginsert_cxt ot ThisTimeLineID.  On a
second session, switch full_page_writes to on or off, reload the
parameters and then trigger a checkpoint.  The important point is to
trigger an inconsistency between XLogCtl->Insert->fullPageWrites and
the value of fullPageWrites within the checkpointer context.  With the
checkpoint triggered, the debugger will stop at UpdateFullPageWrites
immediately.  At this point, you can simply check if fullPageWrites
Insert->fullPageWrites have the same value or a different one.  If the
values match, simply switch full_page_writes and reload again, with the
checkpointer still waiting at the beginning of UpdateFullPageWrites.
SIGHUP will make the checkpointer process hang a bit, and then it will
move on.  At this point you will be able to see the failure:
TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 731)
2018-09-18 15:06:39 JST [7396]: [11-1] db=,user=,app=,client= LOG:
checkpointer process (PID 7399) was terminated by signal 6: Aborted

> On a regular startup when there is no recovery, it won't allow us to
> log the WAL record (XLOG_FPW_CHANGE) which can happen without above
> change.  You can check that by setting full_page_writes=off and start
> the system.

Oh, good point, InRecovery is set to false in this case so that would be
skipped.  We can simply fix that by adding a flag, say "force" to
UpdateFullPageWrites to allow a process to enforce the update of FPW
even if RecoveryInProgress returns true, which would be the case for the
startup process.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Tue, Sep 18, 2018 at 01:06:09PM +0900, Kyotaro HORIGUCHI wrote:
> I was wrong here. It was handled in HandleStartupProcInterrupts
> called from StartupXLOG. So, it should be just removed from the
> set. Sorry for the bogus patch.

Thanks for confirming.

Still, it looks like a waste to abuse on SIGINT just to forcibly wake up
the checkpointer and request from it a checkpoint...  And you could just
have used a new parameter for the checkpointer appended with
CHECKPOINT_FORCE.  I think that my approach of just making the set of
events purely ordered will save from any kind of race conditions, while
I suspect that what you propose here does not close all the holes.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Kyotaro HORIGUCHI
Дата:
At Fri, 14 Sep 2018 16:30:37 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+Xfx5jD2CHmLPNqXeOzqRLKG9TNr8wfs3-cPf9Ln9RVg@mail.gmail.com>
> On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Sep 06, 2018 at 04:37:28PM -0700, Michael Paquier wrote:
> > > /*
> > >  * Properly accept or ignore signals the postmaster might send us.
> > >  */
> > > -       pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
> > > +       pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
> > >
> > > I am finally coming back to this patch set, and that's one of the first
> > > things I am going to help moving on for this CF.  And this bit from the
> > > last patch series is not acceptable as there are some parameters which
> > > are used by the startup process which can be reloaded.  One of them is
> > > wal_retrieve_retry_interval for tuning when fetching WAL at recovery.
> >
> > So, I have been working on this problem again and I have reviewed the
> > thread, and there have been many things discussed in the last couple of
> > months:
> > 1) We do not want to initialize XLogInsert stuff unconditionally for all
> > processes at the moment recovery begins, but we just want to initialize
> > it once WAL write is open for business.
> > 2) Both the checkpointer and the startup process can call
> > UpdateFullPageWrites() which can cause Insert->fullPageWrites to get
> > incorrect values.
> 
> Can you share the steps to reproduce this problem?

The window for the issue is small.

It happens if checkpointer first looks SharedRecoveryInProgress
is turned off at the beginning of the CheckPointerMain loop.
The window is needed be widen to make sure the issue happens.

Applying the first patch attched, the following steps will cause
the issue with high reliability.

1. initdb  (full_page_writes = on by default)

2. put recovery.conf so that server can accept promote signal

3. touch /tmp/hoge
   change full_page_write to off in postgresql.conf

4. pg_ctl_promote

The attached second file is a script do the above steps.
Server writes the following log message and die.

| 2018-09-18 13:55:49.928 JST [16405] LOG:  database system is ready to accept connections
| TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 731)
| 2018-09-18 13:55:50.546 JST [16405] LOG:  checkpointer process (PID 16407) was terminated by signal 6: Aborted

We can preallocating the XLogInsert buffer just to prevent the
crash. This is done by calling RecoveryInProgress() before
UpdateFullPageWrites() finds it turned to false. This is an
isolated problem. But it has another issue that FPW_CHANGE record
can be missing or wrong FPW state after recovery end.

It comes from the fact that responsibility to update the flag is
not atomically passed from startup to checkpointer. (The window
is after UpdateFullPageWrites() call and until setting
SharedRecoveryInProgress to false.)

My latest patch tries to remove the window by imposing all
responsibility to apply config file changes to the shared FPW
flag on the checkpointer. RecoveryInProgress() is changed to be
called prior to UpdateFullPageWrites on the way doing that.


> > 3) We do not want a palloc() in a critical section because of
> > RecoveryinProgress being called.
> >
> > And the root issue here is 2), because the checkpointer tries to update
> > Insert->fullPageWrites but it does not need to do so until recovery has
> > been finished.  So in order to fix the original issue I am proposing a
> > simple fix: let's make sure that the checkpointer does not update
> > Insert->fullPageWrites until recovery finishes, and let's have the
> > startup process do the first update once it finishes recovery and
> > inserts by itself the XLOG_PARAMETER_CHANGE.  This way the order of
> > events is purely sequential and we don't need to worry about having the
> > checkpointer and the startup process eat on each other's plate because
> > the checkpointer would only try to work on updating the shared memory
> > value of full_page_writes once SharedRecoveryInProgress is switched to
> > true, and that happens after the startup process does its initial call
> > to UpdateFullPageWrites().  I have improved as well all the comments
> > around to make clear the behavior wanted.
> >
> > Thoughts?

InRecoery is turned off after the last UpdateFullPageWrite() and
before SharedRecoveryInProgress is turned off. So it is still
leaving the window. Thus, checkpointer stops flipping the value
before SharedRecoveryInProgress's turning off. (I don't
understand InRecovery condition..) However, this lets
checkpointer omit reloading after UpdateFullPagesWrite() in
startup until SharedRecoveryInProgress is tunred off.

>  UpdateFullPageWrites(void)
>  {
>   XLogCtlInsert *Insert = &XLogCtl->Insert;
> + /*
> + * Check if recovery is still in progress before entering this critical
> + * section, as some memory allocation could happen at the end of
> + * recovery.  There is nothing to do for a system still in recovery.
> + * Note that we need to process things here at the end of recovery for
> + * the startup process, which is why this checks after InRecovery.
> + */
> + if (RecoveryInProgress() && !InRecovery)
> + return;
> +
> 
> On a regular startup when there is no recovery, it won't allow us to
> log the WAL record (XLOG_FPW_CHANGE) which can happen without above
> change.  You can check that by setting full_page_writes=off and start
> the system.

Agreed. "If we need to do that in the start process," we need to
change the shared flag and issue FPW_CHANGE always when the
database state is different from configuration file, regardless
of what StartXLOG() did until the point.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3025d0badb..c34a6f6f6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7961,6 +7961,17 @@ StartupXLOG(void)
      */
     WalSndWakeup();
 
+    /* DEBUG: cause a reload */
+    {
+        struct stat b;
+        if (stat("/tmp/hoge", &b) == 0)
+        {
+            elog(LOG, "STARTUP SLEEP FOR 1s");
+            sleep(1);
+            elog(LOG, "DONE.");
+            DirectFunctionCall1(pg_reload_conf, 0);
+        }
+    }
     /*
      * If this was a fast promotion, request an (online) checkpoint now. This
      * isn't required for consistency, but the last restartpoint might be far
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c5..26da229ca1 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -354,6 +354,9 @@ CheckpointerMain(void)
          */
         AbsorbFsyncRequests();
 
+        elog(LOG, "CHECKPOINTER SLEEP FOR 3s");
+        sleep(3);
+        elog(LOG, "DONE.");
         if (got_SIGHUP)
         {
             got_SIGHUP = false;
#! /bin/bash

set -e

#PGDATA=<somewhere>
if [ "$PGDATA" = "" ]; then
  echo set pgdata
  exit 1;
fi

# rm -rf $PGDATA  ## danger!!
rm -f /tmp/hoge
initdb

cat << EOF > $PGDATA/recovery.conf
standby_mode='yes'
primary_conninfo='host=/tmp port=9999'
EOF
cat << EOF >> $PGDATA/postgresql.conf
restart_after_crash = off
EOF

pg_ctl start
sleep 5
touch /tmp/hoge

cat << EOF >> $PGDATA/postgresql.conf
full_page_writes = off
EOF

pg_ctl promote
sleep 10
pg_ctl stop -m i



Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Tue, Sep 18, 2018 at 04:15:42PM +0900, Kyotaro HORIGUCHI wrote:
> My latest patch tries to remove the window by imposing all
> responsibility to apply config file changes to the shared FPW
> flag on the checkpointer. RecoveryInProgress() is changed to be
> called prior to UpdateFullPageWrites on the way doing that.

I still need to look at that in details.  That may be better than what I
am proposing.  At quick glance what I propose is more simple, and does
not need enforcing a checkpoint using SIGINT.

> InRecoery is turned off after the last UpdateFullPageWrite() and
> before SharedRecoveryInProgress is turned off. So it is still
> leaving the window. Thus, checkpointer stops flipping the value
> before SharedRecoveryInProgress's turning off. (I don't
> understand InRecovery condition..) However, this lets
> checkpointer omit reloading after UpdateFullPagesWrite() in
> startup until SharedRecoveryInProgress is tunred off.

That won't matter in this case, as RecoveryInProgress() gets called out
of the critical section in the previous patch I sent, so there is no
failure.  Let's not forget that the first issue is the allocation in the
critical section.  The second issue is that UpdateFullPageWrites may be
called concurrently across multiple processes, which is not something it
is designed for.  The second issue gets addressed in my proposal my
making sure that the checkpointer never tries to update full_page_writes
by himself until recovery has finished, and that the startup process is
the only updater once recovery ends.

> Agreed. "If we need to do that in the start process," we need to
> change the shared flag and issue FPW_CHANGE always when the
> database state is different from configuration file, regardless
> of what StartXLOG() did until the point.

Definitely my mistake here.  Attached is a patch to show what I have in
mind by making sure that the startup process generates a record even
after switching full_page_writes when started normally.  This removes
the condition based on InRecovery, and uses a new argument for
UpdateFullPageWrites() instead.  Your test case,as well as what I do
manually for testing, pass without triggering the assertion.

+   /* DEBUG: cause a reload */
+   {
+       struct stat b;
+       if (stat("/tmp/hoge", &b) == 0)
+       {
+           elog(LOG, "STARTUP SLEEP FOR 1s");
+           sleep(1);
+           elog(LOG, "DONE.");
+           DirectFunctionCall1(pg_reload_conf, 0);
+       }
+   }
The way you patch the backend this way is always nice to see so as it is
easy to reproduce bugs, and it actually helps in reproducing the
assertion failure correctly ;)
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Wed, Sep 19, 2018 at 02:20:34PM +0900, Michael Paquier wrote:
> On Tue, Sep 18, 2018 at 04:15:42PM +0900, Kyotaro HORIGUCHI wrote:
>> My latest patch tries to remove the window by imposing all
>> responsibility to apply config file changes to the shared FPW
>> flag on the checkpointer. RecoveryInProgress() is changed to be
>> called prior to UpdateFullPageWrites on the way doing that.
>
> I still need to look at that in details.  That may be better than what I
> am proposing.  At quick glance what I propose is more simple, and does
> not need enforcing a checkpoint using SIGINT.

I have finally looked at the patch set from Horiguchi-san here:
https://www.postgresql.org/message-id/20180828.193436.253621888.horiguchi.kyotaro@lab.ntt.co.jp

And actually this is very close to what my proposal does, except for a
couple of caveats.  Here are my notes:
1) The issue with palloc() happening in critical sections is able to go
away, by making the logic behind UpdateFullPageWrites() more complicated
than necessary.  With the proposed patch, UpdateFullPageWrites() never
gets called by the checkpointer until the startup process has done its
business.  I would have believed that keeping the check to
RecoveryInProgress() directly in UpdateFullPageWrites() makes the logic
more simple.  That's actually what my proposal does.  With your patch,
it would be even possible to add an assertion so as this never gets
called while in recovery.
2) At the end of recovery, if there is a crash before the checkpointer
is able to update the shared parameters it needs to work on
away, then no XLOG_CHANGE_PARAMETER record is generated.  This is not a
problem for normal cases, but there is one scenario where this is a
problem: at the end of recovery if the bgwriter is not started, then the
startup process creates a checkpoint by itself, and it would miss the
record generation.
3) SIGINT is abused of, in such a way that the checkpointer may generate
two checkpoints where only one is needed post-recovery.
4) SyncRepUpdateSyncStandbysDefined() would get called even without
SIGHUP being reached.  This feels also like a future trap waiting to
bite as well.

When I see your patch, I actually see the same kind of logic as what I
propose which is summarized in two points, and that's a good thing:
a) Avoid the allocation in the critical section.
b) Avoid two processes to call UpdateFullPageWrites at the same time.

Now the points mentioned above make what you are proposing weaker in my
opinion, and 2) is an actual bug, side effect of the proposed patch.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Tue, Sep 18, 2018 at 12:46 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> At Fri, 14 Sep 2018 16:30:37 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+Xfx5jD2CHmLPNqXeOzqRLKG9TNr8wfs3-cPf9Ln9RVg@mail.gmail.com>
> > On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Thu, Sep 06, 2018 at 04:37:28PM -0700, Michael Paquier wrote:
> > > > /*
> > > >  * Properly accept or ignore signals the postmaster might send us.
> > > >  */
> > > > -       pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
> > > > +       pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
> > > >
> > > > I am finally coming back to this patch set, and that's one of the first
> > > > things I am going to help moving on for this CF.  And this bit from the
> > > > last patch series is not acceptable as there are some parameters which
> > > > are used by the startup process which can be reloaded.  One of them is
> > > > wal_retrieve_retry_interval for tuning when fetching WAL at recovery.
> > >
> > > So, I have been working on this problem again and I have reviewed the
> > > thread, and there have been many things discussed in the last couple of
> > > months:
> > > 1) We do not want to initialize XLogInsert stuff unconditionally for all
> > > processes at the moment recovery begins, but we just want to initialize
> > > it once WAL write is open for business.
> > > 2) Both the checkpointer and the startup process can call
> > > UpdateFullPageWrites() which can cause Insert->fullPageWrites to get
> > > incorrect values.
> >
> > Can you share the steps to reproduce this problem?
>
> The window for the issue is small.
>
> It happens if checkpointer first looks SharedRecoveryInProgress
> is turned off at the beginning of the CheckPointerMain loop.
> The window is needed be widen to make sure the issue happens.
>
> Applying the first patch attched, the following steps will cause
> the issue with high reliability.
>
> 1. initdb  (full_page_writes = on by default)
>
> 2. put recovery.conf so that server can accept promote signal
>
> 3. touch /tmp/hoge
>    change full_page_write to off in postgresql.conf
>
> 4. pg_ctl_promote
>
> The attached second file is a script do the above steps.
> Server writes the following log message and die.
>
> | 2018-09-18 13:55:49.928 JST [16405] LOG:  database system is ready to accept connections
> | TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 731)
> | 2018-09-18 13:55:50.546 JST [16405] LOG:  checkpointer process (PID 16407) was terminated by signal 6: Aborted
>
> We can preallocating the XLogInsert buffer just to prevent the
> crash. This is done by calling RecoveryInProgress() before
> UpdateFullPageWrites() finds it turned to false. This is an
> isolated problem.
>

Yes, till this point the problem is quite clear and can be reproduced,
however, my question was for the next part for which there doesn't
seem to be a concrete test case.

> But it has another issue that FPW_CHANGE record
> can be missing or wrong FPW state after recovery end.
>
> It comes from the fact that responsibility to update the flag is
> not atomically passed from startup to checkpointer. (The window
> is after UpdateFullPageWrites() call and until setting
> SharedRecoveryInProgress to false.)
>

I understand your concern about missing FPW_CHANGE WAL record during
the above window but is that really a problem because till the
promotion is complete, it is not expected that we write any WAL
record.  Now, the other part of the problem you mentioned is "wrong
FPW state" which can happen because we don't read
Insert->fullPageWrites under lock.  If you are concerned about that
then I think we can solve it with a much less invasive change.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Wed, Sep 19, 2018 at 10:50 AM Michael Paquier <michael@paquier.xyz> wrote:
> > Agreed. "If we need to do that in the start process," we need to
> > change the shared flag and issue FPW_CHANGE always when the
> > database state is different from configuration file, regardless
> > of what StartXLOG() did until the point.
>
> Definitely my mistake here.  Attached is a patch to show what I have in
> mind by making sure that the startup process generates a record even
> after switching full_page_writes when started normally.  This removes
> the condition based on InRecovery, and uses a new argument for
> UpdateFullPageWrites() instead.  Your test case,as well as what I do
> manually for testing, pass without triggering the assertion.
>

This will fix the previous problem reported by me but will lead to
some other behavior change.  If somebody toggles the full_page_writes
flag before crash recovery, then it will log the XLOG_FPW_CHANGE
record, but that was not the case without your patch.

> When I see your patch, I actually see the same kind of logic as what I
> propose which is summarized in two points, and that's a good thing:
> a) Avoid the allocation in the critical section.
> b) Avoid two processes to call UpdateFullPageWrites at the same time.

I think, in this case, it might be advisable to just fix the problem
(a) which is what has been reported originally in the thread and
AFAICS, the fix for that is clear as compared to the problem (b).  If
you agree, then we can discuss what is the best fix for the first
problem (a).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Sep 27, 2018 at 10:03:59AM +0530, Amit Kapila wrote:
> I think, in this case, it might be advisable to just fix the problem
> (a) which is what has been reported originally in the thread and
> AFAICS, the fix for that is clear as compared to the problem (b).  If
> you agree, then we can discuss what is the best fix for the first
> problem (a).

Okay, thanks for the input.  The fix for (a) would be in my opinion to
just move the call to RecoveryInProgress() out of the critical section,
then save the result into a variable, and use the variable within the
critical section to avoid the potential palloc() problems.  What do you
think?
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Thu, Sep 27, 2018 at 10:34 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 27, 2018 at 10:03:59AM +0530, Amit Kapila wrote:
> > I think, in this case, it might be advisable to just fix the problem
> > (a) which is what has been reported originally in the thread and
> > AFAICS, the fix for that is clear as compared to the problem (b).  If
> > you agree, then we can discuss what is the best fix for the first
> > problem (a).
>
> Okay, thanks for the input.  The fix for (a) would be in my opinion to
> just move the call to RecoveryInProgress() out of the critical section,
> then save the result into a variable, and use the variable within the
> critical section to avoid the potential palloc() problems.  What do you
> think?
>

Your proposed solution makes sense to me.  IIUC, this is quite similar
to what Dilip has also proposed [1].

[1] - https://www.postgresql.org/message-id/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC%3DC2whnzBM8TAcv%3DstckYUw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Thu, Sep 27, 2018 at 11:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 27, 2018 at 10:34 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Sep 27, 2018 at 10:03:59AM +0530, Amit Kapila wrote:
> > > I think, in this case, it might be advisable to just fix the problem
> > > (a) which is what has been reported originally in the thread and
> > > AFAICS, the fix for that is clear as compared to the problem (b).  If
> > > you agree, then we can discuss what is the best fix for the first
> > > problem (a).
> >
> > Okay, thanks for the input.  The fix for (a) would be in my opinion to
> > just move the call to RecoveryInProgress() out of the critical section,
> > then save the result into a variable, and use the variable within the
> > critical section to avoid the potential palloc() problems.  What do you
> > think?
> >
>
> Your proposed solution makes sense to me.  IIUC, this is quite similar
> to what Dilip has also proposed [1].
>

I can take care of committing something along the lines of Dilip's
patch if you are okay.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Sep 27, 2018 at 11:18:02AM +0530, Amit Kapila wrote:
> Your proposed solution makes sense to me.  IIUC, this is quite similar
> to what Dilip has also proposed [1].

Indeed.  I would just add with the patch a comment like that:
"Perform this call outside the critical section so as if the instance
just got out of recovery, the upcoming WAL insert initialization does
not trigger an assertion failure."

If that sounds fine to you, I propose that we commit Dilip's patch with
the comment addition.  That will take care of (a).
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Sep 27, 2018 at 01:30:13PM +0530, Amit Kapila wrote:
> I can take care of committing something along the lines of Dilip's
> patch if you are okay.

Sure, feel free to if you have some room.  I am fine to take care of it
as well, so that's up to you to decide.  Adding a comment like what I
proposed upthread is necessary in my opinion.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Thu, Sep 27, 2018 at 1:32 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 27, 2018 at 11:18:02AM +0530, Amit Kapila wrote:
> > Your proposed solution makes sense to me.  IIUC, this is quite similar
> > to what Dilip has also proposed [1].
>
> Indeed.  I would just add with the patch a comment like that:
> "Perform this call outside the critical section so as if the instance
> just got out of recovery, the upcoming WAL insert initialization does
> not trigger an assertion failure."
>

I think this is mostly fine, but it seems "if the instance just got
out of recovery" doesn't fit well because it can happen anytime after
recovery, this code gets called from checkpointer.  I think we can
slightly tweak it as below:
"Perform this outside critical section so that the WAL insert
initialization done by RecoveryInProgress() doesn't trigger an
assertion failure."

What do you say?

> Sure, feel free to if you have some room.  I am fine to take care of it
> as well, so that's up to you to decide.

Okay, I will take care of it.

>  Adding a comment like what I
> proposed upthread is necessary in my opinion.

Agreed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Sep 27, 2018 at 04:19:02PM +0530, Amit Kapila wrote:
> I think this is mostly fine, but it seems "if the instance just got
> out of recovery" doesn't fit well because it can happen anytime after
> recovery, this code gets called from checkpointer.  I think we can
> slightly tweak it as below:
> "Perform this outside critical section so that the WAL insert
> initialization done by RecoveryInProgress() doesn't trigger an
> assertion failure."
>
> What do you say?

Fine for me.

>> Sure, feel free to if you have some room.  I am fine to take care of it
>> as well, so that's up to you to decide.
>
> Okay, I will take care of it.

Thanks.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Thu, Sep 27, 2018 at 6:22 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 27, 2018 at 04:19:02PM +0530, Amit Kapila wrote:
> > I think this is mostly fine, but it seems "if the instance just got
> > out of recovery" doesn't fit well because it can happen anytime after
> > recovery, this code gets called from checkpointer.  I think we can
> > slightly tweak it as below:
> > "Perform this outside critical section so that the WAL insert
> > initialization done by RecoveryInProgress() doesn't trigger an
> > assertion failure."
> >
> > What do you say?
>
> Fine for me.
>

Okay, I am planning to commit the attached patch tomorrow unless you
or anybody else has any objections to it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Michael Paquier
Дата:
On Thu, Sep 27, 2018 at 07:38:31PM +0530, Amit Kapila wrote:
> Okay, I am planning to commit the attached patch tomorrow unless you
> or anybody else has any objections to it.

None from here.  Thanks for taking care of it.
--
Michael

Вложения

Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Fri, Sep 28, 2018 at 4:23 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 27, 2018 at 07:38:31PM +0530, Amit Kapila wrote:
> > Okay, I am planning to commit the attached patch tomorrow unless you
> > or anybody else has any objections to it.
>
> None from here.  Thanks for taking care of it.
>

Thanks, pushed!  I have backpatched till 9.5 as this has been
introduced by the commit 2c03216d83 which added the XLOG machinery
initialization in RecoveryInProgress code path.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Problem while setting the fpw with SIGHUP

От
Amit Kapila
Дата:
On Fri, Sep 28, 2018 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 28, 2018 at 4:23 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Sep 27, 2018 at 07:38:31PM +0530, Amit Kapila wrote:
> > > Okay, I am planning to commit the attached patch tomorrow unless you
> > > or anybody else has any objections to it.
> >
> > None from here.  Thanks for taking care of it.
> >
>
> Thanks, pushed!  I have backpatched till 9.5 as this has been
> introduced by the commit 2c03216d83 which added the XLOG machinery
> initialization in RecoveryInProgress code path.
>

I have marked the originally reported issue as fixed in the open-items list.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com