Re: Race conditions with checkpointer and shutdown

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Race conditions with checkpointer and shutdown
Дата
Msg-id 1080.1555599180@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Race conditions with checkpointer and shutdown  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Race conditions with checkpointer and shutdown  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Race conditions with checkpointer and shutdown  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Apr 17, 2019 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think what we need to look for is reasons why (1) the postmaster
>> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
>> main loop doesn't get to noticing shutdown_requested.
>> 
>> A rather scary point for (2) is that said main loop seems to be
>> assuming that MyLatch a/k/a MyProc->procLatch is not used for any
>> other purposes in the checkpointer process.  If there were something,
>> like say a condition variable wait, that would reset MyLatch at any
>> time during a checkpoint, then we could very easily go to sleep at the
>> bottom of the loop and not notice that there's a pending shutdown request.

> Agreed on the non-composability of that coding, but if there actually
> is anything in that loop that can reach ResetLatch(), it's well
> hidden...

Well, it's easy to see that there's no other ResetLatch call in
checkpointer.c.  It's much less obvious that there's no such
call anywhere in the code reachable from e.g. CreateCheckPoint().

To try to investigate that, I hacked things up to force an assertion
failure if ResetLatch was called from any other place in the
checkpointer process (dirty patch attached for amusement's sake).
This gets through check-world without any assertions.  That does not
really prove that there aren't corner timing cases where a latch
wait and reset could happen, but it does put a big dent in my theory.
Question is, what other theory has anybody got?

            regards, tom lane

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a1e0423..b0ee1fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -166,6 +166,8 @@ static double ckpt_cached_elapsed;
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;

+extern bool reset_latch_ok;
+
 /* Prototypes for private functions */

 static void CheckArchiveTimeout(void);
@@ -343,9 +345,13 @@ CheckpointerMain(void)
         int            elapsed_secs;
         int            cur_timeout;

+        reset_latch_ok = true;
+
         /* Clear any already-pending wakeups */
         ResetLatch(MyLatch);

+        reset_latch_ok = false;
+
         /*
          * Process any requests or signals received recently.
          */
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 59fa917..1f0613d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -118,6 +118,8 @@ struct WaitEventSet
 #endif
 };

+bool reset_latch_ok = true;
+
 #ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
@@ -521,6 +523,8 @@ ResetLatch(Latch *latch)
     /* Only the owner should reset the latch */
     Assert(latch->owner_pid == MyProcPid);

+    Assert(reset_latch_ok);
+
     latch->is_set = false;

     /*

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Question about the holdable cursor
Следующее
От: David Fetter
Дата:
Сообщение: Re: block-level incremental backup