[HACKERS] More race conditions in logical replication

Поиск
Список
Период
Сортировка
От Tom Lane
Тема [HACKERS] More race conditions in logical replication
Дата
Msg-id 11904.1499039688@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: [HACKERS] More race conditions in logical replication  (Noah Misch <noah@leadboat.com>)
Re: [HACKERS] More race conditions in logical replication  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Список pgsql-hackers
I noticed a recent failure that looked suspiciously like a race condition:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07

The critical bit in the log file is

error running SQL: 'psql:<stdin>:1: ERROR:  could not drop the replication slot "tap_sub" on publisher
DETAIL:  The error was: ERROR:  replication slot "tap_sub" is active for PID 3866790'
while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP
SUBSCRIPTIONtap_sub' at
/home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pmline 1198. 

After poking at it a bit, I found that I can cause several different
failures of this ilk in the subscription tests by injecting delays at
the points where a slot's active_pid is about to be cleared, as in the
attached patch (which also adds some extra printouts for debugging
purposes; none of that is meant for commit).  It seems clear that there
is inadequate interlocking going on when we kill and restart a logical
rep worker: we're trying to start a new one before the old one has
gotten out of the slot.

I'm not particularly interested in fixing this myself, so I'm just
going to add it to the open items list.

            regards, tom lane

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index dc7de20..78cd416 100644
*** a/src/backend/replication/slot.c
--- b/src/backend/replication/slot.c
*************** ReplicationSlotAcquire(const char *name)
*** 358,364 ****
      if (active_pid != MyProcPid)
          ereport(ERROR,
                  (errcode(ERRCODE_OBJECT_IN_USE),
!                  errmsg("replication slot \"%s\" is active for PID %d",
                          name, active_pid)));

      /* We made this slot active, so it's ours now. */
--- 358,364 ----
      if (active_pid != MyProcPid)
          ereport(ERROR,
                  (errcode(ERRCODE_OBJECT_IN_USE),
!                  errmsg("replication slot \"%s\" is active for PID %d (acq)",
                          name, active_pid)));

      /* We made this slot active, so it's ours now. */
*************** ReplicationSlotRelease(void)
*** 391,396 ****
--- 391,398 ----
           * Mark persistent slot inactive.  We're not freeing it, just
           * disconnecting.
           */
+         pg_usleep(100000);
+         elog(LOG, "ReplicationSlotRelease: unmarking persistent slot");
          SpinLockAcquire(&slot->mutex);
          slot->active_pid = 0;
          SpinLockRelease(&slot->mutex);
*************** ReplicationSlotDropPtr(ReplicationSlot *
*** 523,528 ****
--- 525,532 ----
      {
          bool        fail_softly = slot->data.persistency != RS_PERSISTENT;

+         pg_usleep(100000);
+         elog(LOG, "ReplicationSlotDropPtr: unmarking slot after rename fail");
          SpinLockAcquire(&slot->mutex);
          slot->active_pid = 0;
          SpinLockRelease(&slot->mutex);
*************** ReplicationSlotDropPtr(ReplicationSlot *
*** 540,545 ****
--- 544,551 ----
       * and nobody can be attached to this slot and thus access it without
       * scanning the array.
       */
+     pg_usleep(100000);
+     elog(LOG, "ReplicationSlotDropPtr: unmarking slot");
      LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
      slot->active_pid = 0;
      slot->in_use = false;
*************** restart:
*** 876,882 ****
          if (active_pid)
              ereport(ERROR,
                      (errcode(ERRCODE_OBJECT_IN_USE),
!                      errmsg("replication slot \"%s\" is active for PID %d",
                              slotname, active_pid)));

          /*
--- 882,888 ----
          if (active_pid)
              ereport(ERROR,
                      (errcode(ERRCODE_OBJECT_IN_USE),
!                      errmsg("replication slot \"%s\" is active for PID %d (drop)",
                              slotname, active_pid)));

          /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: protocol version negotiation (Re: [HACKERS] Libpq PGRES_COPY_BOTH- version compatibility)
Следующее
От: Michael Paquier
Дата:
Сообщение: [HACKERS] Re: pg_ctl wait exit code (was Re: [COMMITTERS] pgsql: Additionaltests for subtransactions in recovery)