Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size
Дата
Msg-id 20210716.110500.137369520113769822.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Ответы Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-bugs
At Thu, 15 Jul 2021 17:33:20 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> On 2021-Jul-15, Alvaro Herrera wrote:
> 
> > Actually, looking again, isn't this supposed to happen in KeepLogSeg()?
> > We have a block that caps to max_slot_wal_keep_size_mb there ... why did
> > that not work?
> 
> I find that this smaller patch is sufficient to make the added test case
> work.  However, I'm not sure I understand *why* ...

It's because another checkpoint is running at the time the manual
checkpoint just before is invoked.  Two successive checkpoint hides
the defect.  The checkpoint works as the rendezvous point for the
succeeding tests so I added another sync point instead of the manual
checkpoint.  The test in the attached correctly fails if _logSegNo
were not updated by slot invalidation.

By the way, about the previous version, XLByteToSeg is needed since
KeepLogSeg doesn't advance _logSegNo, which is the wanted action
here. The test in the attached fails if only KeepLogSeg is called
again.

I confirmed that *the previous* version of the test works for Windows.
(So there's no reason that the current version doesn't work.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From cd9ecce8655359a2bbb147f4d017de3e66a2bb00 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 15 Jul 2021 15:52:48 +0900
Subject: [PATCH v6] Advance old-segment horizon properly after slot
 invalidation

When some slots are invalidated due to the max_slot_wal_keep_size limit,
the old segment horizon should move forward to stay within the limit.
However, in commit c6550776394e we forgot to call KeepLogSeg again to
recompute the horizon after invalidating replication slots.  In cases
where other slots remained, the limits would be recomputed eventually
for other reasons, but if all slots were invalidated, the limits would
not move at all afterwards.  Repair.

Backpatch to 13 where the feature was introduced.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reported-by: Marcin Krupowicz <mk@071.ovh>
Discussion: https://postgr.es/m/17103-004130e8f27782c9@postgresql.org
---
 src/backend/access/transam/xlog.c         | 24 +++++++++++++++--
 src/backend/replication/slot.c            | 26 +++++++++++++++---
 src/include/replication/slot.h            |  2 +-
 src/test/recovery/t/019_replslot_limit.pl | 33 ++++++++++++++++++-----
 4 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index edb15fe58d..752b1099ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9300,7 +9300,17 @@ CreateCheckPoint(int flags)
      */
     XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
     KeepLogSeg(recptr, &_logSegNo);
-    InvalidateObsoleteReplicationSlots(_logSegNo);
+    if (InvalidateObsoleteReplicationSlots(_logSegNo))
+    {
+        /*
+         * Some slots have been gone, recalculate the old-segment horizon
+         * excluding the invalidated slots. XLByteToSeg is needed here to
+         * advance _logSegNo.
+         */
+        XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+        KeepLogSeg(recptr, &_logSegNo);
+    }
+
     _logSegNo--;
     RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
 
@@ -9640,7 +9650,17 @@ CreateRestartPoint(int flags)
     replayPtr = GetXLogReplayRecPtr(&replayTLI);
     endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
     KeepLogSeg(endptr, &_logSegNo);
-    InvalidateObsoleteReplicationSlots(_logSegNo);
+    if (InvalidateObsoleteReplicationSlots(_logSegNo))
+    {
+        /*
+         * Some slots have been gone, recalculate the old-segment horizon
+         * excluding the invalidated slots. XLByteToSeg is needed here to
+         * advance _logSegNo.
+         */
+        XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+        KeepLogSeg(endptr, &_logSegNo);
+    }
+
     _logSegNo--;
 
     /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b85d86cc..33e9acab4a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1143,11 +1143,14 @@ ReplicationSlotReserveWal(void)
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
+ * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
+InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+                               bool *invalidated)
 {
     int            last_signaled_pid = 0;
     bool        released_lock = false;
@@ -1204,6 +1207,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
             s->active_pid = MyProcPid;
             s->data.invalidated_at = restart_lsn;
             s->data.restart_lsn = InvalidXLogRecPtr;
+
+            /* Let caller know */
+            *invalidated = true;
         }
 
         SpinLockRelease(&s->mutex);
@@ -1291,12 +1297,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
  *
+ * Returns true when any slot have got invalidated.
+ *
  * NB - this runs as part of checkpoint, so avoid raising errors if possible.
  */
-void
+bool
 InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
 {
     XLogRecPtr    oldestLSN;
+    bool        invalidated = false;
 
     XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
@@ -1309,13 +1318,24 @@ restart:
         if (!s->in_use)
             continue;
 
-        if (InvalidatePossiblyObsoleteSlot(s, oldestLSN))
+        if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
         {
             /* if the lock was released, start from scratch */
             goto restart;
         }
     }
     LWLockRelease(ReplicationSlotControlLock);
+
+    /*
+     * If any slots have been invalidated, recalculate the resource limits.
+     */
+    if (invalidated)
+    {
+        ReplicationSlotsComputeRequiredXmin(false);
+        ReplicationSlotsComputeRequiredLSN();
+    }
+
+    return invalidated;
 }
 
 /*
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 34d95eac8e..e32fb85db8 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -213,7 +213,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, int szslot);
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index d4b9ff705f..bbdaaca14c 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -11,7 +11,7 @@ use TestLib;
 use PostgresNode;
 
 use File::Path qw(rmtree);
-use Test::More tests => $TestLib::windows_os ? 14 : 18;
+use Test::More tests => $TestLib::windows_os ? 15 : 19;
 use Time::HiRes qw(usleep);
 
 $ENV{PGDATABASE} = 'postgres';
@@ -176,7 +176,13 @@ ok( !find_in_log(
 # Advance WAL again, the slot loses the oldest segment.
 my $logstart = get_log_size($node_primary);
 advance_wal($node_primary, 7);
-$node_primary->safe_psql('postgres', "CHECKPOINT;");
+
+# This slot should be broken, wait for that to happen
+$node_primary->poll_query_until(
+    'postgres',
+    qq[
+    SELECT wal_status = 'lost' FROM pg_replication_slots
+    WHERE slot_name = 'rep1']);
 
 # WARNING should be issued
 ok( find_in_log(
@@ -185,13 +191,28 @@ ok( find_in_log(
         $logstart),
     'check that the warning is logged');
 
-# This slot should be broken
-$result = $node_primary->safe_psql('postgres',
-    "SELECT slot_name, active, restart_lsn IS NULL, wal_status, safe_wal_size FROM pg_replication_slots WHERE
slot_name= 'rep1'"
 
-);
+$result = $node_primary->safe_psql(
+    'postgres',
+    qq[
+    SELECT slot_name, active, restart_lsn IS NULL, wal_status, safe_wal_size
+    FROM pg_replication_slots WHERE slot_name = 'rep1']);
 is($result, "rep1|f|t|lost|",
     'check that the slot became inactive and the state "lost" persists');
 
+# The invalidated slot shouldn't keep the old-segment horizon back;
+# see bug #17103: https://postgr.es/m/17103-004130e8f27782c9@postgresql.org
+# Test for this by creating a new slot and comparing its restart LSN
+# to the oldest existing file.
+my $redoseg = $node_primary->safe_psql('postgres',
+    "SELECT pg_walfile_name(lsn) FROM pg_create_physical_replication_slot('s2', true)"
+);
+my $oldestseg = $node_primary->safe_psql('postgres',
+                                         "SELECT pg_ls_dir AS f FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~
'^[0-9A-F]{24}\$'ORDER BY 1 LIMIT 1"
 
+  );
+$node_primary->safe_psql('postgres',
+                         qq[SELECT pg_drop_replication_slot('s2')]);
+is($oldestseg, $redoseg, "check that segments have been removed");
+
 # The standby no longer can connect to the primary
 $logstart = get_log_size($node_standby);
 $node_standby->start;
-- 
2.27.0


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: Re: BUG #17066: Cache lookup failed when null (iso-8859-1) is passed as anycompatiblemultirange
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17112: Sequence number is not restored when the stored procedure ends abnormally