Обсуждение: more descriptive message for process termination due to max_slot_wal_keep_size

Поиск
Список
Период
Сортировка

more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
Hello.

As complained in pgsql-bugs [1], when a process is terminated due to
max_slot_wal_keep_size, the related messages don't mention the root
cause for *the termination*.  Note that the third message does not
show for temporary replication slots.

[pid=a] LOG:  terminating process x to release replication slot "s"
[pid=x] LOG:  FATAL:  terminating connection due to administrator command
[pid=a] LOG:  invalidting slot "s" because its restart_lsn X/X exceeds max_slot_wal_keep_size

The attached patch attaches a DETAIL line to the first message.

> [17605] LOG:  terminating process 17614 to release replication slot "s1"
+ [17605] DETAIL:  The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size.
> [17614] FATAL:  terminating connection due to administrator command
> [17605] LOG:  invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size

Somewhat the second and fourth lines look inconsistent each other but
that wouldn't be such a problem.  I don't think we want to concatenate
the two lines together as the result is a bit too long.

> LOG:  terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds
max_slot_wal_keep_size.

What do you think about this?

[1] https://www.postgresql.org/message-id/20211214.101137.379073733372253470.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b0c27dc80aff37ef984592b79f1dd20d052299fa Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 14 Dec 2021 10:50:00 +0900
Subject: [PATCH] Make an error message about process termination more
 descriptive

If checkpointer kills a process due to a temporary replication slot
exceeding max_slot_wal_keep_size, the messages fails to describe the
cause of the termination.  It is because the message that describes
the reason that is emitted for persistent slots does not show for
temporary slots.  Add a DETAIL line to the message common to all types
of slot to describe the cause.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
---
 src/backend/replication/slot.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 90ba9b417d..cba9a29113 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1254,7 +1254,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             {
                 ereport(LOG,
                         (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                                active_pid, NameStr(slotname)),
+                         errdetail("The slot's restart_lsn %X/%X exceeds max_slot_wal_keep_size.",
LSN_FORMAT_ARGS(restart_lsn))));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0


Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Ashutosh Bapat
Дата:
On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> As complained in pgsql-bugs [1], when a process is terminated due to
> max_slot_wal_keep_size, the related messages don't mention the root
> cause for *the termination*.  Note that the third message does not
> show for temporary replication slots.
>
> [pid=a] LOG:  terminating process x to release replication slot "s"
> [pid=x] LOG:  FATAL:  terminating connection due to administrator command
> [pid=a] LOG:  invalidting slot "s" because its restart_lsn X/X exceeds max_slot_wal_keep_size
>
> The attached patch attaches a DETAIL line to the first message.
>
> > [17605] LOG:  terminating process 17614 to release replication slot "s1"
> + [17605] DETAIL:  The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size.
> > [17614] FATAL:  terminating connection due to administrator command
> > [17605] LOG:  invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size
>
> Somewhat the second and fourth lines look inconsistent each other but
> that wouldn't be such a problem.  I don't think we want to concatenate
> the two lines together as the result is a bit too long.
>
> > LOG:  terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds
max_slot_wal_keep_size.
>
> What do you think about this?

Agree. I think we should also specify the restart_lsn value which
would be within max_slot_wal_keep_size for better understanding.

-- 
Best Wishes,
Ashutosh Bapat



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Bharath Rupireddy
Дата:
On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> As complained in pgsql-bugs [1], when a process is terminated due to
> max_slot_wal_keep_size, the related messages don't mention the root
> cause for *the termination*.  Note that the third message does not
> show for temporary replication slots.
>
> [pid=a] LOG:  "terminating process %d to release replication slot \"%s\""
> [pid=x] LOG:  FATAL:  terminating connection due to administrator command
> [pid=a] LOG:  invalidting slot "s" because its restart_lsn X/X exceeds max_slot_wal_keep_size
>
> The attached patch attaches a DETAIL line to the first message.
>
> > [17605] LOG:  terminating process 17614 to release replication slot "s1"
> + [17605] DETAIL:  The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size.
> > [17614] FATAL:  terminating connection due to administrator command
> > [17605] LOG:  invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size
>
> Somewhat the second and fourth lines look inconsistent each other but
> that wouldn't be such a problem.  I don't think we want to concatenate
> the two lines together as the result is a bit too long.
>
> > LOG:  terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds
max_slot_wal_keep_size.
>
> What do you think about this?
>
> [1] https://www.postgresql.org/message-id/20211214.101137.379073733372253470.horikyota.ntt%40gmail.com

+1 to give more context to the "terminating process %d to release
replication slot \"%s\"" message.

How about having below, instead of adding errdetail:
"terminating process %d to release replication slot \"%s\" whose
restart_lsn %X/%X exceeds max_slot_wal_keep_size"?

I think we can keep the "invalidating slot \"%s\" because its
restart_lsn %X/%X exceeds max_slot_wal_keep_size" message as-is. We
may not see "terminating process ..." and "invalidation slot ..."
messages together for the same slot, so having slightly different
wording is fine IMO.

Regards,
Bharath Rupireddy.



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Tue, 14 Dec 2021 19:31:21 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in 
> On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > > [17605] LOG:  terminating process 17614 to release replication slot "s1"
> > + [17605] DETAIL:  The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size.
> > > [17614] FATAL:  terminating connection due to administrator command
> > > [17605] LOG:  invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size
> >
> > Somewhat the second and fourth lines look inconsistent each other but
> > that wouldn't be such a problem.  I don't think we want to concatenate
> > the two lines together as the result is a bit too long.
> >
> > > LOG:  terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds
max_slot_wal_keep_size.
> >
> > What do you think about this?
> 
> Agree. I think we should also specify the restart_lsn value which
> would be within max_slot_wal_keep_size for better understanding.

Thanks!  It seems to me the main message of the "invalidating" log has
no room for further detail.  So I split the reason out to DETAILS line
the same way with the "terminating" message in the attached second
patch. (It is separated from the first patch just for review) I
believe someone can make the DETAIL message simpler or more natural.

The attached patch set emits the following message.

> LOG:  invalidating slot "s1"
> DETAIL:  The slot's restart_lsn 0/10000D68 is behind the limit 0/11000000 defined by max_slot_wal_keep_size.

The second line could be changed like the following or anything other.

> DETAIL:  The slot's restart_lsn 0/10000D68 got behind the limit 0/11000000 determined by max_slot_wal_keep_size.
.....

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 6909d854a0d48f2504cbb6dfbc7eb571fd32bcd8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 14 Dec 2021 10:50:00 +0900
Subject: [PATCH v2 1/2] Make an error message about process termination more
 descriptive

If checkpointer kills a process due to a temporary replication slot
exceeding max_slot_wal_keep_size, the messages fails to describe the
cause of the termination.  It is finally shown for persistent slots in
a subsequent message, but that message doesn't show for temporary
slots.  It's better to attach an explainaion to the termination
message directly.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
---
 src/backend/replication/slot.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 90ba9b417d..cba9a29113 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1254,7 +1254,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             {
                 ereport(LOG,
                         (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                                active_pid, NameStr(slotname)),
+                         errdetail("The slot's restart_lsn %X/%X exceeds max_slot_wal_keep_size.",
LSN_FORMAT_ARGS(restart_lsn))));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0

From 265a84fa6a20ee7c5178ace41096df1a4b00465c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 15 Dec 2021 13:08:20 +0900
Subject: [PATCH v2 2/2] Make the message further detailed

---
 src/backend/replication/slot.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index cba9a29113..695151e484 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1255,7 +1255,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
                 ereport(LOG,
                         (errmsg("terminating process %d to release replication slot \"%s\"",
                                 active_pid, NameStr(slotname)),
-                         errdetail("The slot's restart_lsn %X/%X exceeds max_slot_wal_keep_size.",
LSN_FORMAT_ARGS(restart_lsn))));
+                         errdetail("The slot's restart_lsn %X/%X is behind the limit %X/%X defined by
max_slot_wal_keep_size.",
+                                   LSN_FORMAT_ARGS(restart_lsn),
+                                   LSN_FORMAT_ARGS(oldestLSN))));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
@@ -1292,9 +1294,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             ReplicationSlotRelease();
 
             ereport(LOG,
-                    (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
-                            NameStr(slotname),
-                            LSN_FORMAT_ARGS(restart_lsn))));
+                    (errmsg("invalidating slot \"%s\"",
+                            NameStr(slotname)),
+                     errdetail("The slot's restart_lsn %X/%X is behind the limit %X/%X defined by
max_slot_wal_keep_size.",
+                            LSN_FORMAT_ARGS(restart_lsn),
+                            LSN_FORMAT_ARGS(oldestLSN))));
 
             /* done with this slot for now */
             break;
-- 
2.27.0


Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Ashutosh Bapat
Дата:
On Wed, Dec 15, 2021 at 9:42 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 14 Dec 2021 19:31:21 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in
> > On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > > [17605] LOG:  terminating process 17614 to release replication slot "s1"
> > > + [17605] DETAIL:  The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size.
> > > > [17614] FATAL:  terminating connection due to administrator command
> > > > [17605] LOG:  invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size
> > >
> > > Somewhat the second and fourth lines look inconsistent each other but
> > > that wouldn't be such a problem.  I don't think we want to concatenate
> > > the two lines together as the result is a bit too long.
> > >
> > > > LOG:  terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds
max_slot_wal_keep_size.
> > >
> > > What do you think about this?
> >
> > Agree. I think we should also specify the restart_lsn value which
> > would be within max_slot_wal_keep_size for better understanding.
>
> Thanks!  It seems to me the main message of the "invalidating" log has
> no room for further detail.  So I split the reason out to DETAILS line
> the same way with the "terminating" message in the attached second
> patch. (It is separated from the first patch just for review) I
> believe someone can make the DETAIL message simpler or more natural.
>
> The attached patch set emits the following message.
>
> > LOG:  invalidating slot "s1"
> > DETAIL:  The slot's restart_lsn 0/10000D68 is behind the limit 0/11000000 defined by max_slot_wal_keep_size.
>
> The second line could be changed like the following or anything other.
>
> > DETAIL:  The slot's restart_lsn 0/10000D68 got behind the limit 0/11000000 determined by max_slot_wal_keep_size.
> .....
>

The second version looks better as it gives more details. I am fine
with either of the above wordings.

I would prefer everything in the same message though since
"invalidating slot ..." is too short a LOG message. Not everybody
enabled details always.



-- 
Best Wishes,
Ashutosh Bapat



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Thu, 23 Dec 2021 18:08:08 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in 
> On Wed, Dec 15, 2021 at 9:42 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > > LOG:  invalidating slot "s1"
> > > DETAIL:  The slot's restart_lsn 0/10000D68 is behind the limit 0/11000000 defined by max_slot_wal_keep_size.
> >
> > The second line could be changed like the following or anything other.
> >
> > > DETAIL:  The slot's restart_lsn 0/10000D68 got behind the limit 0/11000000 determined by max_slot_wal_keep_size.
> > .....
> >
> 
> The second version looks better as it gives more details. I am fine
> with either of the above wordings.
> 
> I would prefer everything in the same message though since
> "invalidating slot ..." is too short a LOG message. Not everybody
> enabled details always.

Mmm. Right. I have gone too much to the same way with the
process-termination message.

I rearranged the meesages as follows in the attached version. (at master)

> LOG:  terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds
max_slot_wal_keep_size
> DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.

> LOG:  invalidating slot \"%s\" because its restart_LSN %X/%X exceeds max_slot_wal_keep_size
c> DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.

The messages is actually incomplete even in 13 so I think the change
to the errmsg() message of the first message is worth back-patching.

- v3-0001-Make-a-message-on-process-termination-more-dscrip.patch

  Changes only the first main message and it can be back-patched to 14. 

- v3-0001-Make-a-message-on-process-termination-more-dscrip_13.patch

  The same to the above but for 13, which doesn't have LSN_FORMAT_ARGS.

- v3-0002-Add-detailed-information-to-slot-invalidation-mes.patch

  Attaches the DETAIL line shown above to both messages, only for the
  master.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 4bfbb58801e9d2a0e0ab8320dd95bc850a0a397b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 12:52:07 +0900
Subject: [PATCH v3 1/2] Make a message on process termination more dscriptive

The message at process termination due to slot limit doesn't provide
the reason. In the major scenario the message is followed by another
message about slot invalidatation, which shows the detail for the
termination.  However the second message is missing if the slot is
temporary one.

Augment the first message with the reason same as the second message.

Backpatch through to 13 where the message was introduced.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
Backpatch-through: 13
---
 src/backend/replication/slot.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 90ba9b417d..0ceac3a54d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1253,8 +1253,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                                active_pid, NameStr(slotname),
+                                LSN_FORMAT_ARGS(restart_lsn))));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0

From 444b3bb64ee907280606696e0cb90f6b022c9cd6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 13:23:54 +0900
Subject: [PATCH v3] Make a message on process termination more dscriptive

The message at process termination due to slot limit doesn't provide
the reason. In the major scenario the message is followed by another
message about slot invalidatation, which shows the detail for the
termination.  However the second message is missing if the slot is
temporary one.

Augment the first message with the reason same as the second message.

Backpatch through to 13 where the message was introduced.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
Backpatch-through: 13
---
 src/backend/replication/slot.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 02047ea920..15b8934ae2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1228,8 +1228,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                                active_pid, NameStr(slotname),
+                                (uint32) (restart_lsn >> 32),
+                                (uint32) restart_lsn)));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0

From 4d37e4e6f2af557779d970f5fe4dba98c937ed49 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 12:58:25 +0900
Subject: [PATCH v3 2/2] Add detailed information to slot-invalidation messages

The messages says just "exceeds max_slot_wal_keep_size" but doesn't
tell the concrete LSN of the limit. That information helps operators'
understanding on the issue.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
---
 src/backend/replication/slot.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 0ceac3a54d..2b394fe0d7 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1255,7 +1255,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
                 ereport(LOG,
                         (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
                                 active_pid, NameStr(slotname),
-                                LSN_FORMAT_ARGS(restart_lsn))));
+                                LSN_FORMAT_ARGS(restart_lsn)),
+                         errdetail("The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.",
+                                   LSN_FORMAT_ARGS(oldestLSN))));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
@@ -1294,7 +1296,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             ereport(LOG,
                     (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
                             NameStr(slotname),
-                            LSN_FORMAT_ARGS(restart_lsn))));
+                            LSN_FORMAT_ARGS(restart_lsn)),
+                     errdetail("The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.",
+                               LSN_FORMAT_ARGS(oldestLSN))));
 
             /* done with this slot for now */
             break;
-- 
2.27.0


Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Masahiko Sawada
Дата:
On Fri, Dec 24, 2021 at 1:42 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 23 Dec 2021 18:08:08 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in
> > On Wed, Dec 15, 2021 at 9:42 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > > LOG:  invalidating slot "s1"
> > > > DETAIL:  The slot's restart_lsn 0/10000D68 is behind the limit 0/11000000 defined by max_slot_wal_keep_size.
> > >
> > > The second line could be changed like the following or anything other.
> > >
> > > > DETAIL:  The slot's restart_lsn 0/10000D68 got behind the limit 0/11000000 determined by
max_slot_wal_keep_size.
> > > .....
> > >
> >
> > The second version looks better as it gives more details. I am fine
> > with either of the above wordings.
> >
> > I would prefer everything in the same message though since
> > "invalidating slot ..." is too short a LOG message. Not everybody
> > enabled details always.
>
> Mmm. Right. I have gone too much to the same way with the
> process-termination message.
>
> I rearranged the meesages as follows in the attached version. (at master)

Thank you for the patch! +1 for improving the messages.

>
> > LOG:  terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds
max_slot_wal_keep_size
> > DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.
>
> > LOG:  invalidating slot \"%s\" because its restart_LSN %X/%X exceeds max_slot_wal_keep_size
> c> DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.

-
LSN_FORMAT_ARGS(restart_lsn))));
+
LSN_FORMAT_ARGS(restart_lsn)),
+                                                errdetail("The slot
got behind the limit %X/%X determined by max_slot_wal_keep_size.",
+
LSN_FORMAT_ARGS(oldestLSN))));

Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also
by wal_keep_size?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
Thank you for the comment.

At Fri, 24 Dec 2021 17:06:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> Thank you for the patch! +1 for improving the messages.
> 
> >
> > > LOG:  terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds
max_slot_wal_keep_size
> > > DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.
> >
> > > LOG:  invalidating slot \"%s\" because its restart_LSN %X/%X exceeds max_slot_wal_keep_size
> > c> DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.
> 
> -
> LSN_FORMAT_ARGS(restart_lsn))));
> +
> LSN_FORMAT_ARGS(restart_lsn)),
> +                                                errdetail("The slot
> got behind the limit %X/%X determined by max_slot_wal_keep_size.",
> +
> LSN_FORMAT_ARGS(oldestLSN))));
> 
> Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also
> by wal_keep_size?

Right. But I believe the two are not assumed to be used at once. One
can set wal_keep_size larger than max_slot_wal_keep_size but it is
actually a kind of ill setting.

LOG:  terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds
max_slot_wal_keep_size
DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size and wal_keep_size.

Mmm. I don't like this.  I feel we don't need such detail in the
message..  I'd like to hear opinions from others, please.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Masahiko Sawada
Дата:
On Fri, Dec 24, 2021 at 5:30 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Thank you for the comment.
>
> At Fri, 24 Dec 2021 17:06:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
> > Thank you for the patch! +1 for improving the messages.
> >
> > >
> > > > LOG:  terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds
max_slot_wal_keep_size
> > > > DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.
> > >
> > > > LOG:  invalidating slot \"%s\" because its restart_LSN %X/%X exceeds max_slot_wal_keep_size
> > > c> DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.
> >
> > -
> > LSN_FORMAT_ARGS(restart_lsn))));
> > +
> > LSN_FORMAT_ARGS(restart_lsn)),
> > +                                                errdetail("The slot
> > got behind the limit %X/%X determined by max_slot_wal_keep_size.",
> > +
> > LSN_FORMAT_ARGS(oldestLSN))));
> >
> > Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also
> > by wal_keep_size?
>
> Right. But I believe the two are not assumed to be used at once. One
> can set wal_keep_size larger than max_slot_wal_keep_size but it is
> actually a kind of ill setting.
>
> LOG:  terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds
max_slot_wal_keep_size
> DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size and wal_keep_size.
>
> Mmm. I don't like this.  I feel we don't need such detail in the
> message.

How about something like:

LOG:  terminating process %d to release replication slot \"%s\"
because its restart_lsn %X/%X exceeds the limit
DETAIL:  The slot got behind the limit %X/%X
HINT: You might need to increase max_slot_wal_keep_size or wal_keep_size.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Fri, 24 Dec 2021 20:23:29 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> On Fri, Dec 24, 2021 at 5:30 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > Right. But I believe the two are not assumed to be used at once. One
> > can set wal_keep_size larger than max_slot_wal_keep_size but it is
> > actually a kind of ill setting.
> >
> > LOG:  terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds
max_slot_wal_keep_size
> > DETAIL:  The slot got behind the limit %X/%X determined by max_slot_wal_keep_size and wal_keep_size.
> >
> > Mmm. I don't like this.  I feel we don't need such detail in the
> > message.
> 
> How about something like:
> 
> LOG:  terminating process %d to release replication slot \"%s\"
> because its restart_lsn %X/%X exceeds the limit
> DETAIL:  The slot got behind the limit %X/%X
> HINT: You might need to increase max_slot_wal_keep_size or wal_keep_size.

The message won't be seen when max_slot_wal_keep_size is not set. So
we don't recommend to increase wal_keep_size in that case. We might
need inhibit (or warn)the two parameters from being activated at once,
but it would be another issue.

Another point is how people determine the value for the parameter.  I
suppose (or believe) max_slot_wal_keep_size is not a kind to set to
minimal first then increase later but a kind to set to maximum
allowable first.  On the other hand we suggest as the follows for
too-small max_wal_size so we could do the same for this parameter.

> HINT:  Consider increasing the configuration parameter \"max_wal_size\".

Also, I don't like we have three lines for this message. If the DETAIL
adds only the specific value of the limit, I think it'd better append
it to the main message.

So what do you say if I propose the following?

LOG:  terminating process %d to release replication slot \"%s\"
because its restart_lsn %X/%X exceeds the limit %X/%X
HINT: You might need to increase max_slot_wal_keep_size.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> So what do you say if I propose the following?
> 
> LOG:  terminating process %d to release replication slot \"%s\"
> because its restart_lsn %X/%X exceeds the limit %X/%X
> HINT: You might need to increase max_slot_wal_keep_size.

This version emits the following message.

[35785:checkpointer] LOG:  terminating process 36368 to release replication slot "s1" because its restart_lsn
0/1F000148exceeds the limit 0/21000000
 
[35785:checkpointer] HINT:  You might need to increase max_slot_wal_keep_size.
[36368:walsender] FATAL:  terminating connection due to administrator command
[36368:walsender] STATEMENT:  START_REPLICATION SLOT "s1" 0/1F000000 TIMELINE 1
[35785:checkpointer] LOG:  invalidating slot "s1" because its restart_lsn 0/1F000148 exceeds the limit 0/21000000
[35785:checkpointer] HINT:  You might need to increase max_slot_wal_keep_size.

We can omit the HINT line from the termination log for non-persistent
slots but I think we don't want to bother that considering its low
frequency.

The CI was confused by the mixed patches for multiple PG versions. In
this version the patchset for master are attached as .patch and that
for PG13 as .txt.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 4909db2893f0b33ab6bca1ffc3ad802cd159c577 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 12:52:07 +0900
Subject: [PATCH v4 1/2] Make a message on process termination more dscriptive

The message at process termination due to slot limit doesn't provide
the reason. In the major scenario the message is followed by another
message about slot invalidatation, which shows the detail for the
termination.  However the second message is missing if the slot is
temporary one.

Augment the first message with the reason same as the second message.

Backpatch through to 13 where the message was introduced.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
Backpatch-through: 13
---
 src/backend/replication/slot.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index caa6b29756..92f19bcb35 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,8 +1300,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                                active_pid, NameStr(slotname),
+                                LSN_FORMAT_ARGS(restart_lsn))));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0

From f31014f85c7dec160bd42e1c48f1c28a7dc22af2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 12:58:25 +0900
Subject: [PATCH v4 2/2] Add detailed information to slot-invalidation messages

The messages says just "exceeds max_slot_wal_keep_size" but doesn't
tell the concrete LSN of the limit. That information helps operators'
understanding on the issue.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
---
 src/backend/replication/slot.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 92f19bcb35..be21b62add 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,9 +1300,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds the limit %X/%X",
 
                                 active_pid, NameStr(slotname),
-                                LSN_FORMAT_ARGS(restart_lsn))));
+                                LSN_FORMAT_ARGS(restart_lsn),
+                                LSN_FORMAT_ARGS(oldestLSN)),
+                         errhint("You might need to increase max_slot_wal_keep_size.")));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
@@ -1345,9 +1347,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             ReplicationSlotRelease();
 
             ereport(LOG,
-                    (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
+                    (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X",
                             NameStr(slotname),
-                            LSN_FORMAT_ARGS(restart_lsn))));
+                            LSN_FORMAT_ARGS(restart_lsn),
+                            LSN_FORMAT_ARGS(oldestLSN)),
+                     errhint("You might need to increase max_slot_wal_keep_size.")));
 
             /* done with this slot for now */
             break;
-- 
2.27.0

From 6c5a680842521b26c0c899c3d0675bd53e58ac11 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 13:23:54 +0900
Subject: [PATCH v4] Make a message on process termination more dscriptive

The message at process termination due to slot limit doesn't provide
the reason. In the major scenario the message is followed by another
message about slot invalidatation, which shows the detail for the
termination.  However the second message is missing if the slot is
temporary one.

Augment the first message with the reason same as the second message.

Backpatch through to 13 where the message was introduced.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
Backpatch-through: 13
---
 src/backend/replication/slot.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 02047ea920..15b8934ae2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1228,8 +1228,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                                active_pid, NameStr(slotname),
+                                (uint32) (restart_lsn >> 32),
+                                (uint32) restart_lsn)));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0


Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Wed, 02 Mar 2022 15:37:19 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> The CI was confused by the mixed patches for multiple PG versions. In
> this version the patchset for master are attached as .patch and that
> for PG13 as .txt.

Yeah.... It is of course the relevant check should be fixed.  The
attached v5 adjusts 019_replslot_limit.pl.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 4909db2893f0b33ab6bca1ffc3ad802cd159c577 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 12:52:07 +0900
Subject: [PATCH v5 1/2] Make a message on process termination more dscriptive

The message at process termination due to slot limit doesn't provide
the reason. In the major scenario the message is followed by another
message about slot invalidatation, which shows the detail for the
termination.  However the second message is missing if the slot is
temporary one.

Augment the first message with the reason same as the second message.

Backpatch through to 13 where the message was introduced.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
Backpatch-through: 13
---
 src/backend/replication/slot.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index caa6b29756..92f19bcb35 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,8 +1300,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                                active_pid, NameStr(slotname),
+                                LSN_FORMAT_ARGS(restart_lsn))));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0

From 7a9e571e8cf4e5ffc13d101733c1b6fc455b1aec Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 12:58:25 +0900
Subject: [PATCH v5 2/2] Add detailed information to slot-invalidation messages

The messages says just "exceeds max_slot_wal_keep_size" but doesn't
tell the concrete LSN of the limit. That information helps operators'
understanding on the issue.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
---
 src/backend/replication/slot.c            | 12 ++++++++----
 src/test/recovery/t/019_replslot_limit.pl |  2 +-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 92f19bcb35..be21b62add 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,9 +1300,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds the limit %X/%X",
 
                                 active_pid, NameStr(slotname),
-                                LSN_FORMAT_ARGS(restart_lsn))));
+                                LSN_FORMAT_ARGS(restart_lsn),
+                                LSN_FORMAT_ARGS(oldestLSN)),
+                         errhint("You might need to increase max_slot_wal_keep_size.")));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
@@ -1345,9 +1347,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             ReplicationSlotRelease();
 
             ereport(LOG,
-                    (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
+                    (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X",
                             NameStr(slotname),
-                            LSN_FORMAT_ARGS(restart_lsn))));
+                            LSN_FORMAT_ARGS(restart_lsn),
+                            LSN_FORMAT_ARGS(oldestLSN)),
+                     errhint("You might need to increase max_slot_wal_keep_size.")));
 
             /* done with this slot for now */
             break;
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 9bb71b62c0..bac496a9cf 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -188,7 +188,7 @@ for (my $i = 0; $i < 10000; $i++)
 {
     if (find_in_log(
             $node_primary,
-            "invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds max_slot_wal_keep_size",
+            "invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds the limit [0-9A-F/]+",
             $logstart))
     {
         $invalidated = 1;
-- 
2.27.0


Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
"Drouvot, Bertrand"
Дата:

Hi,

On 3/2/22 7:37 AM, Kyotaro Horiguchi wrote:
At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
So what do you say if I propose the following?

LOG:  terminating process %d to release replication slot \"%s\"
because its restart_lsn %X/%X exceeds the limit %X/%X
HINT: You might need to increase max_slot_wal_keep_size.
This version emits the following message.

[35785:checkpointer] LOG:  terminating process 36368 to release replication slot "s1" because its restart_lsn 0/1F000148 exceeds the limit 0/21000000
[35785:checkpointer] HINT:  You might need to increase max_slot_wal_keep_size.

As the hint is to increase max_slot_wal_keep_size, what about reporting the difference in size (rather than the limit lsn)? Something along those lines?

[35785:checkpointer] LOG:  terminating process 36368 to release replication slot "s1" because its restart_lsn 0/1F000148 exceeds the limit by <NNN MB>.

Regards,

--

Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Mon, 5 Sep 2022 11:56:33 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote in 
> Hi,
> 
> On 3/2/22 7:37 AM, Kyotaro Horiguchi wrote:
> > At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro
> > Horiguchi<horikyota.ntt@gmail.com> wrote in
> >> So what do you say if I propose the following?
> >>
> >> LOG:  terminating process %d to release replication slot \"%s\"
> >> because its restart_lsn %X/%X exceeds the limit %X/%X
> >> HINT: You might need to increase max_slot_wal_keep_size.
> > This version emits the following message.
> >
> > [35785:checkpointer] LOG: terminating process 36368 to release
> > replication slot "s1" because its restart_lsn 0/1F000148 exceeds the
> > limit 0/21000000
> > [35785:checkpointer] HINT: You might need to increase
> > max_slot_wal_keep_size.
> 
> As the hint is to increase max_slot_wal_keep_size, what about
> reporting the difference in size (rather than the limit lsn)?
> Something along those lines?
> 
> [35785:checkpointer] LOG: terminating process 36368 to release
> replication slot "s1" because its restart_lsn 0/1F000148 exceeds the
> limit by <NNN MB>.

Thanks! That might be more sensible exactly for the reason you
mentioned.  One issue doing that is size_pretty is dbsize.c local
function. Since the size is less than kB in many cases, we cannot use
fixed unit for that.

0001 and 0002 are the same with v5.

0003 exposes byte_size_pretty() to other modules.
0004 does the change by using byte_size_pretty()

After 0004 applied, they look like this.

> LOG:  terminating process 108413 to release replication slot "rep3" because its restart_lsn 0/7000D8 exceeds the
limitby 1024 kB
 
> HINT:  You might need to increase max_slot_wal_keep_size.

The reason for "1024 kB" instead of "1 MB" is the precise value is a
bit less than 1024 * 1024.


regards.

- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
"Drouvot, Bertrand"
Дата:

Hi,

On 9/6/22 7:53 AM, Kyotaro Horiguchi wrote:
At Mon, 5 Sep 2022 11:56:33 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote in
Hi,

On 3/2/22 7:37 AM, Kyotaro Horiguchi wrote:
At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro
Horiguchi<horikyota.ntt@gmail.com> wrote in
So what do you say if I propose the following?

LOG:  terminating process %d to release replication slot \"%s\"
because its restart_lsn %X/%X exceeds the limit %X/%X
HINT: You might need to increase max_slot_wal_keep_size.
This version emits the following message.

[35785:checkpointer] LOG: terminating process 36368 to release
replication slot "s1" because its restart_lsn 0/1F000148 exceeds the
limit 0/21000000
[35785:checkpointer] HINT: You might need to increase
max_slot_wal_keep_size.
As the hint is to increase max_slot_wal_keep_size, what about
reporting the difference in size (rather than the limit lsn)?
Something along those lines?

[35785:checkpointer] LOG: terminating process 36368 to release
replication slot "s1" because its restart_lsn 0/1F000148 exceeds the
limit by <NNN MB>.
Thanks! That might be more sensible exactly for the reason you
mentioned.  One issue doing that is size_pretty is dbsize.c local
function. Since the size is less than kB in many cases, we cannot use
fixed unit for that.

Thanks for the new patch version!. I did not realized (sorry about that) that we'd need to expose byte_size_pretty(). Now I wonder if we should not simply report the number of bytes (like I can see it is done in many places). So something like:

@@ -1298,9 +1298,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
                                byte_size_pretty(buf, sizeof(buf),
                                                                 oldestLSN - restart_lsn);
                                ereport(LOG,
-                                               (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit by %s",
+                                               (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit by %lu bytes",
                                                                active_pid, NameStr(slotname),
-                                                               LSN_FORMAT_ARGS(restart_lsn), buf),
+                                                               LSN_FORMAT_ARGS(restart_lsn), oldestLSN - restart_lsn),
                                                 errhint("You might need to increase max_slot_wal_keep_size.")));

and then forget about exposing/using byte_size_pretty() (that would be more consistent with the same kind of reporting in the existing code).

What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
(I noticed I sent a wrong version..)

At Tue, 6 Sep 2022 10:54:35 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote in 
> Thanks for the new patch version!. I did not realized (sorry about
> that) that we'd need to expose byte_size_pretty(). Now I wonder if we

I didn't think we need the units larger than MB, but I used
pretty_print to prevent small number from rounding to exactly zero. On
the other hand, in typical cases it is longer than 6 digits in bytes,
which is a bit hard to read a glance.

> LOG:  terminating process 16034 to release replication slot "rep1" because its restart_lsn 0/3158000 exceeds the
limitby 15368192 bytes
 

> should not simply report the number of bytes (like I can see it is
> done in many places). So something like:
..
> + (errmsg("terminating process %d to release replication slot \"%s\"
> because its restart_lsn %X/%X exceeds the limit by %lu bytes",
..
> and then forget about exposing/using byte_size_pretty() (that would be
> more consistent with the same kind of reporting in the existing code).
> 
> What do you think?

An alterntive would be rounding up to the whole MB, or a sub-MB.

>        ereport(LOG,
>            (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds
thelimit by %.1lf MB",
 
>                active_pid, NameStr(slotname),
>                LSN_FORMAT_ARGS(restart_lsn),
>                /* round-up at sub-MB */
>                ceil((double) (oldestLSN - restart_lsn) / 1024 / 102.4) / 10),

> LOG:  terminating process 49539 to release replication slot "rep1" because its restart_lsn 0/3038000 exceeds the
limitby 15.8 MB
 

If the distance were 1 byte, it is shown as "0.1 MB".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
"Drouvot, Bertrand"
Дата:

Hi,

On 9/7/22 4:20 AM, Kyotaro Horiguchi wrote:
(I noticed I sent a wrong version..)

At Tue, 6 Sep 2022 10:54:35 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote in
Thanks for the new patch version!. I did not realized (sorry about
that) that we'd need to expose byte_size_pretty(). Now I wonder if we
I didn't think we need the units larger than MB, but I used
pretty_print to prevent small number from rounding to exactly zero. 

Yeah makes sense.

Also, rounding to zero wouldn't occur with "just" displaying "oldestLSN - restart_lsn" (as proposed upthread).

On
the other hand, in typical cases it is longer than 6 digits in bytes,
which is a bit hard to read a glance.

Yeah right, but that's already the case in some part of the code, like for example in arrayfuncs.c:

                ereport(ERROR,
                        (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                         errmsg("array size exceeds the maximum allowed (%d)",
                                (int) MaxAllocSize)));

LOG:  terminating process 16034 to release replication slot "rep1" because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytes
should not simply report the number of bytes (like I can see it is
done in many places). So something like:
..
+ (errmsg("terminating process %d to release replication slot \"%s\"
because its restart_lsn %X/%X exceeds the limit by %lu bytes",
..
and then forget about exposing/using byte_size_pretty() (that would be
more consistent with the same kind of reporting in the existing code).

What do you think?
An alterntive would be rounding up to the whole MB, or a sub-MB.

       ereport(LOG,           (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit by %.1lf MB",               active_pid, NameStr(slotname),               LSN_FORMAT_ARGS(restart_lsn),               /* round-up at sub-MB */               ceil((double) (oldestLSN - restart_lsn) / 1024 / 102.4) / 10),

typo "/ 102.4" ?

LOG:  terminating process 49539 to release replication slot "rep1" because its restart_lsn 0/3038000 exceeds the limit by 15.8 MB
If the distance were 1 byte, it is shown as "0.1 MB".

Right and I'm -1 on it, I think we should stick to the "pretty" or the "bytes only" approach (my preference being the bytes only one).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Wed, 7 Sep 2022 12:16:29 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote in 
> Also, rounding to zero wouldn't occur with "just" displaying
> "oldestLSN - restart_lsn" (as proposed upthread).
..
> Yeah right, but that's already the case in some part of the code, like
> for example in arrayfuncs.c:

Fair points.

> >>         ereport(LOG,
> >>             (errmsg("terminating process %d to release replication slot
> >>             \"%s\" because its restart_lsn %X/%X exceeds the limit by %.1lf
> >>             MB",
> >>                 active_pid, NameStr(slotname),
> >>                 LSN_FORMAT_ARGS(restart_lsn),
> >>                 /* round-up at sub-MB */
> >>                 ceil((double) (oldestLSN - restart_lsn) / 1024 / 102.4) /
> >>                 10),
> 
> typo "/ 102.4" ?

No, it rounds the difference up to one decimal place. So it is devided
by 10 after ceil():p

> >> LOG: terminating process 49539 to release replication slot "rep1"
> >> because its restart_lsn 0/3038000 exceeds the limit by 15.8 MB
> > If the distance were 1 byte, it is shown as "0.1 MB".
> 
> Right and I'm -1 on it, I think we should stick to the "pretty" or the
> "bytes only" approach (my preference being the bytes only one).

Okay. the points you brought up above are sufficient grounds for not
doing so.  Now they are in the following format.

>> LOG: terminating process 16034 to release replication slot "rep1"
>> because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytes

Thank you for the discussion, Bertrand!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
"Drouvot, Bertrand"
Дата:

Hi,

On 9/8/22 6:40 AM, Kyotaro Horiguchi wrote:
At Wed, 7 Sep 2022 12:16:29 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote in
LOG: terminating process 49539 to release replication slot "rep1"
because its restart_lsn 0/3038000 exceeds the limit by 15.8 MB
If the distance were 1 byte, it is shown as "0.1 MB".
Right and I'm -1 on it, I think we should stick to the "pretty" or the
"bytes only" approach (my preference being the bytes only one).
Okay. the points you brought up above are sufficient grounds for not
doing so.  Now they are in the following format.

LOG: terminating process 16034 to release replication slot "rep1"
because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytes
Thank you for the discussion, Bertrand!

You are welcome, thanks for the patch!

It looks good to me, barring any objections i think we can mark the CF entry as Ready for Committer.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Tom Lane
Дата:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> Okay. the points you brought up above are sufficient grounds for not
> doing so.  Now they are in the following format.

> LOG: terminating process 16034 to release replication slot "rep1"
> because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytes

This seems to me to be a pretty blatant violation of our first message
style guideline [1]:

    The primary message should be short, factual, and avoid reference to
    implementation details such as specific function names. “Short” means
    “should fit on one line under normal conditions”. Use a detail message
    if needed to keep the primary message short ...

I think you should leave the primary message alone and add a DETAIL,
as the first version of the patch did.

The existing "invalidating slot" message is already in violation
of this guideline, so splitting off a DETAIL from that seems
indicated as well.

            regards, tom lane

[1] https://www.postgresql.org/docs/devel/error-style-guide.html



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Tom Lane
Дата:
... oh, one other point is that using %ld to print an int64 is entirely
not portable, as indeed the cfbot is complaining about.

I think our best practice on that is to put %lld in the format string
and explicitly cast the corresponding argument to "long long".

            regards, tom lane



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Wed, 28 Sep 2022 16:30:37 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > Okay. the points you brought up above are sufficient grounds for not
> > doing so.  Now they are in the following format.
> 
> > LOG: terminating process 16034 to release replication slot "rep1"
> > because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytes
> 
> This seems to me to be a pretty blatant violation of our first message
> style guideline [1]:

Thanks! It seems that I was waiting for a comment on that line.  I
thought that way at first but finally returned to the current message
as the result of discussion (in my memory). I will happily make the
main message shorter.

> I think you should leave the primary message alone and add a DETAIL,
> as the first version of the patch did.
> 
> The existing "invalidating slot" message is already in violation
> of this guideline, so splitting off a DETAIL from that seems
> indicated as well.

So I'm going to change the mssage as:

LOG:  terminating process %d to release replication slot \"%s\"
DETAIL:  The slot's restart_lsn %X/%X exceeds the limit by %lld bytes.
HINT:  You might need to increase max_slot_wal_keep_size.

LOG:  invalidating *replication* slot \"%s\"
DETAILS:  (ditto)
HINTS:  (ditto)

It seems that it's no longer useful to split out the first patch so I
merged them into one.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Tom Lane
Дата:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> At Wed, 28 Sep 2022 16:30:37 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
>> I think you should leave the primary message alone and add a DETAIL,
>> as the first version of the patch did.

> So I'm going to change the mssage as:

> LOG:  terminating process %d to release replication slot \"%s\"
> DETAIL:  The slot's restart_lsn %X/%X exceeds the limit by %lld bytes.
> HINT:  You might need to increase max_slot_wal_keep_size.

> LOG:  invalidating *replication* slot \"%s\"
> DETAILS:  (ditto)
> HINTS:  (ditto)

I thought the latter was a little *too* short; the primary message
should at least give you some clue why that happened, even if it
doesn't offer all the detail.  After some thought I changed it to

LOG:  invalidating obsolete replication slot \"%s\"

and pushed it that way.

            regards, tom lane



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Thu, 29 Sep 2022 13:31:00 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > At Wed, 28 Sep 2022 16:30:37 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> > LOG:  invalidating *replication* slot \"%s\"
> > DETAILS:  (ditto)
> > HINTS:  (ditto)
> 
> I thought the latter was a little *too* short; the primary message
> should at least give you some clue why that happened, even if it
> doesn't offer all the detail.  After some thought I changed it to

Yeah, agreed. It looks better. (I was about to spell it as
"invalidating slot "%s"" then changed my mind to add "replication". I
felt that it is a bit too short but didn't think about further
streaching that by adding "obsolete"..).

> LOG:  invalidating obsolete replication slot \"%s\"
> 
> and pushed it that way.

Thanks. And thanks for fixing the test script, too.

By the way, I didn't notice at that time (and forgot about the
policy), but the HINT message has variations differing only by the
variable name.

What do you think about the attached?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Tom Lane
Дата:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> By the way, I didn't notice at that time (and forgot about the
> policy), but the HINT message has variations differing only by the
> variable name.

> What do you think about the attached?

Hmm, maybe, but a quick grep for 'You might need to increase'
finds about a dozen other cases, and none of them are using %s.
If we do this we should change all of them, and they probably
need "translator:" hints.  I'm not sure whether abstracting
away the variable names will make translation harder.

            regards, tom lane



Re: more descriptive message for process termination due to max_slot_wal_keep_size

От
Kyotaro Horiguchi
Дата:
At Thu, 29 Sep 2022 22:49:00 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > By the way, I didn't notice at that time (and forgot about the
> > policy), but the HINT message has variations differing only by the
> > variable name.
> 
> > What do you think about the attached?
> 
> Hmm, maybe, but a quick grep for 'You might need to increase'
> finds about a dozen other cases, and none of them are using %s.
(Mmm. I didn't find others only in po files..)
> If we do this we should change all of them, and they probably
> need "translator:" hints.  I'm not sure whether abstracting
> away the variable names will make translation harder.

I expect that dedicated po-editing tools can lookup corresponding code
lines, which gives the answer if no hint is attached at least in this
specific case.

Anyway, thinking calmly, since we are not about to edit these
messages, it's unlikely to happen on purpose, I think. So I don't mean
to push this so hard.

Thanks for the comment!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center