Обсуждение: Design of pg_stat_subscription_workers vs pgstats

Поиск

Список

Период

Сортировка

Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

25 января 2022 г., 09:31:31

Hi,

I was looking the shared memory stats patch again. The rebase of which
collided fairly heavily with the addition of pg_stat_subscription_workers.

I'm concerned about the design of pg_stat_subscription_workers. The view was
introduced in


commit 8d74fc96db5fd547e077bf9bf4c3b67f821d71cd
Author: Amit Kapila <akapila@postgresql.org>
Date:   2021-11-30 08:54:30 +0530

    Add a view to show the stats of subscription workers.

    This commit adds a new system view pg_stat_subscription_workers, that
    shows information about any errors which occur during the application of
    logical replication changes as well as during performing initial table
    synchronization. The subscription statistics entries are removed when the
    corresponding subscription is removed.

    It also adds an SQL function pg_stat_reset_subscription_worker() to reset
    single subscription errors.

    The contents of this view can be used by an upcoming patch that skips the
    particular transaction that conflicts with the existing data on the
    subscriber.

    This view can be extended in the future to track other xact related
    statistics like the number of xacts committed/aborted for subscription
    workers.

    Author: Masahiko Sawada
    Reviewed-by: Greg Nancarrow, Hou Zhijie, Tang Haiying, Vignesh C, Dilip Kumar, Takamichi Osumi, Amit Kapila
    Discussion: https://postgr.es/m/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com


I tried to skim-read the discussion leading to its introduction, but it's
extraordinarily long: 474 messages in [1], 131 messages in [2], as well as a
few other associated threads.


From the commit message alone I am concerned that this appears to be intended
to be used to store important state in pgstats. For which pgstats is
fundamentally unsuitable (pgstat can loose state during normal operation,
always looses state during crash restarts, the state can be reset).

I don't really understand the name "pg_stat_subscription_workers" - what
workers are stats kept about exactly? The columns don't obviously refer to a
single worker or such? From the contents it should be name
pg_stat_subscription_table_stats or such. But no, that'd not quite right,
because apply errors are stored per-susbcription, while initial sync stuff is
per-(subscription, table).

The pgstat entries are quite wide (292 bytes), because of the error message
stored. That's nearly twice the size of PgStat_StatTabEntry. And as far as I
can tell, once there was an error, we'll just keep the stats entry around
until the subscription is dropped.  And that includes stats for long dropped
tables, as far as I can see - except that they're hidden from view, due to a
join to pg_subscription_rel.


To me this looks like it's using pgstat as an extremely poor IPC mechanism.


Why isn't this just storing data in pg_subscription_rel?

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK%3D30xJfUVihNZDA%40mail.gmail.com
[2] https://postgr.es/m/OSBPR01MB48887CA8F40C8D984A6DC00CED199%40OSBPR01MB4888.jpnprd01.prod.outlook.com

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

25 января 2022 г., 14:27:07

Hi,

On Tue, Jan 25, 2022 at 3:31 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I was looking the shared memory stats patch again. The rebase of which
> collided fairly heavily with the addition of pg_stat_subscription_workers.
>
> I'm concerned about the design of pg_stat_subscription_workers. The view was
> introduced in
>
>
> commit 8d74fc96db5fd547e077bf9bf4c3b67f821d71cd
> Author: Amit Kapila <akapila@postgresql.org>
> Date:   2021-11-30 08:54:30 +0530
>
>     Add a view to show the stats of subscription workers.
>
>     This commit adds a new system view pg_stat_subscription_workers, that
>     shows information about any errors which occur during the application of
>     logical replication changes as well as during performing initial table
>     synchronization. The subscription statistics entries are removed when the
>     corresponding subscription is removed.
>
>     It also adds an SQL function pg_stat_reset_subscription_worker() to reset
>     single subscription errors.
>
>     The contents of this view can be used by an upcoming patch that skips the
>     particular transaction that conflicts with the existing data on the
>     subscriber.
>
>     This view can be extended in the future to track other xact related
>     statistics like the number of xacts committed/aborted for subscription
>     workers.
>
>     Author: Masahiko Sawada
>     Reviewed-by: Greg Nancarrow, Hou Zhijie, Tang Haiying, Vignesh C, Dilip Kumar, Takamichi Osumi, Amit Kapila
>     Discussion: https://postgr.es/m/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com
>
>
> I tried to skim-read the discussion leading to its introduction, but it's
> extraordinarily long: 474 messages in [1], 131 messages in [2], as well as a
> few other associated threads.
>
>
> From the commit message alone I am concerned that this appears to be intended
> to be used to store important state in pgstats. For which pgstats is
> fundamentally unsuitable (pgstat can loose state during normal operation,
> always looses state during crash restarts, the state can be reset).

The information on pg_stat_subscription_workers view, especially
last_error_xid, can be used to specify XID to "ALTER SUBSCRIPTION ...
SKIP (xid = XXX)" command which is proposed on the same thread, but it
doesn't mean that the new SKIP command relies on this information. The
failure XID is written in the server logs as well and the user
specifies XID manually.

>
> I don't really understand the name "pg_stat_subscription_workers" - what
> workers are stats kept about exactly? The columns don't obviously refer to a
> single worker or such? From the contents it should be name
> pg_stat_subscription_table_stats or such. But no, that'd not quite right,
> because apply errors are stored per-susbcription, while initial sync stuff is
> per-(subscription, table).

This stores stats for subscription workers namely apply and tablesync
worker, so named as pg_stat_subscription_workers.

Also, there is another proposal to add transaction statistics for
logical replication subscribers[1], and it's reasonable to merge these
statistics and this error information rather than having separate
views[2]. There also was an idea to add the transaction statistics to
pg_stat_subscription view, but it doesn't seem a good idea because the
pg_stat_subscription shows dynamic statistics whereas the transaction
statistics are accumulative statistics[3].

>
> The pgstat entries are quite wide (292 bytes), because of the error message
> stored. That's nearly twice the size of PgStat_StatTabEntry. And as far as I
> can tell, once there was an error, we'll just keep the stats entry around
> until the subscription is dropped.

We can drop the particular statistics by
pg_stat_reset_subscription_worker() function.

> And that includes stats for long dropped
> tables, as far as I can see - except that they're hidden from view, due to a
> join to pg_subscription_rel.

We are planning to drop this after successfully apply[4].

> To me this looks like it's using pgstat as an extremely poor IPC mechanism.
>
>
> Why isn't this just storing data in pg_subscription_rel?

These need to be updated on error which means for a failed xact and we
don't want to update the system catalog in that state. There will be
some challenges in a case where updating pg_subscription_rel also
failed too (what to report to the user, etc.). And moreover, we don't
want to consume space for temporary information in the system catalog.

Regards,

[1]
https://www.postgresql.org/message-id/OSBPR01MB48887CA8F40C8D984A6DC00CED199%40OSBPR01MB4888.jpnprd01.prod.outlook.com
[2] https://www.postgresql.org/message-id/CAD21AoDF7LmSALzMfmPshRw_xFcRz3WvB-me8T2gO6Ht%3D3zL2w%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAA4eK1JqwpsvjhLxV8CMYQ3NrhimZ8AFhWHh0Qn1FrL%3DLXfY6Q%40mail.gmail.com
[4] https://www.postgresql.org/message-id/CAA4eK1%2B9yXkWkJSNtWYV2rG7QNAnoAt%2BeNH0PexoSP9ZQmXKPg%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

27 января 2022 г., 08:46:06

Hi,

I didn't quite get to responding in depth, but I wanted to at least respond to
one point today.

On 2022-01-25 20:27:07 +0900, Masahiko Sawada wrote:
> > The pgstat entries are quite wide (292 bytes), because of the error message
> > stored. That's nearly twice the size of PgStat_StatTabEntry. And as far as I
> > can tell, once there was an error, we'll just keep the stats entry around
> > until the subscription is dropped.
> 
> We can drop the particular statistics by
> pg_stat_reset_subscription_worker() function.

Only if either the user wants to drop all stats, or somehow knows the oids of
already dropped tables...

> > Why isn't this just storing data in pg_subscription_rel?
> 
> These need to be updated on error which means for a failed xact and we
> don't want to update the system catalog in that state.

Rightly so! In fact, I'm concerned with sending a pgstats message in that
state as well. But: You don't need to. Just abort the current transaction,
start a new one, and update the state.

> There will be some challenges in a case where updating pg_subscription_rel
> also failed too (what to report to the user, etc.). And moreover, we don't
> want to consume space for temporary information in the system catalog.

You're consuming resources in a *WAY* worse way right now. The stats file gets
constantly written out, and quite often read back by backends. In contrast to
parts of pg_subscription_rel or such that data can't be removed from
shared_buffers under pressure.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

27 января 2022 г., 15:07:41

On Thu, Jan 27, 2022 at 11:16 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-01-25 20:27:07 +0900, Masahiko Sawada wrote:
>
> > There will be some challenges in a case where updating pg_subscription_rel
> > also failed too (what to report to the user, etc.). And moreover, we don't
> > want to consume space for temporary information in the system catalog.
>
> You're consuming resources in a *WAY* worse way right now. The stats file gets
> constantly written out, and quite often read back by backends. In contrast to
> parts of pg_subscription_rel or such that data can't be removed from
> shared_buffers under pressure.
>

I don't think pg_subscription_rel is the right place to store error
info as the error can happen say while processing some message type
like BEGIN where we can't map it to pg_subscription_rel entry. There
could be other cases as well where we won't be able to map it to
pg_subscription_rel like some error related to some other table while
processing trigger function.

In general, there doesn't appear to be much advantage in storing all
the error info in system catalogs as we don't want it to be persistent
(crash-safe). Also, this information is not about any system object
that future operations can use, so won't map from that angle as well.

But, I see the point related to the size overhead of each message (296
bytes) and that is because of the error message present in each entry.
I think it would be better to store error_code instead of the message.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

27 января 2022 г., 23:18:51

On Thu, Jan 27, 2022 at 5:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 27, 2022 at 11:16 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-01-25 20:27:07 +0900, Masahiko Sawada wrote:
>
> > There will be some challenges in a case where updating pg_subscription_rel
> > also failed too (what to report to the user, etc.). And moreover, we don't
> > want to consume space for temporary information in the system catalog.
>
> You're consuming resources in a *WAY* worse way right now. The stats file gets
> constantly written out, and quite often read back by backends. In contrast to
> parts of pg_subscription_rel or such that data can't be removed from
> shared_buffers under pressure.
>

I don't think pg_subscription_rel is the right place to store error
info as the error can happen say while processing some message type
like BEGIN where we can't map it to pg_subscription_rel entry. There
could be other cases as well where we won't be able to map it to
pg_subscription_rel like some error related to some other table while
processing trigger function.

In general, there doesn't appear to be much advantage in storing all
the error info in system catalogs as we don't want it to be persistent
(crash-safe). Also, this information is not about any system object
that future operations can use, so won't map from that angle as well.

Repeating myself here to try and keep complaints regarding pg_stat_subscription_worker in one place.

This is my specific email with respect to the pg_stat_scription_workers design.

https://www.postgresql.org/message-id/CAKFQuwZbFuPSV1WLiNFuODst1sUZon2Qwbj8d9tT%3D38hMhJfvw%40mail.gmail.com

Specifically,

pg_stat_subscription_workers is defined as showing:

"will contain one row per subscription

worker on which errors have occurred, for workers applying logical
replication changes and workers handling the initial data copy of the
subscribed tables."

The fact that these errors remain (last_error_*) even after they are no longer relevant is my main gripe regarding this feature. The information itself is generally useful though last_error_count is not. These fields should auto-clear and be named current_error_* as they exist for purposes of describing the current state of any error-encountering logical replication worker so that ALTER SUBSCRIPTION SKIP, or someone other manual intervention, can be done with that knowledge without having to scan the subscriber's server logs.

This is my email trying to understand reality better in order to figure out what exactly is causing the limitations that are negatively impacting the design of this feature.

https://www.postgresql.org/message-id/CAKFQuwYJ7dsW%2BStsw5%2BZVoY3nwQ9j6pPt-7oYjGddH-h7uVb%2Bg%40mail.gmail.com

In short, it was convenient to use the statistics collector here even if doing so resulted in a non-user friendly (IMO) design. Given all of the limitations to the statistics collection infrastructure, and the fact that this data is not statistical in the usual usage of the term, I find that to be less than satisfying. To the point that I'd be inclined to revert this feature and hold up the ALTER SUBSCRIPTION SET patch until a more user-friendly design can be done using proper IPC techniques. (I also noted in the first email that pg_stat_archiver, solely by observing the column names it exposes, shares this same abuse of the statistics collector for non-statistical data).

In my second email I did some tracing and ended up at the PG_CATCH() block in src/backend/replication/logical/worker.c:L3629. When mentioning trying to get rid of the PG_RE_THROW() there apparently doing so completely is unwarranted due to fatal/panic errors. I am curious that the addition of the statistic reporting logic doesn't seem to care about the same. And in any case, while maybe PG_RE_THROW() cannot go away it could maybe be done conditionally, and the worker still allowed to exit should that be more desirable than making the routine safe for looping after an error.

Andres, I do not know how to be more precise than your comment "But: You don't need to. Just abort the current transaction, start a new one, and update the state.". When I suggested that idea it didn't seem to resonate with anyone on the other thread. Starting at the main PG_TRY() loop in worker.c noted above, could you maybe please explain in a bit more detail whether, and how hard, it would be to go from "just PG_RE_THROW();" to "abort and start a new transaction"?

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

28 января 2022 г., 00:15:18

Hi,

On 2022-01-27 13:18:51 -0700, David G. Johnston wrote:
> Repeating myself here to try and keep complaints regarding
> pg_stat_subscription_worker in one place.

Thanks!

> This is my specific email with respect to the pg_stat_scription_workers
> design.
>
> https://www.postgresql.org/message-id/CAKFQuwZbFuPSV1WLiNFuODst1sUZon2Qwbj8d9tT%3D38hMhJfvw%40mail.gmail.com
>
> Specifically,
>
> pg_stat_subscription_workers is defined as showing:
> "will contain one row per subscription
> worker on which errors have occurred, for workers applying logical
> replication changes and workers handling the initial data copy of the
> subscribed tables."
>
> The fact that these errors remain (last_error_*) even after they are no
> longer relevant is my main gripe regarding this feature.  The information
> itself is generally useful though last_error_count is not.  These fields
> should auto-clear and be named current_error_* as they exist for purposes
> of describing the current state of any error-encountering logical
> replication worker so that ALTER SUBSCRIPTION SKIP, or someone other manual
> intervention, can be done with that knowledge without having to scan the
> subscriber's server logs.

Indeed.

Another related thing is that using a 32bit xid for allowing skipping is a bad
idea anyway - we shouldn't adding new interfaces with xid wraparound dangers -
it's getting more and more common to have multiple wraparounds a day.  An
easily better alternative would be the LSN at which a transaction starts.

> This is my email trying to understand reality better in order to figure out
> what exactly is causing the limitations that are negatively impacting the
> design of this feature.
>
> https://www.postgresql.org/message-id/CAKFQuwYJ7dsW%2BStsw5%2BZVoY3nwQ9j6pPt-7oYjGddH-h7uVb%2Bg%40mail.gmail.com
>
> In short, it was convenient to use the statistics collector here even if
> doing so resulted in a non-user friendly (IMO) design.

And importantly, the whole justification for the scheme, namely the inability
to change actual tables in that state, just doesn't hold up. It's a few lines
to abort the failed transaction and log the error after.

Just retrying over and over at full pace doesn't seem like a good thing. We
should start to back off retries - the retries themselves can very well
contribute to making it harder to fix the problem, by holding locks etc. For
that the launcher (or workers) should check whether there's no worker because
it's errored out. With pgstats such a check would need this full sequence:

1) worker sends failure stats message
2) pgstats receive stats message
3) launcher sends ping to pgstats to request file to be written out
4) pgstats writes out the whole database's stats
5) launcher reads the whole stats file

That's a big and expensive cannon for a check whether we should delay the
launcher of a worker.

> Given all of the
> limitations to the statistics collection infrastructure, and the fact that
> this data is not statistical in the usual usage of the term, I find that to
> be less than satisfying.  To the point that I'd be inclined to revert this
> feature and hold up the ALTER SUBSCRIPTION SET patch until a more
> user-friendly design can be done using proper IPC techniques.

Same.

> In my second email I did some tracing and ended up at the PG_CATCH() block
> in src/backend/replication/logical/worker.c:L3629.  When mentioning trying
> to get rid of the PG_RE_THROW() there apparently doing so completely is
> unwarranted due to fatal/panic errors.  I am curious that the addition of
> the statistic reporting logic doesn't seem to care about the same.

We shouldn't even think about doing stuff like stats updates when we've
PANICed. You could argue its safe to do that in the FATAL case - but where
would such a FATAL validly come from? It'd be something like a user calling
pg_terminate_backend(), which isn't transaction specific, so it'd not make
sense to record details like xid in pg_stat_subscription_workers.

But the argument of needing to do something in PG_CATCH in the fatal/panic
case is bogus, because FATAL/PANIC doesn't reach PG_CATCH. errfinish() only
throws when elevel == ERROR, in the FATAL case we end with proc_exit(1), with
PANIC we abort().

> Andres, I do not know how to be more precise than your comment "But: You
> don't need to. Just abort the current transaction, start a new one, and
> update the state.".  When I suggested that idea it didn't seem to resonate
> with anyone on the other thread.  Starting at the main PG_TRY() loop in
> worker.c noted above, could you maybe please explain in a bit more detail
> whether, and how hard, it would be to go from "just PG_RE_THROW();" to
> "abort and start a new transaction"?

It's pretty easy from the POV of getting into a new transaction.

PG_CATCH():

    /* get us out of the failed transaction */
    AbortOutOfAnyTransaction();

    StartTransactionCommand();
    /* do something to remember the error we just got */
    CommitTransactionCommand();

It may be a bit harder to afterwards to to not just error out the whole
worker, because we'd need to know what to do instead.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

28 января 2022 г., 01:35:57

On Thu, Jan 27, 2022 at 2:15 PM Andres Freund <andres@anarazel.de> wrote:

Another related thing is that using a 32bit xid for allowing skipping is a bad
idea anyway - we shouldn't adding new interfaces with xid wraparound dangers -
it's getting more and more common to have multiple wraparounds a day. An
easily better alternative would be the LSN at which a transaction starts.

Interesting idea. I do not think a well-designed skipping feature need worry about wrap-around though. The XID to be skipped was just seen be a worker and because it failed it will continue to be the same XID encountered by that worker until it is resolved. There is no effective progression in time while the subscriber is stuck for wrap-around to happen. Since we want to skip the transaction as a whole adding a layer of hidden indirection to the process seems undesirable. I'm not against the idea though - to the user it is basically "copy this value from the error message in order to skip the transaction that caused the error". Then the system verifies the value and then ensures it skips one, and only one, transaction.

It's pretty easy from the POV of getting into a new transaction.

PG_CATCH():

/* get us out of the failed transaction */
AbortOutOfAnyTransaction();

StartTransactionCommand();
/* do something to remember the error we just got */
CommitTransactionCommand();

Thank you.

It may be a bit harder to afterwards to to not just error out the whole
worker, because we'd need to know what to do instead.

I imagine the launcher and worker startup code can be made to deal with the restart adequately. Just wait if the last thing seen was an error. Require the user to manually resume the worker - unless we really think a try-until-you-succeed with a backoff protocol is superior. Upon system restart all error information is cleared and we start from scratch and let the errors happen (or not depending) as they will.

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

28 января 2022 г., 08:59:08

On Fri, Jan 28, 2022 at 1:49 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Thu, Jan 27, 2022 at 5:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Jan 27, 2022 at 11:16 AM Andres Freund <andres@anarazel.de> wrote:
>> >
>> > On 2022-01-25 20:27:07 +0900, Masahiko Sawada wrote:
>> >
>> > > There will be some challenges in a case where updating pg_subscription_rel
>> > > also failed too (what to report to the user, etc.). And moreover, we don't
>> > > want to consume space for temporary information in the system catalog.
>> >
>> > You're consuming resources in a *WAY* worse way right now. The stats file gets
>> > constantly written out, and quite often read back by backends. In contrast to
>> > parts of pg_subscription_rel or such that data can't be removed from
>> > shared_buffers under pressure.
>> >
>>
>> I don't think pg_subscription_rel is the right place to store error
>> info as the error can happen say while processing some message type
>> like BEGIN where we can't map it to pg_subscription_rel entry. There
>> could be other cases as well where we won't be able to map it to
>> pg_subscription_rel like some error related to some other table while
>> processing trigger function.
>>
>> In general, there doesn't appear to be much advantage in storing all
>> the error info in system catalogs as we don't want it to be persistent
>> (crash-safe). Also, this information is not about any system object
>> that future operations can use, so won't map from that angle as well.
>
>
> Repeating myself here to try and keep complaints regarding pg_stat_subscription_worker in one place.
>
> This is my specific email with respect to the pg_stat_scription_workers design.
>
> https://www.postgresql.org/message-id/CAKFQuwZbFuPSV1WLiNFuODst1sUZon2Qwbj8d9tT%3D38hMhJfvw%40mail.gmail.com
>
> Specifically,
>
> pg_stat_subscription_workers is defined as showing:
> "will contain one row per subscription
> worker on which errors have occurred, for workers applying logical
> replication changes and workers handling the initial data copy of the
> subscribed tables."
>
> The fact that these errors remain (last_error_*) even after they are no longer relevant is my main gripe regarding
thisfeature.  The information itself is generally useful though last_error_count is not.  These fields should
auto-clearand be named current_error_* as they exist for purposes of describing the current state of any
error-encounteringlogical replication worker so that ALTER SUBSCRIPTION SKIP, or someone other manual intervention, can
bedone with that knowledge without having to scan the subscriber's server logs. 
>

We can discuss names of columns but the main reason was that tomorrow
say we want to account for total errors not only the current error
then we have to introduce the field error_count or something like that
which will then conflict with names like current_*. Similar for
transaction abort_count. In the initial versions of the patch, we were
not using last_* for column names but similar arguments led us to
change names to last_ terminology and the same was being used in
pg_stat_archiver. But, feel free to suggest better names. Yes, I agree
with an auto-clear point as well and there seems to be an agreement
for doing it after the next successful apply and or after we skipped
the failed xact.

> This is my email trying to understand reality better in order to figure out what exactly is causing the limitations
thatare negatively impacting the design of this feature. 
>
> https://www.postgresql.org/message-id/CAKFQuwYJ7dsW%2BStsw5%2BZVoY3nwQ9j6pPt-7oYjGddH-h7uVb%2Bg%40mail.gmail.com
>
> In short, it was convenient to use the statistics collector here even if doing so resulted in a non-user friendly
(IMO)design.  Given all of the limitations to the statistics collection infrastructure, and the fact that this data is
notstatistical in the usual usage of the term, I find that to be less than satisfying. 
>

I think the failures/conflicts are also important information for
users to know, so having a view of those doesn't appear to be a bad
idea. All this data is less suitable for system catalogs like
pg_subscription_rel or others for the reasons quoted in my previous
email [1]. You have already noted one view pg_stat_archiver and we do
have failure information like checksum_failures, deadlocks in some
other views. Then, we have some information like conflicts available
via pg_stat_database_conflicts. I think the error/conflict info about
apply failures is on similar lines.

>  To the point that I'd be inclined to revert this feature and hold up the ALTER SUBSCRIPTION SET patch until a more
user-friendlydesign can be done using proper IPC techniques. 
>

If we find a different/better way and that is a conclusion then I will
do it. But, in my humble opinion, let's first discuss and see why this
is incorrect? IIUC, your main argument was to allow auto skip instead
of allowing users to fetch XID (from a view or server logs) and then
use it in the command. We are already trying to brainstorm that and
Sawada-San has already proposed a couple of ideas [2]. Also, the other
idea Andres has shared is to use LSN (of the corresponding failed
transaction) instead of XID which I find better.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BMDngbOQfMcAMsrf__s2a-MMMHaCR0zwde3GVeEi-bbQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAD21AoBdEcyXKMCMws7HjcYDbbPyq_KfUbCnTX84rDeP45Hbrw%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

01 февраля 2022 г., 09:17:04

On Fri, Jan 28, 2022 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 28, 2022 at 1:49 AM David G. Johnston
> <david.g.johnston@gmail.com> wrote:
> >
> > On Thu, Jan 27, 2022 at 5:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Thu, Jan 27, 2022 at 11:16 AM Andres Freund <andres@anarazel.de> wrote:
> >> >
> >> > On 2022-01-25 20:27:07 +0900, Masahiko Sawada wrote:
> >> >
> >> > > There will be some challenges in a case where updating pg_subscription_rel
> >> > > also failed too (what to report to the user, etc.). And moreover, we don't
> >> > > want to consume space for temporary information in the system catalog.
> >> >
> >> > You're consuming resources in a *WAY* worse way right now. The stats file gets
> >> > constantly written out, and quite often read back by backends. In contrast to
> >> > parts of pg_subscription_rel or such that data can't be removed from
> >> > shared_buffers under pressure.
> >> >
> >>
> >> I don't think pg_subscription_rel is the right place to store error
> >> info as the error can happen say while processing some message type
> >> like BEGIN where we can't map it to pg_subscription_rel entry. There
> >> could be other cases as well where we won't be able to map it to
> >> pg_subscription_rel like some error related to some other table while
> >> processing trigger function.
> >>
> >> In general, there doesn't appear to be much advantage in storing all
> >> the error info in system catalogs as we don't want it to be persistent
> >> (crash-safe). Also, this information is not about any system object
> >> that future operations can use, so won't map from that angle as well.
> >
> >
> > Repeating myself here to try and keep complaints regarding pg_stat_subscription_worker in one place.
> >
> > This is my specific email with respect to the pg_stat_scription_workers design.
> >
> > https://www.postgresql.org/message-id/CAKFQuwZbFuPSV1WLiNFuODst1sUZon2Qwbj8d9tT%3D38hMhJfvw%40mail.gmail.com
> >
> > Specifically,
> >
> > pg_stat_subscription_workers is defined as showing:
> > "will contain one row per subscription
> > worker on which errors have occurred, for workers applying logical
> > replication changes and workers handling the initial data copy of the
> > subscribed tables."
> >
> > The fact that these errors remain (last_error_*) even after they are no longer relevant is my main gripe regarding
thisfeature.  The information itself is generally useful though last_error_count is not.  These fields should
auto-clearand be named current_error_* as they exist for purposes of describing the current state of any
error-encounteringlogical replication worker so that ALTER SUBSCRIPTION SKIP, or someone other manual intervention, can
bedone with that knowledge without having to scan the subscriber's server logs. 
> >
>
> We can discuss names of columns but the main reason was that tomorrow
> say we want to account for total errors not only the current error
> then we have to introduce the field error_count or something like that
> which will then conflict with names like current_*. Similar for
> transaction abort_count. In the initial versions of the patch, we were
> not using last_* for column names but similar arguments led us to
> change names to last_ terminology and the same was being used in
> pg_stat_archiver. But, feel free to suggest better names. Yes, I agree
> with an auto-clear point as well and there seems to be an agreement
> for doing it after the next successful apply and or after we skipped
> the failed xact.
>
> > This is my email trying to understand reality better in order to figure out what exactly is causing the limitations
thatare negatively impacting the design of this feature. 
> >
> > https://www.postgresql.org/message-id/CAKFQuwYJ7dsW%2BStsw5%2BZVoY3nwQ9j6pPt-7oYjGddH-h7uVb%2Bg%40mail.gmail.com
> >
> > In short, it was convenient to use the statistics collector here even if doing so resulted in a non-user friendly
(IMO)design.  Given all of the limitations to the statistics collection infrastructure, and the fact that this data is
notstatistical in the usual usage of the term, I find that to be less than satisfying. 
> >
>
> I think the failures/conflicts are also important information for
> users to know, so having a view of those doesn't appear to be a bad
> idea. All this data is less suitable for system catalogs like
> pg_subscription_rel or others for the reasons quoted in my previous
> email [1].

I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
feature to pass error-XID or error-LSN information to the worker
whereas I'm also not sure of the advantages in storing all error
information in a system catalog. Since what we need to do for this
purpose is only error-XID/LSN, we can store only error-XID/LSN in the
catalog? That is, the worker stores error-XID/LSN in the catalog on an
error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
the transaction in question. The worker clears the error-XID/LSN after
successfully applying or skipping the first non-empty transaction.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

02 февраля 2022 г., 06:07:40

On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jan 28, 2022 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 28, 2022 at 1:49 AM David G. Johnston
> > <david.g.johnston@gmail.com> wrote:
> > >
> > >
> > > In short, it was convenient to use the statistics collector here even if doing so resulted in a non-user friendly
(IMO)design.  Given all of the limitations to the statistics collection infrastructure, and the fact that this data is
notstatistical in the usual usage of the term, I find that to be less than satisfying. 
> > >
> >
> > I think the failures/conflicts are also important information for
> > users to know, so having a view of those doesn't appear to be a bad
> > idea. All this data is less suitable for system catalogs like
> > pg_subscription_rel or others for the reasons quoted in my previous
> > email [1].
>
> I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
> feature to pass error-XID or error-LSN information to the worker
> whereas I'm also not sure of the advantages in storing all error
> information in a system catalog. Since what we need to do for this
> purpose is only error-XID/LSN, we can store only error-XID/LSN in the
> catalog? That is, the worker stores error-XID/LSN in the catalog on an
> error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
> the transaction in question. The worker clears the error-XID/LSN after
> successfully applying or skipping the first non-empty transaction.
>

Where do you propose to store this information? I think we can't use
pg_subscription_rel for reasons quoted by me in email [1]. We can
store it in pg_subscription but that won't cover tablesync cases. I
think it can work if we store at both places. I think that would be
extendable if one wants to bring parallelism on the apply-side as we
can think of storing the values in the array. The other possibility
could be to invent a new catalog for this info but I guess it will
then have to have some duplicate info from pg_subscription/_rel.

The other point is after this, do we want an interface where the user
can also be allowed to specify error_lsn or error_xid? I think it
would be better to have such flexibility as that can be extended later
to allow users to skip some specific operations like 'update',
'insert', etc., or other similar things.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BMDngbOQfMcAMsrf__s2a-MMMHaCR0zwde3GVeEi-bbQ%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

02 февраля 2022 г., 07:11:40

On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

>
> I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
> feature to pass error-XID or error-LSN information to the worker
> whereas I'm also not sure of the advantages in storing all error
> information in a system catalog. Since what we need to do for this
> purpose is only error-XID/LSN, we can store only error-XID/LSN in the
> catalog? That is, the worker stores error-XID/LSN in the catalog on an
> error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
> the transaction in question. The worker clears the error-XID/LSN after
> successfully applying or skipping the first non-empty transaction.
>

Where do you propose to store this information?

pg_subscription_worker

The error message and context is very important. Just make sure it is only non-null when the worker state is "syncing failed" (or whatever term we use).

Records are removed upon server restart (the launcher can handle this). Consider recording a last activity timestamp (some protection/visibility against bugs or, say, a worker ending without reporting that fact). Records stay around even when the worker goes away (the user can filter the state field to omit inactive rows). I'd consider just removing them when done and/or having a reset function that the DBA could run (it should never be wrong to clear the table).

Re: XID and/or LSN, I don't know enough yet to really judge this...

The other possibility
could be to invent a new catalog for this info but I guess it will
then have to have some duplicate info from pg_subscription/_rel.

The other point is after this, do we want an interface where the user
can also be allowed to specify error_lsn or error_xid?

...but whatever is decided, tell me, the user, what my options are, the limitations, and what info to copy from this catalog into the command(s) that I issue to the server, that will make the errors go away. This is generic, not specific to the skipping a commit command or the skip-to-lsn functions, but also includes considering performing DML on the relevant table(s) to avoid the error.

I don't think the fields would be duplicated. While some of the fields seem similar, aside from the key fields the data we would show would be state info for a given worker. None of the v14 fields do this at the worker scope.

That all makes the new catalog a generally useful monitoring source and a standalone patch. I'd personally start a new thread, with a functioning patch as the first message, and a recap of what and why this rework is being done. In order for Andres to make progress on the shared memory statistics patch I would suggest reverting this and building the new patch as if this statistics collector approach never happened.

I'd still like to get some clarity regarding the observation that our error-die-restart process seems problematic. Since that process needs to talk to the new catalog anyway I'd rather commit the changes to the process (if any, but I hope we can either all agree on the status quo or get something better in for v15), and the new catalog that provides insight into that process, as part of this first commit. That includes a probable user function to restart a halted worker instead of doing so continually (even with the suggested back-off protocol).

Then the SKIP commit can go in, leveraging the state information exposed in the catalog. That discussion and work should be restarted on a new thread with an intro recap message. The existing patch should be adapted to leverage the new pg_subscription_worker catalog before starting the new thread.

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

02 февраля 2022 г., 09:54:52

On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> >
>> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
>> > feature to pass error-XID or error-LSN information to the worker
>> > whereas I'm also not sure of the advantages in storing all error
>> > information in a system catalog. Since what we need to do for this
>> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
>> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
>> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
>> > the transaction in question. The worker clears the error-XID/LSN after
>> > successfully applying or skipping the first non-empty transaction.
>> >
>>
>> Where do you propose to store this information?
>
>
> pg_subscription_worker
>
> The error message and context is very important.  Just make sure it is only non-null when the worker state is
"syncingfailed" (or whatever term we use).

>
>

Sure, but is this the reason you want to store all the error info in
the system catalog? I agree that providing more error info could be
useful and also possibly the previously failed (apply) xacts info as
well but I am not able to see why you want to have that sort of info
in the catalog. I could see storing info like err_lsn/err_xid that can
allow to proceed to apply worker automatically or to slow down the
launch of errored apply worker but not all sort of other error info
(like err_cnt, err_code, err_message, err_time, etc.). I want to know
why you are insisting to make all the error info persistent via the
system catalog?

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

02 февраля 2022 г., 10:36:08

On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> >
>> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
>> > feature to pass error-XID or error-LSN information to the worker
>> > whereas I'm also not sure of the advantages in storing all error
>> > information in a system catalog. Since what we need to do for this
>> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
>> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
>> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
>> > the transaction in question. The worker clears the error-XID/LSN after
>> > successfully applying or skipping the first non-empty transaction.
>> >
>>
>> Where do you propose to store this information?
>
>
> pg_subscription_worker
>
> The error message and context is very important. Just make sure it is only non-null when the worker state is "syncing failed" (or whatever term we use).
>
>

Sure, but is this the reason you want to store all the error info in
the system catalog? I agree that providing more error info could be
useful and also possibly the previously failed (apply) xacts info as
well but I am not able to see why you want to have that sort of info
in the catalog. I could see storing info like err_lsn/err_xid that can
allow to proceed to apply worker automatically or to slow down the
launch of errored apply worker but not all sort of other error info
(like err_cnt, err_code, err_message, err_time, etc.). I want to know
why you are insisting to make all the error info persistent via the
system catalog?

I look at the catalog and am informed that the worker has stopped because of an error. I'd rather simply read the error message right then instead of having to go look at the log file. And if I am going to take an action in order to overcome the error I would have to know what that error is; so the error message is not something I can ignore. The error is an attribute of system state, and the catalog stores the current state of the (workers) system.

I already explained that the concept of err_cnt is not useful. The fact that you include it here makes me think you are still thinking that this all somehow is meant to keep track of history. It is not. The workers are state machines and "error" is one of the states - with relevant attributes to display to the user, and system, while in that state. The state machine reporting does not care about historical states nor does it report on them. There is some uncertainty if we continue with the automatic re-launch; which, now that I write this, I can see where what you call err_cnt is effectively a count of how many times the worker re-launched without the underlying problem being resolved and thus encountered the same error. If we persist with the re-launch behavior then maybe err_cnt should be left in place - with the description for it basically being the ah-ha! comment I just made. In a world where we do not typically re-launch and simply re-try without being informed there is a change - such a count remains of minimal value.

I don't really understand the confusion here though - this error data already exists in the pg_stat_subscription_workers stat collector view - the fact that I want to keep it around (just changing the reset behavior) - doesn't seem like it should be controversial. I, thinking as a user, really don't care about all of these implementation details. Whether it is a pg_stat_* view (collector or shmem IPC) or a pg_* catalog is immaterial to me. The behavior I observe is what matters. As a developer I don't want to use the statistics collector because these are not statistics and the collector is unreliable. I don't know enough about the relevant differences between shared memory IPC and catalog tables to decide between them. But catalog tables seem like a lower bar to meet and seem like they can implement the user-facing requirements as I envision them.

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

02 февраля 2022 г., 15:08:37

On Wed, Feb 2, 2022 at 1:06 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
>> <david.g.johnston@gmail.com> wrote:
>> >
>> > On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >>
>> >> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> >
>> >> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
>> >> > feature to pass error-XID or error-LSN information to the worker
>> >> > whereas I'm also not sure of the advantages in storing all error
>> >> > information in a system catalog. Since what we need to do for this
>> >> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
>> >> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
>> >> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
>> >> > the transaction in question. The worker clears the error-XID/LSN after
>> >> > successfully applying or skipping the first non-empty transaction.
>> >> >
>> >>
>> >> Where do you propose to store this information?
>> >
>> >
>> > pg_subscription_worker
>> >
>> > The error message and context is very important.  Just make sure it is only non-null when the worker state is
"syncingfailed" (or whatever term we use). 
>> >
>> >
>>
>> Sure, but is this the reason you want to store all the error info in
>> the system catalog? I agree that providing more error info could be
>> useful and also possibly the previously failed (apply) xacts info as
>> well but I am not able to see why you want to have that sort of info
>> in the catalog. I could see storing info like err_lsn/err_xid that can
>> allow to proceed to apply worker automatically or to slow down the
>> launch of errored apply worker but not all sort of other error info
>> (like err_cnt, err_code, err_message, err_time, etc.). I want to know
>> why you are insisting to make all the error info persistent via the
>> system catalog?
>
>
...
...
>
> I already explained that the concept of err_cnt is not useful.  The fact that you include it here makes me think you
arestill thinking that this all somehow is meant to keep track of history.  It is not.  The workers are state machines
and"error" is one of the states - with relevant attributes to display to the user, and system, while in that state.
Thestate machine reporting does not care about historical states nor does it report on them.  There is some uncertainty
ifwe continue with the automatic re-launch; 
>

I think automatic retry will help to allow some transient errors say
like network glitches that can be resolved on retry and will keep the
behavior transparent. This is also consistent with what we do in
standby mode where if there is an error on primary due to which
standby is not able to fetch some data it will just retry. We can't
fix any error that occurred on the server-side, so the way is to retry
which is true for both standby and subscribers. Basically, I don't
think every kind of error demands user intervention. We can allow to
control it via some parameter say disable_on_error as is discussed in
CF entry [1] but don't think that should be the default.

[1] - https://commitfest.postgresql.org/36/3407/

--
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

02 февраля 2022 г., 18:15:05

On Wed, Feb 2, 2022 at 5:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 2, 2022 at 1:06 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:

...
>
> I already explained that the concept of err_cnt is not useful. The fact that you include it here makes me think you are still thinking that this all somehow is meant to keep track of history. It is not. The workers are state machines and "error" is one of the states - with relevant attributes to display to the user, and system, while in that state. The state machine reporting does not care about historical states nor does it report on them. There is some uncertainty if we continue with the automatic re-launch;
>

I think automatic retry will help to allow some transient errors say
like network glitches that can be resolved on retry and will keep the
behavior transparent. This is also consistent with what we do in
standby mode where if there is an error on primary due to which
standby is not able to fetch some data it will just retry. We can't
fix any error that occurred on the server-side, so the way is to retry
which is true for both standby and subscribers.

Good points. In short there are two subsets of problems to deal with here. We should address them separately, though the pg_subscription_worker table should provide relevant information for both cases. If we are in a retry situation relevant information, like next_scheduled_retry (estimated), should be provided (if there is some kind of delay involved). In a situation like "unique constraint violation" the "next_scheduled_retry" would be null; or make the field a text field and print "Manual Intervention Required". Likewise, the XID/LSN would be null in a retry situation since we haven't received a wholly intact transaction from the publisher (we may know of such an ID but if the final COMMIT message is never even seen before the feed dies we should not be exposing that incomplete information to the user).

A standby is not expected to encounter any user data constraint problems so even a system with manual intervention for such will work for standbys because they will never hit that code path. And you cannot simply skip applying the failed transaction and move onto the next one - that data also never came over.

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

03 февраля 2022 г., 07:33:08

On Wed, Feb 2, 2022 at 4:36 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
>> <david.g.johnston@gmail.com> wrote:
>> >
>> > On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >>
>> >> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> >
>> >> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
>> >> > feature to pass error-XID or error-LSN information to the worker
>> >> > whereas I'm also not sure of the advantages in storing all error
>> >> > information in a system catalog. Since what we need to do for this
>> >> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
>> >> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
>> >> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
>> >> > the transaction in question. The worker clears the error-XID/LSN after
>> >> > successfully applying or skipping the first non-empty transaction.
>> >> >
>> >>
>> >> Where do you propose to store this information?
>> >
>> >
>> > pg_subscription_worker
>> >
>> > The error message and context is very important.  Just make sure it is only non-null when the worker state is
"syncingfailed" (or whatever term we use). 
>> >
>> >
>>
>> Sure, but is this the reason you want to store all the error info in
>> the system catalog? I agree that providing more error info could be
>> useful and also possibly the previously failed (apply) xacts info as
>> well but I am not able to see why you want to have that sort of info
>> in the catalog. I could see storing info like err_lsn/err_xid that can
>> allow to proceed to apply worker automatically or to slow down the
>> launch of errored apply worker but not all sort of other error info
>> (like err_cnt, err_code, err_message, err_time, etc.). I want to know
>> why you are insisting to make all the error info persistent via the
>> system catalog?
>
>
> I look at the catalog and am informed that the worker has stopped because of an error.  I'd rather simply read the
errormessage right then instead of having to go look at the log file.  And if I am going to take an action in order to
overcomethe error I would have to know what that error is; so the error message is not something I can ignore.  The
erroris an attribute of system state, and the catalog stores the current state of the (workers) system. 
>
> I already explained that the concept of err_cnt is not useful.  The fact that you include it here makes me think you
arestill thinking that this all somehow is meant to keep track of history.  It is not.  The workers are state machines
and"error" is one of the states - with relevant attributes to display to the user, and system, while in that state.
Thestate machine reporting does not care about historical states nor does it report on them.  There is some uncertainty
ifwe continue with the automatic re-launch; which, now that I write this, I can see where what you call err_cnt is
effectivelya count of how many times the worker re-launched without the underlying problem being resolved and thus
encounteredthe same error.  If we persist with the re-launch behavior then maybe err_cnt should be left in place - with
thedescription for it basically being the ah-ha! comment I just made. In a world where we do not typically re-launch
andsimply re-try without being informed there is a change - such a count remains of minimal value. 
>
> I don't really understand the confusion here though - this error data already exists in the
pg_stat_subscription_workersstat collector view - the fact that I want to keep it around (just changing the reset
behavior)- doesn't seem like it should be controversial.  I, thinking as a user, really don't care about all of these
implementationdetails.  Whether it is a pg_stat_* view (collector or shmem IPC) or a pg_* catalog is immaterial to me.
Thebehavior I observe is what matters.  As a developer I don't want to use the statistics collector because these are
notstatistics and the collector is unreliable.  I don't know enough about the relevant differences between shared
memoryIPC and catalog tables to decide between them.  But catalog tables seem like a lower bar to meet and seem like
theycan implement the user-facing requirements as I envision them. 

I see that important information such as error-XID that can be used
for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and
using system catalogs is a reasonable way for this purpose. But it's
still unclear to me why all error information that is currently shown
in pg_stat_subscription_workers view, including error-XID and the
error message, relation OID, action, etc., need to be stored in the
catalog. The information other than error-XID doesn't necessarily need
to be reliable compared to error-XID. I think we can have
error-XID/LSN in the pg_subscription catalog and have other error
information in pg_stat_subscription_workers view. After the user
checks the current status of logical replication by checking
error-XID/LSN, they can check pg_stat_subscription_workers for
details.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

03 февраля 2022 г., 07:48:28

On Wednesday, February 2, 2022, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

and have other error
information in pg_stat_subscription_workers view.

What benefit is there to keeping the existing collector-based pg_stat_subscripiton_workers view? If we re-write it using shmem IPC then we might as well put everything there and forego using a catalog. Then it behaves in a similar manner to pg_stat_activity but for logical replication workers.

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

03 февраля 2022 г., 08:35:10

On Thu, Feb 3, 2022 at 1:48 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Wednesday, February 2, 2022, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> and have other error
>> information in pg_stat_subscription_workers view.
>
>
> What benefit is there to keeping the existing collector-based pg_stat_subscripiton_workers view?  If we re-write it
usingshmem IPC then we might as well put everything there and forego using a catalog.  Then it behaves in a similar
mannerto pg_stat_activity but for logical replication workers. 

Yes, but if we use shmem IPC, we need to allocate shared memory for
them based on the number of subscriptions, not logical replication
workers (i.e., max_logical_replication_workers). So we cannot estimate
memory in the beginning. Also, IIUC the number of subscriptions that
are concurrently working is limited by max_replication_slots (see
ReplicationStateCtl) but I think we need to remember the state of
disabled subscriptions too.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Eisentraut

Дата:

03 февраля 2022 г., 12:55:37

On 02.02.22 07:54, Amit Kapila wrote:
>>> Where do you propose to store this information?
>>
>>
>> pg_subscription_worker
>>
>> The error message and context is very important.  Just make sure it is only non-null when the worker state is
"syncingfailed" (or whatever term we use).
 

We could name the table something like pg_subscription_worker_error, so 
it's explicit that it is collecting error information only.

> Sure, but is this the reason you want to store all the error info in
> the system catalog? I agree that providing more error info could be
> useful and also possibly the previously failed (apply) xacts info as
> well but I am not able to see why you want to have that sort of info
> in the catalog. I could see storing info like err_lsn/err_xid that can
> allow to proceed to apply worker automatically or to slow down the
> launch of errored apply worker but not all sort of other error info
> (like err_cnt, err_code, err_message, err_time, etc.). I want to know
> why you are insisting to make all the error info persistent via the
> system catalog?

Let's flip this around and ask, why not?  Tables are the place to store 
data, by default.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

04 февраля 2022 г., 06:53:06

On Thu, Feb 3, 2022 at 3:25 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 02.02.22 07:54, Amit Kapila wrote:
>
> > Sure, but is this the reason you want to store all the error info in
> > the system catalog? I agree that providing more error info could be
> > useful and also possibly the previously failed (apply) xacts info as
> > well but I am not able to see why you want to have that sort of info
> > in the catalog. I could see storing info like err_lsn/err_xid that can
> > allow to proceed to apply worker automatically or to slow down the
> > launch of errored apply worker but not all sort of other error info
> > (like err_cnt, err_code, err_message, err_time, etc.). I want to know
> > why you are insisting to make all the error info persistent via the
> > system catalog?
>
> Let's flip this around and ask, why not?
>

Because we don't necessarily need all this information after the crash
and neither is this information about any system object which we
require for performing operations on objects. OTOH, if we need some
particular error/apply state(s) that should be okay to keep in
persistent form (in system catalog). In walreceiver (for standby), we
don't store the errors/conflicts in any table, they are either
reported in logs or shared via stats. Similarly, the archiver process
do share its failure information either via stats or logs. Similar is
the case for checkpointer which also just logs the error. Now,
similarly in this case also we are sharing the error information via
logs and stats.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

15 февраля 2022 г., 08:12:52

On Thu, Feb 3, 2022 at 2:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Feb 3, 2022 at 1:48 PM David G. Johnston
> <david.g.johnston@gmail.com> wrote:
> >
> > On Wednesday, February 2, 2022, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>
> >> and have other error
> >> information in pg_stat_subscription_workers view.
> >
> >
> > What benefit is there to keeping the existing collector-based pg_stat_subscripiton_workers view?  If we re-write it
usingshmem IPC then we might as well put everything there and forego using a catalog.  Then it behaves in a similar
mannerto pg_stat_activity but for logical replication workers. 
>
> Yes, but if we use shmem IPC, we need to allocate shared memory for
> them based on the number of subscriptions, not logical replication
> workers (i.e., max_logical_replication_workers). So we cannot estimate
> memory in the beginning. Also, IIUC the number of subscriptions that
> are concurrently working is limited by max_replication_slots (see
> ReplicationStateCtl) but I think we need to remember the state of
> disabled subscriptions too.

Correction; the replication state remains even after the subscription
is disabled, and it's removed only when the subscription is dropped.
Therefore, the number of subscriptions that can be active in the
database cluster is effectively limited by max_replication_slots.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

15 февраля 2022 г., 19:26:02

Hi,

On 2022-02-03 14:35:10 +0900, Masahiko Sawada wrote:
> Yes, but if we use shmem IPC, we need to allocate shared memory for
> them based on the number of subscriptions, not logical replication
> workers (i.e., max_logical_replication_workers). So we cannot estimate
> memory in the beginning. Also, IIUC the number of subscriptions that
> are concurrently working is limited by max_replication_slots (see
> ReplicationStateCtl) but I think we need to remember the state of
> disabled subscriptions too.

Use dshash (i.e. dsm) with a small initial allocation in non-dynamic shared
memory...

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

15 февраля 2022 г., 21:17:42

Hi,

On 2022-02-04 09:23:06 +0530, Amit Kapila wrote:
> On Thu, Feb 3, 2022 at 3:25 PM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
> >
> > On 02.02.22 07:54, Amit Kapila wrote:
> >
> > > Sure, but is this the reason you want to store all the error info in
> > > the system catalog? I agree that providing more error info could be
> > > useful and also possibly the previously failed (apply) xacts info as
> > > well but I am not able to see why you want to have that sort of info
> > > in the catalog. I could see storing info like err_lsn/err_xid that can
> > > allow to proceed to apply worker automatically or to slow down the
> > > launch of errored apply worker but not all sort of other error info
> > > (like err_cnt, err_code, err_message, err_time, etc.). I want to know
> > > why you are insisting to make all the error info persistent via the
> > > system catalog?
> >
> > Let's flip this around and ask, why not?
> >
> 
> Because we don't necessarily need all this information after the crash
> and neither is this information about any system object which we
> require for performing operations on objects.

I find this not particularly convincing. IMO data that leads the user to
compromise "replication integrity" is pretty crucial.

And skipped data needs to be logged somewhere persistent, so that there's a
chance to analyze / recover.

We also should utilize more detailed knowledge about errors to influence at
which interval replication is retried. Serialization error: retry soon. Other
errors: retry with increasing backoff.


> In walreceiver (for standby), we don't store the errors/conflicts in any
> table, they are either reported in logs or shared via stats.

That's imo quite different - they're fundamentally time-limited problems. And
they aren't leading the user / DBA to skip transactions etc.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

15 февраля 2022 г., 21:26:41

Hi,

On 2022-02-03 13:33:08 +0900, Masahiko Sawada wrote:
> I see that important information such as error-XID that can be used
> for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and
> using system catalogs is a reasonable way for this purpose. But it's
> still unclear to me why all error information that is currently shown
> in pg_stat_subscription_workers view, including error-XID and the
> error message, relation OID, action, etc., need to be stored in the
> catalog. The information other than error-XID doesn't necessarily need
> to be reliable compared to error-XID. I think we can have
> error-XID/LSN in the pg_subscription catalog and have other error
> information in pg_stat_subscription_workers view. After the user
> checks the current status of logical replication by checking
> error-XID/LSN, they can check pg_stat_subscription_workers for
> details.

The stats system isn't geared towards storing error messages and
such. Generally it is about storing counts of events etc, not about
information about a single event. Obviously there are a few cases where that
boundary isn't that clear...

IOW, storing information like:
- subscription oid
- retryable error count
- hard error count
- #replicated inserts
- #replicated updates
- #replicated deletes

is what pgstats is for. But not
- subscription oid
- error message
- xid of error
- ...

IMO the addition of the pg_stat_subscription_workers view needs to be
reverted.

Yes, there's some precedent in pg_stat_archiver. But that ship has sailed
(it's released), and it's a much more limited issue. Just because we did a not
great thing once isn't a reason to do a similar, but even less great, thing
another time.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Robert Haas

Дата:

15 февраля 2022 г., 21:31:44

On Thu, Jan 27, 2022 at 12:46 AM Andres Freund <andres@anarazel.de> wrote:
> Only if either the user wants to drop all stats, or somehow knows the oids of
> already dropped tables...

If it's really true that we can end up storing data for dropped
objects, I think that's not acceptable and needs to be fixed.

I don't currently understand the other issues on this thread well
enough to have a clear opinion on them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

16 февраля 2022 г., 11:45:03

On Tue, Feb 15, 2022 at 11:47 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-02-04 09:23:06 +0530, Amit Kapila wrote:
> > On Thu, Feb 3, 2022 at 3:25 PM Peter Eisentraut
> > <peter.eisentraut@enterprisedb.com> wrote:
> > >
> > > On 02.02.22 07:54, Amit Kapila wrote:
> > >
> > > > Sure, but is this the reason you want to store all the error info in
> > > > the system catalog? I agree that providing more error info could be
> > > > useful and also possibly the previously failed (apply) xacts info as
> > > > well but I am not able to see why you want to have that sort of info
> > > > in the catalog. I could see storing info like err_lsn/err_xid that can
> > > > allow to proceed to apply worker automatically or to slow down the
> > > > launch of errored apply worker but not all sort of other error info
> > > > (like err_cnt, err_code, err_message, err_time, etc.). I want to know
> > > > why you are insisting to make all the error info persistent via the
> > > > system catalog?
> > >
> > > Let's flip this around and ask, why not?
> > >
> >
> > Because we don't necessarily need all this information after the crash
> > and neither is this information about any system object which we
> > require for performing operations on objects.
>
> I find this not particularly convincing. IMO data that leads the user to
> compromise "replication integrity" is pretty crucial.
>
> And skipped data needs to be logged somewhere persistent, so that there's a
> chance to analyze / recover.
>

Valid point. I think we can store this in a separate table
(pg_subsciption_conflict_history or something like that) but at some
point we need to clear this data like say when the user drops the
subscription or maybe a separate interface altogether or after a
particular time interval (user-configurable or otherwise), the
subscription worker or some other background worker clears this
information.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

16 февраля 2022 г., 11:49:37

On Tue, Feb 15, 2022 at 11:56 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-02-03 13:33:08 +0900, Masahiko Sawada wrote:
> > I see that important information such as error-XID that can be used
> > for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and
> > using system catalogs is a reasonable way for this purpose. But it's
> > still unclear to me why all error information that is currently shown
> > in pg_stat_subscription_workers view, including error-XID and the
> > error message, relation OID, action, etc., need to be stored in the
> > catalog. The information other than error-XID doesn't necessarily need
> > to be reliable compared to error-XID. I think we can have
> > error-XID/LSN in the pg_subscription catalog and have other error
> > information in pg_stat_subscription_workers view. After the user
> > checks the current status of logical replication by checking
> > error-XID/LSN, they can check pg_stat_subscription_workers for
> > details.
>
> The stats system isn't geared towards storing error messages and
> such. Generally it is about storing counts of events etc, not about
> information about a single event. Obviously there are a few cases where that
> boundary isn't that clear...
>

True, in the beginning, we discussed this information to be stored in
system catalog [1] (See .... Also, I am thinking that instead of a
stat view, do we need to consider having a system table .. ) but later
discussion led us to store this as stats.

> IOW, storing information like:
> - subscription oid
> - retryable error count
> - hard error count
> - #replicated inserts
> - #replicated updates
> - #replicated deletes
>
> is what pgstats is for.
>

Some of this and similar ((like error count, last_error_time)) is
present in stats and something on the lines of the other information
is proposed in [2] (xacts successfully replicated (committed),
aborted, etc) to be stored along with it.

> But not
> - subscription oid
> - error message
> - xid of error
> - ...
>

I think from the current set of things we are capturing, the other
thing in this list will be error_command (insert/update/delete..) and
or probably error_code. So, we can keep this information in a system
table.

Based on this discussion, it appears to me what we want here is to
store the error info in some persistent way (system table) and the
counters (both error and success) of logical replication in stats. If
we can't achieve this work (separation) in the next few weeks (before
the feature freeze) then I'll revert the work related to stats.

[1] - https://www.postgresql.org/message-id/CAA4eK1LTE-AYtwatvLzAw%2BVy53C92QHoB7-rVbX-9nf8ws2Vag%40mail.gmail.com
[2] - https://commitfest.postgresql.org/37/3504/

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

16 февраля 2022 г., 11:50:10

On Wed, Feb 16, 2022 at 12:01 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 27, 2022 at 12:46 AM Andres Freund <andres@anarazel.de> wrote:
> > Only if either the user wants to drop all stats, or somehow knows the oids of
> > already dropped tables...
>
> If it's really true that we can end up storing data for dropped
> objects, I think that's not acceptable and needs to be fixed.
>

Agreed.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

16 февраля 2022 г., 14:36:18

On Wed, Feb 16, 2022 at 5:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 15, 2022 at 11:56 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2022-02-03 13:33:08 +0900, Masahiko Sawada wrote:
> > > I see that important information such as error-XID that can be used
> > > for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and
> > > using system catalogs is a reasonable way for this purpose. But it's
> > > still unclear to me why all error information that is currently shown
> > > in pg_stat_subscription_workers view, including error-XID and the
> > > error message, relation OID, action, etc., need to be stored in the
> > > catalog. The information other than error-XID doesn't necessarily need
> > > to be reliable compared to error-XID. I think we can have
> > > error-XID/LSN in the pg_subscription catalog and have other error
> > > information in pg_stat_subscription_workers view. After the user
> > > checks the current status of logical replication by checking
> > > error-XID/LSN, they can check pg_stat_subscription_workers for
> > > details.
> >
> > The stats system isn't geared towards storing error messages and
> > such. Generally it is about storing counts of events etc, not about
> > information about a single event. Obviously there are a few cases where that
> > boundary isn't that clear...
> >
>
> True, in the beginning, we discussed this information to be stored in
> system catalog [1] (See .... Also, I am thinking that instead of a
> stat view, do we need to consider having a system table .. ) but later
> discussion led us to store this as stats.
>
> > IOW, storing information like:
> > - subscription oid
> > - retryable error count
> > - hard error count
> > - #replicated inserts
> > - #replicated updates
> > - #replicated deletes
> >
> > is what pgstats is for.
> >
>
> Some of this and similar ((like error count, last_error_time)) is
> present in stats and something on the lines of the other information
> is proposed in [2] (xacts successfully replicated (committed),
> aborted, etc) to be stored along with it.
>
> > But not
> > - subscription oid
> > - error message
> > - xid of error
> > - ...
> >
>
> I think from the current set of things we are capturing, the other
> thing in this list will be error_command (insert/update/delete..) and
> or probably error_code. So, we can keep this information in a system
> table.

Agreed. Also, we can have also commit-LSN or replace error-XID with error-LSN?

>
> Based on this discussion, it appears to me what we want here is to
> store the error info in some persistent way (system table) and the
> counters (both error and success) of logical replication in stats. If
> we can't achieve this work (separation) in the next few weeks (before
> the feature freeze) then I'll revert the work related to stats.

There was an idea to use shmem to store error info but it seems to be
better to use a system catalog to persist them.

I'll summarize changes we discussed and make a plan of changes and
feature designs toward the feature freeze (and v16). I think that once
we get a consensus on them we can start implementation and move it
forward.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

18 февраля 2022 г., 11:26:04

On Wed, Feb 16, 2022 at 8:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Feb 16, 2022 at 5:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 11:56 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2022-02-03 13:33:08 +0900, Masahiko Sawada wrote:
> > > > I see that important information such as error-XID that can be used
> > > > for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and
> > > > using system catalogs is a reasonable way for this purpose. But it's
> > > > still unclear to me why all error information that is currently shown
> > > > in pg_stat_subscription_workers view, including error-XID and the
> > > > error message, relation OID, action, etc., need to be stored in the
> > > > catalog. The information other than error-XID doesn't necessarily need
> > > > to be reliable compared to error-XID. I think we can have
> > > > error-XID/LSN in the pg_subscription catalog and have other error
> > > > information in pg_stat_subscription_workers view. After the user
> > > > checks the current status of logical replication by checking
> > > > error-XID/LSN, they can check pg_stat_subscription_workers for
> > > > details.
> > >
> > > The stats system isn't geared towards storing error messages and
> > > such. Generally it is about storing counts of events etc, not about
> > > information about a single event. Obviously there are a few cases where that
> > > boundary isn't that clear...
> > >
> >
> > True, in the beginning, we discussed this information to be stored in
> > system catalog [1] (See .... Also, I am thinking that instead of a
> > stat view, do we need to consider having a system table .. ) but later
> > discussion led us to store this as stats.
> >
> > > IOW, storing information like:
> > > - subscription oid
> > > - retryable error count
> > > - hard error count
> > > - #replicated inserts
> > > - #replicated updates
> > > - #replicated deletes
> > >
> > > is what pgstats is for.
> > >
> >
> > Some of this and similar ((like error count, last_error_time)) is
> > present in stats and something on the lines of the other information
> > is proposed in [2] (xacts successfully replicated (committed),
> > aborted, etc) to be stored along with it.
> >
> > > But not
> > > - subscription oid
> > > - error message
> > > - xid of error
> > > - ...
> > >
> >
> > I think from the current set of things we are capturing, the other
> > thing in this list will be error_command (insert/update/delete..) and
> > or probably error_code. So, we can keep this information in a system
> > table.
>
> Agreed. Also, we can have also commit-LSN or replace error-XID with error-LSN?
>
> >
> > Based on this discussion, it appears to me what we want here is to
> > store the error info in some persistent way (system table) and the
> > counters (both error and success) of logical replication in stats. If
> > we can't achieve this work (separation) in the next few weeks (before
> > the feature freeze) then I'll revert the work related to stats.
>
> There was an idea to use shmem to store error info but it seems to be
> better to use a system catalog to persist them.
>
> I'll summarize changes we discussed and make a plan of changes and
> feature designs toward the feature freeze (and v16). I think that once
> we get a consensus on them we can start implementation and move it
> forward.
>

Here is the summary of the discussion, changes, and plan.

1. Move some error information such as the error message to a new
system catalog, pg_subscription_error. The pg_subscription_error table
would have the following columns:

* sesubid : subscription Oid.
* serelid : relation Oid (NULL for apply worker).
* seerrlsn : commit-LSN or the error transaction.
* seerrcmd : command (INSERT, UPDATE, etc.) of the error transaction.
* seerrmsg : error message

The tuple is inserted or updated when an apply worker or a tablesync
worker raises an error. If the same error occurs in a row, the update
is skipped. The tuple is removed in the following cases:

* the subscription is dropped.
* the table is dropped.
* the table is removed from the subscription.
* the worker successfully committed a non-empty transaction.

With this change, pg_stat_subscription_workers will be like:

* subid
* subname
* subrelid
* error_count
* last_error_timestamp

This view will be extended by adding transaction statistics proposed
on another thread[1].

2. Fix a bug in pg_stat_subscription_workers. As pointed out by
Andres, there is a bug in pg_stat_subscription_workers; it doesn't
drop entries for already-dropped tables. We need to fix it.

3. Show commit-LSN of the error transaction in errcontext. Currently,
we show XID and commit timestamp in the errcontext. But given that we
use LSN in pg_subscriptoin_error, it's better to show commit-LSN as
well (or instead of error-XID).

4. Skipping the particular conflicted transaction. There are two proposals:

4-1. Use the existing replication_origin_advance() SQL function. We
don't need to add any additional syntax, instead use
replication_origin_advance() with the error-LSN reported in either
pg_subscription_error or server logs to skip the particular
transaction.

4-2. Introduce a new syntax like ALTER SUBSCRIPTION ... SKIP. This
proposal further has two options: (1) the user specifies error-LSN
manually and (2) the user just enables skipping behavior and error-LSN
is automatically fetched from pg_subscription_error. In any way, the
command raises an error when there is no error entry in
pg_subscription_error.

We can discuss this item for details on the original thread.

5. Record skipped data to the system catalog, say
pg_subscription_conflict_history so that there is a chance to analyze
and recover. The pg_subscription_conflict_history I'm thinking is that
we record the following of all skipped changes:

* command (INSERT, UPDATE etc.)
* commit-LSN
* before row (in json format?)
* after row (in json format?)

Given that we end up writing a huge amount of history if the
transaction is very large and I think there are peoples who want to
check what changes will be skipped together before enabling skipping
behavior, I think it could be optional. Therefore I think we can
provide an option for ALTER SUBSCRIPTION ... SKIP to whether the
skipped data is recorded or not and to the
pg_subscription_conflict_history or server logs. We can discuss the
details in a new thread.

4 and 5 might be better introduced together but I think since the user
is able to check what changes will be skipped on the publisher side we
can do 5 for v16. Feedback and comments are very welcome.

Regards,

[1]
https://www.postgresql.org/message-id/flat/OSBPR01MB48887CA8F40C8D984A6DC00CED199@OSBPR01MB4888.jpnprd01.prod.outlook.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

18 февраля 2022 г., 22:46:51

On Fri, Feb 18, 2022 at 1:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here is the summary of the discussion, changes, and plan.

1. Move some error information such as the error message to a new
system catalog, pg_subscription_error. The pg_subscription_error table
would have the following columns:

* sesubid : subscription Oid.
* serelid : relation Oid (NULL for apply worker).
* seerrlsn : commit-LSN or the error transaction.
* seerrcmd : command (INSERT, UPDATE, etc.) of the error transaction.
* seerrmsg : error message

Not a fan of the "se" prefix but overall yes. We should include a timestamp.

The tuple is inserted or updated when an apply worker or a tablesync
worker raises an error. If the same error occurs in a row, the update
is skipped.

I disagree with this - I would treat every new instance of an error to be important and insert on conflict (sesubid, serelid) the new entry, updating lsn/cmd/msg with the new values.

The tuple is removed in the following cases:

* the subscription is dropped.
* the table is dropped.

* the table is removed from the subscription.
* the worker successfully committed a non-empty transaction.

Correct. This handles the "end of sync worker" just fine since its final action should be a successful commit of a non-empty transaction.

With this change, pg_stat_subscription_workers will be like:

* subid
* subname
* subrelid
* error_count
* last_error_timestamp

This view will be extended by adding transaction statistics proposed
on another thread[1].

I haven't reviewed that thread but in-so-far as this one goes I would just drop this altogether. The error count, if desired, can be added to pg_subscription_error, and the timestamp should be added there as noted above.

If this is useful for the feature on the other thread it can be reconstituted from there.

2. Fix a bug in pg_stat_subscription_workers. As pointed out by
Andres, there is a bug in pg_stat_subscription_workers; it doesn't
drop entries for already-dropped tables. We need to fix it.

Given the above, this becomes an N/A.

3. Show commit-LSN of the error transaction in errcontext. Currently,
we show XID and commit timestamp in the errcontext. But given that we
use LSN in pg_subscriptoin_error, it's better to show commit-LSN as
well (or instead of error-XID).

Agreed, I think: what "errcontext" is this referring to?

5. Record skipped data to the system catalog, say
pg_subscription_conflict_history so that there is a chance to analyze
and recover.

We can discuss the
details in a new thread.

Agreed - the "before skipping" consideration seems considerably more helpful; but wouldn't need to be persistent, it could just be a view. A permanent record probably would best be recorded in the logs - though if we get the pre-skip functionality the user can view directly and save the results wherever they wish or we can provide a function to spool the information to the log. I don't see persistent in-database storage being that desirable here.

If we only do something after the transaction has been skipped it may be useful to add an option to the skipping command to auto-disable the subscription after the transaction has been skipped and before any subsequent transactions are applied. This will give the user a chance to process the post-skipped information before restoring sync and having the system begin changing underneath them again.

4 and 5 might be better introduced together but I think since the user
is able to check what changes will be skipped on the publisher side we
can do 5 for v16.

How would one go about doing that (checking what changes will be skipped on the publisher side)?

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

18 февраля 2022 г., 23:32:55

Hi,

On 2022-02-18 17:26:04 +0900, Masahiko Sawada wrote:
> With this change, pg_stat_subscription_workers will be like:
> 
> * subid
> * subname
> * subrelid
> * error_count
> * last_error_timestamp

> This view will be extended by adding transaction statistics proposed
> on another thread[1].

I do not agree with these bits. What's the point of these per-relation stats
at this poitns.  You're just duplicating the normal relation pg_stats here.

I really think we just should drop pg_stat_subscription_workers. Even if we
don't, we definitely should rename it, because it still isn't meaningfully
about workers.

This stuff is getting painful for me. I'm trying to clean up some stuff for
shared memory stats, and this stuff doesn't fit in with the rest. I'll have to
rework some core stuff in the shared memory stats patch to make it work with
this. Just to then quite possibly deal with reverting that part.

Given the degree we're still designing stuff at this point, I think the
appropriate thing is to revert the patch, and then try from there.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

19 февраля 2022 г., 13:49:39

On Sat, Feb 19, 2022 at 1:17 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Fri, Feb 18, 2022 at 1:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>>
>> Here is the summary of the discussion, changes, and plan.
>>
>> 1. Move some error information such as the error message to a new
>> system catalog, pg_subscription_error. The pg_subscription_error table
>> would have the following columns:
>>
>> * sesubid : subscription Oid.
>> * serelid : relation Oid (NULL for apply worker).
>> * seerrlsn : commit-LSN or the error transaction.
>> * seerrcmd : command (INSERT, UPDATE, etc.) of the error transaction.
>> * seerrmsg : error message
>
>
> Not a fan of the "se" prefix but overall yes. We should include a timestamp.
>

How about naming it pg_subscription_worker_error as Peter E. has
suggested in one of his emails? I find pg_subscription_error slightly
odd as one could imagine that even the errors related to subscription
commands like Alter Subscription.

>>
>> The tuple is inserted or updated when an apply worker or a tablesync
>> worker raises an error. If the same error occurs in a row, the update
>> is skipped.
>

Are you going to query table to check if it is same error?

>
> I disagree with this - I would treat every new instance of an error to be important and insert on conflict (sesubid,
serelid)the new entry, updating lsn/cmd/msg with the new values.

>

I don't think that will be a problem but what advantage are you
envisioning with updating the same info except for timestamp?

>> The tuple is removed in the following cases:
>>
>> * the subscription is dropped.
>> * the table is dropped.
>>
>> * the table is removed from the subscription.
>> * the worker successfully committed a non-empty transaction.
>
>
> Correct.  This handles the "end of sync worker" just fine since its final action should be a successful commit of a
non-emptytransaction.

>>

I think for tablesync workers, we may need slightly different handling
as there could probably be no transactions to apply apart from the
initial copy. Now, I think for tablesync worker, we can't postpone it
till after we update the rel state as SUBREL_STATE_SYNCDONE because if
we do it after that and there is some error updating/deleting the
tuple, the tablesync worker won't be launched again and that entry
will remain in the system for a longer duration.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

19 февраля 2022 г., 16:21:43

On Saturday, February 19, 2022, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Feb 19, 2022 at 1:17 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Fri, Feb 18, 2022 at 1:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>>
>> Here is the summary of the discussion, changes, and plan.
>>
>> 1. Move some error information such as the error message to a new
>> system catalog, pg_subscription_error. The pg_subscription_error table
>> would have the following columns:
>>
>> * sesubid : subscription Oid.
>> * serelid : relation Oid (NULL for apply worker).
>> * seerrlsn : commit-LSN or the error transaction.
>> * seerrcmd : command (INSERT, UPDATE, etc.) of the error transaction.
>> * seerrmsg : error message
>
>
> Not a fan of the "se" prefix but overall yes. We should include a timestamp.
>

How about naming it pg_subscription_worker_error as Peter E. has
suggested in one of his emails? I find pg_subscription_error slightly
odd as one could imagine that even the errors related to subscription
commands like Alter Subscription.

Adding worker makes sense.

>>
>> The tuple is inserted or updated when an apply worker or a tablesync
>> worker raises an error. If the same error occurs in a row, the update
>> is skipped.
>

Are you going to query table to check if it is same error?

I don’t get the question, the quoted text is your which I disagree with. But the error message is being captured in any case.

>
> I disagree with this - I would treat every new instance of an error to be important and insert on conflict (sesubid, serelid) the new entry, updating lsn/cmd/msg with the new values.
>

I don't think that will be a problem but what advantage are you
envisioning with updating the same info except for timestamp?

Omission of timestamp (or any other non-key field we have) from the update is an oversight.

>> The tuple is removed in the following cases:
>>
>> * the subscription is dropped.
>> * the table is dropped.
>>
>> * the table is removed from the subscription.
>> * the worker successfully committed a non-empty transaction.
>
>
> Correct. This handles the "end of sync worker" just fine since its final action should be a successful commit of a non-empty transaction.
>>

I think for tablesync workers, we may need slightly different handling
as there could probably be no transactions to apply apart from the
initial copy. Now, I think for tablesync worker, we can't postpone it
till after we update the rel state as SUBREL_STATE_SYNCDONE because if
we do it after that and there is some error updating/deleting the
tuple, the tablesync worker won't be launched again and that entry
will remain in the system for a longer duration.

Not sure…but I don’t see how you can not have a non-empty transaction while still having an error.

Are subscriptions “dropped” upon a reboot? If not, that needs its own case for row removal.

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

19 февраля 2022 г., 16:36:58

On Sat, Feb 19, 2022 at 6:51 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Saturday, February 19, 2022, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Sat, Feb 19, 2022 at 1:17 AM David G. Johnston
>> <david.g.johnston@gmail.com> wrote:
>> >
>> > On Fri, Feb 18, 2022 at 1:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >>
>> >> Here is the summary of the discussion, changes, and plan.
>> >>
>> >> 1. Move some error information such as the error message to a new
>> >> system catalog, pg_subscription_error. The pg_subscription_error table
>> >> would have the following columns:
>> >>
>> >> * sesubid : subscription Oid.
>> >> * serelid : relation Oid (NULL for apply worker).
>> >> * seerrlsn : commit-LSN or the error transaction.
>> >> * seerrcmd : command (INSERT, UPDATE, etc.) of the error transaction.
>> >> * seerrmsg : error message
>> >
>> >
>> > Not a fan of the "se" prefix but overall yes. We should include a timestamp.
>> >
>>
>> How about naming it pg_subscription_worker_error as Peter E. has
>> suggested in one of his emails? I find pg_subscription_error slightly
>> odd as one could imagine that even the errors related to subscription
>> commands like Alter Subscription.
>>
>
> Adding worker makes sense.
>
>>
>> >>
>> >> The tuple is inserted or updated when an apply worker or a tablesync
>> >> worker raises an error. If the same error occurs in a row, the update
>> >> is skipped.
>> >
>>
>> Are you going to query table to check if it is same error?
>
>
> I don’t get the question, the quoted text is your which I disagree with.
>

It was Sawada-San's email and this question was for him.

>  But the error message is being captured in any case.
>>
>>
>> >
>> > I disagree with this - I would treat every new instance of an error to be important and insert on conflict
(sesubid,serelid) the new entry, updating lsn/cmd/msg with the new values. 
>> >
>>
>> I don't think that will be a problem but what advantage are you
>> envisioning with updating the same info except for timestamp?
>
>
> Omission of timestamp (or any other non-key field we have) from the update is an oversight.
>

Yeah, if we decide to keep timestamp which the original proposal
doesn't have then it makes sense to update again.

>>
>> >> The tuple is removed in the following cases:
>> >>
>> >> * the subscription is dropped.
>> >> * the table is dropped.
>> >>
>> >> * the table is removed from the subscription.
>> >> * the worker successfully committed a non-empty transaction.
>> >
>> >
>> > Correct.  This handles the "end of sync worker" just fine since its final action should be a successful commit of
anon-empty transaction. 
>> >>
>>
>> I think for tablesync workers, we may need slightly different handling
>> as there could probably be no transactions to apply apart from the
>> initial copy. Now, I think for tablesync worker, we can't postpone it
>> till after we update the rel state as SUBREL_STATE_SYNCDONE because if
>> we do it after that and there is some error updating/deleting the
>> tuple, the tablesync worker won't be launched again and that entry
>> will remain in the system for a longer duration.
>
>
> Not sure…but I don’t see how you can not have a non-empty transaction while still having an error.
>
> Are subscriptions “dropped” upon a reboot?  If not, that needs its own case for row removal.
>

Subscriptions are not dropped automatically on reboot but I don't
understand what you mean by "that needs its own case for row removal"?

--
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

19 февраля 2022 г., 16:38:19

On Sat, Feb 19, 2022 at 5:32 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-02-18 17:26:04 +0900, Masahiko Sawada wrote:
> > With this change, pg_stat_subscription_workers will be like:
> >
> > * subid
> > * subname
> > * subrelid
> > * error_count
> > * last_error_timestamp
>
> > This view will be extended by adding transaction statistics proposed
> > on another thread[1].
>
> I do not agree with these bits. What's the point of these per-relation stats
> at this poitns.  You're just duplicating the normal relation pg_stats here.
>
> I really think we just should drop pg_stat_subscription_workers. Even if we
> don't, we definitely should rename it, because it still isn't meaningfully
> about workers.

The view has stats per subscription worker (i.e., apply worker and
tablesync worker), not per relation. The subrelid is OID of the
relation that the tablesync worker is synchronizing. For the stats of
apply workers, it is null.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

19 февраля 2022 г., 16:43:49

On Sat, Feb 19, 2022 at 1:17 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Fri, Feb 18, 2022 at 1:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> 5. Record skipped data to the system catalog, say
>> pg_subscription_conflict_history so that there is a chance to analyze
>> and recover.
>
>
>> We can discuss the
>> details in a new thread.
>
> Agreed - the "before skipping" consideration seems considerably more helpful; but wouldn't need to be persistent, it
couldjust be a view.  A permanent record probably would best be recorded in the logs - though if we get the pre-skip
functionalitythe user can view directly and save the results wherever they wish or we can provide a function to spool
theinformation to the log.  I don't see persistent in-database storage being that desirable here. 
>

We can consider this but note that it could fill up a lot of LOG and
difficult to find/interpret information. Say after skipping and
logging half of the transaction data, there is some error like
"connection error", it will then repeat the entire table data again.
Also, say the table has toast columns, we have some mechanism to write
such data in tables (like by compressing, etc) but printing huge data
in Logs would be tricky and it may not be easier to read it, we may
even need some sort of tuple header, column header etc. Also, there
could be errors from other sessions in-between which we should be able
to filter out but still it's may not be that clear.

> If we only do something after the transaction has been skipped it may be useful to add an option to the skipping
commandto auto-disable the subscription after the transaction has been skipped and before any subsequent transactions
areapplied.  This will give the user a chance to process the post-skipped information before restoring sync and having
thesystem begin changing underneath them again. 
>

I think it can be helpful and probably can be done as a separate patch?

--
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

19 февраля 2022 г., 16:44:28

On Sat, Feb 19, 2022 at 10:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Feb 19, 2022 at 6:51 PM David G. Johnston
> <david.g.johnston@gmail.com> wrote:
> >
> > On Saturday, February 19, 2022, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Sat, Feb 19, 2022 at 1:17 AM David G. Johnston
> >> <david.g.johnston@gmail.com> wrote:
> >> >
> >> > On Fri, Feb 18, 2022 at 1:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >> >>
> >> >>
> >> >> Here is the summary of the discussion, changes, and plan.
> >> >>
> >> >> 1. Move some error information such as the error message to a new
> >> >> system catalog, pg_subscription_error. The pg_subscription_error table
> >> >> would have the following columns:
> >> >>
> >> >> * sesubid : subscription Oid.
> >> >> * serelid : relation Oid (NULL for apply worker).
> >> >> * seerrlsn : commit-LSN or the error transaction.
> >> >> * seerrcmd : command (INSERT, UPDATE, etc.) of the error transaction.
> >> >> * seerrmsg : error message
> >> >
> >> >
> >> > Not a fan of the "se" prefix but overall yes. We should include a timestamp.
> >> >
> >>
> >> How about naming it pg_subscription_worker_error as Peter E. has
> >> suggested in one of his emails? I find pg_subscription_error slightly
> >> odd as one could imagine that even the errors related to subscription
> >> commands like Alter Subscription.
> >>
> >
> > Adding worker makes sense.

Agreed.

> >
> >>
> >> >>
> >> >> The tuple is inserted or updated when an apply worker or a tablesync
> >> >> worker raises an error. If the same error occurs in a row, the update
> >> >> is skipped.
> >> >
> >>
> >> Are you going to query table to check if it is same error?
> >
> >
> > I don’t get the question, the quoted text is your which I disagree with.
> >
>
> It was Sawada-San's email and this question was for him.

What I wanted to say about how to insert/update the tuple to
pg_subscription_worker_error is that we insert a new tuple for the
first time, then, when the next error occurred, the worker fetches the
tuple and checks if the error (i.e., error-LSN, error-cmd, and
error-message) is exactly the same as previous one. If it is, we can
skip updating it. But if we include the error-timestamp in the tuple,
we need to update it every time.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

19 февраля 2022 г., 16:48:26

On Sat, Feb 19, 2022 at 4:47 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Fri, Feb 18, 2022 at 1:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>>
>> Here is the summary of the discussion, changes, and plan.
>>
>> 1. Move some error information such as the error message to a new
>> system catalog, pg_subscription_error. The pg_subscription_error table
>> would have the following columns:
>>
>> * sesubid : subscription Oid.
>> * serelid : relation Oid (NULL for apply worker).
>> * seerrlsn : commit-LSN or the error transaction.
>> * seerrcmd : command (INSERT, UPDATE, etc.) of the error transaction.
>> * seerrmsg : error message
>
>
> Not a fan of the "se" prefix but overall yes. We should include a timestamp.
>
>>
>> The tuple is inserted or updated when an apply worker or a tablesync
>> worker raises an error. If the same error occurs in a row, the update
>> is skipped.
>
>
> I disagree with this - I would treat every new instance of an error to be important and insert on conflict (sesubid,
serelid)the new entry, updating lsn/cmd/msg with the new values. 
>
>> The tuple is removed in the following cases:
>>
>> * the subscription is dropped.
>> * the table is dropped.
>>
>> * the table is removed from the subscription.
>> * the worker successfully committed a non-empty transaction.
>
>
> Correct.  This handles the "end of sync worker" just fine since its final action should be a successful commit of a
non-emptytransaction. 
>>
>>
>> With this change, pg_stat_subscription_workers will be like:
>>
>> * subid
>> * subname
>> * subrelid
>> * error_count
>> * last_error_timestamp
>>
>> This view will be extended by adding transaction statistics proposed
>> on another thread[1].
>
>
> I haven't reviewed that thread but in-so-far as this one goes I would just drop this altogether.  The error count, if
desired,can be added to pg_subscription_error, and the timestamp should be added there as noted above. 
>
> If this is useful for the feature on the other thread it can be reconstituted from there.
>
>>
>> 2. Fix a bug in pg_stat_subscription_workers. As pointed out by
>> Andres, there is a bug in pg_stat_subscription_workers; it doesn't
>> drop entries for already-dropped tables. We need to fix it.
>
>
> Given the above, this becomes an N/A.
>
>>
>> 3. Show commit-LSN of the error transaction in errcontext. Currently,
>> we show XID and commit timestamp in the errcontext. But given that we
>> use LSN in pg_subscriptoin_error, it's better to show commit-LSN as
>> well (or instead of error-XID).
>
>
> Agreed, I think: what "errcontext" is this referring to?

The CONTEXT part in a log message. The apply worker and the tablesync
worker write the details of the changes on an error in the CONTEXT
part as follow:

ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (id)=(1) already exists.
CONTEXT:  processing remote data during "INSERT" for replication
target relation "public.test" in transaction 716 with commit timestamp
2021-09-29 15:52:45.165754+00

>>
>>
>> 5. Record skipped data to the system catalog, say
>> pg_subscription_conflict_history so that there is a chance to analyze
>> and recover.
>
>
>> We can discuss the
>> details in a new thread.
>
> Agreed - the "before skipping" consideration seems considerably more helpful; but wouldn't need to be persistent, it
couldjust be a view.  A permanent record probably would best be recorded in the logs - though if we get the pre-skip
functionalitythe user can view directly and save the results wherever they wish or we can provide a function to spool
theinformation to the log.  I don't see persistent in-database storage being that desirable here. 
>
> If we only do something after the transaction has been skipped it may be useful to add an option to the skipping
commandto auto-disable the subscription after the transaction has been skipped and before any subsequent transactions
areapplied.  This will give the user a chance to process the post-skipped information before restoring sync and having
thesystem begin changing underneath them again. 
>
>>
>>
>> 4 and 5 might be better introduced together but I think since the user
>> is able to check what changes will be skipped on the publisher side we
>> can do 5 for v16.
>
>
> How would one go about doing that (checking what changes will be skipped on the publisher side)?

We can copy the replication slot while changing the plugin by using
pg_copy_replication_slot(). Then we can check the changes by using
pg_logical_slot_peek_changes() with the copied slot.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

19 февраля 2022 г., 19:02:03

Hi,

On 2022-02-19 22:38:19 +0900, Masahiko Sawada wrote:
> On Sat, Feb 19, 2022 at 5:32 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-02-18 17:26:04 +0900, Masahiko Sawada wrote:
> > > With this change, pg_stat_subscription_workers will be like:
> > >
> > > * subid
> > > * subname
> > > * subrelid
> > > * error_count
> > > * last_error_timestamp
> >
> > > This view will be extended by adding transaction statistics proposed
> > > on another thread[1].
> >
> > I do not agree with these bits. What's the point of these per-relation stats
> > at this poitns.  You're just duplicating the normal relation pg_stats here.
> >
> > I really think we just should drop pg_stat_subscription_workers. Even if we
> > don't, we definitely should rename it, because it still isn't meaningfully
> > about workers.
> 
> The view has stats per subscription worker (i.e., apply worker and
> tablesync worker), not per relation. The subrelid is OID of the
> relation that the tablesync worker is synchronizing. For the stats of
> apply workers, it is null.

That's precisely the misuse of the stats subsystem that I'm complaining about
here. The whole design of pgstat (as it is today) only makes sense if you can
loose a message and it doesn't matter much, because it's just an incremental
counter increment that's lost.  And to be able properly prune dead pgstat
contents the underlying objects stats are kept around either need to be
permanent (e.g. stats about WAL) or a record of objects needs to exist
(e.g. stats about relations).

Even leaving everything else aside, a key of (dboid, subid, subrelid), where
subrelid can be NULL, but where (dboid, subid) is *not* unique, imo is poor
relational design.  What is the justification for mixing relation specific and
non-relation specific contents in this view?

The patch you referenced [1] should just store the stats in the
pg_stat_subscription view, not pg_stat_subscription_workers.

It *does* make sense to keep stats about the number of table syncs that failed
etc. But that should be a counter in pg_stat_subscription, not a row in
pg_stat_subscription_workers.

IOW, we should just drop pg_stat_subscription_workers, and add a few counters
to pg_stat_subscription.

Greetings,

Andres Freund

[1]
https://www.postgresql.org/message-id/TYCPR01MB8373E658CEE48FE05505A120ED529%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

19 февраля 2022 г., 19:16:40

On Sat, Feb 19, 2022 at 9:02 AM Andres Freund <andres@anarazel.de> wrote:

Even leaving everything else aside, a key of (dboid, subid, subrelid), where
subrelid can be NULL, but where (dboid, subid) is *not* unique, imo is poor
relational design. What is the justification for mixing relation specific and
non-relation specific contents in this view?

The to-be-created pg_subscription_worker_error has this same design.

It really is a worker-oriented view so the PK should ideally (ignoring dboid at the moment) be just workerid and subid and subrelid would just be attributes, of which subrelid is optional. But a worker's natural key is (subid, subrelid) so long as you accept that null has to be considered equal to itself. Not usually an ideal model to pick but it seems like this is one of those exceptions to the rule that works just fine. Feel free to use InvalidOID instead of null if that makes things more consistent and easier to implement. The conceptual model is still the same. It doesn't seem to be beneficial to have tablesync and apply workers have their own distinct relations. They are similar enough that they can share this one.

IOW, we should just drop pg_stat_subscription_workers, and add a few counters
to pg_stat_subscription.

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

19 февраля 2022 г., 19:37:43

Hi,

On 2022-02-19 09:16:40 -0700, David G. Johnston wrote:
> On Sat, Feb 19, 2022 at 9:02 AM Andres Freund <andres@anarazel.de> wrote:
> > Even leaving everything else aside, a key of (dboid, subid, subrelid),
> > where
> > subrelid can be NULL, but where (dboid, subid) is *not* unique, imo is poor
> > relational design.  What is the justification for mixing relation specific
> > and
> > non-relation specific contents in this view?

> The to-be-created pg_subscription_worker_error has this same design.
> It really is a worker-oriented view so the PK should ideally (ignoring
> dboid at the moment) be just workerid and subid and subrelid would just be
> attributes, of which subrelid is optional.  But a worker's natural key is
> (subid, subrelid) so long as you accept that null has to be considered
> equal to itself.  Not usually an ideal model to pick but it seems like this
> is one of those exceptions to the rule that works just fine.  Feel free to
> use InvalidOID instead of null if that makes things more consistent and
> easier to implement.  The conceptual model is still the same.  It doesn't
> seem to be beneficial to have tablesync and apply workers have their own
> distinct relations.  They are similar enough that they can share this one.

I'm not so convinced. "User space" reasonably might want to know why a certain
replication hasn't yet become part of logical replication / why a refresh
hasn't completed yet. For that something like

  SELECT *
  FROM pg_subscription ps JOIN pg_subscription_worker_error pwe ON (ps.oid = pwe.subid)
  WHERE subrelid = 'foo'::regclass;

makes sense.

The apply side is different from that. Yes, a "subrelid = ..." predicate would
filter out non-relation specific failures, but it still seems weird that apply
failures would even need to be filtered out for something like this.

And conversely, monitoring for apply failures is conceptually looking for
something different than tablesync.

IMO the type of information you'd want for apply failures is substantially
different enough from worker failures that I don't really see the temptation
to put them in the same table.

I also still think that _worker shouldn't be part of any of the naming
here. It's an implementation detail that we use one worker for one tablesync
etc. It'd make sense for one apply worker to sync multiple small tables, and
it'd make a lot of sense for multiple apply workers to collaborate on syncing
one large relation.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

19 февраля 2022 г., 20:04:57

On Sat, Feb 19, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:

IMO the type of information you'd want for apply failures is substantially

different enough from worker failures that I don't really see the temptation
to put them in the same table.

It's an error message and a transaction LSN in both cases right now, along with knowledge of whether said transaction only affects a single table (relid is not null) or not (relid is null). Do you have a concrete idea in mind that would make this separation need more obvious?

I also still think that _worker shouldn't be part of any of the naming
here. It's an implementation detail that we use one worker for one tablesync
etc. It'd make sense for one apply worker to sync multiple small tables, and
it'd make a lot of sense for multiple apply workers to collaborate on syncing
one large relation.

Good point. The existing design doesn't actually require the "worker status" concept I described; so let's not have worker be part of the name.

So basically separate the proposed pg_subscription_error table into two: a pg_subscription_tablesync_error and pg_subscription_apply_error. The former having a relid field while the later does not. What fields belong on each?

How about we have it both ways. Two tables but provide a canonical view union'ing them that a user can query to see whether any errors are presently affecting their system?

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

21 февраля 2022 г., 06:56:46

On Sun, Feb 20, 2022 at 1:02 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-02-19 22:38:19 +0900, Masahiko Sawada wrote:
> > On Sat, Feb 19, 2022 at 5:32 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2022-02-18 17:26:04 +0900, Masahiko Sawada wrote:
> > > > With this change, pg_stat_subscription_workers will be like:
> > > >
> > > > * subid
> > > > * subname
> > > > * subrelid
> > > > * error_count
> > > > * last_error_timestamp
> > >
> > > > This view will be extended by adding transaction statistics proposed
> > > > on another thread[1].
> > >
> > > I do not agree with these bits. What's the point of these per-relation stats
> > > at this poitns.  You're just duplicating the normal relation pg_stats here.
> > >
> > > I really think we just should drop pg_stat_subscription_workers. Even if we
> > > don't, we definitely should rename it, because it still isn't meaningfully
> > > about workers.
> >
> > The view has stats per subscription worker (i.e., apply worker and
> > tablesync worker), not per relation. The subrelid is OID of the
> > relation that the tablesync worker is synchronizing. For the stats of
> > apply workers, it is null.
>
> That's precisely the misuse of the stats subsystem that I'm complaining about
> here. The whole design of pgstat (as it is today) only makes sense if you can
> loose a message and it doesn't matter much, because it's just an incremental
> counter increment that's lost.  And to be able properly prune dead pgstat
> contents the underlying objects stats are kept around either need to be
> permanent (e.g. stats about WAL) or a record of objects needs to exist
> (e.g. stats about relations).
>
>
> Even leaving everything else aside, a key of (dboid, subid, subrelid), where
> subrelid can be NULL, but where (dboid, subid) is *not* unique, imo is poor
> relational design.  What is the justification for mixing relation specific and
> non-relation specific contents in this view?

I think the current schema of the view with key (dboid, subid,
subrelid) comes from the fact that we store the same statistics for
both apply and tablesync. I think even if we have two separate views
for apply and tablesync, they would have almost the same columns
except for their keys. Also, from the relational design point of view,
pg_locks has a somewhat similar table schema; its database and
relation columns can be NULL.

>
>
> The patch you referenced [1] should just store the stats in the
> pg_stat_subscription view, not pg_stat_subscription_workers.
>
> It *does* make sense to keep stats about the number of table syncs that failed
> etc. But that should be a counter in pg_stat_subscription, not a row in
> pg_stat_subscription_workers.

We have discussed using pg_stat_subscription before but concluded it's
not an appropriate place to store error information because it ends up
keeping cumulative stats mixed with non-cumulative stats. To take a
precedent, we used to store accumulative statistics such as spill_txns
to pg_stat_replication, but then for the same reason we moved those
statistics to the new stats view, pg_stat_replication_slot. New
subscription statistics that we're introducing are cumulative
statistics whereas pg_stat_subscription is a dynamic statistics view.


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

21 февраля 2022 г., 08:08:51

On Sat, Feb 19, 2022 at 10:07 PM Andres Freund <andres@anarazel.de> wrote:
>
>
> I also still think that _worker shouldn't be part of any of the naming
> here.
>

Okay, the other options that comes to mind for this are:
pg_subscription_replication_error, or
pg_subscription_lreplication_error, or pg_subscription_lrep_error? We
can use similar naming at another place (view) if required.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

21 февраля 2022 г., 08:10:09

On Sat, Feb 19, 2022 at 10:35 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Sat, Feb 19, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> IMO the type of information you'd want for apply failures is substantially
>>
>> different enough from worker failures that I don't really see the temptation
>> to put them in the same table.
>>
>
> It's an error message and a transaction LSN in both cases right now, along with knowledge of whether said transaction
onlyaffects a single table (relid is not null) or not (relid is null).  Do you have a concrete idea in mind that would
makethis separation need more obvious? 
>

I would also like to mention that in some cases, sync workers also
behaves like apply worker (after initial sync till it catches up with
the apply worker) and try to stream and apply changes similar to apply
worker. The error during that phase will be no different than the
apply worker. One idea to make the separation more obvious is to
introduce 'worker_type' column similar to backend_type in
pg_stat_activity which will tell whether it is an apply worker or a
table sync worker.

--
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

21 февраля 2022 г., 08:34:53

Hi,

On 2022-02-21 12:56:46 +0900, Masahiko Sawada wrote:
> > The patch you referenced [1] should just store the stats in the
> > pg_stat_subscription view, not pg_stat_subscription_workers.
> >
> > It *does* make sense to keep stats about the number of table syncs that failed
> > etc. But that should be a counter in pg_stat_subscription, not a row in
> > pg_stat_subscription_workers.
>
> We have discussed using pg_stat_subscription before but concluded it's
> not an appropriate place to store error information because it ends up
> keeping cumulative stats mixed with non-cumulative stats.

Well, as we've amply discussed, the non-cumulative stats shouldn't be in the
pgstat subsystem.


> To take a precedent, we used to store accumulative statistics such as
> spill_txns to pg_stat_replication, but then for the same reason we moved
> those statistics to the new stats view, pg_stat_replication_slot. New
> subscription statistics that we're introducing are cumulative statistics
> whereas pg_stat_subscription is a dynamic statistics view.

I'm happy to have cumulative subscriber stats somewhere in pgstats. But it
shouldn't be split by worker or relation, and it shouldn't contain
non-cumulative error information.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

21 февраля 2022 г., 10:09:31

On Mon, Feb 21, 2022 at 11:04 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-02-21 12:56:46 +0900, Masahiko Sawada wrote:
>
> > To take a precedent, we used to store accumulative statistics such as
> > spill_txns to pg_stat_replication, but then for the same reason we moved
> > those statistics to the new stats view, pg_stat_replication_slot. New
> > subscription statistics that we're introducing are cumulative statistics
> > whereas pg_stat_subscription is a dynamic statistics view.
>
> I'm happy to have cumulative subscriber stats somewhere in pgstats. But it
> shouldn't be split by worker or relation, and it shouldn't contain
> non-cumulative error information.
>

Fair enough. Then, how about the following keeping the following information:

* subid (subscription id)
* subname (subscription name)
* sync_error_count/sync_failure_count (number of timed table sync failed)
* apply_error_count/apply_failure_count (number of times apply failed)
* sync_success_count (number of table syncs successfully finished)
* apply_commit_count (number of transactions applied successfully)
* apply_rollback_count (number of transactions explicitly rolled back)
* stats_reset (Time at which these statistics were last reset)

The view name could be pg_stat_subscription_lrep,
pg_stat_logical_replication, or something on those lines.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

21 февраля 2022 г., 10:48:06

Hi,

On 2022-02-21 12:39:31 +0530, Amit Kapila wrote:
> Fair enough. Then, how about the following keeping the following information:

Mostly sounds good.

> * subid (subscription id)
> * subname (subscription name)

Coming from catalog, via join, I assume?

> * sync_error_count/sync_failure_count (number of timed table sync failed)
> * apply_error_count/apply_failure_count (number of times apply failed)

Yep.

> * sync_success_count (number of table syncs successfully finished)

This one I'm not quite convinced by. You can't rely on precise counts with
pgstats and we should be able to get a better idea from somewhere more
permanent which relations succeeded?  But it also doesn't do much harm, so ...

> * apply_commit_count (number of transactions applied successfully)
> * apply_rollback_count (number of transactions explicitly rolled back)

What does "explicit" mean here?

> * stats_reset (Time at which these statistics were last reset)
> 
> The view name could be pg_stat_subscription_lrep,
> pg_stat_logical_replication, or something on those lines.

pg_stat_subscription_stats :)

(I really dislike that we have pg_stat_ stuff that's not actually stats, but
something describing the current state, but that ship has well sailed).

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

21 февраля 2022 г., 12:19:01

On Mon, Feb 21, 2022 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-02-21 12:39:31 +0530, Amit Kapila wrote:
> > Fair enough. Then, how about the following keeping the following information:
>
> Mostly sounds good.
>
>
> > * subid (subscription id)
> > * subname (subscription name)
>
> Coming from catalog, via join, I assume?
>

The subname would be from pg_subscription catalog similar to what we
are doing now for pg_stat_subscription_workers.

>
> > * sync_error_count/sync_failure_count (number of timed table sync failed)
> > * apply_error_count/apply_failure_count (number of times apply failed)
>
> Yep.
>
>
> > * sync_success_count (number of table syncs successfully finished)
>
> This one I'm not quite convinced by. You can't rely on precise counts with
> pgstats and we should be able to get a better idea from somewhere more
> permanent which relations succeeded?  But it also doesn't do much harm, so ...
>

We can get precise information from pg_subscription_rel (rels that are
in ready/finish_copy state) but OTOH, during refresh some of the rels
would have been dropped or if a user creates/refreshes publication
with copy_data = false, then we won't get information about how many
table syncs succeeded? I have also kept this to make the sync
information look consistent considering we have sync_failure_count.

>
> > * apply_commit_count (number of transactions applied successfully)
> > * apply_rollback_count (number of transactions explicitly rolled back)
>
> What does "explicit" mean here?
>

It is for the Rollback Prepared case and probably for streaming of
in-progress transactions that eventually get rolled back.

>
> > * stats_reset (Time at which these statistics were last reset)
> >
> > The view name could be pg_stat_subscription_lrep,
> > pg_stat_logical_replication, or something on those lines.
>
> pg_stat_subscription_stats :)
>

Having *stat* two times in the name sounds slightly odd to me but let
us see what others think. One more option could be
pg_stat_subscription_replication.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

21 февраля 2022 г., 19:07:07

On Mon, Feb 21, 2022 at 2:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 21, 2022 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:

> > The view name could be pg_stat_subscription_lrep,
> > pg_stat_logical_replication, or something on those lines.
>
> pg_stat_subscription_stats :)
>

Having *stat* two times in the name sounds slightly odd to me but let
us see what others think. One more option could be
pg_stat_subscription_replication.

Agreed.

pg_stat_subscription_activity

We already have pg_stat_activity (which may be an argument against the suggestion...)

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

21 февраля 2022 г., 19:17:15

Hi,

On 2022-02-21 14:49:01 +0530, Amit Kapila wrote:
> On Mon, Feb 21, 2022 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:
> > > * stats_reset (Time at which these statistics were last reset)
> > >
> > > The view name could be pg_stat_subscription_lrep,
> > > pg_stat_logical_replication, or something on those lines.
> >
> > pg_stat_subscription_stats :)

> Having *stat* two times in the name sounds slightly odd to me but let
> us see what others think. One more option could be
> pg_stat_subscription_replication.

It was a joke, making light of our bad naming in pg_stat_*, not a serious
suggestion...

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

"David G. Johnston"

Дата:

21 февраля 2022 г., 19:19:11

On Sun, Feb 20, 2022 at 10:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Feb 19, 2022 at 10:35 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Sat, Feb 19, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> IMO the type of information you'd want for apply failures is substantially
>>
>> different enough from worker failures that I don't really see the temptation
>> to put them in the same table.
>>
>
> It's an error message and a transaction LSN in both cases right now, along with knowledge of whether said transaction only affects a single table (relid is not null) or not (relid is null). Do you have a concrete idea in mind that would make this separation need more obvious?
>

I would also like to mention that in some cases, sync workers also
behaves like apply worker (after initial sync till it catches up with
the apply worker) and try to stream and apply changes similar to apply
worker. The error during that phase will be no different than the
apply worker. One idea to make the separation more obvious is to
introduce 'worker_type' column similar to backend_type in
pg_stat_activity which will tell whether it is an apply worker or a
table sync worker.

The point isn't to make the separation more obvious by specifying which worker type is doing the work. It is to make the concept of worker type (and identity) irrelevant. The end user cannot (and should not be able to) address individual workers - only the subscription.

Even while a sync worker is in synchronization mode (as opposed to whatever mode comes before synchronization mode) it still only affects a single table. To the end user the distinction between the two modes is immaterial.

The statement "will be no different than the apply worker" doesn't make sense to me given that in a multiple-table subscription (the only kind where this matters...) you will have multiple table sync workers in synchronization mode and they both cannot behave identically to an apply worker otherwise they would step on each other's toes. That two different table-specific updates produce the same error shouldn't be a problem if that is indeed what happens (though if the error is on tableA having the worker for tableB report the tableA error would be odd - but not problematic).

I'll admit I don't fully understand the details of this particular synchronization interaction but I'm not see how the discussion of "errors during table-specific updates" vs "errors during whole transaction application" can be affected by it.

David J.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

22 февраля 2022 г., 06:37:30

On Mon, Feb 21, 2022 at 9:37 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Mon, Feb 21, 2022 at 2:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Feb 21, 2022 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:
>>
>> > > The view name could be pg_stat_subscription_lrep,
>> > > pg_stat_logical_replication, or something on those lines.
>> >
>> > pg_stat_subscription_stats :)
>> >
>>
>> Having *stat* two times in the name sounds slightly odd to me but let
>> us see what others think. One more option could be
>> pg_stat_subscription_replication.
>>
>
> Agreed.
>
> pg_stat_subscription_activity
>
> We already have pg_stat_activity (which may be an argument against the suggestion...)
>

I don't know if that can be an argument against it but one can imagine
that we record other subscription changes like (change of
publications, etc.). I personally feel it may be better to add
'_replication' in some way like pg_stat_sub_replication_activity but I
am fine either way.

--
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

22 февраля 2022 г., 08:45:19

On Mon, Feb 21, 2022 at 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 21, 2022 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2022-02-21 12:39:31 +0530, Amit Kapila wrote:
> > > Fair enough. Then, how about the following keeping the following information:
> >
> > Mostly sounds good.
> >
> >
> > > * subid (subscription id)
> > > * subname (subscription name)
> >
> > Coming from catalog, via join, I assume?
> >
>
> The subname would be from pg_subscription catalog similar to what we
> are doing now for pg_stat_subscription_workers.

I've attached a patch that changes pg_stat_subscription_workers view.
It removes non-cumulative values such as error details such as
error-XID and the error message from the view, and consequently the
view now has only cumulative statistics counters: apply_error_count
and sync_error_count. Since the new view name is under discussion I
temporarily chose pg_stat_subscription_activity.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Вложения

v1-0001-Reconsider-pg_stat_subscription_workers-view.patch

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

22 февраля 2022 г., 12:53:38

On Tue, Feb 22, 2022 at 11:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've attached a patch that changes pg_stat_subscription_workers view.
> It removes non-cumulative values such as error details such as
> error-XID and the error message from the view, and consequently the
> view now has only cumulative statistics counters: apply_error_count
> and sync_error_count. Since the new view name is under discussion I
> temporarily chose pg_stat_subscription_activity.
>

Few comments:
=============
1.
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -637,11 +637,9 @@ REVOKE EXECUTE ON FUNCTION
pg_stat_reset_single_table_counters(oid) FROM public;

 REVOKE EXECUTE ON FUNCTION
pg_stat_reset_single_function_counters(oid) FROM public;

-REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;
-
-REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_worker(oid) FROM public;
+REVOKE EXECUTE ON FUNCTION
pg_stat_reset_single_subscription_counters(oid) FROM public;

-REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_worker(oid,
oid) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;

Is there a need to change anything about
pg_stat_reset_replication_slot() in this patch?

2. Do we still need to use LATERAL in the view's query?

3. Can we send error stats pgstat_report_stat() as that will be called
via proc_exit() path. We can set the phase (apply/sync) in
apply_error_callback_arg and then use that to send the appropriate
message. I think this will obviate the need for try..catch.

-- 
With Regards,
Amit Kapila.

RE: Design of pg_stat_subscription_workers vs pgstats

От

"osumi.takamichi@fujitsu.com"

Дата:

22 февраля 2022 г., 15:22:18

On Tuesday, February 22, 2022 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've attached a patch that changes pg_stat_subscription_workers view.
> It removes non-cumulative values such as error details such as error-XID and
> the error message from the view, and consequently the view now has only
> cumulative statistics counters: apply_error_count and sync_error_count. Since
> the new view name is under discussion I temporarily chose
> pg_stat_subscription_activity.
Hi, thank you for sharing the patch.

Few minor comments for v1.

(1) commit message's typo

This commits changes the view so that it stores only statistics
counters: apply_error_count and sync_error_count.

"This commits" -> "This commit"

(2) minor improvement suggestion for the commit message

I suggest that we touch the commit id 8d74fc9
that introduces the pg_stat_subscription_workers
in the commit message, for better traceability. Below is an example.

From:
As the result of the discussion, we've concluded that the stats
collector is not an appropriate place to store the error information of
subscription workers.

To:
As the result of the discussion about the view introduced by 8d74fc9,...

(3) doc/src/sgml/logical-replication.sgml

Kindly refer to commit id 85c61ba for the detail.
You forgot "the" in the below sentence.

@@ -346,8 +346,6 @@
   <para>
    A conflict will produce an error and will stop the replication; it must be
    resolved manually by the user.  Details about the conflict can be found in
-   <link linkend="monitoring-pg-stat-subscription-workers">
-   <structname>pg_stat_subscription_workers</structname></link> and the
    subscriber's server log.
   </para>

From:
subscriber's server log.
to:
the subscriber's server log.

(4) doc/src/sgml/monitoring.sgml

      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>last_error_time</structfield> <type>timestamp with time zone</type>
+       <structfield>sync_error_count</structfield> <type>uint8</type>
       </para>
       <para>
-       Last time at which this error occurred
+       Number of times the error occurred during the initial data copy
       </para></entry>

I supposed it might be better to use "initial data sync"
or "initial data synchronization", rather than "initial data copy".

(5) src/test/subscription/t/026_worker_stats.pl

+# Truncate test_tab1 so that table sync can continue.
+$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");

The second truncate is for apply, isn't it? Therefore, kindly change

From:
Truncate test_tab1 so that table sync can continue.
To:
Truncate test_tab1 so that apply can continue.

(6) src/test/subscription/t/026_worker_stats.pl

+# Insert more data to test_tab1 on the subscriber and then on the publisher, raising an
+# error on the subscriber due to violation of the unique constraint on test_tab1.
+$node_subscriber->safe_psql('postgres', "INSERT INTO test_tab1 VALUES (2)");

Did we need this insert ?
If you want to indicate the apply is working okay after the error of table sync is solved,
waiting for the max value in the test_tab1 becoming 2 on the subscriber by polling query
would work. But, I was not sure if this is essentially necessary for the testing purpose.

Best Regards,
    Takamichi Osumi

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

22 февраля 2022 г., 17:32:02

On Tue, Feb 22, 2022 at 6:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 22, 2022 at 11:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've attached a patch that changes pg_stat_subscription_workers view.
> > It removes non-cumulative values such as error details such as
> > error-XID and the error message from the view, and consequently the
> > view now has only cumulative statistics counters: apply_error_count
> > and sync_error_count. Since the new view name is under discussion I
> > temporarily chose pg_stat_subscription_activity.
> >
>
> Few comments:
> =============

Thank you for the comments!

> 1.
> --- a/src/backend/catalog/system_functions.sql
> +++ b/src/backend/catalog/system_functions.sql
> @@ -637,11 +637,9 @@ REVOKE EXECUTE ON FUNCTION
> pg_stat_reset_single_table_counters(oid) FROM public;
>
>  REVOKE EXECUTE ON FUNCTION
> pg_stat_reset_single_function_counters(oid) FROM public;
>
> -REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;
> -
> -REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_worker(oid) FROM public;
> +REVOKE EXECUTE ON FUNCTION
> pg_stat_reset_single_subscription_counters(oid) FROM public;
>
> -REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_worker(oid,
> oid) FROM public;
> +REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;
>
> Is there a need to change anything about
> pg_stat_reset_replication_slot() in this patch?

It doesn't change pg_stat_reset_replication_slot() but just changes
the order in order to put the modified function
pg_stat_reset_single_subscription_counters() closer to other similar
functions such as pg_stat_reset_single_function_counters().

>
> 2. Do we still need to use LATERAL in the view's query?

There are some functions that use LATERAL in a similar way but it
seems no need to put LATERAL before the function call. Will remove.

> 3. Can we send error stats pgstat_report_stat() as that will be called
> via proc_exit() path. We can set the phase (apply/sync) in
> apply_error_callback_arg and then use that to send the appropriate
> message. I think this will obviate the need for try..catch.

If we use pgstat_report_stat() to send subscription stats messages,
all processes end up going through that path. It might not bring
overhead in practice but I'd like to avoid it. And, since the apply
worker also calls pgstat_report_stat() at the end of the transaction,
we might need to change pgstat_report_stat() so that it doesn't send
the subscription messages when it gets called at the end of the
transaction. I think it's likely that PG_TRY() and PG_CATCH() wil be
added for example, when the disable_on_error feature or the storing
error details feature is introduced, so obviating the need for them at
this point would not benefit much.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

22 февраля 2022 г., 17:46:31

On Tue, Feb 22, 2022 at 9:22 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Tuesday, February 22, 2022 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I've attached a patch that changes pg_stat_subscription_workers view.
> > It removes non-cumulative values such as error details such as error-XID and
> > the error message from the view, and consequently the view now has only
> > cumulative statistics counters: apply_error_count and sync_error_count. Since
> > the new view name is under discussion I temporarily chose
> > pg_stat_subscription_activity.
> Hi, thank you for sharing the patch.
>
>
> Few minor comments for v1.

Thank you for the comments!

>
> (1) commit message's typo
>
> This commits changes the view so that it stores only statistics
> counters: apply_error_count and sync_error_count.
>
> "This commits" -> "This commit"

Will fix.

>
> (2) minor improvement suggestion for the commit message
>
> I suggest that we touch the commit id 8d74fc9
> that introduces the pg_stat_subscription_workers
> in the commit message, for better traceability. Below is an example.
>
> From:
> As the result of the discussion, we've concluded that the stats
> collector is not an appropriate place to store the error information of
> subscription workers.
>
> To:
> As the result of the discussion about the view introduced by 8d74fc9,...

Okay, will add the commit reference.

>
> (3) doc/src/sgml/logical-replication.sgml
>
> Kindly refer to commit id 85c61ba for the detail.
> You forgot "the" in the below sentence.
>
> @@ -346,8 +346,6 @@
>    <para>
>     A conflict will produce an error and will stop the replication; it must be
>     resolved manually by the user.  Details about the conflict can be found in
> -   <link linkend="monitoring-pg-stat-subscription-workers">
> -   <structname>pg_stat_subscription_workers</structname></link> and the
>     subscriber's server log.
>    </para>
>
> From:
> subscriber's server log.
> to:
> the subscriber's server log.

Will fix.

>
> (4) doc/src/sgml/monitoring.sgml
>
>       <row>
>        <entry role="catalog_table_entry"><para role="column_definition">
> -       <structfield>last_error_time</structfield> <type>timestamp with time zone</type>
> +       <structfield>sync_error_count</structfield> <type>uint8</type>
>        </para>
>        <para>
> -       Last time at which this error occurred
> +       Number of times the error occurred during the initial data copy
>        </para></entry>
>
> I supposed it might be better to use "initial data sync"
> or "initial data synchronization", rather than "initial data copy".

"Initial data synchronization" sounds like the whole table
synchronization process including COPY and applying changes to catch
up. But sync_error_count is incremented only during COPY so I used
"initial data copy". What do you think?

>
> (5) src/test/subscription/t/026_worker_stats.pl
>
> +# Truncate test_tab1 so that table sync can continue.
> +$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");
>
> The second truncate is for apply, isn't it? Therefore, kindly change
>
> From:
> Truncate test_tab1 so that table sync can continue.
> To:
> Truncate test_tab1 so that apply can continue.

Right, will fix.

>
> (6) src/test/subscription/t/026_worker_stats.pl
>
> +# Insert more data to test_tab1 on the subscriber and then on the publisher, raising an
> +# error on the subscriber due to violation of the unique constraint on test_tab1.
> +$node_subscriber->safe_psql('postgres', "INSERT INTO test_tab1 VALUES (2)");
>
> Did we need this insert ?
> If you want to indicate the apply is working okay after the error of table sync is solved,
> waiting for the max value in the test_tab1 becoming 2 on the subscriber by polling query
> would work. But, I was not sure if this is essentially necessary for the testing purpose.

You're right, it's not necessary. Also, it seems better to change the
TAP test file name from 026_worker_stats.pl to 026_stats.pl. Will
incorporate these changes.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Design of pg_stat_subscription_workers vs pgstats

От

"osumi.takamichi@fujitsu.com"

Дата:

23 февраля 2022 г., 04:13:45

On Tuesday, February 22, 2022 11:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Feb 22, 2022 at 9:22 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > (4) doc/src/sgml/monitoring.sgml
> >
> >       <row>
> >        <entry role="catalog_table_entry"><para
> role="column_definition">
> > -       <structfield>last_error_time</structfield> <type>timestamp with
> time zone</type>
> > +       <structfield>sync_error_count</structfield> <type>uint8</type>
> >        </para>
> >        <para>
> > -       Last time at which this error occurred
> > +       Number of times the error occurred during the initial data
> > + copy
> >        </para></entry>
> >
> > I supposed it might be better to use "initial data sync"
> > or "initial data synchronization", rather than "initial data copy".
> 
> "Initial data synchronization" sounds like the whole table synchronization
> process including COPY and applying changes to catch up. But
> sync_error_count is incremented only during COPY so I used "initial data copy".
> What do you think?
Okay. Please keep it as is.


Best Regards,
    Takamichi Osumi

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

23 февраля 2022 г., 05:14:15

Hi,

On 2022-02-22 14:45:19 +0900, Masahiko Sawada wrote:
> I've attached a patch that changes pg_stat_subscription_workers view.

Thanks for working on this!

Why are the stats stored in the per-database stats file / as a second level
below the database? While they're also associated with a database, it's a
global catalog, so it seems to make more sense to have them "live" globally as
well?

Not just from an aesthetical perspective, but there might also be cases where
it's useful to send stats from the stats launcher. E.g. the number of times
the launcher couldn't start a worker because the max numbers of workers was
already active or such.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

23 февраля 2022 г., 05:46:14

On Wed, Feb 23, 2022 at 11:14 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-02-22 14:45:19 +0900, Masahiko Sawada wrote:
> > I've attached a patch that changes pg_stat_subscription_workers view.
>
> Thanks for working on this!
>
> Why are the stats stored in the per-database stats file / as a second level
> below the database? While they're also associated with a database, it's a
> global catalog, so it seems to make more sense to have them "live" globally as
> well?

Good point. The reason why we used to use per-database stats file is
that we were storing some relation information there. But now that we
don't need to have such information, it makes more sense to have them
live globally. I'll change the patch accordingly.

>
> Not just from an aesthetical perspective, but there might also be cases where
> it's useful to send stats from the stats launcher. E.g. the number of times
> the launcher couldn't start a worker because the max numbers of workers was
> already active or such.

Good idea.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Design of pg_stat_subscription_workers vs pgstats

От

"tanghy.fnst@fujitsu.com"

Дата:

23 февраля 2022 г., 06:00:08

On Tue, Feb 22, 2022 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> I've attached a patch that changes pg_stat_subscription_workers view.
> It removes non-cumulative values such as error details such as
> error-XID and the error message from the view, and consequently the
> view now has only cumulative statistics counters: apply_error_count
> and sync_error_count. Since the new view name is under discussion I
> temporarily chose pg_stat_subscription_activity.
> 

Thanks for your patch.

Few comments:

1.
+       <structfield>apply_error_count</structfield> <type>uint8</type>
...
+       <structfield>sync_error_count</structfield> <type>uint8</type>

It seems that Postgres has no data type named uint8, should we change it to
bigint?

2.
+# Wait for the table sync error to be reported.
+$node_subscriber->poll_query_until(
+    'postgres',
+    qq[
+SELECT apply_error_count = 0 AND sync_error_count > 0
+FROM pg_stat_subscription_activity
+WHERE subname = 'tap_sub'
+]) or die "Timed out while waiting for table sync error";

We want to check table sync error here, but do we need to check
"apply_error_count = 0"? I am not sure if it is possible that the apply worker has
an unexpected error, which would cause this test to fail.

Regards,
Tang

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Smith

Дата:

23 февраля 2022 г., 06:07:45

Hi. Below are my review comments for the v1 patch.

======

1. Commit message - wording

As the result of the discussion, we've concluded that the stats
collector is not an appropriate place to store the error information of
subscription workers.

SUGGESTION
It was decided (refer to the Discussion link below) that the stats
collector is not an appropriate place to store the error information of
subscription workers.

~~~

2. Commit message - wording

This commits changes the view so that it stores only statistics
counters: apply_error_count and sync_error_count.

SUGGESTION
"This commits changes the view" --> "This patch changes the
pg_stat_subscription_workers view"

~~~

3. Commit message - wording

Removing these error details, since we don't need to record the
error information for apply workers and tablesync workers separately,
the view now has one entry per subscription.

DID THIS MEAN...
After removing these error details, there is no longer separate
information for apply workers and tablesync workers, so the view now
has only one entry per subscription.

--

But anyway, that is not entirely true either because those counters
are separate information for the different workers, right?

~~

4. Commit message - wording

Also, it changes the view name to pg_stat_subscription_activity
since the word "worker" is an implementation detail that we use one
worker for one tablesync.

SUGGESTION
"Also, it changes" --> "The patch also changes" ...

~~~

5. doc/src/sgml/monitoring.sgml - wording

   <para>
-   The <structname>pg_stat_subscription_workers</structname> view will contain
-   one row per subscription worker on which errors have occurred, for workers
-   applying logical replication changes and workers handling the initial data
-   copy of the subscribed tables.  The statistics entry is removed when the
+   The <structname>pg_stat_subscription_activity</structname> view will contain
+   one row per subscription.  The statistics entry is removed when the
    corresponding subscription is dropped.
   </para>

SUGGESTION
"The statistics entry is removed" --> "This row is removed"

~~~

6.  doc/src/sgml/monitoring.sgml - why two counters?

Please forgive this noob question...

I see there are 2 error_count columns (one for each kind of worker)
but I was wondering why it is useful for users to be able to
distinguish if the error came from the tablesync workers or from the
apply workers? Do you have any example?

Also, IIRC sometimes the tablesync might actually do a few "apply"
changes itself... so the distinction may become a bit fuzzy...

~~~

7. src/backend/postmaster/pgstat.c - comment

@@ -1313,13 +1312,13 @@ pgstat_vacuum_stat(void)
  }

  /*
- * Repeat for subscription workers.  Similarly, we needn't bother in the
- * common case where no subscription workers' stats are being collected.
+ * Repeat for subscription.  Similarly, we needn't bother in the common
+ * case where no subscription stats are being collected.
  */

typo?

"Repeat for subscription." --> "Repeat for subscriptions."

~~~

8. src/backend/postmaster/pgstat.c

@@ -3000,32 +2968,29 @@ pgstat_fetch_stat_funcentry(Oid func_id)

 /*
  * ---------
- * pgstat_fetch_stat_subworker_entry() -
+ * pgstat_fetch_stat_subentry() -
  *
  * Support function for the SQL-callable pgstat* functions. Returns
- * the collected statistics for subscription worker or NULL.
+ * the collected statistics for subscription or NULL.
  * ---------
  */
-PgStat_StatSubWorkerEntry *
-pgstat_fetch_stat_subworker_entry(Oid subid, Oid subrelid)
+PgStat_StatSubEntry *
+pgstat_fetch_stat_subentry(Oid subid)

There seems some kind of inconsistency because the member name is
called "subscriptions" but sometimes it seems singular.

Some places (e.g. pgstat_vacuum_stat) will iterate multiple results,
but then other places (like this function) just return to a single
"subscription" (or "entry").

I suspect all the code may be fine; probably it is just some
inconsistent (singular/plural) comments that have confused things a
bit.

~~~

9. src/backend/replication/logical/worker.c - subscription id

+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MyLogicalRepWorker->subid, false);

and

+ /* Report the worker failed during the application of the change */
+ pgstat_report_subscription_error(MyLogicalRepWorker->subid, true);


Why don't these use MySubscription->oid instead of MyLogicalRepWorker->subid?

~~~

10. src/include/pgstat.h - enum order

@@ -84,8 +84,8 @@ typedef enum StatMsgType
  PGSTAT_MTYPE_REPLSLOT,
  PGSTAT_MTYPE_CONNECT,
  PGSTAT_MTYPE_DISCONNECT,
+ PGSTAT_MTYPE_SUBSCRIPTIONERROR,
  PGSTAT_MTYPE_SUBSCRIPTIONPURGE,
- PGSTAT_MTYPE_SUBWORKERERROR,
 } StatMsgType;

This change rearranges the enum order. Maybe it is safer not to do this?

~~~

11. src/include/pgstat.h

@@ -767,8 +747,8 @@ typedef union PgStat_Msg
  PgStat_MsgReplSlot msg_replslot;
  PgStat_MsgConnect msg_connect;
  PgStat_MsgDisconnect msg_disconnect;
+ PgStat_MsgSubscriptionError msg_subscriptionerror;
  PgStat_MsgSubscriptionPurge msg_subscriptionpurge;
- PgStat_MsgSubWorkerError msg_subworkererror;
 } PgStat_Msg;

This change also rearranges the order. Maybe there was no good reason
to do that?

~~~

12. src/include/pgstat.h - PgStat_StatDBEntry

@@ -823,16 +803,12 @@ typedef struct PgStat_StatDBEntry
  TimestampTz stats_timestamp; /* time of db stats file update */

  /*
- * tables, functions, and subscription workers must be last in the struct,
- * because we don't write the pointers out to the stats file.
- *
- * subworkers is the hash table of PgStat_StatSubWorkerEntry which stores
- * statistics of logical replication workers: apply worker and table sync
- * worker.
+ * tables, functions, and subscription must be last in the struct, because
+ * we don't write the pointers out to the stats file.
  */

Should that say "tables, functions, and subscriptions" (plural)

------

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

23 февраля 2022 г., 06:47:31

On Tue, Feb 22, 2022 at 8:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 22, 2022 at 6:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 3. Can we send error stats pgstat_report_stat() as that will be called
> > via proc_exit() path. We can set the phase (apply/sync) in
> > apply_error_callback_arg and then use that to send the appropriate
> > message. I think this will obviate the need for try..catch.
>
> If we use pgstat_report_stat() to send subscription stats messages,
> all processes end up going through that path. It might not bring
> overhead in practice but I'd like to avoid it.
>

I am not sure about overhead but I see another problem if we use that
approach. In the exit path, logicalrep_worker_onexit() will get called
before pgstat_report_stat() and that will clear the
MyLogicalRepWorker->subid, so we won't know the id for which to send
stats. So, the way patch is doing seems reasonable to me unless
someone has better ideas.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Smith

Дата:

23 февраля 2022 г., 09:20:58

Below are my review comments just for the v1 patch test code.

======

1. "table sync error" versus "tablesync error"

+# Wait for the table sync error to be reported.
+$node_subscriber->poll_query_until(
+ 'postgres',
+ qq[
+SELECT apply_error_count = 0 AND sync_error_count > 0
+FROM pg_stat_subscription_activity
+WHERE subname = 'tap_sub'
+]) or die "Timed out while waiting for table sync error";
+
+# Truncate test_tab1 so that table sync can continue.
+$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");
+
 # Wait for initial table sync for test_tab1 to finish.

IMO all these "table sync" should be changed to "tablesync", because a
table "sync error" sounds like something completely different to a
"tablesync error".

SUGGESTIONS
- "Wait for the table sync error to be reported." --> "Wait for the
tablesync error to be reported."
- "Timed out while waiting for table sync error" --> "Timed out while
waiting for tablesync error"
- "Truncate test_tab1 so that table sync can continue." -->  "Truncate
test_tab1 so that tablesync worker can fun to completion."
- "Wait for initial table sync for test_tab1 to finish." --> "Wait for
the tablesync worker of test_tab1 to finish."

~~~

2. Unnecessary INSERT VALUES (2)?

(I think this is a duplicate of what [Osumi] #6 reported)

+# Insert more data to test_tab1 on the subscriber and then on the
publisher, raising an
+# error on the subscriber due to violation of the unique constraint
on test_tab1.
+$node_subscriber->safe_psql('postgres', "INSERT INTO test_tab1 VALUES (2)");

Why does the test do INSERT data (2)? There is already data (1) from
the tablesync which will cause an apply worker PK violation when
another VALUES (1) is published.

Note, the test comment also needs to change...

~~~

3. Wait for the apply worker error

+# Wait for the apply error to be reported.
+$node_subscriber->poll_query_until(
+ 'postgres',
+ qq[
+SELECT apply_error_count > 0 AND sync_error_count > 0
+FROM pg_stat_subscription_activity
+WHERE subname = 'tap_sub'
+]) or die "Timed out while waiting for apply error";

This test is only for apply worker errors. So why is the test SQL
checking "AND sync_error_count > 0"?

(This is similar to what [Tang] #2 reported, but I think she was
referring to the other tablesync test)

~~~

4. Wrong worker?

(looks like a duplicate of what [Osumi] #5 already)

+
+# Truncate test_tab1 so that table sync can continue.
+$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");

 $node_subscriber->stop('fast');
 $node_publisher->stop('fast');

Cut/paste error? Aren't you doing TRUNCATE here so the apply worker
can continue; not the tablesync worker (which already completed)

"Truncate test_tab1 so that table sync can continue." --> "Truncate
test_tab1 so that the apply worker can continue."

------
[Osumi] https://www.postgresql.org/message-id/CAD21AoBRt%3DcyKsZP83rcMkHnT498gHH0TEP34fZBrGCxT-Ahwg%40mail.gmail.com
[Tang]
https://www.postgresql.org/message-id/TYCPR01MB612840D018FEBD38268CC83BFB3C9%40TYCPR01MB6128.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

24 февраля 2022 г., 04:32:54

 Hi,

On Wed, Feb 23, 2022 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi. Below are my review comments for the v1 patch.

Thank you for the comments! I've attached the latest version patch
that incorporated all comments I got so far. The primary change from
the previous version is that the subscription statistics live globally
rather than per-database.

>
> ======
>
> 1. Commit message - wording
>
> As the result of the discussion, we've concluded that the stats
> collector is not an appropriate place to store the error information of
> subscription workers.
>
> SUGGESTION
> It was decided (refer to the Discussion link below) that the stats
> collector is not an appropriate place to store the error information of
> subscription workers.

Fixed.

>
> ~~~
>
> 2. Commit message - wording
>
> This commits changes the view so that it stores only statistics
> counters: apply_error_count and sync_error_count.
>
> SUGGESTION
> "This commits changes the view" --> "This patch changes the
> pg_stat_subscription_workers view"

Fixed.

>
> ~~~
>
> 3. Commit message - wording
>
> Removing these error details, since we don't need to record the
> error information for apply workers and tablesync workers separately,
> the view now has one entry per subscription.
>
> DID THIS MEAN...
> After removing these error details, there is no longer separate
> information for apply workers and tablesync workers, so the view now
> has only one entry per subscription.
>
> --
>
> But anyway, that is not entirely true either because those counters
> are separate information for the different workers, right?

Right. Since I also made the subscription statistics a cluster-wide
statistics, I've changed this part accordingly.

>
> ~~
>
> 4. Commit message - wording
>
> Also, it changes the view name to pg_stat_subscription_activity
> since the word "worker" is an implementation detail that we use one
> worker for one tablesync.
>
> SUGGESTION
> "Also, it changes" --> "The patch also changes" ...

Fixed.

>
> ~~~
>
> 5. doc/src/sgml/monitoring.sgml - wording
>
>    <para>
> -   The <structname>pg_stat_subscription_workers</structname> view will contain
> -   one row per subscription worker on which errors have occurred, for workers
> -   applying logical replication changes and workers handling the initial data
> -   copy of the subscribed tables.  The statistics entry is removed when the
> +   The <structname>pg_stat_subscription_activity</structname> view will contain
> +   one row per subscription.  The statistics entry is removed when the
>     corresponding subscription is dropped.
>    </para>
>
> SUGGESTION
> "The statistics entry is removed" --> "This row is removed"

On second thoughts, this sentence is not necessary since it's obvious
and descriptions of other stats view don't mention it.

>
> ~~~
>
> 6.  doc/src/sgml/monitoring.sgml - why two counters?
>
> Please forgive this noob question...
>
> I see there are 2 error_count columns (one for each kind of worker)
> but I was wondering why it is useful for users to be able to
> distinguish if the error came from the tablesync workers or from the
> apply workers? Do you have any example?
>
> Also, IIRC sometimes the tablesync might actually do a few "apply"
> changes itself... so the distinction may become a bit fuzzy...

I think that the tablesync phase and the apply phase can fail for
different reasons. So these values would be a good indicator for users
to check if each phase works fine.

After more thoughts, I think it's better to increment sync_error_count
also when a tablesync worker fails while applying the changes. These
counters will correspond to the error information entries that will be
stored in a system catalog.

>
> ~~~
>
> 7. src/backend/postmaster/pgstat.c - comment
>
> @@ -1313,13 +1312,13 @@ pgstat_vacuum_stat(void)
>   }
>
>   /*
> - * Repeat for subscription workers.  Similarly, we needn't bother in the
> - * common case where no subscription workers' stats are being collected.
> + * Repeat for subscription.  Similarly, we needn't bother in the common
> + * case where no subscription stats are being collected.
>   */
>
> typo?
>
> "Repeat for subscription." --> "Repeat for subscriptions."

Fixed.

>
> ~~~
>
> 8. src/backend/postmaster/pgstat.c
>
> @@ -3000,32 +2968,29 @@ pgstat_fetch_stat_funcentry(Oid func_id)
>
>  /*
>   * ---------
> - * pgstat_fetch_stat_subworker_entry() -
> + * pgstat_fetch_stat_subentry() -
>   *
>   * Support function for the SQL-callable pgstat* functions. Returns
> - * the collected statistics for subscription worker or NULL.
> + * the collected statistics for subscription or NULL.
>   * ---------
>   */
> -PgStat_StatSubWorkerEntry *
> -pgstat_fetch_stat_subworker_entry(Oid subid, Oid subrelid)
> +PgStat_StatSubEntry *
> +pgstat_fetch_stat_subentry(Oid subid)
>
> There seems some kind of inconsistency because the member name is
> called "subscriptions" but sometimes it seems singular.
>
> Some places (e.g. pgstat_vacuum_stat) will iterate multiple results,
> but then other places (like this function) just return to a single
> "subscription" (or "entry").
>
> I suspect all the code may be fine; probably it is just some
> inconsistent (singular/plural) comments that have confused things a
> bit.

Fixed.

>
> ~~~
>
> 9. src/backend/replication/logical/worker.c - subscription id
>
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, false);
>
> and
>
> + /* Report the worker failed during the application of the change */
> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, true);
>
>
> Why don't these use MySubscription->oid instead of MyLogicalRepWorker->subid?

It's just because we used to use MyLogicalRepWorker->subid, is there
any particular reason why we should use MySubscription->oid here?

>
> ~~~
>
> 10. src/include/pgstat.h - enum order
>
> @@ -84,8 +84,8 @@ typedef enum StatMsgType
>   PGSTAT_MTYPE_REPLSLOT,
>   PGSTAT_MTYPE_CONNECT,
>   PGSTAT_MTYPE_DISCONNECT,
> + PGSTAT_MTYPE_SUBSCRIPTIONERROR,
>   PGSTAT_MTYPE_SUBSCRIPTIONPURGE,
> - PGSTAT_MTYPE_SUBWORKERERROR,
>  } StatMsgType;
>
> This change rearranges the enum order. Maybe it is safer not to do this?
>

I assume you're concerned about binary compatibility or something. I
think it should not be a problem since both
PGSTAT_MTYPE_SUBWORKERERROR and PGSTAT_MTYPE_SUBSCRIPTIONPURGE are
introduced to PG15.

> ~~~
>
> 11. src/include/pgstat.h
>
> @@ -767,8 +747,8 @@ typedef union PgStat_Msg
>   PgStat_MsgReplSlot msg_replslot;
>   PgStat_MsgConnect msg_connect;
>   PgStat_MsgDisconnect msg_disconnect;
> + PgStat_MsgSubscriptionError msg_subscriptionerror;
>   PgStat_MsgSubscriptionPurge msg_subscriptionpurge;
> - PgStat_MsgSubWorkerError msg_subworkererror;
>  } PgStat_Msg;
>
> This change also rearranges the order. Maybe there was no good reason
> to do that?

It's for keeping the alphabetical order within subscription-related messages.

>
> ~~~
>
> 12. src/include/pgstat.h - PgStat_StatDBEntry
>
> @@ -823,16 +803,12 @@ typedef struct PgStat_StatDBEntry
>   TimestampTz stats_timestamp; /* time of db stats file update */
>
>   /*
> - * tables, functions, and subscription workers must be last in the struct,
> - * because we don't write the pointers out to the stats file.
> - *
> - * subworkers is the hash table of PgStat_StatSubWorkerEntry which stores
> - * statistics of logical replication workers: apply worker and table sync
> - * worker.
> + * tables, functions, and subscription must be last in the struct, because
> + * we don't write the pointers out to the stats file.
>   */
>
> Should that say "tables, functions, and subscriptions" (plural)

This part is removed in the latest patch.

On Wed, Feb 23, 2022 at 12:00 PM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:
>
>
> Thanks for your patch.
>
> Few comments:
>
> 1.
> +       <structfield>apply_error_count</structfield> <type>uint8</type>
> ...
> +       <structfield>sync_error_count</structfield> <type>uint8</type>
>
> It seems that Postgres has no data type named uint8, should we change it to
> bigint?

Right, fixed.

>
> 2.
> +# Wait for the table sync error to be reported.
> +$node_subscriber->poll_query_until(
> +       'postgres',
> +       qq[
> +SELECT apply_error_count = 0 AND sync_error_count > 0
> +FROM pg_stat_subscription_activity
> +WHERE subname = 'tap_sub'
> +]) or die "Timed out while waiting for table sync error";
>
> We want to check table sync error here, but do we need to check
> "apply_error_count = 0"? I am not sure if it is possible that the apply worker has
> an unexpected error, which would cause this test to fail.

Yeah, it seems better not to have this condition, fixed.

On Wed, Feb 23, 2022 at 3:21 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Below are my review comments just for the v1 patch test code.
>
> ======
> 1. "table sync error" versus "tablesync error"
>
> +# Wait for the table sync error to be reported.
> +$node_subscriber->poll_query_until(
> + 'postgres',
> + qq[
> +SELECT apply_error_count = 0 AND sync_error_count > 0
> +FROM pg_stat_subscription_activity
> +WHERE subname = 'tap_sub'
> +]) or die "Timed out while waiting for table sync error";
> +
> +# Truncate test_tab1 so that table sync can continue.
> +$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");
> +
>  # Wait for initial table sync for test_tab1 to finish.
>
> IMO all these "table sync" should be changed to "tablesync", because a
> table "sync error" sounds like something completely different to a
> "tablesync error".
>
> SUGGESTIONS
> - "Wait for the table sync error to be reported." --> "Wait for the
> tablesync error to be reported."
> - "Timed out while waiting for table sync error" --> "Timed out while
> waiting for tablesync error"
> - "Truncate test_tab1 so that table sync can continue." -->  "Truncate
> test_tab1 so that tablesync worker can fun to completion."
> - "Wait for initial table sync for test_tab1 to finish." --> "Wait for
> the tablesync worker of test_tab1 to finish."

Fixed.

>
> ~~~
>
> 3. Wait for the apply worker error
>
> +# Wait for the apply error to be reported.
> +$node_subscriber->poll_query_until(
> + 'postgres',
> + qq[
> +SELECT apply_error_count > 0 AND sync_error_count > 0
> +FROM pg_stat_subscription_activity
> +WHERE subname = 'tap_sub'
> +]) or die "Timed out while waiting for apply error";
>
> This test is only for apply worker errors. So why is the test SQL
> checking "AND sync_error_count > 0"?
>
> (This is similar to what [Tang] #2 reported, but I think she was
> referring to the other tablesync test)

Fixed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Вложения

v2-0001-Reconsider-pg_stat_subscription_workers-view.patch

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

24 февраля 2022 г., 06:18:21

On Thu, Feb 24, 2022 at 7:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > ~~~
> >
> > 6.  doc/src/sgml/monitoring.sgml - why two counters?
> >
> > Please forgive this noob question...
> >
> > I see there are 2 error_count columns (one for each kind of worker)
> > but I was wondering why it is useful for users to be able to
> > distinguish if the error came from the tablesync workers or from the
> > apply workers? Do you have any example?
> >
> > Also, IIRC sometimes the tablesync might actually do a few "apply"
> > changes itself... so the distinction may become a bit fuzzy...
>
> I think that the tablesync phase and the apply phase can fail for
> different reasons. So these values would be a good indicator for users
> to check if each phase works fine.
>
> After more thoughts, I think it's better to increment sync_error_count
> also when a tablesync worker fails while applying the changes.
>

This sounds reasonable to me because even if we are applying the
changes in tablesync worker, it is only for that particular table. So,
it seems okay to increment it under category with the description:
"Number of times the error occurred during the initial table
synchronization".

-- 
With Regards,
Amit Kapila.

RE: Design of pg_stat_subscription_workers vs pgstats

От

"tanghy.fnst@fujitsu.com"

Дата:

24 февраля 2022 г., 10:20:49

On Thu, Feb 24, 2022 9:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> Thank you for the comments! I've attached the latest version patch
> that incorporated all comments I got so far. The primary change from
> the previous version is that the subscription statistics live globally
> rather than per-database.
> 

Thanks for updating the patch.

Few comments:

1. 
I think we should add some doc for column stats_reset in pg_stat_subscription_activity view.

2.
+CREATE VIEW pg_stat_subscription_activity AS
     SELECT
-        w.subid,
+        a.subid,
         s.subname,
...
+        a.apply_error_count,
+        a.sync_error_count,
+    a.stats_reset
+    FROM pg_subscription as s,
+        pg_stat_get_subscription_activity(oid) as a;

The line "a.stats_reset" uses a Tab, and we'd better use spaces here.

Regards,
Tang

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Smith

Дата:

24 февраля 2022 г., 11:53:33

Hi. Below are my review comments for the v2 patch.

======

1. Commit message

This patch changes the pg_stat_subscription_workers view (introduced
by commit 8d74fc9) so that it stores only statistics counters:
apply_error_count and sync_error_count, and has one entry for
subscription.

SUGGESTION
"and has one entry for subscription." --> "and has one entry for each
subscription."

~~~

2. Commit message

After removing these error details, there are no longer relation
information, so the subscription statistics are now a cluster-wide
statistics.

SUGGESTION
"there are no longer relation information," --> "there is no longer
any relation information,"

~~~

3. doc/src/sgml/monitoring.sgml

-      <para>
-       The error message
+       Number of times the error occurred during the application of changes
       </para></entry>
      </row>

      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>last_error_time</structfield> <type>timestamp
with time zone</type>
+       <structfield>sync_error_count</structfield> <type>bigint</type>
       </para>
       <para>
-       Last time at which this error occurred
+       Number of times the error occurred during the initial table
+       synchronization
       </para></entry>

SUGGESTION (both places)
"Number of times the error occurred ..." --> "Number of times an error
occurred ..."

~~~

4. doc/src/sgml/monitoring.sgml - missing column

(duplicate - also reported by [Tang-v2])

The PG docs for the new "stats_reset" column are missing.

~~~

5. src/backend/catalog/system_views.sql - whitespace

(duplicate - also reported by [Tang-v2])

-          JOIN pg_subscription s ON (w.subid = s.oid);
+        a.apply_error_count,
+        a.sync_error_count,
+ a.stats_reset
+    FROM pg_subscription as s,
+        pg_stat_get_subscription_activity(oid) as a;

inconsistent tab/space indenting for 'a.stats_reset'.

~~~

6. src/backend/postmaster/pgstat.c - function name

+/* ----------
+ * pgstat_reset_subscription_counter() -
+ *
+ * Tell the statistics collector to reset a single subscription
+ * counter, or all subscription counters (when subid is InvalidOid).
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_subscription_counter(Oid subid)

SUGGESTION (function name)
"pgstat_reset_subscription_counter" -->
"pgstat_reset_subscription_counters" (plural)

~~

7. src/backend/postmaster/pgstat.c - pgstat_recv_resetsubcounter

@@ -5645,6 +5598,51 @@
pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
  }
 }

+/* ----------
+ * pgstat_recv_resetsubcounter() -
+ *
+ * Reset some subscription statistics of the cluster.
+ * ----------
+ */
+static void
+pgstat_recv_resetsubcounter(PgStat_MsgResetsubcounter *msg, int len)


"Reset some" seems a bit vague. Why not describe that it is all or
none according to the msg->m_subid?

~~~

8. src/backend/postmaster/pgstat.c - pgstat_recv_resetsubcounter

+ if (!OidIsValid(msg->m_subid))
+ {
+ HASH_SEQ_STATUS sstat;
+
+ /* Clear all subscription counters */
+ hash_seq_init(&sstat, subscriptionStatHash);
+ while ((subentry = (PgStat_StatSubEntry *) hash_seq_search(&sstat)) != NULL)
+ pgstat_reset_subscription(subentry, ts);
+ }
+ else
+ {
+ /* Get the subscription statistics to reset */
+ subentry = pgstat_get_subscription_entry(msg->m_subid, false);
+
+ /*
+ * Nothing to do if the given subscription entry is not found.  This
+ * could happen when the subscription with the subid is removed and
+ * the corresponding statistics entry is also removed before receiving
+ * the reset message.
+ */
+ if (!subentry)
+ return;
+
+ /* Reset the stats for the requested replication slot */
+ pgstat_reset_subscription(subentry, ts);
+ }
+}

Why not reverse the if/else?

Checking OidIsValid(...) seems more natural than checking !OidIsValid(...)

~~~

9. src/backend/postmaster/pgstat.c - pgstat_recv_subscription_purge

static void
pgstat_recv_subscription_purge(PgStat_MsgSubscriptionPurge *msg, int len)
{
/* Return if we don't have replication subscription statistics */
if (subscriptionStatHash == NULL)
return;

/* Remove from hashtable if present; we don't care if it's not */
(void) hash_search(subscriptionStatHash, (void *) &(msg->m_subid),
   HASH_REMOVE, NULL);
}

SUGGESTION
Wouldn't the above code be simpler written like:

if (subscriptionStatHash)
{
/* Remove from hashtable if present; we don't care if it's not */
(void) hash_search(subscriptionStatHash, (void *) &(msg->m_subid),
   HASH_REMOVE, NULL);
}
~~~

10. src/backend/replication/logical/worker.c

(from my previous [Peter-v1] #9)

>> + /* Report the worker failed during table synchronization */
>> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, false);
>>
>> and
>>
>> + /* Report the worker failed during the application of the change */
>> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, true);
>>
>>
>> Why don't these use MySubscription->oid instead of MyLogicalRepWorker->subid?

> It's just because we used to use MyLogicalRepWorker->subid, is there
> any particular reason why we should use MySubscription->oid here?

I felt MySubscription->oid is a more natural and more direct way of
expressing the same thing.

Consider:  "the oid of the current subscription" versus "the oid of
the subscription of the current worker". IMO the first one is simpler.
YMMV.

Also, it is shorter :)

~~~

11. src/include/pgstat.h  - enum order

(follow-on from my previous v1 review comment #10)

>I assume you're concerned about binary compatibility or something. I
>think it should not be a problem since both
>PGSTAT_MTYPE_SUBWORKERERROR and PGSTAT_MTYPE_SUBSCRIPTIONPURGE are
>introduced to PG15.

Yes, maybe it is OK for those ones. But now in v2 there is a new
PGSTAT_MTYPE_RESETSUBCOUNTER.

Shouldn't at least that one be put at the end for the same reason?

~~~

12. src/include/pgstat.h  - PgStat_MsgResetsubcounter

Maybe a better name for this is "PgStat_MsgResetsubcounters" (plural)?

~~~


13. src/test/subscription/t/026_worker_stats.pl - missing test?

Shouldn't there also be some test to reset the counters to confirm
that they really do get reset to zero?

~~~

14. src/tools/pgindent/typedefs.list

PgStat_MsgResetsubcounter (from pgstat.h) is missing?

------

15. pg_stat_subscription_activity view name?

Has the view name already been decided or still under discussion - I
was not sure?

If is it already decided then fine, but if not then my vote would be
for something different like:
e.g.1 - pg_stat_subscription_errors
e.g.2 - pg_stat_subscription_counters
e.g.3 - pg_stat_subscription_metrics

Maybe "activity" was chosen to be deliberately vague in case some
future unknown stats columns get added? But it means now there is a
corresponding function "pg_stat_reset_subscription_activity", when in
practice you don't really reset activity - what you want to do is
reset some statistics *about* the activity... so it all seems a bit
odd to me.

------
[Tang-v2]
https://www.postgresql.org/message-id/OS0PR01MB6113769B17E90ADC9ACA14B2FB3D9%40OS0PR01MB6113.jpnprd01.prod.outlook.com
[Peter-v1]
https://www.postgresql.org/message-id/CAHut%2BPtH-uN5rbGRh-%3DkCd8xvQYDf_JCcjLcVjW3OXGz6T%2BxCw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Eisentraut

Дата:

24 февраля 2022 г., 12:20:39

On 21.02.22 17:17, Andres Freund wrote:
> Hi,
> 
> On 2022-02-21 14:49:01 +0530, Amit Kapila wrote:
>> On Mon, Feb 21, 2022 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:
>>>> * stats_reset (Time at which these statistics were last reset)
>>>>
>>>> The view name could be pg_stat_subscription_lrep,
>>>> pg_stat_logical_replication, or something on those lines.
>>>
>>> pg_stat_subscription_stats :)
> 
>> Having *stat* two times in the name sounds slightly odd to me but let
>> us see what others think. One more option could be
>> pg_stat_subscription_replication.
> 
> It was a joke, making light of our bad naming in pg_stat_*, not a serious
> suggestion...

I think pg_stat_subscription_stats is actually the least worst option.

Unless we want to consider renaming pg_stat_subscription (which is 
actually more like pg_stat_subscription_activity).  But I think that 
should be avoided.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Eisentraut

Дата:

24 февраля 2022 г., 12:48:41

On 23.02.22 03:14, Andres Freund wrote:
> Why are the stats stored in the per-database stats file / as a second level
> below the database? While they're also associated with a database, it's a
> global catalog, so it seems to make more sense to have them "live" globally as
> well?

pg_subscription being a global catalog is a bit of a lie for the benefit 
of the worker launcher, but it can be treated as a per-database catalog 
for practical purposes.

> Not just from an aesthetical perspective, but there might also be cases where
> it's useful to send stats from the stats launcher. E.g. the number of times
> the launcher couldn't start a worker because the max numbers of workers was
> already active or such.

That's a reasonable point, however.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Eisentraut

Дата:

24 февраля 2022 г., 12:53:02

On 24.02.22 02:32, Masahiko Sawada wrote:
> On Wed, Feb 23, 2022 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote:
>>
>> Hi. Below are my review comments for the v1 patch.
> 
> Thank you for the comments! I've attached the latest version patch
> that incorporated all comments I got so far. The primary change from
> the previous version is that the subscription statistics live globally
> rather than per-database.

I don't think the name pg_stat_subscription_activity is a good choice.

We have a view called pg_stat_activity, which is very well known.  From 
that perspective, "activity" means what is happening right now or what 
has happened most recently.  The reworked view in this patch does not 
contain that (we already have pg_stat_subscription for that), but it 
contains accumulated counters.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

24 февраля 2022 г., 14:46:04

On Thu, Feb 24, 2022 at 6:53 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 24.02.22 02:32, Masahiko Sawada wrote:
> > On Wed, Feb 23, 2022 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >>
> >> Hi. Below are my review comments for the v1 patch.
> >
> > Thank you for the comments! I've attached the latest version patch
> > that incorporated all comments I got so far. The primary change from
> > the previous version is that the subscription statistics live globally
> > rather than per-database.
>
> I don't think the name pg_stat_subscription_activity is a good choice.
>
> We have a view called pg_stat_activity, which is very well known.  From
> that perspective, "activity" means what is happening right now or what
> has happened most recently.  The reworked view in this patch does not
> contain that (we already have pg_stat_subscription for that), but it
> contains accumulated counters.

Right.

What pg_stat_subscription shows is rather suitable for the name
pg_stat_subscription_activity than the reworked view. But switching
these names would also not be a good idea.  I think it's better to use
"subscription" in the view name since it shows actually statistics for
subscriptions and subscription OID is the key. I personally prefer
pg_stat_subscription_counters among the ideas that have been proposed
so far, but I'd like to hear opinions and votes.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

24 февраля 2022 г., 15:05:26

On Thu, Feb 24, 2022 at 2:24 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> 9. src/backend/postmaster/pgstat.c - pgstat_recv_subscription_purge
>
> static void
> pgstat_recv_subscription_purge(PgStat_MsgSubscriptionPurge *msg, int len)
> {
> /* Return if we don't have replication subscription statistics */
> if (subscriptionStatHash == NULL)
> return;
>
> /* Remove from hashtable if present; we don't care if it's not */
> (void) hash_search(subscriptionStatHash, (void *) &(msg->m_subid),
>    HASH_REMOVE, NULL);
> }
>
> SUGGESTION
> Wouldn't the above code be simpler written like:
>
> if (subscriptionStatHash)
> {
> /* Remove from hashtable if present; we don't care if it's not */
> (void) hash_search(subscriptionStatHash, (void *) &(msg->m_subid),
>    HASH_REMOVE, NULL);
> }
> ~~~
>

I think we can write that way as well but I would prefer the way it is
currently in the patch as we use a similar pattern in nearby code (ex.
pgstat_recv_resetreplslotcounter) and at other places in the code base
as well.


> 10. src/backend/replication/logical/worker.c
>
> (from my previous [Peter-v1] #9)
>
> >> + /* Report the worker failed during table synchronization */
> >> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, false);
> >>
> >> and
> >>
> >> + /* Report the worker failed during the application of the change */
> >> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, true);
> >>
> >>
> >> Why don't these use MySubscription->oid instead of MyLogicalRepWorker->subid?
>
> > It's just because we used to use MyLogicalRepWorker->subid, is there
> > any particular reason why we should use MySubscription->oid here?
>
> I felt MySubscription->oid is a more natural and more direct way of
> expressing the same thing.
>
> Consider:  "the oid of the current subscription" versus "the oid of
> the subscription of the current worker". IMO the first one is simpler.
> YMMV.
>

I think we can use either but maybe MySubscription->oid would be
slightly better here as the same is used in nearby code as well.

> Also, it is shorter :)
>
> ~~~
>
> 11. src/include/pgstat.h  - enum order
>
> (follow-on from my previous v1 review comment #10)
>
> >I assume you're concerned about binary compatibility or something. I
> >think it should not be a problem since both
> >PGSTAT_MTYPE_SUBWORKERERROR and PGSTAT_MTYPE_SUBSCRIPTIONPURGE are
> >introduced to PG15.
>
> Yes, maybe it is OK for those ones. But now in v2 there is a new
> PGSTAT_MTYPE_RESETSUBCOUNTER.
>
> Shouldn't at least that one be put at the end for the same reason?
>
> ~~~
>

I don't see the reason to put that at end. It is better to add it near
to similar RESET enums.

>
> 13. src/test/subscription/t/026_worker_stats.pl - missing test?
>
> Shouldn't there also be some test to reset the counters to confirm
> that they really do get reset to zero?
>
> ~~~
>

I think we avoid writing tests for stats for each and every case as it
is not reliable in nature (the message can be lost). If we can find a
reliable way then it is okay.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

24 февраля 2022 г., 15:17:55

On Thu, Feb 24, 2022 at 9:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 24, 2022 at 2:24 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > 10. src/backend/replication/logical/worker.c
> >
> > (from my previous [Peter-v1] #9)
> >
> > >> + /* Report the worker failed during table synchronization */
> > >> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, false);
> > >>
> > >> and
> > >>
> > >> + /* Report the worker failed during the application of the change */
> > >> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, true);
> > >>
> > >>
> > >> Why don't these use MySubscription->oid instead of MyLogicalRepWorker->subid?
> >
> > > It's just because we used to use MyLogicalRepWorker->subid, is there
> > > any particular reason why we should use MySubscription->oid here?
> >
> > I felt MySubscription->oid is a more natural and more direct way of
> > expressing the same thing.
> >
> > Consider:  "the oid of the current subscription" versus "the oid of
> > the subscription of the current worker". IMO the first one is simpler.
> > YMMV.
> >
>
> I think we can use either but maybe MySubscription->oid would be
> slightly better here as the same is used in nearby code as well.

Okay, will change.

> >
> > 13. src/test/subscription/t/026_worker_stats.pl - missing test?
> >
> > Shouldn't there also be some test to reset the counters to confirm
> > that they really do get reset to zero?
> >
> > ~~~
> >
>
> I think we avoid writing tests for stats for each and every case as it
> is not reliable in nature (the message can be lost). If we can find a
> reliable way then it is okay.

Yeah, the messages can even be out-of-order. Particularly, in this
test, the apply worker and table sync worker keep reporting the
messages, it's quite possible that the test becomes unstable. I
remember we removed unstable tests of resetting statistics before
(e.g., see fc6950913).

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Eisentraut

Дата:

24 февраля 2022 г., 15:23:55

On 24.02.22 12:46, Masahiko Sawada wrote:
>> We have a view called pg_stat_activity, which is very well known.  From
>> that perspective, "activity" means what is happening right now or what
>> has happened most recently.  The reworked view in this patch does not
>> contain that (we already have pg_stat_subscription for that), but it
>> contains accumulated counters.
> Right.
> 
> What pg_stat_subscription shows is rather suitable for the name
> pg_stat_subscription_activity than the reworked view. But switching
> these names would also not be a good idea.  I think it's better to use
> "subscription" in the view name since it shows actually statistics for
> subscriptions and subscription OID is the key. I personally prefer
> pg_stat_subscription_counters among the ideas that have been proposed
> so far, but I'd like to hear opinions and votes.

_counters will fail if there is something not a counter (such as 
last-timestamp-of-something).

Earlier, pg_stat_subscription_stats was mentioned, which doesn't have 
that problem.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

24 февраля 2022 г., 15:51:05

On Thu, Feb 24, 2022 at 9:23 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
>
> On 24.02.22 12:46, Masahiko Sawada wrote:
> >> We have a view called pg_stat_activity, which is very well known.  From
> >> that perspective, "activity" means what is happening right now or what
> >> has happened most recently.  The reworked view in this patch does not
> >> contain that (we already have pg_stat_subscription for that), but it
> >> contains accumulated counters.
> > Right.
> >
> > What pg_stat_subscription shows is rather suitable for the name
> > pg_stat_subscription_activity than the reworked view. But switching
> > these names would also not be a good idea.  I think it's better to use
> > "subscription" in the view name since it shows actually statistics for
> > subscriptions and subscription OID is the key. I personally prefer
> > pg_stat_subscription_counters among the ideas that have been proposed
> > so far, but I'd like to hear opinions and votes.
>
> _counters will fail if there is something not a counter (such as
> last-timestamp-of-something).
>
> Earlier, pg_stat_subscription_stats was mentioned, which doesn't have
> that problem.

Ah, I had misunderstood your comment. Right, _counter could be a
blocker for the future changes.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

24 февраля 2022 г., 17:45:38

Hi,

On 2022-02-24 13:23:55 +0100, Peter Eisentraut wrote:
> On 24.02.22 12:46, Masahiko Sawada wrote:
> > > We have a view called pg_stat_activity, which is very well known.  From
> > > that perspective, "activity" means what is happening right now or what
> > > has happened most recently.  The reworked view in this patch does not
> > > contain that (we already have pg_stat_subscription for that), but it
> > > contains accumulated counters.
> > Right.
> > 
> > What pg_stat_subscription shows is rather suitable for the name
> > pg_stat_subscription_activity than the reworked view. But switching
> > these names would also not be a good idea.  I think it's better to use
> > "subscription" in the view name since it shows actually statistics for
> > subscriptions and subscription OID is the key. I personally prefer
> > pg_stat_subscription_counters among the ideas that have been proposed
> > so far, but I'd like to hear opinions and votes.
> 
> _counters will fail if there is something not a counter (such as
> last-timestamp-of-something).
> 
> Earlier, pg_stat_subscription_stats was mentioned, which doesn't have that
> problem.

We really should try to fix this in a more general way at some point. We have
way too many different things mixed up in pg_stat_*.

I'd like to get something like the patch in soon though, we can still change
the name later. I've been blocked behind this stuff for weeks, and it's
getting really painful...

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

24 февраля 2022 г., 18:49:37

Hi,

Thank you for the comments!

On Thu, Feb 24, 2022 at 4:20 PM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:
>
> On Thu, Feb 24, 2022 9:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for the comments! I've attached the latest version patch
> > that incorporated all comments I got so far. The primary change from
> > the previous version is that the subscription statistics live globally
> > rather than per-database.
> >
>
> Thanks for updating the patch.
>
> Few comments:
>
> 1.
> I think we should add some doc for column stats_reset in pg_stat_subscription_activity view.

Added.

>
> 2.
> +CREATE VIEW pg_stat_subscription_activity AS
>      SELECT
> -        w.subid,
> +        a.subid,
>          s.subname,
> ...
> +        a.apply_error_count,
> +        a.sync_error_count,
> +       a.stats_reset
> +    FROM pg_subscription as s,
> +        pg_stat_get_subscription_activity(oid) as a;
>
> The line "a.stats_reset" uses a Tab, and we'd better use spaces here.

Fixed.

On Thu, Feb 24, 2022 at 5:54 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi. Below are my review comments for the v2 patch.
>
> ======
>
> 1. Commit message
>
> This patch changes the pg_stat_subscription_workers view (introduced
> by commit 8d74fc9) so that it stores only statistics counters:
> apply_error_count and sync_error_count, and has one entry for
> subscription.
>
> SUGGESTION
> "and has one entry for subscription." --> "and has one entry for each
> subscription."

Fixed.

>
> ~~~
>
> 2. Commit message
>
> After removing these error details, there are no longer relation
> information, so the subscription statistics are now a cluster-wide
> statistics.
>
> SUGGESTION
> "there are no longer relation information," --> "there is no longer
> any relation information,"

Fixed.

>
> ~~~
>
> 3. doc/src/sgml/monitoring.sgml
>
> -      <para>
> -       The error message
> +       Number of times the error occurred during the application of changes
>        </para></entry>
>       </row>
>
>       <row>
>        <entry role="catalog_table_entry"><para role="column_definition">
> -       <structfield>last_error_time</structfield> <type>timestamp
> with time zone</type>
> +       <structfield>sync_error_count</structfield> <type>bigint</type>
>        </para>
>        <para>
> -       Last time at which this error occurred
> +       Number of times the error occurred during the initial table
> +       synchronization
>        </para></entry>
>
> SUGGESTION (both places)
> "Number of times the error occurred ..." --> "Number of times an error
> occurred ..."

Fixed.

>
> ~~~
>
> 4. doc/src/sgml/monitoring.sgml - missing column
>
> (duplicate - also reported by [Tang-v2])
>
> The PG docs for the new "stats_reset" column are missing.
>
> ~~~
>
> 5. src/backend/catalog/system_views.sql - whitespace
>
> (duplicate - also reported by [Tang-v2])
>
> -          JOIN pg_subscription s ON (w.subid = s.oid);
> +        a.apply_error_count,
> +        a.sync_error_count,
> + a.stats_reset
> +    FROM pg_subscription as s,
> +        pg_stat_get_subscription_activity(oid) as a;
>
> inconsistent tab/space indenting for 'a.stats_reset'.
>
> ~~~
>
> 6. src/backend/postmaster/pgstat.c - function name
>
> +/* ----------
> + * pgstat_reset_subscription_counter() -
> + *
> + * Tell the statistics collector to reset a single subscription
> + * counter, or all subscription counters (when subid is InvalidOid).
> + *
> + * Permission checking for this function is managed through the normal
> + * GRANT system.
> + * ----------
> + */
> +void
> +pgstat_reset_subscription_counter(Oid subid)
>
> SUGGESTION (function name)
> "pgstat_reset_subscription_counter" -->
> "pgstat_reset_subscription_counters" (plural)

Fixed.

>
> ~~
>
> 7. src/backend/postmaster/pgstat.c - pgstat_recv_resetsubcounter
>
> @@ -5645,6 +5598,51 @@
> pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
>   }
>  }
>
> +/* ----------
> + * pgstat_recv_resetsubcounter() -
> + *
> + * Reset some subscription statistics of the cluster.
> + * ----------
> + */
> +static void
> +pgstat_recv_resetsubcounter(PgStat_MsgResetsubcounter *msg, int len)
>
>
> "Reset some" seems a bit vague. Why not describe that it is all or
> none according to the msg->m_subid?

I think it reset none, one, or all statistics, actually. Given other
pgstat_recv_reset* functions also have similar comments, I think we
can use it rather than elaborating.

>
> ~~~
>
> 8. src/backend/postmaster/pgstat.c - pgstat_recv_resetsubcounter
>
> + if (!OidIsValid(msg->m_subid))
> + {
> + HASH_SEQ_STATUS sstat;
> +
> + /* Clear all subscription counters */
> + hash_seq_init(&sstat, subscriptionStatHash);
> + while ((subentry = (PgStat_StatSubEntry *) hash_seq_search(&sstat)) != NULL)
> + pgstat_reset_subscription(subentry, ts);
> + }
> + else
> + {
> + /* Get the subscription statistics to reset */
> + subentry = pgstat_get_subscription_entry(msg->m_subid, false);
> +
> + /*
> + * Nothing to do if the given subscription entry is not found.  This
> + * could happen when the subscription with the subid is removed and
> + * the corresponding statistics entry is also removed before receiving
> + * the reset message.
> + */
> + if (!subentry)
> + return;
> +
> + /* Reset the stats for the requested replication slot */
> + pgstat_reset_subscription(subentry, ts);
> + }
> +}
>
> Why not reverse the if/else?
>
> Checking OidIsValid(...) seems more natural than checking !OidIsValid(...)

Yes, but it's because we use the same pattern in the near function
(see pgstat_recv_resetreplslotcounter()).

>
> ~~~
>
> 9. src/backend/postmaster/pgstat.c - pgstat_recv_subscription_purge
>
> static void
> pgstat_recv_subscription_purge(PgStat_MsgSubscriptionPurge *msg, int len)
> {
> /* Return if we don't have replication subscription statistics */
> if (subscriptionStatHash == NULL)
> return;
>
> /* Remove from hashtable if present; we don't care if it's not */
> (void) hash_search(subscriptionStatHash, (void *) &(msg->m_subid),
>    HASH_REMOVE, NULL);
> }
>
> SUGGESTION
> Wouldn't the above code be simpler written like:
>
> if (subscriptionStatHash)
> {
> /* Remove from hashtable if present; we don't care if it's not */
> (void) hash_search(subscriptionStatHash, (void *) &(msg->m_subid),
>    HASH_REMOVE, NULL);
> }

Similarly, as Amit also mentioned, there is a similar pattern in the
near function. So keep it as it is

> ~~~
>
> 10. src/backend/replication/logical/worker.c
>
> (from my previous [Peter-v1] #9)
>
> >> + /* Report the worker failed during table synchronization */
> >> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, false);
> >>
> >> and
> >>
> >> + /* Report the worker failed during the application of the change */
> >> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, true);
> >>
> >>
> >> Why don't these use MySubscription->oid instead of MyLogicalRepWorker->subid?
>
> > It's just because we used to use MyLogicalRepWorker->subid, is there
> > any particular reason why we should use MySubscription->oid here?
>
> I felt MySubscription->oid is a more natural and more direct way of
> expressing the same thing.
>
> Consider:  "the oid of the current subscription" versus "the oid of
> the subscription of the current worker". IMO the first one is simpler.
> YMMV.
>
> Also, it is shorter :)

Changed.

>
> ~~~
>
> 11. src/include/pgstat.h  - enum order
>
> (follow-on from my previous v1 review comment #10)
>
> >I assume you're concerned about binary compatibility or something. I
> >think it should not be a problem since both
> >PGSTAT_MTYPE_SUBWORKERERROR and PGSTAT_MTYPE_SUBSCRIPTIONPURGE are
> >introduced to PG15.
>
> Yes, maybe it is OK for those ones. But now in v2 there is a new
> PGSTAT_MTYPE_RESETSUBCOUNTER.
>
> Shouldn't at least that one be put at the end for the same reason?

I think it's better to put it near similar RESET enums.

>
> ~~~
>
> 12. src/include/pgstat.h  - PgStat_MsgResetsubcounter
>
> Maybe a better name for this is "PgStat_MsgResetsubcounters" (plural)?

I think it's better to be consistent with other similar message
structs (e.g., msg_resetsharedcounter and msg_resetslrucounter).

>
> ~~~
>
> 14. src/tools/pgindent/typedefs.list
>
> PgStat_MsgResetsubcounter (from pgstat.h) is missing?
>

Added.

> ------
>
> 15. pg_stat_subscription_activity view name?
>
> Has the view name already been decided or still under discussion - I
> was not sure?
>
> If is it already decided then fine, but if not then my vote would be
> for something different like:
> e.g.1 - pg_stat_subscription_errors
> e.g.2 - pg_stat_subscription_counters
> e.g.3 - pg_stat_subscription_metrics
>
> Maybe "activity" was chosen to be deliberately vague in case some
> future unknown stats columns get added? But it means now there is a
> corresponding function "pg_stat_reset_subscription_activity", when in
> practice you don't really reset activity - what you want to do is
> reset some statistics *about* the activity... so it all seems a bit
> odd to me.

Yes, it still needs discussion.

I've attached the updated patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Вложения

v3-0001-Reconsider-pg_stat_subscription_workers-view.patch

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Smith

Дата:

25 февраля 2022 г., 04:56:04

Below are my review comments for the v3 patch.

======

1. Commit message

(An earlier review comment [Peter-v2] #2 was only partly fixed)

"there are no longer any relation information" --> "there is no longer
any relation information"

~~~

2. doc/src/sgml/monitoring.sgml

+
<entry><structname>pg_stat_subscription_activity</structname><indexterm><primary>pg_stat_subscription_activity</primary></indexterm></entry>
+      <entry>One row per subscription, showing statistics about subscription
+      activity.
+      See <link linkend="monitoring-pg-stat-subscription-activity">
+      <structname>pg_stat_subscription_activity</structname></link>
for details.
       </entry>
      </row>

Currently these stats are only about errors. These are not really
statistics about "activity" though. Probably it is better just to
avoid that word altogether?

SUGGESTIONS

e.g.1. "One row per subscription, showing statistics about
subscription activity." --> "One row per subscription, showing
statistics about errors."

e.g.2. "One row per subscription, showing statistics about
subscription activity." --> "One row per subscription, showing
statistics about that subscription."

-----
[Peter-v2]
https://www.postgresql.org/message-id/CAHut%2BPv%3DVmXtHmPKp4fg8VDF%2BTQP6xWgL91Jn-hrqg5QObfCZA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

25 февраля 2022 г., 08:26:47

On Thu, Feb 24, 2022 at 9:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > 6. src/backend/postmaster/pgstat.c - function name
> >
> > +/* ----------
> > + * pgstat_reset_subscription_counter() -
> > + *
> > + * Tell the statistics collector to reset a single subscription
> > + * counter, or all subscription counters (when subid is InvalidOid).
> > + *
> > + * Permission checking for this function is managed through the normal
> > + * GRANT system.
> > + * ----------
> > + */
> > +void
> > +pgstat_reset_subscription_counter(Oid subid)
> >
> > SUGGESTION (function name)
> > "pgstat_reset_subscription_counter" -->
> > "pgstat_reset_subscription_counters" (plural)
>
> Fixed.
>

We don't use the plural form in other similar cases like
pgstat_reset_replslot_counter, pgstat_reset_slru_counter, so why do it
differently here?

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Greg Stark

Дата:

25 февраля 2022 г., 17:52:49

On Tue, 25 Jan 2022 at 01:32, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I was looking the shared memory stats patch again.

Can you point me to this thread? I looked for it but couldn't find it.


-- 
greg

Re: Design of pg_stat_subscription_workers vs pgstats

От

"Euler Taveira"

Дата:

25 февраля 2022 г., 22:25:01

On Fri, Feb 25, 2022, at 11:52 AM, Greg Stark wrote:

On Tue, 25 Jan 2022 at 01:32, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I was looking the shared memory stats patch again.

Can you point me to this thread? I looked for it but couldn't find it.

https://postgr.es/m/20180629.173418.190173462.horiguchi.kyotaro@lab.ntt.co.jp

Euler Taveira

EDB https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

25 февраля 2022 г., 22:32:24

Hi,

On 2022-02-25 16:25:01 -0300, Euler Taveira wrote:
> On Fri, Feb 25, 2022, at 11:52 AM, Greg Stark wrote:
> > On Tue, 25 Jan 2022 at 01:32, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > I was looking the shared memory stats patch again.
> > 
> > Can you point me to this thread? I looked for it but couldn't find it.

> https://postgr.es/m/20180629.173418.190173462.horiguchi.kyotaro@lab.ntt.co.jp

I'll post a rebased version as soon as this is resolved... I have a local one,
but it just works by nuking a bunch of tests / #ifdefing out code related to
this.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

26 февраля 2022 г., 05:51:23

On Thu, Feb 24, 2022 at 9:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>

I have reviewed the latest version and made a few changes along with
fixing some of the pending comments by Peter Smith. The changes are as
follows: (a) Removed m_databaseid in PgStat_MsgSubscriptionError as
that is not required now; (b) changed the struct name
PgStat_MsgSubscriptionPurge to PgStat_MsgSubscriptionDrop to make it
similar to DropDb; (c) changed the view name to
pg_stat_subscription_stats, we can reconsider it in future if there is
a consensus on some other name, accordingly changed the reset function
name to pg_stat_reset_subscription_stats; (d) moved some of the newly
added subscription stats functions adjacent to slots to main the
consistency in code; (e) changed comments at few places; (f) added
LATERAL back to system_views query as we refer pg_subscription's oid
in the function call, previously that was not clear.

Do let me know what you think of the attached?

-- 
With Regards,
Amit Kapila.

Вложения

v4-0001-Reconsider-pg_stat_subscription_workers-view.patch

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

26 февраля 2022 г., 05:54:07

On Fri, Feb 25, 2022 at 7:26 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Below are my review comments for the v3 patch.
>
...
> 2. doc/src/sgml/monitoring.sgml
>
> +
<entry><structname>pg_stat_subscription_activity</structname><indexterm><primary>pg_stat_subscription_activity</primary></indexterm></entry>
> +      <entry>One row per subscription, showing statistics about subscription
> +      activity.
> +      See <link linkend="monitoring-pg-stat-subscription-activity">
> +      <structname>pg_stat_subscription_activity</structname></link>
> for details.
>        </entry>
>       </row>
>
> Currently these stats are only about errors. These are not really
> statistics about "activity" though. Probably it is better just to
> avoid that word altogether?
>
> SUGGESTIONS
>
> e.g.1. "One row per subscription, showing statistics about
> subscription activity." --> "One row per subscription, showing
> statistics about errors."
>

I preferred this one and made another change suggested by you in the
latest version posted by me. Thanks!

-- 
With Regards,
Amit Kapila.

RE: Design of pg_stat_subscription_workers vs pgstats

От

"osumi.takamichi@fujitsu.com"

Дата:

26 февраля 2022 г., 11:05:17

On Saturday, February 26, 2022 11:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have reviewed the latest version and made a few changes along with fixing
> some of the pending comments by Peter Smith. The changes are as
> follows: (a) Removed m_databaseid in PgStat_MsgSubscriptionError as that is
> not required now; (b) changed the struct name PgStat_MsgSubscriptionPurge
> to PgStat_MsgSubscriptionDrop to make it similar to DropDb; (c) changed the
> view name to pg_stat_subscription_stats, we can reconsider it in future if there
> is a consensus on some other name, accordingly changed the reset function
> name to pg_stat_reset_subscription_stats; (d) moved some of the newly
> added subscription stats functions adjacent to slots to main the consistency in
> code; (e) changed comments at few places; (f) added LATERAL back to
> system_views query as we refer pg_subscription's oid in the function call,
> previously that was not clear.
> 
> Do let me know what you think of the attached?
Hi, thank you for updating the patch !

I have a couple of comments on v4.

(1)

I'm not sure if I'm correct, but I'd say the sync_error_count
can come next to the subname as the order of columns.
I felt there's case that the column order is somewhat
related to the time/processing order (I imagined
pg_stat_replication's LSN related columns).
If this was right, table sync related column could be the
first column as a counter within this patch.

(2) doc/src/sgml/monitoring.sgml

+        Resets statistics for a single subscription shown in the
+        <structname>pg_stat_subscription_stats</structname> view to zero. If
+        the argument is <literal>NULL</literal>, reset statistics for all
+        subscriptions.
        </para>

I felt we could improve the first sentence.

From:
Resets statistics for a single subscription shown in the..

To(idea1):
Resets statistics for a single subscription defined by the argument to zero.

Or,
To(idea2):
Resets statistics to zero for a single subscription or for all subscriptions.

Best Regards,
    Takamichi Osumi

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

28 февраля 2022 г., 04:18:33

On Sat, Feb 26, 2022 at 11:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 24, 2022 at 9:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
>
> I have reviewed the latest version and made a few changes along with
> fixing some of the pending comments by Peter Smith.

Thank you for updating the patch!

> The changes are as
> follows: (a) Removed m_databaseid in PgStat_MsgSubscriptionError as
> that is not required now; (b) changed the struct name
> PgStat_MsgSubscriptionPurge to PgStat_MsgSubscriptionDrop to make it
> similar to DropDb; (c) changed the view name to
> pg_stat_subscription_stats, we can reconsider it in future if there is
> a consensus on some other name, accordingly changed the reset function
> name to pg_stat_reset_subscription_stats; (d) moved some of the newly
> added subscription stats functions adjacent to slots to main the
> consistency in code; (e) changed comments at few places;

Agreed.

> (f) added
> LATERAL back to system_views query as we refer pg_subscription's oid
> in the function call, previously that was not clear.

I think LATERAL is still unnecessary as you pointed out before. The
documentation[1] says,

LATERAL can also precede a function-call FROM item, but in this case
it is a noise word, because the function expression can refer to
earlier FROM items in any case.

The rest looks good to me.

Regards,

[1] https://www.postgresql.org/docs/devel/sql-select.html


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

28 февраля 2022 г., 05:33:34

On Sat, Feb 26, 2022 at 1:35 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Saturday, February 26, 2022 11:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have reviewed the latest version and made a few changes along with fixing
> > some of the pending comments by Peter Smith. The changes are as
> > follows: (a) Removed m_databaseid in PgStat_MsgSubscriptionError as that is
> > not required now; (b) changed the struct name PgStat_MsgSubscriptionPurge
> > to PgStat_MsgSubscriptionDrop to make it similar to DropDb; (c) changed the
> > view name to pg_stat_subscription_stats, we can reconsider it in future if there
> > is a consensus on some other name, accordingly changed the reset function
> > name to pg_stat_reset_subscription_stats; (d) moved some of the newly
> > added subscription stats functions adjacent to slots to main the consistency in
> > code; (e) changed comments at few places; (f) added LATERAL back to
> > system_views query as we refer pg_subscription's oid in the function call,
> > previously that was not clear.
> >
> > Do let me know what you think of the attached?
> Hi, thank you for updating the patch !
>
>
> I have a couple of comments on v4.
>
> (1)
>
> I'm not sure if I'm correct, but I'd say the sync_error_count
> can come next to the subname as the order of columns.
> I felt there's case that the column order is somewhat
> related to the time/processing order (I imagined
> pg_stat_replication's LSN related columns).
> If this was right, table sync related column could be the
> first column as a counter within this patch.
>

I am not sure if there is such a correlation but even if it is there
it doesn't seem to fit here completely as sync errors can happen after
apply errors in multiple ways like via Alter Subscription ... Refresh
...

So, I don't see the need to change the order here. What do you or others think?

>
> (2) doc/src/sgml/monitoring.sgml
>
> +        Resets statistics for a single subscription shown in the
> +        <structname>pg_stat_subscription_stats</structname> view to zero. If
> +        the argument is <literal>NULL</literal>, reset statistics for all
> +        subscriptions.
>         </para>
>
> I felt we could improve the first sentence.
>
> From:
> Resets statistics for a single subscription shown in the..
>
> To(idea1):
> Resets statistics for a single subscription defined by the argument to zero.
>

Okay, I can use this one.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

28 февраля 2022 г., 05:46:52

On Mon, Feb 28, 2022 at 11:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Feb 26, 2022 at 1:35 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Saturday, February 26, 2022 11:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I have reviewed the latest version and made a few changes along with fixing
> > > some of the pending comments by Peter Smith. The changes are as
> > > follows: (a) Removed m_databaseid in PgStat_MsgSubscriptionError as that is
> > > not required now; (b) changed the struct name PgStat_MsgSubscriptionPurge
> > > to PgStat_MsgSubscriptionDrop to make it similar to DropDb; (c) changed the
> > > view name to pg_stat_subscription_stats, we can reconsider it in future if there
> > > is a consensus on some other name, accordingly changed the reset function
> > > name to pg_stat_reset_subscription_stats; (d) moved some of the newly
> > > added subscription stats functions adjacent to slots to main the consistency in
> > > code; (e) changed comments at few places; (f) added LATERAL back to
> > > system_views query as we refer pg_subscription's oid in the function call,
> > > previously that was not clear.
> > >
> > > Do let me know what you think of the attached?
> > Hi, thank you for updating the patch !
> >
> >
> > I have a couple of comments on v4.
> >
> > (1)
> >
> > I'm not sure if I'm correct, but I'd say the sync_error_count
> > can come next to the subname as the order of columns.
> > I felt there's case that the column order is somewhat
> > related to the time/processing order (I imagined
> > pg_stat_replication's LSN related columns).
> > If this was right, table sync related column could be the
> > first column as a counter within this patch.
> >
>
> I am not sure if there is such a correlation but even if it is there
> it doesn't seem to fit here completely as sync errors can happen after
> apply errors in multiple ways like via Alter Subscription ... Refresh
> ...
>
> So, I don't see the need to change the order here. What do you or others think?

I'm also not sure about it, both sound good to me. Probably we can
change the order later.

>
> >
> > (2) doc/src/sgml/monitoring.sgml
> >
> > +        Resets statistics for a single subscription shown in the
> > +        <structname>pg_stat_subscription_stats</structname> view to zero. If
> > +        the argument is <literal>NULL</literal>, reset statistics for all
> > +        subscriptions.
> >         </para>
> >
> > I felt we could improve the first sentence.
> >
> > From:
> > Resets statistics for a single subscription shown in the..
> >
> > To(idea1):
> > Resets statistics for a single subscription defined by the argument to zero.
> >
>
> Okay, I can use this one.

Are you going to remove the part "shown in the
pg_stat_subsctiption_stats view"? I think it's better to keep it in
order to make it clear which statistics the function resets as we have
pg_stat_subscription and pg_stat_subscription_stats.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

28 февраля 2022 г., 05:52:18

On Mon, Feb 28, 2022 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Feb 28, 2022 at 11:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > (2) doc/src/sgml/monitoring.sgml
> > >
> > > +        Resets statistics for a single subscription shown in the
> > > +        <structname>pg_stat_subscription_stats</structname> view to zero. If
> > > +        the argument is <literal>NULL</literal>, reset statistics for all
> > > +        subscriptions.
> > >         </para>
> > >
> > > I felt we could improve the first sentence.
> > >
> > > From:
> > > Resets statistics for a single subscription shown in the..
> > >
> > > To(idea1):
> > > Resets statistics for a single subscription defined by the argument to zero.
> > >
> >
> > Okay, I can use this one.
>
> Are you going to remove the part "shown in the
> pg_stat_subsctiption_stats view"? I think it's better to keep it in
> order to make it clear which statistics the function resets as we have
> pg_stat_subscription and pg_stat_subscription_stats.
>

How about the following:
"Resets statistics for a single subscription defined by the argument
shown in the <structname>pg_stat_subscription_stats</structname> view
to zero. If the argument is <literal>NULL</literal>, reset statistics
for all subscriptions."

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Masahiko Sawada

Дата:

28 февраля 2022 г., 06:14:07

On Mon, Feb 28, 2022 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 28, 2022 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Feb 28, 2022 at 11:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > (2) doc/src/sgml/monitoring.sgml
> > > >
> > > > +        Resets statistics for a single subscription shown in the
> > > > +        <structname>pg_stat_subscription_stats</structname> view to zero. If
> > > > +        the argument is <literal>NULL</literal>, reset statistics for all
> > > > +        subscriptions.
> > > >         </para>
> > > >
> > > > I felt we could improve the first sentence.
> > > >
> > > > From:
> > > > Resets statistics for a single subscription shown in the..
> > > >
> > > > To(idea1):
> > > > Resets statistics for a single subscription defined by the argument to zero.
> > > >
> > >
> > > Okay, I can use this one.
> >
> > Are you going to remove the part "shown in the
> > pg_stat_subsctiption_stats view"? I think it's better to keep it in
> > order to make it clear which statistics the function resets as we have
> > pg_stat_subscription and pg_stat_subscription_stats.
> >
>
> How about the following:
> "Resets statistics for a single subscription defined by the argument
> shown in the <structname>pg_stat_subscription_stats</structname> view
> to zero. If the argument is <literal>NULL</literal>, reset statistics
> for all subscriptions."

Sounds good but I'm not sure it's correct in terms of English grammar.
Shouldn't it be something like "subscription that is defined by the
argument and shown in the pg_stat_subscription_stats"?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Design of pg_stat_subscription_workers vs pgstats

От

"osumi.takamichi@fujitsu.com"

Дата:

28 февраля 2022 г., 06:19:22

On Monday, February 28, 2022 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Feb 26, 2022 at 1:35 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Saturday, February 26, 2022 11:51 AM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > I have reviewed the latest version and made a few changes along with
> > > fixing some of the pending comments by Peter Smith. The changes are
> > > as
> > > follows: (a) Removed m_databaseid in PgStat_MsgSubscriptionError as
> > > that is not required now; (b) changed the struct name
> > > PgStat_MsgSubscriptionPurge to PgStat_MsgSubscriptionDrop to make it
> > > similar to DropDb; (c) changed the view name to
> > > pg_stat_subscription_stats, we can reconsider it in future if there
> > > is a consensus on some other name, accordingly changed the reset
> > > function name to pg_stat_reset_subscription_stats; (d) moved some of
> > > the newly added subscription stats functions adjacent to slots to
> > > main the consistency in code; (e) changed comments at few places;
> > > (f) added LATERAL back to system_views query as we refer
> pg_subscription's oid in the function call, previously that was not clear.
> > >
> > > Do let me know what you think of the attached?
> > Hi, thank you for updating the patch !
> > I have a couple of comments on v4.
> >
> > (1)
> >
> > I'm not sure if I'm correct, but I'd say the sync_error_count can come
> > next to the subname as the order of columns.
> > I felt there's case that the column order is somewhat related to the
> > time/processing order (I imagined pg_stat_replication's LSN related
> > columns).
> > If this was right, table sync related column could be the first column
> > as a counter within this patch.
> >
> 
> I am not sure if there is such a correlation but even if it is there it doesn't seem
> to fit here completely as sync errors can happen after apply errors in multiple
> ways like via Alter Subscription ... Refresh ...
> 
> So, I don't see the need to change the order here. What do you or others think?
In the alter subscription case, any errors after the table sync would increment
apply_error_count.

I mentioned this, because this point of view would impact on the doc read by users
and internal source codes for developers.
I had a concern that when we extend and increase a lot of statistics (not only for this view,
but also other statistics in general), writing doc for statistics needs some alignment for better
readability.

*But*, as you mentioned, in case we don't have such a correlation, I'm okay with the current patch.
Thank you for replying.


Best Regards,
    Takamichi Osumi

Re: Design of pg_stat_subscription_workers vs pgstats

От

Peter Smith

Дата:

28 февраля 2022 г., 06:29:13

Below are my comments for the v4 patch.

These are only nitpicking comments now; Otherwise, it LGTM.

(Sorry, now I see there are some overlaps with comments posted in the
last 20 mins so take or leave these as you wish)

======

1. doc/src/sgml/monitoring.sgml

-      <para>
-       OID of the relation that the worker was processing when the
-       error occurred
+       Number of times an error occurred during the application of changes
       </para></entry>
      </row>

BEFORE
Number of times an error occurred during the application of changes
SUGGESTED
Number of times an error occurred while applying changes

~~~

2. doc/src/sgml/monitoring.sgml

+        Resets statistics for a single subscription shown in the
+        <structname>pg_stat_subscription_stats</structname> view to zero. If
+        the argument is <literal>NULL</literal>, reset statistics for all
+        subscriptions.
        </para>

SUGGESTED (simpler description, more similar to pg_stat_reset_replication_slot)
Reset statistics to zero for a single subscription. If the argument is
<literal>NULL</literal>, reset statistics for all subscriptions.

~~~

3. src/backend/replication/logical/worker.c - comment

+ /*  */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

BEFORE
Report the worker failed during the application of the change
SUGGESTED
Report the worker failed while applying changes

~~~

4. src/include/pgstat.h - comment

+typedef struct PgStat_MsgResetsubcounter
+{
+ PgStat_MsgHdr m_hdr;
+ Oid m_subid; /* InvalidOid for clearing all subscription
+ * stats */
+} PgStat_MsgResetsubcounter;

BEFORE
InvalidOid for clearing all subscription stats
SUGGESTED
InvalidOid means reset all subscription stats

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

28 февраля 2022 г., 06:50:11

On Mon, Feb 28, 2022 at 8:59 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
>
> 2. doc/src/sgml/monitoring.sgml
>
> +        Resets statistics for a single subscription shown in the
> +        <structname>pg_stat_subscription_stats</structname> view to zero. If
> +        the argument is <literal>NULL</literal>, reset statistics for all
> +        subscriptions.
>         </para>
>
> SUGGESTED (simpler description, more similar to pg_stat_reset_replication_slot)
> Reset statistics to zero for a single subscription. If the argument is
> <literal>NULL</literal>, reset statistics for all subscriptions.
>

As discussed, it is better to keep the view name in this description
important as we have another view (pg_stat_susbcription) as well. So,
I am planning to retain the current wording.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

28 февраля 2022 г., 06:57:18

On Mon, Feb 28, 2022 at 8:49 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, February 28, 2022 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Feb 26, 2022 at 1:35 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Saturday, February 26, 2022 11:51 AM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > > > I have reviewed the latest version and made a few changes along with
> > > > fixing some of the pending comments by Peter Smith. The changes are
> > > > as
> > > > follows: (a) Removed m_databaseid in PgStat_MsgSubscriptionError as
> > > > that is not required now; (b) changed the struct name
> > > > PgStat_MsgSubscriptionPurge to PgStat_MsgSubscriptionDrop to make it
> > > > similar to DropDb; (c) changed the view name to
> > > > pg_stat_subscription_stats, we can reconsider it in future if there
> > > > is a consensus on some other name, accordingly changed the reset
> > > > function name to pg_stat_reset_subscription_stats; (d) moved some of
> > > > the newly added subscription stats functions adjacent to slots to
> > > > main the consistency in code; (e) changed comments at few places;
> > > > (f) added LATERAL back to system_views query as we refer
> > pg_subscription's oid in the function call, previously that was not clear.
> > > >
> > > > Do let me know what you think of the attached?
> > > Hi, thank you for updating the patch !
> > > I have a couple of comments on v4.
> > >
> > > (1)
> > >
> > > I'm not sure if I'm correct, but I'd say the sync_error_count can come
> > > next to the subname as the order of columns.
> > > I felt there's case that the column order is somewhat related to the
> > > time/processing order (I imagined pg_stat_replication's LSN related
> > > columns).
> > > If this was right, table sync related column could be the first column
> > > as a counter within this patch.
> > >
> >
> > I am not sure if there is such a correlation but even if it is there it doesn't seem
> > to fit here completely as sync errors can happen after apply errors in multiple
> > ways like via Alter Subscription ... Refresh ...
> >
> > So, I don't see the need to change the order here. What do you or others think?
> In the alter subscription case, any errors after the table sync would increment
> apply_error_count.
>

Sure, but the point I was trying to explain was that there is no
certainty in the order of these errors.

-- 
With Regards,
Amit Kapila.

RE: Design of pg_stat_subscription_workers vs pgstats

От

"osumi.takamichi@fujitsu.com"

Дата:

28 февраля 2022 г., 07:06:34

On Monday, February 28, 2022 12:57 PM Amit Kapila <amit.kapila16@gmail.com>
> On Mon, Feb 28, 2022 at 8:49 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Monday, February 28, 2022 11:34 AM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > On Sat, Feb 26, 2022 at 1:35 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > > On Saturday, February 26, 2022 11:51 AM Amit Kapila
> > > <amit.kapila16@gmail.com> wrote:
> > > > > I have reviewed the latest version and made a few changes along
> > > > > with fixing some of the pending comments by Peter Smith. The
> > > > > changes are as
> > > > > follows: (a) Removed m_databaseid in PgStat_MsgSubscriptionError
> > > > > as that is not required now; (b) changed the struct name
> > > > > PgStat_MsgSubscriptionPurge to PgStat_MsgSubscriptionDrop to
> > > > > make it similar to DropDb; (c) changed the view name to
> > > > > pg_stat_subscription_stats, we can reconsider it in future if
> > > > > there is a consensus on some other name, accordingly changed the
> > > > > reset function name to pg_stat_reset_subscription_stats; (d)
> > > > > moved some of the newly added subscription stats functions
> > > > > adjacent to slots to main the consistency in code; (e) changed
> > > > > comments at few places;
> > > > > (f) added LATERAL back to system_views query as we refer
> > > pg_subscription's oid in the function call, previously that was not clear.
> > > > >
> > > > > Do let me know what you think of the attached?
> > > > Hi, thank you for updating the patch !
> > > > I have a couple of comments on v4.
> > > >
> > > > (1)
> > > >
> > > > I'm not sure if I'm correct, but I'd say the sync_error_count can
> > > > come next to the subname as the order of columns.
> > > > I felt there's case that the column order is somewhat related to
> > > > the time/processing order (I imagined pg_stat_replication's LSN
> > > > related columns).
> > > > If this was right, table sync related column could be the first
> > > > column as a counter within this patch.
> > > >
> > >
> > > I am not sure if there is such a correlation but even if it is there
> > > it doesn't seem to fit here completely as sync errors can happen
> > > after apply errors in multiple ways like via Alter Subscription ... Refresh ...
> > >
> > > So, I don't see the need to change the order here. What do you or others
> think?
> > In the alter subscription case, any errors after the table sync would
> > increment apply_error_count.
> >
> 
> Sure, but the point I was trying to explain was that there is no certainty in the
> order of these errors.
I got it. Thank you so much for your explanation.


I don't have other new comments on this patch.
It looks good to me as well.


Best Regards,
    Takamichi Osumi

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

28 февраля 2022 г., 07:31:58

On Mon, Feb 28, 2022 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Feb 28, 2022 at 11:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > (2) doc/src/sgml/monitoring.sgml
> > >
> > > +        Resets statistics for a single subscription shown in the
> > > +        <structname>pg_stat_subscription_stats</structname> view to zero. If
> > > +        the argument is <literal>NULL</literal>, reset statistics for all
> > > +        subscriptions.
> > >         </para>
> > >
> > > I felt we could improve the first sentence.
> > >
> > > From:
> > > Resets statistics for a single subscription shown in the..
> > >
> > > To(idea1):
> > > Resets statistics for a single subscription defined by the argument to zero.
> > >
> >
> > Okay, I can use this one.
>
> Are you going to remove the part "shown in the
> pg_stat_subsctiption_stats view"? I think it's better to keep it in
> order to make it clear which statistics the function resets as we have
> pg_stat_subscription and pg_stat_subscription_stats.
>

I decided to keep this part of the docs as it is and fixed a few other
minor comments raised by you and Peter. Additionally, I have bumped
the PGSTAT_FILE_FORMAT_ID. I'll push this patch tomorrow unless there
are any other major comments.

-- 
With Regards,
Amit Kapila.

Вложения

v5-0001-Reconsider-pg_stat_subscription_workers-view.patch

RE: Design of pg_stat_subscription_workers vs pgstats

От

"tanghy.fnst@fujitsu.com"

Дата:

28 февраля 2022 г., 10:45:37

On Mon, Feb 28, 2022 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Feb 28, 2022 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > On Mon, Feb 28, 2022 at 11:33 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > >
> > > > (2) doc/src/sgml/monitoring.sgml
> > > >
> > > > +        Resets statistics for a single subscription shown in the
> > > > +        <structname>pg_stat_subscription_stats</structname> view to zero. If
> > > > +        the argument is <literal>NULL</literal>, reset statistics for all
> > > > +        subscriptions.
> > > >         </para>
> > > >
> > > > I felt we could improve the first sentence.
> > > >
> > > > From:
> > > > Resets statistics for a single subscription shown in the..
> > > >
> > > > To(idea1):
> > > > Resets statistics for a single subscription defined by the argument to zero.
> > > >
> > >
> > > Okay, I can use this one.
> >
> > Are you going to remove the part "shown in the
> > pg_stat_subsctiption_stats view"? I think it's better to keep it in
> > order to make it clear which statistics the function resets as we have
> > pg_stat_subscription and pg_stat_subscription_stats.
> >
> 
> I decided to keep this part of the docs as it is and fixed a few other
> minor comments raised by you and Peter. Additionally, I have bumped
> the PGSTAT_FILE_FORMAT_ID. I'll push this patch tomorrow unless there
> are any other major comments.
> 

Thanks for your patch. I have finished the review/test for this patch.
The patch LGTM.

Regards,
Tang

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

02 марта 2022 г., 05:05:46

On Mon, Feb 28, 2022 at 1:15 PM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:
>
> On Mon, Feb 28, 2022 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I decided to keep this part of the docs as it is and fixed a few other
> > minor comments raised by you and Peter. Additionally, I have bumped
> > the PGSTAT_FILE_FORMAT_ID. I'll push this patch tomorrow unless there
> > are any other major comments.
> >
>
> Thanks for your patch. I have finished the review/test for this patch.
> The patch LGTM.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

02 марта 2022 г., 08:09:16

Hi,

On 2022-03-02 07:35:46 +0530, Amit Kapila wrote:
> Pushed.

Thanks!

Working on rebasing shared memory stats over this. Feels *much* better so far.

While rebasing, I was wondering why pgstat_reset_subscription_counter() has
"all subscription counters" support. We don't have a function to reset all
function stats or such either.

I'm asking because support for that is what currently prevents sub stats from
using the more general function for reset. It's an acceptable amount of code,
but if we don't really need it I'd rather not have it / add it in a more
general way if we want it.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

02 марта 2022 г., 10:09:57

On Wed, Mar 2, 2022 at 10:39 AM Andres Freund <andres@anarazel.de> wrote:
>
> Working on rebasing shared memory stats over this. Feels *much* better so far.
>

Good to hear that it helps. BTW, there is another patch [1] that
extends this view. I think that patch is still not ready but once it
is ready (which I expect to happen sometime in this CF), it might be
good if you would be able to check whether it has any major problem
with your integration.

> While rebasing, I was wondering why pgstat_reset_subscription_counter() has
> "all subscription counters" support. We don't have a function to reset all
> function stats or such either.
>

We have similar thing for srlu (pg_stat_reset_slru) and slots
(pg_stat_reset_replication_slot). For functions and tables, one can
use pg_stat_reset. Similarly, we have pg_stat_reset_shared() which
reset stats like WAL. This matches more with slru/slots, so we
providied it via pg_stat_reset_subscription_stats.

[1] -
https://www.postgresql.org/message-id/TYWPR01MB8362B30A904274A911C0B1CCED039%40TYWPR01MB8362.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

02 марта 2022 г., 10:56:15

Hi,

On 2022-03-02 12:39:57 +0530, Amit Kapila wrote:
> On Wed, Mar 2, 2022 at 10:39 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Working on rebasing shared memory stats over this. Feels *much* better so far.
> >
> 
> Good to hear that it helps. BTW, there is another patch [1] that
> extends this view. I think that patch is still not ready but once it
> is ready (which I expect to happen sometime in this CF), it might be
> good if you would be able to check whether it has any major problem
> with your integration.

I skimmed it briefly, and I don't see an architectural conflict. I'm not
convinced it's worth adding that information, but that's a separate
discussion.

> > While rebasing, I was wondering why pgstat_reset_subscription_counter() has
> > "all subscription counters" support. We don't have a function to reset all
> > function stats or such either.
> >
> 
> We have similar thing for srlu (pg_stat_reset_slru) and slots
> (pg_stat_reset_replication_slot).

Neither should have been added imo. We're already at 9 different reset
functions. Without a unified function to reset all stats, pretty much the only
actually relevant operation. And having pg_stat_reset_shared() with variable
'reset' systems but then also pg_stat_reset_slru() and
pg_stat_reset_subscription_stats() is absurd.

This is just making something incomprehensible evermore incomprehensible.

> For functions and tables, one can use pg_stat_reset.

Not in isolation.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

От

Amit Kapila

Дата:

02 марта 2022 г., 13:38:17

On Wed, Mar 2, 2022 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
>
> > > While rebasing, I was wondering why pgstat_reset_subscription_counter() has
> > > "all subscription counters" support. We don't have a function to reset all
> > > function stats or such either.
> > >
> >
> > We have similar thing for srlu (pg_stat_reset_slru) and slots
> > (pg_stat_reset_replication_slot).
>
> Neither should have been added imo. We're already at 9 different reset
> functions.
>

As per [1], we have 7.

>
> And having pg_stat_reset_shared() with variable
> 'reset' systems but then also pg_stat_reset_slru() and
> pg_stat_reset_subscription_stats() is absurd.
>

I don't know. I feel if for some subsystem, we have a way to reset a
single counter like for slru or slots, one would prefer to use the
same function to reset all stats of that subsytem. Now, for WAL,
bgwriter, etc., we don't want to reset any specific counter, so doing
it via a shared function is okay but not for others.

[1] - https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-STATS-FUNCTIONS

-- 
With Regards,
Amit Kapila.

Re: Design of pg_stat_subscription_workers vs pgstats

От

Andres Freund

Дата:

03 марта 2022 г., 05:21:06

Hi,

On 2022-02-25 11:32:24 -0800, Andres Freund wrote:
> On 2022-02-25 16:25:01 -0300, Euler Taveira wrote:
> > On Fri, Feb 25, 2022, at 11:52 AM, Greg Stark wrote:
> > > On Tue, 25 Jan 2022 at 01:32, Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I was looking the shared memory stats patch again.
> > > 
> > > Can you point me to this thread? I looked for it but couldn't find it.
> 
> > https://postgr.es/m/20180629.173418.190173462.horiguchi.kyotaro@lab.ntt.co.jp
> 
> I'll post a rebased version as soon as this is resolved... I have a local one,
> but it just works by nuking a bunch of tests / #ifdefing out code related to
> this.

Now that the pg_stat_subscription_workers changes have been committed, I've
posted a rebased version to the above thread.

Greetings,

Andres Freund

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Design of pg_stat_subscription_workers vs pgstats

Вложения

Вложения

Вложения

Вложения

Вложения